Rendered at 09:49:57 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
spoaceman7777 6 hours ago [-]
I'm somewhat confused as to why this is on the front page. It doesn't go into any real detail, and the advice it gives is... not good. You should definitely not be quantizing your own gguf's using an old method like that hf script. There are lots of ways to run LLMs via podman (some even officially recommended by the project!). The chip has been out for almost a year now, and its most notable (and relevant-to-AI) feature is not mentioned in this article (it's the only x86_64 chip below workstation/server grade that has quad-channel RAM-- and inference is generally RAM constrained). I'm also quite puzzled about this bit about running pytorch via uv.
Anyway. I wouldn't recommend following the steps posted in there. Poke around google, or ask your friendly neighborhood LLM for some advice on how to set up your Strix Halo laptop/desktop for the tasks described. A good resource to start with would probably be the unsloth page for whichever model you are trying to run. (There are a few quantization groups that are competing for top-place with gguf's, and unsloth is regularly at the top-- with incredible documentation on inference, training, etc.)
Anyway, sorry to be harsh. I understand that this is just a blog for jotting down stuff you're doing, which is a great thing to do. I'm mostly just commenting on the fact that this is on the front page of hn for some reason.
pierrekin 5 hours ago [-]
Thanks for writing this comment, I think seeing someone’s “first impressions” and then seeing someone else’s response to those thoughts is more interesting and feels more connected socially than just reading a “correct” guide or similar especially when it’s something I’m curious about but wouldn’t necessarily be motivated enough to actually try out myself.
fwipsy 4 hours ago [-]
Quad-channel RAM is common on consumer desktops. Strix Halo has *8* channels, and also very fast RAM (soldered RAM can be faster than dimms because the traces are shorter.)
fluoridation 4 hours ago [-]
Quad channel memory is not common on consumer desktops, it's a strictly HEDT and above feature. The vast majority of consumer desktops have 2 channels or fewer.
phonon 3 hours ago [-]
4 DIMMS =/= 4 channels
suprjami 3 hours ago [-]
If you are using quants below Q8 then get them from Unsloth or Bartowski.
They are higher quality than the quants you can make yourself due to their imatrix datasets and selective quantisation of different parts of the model.
For Qwen 3.5 Unsloth did 9 terabytes of quants to benchmark the effects of this:
That used to be a good suggestion, and it still most likely is if you're using a recent Nvidia dGPU, but absolutely not for iGPUs like the Halo/Point or Arc LPG. The problem is bf16.
In short, even lower quants leave some layers at original precision and llama.cpp in its endless wisdom does not do any conversion when loading weights and seeing what your card supports, so every time you run inference it gets so surprised and hits a brick wall when there's no bf16 acceleration. Then it has to convert to fp16 on the fly or something else which can literally drop tg by half or even more. I've seen fp16 models literally run faster than Q8 on Arc despite being twice the size with the same bandwidth and it's expectedly similar [0] on AMD.
Models used to be released as fp16 which was fine, then Gemma did native bf16 and Bartowski initially came up with a compatibility thing where they converted bf16 to fp32 then fp16 and used that for quants. Most models are released as bf16 these days though and Bartowski's given up on doing that (while Unsloth never did that to begin with). So if you do want max speed, you kinda have to do static quants yourself and follow the same multi-step process to remove all the stupid bf16 weights from the model. I don't get why this can't be done once at model load ffs, but this is what we've got.
Check out the officially supported project Lemonade[0] by AMD. It has gfx1151 specific builds of vLLM, llama.cpp, comfy-ui, and even a PR to merge a Strix Halo port of Apple’s MLX[1] with a quick and easy install.
> It seems that things wouldn't work without a BIOS update: PyTorch was unable to find the GPU. This was easily done on the BIOS settings: it was able to connect to my Wifi network and download it automatically.
Call me traditional but I find it a bit scary for my BIOS to be connecting to WiFi and doing the downloading. Makes me wonder if the new BIOS blob would be secure i.e. did the BIOS connect over securely over https ? Did it check for the appropriate hash/signature etc. ? I would suppose all this is more difficult to do in the BIOS. I would expect better security if this was done in user space in the OS.
I'm much prefer if the OS did the actual downloading followed by the BIOS just doing the installation of the update.
ZiiS 3 hours ago [-]
I have never seen a BIOS that didn't allow offline updates? However SSL is much less processing then a WPA2 WiFi stack, I would certainly expect this to be fully secure and boycot a manufacturer who failed. Conversely updating your BIOS without worrying if your OS is rooted is nice.
imp0cat 3 hours ago [-]
Isn't this pretty much standard in this day and age? HP for example also has this option in BIOS for their laptops (but you still can either download the BIOS blob manually in Linux or use the automatic updater in Windows if you want).
sidkshatriya 3 hours ago [-]
> Isn't this pretty much standard in this day and age?
If something is "standard" nowadays does it mean it is the right way to go ?
One of my main issues is that this means your BIOS has to have a WiFi software stack in it, have a TLS stack in it etc. Basically millions of lines of extra code. Most of it in a blob never to be seen by more than a few engineers.
Though in another a way allowing BIOS to perform self updates is good because it doesn't matter if you've installed FreeBSD, OpenBSD, Linux, Windows, <any other os> you will be able to update your BIOS.
trvz 1 hours ago [-]
I fully expect any BIOS to have millions of unnecessary lines of code already though. May as well have a bit more for user convenience.
anko 7 hours ago [-]
I would be interested to know what speeds you can get from gemma4 26b + 31b from this machine.
also how rocm compares to triton.
SwellJoe 5 hours ago [-]
Currently running Gemma 4 26B A4B 8-bit quantization, reasoning off, and the most recent job performed thus (which seems about average, though these are short running tasks, <2 seconds for each prompt):
prompt eval time = 315.66 ms / 221 tokens ( 1.43 ms per
token, 700.13 tokens per second)
eval time = 1431.96 ms / 58 tokens ( 24.69 ms per token, 40.50 tokens per second)
total time = 1747.62 ms / 279 tokens
With reasoning enabled, it's about a quarter or fifth of that performance, quite a lot slower, but still reasonably comfortable to use interactively. The dense model is even slower. For some reason, Gemma 4 is pretty slow on the Strix Halo with reasoning enabled, compared to similar other models. It reasons really hard, I guess. I don't understand what makes models slower or faster given similar sizes, it surprised me.
Qwen 3.5 and 3.6 in the similar sized MoE versions at 8-bit quantization are notably faster on this hardware. If I were using Gemma 4 31B with reasoning interactively, I'd use a smaller 6-bit or even 5-bit quantization, to speed it up to something sort of comfortable to use. Because it is dog slow at 8-bit quantization, but shockingly smart and effective for such a tiny model.
owning GGUF conversion step is good in sone circumstances, but running in fp16 is below optimal for this hardware due to low-ish bandwidth.
It looks like context is set to 32k which is the bare minimum needed for OpenCode with its ~10k initial system prompt. So overall, something like Unsloth's UD q8 XL or q6 XL quants free up a lot of memory and bandwidth moving into the next tier of usefulness.
roenxi 7 hours ago [-]
I thought the point of something like Strix Halo was to avoid ROCm all together? AMDs strategy seems to have been to unify GPU/CPU memory then let people write their own libraries.
The industry looks like it's started to move towards Vulkan. If AMD cards have figured out how to reliably run compute shaders without locking up (never a given in my experience, but that was some time ago) then there shouldn't be a reason to use speciality APIs or software written by AMD outside of drivers.
ROCm was always a bit problematic, but the issue was if AMD card's weren't good enough for AMD engineers to reliably support tensor multiplication then there was no way anyone else was going to be able to do it. It isn't like anyone is confused about multiplying matricies together, it isn't for everyone but the naive algorithm is a core undergrad topic and the advanced algorithms surely aren't that crazy to implement. It was never a library problem.
SwellJoe 5 hours ago [-]
You misunderstand the point, and ROCm. The GPU and CPU share memory, that doesn't mean you don't need to interact with the GPU, anymore.
You can use Vulkan instead of ROCm on Radeon GPUs, including on the Strix Halo (and for a while, Vulkan was more likely to work on the Strix Halo, as ROCm support was slow to arrive and stabilize), but you need something that talks to the GPU.
Current ROCm, 7.2.1, works quite well on the Strix Halo. Vulkan does, too. ROCm tends to be a little faster, though. Not always, but mostly. People used to benchmark to figure out which was the best for a given model/workload, but now, I think most folks just assume ROCm is the better choice and use it exclusively. That's what I do, though I did find Gemma 4 wouldn't work on ROCm for a little bit after release (I think that was a llamma.cpp issue, though).
roenxi 5 hours ago [-]
> The GPU and CPU share memory, that doesn't mean you don't need to interact with the GPU, anymore.
But we already have software that talks to the GPU; mesa3d and the ecosystem around that. It has existed for decades. My understanding was that the main reasons not to use it was that memory management was too complicated and CUDA solved that problem.
If memory gets unified, what is the value proposition of ROCm supposed to be over mesa3d? Why does AMD need to invent some new way to communicate with GPUs? Why would it be faster?
SwellJoe 5 hours ago [-]
> CUDA solved that problem.
CUDA is a proprietary Nvidia product. CUDA solved the problem for Nvidia chips.
On AMD GPUs, you use ROCm. On Intel, you use OpenVINO. On Apple silicon you use MLX. All work fine with all the common AI tasks you'd want to do on self-hosted hardware. CUDA was there first and so it has a more mature ecosystem, but, so far, I've found 0 models or tasks I haven't been able to use with ROCm. llama.cpp works fine. ComfyUI works fine. Transformers library works fine. LM Studio works fine.
Unless you believe Nvidia having a monopoly on inference or training AI models is good for the world, you can't oppose all the other GPU makers having a way for their chips to be used for those purposes. CUDA is a proprietary vendor-specific solution.
Edit: But, also, Vulkan works fine on the Strix Halo. It is reliable and usually not that much slower than ROCm (and occasionally faster, somehow). Here's some benchmarks: https://kyuz0.github.io/amd-strix-halo-toolboxes/
roenxi 5 hours ago [-]
Why? What is the point of focusing on something that seems to be a memory management solution when the memory management problem theoretically just went away?
That has been one of the big themes in GPU hardware since around 2010 era when AMD committed to ATI. Nvidia tried to solve the memory management problem in the software layer, AMD committed to doing it in hardware. Software was a better bet by around a trillion dollars so far, but if the hardware solutions have finally come to fruit then why the focus on ROCm?
SwellJoe 3 hours ago [-]
I dunno. GPU programming and performance is above my pay grade. I assume the reason every GPU maker is investing in software is because they understand the problems to be solved and feel it's worth the investment to solve them. I like AMD because their Linux drivers are open source. I like Intel because all their stuff is Open Source. I like Nvidia notably less because none of their stuff is Open Source, not even the Linux drivers.
sabedevops 5 hours ago [-]
The problem with ROCm, unlike CUDA, is that it doesn’t run on much of AMDs own hardware, most notably their iGPU.
SwellJoe 4 hours ago [-]
Yeah, that kinda sucks, but, all their new generation onboard GPUs are supported by ROCm. e.g. Ryzen AI 395 and 400 series which will be found in mid-to-high end laptops and desktops and motherboards. They seem to have realized that the reason Nvidia is kicking their ass is that people can develop with CUDA on all sorts of hardware, including their personal laptop or desktop.
dragontamer 4 hours ago [-]
> If memory gets unified, what is the value proposition of ROCm supposed to be over mesa3d? Why does AMD need to invent some new way to communicate with GPUs? Why would it be faster?
And the memory barriers? How do you sync up the L1/L2 cache of a CPU core with the GPU's cache?
Exactly. With a ROCm memory barrier, ensuring parallelism between CPU + GPU, while also providing a mechanism for synchronization.
GPU and CPU can share memory, but they do not share caches. You need programming effort to make ANY of this work.
timmy777 8 hours ago [-]
Thanks for sharing. However, this missed being a good
writeup due to lack of numbers and data.
I'll give a specific example in my feedback,
You said:
```
so far, so good, I was able to play with PyTorch and run Qwen3.6 on llama.cpp with a large context window
```
But there are no numbers, results or output paste.
Performance, or timings.
Anyone with ram can run these models, it will just be impracticably slow. The halo strix is for a descent performance, so you sharing numbers will be valuable here.
Do you mind sharing these? Thanks!
gessha 7 hours ago [-]
This is more of a “succeeding to get anywhere close to messing around” rather than “it works so now I can run some benchmarks” type of article.
l33tfr4gg3r 7 hours ago [-]
To give benefit of doubt, author does state multiple times (including in the title) that these were "first impressions", so perhaps they should have mentioned something like "...In the next post, we'll explore performance and numbers" to avoid a cliffhanger situation, or do a part 1 (assuming the intention was to follow-up with a part 2).
IamTC 7 hours ago [-]
Nice. Thanks for the writeup. My Strix Halo machine is arriving next week. This is handy and helpful.
JSR_FDED 7 hours ago [-]
Perfect. No fluff, just the minimum needed to get things working.
Anyway. I wouldn't recommend following the steps posted in there. Poke around google, or ask your friendly neighborhood LLM for some advice on how to set up your Strix Halo laptop/desktop for the tasks described. A good resource to start with would probably be the unsloth page for whichever model you are trying to run. (There are a few quantization groups that are competing for top-place with gguf's, and unsloth is regularly at the top-- with incredible documentation on inference, training, etc.)
Anyway, sorry to be harsh. I understand that this is just a blog for jotting down stuff you're doing, which is a great thing to do. I'm mostly just commenting on the fact that this is on the front page of hn for some reason.
They are higher quality than the quants you can make yourself due to their imatrix datasets and selective quantisation of different parts of the model.
For Qwen 3.5 Unsloth did 9 terabytes of quants to benchmark the effects of this:
https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
In short, even lower quants leave some layers at original precision and llama.cpp in its endless wisdom does not do any conversion when loading weights and seeing what your card supports, so every time you run inference it gets so surprised and hits a brick wall when there's no bf16 acceleration. Then it has to convert to fp16 on the fly or something else which can literally drop tg by half or even more. I've seen fp16 models literally run faster than Q8 on Arc despite being twice the size with the same bandwidth and it's expectedly similar [0] on AMD.
Models used to be released as fp16 which was fine, then Gemma did native bf16 and Bartowski initially came up with a compatibility thing where they converted bf16 to fp32 then fp16 and used that for quants. Most models are released as bf16 these days though and Bartowski's given up on doing that (while Unsloth never did that to begin with). So if you do want max speed, you kinda have to do static quants yourself and follow the same multi-step process to remove all the stupid bf16 weights from the model. I don't get why this can't be done once at model load ffs, but this is what we've got.
[0] https://old.reddit.com/r/LocalLLaMA/comments/1r0b7p8/free_st...
[0] https://www.amd.com/en/developer/resources/technical-article...
[1] https://github.com/lemonade-sdk/lemonade/issues/1642
Call me traditional but I find it a bit scary for my BIOS to be connecting to WiFi and doing the downloading. Makes me wonder if the new BIOS blob would be secure i.e. did the BIOS connect over securely over https ? Did it check for the appropriate hash/signature etc. ? I would suppose all this is more difficult to do in the BIOS. I would expect better security if this was done in user space in the OS.
I'm much prefer if the OS did the actual downloading followed by the BIOS just doing the installation of the update.
If something is "standard" nowadays does it mean it is the right way to go ?
One of my main issues is that this means your BIOS has to have a WiFi software stack in it, have a TLS stack in it etc. Basically millions of lines of extra code. Most of it in a blob never to be seen by more than a few engineers.
Though in another a way allowing BIOS to perform self updates is good because it doesn't matter if you've installed FreeBSD, OpenBSD, Linux, Windows, <any other os> you will be able to update your BIOS.
prompt eval time = 315.66 ms / 221 tokens ( 1.43 ms per token, 700.13 tokens per second)
eval time = 1431.96 ms / 58 tokens ( 24.69 ms per token, 40.50 tokens per second)
total time = 1747.62 ms / 279 tokens
With reasoning enabled, it's about a quarter or fifth of that performance, quite a lot slower, but still reasonably comfortable to use interactively. The dense model is even slower. For some reason, Gemma 4 is pretty slow on the Strix Halo with reasoning enabled, compared to similar other models. It reasons really hard, I guess. I don't understand what makes models slower or faster given similar sizes, it surprised me.
Qwen 3.5 and 3.6 in the similar sized MoE versions at 8-bit quantization are notably faster on this hardware. If I were using Gemma 4 31B with reasoning interactively, I'd use a smaller 6-bit or even 5-bit quantization, to speed it up to something sort of comfortable to use. Because it is dog slow at 8-bit quantization, but shockingly smart and effective for such a tiny model.
Edit: Here's some benchmarks which feel right, based on my own experiences. https://kyuz0.github.io/amd-strix-halo-toolboxes/
- gemma4-31b normal q8 -> 5.1 tok/s
- gemma4-31b normal q16 -> 3.7 t/s
- gemma4-31b distil q16 -> 3.6 t/s
- gemma4-31b distil q8 -> 5.7 tok/s (!)
- gemma4-26b-a4b ud q8kxl -> 38 t/s (!)
- gemma4-26b-a4b ud q16 -> 12 t/s
- gemma4-26b-a4b cl q8 -> 42 t/s (!)
- gemma4-26b-a4b cl q16 -> 12 t/s
- qwen3.5-35b-a3b-UD@q6_k -> 52 t/s (!)
- qwen3.5-35b-a3b-uncensored-hauhaucs-aggressive@q8_0 -> 34 tok/s (!)
- qwen3.5-35b-a3b-uncensored-hauhaucs-aggressive@bf16 -> 11 tok/s
- qwen3.5-27b-claude-4.6-opus-reasoning-distilled-v2 q8 -> 8 tok/s
- qwen3.5 122B A10B MXFP4 Mo qwen3.5-122b-a10b (q4) -> 11 tok/s
- qwen3.5-122b-a10b-uncensored-hauhaucs-aggressive (q6) -> 10 tok/s
It looks like context is set to 32k which is the bare minimum needed for OpenCode with its ~10k initial system prompt. So overall, something like Unsloth's UD q8 XL or q6 XL quants free up a lot of memory and bandwidth moving into the next tier of usefulness.
The industry looks like it's started to move towards Vulkan. If AMD cards have figured out how to reliably run compute shaders without locking up (never a given in my experience, but that was some time ago) then there shouldn't be a reason to use speciality APIs or software written by AMD outside of drivers.
ROCm was always a bit problematic, but the issue was if AMD card's weren't good enough for AMD engineers to reliably support tensor multiplication then there was no way anyone else was going to be able to do it. It isn't like anyone is confused about multiplying matricies together, it isn't for everyone but the naive algorithm is a core undergrad topic and the advanced algorithms surely aren't that crazy to implement. It was never a library problem.
You can use Vulkan instead of ROCm on Radeon GPUs, including on the Strix Halo (and for a while, Vulkan was more likely to work on the Strix Halo, as ROCm support was slow to arrive and stabilize), but you need something that talks to the GPU.
Current ROCm, 7.2.1, works quite well on the Strix Halo. Vulkan does, too. ROCm tends to be a little faster, though. Not always, but mostly. People used to benchmark to figure out which was the best for a given model/workload, but now, I think most folks just assume ROCm is the better choice and use it exclusively. That's what I do, though I did find Gemma 4 wouldn't work on ROCm for a little bit after release (I think that was a llamma.cpp issue, though).
But we already have software that talks to the GPU; mesa3d and the ecosystem around that. It has existed for decades. My understanding was that the main reasons not to use it was that memory management was too complicated and CUDA solved that problem.
If memory gets unified, what is the value proposition of ROCm supposed to be over mesa3d? Why does AMD need to invent some new way to communicate with GPUs? Why would it be faster?
CUDA is a proprietary Nvidia product. CUDA solved the problem for Nvidia chips.
On AMD GPUs, you use ROCm. On Intel, you use OpenVINO. On Apple silicon you use MLX. All work fine with all the common AI tasks you'd want to do on self-hosted hardware. CUDA was there first and so it has a more mature ecosystem, but, so far, I've found 0 models or tasks I haven't been able to use with ROCm. llama.cpp works fine. ComfyUI works fine. Transformers library works fine. LM Studio works fine.
Unless you believe Nvidia having a monopoly on inference or training AI models is good for the world, you can't oppose all the other GPU makers having a way for their chips to be used for those purposes. CUDA is a proprietary vendor-specific solution.
Edit: But, also, Vulkan works fine on the Strix Halo. It is reliable and usually not that much slower than ROCm (and occasionally faster, somehow). Here's some benchmarks: https://kyuz0.github.io/amd-strix-halo-toolboxes/
That has been one of the big themes in GPU hardware since around 2010 era when AMD committed to ATI. Nvidia tried to solve the memory management problem in the software layer, AMD committed to doing it in hardware. Software was a better bet by around a trillion dollars so far, but if the hardware solutions have finally come to fruit then why the focus on ROCm?
And the memory barriers? How do you sync up the L1/L2 cache of a CPU core with the GPU's cache?
Exactly. With a ROCm memory barrier, ensuring parallelism between CPU + GPU, while also providing a mechanism for synchronization.
GPU and CPU can share memory, but they do not share caches. You need programming effort to make ANY of this work.
I'll give a specific example in my feedback, You said:
``` so far, so good, I was able to play with PyTorch and run Qwen3.6 on llama.cpp with a large context window ```
But there are no numbers, results or output paste. Performance, or timings.
Anyone with ram can run these models, it will just be impracticably slow. The halo strix is for a descent performance, so you sharing numbers will be valuable here.
Do you mind sharing these? Thanks!