ARM is an older Reduced Instruction Set Computing out of Berkeley too. There are not a lot of differences here. x86 could even be better. American companies are mostly run by incompetent misers that extract value through exploitation instead of innovation on the edge and future. Intel has crashed and burned because it failed to keep pace with competition. Like much of the newer x86 stuff is RISC-like wrappers on CISC instructions under the hood, to loosely quote others at places like Linux Plumbers conference talks.
ARM costs a fortune in royalties. RISC-V removes those royalties and creates an entire ecosystem for companies to independently sell their own IP blocks instead of places like Intel using this space for manipulative exploitation through vendor lock in. If China invests in RISC-V, it will antiquate the entire West within 5-10 years time, similar to what they did with electric vehicles and western privateer pirate capitalist incompetence.
Like much of the newer x86 stuff is RISC-like wrappers on CISC instructions under the hood
I think it’s actually the opposite. The actual execution units tend to be more RISC-like but the “public” interfaces are CISC to allow backwards compatibility. Otherwise, they would have to publish new developer docs for every microcode update or generational change.
Not necessarily a bad strategy but, definitely results in greater complexity over time to translate between the “external” and “internal” architecture and also results in challenged in really tuning the interfacing between hardware and software because of the abstraction layer.
You caught me. I meant this, but was thinking backwards from the bottom up. Like building the logic and registers required to satisfy the CISC instruction.
This mental space is my thar be dragons and wizards space on the edge of my comprehension and curiosity. The pipelines involved to execute a complex instruction like AVX loading a 512 bit word, while two logical cores are multi threading with cache prediction, along with the DRAM bus width limitations, to run tensor maths – are baffling to me.
I barely understood the Chips and Cheese article explaining how the primary bottleneck for running LLMs on a CPU is the L2 to L1 cache bus throughput. Conceptually that makes sense, but thinking in terms of the actual hardware, I can’t answer, “why aren’t AI models packaged and processed in blocks specifically sized for this cache bus limitation”. If my cache bus is the limiting factor, duel threading for logical cores seems like asinine stupidity that poisons the cache. Or why an OS CPU scheduler is not equip to automatically detect or flag tensor math and isolate threads from kernel interrupts is beyond me.
Adding a layer to that and saying all of this is RISC cosplaying as CISC is my mental party clown cum serial killer… “but… but… it is 1 instruction…”
You caught me. I meant this, but was thinking backwards from the bottom up. Like building the logic and registers required to satisfy the CISC instruction.
Yeah. I’m from more of a SysAdmin/DevOps/(kinda)SWE background so, I tend to think of it in a similar manner to APIs. The x86_64 CISC registers are like a public API and the ??? RISC-y registers are like an internal API and may or may not even be accessible outside of intra-die communication.
This mental space is my thar be dragons and wizards space on the edge of my comprehension and curiosity. The pipelines involved to execute a complex instruction like AVX loading a 512 bit word, while two logical cores are multi threading with cache prediction, along with the DRAM bus width limitations, to run tensor maths – are baffling to me.
Very similar to where I’m at. I’ve finally gotten my AuADHD brain to get Vivado setup for my Zynq dev board and I think I finally have everything that I need to try to unbrick my Fomu (it doesn’t have a hard USB controller so, I have to use a pogo pin jig to try to load a basic USB softcore that will allow it to be programmed normally).
I barely understood the Chips and Cheese article explaining how the primary bottleneck for running LLMs on a CPU is the L2 to L1 cache bus throughput. Conceptually that makes sense, but thinking in terms of the actual hardware, I can’t answer, “why aren’t AI models packaged and processed in blocks specifically sized for this cache bus limitation”. If my cache bus is the limiting factor, duel threading for logical cores seems like asinine stupidity that poisons the cache. Or why an OS CPU scheduler is not equip to automatically detect or flag tensor math and isolate threads from kernel interrupts is beyond me.
Mind sharing that article?
Adding a layer to that and saying all of this is RISC cosplaying as CISC is my mental party clown cum serial killer… “but… but… it is 1 instruction…”
I think that it’s like the above way of thinking of it like APIs but, I could be entirely incorrect. I don’t think I am though. Because the registers that programs interact with are standardized, those probably are “actual” x86, in that they are to be expected to handle x86 instructions in the spec defined manner. Past those externally-addressable registers is just a black box that does the work to allow the registers to act in an expected manner. Some of that black box also must include programmable logic to allow microcode to be a thing.
ARM is an older Reduced Instruction Set Computing out of Berkeley too. There are not a lot of differences here. x86 could even be better. American companies are mostly run by incompetent misers that extract value through exploitation instead of innovation on the edge and future. Intel has crashed and burned because it failed to keep pace with competition. Like much of the newer x86 stuff is RISC-like wrappers on CISC instructions under the hood, to loosely quote others at places like Linux Plumbers conference talks.
ARM costs a fortune in royalties. RISC-V removes those royalties and creates an entire ecosystem for companies to independently sell their own IP blocks instead of places like Intel using this space for manipulative exploitation through vendor lock in. If China invests in RISC-V, it will antiquate the entire West within 5-10 years time, similar to what they did with electric vehicles and western privateer pirate capitalist incompetence.
I think it’s actually the opposite. The actual execution units tend to be more RISC-like but the “public” interfaces are CISC to allow backwards compatibility. Otherwise, they would have to publish new developer docs for every microcode update or generational change.
Not necessarily a bad strategy but, definitely results in greater complexity over time to translate between the “external” and “internal” architecture and also results in challenged in really tuning the interfacing between hardware and software because of the abstraction layer.
You caught me. I meant this, but was thinking backwards from the bottom up. Like building the logic and registers required to satisfy the CISC instruction.
This mental space is my thar be dragons and wizards space on the edge of my comprehension and curiosity. The pipelines involved to execute a complex instruction like AVX loading a 512 bit word, while two logical cores are multi threading with cache prediction, along with the DRAM bus width limitations, to run tensor maths – are baffling to me.
I barely understood the Chips and Cheese article explaining how the primary bottleneck for running LLMs on a CPU is the L2 to L1 cache bus throughput. Conceptually that makes sense, but thinking in terms of the actual hardware, I can’t answer, “why aren’t AI models packaged and processed in blocks specifically sized for this cache bus limitation”. If my cache bus is the limiting factor, duel threading for logical cores seems like asinine stupidity that poisons the cache. Or why an OS CPU scheduler is not equip to automatically detect or flag tensor math and isolate threads from kernel interrupts is beyond me.
Adding a layer to that and saying all of this is RISC cosplaying as CISC is my mental party clown cum serial killer… “but… but… it is 1 instruction…”
Yeah. I’m from more of a SysAdmin/DevOps/(kinda)SWE background so, I tend to think of it in a similar manner to APIs. The x86_64 CISC registers are like a public API and the ??? RISC-y registers are like an internal API and may or may not even be accessible outside of intra-die communication.
Very similar to where I’m at. I’ve finally gotten my AuADHD brain to get Vivado setup for my Zynq dev board and I think I finally have everything that I need to try to unbrick my Fomu (it doesn’t have a hard USB controller so, I have to use a pogo pin jig to try to load a basic USB softcore that will allow it to be programmed normally).
Mind sharing that article?
I think that it’s like the above way of thinking of it like APIs but, I could be entirely incorrect. I don’t think I am though. Because the registers that programs interact with are standardized, those probably are “actual” x86, in that they are to be expected to handle x86 instructions in the spec defined manner. Past those externally-addressable registers is just a black box that does the work to allow the registers to act in an expected manner. Some of that black box also must include programmable logic to allow microcode to be a thing.
Its a crazy and magical side of technology.