• 𞋴𝛂𝛋𝛆@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    5 days ago

    You caught me. I meant this, but was thinking backwards from the bottom up. Like building the logic and registers required to satisfy the CISC instruction.

    This mental space is my thar be dragons and wizards space on the edge of my comprehension and curiosity. The pipelines involved to execute a complex instruction like AVX loading a 512 bit word, while two logical cores are multi threading with cache prediction, along with the DRAM bus width limitations, to run tensor maths – are baffling to me.

    I barely understood the Chips and Cheese article explaining how the primary bottleneck for running LLMs on a CPU is the L2 to L1 cache bus throughput. Conceptually that makes sense, but thinking in terms of the actual hardware, I can’t answer, “why aren’t AI models packaged and processed in blocks specifically sized for this cache bus limitation”. If my cache bus is the limiting factor, duel threading for logical cores seems like asinine stupidity that poisons the cache. Or why an OS CPU scheduler is not equip to automatically detect or flag tensor math and isolate threads from kernel interrupts is beyond me.

    Adding a layer to that and saying all of this is RISC cosplaying as CISC is my mental party clown cum serial killer… “but… but… it is 1 instruction…”

    • nickwitha_k (he/him)@lemmy.sdf.org
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      4 days ago

      You caught me. I meant this, but was thinking backwards from the bottom up. Like building the logic and registers required to satisfy the CISC instruction.

      Yeah. I’m from more of a SysAdmin/DevOps/(kinda)SWE background so, I tend to think of it in a similar manner to APIs. The x86_64 CISC registers are like a public API and the ??? RISC-y registers are like an internal API and may or may not even be accessible outside of intra-die communication.

      This mental space is my thar be dragons and wizards space on the edge of my comprehension and curiosity. The pipelines involved to execute a complex instruction like AVX loading a 512 bit word, while two logical cores are multi threading with cache prediction, along with the DRAM bus width limitations, to run tensor maths – are baffling to me.

      Very similar to where I’m at. I’ve finally gotten my AuADHD brain to get Vivado setup for my Zynq dev board and I think I finally have everything that I need to try to unbrick my Fomu (it doesn’t have a hard USB controller so, I have to use a pogo pin jig to try to load a basic USB softcore that will allow it to be programmed normally).

      I barely understood the Chips and Cheese article explaining how the primary bottleneck for running LLMs on a CPU is the L2 to L1 cache bus throughput. Conceptually that makes sense, but thinking in terms of the actual hardware, I can’t answer, “why aren’t AI models packaged and processed in blocks specifically sized for this cache bus limitation”. If my cache bus is the limiting factor, duel threading for logical cores seems like asinine stupidity that poisons the cache. Or why an OS CPU scheduler is not equip to automatically detect or flag tensor math and isolate threads from kernel interrupts is beyond me.

      Mind sharing that article?

      Adding a layer to that and saying all of this is RISC cosplaying as CISC is my mental party clown cum serial killer… “but… but… it is 1 instruction…”

      I think that it’s like the above way of thinking of it like APIs but, I could be entirely incorrect. I don’t think I am though. Because the registers that programs interact with are standardized, those probably are “actual” x86, in that they are to be expected to handle x86 instructions in the spec defined manner. Past those externally-addressable registers is just a black box that does the work to allow the registers to act in an expected manner. Some of that black box also must include programmable logic to allow microcode to be a thing.

      Its a crazy and magical side of technology.