Did you ever saw a char and thought: “Damn, 1 byte for a single char is pretty darn inefficient”? No? Well I did. So what I decided to do instead is to pack 5 chars, convert each char to a 2 digit integer and then concat those 5 2 digit ints together into one big unsigned int and boom, I saved 5 chars using only 4 instead of 5 bytes. The reason this works is, because one unsigned int is a ten digit long number and so I can save one char using 2 digits. In theory you could save 32 different chars using this technique (the first two digits of an unsigned int are 42 and if you dont want to account for a possible 0 in the beginning you end up with 32 chars). If you would decide to use all 10 digits you could save exactly 3 chars. Why should anyone do that? Idk. Is it way to much work to be useful? Yes. Was it funny? Yes.

Anyone whos interested in the code: Heres how I did it in C: https://pastebin.com/hDeHijX6

Yes I know, the code is probably bad, but I do not care. It was just a funny useless idea I had.

    • Billegh@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      4 months ago

      But that’s only four chars in four bytes. This absolute madlad has put five chars in four bytes.

  • Zacryon@feddit.org
    link
    fedilink
    arrow-up
    0
    ·
    4 months ago

    At first I thought, "How are they going to compress 256 values, i.e. 1 Byte sized data, by “rearranging into integers”?

    Then I saw your code and realized you are discarding 228 of them, effectively reducing the available symbol set by about 89%.

    Speaking of efficiency: Since chars are essentially unsigned integers of size 1 byte and ‘a’ to ‘z’ are values 97 to 122 (decimal, both including) you can greatly simplify your turn_char_to_int method by just substracting 87 from each symbol to get them into your desired value range instead of using this cumbersome switch-case structure. Space (32) and dot (46) would still need special handling though to fit your desired range.

    Bit-encoding your chosen 28 values directly would require 5 bit.

    • DacoTaco@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      edit-2
      4 months ago

      Cache man, its a fun thing. 32k is a common cache line size. Some compilers realise that your data might be hit often and aligns it to a cache line start to make its access fast and easy. So yes, it might allocate more memory than it should need, but then its to align the data to something like a cache line.
      There is also a hardware reasons that might also be the case. I know the wii’s main processor communicates with the co processor over memory locations that should be 32k aligned because of access speed, not only because of cache. Sometimes more is less :')

      Hell, might even be a cause of instruction speed that loading and handling 32k of data might be faster than a single byte :')

    • enumerator4829@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      0
      ·
      4 months ago

      Lol, using RAM like last century. We have enough L3 cache for a full linux desktop in cache. Git gud and don’t miss it (/s).

      (As an aside, now I want to see a version of puppylinux running entirely in L3 cache)

      • BartyDeCanter@lemmy.sdf.org
        link
        fedilink
        arrow-up
        0
        ·
        4 months ago

        I decided to take a look and my current CPU has the same L1 as my high school computer had total RAM. And the L3 is the same as the total for the machine I built in college. It should be possible to run a great desktop environment entirely in L3.

  • solrize@lemmy.ml
    link
    fedilink
    arrow-up
    0
    ·
    4 months ago

    Back in the day those tricks were common. Some PDP-11 OS’s supported a “Radix-50” encoding (50 octal = 40 decimal) that packed 3 characters into a 16 bit word (40 codepoints=26 letters, 10 digits, and a few punctuation). So you could have a 6.3 filename in 3 words.

  • hdsrob@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    4 months ago

    We have a binary file that has to maintain compatibility with a 16 bit Power Basic app that hasn’t been recompiled since '99 or '00. We have storage for 8 character strings in two ints , and 12 character string in two ints and two shorts.

  • drath@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    edit-2
    4 months ago

    Oh god, please don’t. Just use utf8mb4 like a normal human being, and let the encoding issues finally die out (when microsoft kills code pages). If space is of consideration, just use compression, like gz or something.

      • magic_lobster_party@fedia.io
        link
        fedilink
        arrow-up
        0
        ·
        4 months ago

        Well it’s certainly possible to fit both uppercase and lowercase + 11 additional characters inside an int (26 + 26 + 11 = 63). The you need a null terminating char, which adds up to 64, which is 6 bits.

        So all you need is 6 bits per char. 6 * 5 = 30, which is less than 32.

        It’s easier to do this by thinking in binary rather than decimals. Look into bit shifting and other bitwise operations.

        • lad@programming.dev
          link
          fedilink
          arrow-up
          0
          ·
          4 months ago

          Depending on the use-case you might also want to add special case value like @Redkey@programming.dev did in their example, and get kind of UTF-8 pages. Then you can pack lowercase to 5 bits, and uppercase and some special symbols to 10 bits, and it will be smaller if uppercase are rare

      • Cethin@lemmy.zip
        link
        fedilink
        English
        arrow-up
        0
        ·
        edit-2
        4 months ago

        If you’re ever doing optimizations like this, always think in binary. You’re causing yourself more trouble by thinking in decimal. With n bits you can represent 2^n different results. Using this you can figure out how many bits you need to store however many different potential characters. 26 letters can be stored in 5 bits, with 6 extra possibilities. 52 letters can be stored in 6 bits, with 12 extra possibilities. Assuming you want an end of string character, you have 11 more options.

        If you want optimal packing, you could pack this into 48 bits, or 6 bytes/chars, for 8 characters.

    • AllNewTypeFace@leminal.space
      link
      fedilink
      arrow-up
      0
      ·
      4 months ago

      I did something like this once, in the course of a project whose purpose I don’t remember. Realising that 8-bit ASCII was wasted on the constrained alphabet of some kind of identifiers, I packed them into 6 bits, 1⅓ to a byte. I recall naming the code to do this “ShortASC”, pronounced “short-ass”

  • DarkCloud@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    4 months ago

    Use strings for everything and use a single universal method to convert some to floats only when you absolutely have to.

  • HyperfocusSurfer@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    0
    ·
    4 months ago

    Mostly because compilers do this kind of stuff if you optimize for space, iirc. Not that you should never do it or something, but it kinda looks like premature optimization to me.

    • Cethin@lemmy.zip
      link
      fedilink
      English
      arrow-up
      0
      ·
      4 months ago

      No, it doesn’t. It can’t know if you need a full set of characters or only a subset of them, so it can’t optimize like this. If you know you only need to represent capital letters and a few punctuations, you can do something like the OP. The compiler has to assume you could need the full range of characters represented by the format though (especially since it doesn’t even know if you’ll continue to use them as characters—you may want to do arithmetic on them).

      Definitely premature optimization, and not even ose to optimal either. It’s next to useless, but I think the OP was just having fun.