Did you ever saw a char and thought: “Damn, 1 byte for a single char is pretty darn inefficient”? No? Well I did. So what I decided to do instead is to pack 5 chars, convert each char to a 2 digit integer and then concat those 5 2 digit ints together into one big unsigned int and boom, I saved 5 chars using only 4 instead of 5 bytes. The reason this works is, because one unsigned int is a ten digit long number and so I can save one char using 2 digits. In theory you could save 32 different chars using this technique (the first two digits of an unsigned int are 42 and if you dont want to account for a possible 0 in the beginning you end up with 32 chars). If you would decide to use all 10 digits you could save exactly 3 chars. Why should anyone do that? Idk. Is it way to much work to be useful? Yes. Was it funny? Yes.
Anyone whos interested in the code: Heres how I did it in C: https://pastebin.com/hDeHijX6
Yes I know, the code is probably bad, but I do not care. It was just a funny useless idea I had.
C lets you do this by putting text in single quotes:
int foo = ‘Abcd’;
works
But that’s only four chars in four bytes. This absolute madlad has put five chars in four bytes.
At first I thought, "How are they going to compress 256 values, i.e. 1 Byte sized data, by “rearranging into integers”?
Then I saw your code and realized you are discarding 228 of them, effectively reducing the available symbol set by about 89%.
Speaking of efficiency: Since chars are essentially unsigned integers of size 1 byte and ‘a’ to ‘z’ are values 97 to 122 (decimal, both including) you can greatly simplify your
turn_char_to_intmethod by just substracting 87 from each symbol to get them into your desired value range instead of using this cumbersome switch-case structure. Space (32) and dot (46) would still need special handling though to fit your desired range.Bit-encoding your chosen 28 values directly would require 5 bit.

now use the intchar to store prefixes for a smart string , or pack them to make a simd optimized b tree with 40 string prefixes per node instead of 32
CPU still pulls a 32kb block from RAM…
Cache man, its a fun thing. 32k is a common cache line size. Some compilers realise that your data might be hit often and aligns it to a cache line start to make its access fast and easy. So yes, it might allocate more memory than it should need, but then its to align the data to something like a cache line.
There is also a hardware reasons that might also be the case. I know the wii’s main processor communicates with the co processor over memory locations that should be 32k aligned because of access speed, not only because of cache. Sometimes more is less :')Hell, might even be a cause of instruction speed that loading and handling 32k of data might be faster than a single byte :')
Cache Man, I would watch that movie.
Cache lines are 64 bytes though? Pages are 4k.
Ye derp, im used to 32, not 32k lol.
Lol, using RAM like last century. We have enough L3 cache for a full linux desktop in cache. Git gud and don’t miss it (/s).
(As an aside, now I want to see a version of puppylinux running entirely in L3 cache)
I decided to take a look and my current CPU has the same L1 as my high school computer had total RAM. And the L3 is the same as the total for the machine I built in college. It should be possible to run a great desktop environment entirely in L3.
Look at this guy with their fancy RAM caches.
dammit yesterday was too long i thought this was a dnd joke at first
Me too! Haven’t had my coffee yet. I was like
“… character…? Charisma…? (blink blink)”
coffee? Fuck that’s what’s going on i knew it. hold on
Check out “densly packed decimal” encoding,
Back in the day those tricks were common. Some PDP-11 OS’s supported a “Radix-50” encoding (50 octal = 40 decimal) that packed 3 characters into a 16 bit word (40 codepoints=26 letters, 10 digits, and a few punctuation). So you could have a 6.3 filename in 3 words.
We have a binary file that has to maintain compatibility with a 16 bit Power Basic app that hasn’t been recompiled since '99 or '00. We have storage for 8 character strings in two ints , and 12 character string in two ints and two shorts.
Damn, that are setups where you have to get creative.
Oh god, please don’t. Just use utf8mb4 like a normal human being, and let the encoding issues finally die out (when microsoft kills code pages). If space is of consideration, just use compression, like gz or something.
It’s all fun and games until the requirement changes and you need to support uppercase letters and digits as well.
I am constantly on how I can allow Uppercase, without significantly reducing the posiible amounts of chars
Well it’s certainly possible to fit both uppercase and lowercase + 11 additional characters inside an int (26 + 26 + 11 = 63). The you need a null terminating char, which adds up to 64, which is 6 bits.
So all you need is 6 bits per char. 6 * 5 = 30, which is less than 32.
It’s easier to do this by thinking in binary rather than decimals. Look into bit shifting and other bitwise operations.
Depending on the use-case you might also want to add special case value like @Redkey@programming.dev did in their example, and get kind of UTF-8 pages. Then you can pack lowercase to 5 bits, and uppercase and some special symbols to 10 bits, and it will be smaller if uppercase are rare
If you’re ever doing optimizations like this, always think in binary. You’re causing yourself more trouble by thinking in decimal. With n bits you can represent 2^n different results. Using this you can figure out how many bits you need to store however many different potential characters. 26 letters can be stored in 5 bits, with 6 extra possibilities. 52 letters can be stored in 6 bits, with 12 extra possibilities. Assuming you want an end of string character, you have 11 more options.
If you want optimal packing, you could pack this into 48 bits, or 6 bytes/chars, for 8 characters.
Bro reinventing sixbits
Did not knew that this existed, but yeah its kinda like that. Except that I only allow 5 characters.
Have you heard of Proquints?
https://www.ietf.org/archive/id/draft-rayner-proquint-03.html
Funny how they have a typo in test vectors:
0x0000 -> babab
0xFFFF -> zvzuz
0x1234 -> damuh
0xF00D -> zabat
0xBEEF -> ruroz
This is hilarious. I’m not sure how often anyone would actually need to verbalize arbitrary binary data, but I do see an advantage over base64 since the English letter names are so often phonetically similar.
Given you are only a-z with space and dot; you could do 5bit and pack 6 characters per 32bit word with space to spare
I did something like this once, in the course of a project whose purpose I don’t remember. Realising that 8-bit ASCII was wasted on the constrained alphabet of some kind of identifiers, I packed them into 6 bits, 1⅓ to a byte. I recall naming the code to do this “ShortASC”, pronounced “short-ass”
Use strings for everything and use a single universal method to convert some to floats only when you absolutely have to.
Mostly because compilers do this kind of stuff if you optimize for space, iirc. Not that you should never do it or something, but it kinda looks like premature optimization to me.
No, it doesn’t. It can’t know if you need a full set of characters or only a subset of them, so it can’t optimize like this. If you know you only need to represent capital letters and a few punctuations, you can do something like the OP. The compiler has to assume you could need the full range of characters represented by the format though (especially since it doesn’t even know if you’ll continue to use them as characters—you may want to do arithmetic on them).
Definitely premature optimization, and not even ose to optimal either. It’s next to useless, but I think the OP was just having fun.
Idk, I just had this funny idea, and thought I could do this as a cool and quick proof of example
In typical C fashion, there’s undefined behavior in
turn_char_to_int. xDdid not want to always scroll past that behemoth of a switch case xD









