Yes, I did spend time on this

da_cow (she/her)@feddit.org · edit-2 4 months ago

Yes, I did spend time on this

SubArcticTundra@lemmy.ml · 4 months ago

C lets you do this by putting text in single quotes:

int foo = ‘Abcd’;

works

Billegh@lemmy.world · 4 months ago

But that’s only four chars in four bytes. This absolute madlad has put five chars in four bytes.

Zacryon@feddit.org · 4 months ago

At first I thought, "How are they going to compress 256 values, i.e. 1 Byte sized data, by “rearranging into integers”?

Then I saw your code and realized you are discarding 228 of them, effectively reducing the available symbol set by about 89%.

Speaking of efficiency: Since chars are essentially unsigned integers of size 1 byte and ‘a’ to ‘z’ are values 97 to 122 (decimal, both including) you can greatly simplify your turn_char_to_int method by just substracting 87 from each symbol to get them into your desired value range instead of using this cumbersome switch-case structure. Space (32) and dot (46) would still need special handling though to fit your desired range.

Bit-encoding your chosen 28 values directly would require 5 bit.

carrylex@lemmy.world · 4 months ago

bonus_crab@lemmy.world · 4 months ago

now use the intchar to store prefixes for a smart string , or pack them to make a simd optimized b tree with 40 string prefixes per node instead of 32

Valmond@lemmy.world · 4 months ago

CPU still pulls a 32kb block from RAM…

DacoTaco@lemmy.world · edit-2 4 months ago

Cache man, its a fun thing. 32k is a common cache line size. Some compilers realise that your data might be hit often and aligns it to a cache line start to make its access fast and easy. So yes, it might allocate more memory than it should need, but then its to align the data to something like a cache line.
There is also a hardware reasons that might also be the case. I know the wii’s main processor communicates with the co processor over memory locations that should be 32k aligned because of access speed, not only because of cache. Sometimes more is less :')

Hell, might even be a cause of instruction speed that loading and handling 32k of data might be faster than a single byte :')

Victor@lemmy.world · 4 months ago

Cache Man, I would watch that movie.

gens@programming.dev · 4 months ago

Cache lines are 64 bytes though? Pages are 4k.

DacoTaco@lemmy.world · edit-2 4 months ago

Ye derp, im used to 32, not 32k lol.

enumerator4829@sh.itjust.works · 4 months ago

Lol, using RAM like last century. We have enough L3 cache for a full linux desktop in cache. Git gud and don’t miss it (/s).

(As an aside, now I want to see a version of puppylinux running entirely in L3 cache)

BartyDeCanter@lemmy.sdf.org · 4 months ago

I decided to take a look and my current CPU has the same L1 as my high school computer had total RAM. And the L3 is the same as the total for the machine I built in college. It should be possible to run a great desktop environment entirely in L3.

BartyDeCanter@lemmy.sdf.org · 4 months ago

Look at this guy with their fancy RAM caches.

BeeegScaaawyCripple@lemmy.world · 4 months ago

dammit yesterday was too long i thought this was a dnd joke at first

MonkeMischief@lemmy.today · 4 months ago

Me too! Haven’t had my coffee yet. I was like

“… character…? Charisma…? (blink blink)”

BeeegScaaawyCripple@lemmy.world · 4 months ago

coffee? Fuck that’s what’s going on i knew it. hold on

aubeynarf@lemmynsfw.com · 4 months ago

Check out “densly packed decimal” encoding,

solrize@lemmy.ml · 4 months ago

Back in the day those tricks were common. Some PDP-11 OS’s supported a “Radix-50” encoding (50 octal = 40 decimal) that packed 3 characters into a 16 bit word (40 codepoints=26 letters, 10 digits, and a few punctuation). So you could have a 6.3 filename in 3 words.

hdsrob@lemmy.world · 4 months ago

We have a binary file that has to maintain compatibility with a 16 bit Power Basic app that hasn’t been recompiled since '99 or '00. We have storage for 8 character strings in two ints , and 12 character string in two ints and two shorts.

da_cow (she/her)@feddit.org · 4 months ago

Damn, that are setups where you have to get creative.

drath@lemmy.world · edit-2 4 months ago

Oh god, please don’t. Just use utf8mb4 like a normal human being, and let the encoding issues finally die out (when microsoft kills code pages). If space is of consideration, just use compression, like gz or something.

magic_lobster_party@fedia.io · 4 months ago

It’s all fun and games until the requirement changes and you need to support uppercase letters and digits as well.

da_cow (she/her)@feddit.org · 4 months ago

I am constantly on how I can allow Uppercase, without significantly reducing the posiible amounts of chars

magic_lobster_party@fedia.io · 4 months ago

Well it’s certainly possible to fit both uppercase and lowercase + 11 additional characters inside an int (26 + 26 + 11 = 63). The you need a null terminating char, which adds up to 64, which is 6 bits.

So all you need is 6 bits per char. 6 * 5 = 30, which is less than 32.

It’s easier to do this by thinking in binary rather than decimals. Look into bit shifting and other bitwise operations.

lad@programming.dev · 4 months ago

Depending on the use-case you might also want to add special case value like @Redkey@programming.dev did in their example, and get kind of UTF-8 pages. Then you can pack lowercase to 5 bits, and uppercase and some special symbols to 10 bits, and it will be smaller if uppercase are rare

Cethin@lemmy.zip · edit-2 4 months ago

If you’re ever doing optimizations like this, always think in binary. You’re causing yourself more trouble by thinking in decimal. With n bits you can represent 2^n different results. Using this you can figure out how many bits you need to store however many different potential characters. 26 letters can be stored in 5 bits, with 6 extra possibilities. 52 letters can be stored in 6 bits, with 12 extra possibilities. Assuming you want an end of string character, you have 11 more options.

If you want optimal packing, you could pack this into 48 bits, or 6 bytes/chars, for 8 characters.

bacon_pdp@lemmy.world · 4 months ago

Bro reinventing sixbits

https://en.m.wikipedia.org/wiki/Six-bit_character_code

da_cow (she/her)@feddit.org · 4 months ago

Did not knew that this existed, but yeah its kinda like that. Except that I only allow 5 characters.

SolidGrue@lemmy.world · 4 months ago

Have you heard of Proquints?

https://www.ietf.org/archive/id/draft-rayner-proquint-03.html

lad@programming.dev · edit-2 4 months ago

Funny how they have a typo in test vectors:

0x0000 -> babab

0xFFFF -> zvzuz

0x1234 -> damuh

0xF00D -> zabat

0xBEEF -> ruroz

BatmanAoD@programming.dev · 4 months ago

This is hilarious. I’m not sure how often anyone would actually need to verbalize arbitrary binary data, but I do see an advantage over base64 since the English letter names are so often phonetically similar.

bacon_pdp@lemmy.world · 4 months ago

Given you are only a-z with space and dot; you could do 5bit and pack 6 characters per 32bit word with space to spare

AllNewTypeFace@leminal.space · 4 months ago

I did something like this once, in the course of a project whose purpose I don’t remember. Realising that 8-bit ASCII was wasted on the constrained alphabet of some kind of identifiers, I packed them into 6 bits, 1⅓ to a byte. I recall naming the code to do this “ShortASC”, pronounced “short-ass”

DarkCloud@lemmy.world · 4 months ago

Use strings for everything and use a single universal method to convert some to floats only when you absolutely have to.

HyperfocusSurfer@lemmy.dbzer0.com · 4 months ago

Mostly because compilers do this kind of stuff if you optimize for space, iirc. Not that you should never do it or something, but it kinda looks like premature optimization to me.

Cethin@lemmy.zip · 4 months ago

No, it doesn’t. It can’t know if you need a full set of characters or only a subset of them, so it can’t optimize like this. If you know you only need to represent capital letters and a few punctuations, you can do something like the OP. The compiler has to assume you could need the full range of characters represented by the format though (especially since it doesn’t even know if you’ll continue to use them as characters—you may want to do arithmetic on them).

Definitely premature optimization, and not even ose to optimal either. It’s next to useless, but I think the OP was just having fun.

da_cow (she/her)@feddit.org · 4 months ago

Idk, I just had this funny idea, and thought I could do this as a cool and quick proof of example

Dumhuvud@programming.dev · 4 months ago

In typical C fashion, there’s undefined behavior in turn_char_to_int. xD

da_cow (she/her)@feddit.org · 4 months ago

did not want to always scroll past that behemoth of a switch case xD