Thread Rating:
  • 1 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Digraph Compression
#3
That said, I did some preliminary tests on my method, and it does look like, for English - even complex English with lots of symbols and numbers and some capital letters, it's well under 5 bits per character. A hand run on a few sentences came to 4.17 bits per character. It can't even theoretically compress to less than 4, so getting close is quite neat.

While entropy is for English, technically just under 2 bits per character in theory, reaching that with a simple algorithm probably isn't possible. A near 50% reduction from this isn't half bad.

Basically it puts the most common 60% of the letters into each 4 bit block, so quite commonly does one character in 4 bits. Failing that it usually does it in 8 bits. Sometimes it falls back to an expensive 16 bits, as we need to forget about the compression, label up a standard byte, and go with that.

One imagines that given there's no ascii > 127, that using the first bit of such a "normal" character to say "the next one isn't compressed either" might be clever. It means that long strings of uncompressable data only cost us one flag byte.

Anyway. I think I might press on with this one a bit and see where it takes me.
Reply


Messages In This Thread

Forum Jump:


Users browsing this thread: 2 Guest(s)