![]() |
Digraph Compression - Printable Version +- Forum (https://www.boriel.com/forum) +-- Forum: Compilers and Computer Languages (https://www.boriel.com/forum/forumdisplay.php?fid=12) +--- Forum: ZX Basic Compiler (https://www.boriel.com/forum/forumdisplay.php?fid=11) +--- Thread: Digraph Compression (/showthread.php?tid=724) |
Digraph Compression - britlion - 02-15-2016 After reading a note by Cheveron about digraph compression ( http://foro.speccy.org/viewtopic.php?f=6&t=2997&p=33001&hilit=digraph#p33001) I thought I'd run up the English language version to see what it looked like. If we take http://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/digraphs.html as a start, and work out what the value of first letter and second letter are, I get this table in the end: Code: 1st Frq 2nd Frq Thinking about this, you could store these as a nybble each: Code: Val bin Lft Rght Not done anything else with it yet, but I thought the thought experiment might be useful if you are going to allocate bits to such a compression scheme. Re: Digraph Compression - britlion - 02-15-2016 And of course, genius that I am, I think I missed an obvious thing: There are only 40 digraphs in the table. So while my version allows any combo of those letters to be treated as a digraph, most of those combos are uncommon. It would be better, probably to allocate 40 of the 256 possibilities to digraphs, and then you still have all the symbols uniquely, as well as room for a pile more things longer than digraphs - e.g. your most common whole word list. This is what happens when I say "Try to do this in English the same way Andrew did there in Spanish" Was fun maths though ![]() Re: Digraph Compression - britlion - 02-19-2016 That said, I did some preliminary tests on my method, and it does look like, for English - even complex English with lots of symbols and numbers and some capital letters, it's well under 5 bits per character. A hand run on a few sentences came to 4.17 bits per character. It can't even theoretically compress to less than 4, so getting close is quite neat. While entropy is for English, technically just under 2 bits per character in theory, reaching that with a simple algorithm probably isn't possible. A near 50% reduction from this isn't half bad. Basically it puts the most common 60% of the letters into each 4 bit block, so quite commonly does one character in 4 bits. Failing that it usually does it in 8 bits. Sometimes it falls back to an expensive 16 bits, as we need to forget about the compression, label up a standard byte, and go with that. One imagines that given there's no ascii > 127, that using the first bit of such a "normal" character to say "the next one isn't compressed either" might be clever. It means that long strings of uncompressable data only cost us one flag byte. Anyway. I think I might press on with this one a bit and see where it takes me. Re: Digraph Compression - boriel - 02-19-2016 britlion!!! Glad to see you back! ![]() Actually I started to read more about digraphs thanks to cheveron and you ![]() Now I use that a lot. Re: Digraph Compression - britlion - 03-01-2016 Interesting. What have you used them for, Boriel? Re: Digraph Compression - boriel - 03-06-2016 britlion Wrote:Interesting. What have you used them for, Boriel?I create my own statistical language recognizer, by previously analysing statistical frequency of digraphs and trigraphs. Then for a new text, I do the same, and return the "guessed" language by taken the closest histogram to this new one. :roll: Re: Digraph Compression - britlion - 03-10-2016 boriel Wrote:britlion Wrote:Interesting. What have you used them for, Boriel?I create my own statistical language recognizer, by previously analysing statistical frequency of digraphs and trigraphs. Then for a new text, I do the same, and return the "guessed" language by taken the closest histogram to this new one. :roll: Ooh. I'd have never thought of that one. I'd have been messing with dictionaries and highest count of recognised words, I think. I wonder how google translate does it. Re: Digraph Compression - boriel - 03-10-2016 britlion Wrote:I used "trigraphs" later,and works much better. For 10+ words, the accuracy I got is near 100%.boriel Wrote:britlion Wrote:Interesting. What have you used them for, Boriel?I create my own statistical language recognizer, by previously analysing statistical frequency of digraphs and trigraphs. Then for a new text, I do the same, and return the "guessed" language by taken the closest histogram to this new one. :roll: To "train" the frequencies I use these word count lists from here. |