Digraph Compression - Printable Version +- Forum (https://www.boriel.com/forum) +-- Forum: Compilers and Computer Languages (https://www.boriel.com/forum/forumdisplay.php?fid=12) +--- Forum: ZX Basic Compiler (https://www.boriel.com/forum/forumdisplay.php?fid=11) +--- Thread: Digraph Compression (/showthread.php?tid=724) Digraph Compression - britlion - 02-15-2016 After reading a note by Cheveron about digraph compression ( http://foro.speccy.org/viewtopic.php?f=6&t=2997&p=33001&hilit=digraph#p33001) I thought I'd run up the English language version to see what it looked like. If we take http://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/digraphs.html as a start, and work out what the value of first letter and second letter are, I get this table in the end: Code:```1st    Frq    2nd    Frq e    3.24        n    2.88 t    2.65        e    2.52 h    2.3        t    2.39 i    1.9        h    1.52 a    1.87        r    1.43 o    1.66        s    1.35 n    1.37        d    1.18 s    0.74        a    1.13 r    0.72        i    0.85 l    0.1        o    0.52 d    0.09        u    0.5 v    0.04        g    0.18 u    0.02        f    0.16                 l    0.09``` Thinking about this, you could store these as a nybble each: Code:```Val    bin    Lft    Rght 00    0000    e    n 01    0001    t    e 02    0010    h    t 03    0011    i    h 04    0100    a    r 05    0101    o    s 06    0110    n    d 07    0111    s    a 08    1000    r    i 09    1001    l    o 10    1010    d    u 11    1011    v    g 12    1100    u    f 13    1101    space    l 14    1110    Not Digraph    space 15    1111    Not Digraph```And if the first three bits are 1 (nybble is > 13), you could use the last 5 bits for the letters A-Z comfortably. There's even room for a few other symbols in there. You might have to use one of them to express some non compression symbol and use a separate byte for capitals (or non capitals), and other symbols. Though clearly 26 letters + space is the default for the first 27 numbers of the 5 bits. In a real world usage, you'd probably find the four most common non lowercase characters, and put those in, and then use 255 to mean "use the full next byte as the character code". This does mean that some symbols go double byte, but that's compression for you - you can always find a case that's bigger than the original. Not done anything else with it yet, but I thought the thought experiment might be useful if you are going to allocate bits to such a compression scheme. Re: Digraph Compression - britlion - 02-15-2016 And of course, genius that I am, I think I missed an obvious thing: There are only 40 digraphs in the table. So while my version allows any combo of those letters to be treated as a digraph, most of those combos are uncommon. It would be better, probably to allocate 40 of the 256 possibilities to digraphs, and then you still have all the symbols uniquely, as well as room for a pile more things longer than digraphs - e.g. your most common whole word list. This is what happens when I say "Try to do this in English the same way Andrew did there in Spanish" Was fun maths though Re: Digraph Compression - britlion - 02-19-2016 That said, I did some preliminary tests on my method, and it does look like, for English - even complex English with lots of symbols and numbers and some capital letters, it's well under 5 bits per character. A hand run on a few sentences came to 4.17 bits per character. It can't even theoretically compress to less than 4, so getting close is quite neat. While entropy is for English, technically just under 2 bits per character in theory, reaching that with a simple algorithm probably isn't possible. A near 50% reduction from this isn't half bad. Basically it puts the most common 60% of the letters into each 4 bit block, so quite commonly does one character in 4 bits. Failing that it usually does it in 8 bits. Sometimes it falls back to an expensive 16 bits, as we need to forget about the compression, label up a standard byte, and go with that. One imagines that given there's no ascii > 127, that using the first bit of such a "normal" character to say "the next one isn't compressed either" might be clever. It means that long strings of uncompressable data only cost us one flag byte. Anyway. I think I might press on with this one a bit and see where it takes me. Re: Digraph Compression - boriel - 02-19-2016 britlion!!! Glad to see you back! Actually I started to read more about digraphs thanks to cheveron and you Now I use that a lot. Re: Digraph Compression - britlion - 03-01-2016 Interesting. What have you used them for, Boriel? Re: Digraph Compression - boriel - 03-06-2016 britlion Wrote:Interesting. What have you used them for, Boriel?I create my own statistical language recognizer, by previously analysing statistical frequency of digraphs and trigraphs. Then for a new text, I do the same, and return the "guessed" language by taken the closest histogram to this new one. :roll: Re: Digraph Compression - britlion - 03-10-2016 boriel Wrote:britlion Wrote:Interesting. What have you used them for, Boriel?I create my own statistical language recognizer, by previously analysing statistical frequency of digraphs and trigraphs. Then for a new text, I do the same, and return the "guessed" language by taken the closest histogram to this new one. :roll: Ooh. I'd have never thought of that one. I'd have been messing with dictionaries and highest count of recognised words, I think. I wonder how google translate does it. Re: Digraph Compression - boriel - 03-10-2016 britlion Wrote:boriel Wrote:britlion Wrote:Interesting. What have you used them for, Boriel?I create my own statistical language recognizer, by previously analysing statistical frequency of digraphs and trigraphs. Then for a new text, I do the same, and return the "guessed" language by taken the closest histogram to this new one. :roll: Ooh. I'd have never thought of that one. I'd have been messing with dictionaries and highest count of recognised words, I think. I wonder how google translate does it.I used "trigraphs" later,and works much better. For 10+ words, the accuracy I got is near 100%. To "train" the frequencies I use these word count lists from here.