## Digraph Compression

Posts: 770

Joined: Mon Apr 27, 2009 7:26 pm

Location: Slough, Berkshire, UK

### Digraph Compression

After reading a note by Cheveron about digraph compression ( http://foro.speccy.org/viewtopic.php?f= ... aph#p33001 ) I thought I'd run up the English language version to see what it looked like.

If we take http://www.math.cornell.edu/~mec/2003-2 ... raphs.html as a start, and work out what the value of first letter and second letter are, I get this table in the end:

Code:
`1st   Frq   2nd   Frqe   3.24      n   2.88t   2.65      e   2.52h   2.3      t   2.39i   1.9      h   1.52a   1.87      r   1.43o   1.66      s   1.35n   1.37      d   1.18s   0.74      a   1.13r   0.72      i   0.85l   0.1      o   0.52d   0.09      u   0.5v   0.04      g   0.18u   0.02      f   0.16            l   0.09`

Code:
`Val   bin   Lft   Rght00   0000   e   n01   0001   t   e02   0010   h   t03   0011   i   h04   0100   a   r05   0101   o   s06   0110   n   d07   0111   s   a08   1000   r   i09   1001   l   o10   1010   d   u11   1011   v   g12   1100   u   f13   1101   space   l14   1110   Not Digraph   space15   1111   Not Digraph   `

And if the first three bits are 1 (nybble is > 13), you could use the last 5 bits for the letters A-Z comfortably. There's even room for a few other symbols in there. You might have to use one of them to express some non compression symbol and use a separate byte for capitals (or non capitals), and other symbols. Though clearly 26 letters + space is the default for the first 27 numbers of the 5 bits. In a real world usage, you'd probably find the four most common non lowercase characters, and put those in, and then use 255 to mean "use the full next byte as the character code". This does mean that some symbols go double byte, but that's compression for you - you can always find a case that's bigger than the original.

Not done anything else with it yet, but I thought the thought experiment might be useful if you are going to allocate bits to such a compression scheme.

Posts: 770

Joined: Mon Apr 27, 2009 7:26 pm

Location: Slough, Berkshire, UK

### Re: Digraph Compression

And of course, genius that I am, I think I missed an obvious thing: There are only 40 digraphs in the table. So while my version allows any combo of those letters to be treated as a digraph, most of those combos are uncommon. It would be better, probably to allocate 40 of the 256 possibilities to digraphs, and then you still have all the symbols uniquely, as well as room for a pile more things longer than digraphs - e.g. your most common whole word list.

This is what happens when I say "Try to do this in English the same way Andrew did there in Spanish"

Was fun maths though

Posts: 770

Joined: Mon Apr 27, 2009 7:26 pm

Location: Slough, Berkshire, UK

### Re: Digraph Compression

That said, I did some preliminary tests on my method, and it does look like, for English - even complex English with lots of symbols and numbers and some capital letters, it's well under 5 bits per character. A hand run on a few sentences came to 4.17 bits per character. It can't even theoretically compress to less than 4, so getting close is quite neat.

While entropy is for English, technically just under 2 bits per character in theory, reaching that with a simple algorithm probably isn't possible. A near 50% reduction from this isn't half bad.

Basically it puts the most common 60% of the letters into each 4 bit block, so quite commonly does one character in 4 bits. Failing that it usually does it in 8 bits. Sometimes it falls back to an expensive 16 bits, as we need to forget about the compression, label up a standard byte, and go with that.

One imagines that given there's no ascii > 127, that using the first bit of such a "normal" character to say "the next one isn't compressed either" might be clever. It means that long strings of uncompressable data only cost us one flag byte.

Anyway. I think I might press on with this one a bit and see where it takes me.

Posts: 1488

Joined: Wed Nov 01, 2006 6:18 pm

Location: Santa Cruz de Tenerife, Spain

### Re: Digraph Compression

britlion!!! Glad to see you back!

Actually I started to read more about digraphs thanks to cheveron and you
Now I use that a lot.

Posts: 770

Joined: Mon Apr 27, 2009 7:26 pm

Location: Slough, Berkshire, UK

### Re: Digraph Compression

Interesting. What have you used them for, Boriel?

Posts: 1488

Joined: Wed Nov 01, 2006 6:18 pm

Location: Santa Cruz de Tenerife, Spain

### Re: Digraph Compression

britlion wrote:Interesting. What have you used them for, Boriel?

I create my own statistical language recognizer, by previously analysing statistical frequency of digraphs and trigraphs. Then for a new text, I do the same, and return the "guessed" language by taken the closest histogram to this new one.

Posts: 770

Joined: Mon Apr 27, 2009 7:26 pm

Location: Slough, Berkshire, UK

### Re: Digraph Compression

boriel wrote:
britlion wrote:Interesting. What have you used them for, Boriel?

I create my own statistical language recognizer, by previously analysing statistical frequency of digraphs and trigraphs. Then for a new text, I do the same, and return the "guessed" language by taken the closest histogram to this new one.

Ooh. I'd have never thought of that one. I'd have been messing with dictionaries and highest count of recognised words, I think.

I wonder how google translate does it.

Posts: 1488

Joined: Wed Nov 01, 2006 6:18 pm

Location: Santa Cruz de Tenerife, Spain

### Re: Digraph Compression

britlion wrote:
boriel wrote:
britlion wrote:Interesting. What have you used them for, Boriel?

I create my own statistical language recognizer, by previously analysing statistical frequency of digraphs and trigraphs. Then for a new text, I do the same, and return the "guessed" language by taken the closest histogram to this new one.

Ooh. I'd have never thought of that one. I'd have been messing with dictionaries and highest count of recognised words, I think.

I wonder how google translate does it.

I used "trigraphs" later,and works much better. For 10+ words, the accuracy I got is near 100%.
To "train" the frequencies I use these word count lists from here.