Thread Rating:
  • 1 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Digraph Compression
#1
After reading a note by Cheveron about digraph compression ( http://foro.speccy.org/viewtopic.php?f=6...aph#p33001) I thought I'd run up the English language version to see what it looked like.

If we take http://www.math.cornell.edu/~mec/2003-20...raphs.html as a start, and work out what the value of first letter and second letter are, I get this table in the end:

Code:
1st    Frq    2nd    Frq
e    3.24        n    2.88
t    2.65        e    2.52
h    2.3        t    2.39
i    1.9        h    1.52
a    1.87        r    1.43
o    1.66        s    1.35
n    1.37        d    1.18
s    0.74        a    1.13
r    0.72        i    0.85
l    0.1        o    0.52
d    0.09        u    0.5
v    0.04        g    0.18
u    0.02        f    0.16
                l    0.09

Thinking about this, you could store these as a nybble each:
Code:
Val    bin    Lft    Rght
00    0000    e    n
01    0001    t    e
02    0010    h    t
03    0011    i    h
04    0100    a    r
05    0101    o    s
06    0110    n    d
07    0111    s    a
08    1000    r    i
09    1001    l    o
10    1010    d    u
11    1011    v    g
12    1100    u    f
13    1101    space    l
14    1110    Not Digraph    space
15    1111    Not Digraph
And if the first three bits are 1 (nybble is > 13), you could use the last 5 bits for the letters A-Z comfortably. There's even room for a few other symbols in there. You might have to use one of them to express some non compression symbol and use a separate byte for capitals (or non capitals), and other symbols. Though clearly 26 letters + space is the default for the first 27 numbers of the 5 bits. In a real world usage, you'd probably find the four most common non lowercase characters, and put those in, and then use 255 to mean "use the full next byte as the character code". This does mean that some symbols go double byte, but that's compression for you - you can always find a case that's bigger than the original.


Not done anything else with it yet, but I thought the thought experiment might be useful if you are going to allocate bits to such a compression scheme.
Reply
#2
And of course, genius that I am, I think I missed an obvious thing: There are only 40 digraphs in the table. So while my version allows any combo of those letters to be treated as a digraph, most of those combos are uncommon. It would be better, probably to allocate 40 of the 256 possibilities to digraphs, and then you still have all the symbols uniquely, as well as room for a pile more things longer than digraphs - e.g. your most common whole word list.

This is what happens when I say "Try to do this in English the same way Andrew did there in Spanish"

Was fun maths though Smile
Reply
#3
That said, I did some preliminary tests on my method, and it does look like, for English - even complex English with lots of symbols and numbers and some capital letters, it's well under 5 bits per character. A hand run on a few sentences came to 4.17 bits per character. It can't even theoretically compress to less than 4, so getting close is quite neat.

While entropy is for English, technically just under 2 bits per character in theory, reaching that with a simple algorithm probably isn't possible. A near 50% reduction from this isn't half bad.

Basically it puts the most common 60% of the letters into each 4 bit block, so quite commonly does one character in 4 bits. Failing that it usually does it in 8 bits. Sometimes it falls back to an expensive 16 bits, as we need to forget about the compression, label up a standard byte, and go with that.

One imagines that given there's no ascii > 127, that using the first bit of such a "normal" character to say "the next one isn't compressed either" might be clever. It means that long strings of uncompressable data only cost us one flag byte.

Anyway. I think I might press on with this one a bit and see where it takes me.
Reply
#4
britlion!!! Glad to see you back! Smile

Actually I started to read more about digraphs thanks to cheveron and you Smile
Now I use that a lot.
Reply
#5
Interesting. What have you used them for, Boriel?
Reply
#6
britlion Wrote:Interesting. What have you used them for, Boriel?
I create my own statistical language recognizer, by previously analysing statistical frequency of digraphs and trigraphs. Then for a new text, I do the same, and return the "guessed" language by taken the closest histogram to this new one. :roll:
Reply
#7
boriel Wrote:
britlion Wrote:Interesting. What have you used them for, Boriel?
I create my own statistical language recognizer, by previously analysing statistical frequency of digraphs and trigraphs. Then for a new text, I do the same, and return the "guessed" language by taken the closest histogram to this new one. :roll:


Ooh. I'd have never thought of that one. I'd have been messing with dictionaries and highest count of recognised words, I think.

I wonder how google translate does it.
Reply
#8
britlion Wrote:
boriel Wrote:
britlion Wrote:Interesting. What have you used them for, Boriel?
I create my own statistical language recognizer, by previously analysing statistical frequency of digraphs and trigraphs. Then for a new text, I do the same, and return the "guessed" language by taken the closest histogram to this new one. :roll:


Ooh. I'd have never thought of that one. I'd have been messing with dictionaries and highest count of recognised words, I think.

I wonder how google translate does it.
I used "trigraphs" later,and works much better. For 10+ words, the accuracy I got is near 100%.
To "train" the frequencies I use these word count lists from here.
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)