Thread Rating:
  • 1 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Digraph Compression
#1
After reading a note by Cheveron about digraph compression ( http://foro.speccy.org/viewtopic.php?f=6...aph#p33001) I thought I'd run up the English language version to see what it looked like.

If we take http://www.math.cornell.edu/~mec/2003-20...raphs.html as a start, and work out what the value of first letter and second letter are, I get this table in the end:

Code:
1st    Frq    2nd    Frq
e    3.24        n    2.88
t    2.65        e    2.52
h    2.3        t    2.39
i    1.9        h    1.52
a    1.87        r    1.43
o    1.66        s    1.35
n    1.37        d    1.18
s    0.74        a    1.13
r    0.72        i    0.85
l    0.1        o    0.52
d    0.09        u    0.5
v    0.04        g    0.18
u    0.02        f    0.16
                l    0.09

Thinking about this, you could store these as a nybble each:
Code:
Val    bin    Lft    Rght
00    0000    e    n
01    0001    t    e
02    0010    h    t
03    0011    i    h
04    0100    a    r
05    0101    o    s
06    0110    n    d
07    0111    s    a
08    1000    r    i
09    1001    l    o
10    1010    d    u
11    1011    v    g
12    1100    u    f
13    1101    space    l
14    1110    Not Digraph    space
15    1111    Not Digraph
And if the first three bits are 1 (nybble is > 13), you could use the last 5 bits for the letters A-Z comfortably. There's even room for a few other symbols in there. You might have to use one of them to express some non compression symbol and use a separate byte for capitals (or non capitals), and other symbols. Though clearly 26 letters + space is the default for the first 27 numbers of the 5 bits. In a real world usage, you'd probably find the four most common non lowercase characters, and put those in, and then use 255 to mean "use the full next byte as the character code". This does mean that some symbols go double byte, but that's compression for you - you can always find a case that's bigger than the original.


Not done anything else with it yet, but I thought the thought experiment might be useful if you are going to allocate bits to such a compression scheme.
Reply


Messages In This Thread

Forum Jump:


Users browsing this thread: 3 Guest(s)