02-15-2016, 09:20 PM
(This post was last modified: 03-12-2021, 11:34 AM by boriel.
Edit Reason: Fix URL
)
After reading a note by Cheveron about digraph compression ( http://foro.speccy.org/viewtopic.php?f=6...aph#p33001) I thought I'd run up the English language version to see what it looked like.
If we take http://www.math.cornell.edu/~mec/2003-20...raphs.html as a start, and work out what the value of first letter and second letter are, I get this table in the end:
Thinking about this, you could store these as a nybble each:
And if the first three bits are 1 (nybble is > 13), you could use the last 5 bits for the letters A-Z comfortably. There's even room for a few other symbols in there. You might have to use one of them to express some non compression symbol and use a separate byte for capitals (or non capitals), and other symbols. Though clearly 26 letters + space is the default for the first 27 numbers of the 5 bits. In a real world usage, you'd probably find the four most common non lowercase characters, and put those in, and then use 255 to mean "use the full next byte as the character code". This does mean that some symbols go double byte, but that's compression for you - you can always find a case that's bigger than the original.
Not done anything else with it yet, but I thought the thought experiment might be useful if you are going to allocate bits to such a compression scheme.
If we take http://www.math.cornell.edu/~mec/2003-20...raphs.html as a start, and work out what the value of first letter and second letter are, I get this table in the end:
Code:
1st Frq 2nd Frq
e 3.24 n 2.88
t 2.65 e 2.52
h 2.3 t 2.39
i 1.9 h 1.52
a 1.87 r 1.43
o 1.66 s 1.35
n 1.37 d 1.18
s 0.74 a 1.13
r 0.72 i 0.85
l 0.1 o 0.52
d 0.09 u 0.5
v 0.04 g 0.18
u 0.02 f 0.16
l 0.09
Thinking about this, you could store these as a nybble each:
Code:
Val bin Lft Rght
00 0000 e n
01 0001 t e
02 0010 h t
03 0011 i h
04 0100 a r
05 0101 o s
06 0110 n d
07 0111 s a
08 1000 r i
09 1001 l o
10 1010 d u
11 1011 v g
12 1100 u f
13 1101 space l
14 1110 Not Digraph space
15 1111 Not Digraph
Not done anything else with it yet, but I thought the thought experiment might be useful if you are going to allocate bits to such a compression scheme.