FAQ  •  Register  •  Login

Digraph Compression

<<

britlion

Posts: 766

Joined: Mon Apr 27, 2009 7:26 pm

Location: Slough, Berkshire, UK

Post Mon Feb 15, 2016 9:20 pm

Digraph Compression

After reading a note by Cheveron about digraph compression ( http://foro.speccy.org/viewtopic.php?f= ... aph#p33001 ) I thought I'd run up the English language version to see what it looked like.

If we take http://www.math.cornell.edu/~mec/2003-2 ... raphs.html as a start, and work out what the value of first letter and second letter are, I get this table in the end:

  Code:
1st   Frq   2nd   Frq
e   3.24      n   2.88
t   2.65      e   2.52
h   2.3      t   2.39
i   1.9      h   1.52
a   1.87      r   1.43
o   1.66      s   1.35
n   1.37      d   1.18
s   0.74      a   1.13
r   0.72      i   0.85
l   0.1      o   0.52
d   0.09      u   0.5
v   0.04      g   0.18
u   0.02      f   0.16
            l   0.09


Thinking about this, you could store these as a nybble each:
  Code:
Val   bin   Lft   Rght
00   0000   e   n
01   0001   t   e
02   0010   h   t
03   0011   i   h
04   0100   a   r
05   0101   o   s
06   0110   n   d
07   0111   s   a
08   1000   r   i
09   1001   l   o
10   1010   d   u
11   1011   v   g
12   1100   u   f
13   1101   space   l
14   1110   Not Digraph   space
15   1111   Not Digraph   

And if the first three bits are 1 (nybble is > 13), you could use the last 5 bits for the letters A-Z comfortably. There's even room for a few other symbols in there. You might have to use one of them to express some non compression symbol and use a separate byte for capitals (or non capitals), and other symbols. Though clearly 26 letters + space is the default for the first 27 numbers of the 5 bits. In a real world usage, you'd probably find the four most common non lowercase characters, and put those in, and then use 255 to mean "use the full next byte as the character code". This does mean that some symbols go double byte, but that's compression for you - you can always find a case that's bigger than the original.


Not done anything else with it yet, but I thought the thought experiment might be useful if you are going to allocate bits to such a compression scheme.
<<

britlion

Posts: 766

Joined: Mon Apr 27, 2009 7:26 pm

Location: Slough, Berkshire, UK

Post Mon Feb 15, 2016 9:42 pm

Re: Digraph Compression

And of course, genius that I am, I think I missed an obvious thing: There are only 40 digraphs in the table. So while my version allows any combo of those letters to be treated as a digraph, most of those combos are uncommon. It would be better, probably to allocate 40 of the 256 possibilities to digraphs, and then you still have all the symbols uniquely, as well as room for a pile more things longer than digraphs - e.g. your most common whole word list.

This is what happens when I say "Try to do this in English the same way Andrew did there in Spanish"

Was fun maths though :)
<<

britlion

Posts: 766

Joined: Mon Apr 27, 2009 7:26 pm

Location: Slough, Berkshire, UK

Post Fri Feb 19, 2016 10:38 am

Re: Digraph Compression

That said, I did some preliminary tests on my method, and it does look like, for English - even complex English with lots of symbols and numbers and some capital letters, it's well under 5 bits per character. A hand run on a few sentences came to 4.17 bits per character. It can't even theoretically compress to less than 4, so getting close is quite neat.

While entropy is for English, technically just under 2 bits per character in theory, reaching that with a simple algorithm probably isn't possible. A near 50% reduction from this isn't half bad.

Basically it puts the most common 60% of the letters into each 4 bit block, so quite commonly does one character in 4 bits. Failing that it usually does it in 8 bits. Sometimes it falls back to an expensive 16 bits, as we need to forget about the compression, label up a standard byte, and go with that.

One imagines that given there's no ascii > 127, that using the first bit of such a "normal" character to say "the next one isn't compressed either" might be clever. It means that long strings of uncompressable data only cost us one flag byte.

Anyway. I think I might press on with this one a bit and see where it takes me.
<<

boriel

Site Admin

Posts: 1463

Joined: Wed Nov 01, 2006 6:18 pm

Location: Santa Cruz de Tenerife, Spain

Post Fri Feb 19, 2016 7:52 pm

Re: Digraph Compression

britlion!!! Glad to see you back! :)

Actually I started to read more about digraphs thanks to cheveron and you :)
Now I use that a lot.
<<

britlion

Posts: 766

Joined: Mon Apr 27, 2009 7:26 pm

Location: Slough, Berkshire, UK

Post Tue Mar 01, 2016 5:11 pm

Re: Digraph Compression

Interesting. What have you used them for, Boriel?
<<

boriel

Site Admin

Posts: 1463

Joined: Wed Nov 01, 2006 6:18 pm

Location: Santa Cruz de Tenerife, Spain

Post Sun Mar 06, 2016 5:31 pm

Re: Digraph Compression

britlion wrote:Interesting. What have you used them for, Boriel?

I create my own statistical language recognizer, by previously analysing statistical frequency of digraphs and trigraphs. Then for a new text, I do the same, and return the "guessed" language by taken the closest histogram to this new one. :roll:
<<

britlion

Posts: 766

Joined: Mon Apr 27, 2009 7:26 pm

Location: Slough, Berkshire, UK

Post Thu Mar 10, 2016 3:19 pm

Re: Digraph Compression

boriel wrote:
britlion wrote:Interesting. What have you used them for, Boriel?

I create my own statistical language recognizer, by previously analysing statistical frequency of digraphs and trigraphs. Then for a new text, I do the same, and return the "guessed" language by taken the closest histogram to this new one. :roll:



Ooh. I'd have never thought of that one. I'd have been messing with dictionaries and highest count of recognised words, I think.

I wonder how google translate does it.
<<

boriel

Site Admin

Posts: 1463

Joined: Wed Nov 01, 2006 6:18 pm

Location: Santa Cruz de Tenerife, Spain

Post Thu Mar 10, 2016 3:35 pm

Re: Digraph Compression

britlion wrote:
boriel wrote:
britlion wrote:Interesting. What have you used them for, Boriel?

I create my own statistical language recognizer, by previously analysing statistical frequency of digraphs and trigraphs. Then for a new text, I do the same, and return the "guessed" language by taken the closest histogram to this new one. :roll:



Ooh. I'd have never thought of that one. I'd have been messing with dictionaries and highest count of recognised words, I think.

I wonder how google translate does it.

I used "trigraphs" later,and works much better. For 10+ words, the accuracy I got is near 100%.
To "train" the frequencies I use these word count lists from here.

Return to ZX Basic Compiler

Who is online

Users browsing this forum: No registered users and 2 guests

cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group.
Designed by Vjacheslav Trushkin for Free Forums/DivisionCore.

phpBB SEO