Wordlist

A few years ago, a friend of mine, Mike, sent me his Fortran program for compressing the wordlists he uses to solve puzzles. I thought it looked vaguely familiar, and sure enough, it is more or less equivalent to the DAWG (Directed Acyclic Word Graph) algorithm used by Crack (and others). This algorithm is described by Alec Muffett in the Crack FAQ:

sort the wordlist into normal Unix order. (beware localization!)
for each word that the DAWG preprocessor reads...
count how many leading characters it shares with the previous word that was read...
encode that number as a character from the set [0-9A-Za-z] for values 0..61 (if the value is >61 then stop there)
print said character (the encoded number) and the remaining stem of the word
end-for-loop

When I suggested to Mike that he had reinvented a scheme dating to at least 1990, he said no, he first heard about it years ago at a talk given by a Navy Admiral (Grace Hopper?). He wrote

I first saw this wordlist scheme used in 1972. I think it was about 25 years old then. I have used it to store wordlists on my Atari which had a wopping 48K bytes!

At the time, I coded it up in Perl and even packaged it all up as a proper Perl module. Note that the version of Crack that I have seems to use a numarr that matches the ASCII table starting at '0', rather than the one implied by the Crack FAQ (described above). That is, to actually decode the dwg files that come with Crack, I have to use the following:

    use Compress::DAWG numarr => ['0'..'9',':',';','<','=','>',
                                  '?','@','A'..'Z','[','\\',']',
                                  '^','_','`','a'..'z'];

I thought about making that the default in the module, but I never followed through. Anyway, I was going to upload the module to CPAN, but then I realized that since Mike's alphabet and the alphabet actually used by Crack are both contiguous, the whole thing boils down to just four one-liners:

    # Mike's compress.
    perl -ple '@c=split q//;$n=0;$n++ while $p[$n] eq $c[$n];$_=chr
    64+$n;$_.=join q//, @c[$n..$#c];@p=@c;' words > words.cpt

    # Mike's decompress.
    perl -ple '@c=split q//;$i=shift @c;$n=(ord $i)-64;$_=join q//,
    @p[0..$n-1], @c;@p=split q//' words.cpt > words

    # What Crack seems to actually use for compress.
    perl -ple '@c=split q//;$n=0;$n++ while $p[$n] eq $c[$n];$_=chr
    48+$n;$_.=join q//, @c[$n..$#c];@p=@c;' words > words.dwg

    # What Crack seems to actually use for decompress.
    perl -ple '@c=split q//;$i=shift @c;$n=(ord $i)-48;$_=join q//,
    @p[0..$n-1], @c;@p=split q//' words.dwg > words

I decided I was probably the only one interested in such a module.

Lately, I have become interested in Python, so today I coded it up in Python just for fun. You might use it something like this

    >>> import wordlist
    >>> wordlist.compress(['foo', 'foot', 'footle', 'fubar', 'fub'])
    ['#!xdawg', '0foo', '3t', '4le', '1ubar', '3']
    >>>

Or perhaps like this

    #!/usr/bin/python

    import wordlist

    # Compress with DAWG format.
    inf = open('wubba')
    outf = open('wubba.dwg', 'w')
    print "Compressing file", inf.name
    wordlist.compress(inf, outf, 'DAWG')
    print "Output written to", outf.name

    # Decompress with Mike format.
    inf = open('wubba.cpt')
    outf = open('wubba.out', 'w')
    print "Decompressing file", inf.name
    wordlist.decompress(inf, outf, 'Mike')
    print "Output written to", outf.name

Or, more likely, you won't use it at all.

I am also quite fond of Ruby, but someone called Skrud has already implemented a version of Crack in Ruby which includes handling the DAWG case, so go look at his.