Thursday, 19 February 2009

Acronyms

In the last post I mentioned the book Programming Pearls. In the second chapter, Jon Bentley presents a simple, but interesting problem: Given a list of words, group those that are acronyms together.

If you don't know what acronyms are, it is simply a words formed from the initial letters of others. For instance, the words "pot" and "top" are acronyms; so are the words "slow" and "owls".

When I read it, I thought "This could be done with unix tools and pipes". Actually, I didn't find a way of doing it efficiently with native unix tools. The reason being that there's no horizontal sort implementation in unix. You could write something like this:


for i in `cat dict | sed 'p' | sed 'N;s/\n/*-/g'`; do echo $i |
cut -d "*" -f 2 | sed "s/\(.\)/\1-/g" | tr "-" "\n" | sort |
tr -d "\n"; echo "$i" | cut -d "*" -f 2 ; done | sort


But it won't be very efficient and it's very ugly. I didn't spend too much time thinking of a better way to do it unix-style. If you have one, please post it on the comments =)

Instead, I wrote a small python script


#!/usr/bin/env python
import sys

if __name__ == '__main__':
f = open(sys.argv[1])
for line in f.xreadlines():
line = line.strip()
print '%s-%s' % (''.join(sorted(line)), line)


and I executed:


./expand.py dict2 | sort


which runs in about 0.47 seconds. Then, we could add another step to pretty print the acronyms, but I wasn't interested in that.

Unfortunately, the only thing that remains 'unix' about this solution is the use of sort and one pipe. This post is not so much to show my findings but to look for a better solution. Does anyone have a suggestion?


No comments:

Post a Comment