Thursday, 19 February 2009


In the last post I mentioned the book Programming Pearls. In the second chapter, Jon Bentley presents a simple, but interesting problem: Given a list of words, group those that are acronyms together.

If you don't know what acronyms are, it is simply a words formed from the initial letters of others. For instance, the words "pot" and "top" are acronyms; so are the words "slow" and "owls".

When I read it, I thought "This could be done with unix tools and pipes". Actually, I didn't find a way of doing it efficiently with native unix tools. The reason being that there's no horizontal sort implementation in unix. You could write something like this:

for i in `cat dict | sed 'p' | sed 'N;s/\n/*-/g'`; do echo $i |
cut -d "*" -f 2 | sed "s/\(.\)/\1-/g" | tr "-" "\n" | sort |
tr -d "\n"; echo "$i" | cut -d "*" -f 2 ; done | sort

But it won't be very efficient and it's very ugly. I didn't spend too much time thinking of a better way to do it unix-style. If you have one, please post it on the comments =)

Instead, I wrote a small python script

#!/usr/bin/env python
import sys

if __name__ == '__main__':
f = open(sys.argv[1])
for line in f.xreadlines():
line = line.strip()
print '%s-%s' % (''.join(sorted(line)), line)

and I executed:

./ dict2 | sort

which runs in about 0.47 seconds. Then, we could add another step to pretty print the acronyms, but I wasn't interested in that.

Unfortunately, the only thing that remains 'unix' about this solution is the use of sort and one pipe. This post is not so much to show my findings but to look for a better solution. Does anyone have a suggestion?

No comments:

Post a Comment