Anyway, one of the optional programming assignments was to decode the following message:
"Esp qtcde nzyqpcpynp zy esp ezatn zq Lcetqtntlw Tyepwwtrpynp hld spwo le Olcexzfes Nzwwprp ty estd jplc."
The message is encoded using a Caesar Cipher - a very simple shift cipher, where each letter is substituted for another a fixed number of positions down the alphabet. Those of you old enough to remember Usenet may have encountered rot13, which is an example of a Caesar cipher.
The key to solving this problem is probabilistic analysis of the text. My first, rather naive approach based on intuition rather a strict application of probability theory was simply to analyse the text with the assumption that the most frequent letter in English is 'e'. Using AI jargon, we could say I used a probabilistic unigram letter model.
Anyway, I thought it was a great opportunity to practice my freshly minted Haskell skills, and to my surprise this naive approach actually worked!
Blogger does a lousy job of inlining code so I have hosted the code here on ideone.com, an awesome site that not only displays code with syntax highlighting, it can even take input, run the code and display the output. Thanks to Maxim on aiqus for improving my Haskell code.
Flush with my success, I tried a more sophisticated approach. This time I would use a trigram letter model, which are basically sequences of three adjacent letters- for example, one of the most common letter trigrams in English is 'THE". The approach was to generate all 26 possible decoding of the string, and score each one based on the trigram probabilities, choosing the one with the highest score as the most likely candidate. A few tricks that I learned in AI class were employed here - I used Laplace Smoothing to ensure that unlikely trigrams that I didn't have scores for would not invalidate my probability score due to multiplying by zero. I also used the log of the probabilities so I could simply sum them and didn't have to deal with ridiculously small numbers.
The source code, along with the trigram input data and output can be found here on here on oneide.com.
I have copied the output below
Caesar Cypher Solver (ai-class)
encoded message:
Esp qtcde nzyqpcpynp zy esp ezatn zq Lcetqtntlw Tyepwwtrpynp hld spwo le Olcexzfes Nzwwprp ty estd jplc.
The top 3 candidates based on trigram letter probabilities for english language
The first conference on the topic of Artificial Intelligence was held at Dartmouth College in this year.
score: -742.9796021883617
Nby zclmn wihzylyhwy ih nby nijcw iz Ulnczcwcuf Chnyffcayhwy qum byfx un Xulngionb Wiffyay ch nbcm syul.
score: -913.8004146160043
Esp qtcde nzyqpcpynp zy esp ezatn zq Lcetqtntlw Tyepwwtrpynp hld spwo le Olcexzfes Nzwwprp ty estd jplc.
score: -921.9751546970537
Cool.
ReplyDeletePaul can you make it so that you give it a page of an english novel, and the cypher message, and it does some kind of correlation between the shifted cyphers and the text you give it to work out which is english? That would impress me.
I would if I could understand what you just said...
ReplyDeletelol
ReplyDelete