Saturday, January 07, 2012

Project Euler in R: Problem 22

I solve Project Euler problems for recreation. I am using the Statistical Language R to solve these. R is free for use and download, so I would recommend downloading it if you are interested in Statistical computation.
This is problem 22 from Project Euler:
Using names.txt, a 46K text file containing over five-thousand first names, begin by sorting it into alphabetical order. Then working out the alphabetical value for each name, multiply this value by its alphabetical position in the list to obtain a name score.
For example, when the list is sorted into alphabetical order, COLIN, which is worth 3 + 15 + 12 + 9 + 14 = 53, is the 938th name in the list. So, COLIN would obtain a score of 938 x 53 = 49714.
What is the total of all the name scores in the file?
While this is an easy problem, its solution involves some nice R concepts. Let's go through solving this in R.

Reading free form text using scan()
The first order of business is to read the text file. The file names.txt is a single line of text, separated by commas:
"MARY","PATRICIA","LINDA","BARBARA", ...
The scan function reads free form text and stores each element in a vector. We specify that the input is character data separated by commas. That should do the trick, right?
nlist <- scan("names.txt", what= "", sep=",",)
Unfortunately, this does not work. After many minutes of scratching your head you'll find that one of the names in the file is 'NA'. This confuses the scan function since it is interpreted as the constant NA, or "Not Available" in R. This can be fixed by setting an na.strings argument in the scan function.
nlist <- scan("names.txt", what= "", sep=",", na.strings="")
  
Dictionaries / hashtables in R
Getting the alphabetical value of each character is easiest if it is stored in a hashtable somewhere. In R, named vectors can be used as hashtables. The traditional definition would go as follows:
dict = c("A" = 1, "B" = 2, ...)
While you can sit and write the entire dictionary by hand, you should avoid such repeated work. Here's a niftier definition using the names() function:
dict = 1:26
names(dict) = unlist(strsplit("ABCDEFGHIJKLMNOPQRSTUVWXYZ", ""))
This should be a trivial method to write in a C-style language where you can turn a character into an integer. R, however, has no internal ASCII table that I could find. If there is a simpler solution, please let me know.

Name scoring function
The scoring function gets the name as an input, pulls the score for each letter in the name and then adds up the letter scores.
score <- function (name) {
    return (sum(dict[unlist(strsplit(name, ""))]))
}
Calculating weighted scores
The vector of name scores is obtained by applying the score function defined above to each name in nlist. We use the sapply function, which gives a simplified vector output by default. The vector of weights is a sequence of numbers from 1 to the length of nlist. Multiply the two vectors to get the vector of weighted scores.
weighted.score = (1:length(nlist))*(sapply(nlist, score)) 
  
The full solution
nlist <- scan("names.txt", what= "", sep=",", na.strings="")
dict = 1:26
names(dict) = unlist(strsplit("ABCDEFGHIJKLMNOPQRSTUVWXYZ", ""))

score <- function (name) {
    return (sum(dict[unlist(strsplit(name, ""))]))
}

nlist <- sort(nlist)
weighted.score = (1:length(nlist))*(sapply(nlist, score)) 
sum(weighted.score)
This was an interesting problem to solve in R. It demonstrates the flexibility of R in reading and processing free form text. Trying to do a similar task in SAS would be hard.