Vikram Aggarwal: January 2012

Tuesday, January 24, 2012

Project Euler in R: Problem 25

Solutions in R to Project Euler problems are generally hard to find, so I've recently started posting R solutions on this site. Here are the previous problems: problem 22, problem 23 and problem 24

Now let's look at Problem 25 from Project Euler. Here is the problem statement:

The Fibonacci sequence is defined by the recurrence relation: Fn = Fn1 + Fn2, where F1 = 1 and F2 = 1.
Hence the first 12 terms will be:
F1 = 1
F2 = 1
...
F11 = 89
F12 = 144
The 12th term, F12, is the first term to contain three digits. What is the first term in the Fibonacci sequence to contain 1000 digits?

This is an easy problem to solve for a small number of digits. It gets interesting once we increase the required number of digits to 1000. The R integer type can hold 10 digits at most, and the type double can hold 309 digits. There is no Big Integer class in R that can hold larger numbers, so to compute the 1000 digit Fibonacci number we will have to be creative.

Computing Fibonacci numbers iteratively

Here is a basic iterative Fibonacci function, that works for Fibonacci numbers with 309 or fewer digits.

fib <- function(n) {
  a = 0
  b = 1
  for (i in 1:n) {
    tmp = b
    b = a
    a = a + tmp
  }
  return (a)
}

The version that works with larger numbers is similar, with the following changes

Integer vectors are used to store each Fibonacci number.
Each element of the vector is at most nine digits long.
Once all the nine digits are used, a new element is added to the vector and the Fibonacci number computation is carried over to the new element.
The variable numberDigits counts the number of digits in the latest Fibonacci number. When numberDigits exceeds 999, the function exits, returning the Fibonacci number index.

fib.bigint <- function(lim) {
  a = c(0)
  b = c(1)
  n = 0           # Fibonacci number index
  integerSize <- 1000000000  ## How big is each integer
  while (1) {
    n = n + 1     # Compute next Fibonacci number
    tmp = c(rep(0,length(a)-length(b)), b)  # Pad beginning with zero
    b = a
 
    # Add a and tmp
    carry = 0
    for (i in length(a):1) {
      sum.i = tmp[i] + a[i] + carry
      a[i] = sum.i %% integerSize
      carry = floor (sum.i / integerSize)
    }
    if (carry > 0) {a <- c(carry, a)}
    numberDigits <- (length(a) - 1) * 9 + countDigits(a[1])
    if (numberDigits > (lim - 1)) { return (n) }
  }
}

countDigits <- function(n) {
  count <- 1
  for (i in 1:9){
    if ((n/10) > 1) {
      count <- count + 1
      n = n / 10
    } else {
      return (count)
    }
  }
}

On my Intel Core 2 Duo 1.6Ghz laptop running Linux, this function less than 4 seconds to find the Fibonacci number with 1000 digits.

The analytical solution
Fibonacci numbers have a closed form solution, which leads to a simpler approximation by rounding (from Wikipedia),

F_n = the closest integer to (golden ratio)^n / sqrt(5)

We need the first Fibonacci number with 1000 digits. This number will be greater than or equal to 10^999.

F_n = the closest integer to (golden ratio)^n / sqrt(5) >= 10^999

We can drop the closest integer function because if F_n rounds off to the floor of the ratio, the resulting Fibonacci number will have less than the required digits. So we need smallest n satisfying

(golden ratio)^n / sqrt(5) >= 10^999

Or, n >= (999 log(10) + log(sqrt(5)))/log(golden ratio)

Or, n >= 4781.86

So we need the 4782nd Fibonacci number to get the required 1000 digits.

This problem was particularly interesting to R users because it runs into an R-specific limitation. Many other languages don't have this problem and can just use a simple Fibonacci generator to get the answer. For example, in Scheme (or other Lisp variants), a naive Fibonacci implementation produces all the digits.

Aside: Having started writing about R, I have found an excellent online resource, R-bloggers - an aggregator of content from various blogs about R. You can search for other R Project Euler solutions here. Be sure to check out the top articles box on R-bloggers landing page if you're looking for new and interesting applications of R.

Thursday, January 19, 2012

Project Euler in R: Problem 24

I had previously posted solutions in R to Project Euler problem 23 and problem 22. This is the next problem from Project Euler. The statement of problem 24 is as follows.

A permutation is an ordered arrangement of objects. For example, 3124 is one possible permutation of the digits 1, 2, 3 and 4. If all of the permutations are listed numerically or alphabetically, we call it lexicographic order. The lexicographic permutations of 0, 1 and 2 are: 012 021 102 120 201 210

What is the millionth lexicographic permutation of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9?

This is a very interesting problem and it can be solved analytically. The key insight is that n distinct digits can be arranged in n! permutations.

Here's the logic you could use. If the permutations of the numbers 0 through 9 are arranged in lexicographic order,

The first 9! permutations, #1 to #362,880 are numbers starting with 0, in order from 0123456789 to 0987654321
The next 9! permutations, #362,881 to #725,760, are numbers from 1023456789 to 1987654320
The next 9! permutations, #725,761 to #1,088,640 are numbers from 2013456789 to 2987654310. The millionth permutation is contained in this set. Thus we know that the millionth number must start with a 2. In fact, it is the 274,240th = (1000000 - 725760) number contained in this set.
Our problem has now reduced to finding the 274,240th lexicographic permutation of the numbers 0, 1, 3, 4, 5, 6, 7, 8 and 9 - the nine digits excluding 2. We can repeat the above process to get each successive digit.

Essentially we're writing one million as the following sum

1000000 = (a1 x 9!) + (a2 x 8!) + ... + (a10 x 1)

As shown above, the index a1 is 2. That is, only two times 9! numbers fit perfectly within 1,000,000. So the first digit of the millionth number is the third digit in 0123456789. The index a2 is 6, so the second digit of the millionth number is the seventh digit of the remaining numbers 013456789, which is 7. The remaining digits can be found similarly. While finding the digits you have to be careful to ignore the digits that have been previously used. This is a permutation, so each digit appears exactly once.

The process is tedious, so here is some R code that will make it easier. This generalizes the solution to find the nth permutation of the numbers 0 to 9.

findperm <- function (n) {
  # Find the indices a1 - a10
  indices = c()
  for (i in 9:0) {
    index = 0
    while (factorial (i) < n) {
      n = n - factorial (i)
      index = index + 1
    }
    indices = c(indices, index)
  }
 
  # Use the indices to find the nth permutation
  input = 0:9                         # The list of numbers to permute
  output = c()
  for (j in 1:10) {
    output[j] = input[indices[j] + 1] # Move digit to output
    input = input[-(indices[j] + 1)]  # Remove the assigned digits from input
  }
  return (output)
}

That was straight-forward and a fun illustration of R's capabilities.

But you can do better. The strength of R lies in the wealth of user defined libraries and functions that can ease the implementation of tricky computation. For a simpler solution, try the lexicographicPermutation function in the library R.basic. Here is the entire solution.

library(R.basic)
lexicographicPermutation(c(0,1,2,3,4,5,6,7,8,9),999999)

The first solution illustrates that even tricky computation can be coded easily in R and the code is pretty readable. The second solution suggests that before writing your own code, it is a good idea to look for existing libraries that can help. Statistical computations can get complex and tricky. You can write your own k-means clustering algorithm but it is difficult getting all the edge cases correct. Rather than writing your own routine, look around and see if it has been written already. An expert does a better job writing such routines. And you can leverage their code and their expertise.

Sunday, January 15, 2012

Project Euler in R: Problem 23

I was just looking through the programming language statistics on Project Euler. It shows that only 7% of the problems have been solved in R, whereas 8% have been solved on any kind of spreadsheet. This is outrageous!

Let's look at the solution of problem 23 in R. Here is the statement of Project Euler's problem 23:

A perfect number is a number for which the sum of its proper divisors is exactly equal to the number. For example, the sum of the proper divisors of 28 would be 1 + 2 + 4 + 7 + 14 = 28, which means that 28 is a perfect number.
A number n is called deficient if the sum of its proper divisors is less than n and it is called abundant if this sum exceeds n.
As 12 is the smallest abundant number, 1 + 2 + 3 + 4 + 6 = 16, the smallest number that can be written as the sum of two abundant numbers is 24. By mathematical analysis, it can be shown that all integers greater than 28123 can be written as the sum of two abundant numbers. However, this upper limit cannot be reduced any further by analysis even though it is known that the greatest number that cannot be expressed as the sum of two abundant numbers is less than this limit.
Find the sum of all the positive integers which cannot be written as the sum of two abundant numbers.

Find proper divisors
First we write a function to find the proper divisors (that is divisors less than the number itself) of a number n.

divisor <- function (n) {
    divs <- c(1)
    i = 2
    lim = floor (sqrt(n))
 
    while (i <= lim) {
        if (n %% i == 0) {
          divs <- c(divs, i)
          if (n/i != i) { divs <- c(divs, n/i) }
        }
        i = i + 1
      }
    return (divs)
  }

Check if number is abundant
Next we create a simple function which checks whether a number n is abundant.

abundant <- function (n) {
    if (sum (divisor(n)) > n) return (TRUE) else return (FALSE)
  }

The solution
With the above two functions we have the tools we need to solve the main problem: finding the sum of all positive integers which can not be written as the sum of two abundant numbers.

All the numbers greater than 28,123 can be written as sum of two abundant numbers, so we only need to check numbers smaller than 28,123 for this property. The vector abunlist in the code below holds all abundant numbers starting from 12 to 28,123. With every new abundant number than we find, we create all possible two abundant number sums and store them in the vector sumno. The answer we're looking for is the sum of all numbers 1 to 28,123 except the ones in the vector sumno.

abunlist <- c(12)            # The smallest abundant number = 12
sumno <- c(24)               # The smallest abundant number sum = 12 + 12 = 24
 
for (i in 13:28123) {
  if (abundant(i)) {
    abunlist <- c(abunlist, i)                   # Latest abundant number = i
    tmp <- abunlist[(abunlist + i) <= 28123] + i   # New abundant sums <= 28123 using i
    sumno <-unique(c(sumno, tmp))                # Add to existing list
  }
}

This is certainly not the most efficient solution, but runs in approximately 25 seconds - well within the stated 1 minute goal.

Note: The code for this post was formatted in Pretty R at inside-R.org.

Wednesday, January 11, 2012

Generate UML Diagrams From Java Source

I was navigating through a large Java project today, and I needed some source tools to help me visualize the massive class hierarchy. Looking on Google didn't produce anything obvious: people pointed to eUML2, which was difficult to install, and some tools didn't generate UML from existing code.

After a while of searching, I remembered the Doxygen utility which writes documentation from source code. It works on C,C++, and Java, among many other languages. While its diagrams might not be perfect, they were very helpful for C++ projects in the past.

So I decided to give it a shot. It worked out beautifully.

Here are the steps involved in generating UML diagrams.
First, you install doxygen (to parse the files) and graphviz (to write out PNG of graphs)

$ sudo apt-get install doxygen graphviz

Then, run doxygen to generate a config file.

$ cd <path-to-source>
$ doxygen -g

Now you have a config file called Doxyfile. Edit this to allow recursive searching, to look for java source, to generate call graphs and to generate pictures, and to generate UML style diagrams. Here are all the changes I made to this file.

FILE_PATTERNS     = *.java
RECURSIVE       = YES
HAVE_DOT        = YES
CALL_GRAPH       = YES
CALLER_GRAPH      = YES
DOT_GRAPH_MAX_NODES  = 500
UML_LOOK            = YES

Now, generate documentation with this command:

$ doxygen Doxyfile

To test this out for a blog post, I downloaded the source code to Azureus, a large Java-based project. Here is a graph from one of the files. The top right shows a close-up view of a tiny section of the picture. Even in this simple example, you can see how a zoomed-out view allows you to understand the complex structure.

Doxygen reads code written in C, C++, Java, and many other languages. So if you are in need of UML style diagrams to better understand a class hierarchy, give it a try. It is freely available, easy to install, and easy to use. In addition to class graphs, it can also make call graphs and caller graphs.

Saturday, January 07, 2012

Project Euler in R: Problem 22

I solve Project Euler problems for recreation. I am using the Statistical Language R to solve these. R is free for use and download, so I would recommend downloading it if you are interested in Statistical computation.
This is problem 22 from Project Euler:

Using names.txt, a 46K text file containing over five-thousand first names, begin by sorting it into alphabetical order. Then working out the alphabetical value for each name, multiply this value by its alphabetical position in the list to obtain a name score.
For example, when the list is sorted into alphabetical order, COLIN, which is worth 3 + 15 + 12 + 9 + 14 = 53, is the 938th name in the list. So, COLIN would obtain a score of 938 x 53 = 49714.
What is the total of all the name scores in the file?

While this is an easy problem, its solution involves some nice R concepts. Let's go through solving this in R.

Reading free form text using scan()
The first order of business is to read the text file. The file names.txt is a single line of text, separated by commas:

"MARY","PATRICIA","LINDA","BARBARA", ...

The scan function reads free form text and stores each element in a vector. We specify that the input is character data separated by commas. That should do the trick, right?

nlist <- scan("names.txt", what= "", sep=",",)

Unfortunately, this does not work. After many minutes of scratching your head you'll find that one of the names in the file is 'NA'. This confuses the scan function since it is interpreted as the constant NA, or "Not Available" in R. This can be fixed by setting an na.strings argument in the scan function.

nlist <- scan("names.txt", what= "", sep=",", na.strings="")

Dictionaries / hashtables in R
Getting the alphabetical value of each character is easiest if it is stored in a hashtable somewhere. In R, named vectors can be used as hashtables. The traditional definition would go as follows:
dict = c("A" = 1, "B" = 2, ...)
While you can sit and write the entire dictionary by hand, you should avoid such repeated work. Here's a niftier definition using the names() function:

dict = 1:26
names(dict) = unlist(strsplit("ABCDEFGHIJKLMNOPQRSTUVWXYZ", ""))

This should be a trivial method to write in a C-style language where you can turn a character into an integer. R, however, has no internal ASCII table that I could find. If there is a simpler solution, please let me know.

Name scoring function
The scoring function gets the name as an input, pulls the score for each letter in the name and then adds up the letter scores.

score <- function (name) {
    return (sum(dict[unlist(strsplit(name, ""))]))
}

Calculating weighted scores
The vector of name scores is obtained by applying the score function defined above to each name in nlist. We use the sapply function, which gives a simplified vector output by default. The vector of weights is a sequence of numbers from 1 to the length of nlist. Multiply the two vectors to get the vector of weighted scores.

weighted.score = (1:length(nlist))*(sapply(nlist, score))

The full solution

nlist <- scan("names.txt", what= "", sep=",", na.strings="")
dict = 1:26
names(dict) = unlist(strsplit("ABCDEFGHIJKLMNOPQRSTUVWXYZ", ""))

score <- function (name) {
    return (sum(dict[unlist(strsplit(name, ""))]))
}

nlist <- sort(nlist)
weighted.score = (1:length(nlist))*(sapply(nlist, score)) 
sum(weighted.score)

This was an interesting problem to solve in R. It demonstrates the flexibility of R in reading and processing free form text. Trying to do a similar task in SAS would be hard.

Monday, January 02, 2012

Book Review: Binary, by Michael Crichton

Want to read good fiction? Read "Binary", by Michael Crichton, one of his best works that you've never read. Yes, that Michael Crichton.

The first book I read by Michael Crichton was an unknown book called "Binary". It was an obscure book by an obscure author called John Lange. I was in high school, and my father's friend had dropped this short book at our place. The author was John Lange, a nobody. Inside, the book mentioned that the author had been revealed as the great Michael Crichton, but I hadn't heard of Crichton either, so expectations ran low.

The book starts with a gripping description of a heist. It grabs you, and pulls you in. The book is a thriller, of a plot involving dangerous weapons, and one man's race to stop it all. It is a short book, written very well.

I had harbored fond memories of this book. I had recommended it to many friends, knowing full well how difficult it was to obtain a copy. Recently, I received a copy of this book, and I sat down to read it. A short evening passed, and I was done. Despite having read it over a decade ago, I remembered most of the plot and many of the scenes. It was a thrilling book made punchier by the fact that I knew the story. I marveled at the plot and the excellent writing.

A fantastic novel, a thrilling plot. The ending isn't blockbuster by today's standards, and that is part of the lure. No superfluous nonsense, and no extra pages.

A gripping story snappily told.