Tuesday, January 24, 2012

Project Euler in R: Problem 25

Solutions in R to Project Euler problems are generally hard to find, so I've recently started posting R solutions on this site.  Here are the previous problems: problem 22problem 23 and problem 24

Now let's look at Problem 25 from Project Euler.  Here is the problem statement:
The Fibonacci sequence is defined by the recurrence relation: Fn = Fn1 + Fn2, where F1 = 1 and F2 = 1.
Hence the first 12 terms will be:
F1 = 1
F2 = 1
...
F11 = 89
F12 = 144
The 12th term, F12, is the first term to contain three digits. What is the first term in the Fibonacci sequence to contain 1000 digits?
This is an easy problem to solve for a small number of digits. It gets interesting once we increase the required number of digits to 1000. The R integer type can hold 10 digits at most, and the type double can hold 309 digits.  There is no Big Integer class in R that can hold larger numbers, so to compute the 1000 digit Fibonacci number we will have to be creative.

Computing Fibonacci numbers iteratively

Here is a basic iterative Fibonacci function, that works for Fibonacci numbers with 309 or fewer digits.
fib <- function(n) {
  a = 0
  b = 1
  for (i in 1:n) {
    tmp = b
    b = a
    a = a + tmp
  }
  return (a)
}
The version that works with larger numbers is similar, with the following changes
  • Integer vectors are used to store each Fibonacci number.
  • Each element of the vector is at most nine digits long.  
  • Once all the nine digits are used, a new element is added to the vector and the Fibonacci number computation is carried over to the new element.
  • The variable numberDigits counts the number of digits in the latest Fibonacci number.  When numberDigits exceeds 999, the function exits, returning the Fibonacci number index.
fib.bigint <- function(lim) {
  a = c(0)
  b = c(1)
  n = 0           # Fibonacci number index
  integerSize <- 1000000000  ## How big is each integer
  while (1) {
    n = n + 1     # Compute next Fibonacci number
    tmp = c(rep(0,length(a)-length(b)), b)  # Pad beginning with zero
    b = a
 
    # Add a and tmp
    carry = 0
    for (i in length(a):1) {
      sum.i = tmp[i] + a[i] + carry
      a[i] = sum.i %% integerSize
      carry = floor (sum.i / integerSize)
    }
    if (carry > 0) {a <- c(carry, a)}
    numberDigits <- (length(a) - 1) * 9 + countDigits(a[1])
    if (numberDigits > (lim - 1)) { return (n) }
  }
}
countDigits <- function(n) {
  count <- 1
  for (i in 1:9){
    if ((n/10) > 1) {
      count <- count + 1
      n = n / 10
    } else {
      return (count)
    }
  }
}

On my Intel Core 2 Duo 1.6Ghz laptop running Linux, this function less than 4 seconds to find the Fibonacci number with 1000 digits.

The analytical solution
Fibonacci numbers have a closed form solution, which leads to a simpler approximation by rounding (from Wikipedia),
Fn = the closest integer to (golden ratio)^n / sqrt(5) 
We need the first Fibonacci number with 1000 digits.  This number will be greater than or equal to 10^999.
Fn = the closest integer to (golden ratio)^n / sqrt(5) >= 10^999
We can drop the closest integer function because if Fn rounds off to the floor of the ratio, the resulting Fibonacci number will have less than the required digits.  So we need smallest n satisfying
(golden ratio)^n / sqrt(5) >= 10^999 
Or, n >= (999 log(10) + log(sqrt(5)))/log(golden ratio) 
Or, n >= 4781.86
So we need the 4782nd Fibonacci number to get the required 1000 digits.

This problem was particularly interesting to R users because it runs into an R-specific limitation. Many other languages don't have this problem and can just use a simple Fibonacci generator to get the answer. For example, in Scheme (or other Lisp variants), a naive Fibonacci implementation produces all the digits.

Aside: Having started writing about R, I have found an excellent online resource, R-bloggers -  an aggregator of content from various blogs about R. You can search for other R Project Euler solutions here.  Be sure to check out the top articles box on R-bloggers landing page if you're looking for new and interesting applications of R.

Thursday, January 19, 2012

Project Euler in R: Problem 24

I had previously posted solutions in R to Project Euler problem 23 and problem 22.  This is the next problem from Project Euler. The statement of problem 24 is as follows.
A permutation is an ordered arrangement of objects. For example, 3124 is one possible permutation of the digits 1, 2, 3 and 4. If all of the permutations are listed numerically or alphabetically, we call it lexicographic order. The lexicographic permutations of 0, 1 and 2 are: 012   021   102   120   201   210
What is the millionth lexicographic permutation of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9?
This is a very interesting problem and it can be solved analytically. The key insight is that n distinct digits can be arranged in n! permutations.

Here's the logic you could use. If the permutations of the numbers 0 through 9 are arranged in lexicographic order,
  • The first 9! permutations, #1 to #362,880 are numbers starting with 0, in order from 0123456789 to 0987654321
  • The next 9! permutations, #362,881 to #725,760, are numbers from 1023456789 to 1987654320
  • The next 9! permutations, #725,761 to #1,088,640  are numbers from 2013456789 to 2987654310.  The millionth permutation is contained in this set.  Thus we know that the millionth number must start with a 2.  In fact, it is the 274,240th = (1000000 - 725760) number contained in this set.
  • Our problem has now reduced to finding the 274,240th lexicographic permutation of the numbers 0, 1, 3, 4, 5, 6, 7, 8 and 9 - the nine digits excluding 2.  We can repeat the above process to get each successive digit.
Essentially we're writing one million as the following sum
1000000 = (a1 x 9!) + (a2 x 8!) + ... + (a10 x 1)
As shown above, the index a1 is 2.  That is, only two times 9! numbers fit perfectly within 1,000,000.  So the first digit of the millionth number is the third digit in 0123456789.  The index a2 is 6, so the second digit of the millionth number is the seventh digit of the remaining numbers 013456789, which is 7.  The remaining digits can be found similarly. While finding the digits you have to be careful to ignore the digits that have been previously used. This is a permutation, so each digit appears exactly once.

The process is tedious, so here is some R code that will make it easier.  This generalizes the solution to find the nth permutation of the numbers 0 to 9.
findperm <- function (n) {
  # Find the indices a1 - a10
  indices = c()
  for (i in 9:0) {
    index = 0
    while (factorial (i) < n) {
      n = n - factorial (i)
      index = index + 1
    }
    indices = c(indices, index)
  }
 
  # Use the indices to find the nth permutation
  input = 0:9                         # The list of numbers to permute
  output = c()
  for (j in 1:10) {
    output[j] = input[indices[j] + 1] # Move digit to output
    input = input[-(indices[j] + 1)]  # Remove the assigned digits from input
  }
  return (output)
}
That was straight-forward and a fun illustration of R's capabilities.

But you can do better. The strength of R lies in the wealth of user defined libraries and functions that can ease the implementation of tricky computation.  For a simpler solution, try the lexicographicPermutation function in the library R.basic. Here is the entire solution.
library(R.basic)
lexicographicPermutation(c(0,1,2,3,4,5,6,7,8,9),999999)

The first solution illustrates that even tricky computation can be coded easily in R and the code is pretty readable.  The second solution suggests that before writing your own code, it is a good idea to look for existing libraries that can help. Statistical computations can get complex and tricky. You can write your own k-means clustering algorithm but it is difficult getting all the edge cases correct. Rather than writing your own routine, look around and see if it has been written already. An expert does a better job writing such routines. And you can leverage their code and their expertise.

Sunday, January 15, 2012

Project Euler in R: Problem 23

I was just looking through the programming language statistics on Project Euler. It shows that only 7% of the problems have been solved in R, whereas 8% have been solved on any kind of spreadsheet.  This is outrageous!

Let's look at the solution of problem 23 in R. Here is the statement of Project Euler's problem 23:
A perfect number is a number for which the sum of its proper divisors is exactly equal to the number. For example, the sum of the proper divisors of 28 would be 1 + 2 + 4 + 7 + 14 = 28, which means that 28 is a perfect number.
A number n is called deficient if the sum of its proper divisors is less than n and it is called abundant if this sum exceeds n.
As 12 is the smallest abundant number, 1 + 2 + 3 + 4 + 6 = 16, the smallest number that can be written as the sum of two abundant numbers is 24. By mathematical analysis, it can be shown that all integers greater than 28123 can be written as the sum of two abundant numbers. However, this upper limit cannot be reduced any further by analysis even though it is known that the greatest number that cannot be expressed as the sum of two abundant numbers is less than this limit.
Find the sum of all the positive integers which cannot be written as the sum of two abundant numbers.
Find proper divisors
First we write a function to find the proper divisors (that is divisors less than the number itself) of a number n.
divisor <- function (n) {
    divs <- c(1)
    i = 2
    lim = floor (sqrt(n))
 
    while (i <= lim) {
        if (n %% i == 0) {
          divs <- c(divs, i)
          if (n/i != i) { divs <- c(divs, n/i) }
        }
        i = i + 1
      }
    return (divs)
  }

Check if number is abundant
Next we create a simple function which checks whether a number n is abundant.
abundant <- function (n) {
    if (sum (divisor(n)) > n) return (TRUE) else return (FALSE)
  }

The solution
With the above two functions we have the tools we need to solve the main problem: finding the sum of all positive integers which can not be written as the sum of two abundant numbers.

All the numbers greater than 28,123 can be written as sum of two abundant numbers, so we only need to check numbers smaller than 28,123 for this property.  The vector abunlist in the code below holds all abundant numbers starting from 12 to 28,123.  With every new abundant number than we find, we create all possible two abundant number sums and store them in the vector sumno.  The answer we're looking for is the sum of all numbers 1 to 28,123 except the ones in the vector sumno.

abunlist <- c(12)            # The smallest abundant number = 12
sumno <- c(24)               # The smallest abundant number sum = 12 + 12 = 24
 
for (i in 13:28123) {
  if (abundant(i)) {
    abunlist <- c(abunlist, i)                   # Latest abundant number = i
    tmp <- abunlist[(abunlist + i) <= 28123] + i   # New abundant sums <= 28123 using i
    sumno <-unique(c(sumno, tmp))                # Add to existing list
  }
}

This is certainly not the most efficient solution, but runs in approximately 25 seconds - well within the stated 1 minute goal.

Note: The code for this post was formatted in Pretty R at inside-R.org.

Wednesday, January 11, 2012

Generate UML Diagrams From Java Source

I was navigating through a large Java project today, and I needed some source tools to help me visualize the massive class hierarchy. Looking on Google didn't produce anything obvious: people pointed to eUML2, which was difficult to install, and some tools didn't generate UML from existing code.

After a while of searching, I remembered the Doxygen utility which writes documentation from source code. It works on C,C++, and Java, among many other languages. While its diagrams might not be perfect, they were very helpful for C++ projects in the past.

So I decided to give it a shot. It worked out beautifully.

Here are the steps involved in generating UML diagrams.
First, you install doxygen (to parse the files) and graphviz (to write out PNG of graphs)
$ sudo apt-get install doxygen graphviz
Then, run doxygen to generate a config file.
$ cd <path-to-source>
$ doxygen -g
Now you have a config file called Doxyfile. Edit this to allow recursive searching, to look for java source, to generate call graphs and to generate pictures, and to generate UML style diagrams. Here are all the changes I made to this file.
FILE_PATTERNS     = *.java
RECURSIVE       = YES
HAVE_DOT        = YES
CALL_GRAPH       = YES
CALLER_GRAPH      = YES
DOT_GRAPH_MAX_NODES  = 500
UML_LOOK            = YES
Now, generate documentation with this command:
$ doxygen Doxyfile
To test this out for a blog post, I downloaded the source code to Azureus, a large Java-based project. Here is a graph from one of the files. The top right shows a close-up view of a tiny section of the picture. Even in this simple example, you can see how a zoomed-out view allows you to understand the complex structure.

Doxygen reads code written in C, C++, Java, and many other languages. So if you are in need of UML style diagrams to better understand a class hierarchy, give it a try. It is freely available, easy to install, and easy to use. In addition to class graphs, it can also make call graphs and caller graphs.

Monday, January 09, 2012

Entertainment comes from many activities

I came across an interesting post by Roger Ebert about why movie revenue is dropping. He raises important points about high ticket prices and the poor theater experience.  I agree with most of the post. Many years ago, I had written about the abysmal theater experience and it is nice to see that my concern is shared.

Movies are losing revenue because they are chasing ghosts. The days of mass theater watching is history. Even in India, people are comfortable renting or buying DVDs and watching the movie at home. The theater experience deserves much blame. Theaters function in a world of limited supply. Their glory days were when few movies were released and there was little alternate entertainment. Both assumptions are false today. Thanks to streaming video and increased access to foreign movies, I can watch Korean movies from 2011 if Hollywood produced poor movies. And I don't need to watch movies. I can play video games, watch short videos online, or spend my evening looking at cute kittens online.

Our definition of entertainment is changing. Movies are just a way of spending time having fun. We have many more ways of having fun. Traditional media likes their definition of "entertainment", which is "consuming content". This is a very narrow definition. It has never held true. Children's idea of enjoyment ranges from drawing to making a mess out of newspaper. Adults are similar.

Computers on the Internet are the ultimate tool. I write articles (like this one) and it entertains me. Others use Facebook or Google Plus to keep in touch with their friends and family. Still others contribute to Wikipedia or online forums. Each of these activities provides enjoyment for the person. Each of these are entertainment.

Some of the terminology from the movie industry indicates this bias. "Entertainment Online" and "Entertainment Weekly" are focused entirely on the movie and television industry. Entertainment on MSN is all about movie stars.

Even in the absence of online streaming videos and DVDs, the number of people watching movies would drop. There is a lot more stuff you can do on a computer. And sitting in an uncomfortable chair for ninety minutes staring at a screen is not always entertainment.


Sunday, January 08, 2012

Learning Languages as a puzzle game

I am interested in languages. I tried learning some Japanese when I was young. I gave up rather quickly. It might have been due to the lack of good books, or the lack of teachers. It might have been the lack of native Japanese speakers with whom I could practice my skills. My experience with Japanese instilled a love for learning new languages, and understanding a culture from their perspective, through their language.

A few months ago I started learning Mandarin Chinese. We found ourselves in Beijing, and we came across some great Chinese books lesson books for English speakers. These were useful in teaching characters and sentence construction in a very academic setting. Learning a language in this way is difficult. For one: your pronunciation is all wrong. For another, it is difficult to apply such academic learning to everyday speech.

Next, I tried using audio lessons only. These were a bit more helpful. They had correct pronunciation and useful everyday phrases. I did find that the lessons were geared more towards scoring women rather than real everyday phrases. Early lessons started with, "Where do you want to go to drink?", "Would you like to come to my place?" Useful stuff I'm sure, but not for a boring family man like me. A person like me needs directions to the hospital and the rest room. Further, the audio lessons made it difficult to distinguish between close syllables like 'ga' and 'ka'. You could easily mispronounce words and not realize it. Finally, you have no idea of written Chinese. This is a real limitation. Chinese is written in a strange and confusing way. Starting out with characters is much better than trying to learn them later. Building an initial comfort level helps if you travel.

Finally, we settled on Rosetta Stone (RS for short). RS teaches language in an interesting way. You don't see any of your native language written down. Concepts are introduced entirely through pictures, and you see Chinese to go along with the concepts. It is a strange feeling. You see a big ball and a small ball, and there are strange words. With time, you form the connection between the word and the concept. All without relying on the equivalent word in your native language. I was skeptical of this idea when a friend explained it to me. And RS costs an arm and a leg, which made me even more skeptical.

Luckily for me, I seem to learn Chinese with RS. I use RS as a puzzle game rather than language learning. It is fun to try to understand the game and get better. Now I'm on the final unit in Level One, I am surprised at how much Chinese I can understand. We have a few Chinese friends and I can understand bits of their conversation with their children. It is a great feeling.

Chinese is a devilishly hard language to learn. Right at the beginning, you start out learning pictographs (characters), tones in pronunciation, vocabulary, and grammar. If RS works for Chinese, I suspect it will work for Japanese too. Once I go through RS Mandarin lessons, I might just look at the RS Japanese lessons.

Fun fact for the day: The original rosetta stone was a stone which had three languages inscribed on it. The three languages had the same message. The discovery of this stone led to a breakthrough in deciphering Ancient Egyptian. It is ironic that the Rosetta Stone software teaches you a language by only showing you the language you wish to learn. Rosetta Stone never shows you a rosetta stone.