Friday, September 16, 2011

Linux udev: two-factor USB token turned into a car ignition key

We use two-factor authentication at work. The one-time token generator looks a lot like an automobile key. It is easy to modify Linux to make a silly car noise when the token generator is plugged in. It turns your lame one-time token into a cool ignition key for your laptop.

Here is a video that shows what it looks like:

To do this for your device, create a shell script to make the sound of a car. I have created an audio file called car.mp3 containing the car sound. The shell script plays this file with mplayer. Record the sound of your car, or grab a file from the internet.
$ cat /root/car.sh
#!/bin/bash
# Play a silly car starting sound
mplayer /root/car.mp3 &
Now you need obtain the vendor ID. Disconnect your token generator and run this command:
$ udevadm monitor
Now insert your device. You will see an output like this:
KERNEL[] add  /devices/pci0000:00/0000:00:1a.0/usb3/3-1 (usb)
KERNEL[] add  /devices/pci0000:00/0000:00:1a.0/usb3/3-1/3-1:1.0 (usb)
KERNEL[] add  /devices/pci0000:00/0000:00:1a.0/usb3/3-1/3-1:1.0/0003:1050:0010.0015 (hid)
UDEV  [] add  /devices/pci0000:00/0000:00:1a.0/usb3/3-1 (usb)
UDEV  [] add  /devices/pci0000:00/0000:00:1a.0/usb3/3-1/3-1:1.0 (usb)
UDEV  [] add  /devices/pci0000:00/0000:00:1a.0/usb3/3-1/3-1:1.0/0003:1050:0010.0015 (hid)
KERNEL[] add  /devices/pci0000:00/0000:00:1a.0/usb3/3-1/3-1:1.0/input/input33 (input)
UDEV  [] add  /devices/pci0000:00/0000:00:1a.0/usb3/3-1/3-1:1.0/0003:1050:0010.0015/hidraw/hidraw0 (hidraw)
KERNEL[] add  /devices/pci0000:00/0000:00:1a.0/usb3/3-1/3-1:1.0/input/input33/event13 (input)
KERNEL[] add  /devices/pci0000:00/0000:00:1a.0/usb3/3-1/3-1:1.0/0003:1050:0010.0015/hidraw/hidraw0 (hidraw)
UDEV  [] add  /devices/pci0000:00/0000:00:1a.0/usb3/3-1/3-1:1.0/input/input33 (input)
UDEV  [] add  /devices/pci0000:00/0000:00:1a.0/usb3/3-1/3-1:1.0/input/input33/event13 (input)
The part after the 0003: is your vendor ID. Of course, it will be different for you. Now create a udev rule to specify the action. This is what my rule looks like:
$ cat /etc/udev/rules.d/90-token.rules
# Make a silly sound like the starting of a car
SUBSYSTEM=="usb", ATTR{idVendor}=="1050", MODE="0664", GROUP="plugdev", RUN+="/root/car.sh"
That's it, you don't need to restart anything. If you have done it correctly, then the car sound will play every time you insert your key.

Tuesday, September 13, 2011

Android ARM Assembly: Calling Assembly from Android (Part 8)

This is part eight in a series on learning ARM assembly on Android. This part covers calling Assembly code from Android applications.

Part 1: Motivation and device set up
Part 2: A walk-through of a simple ARM assembly program
Part 3: Registers, memory, and addressing modes
Part 4: Gnu tools for assembly; GCC and GDB
Part 5: Stack and Functions
Part 6: Arithmetic and Logical Expressions
Part 7: Conditional Execution
=> Part 8: Assembly in Android code

The articles follow in series, each article builds on the previous.

Native Development Kit
You've written some assembly code, made it run real fast, and now you want to include it in an Android application. What's the advantage of running little assembly programs that can't use the full feature set available to Android? You need the Android Native Development Kit.

Download the Native Development Kit from the Android development page. Follow the instructions on that page to install it. Installation is little more than unzipping the file in the correct directory.

Sample Application
My sample application has three parts. The first is the most familiar part: this is the assembly source. This is in an assembly source file called jni/multiple.s. This code computes 10y for a given value y.
	@ This file is jni/multiple.s
	.text
	.align	2
	.global	armFunction
	.type	armFunction, %function
armFunction:
	@ Multiply by 10. Input value and return value in r0
	stmfd	sp!, {fp,ip,lr}
	mov	r3, r0, asl #3
	add	r0, r3, r0, asl #1
	ldmfd	sp!, {fp,ip,lr}
	bx	lr
	.size	armFunction, .-armFunction

The assembly code is called from a C stub. The C stub must have a very fixed name: Java_name_of_package_Class_function. This looks downright ugly but is required for Java to look up the correct function. I create a C stub to hold the strange name, and to accept the weird JNI arguments. You don't need to have a C stub, but it makes life easy.

The type jint is a java int that you can treat as a 32 bit int value. Other types are jboolean, jbyte, jchar, jshort, jlong, jfloat, jdouble, and jobject. Notice the signature of the JNI function: it accepts the environment, which is a JNIEnv pointer, and an arbitrary object. Finally, we have the input value, which is an integer. In its implementation, we call our ARM assembly function on the input. The return value is a jint, which indicates that we are returning an integer.
/* This file is jni/hello-jni.c */

#include <jni.h>

/* This stub calls the function. It helps to have a stub like this to
 * save yourself the hassle of defining the function call in
 * Assembly. */
jint Java_com_eggwall_android_assembly_AssemblyActivity_factorialJNI(
	JNIEnv* env, jobject object, jint input) {
	/* Try calling some local code */
	return armFunction(input);
}
Finally, there is a file that defines the sources in a Makefile. This is the jni/Android.mk file. It puts together the stub and the assembly code into a library called "hello-jni".
# This file is jni/Android.mk

LOCAL_PATH := $(call my-dir)
include $(CLEAR_VARS)

# I want ARM, not thumb.
LOCAL_ARM_MODE := arm

# Name of the local module
LOCAL_MODULE    := hello-jni
# The files that make up the source code
LOCAL_SRC_FILES := hello-jni.c multiple.s

include $(BUILD_SHARED_LIBRARY)

The stub and the assembly input are compiled by calling the following command from the root directory of your project. This is the directory that contains jni/, src/, res/, ..etc. I am assuming the ndk is installed in /usr/local/android-sdk-linux_x86/android-ndk-r6b.
$ ls
AndroidManifest.xml  assets/  bin/  build.properties  build.xml  default.properties
gen/  jni/  libs/  local.properties  obj/  proguard.cfg  res/  src/
$ /usr/local/android-sdk-linux_x86/android-ndk-r6b/ndk-build
Compile arm    : hello-jni <= multiple.s
SharedLibrary  : libhello-jni.so
Install        : libhello-jni.so => libs/armeabi/libhello-jni.so

Finally, there is a Java source code to create the Activity in Android. This code creates an Android application. It extends Activity, and overrides the onCreate method. In this, it creates a TextView, which is a label, and then sets the contents of the label to the return value of the function. It defines a function called factorialJNI which accepts an integer input and returns an integer. It is marked as native, indicating that its implementation is not in Java.

Finally, a static initialisation loads the jni library that was defined in the XML file.
package com.eggwall.android.assembly; 
 
import android.app.Activity; 
import android.widget.TextView; 
import android.os.Bundle; 
 
public class AssemblyActivity extends Activity { 
	@Override 
	public void onCreate(Bundle savedInstanceState) { 
		super.onCreate(savedInstanceState); 
		// Create a new Textview 
		TextView  tv = new TextView(this); 
		// Print the multiple of 13 through assembly. 
		tv.setText("The multiple was: " + factorialJNI(13)); 
		setContentView(tv); 
	} 
	/**
	 * Multiply the number by 10.
	 * @param input, the number to be multiplied
	 * @return the multiple of the number
	 */ 
	public native int factorialJNI(int input); 
	/* This is used to load the 'hello-jni' library on application
	 * startup. The library has already been unpacked into
	 * /data/data/com.eggwall.android.AssemblyActivity/lib/libhello-jni.so at
	 * installation time by the package manager.
	 */ 
	static { 
		System.loadLibrary("hello-jni"); 
	} 

} 
That is a lot of code to run a single assembly function! But now that you've seen the overall structure, you can begin modifying it to run your own assembly code. This is not a good way to experiment with assembly programming, though. Assembly programs can be hard to debug and it helps to have good tools during development. I would recommend developing using emacs, gcc, gdb, and other GNU tools, just as before. When the code is working correctly, hook it into Android Java source. The Android NDK has some useful debugging facilities, but I would consider them options of last resort.
You can download the entire ARM Android assembly example as an Eclipse project here.

Speed versus Complexity
Just because you are calling assembly code does not automatically make your program faster. The Dalvik virtual machine runs most code fairly fast, and the effort to develop assembly code is not worth the minor improvement in code execution speed. Here are some reasons why you might want to use native code.
  1. Legacy code. You have a lot of existing code that you want to plug into Android.
  2. Optimised code. You have CPU-intensive code that has been carefully optimised.
Assembly code takes a longer to develop, more effort to maintain, and is difficult to port. There are few people who can read assembly code, so your code will be out of reach of many programmers.

A careful consideration of implementation speed and code complexity will lead you to the correct balance.
Things to try out
  1. Create a new function called factorial(int a) and its corresponding stub. Call it from Java.
  2. Create a new source file called factorial.s, and put the function in there. Modify jni/Android.mk to run it correctly.
  3. Try adding an input area, where a number is entered. Pass this number to the assembly source code, and print out its factorial.
My example builds upon the hello-jni example that ships with the NDK. You can read the other NDK samples for inspiration. You can learn more by reading the JNI reference. It covers the data types, and the various functions available from native code.

Monday, September 12, 2011

Android ARM Assembly: Conditional execution (Part 7)

This is part seven in a series on learning ARM assembly on Android. This part covers conditional execution.

Part 1: Motivation and device set up
Part 2: A walk-through of a simple ARM assembly program
Part 3: Registers, memory, and addressing modes
Part 4: Gnu tools for assembly; GCC and GDB
Part 5: Stack and Functions
Part 6: Arithmetic and Logical Expressions
=> Part 7: Conditional Execution
Part 8: Assembly in Android code

The articles follow in series, each article builds on the previous.

Compare and Jump
The previous lessons touched on a variety of microprocessor instructions. Let's move to conditional execution. These are the basis of all the if, while, for loops in programming languages.

Conditions are implemented in microprocessors using the Status Register CPSR. The various bits on the status register are set using instructions. Then, you can jump to a different position in the code if the specific bit is set or unset. While this scheme looks primitive, it forms the basis of all conditional execution in every language on computers. Let's see the status register again.

The conditions in the status register are:
  1. Negative: the result was negative (bit 31 of the result was set)
  2. Zero: the result was zero (all bits of the result were unset)
  3. Carry: integer addition, subtraction or shifts produce a carry or borrow (result bits could be anything)
  4. oVerflow: there was carry or borrow during signed addition or subtraction (result bits could be anything)
The status register has other bits relating to the different processor modes, but we can ignore them for now. The bits in the CPSR are set on four instructions: CMP, CMN, TEQ, TST. Let's see these instructions.

CMP Rn, shifter_operand

This performs Rn - shifter_operand, throws away the result and updates the status register. So doing a CMP R2, R2 would produce a zero output, and the Zero condition would be set. It is important to note that the result is never stored anywhere. From your perspective, you are asking the processor what the result would be like.

This is a full list of the comparison instructions.

InstructionRough IntentionThe condition flags are set based on this value
CMP Rn, shifter_operand(Addition)Rn - shifter_operand
CMN Rn, shifter_operand(Negation)Rn + shifter_operand
TST Rn, shifter_operand(Test)Rn & shifter_operand
TEQ Rn, shifter_operand(Test Equivalence)Rn ^ shifter_operand

Once the condition is set, you can use that condition to jump to a specific label. The easiest jump instruction doesn't care for flags. We saw it in part 5, where it was used to return execution back from our function.
        bx      lr
This instruction branched execution to the address contained in the Link Register. This could be done with any register, though you better be sure that the register contains a valid address. It also needs to be a 32-bit aligned address. All ARM instructions are 32-bit (word) aligned. This isn't Intel, every instruction is exactly 32 bits long. BX is a Branch, while BL is a Branch and Link.

Branch just jumps to the address specified in the register.

Branch and Link stores the address of the existing Program Counter in the Link Register, in case you want to jump back. This is something you do all the time, so there is one instruction to handle it.

ARM and Thumb Instructions
Most new ARM processors support two different instruction sets: ARM and Thumb. Thumb instructions are a subset of ARM instructions and are 16 bits each. Going back and forth between ARM and Thumb is possible, though it should be done correctly. The branch instructions ending in "X" do this. A "BX" instruction allows you to switch between ARM and Thumb while the "B" instruction doesn't. In general, you should use "BX" and "BLX" when you are unsure of the resulting code. If you are certain, you can use "B" for branch, and "BL" for branch and link.

Condition Codes
How about adding the actual condition? The condition goes as a suffix to the instruction. Say you only want to check if the Zero condition was true. Then, the condition is called EQual, with mnemonic EQ. So this operation only branches when the zero condition is set:
BLEQ lr

The full list of conditions is:
SuffixMeaningCondition tested
EQEqualZ == 1
NENot EqualZ == 0
CS or HSUnsigned Higher or Same (Carry Set)C == 1
CC or LOUnsigned Lower (Carry Clear)C == 0
MIMInusN == 1
PLPLus or ZeroN == 0
VSOverflow (V Set)V == 1
VCNo Overflow (V Clear)V == 0
HIUnsigned HigherC == 1 and Z == 0
LSUnsigned Lower or SameC == 0 or Z == 1
GESigned Greater Than or EqualN == V
LTSigned Lesser Than N != V
GTSigned Greater Than Z == 0 and N == V
LESigned Less Than or Equal ToZ==1 or N != V
ALAlways

So you could have an instruction BXNE r3, which will branch to r3, only if the previous comparison did not set the Zero condition code.

Conditional Execution
Let's look at a single if-then loop to see how to run a simple conditional in Assembly.
 1         .section        .data
 2         .align 2
 3 higher:
 4         .asciz "Yes, r2 is higher than or equal to r1\n"
 5 lower:
 6         .asciz "No, r2 is lower than r1\n"
 7 
 8         .text
 9         .align  2
10         .global main
11         .type   main, %function
12 main:
13         stmfd   sp!, {fp, lr}
14 
15         @ Load some values
16         mov     r1, #32
17         mov     r2, #33
18 
19         @ Check if r2 is lower than r1
20         cmp     r2, r1
21 
22         @ If it is greater or equal, jump ahead
23         bge     greaterOrEqual
24 
25         @ Otherwise it was lower
26         ldr     r0, =lower
27         @ Now skip past to the common point again
28         b       common
29 
30 greaterOrEqual:
31         ldr     r0, =higher
32 
33 common:
34         @ Print the message
35         bl      puts
36 
37         @ Return 0
38         mov     r0, #0
39         ldmfd   sp!, {fp, lr}
40         bx      lr

This program loads values in r2 and r3 and then compares the two. If (r2-r1) is greater than or equal to 0, then we skip to the greaterOrEqual label. Since there are only one kind of instructions (ARM instructions), we don't bother with the "X" variant of the branch. At the greaterOrEqual label, we move the string corresponding to the higher message. Otherwise, we move the string corresponding to the lower message into r0. Either case, we want to get to the common code, which prints the message, and returns 0. If we didn't return to the common case, it would continue running instructions linearly, and load r0 with the higher message. Then, we return 0, like every well behaved main() function should.

The Program Counter holds the address that will be executed next. Since it is also r15, you can directly modify it. This is a terrible idea in most cases, so you should have a good reason to do this. In theory, you could implement logic by conditionally moving values into the Program Counter. For example, you could add 8 to the current value in the PC to skip an instruction. In practice, you will not do this because it is prone to errors. Such logic fails when the program is modified, and the address of instructions changes.


Every Instruction is Conditional
Now that you know what conditional execution looks like, you are ready for a real twist. In ARM architectures, every instruction is conditional. So the ADD, SUB, LDR, STR instructions can all have these condition codes appended to them. We could write instructions containing very few BX or BLX calls.

In practice, however, you should use conditional operations only for a single instruction or two. If you find that a lot of operations act on the same condition, you should branch and skip all these operations. This is much more efficient, since ARM processors fetch more than one instruction and act upon parts of it in parallel. If you can start to execute the next few instructions, your code is considerably faster. So as a rule, use conditional operations only for two instructions or fewer. If you have a longer list of conditional operations that depend on the same condition, pull them out in a separate routine.

In addition to every instruction being conditional, every instruction can set the CPSR flags. For this, you need to append the "S" flag (for Status) to every instruction. Here is what a MOV instruction really looks like:
MOV{condition_code}{S} Rd, shifter_operand
This allows you to set the status flag without performing a CMP, saving another instruction. The status flag can be set on most operations you have seen till now. So you can automatically set the condition codes when performing arithmetic operations.

Exercises
Try your hand at these to see if you understand branches.
  1. What does the MOVGE r0, r2 instruction do?
  2. Rewrite the program to use just one branch rather than two.
  3. Rewrite the program using conditional move to avoid a branch. In this case, a conditional move is perhaps cleaner since the condition only changes one instruction.
  4. Write a function isZero(n) to accept an unsigned number n and to return 1 if n is equal to 0 and n otherwise.
  5. Improve on isZero() to write almostFactorial(n) which returns n*(n-1) if n is not zero, and 1 otherwise.
  6. Try your hand at turning almostFactorial(n) to factorial(n) which returns n! for small values of n. As n gets large, you run out of bits, so you will need to test your function with small inputs.
  7. Write a simple while { .. } loop in C and let GCC compile it to assembly. See if you can make sense of the code. GCC can compile code with optimisations turned on with the "-O2" flag. Try to read the assembly output with and without optimisations. See if you can figure out the optimisation logic.
This marks the end of all the functional pieces of ARM architecture. You should be able to read the disassembly of nearly any ARM program now. Isn't that impressive? Look up unknown instructions in the reference manual. As always, the ARM Reference Manual has more information on all condition codes, conditional execution and instructions that accept conditions and can set the flags.

The final piece will be to integrate this with Android. We will see how to call assembly code from Android to be able to run fast processing directly in assembly.


Sunday, September 11, 2011

Book Review: The Big Necessity by Rose George

Would you be worried if a Boeing 747 worth of people died? What if it were all children. How about 10 Boeing 747s of children dying, everyday?

Diarrhea causes 1.5 million deaths in children every year. That's roughly ten 747s worth every single day. And diarrhea is simple to cure. Good sanitation removes the occurrence of the disease. The reason why so many children still die of diarrhea is because of poor sanitation. Most deaths are in third-world countries, and the West is largely unaware of the problem because it is embarrassing to talk about poop.

Rose George's book, "The Big Necessity" talks about various aspects of poop in her book: from the science of waste water treatment, to the design of eco-friendly toilets and the problems of third-world sanitation. It is a delightful read which brings our awareness to many poop-related issues.

We don't really talk about poop, even though many of our swear-words include fecal matter. And that's largely the problem. We can solve problems of nutrition because we can casually talk about African children only eating 400 calories a day. What we cannot do is talk about the fecal matter that is contaminating their food. So we can talk about to 400 calories they are eating, but not about the crap they are eating with it.

Sanitation brings immense benefits. Children living in areas with good sanitation fall ill less, and are more likely to go to school. Girls' education is directly related to good sanitation. Yet, we're too civilised to talk about improving sanitation.

The book is an insightful read. I would highly recommend it to anyone who has witnessed poverty and wondered what can be done to improve health of the poor. It talks about open defecation in India, the general lack of sanitation in India, and glimmers of hope from rural China. Each chapter deals with a single aspect of sanitation. You will learn about how waste water is treated in the West, and why it is a poor design. You will learn about the difficulties in marketing toilets to the poor. It should be an easy sell, but there are reasons why it is hard, and the book talks about creative solutions.



Saturday, September 10, 2011

Android ARM Assembly: Arithmetic and Logical Operations (Part 6)

This is part six in a series on learning ARM assembly on Android. This part covers arithmetic and logic expressions.

Part 1: Motivation and device set up
Part 2: A walk-through of a simple ARM assembly program
Part 3: Registers, memory, and addressing modes
Part 4: Gnu tools for assembly; GCC and GDB
Part 5: Stack and Functions
=> Part 6: Arithmetic and Logical Expressions
Part 7: Conditional Execution
Part 8: Assembly in Android code

The articles follow in series, each article builds on the previous.

We covered a lot of data moving in the previous articles. Let's see what processing we can do.

Arithmetic
The ARM processors are packed with the usual arithmetic operators. If you have experience with some other processor, you might find the possible operations somewhat limited. This follows the Reduced Instruction Set principles: there are a few basic primitives and you are required to produce everything using these. So you lack complex instructions, but make up for it with faster code. More importantly, you make up for it with code that runs quiet, cool, and uses minimal battery life.

All arithmetic instructions look identical. Remember the MOV instructions?
  1. An immediate value. Ex.
    mov	r0, #3
  2. A register. Ex:
    mov	r0, r1
  3. A register with an offset. Ex:
    mov	r0, r1, #4
  4. A register with a shift. Ex:
    mov	r0, r1, lsr #3
  5. A register with a register specifying the shift. Ex:
    mov	r0, r1, lsr r2
The first part is called the destination register, since that's where the value is placed. The second part is called the shifter_operand, as a short-hand for one of many possibilities. In other words, MOV instructions have this structure:
MOV Rd, shifter_operand. This causes: Rd = shifter_operand

Arithmetic instructions follow the same structure. Instead of having one register, they have two registers. This is the general structure of arithmetic instructions:
OPERATION Rd, Rn, shifter_operand

OPERATION can be one of:
ADD (Addition)Rd = Rn + shifter_operand
SUB: (Subtraction) Rd = Rn - shifter_operand
RSB: (Reverse Subtraction) Rd = shifter_operand - Rn
ADC: (Add with Carry) Rd = Rn + shifter_operand + Carry
SBC: (Subtract with Carry) Rd = Rn - shifter_operand - Carry
RSC: (Reverse Subtract with Carry) Rd = shifter_operand - Rn - Carry
AND: (Logical And) Rd = Rn & shifter_operand
BIC: (Logical Bit Clear)Rd = Rn & ! shifter_operand
ORR: (Logical Or) Rd = Rn | shifter_operand
EOR: (Logical Exclusive Or)Rd = Rn ^ shifter_operand

The instructions are regular and follow the same pattern. The only unexplained thing is the Carry.

Carry is a bit in the status register CPSR that is set when an addition overflows past 32 bits. This could happen when adding 1 to 0xff ff ff ff. The resulting number is 1 << 32, which cannot be represented in 32 bits. It needs 33 bits. In such a case, the carry bit will be set. Addition with carry allows you to automatically add the carry bit.

The carry bit is also set when subtraction produces underflow: If you subtract INT_MAX from 0, the resulting number cannot be expressed in 2's complement because it needs to borrow a bit. In such a case, the carry is set again.

Multiply
Multiplications only take a register argument. They don't take a shifter_operand.

MLA Rd, Rm, Rs, Rn Multiply with Accummulate Rd = Rm * Rs + Rn
MUL Rd, Rm, Rs Multiply Rd = Rm * Rs
SMLAL Rd32, Rd64, Rm, Rs Signed Multiply with Accummulate LongRd64,Rd32 += Rm * Rs
SMULL Rd32, Rd64, Rm, Rs Signed Multiply Long Rd64,Rd32 = Rm * Rs
UMLAL Rd32, Rd64, Rm, Rs Unsigned Multiply with Accummulate Long Rd64,Rd32 += Rm * Rs
UMULL Rd32, Rd64, Rm, Rs Unsigned Multiply Long Rd64,Rd32 = Rm * Rs

MLA and MUL are easy to understand. The difference between Signed and Unsigned instructions is that signed instructions use bit 31 as a sign bit, and 2's complement arithmetic. If you know you have unsigned numbers, using the unsigned variants allows you to express higher numbers.

The Long version of the instructions stores the lower 32 bits of the result in Rd32 and the higher bits in Rd64. MLA and MUL store only the lower 32 bits of the result, so if you multiply two very large numbers, you will get incorrect results.

A multiply is slower than using the barrel shifter. If you need to multiply by 4, you are much better off supplying "lsl #2".

Links
Arithmetic and logical instructions are very easy in ARM processors. I have covered the instructions available in v5 of ARM processors. These instructions are guaranteed to be available nearly everywhere. v6 adds a few instructions, but the regular structure means that once you understand the v5 set, you can immediately learn the v6 additions.

You can try the following to test your knowledge till now:
  1. Write a function to input an integer y and return y * (y-1). Check by calling it inside main() and using the program from Part 5 to print the value.
  2. Test out the behaviour of BIC using a program and different inputs. Why is it called Bit Clear?
As before, the ARM Reference Manual has more information on all operations.


Friday, September 09, 2011

Android ARM Assembly: Stack and Functions (Part 5)

This is part five in a series on learning ARM assembly on Android. This part covers the stack and function calling.

Part 1: Motivation and device set up
Part 2: A walk-through of a simple ARM assembly program
Part 3: Registers, memory, and addressing modes
Part 4: Gnu tools for assembly; GCC and GDB
=> Part 5: Stack and Functions
Part 6: Arithmetic and Logical Expressions
Part 7: Conditional Execution
Part 8: Assembly in Android code

The articles follow in series, each article builds on the previous.

Stack
The stack is a linear data structure in memory that holds function-specific information. When a function call is made, the stack is grown, and we place local variables on the stack. When a function returns, we shrink the stack, and the local variables are lost. The stack is a temporary structure. For permanent non-constant variables, we need to allocate space on the heap. The heap is an operating system (OS) concept that we'll cover later. It is an OS concept because the processor is unaware of its presence. It is a programming paradigm rather than a facility provided by the silicon on the chip. The stack, on the other hand, is a processor-level concept. Life is a lot simpler if the processor supports stack primitives.

The stack grows up: to allocate space on the stack, you subtract multiples of 4 from the Stack Pointer. If you need 5 variables, you need to subtract 20 from the Stack Pointer. (In many documents, the memory is laid out from highest address to the lowest address. In such documents, the stack is said to grow downwards. Unfortunately, this is a standard convention even though humans think of stacks growing upward, so I explain everything with the stack growing upward.)

If you write memory as a linear list of addresses, you can see why this makes sense. Here is the memory laid out from top to bottom as a linear list going from the lowest address to the highest address. All addresses and data are in hexadecimal.
Data Location
ab 0000
cd 0004
de 0008
xx ...
f0 0080
f0 0084
f2 0088
f3 008c
f4 0090 <- SP
xx ...
ff ffff
The Stack Pointer is pointing at location 90, which means that it contains the value 0x90. The value at the top of the stack is f4, and this is the contents of memory location 0x90.

If we want to allocate two variables, we need to subtract 4 * 2 = 8 from the stack pointer. Say we store values 99 and AA at these locations, then the stack looks like this:
Data Location
ab 0000
cd 0004
de 0008
xx ...
f0 0080
f0 0084
99 0088 <- SP
AA 008c
f4 0090
xx ...
ff ffff
We ended up clobbering the existing values at those locations. To deallocate space from the stack, you add to the SP. Every allocation on the stack should be matched by a de-allocation. Otherwise you get out of sync, and read the wrong value. Functions start out by storing all the local registers on the stack. Then, on top of the stack, they allocate function-specific data.

Data Location
ab 0000
cd 0004
de 0008
xx ...
current function's stack 0080 <- SP
SP 0084
LR 0088
PC 008c
previous function's stack 0090
xx ...
ff ffff
As a result, the new function has space to grow, and when it returns, the previous variable values are moved from the stack to the registers. From the perspective of the calling function, nothing has changed.

Writing Functions
Let's look at a program to see how variables are stored on the stack. I'll use a long program, because we need a main function and another function that we can change to our liking. The main() function has a fixed signature, so we need something different.
 1         .text
 2         .align  2
 3         .global multiplyByTen
 4         .type   multiplyByTen, %function
 5 multiplyByTen:
 6         stmfd   sp!, {fp, ip, lr}
 7         mov     r3, r0, asl #3
 8         add     r0, r3, r0, asl #1
 9         ldmfd   sp!, {fp, ip, lr}
10         bx      lr
11         .size   multiplyByTen, .-multiplyByTen
12 
13         .section        .rodata
14         .align  2
15 .LC0:
16         .ascii  "The number is %d\012\000"
17 
18         .text
19         .align  2
20         .global main
21         .type   main, %function
22 main:
23         stmfd   sp!, {fp, lr}
24 
25         mov     r0, #32
26         bl      multiplyByTen
27 
28         mov     r1, r0
29         ldr     r0, .L3
30         bl      printf
31 
32         mov     r0, #0
33         ldmfd   sp!, {fp, lr}
34         bx      lr
35 
36         .align  2
37 .L3:
38         .word   .LC0
39         .size   main, .-main
The program is long, but you should be able to understand most of it by now.
This is the function we made:
 5 multiplyByTen:
 6         stmfd   sp!, {fp, ip, lr}
 7         mov     r3, r0, asl #3
 8         add     r0, r3, r0, asl #1
 9         ldmfd   sp!, {fp, ip, lr}
10         bx      lr
This function returns the input multiplied by 10 through its creative uses of shifts. Multiplying y by 10 is the same as (y << 3 + y << 1) = (8y + 2y). The function is made to illustrate how function calling works. The first four inputs to a function are kept in r0-r3. Here we have just one input, so it is in r0. Line 7 moves 8*r0 to r3, and line 8 adds 2*r0 to this, and stores it back in r0. The return value of the function is kept in r0. However, we also have line 6 and line 9. What are they doing?
 6         stmfd   sp!, {fp, ip, lr}
This is a multi-register move operation. This moves the registers FP,IP,LR into the area specified by the register SP. Since SP is the stack pointer, this is the same as pushing registers FP,IP,LR to the top of the stack in a single operation. FP is another name for r11. Once this is done, the stack pointer is updated since it has an exclamation mark. The order of specifying the registers doesn't matter. ARM processors have a bitmask to specify which registers are written, and they are always written in the same order, irrespective of the order you specify them in. GCC might complain if you specify them in the wrong order, though. STM is the base operation, there are variants depending on how you want the stack to grow. STM is STore Multiple, and it comes in variants like IA: Increment After, IB: Increment Before, DA: Decrement After, DB: Decrement Before. These describe when the stack pointer is updated (before writing to memory, or after), and whether it is incremented or decremented. These operations also have handy mnemonics describing their use for operating with stacks. Linux uses a Full Descending type of a stack. Full because the stack pointer points to the last used location, and Descending because the stack grows downward, when memory listed from highest address to lowest address. If you use the FD versions of these operations, your code will interoperate well with other libraries.
 9         ldmfd   sp!, {fp, ip, lr}
This is another multi-register move that undoes the action on line 6. This reads back the values that were written earlier, popping them from the stack. The combined effect of lines 6 and 9 is to store the important register values on the stack and then restore them. This allows the function to modify them in the main body. As with STMFD, the values are specified in a bitmask, so they are always read in the same order.

The function semantics require that r0-r3 are the only registers that can be clobbered during a function call. If you need more than four registers in a function, you need to include them in the STMFD and LDMFD lines to ensure they are restored before exit. If you need a lot of local variables in a function, you need to keep the rest in memory using load and store commands. You can safely keep variables past the stack, in locations like SP-4, SP-8. When your function returns, the stack pointer is restored and the next function can reuse this storage. You can either grow the stack by subtracting multiples of 4 from it, or you can refer to offsets from the stack. The ARM Procedure call Standard mentions that r4-r8, r10, r11, and the SP need to be preserved in a subroutine. So if you are modifying these registers, ensure that you store them at the start and load them back before the end of your routine.

If you need to pass more than four input values to a function, you need to store them on the stack. The function needs to read the exact number of values from the stack. Finally, if you need to return more than one value, you need to modify values in the stack, or return them in r0-r3, and hope your code is never called by anyone else's code.

Exercises
Try experimenting with the following:
  1. Break the function at various points with GDB and examine the memory starting at the Stack Pointer. Does the stack pointer point to the last variable on the stack, or past the last variable on the stack?
  2. Try removing the STMFD and LDMFD in the multiplyByTen function. Does the function still work?
  3. Change multiplyByTen to return the value of SP. Print it using '%p' in the printf. Does it match what GDB says?
  4. Change the multiplyByTen to modify IP, LR and FP in the function and assert that they are indeed restored. Verify the modification and the restoring through GDB.
  5. Rewrite the function to use LDR and STR rather than multi-register moves. You will need to modify the stack pointer with each load and store.
The ARM Reference Manual has more information on mutiple loads and stores, including information about the IA,IB,DA,DB, and FD versions, and what they mean.

Thursday, September 08, 2011

Android ARM Assembly: GCC and GDB (Part 4)

This is part four of a multi-part series. This part covers the Gnu C Compiler and the Gnu Debugger. Both are essential development tools.

Part 1: Motivation and device set up
Part 2: A walk-through of a simple ARM assembly program
Part 3: Registers, memory, and addressing modes
=> Part 4: Gnu tools for assembly; GCC and GDB
Part 5: Stack and Functions
Part 6: Arithmetic and Logical Expressions
Part 7: Conditional Execution
Part 8: Assembly in Android code

The articles follow in series, each article builds on the previous.

In case you already know how to use Gnu tools in the Intel world, they work in the same way for ARM machines.

GCC
GCC is the default compiler on Linux systems. It is a versatile compiler. We need very few options for our ARM assembly development.
$ gcc -S source.c
This is perhaps the most useful way in which we can use gcc. It creates an assembly source.s file which corresponds to the C source code. This is great for learning how gcc translates specific C constructs to assembly.
$ gcc -o hello source.c
Compile C file source.c into an executable called hello.
$ gcc -o hello source.s
Compile assembly file source.s into an executable called hello. This is probably what you will be using all along.
The full GCC manual can be downloaded online.

GAS
Gas is the GNU assembler, and is the default assembler on Linux systems. My tutorials don't call gas directly. If you invoke gas yourself, you get an object file that you need to link against glibc using the linker. Invoking gcc on assembly source code calls gas and the linker, so I prefer to do that.

Knowledge of the assembler helps when you want to use specific assembler directives. The gas manual is available online.

GDB
While programming assembly, you often need something to show you what the state of the machine is. You need to see every register and every memory location. This is where GDB comes in. It is free, it is easy to use. Here is a gentle introduction to gdb to cover most assembly needs.
$ gdb hello
Start gdb with the executable called hello. It prints a helpful message, and drops you to the (gdb) prompt. This prompt is where you type all commands.
(gdb) disassemble main
Dump of assembler code for function main:
   0x000083d0 <+0>:	push	{r11, lr}
   0x000083d4 <+4>:	add	r11, sp, #4
   0x000083d8 <+8>:	sub	sp, sp, #8
   0x000083dc <+12>:	str	r0, [r11, #-8]
   0x000083e0 <+16>:	str	r1, [r11, #-12]
   0x000083e4 <+20>:	ldr	r0, [pc, #20]	; 0x8400 
   0x000083e8 <+24>:	bl	0x82e8 
   0x000083ec <+28>:	mov	r3, #0
   0x000083f0 <+32>:	mov	r0, r3
   0x000083f4 <+36>:	sub	sp, r11, #4
   0x000083f8 <+40>:	pop	{r11, lr}
   0x000083fc <+44>:	bx	lr
   0x00008400 <+48>:	andeq	r8, r0, r8, lsl #9
End of assembler dump.
Look at the assembly source of any function. In this case, we looked through the assembly output of main, the entry point to our hello word function. There are some familiar instructions here already. Disassembly can be done for any executable. You don't need the source code for the program.
(gdb) break *0x000083e4
Breakpoint 1 at 0x83e4
This sets a breakpoint at the specified memory address. When you run the program, the execution will break at that location, and you will be dropped back on the gdb shell to inspect the state.
(gdb) run
Starting program: /home/user/ARM/hello 

Breakpoint 1, 0x000083e4 in main ()
Alright, we started the program and it broke exactly where we asked it to. This is a great time to examine the registers and the memory.
(gdb) info registers
r0             0x1	1
r1             0xbed9a924	3201935652
r2             0xbed9a92c	3201935660
r3             0x83d0	33744
r4             0x0	0
r5             0x0	0
r6             0x0	0
r7             0x0	0
r8             0x0	0
r9             0x0	0
r10            0x40025000	1073893376
r11            0xbed9a7d4	0xbed9a7d4
r12            0xbed9a840	3201935424
sp             0xbed9a7c8	0xbed9a7c8
lr             0x4003b508	1073984776
pc             0x83e4	0x83e4 
cpsr           0x60000010	1610612752
This command shows you the register state. As you can see, there are the standard registers r0-r12, and SP, LR, and PC. You can also see the status register CPSR printed in full. The function calling convention on ARM is that the first four arguments to a function are stored in r0-r3. Let's verify that this is the case.
The function we are looking at is main(int argc, char* argv[]). It has two arguments argc and argv, which should be in r0 and r1 respectively. r0 should contain argc, or the number of commandline arguments given. We invoked the program with no arguments, so the commandline arguments consist of only the program name. argc should be 1, which is what r0 contains
argv is trickier. It is a pointer to pointers containing strings. This is partly confirmed by r2, which is a large hex number: 0xbed9a924. It could be a memory location. Let's find out.
(gdb) x/w 0xbed9a924
0xbed9a924:	0xbed9aa0d
The "x/w" stands for eXamine memory/ parse as Word. Memory locations could contain anything, so we want to parse it as a 32 bit word to start out. The contents look a lot like the address itself. Let's see what the next few contents hold.
(gdb) x/12w 0xbed9a924
0xbed9a924:	0xbed9aa0d	0x00000000	0xbed9aa22	0xbed9aa32
0xbed9a934:	0xbed9aa3d	0xbed9aa47	0xbed9af37	0xbed9af43
0xbed9a944:	0xbed9af80	0xbed9af8f	0xbed9afa2	0xbed9afab
x/12w stand for eXamine memory/show me 12 Words. As you can see, all the contents of memory look like they are addresses. Let's see what is at the first address: at 0xbed9aa0d
(gdb) x/w 0xbed9aa0d
0xbed9aa0d:	0x6d6f682f
Hmm, that doesn't look like an address. This should be a string, and rather than converting the 0x6d 0x6f 0x68 ... to ascii myself, I'll let gdb help me out.
(gdb) x/s 0xbed9aa0d
0xbed9aa0d:	 "/home/user/ARM/hello"
We are asking gdb to "eXamine memory / as String". gdb knows that C strings are null terminated, so it helpfully walks over the successive memory locations, interpreting each byte as ASCII, till it comes to a null terminator. So we have verified that argv[1] is a pointer to a string, containing the program name. Let's see what the next few memory addresses hold.
(gdb) x/10s 0xbed9aa0d
0xbed9aa0d:	 "/home/user/ARM/hello"
0xbed9aa22:	 "SHELL=/bin/bash"
0xbed9aa32:	 "TERM=xterm"
0xbed9aa3d:	 "USER=user"
0xbed9aa47:	 "LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:...
0xbed9ab0f:	 ":*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tl...
0xbed9abd7:	 "eb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=0"...
0xbed9ac9f:	 ":*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.ti...
0xbed9ad67:	 "v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=0...
0xbed9ae2f:	 "yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv...
We "eXamine 10 memory locations as String", and we find that we have run past the end of argv. We are seeing the environment variables that are specified by the Bash shell, including the name of the shell, the username, and the colors for the different file types. I forgot, where were we?
(gdb) where
#0  0x000083e4 in main ()
We can find out our state in the execution by asking 'where'. Though we could just as easily have looked up the Program Counter register for this simple program.
(gdb) disassemble main
Dump of assembler code for function main:
   0x000083d0 <+0>:	push	{r11, lr}
   0x000083d4 <+4>:	add	r11, sp, #4
   0x000083d8 <+8>:	sub	sp, sp, #8
   0x000083dc <+12>:	str	r0, [r11, #-8]
   0x000083e0 <+16>:	str	r1, [r11, #-12]
=> 0x000083e4 <+20>:	ldr	r0, [pc, #20]	; 0x8400 
   0x000083e8 <+24>:	bl	0x82e8 
   0x000083ec <+28>:	mov	r3, #0
   0x000083f0 <+32>:	mov	r0, r3
   0x000083f4 <+36>:	sub	sp, r11, #4
   0x000083f8 <+40>:	pop	{r11, lr}
   0x000083fc <+44>:	bx	lr
   0x00008400 <+48>:	andeq	r8, r0, r8, lsl #9
End of assembler dump.
gdb shows a helpful arrow showing where we are. We can set another breakpoint if we like. After a long debugging session, you might forget which breakpoints you have set.
(gdb) info breakpoints 
Num     Type           Disp Enb Address    What
1       breakpoint     keep y   0x000083e4 
	breakpoint already hit 1 time
        info registers
3       breakpoint     keep y   0x000083f0 
        info registers
You can see all breakpoints with 'info breakpoints' and you can delete breakpoints with 'delete x', where x is the number of the breakpoint. When deleting breakpoints, gdb doesn't produce any output if it is successful.
(gdb) delete 3
A very helpful technique when debugging for loops is to run some commands automatically when a breakpoint is hit. This is done with the 'commands' directive as folows.
(gdb) break *0x000083ec
Breakpoint 4 at 0x83ec
(gdb) commands 4
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>info registers
>end
(gdb) 
Now, when the breakpoint is hit, gdb will automatically run the 'info registers' command. Let's continue running this program so it can hit the next breakpoint.
(gdb) continue
Continuing.
Hello World

Breakpoint 4, 0x000083ec in main ()
r0             0xc	12
r1             0x0	0
r2             0x40153228	1075130920
r3             0x83d0	33744
r4             0x0	0
r5             0x0	0
r6             0x0	0
r7             0x0	0
r8             0x0	0
r9             0x0	0
r10            0x40025000	1073893376
r11            0xbed9a7d4	0xbed9a7d4
r12            0x0	0
sp             0xbed9a7c8	0xbed9a7c8
lr             0x83ec	33772
pc             0x83ec	0x83ec 
cpsr           0x60000010	1610612752
gdb ran past the puts(), and printed "Hello World" on the screen. It hit the breakpoint, and automatically showed us the registers. Great. Let's finish up by continuing.
(gdb) continue 
Continuing.
[Inferior 1 (process 17307) exited normally]
(gdb) info registers 
The program has no registers now.
The program is done. We can't examine registers or memory because it isn't running anymore.
The full GDB documentation is available online.

Links
Now that you know how to examine registers and memory, you can write ARM programs and verify that they do the right thing. You can break at various locations and verify that your load store and move instructions are working as expected.

The Gnu tools are ubiquitous and mature. Once you learn how to use gdb and gcc on ARM, you can easily use the same tricks on another platform like Intel. Here are all the manual links again:
  1. GCC manual
  2. Gas (Gnu Assembler) manual
  3. GDB manual

Wednesday, September 07, 2011

Android ARM Assembly: Registers, Memory and Addressing Modes (Part 3)

This is part three of a series on learning ARM assembly on Android. This part covers registers, memory and addressing modes.

Part 1: Motivation and device set up
Part 2: A walk-through of a simple ARM assembly program
=> Part 3: Registers, memory, and addressing modes
Part 4: Gnu tools for assembly; GCC and GDB
Part 5: Stack and Functions
Part 6: Arithmetic and Logical Expressions
Part 7: Conditional Execution
Part 8: Assembly in Android code

The articles follow in series, each article builds on the previous.

Registers

On ARM processors you have 16 registers. Actually, that is not entirely true. ARM processors have 32 registers, each 32 bits wide. ARM processors have different programming modes to distinguish user-level and system-level access. Only some registers are visible in each mode. In the user-level mode you can access 16 registers. This is the mode you will use most frequently, so you can ignore the entire mode stuff for now. By the time you need to write Linux device drivers, you will be well past this introduction.

The registers are called r0-r15, and the last four are special.
r12: IP, or Intra-Procedure call stack register. This register is used by the linker as a scratch register between procedure calls. A procedure must not modify its value on return. This register isn't used by Linux gcc or glibc, but another system might.
r13: SP, or Stack Pointer. This register points to the top of the stack. The stack is area of memory used for local function-specific storage. This storage is reclaimed when the function returns. To allocate space on the stack, we subtract from the stack register. To allocate one 32-bit value, we subtract 4 from the stack pointer.
r14: LR, or Link Register. This register holds the return value of a subroutine. When a subroutine is called, the LR is filled with the program counter.
r15: PC, or Program Counter. This register holds the address of memory that is currently being executed.
There is one more register, the Current Program Status Register (CPSR) that contains values indicating some flags like Negative, Zero, Carry, etc. We'll visit it later, you can't read and write it like a normal register anyway.

In assembly language, these are all the variables you have access to. If you need to perform any computation, they need to operate upon these registers. Since there are a small number of registers, you need to be judicious in their use. A lot of optimisation boils down to being miserly with register allocation.

Here are all the data move instructions:
	mov	r0, r1
This moves the value from r1 to r0. This achieves r0 = r1
 	mov	r0, r1, lsl #2
Logical Shift Left (lsl) variable r1 by 2 bits, and then move to r0. This achieves r0 = r1 << 2. There is a limitation on how much rotation is allowed: rotation is specified using five bits, so any value of rotation between 0-31 is allowed. The effect of a single left-shift is multiplication by 2.

 	mov	r0, r1, lsr #3
Logical Shift Right (lsr) variable r1 by 3, and then move to r0. This achieves r0 = (int) (r1 >> 3) The effect of a single left-shift is (the integer part of) division by 2.
 	mov	r0, r1, lsl r2
Logical Shift Left (lsl) variable r1 by the amount given in r2, and then move to r0. The value in register r2 should be between 0 and 31. This achieves r0 = (int) (r1 << r2)
LSL and LSR are not the only shifts. There are also Arithmetic Shifts: ASL and ASR that maintain signed-ness. ASR is different from LSR for negative numbers. Negative numbers have leading bits set 1, and ASR right shifting propagates the 1 bits. In case you don't remember how negative numbers are represented, you might want to read a One's Complement review. ASL is the same as LSL because the sign extension doesn't happen when shifting left. Finally there is ROtate Right, or ROR. Rotation preserves all information, the bits are cycled like the barrel of a six-shooter. You can rotate by five bits, which means from 0-31 positions. There is no rotate left, since to rotate left by 3 bits, you can rotate right by 32-3 = 29 bits. Every shift can take either a five bit value or a register that contains a value between 0 and 31.
 	mov	r0,#10
Move the literal value 10 r0. This achieves r0 = 10. This is an example of Immediate addressing, the literal value of the constant is given in the instruction. There is a limit to what can be listed in immediate values. Any number that can be expressed in 8 bits constant, plus 5 bits in rotation is allowed. Anything other that that must be declared as a constant, and loaded from memory. Examples of valid values are 0x10, 0x100 (both using the constant alone), and 0x40000 (using rotate to left)

Loading from and Storing to Memory

The memory on the ARM architecture is laid out as a flat array. No more segment register nonsense of the Intel world. Addressing modes are equally straight forward and easy to remember.
 	ldr	r0, .L3
Move the contents of address label .L3 to r0. This achieves r0 = *(.L3)
 	ldr	r0, [r1]
Move the contents of data pointed to by r1 to r0. This achieves r0 = *(r1). This addressing mode is using a register for a base address.
 	ldr	r0, [r1, #4]
Move the contents of data pointed to by (r1 + 4) to r0. This achieves r0 = *(r1 + 1) since 32 bits are moved at a time. Byte alignment might enforce this, so you might not be able to do "ldr r0, [r1, #1]". This addressing mode is using an immediate offset value.
 	ldr	r0, [r1, r2]
Move the contents of data pointed to by (r1 + r2) to r0. This achieves r0 = *(r1 + (r2/4)). This addressing mode is using a register as an offset, in addition to a register as a bass address.
 	ldr	r0, [r1, -r2]
It is possible to specify the offsets as negative.
 	ldr	r0, [r1, r2, lsl #4]
Move the contents of data pointed to by (r1 + (r2 logical shift left by four bits)) to r0. This achieves r0 = *(r1 + ((r2 << 4)/4)). This addressing mode is using a register for base address, and a shifted register for offset.
You get the picture. The load instructions take the same barrel shift arguments as the move instructions, so LSL, LSR, ASL, ASR, and ROR are all valid. The interesting stuff happens when you have the ability to use a register as a pointer while modifying the pointer and the destination. There are two ways of doing this, post-index addressing, and pre-index addressing. These correspond roughly to the postincrement (a++) and preincrement (++a) operators in C. Here is a post-increment operator
 	ldr	r0, [r1], #4
Move the contents of data pointed to by r1 to r0, and stores the value r1+4 in r1. This achieves r0 = *(r1), r1 = r1 + 1. As before, variants where registers are used as offsets and shifted registers as offsets are valid.
 	ldr	r0, [r1], r2
 	ldr	r0, [r1], r2, asl #4

In pre-indexed addressing, we modify the base address register before loading the address from the memory to the register. This is indicated by an exclamation mark at the end of the address, indicating that it is written first.
 	ldr	r0, [r1, #4]!
Increase the content of r1 by 4, and then move the contents of data pointed to by (r1) to r0. This achieves r1 = r1 + 4, r0 = *(r1)
 	ldr	r0, [r1, r2]!
Increase the content of r1 by the contents of r2, and then move the contents of data pointed to by (r1) to r0. This achieves r1 = r1 + r2, r0 = *(r1)
 	ldr	r0, [r1, r2,  #4 ]!
You get the idea. Any registers r0-r15 can be used where r0, r1, and r2 are used in examples above.
All those examples dealt with loading contents from memory to registers. For storing to memory, we use the STR opcode instead of the LDR opcode. The same addressing modes can be used to store values from registers into memory. Here are a few examples.
 	str	r0, [r1], #4
Move the contents of r0 into memory pointed to by r1, and stores the value r1+4 in r1. This achieves *(r1) = r0, r1 = r1 + 1. As before, variants where registers are used as offsets and shifted registers as offsets are valid.
 	str	r0, [r1]
Move contents of r0 into address pointed to by register r1 *(r1) = r0
 	str	r0, [r1], r2
Move contents of r0 into address pointed to by r1, and increment r1 by contents of r2. *r1 = r0, r1 = r1 + r2

References
Register move instructions and indexing modes is perhaps the hardest part of ARM assembly. You should take solace in the fact that the addressing modes are considerably less complicated than the madness of segment registers on Intel processors.

This is a basic introduction to the registers and addressing modes in ARM processors. In case you want more depth, Jasper Vijn's excellent introduction to ARM assembly covers these topics.

The authoritative technical reference for the ARM architecture is the ARM Application Reference Manual. It contains valuable technical information on all instructions in the ARM instruction set.

Monday, September 05, 2011

Android ARM Assembly: A trivial program (Part 2)

This is part two of a series on learning ARM assembly on Android. This part covers a walkthrough of a ARM assembly program.

Part 1: Motivation and device set up
=> Part 2: A walk-through of a simple ARM assembly program
Part 3: Registers, memory, and addressing modes
Part 4: Gnu tools for assembly; GCC and GDB
Part 5: Stack and Functions
Part 6: Arithmetic and Logical Expressions
Part 7: Conditional Execution
Part 8: Assembly in Android code

The articles follow in series, each article builds on the previous.

Hello world assembly program

The easiest program that most languages introduce is the Hello World program. If you had followed with Part one, you typed the Hello World program in C. Now generate the assembly language for that example using the command given at the end of part one.

Here is the assembly language program in full:
 1 	.cpu arm9tdmi
 2 	.fpu softvfp
 3 	.eabi_attribute 20, 1
 4 	.eabi_attribute 21, 1
 5 	.eabi_attribute 23, 3
 6 	.eabi_attribute 24, 1
 7 	.eabi_attribute 25, 1
 8 	.eabi_attribute 26, 2
 9 	.eabi_attribute 30, 6
10 	.eabi_attribute 18, 4
11 	.file	"hello.c"
12 	.section	.rodata
13 	.align	2
14 .LC0:
15 	.ascii	"Hello World\000"
16 	.text
17 	.align	2
18 	.global	main
19 	.type	main, %function
20 main:
21 	@ Function supports interworking.
22 	@ args = 0, pretend = 0, frame = 8
23 	@ frame_needed = 1, uses_anonymous_args = 0
24 	mov	ip, sp
25 	stmfd	sp!, {fp, ip, lr, pc}
26 	sub	fp, ip, #4
27 	sub	sp, sp, #8
28 	str	r0, [fp, #-16]
29 	str	r1, [fp, #-20]
30 	ldr	r0, .L3
31 	bl	puts
32 	mov	r3, #0
33 	mov	r0, r3
34 	sub	sp, fp, #12
35 	ldmfd	sp, {fp, sp, lr}
36 	bx	lr
37 .L4:
38 	.align	2
39 .L3:
40 	.word	.LC0
41 	.size	main, .-main
42 	.ident	"GCC: (Debian 4.3.2-1.1) 4.3.2"
43 	.section	.note.GNU-stack,"",%progbits
As you can see, it is quite a lot of code for a simple program. Luckily, most of it is boilerplate.  Let's break it down piece by piece.

Declaration and options
 1 	.cpu arm9tdmi
 2 	.fpu softvfp
 3 	.eabi_attribute 20, 1
 4 	.eabi_attribute 21, 1
 5 	.eabi_attribute 23, 3
 6 	.eabi_attribute 24, 1
 7 	.eabi_attribute 25, 1
 8 	.eabi_attribute 26, 2
 9 	.eabi_attribute 30, 6
10 	.eabi_attribute 18, 4
11 	.file	"hello.c"
The first eleven lines are declarations of various options in the ARM cpu. You can ignore them for now. In case you are curious, we specify the CPU type, the way we want the Floating Point Unit (FPU) to operate, and then specify options for the ARM Embedded Application Binary Interface (EABI). The filename is specified on line 11.

Declaring constants
12 	.section	.rodata
13 	.align	2
14 .LC0:
15 	.ascii	"Hello World\000"
The string "Hello World" is specified as a constant in assembly on lines 12-15. It is in the Read Only DATA section (.section .rodata), it needs to be aligned on two-byte boundaries, and the string is specified as an ASCII string. Byte alignment is very important in assembly language programming.  You must pay careful attention to data that needs to be word aligned (2-byte), quad aligned (4-byte) or byte aligned (1-byte). In general, you can't go wrong with word alignment, so if you are uncertain, add a .align 2 at the top of data and functions.
The data is specified as a string in assembly, though the assembler writes it out as bytes behind the scenes.

Declaring functions
16 	.text
17 	.align	2
18 	.global	main
19 	.type	main, %function
20 main:
The program consists of a single function called main. Program code is always in the text section, thus the declaration on line 16. It is word aligned (line 17). main is a global variable, and line 18 allows it to be visible elsewhere in the program. Finally, it is listed as a function, and line 20 is a label containing the name 'main'.

Registers and data
Before we look at the function contents, it might be useful to know what the ARM architecture is like. ARM processors are an example of Reduced Instruction Set Computing (RISC). This means that there are few instructions, and most instructions operate on registers. For user programs, there are 16 registers, called r0, r1, r2, .., r15. Each register is 32 bits long. Registers correspond roughly to variables, though they don't have a data type. Most arithmetical and logical operations are performed on registers. The processor can also access the entire memory using load and store instructions.

The registers r12-r15 are special:
r12: IP, or Intra-Procedure call stack register. This register is used by the linker as a scratch register between procedure calls. A procedure must not modify its value on return. This register isn't used by Linux gcc or glibc, but another system might.
r13: SP, or Stack Pointer. This register points to the top of the stack. The stack is area of memory used for local function-specific storage. This storage is reclaimed when the function returns. To allocate space on the stack, we subtract from the stack register. To allocate one 32-bit value, we subtract 4 from the stack pointer.
r14: LR, or Link Register. This register holds the return value of a subroutine. When a subroutine is called, the LR is filled with the program counter.
r15: PC, or Program Counter. This register holds the address of memory that is currently being executed.

Here are some data move instructions:
24 	mov	ip, sp
This moves the value from sp (r13) to ip (r12). This achieves ip = sp.
32 	mov	r3, #0
This moves the value 0 into r3. This achieves r3 = 0
 In addition to moving values within registers, you can load values from memory into registers, and store registers into memory. Here are instructions that achieve this:
28 	str	r0, [fp, #-16]
This stores the contents of r0 into the memory location pointed to by (fp -16). Since memory is addressed by bytes, and registers are 4 bytes each, memory offsets are often multiples of 4. This is the same as the C statement *(fp - 4) = r0
30 	ldr	r0, [fp]
This loads the data from memory pointed to by register fp into register r0. This is the same as the C statement r0 = *(fp)
25 	stmfd	sp!, {fp, ip, lr, pc}
This is a multi-register move operation. This moves the registers FP,IP,LR,PC into the area specified by the register SP. Since SP is the stack pointer, this is the same as pushing registers FP,IP,LR,PC to the top of the stack in a single operation. Once this is done, the stack pointer is updated since it has an exclamation mark.
35 	ldmfd	sp, {fp, sp, lr}
This is another multi-register move that undoes the action on line 25. This reads back the values that were written earlier, popping them from the stack. The combined effect of lines 25 and 35 is to store the important register values on the stack and then restore them. This allows the function to modify them in the main body. We don't restore IP

Manipulating data
Assembly instructions to manipulate data are very basic: you can do basic arithmetic operations: ADD, SUB, and basic logical operations: AND, OR. ARM Assembly language operations are of the type: OPERATION ARG1, ARG2, ARG3 This performs the operation ARG1 = ARG2 OPERATION ARG3. In the above program, you see some basic arithmetic.

26 	sub	fp, ip, #4
This performs the action fp = ip - 4.

Function calls and returns
Assembly language is written as a flat set of instructions: there is very little structure once the instructions are written to memory in the computer. In order to make code modular, functions can be written. Without the protection of the C compiler, assembly programs must manage their own function calling.
The basic function call involves the following structure:
 	.text
 	.align	2
 	.global	functionName
 	.type	functionName, %function
 functionName:
 	mov	ip, sp
 	stmfd	sp!, {fp, ip, lr, pc}
 	sub	fp, ip, #4  @ Space for local variables
 	sub	sp, sp, #8
 
 	sub	sp, fp, #12
 	ldmfd	sp, {fp, sp, lr}
 	bx	lr
Subroutines are required to preserve every register except for r0-r3. So if you need to use the other registers (r4 onwards), you should save them on the stack before over-writing them. The exact function call convention is listed in the ARM procedure call standard.
In order to call a function, you Branch and Link using the BL instruction. The return address is placed in the LR register. To return from a function, you call a branch on the LR register. This is a BX rather than a B to correctly move between ARM and Thumb instructions. (Ignore ARM and Thumb differences for now, they will be made clear later).

Hello World functionality
Put together, the lines 30-33 lines print out the message Hello ARM world. Let's break this down instruction-by-instruction to see how this is achieved.
30 	ldr	r0, .L3
31 	bl	puts
32 	mov	r3, #0
33 	mov	r0, r3
Line 30 loads the address of label .L3 into register r0. The function calling convention is that the first four arguments are stored in r0-r3, and subsequent arguments are stored on the stack. The function to put a value on the screen is called puts, and it accepts just one argument: the string to be printed. The address of this string is stored in the first register: r0.
Line 31 calls the puts function, which consumes r0 and prints the value on the screen. After calling the function, we can expect the registers r0-r3 to be trashed. The return value of the puts is in r0, but we don't care for it.

Line 32 and 33 put together achieve r0 = 0. This is the return value that the main method returns.

Final word
You are now capable of reading ARM assembly and understanding the main elements in the program. As an exercise, you could try reducing the size of the code while keeping the functionality intact.

 In the next article, we can examine each piece in some detail.