Yacpb: LLVM assembly and integer signedness

LLVM assembly does not encode integer signedness which can be a nuisance when reading the assembly code. On the other hand the policy simplifies the representation, as the sign does not make a difference in most of the operations (such as integer addition). In some operations (such as boolean comparisons on integers <, <=, >, >=) signedness is significant, however. For completeness, let's remember that the (dis)equality operator does not care about the sign.

Let's have a look on the following C program block:

if (x < UINT_MAX-1)
    puts("foo");
else
    puts("bar")

Piping clangs's output to llvm-dis gives us the following LLVM assembly representation:

%1 = load i32* %x, align 4
%2 = icmp ult i32 %1, -2
br i1 %2, label %bb, label %bb1

bb:
%3 = call i32 @puts(i8* getelementptr ([4 x i8]* @.str, i32 0, i32 0)) nounwind
br label %bb2

bb1:
%4 = call i32 @puts(i8* getelementptr ([4 x i8]* @.str1, i32 0, i32 0)) nounwind
br label %bb2

bb2:

Now, in the assembly code, it can be read that the %1 register is compared with -2, and the operands are to be interpreted as unsigned 32-bit integers. On the bit level, that is of course correct. However, a human needs to make a conversion in her head to understand on what concrete values of %1 the first branch is taken.

In the two's complement, the first bit represents the sign. For negative numbers, the sign bit is set, for positive, it is unset. The number zero is represented as a value where all bits are unset. The smallest negative value of an n-bit integer is represented as one followed by n-1 zeros. Semantically it can be thought as if the rightmost n-1 bits represent a positive value that is summed with the smallest negative value available (-2^(n-1)), and the result is something between -2^(n-1)..-1.

Assuming 4-bit integers, -2 is represented as 1110 (-2^3 + 6). When 1110 is interpreted as an unsigned integer, that is 14 (2^3 + 2^2 + 2^1). However, in our example, we have 32-bit integers (the i32 type in LLVM), where the min value for int is -2147483648. In the 32-bit world, -2 is represented by a number where there are 31 ones followed by a zero. Interpreted as unsigned, that is 2^31 + 2^30 ... 2^2 + 2^1 = 4294967294.

So we came to the conclusion, for values of %1 < 4294967294, the first branch is taken. Otherwise the second branch is taken. That was not obvious directly from the assembly code. We had to consider the integer width to deduce the concrete unsigned value.

It should be noted that one can decide whether to write integer literals as the llvm tools do or not. Since the bit representation is identical for signed i32 -2 and unsigned i32 4294967294, it does not matter which way it is written. I would still stick to the signed interpretation for consistency's sake. Otherwise it becomes a hopeless mess.

Yacpb

Friday, August 14, 2009

LLVM assembly and integer signedness

No comments:

Tweets

Blog Archive

Labels

Endorsed blogs

About Me