Machine details

In the memory of a computer, all data is stored in the form of bits (or binary digits). In this section we will investigate the exact manner in which various forms of information are stored as bits.

Bytes

These days, most computers organise their memory in bytes of 8 bits each, and words of 2 or 4 bytes each. To write out the value of a byte one could write out the bits, but it is more economical to use hexadecimal numbers.

In the hexadecimal number system we extend the usual decimal digits 0,...,9 with A,B,C,D,E,F for 10,...,15. Since four bits are needed for one hexadecimal digit, we can write byte values using 2 hexadecimal digits.

Occasionally the octal number system is used. In it, only the digits 0,...,7 are used, and they can be represented in three bits.

If there is a possibility of confusion between the number systems, the base is appended as a subscript to the number. Eg, 2A16 = 4210 = 528 = 1010102.

Integers

The positive and negative integers ...,-2,-1,0,1,2,... are easily represented in the binary number system. Like our everyday decimal numbers, this is a positional system: the rightmost positions indicates the ones, the position to the left of that the powers of 2, then the powers of 4, 8, 16, et cetera. For example: 10011 stands for (from right to left) 1+2+16=19. This way, in n bits we can store the numbers zero through 2^n-1.

However, we also need negative numbers. We declare the leftmost bit to be the sign bit: if the sign bit is zero the number is postive, if it is one the number is negative. The easiest realisation of negative numbers would now be to let a number and its negative have the same bit pattern up to the sign bit. For three bit numbers this would give:

 000 001 010 011 100 101 110 111
   0   1   2   3  -0  -1  -2  -3

The big disadvantage of this scheme is first of all that there would be two different numbers zero. Also, note that adding one to the bit pattern of a negative number would actually subtract one from that number.

A more sensible solution would then take a number, and complement each bit to arrive at its negative. This is called one's complement, and again there is the problem of two zeros.

 000 001 010 011 100 101 110 111
   0   1   2   3  -3  -2  -1  -0

The so-called two's complement system takes the complement of every bit individually, then adds one to the result. Eg,

 2 = 10 -> 01 -> 10 ==> -2 = 110

 000 001 010 011 100 101 110 111
   0   1   2   3  -4  -3  -2  -1

Commonly, integers are stored in 2 bytes, that is, 16 bits. The two's complement scheme stores the negative numbers in the bit patterns for 2^15 through 2^16-1. As a result, we can represent the numbers -32768=-2^15 through 32767=2^15-1. (Integers can also be stored in 4 bytes; then the range is between plus and minus 2.14 billion.)

As a consequence, if you add two numbers with a sum of 2^15 or more, you may get a result that is negative. Maybe the computer system has an option of telling you that a integer overflow error occurred, but having to test for this inevitable slows your program down. It is easier for the program to watch for potential occurrences of this problem and prevent them.

Reals

In computer languages, the term `real' does not really correspond to real numbers in mathematics. Reals are more like fractions, binary fractions at that. Historically there are two types of reals: fixed point and floating point numbers. Fixed point numbers are hardly used any more, so you will often encounter the term floating point number as a synonym for real number.

Representation

The representation of real numbers is based on so-called scientific notation: a number is written as a mantissa, that is, a fraction between zero and one and an exponent, indicating the powers of 10 or 2 with which to multiply the fractional part. Example: 12.37 = 0.123 * 10^2. We could also have written 12.37 = 0.01237 * 10^3, but by convention we use a normalisation: the first digit after the decimal point is always nonzero.

Both the mantissa and the exponent incorporate a sign: the former to represent negative numbers, the latter to represent fractions less than one in absolute value.

Errors

Real numbers, like integers, are subject to overflow errors: there is a largest representable number, and a calculation with a result larger than that will give an undefined result. However, there are a few more types of numerical error, specific to floating point numbers.

First of all, there is representation or round-off error. Just as 1/3 is not expressable as a finite decimal fraction (1/3 = 0.33333... which is a repeating fraction), most fractions are not expressable as finite binary fractions. Truncating a repeating fraction to a finite mantissa length leads to representation error. If round-off is applied here, we also call this round-off error.

Floating point numbers have a converse to overflow where numbers are too big: underflow, where numbers are too close to zero.. The smallest positive number is attained if the exponent takes its most negative value, and the mantissa is the smallest value that satisfies the normalisation condition above. Any calculation with a result less in absolute value than this number is said to cause floating point underflow.

A quantity related to the representation of real numbers is the so-called machine constant (sometimes called machine epsilon). Its formal definition is: the smallest number that, added to one, gives a result greater than one.

To understand this, consider how numbers in floating point format are added. Suppose we havea mantissa with 2 decimal places. To add 12 (= 0.12 * 10^2) and 3.5 (= 0.35 * 10^1) we normalise both to the exponent of the larger number: 0.12 * 10^2 and 0.035 * 10^2. Adding these would give 0.155 * 10^2, but our mantissa can only accomodate 2 places, so the second operand is first truncated to 0.03 * 10^2. The result of our computation is then 15, rather than the exact sum 15.5. Even if we had had some extra space to prevent truncation of the second operand, we still would have round-off error in representing the final result as a normalised floating point number.

Now, if our second operand had been even smaller, say 0.35, its representation to the exponent of the other number would have been zero! Hence, adding it to the first operand would simply give the first operand. Hence the definition of the machine epsilon.

As a final instance of numerical error we will consider loss of precision. This happens when almost equal numbers are subtracted. For instance, in two significant digits, subtracting 0.23 from 0.24 is 0.01, or after normalisation, 0.10 * 10^-1 in only one significant digit. To understand this, consider that the number 0.23 can, because of round-off or representation error, stand for anything between 0.225 and 0.235. The error is in the third decimal. However, after the subtraction, the number 0.10 * 10^-1 has an error in the second decimal.

This example also illustrates that addition of real numbers in a computer is not associative: in exact arithmetic .20+.035-.20-.03 = .005, but

    (.20 + .35*10^-1) - (.20 + .30*10^-1) = .24 - .23 = .10*10^-1

with a relative error of fifty percent, and

    (.20 - .20) + (.35*10^-1 - .30*10^-1) = 0 + .5*10^-2

which is exact.

Example: roots of a 2nd degree polynomial

In computing the roots of a second degree polynomial the matter of numerical errors comes up twice. First of all, the test whether the discriminant

      disc = b**2-4*a*c

is zero should not be written as

      IF (disc.eq.0.0) THEN

Even if the values of a,b,c are such that in exact arithmetic the discriminant would be zero, this is not likely to be the case in computer arithmetic. Instead, using the machine epsilon, one should test whether b**2 and 4*a*c differ relatively by less than a small number of times the machine constant:

      IF (abs(disc).LT.10*eps*b**2) THEN

Next, if b and sqrt(disc) are approximately equal in absolute value, computation of one of the roots will suffer from severe loss of precision. The solution is to compute the stable root, and to use the fact that the product of the roots is c/a:

      IF (b.LT.0.0) THEN
         x1 = (-b+sqrt(disc))/(2*a)
      ELSE
         x1 = (-b-sqrt(disc))/(2*a)
      END IF
      x2 = c/(a*x1)

Characters

There is no obvious way to represent character data in bit patterns. There only solution is to draw up a list of interpretations of the bit patterns. By far the most common character table is called Ascii: American Standard Code for Information Interchange. Occasionally you may still come across Ebcdic: the Extended Binary Coded Decimal Information Code. This is an IBM invention, so IBM mainframes are just about the only place it is still used.

There are in fact two Ascii tables: the 7-bit and 8-bit tables. While the meaning of 7-bit Ascii is pretty much standardised, various manufacturers consider the meaning of the higher Ascii codes fair game for their own purposes. Much software, eg much Internet communication software, is based on 7-bit ascii; in order to send 8-bit data you have to translate it down to 7-bit data.

In fact, the 7-bit ascii codes are conceptually split in the printable and unprintable Ascii codes. The printable codes are the alphabet, the numbers, and various other characters; the unprintable codes can have various meanings. Sending them to your screen may cause your cursor to jump, or parts of the screen to start blinking.

Logical data

There are only two logical values: true and false. Strictly speaking a single bit of memory would be enough to store that, but on most computers addressing a single bit is expensive, typically requiring extraction from a byte or a word. Therefore, logical data is typically allocated a byte or a word. Various conventions are possible, eg, all zeros representing false and all ones representing true. Alternative, any nonzero bit could imply a true value.