artiface/artiface/ BasicDataTypes
Basic Data Types
What our computer data is made of.
Simple Data Types
There are many datatypes available in most modern programming languages. We shall ignore languages like Haskell for now, and restrict our attention to C, C++ and Python (version 3). In C, there are a small number of basic data types, which closely map to data manipulated by the CPU. So for example, a char corresponds to a single byte, taking values from -128 to 127 (and an unsigned char taking values from 0 to 255). On a typical CPU (i.e. an Intel x86) there are three classes of basic datatype: integer, floating point, and array. An array lives a double life, being both an array and a pointer at the low level (to the CPU, an array is a pointer to where the array starts in memory). In a higher level language like Python, there are two datatypes resembling an array: a list and a tuple.
CPU Data Types
It seems best to start at the bottom and work our way up. At the very foundation, everything is a bit, or a collection of bits. The CPU is built out of a collection of logic gates which take bits as input, and produce bits as outputs. An integer is a collection of bits; a floating point number is a collection of bits, and every other type in most languages like C boil down to a collection of integers, floating point numbers, or bits packed into integers (in languages like C/C++).
Now a CPU stores a small number of pieces of data in its registers. There are integer registers, which represent integers in binary as a sequence of bits (in groups of 8, 16, 32 or 64, depending on the CPU), and there are often floating-point registers (all x86 processors from the Pentium onwards have floating point registers, earlier chips like the 386 needed a separate chip (the 387) to handle floating point maths on behalf of the main processor).
In general, data that the CPU is working with is stored in main memory, which at the time of writing has a capacity measured in gigabytes (e.g. the machine on which I am typing this has 24GB of main memory). Between the CPU and main memory are a number of smaller, intermediate memories called caches. When the CPU wants to work with data, it loads a chunk of data (64bits) into the CPU. It works with data stored in registers, or possibly loaded from main memory for the purpose of a single instruction, and the results of an instruction are stored in either a register, or back into main memory. Most modern RISC processors only work with data in registers, except for loading and storing to/from main memory.
So in summary, to the CPU, data is either integer, floating point, or possibly some special purpose registers, and is ultimately an ordered collection of bits.
Bytes and Words
Bits are grouped into bytes, and bytes are the smallest data type that a CPU will work with. A byte, in modern day computing, is standardised as 8 bits. (In the early days of computing, bytes may have been 6 bits, for example, until things were standardised.) Then bytes are grouped into words, of say 2, 4, or 8 bytes (corresponding to 16, 32 and 64 bits respectively). Floating point numbers generally have two formats, according to their size: single precision floating point numbers (which are 32 bits), and double precision floating point numbers (which are 64 bits). Historically there was also an 80 bit floating point number, which was used internally by the x87 floating point coprocessors, and in modern times there are also 8 and 16 bit floating point numbers, which turn up in the context of modern GPUs (graphics processing units) in the context of 3D graphics and AI. The advantage of smaller floating point formats is that they take less time and energy to work with, and take less memory to store; the disadvantage is that they have less precision. Likewise with integers, the advantages of smaller integers is that they take less space to store, at the expense of the range of integers that they can represent, for example an 8 bit integer can store 256 values, usually in the range of 0–255 or -128–127. The range of values expressible using floating point formats is a bit more complicated, and we shall explore that later when we get to them.
In summary, a byte is a collection of 8 bits, and other datatypes are represented using an integral number of bytes (even booleans are ultimately stored in bytes, and accessed using bitwise logic operations). Note that while in earlier times there was a direction relationship between bytes and the notion of a character (which is the historical reason a byte in languages like C, C++ and Java is called a char or an unsigned char), with the modern system of Unicode, this is no longer the case. We shall discuss characters a little more later.
Integers
In mathematical terms, the integers are all the whole numbers, from -∞ to ∞ (i.e. as low and as high as we like). With a fixed number of bits, however, there is a fixed number of integers we can represent with them. With 1 bit, we can only represent the numbers 0 and 1. With 2 bits we can represent the numbers 0, 1, 2, and 3. And for each additional bit we have, the range of numbers representable increases by a factor of 2 (much like how allowing an additional digit in base-10 notation increases the range of numbers representable by a factor of 10). In software, we can simulate integers as large as we like (subject to limitations of memory and storage, these are called arbitrary precision or multiple precision, or simply big integers), but in the CPU itself, the largest integers we can work with are of the 64 bit variety, ranging either from 0 to 18446744073709551615 (as unsigned integers) or from -9223372036854775808 to 9223372036854775807 (as signed integers). By signed, we mean integers that could be both negative or positive, whereas unsigned integers an only be non-negative (that is, positive or zero).
Hexadecimal
Conventionally we use base 10 in everyday life. We count 1, 2, 3, …, 9, 10, …, 19, 20, and so on, rolling over every 10. That is not a convenient system when it comes to computers used to binary, and binary itself is very length to write down, and hard to read back. To solve this issue, we use base 16. Here we could 1, 2, 3, …, 9, A, B, C, …, F, 10, 11. (There is also a number format called octal, which is base 8, and is based on grouping bits into groups of 3, rather than 4 as is the case in hexadecimal—these still find use in some operating systems such as Unix/Linux, for representing the permission bits of a file.)
Now we have another issue: if we write 10, do we mean 2 (reading it as binary), 10 (reading it in base 10), or 16 (reading it as hexadecimal). To disambiguate these, we use the notations 0b10 for binary, 0x10 for hexadecimal, and plain 10 when we mean base 10.
Floating Point Numbers
When we want to handle decimal numbers (such as 0.5, or π), we store these, either exactly or approximately, with three pieces of information:
- is the number positive or negative?
- the mantissa; and
- the exponent. This works like modern scientific notation, that is, writing numbers like 4bcdc8337f2fe14fe3558995be91c55c3a14f0a3372e378f1b28067fed24b1a3535fa30d7e25dd8a49f1536779734ec8286108d115da5045d77f3b4185d8f790. The sign is simply a single bit which indicates whether or not the number is negative. The mantissa corresponds to the 2.997 in the above example, and the exponent corresponds to the superscript 8, the power of 10 by which the mantissa is multiplied. In binary things are done likewise, except that things now look like e.g. 4bcdc8337f2fe14fe3558995be91c55c3a14f0a3372e378f1b28067fed24b1a3c2356069e9d1e79ca924378153cfbbfb4d4416b1f99d41a2940bfdb66c5319db . Now unless our floating point number is zero, we don’t have to write down the leading 1 (since what else could it be?), so we write down the 0001110 that follows it, and the 21 (which is 10101 in binary). The modern standard for this is called IEEE 754. In C, a 32 bit floating point number is refereed to as a float, and a 64 bit floating point number is referred to as a double.
Fixed Point
Some processors have explicit support for a different kind of fractional number, called fixed point. This models our conventional notation where we write e.g. 3.14159, but where the number of decimal digits after the decimal point is a fixed number, say 5. Such numbers find use in digital signal processing applications, though we will not discuss these further. Suffice to say that they are relatively easy to emulate using integer arithmetic (and what are known as bit shifts).
Simple Data Types in C
Having discussed how things are at the CPU level, we now look at basic data types as they manifest in a relatively low-level language, in this case C. These closely match what the CPU handles in hardware, with little abstraction.
Integers, Floating Point Numbers and Booleans
To the C programmer, there are 8 bit integers, 16 bit integers, 32 bit integers and 64 bit integers, 32 bit floating point numbers and 64 bit floating point numbers. On a modern 64 bit system, these will be called char, short, int, and long, respectively. (Note that the exact size of each of these is not mandated by the C standard, so for example on an Arduino UNO, which has an 8 bit Atmel CPU, the int is 16 bits, and the long is 32bits. Additional datatypes.)
For true/false booleans, C adopts the convention that 0 is false, and anything else is true. In C++, however, there is an explicit bool datatype, which has values true or false.
Arrays
Before we can discuss strings, we need to discuss arrays, for a string is essentially an array of characters. Now an array of a basic data type (in C at least), is a contiguous stretch of memory whose length in bytes is an integral multiple of the size, in bytes, of that datatype. For example an array of 10 ints, assuming that int uses 32 bits (which is 4 bytes), is an area in memory that is 40 bytes long. The first 4 bytes correspond to the first element of the array. Now a word is in order about indexing array elements, and what 0-based indexing means. When we index elements of an array, the first element is at index 0, the second at index 1, and so on. Often people talk of the zeroth element of an array, and so on. It is important to avoid communication errors and to be clear which is meant when we say ‘the first element of an array’.
Note that in higher level languages (and we’ll get to the Python point of view in a bit), things are a bit more abstract in that in higher level languages we (usually) don’t care how things are represented at a machine level, only that variables have a well-defined meaning to the programmer.
Characters and Strings
Consider a word in the English language as we would normally write it. For example ‘turnip’. This is spelt with the six letters t, u, r, n, i and p. That said, when we think about food, in particular a turnip, we think about the actual vegetable, not how it is spelt: we only need to think about the spelling when we read or write the word ‘turnip’. In a computer, things are similar, albeit done in a way so idiot proof that a machine can understand it (though unfortunately not so idiot proof that programmers can’t make elementary mistakes in handling them). When the C language was originally devised, a byte was 8 bits, and a character (which at the time needed 7 bits) was essentially identified with the byte that contained it. The 8th bit was then used in various ways to extend the character set for things such as accented characters or box drawing in days of text mode interfaces. In C, a string is just an array of characters, that is, a contiguous stretch of memory containing 1 byte for each character of the string (plus an extra byte, with value 0, to denote the end of the string). This is not the only way to store a string, however, we could alternatively store the length of the string, rather than relying on the presence of a 0 byte at the end of the string.
Things have got a little more complicated as times have progressed. Windows, for example, moved to using 2-byte characters for its strings, before much of the modern world standardised on a encoding called UTF-8. As such, in an encoding like UTF-8, different characters may take a different number of bytes to encode. This allows strings like “turnip” to use only 6 bytes (plus any terminating zero), yet also allows millions of other characters to be encoded, (including emoji such as ☺ and 💩). So a with an encoding such as UTF-8, a character no longer corresponds to a specific number of bytes.
Simple Data Types in Python
Python is a higher level language than C, allowing for much expressivity at an expense of performance and the capacity to manipulate the machine on a lower level. We shall describe briefly how things are in Python 3.
In Python 3, integers, by default, are big integers, that is, can hold integers of any size. Presumably integers that fit into a 64 bit integer (that is, is between -9223372036854775808 and 9223372036854775807), are stored as, and manipulated as, 64 bit integers native to the CPU, and larger integers are handled by clever software routines that break arithmetical operations down into ones that the CPU can natively handle. For arbitrarily sized floating point numbers, like C, we need to rely upon a library which is not built into the language.
Arrays come in two basic types: tuples and lists. Tuples are immutable, meaning that once a tuple such as (1,2,3) is assigned to a variable, say x, we cannot just modify the second element (the 2) without creating a whole new tuple to replace the one we stored in x. In the case of lists, we can. There is, in addition, a type known as a dict in Python. A string in Python is an array of Unicode characters, and the programmer is (deliberately) insulated from how these characters are stored and manipulated. To get at an explicit encoding of a string, the programmer needs to convert it via the .encode() and .decode() methods. There are types in Python which correspond to the C notion of an array of bytes (a bytes object), and indeed there are modules that let you construct and manipulate the kind of arrays that C uses, but the arrays in C are not native to Python, just as Unicode strings are not native to C.
Booleans and Bitwise Operations
It seems odd to end with booleans, but in order to talk of them at a low level, we need already discuss how integers are handled, and in so doing, may as well describe floating point numbers and arrays as we go.
Booleans
A boolean value is simply one with two possible values, called true and false. These correspond naturally to the true and false of traditional logic. So if I wanted to consider the truth of the statement ‘I have a cat called Gerald’, which is false in this case, I could represent that in a programming language as a boolean.
Now in programming terms, a boolean can correspond to a single bit in memory, or a single byte, or even a 64 bit integer. It does not matter so long as we know what corresponds to true and what corresponds to false. As mentioned earlier, C adopts the convention that a 0 value is false, and any other integer is true. Many languages follow this convention, with some, like Perl, adopting the convention that a 0-length string is false, and any other string is true.
In the case of booleans, we can also pack them into integers, for example 64 separate true/false values can be stored in a single 64 bit integer. How we get at these is briefly discussed shortly.
Logical Operators
In traditional logic, there are a small number of operations we can perform: AND, OR, XOR and NOT (and a few others such as NAND, which can be represented in terms of AND, OR, XOR and NOT—indeed every logical operation can be expressed solely in terms of NAND). These are defined such that
- true AND true is true; anything else is false;
- false OR falseis false; anything else is true;
- false XOR false is false; true XOR true is false; anything else is true;
- NOT true is false; NOT false is true;
Bitwise Logical Operators
When representing numbers as binary digits, we can consider them as a sequence of boolean values. For example in an 8 bit number 0b10110011, this could represent true, false, true, true, false, false, true, true. Given two such 8 but numbers, the above and, say, 0b00010001, we can form, for example, the bitwise AND of these two numbers as, in this example, 1 AND 0, 0 AND 0, 1 AND 0, 1 AND 1, 0 AND 0, 0 AND 0, 1 AND 0, and 1 AND 1. These evaluate respectively to false, false, false, true, false, false, false, true and can be represented as the 8 bit number 0b00010001. We can do likewise for OR, XOR, and NOT. Essentially any CPU will have instructions to do exactly this, and many programming languages also have explicit notation for such operations.
Manipulating Individual Bits
If we have an 8 bit integer, say 0b00110011, corresponding to 8 boolean values, and want to change one of them, there is a means to do this with bitwise operations: If we take the 8 bit number 0b0001000 and AND it with any other 8 bit number, the result will be either 0b0001000 in case the other number had a 1 in the same position, and 0b00000000 otherwise. In this way we can inspect individual bits.
If we wish to make a given bit zero, say the second bit in 0b01000011, (second from the left), we could compute 0b01000011 AND 0b10111111. The pattern 0b10111111 is called a bit mask. Similarly we could make the first bit 1 via 0b01000011 OR 0b10000000.
Summary
We have gone, briefly, from the low CPU level of how basic datatypes are represented, to briefly the differences in higher level languages, finally discussing booleans and bitwise operations. Together with testing, comparison and branching operations, this is the entirety of how computer programs are seen from the point of view of the CPU, and how the low level details are translated into a form that humans can more easily deal with.