Recently a curious intern at work asked why indexing in an array starts at zero. My explanation to him was that it pertains to offsets from the address of the start of the array. While this is an acceptable answer, it is not the only explanation as to why many programming languages (even those without pointer arithmetic) have adopted indexing of arrays starting at zero. I will begin by explaining my original answer and briefly go over a couple other compelling arguments.
Argument 1: “If Memory serves me right…”
The following will assume that the reader has some knowledge of what pointers are in the context of programming languages. Pointers are simply variables which store the address of another variable, hence they “point to” another variable. If that does not quite make sense to you then refer to either this C++ Pointer tutorial or the Wikipedia article on pointers.
I will begin this section with an example in C++. When we work with arrays in C/C++, the array name in most cases will decay into a pointer to the first element of the array. This does not mean that the array name itself is a pointer! It is not! A counter example to this common misunderstanding is the sizeof
operator where the array name is instead treated as the array itself and not a pointer to the first element. Although the array name can usually can be treated as a pointer to the first element of the array.
Now when we access the elements of an array in C/C++ we use the dereferencing operator known as the offset operator ([]). As a simple example of what I am referring to:
int main() { // Declare and initialize our array int myArray [5] = { 8, -3, 89, -30, 77 }; // Accessing elements of an array using the offset operator cout << "Index 2: " << myArray[2] << endl; cout << "Index 3: " << myArray[3] << endl; return 0; }
What some programmers might not realize is that the offset operator adds the number between the brackets to the address that is being dereferenced which is the array name (treated as a pointer). So the following snippet is equivalent to the previous snippet:
int main() { // Declare and initialize our array int myArray [5] = { 8, -3, 89, -30, 77 }; // Accessing elements by offsetting the address then dereferencing cout << "Index 2: " << *(myArray + 2) << endl; cout << "Index 3: " << *(myArray + 3) << endl; return 0; }
So at this point, it should be clear that when we access index 0 (ie. myArray[0]
) of an array we are really just dereferencing the pointer to the first element of the array offset by 0 (ie. *(myArray + 0)
). For those of us who are either very tired or mathematically challenged, that is just dereferencing myArray
being treated as the pointer to the first element of the array and thus obtaining the value of that first element.
Personally, the fact that arrays in C/C++ are just a series of elements placed in a contiguous block of memory never really stuck with me all that much until I programmed in assembly language. What is that I hear? You want me to break out the wonderful EASy68K IDE? Well okay then!
Here is a quick little Motorola 68000 assembly language program that I wrote to further bring home the point of array traversal via offsets in memory by stressing the fact that arrays are just a collection of elements neatly placed side-by-side in the computers memory. Assembly language also does a great job of pushing the semantics onto the programmer.
*----------------------------------------------------------- * Title : Arrays in Memory Example * Written by : Derek Will * Date : 7-8-2014 * Description: Showing how arrays can be traversed in assembly language *----------------------------------------------------------- ORG $1000 START: ; first instruction of program * Program code starts here LEA P, A0 ; Load Array P into Address Register A0 MOVE.W #10, D2 ; Data Register D2 holds the size of the array (count) * Start of Loop * Begin Trap Task 3 - To print out the current element (signed integer) of the array LOOP MOVE.B (A0), D1 ; Copy the contents pointed at by A0 into D1 MOVE.B #3, D0 ; Move 3 into D0 TRAP #15 ; Display the contents of D1 * End Trap Task 3 ADDA.L #1, A0 ; Point to the next element in the array * Begin Trap Task 14 - To print out a space (' ') between the elements LEA EMPTY, A1 ; Put the address of EMPTY into A1 MOVE.B #14,D0 ; Move 14 into D0 TRAP #15 ; Display a space between entries * End Trap Task 14 SUB.B #1, D2 ; Decrement from count BNE LOOP ; Continue until count is 0 * End of Loop SIMHALT ; halt simulator * Variables and constants ORG $1100 ; Store Array at Address $1100 P DC.B 5, 2, 4, 12, 55, 60, 1, 8, 3, 11 ; Declaring our array called P ORG $1200 EMPTY DC.B ' ', 0 ; Displays ' ' for display purposes END START ; last line of source
Now when you execute this program you will receive the following output:
A really cool aspect of the EASy68K IDE is the fact that you can view the raw memory of your 68K program. It works very well for this topic. You can read that C/C++ arrays are just a series of elements placed in contiguous memory locations until you are blue in the face, but sometimes visuals can aid in cementing our understanding. So here is the array beginning at address $1100 as declared in the code where each subsequent element is placed next to the previous.
I hope that it is now clear why indexing in an array starts at zero, at least for C/C++. You may be asking why it needs to be this way in languages like C#, Java or Python where we could possibly get around this fact. I think that it has more to do with consistency than anything else. Personally I started out programming in C++, later when I began learning to program in Java and C# it did not require any new knowledge to make use of their collection classes. Languages like Java and C# strive to be easy for C/C++ developers to pick up. Indeed, there are languages that begin their indexing at 1, but the vast majority of them start at 0.
Argument 2: Hardware influencing the Software
The first of the two other arguments is connected to the design of digital systems.
The argument begins with the realization that for any base b, the first b^N positive integers are represented by exactly N digits (including leading zeros) only if numbering starts at 0. So for base 2, we can represent 2^4 = 16 digits using 4 bits when numbering starts at 0. Which makes sense given that the sixteenth positive integer is 15 (%1111), if we start numbering at 0. However, if we do not start numbering at 0, then the sixteenth number becomes 16 (%10000) which is now 5 digits!
The larger implication of all this is that computer memory addresses have 2^N cells addressed by N bits. If we start counting at 1, then 2^N cells would need N+1 address lines! You then need an entirely new bit to access just a single address! You could leave this last address inaccessible, but why would you want to do that? So there you have it, the design of digital systems to start numbering at 0 has influenced the software which runs on them.
Argument 3: Mathematically Favorable
The extraordinary computer scientist Edsger Dijkstra outlined an argument built upon what is the most mathematically logical way to index a given sequence of elements in this EWD manuscript. The argument goes like this:
We can represent a sequence of natural numbers starting at a and ending at b in four possible different ways:
- a <= i < b + 1
- a – 1 < i <= b
- a <= i <= b
- a – 1 < i < b - 1
Let us start with the low hanging fruit. Number Three has the issue where writing an empty sequence is downright ugly. We can not simply write a <= i <= a as that is not empty (a is still in the sequence), so therefore we have to write a <= i <= a - 1 to achieve an empty sequence. Furthermore, if we assume that a is the smallest natural number, then our upper bound has now become an unnatural number! This option just went from ugly to hideous and therefore Number Three is out of the running.
Number Two and Number Four ultimately share the same fate. Say we choose to start our sequence at the smallest natural number, by the lack of the inclusion of the lower bound we force the lower bound into the realm of the unnatural numbers. For example, if a were the smallest natural number, then we would represent the lower bound as a – 1 < i ... which we can conclude that the number smaller than the smallest natural number is indeed an unnatural number.
Well that leaves Number One and although Harry Nilsson sings that it is the loneliest number, it is also the winner of this competition. Number One does not contain any of the undesirable traits of it’s brethren. It also has the added bonus that the difference between the bounds is the length of the sequence (Number Two has this trait as well).
From here we can index each element in the sequence of length N via a subscript (ordinal) in ideally one of two ways:
- 1 <= i < N + 1
- 0 <= i < N
Dijkstra argues (and I agree) that 0 <= i < N is the cleaner solution of these two for a couple of reasons. First, the ordinal of the element represents the number of elements preceding it in the sequence. Secondly, using N + 1 to represent the upper bound is a loss of elegance. You can easily inspect the 0 <= i < N representation to see that it is a sequence of length N.
So this concludes the discussion. Agree or disagree, these are just some of the common arguments for indexing an array starting at zero.