If the compiler sill performed TCO, then all of the external foo(arr) calls would receive the same pointer. That's a visible semantics change, and thus no longer a pure optimization.
The fact that the local variable in question is an array is probably a red herring here; it is its visibility to the outside via a pointer that is important.
Consider this program:
#include <stdio.h>
int *valptr[7], **curptr = valptr, **endptr = valptr + 7;
void reset(void)
{
curptr = valptr;
}
int record(int *ptr)
{
if (curptr >= endptr)
return 1;
*curptr++ = ptr;
return 0;
}
int tally(void)
{
int **pp;
int count = 0;
for (pp = valptr; pp < curptr; pp++)
count += **pp;
return count;
}
int tail_function(int x)
{
return record(&x) ? tally() : tail_function(x + 1);
}
int main(void)
{
printf("tail_function(0) = %d\n", tail_function(0));
return 0;
}
As the tail_function recurses, which it does via a tail call, the record function records the addresses of different instances of the local variable x. When it runs out of room, it returns 1, and that triggers tail_function to call tally and return. tally sweeps through the recorded memory locations and adds their values.
If tally were subject to TCO, then there would just be one instance of x. Effectively, it would be this:
int tail_function(int x)
{
tail:
if (record(&x))
return tally();
x = x + 1;
goto tail;
}
And so now, record is recording the same location over and over again, causing tally to calculate an incorrect value instead of the expected 21.
The logic of record and tally depends on x being actually instantiated on each activation of the scope, and that outer activations of the scope have a lifetime which endures until the inner ones terminate. That requirement precludes tail_function from recursing in constant space; it must allocate separate x instances.