How does Numpy move data when transpose a matrix?

Question

It seems numpy.transpose only save strides, and do actually transpose lazily according to this

So, when data movement actually happened and how to move? use many many memcpy? or some other trick?

I follow the path: array_reshape, PyArray_Newshape, PyArray_NewCopy, PyArray_NewLikeArray, PyArray_NewFromDescr, PyArray_NewFromDescrAndBase, PyArray_NewFromDescr_int but see nothing about axis permute. When did it happen indeed?

Update 2021/1/19

Thanks for answers, numpy array copy with transpose is here, which use a common macro to implement it, this algorithm is very native, and it does not consider any of simd acceleration or cache friendliness

Data is copied from the source array according to its strides to the allocated target according to its strides. — hpaulj, Jan 04 '19 at 15:55
@hpaulj thank you for comment, could you tell me the code about it? I want to know about its algorithm of transpose. — Hao Zhang, Jan 04 '19 at 16:08

Ed Smith · Answer 1 · 2019-01-04T14:04:22.067

The answer to your question is: Numpy doesn't move data.

Did you see PyArray_Transpose on line 688 of your above links? There is a permute in this function,

    n = permute->len;
    axes = permute->ptr;
    ...
    for (i = 0; i < n; i++) {
        int axis = axes[i];
        ...
        permutation[i] = axis;
}

Any array shape is purely metadata, used by Numpy to understand how to handle the data, as memory is always stored linearly and contiguously. There is therefore no reason to move or reorder any data, from the docs here,

Other operations, such as transpose, don't move data elements around in the array, but rather change the information about the shape and strides so that the indexing of the array changes, but the data in the doesn't move. Typically these new versions of the array metadata but the same data buffer are new 'views' into the data buffer. There is a different ndarray object, but it uses the same data buffer. This is why it is necessary to force copies through use of the .copy() method if one really wants to make a new and independent copy of the data buffer.

The only reason to copy may be to maximize cache efficiency, although Numpy already considers this,

As it turns out, numpy is smart enough when dealing with ufuncs to determine which index is the most rapidly varying one in memory and uses that for the innermost loop.

I know `np.transpose` call will not move data, but `reshape` or `copy` after it **will** move data indeed. so where is the code moving data when I call `reshape`? — Hao Zhang, Jan 04 '19 at 14:56
@HaoZhang under `alloc.c`, on line 221 is `result = malloc(size);` (https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/alloc.c). The word "move" is a bit odd here, I assume `Numpy` will allocate a new data array in `np.copy` with `PyDataMem_NEW` and then fill it with your copied array values, before setting/copying the permutation/array shape metadata. — Ed Smith, Jan 04 '19 at 18:47
Also. see this https://github.com/numpy/numpy/blob/aef982e4482773e802cc0ef076bf5e76ff650cf9/numpy/core/src/multiarray/array_assign_array.c#L225 for copying arrays — Ed Smith, Jan 04 '19 at 18:54

hpaulj · Answer 2 · 2019-01-04T17:37:31.813

Tracing through the numpy C code is a slow and tedious process. I prefer to deduce patterns of behavior from timings.

Make a sample array and its transpose:

In [168]: A = np.random.rand(1000,1000)
In [169]: At = A.T

First a fast view - no coping of the databuffer:

In [171]: timeit B = A.ravel()
262 ns ± 4.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

A fast copy (presumably uses some fast block memory coping):

In [172]: timeit B = A.copy()
2.2 ms ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

A slow copy (presumably requires traversing the source in its strided order, and the target in its own order):

In [173]: timeit B = A.copy(order='F')
6.29 ms ± 2.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Copying At without having to change the order - fast:

In [174]: timeit B = At.copy(order='F')
2.23 ms ± 51.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Like [173] but going from 'F' to 'C':

In [175]: timeit B = At.copy(order='C')
6.29 ms ± 4.16 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [176]: timeit B = At.ravel()
6.54 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Copies with simpler strided reordering fall somewhere in between:

In [177]: timeit B = A[::-1,::-1].copy()
3.75 ms ± 4.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [178]: timeit B = A[::-1].copy()
3.73 ms ± 6.48 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [179]: timeit B = At[::-1].copy(order='K')
3.98 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This astype also requires the slower copy:

In [182]: timeit B = A.astype('float128')
6.7 ms ± 8.12 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

PyArray_NewFromDescr_int is described as Generic new array creation routine. While I can't figure out where it copies data from the source to the target, it clearly is checking order and strides and dtype. Presumably it handles all cases where the generic copy is required. The axis permutation isn't a special case.

How does Numpy move data when transpose a matrix?

2 Answers2