Correlation between a 2D matrix/array and 1D array/vector :
We can adapt corr2_coeff_rowwise for correlation between a 2D array/matrix and a 1D array/vector, like so -
def corr2_coeff_2d_1d(A, B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1,keepdims=1)
B_mB = B - B.mean()
# Sum of squares across rows
ssA = np.einsum('ij,ij->i',A_mA,A_mA)
ssB = B_mB.dot(B_mB)
# Finally get corr coeff
return A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
To shuffle each row and do this for all rows, we can make use of np.random.shuffle. Now, this shuffle function works along the first axis. So, to solve our case, we need to feed in the transposed version. Also, note that this shuffling would be done in-place. So, if original dataframe is needed elsewhere, do make a copy before processing. Thus, the solution would be -
Hence, let's use this to solve our case -
# Extract underlying arry data for faster NumPy processing in loop later on
a = df.values
s_ar = s.values
# Setup array for row-indexing with NumPy's advanced indexing later on
r = np.arange(a.shape[0])[:,None]
for i in range(N):
# Get shuffled indices per row with `rand+argsort/argpartition` trick from -
# https://stackoverflow.com/a/45438143/
idx = np.random.rand(*a.shape).argsort(1)
# Shuffle array data with NumPy's advanced indexing
shuffled_a = a[r, idx]
# Compute correlation
corr = corr2_coeff_2d_1d(shuffled_a, s_ar)
Optimized version #1
Now, we could pre-compute for the parts involving the series that stays the same between iterations. Hence, a further optimized version would look like this -
a = df.values
s_ar = s.values
r = np.arange(a.shape[0])[:,None]
B = s_ar
B_mB = B - B.mean()
ssB = B_mB.dot(B_mB)
A = a
A_mean = A.mean(1,keepdims=1)
for i in range(N):
# Get shuffled indices per row with `rand+argsort/argpartition` trick from -
# https://stackoverflow.com/a/45438143/
idx = np.random.rand(*a.shape).argsort(1)
# Shuffle array data with NumPy's advanced indexing
shuffled_a = a[r, idx]
# Compute correlation
A = shuffled_a
A_mA = A - A_mean
ssA = np.einsum('ij,ij->i',A_mA,A_mA)
corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
Benchmarking
Setup inputs with actual use-case shapes/sizes
In [302]: df = pd.DataFrame(np.random.rand(10000,1000))
In [303]: s = pd.Series(df.iloc[0])
1. Original method
In [304]: %%timeit
...: df_sh = df.apply(np.random.permutation, axis=1)
...: corr = df_sh.corrwith(s, axis = 1)
1 loop, best of 3: 1.99 s per loop
2. Proposed method
The pre-processing part (only done once before starting loop, so not including in timings) -
In [305]: a = df.values
...: s_ar = s.values
...: r = np.arange(a.shape[0])[:,None]
...:
...: B = s_ar
...: B_mB = B - B.mean()
...: ssB = B_mB.dot(B_mB)
...:
...: A = a
...: A_mean = A.mean(1,keepdims=1)
Part of proposed solution that runs in loop -
In [306]: %%timeit
...: idx = np.random.rand(*a.shape).argsort(1)
...: shuffled_a = a[r, idx]
...:
...: A = shuffled_a
...: A_mA = A - A_mean
...: ssA = np.einsum('ij,ij->i',A_mA,A_mA)
...: corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
1 loop, best of 3: 675 ms per loop
Thus, we are seeing a speedup of around 3x here!