This question extends this post, relating to a machine learning feature selection procedure where I have a large matrix of features and I'd like to perform a fast and crude feature selection by measuring the correlation between the outer product between each pair of features and the response, since I'll be using a random forest or boosting classifier.
The number of features is ~60,000 and the number of responses is ~2,200,000.
Given unlimited memory perhaps the fastest way to go about this would be to generate a matrix where the columns are the outer products of all pairs of features and use cor of that matrix against the response. As a smaller dimension example:
set.seed(1)
feature.mat <- matrix(rnorm(2200*100),nrow=2200,ncol=100)
response.vec <- rnorm(2200)
#generate indices of all unique pairs of features and get the outer products:
feature.pairs <- t(combn(1:ncol(feature.mat),2))
feature.pairs.prod <- feature.mat[,feature.pairs[,1]]*feature.mat[,feature.pairs[,2]]
#compute the correlation coefficients
res <- cor(feature.pairs.prod,response.vec)
But for my real dimensions feature.pairs.prod is 2,200,000 by 1,799,970,000 which obviously cannot be stored in memory.
So my question is if and how it is possible to get all the correlations in reasonable computation time?
I was thinking that perhaps breaking down feature.pairs.prod to chunks that fit in memory, and then do cor between them and response.vec one at a time will be the fastest but I'm not sure how to automatically test in R what dimensions I need these chunks to be.
Another option is to apply a function over feature.pairs which will compute the outer product and then cor between that and response.vec.
Any suggestions?