I am building a simple user-based recommendation system using 10M MovieLens dataset. While calculating the Pearson Correlation, the enormous size of the data (69878 row, 10677 cols) overwhelms my memory (16GB), thus it gives me a memory error and stops.
The matrix DataFrame I've been trying to calculate PearsonCorr by matrix.T.corr()
:
userid | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... |
2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... |
3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... |
5 | -2.850575 | NaN | NaN | NaN | NaN | NaN | -0.850575 | NaN | NaN | NaN | ... |
I thought of slicing my data into chunks and apply Pearson Correlation chunk by chunk, but here is the catch: When I measure the correlation of first chunk's first row with other rows, I also need to checkout other chunks too. It becomes complicated and I fear I would stray too far from an appropriate solution.
I also tried to reduce the dataframe size by converting column datatypes from float64 to float16, which reduced the dataframe size by %25 (from 5.6GB to 1.4GB). However, when the corr
method is used I think it converts it back to float64 and gives a memory error again.
I've been looking for a proper Pandas & Numpy implementation so I can apply this solution to many large dataset projects but I haven't succeeded yet.
I uploaded a simple glance to code here, might be easier to check it up and do some tests if you want to.