2

I am building a simple user-based recommendation system using 10M MovieLens dataset. While calculating the Pearson Correlation, the enormous size of the data (69878 row, 10677 cols) overwhelms my memory (16GB), thus it gives me a memory error and stops.

The matrix DataFrame I've been trying to calculate PearsonCorr by matrix.T.corr():

userid 1 2 3 4 5 6 7 8 9 10 ...
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
5 -2.850575 NaN NaN NaN NaN NaN -0.850575 NaN NaN NaN ...

I thought of slicing my data into chunks and apply Pearson Correlation chunk by chunk, but here is the catch: When I measure the correlation of first chunk's first row with other rows, I also need to checkout other chunks too. It becomes complicated and I fear I would stray too far from an appropriate solution.

I also tried to reduce the dataframe size by converting column datatypes from float64 to float16, which reduced the dataframe size by %25 (from 5.6GB to 1.4GB). However, when the corr method is used I think it converts it back to float64 and gives a memory error again.

I've been looking for a proper Pandas & Numpy implementation so I can apply this solution to many large dataset projects but I haven't succeeded yet.

I uploaded a simple glance to code here, might be easier to check it up and do some tests if you want to.

2
  • 1
    Try dask.dataframe.DataFrame.corr
    – PaulS
    Commented Jul 7 at 15:56
  • Hey Paul thanks for the tip, I've tried it and some more ways but I'm afraid they didn't worked due to size of the data. I will just complete this project with a smaller data. Although I will keep working on this to find a proper way to do it, maybe I can come up with something.
    – Can Demir
    Commented Jul 9 at 9:28

0

Browse other questions tagged or ask your own question.