python - How can I calculate Pearson Correlation in a memory-efficient way using Pandas?

I am building a simple user-based recommendation system using 10M MovieLens dataset. While calculating the Pearson Correlation, the enormous size of the data (69878 row, 10677 cols) overwhelms my memory (16GB), thus it gives me a memory error and stops.

The matrix DataFrame I've been trying to calculate PearsonCorr by matrix.T.corr():

userid	1	2	3	4	5	6	7	8	9	10	...
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...
5	-2.850575	NaN	NaN	NaN	NaN	NaN	-0.850575	NaN	NaN	NaN	...

I thought of slicing my data into chunks and apply Pearson Correlation chunk by chunk, but here is the catch: When I measure the correlation of first chunk's first row with other rows, I also need to checkout other chunks too. It becomes complicated and I fear I would stray too far from an appropriate solution.

I also tried to reduce the dataframe size by converting column datatypes from float64 to float16, which reduced the dataframe size by %25 (from 5.6GB to 1.4GB). However, when the corr method is used I think it converts it back to float64 and gives a memory error again.

I've been looking for a proper Pandas & Numpy implementation so I can apply this solution to many large dataset projects but I haven't succeeded yet.

I uploaded a simple glance to code here, might be easier to check it up and do some tests if you want to.

asked Jul 7 at 15:46

Can Demir

213 bronze badges

1

Try dask.dataframe.DataFrame.corr
– PaulS
Commented Jul 7 at 15:56
Hey Paul thanks for the tip, I've tried it and some more ways but I'm afraid they didn't worked due to size of the data. I will just complete this project with a smaller data. Although I will keep working on this to find a proper way to do it, maybe I can come up with something.
– Can Demir
Commented Jul 9 at 9:28

Add a comment |

Collectives™ on Stack Overflow

How can I calculate Pearson Correlation in a memory-efficient way using Pandas?

0

Browse other questions tagged
python
pandas
dataframe
numpy
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Browse other questions tagged pythonpandasdataframenumpy or ask your own question.

Browse other questions tagged
python
pandas
dataframe
numpy
or ask your own question.