0

I´m trying to merge large data frames, using R. My computer CPU has 3GHz intel Core i5 Quad-core, with 8GB RAM. I´m using the Reduce function, I´m not using a loop. All data frames are in a list, total size is 160MB.

<- Reduce(function(x,y) merge(x, y, all = TRUE, by = my_column_ID), my_list)

Before running the script, I´ve expanded my virtual memory to 50GB using the Terminal, as explained below.

cd ~
touch .Renviron
open .Renviron

R_MAX_VSIZE=50Gb

https://r.789695.n4.nabble.com/R-3-5-0-vector-memory-exhausted-error-on-readBin-td4750237.html

The computer was also restarted before running the script to clean the RAM and all other programs are closed, only RStudio is running..

The script is running for hours, so I decided to check the Activity Monitor on my Mac.

The CPU usage is very low (kernel 2.3% and rsession 0.5%), but the memory usage is very high (kernel 30MB, r session 36GB and rstudio 200MB).

How can that be explained? Why the CPU is not working fast, since the memory is working well, so information can be quickly accessed?

2
  • What is the speed of the hard drive? A slow drive could cause what you are seeing
    – anon
    Commented May 25, 2021 at 19:43
  • Is that right? 36 GB? That's more than 4 times the amount of RAM you have. This isn't going to work. Just to make sure, please provide the output of free -m.
    – Daniel B
    Commented May 25, 2021 at 20:09

2 Answers 2

1

You have a machine with 8GB of RAM, but your program is using 36GB. That means it will be using swap.

Using that much swap does not mean that your program is going to be efficient. In order to read in one part of memory it will have to push another part out to disk.

The reason your program is using nearly no CPU time is because it is spending 99% of its run time waiting for blocks of memory to be paged in and out to your disk.

Your program requires at least four times more memory than your system actually has. If you actually want to make use of your system then you will need to install more RAM, or run your program on a system that has more RAM.

An SSD might be better than a HDD, but even this kind of load is not going to run well on it.

0

You may be merging one-to-many and this could cause unneeded memory usage. I'd do this to troubleshoot:

  1. Merge just two DF's, outside of the Reduce function, merely: merge(x, y, all = TRUE, by = my_column_ID)

  2. Merge a subset of x and/or y. For example x.small <- x[1:100,]; merge(x.small, y, all = TRUE, by = my_column_ID). Do the same for y.

Check to see if the results are what you expect. Oftentimes, you have duplicated keys (the by parameter) that will hog up memory. This problem is often solved by de-duplicating your data.

Of course this all depends on your data... if you could share some (e.g., dput(head(x)) and dput(head(y)) then we could troubleshoot far better.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .