Not enough space to cache rdd in memory warning

Question

I am running a spark job, and I got Not enough space to cache rdd_128_17000 in memory warning. However, in the attached file, it obviously saying only 90.8 G out of 719.3 G is used. Why is that? Thanks!

15/10/16 02:19:41 WARN storage.MemoryStore: Not enough space to cache rdd_128_17000 in memory! (computed 21.4 GB so far)
15/10/16 02:19:41 INFO storage.MemoryStore: Memory use = 4.1 GB (blocks) + 21.2 GB (scratch space shared across 1 thread(s)) = 25.2 GB. Storage limit = 36.0 GB.
15/10/16 02:19:44 WARN storage.MemoryStore: Not enough space to cache rdd_129_17000 in memory! (computed 9.4 GB so far)
15/10/16 02:19:44 INFO storage.MemoryStore: Memory use = 4.1 GB (blocks) + 30.6 GB (scratch space shared across 1 thread(s)) = 34.6 GB. Storage limit = 36.0 GB.
15/10/16 02:25:37 INFO metrics.MetricsSaver: 1001 MetricsLockFreeSaver 339 comitted 11 matured S3WriteBytes values
15/10/16 02:29:00 INFO s3n.MultipartUploadOutputStream: uploadPart /mnt1/var/lib/hadoop/s3/959a772f-d03a-41fd-bc9d-6d5c5b9812a1-0000 134217728 bytes md5: qkQ8nlvC8COVftXkknPE3A== md5hex: aa443c9e5bc2f023957ed5e49273c4dc
15/10/16 02:38:15 INFO s3n.MultipartUploadOutputStream: uploadPart /mnt/var/lib/hadoop/s3/959a772f-d03a-41fd-bc9d-6d5c5b9812a1-0001 134217728 bytes md5: RgoGg/yJpqzjIvD5DqjCig== md5hex: 460a0683fc89a6ace322f0f90ea8c28a
15/10/16 02:42:20 INFO metrics.MetricsSaver: 2001 MetricsLockFreeSaver 339 comitted 10 matured S3WriteBytes values

Total used / Total does not matter for caching blocks, they are atomic in memory sense. Can you try to increase # of partitions for that specific RDD? BTW, you have a nifty cluster. — mehmetminanc, Commented Oct 16, 2015 at 5:10
So what would be the difference between caching block and the (Total used/Total) appeared on UI? Thanks! — Edamame, Commented Oct 16, 2015 at 16:24

Jem Tucker · Accepted Answer · 2015-10-16 12:19:15Z

7

This is likely to be caused by the configuration of spark.storage.memoryFraction being too low. Spark will only use this fraction of the allocated memory to cache RDDs.

Try either:

increasing the storage fraction
rdd.persist(StorageLevel.MEMORY_ONLY_SER) to reduce memory usage by serializing the RDD data
rdd.persist(StorageLevel.MEMORY_AND_DISK) to partially persist onto disk if memory limits are reached.

answered Oct 16, 2015 at 12:19

Jem Tucker

1,16316 silver badges35 bronze badges

1

I found this blog post very useful in order to better understand this configuration: 0x0fff.com/spark-memory-management
– Jakob
Commented Jan 31, 2022 at 0:12

Add a comment |

jbrown · Accepted Answer · 2015-11-24 17:08:29Z

0

This could be due to the following issue if you're loading lots of avro files:

https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCANx3uAiJqO4qcTXePrUofKhO3N9UbQDJgNQXPYGZ14PWgfG5Aw@mail.gmail.com%3E

With a PR in progress at:

https://github.com/databricks/spark-avro/pull/95

answered Nov 24, 2015 at 17:08

jbrown

7,77818 gold badges73 silver badges121 bronze badges

Add a comment |

Pino · Accepted Answer · 2022-09-16 13:59:15Z

I have a Spark-based batch application (a JAR with main() method, not written by me, I'm not a Spark expert) that I run in local mode without spark-submit, spark-shell, or spark-defaults.conf. When I tried to use IBM JRE (like one of my customers) instead of Oracle JRE (same machine and same data), I started getting those warnings.

Since the memory store is a fraction of the heap (see the page that Jacob suggested in his comment), I checked the heap size: IBM JRE uses a different strategy to decide default heap size and it was too small, so I simply added appropriate -Xms and -Xmx params and the problem disappeared: now the batch works fine both with IBM and Oracle JRE.

My usage scenario is not typical, I know, however I hope this can help someone.

hmadinei · Accepted Answer · 2024-04-08 08:32:08Z

0

you can fix this problem by increasing memory allocation.
In Pyspark, you can increase spark.driver.memory and spark.executor.memory to 4g using this configuration:

spark = SparkSession.builder 
    .appName("Pandas_on_spark") 
    .config("spark.driver.memory", "4g") 
    .config("spark.executor.memory", "4g") 
    .getOrCreate()

edited Apr 8 at 8:32

answered Apr 8 at 8:17

hmadinei

213 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Not enough space to cache rdd in memory warning

4 Answers 4

Not the answer you're looking for? Browse other questions tagged
amazon-web-services
amazon-s3
apache-spark
rdd
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Not the answer you're looking for? Browse other questions tagged amazon-web-servicesamazon-s3apache-sparkrdd or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
amazon-web-services
amazon-s3
apache-spark
rdd
or ask your own question.