I'm using pySpark in version 2.3 (cannot update to 2.4 in my current dev-System) and have the following questions concerning the foreachPartition.
First a little context: As far as I understood pySpark-UDFs
force the Python-code to be executed outside the Java Virtual Machine (JVM) in a Python-instance, making it performance-costing.
Since I need to apply some Python-functions to my data and want to minimize overhead costs, I had the idea to at least load a handable bunch of data into the driver and process it as Pandas-DataFrame. Anyhow, this would lead to a loss of the parallelism-advantage Spark has.
Then I read that foreachPartition
applies a function to all the data within a partition and, hence, allows parallel processing.
My questions now are:
When I apply a Python-function via
foreachPartition
, does the Python-execution take place within the driver process (and the partition-data is therefore transfered over the network to my driver)?Is the data processed row-wise within
foreachPartition
(meaning every RDD-row is transfered one by one to the Python-instance), or is the partition-data processed at once (meaning, for example, the whole partition is transfered to the instance and is handled as whole by one Python-instance)?
Thank you in advance for your input!
Edit:
A working in driver-solution I used before looks like this, taken from SO here:
for partition in rdd.mapPartitions(lambda partition: [list(partition)]).toLocalIterator():
# Do stuff on the partition
As can be read from the docs rdd.toLocalIterator()
provides the necessary functionality:
Return an iterator that contains all of the elements in this RDD. The iterator will consume as much memory as the largest partition in this RDD.