Spark: how to use SparkContext.textFile for local file system

Question

I'm just getting started using Apache Spark (in Scala, but the language is irrelevant). I'm using standalone mode and I'll want to process a text file from a local file system (so nothing distributed like HDFS).

According to the documentation of the textFile method from SparkContext, it will

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

What is unclear for me is if the whole text file can just be copied to all the nodes, or if the input data should already be partitioned, e.g. if using 4 nodes and a csv file with 1000 lines, have 250 lines on each node.

I suspect each node should have the whole file but I'm not sure.

David Gruzman · Accepted Answer · 2014-07-14 11:59:19Z

10

Each node should contain a whole file. In this case local file system will be logically indistinguishable from the HDFS, in respect to this file.

answered Jul 14, 2014 at 11:59

David Gruzman

7,9381 gold badge30 silver badges30 bronze badges

No reference given, but assuming this is correct because of your repuation and expierence. Thanks!
– herman
Commented Jul 14, 2014 at 12:30
Thank you! It is what I think, but I am quite sure in this case. Anyway - please let me know if you have some problems with it.
– David Gruzman
Commented Jul 14, 2014 at 12:49
2

From the programming guide on external datasets: If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
– Tobber
Commented Feb 17, 2015 at 10:11
3

In this case, how would Spark parallel processing the file?For example, if there are 4 worker node in the spark cluster, and you copy a copy of the entire file to the same folder on each worker node. Will spark read the file 4 times(once on each worker) or just random pick one file from the 4 worker nodes?
– buqing
Commented May 22, 2017 at 18:00
@DavidGruzman Is there any way via spark to distribute the data to the nodes over the wire? (running on k8s) Seems crazy to have to read the file N times?
– jtlz2
Commented Mar 19, 2019 at 12:20

| Show 1 more comment

Prasad Khode · Accepted Answer · 2016-09-19 07:13:16Z

4

prepend file:// to your local file path

edited Sep 19, 2016 at 7:13

Prasad Khode

6,72912 gold badges45 silver badges60 bronze badges

answered Aug 20, 2015 at 5:05

zhaozhi

1,5411 gold badge17 silver badges20 bronze badges

1

Didn't work in my case. However, it worked with just one slash: sc.textFile('file:/home/data/lines').count()
– marvelousNinja
Commented Feb 16, 2016 at 22:56

Add a comment |

gneets · Accepted Answer · 2015-03-14 01:01:29Z

2

From Spark's FAQ page - If you don't use Hadoop/HDFS, "if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode."

https://spark.apache.org/faq.html

answered Mar 14, 2015 at 1:01

gneets

193 bronze badges

Add a comment |

Manu Prakash · Accepted Answer · 2016-09-21 10:59:11Z

2

Proper way of using is with three slashes. Two for syntax (just like http://) and one for mount point of linux file system e.g., sc.textFile(file:///home/worker/data/my_file.txt). If you are using local mode then only file is sufficient. In case of standalone cluster, the file must be copied at each node. Note that the contents of the file must be exactly same, otherwise spark returns funny results.

answered Sep 21, 2016 at 10:59

Manu Prakash

151 bronze badge

Add a comment |

ketankk · Accepted Answer · 2017-03-28 13:40:21Z

1

Spark-1.6.1

Java-1.7.0_99

Nodes in cluster-3(HDP).

Case 1:

Running in local mode local[n]

file:///.. and file:/.. reads file from local system

Case 2:

`--master yarn-cluster`

Input path does not exist: for file:/ and file://

And for file://

java.lang.IllegalArgumentException :Wrong FS: file://.. expected: file:///

answered Mar 28, 2017 at 13:40

ketankk

2,6481 gold badge31 silver badges31 bronze badges

Add a comment |

KayV · Accepted Answer · 2017-06-16 12:58:45Z

1

Add "file:///" uri in place of "file://". This solved the issue for me.

answered Jun 16, 2017 at 12:58

KayV

13.6k14 gold badges103 silver badges155 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Spark: how to use SparkContext.textFile for local file system

6 Answers 6

Not the answer you're looking for? Browse other questions tagged
apache-spark
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Not the answer you're looking for? Browse other questions tagged apache-spark or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
apache-spark
or ask your own question.