14

I'm just getting started using Apache Spark (in Scala, but the language is irrelevant). I'm using standalone mode and I'll want to process a text file from a local file system (so nothing distributed like HDFS).

According to the documentation of the textFile method from SparkContext, it will

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

What is unclear for me is if the whole text file can just be copied to all the nodes, or if the input data should already be partitioned, e.g. if using 4 nodes and a csv file with 1000 lines, have 250 lines on each node.

I suspect each node should have the whole file but I'm not sure.

6 Answers 6

10

Each node should contain a whole file. In this case local file system will be logically indistinguishable from the HDFS, in respect to this file.

6
  • No reference given, but assuming this is correct because of your repuation and expierence. Thanks!
    – herman
    Commented Jul 14, 2014 at 12:30
  • Thank you! It is what I think, but I am quite sure in this case. Anyway - please let me know if you have some problems with it. Commented Jul 14, 2014 at 12:49
  • 2
    From the programming guide on external datasets: If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
    – Tobber
    Commented Feb 17, 2015 at 10:11
  • 3
    In this case, how would Spark parallel processing the file?For example, if there are 4 worker node in the spark cluster, and you copy a copy of the entire file to the same folder on each worker node. Will spark read the file 4 times(once on each worker) or just random pick one file from the 4 worker nodes?
    – buqing
    Commented May 22, 2017 at 18:00
  • @DavidGruzman Is there any way via spark to distribute the data to the nodes over the wire? (running on k8s) Seems crazy to have to read the file N times?
    – jtlz2
    Commented Mar 19, 2019 at 12:20
4

prepend file:// to your local file path

1
  • 1
    Didn't work in my case. However, it worked with just one slash: sc.textFile('file:/home/data/lines').count() Commented Feb 16, 2016 at 22:56
2

From Spark's FAQ page - If you don't use Hadoop/HDFS, "if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode."

https://spark.apache.org/faq.html

2

Proper way of using is with three slashes. Two for syntax (just like http://) and one for mount point of linux file system e.g., sc.textFile(file:///home/worker/data/my_file.txt). If you are using local mode then only file is sufficient. In case of standalone cluster, the file must be copied at each node. Note that the contents of the file must be exactly same, otherwise spark returns funny results.

1

Spark-1.6.1

Java-1.7.0_99

Nodes in cluster-3(HDP).

Case 1:

Running in local mode local[n]

file:///.. and file:/.. reads file from local system

Case 2:

`--master yarn-cluster`

Input path does not exist: for file:/ and file://

And for file://

java.lang.IllegalArgumentException :Wrong FS: file://.. expected: file:///

1

Add "file:///" uri in place of "file://". This solved the issue for me.

Not the answer you're looking for? Browse other questions tagged or ask your own question.