Questions tagged [apache-spark]
Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.
apache-spark
82,564
questions
0
votes
0
answers
17
views
What is the recommended way to process large Spark DataFrames in chunks: `toPandas()` or `RDD.foreachPartition()`?
I am working with large datasets using PySpark and need to process my data in chunks of 500 records each. I am contemplating between converting my Spark DataFrames to Pandas DataFrames using toPandas()...
0
votes
0
answers
14
views
"ps: unknown option -- o" error when starting an Apache Spark HTTP server
When I run an Apache Spark server from the Git Bash command line of my Windows 10 (v. 22H2) laptop (x64), using the following command:
spark-submit Documents/Python/Temp/rumbledb-1.21.0-for-spark-3.5....
0
votes
0
answers
23
views
spark sql query returns column has 0 length but non null
I have a spark dataframe for a parquet file. The column is string type.
spark.sql("select col_a, length(col_a) from df where col_a is not null")
+-------------------+------------------------...
-1
votes
0
answers
15
views
How do you handle Big Datasets for training XGboost?
I had a dataset which has 530 classes and it was heavily imbalanced so as I am new to handling such large datasets I undersampled the top 10 majority classes and then concatinated it with other data ...
0
votes
0
answers
21
views
Spark-Scala vs Pyspark Dag is different?
I am converting pyspark job to Scala and jobs executes in emr. The parameter and data and code is same. However I see the run time is different and so also the dag getting created is different. Here I ...
0
votes
0
answers
15
views
How to use API and API key on Python [closed]
I am trying to implement API keys for alpha vantage, Bloomberg and newsapi to load data into hadoop using spark
ALPHA_VANTAGE_API_URL = "https://www.alphavantage.co/query?function=...
0
votes
1
answer
36
views
Expand the json string to multiple columns in pyspark (python)
I need to expand the Json object (column b) to multiple columns.
From this table,
Column A
Column B
id1
[{a:1,b:'letter1'}]
id2
[{a:1,b:'letter2',c:3,d:4}]
To this table,
Column A
a
b
c
d
id1
1
...
0
votes
1
answer
35
views
A seemingly stopped HTTP service continues to work
I wrote a python function, detailed in another post, which communicates with an HTTP-based RumbleDB server, running on my local machine, in order to evaluate JSONiq queries.
I tested the function in a ...
1
vote
1
answer
23
views
How do I join two data frames on a nested array attribute
I have two dataframes created from ingestion json data w/ the below schemas
provider:
{
npi: "..."
name: "..."
location: {
address: "...",
...
0
votes
1
answer
33
views
Is it preferable to use PySpark/SparkR in-built API functions over SQL queries for Databricks?
I am currently working on Databricks (using SparkR, but I'd imagine my question is still relevant for PySpark). I have a general question about whether there is a performance difference in using ...
0
votes
0
answers
18
views
Spark not asking for delegation tokens from ticket cache after upgrading to 3.5.1
We are not able to upgrade Spark from version 3.1.3 to 3.5.1 due to an authentication error.
Steps to reproduce the issue:
Create the Spark image from the binary.
Edit the image entrypoint in order ...
0
votes
0
answers
22
views
Writing a small paraquet dataframe into google cloud storage using spark 3.5.0 taking too long
We are using Spark on-premise to simply read a parquet file from GCS(Google Cloud storage) into the DataFrame and write the DataFrame into another folder in parquet format in GCS, using below code:
...
0
votes
0
answers
17
views
Delta Lake Table - Merge command generates ONE huge file (3 GB or higher)
I have a Delta table, that when merging new data, will generate ONLY one HUGE file ( > 3 GB), per partition.
Create Session
builder = pyspark.sql.SparkSession.builder.appName("...
-1
votes
0
answers
17
views
How to access Spark Session in java jar running on DataBricks Cluster
I have a java app in form of jar file which is installed in DataBricks cluster. The app reads and writes to tables in Databricks. So it needs Spark session to perform these actions. I need to somehow ...
0
votes
0
answers
29
views
FileNotFound when creating SparkContext with YARN without HDFS
I'm just getting started with Spark / pyspark and trying to switch the resource manager from Spark Standalone to Hadoop YARN. We're using Minio for storage and so not running HDFS nodes.
The problem ...