Questions tagged [apache-spark]

Ask Question

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

82,564 questions

0 votes

0 answers

17 views

What is the recommended way to process large Spark DataFrames in chunks: `toPandas()` or `RDD.foreachPartition()`?

I am working with large datasets using PySpark and need to process my data in chunks of 500 records each. I am contemplating between converting my Spark DataFrames to Pandas DataFrames using toPandas()...

Bo Yuan

asked 3 hours ago

0 votes

0 answers

14 views

"ps: unknown option -- o" error when starting an Apache Spark HTTP server

When I run an Apache Spark server from the Git Bash command line of my Windows 10 (v. 22H2) laptop (x64), using the following command: spark-submit Documents/Python/Temp/rumbledb-1.21.0-for-spark-3.5....

Evan Aad

5,965

asked yesterday

0 votes

0 answers

23 views

spark sql query returns column has 0 length but non null

I have a spark dataframe for a parquet file. The column is string type. spark.sql("select col_a, length(col_a) from df where col_a is not null") +-------------------+------------------------...

Dozel

asked yesterday

-1 votes

0 answers

15 views

How do you handle Big Datasets for training XGboost?

I had a dataset which has 530 classes and it was heavily imbalanced so as I am new to handling such large datasets I undersampled the top 10 majority classes and then concatinated it with other data ...

asked yesterday

0 votes

0 answers

21 views

Spark-Scala vs Pyspark Dag is different?

I am converting pyspark job to Scala and jobs executes in emr. The parameter and data and code is same. However I see the run time is different and so also the dag getting created is different. Here I ...

user3858193

1,438

asked yesterday

0 votes

0 answers

15 views

How to use API and API key on Python [closed]

I am trying to implement API keys for alpha vantage, Bloomberg and newsapi to load data into hadoop using spark ALPHA_VANTAGE_API_URL = "https://www.alphavantage.co/query?function=...

Onuh John Edoh Adanu

asked yesterday

0 votes

1 answer

36 views

Expand the json string to multiple columns in pyspark (python)

I need to expand the Json object (column b) to multiple columns. From this table, Column A Column B id1 [{a:1,b:'letter1'}] id2 [{a:1,b:'letter2',c:3,d:4}] To this table, Column A a b c d id1 1 ...

Jack9406

asked yesterday

0 votes

1 answer

35 views

A seemingly stopped HTTP service continues to work

I wrote a python function, detailed in another post, which communicates with an HTTP-based RumbleDB server, running on my local machine, in order to evaluate JSONiq queries. I tested the function in a ...

Evan Aad

5,965

asked yesterday

1 vote

1 answer

23 views

How do I join two data frames on a nested array attribute

I have two dataframes created from ingestion json data w/ the below schemas provider: { npi: "..." name: "..." location: { address: "...", ...

xxyyxx

2,386

asked 2 days ago

0 votes

1 answer

33 views

Is it preferable to use PySpark/SparkR in-built API functions over SQL queries for Databricks?

I am currently working on Databricks (using SparkR, but I'd imagine my question is still relevant for PySpark). I have a general question about whether there is a performance difference in using ...

mgmf46

asked 2 days ago

0 votes

0 answers

18 views

Spark not asking for delegation tokens from ticket cache after upgrading to 3.5.1

We are not able to upgrade Spark from version 3.1.3 to 3.5.1 due to an authentication error. Steps to reproduce the issue: Create the Spark image from the binary. Edit the image entrypoint in order ...

maron

asked 2 days ago

0 votes

0 answers

22 views

Writing a small paraquet dataframe into google cloud storage using spark 3.5.0 taking too long

We are using Spark on-premise to simply read a parquet file from GCS(Google Cloud storage) into the DataFrame and write the DataFrame into another folder in parquet format in GCS, using below code: ...

user3830120

asked 2 days ago

0 votes

0 answers

17 views

Delta Lake Table - Merge command generates ONE huge file (3 GB or higher)

I have a Delta table, that when merging new data, will generate ONLY one HUGE file ( > 3 GB), per partition. Create Session builder = pyspark.sql.SparkSession.builder.appName("...

David Sánchez

asked 2 days ago

-1 votes

0 answers

17 views

How to access Spark Session in java jar running on DataBricks Cluster

I have a java app in form of jar file which is installed in DataBricks cluster. The app reads and writes to tables in Databricks. So it needs Spark session to perform these actions. I need to somehow ...

Ayush

asked Jul 18 at 6:33

0 votes

0 answers

29 views

FileNotFound when creating SparkContext with YARN without HDFS

I'm just getting started with Spark / pyspark and trying to switch the resource manager from Spark Standalone to Hadoop YARN. We're using Minio for storage and so not running HDFS nodes. The problem ...

elhefe

3,454

asked Jul 17 at 20:18

15 30 50 per page

2 3 4 5

…

5505 Next

Collectives™ on Stack Overflow

Questions tagged [apache-spark]

What is the recommended way to process large Spark DataFrames in chunks: `toPandas()` or `RDD.foreachPartition()`?

"ps: unknown option -- o" error when starting an Apache Spark HTTP server

spark sql query returns column has 0 length but non null

How do you handle Big Datasets for training XGboost?

Spark-Scala vs Pyspark Dag is different?

How to use API and API key on Python [closed]

Expand the json string to multiple columns in pyspark (python)

A seemingly stopped HTTP service continues to work

How do I join two data frames on a nested array attribute

Is it preferable to use PySpark/SparkR in-built API functions over SQL queries for Databricks?

Spark not asking for delegation tokens from ticket cache after upgrading to 3.5.1

Writing a small paraquet dataframe into google cloud storage using spark 3.5.0 taking too long

Delta Lake Table - Merge command generates ONE huge file (3 GB or higher)

How to access Spark Session in java jar running on DataBricks Cluster

FileNotFound when creating SparkContext with YARN without HDFS

Hot Network Questions

Collectives™ on Stack Overflow

Questions tagged [apache-spark]

Related Tags