Skip to main content

Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

0 votes
0 answers
17 views

What is the recommended way to process large Spark DataFrames in chunks: `toPandas()` or `RDD.foreachPartition()`?

I am working with large datasets using PySpark and need to process my data in chunks of 500 records each. I am contemplating between converting my Spark DataFrames to Pandas DataFrames using toPandas()...
Bo Yuan's user avatar
  • 109
0 votes
0 answers
14 views

"ps: unknown option -- o" error when starting an Apache Spark HTTP server

When I run an Apache Spark server from the Git Bash command line of my Windows 10 (v. 22H2) laptop (x64), using the following command: spark-submit Documents/Python/Temp/rumbledb-1.21.0-for-spark-3.5....
Evan Aad's user avatar
  • 5,965
0 votes
0 answers
23 views

spark sql query returns column has 0 length but non null

I have a spark dataframe for a parquet file. The column is string type. spark.sql("select col_a, length(col_a) from df where col_a is not null") +-------------------+------------------------...
Dozel's user avatar
  • 159
-1 votes
0 answers
15 views

How do you handle Big Datasets for training XGboost?

I had a dataset which has 530 classes and it was heavily imbalanced so as I am new to handling such large datasets I undersampled the top 10 majority classes and then concatinated it with other data ...
Cookies's user avatar
0 votes
0 answers
21 views

Spark-Scala vs Pyspark Dag is different?

I am converting pyspark job to Scala and jobs executes in emr. The parameter and data and code is same. However I see the run time is different and so also the dag getting created is different. Here I ...
user3858193's user avatar
  • 1,438
0 votes
0 answers
15 views

How to use API and API key on Python [closed]

I am trying to implement API keys for alpha vantage, Bloomberg and newsapi to load data into hadoop using spark ALPHA_VANTAGE_API_URL = "https://www.alphavantage.co/query?function=...
Onuh John Edoh Adanu's user avatar
0 votes
1 answer
36 views

Expand the json string to multiple columns in pyspark (python)

I need to expand the Json object (column b) to multiple columns. From this table, Column A Column B id1 [{a:1,b:'letter1'}] id2 [{a:1,b:'letter2',c:3,d:4}] To this table, Column A a b c d id1 1 ...
Jack9406's user avatar
0 votes
1 answer
35 views

A seemingly stopped HTTP service continues to work

I wrote a python function, detailed in another post, which communicates with an HTTP-based RumbleDB server, running on my local machine, in order to evaluate JSONiq queries. I tested the function in a ...
Evan Aad's user avatar
  • 5,965
1 vote
1 answer
23 views

How do I join two data frames on a nested array attribute

I have two dataframes created from ingestion json data w/ the below schemas provider: { npi: "..." name: "..." location: { address: "...", ...
xxyyxx's user avatar
  • 2,386
0 votes
1 answer
33 views

Is it preferable to use PySpark/SparkR in-built API functions over SQL queries for Databricks?

I am currently working on Databricks (using SparkR, but I'd imagine my question is still relevant for PySpark). I have a general question about whether there is a performance difference in using ...
mgmf46's user avatar
  • 173
0 votes
0 answers
18 views

Spark not asking for delegation tokens from ticket cache after upgrading to 3.5.1

We are not able to upgrade Spark from version 3.1.3 to 3.5.1 due to an authentication error. Steps to reproduce the issue: Create the Spark image from the binary. Edit the image entrypoint in order ...
maron's user avatar
  • 1
0 votes
0 answers
22 views

Writing a small paraquet dataframe into google cloud storage using spark 3.5.0 taking too long

We are using Spark on-premise to simply read a parquet file from GCS(Google Cloud storage) into the DataFrame and write the DataFrame into another folder in parquet format in GCS, using below code: ...
user3830120's user avatar
0 votes
0 answers
17 views

Delta Lake Table - Merge command generates ONE huge file (3 GB or higher)

I have a Delta table, that when merging new data, will generate ONLY one HUGE file ( > 3 GB), per partition. Create Session builder = pyspark.sql.SparkSession.builder.appName("...
David Sánchez's user avatar
-1 votes
0 answers
17 views

How to access Spark Session in java jar running on DataBricks Cluster

I have a java app in form of jar file which is installed in DataBricks cluster. The app reads and writes to tables in Databricks. So it needs Spark session to perform these actions. I need to somehow ...
Ayush's user avatar
  • 29
0 votes
0 answers
29 views

FileNotFound when creating SparkContext with YARN without HDFS

I'm just getting started with Spark / pyspark and trying to switch the resource manager from Spark Standalone to Hadoop YARN. We're using Minio for storage and so not running HDFS nodes. The problem ...
elhefe's user avatar
  • 3,454

15 30 50 per page
1
2 3 4 5
5505