Questions tagged [pyspark]
The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.
pyspark
40,802
questions
0
votes
0
answers
8
views
Pyspark Structured Streaming
I've came across forEachBatch while reading about pyspark streaming. If we use forEachBatch to output the data in arbitrary sink then how does outputMode works with foreachBatch
e.g,
df_delta....
0
votes
0
answers
23
views
spark sql query returns column has 0 length but non null
I have a spark dataframe for a parquet file. The column is string type.
spark.sql("select col_a, length(col_a) from df where col_a is not null")
+-------------------+------------------------...
0
votes
0
answers
20
views
Spark-Scala vs Pyspark Dag is different?
I am converting pyspark job to Scala and jobs executes in emr. The parameter and data and code is same. However I see the run time is different and so also the dag getting created is different. Here I ...
0
votes
0
answers
13
views
how to apply an expression from a column to another column in pyspark dataframe
I would like to know if this possible to apply.
for example i have this table:
new_feed_dt regex_to_apply expr_to_apply
053021 | _(\d+) | date_format(to_date(new_feed_dt, '...
0
votes
0
answers
13
views
What is the configuration for Apache Airflow, Django, PySpark together in Systemd services?
We are using Apache airflow, Django (python), and Spark (pySpark) in our project. Our DAGs are running fine when run 'airflow scheduler' is run from command line. However, these DAGs are not working ...
0
votes
0
answers
36
views
Pyspark Filtering Array inside a Struct column
I have a column in my Spark DataFrame that has this schema:
root
|-- my_feature_name: struct (nullable = true)
| |-- first_profiles: map (nullable = true)
| | |-- key: string
| | |--...
0
votes
0
answers
12
views
logistic regressions using statsmodel and pyspark result in different estimations
I have tested statsmodel and various pyspark ml packages for logistic regression with weightCol feature and found the model estimations vary.
For my particular tests:
statsmodel and pyspark.ml....
0
votes
1
answer
14
views
not able to retrieve the data from Azure Event Hub with Pyspark
I'm trying to retrieve the data from Azure Event Hub using pysprak. the code just keeps running but doesn't display any data
EH_CONN_STR = 'Endpoint=sb://event-hub-18-jul.servicebus.windows.net/;...
0
votes
3
answers
62
views
how to find max and min timestamp when a value goes below min threshold in pyspark?
I have a table like below-
time_is_seconds
value
1
4.5
2
4
3
3
4
5
5
6
6
7
7
6
8
5
9
4.5
10
4.2
11
3
12
3.5
I want to find the min time and max time when the value goes below 5.
Expected ...
1
vote
1
answer
23
views
How do I join two data frames on a nested array attribute
I have two dataframes created from ingestion json data w/ the below schemas
provider:
{
npi: "..."
name: "..."
location: {
address: "...",
...
1
vote
1
answer
18
views
Databricks Pyspark parse connection string
Is there an easy way to parse a connection string in this format?
HOST=HostName;Port=1234;ServiceName=Database;USer ID=User1;Password=Password123;
I need to parse out the host and port, database, ...
0
votes
1
answer
48
views
Extracting data from blob storage to Databricks[automation]
I have blob data with in different folder by year, month and date(nested folder) refreshing daily.
I need to design a pipeline which will efficiently load the historical data from blob to azure ...
0
votes
1
answer
29
views
Pyspark saveasTable gives error on overwrites for pyspark data-frame
In my Pyspark code I am performing more than 10 join operations and multiple groupBy in between. I want to avoid a large DAG and so I decided to save the dataframe as a table to avoid re-computations. ...
0
votes
1
answer
32
views
Is it preferable to use PySpark/SparkR in-built API functions over SQL queries for Databricks?
I am currently working on Databricks (using SparkR, but I'd imagine my question is still relevant for PySpark). I have a general question about whether there is a performance difference in using ...
0
votes
0
answers
21
views
Writing a small paraquet dataframe into google cloud storage using spark 3.5.0 taking too long
We are using Spark on-premise to simply read a parquet file from GCS(Google Cloud storage) into the DataFrame and write the DataFrame into another folder in parquet format in GCS, using below code:
...