Newest 'pyspark' Questions

0 votes

0 answers

8 views

Pyspark Structured Streaming

I've came across forEachBatch while reading about pyspark streaming. If we use forEachBatch to output the data in arbitrary sink then how does outputMode works with foreachBatch e.g, df_delta....

Manish Visave

146

asked 19 hours ago

0 votes

0 answers

23 views

spark sql query returns column has 0 length but non null

I have a spark dataframe for a parquet file. The column is string type. spark.sql("select col_a, length(col_a) from df where col_a is not null") +-------------------+------------------------...

Dozel

159

asked yesterday

0 votes

0 answers

20 views

Spark-Scala vs Pyspark Dag is different?

I am converting pyspark job to Scala and jobs executes in emr. The parameter and data and code is same. However I see the run time is different and so also the dag getting created is different. Here I ...

user3858193

1,438

asked yesterday

0 votes

0 answers

13 views

how to apply an expression from a column to another column in pyspark dataframe

I would like to know if this possible to apply. for example i have this table: new_feed_dt regex_to_apply expr_to_apply 053021 | _(\d+) | date_format(to_date(new_feed_dt, '...

Tomás Jullier

83

asked yesterday

0 votes

0 answers

13 views

What is the configuration for Apache Airflow, Django, PySpark together in Systemd services?

We are using Apache airflow, Django (python), and Spark (pySpark) in our project. Our DAGs are running fine when run 'airflow scheduler' is run from command line. However, these DAGs are not working ...

Govinda Padaki

21

asked yesterday

0 votes

0 answers

36 views

Pyspark Filtering Array inside a Struct column

MathLal

392

asked 2 days ago

0 votes

0 answers

12 views

logistic regressions using statsmodel and pyspark result in different estimations

I have tested statsmodel and various pyspark ml packages for logistic regression with weightCol feature and found the model estimations vary. For my particular tests: statsmodel and pyspark.ml....

user16739

101

asked 2 days ago

0 votes

1 answer

14 views

not able to retrieve the data from Azure Event Hub with Pyspark

I'm trying to retrieve the data from Azure Event Hub using pysprak. the code just keeps running but doesn't display any data EH_CONN_STR = 'Endpoint=sb://event-hub-18-jul.servicebus.windows.net/;...

Shiva Kumar

35

asked 2 days ago

0 votes

3 answers

62 views

how to find max and min timestamp when a value goes below min threshold in pyspark?

I have a table like below- time_is_seconds value 1 4.5 2 4 3 3 4 5 5 6 6 7 7 6 8 5 9 4.5 10 4.2 11 3 12 3.5 I want to find the min time and max time when the value goes below 5. Expected ...

user8178045

13

asked 2 days ago

1 vote

1 answer

23 views

How do I join two data frames on a nested array attribute

I have two dataframes created from ingestion json data w/ the below schemas provider: { npi: "..." name: "..." location: { address: "...", ...

xxyyxx

2,386

asked 2 days ago

1 vote

1 answer

18 views

Databricks Pyspark parse connection string

Is there an easy way to parse a connection string in this format? HOST=HostName;Port=1234;ServiceName=Database;USer ID=User1;Password=Password123; I need to parse out the host and port, database, ...

AMN

51

asked 2 days ago

0 votes

1 answer

48 views

Extracting data from blob storage to Databricks[automation]

I have blob data with in different folder by year, month and date(nested folder) refreshing daily. I need to design a pipeline which will efficiently load the historical data from blob to azure ...

Saswat Ray

195

asked 2 days ago

0 votes

1 answer

29 views

Pyspark saveasTable gives error on overwrites for pyspark data-frame

In my Pyspark code I am performing more than 10 join operations and multiple groupBy in between. I want to avoid a large DAG and so I decided to save the dataframe as a table to avoid re-computations. ...

Ash

55

asked 2 days ago

0 votes

1 answer

32 views

Is it preferable to use PySpark/SparkR in-built API functions over SQL queries for Databricks?

I am currently working on Databricks (using SparkR, but I'd imagine my question is still relevant for PySpark). I have a general question about whether there is a performance difference in using ...

mgmf46

173

asked 2 days ago

0 votes

0 answers

21 views

Writing a small paraquet dataframe into google cloud storage using spark 3.5.0 taking too long

We are using Spark on-premise to simply read a parquet file from GCS(Google Cloud storage) into the DataFrame and write the DataFrame into another folder in parquet format in GCS, using below code: ...

user3830120

61

asked 2 days ago

Collectives™ on Stack Overflow

Questions tagged [pyspark]

Pyspark Structured Streaming

spark sql query returns column has 0 length but non null

Spark-Scala vs Pyspark Dag is different?

how to apply an expression from a column to another column in pyspark dataframe

What is the configuration for Apache Airflow, Django, PySpark together in Systemd services?

Pyspark Filtering Array inside a Struct column

logistic regressions using statsmodel and pyspark result in different estimations

not able to retrieve the data from Azure Event Hub with Pyspark

how to find max and min timestamp when a value goes below min threshold in pyspark?

How do I join two data frames on a nested array attribute

Databricks Pyspark parse connection string

Extracting data from blob storage to Databricks[automation]

Pyspark saveasTable gives error on overwrites for pyspark data-frame

Is it preferable to use PySpark/SparkR in-built API functions over SQL queries for Databricks?

Writing a small paraquet dataframe into google cloud storage using spark 3.5.0 taking too long

Hot Network Questions

Collectives™ on Stack Overflow

Questions tagged [pyspark]

Related Tags