Skip to main content

Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

0 votes
0 answers
8 views

Pyspark Structured Streaming

I've came across forEachBatch while reading about pyspark streaming. If we use forEachBatch to output the data in arbitrary sink then how does outputMode works with foreachBatch e.g, df_delta....
Manish Visave's user avatar
0 votes
0 answers
23 views

spark sql query returns column has 0 length but non null

I have a spark dataframe for a parquet file. The column is string type. spark.sql("select col_a, length(col_a) from df where col_a is not null") +-------------------+------------------------...
Dozel's user avatar
  • 159
0 votes
0 answers
20 views

Spark-Scala vs Pyspark Dag is different?

I am converting pyspark job to Scala and jobs executes in emr. The parameter and data and code is same. However I see the run time is different and so also the dag getting created is different. Here I ...
user3858193's user avatar
  • 1,438
0 votes
0 answers
13 views

how to apply an expression from a column to another column in pyspark dataframe

I would like to know if this possible to apply. for example i have this table: new_feed_dt regex_to_apply expr_to_apply 053021 | _(\d+) | date_format(to_date(new_feed_dt, '...
Tomás Jullier's user avatar
0 votes
0 answers
13 views

What is the configuration for Apache Airflow, Django, PySpark together in Systemd services?

We are using Apache airflow, Django (python), and Spark (pySpark) in our project. Our DAGs are running fine when run 'airflow scheduler' is run from command line. However, these DAGs are not working ...
Govinda Padaki's user avatar
0 votes
0 answers
36 views

Pyspark Filtering Array inside a Struct column

I have a column in my Spark DataFrame that has this schema: root |-- my_feature_name: struct (nullable = true) | |-- first_profiles: map (nullable = true) | | |-- key: string | | |--...
MathLal's user avatar
  • 392
0 votes
0 answers
12 views

logistic regressions using statsmodel and pyspark result in different estimations

I have tested statsmodel and various pyspark ml packages for logistic regression with weightCol feature and found the model estimations vary. For my particular tests: statsmodel and pyspark.ml....
user16739's user avatar
  • 101
0 votes
1 answer
14 views

not able to retrieve the data from Azure Event Hub with Pyspark

I'm trying to retrieve the data from Azure Event Hub using pysprak. the code just keeps running but doesn't display any data EH_CONN_STR = 'Endpoint=sb://event-hub-18-jul.servicebus.windows.net/;...
Shiva Kumar's user avatar
0 votes
3 answers
62 views

how to find max and min timestamp when a value goes below min threshold in pyspark?

I have a table like below- time_is_seconds value 1 4.5 2 4 3 3 4 5 5 6 6 7 7 6 8 5 9 4.5 10 4.2 11 3 12 3.5 I want to find the min time and max time when the value goes below 5. Expected ...
user8178045's user avatar
1 vote
1 answer
23 views

How do I join two data frames on a nested array attribute

I have two dataframes created from ingestion json data w/ the below schemas provider: { npi: "..." name: "..." location: { address: "...", ...
xxyyxx's user avatar
  • 2,386
1 vote
1 answer
18 views

Databricks Pyspark parse connection string

Is there an easy way to parse a connection string in this format? HOST=HostName;Port=1234;ServiceName=Database;USer ID=User1;Password=Password123; I need to parse out the host and port, database, ...
AMN's user avatar
  • 51
0 votes
1 answer
48 views

Extracting data from blob storage to Databricks[automation]

I have blob data with in different folder by year, month and date(nested folder) refreshing daily. I need to design a pipeline which will efficiently load the historical data from blob to azure ...
Saswat Ray's user avatar
0 votes
1 answer
29 views

Pyspark saveasTable gives error on overwrites for pyspark data-frame

In my Pyspark code I am performing more than 10 join operations and multiple groupBy in between. I want to avoid a large DAG and so I decided to save the dataframe as a table to avoid re-computations. ...
Ash's user avatar
  • 55
0 votes
1 answer
32 views

Is it preferable to use PySpark/SparkR in-built API functions over SQL queries for Databricks?

I am currently working on Databricks (using SparkR, but I'd imagine my question is still relevant for PySpark). I have a general question about whether there is a performance difference in using ...
mgmf46's user avatar
  • 173
0 votes
0 answers
21 views

Writing a small paraquet dataframe into google cloud storage using spark 3.5.0 taking too long

We are using Spark on-premise to simply read a parquet file from GCS(Google Cloud storage) into the DataFrame and write the DataFrame into another folder in parquet format in GCS, using below code: ...
user3830120's user avatar

15 30 50 per page
1
2 3 4 5
2721