How to drop all columns with null values in a PySpark DataFrame?

Question

I have a large dataset of which I would like to drop columns that contain null values and return a new dataframe. How can I do that?

The following only drops a single column or rows containing null.

df.where(col("dt_mvmt").isNull()) #doesnt work because I do not have all the columns names or for 1000's of columns
df.filter(df.dt_mvmt.isNotNull()) #same reason as above
df.na.drop() #drops rows that contain null, instead of columns that contain null

For example

a |  b  | c
1 |     | 0
2 |  2  | 3

In the above case it will drop the whole column B because one of its values is empty.

Trying checking - spark.apache.org/docs/2.1.0/api/python/… — Tom Ron, Commented Jul 13, 2018 at 10:17

MattSt · Accepted Answer · 2018-10-31 20:51:22Z

17

Here is one possible approach for dropping all columns that have NULL values: See here for the source on the code of counting NULL values per column.

import pyspark.sql.functions as F

# Sample data
df = pd.DataFrame({'x1': ['a', '1', '2'],
                   'x2': ['b', None, '2'],
                   'x3': ['c', '0', '3'] })
df = sqlContext.createDataFrame(df)
df.show()

def drop_null_columns(df):
    """
    This function drops all columns which contain null values.
    :param df: A PySpark DataFrame
    """
    null_counts = df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()
    to_drop = [k for k, v in null_counts.items() if v > 0]
    df = df.drop(*to_drop)
    return df

# Drops column b2, because it contains null values
drop_null_columns(df).show()

Before:

+---+----+---+
| x1|  x2| x3|
+---+----+---+
|  a|   b|  c|
|  1|null|  0|
|  2|   2|  3|
+---+----+---+

After:

+---+---+
| x1| x3|
+---+---+
|  a|  c|
|  1|  0|
|  2|  3|
+---+---+

Hope this helps!

edited Oct 31, 2018 at 20:51

MattSt

1,1632 gold badges16 silver badges39 bronze badges

answered Jul 13, 2018 at 12:28

Florian

25.2k4 gold badges53 silver badges85 bronze badges

yes sir ! It did help. How beautiful ! The other 3 earlier lines also worked perfectly
– PolarBear10
Commented Jul 13, 2018 at 13:23
1

Glad I could help! I removed the threshold-part, maybe a bit confusing to future people who stumble upon this question.
– Florian
Commented Jul 13, 2018 at 17:53
@Florian You should keep the threshold part, it makes it a complete answer! It would be really helpful, thanks :)
– pissall
Commented Oct 30, 2019 at 3:58

Add a comment |

user19195895 · Accepted Answer · 2022-10-19 06:55:27Z

1

If we need to keep only the rows having at least one inspected column not null then use this. Execution time is very less.

from operator import or_
from functools import reduce

inspected = df.columns
df = df.where(reduce(or_, (F.col(c).isNotNull() for c in inspected ), F.lit(False)))```

answered Oct 19, 2022 at 6:55

user19195895

414 bronze badges

Add a comment |

Collectives™ on Stack Overflow

How to drop all columns with null values in a PySpark DataFrame?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
python
apache-spark
pyspark
apache-spark-sql
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged pythonapache-sparkpysparkapache-spark-sql or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
apache-spark
pyspark
apache-spark-sql
or ask your own question.