How to bucketize a group of columns in pyspark?

Question

I am trying to bucketize columns that contain the word "road" in a 5k dataset. And create a new dataframe.

I am not sure how to do that, here is what I have tried far :

from pyspark.ml.feature import Bucketizer

spike_cols = [col for col in df.columns if "road" in col]

for x in spike_cols :

    bucketizer = Bucketizer(splits=[-float("inf"), 10, 100, float("inf")],
                        inputCol=x, outputCol=x + "bucket")

bucketedData = bucketizer.transform(df)

E. Zeytinci · Accepted Answer · 2018-12-25 16:49:21Z

9

Either modify df in the loop:

from pyspark.ml.feature import Bucketizer

for x in spike_cols :
    bucketizer = Bucketizer(splits=[-float("inf"), 10, 100, float("inf")],
                    inputCol=x, outputCol=x + "bucket")
    df = bucketizer.transform(df)

or use Pipeline:

from pyspark.ml import Pipeline
from pyspark.ml.feature import Bucketizer 

model = Pipeline(stages=[
    Bucketizer(
        splits=[-float("inf"), 10, 100, float("inf")],
        inputCol=x, outputCol=x + "bucket") for x in spike_cols
]).fit(df)

model.transform(df)

edited Dec 25, 2018 at 16:49

E. Zeytinci

2,6432 gold badges22 silver badges39 bronze badges

answered Jul 18, 2018 at 12:58

Aaron Makubuya

9975 silver badges10 bronze badges

Add a comment |

j-i-l · Accepted Answer · 2021-05-22 13:07:06Z

Since 3.0.0, Bucketizer can map multiple columns at once by setting the inputCols parameter.

So this became easier:

from pyspark.ml.feature import Bucketizer

splits = [-float("inf"), 10, 100, float("inf")]
params = [(col, col+'bucket', splits) for col in df.columns if "road" in col]
input_cols, output_cols, splits_array = zip(*params)

bucketizer = Bucketizer(inputCols=input_cols, outputCols=output_cols,
                        splitsArray=splits_array)

bucketedData = bucketizer.transform(df)

Collectives™ on Stack Overflow

How to bucketize a group of columns in pyspark?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
python
apache-spark
pyspark
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged pythonapache-sparkpyspark or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
apache-spark
pyspark
or ask your own question.