How to change dataframe column names in PySpark?

Question

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:

df.columns = new_column_name_list

However, the same doesn't work in PySpark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following:

df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
  k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)

This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.

Is there a better and more efficient way to do this like we do in pandas?

My Spark version is 1.5.0

Cristian Ispan · Accepted Answer · 2021-06-08 19:49:31Z

There are many ways to do that:

Option 1. Using selectExpr.

 data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)], 
                                   ["Name", "askdaosdka"])
 data.show()
 data.printSchema()

 # Output
 #+-------+----------+
 #|   Name|askdaosdka|
 #+-------+----------+
 #|Alberto|         2|
 #| Dakota|         2|
 #+-------+----------+

 #root
 # |-- Name: string (nullable = true)
 # |-- askdaosdka: long (nullable = true)

 df = data.selectExpr("Name as name", "askdaosdka as age")
 df.show()
 df.printSchema()

 # Output
 #+-------+---+
 #|   name|age|
 #+-------+---+
 #|Alberto|  2|
 #| Dakota|  2|
 #+-------+---+

 #root
 # |-- name: string (nullable = true)
 # |-- age: long (nullable = true)

Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrange with range.

 from functools import reduce

 oldColumns = data.schema.names
 newColumns = ["name", "age"]

 df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
 df.printSchema()
 df.show()

Option 3. using alias, in Scala you can also use as.

 from pyspark.sql.functions import col

 data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
 data.show()

 # Output
 #+-------+---+
 #|   name|age|
 #+-------+---+
 #|Alberto|  2|
 #| Dakota|  2|
 #+-------+---+

Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.

 sqlContext.registerDataFrameAsTable(data, "myTable")
 df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")

 df2.show()

 # Output
 #+-------+---+
 #|   name|age|
 #+-------+---+
 #|Alberto|  2|
 #| Dakota|  2|
 #+-------+---+

I did it with a for loop + withColumnRenamed, but your reduce option is very nice :) — Felipe Gerard, Commented Nov 3, 2016 at 20:35
Well since nothing gets done in Spark until an action is called on the DF, it's just less elegant code... In the end the resulting DF is exactly the same! — Felipe Gerard, Commented Nov 3, 2016 at 21:41
@FelipeGerard Please check this post, bad things may happen if you have many columns. — Alberto Bonsanto, Commented Nov 3, 2016 at 21:48
@NuValue, you should first run from functools import reduce — joaofbsm, Commented Aug 1, 2018 at 5:11
In PySpark 2.4 with Python 3.6.8 the only method that works out of these is df.select('id').withColumnRenamed('id', 'new_id') and spark.sql("SELECT id AS new_id FROM df") — rjurney, Commented Jul 1, 2019 at 3:42

Sotos · Accepted Answer · 2020-07-15 13:58:27Z

309

df = df.withColumnRenamed("colName", "newColName")\
       .withColumnRenamed("colName2", "newColName2")

Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.

edited Jul 15, 2020 at 13:58

Sotos

51.4k6 gold badges34 silver badges69 bronze badges

answered Mar 30, 2016 at 7:25

Pankaj Kumar

3,3091 gold badge16 silver badges9 bronze badges

3

is there a variant of this solution that leaves all other columns unchanged? with this method, and others, only the explicitly named columns remained (all others removed)
– Quetzalcoatl
Commented Dec 22, 2017 at 5:22
6

+1 it worked fine for me, just edited the specified column leaving others unchanged and no columns were removed.
– mnis.p
Commented Jul 18, 2018 at 5:51
5

@Quetzalcoatl This command appears to change only the specified column while maintaining all other columns. Hence, a great command to rename just one of potentially many column names
– user989762
Commented Aug 24, 2018 at 9:07
1

@user989762: agreed; my initial understanding was incorrect on this one...!
– Quetzalcoatl
Commented Aug 24, 2018 at 17:27
1

This is great for renaming a few columns. See my answer for a solution that can programatically rename columns. Say you have 200 columns and you'd like to rename 50 of them that have a certain type of column name and leave the other 150 unchanged. In that case, you won't want to manually run withColumnRenamed (running withColumnRenamed that many times would also be inefficient, as explained here).
– Powers
Commented Jul 19, 2020 at 22:31

Add a comment |

Petter Friberg · Accepted Answer · 2017-06-06 05:56:13Z

124

If you want to change all columns names, try df.toDF(*cols)

edited Jun 6, 2017 at 5:56

Petter Friberg

21.6k9 gold badges65 silver badges111 bronze badges

answered Jun 6, 2017 at 5:52

user8117731

1,2491 gold badge8 silver badges2 bronze badges

14

this solution is the closest to df.columns = new_column_name_list per the OP, both in how concise it is and its execution.
– Quetzalcoatl
Commented Mar 29, 2018 at 1:17
1

For me I was getting the header names from a pandas dataframe, so I just used df = df.toDF(*my_pandas_df.columns)
– Nic Scozzaro
Commented Jun 8, 2020 at 5:52
4

This answer confuses me. Shouldn't there be a mapping from old column names to new names? Does this work by having cols be the new column names, and just assuming the the order of names in cols corresponds to the column order of the dataframe?
– rbatt
Commented Jun 23, 2020 at 22:15
1

@rbatt Using df.select in combination with pyspark.sql.functions col-method is a reliable way to do this since it maintains the mapping/alias applied & thus the order/schema is maintained after the rename operations. Checkout the comment for code snippet: stackoverflow.com/a/62728542/8551891
– Krunal Patel
Commented May 17, 2021 at 16:40

Add a comment |

pbahr · Accepted Answer · 2018-04-23 14:50:38Z

82

In case you would like to apply a simple transformation on all column names, this code does the trick: (I am replacing all spaces with underscore)

new_column_name_list= list(map(lambda x: x.replace(" ", "_"), df.columns))

df = df.toDF(*new_column_name_list)

Thanks to @user8117731 for toDf trick.

edited Apr 23, 2018 at 14:50

answered Apr 13, 2018 at 15:17

pbahr

1,33012 silver badges14 bronze badges

2

This code generates a simple physical plan that's easy for Catalyst to optimize. It's also elegant. +1
– Powers
Commented Jul 19, 2020 at 22:25

Add a comment |

Def_Os · Accepted Answer · 2018-07-11 18:32:49Z

22

df.withColumnRenamed('age', 'age2')

edited Jul 11, 2018 at 18:32

Def_Os

5,3895 gold badges35 silver badges64 bronze badges

answered Jun 28, 2018 at 14:49

Sahan Jayasumana

4506 silver badges10 bronze badges

2

Pankaj Kumar's answer and Alberto Bonsanto's answer (which are from 2016 and 2015, respectively) already suggest using withColumnRenamed.
– Andrew Myers
Commented Jul 12, 2018 at 1:08
Thanks, yes but there are a couple of different syntax's, maybe we should collect them into a more formal answer? data.withColumnRenamed(oldColumns[idx], newColumns[idx]) vs data.withColumnRenamed(columnname, new columnname) i think it depends on which version of pyspark your using
– Sahan Jayasumana
Commented Oct 12, 2018 at 23:43
3

This is not a different syntax. The only difference is you did not store your column names in an array.
– Ed Bordin
Commented Jan 8, 2019 at 6:00

Add a comment |

Ratul Ghosh · Accepted Answer · 2017-01-15 15:22:33Z

20

If you want to rename a single column and keep the rest as it is:

from pyspark.sql.functions import col
new_df = old_df.select(*[col(s).alias(new_name) if s == column_to_change else s for s in old_df.columns])

answered Jan 15, 2017 at 15:22

Ratul Ghosh

2012 silver badges4 bronze badges

Add a comment |

Grant Shannon · Accepted Answer · 2018-12-07 15:00:20Z

this is the approach that I used:

create pyspark session:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('changeColNames').getOrCreate()

create dataframe:

df = spark.createDataFrame(data = [('Bob', 5.62,'juice'),  ('Sue',0.85,'milk')], schema = ["Name", "Amount","Item"])

view df with column names:

df.show()
+----+------+-----+
|Name|Amount| Item|
+----+------+-----+
| Bob|  5.62|juice|
| Sue|  0.85| milk|
+----+------+-----+

create a list with new column names:

newcolnames = ['NameNew','AmountNew','ItemNew']

change the column names of the df:

for c,n in zip(df.columns,newcolnames):
    df=df.withColumnRenamed(c,n)

view df with new column names:

df.show()
+-------+---------+-------+
|NameNew|AmountNew|ItemNew|
+-------+---------+-------+
|    Bob|     5.62|  juice|
|    Sue|     0.85|   milk|
+-------+---------+-------+

Vedom · Accepted Answer · 2020-05-29 17:58:34Z

14

I made an easy to use function to rename multiple columns for a pyspark dataframe, in case anyone wants to use it:

def renameCols(df, old_columns, new_columns):
    for old_col,new_col in zip(old_columns,new_columns):
        df = df.withColumnRenamed(old_col,new_col)
    return df

old_columns = ['old_name1','old_name2']
new_columns = ['new_name1', 'new_name2']
df_renamed = renameCols(df, old_columns, new_columns)

Be careful, both lists must be the same length.

edited May 29, 2020 at 17:58

Vedom

3,1173 gold badges15 silver badges16 bronze badges

answered Mar 22, 2019 at 11:57

Manrique

2,1634 gold badges17 silver badges39 bronze badges

1

Nice job on this one. A bit of overkill for what I needed though. And you can just pass the df because old_columns would be the same as df.columns.
– Darth Egregious
Commented Sep 26, 2019 at 14:14

Add a comment |

h4z3 · Accepted Answer · 2022-03-24 09:31:13Z

12

Method 1:

df = df.withColumnRenamed("old_column_name", "new_column_name")

Method 2: If you want to do some computation and rename the new values

df = df.withColumn("old_column_name", F.when(F.col("old_column_name") > 1, F.lit(1)).otherwise(F.col("old_column_name"))
df = df.drop("new_column_name", "old_column_name")

edited Mar 24, 2022 at 9:31

h4z3

5,3881 gold badge16 silver badges30 bronze badges

answered Dec 15, 2020 at 13:45

Gourav Bansal

2173 silver badges5 bronze badges

1

There was a lot of similar answers so no need to post another one duplicate.
– astentx
Commented Dec 15, 2020 at 15:47
5

The first argument in withColumnRenamed is the old column name. Your Method 1 is wrong
– Sheldore
Commented Jan 25, 2021 at 15:56

Add a comment |

scottlittle · Accepted Answer · 2018-06-20 14:24:12Z

11

Another way to rename just one column (using import pyspark.sql.functions as F):

df = df.select( '*', F.col('count').alias('new_count') ).drop('count')

answered Jun 20, 2018 at 14:24

scottlittle

20.2k8 gold badges56 silver badges75 bronze badges

Add a comment |

Clock Slave · Accepted Answer · 2019-10-11 10:19:30Z

6

You can use the following function to rename all the columns of your dataframe.

def df_col_rename(X, to_rename, replace_with):
    """
    :param X: spark dataframe
    :param to_rename: list of original names
    :param replace_with: list of new names
    :return: dataframe with updated names
    """
    import pyspark.sql.functions as F
    mapping = dict(zip(to_rename, replace_with))
    X = X.select([F.col(c).alias(mapping.get(c, c)) for c in to_rename])
    return X

In case you need to update only a few columns' names, you can use the same column name in the replace_with list

To rename all columns

df_col_rename(X,['a', 'b', 'c'], ['x', 'y', 'z'])

To rename a some columns

df_col_rename(X,['a', 'b', 'c'], ['a', 'y', 'z'])

answered Oct 11, 2019 at 10:19

Clock Slave

7,87715 gold badges75 silver badges112 bronze badges

I like that this uses the select statement with aliases and uses more of an "immutable" type of framework. I did, however, find that the toDF function and a list comprehension that implements whatever logic is desired was much more succinct. for example, def append_suffix_to_columns(spark_df, suffix): return spark_df.toDF([c + suffix for c in spark_df.columns])
– John Haberstroh
Commented Oct 1, 2020 at 1:52
Since mapping is a dictionary, why can't you simply use mapping[c] instead of mapping.get(c, c)?
– Sheldore
Commented Jan 31, 2021 at 0:41

Add a comment |

mike · Accepted Answer · 2022-10-05 05:11:15Z

6

we can use col.alias for renaming the column:

from pyspark.sql.functions import col
df.select(['vin',col('timeStamp').alias('Date')]).show()

edited Oct 5, 2022 at 5:11

answered Jan 31, 2018 at 14:33

mike

1111 silver badge4 bronze badges

3

While this code snippet may solve the question, including an explanation really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion.
– Isma
Commented Jan 31, 2018 at 15:19

Add a comment |

Neeraj Bhadani · Accepted Answer · 2020-05-31 08:40:58Z

We can use various approaches to rename the column name.

First, let create a simple DataFrame.

df = spark.createDataFrame([("x", 1), ("y", 2)], 
                                  ["col_1", "col_2"])

Now let's try to rename col_1 to col_3. PFB a few approaches to do the same.

# Approach - 1 : using withColumnRenamed function.
df.withColumnRenamed("col_1", "col_3").show()

# Approach - 2 : using alias function.
df.select(df["col_1"].alias("col3"), "col_2").show()

# Approach - 3 : using selectExpr function.
df.selectExpr("col_1 as col_3", "col_2").show()

# Rename all columns
# Approach - 4 : using toDF function. Here you need to pass the list of all columns present in DataFrame.
df.toDF("col_3", "col_2").show()

Here is the output.

+-----+-----+
|col_3|col_2|
+-----+-----+
|    x|    1|
|    y|    2|
+-----+-----+

I hope this helps.

lfvv · Accepted Answer · 2021-09-01 15:30:43Z

4

A way that you can use 'alias' to change the column name:

col('my_column').alias('new_name')

Another way that you can use 'alias' (possibly not mentioned):

df.my_column.alias('new_name')

answered Sep 1, 2021 at 15:30

lfvv

1,55916 silver badges17 bronze badges

Add a comment |

Haha TTpro · Accepted Answer · 2020-10-16 07:14:35Z

3

You can put into for loop, and use zip to pairs each column name in two array.

new_name = ["id", "sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm", "species"]

new_df = df
for old, new in zip(df.columns, new_name):
    new_df = new_df.withColumnRenamed(old, new)

answered Oct 16, 2020 at 7:14

Haha TTpro

5,4167 gold badges47 silver badges76 bronze badges

Add a comment |

Michael H. · Accepted Answer · 2020-11-03 11:51:44Z

3

I like to use a dict to rename the df.

rename = {'old1': 'new1', 'old2': 'new2'}
for col in df.schema.names:
    df = df.withColumnRenamed(col, rename[col])

answered Nov 3, 2020 at 11:51

Michael H.

5957 silver badges11 bronze badges

Add a comment |

ganeiy · Accepted Answer · 2017-06-27 14:42:33Z

1

For a single column rename, you can still use toDF(). For example,

df1.selectExpr("SALARY*2").toDF("REVISED_SALARY").show()

answered Jun 27, 2017 at 14:42

ganeiy

2962 silver badges9 bronze badges

Add a comment |

dcb · Accepted Answer · 2020-10-10 13:17:57Z

1

There are multiple approaches you can use:

df1=df.withColumn("new_column","old_column").drop(col("old_column"))
df1=df.withColumn("new_column","old_column")
df1=df.select("old_column".alias("new_column"))

edited Oct 10, 2020 at 13:17

dcb

2,2526 gold badges34 silver badges51 bronze badges

answered Oct 10, 2020 at 6:14

pankajs

511 gold badge1 silver badge4 bronze badges

Add a comment |

ZygD · Accepted Answer · 2022-09-06 14:20:48Z

1

List comprehension + f-string:

df = df.toDF(*[f'n_{c}' for c in df.columns])

Simple list comprehension:

df = df.toDF(*[c.lower() for c in df.columns])

answered Sep 6, 2022 at 14:20

ZygD

23.7k40 gold badges93 silver badges115 bronze badges

Add a comment |

prashangrg · Accepted Answer · 2023-12-06 15:10:31Z

1

Simplest solution is:

for col, new_col in columns:
    df = df.withColumnRenamed(col, new_col)

answered Dec 6, 2023 at 15:10

prashangrg

211 silver badge8 bronze badges

Add a comment |

thedataengineer · Accepted Answer · 2021-03-28 04:50:24Z


from pyspark.sql.types import StructType,StructField, StringType, IntegerType

CreatingDataFrame = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  ]

schema = StructType([ \
    StructField("employee_name",StringType(),True), \
    StructField("department",StringType(),True), \
    StructField("state",StringType(),True), \
    StructField("salary", IntegerType(), True), \
    StructField("age", StringType(), True), \
    StructField("bonus", IntegerType(), True) \
  ])

 
OurData = spark.createDataFrame(data=CreatingDataFrame,schema=schema)

OurData.show()

# COMMAND ----------

GrouppedBonusData=OurData.groupBy("department").sum("bonus")


# COMMAND ----------

GrouppedBonusData.show()


# COMMAND ----------

GrouppedBonusData.printSchema()

# COMMAND ----------

from pyspark.sql.functions import col

BonusColumnRenamed = GrouppedBonusData.select(col("department").alias("department"), col("sum(bonus)").alias("Total_Bonus"))
BonusColumnRenamed.show()

# COMMAND ----------

GrouppedBonusData.groupBy("department").count().show()

# COMMAND ----------

GrouppedSalaryData=OurData.groupBy("department").sum("salary")

# COMMAND ----------

GrouppedSalaryData.show()

# COMMAND ----------

from pyspark.sql.functions import col

SalaryColumnRenamed = GrouppedSalaryData.select(col("department").alias("Department"), col("sum(salary)").alias("Total_Salary"))
SalaryColumnRenamed.show()

Dicer · Accepted Answer · 2022-01-30 04:50:48Z

Try the following method. The following method can allow you rename columns of multiple files

Reference: https://www.linkedin.com/pulse/pyspark-methods-rename-columns-kyle-gibson/

df_initial = spark.read.load('com.databricks.spark.csv')
    
    rename_dict = {
      'Alberto':'Name',
      'Dakota':'askdaosdka'
    }
    
    df_renamed = df_initial \
    .select([col(c).alias(rename_dict.get(c, c)) for c in df_initial.columns])

    
     rename_dict = {
       'FName':'FirstName',
       'LName':'LastName',
       'DOB':'BirthDate'
        }

     return df.select([col(c).alias(rename_dict.get(c, c)) for c in df.columns])


df_renamed = spark.read.load('/mnt/datalake/bronze/testData') \
.transform(renameColumns)

sargupta · Accepted Answer · 2022-04-02 09:42:53Z

0

The simplest solution is using withColumnRenamed:

renamed_df = df.withColumnRenamed(‘name_1’, ‘New_name_1’).withColumnRenamed(‘name_2’, ‘New_name_2’)
renamed_df.show()

And if you would like to do this like we do with Pandas, you can use toDF:

Create an order of list of new columns and pass it to toDF

df_list = ["newName_1", “newName_2", “newName_3", “newName_4"]
renamed_df = df.toDF(*df_list)
renamed_df.show()

answered Apr 2, 2022 at 9:42

sargupta

1,00315 silver badges28 bronze badges

Add a comment |

Rayanaay · Accepted Answer · 2022-08-25 08:46:03Z

0

This is an easy way to rename multiple columns with a loop:

cols_to_rename = ["col1","col2","col3"]

for col in cols_to_rename:
  df = df.withColumnRenamed(col,"new_{}".format(col))

answered Aug 25, 2022 at 8:46

Rayanaay

851 silver badge9 bronze badges

Add a comment |

John Haberstroh · Accepted Answer · 2022-12-19 23:30:36Z

The closest statement to df.columns = new_column_name_list is:

import pyspark.sql.functions as F
df = df.select(*[F.col(name_old).alias(name_new) 
                 for (name_old, name_new) 
                 in zip(df.columns, new_column_name_list)]

This doesn't require any rarely-used functions, and emphasizes some patterns that are very helpful in Spark. You could also break up the steps if you find this one-liner to be doing too many things:

import pyspark.sql.functions as F
column_mapping = [F.col(name_old).alias(name_new) 
                  for (name_old, name_new) 
                  in zip(df.columns, new_column_name_list)]
df = df.select(*column_mapping)

KaranSingh · Accepted Answer · 2023-05-15 13:40:32Z

0

To apply any generic function on the spark dataframe columns and then rename the column names, can use the quinn library. Please refer example code:

import quinn
def lower_case(col):
  return col.lower()

df_ = quinn.with_columns_renamed(lower_case)(df)

lower_case is the function name and df is the initial spark dataframe

If you get an error importing quinn library. Use example code below:

%pip install quinn

answered May 15, 2023 at 13:40

KaranSingh

5604 silver badges12 bronze badges

Add a comment |

Collectives™ on Stack Overflow

How to change dataframe column names in PySpark?

26 Answers 26

Not the answer you're looking for? Browse other questions tagged
python
apache-spark
pyspark
apache-spark-sql
rename
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

26 Answers 26

Not the answer you're looking for? Browse other questions tagged pythonapache-sparkpysparkapache-spark-sqlrename or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
apache-spark
pyspark
apache-spark-sql
rename
or ask your own question.