Split Pandas dataset column based on values (suffixes: string operation)

Question

In Python using Pandas, I am splitting a dataset column into 4 lists based on the suffix of the values. For the 3 suffixes I am using a list comprehension then for the 4th one, a set operation that substracts the 3 lists from the original list with all values:

import pandas as pd

df = pd.DataFrame({
    "alcohol_by_volume": [],
    "barcode": [],
    "calcium_per_hundred": [],
    "calcium_unit": [],
    "carbohydrates_per_hundred": [],
    "carbohydrates_per_portion": [],
    "carbohydrates_unit": [],
    "cholesterol_per_hundred": [],
    "cholesterol_unit": [],
    "copper_cu_per_hundred": [],
    "copper_cu_unit": [],
    "country": [],
    "created_at": [],
    "energy_kcal_per_hundred": [],
    "energy_kcal_per_portion": [],
    "energy_kcal_unit": [],
    "energy_per_hundred": [],
    "energy_per_portion": [],
    "energy_unit": [],
    "fat_per_hundred": [],
    "fat_per_portion": [],
    "fat_unit": [],
    "fatty_acids_total_saturated_per_hundred": [],
    "fatty_acids_total_saturated_unit": [],
    "fatty_acids_total_trans_per_hundred": [],
    "fatty_acids_total_trans_unit": [],
    "fiber_insoluble_per_hundred": [],
    "fiber_insoluble_unit": [],
    "fiber_per_hundred": [],
    "fiber_per_portion": [],
    "fiber_soluble_per_hundred": [],
    "fiber_soluble_unit": [],
    "fiber_unit": [],
    "folate_total_per_hundred": [],
    "folate_total_unit": [],
    "folic_acid_per_hundred": [],
    "folic_acid_unit": [],
    "hundred_unit": [],
    "id": [],
    "ingredients_en": [],
    "iron_per_hundred": [],
    "iron_unit": [],
    "magnesium_per_hundred": [],
    "magnesium_unit": [],
    "manganese_mn_per_hundred": []
})

colnames_all = df.columns.to_list()
colnames_unit = [n for n in colnames_all if n.endswith("_unit")]
colnames_per_hundred = [n for n in colnames_all if n.endswith("_per_hundred")]
colnames_per_portion = [n for n in colnames_all if n.endswith("_per_portion")]
colnames_other = list(
    set(colnames_all) - set(colnames_unit + colnames_per_hundred + colnames_per_portion)
)

Expected result (2 examples, other 2 lists are similar to 1st one):

colnames_unit:

['calcium_unit',
 'carbohydrates_unit',
 'cholesterol_unit',
 'copper_cu_unit',
 'energy_kcal_unit',
 'energy_unit',
 'fat_unit',
 'fatty_acids_total_saturated_unit',
 'fatty_acids_total_trans_unit',
 'fiber_insoluble_unit',
 'fiber_soluble_unit',
 'fiber_unit',
 'folate_total_unit',
 'folic_acid_unit',
 'hundred_unit',
 'iron_unit',
 'magnesium_unit']

 colnames_other:

 ['ingredients_en',
 'country',
 'id',
 'created_at',
 'barcode',
 'alcohol_by_volume']

However this does not look like the best way to do this. Is there a "better" way, i.e. shorter and/or more elegant/idiomatic?

It's hard to review this small fragment in isolation. It would be better to present a complete function, with its unit tests (or at least some sample input to illustrate it). — Toby Speight, Commented Jul 3, 2023 at 9:40
@TobySpeight Added full code for repro. I did not define a function for this, maybe this is already part of the better way to do it...? The code repetition and the set subtraction don't look good to me. — evilmandarine, Commented Jul 3, 2023 at 17:42

J_H · Accepted Answer · 2023-07-03 19:24:40Z

1

colnames_all = df.columns.to_list()

I don't see a clear need for this. We could simply refer to df.columns instead.

list(
    set(colnames_all) - set(colnames_unit + colnames_per_hundred + colnames_per_portion)
)

That doesn't seem so bad, to me. Certainly the intent is clear.

colnames_unit = [n for n in colnames_all  if n.endswith("_unit")]

Consider rephrasing this as

colnames_unit = [n for n in colnames_all  if re.search(r'_unit$', n)]

That lets us generalize in this way:

colnames_measured = [n for n in df.columns  if re.search(r'_(unit|per_hundred|per_portion)$', n)]

To find the inverse:

colnames_other = [n for n in df.columns  if not re.search(r'_(unit|per_hundred|per_portion)$', n)]

answered Jul 3, 2023 at 19:24

J_H

30.7k3 gold badges30 silver badges119 bronze badges

\$\begingroup\$ Maybe it is not clear but I need a different list by suffix, so colnames_measured does not work for me. Also as it is a constant known suffix, endswith() seems ok, this is not the issue. The question is should this be a 10'000 list with say a collection of 100 suffixes, what would be the best way to address this. Thank you for your input though. \$\endgroup\$
– evilmandarine
Commented Jul 3, 2023 at 20:03

Add a comment |

Reinderien · Accepted Answer · 2023-07-03 22:29:57Z

1

Don't use comprehensions. Don't use lists. Don't use sets. Use Pandas string vectorisation:

colnames_all = df.columns
is_unit = colnames_all.str.endswith("_unit")
is_hundred = colnames_all.str.endswith("_per_hundred")
is_portion = colnames_all.str.endswith("_per_portion")
colnames_unit = colnames_all[is_unit]
colnames_per_hundred = colnames_all[is_hundred]
colnames_per_portion = colnames_all[is_portion]
colnames_other = colnames_all[~(is_unit | is_hundred | is_portion)]
print(colnames_other)

Index(['alcohol_by_volume', 'barcode', 'country', 'created_at', 'id',
       'ingredients_en'],
      dtype='object')

answered Jul 3, 2023 at 22:29

Reinderien

62.8k5 gold badges63 silver badges129 bronze badges

\$\begingroup\$ Why are you advising against lists, is it because of readability, performance, or other reason...? This SO Q&A goes through some detailed and interesting points about this. I like this method, though. I'm thinking a filtering function may be the best way to avoid code repetition. \$\endgroup\$
– evilmandarine
Commented Jul 4, 2023 at 8:42
\$\begingroup\$ It's not in the Pandas style, and (though it matters more for large input) vectorized operations will be faster \$\endgroup\$
– Reinderien
Commented Jul 4, 2023 at 12:53

Add a comment |

Stack Exchange Network

Split Pandas dataset column based on values (suffixes: string operation)

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
python
pandas
dataframe
or ask your own question.

Hot Network Questions

Split Pandas dataset column based on values (suffixes: string operation)

2 Answers 2

Not the answer you're looking for? Browse other questions tagged pythonpandasdataframe or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
python
pandas
dataframe
or ask your own question.