1
\$\begingroup\$

In Python using Pandas, I am splitting a dataset column into 4 lists based on the suffix of the values. For the 3 suffixes I am using a list comprehension then for the 4th one, a set operation that substracts the 3 lists from the original list with all values:

import pandas as pd

df = pd.DataFrame({
    "alcohol_by_volume": [],
    "barcode": [],
    "calcium_per_hundred": [],
    "calcium_unit": [],
    "carbohydrates_per_hundred": [],
    "carbohydrates_per_portion": [],
    "carbohydrates_unit": [],
    "cholesterol_per_hundred": [],
    "cholesterol_unit": [],
    "copper_cu_per_hundred": [],
    "copper_cu_unit": [],
    "country": [],
    "created_at": [],
    "energy_kcal_per_hundred": [],
    "energy_kcal_per_portion": [],
    "energy_kcal_unit": [],
    "energy_per_hundred": [],
    "energy_per_portion": [],
    "energy_unit": [],
    "fat_per_hundred": [],
    "fat_per_portion": [],
    "fat_unit": [],
    "fatty_acids_total_saturated_per_hundred": [],
    "fatty_acids_total_saturated_unit": [],
    "fatty_acids_total_trans_per_hundred": [],
    "fatty_acids_total_trans_unit": [],
    "fiber_insoluble_per_hundred": [],
    "fiber_insoluble_unit": [],
    "fiber_per_hundred": [],
    "fiber_per_portion": [],
    "fiber_soluble_per_hundred": [],
    "fiber_soluble_unit": [],
    "fiber_unit": [],
    "folate_total_per_hundred": [],
    "folate_total_unit": [],
    "folic_acid_per_hundred": [],
    "folic_acid_unit": [],
    "hundred_unit": [],
    "id": [],
    "ingredients_en": [],
    "iron_per_hundred": [],
    "iron_unit": [],
    "magnesium_per_hundred": [],
    "magnesium_unit": [],
    "manganese_mn_per_hundred": []
})

colnames_all = df.columns.to_list()
colnames_unit = [n for n in colnames_all if n.endswith("_unit")]
colnames_per_hundred = [n for n in colnames_all if n.endswith("_per_hundred")]
colnames_per_portion = [n for n in colnames_all if n.endswith("_per_portion")]
colnames_other = list(
    set(colnames_all) - set(colnames_unit + colnames_per_hundred + colnames_per_portion)
)

Expected result (2 examples, other 2 lists are similar to 1st one):

colnames_unit:

['calcium_unit',
 'carbohydrates_unit',
 'cholesterol_unit',
 'copper_cu_unit',
 'energy_kcal_unit',
 'energy_unit',
 'fat_unit',
 'fatty_acids_total_saturated_unit',
 'fatty_acids_total_trans_unit',
 'fiber_insoluble_unit',
 'fiber_soluble_unit',
 'fiber_unit',
 'folate_total_unit',
 'folic_acid_unit',
 'hundred_unit',
 'iron_unit',
 'magnesium_unit']

 colnames_other:

 ['ingredients_en',
 'country',
 'id',
 'created_at',
 'barcode',
 'alcohol_by_volume']

However this does not look like the best way to do this. Is there a "better" way, i.e. shorter and/or more elegant/idiomatic?

\$\endgroup\$
2
  • 1
    \$\begingroup\$ It's hard to review this small fragment in isolation. It would be better to present a complete function, with its unit tests (or at least some sample input to illustrate it). \$\endgroup\$ Commented Jul 3, 2023 at 9:40
  • 1
    \$\begingroup\$ @TobySpeight Added full code for repro. I did not define a function for this, maybe this is already part of the better way to do it...? The code repetition and the set subtraction don't look good to me. \$\endgroup\$ Commented Jul 3, 2023 at 17:42

2 Answers 2

1
\$\begingroup\$
colnames_all = df.columns.to_list()

I don't see a clear need for this. We could simply refer to df.columns instead.


list(
    set(colnames_all) - set(colnames_unit + colnames_per_hundred + colnames_per_portion)
)

That doesn't seem so bad, to me. Certainly the intent is clear.


colnames_unit = [n for n in colnames_all  if n.endswith("_unit")]

Consider rephrasing this as

colnames_unit = [n for n in colnames_all  if re.search(r'_unit$', n)]

That lets us generalize in this way:

colnames_measured = [n for n in df.columns  if re.search(r'_(unit|per_hundred|per_portion)$', n)]

To find the inverse:

colnames_other = [n for n in df.columns  if not re.search(r'_(unit|per_hundred|per_portion)$', n)]
\$\endgroup\$
1
  • \$\begingroup\$ Maybe it is not clear but I need a different list by suffix, so colnames_measured does not work for me. Also as it is a constant known suffix, endswith() seems ok, this is not the issue. The question is should this be a 10'000 list with say a collection of 100 suffixes, what would be the best way to address this. Thank you for your input though. \$\endgroup\$ Commented Jul 3, 2023 at 20:03
1
\$\begingroup\$

Don't use comprehensions. Don't use lists. Don't use sets. Use Pandas string vectorisation:

colnames_all = df.columns
is_unit = colnames_all.str.endswith("_unit")
is_hundred = colnames_all.str.endswith("_per_hundred")
is_portion = colnames_all.str.endswith("_per_portion")
colnames_unit = colnames_all[is_unit]
colnames_per_hundred = colnames_all[is_hundred]
colnames_per_portion = colnames_all[is_portion]
colnames_other = colnames_all[~(is_unit | is_hundred | is_portion)]
print(colnames_other)
Index(['alcohol_by_volume', 'barcode', 'country', 'created_at', 'id',
       'ingredients_en'],
      dtype='object')
\$\endgroup\$
2
  • \$\begingroup\$ Why are you advising against lists, is it because of readability, performance, or other reason...? This SO Q&A goes through some detailed and interesting points about this. I like this method, though. I'm thinking a filtering function may be the best way to avoid code repetition. \$\endgroup\$ Commented Jul 4, 2023 at 8:42
  • \$\begingroup\$ It's not in the Pandas style, and (though it matters more for large input) vectorized operations will be faster \$\endgroup\$
    – Reinderien
    Commented Jul 4, 2023 at 12:53

Not the answer you're looking for? Browse other questions tagged or ask your own question.