1
\$\begingroup\$

I am currently switching for R to Python so please be patient with me. Is the following a good way to count the number of rows given column names and values?

import pandas as pd

df = pd.DataFrame([["1", "2"], ["2", "4"], ["1", "4"]], columns=['A', 'B'])

cn1 = "A"
cn2 = "B"
cv1 = "1"
cv2 = "2"
no_rows = len(df[(df[cn1]==cv1) & (df[cn2]==cv2)].index)
print(no_rows)
\$\endgroup\$

3 Answers 3

1
\$\begingroup\$

First, it's a bad idea to input your numerics as strings in your dataframe. Use plain ints instead.

Your code currently forms a predicate, performs a slice on the frame and then finds the size of the frame. This is more work than necessary - the predicate itself is a series of booleans, and running a .sum() on it produces the number of matching values.

That, plus your current code is not general-purpose. A general-purpose implementation could look like

from typing import Dict, Any

import pandas as pd


def match_count(df: pd.DataFrame, **criteria: Any) -> int:
    pairs = iter(criteria.items())
    column, value = next(pairs)
    predicate = df[column] == value
    for column, value in pairs:
        predicate &= df[column] == value
    return predicate.sum()


def test() -> None:
    df = pd.DataFrame(
        [[1, 2],
         [2, 4],
         [1, 4]],
        columns=['A', 'B'],
    )

    print(match_count(df, A=1, B=2))


if __name__ == '__main__':
    test()
\$\endgroup\$
1
\$\begingroup\$

I usually use shape[0] because it's more readable, so in your case it would be:

no_rows = df[(df[cn1]==cv1) & (df[cn2]==cv2)].shape[0]
\$\endgroup\$
1
\$\begingroup\$

While this specific example can be completely refactored into Reinderien's top-notch functions, we don't always need something so elaborate (e.g., quick exploratory analysis).

Masking and counting come up very often in one form or another, so I think it's still worth reviewing how to do them idiomatically in pandas.


Revised code

Maintaining the spirit of the original code, I would use something like:

matches = df[cn1].eq(cv1) & df[cn2].eq(cv2)
len(df[matches])  # but remember that matches.sum() is faster

Comments on the original code

len(df[(df[cn1] == cv1) & (df[cn2] == cv2)].index)
^               ^       ^                  ^
3               2       4                  1
  1. No need to use .index explicitly since DataFrame.__len__ does it automatically:

    class DataFrame(NDFrame, OpsMixin):
        ...
        def __len__(self) -> int:
            return len(self.index)
    
  2. DataFrame.eq can sometimes be useful over ==:

    • supports axis / level broadcasting

    • arguably more readable when joining multiple tests

      df[cn1].eq(cv1) & df[cn2].eq(cv2) # (df[cn1] == cv1) & (df[cn2] == cv2)
      
    • arguably more readable when chaining methods (e.g., when comparing shifted columns)

      df[cn1].shift().eq(cv1).cumsum() # (df[cn1].shift() == cv1).cumsum()
      
  3. If speed is important, len(df) is faster than df.shape[0] (h/t @root):

  4. If you have a lot of conditions to join (e.g., generated via comprehension), consider np.logical_and.reduce:

    df[np.logical_and.reduce([
        df[cn1] == cv1,
        # ...
        df[cn2] == cv2,
    ])]
    
\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.