17

So I wanted to create a module for my own projects and wanted to use methods. For example I wanted to do:

from mymodule import *
df = pd.DataFrame(np.random.randn(4,4))
df.mymethod()

Thing is it seems I can't use .myfunc() since I think I can only use methods for the classes I've created. A work around is making mymethod a function and making it use pandas.Dataframes as a variable:

myfunc(df)

I don't really want to do this, is there anyway to implement the first one?

4
  • Why don't you want to make it a function? Otherwise you'll have to subclass or patch the data frame.
    – jonrsharpe
    Commented Apr 19, 2017 at 19:04
  • Depending on what the function does you may be able to use apply. For example df.apply(myfunc) I realize this doesn't create a new method, but perhaps it gets you what you need, at the very least you can do method chaining this way ` df.apply(myfunc).apply(myotherfunc)...
    – johnchase
    Commented Apr 19, 2017 at 19:10
  • What about just using the apply method? How complex is your method? Commented Apr 19, 2017 at 19:10
  • 1
    As noted in an answer below, the pandas documentation provides a "way to extend pandas objects without subclassing them" using the decorator pandas.api.extensions.register_dataframe_accessor(). There is a long list of extensions in the pandas ecosystem page. Commented Nov 16, 2021 at 14:07

4 Answers 4

38

Nice solution can be found in ffn package. What authors do:

from pandas.core.base import PandasObject
def your_fun(df):
    ...
PandasObject.your_fun = your_fun

After that your manual function "your_fun" becomes a method of pandas.DataFrame object and you can do something like

df.your_fun()

This method will be able to work with both DataFrame and Series objects

4
  • 1
    Does this technique or way of coding has a name? I am trying to understand how/why it works and not sure I grasp it. Commented Aug 1, 2019 at 11:00
  • 3
    @monkeyintern There is "monkey-patching" name for it in outdated docs pandas.pydata.org/pandas-docs/version/0.15/… , however I found not pandas specific, but general way to add methods here medium.com/@mgarod/… Commented Aug 1, 2019 at 11:47
  • 1
    After experimenting, this seems to add this under all Pandas object, including Series (columns), maybe not what you want, as "self" - here "df" is then not a dataframe, but a Series... You would then have to stop the user from using a method in a place you have put it. The Pandas API now lets you extend in other ways. pandas.pydata.org/docs/development/extending.html Take a look at pandichef's answer. Commented Feb 5, 2022 at 19:16
  • Note that this can also be done with an anonymous function, (e.g. pd.Series.vc = lambda x: x.value_counts(dropna=False))
    – Raisin
    Commented Jun 24, 2022 at 16:02
12

If you really need to add a method to a pandas.DataFrame you can inherit from it. Something like:

mymodule:

import pandas as pd

class MyDataFrame(pd.DataFrame):
    def mymethod(self):
        """Do my stuff"""

Use mymodule:

from mymodule import *
df = MyDataFrame(np.random.randn(4,4))
df.mymethod()

To preserve your custom dataframe class:

pandas routinely returns new dataframes when performing operations on dataframes. So to preserve your dataframe class, you need to have pandas return your class when performing operations on an instance of your class. That can be done by providing a _constructor property like:

class MyDataFrame(pd.DataFrame):

    @property
    def _constructor(self):
        return MyDataFrame

    def mymethod(self):
        """Do my stuff"""

Test Code:

class MyDataFrame(pd.DataFrame):

    @property
    def _constructor(self):
        return MyDataFrame

df = MyDataFrame([1])
print(type(df))
df = df.rename(columns={})
print(type(df))

Test Results:

<class '__main__.MyDataFrame'>
<class '__main__.MyDataFrame'>
2
  • 3
    plus one for effort. But won't this be difficult because pandas will just return a dataframe in most cases. You have to do some additional trickery to override every pd.DataFrame method that returns pd.DataFrame. Otherwise, this is a one use method and you are back to a pdDataFrame... most likely.
    – piRSquared
    Commented Apr 19, 2017 at 19:21
  • 1
    @piRSquared, you are correct as usual. But there appears to be an easy workaround.
    – Stephen Rauch
    Commented Apr 19, 2017 at 19:43
12

This topic is well documented as of Nov 2019: Extending pandas

Note that the most obvious technique - Ivan Mishalkin's monkey patching - was actually removed at some point in the official documentation... probably for good reason.

Monkey patching works fine for small projects, but there is a serious drawback for a large scale project: IDEs like Pycharm can't introspect the patched-in methods. So if one right clicks "Go to declaration", Pycharm simply says "cannot find declaration to go to". It gets old fast if you're an IDE junkie.

I confirmed that Pycharm CAN introspect both the "custom accessors" and "subclassing" methods discussed in the official documentation.

1
  • 2
    This is now the best answer!
    – n8yoder
    Commented Sep 8, 2023 at 15:58
2

I have used the Ivan Mishalkins handy solution in our in-house python library extensively. At some point I thought, it would be better to use his solution in form of a decorator. The only restriction is that the first argument of decorated function must be a DataFrame:

from copy import deepcopy
from functools import wraps
import pandas as pd
from pandas.core.base import PandasObject

def as_method(func):
    """
    This decrator makes a function also available as a method.
    The first passed argument must be a DataFrame.
    """

    @wraps(func)
    def wrapper(*args, **kwargs):
        return func(*deepcopy(args), **deepcopy(kwargs))

    setattr(PandasObject, wrapper.__name__, wrapper)

    return wrapper


@as_method
def augment_x(DF, x):
    """We will be able to see this docstring if we run ??augment_x"""
    DF[f"column_{x}"] = x

    return DF

Example:

df = pd.DataFrame({"A": [1, 2]})
df
   A
0  1
1  2

df.augment_x(10)
   A  column_10
0  1         10
1  2         10

As you can see, the original DataFrame is not changed. As if there is a inplace = False

df
   A
0  1
1  2

You can still use the augment_x as a simple function:

augment_x(df, 2)
    A   column_2
0   1   2
1   2   2

Not the answer you're looking for? Browse other questions tagged or ask your own question.