5

Consider the following example:

import pandas as pd
from pandas import DataFrame
import statsmodels.formula.api as smf
df = DataFrame({'a': [1,2,3], 'b': [2,3,4]})
df2 = DataFrame({'177sdays': [1,2,3], 'b': [2,3,4]})

Then smf.ols('a ~ b', df) smf.ols('177sdays ~ b', df2)

And the first work and the second does not. The only difference seems to be the presence of numerical characters in the variable name. Why is this?

4
  • In particular it generates error invalid syntax! Commented Nov 23, 2016 at 1:25
  • ... valid python names cannot begin with numbers. Perhaps under the hood there is an eval in statsmodels. Try prefixing with underscore. Commented Nov 23, 2016 at 1:28
  • Q can "quote" arbitrary variable names patsy.readthedocs.io/en/latest/…
    – Josef
    Commented Nov 23, 2016 at 4:06
  • @Josef what if there is a variable named Q which conflicts with the Q function? Commented Jul 27, 2018 at 10:45

2 Answers 2

7

Apparently, statsmodels uses a library called patsy to interpret the formulas passed to ols. From the docs, an expression of the form:

y ~ a + a:b + np.log(x)

will construct a patsy object of the form:

ModelDesc([Term([EvalFactor("y")])],
      [Term([]),
       Term([EvalFactor("a")]),
       Term([EvalFactor("a"), EvalFactor("b")]),
       Term([EvalFactor("np.log(x)")])])

EvalFactor then "executes arbitrary Python code." Thus your variable names must be valid Python identifiers. I.e. the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.

1
  • This was super helpful. Otherwise it's a "gotcha" with an utterly vague error message. Thanks!
    – Jeff
    Commented Dec 30, 2017 at 9:54
3

As @Josef stated one can use patsy Q to quote the variable:

smf.ols('Q("177sdays") ~ b', df2).fit()

Not the answer you're looking for? Browse other questions tagged or ask your own question.