numpy/scipy equivalent of R ecdf(x)(x) function?

Question

What is the equivalent of R's ecdf(x)(x) function in Python, in either numpy or scipy? Is ecdf(x)(x) basically the same as:

import numpy as np
def ecdf(x):
  # normalize X to sum to 1
  x = x / np.sum(x)
  return np.cumsum(x)

or is something else required?

EDIT how can one control the number of bins used by ecdf?

This should help.
– agstudy
Commented Apr 3, 2013 at 16:17 — agstudy, Commented Apr 3, 2013 at 16:17

RubenLaguna · Accepted Answer · 2016-06-06 14:55:19Z

The OP implementation for ecdf is wrong, you are not supposed to cumsum() the values. So not ys = np.cumsum(x)/np.sum(x) but ys = np.cumsum(1 for _ in x)/float(len(x)) or better ys = np.arange(1, len(x)+1)/float(len(x))

You either go with statmodels's ECDF if you are OK with that extra dependency or provide your own implementation. See below:

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.distributions.empirical_distribution import ECDF
%matplotlib inline

grades = (93.5,93,60.8,94.5,82,87.5,91.5,99.5,86,93.5,92.5,78,76,69,94.5,
          89.5,92.8,78,65.5,98,98.5,92.3,95.5,76,91,95,61)


def ecdf_wrong(x):
    xs = np.sort(x) # need to be sorted
    ys = np.cumsum(xs)/np.sum(xs) # normalize so sum == 1
    return (xs,ys)
def ecdf(x):
    xs = np.sort(x)
    ys = np.arange(1, len(xs)+1)/float(len(xs))
    return xs, ys

xs, ys = ecdf_wrong(grades)
plt.plot(xs, ys, label="wrong cumsum")
xs, ys = ecdf(grades)
plt.plot(xs, ys, label="handwritten", marker=">", markerfacecolor='none')
cdf = ECDF(grades)
plt.plot(cdf.x, cdf.y, label="statmodels", marker="<", markerfacecolor='none')
plt.legend()
plt.show()

steffen · Accepted Answer · 2018-06-08 18:00:27Z

22

Try these links:

statsmodels.ECDF

ECDF in python without step function?

Example code

import numpy as np
from statsmodels.distributions.empirical_distribution import ECDF
import matplotlib.pyplot as plt

data = np.random.normal(0,5, size=2000)

ecdf = ECDF(data)
plt.plot(ecdf.x,ecdf.y)

edited Jun 8, 2018 at 18:00

steffen

2,1724 gold badges19 silver badges31 bronze badges

answered Apr 3, 2013 at 16:18

yasouser

5,1632 gold badges28 silver badges41 bronze badges

6

Is this not possible with scipy?
– Zhubarb
Commented Feb 19, 2014 at 16:19

Add a comment |

Tim · Accepted Answer · 2019-09-12 18:35:53Z

The ecdf function in R returns the empirical cumulative distribution function, so the have exact equivalent would be rather:

def ecdf(x):
    x = np.sort(x)
    n = len(x)
    def _ecdf(v):
        # side='right' because we want Pr(x <= v)
        return (np.searchsorted(x, v, side='right') + 1) / n
    return _ecdf

np.random.seed(42)
X = np.random.normal(size=10_000)
Fn = ecdf(X)
Fn([3, 2, 1]) - Fn([-3, -2, -1])
## array([0.9972, 0.9533, 0.682 ])

As shown, it gives the correct 68–95–99.7% probabilities for normal distribution.

CompEcon · Accepted Answer · 2014-07-09 21:01:08Z

This author has a very nice example of a user-written ECDF function: John Stachurski's Python lectures. His lecture series is geared towards graduate students in computational economics; however they are my go-to resource for anyone learning general scientific computing in Python.

Edit: This is a year old now, but I thought I'd still answer the "Edit" part of your question, in case you (or others) still fin it useful.

There really aren't any "bins" with ECDFs as there are with histograms. If G is your empirical distribution function formed using data vector Z, G(x) is literally the number of occurrences of Z <= x, divided by len(Z). This requires no "binning" to determine. Thus there is a sense in which the ECDF retains all possible information about a dataset (since it must retain the entire dataset for calculations), whereas a histogram actually loses some information about the dataset by binning. I much prefer to work with ecdfs vs histograms when possible, for this reason.

Fun bonus: if you need to create a small-footprint ECDF-like object from very large streaming data, you should look into this "Data Skeletons" paper by McDermott et al.

The link is now broken. The author moved his python lectures to: quant-econ.net/py/index.html — bersanri, Commented Aug 24, 2015 at 10:22
The new link is also broken. Maybe this is the new (hopefully permanent) link? python.quantecon.org/intro.html — Sterling, Commented Aug 19, 2021 at 20:28

Florian Jacob · Accepted Answer · 2023-11-13 21:59:14Z

SciPy 1.11 finally gained a built-in scipy.stats.ecdf(sample) function.

For a given sample one-dimensional array-like object, e.g., a list, the function returns an object cdf that represents the estimated, i.e., empirical cumulative distribution function for that sample as well as an object sf that represents the empirical survival function for that sample.

The question asks for an equivalent of R's ecdf(x)(x). Assuming that x is meant as the actual sample in both cases, the equivalent using the new SciPy function would be scipy.stats.ecdf(sample).cdf.probabilities. Assuming that x is only the actual sample in the first case and the second should actually be the function parameter of the estimated cdf, then the equivalent would be scipy.stats.ecdf(sample).cdf.eval(x).

Good points, thank you! For the time being I did skip a more practical usage example, as I do not yet have my hands on a SciPy version recent enough to contain the function. — Florian Jacob, Commented Nov 13, 2023 at 22:02

Pawan Kumar Gunjan · Accepted Answer · 2023-02-25 18:56:46Z

0

data <- c(10, 20, 50, 40, 40, 30, 60, 70, 80, 90)
# Define a function to compute the ECDF
ecdf_func <- function(data) {
     Length <- length(data)
     sorted <- sort(data)
     ecdf <- rep(0, Length)
     for (i in 1:Length) {
          ecdf[i] <- sum(sorted <= data[i]) / Length
      }
      return(ecdf)

 }
ecdf <- ecdf_func(data)
print(ecdf)

Output: [1] 0.1 0.2 0.6 0.5 0.5 0.3 0.7 0.8 0.9 1.0

# With stats library
library(stats)

ecdf_fun <- ecdf(data)
ecdf_ <- ecdf_fun(data)
print(ecdf_)

Output: [1] 0.1 0.2 0.6 0.5 0.5 0.3 0.7 0.8 0.9 1.0

edited Feb 25, 2023 at 18:56

answered Feb 20, 2023 at 6:46

Pawan Kumar Gunjan

13 bronze badges

our answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center
– Simas Joneliunas
Commented Feb 24, 2023 at 14:54

Add a comment |

Collectives™ on Stack Overflow

numpy/scipy equivalent of R ecdf(x)(x) function?

6 Answers 6

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Linked

Related