I have a DataFrame which contains first and last number of some intervals. how can I get how many times each number was in each interval?

Question

the DataFrame(input)

0     1.0     25.0
1     1.0     31.0
2     2.0     97.0
3     1.0     25.0
4     1.0     26.0

output

I want to get an array that has indexes from 1 up to and including 97 that says each index was how many times in intervals for example 1 was in 4 intervals(first two and last two intervals), 3 was in 5 intervals, 96 was just in 1 interval. note that I can not use a loop and I have to do it with array operations(numpy, pandas).

I want to get something like:

Pygirl · Accepted Answer · 2021-06-20 07:08:29Z

2

try:

df1 = pd.DataFrame(data=[i for i in range(1,98)])

df:

    0   1       2
0   0   1.0     25.0
1   1   1.0     31.0
2   2   2.0     97.0
3   3   1.0     25.0
4   4   1.0     26.0

df1.head():

res = df1[0].apply(lambda x: sum((df[1]<=x) & (df[2]>=x)))

res:

0     4
1     5
2     5
3     5
4     5
     ..
92    1
93    1
94    1
95    1
96    1
Name: 0, Length: 97, dtype: int64

answered Jun 20, 2021 at 7:08

Pygirl

13.3k5 gold badges33 silver badges48 bronze badges

thanks a lot. it solved my problem. I really appreciate that but I have a question. can we use (index + 1) instead of creating a new column?
– john
Commented Jun 20, 2021 at 7:45
@Hamed Yes you can do that.
– Pygirl
Commented Jun 20, 2021 at 7:52

Add a comment |

Mustafa Aydın · Accepted Answer · 2021-06-20 08:04:24Z

You can form ranges per row and then explode them. Counting the values gives the final result:

result = (pd.Series(np.arange(first, second+1) for first, second in df.to_numpy())
                   .explode()
                   .value_counts(sort=False))

to get

This won't necessarily include all the values in 1..97 because if a number falls out of every interval, it won't be counted. To guarantee an index of 1..97, we can reindex with the min and max values (i.e., 1 and 97 here) and put 0 to those that didn't appear:

values = df.to_numpy()
min_, max_ = values.min(), values.max()

result = result.reindex(np.arange(min_, max_+1), fill_value=0)

and final note is the np.aranges can be replaced with range if the values in the frame are integers and therefore df = df.astype(int) loses no information. If not, np.arange is needed. Also, np.arange encapsulates range so it can be used in either case.

Cimbali · Accepted Answer · 2021-06-20 12:37:46Z

First we need to know how many intervals open and close at each value:

>>> df
    col_1 col_2
0     1.0  25.0
1.0   1.0  31.0
2.0   2.0  97.0
3.0   1.0  25.0
4.0   1.0  26.0
>>> idx = pd.RangeIndex(1, 98)
>>> opencount = df['col_1'].value_counts().reindex(idx, fill_value=0)
>>> closecount = df['col_2'].value_counts().reindex(idx, fill_value=0)
>>> opencount
1     3
2     1
3     0
4     0
5     0
 ..
93    0
94    0
95    0
96    0
97    0
Name: col_1, Length: 97, dtype: int64
>>> closecount
1     0
2     0
3     0
4     0
5     0
 ..
93    0
94    0
95    0
96    0
97    1
Name: col_2, Length: 97, dtype: int64

Note that we used reindex to add zeros at all the values not available in col_1 and col_2.

If in fact the end of the interval is contained in the interval (as per comments) you can simply shift the closecount down by 1:

>>> closecount = closecount.shift(fill_value=0)
>>> closecount
1     0
2     0
3     0
4     0
5     0
     ..
93    0
94    0
95    0
96    0
97    0
Name: col_2, Length: 97, dtype: int64

Then we can compute the number of intervals at each point as the sum of intervals having opened before, minus the sum of intervals having closed before. This can be done with cumsum

>>> opencount.cumsum() - closecount.cumsum()
1     4
2     5
3     5
4     5
5     5
     ..
93    1
94    1
95    1
96    1
97    1
Length: 97, dtype: int64

this is a really good way except that I want 97 to be 1, not 0. — john, Commented Jun 20, 2021 at 9:19
Ah then you just need to add a shift() @Hamed, I’ll edit my answer — Cimbali, Commented Jun 20, 2021 at 9:34
one last question. to get a good idea of what you are doing(I have a rough idea about what you are doing with cumulative sum but not really precise and complete understanding) what resource do you recommend as you are quite the expert. I'm really grateful for your help. — john, Commented Jun 20, 2021 at 16:23

Pythonics · Accepted Answer · 2021-06-20 07:31:53Z

1

try:

df1 = df.groupby(['ColumnName']).count()

This will return the column you choose as the index and get a count of the matching intervals

answered Jun 20, 2021 at 7:31

Pythonics

1223 bronze badges

Add a comment |

crayxt · Accepted Answer · 2021-06-20 07:39:30Z

You can do as following. First create a new dataframe with a column from 1 to 97

>>> df2 = pd.DataFrame(list(range(1,98,1)), dtype=float, columns=["range"])
>>> df2
    range
0     1.0
1     2.0
2     3.0
3     4.0
4     5.0
..    ...
92   93.0
93   94.0
94   95.0
95   96.0
96   97.0

To get counts of values from your original dataframe

>>> dfg = df.groupby('val').agg('count')
>>> dfg
     range
val
1.0      4
2.0      1

Now map from your df, and at the same time fill missing values with zeros. Then cast to integer to match OP's expected format.

>>> df2["count"] = df2["range"].map(dfg['range']).fillna(0)
>>> df2 = df2.astype(int)
>>> df2
    range  count
0       1      4
1       2      1
2       3      0
3       4      0
4       5      0
..    ...    ...
92     93      0
93     94      0
94     95      0
95     96      0
96     97      0

And if needed, cast to integer

there are NaNs instead of counts; this is not the expected output, is it? — Mustafa Aydın, Commented Jun 20, 2021 at 7:34

Collectives™ on Stack Overflow

I have a DataFrame which contains first and last number of some intervals. how can I get how many times each number was in each interval?

the DataFrame(input)

output

5 Answers 5

Not the answer you're looking for? Browse other questions tagged
python
pandas
dataframe
numpy
data-analysis
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

the DataFrame(input)

output

5 Answers 5

Not the answer you're looking for? Browse other questions tagged pythonpandasdataframenumpydata-analysis or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
pandas
dataframe
numpy
data-analysis
or ask your own question.