5

the DataFrame(input)

0     1.0     25.0
1     1.0     31.0
2     2.0     97.0
3     1.0     25.0
4     1.0     26.0

output

I want to get an array that has indexes from 1 up to and including 97 that says each index was how many times in intervals for example 1 was in 4 intervals(first two and last two intervals), 3 was in 5 intervals, 96 was just in 1 interval. note that I can not use a loop and I have to do it with array operations(numpy, pandas).

I want to get something like:

1    4 
2    5
3    5
.
.
.
25   5 
26   3
27   2
28   2
29   2
30   2
31   2
32   1
33   1
34   1
.
.
. 
97   1

5 Answers 5

2

try:

df1 = pd.DataFrame(data=[i for i in range(1,98)])

df:

    0   1       2
0   0   1.0     25.0
1   1   1.0     31.0
2   2   2.0     97.0
3   3   1.0     25.0
4   4   1.0     26.0

df1.head():

    0
0   1
1   2
2   3
3   4
4   5

res = df1[0].apply(lambda x: sum((df[1]<=x) & (df[2]>=x)))

res:

0     4
1     5
2     5
3     5
4     5
     ..
92    1
93    1
94    1
95    1
96    1
Name: 0, Length: 97, dtype: int64
2
  • thanks a lot. it solved my problem. I really appreciate that but I have a question. can we use (index + 1) instead of creating a new column?
    – john
    Commented Jun 20, 2021 at 7:45
  • @Hamed Yes you can do that.
    – Pygirl
    Commented Jun 20, 2021 at 7:52
2

You can form ranges per row and then explode them. Counting the values gives the final result:

result = (pd.Series(np.arange(first, second+1) for first, second in df.to_numpy())
                   .explode()
                   .value_counts(sort=False))

to get

>>> result

1.0     4
2.0     5
3.0     5
4.0     5
5.0     5
       ..
93.0    1
94.0    1
95.0    1
96.0    1
97.0    1

This won't necessarily include all the values in 1..97 because if a number falls out of every interval, it won't be counted. To guarantee an index of 1..97, we can reindex with the min and max values (i.e., 1 and 97 here) and put 0 to those that didn't appear:

values = df.to_numpy()
min_, max_ = values.min(), values.max()

result = result.reindex(np.arange(min_, max_+1), fill_value=0)

and final note is the np.aranges can be replaced with range if the values in the frame are integers and therefore df = df.astype(int) loses no information. If not, np.arange is needed. Also, np.arange encapsulates range so it can be used in either case.

1

First we need to know how many intervals open and close at each value:

>>> df
    col_1 col_2
0     1.0  25.0
1.0   1.0  31.0
2.0   2.0  97.0
3.0   1.0  25.0
4.0   1.0  26.0
>>> idx = pd.RangeIndex(1, 98)
>>> opencount = df['col_1'].value_counts().reindex(idx, fill_value=0)
>>> closecount = df['col_2'].value_counts().reindex(idx, fill_value=0)
>>> opencount
1     3
2     1
3     0
4     0
5     0
 ..
93    0
94    0
95    0
96    0
97    0
Name: col_1, Length: 97, dtype: int64
>>> closecount
1     0
2     0
3     0
4     0
5     0
 ..
93    0
94    0
95    0
96    0
97    1
Name: col_2, Length: 97, dtype: int64

Note that we used reindex to add zeros at all the values not available in col_1 and col_2.

If in fact the end of the interval is contained in the interval (as per comments) you can simply shift the closecount down by 1:

>>> closecount = closecount.shift(fill_value=0)
>>> closecount
1     0
2     0
3     0
4     0
5     0
     ..
93    0
94    0
95    0
96    0
97    0
Name: col_2, Length: 97, dtype: int64

Then we can compute the number of intervals at each point as the sum of intervals having opened before, minus the sum of intervals having closed before. This can be done with cumsum

>>> opencount.cumsum() - closecount.cumsum()
1     4
2     5
3     5
4     5
5     5
     ..
93    1
94    1
95    1
96    1
97    1
Length: 97, dtype: int64
3
  • this is a really good way except that I want 97 to be 1, not 0.
    – john
    Commented Jun 20, 2021 at 9:19
  • 1
    Ah then you just need to add a shift() @Hamed, I’ll edit my answer
    – Cimbali
    Commented Jun 20, 2021 at 9:34
  • one last question. to get a good idea of what you are doing(I have a rough idea about what you are doing with cumulative sum but not really precise and complete understanding) what resource do you recommend as you are quite the expert. I'm really grateful for your help.
    – john
    Commented Jun 20, 2021 at 16:23
1

try:

df1 = df.groupby(['ColumnName']).count()

This will return the column you choose as the index and get a count of the matching intervals

0

You can do as following. First create a new dataframe with a column from 1 to 97

>>> df2 = pd.DataFrame(list(range(1,98,1)), dtype=float, columns=["range"])
>>> df2
    range
0     1.0
1     2.0
2     3.0
3     4.0
4     5.0
..    ...
92   93.0
93   94.0
94   95.0
95   96.0
96   97.0

To get counts of values from your original dataframe

>>> dfg = df.groupby('val').agg('count')
>>> dfg
     range
val
1.0      4
2.0      1

Now map from your df, and at the same time fill missing values with zeros. Then cast to integer to match OP's expected format.

>>> df2["count"] = df2["range"].map(dfg['range']).fillna(0)
>>> df2 = df2.astype(int)
>>> df2
    range  count
0       1      4
1       2      1
2       3      0
3       4      0
4       5      0
..    ...    ...
92     93      0
93     94      0
94     95      0
95     96      0
96     97      0

And if needed, cast to integer

1
  • 1
    there are NaNs instead of counts; this is not the expected output, is it? Commented Jun 20, 2021 at 7:34

Not the answer you're looking for? Browse other questions tagged or ask your own question.