0

Specifically I have a time series of a disease passing through a horse population. What I want to do is create a frequency based not on time but on cases, such that the df maintains its current order but lists 1000 cases for each entry. If an entry is too high it will create a new row, and if too low it will merge with following row averaging the inputs based on the constituent case numbers so that I can get heteroscedasticity out of the data. I realize I could do this with a massive loop, but am wondering if there are any less computationally intensive apply methods that I can use to accomplish the same task. So in the example below Time 0 would create four new rows, the last of which having 699 entries would merge with the 230 from Time 1 plus 71 from Time 2 averaging their severity and states by the number of input cases.

Time    Severity   Cases   States  
0        4       3699      39  
1        7       230       15  
2        2       1300      27  
3        3       740       13  
4        2       3000      23
2
  • Can you type in your expected output? Commented Dec 11, 2019 at 7:33
  • What have you tried, exactly? Stack Overflow is not the place to dump your work for others to do.
    – AMC
    Commented Dec 11, 2019 at 8:03

2 Answers 2

1

IIUC, this should do:

# this will blows up your dataframe by `Cases`, 
# so be aware if you have lengthy data
new_df = (df.loc[df.index.repeat(df.Cases)]    
            .assign(Cases=1,
                    groups=lambda x: np.arange(len(x))//1000
                   )
            .groupby('groups')
            .agg({'Time':'first',
                  'Cases':'sum',
                  'Severity':'mean',
                  'States':'mean'})
         )

Output:

        Time  Cases  Severity  States
groups                               
0          0   1000     4.000  39.000
1          0   1000     4.000  39.000
2          0   1000     4.000  39.000
3          0   1000     4.548  32.628
4          2   1000     2.000  27.000
5          2   1000     2.740  16.516
6          4   1000     2.000  23.000
7          4   1000     2.000  23.000
8          4    969     2.000  23.000
0

For anyone in the future who finds this question. From what I can tell there's no apply that will transform the TS. Quang's answer works great, however as he pointed out it really works only with small TSs otherwise the whole thing crashes due to memory limitations. There are two solutions to skirt this, the first is to create a df, run the above script in fractions of the whole TS and append the results to the new df. The other option, so that it's not like molasses and doesn't spank memory so severely is to do the loops in Cython instead. You can find an example of the Cython Loops code here: https://github.com/Jackal08/financial-data-structures/blob/master/cython_loops.c

Not the answer you're looking for? Browse other questions tagged or ask your own question.