5

I am trying to do a dataframe transformation that I cannot solve. I have tried multiple approaches from stackoverflow and the pandas documentation: apply, apply(lambda: ...), pivots, and joins. Too many attempts to list here, but not sure which approach is the best or if maybe I tried the right approach with the wrong syntax.

Basically, I have a dataframe, and I need to 1) offset the columns, 2) the number of columns to offset by varies and depends on a variable in the dataframe, 3) create columns at the end of the dataframe where needed to accommodate the offset, and 4) place zeros in the newly created intervals.

df1 = pd.DataFrame({'first' : ['John', 'Mary', 'Larry', 'jerry'], '1' : [5.5, 6.0,10,20], '2' : [100, 200, 300, 400], '3' : [150, 100, 240, 110], 'offset' : ([1,0,2,1])})
goal_df = pd.DataFrame({'first' : ['John', 'Mary', 'Larry', 'jerry'], '1' : [0.0, 6.0, 0.0, 0], '2' : [5.5, 200, 0.0, 20], '3' : [100, 100, 10, 400], '4' : [150, 0.0, 300, 110], '5' : [0.0, 0.0, 240, 0.0]})

df1
1         2        3    first      offset
5.5      100      150    John       1
6.0      200      100    Mary       0
10.0     300      240    Larry      2
20.0     400      110    jerry      1


goal_df
1      2    3    4    5  first
0    5.5  100  150    0   John
6  200.0  100    0    0   Mary
0    0.0   10  300  240  Larry
0   20.0  400  110    0  jerry

This data set will have c. 500 rows and c. 120 columns. The amount of the offset will very between 0-12. I thought about doing this with base Python functions, but I also found that difficult and the time consumer by the program would defeat the ultimate purpose which is to remove some tasks being done in Microsoft Excel.

I complain a lot about how Excel is inferior for big tasks like this, but it seems so far that the current spreadsheet offset() function in excel does do this in a very easy to use way but with thousands of formulas, is very slow. I have sold my workplace on the benefits of Python over Excel, and this is my first real trial, so speed is very important to me because I'm trying to convince my colleagues that Python can gobble up this spreadsheet much quicker than the current excel file weighing in a 96Mb in file size.

I came pretty close with the melt() function, and then taking the former column numbers and added the offset to them. However, I've had a lot of problems trying to reform the dataframe using pivot. No luck with apply or apply(lambda)!

Thanks for any help anyone can give!

1 Answer 1

3

This is not especially elegant or concise but ought to do the trick. I find it a little easier to shuffle columns around in numpy (also should be a bit faster) so I first convert from a dataframe to array.

arr    = df1.values[:,:-2]    # just the numbers
offset = df1.values[:,-1]     # just the offsets
column_pad = 2
arr2 = np.zeros( (arr.shape[0],arr.shape[1]+column_pad) )

Here's the key code, which simply shifts each row by the amount of the offset.

for i, j in enumerate(offset):
    arr2[i,j:3+j] = arr[i]

array([[   0. ,    5.5,  100. ,  150. ,    0. ],
       [   6. ,  200. ,  100. ,    0. ,    0. ],
       [   0. ,    0. ,   10. ,  300. ,  240. ],
       [   0. ,   20. ,  400. ,  110. ,    0. ]])

Beyond that it's just a little bit of manual labor to add space for the columns and put them in the right order.

df2 = df1.copy()
last_column = 6
for i in map(str,range(3,last_column)):
    df2[i] = 0
df2 = df2[ map(str,range(1,last_column))+['first','offset']]

Then load arr2 into df2.

df2.loc[:,'1':'5'] = arr2

   1      2    3    4    5  first  offset
0  0    5.5  100  150    0   John       1
1  6  200.0  100    0    0   Mary       0
2  0    0.0   10  300  240  Larry       2
3  0   20.0  400  110    0  jerry       1
3
  • Thanks for taking a look. I'm trying out your solution, so at least I have something that works. In the meantime, I'm going to try and develop that one path of melting the dataframe and mutating the column numbers (old column number + offset). The only problem is that when I go to "unmelt" the dataframe, the pivot completely messes everything up. Commented Jul 9, 2015 at 12:24
  • @nordicray OK, if you prefer the melt/pivot way, you might want to post what you tried and see if anyone can fix or improve it.
    – JohnE
    Commented Jul 9, 2015 at 12:57
  • Thanks @JohnE. The reason I didn't post it is because every time I thought I was close, I would realise how far away I actually was. I think I need to do some more research and really put my finger on how the multi-indexing for data frames in Pandas works before trying a more streamlined version of this. I also think that help in some other areas I'm working on. Commented Jul 24, 2015 at 16:15

Not the answer you're looking for? Browse other questions tagged or ask your own question.