7

I am working on a data manipulation exercise, where the original dataset looks like;

df = pd.DataFrame({
'x1': [1, 2, 3, 4, 5],
'x2': [2, -7, 4, 3, 2],
'a': [0, 1, 0, 1, 1],
'b': [0, 1, 1, 0, 0],
'c': [0, 1, 1, 1, 1],
'd': [0, 0, 1, 0, 1]})

Here the columns a,b,c are categories whereas x,x2 are features. The goal is to convert this dataset into following format;

dfnew1 = pd.DataFrame({
'x1': [1, 2,2,2, 3,3,3, 4,4, 5,5,5],
'x2': [2, -7,-7,-7, 4,4,4, 3,3, 2,2,2],
'a': [0, 1,0,0, 0,0,0, 1,0,1,0,0],
'b': [0, 0,1,0, 1,0,0,0, 0, 0,0,0],
'c': [0,0,0,1,0,1,0,0,1,0,1,0],
'd': [0,0,0,0,0,0,1,0,0,0,0,1],
'y':[0,'a','b','c','b','c','d','a','c','a','c','d']})

Can I get some help on how to do it? On my part, I was able to get in following form;


df.loc[:, 'a':'d']=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['label_concat']=df.loc[:, 'a':'d'].apply(lambda x: '-'.join([i for i in x if i!=0]),axis=1)

This gave me the following output;



   x1   x2  a   b   c   d   label_concat
0   1   2   0   0   0   0       
1   2   -7  a   b   c   0   a-b-c
2   3   4   0   b   c   d   b-c-d
3   4   3   a   0   c   0   a-c
4   5   2   a   0   c   d   a-c-d

As seen, it is not the desired output. Can I please get some help on how to modify my approach to get desired output? thanks

3
  • 1
    I'm curious to know the source of this exercise.. can you post the same.
    – mnm
    Commented Jul 16, 2020 at 22:29
  • 1
    Appreciate you reply. It is a type of multilabel classification task, which I am trying to convert into a modified single label classification problem. then the plan is to obtain probability distribution over the categories and do a ranking, as discussed in the paper, lkm.fri.uni-lj.si/xaigor/slo/pedagosko/dr-ui/…
    – jay
    Commented Jul 16, 2020 at 22:35
  • thanks for the reference. "Curiosity killed the cat", as one might say now, because I wonder what is the underlying reason to transform this multi-label classification into single label. Let's say if you are achieve the task, how would you account for data compression loss or significance of feature relevance lost in this cohesion?
    – mnm
    Commented Jul 17, 2020 at 5:08

2 Answers 2

4

You could try this, to get the desired output based on your original approach:

Option 1

temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1)
df=df.explode('y').fillna(0).reset_index(drop=True)
m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.y.values[int(x.name)] ,axis=1).astype(int)
df.loc[1:, 'a':'d']=m.astype(int)

Another approach, similar to @ALollz's solution:

Option 2

df=df.assign(y=[np.array(range(i))+1 for i in df.loc[:, 'a':'d'].sum(axis=1)]).explode('y').fillna(1)
m = df.loc[:, 'a':'d'].groupby(level=0).cumsum(1).eq(df.y, axis=0) 
df.loc[:, 'a':'d'] = df.loc[:, 'a':'d'].where(m).fillna(0).astype(int)
df['y']=df.loc[:, 'a':'d'].dot(df.columns[list(df.columns).index('a'):list(df.columns).index('d')+1]).replace('',0)

Output:

df
  x1  x2  a  b  c  d  y
0   1   2  0  0  0  0  0
1   2  -7  1  0  0  0  a
1   2  -7  0  1  0  0  b
1   2  -7  0  0  1  0  c
2   3   4  0  1  0  0  b
2   3   4  0  0  1  0  c
2   3   4  0  0  0  1  d
3   4   3  1  0  0  0  a
3   4   3  0  0  1  0  c
4   5   2  1  0  0  0  a
4   5   2  0  0  1  0  c
4   5   2  0  0  0  1  d

Explanation of Option 1:

First, we use your approach, but instead of change the original data, use copy temp, and also instead of joining the columns into a string, keep them as a list:

temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1)   #without join

df['y']
0           []
1    [a, b, c]
2    [b, c, d]
3       [a, c]
4    [a, c, d]

Then we can use pd.DataFrame.explode to get the lists expanded, pd.DataFrame.fillna(0) to fill the first row, and pd.DataFrame.reset_index():

df=df.explode('y').fillna(0).reset_index(drop=True)

df
    x1  x2  a  b  c  d            y
0    1   2  0  0  0  0            0
1    2  -7  1  1  1  0            a
2    2  -7  1  1  1  0            b
3    2  -7  1  1  1  0            c
4    3   4  0  1  1  1            b
5    3   4  0  1  1  1            c
6    3   4  0  1  1  1            d
7    4   3  1  0  1  0            a
8    4   3  1  0  1  0            c
9    5   2  1  0  1  1            a
10   5   2  1  0  1  1            c
11   5   2  1  0  1  1            d

Then we mask df.loc[1:, 'a':'d'] to see when it is equal to y column, and then, we cast the mask to int, using astype(int):

m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1)

m
        a      b      c      d
1    True  False  False  False
2   False   True  False  False
3   False  False   True  False
4   False   True  False  False
5   False  False   True  False
6   False  False  False   True
7    True  False  False  False
8   False  False   True  False
9    True  False  False  False
10  False  False   True  False
11  False  False  False   True



df.loc[1:, 'a':'d']=m.astype(int)

df.loc[1:, 'a':'d']
   a  b  c  d
1   1  0  0  0
2   0  1  0  0
3   0  0  1  0
4   0  1  0  0
5   0  0  1  0
6   0  0  0  1
7   1  0  0  0
8   0  0  1  0
9   1  0  0  0
10  0  0  1  0
11  0  0  0  1

Important: Note that in the last step we are excluding first row in this case, because it will be True all value in row in the mask, since all values are 0, for a general way you could try this:

#Replace NaN values (the empty list from original df) with ''
df=df.explode('y').fillna('').reset_index(drop=True)

#make the mask with all the rows
msk=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1)
df.loc[:, 'a':'d']=msk.astype(int)

#Then, replace the original '' (NaN values) with 0
df=df.replace('',0)
1

Tricky problem. Here's one of probably many methods.

We set the index then use .loc to repeat that row as many times as we will need, based on the sum of the other columns (clip at 1 so every row appears at least once). Then we can use where to mask the DataFrame and turn the repeated 1s into 0s and we will dot with the columns to get the 'y' column you desire, replacing the empty string (when 0 across an entire row) with 0.

df1 = df.set_index(['x1', 'x2'])
df1 = df1.loc[df1.index.repeat(df1.sum(1).clip(lower=1))] 
#       a  b  c  d
#x1 x2            
#1   2  0  0  0  0
#2  -7  1  1  1  0
#   -7  1  1  1  0
#   -7  1  1  1  0
#3   4  0  1  1  1
#    4  0  1  1  1
#    4  0  1  1  1
#4   3  1  0  1  0
#    3  1  0  1  0
#5   2  1  0  1  1
#    2  1  0  1  1
#    2  1  0  1  1

N = df1.groupby(level=0).cumcount()+1
m = df1.groupby(level=0).cumsum(1).eq(N, axis=0) 
     
df1 = df1.where(m).fillna(0, downcast='infer')
df1['y'] = df1.dot(df1.columns).replace('', 0)

df1 = df1.reset_index()

    x1  x2  a  b  c  d  y
0    1   2  0  0  0  0  0
1    2  -7  1  0  0  0  a
2    2  -7  0  1  0  0  b
3    2  -7  0  0  1  0  c
4    3   4  0  1  0  0  b
5    3   4  0  0  1  0  c
6    3   4  0  0  0  1  d
7    4   3  1  0  0  0  a
8    4   3  0  0  1  0  c
9    5   2  1  0  0  0  a
10   5   2  0  0  1  0  c
11   5   2  0  0  0  1  d

Not the answer you're looking for? Browse other questions tagged or ask your own question.