6
$\begingroup$

I am dealing with a binary classification problem. The output column of my dataset is already encoded in 0/1. The problem is that I have many categorical features (columns), which are strings and I would like to one-hot encode them.

I have 18 features (few features are integers and others are strings, the categorical ones) and 1 output column.

I tried this:

dataframe = pd.read_csv('basic_df_export.csv', sep=';', encoding = 'ISO-8859-1', header=None) 

dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:17]
Y = dataset[:,17]

# define example
encoded = to_categorical(X)
print(encoded)

but it doesn't work, throwing me this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-318da09d7033> in <module>()
      9 
     10 # define example
---> 11 encoded = to_categorical(X)
     12 print(encoded)

~/anaconda/lib/python3.6/site-packages/keras/utils/np_utils.py in to_categorical(y, num_classes)
     21         is placed last.
     22     """
---> 23     y = np.array(y, dtype='int')
     24     input_shape = y.shape
     25     if input_shape and input_shape[-1] == 1 and len(input_shape) > 1:

ValueError: invalid literal for int() with base 10: 'photo'
$\endgroup$
3
  • 1
    $\begingroup$ Look at this: pbpython.com/categorical-encoding.html $\endgroup$
    – Mo-
    Commented Jan 7, 2019 at 21:47
  • $\begingroup$ I suggest that you modify your question and provide some actual data points so that people can see what kind of variable are there. Clearly you do not know how to encode categorical features. You say you have a mix of categorical and numerical columns, but here "encoded = to_categorical(X)", you pass all your features to be encoded. Not to mention that this way of encoding categorical features is rather wrong as well! Otherwise you get not useful answers. A place to start understanding cat. encoding could be: towardsdatascience.com/… $\endgroup$ Commented Jan 9, 2019 at 6:57
  • $\begingroup$ Check out this answer which shows you how to cast your string to float first, then you can use keras_to_categorical() and specify float dtype stackoverflow.com/a/42909410/772521 $\endgroup$
    – jlewkovich
    Commented Feb 26, 2019 at 23:37

2 Answers 2

3
$\begingroup$

For string data, use get_dummies() (from Pandas). to_categorical() takes integers as inputs.

There are two important differences between Keras: to_categorical() and Pandas: get_dummies().


Keras: to_categorical()

  • to_categorical() takes integers as input (no strings allowed).
  • to_categorical() generates dummies starting at 0 by default!

Looking at the help function:

print(help(to_categorical))

Says:

to_categorical(y, num_classes=None, dtype='float32')
    Converts a class vector (integers) to binary class matrix.

    E.g. for use with categorical_crossentropy.

    # Arguments
        y: class vector to be converted into a matrix
            (integers from 0 to num_classes).
        num_classes: total number of classes.
        dtype: The data type expected by the input, as a string
            (`float32`, `float64`, `int32`...)
...

So if your data is numeric (int), you can use to_categorical(). You can check if your data is an np.array by looking at .dtype and/or type().

import numpy as np
npa = np.array([2,2,3,3,4,4])
print(npa.dtype, type(npa))
print(npa)

Result:

int32 <class 'numpy.ndarray'>
[2 2 3 3 4 4]

Now you can use to_categorical():

from keras.utils import to_categorical
cat1 = to_categorical(npa)
print(cat1.dtype, type(cat1))
print(cat1)

Which yields a matrix:

float32 <class 'numpy.ndarray'>
[[0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1.]]

Note that the matrix contains five columns (starting at zero up to four, which is my max. value in the np.array). The first two columns (representing 0 and 1 in the original data) are 0 in the whole matrix, because none of these values are found in the original data.

to_categorical() also takes input which is not explicitly defined as np.array. For instance the statements below would also be legal.

alt1 = to_categorical([0,0,1,1,2,2])
print(alt1.dtype, type(alt1))
print(alt1)

alt2 = to_categorical((0,0,1,1,2,2))
print(alt2.dtype, type(alt2))
print(alt2)

Because the range of values now is between 0 and 2, the result would look like:

[[1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]]

Pandas: get_dummies()

When you have a Pandas df, you can convert some column to dummies using get_dummies(), regardless of the data type in the column. So it is also possible to convert a column of strings to dummies.

import pandas as pd
df = pd.DataFrame(data={'col1':["A", "A", "B", "B", "C", "C"]})
alt3 = pd.get_dummies(df['col1'])
print(type(alt3))

This gives:

<class 'pandas.core.frame.DataFrame'>
   A  B  C
0  1  0  0
1  1  0  0
2  0  1  0
3  0  1  0
4  0  0  1
5  0  0  1

Note that the result is (again) a Pandas df. So we need to convert it to a np.array.

alt3 = alt3.to_numpy()
print(alt3.dtype, type(alt3))
print(alt3)

This yields:

uint8 <class 'numpy.ndarray'>
[[1 0 0]
 [1 0 0]
 [0 1 0]
 [0 1 0]
 [0 0 1]
 [0 0 1]]

So that it is ready to be used with Keras.

Note that the matrix generated here does not (!) start at zero. Instead each distinct value in the chosen Pandas column gets it's own column in the dummy matrix.

$\endgroup$
0
$\begingroup$

Try:

X = dataset[:,0:17].astype(float).astype(int)

I think that if you have a string like '45.2', you will have to cast it as a floating-point first and from float you can cast them into integer.

I will be glad if an editor could corroborate/correct this answer.

$\endgroup$
3
  • $\begingroup$ no it doesn't work. I dont have any strings like that. All my columns are integers or strings (categorical features, where for example, in the column 'industry' the values could take one of 3 values: 'IT', 'FoodTrade' or 'Automotive') and the column 'content_type' could take one of 4 values: 'photo', 'video', 'link', 'event') $\endgroup$
    – ZelelB
    Commented Jan 8, 2019 at 21:17
  • $\begingroup$ Ok, you can describe your "to_categorical" method together with a few rows of your data? You can't certainly convert 'photo' into a integer. Did you already transformed these strings 'photo', 'video', 'link', 'event' into 1,2,3,4? $\endgroup$
    – aerijman
    Commented Jan 8, 2019 at 22:04
  • $\begingroup$ got it.. I used get_dummies() $\endgroup$
    – ZelelB
    Commented Jan 10, 2019 at 0:22

Not the answer you're looking for? Browse other questions tagged or ask your own question.