2
$\begingroup$

Trying to use sc.fit_transform(X), I get a huge drop in accuracy on the same model. Without scaling the values of my dataset, I get accuracy values of 80 - 82%. When I try to scale them, using sc.fit_transform(X), I get accuracy values of 70 - 74%.

What could be the reasons for this huge drop in accuracy?

EDIT:

Here is the code I am using:

# read the dataset file
basic_df = pd.read_csv('posts.csv', sep=';', encoding = 'ISO-8859-1', parse_dates=[2], dayfirst=True) 

# One-Hot-Encoding for categorical (strings) features
basic_df = pd.get_dummies(basic_df, columns=['industry', 'weekday', 'category_name', 'page_name', 'type']) 

# bring the label column to the end 
cols = list(basic_df.columns.values) # Make a list of all of the columns in the df
cols.pop(cols.index('successful')) # Remove target column from list
basic_df = basic_df[cols+['successful']] # Add it at the end of dataframe

dataset = basic_df.values

# separate the data from the labels 
X = dataset[:,0:45].astype(float)
Y = dataset[:,45]

#standardizing the input feature
X = sc.fit_transform(X)

# evaluate model with standardized dataset
#estimator = KerasClassifier(build_fn=create_baseline, epochs=5, batch_size=5, verbose=0)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=seed)
#estimator.fit(X_train, Y_train)
#predictions = estimator.predict(X_test)
#list(predictions)

# build the model 
model = Sequential()
model.add(Dense(100, input_dim=45, kernel_initializer='normal', activation='relu'))
model.add(Dense(50, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))

# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model
history = model.fit(X_train, Y_train, validation_split=0.3, epochs=500, batch_size=10)

There is a part of code commented, as I tried to use the KerasClassifier in the beginning. But both methods end up with much less accuracy (as stated above), when I use fit_transform(X). Without using fit_transform(X) I get an accuracy of 80 - 82%. Without 70 - 74%. How come? Am I doing something wrong? Does scaling the input data not always lead to better (or almost same accuracy results at least) AND primarily faster fitting? Why is this big drop in accuracy when using it?

PS: 'sc' is StandardScaler() --> sc = StandardScaler()

Here is the dataframe used (in 2 photos, because it too wide to make a screenshot in just one photo) with column 'successful' as label-column: df 1 df 2

$\endgroup$
4
  • $\begingroup$ What is sc in this context? $\endgroup$ Commented Jan 15, 2019 at 22:00
  • $\begingroup$ Also, what classifier are you using? $\endgroup$ Commented Jan 15, 2019 at 22:06
  • $\begingroup$ @timleathart I edited my question adding more context to the question. Thank you for helping! $\endgroup$
    – ZelelB
    Commented Jan 16, 2019 at 17:55
  • $\begingroup$ I think I need to "normalize" and not "standardize" data. Could you please take a look at the data, and tell me if normalizing or standardizing is right in this case? $\endgroup$
    – ZelelB
    Commented Jan 16, 2019 at 19:09

2 Answers 2

1
$\begingroup$

Without seeing the actual data its kind hard to say.
I do have one speculation, when using the scaler on all data(before the train/test split) you create a data leakage.
Meaning, some of the data the model fitting should not see(the test set) is effectively included in the train set(because the scaler seen it and used if for setting the scales).
This data leakage can cause overfitting and thus lower accuracy score.
Trying doing the scale part on each train/test split(use fit_transform on the train and only transform on the test). This is a better practice for model research and obviously closer to how the trained models will perform live(on not seen yet data).

$\endgroup$
5
  • $\begingroup$ added the data (just the head) $\endgroup$
    – ZelelB
    Commented Jan 16, 2019 at 18:11
  • $\begingroup$ does the edit gives more context to the question? $\endgroup$
    – ZelelB
    Commented Jan 16, 2019 at 18:55
  • $\begingroup$ and why to fit the test data? O_o Doesn't one want to fit the training data and test the accuracy with the test data?? $\endgroup$
    – ZelelB
    Commented Jan 16, 2019 at 18:59
  • $\begingroup$ I think I need to "normalize" and not "standardize" data. Could you please take a look at the data, and tell me if normalizing or standardizing is right in this case? $\endgroup$
    – ZelelB
    Commented Jan 16, 2019 at 19:10
  • 1
    $\begingroup$ Sorry my bad, the test data needed to be transformed after the fit on train. edited the answer $\endgroup$
    – yoav_aaa
    Commented Jan 17, 2019 at 5:57
0
$\begingroup$

please use early stopping. We cannot just choose some epochs and some hyperparameters in the beginning, and then change (transform) the data and wait for to see what happens. If your statement that normalizing the input data usually ends up in a faster fitting model is true (which is expected), the model should overfit faster as well. You should monitor your learning and catch the best epoch where your model fits. You can do this as follows:

Define early stopping:

from keras.callbacks import EarlyStopping
es = EarlyStopping(monitor='val_acc',
                              min_delta=0,
                              patience=20,
                              verbose=0, mode='auto',restore_best_weights=True)

Then add it to your training as:

model.fit(X_train, Y_train, validation_split=0.3, epochs=500, batch_size=10,callbacks=[es])

You can set the parameters of early stopping yourself. Early stopping allows you to stop where your model performance starts to decrease due to overfitting. By choosing restore_best_weights=True you also get to restore the model from the epoch with the best performance when stopping. It will shorten your time of training as well since it will probably stop at somewhere.

For more detail on early stopping, have a look at: https://stackoverflow.com/questions/43906048/keras-early-stopping

Also you can use Batch Normalization for normalizing not just the input but the mid tensors between the hidden layers of your neural net.

model = Sequential()
model.add(Dense(100, input_dim=45, kernel_initializer='normal', activation='relu'))
model.add(BatchNormalization())
model.add(Dense(50, kernel_initializer='normal', activation='relu'))
model.add(BatchNormalization())
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))

This will accelarete your fitting speed, whereas it will also have a slight regularization effect.

For a deeper insight(BatchNorm Paper): https://arxiv.org/pdf/1502.03167.pdf

Lastly, for answering the specific core of your question, you need to normalize your data, not standardize it in your context. That usually depends on the activation functions, ReLU works well with normalization. Do not forget to separate training and test sets before normalization; you do them at once as I observe. That's a fine way to cause data leakage. You can use the configuration below:

from sklearn.preprocessing import Normalizer
transformer = Normalizer().fit(x_train)
x_train = transformer.transform(x_train)
transformer = Normalizer().fit(x_test)
x_test = transformer.transform(x_test)

Hope I could help, good luck!

$\endgroup$
2
  • $\begingroup$ May I ask you what is the advantage of splitting data before normalizing ? what if we split them after normalization ? Normalization is one of processioning steps before feeding to ML or DL, so it seems normalization before split data makes sense much ! $\endgroup$
    – Mario
    Commented Jun 16, 2019 at 15:48
  • $\begingroup$ You substract a mean from and divide test set by a s.deviation, where you use training data's values' to calculate those mean and s.d., which causes you to leak knowledge from training data to test data. $\endgroup$
    – Ugur MULUK
    Commented Jun 19, 2019 at 7:02

Not the answer you're looking for? Browse other questions tagged or ask your own question.