4
$\begingroup$

I am building a binary classifier, which classifies numerical data, using Keras.

I have 6992 datapoints in my dataset. Test set is 30% of the data. And validation set is 30% of the training set.

When evaluating the model, I get these values:

recall:  0.8914240755310779
precision:  0.7006802721088435
f1_score:  0.7846260387811634
accuracy_score:  0.7035271816800843

How come is the accuracy_score so about 10% lower than the F1-score?

Here is the code I'm using to evaluate the model:

print('recall: ', recall_score(Y_test, y_pred))
print('precision: ', precision_score(Y_test, y_pred))
print('f1_score: ', f1_score(Y_test, y_pred))
print('accuracy_score: ', model.score(X_test, Y_test, verbose=0))

And here is my model:

def create_model(neurons=23):

    model = Sequential()
    model.add(Dense(neurons, input_dim=37, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))

    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', precision, recall])

    return model

model = KerasClassifier(build_fn=create_model, epochs=500, batch_size=5, verbose=1)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=seed)

transformer = Normalizer().fit(X_train)
X_train = transformer.transform(X_train)
transformer = Normalizer().fit(X_test)
X_test = transformer.transform(X_test)

tensorboard = TensorBoard(log_dir="logs/{}".format(time.time()))
time_callback = TimeHistory()
es = EarlyStopping(monitor='val_acc', min_delta=0, patience=20, verbose=0, mode='auto',restore_best_weights=True)

# Fit the model
history = model.fit(X_train, Y_train, validation_split=0.3, epochs=200, batch_size=5, verbose=1, callbacks=[tensorboard, time_callback]) 
$\endgroup$
2
  • $\begingroup$ Among the four metrics, only accuracy_score depends on the value TN (true negatives). Low TN would result in low accuracy_score, but would not influence the other three metrics. $\endgroup$
    – user12075
    Commented Jan 18, 2019 at 17:09
  • $\begingroup$ A bit late, but having the same thing happening to my model, the problem was indeed that there were imbalanced classes which with each epoch were getting even more imbalanced randomly. I found out that working with XGBOOST for this kind of datasets can have amazing results. Especially when working on a binary classifier. $\endgroup$
    – M. Chris
    Commented Jul 25, 2023 at 7:49

2 Answers 2

6
$\begingroup$

You have imbalanced classes. Notice that your accuracy is very close to your precision, and quite dissimilar to your recall. This means that your precision (accuracy of positive predictions) is dominating the overall accuracy measure - nearly all of the cases in your data are classified as positive, so the accuracy among predicted positives is almost equivalent to the accuracy among all cases.

The F1 score is the harmonic mean of precision and recall, so it's a class-balanced accuracy measure. You have better performance on the minority class than the majority class, which is evidenced by the nearly equivalent accuracy and precision, and much higher recall.

$\endgroup$
3
  • $\begingroup$ But an F1-score of 78% is relatively Okay, isn't it? $\endgroup$
    – ZelelB
    Commented Jan 18, 2019 at 18:36
  • 1
    $\begingroup$ @ZelelB It's entirely dependent on your application. For some problems, that could be a totally respectable F1 score, for others, it might be a miserable failure. F1 is a good summary measure, but depending on your application, you may be more interested in optimizing precision or recall specifically. A medical screening test, for example, should have high recall, as we don't want to miss any true cases of disease. We can accept some false positives (lower precision) in order to achieve that, and optimizing the F1 measure would inappropriately try to balance them. $\endgroup$ Commented Jan 18, 2019 at 18:44
  • $\begingroup$ Thank you so much for your answer, explanation and time! $\endgroup$
    – ZelelB
    Commented Jan 21, 2019 at 14:36
1
$\begingroup$

F1-score is equal to:

2*((recall*precision)/(recall+precision))

If your model was catching more true negatives (TN) it would lower the F1-score. While your model is probably good at predicting true positives, it is likely predicting true negative at a lower rate.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.