What could explain a much higher F1 score in comparision to accuracy score?

Question

I am building a binary classifier, which classifies numerical data, using Keras.

I have 6992 datapoints in my dataset. Test set is 30% of the data. And validation set is 30% of the training set.

When evaluating the model, I get these values:

recall:  0.8914240755310779
precision:  0.7006802721088435
f1_score:  0.7846260387811634
accuracy_score:  0.7035271816800843

How come is the accuracy_score so about 10% lower than the F1-score?

Here is the code I'm using to evaluate the model:

print('recall: ', recall_score(Y_test, y_pred))
print('precision: ', precision_score(Y_test, y_pred))
print('f1_score: ', f1_score(Y_test, y_pred))
print('accuracy_score: ', model.score(X_test, Y_test, verbose=0))

And here is my model:

def create_model(neurons=23):

    model = Sequential()
    model.add(Dense(neurons, input_dim=37, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))

    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', precision, recall])

    return model

model = KerasClassifier(build_fn=create_model, epochs=500, batch_size=5, verbose=1)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=seed)

transformer = Normalizer().fit(X_train)
X_train = transformer.transform(X_train)
transformer = Normalizer().fit(X_test)
X_test = transformer.transform(X_test)

tensorboard = TensorBoard(log_dir="logs/{}".format(time.time()))
time_callback = TimeHistory()
es = EarlyStopping(monitor='val_acc', min_delta=0, patience=20, verbose=0, mode='auto',restore_best_weights=True)

# Fit the model
history = model.fit(X_train, Y_train, validation_split=0.3, epochs=200, batch_size=5, verbose=1, callbacks=[tensorboard, time_callback])

Among the four metrics, only accuracy_score depends on the value TN (true negatives). Low TN would result in low accuracy_score, but would not influence the other three metrics. — user12075, Commented Jan 18, 2019 at 17:09
A bit late, but having the same thing happening to my model, the problem was indeed that there were imbalanced classes which with each epoch were getting even more imbalanced randomly. I found out that working with XGBOOST for this kind of datasets can have amazing results. Especially when working on a binary classifier. — M. Chris, Commented Jul 25, 2023 at 7:49

Nuclear Hoagie · Accepted Answer · 2019-01-18 17:23:00Z

6

You have imbalanced classes. Notice that your accuracy is very close to your precision, and quite dissimilar to your recall. This means that your precision (accuracy of positive predictions) is dominating the overall accuracy measure - nearly all of the cases in your data are classified as positive, so the accuracy among predicted positives is almost equivalent to the accuracy among all cases.

The F1 score is the harmonic mean of precision and recall, so it's a class-balanced accuracy measure. You have better performance on the minority class than the majority class, which is evidenced by the nearly equivalent accuracy and precision, and much higher recall.

answered Jan 18, 2019 at 17:23

Nuclear Hoagie

1,2717 silver badges9 bronze badges

$\begingroup$ But an F1-score of 78% is relatively Okay, isn't it? $\endgroup$
– ZelelB
Commented Jan 18, 2019 at 18:36
1

$\begingroup$ @ZelelB It's entirely dependent on your application. For some problems, that could be a totally respectable F1 score, for others, it might be a miserable failure. F1 is a good summary measure, but depending on your application, you may be more interested in optimizing precision or recall specifically. A medical screening test, for example, should have high recall, as we don't want to miss any true cases of disease. We can accept some false positives (lower precision) in order to achieve that, and optimizing the F1 measure would inappropriately try to balance them. $\endgroup$
– Nuclear Hoagie
Commented Jan 18, 2019 at 18:44
$\begingroup$ Thank you so much for your answer, explanation and time! $\endgroup$
– ZelelB
Commented Jan 21, 2019 at 14:36

Add a comment |

TheDataScienceNinja · Accepted Answer · 2019-01-18 17:13:23Z

1

F1-score is equal to:

2*((recall*precision)/(recall+precision))

If your model was catching more true negatives (TN) it would lower the F1-score. While your model is probably good at predicting true positives, it is likely predicting true negative at a lower rate.

answered Jan 18, 2019 at 17:13

TheDataScienceNinja

3341 silver badge4 bronze badges

Add a comment |

Stack Exchange Network

What could explain a much higher F1 score in comparision to accuracy score?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
machine-learning
neural-network
deep-learning
classification
keras
or ask your own question.

Hot Network Questions

What could explain a much higher F1 score in comparision to accuracy score?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged machine-learningneural-networkdeep-learningclassificationkeras or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
machine-learning
neural-network
deep-learning
classification
keras
or ask your own question.