2

I am working on an NLP task for a classification problem. My dataset is imbalanced and some authors have only 1 text, and thus I want to have this text only in the training set. As for the other authors I need to split the dataset into 70% training set, 15% validation set and 15% test set.

I tried to use train_test_split function from sklearn, but the results aren't that good.

My dataset is a dataframe that looks like this

Title   Preprocessed_Text   Label
-----   -----------------   -----

Please help me out.

2 Answers 2

1

It is rather hard to obtain good classification results for a class that contains only 1 instance (at least for that specific class). Regardless, for imbalanced datasets, one should use stratified train_test_split (using stratify=y), which preserves the same proportions of instances in each class as observed in the original dataset.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)

I should also add that if the dataset is rather small, let's say no more than 100 instances, it would be preferable to use cross-validation instead of train_test_split, and more specifically, StratifiedKFold or RepeatedStratifiedKFold that returns stratified folds (see this answer to understand the difference between the two).

When it comes to evaluation, you should consider using metrics such as Precision, Recall and F1-score (the harmonic mean of the Precision and Recall), using the average weighted score for each of these, which uses a weight that depends on the number of true instances of each class. As per the documentation:

'weighted':

Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

2
  • I did that but I have the following error and I was wondering if you know any way to overcome this issue. My error is 'ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.' and y = y = df.Label.values where Label is a name of a column in my data frame.
    – user18002341
    Commented Jan 27, 2022 at 12:38
  • 1
    This is due to how stratification works. By setting the stratify parameter when splitting the dataset, it assures that the percentage of instances (samples) for each class is preserved in both splits (train and test set). However, in your case, it cannot produce both splits with the same ratio of that specfic class, as it contains only 1 instance. So, you can either remove that from your data, or duplicate that instance in your dataset (which is how some oversampling techniques work, but wouldn't really recommend, as it is only a single instance and the algorithm will learn on that alone).
    – Chris
    Commented Jan 27, 2022 at 13:08
0

Whit only One sample of a particular class it seems impossible to measure the classification performance on this class. So I recommend using one or more oversampling approaches to overcome the imbalance problem ([a hands-on article on it][1]). As a matter of fact, you must pay more attention to splitting the data in such a way that preserves the prior probability of each class (for example by setting the stratify argument in train_test_split). In addition, there are some considerations about the scoring method you must take into account (for example accuracy is not the best fit for scoring).

1
  • Thank you so much, I'll take that into account and I'm gonna use other metrics, like F1 wheighted score to meause the performance.
    – user18002341
    Commented Jan 27, 2022 at 12:28