3
$\begingroup$

I have a data set collected from Facebook consists of 10 class, each class have 2500 posts, but when count number of unique words in each class, they has different count as shown in the figure word count in each class

Is this an imbalanced problem due to word count , or balanced according number of posts. and what is the best solution if it imbalanced?

update My python code:

data = pd.read_csv('E:\cluster data\One_File_nonnormalizenew2norm.txt', sep="*")

data.columns = ["text", "class1"]
data.dropna(inplace=True)
data['class1'] = data.class1.astype('category').cat.codes
text = data['text']

y = (data['class1'])
sentences_train, sentences_test, y_train, y_test = train_test_split(text, y, test_size=0.25, random_state=1000)
from sklearn.feature_extraction.text import CountVectorizer
num_class = len(np.unique(data.class1.values))



vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)

X_train = vectorizer.transform(sentences_train)
X_test  = vectorizer.transform(sentences_test)

model = Sequential()
max_words=5000
model.add(Dense(512, input_shape=(60874,)))
model.add(Dense(20,activation='softmax'))####
model.summary()
model.compile(loss='sparse_categorical_crossentropy',
  optimizer='rmsprop',
  metrics=['accuracy'])

model.fit(X_train, y_train,batch_size=150,epochs=10,verbose=2,validation_data=(X_test,y_test),shuffle=True)
predicted = model.predict(X_test)
predicted = np.argmax(predicted, axis=1)
accuracy_score(y_test, predicted)
predicted = model.predict(X_test)
predicted = np.argmax(predicted, axis=1)
accuracy_score(y_test, predicted)

0.9592031872509961
$\endgroup$
2
  • $\begingroup$ Could you please post your approach/code here? $\endgroup$
    – Sunil
    Commented Feb 6, 2019 at 14:37
  • $\begingroup$ which code??I ask a general question based on number of samples?? $\endgroup$
    – mtesta010
    Commented Feb 6, 2019 at 15:18

3 Answers 3

-1
$\begingroup$

Thank you for your message Ahmed. There are things to point out:

  1. Is this an imbalanced problem? Which problem? THIS is not a problem. This is data.
  2. What analysis is going to be done? In some cases you need posts and in some you need these keywords.
  3. What method is going to be done for that analysis? Some methods get keywords as input and some get posts.

But about the numbers themselves; Not necessarily. The smallest class has 20% of the largest population and moreover, the scale is pretty high (20000 samples). So it is not necessarily an imbalanced class distribution. Again, see what you want to do with this data. That determines the answer much more accurate.

Hope it helped. If you write about the task you want to do I can post the solution here.

Cheers,

UPDATE

Well, then the problem is pretty straight-forward. These unique words are probably not much meaningful here. I certainly recommend that you try BoW models first (TF-IDF and classic BoW) for modeling your corpus. Then tune the hyperparameters of models and using a simple Multinomial Naive Bayes you will get an acceptable result.

Data is not counted that imbalanced. I had a problem in which some classes had 3000-4000 samples and some only 20! That is certainly called imbalanced but here you still have enough data to represent your minority class and also you will use Precision-Recall for evaluation instead of Accuracy so you will be fine. I strongly recommend you to have a look at this for Python implementation and also seeing some imbalanced data in practice.

The DL thing is answered in the comment.

$\endgroup$
4
  • $\begingroup$ thank you Kasra for your answer, 1-the problem i mean that this data-set will be used in text classification system, 2 , 3-I will apply text preprocessing technique and some ML algorithms (NB,svm ,..) and Deep learning (Keras). $\endgroup$
    – mtesta010
    Commented Feb 7, 2019 at 10:44
  • $\begingroup$ when I use Deep Learning text classification there is no improvement occurred comparing with traditional ML ??!! $\endgroup$
    – mtesta010
    Commented Feb 7, 2019 at 10:51
  • $\begingroup$ First the last: 1) Doing Deep Learning is tricky (be sure your setting is right) 2) Not necessarily it is supposed to improve. Depends on your data, the problem and the architecture you use. $\endgroup$ Commented Feb 7, 2019 at 13:42
  • $\begingroup$ OK , I will try it thank you very much $\endgroup$
    – mtesta010
    Commented Feb 7, 2019 at 14:10
0
$\begingroup$

I don't now wether I got your question right. But if you count all words within a class, for example, the word "the" is counted everytime it appears. However, if you count the unique words the word "the" is counted once. This is why your counts differ from your plot. Each class can have a different number of unique words.

$\endgroup$
1
  • $\begingroup$ count of unique words after remove stop words,the count differ because posts lengths are different $\endgroup$
    – mtesta010
    Commented Feb 6, 2019 at 15:55
0
$\begingroup$

Here's a simple way to deal with the word-count imbalance.

You could first convert the representation of the tokens using word embedding. There are two publicly available models that are very popular: Word2Vec and GloVe. Word embeddings can be very useful because they capture latent semantic and lexical information that is not usually available in vanilla BOW models.

https://nlp.stanford.edu/projects/glove/ https://radimrehurek.com/gensim/models/word2vec.html

Next, using the word embeddings, take a mean over the set for each example in you data set. That will balance the classes in some sense because you are essentially reducing the problem to an average embedding representation for each example. Assuming that the number of examples per class are balance then your balance problem is solved. Also consider doing stop-word filtering to remove useless terms.

You can then use that representation as input to you classification model of choice - SVM, Random Forest, Logistic Regression etc... but that is it's own problem to solve.

There are tradeoffs to every type of feature engineering that you do. Be aware of that and do your due diligence to evaluate what type of systematic effects (if any) this type of preprocessing will have on your result.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.