I have a data set collected from Facebook consists of 10 class, each class have 2500 posts, but when count number of unique words in each class, they has different count as shown in the figure
Is this an imbalanced problem due to word count , or balanced according number of posts. and what is the best solution if it imbalanced?
update My python code:
data = pd.read_csv('E:\cluster data\One_File_nonnormalizenew2norm.txt', sep="*")
data.columns = ["text", "class1"]
data.dropna(inplace=True)
data['class1'] = data.class1.astype('category').cat.codes
text = data['text']
y = (data['class1'])
sentences_train, sentences_test, y_train, y_test = train_test_split(text, y, test_size=0.25, random_state=1000)
from sklearn.feature_extraction.text import CountVectorizer
num_class = len(np.unique(data.class1.values))
vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_test = vectorizer.transform(sentences_test)
model = Sequential()
max_words=5000
model.add(Dense(512, input_shape=(60874,)))
model.add(Dense(20,activation='softmax'))####
model.summary()
model.compile(loss='sparse_categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
model.fit(X_train, y_train,batch_size=150,epochs=10,verbose=2,validation_data=(X_test,y_test),shuffle=True)
predicted = model.predict(X_test)
predicted = np.argmax(predicted, axis=1)
accuracy_score(y_test, predicted)
predicted = model.predict(X_test)
predicted = np.argmax(predicted, axis=1)
accuracy_score(y_test, predicted)
0.9592031872509961