0
$\begingroup$

I'm building a multiclass classification system using Keras.

I am working with a dataset that includes text data and its metadata. Both the text and the metadata are sequences of words. The output of the model is a probability distribution over multiple classes, indicating the likelihood of each class for the given input data. Each label has a user-defined title and belongs to a parent label, which also has a user-defined title containing text information relevant to the classification task.

The model currently does not consider the label titles or their parent labels, which contain useful text information. I'm wondering if incorporating this information into the model might yield better results. As of now, I'm achieving a val_accuracy around 80-90, depending on the set of data used in the experiments.

All text passed to the model is pre-processed with an encoder to generate embeddings of length 256.

Below is a simplified version of my current model structure:

# Input for text data
text_input = Input(shape=(TEXT_INPUT_SHAPE,), name='text_input')
text_layer = LSTM(units=LSTM_UNITS, return_sequences=False)(text_input)

# Input for categorical data
cat_input = Input(shape=(CAT_INPUT_SHAPE,), name='cat_input')
cat_layer = Dense(DENSE_UNITS, activation='relu', kernel_regularizer='l2')(cat_input)

# Concatenate all layers
concatenated = Concatenate()([text_layer, cat_layer])

output = Dense(NUM_CLASSES, activation='softmax', kernel_regularizer='l2')(concatenated)

model = Model(inputs=[text_input, cat_input], outputs=output)

Would it be a good idea to include the label titles and their parent titles in the model? If so, what is a good strategy to effectively include additional text information from label titles and their parent labels in the model?

$\endgroup$

1 Answer 1

1
+100
$\begingroup$

Most likely, adding more data often improves a model. It is an empirical question that can be answered by looking at the evaluation metric performance on hold-out data.

One approach would be to add the additional data in the simplest way possible. It is not clear how the text is currently being encoded. If you choose to encode the text data as an embedding layer, label titles and their parent labels can be added to the embedding layer. The simplest way is to concatenate the labels to the text for each training instance.

$\endgroup$
1
  • 1
    $\begingroup$ I've added info on how the text is encoded. In short, it is embedded before it is fed to the model. I'm not sure I understand your suggestion. How would passing all label titles and parent labels to the model improve the results, if the data is unchanged across all epochs during training? $\endgroup$
    – don
    Commented Jul 12 at 14:59

Not the answer you're looking for? Browse other questions tagged or ask your own question.