0
$\begingroup$

Preface

I am trying to fine-tune the transformer-based model (LM and LLM). The LM that I used is DEBERTA, and the LLM is LLaMA 3. The task is to classify whether a text contains condescending language (binary classification).

I use AutoModelForSequenceClassification, which adds a classification layer to the model's top layer for both LM and LLM.

Implementation

  1. Dataset:

    • Amount: it has about 10.000 texts with each text labeled 0 (for not condescending) and 1 (condescending). The proportion is 1:10 (condescending : not condescending).
  2. Parameter

Parameter LM LLM
Batch size 32 16 (per_device_train_batch_size = 4, gradient_accumulation_steps = 4)
Epoch / steps 2 epoch 1000 steps (20% used as validation set)
Learning Rate linear (2e-5) constant (2e-5)
Optimizer AdamW (lr = 2e-5, eps = 1e-8) paged_adamw_32bit
Fine-tuning Full fine-tuning LoRA (rank=32, dropout=0.5, alpha=8) with 8-bit quantization
Learning Rate linear (2e-5) constant (2e-5)
Precision 0,659 0,836
Recall 0,47 0,091
F1-score 0,549 0,164

Question and Issue

Here is the log of the training sample. The validation f1-score is always >0.6. But the validation loss is stuck at 0.24. It is one of the samples of fine-tuned LLM.

enter image description here

  1. Why does the test set f1-score only range from 0 - 0.2 for some parameter variation that I tuned when the f1-score for the validation set is always above 0.6, is it reasonable? why?
  2. Is it common for LM to beat LLM for a particular task? If yes, what is the rationalization?
$\endgroup$
1
  • $\begingroup$ Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. $\endgroup$
    – Community Bot
    Commented Jul 14 at 6:41

0