Preface
I am trying to fine-tune the transformer-based model (LM and LLM). The LM that I used is DEBERTA, and the LLM is LLaMA 3. The task is to classify whether a text contains condescending language (binary classification).
I use AutoModelForSequenceClassification
, which adds a classification layer to the model's top layer for both LM and LLM.
Implementation
Dataset:
- Amount: it has about 10.000 texts with each text labeled
0
(for not condescending) and1
(condescending). The proportion is1:10
(condescending : not condescending).
- Amount: it has about 10.000 texts with each text labeled
Parameter
Parameter | LM | LLM |
---|---|---|
Batch size | 32 | 16 (per_device_train_batch_size = 4, gradient_accumulation_steps = 4) |
Epoch / steps | 2 epoch | 1000 steps (20% used as validation set) |
Learning Rate | linear (2e-5) | constant (2e-5) |
Optimizer | AdamW (lr = 2e-5, eps = 1e-8) | paged_adamw_32bit |
Fine-tuning | Full fine-tuning | LoRA (rank=32, dropout=0.5, alpha=8) with 8-bit quantization |
Learning Rate | linear (2e-5) | constant (2e-5) |
Precision | 0,659 | 0,836 |
Recall | 0,47 | 0,091 |
F1-score | 0,549 | 0,164 |
Question and Issue
Here is the log of the training sample. The validation f1-score is always >0.6
. But the validation loss is stuck at 0.24
. It is one of the samples of fine-tuned LLM.
- Why does the test set f1-score only range from 0 - 0.2 for some parameter variation that I tuned when the f1-score for the validation set is always above 0.6, is it reasonable? why?
- Is it common for LM to beat LLM for a particular task? If yes, what is the rationalization?