fine-tuning_NLP

COMP9444

Created At : 2025-08-26 18:03

Count:765 Views 👀 :

Background
Task purpose
1. Extractive
2. Abstractive
Metrics
preliminary research
GPU platform choice
Loss-epoch curve
About if fine-tuning affects the model weights
About the lossy function in NLP tasks
Implementation of codes
Eventual discussion

Background
Task purpose
- Extractive
- Abstractive
Metrics
preliminary research
GPU platform choice
Loss-epoch curve
About if fine-tuning affects the model weights
About the lossy function in NLP tasks
Implementation of codes
Eventual discussion

Background

This term I finished the course of COMP9444, namely Neural Network and Deep Learning. Because I found many details worthy noting and recording, I wrote the blog as my notes.

Here is the records by our teams (Although the doc is in Chinese, it’s worth reading the detailed preliminary research and understanding of the massive NLP models):

Google doc for the notes and cordinations

Here is our codes submitted, and the notebook (with lots of trial&errors), reports (formal report submission) and readme (How to run code) contents are in the 1st directory:

Github Codes

Please pay attention to the files mentioned above, which contain lots of detailed processes and trial&errors.

Presentation slides

Task purpose

Task: To create a deep-learning model for extractive or abstractive text summarisation on news articles. The model should aim to accurately convey the key ideas of a text concisely while maintaining coherence and being robust to different writing styles.

According to the description of the purpose, we’d better to chose one direction as our purpose from the choices between Extractive and Abstractive.

Extractive

For the extractive summarization, it’s extractive from the raw article. For example, if there are several most important sentences, then they will be extracted as the summarization.

Abstractive

For the abstractive summerization, it’s generated by concerned third party. Just like a human uses his words to describe something happened, rather not using raw sentences from the article.

Finally, because the raw dataset’s summerizations are all human-labeled abstractive summerizations, we can only focus on abstractive summerization, rather not the extractive summerization.

Metrics

ROUGE-1 (overlap of unigrams/words)

ROUGE-2 (overlap of bigrams/two-word phrases)

ROUGE-L (longest common subsequence, measuring fluency and ordering)

Proposed Metric(Junadhi Metric)

BERTScore(semantic similarity)

BARTScore(semantic similarity)

preliminary research

RAGFlow

facebook/bart-large-cnn

google/pegasus-cnn_dailymail

facebook/bart-large

google/pegasus-large

pegasus-x-base

bert-extractive-summarizer

and so on.

GPU platform choice

Firstly I used Tesla4T to train, but it’s really time-consuming in such low computation device. We need to spend 3 days to train only 2 epochs.

Then I found, the authentic graphic cards are not convenient. It’s better to choose the vast.ai directly. Besides, I suggest that you don’t need to upload from local to vast cloud, it’s slow. Vast cloud can download the hugging-face datasets in several Giga Bytes in only 10 seconds!

Loss-epoch curve

The loss-epoch must be similiar to Exponential decay curve. If the loss descends too rapidly, I guess its tokenization or paddings got wrong.

Because the neural network’s training outcome must be a continueous curve, too rapid peak menas something wired.

About if fine-tuning affects the model weights

Yes, we know fine-tuning can change the initial model directly. Usually it freezes the most middle layers of the network, and just be fine-tunned in several first layers and last layers.

Because you just need to let model reach the expectation slightly, based on its base model.
You don’t need to change the initial weights massively.

About the lossy function in NLP tasks

In graphic tasks, the loss functions are usually Cross-Entropy Loss.

But in NLP models, they are obviously not. NLP lossy functions are usually n-grams, there is no semantic loss, and 2-gram is usual in most NLP models. Hence they don’t have good perfermance on understanding of the raw articles.

Of course, we used our own metrics to evaluate the validation datasets, but which doesn’t mean that the metrics are applicable in the lossy functions.

Implementation of codes

We didn’t pay much attention to codes, because it’s simple. Just take care of the tokenizations and the paddings, just debug the model frequently.

Github Codes

Eventual discussion

Here you can refer to the disscussions of limitation in notebook or reports.

We found the NLP models don’t have emotion to analyze which sentence is much worthy reading. For example, if there are sentences about death, usually human-beings will pay attention to these emphysized sentences.

However, the machine could only focus on it’s 2-gram loss function, and may prefer the informative sentences, because they cover much.

Hence we think there should be some weights of emotions as several layers in the deep-learning network. So they can improve the simulations as a real emotional humans.

Welcome to point out the mistakes and faults!