- Background
- Task purpose
- Metrics
- preliminary research
- GPU platform choice
- Loss-epoch curve
- About if fine-tuning affects the model weights
- About the lossy function in NLP tasks
- Implementation of codes
- Eventual discussion
Background
This term I finished the course of COMP9444, namely Neural Network and Deep Learning. Because I found many details worthy noting and recording, I wrote the blog as my notes.
Here is the records by our teams (Although the doc is in Chinese, it’s worth reading the detailed preliminary research and understanding of the massive NLP models):
Google doc for the notes and cordinations
Here is our codes submitted, and the notebook (with lots of trial&errors), reports (formal report submission) and readme (How to run code) contents are in the 1st directory:
Please pay attention to the files mentioned above, which contain lots of detailed processes and trial&errors.
Task purpose
Task: To create a deep-learning model for extractive or abstractive text summarisation on news articles. The model should aim to accurately convey the key ideas of a text concisely while maintaining coherence and being robust to different writing styles.
According to the description of the purpose, we’d better to chose one direction as our purpose from the choices between Extractive and Abstractive.
Extractive
For the extractive summarization, it’s extractive from the raw article. For example, if there are several most important sentences, then they will be extracted as the summarization.
Abstractive
For the abstractive summerization, it’s generated by concerned third party. Just like a human uses his words to describe something happened, rather not using raw sentences from the article.
Finally, because the raw dataset’s summerizations are all human-labeled abstractive summerizations, we can only focus on abstractive summerization, rather not the extractive summerization.
Metrics
ROUGE-1 (overlap of unigrams/words)
ROUGE-2 (overlap of bigrams/two-word phrases)
ROUGE-L (longest common subsequence, measuring fluency and ordering)
Proposed Metric(Junadhi Metric)
BERTScore(semantic similarity)
BARTScore(semantic similarity)
preliminary research
RAGFlow
facebook/bart-large-cnn
google/pegasus-cnn_dailymail
facebook/bart-large
google/pegasus-large
pegasus-x-base
bert-extractive-summarizer
and so on.
GPU platform choice
Firstly I used Tesla4T to train, but it’s really time-consuming in such low computation device. We need to spend 3 days to train only 2 epochs.
Then I found, the authentic graphic cards are not convenient. It’s better to choose the vast.ai directly. Besides, I suggest that you don’t need to upload from local to vast cloud, it’s slow. Vast cloud can download the hugging-face datasets in several Giga Bytes in only 10 seconds!
Loss-epoch curve
The loss-epoch must be similiar to Exponential decay curve. If the loss descends too rapidly, I guess its tokenization or paddings got wrong.
Because the neural network’s training outcome must be a continueous curve, too rapid peak menas something wired.
About if fine-tuning affects the model weights
Yes, we know fine-tuning can change the initial model directly. Usually it freezes the most middle layers of the network, and just be fine-tunned in several first layers and last layers.
Because you just need to let model reach the expectation slightly, based on its base model.
You don’t need to change the initial weights massively.
About the lossy function in NLP tasks
In graphic tasks, the loss functions are usually Cross-Entropy Loss.
But in NLP models, they are obviously not. NLP lossy functions are usually n-grams, there is no semantic loss, and 2-gram is usual in most NLP models. Hence they don’t have good perfermance on understanding of the raw articles.
Of course, we used our own metrics to evaluate the validation datasets, but which doesn’t mean that the metrics are applicable in the lossy functions.
Implementation of codes
We didn’t pay much attention to codes, because it’s simple. Just take care of the tokenizations and the paddings, just debug the model frequently.
Eventual discussion
Here you can refer to the disscussions of limitation in notebook or reports.
We found the NLP models don’t have emotion to analyze which sentence is much worthy reading. For example, if there are sentences about death, usually human-beings will pay attention to these emphysized sentences.
However, the machine could only focus on it’s 2-gram loss function, and may prefer the informative sentences, because they cover much.
Hence we think there should be some weights of emotions as several layers in the deep-learning network. So they can improve the simulations as a real emotional humans.
Welcome to point out the mistakes and faults!