Natural language processing
These are notes from lesson 4 of Fast AI Practical Deep Learning for Coders.
Kaggle NLP pattern similarity notebook: see notebook
1. NLP Background and Transformers
NLP applications: categorising documents, translation, text generation.
We are using the Huggingface transformers library for this lesson. It is now incorporated into the fastai library.
ULMFit is an algorithm which uses fine-tuning, in this example to train a positve/negative sentiment classifier in 3 steps: 1. Train an RNN on wikipedia to predict the next word. No labels required. 2. Fine-tune this for IMDb reviews to predict the next word of a movie review. Still no labels required. 3. Fine-tune this to classify the sentiment.
Transformers have overtaken ULMFit as the state-of-the-art.
Looking “inside” a CNN, the first layer contains elementary detectors like edge detectors, blob detectors, gradient detectors etc. These get combined in non-linear ways to make increasingly complex detectors. Layer 2 might combine vertical and horizontal edge detectors into a corner detector. By the later layers, it is detecting rich features like lizard eyes, dog faces etc. See Zeiler and Fegus.
For the fine-tuning process, the earlier layers are unlikely to need changing because they are more general. So we only need to fine-tune (i.e. re-train) the later layers.
2. Kaggle Competition Walkthrough
As an example of NLP in practice, we walk through this notebook for U.S. Patent Phrase to Phrase Matching.
Reshape the input to fit a standard NLP task
- We want to learn the similarity between two fields and are provided with similarity scores.
- We concat the fields of interest (with identifiers in between). The identifiers themselves are not important, they just need to be consistent.
- The NLP model is then a supervised regression task to predict the score given the concatendated string.
'input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor df[
2.1 Tokenization
Split the text into tokens (words). Tokens are, broadly speaking, words.
There are some caveats to that, as some languages like Chinese don’t fit nicely into that model. We don’t want the vocabulary to be too big. Alternative approaches use character tokenization rather than word tokenization.
In practice, we tokenize into subwords.
2.2 Numericalization
Map each unique token to a number. One-hot encoding.
The choice of tokenization and numericalization depends on the model you use. Whoever trained the model chose a convention for tokenizing. We need to be consistent with that if we want the model to work correctly.
2.3. Models
The Huggingface model hub contains thousands of pretrained models.
For NLP tasks, it is useful to choose a model that was trained on a similar corpus, so you can search the model hub. In this case, we search for “patent”.
Some models are general purpose, e.g. deberta-v3 used in the lesson.
ULMFit handles large documents better as it can split up the document. Transformer approaches require loading the whole document into GPU memory, so struggle for larger documents.
2.4. Overfitting
If a model is too simple (i.e. not flexible enough) then it cannot fit the data and be biased, leading to underfitting.
If the model fits the data points too closely, it is overfitting.
A good validation set, and monitoring validation error rather than training error as a metric, is key to avoiding overfitting. See this article for an in-depth discussion on the importance of choosing a good validation set.
Often people will default to using a random train/test split. This is what scikit-learn uses. This is a BAD idea very often.
For time-series data, it’s easier to infer gaps than it is to predict a block in the future. The latter is the more common task but a random split simulates the former, giving unrealistically good performance.
For image data, there may be people, boats, etc that are in the training set but not the test set. By failing to have new people in the validation set, the model can learn things about specific people/boats that it can’t rely on in practice.
2.5. Metrics vs Loss Functions
Metrics are things that are human-understandable.
Loss functions should be smooth and differentiable to aid in training.
These can sometimes be the same thing, but not in general. For example, accuracy is a good metric in image classification. We could tweak the weights in such a way that it improves the model slightly, but not so much that it now correctly classifies a previously incorrect image. This means the metric function is bumpy, therefore a bad loss function.
This article goes into more detail on the choice of metrics.
AI can be particularly dangerous at confirming systematic biases, because it is so good at optimising metrics, so it will conform to any biases present in the training data. Making decisions based on the model then reinforces those biases.
Goodhart’s law applies: If a metric becomes a target it’s no longer a good metric
2.6. Correlations
The best way to understand a metric is not to look at the mathematical formula, but to plot some data for which the metric is high, medium and low, then see what that tells you.
After that, look at the equation to see if your intuition matches the logic.
2.7 Choosing a learning rate
Fast AI has a function to find a good starting point. Otherwise, pick a small value and keep doubling it until it falls apart.