How Does a Neural Network Learn?

These are notes from lesson 3 of Fast AI Practical Deep Learning for Coders.

Homework Task

Recreate the spreadsheet to train a linear model and a neural network from scratch: see spreadsheet

1. Model Selection

Some options for cloud environments: Kaggle, Colab, Paperspace.

A comparison of performance vs training time for different image models useful for model selection is provided here. Resnet and convnext are generally good to start with.

The best practice is to start with a “good enough” model with a quick training time, so you can iterate quickly. Then as you get further on with your research and need a better model, you can move to a slower, more accurate model.

The criteria we generally care about are:

How fast are they
How much memory do they use
How accurate are they

A learner object contains the pre-processing steps and the model itself.

2. Fitting a Quadratic Function

How do we fit a function to data? An ML model is just a function fitted to data. Deep learning is just fitting an infinitely flexible model to data.

In this notebook there is an interactive cell to fit a quadratic function to some noisy data: y = ax^2 +bx + c

We can vary a, b and c to get a better fit by eye. We can make this more rigorous by defining a loss function to quantify how good the fit is.

In this case, use the mean absolute error: mae = mean(abs(actual - preds)) We can then use stochastic gradient descent to autome the tweaking of a, b and c to minimise the loss.

We can store the parameters as a tensor, then pytorch will calculate gradients of the loss function based on that tensor.

abc = torch.tensor([1.1,1.1,1.1]) 
abc.requires_grad_()  # Modifies abc in place so that it will include gradient calculations on anything which uses abc downstream
loss = quad_mae(abc)  # `loss` uses abc so pytorch will give the gradient of loss too
loss.backward()  # Back-propagate the gradients
abc.grad  # Returns a tensor of the loss gradients

We can then take a “small step” in the direction that will decrease the gradient. The size of the step should be proportional to the size of the gradient. We define a learning rate hyperparameter to determine how much to scale the gradients by.

learning_rate = 0.01
abc -= abc.grad * learning_rate
loss = quad_mae(abc)  # The loss should now be smaller

We can repeat this process to take multiple steps to minimise the gradient.

for step in range(10):
    loss = quad_mae(abc)
    loss.backward()
    with torch.no_grad():
        abc -= abc.grad * learning_rate
    print(f'step={step}; loss={loss:.2f}')

The learning rate should decrease as we get closer to the minimum to ensure we don’t overshoot the minimum and increase the loss. A learning rate schedule can be specified to do this.

3. Fit a Deep Learning Model

For deep learning, the premise is the same but instead of quadratic functions, we fit ReLUs and other non-linear functions.

Universal approximation theorem states that this is infinitely expressive if enough ReLUs (or other non-linear units) are combined. This means we can learn any computable function.

A ReLU is essentially a linear function with the negative values clipped to 0.

def relu(m,c,x):
    y = m * x + c
    return np.clip(y, 0.)

This is all that deep learning is! All we need is:

A model - a bunch of ReLUs combined will be flexible
A loss function - mean absolute error between the actual data values and the values predicted by the model
An optimiser - stochastic gradient descent can start from random weights to incrementally improve the loss until we get a “good enough” fit

We just need enough time and data. There are a few hacks to decrease the time and data required:

Data augmentation
Running on GPUs to parallelise matrix multiplications
Convolutions to skip over values to reduce the number of matrix multiplications required
Transfer learning - initialise with parameters from another pre-trained model instead of random weights.

This spreadsheet from the homework task is a worked example of manually training a multivariate linear model, then extending that to a neural network summing two ReLUs.

FastAI Lesson 3: How Does a Neural Network Learn?

How Does a Neural Network Learn?

1. Model Selection

2. Fitting a Quadratic Function

3. Fit a Deep Learning Model

References