Introduction to Large Language Models

1. A Brief History of Language AI

1.1. Bag of Words

This approach originated in the 1950s but gained popularity in the 2000s.

It treats unstructured text as a bag or words, throwing away any information from the position / ordering of the words and any semantic meaning of text.

Tokenise the text. A straightforward way to do this is to split on the spaces so we have a list of words.
Create a vocabulary of length N, containing every word in our training data.
We can then represent any sentence or document as a one-hot encoded N-dimensional vector.
Use those vectors for downstream tasks, e.g. cosine similarity between vectors to measure the similarity of documents for recommender systems.

For example, if our vocabulary contains the words: that is a cute dog my cat

Then we can encode the sentence “that is a cute dog” as:

that	is	a	cute	dog	my	cat
1	1	1	1	1	0	0

And another sentence “my cat is cute” as:

that	is	a	cute	dog	my	cat
0	1	0	1	0	1	1

Then to compare how similar the two sentences are, we can compare those vectors, for example using cosine similarity.

1.2. Word2Vec

A limitation of bag-of-words is that it makes no attempt to capture meaning from the text, treating each word as an unrelated token. By encoding text as one-hot encoded vectors, it does not capture that the word “cute” might be similar to “adorable” or “scrum-diddly-umptious”; every word is simply an arbitrary element of the vocabulary.

Dense vector embeddings attempt to capture these differences; rather than treating words as discrete elements, we can introduce a continuous scale for each embedding dimension, and learn where each word falls on the scale. Word2Vec was an early, and successful, approach to generating these embeddings.

The approach is to:

Assign every word in the vocabulary an (initial random) vector of the embedding dimension, say 50.
Take pairs of words from the training data, and train a model to predict whether they are likely to be neighbors in a sentence.
If two words typically share the same neighbouring words, they are likely to share similar embedding vectors, and vice versa.

Illustrated word2vec provides a deeper dive.

These embeddings then have interesting properties. The classic example is that adding/subtracting the vectors for the corresponding words gives: \[ king - man + woman \approx queen \]

The numbers don’t lie and they spell disaster

- “Big Poppa Pump” Scott Steiner

N-grams are sliding windows of N words sampled from text. These can be used to train a model where the input is N words and the output is the predicted next word.
Continuous bag or words (CBOW) tries to predict a missing word given the N preceding and following words.
Skip-grams take a single word and try to predict the surrounding words. The are the “opposite” of CBOW.

Architecture	Task	Inputs	Output
N-gram	The numbers ___	[The, numbers]	don’t
CBOW	The numbers ___ lie and	[The, numbers, lie, and]	don’t
Skip-gram	___ ___ don’t ___ ___	don’t	[The, numbers, lie, and]

Negative sampling is used to speed up the next-word prediction process. Instead of predicting the next token (a computationally expensive neural network), we reframe the task as “given two words, what is the probability that they are neighbours?” (a much faster logistic regression problem.)

But the issue is, our training dataset only has positive examples of neighbours. So the model could just always output 1 to get 100% accuracy. To avoid this, we introduce negative exmaples by taking random combinations of words in the vocabulary that aren’t neighbours. This idea is called noise-contrastive estimation.

Word2vec is then just “skip gram with negative sampling” to generate word embeddings.

Types of embeddings

There are different types of embeddings that indicate different levels of abstraction.

We can create an embedding for a sub-word, a word, a sentence or a whole document. In each case, the result is an N-dimensional vector where N is the embedding size.

1.3. RNNs

The embeddings so far have been static: the embedding for “bank” will be the same regardless of whether it’s referring to the bank of a river or a branch of Santander.

The next development notes that the embeddings should vary depending on their context, i.e. the surrounding words.

Recurrent Neural Networks (RNNs) were initially used with attention mechanisms. These would:

Take pre-generated embeddings (say, from word2vec) as inputs
Pass this to an encoder RNN to generate a context embedding
Pass this to a decoder RNN to generate an output, such as the input text translated to another language.

flowchart LR

  A(Pre-generated embeddings) --> B[[Encoder RNN]] --> C(Context embedding) --> D[[Decoder RNN]] --> E(Output)

1.4. Transformers

Transformers were introduced in the 2017 paper “Attention is All You Need”, which solely used the attention mechanism and removed the RNNs.

The original transformer was an encoder-decoder model for machine translation.

flowchart LR

  A(Pre-generated embeddings) --> B[[Transformer Encoder]] --> C[[Transformer Decoder]] --> E(Output)

1.5. Representation Models

By splitting the encoder-decoder architecture and focusing only on the encoder, we can create models the excel at creating meaningful representations of language.

This is the premise behind models like BERT. The classification token is appended to the input, and the encoder alone is trained.

1.6. Generative Models

Similarly, we can split the encoder-decoder architecture and focusing only on the decoder. These excel at text generation.

This is the premise behind models like GPT.

Generative LLMs are essentially sequence-to-sequence machines: given some input text, predict the next tokens. The primary use case these days is being fine-tuned for “instruct” or “chat” models that are trained to provide an answer when given a question.

Foundation models are open-source base models that can be fine-tuned for specific tasks.

1.7. Other Modern Architectures

Aside from Transformers, Mamba and RWKV perform well.

2. How Large is a Large Language Model?

This size of a large larnguage model is a moving target as the field develops and model sizes scale.

Considerations:

What if a new model has the same capabilities as an existing LLM but with a fraction of the parameters. Is this new model still “large”?
What if we train a model the same size as GPT-4 but for text classification instead of generation? Is it still an LLM?

The creation of LLMs is typically done in two stages:

Language modeling: Create a foundation model by (unsuperivsed) training on a vast corpus of text. This step allows the model to learn the grammar, structure and patterns of the language. It is not yet directed at a specific task. This takes the majority of the training time.
Fine-tuning: Using the foundation model for (supervised) training on a specific task.

3. Ethical Considerations of LLMS

Bias and fariness: Training data is seldom shared, so may contain implicit biases
Transparency and accountability: Unintended consequences when there is no “human in the loop”. Who is accountable for the outcomes of the LLM? The company that trained it? Or the one that used it? Or the patient?
Generating harmful content
Intellectual property: Who owns the output of an LLM? The user? The company that trained it? Or the original creators of the training data?
Regulation

4. Using an LLM Locally

The following can be run in Google Colab on a free GPU.

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load model and tokenizer 
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct", device_map="cuda", torch_dtype="auto", trust_remote_code=True
) 
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini4k-instruct")

# Create a pipeline 
generator = pipeline(
    "text-generation", model=model, tokenizer=tokenizer, return_full_text=False, max_new_tokens=500, do_sample=False
)

# The prompt (user input / query) 
messages = [{"role": "user", "content": "Create a funny joke about chickens."}]

# Generate output 
output = generator(messages)
print(output[0]["generated_text"])

>>> Why don't chickens like to go to the gym? Because they can't crack the egg-sistence of it!

References

Chapter 1 of Hands-On Large Language Models by Jay Alammar & Marten Grootendoorst
https://jalammar.github.io/illustrated-word2vec/