Tokens and Embeddings

1. Tokenization

A text prompt sent to a model first needs to be broken down into tokens. The numerical token IDs are passed to the model.

The tokenization scheme is designed hand-in-hand with the model, so the two are coupled: each model has a corresponding tokenizer which transforms the prompt text into the expect numerical representation.

1.1. Tokenization in Action

We can load a model to see what tokenization looks like practice.

We’ll load a smaller open-source model and its corresponding tokenizer.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

if torch.cuda.is_available():
    device = "cuda"
    device_map = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
    device_map=None
else:
    device = "cpu"
    device_map = "cpu"

1.1.1. Load the model

This can take a few minutes depedning on internet connection.

model_name = "microsoft/Phi-3-mini-4k-instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map=device_map,
    torch_dtype="auto",
    trust_remote_code=True
)
model.to(device)

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=32064, bias=False)
)

1.1.2. Load the tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

1.1.3. Use the model to generate text

First we tokenize the input promt. Then we pass this to the model. We can peek in at each step to see what’s actually being passed around.

We’ll start with the following input prompt:

input_prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>"

The tokenizer converts this text to a list of integers. These are the input IDs that are passed to the model.

# Tokenize the input prompt 
input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.to(device)

print(input_ids)

tensor([[14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278, 25305,
           293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,   920,
           372,  9559, 29889, 32001]], device='mps:0')

We can “decode” these input IDs, converting them back to text, to see how the tokenizer splits the text. It uses sub-word tokens, so mishap is split into m, ish, ap. Punctuation is its own token and there is a special token for <|assistant|> Spaces are implicit; parital tokens have a special hidden character preceding them and tokens without that character are assumed to have a space before them.

for id in input_ids[0]: 
    print(tokenizer.decode(id))

Write
an
email
apolog
izing
to
Sarah
for
the
trag
ic
garden
ing
m
ish
ap
.
Exp
lain
how
it
happened
.
<|assistant|>

We can now pass this tokenized input to the model to generate new tokens.

# Due to a quirk of Macs, we need to explicitly pass it an attention mask as it cannot be inferred
if device == 'mps':
    model_kwargs = {'attention_mask': (input_ids != tokenizer.pad_token_id).long()}
else:
    model_kwargs = {}

# Generate the text 
generation_output = model.generate(input_ids=input_ids, max_new_tokens=100, **model_kwargs)

The output of the generation appends tokens to the input.

generation_output

tensor([[14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278, 25305,
           293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,   920,
           372,  9559, 29889, 32001,  3323,   622, 29901, 17778, 29888,  2152,
          6225, 11763,   363,   278, 19906,   292,   341,   728,   481,    13,
            13,    13, 29928,   799, 19235, 29892,    13,    13,    13, 29902,
          4966,   445,  2643, 14061,   366,  1532, 29889,   306,   626,  5007,
           304,  4653,   590,  6483,   342,  3095, 11763,   363,   278,   443,
          6477,   403, 15134,   393, 10761,   297,   596, 16423, 22600, 29889,
            13,    13,    13,  2887,   366,  1073, 29892,   306,   505,  2337,
          7336,  2859,   278, 15409,   322, 22024,   339,  1793,   310,   596,
         16423, 29889,   739,   756,  1063,   263,  2752,   310,  8681, 12232,
           363,   592, 29892,   322,   306,   471,  1468, 24455,   304,   505,
           278, 15130,   304,  1371]], device='mps:0')

Again, we can decode this to see the output text

# Print the output 
print(tokenizer.decode(generation_output[0]))

Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|> Subject: Heartfelt Apologies for the Gardening Mishap


Dear Sarah,


I hope this message finds you well. I am writing to express my deepest apologies for the unfortunate incident that occurred in your garden yesterday.


As you know, I have always admired the beauty and tranquility of your garden. It has been a source of inspiration for me, and I was thrilled to have the opportunity to help

1.2. Tokenizer Design

There are three primary decisions that determine how the tokenizer splits a given prompt:

Tokenization method: byte pair encoding (BPE), WordPiece
Tokenizer parameters: vocabulary size, choice of special tokens
Training data set: a tokenizer trained on English text will give different results to one trained on Punjabi text or Python code, etc.

Tokenizers are used on the input (to encode text -> numbers) and on the output (to decode numbers -> text).

1.2.1 Tokenization Methods

There are four promininent tokenization schemes:

Word tokens.
- Pros: Simple to implement and understand; can fit more text in a given context window
- Cons: Unable to handle unseen words; vocab has lots of tokens for almost identical words (e.g. write, writing, written, wrote)
Sub-word tokens.
- Pros: Can represent new words by breaking down into other known tokens
- Cons: Choice of partial words dictionary requires careful design
Character tokens.
- Pros: Can represent any new word
- Cons: Modeling is more difficult; can’t fit as much text in a context window
Byte tokens. Breaks tokens down into the individual unicode character bytes. This is also called “tokenization-free representation”.
- Pros: Can represent text of different alphabets, useful for multilingual models

Some tokenizers employ a hybrid approach. For example, GPT-2 uses sub-word tokenization and falls back to byte tokens for other characters.

Particular cases of interest that distinguish tokenization (and model) performance are the way the tokenizer handles:

Capitalization
Other languages
Emojis
Code - keywords and whitespace. Some models have different tokens for one space, two spaces, three spaces, four spaces etc.
Numbers and digits - does it encode each digit as a separate number or as a whole? E.g. 420 vs 4,2,0. Separate seems to perform maths better.
Special tokens - beginning of text, end of text, user/system/assistant tags, separator token used to separate two text inputs in similarity models.

1.2.2. Tokenizer Parameters

The LLM designer makes decisions about the paramters of the tokenizer:

Vocabulary size: \(\approx 50k\) is typical currently
Special tokens: Particular use cases may warrant special tokens, e.g. coding, research citations, etc
Capitalisation: Treat upper case and lower case as separate tokens? Or convert all to lower case?
Training data domain

2. Embeddings

Now that we have represented language as a sequence of tokens, the next question is finding an efficient numerical representation of text to model the patterns we see.

For neural networks, it’s helpful (or even necessary) to resize inputs to be consistent length. Just like dealing with tabular data, or resizing images when dealing with CNNs.

Therefore, it would be helpful to represent every word as an embedding vector of a pre-determined size.

This embedding approach allows us to apply the same ideas to different levels of text: character, sub-word, word, sentence, document.

Transformers take this a step further. Rather than a static embedding vector, the attention mechanism allows for contextualised embeddings that vary with the surrounding words.

We can explore a few examples of models that operate at different levels of abstraction.

2.1. Word Embeddings

Deberta is a small model that produces high-quality word embeddings.

from transformers import AutoModel, AutoTokenizer 

# Load a tokenizer 
tokenizer = AutoTokenizer.from_pretrained("microsoft/debertabase") 

# Load a language model 
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall") 

# Tokenize the sentence 
tokens = tokenizer('Hello world', return_tensors='pt') 

# Process the tokens 
output = model(**tokens)[0]

2.2. Sentence/Document Embeddings

Some models operate on sentences or entire documents.

A simple approach is to take the word embeddings for each word in the document, then average them. Some LLMs produce “text embeddings” which represent the whole text as an embedding vector directly.

from sentence_transformers import SentenceTransformer 

# Load model 
model = SentenceTransformer("sentence-transformers/all-mpnetbase-v2") 

# Convert text to text embeddings 
vector = model.encode("Best movie ever!")

2.3. Non-LLM-based Embeddings

Embeddings are useful in NLP more generally, and some techniques, such as Word2Vec and GloVe, predate LLMs.

These can be useful to apply NLP to non-text applications, such as music recommendations.

Say we have a data set of songs belonging to playlists. This can help us learn which songs are similar, because similar songs are likely to be neighbouring on playlists, just as similar words are likely to be neighbouring in a sentence.

So we can convert each song to an ID, and treat a playlist like a sentence, i.e. it is just a sequence of tokens. Then we can train a Word2Vec model on it to get embedding vectors for each song.

Then, if we have a song we like, we can look at its embedding vector and find similar songs by finding the songs with the closest embeddings.

References

Chapter 2 of Hands-On Large Language Models by Jay Alammar & Marten Grootendoorst
https://jalammar.github.io/illustrated-word2vec/