Prompt Engineering

Prompt engineering is an umbrella term for techniques that can improve the quality of a response from an LLM.

It extends beyond this as a tool to evaluate the model’s outputs and implement guardrails on the response.

1. How the model interprets a basic prompt

It’s helpful to start by studying how a model sees a basic prompt.

1.1. Special Tokens

We can explore the tokens used by a model in a transformers pipeline by inspecting the chat template.

prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False)

This gives a prompt in the form:

<s><|user|> Create a funny joke about chickens.<|end|> <|assistant|>

There are special tokens to indicate:

<s> - the start of the prompt
<|user|> - when a user (i.e. you) begins their message
<|end|> - when that message ends
<|assistant|> - when the assistant (the LLM) begins their message.

This gets passed to the model to complete the sequence, one token and a time until it generates an <|end|> token.

These special tokens help the model keep track of the context.

1.2. Temperature

LLMs are fundamentally trained neural networks, so their output should be (and is) deterministic.

But we don’t want to always get the same response to a given prompt. That would be boring. We want a bit of pizazz.

When the model is predicting the next token, what it actually does is create a probability distribution over all possible tokens in the vocabulary. It then samples from this distribution to choose the next word.

Temperature defines how likely it is to choose less probable tokens.

A temperature=0 will always choose the most likely token; this is greedy sampling.

Higher temperatures lead to more creative outputs.

1.3. Constraining the sample space: Top p and Top k

A related concept in sampling is to restrict the sample space, so rather than having the possibility (however small) of selecting any token in the vocab list, we constrain the possibilities.

The top_p parameter considers tokens from most to least probable until the cumulative probability reaches the given value. So top_p=1 considers all tokens, and a lower value filters out less probable tokens. This is known as nucleus sampling.

The top_k parameter restrict the sample space to the k most probable tokens.

1.4. Example use cases

Use case	`temperature`	`top_p`	Description
Brainstorming	High	High	Creative and unexpected responses
Email generation	Low	Low	Predictable and focused responses that aren’t too out there
Creative writing	High	Low	Creative but still coherent
Translation	Low	High	Deterministic output but with linguistic variety

2. Basic prompt engineering

2.1. The ingredients of a prompt

The most basic prompt is simply an input, without even an instruction. The LLM will simply complete the sequence. E.g.

The sky is

blue

We can extend this to an instruction prompt where we now have two components:

Instruction - Classify the text into negative or positive
Data - “This is a great movie!”

The model may have seen similar instructions in its training data, or at least seen similar enough example to allow it to generalise.

This is called instruction-based prompting.

2.2. Instruction-based tasks

It’s helpful to understand common tasks LLMs are used to perform, as they may have been trained on such examples, so if you use the phrasing they are familiar with, they are more likely to give you the desired output.

Classification - Classify the text into positive, neutral or negative.
Search - Find the ___ in the following text
Summarization - Summarise the following text
Code generation - Generate python code for…
Named entity recognition - An entity is… Extract the named entities from the following text

The common features of these prompts for these different tasks are:

Specificity - accurately describe what you want to achieve
Hallucination - LLMs are confident, not correct. We can ask the LLM to generate an answer if it knows the answer, otherwise respond with I don’t know
Order - The instruction should come either at the beginning or end. LLMs tend to focus at either extreme, known as the primacy effect and the recency effect respectively.

3. Modular components of prompts

We can build a complex prompt out of reusable components.

Component	Description	Example
Persona	Describe what role the LLM should take on	“You are an expert in astrophysics”
Instruction	The specific task (see last section)	“Classify the following text…”
Context	Additional info on the problem or task	“Your summary should extract the crucial points for executives to understand the vital information”
Format	Useful if you need a particular structured output	“Create a bullet point summary”
Audience	Target of the response	“Explain like I’m 5 years old”
Tone	The tone of voice of the response	“The tone should be professional and clear.”
Emotional stimuli!	Get creative, make the model feel	“This is very important for my career” changes the response!
Data	The data related to the task itself	…

Due to the modular nature, we can add and remove components and judge the effect on the output.

The order of these components matters — the primacy and recency effects again — so experimentation is key.

This is also model-dependent; some models respond better to certain prompting.

4. In-context learning

This is when we provide the model with examples of the correct behaviour.

4.1. Zero-shot

Zero-shot prompts provide no examples but outline the shape of the desired response. E.g.

Classify the text into positive, neutral or negative Text: The food was great! Sentiment:

4.2. One-shot and Few-shot

Few-shot prompts provide two or more examples, one-shot prompts provide one example.

We can use the user and assistant roles in the prompt to distinguish the user prompt from the exemplar assistant output.

one_shot_prompt = [  
    {"role": "user",
    "content": "A 'Gigamuru' is a type of Japanese musical instrument. An example of a sentence that uses the word Gigamuru is:"  },
    {"role": "assistant",
     "content": "I have a Gigamuru that my uncle gave me as a gift. I love to play it at home."  },
    {"role": "user",
     "content": "To 'screeg' something is to swing a sword at it. An example of a sentence that uses the word screeg is:"}
]

5. Chain Prompting

Rather than dumping everything in a single prompt, we take the output of one prompt and use it as the input to the next.

This gives the LLM more time (and context tokens) to focus on each part.

E.g. if we want to generate a product name, slogan and sales pitch for a new product, we could ask for all of those in a single prompt.

Or we could chain prompt: ask for a name, then ask for a slogan given the name, then ask for a sales pitch given the name and slogan.

This idea is at the core or LangChain and agents.

Two other use cases for chain prompting are:

Response validation - send a follow-up prompt asking the model to check its answer
Parallel prompts - send multiple prompts and ask the LLM to merge the answers

6. Reasoning

Emergent behaviour of LLMs “resembles” reasoning, though is generally considered to do this just by memorisation and pattern-matching. (Although who says that isn’t what humans are doing too?)

6.1. Chain of thought

We can coax this behaviour which mimics reasoning out of the LLM to improve the output.

This is analogous to the System 1 and System 2 thinking of Kahneman and Tversky; by default the LLM will give a knee-jerk System 1 response, but if we ask it to reason it will give an more considered System 2 response.

We can encourage this with a one-shot chain-of-thought prompt demonstrating reasoning. E.g.

cot_prompt = [
    {"role": "user",
     "content": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"},
    {"role": "assistant", 
     "content": "Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11."},
     {"role": "user",
      "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?"}
]

This guides the model towards providing an explanation as well as the answer. “Show your working!”

There is a zero-shot chain-of-thought approach that doesn’t require us to give an example of reasoning. A common and effective method is use this phrase to prime reasoning:

Let’s think set-by-step

(AKA the Bobby Valentino approach: slow down!)

6.2. Self-consistency

We can use the same prompt multiple times and get different responses (with differing quality) due to the sampling nature of LLMs.

We can sample multiple responses and ask the LLM to give the majority vote as the response.

This does make it slower and more expensive though, since we’re prompting n times for each prompt.

6.3. Tree of thought

This is a combination of the previous ideas.

Tree of thought = Chain of thought + Self consistency

We break the problem down into multiple steps. At each step, we ask the model to explore different solutions, then vote for the best solution(s) and move on to the next step. The thoughts are rates, with the most promising kept and the least promising pruned.

The disadvantage is that it requires even more calls to the model, slowing things down.

A zero-shot tree-of-thought approach is to ask the model to emulate a “discussion between multiple experts”. E.g.

zeroshot_tot_prompt = [
    {"role": "user",
     "content": "Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realizes they're wrong at any point then they leave. The question is 'The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?' Make sure to discuss the results."}
]

An extention of Three-of-thought is Graph-of-thought, where each prompt is treated like a node in a graph. Rather than a tree following a linear train of thought, the prompts can be reused in different orders and combinations in a graph.

7. Output verification

There are several reasons we may wish to verify the output of the model:

Structured output - the use case may require a certain format like JSON
Valid output - if a model is asked to pick one of two choices, it should not create a third
Ethics - biases, personal info leakage, etc
Accuracy - fact-check

There are generally three ways to control the output of a generative model:

Examples - of the expected output
Grammar - control the token selection process
Fine-tuning - tune a model on data which contains the desired output

Example-based approaches are essentially the in-context learning approach discussed previously. The model may still disregard your examples though.

Grammar-based approaches pass the output through another LLM to validate it meets the requirements. Packages like Guidance, Guardrails and LMQL constrain and validate the output of LLMs.

We can also constrain the tokens which the model is allowed to sample from.

Some models, like Llama, allow a response_format to be specified if it is a standard format like JSON.

References

Chapter 6 of Hands-On Large Language Models by Jay Alammar & Marten Grootendoorst
https://cameronrwolfe.substack.com/p/modern-advances-in-prompt-engineering