!wget https://huggingface.co/microsoft/Phi-3-mini-4k-instructgguf/resolve/main/Phi-3-mini-4k-instruct-fp16.gguf
Advanced Text Generation Techniques and Tools
Going beyond prompt engineering, there are several areas where we can improve the quality of generated text:
- Model I/O
- Memory
- Agents
- Chains
Langchain is a framework that provides useful abstractions for these kinds of things and helps connect them together. We will use LangChain here, but alternatives include LlamaIndex, DSPy and Haystack.
1. Model I/O
1.1. Quantization
We can load quantized models using the GGUF file format which is a binary format optimised for fast loading of pytorch models.
The benefit of a quantized model is a smaller size in memory while retaining most of the original information. For example, if the model was trained using 32-bit floats for parameters, we can use 16-bit floats instead. This reduces the memory requirement but also reduces the precision. Often this trade-off is worthwhile.
This page goes into detail on the mechanics of quantisation.
The “best” model is constantly changing, so we can refer to the Open LLM leaderboard.
We can download a 16-bit quantized version of the Phi-3 mini model from HuggingFace.
Then we can use LangChain to load the GGUF file.
Note: an alternative is to use the langchain_huggingface library.
# This cell *should* work, but due to some funkiness with incompatible langchain vs llama versions (I think)
# it's easier to just create a custom LangChain wrapper in the following cell.
from langchain import LlamaCpp
# Make sure model_path points at the file location of the GGUF file
= "/Users/gurpreetjohl/workspace/python/ml-practice/ml-practice/models/"
MODEL_DIR = "Phi-3-mini-4k-instruct-fp16.gguf"
MODEL_NAME = MODEL_DIR+MODEL_NAME
model_path
= LlamaCpp(
llm =MODEL_DIR+MODEL_NAME,
model_path=-1,
n_gpu_layers=500,
max_tokens=2048,
n_ctx=42,
seed=False
verbose )
/Users/gurpreetjohl/miniconda3/envs/thellmbook/lib/python3.10/site-packages/langchain_community/llms/__init__.py:312: LangChainDeprecationWarning: As of langchain-core 0.3.0, LangChain uses pydantic v2 internally. The langchain_core.pydantic_v1 module was a compatibility shim for pydantic v1, and should no longer be used. Please update the code to import from Pydantic directly.
For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. from pydantic.v1 import BaseModel
from langchain_community.llms.llamacpp import LlamaCpp
ValidationError: 1 validation error for LlamaCpp
client
Field required [type=missing, input_value={'model_path': '/Users/gu...': 42, 'verbose': False}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.10/v/missing
from typing import Any, Dict, List, Optional
from langchain_core.language_models import LLM
from llama_cpp import Llama
class CustomLlamaLLM(LLM):
str
model_path: int = -1
n_gpu_layers: int = 500
max_tokens: int = 2048
n_ctx: int] = 42
seed: Optional[bool = False
verbose: = None
client: Any
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.client = Llama(
=self.model_path,
model_path=self.n_gpu_layers,
n_gpu_layers=self.max_tokens,
max_tokens=self.n_ctx,
n_ctx=self.seed,
seed=self.verbose
verbose
)
@property
def _llm_type(self) -> str:
return "CustomLlama"
def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
= self.client(prompt, stop=stop)
response return response["choices"][0]["text"]
# Instantiate the custom LLM class
= CustomLlamaLLM(
llm =model_path,
model_path=-1,
n_gpu_layers=500,
max_tokens=2048,
n_ctx=42,
seed=True
verbose )
llama_model_loader: loaded meta data with 23 key-value pairs and 195 tensors from /Users/gurpreetjohl/workspace/python/ml-practice/ml-practice/models/Phi-3-mini-4k-instruct-fp16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = phi3
llama_model_loader: - kv 1: general.name str = Phi3
llama_model_loader: - kv 2: phi3.context_length u32 = 4096
llama_model_loader: - kv 3: phi3.embedding_length u32 = 3072
llama_model_loader: - kv 4: phi3.feed_forward_length u32 = 8192
llama_model_loader: - kv 5: phi3.block_count u32 = 32
llama_model_loader: - kv 6: phi3.attention.head_count u32 = 32
llama_model_loader: - kv 7: phi3.attention.head_count_kv u32 = 32
llama_model_loader: - kv 8: phi3.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 9: phi3.rope.dimension_count u32 = 96
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.pre str = default
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32064] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32064] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32064] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 32000
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type f16: 130 tensors
llm_load_vocab: special tokens cache size = 323
llm_load_vocab: token to piece cache size = 0.1687 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = phi3
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32064
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 96
llm_load_print_meta: n_embd_head_k = 96
llm_load_print_meta: n_embd_head_v = 96
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 3072
llm_load_print_meta: n_embd_v_gqa = 3072
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 3.82 B
llm_load_print_meta: model size = 7.12 GiB (16.00 BPW)
llm_load_print_meta: general.name = Phi3
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 32000 '<|endoftext|>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOT token = 32007 '<|end|>'
llm_load_tensors: ggml ctx size = 0.22 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size = 7100.64 MiB, (23176.83 / 10922.67)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 187.88 MiB
llm_load_tensors: Metal buffer size = 7100.64 MiB
....................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 11453.25 MB
llama_kv_cache_init: Metal KV buffer size = 768.00 MiB
llama_new_context_with_model: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: Metal compute buffer size = 168.00 MiB
llama_new_context_with_model: CPU compute buffer size = 10.01 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 2
AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + '\n' + message['content'] + '<|end|>' + '\n' + '<|assistant|>' + '\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + '\n'}}{% endif %}{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.add_bos_token': 'true', 'tokenizer.ggml.padding_token_id': '32000', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'tokenizer.ggml.pre': 'default', 'general.file_type': '1', 'phi3.rope.dimension_count': '96', 'tokenizer.ggml.bos_token_id': '1', 'phi3.attention.layer_norm_rms_epsilon': '0.000010', 'phi3.attention.head_count_kv': '32', 'phi3.attention.head_count': '32', 'tokenizer.ggml.model': 'llama', 'phi3.block_count': '32', 'general.architecture': 'phi3', 'phi3.feed_forward_length': '8192', 'phi3.embedding_length': '3072', 'general.name': 'Phi3', 'phi3.context_length': '4096'}
Available chat formats from metadata: chat_template.default
Using gguf chat template: {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + '
' + message['content'] + '<|end|>' + '
' + '<|assistant|>' + '
'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + '
'}}{% endif %}{% endfor %}
Using chat eos_token: <|endoftext|>
Using chat bos_token: <s>
ggml_metal_free: deallocating
2. Chains
2.1. Basic Prompt Chain
In LangChain, we use the invoke
function to generate an output.
However, each model requires a specific prompt template. If we blindly called invoke
on our model, we get no response:
"Hi! My name is Gurp. What is 1 + 1?") llm.invoke(
llama_print_timings: load time = 840.73 ms
llama_print_timings: sample time = 0.89 ms / 1 runs ( 0.89 ms per token, 1123.60 tokens per second)
llama_print_timings: prompt eval time = 840.28 ms / 18 tokens ( 46.68 ms per token, 21.42 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 841.04 ms / 19 tokens
''
This is where the LangChain abstractions come in useful.
We will create a simple chain with a single link:
flowchart LR subgraph PromptChain B(Prompt template) --> C[LLM] end A(User prompt) --> B C --> D(Output)
For our particular case, Phi-3 prompts require start, end , user and assistant tokens.
from langchain import PromptTemplate
# Create a prompt template with the "input_prompt" variable
= """<s><|user|> {input_prompt}<|end|> <|assistant|>"""
template = PromptTemplate(template=template, input_variables=["input_prompt"]) prompt
We can then create a chain by chaining the prompt and LLM together. Then we can call invoke
and get the intended text generation.
= prompt | llm
basic_chain "input_prompt": "Hi! My name is Gurp. What is 1 + 1?"}) basic_chain.invoke({
llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
/Users/gurpreetjohl/miniconda3/envs/thellmbook/lib/python3.10/site-packages/llama_cpp/llama.py:1031: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
inputs = [input]
Llama.generate: prefix-match hit
llama_print_timings: load time = 840.73 ms
llama_print_timings: sample time = 5.89 ms / 16 runs ( 0.37 ms per token, 2716.93 tokens per second)
llama_print_timings: prompt eval time = 295.30 ms / 21 tokens ( 14.06 ms per token, 71.11 tokens per second)
llama_print_timings: eval time = 1349.54 ms / 15 runs ( 89.97 ms per token, 11.11 tokens per second)
llama_print_timings: total time = 1668.23 ms / 36 tokens
' Hello Gurp! The answer to 1 + 1 is 2'
Note that we just passed the entire input_prompt
as a variable to the chain, but we can define other variables. For example, if we wanted a more specialised use case where we don’t give te user as much flexibility, we can pre-define some of the prompt.
= "Create a funny name for a business that sells {product}."
template = PromptTemplate(template=template, input_variables=["product"]) name_prompt
2.2. Multiple Prompt Chain
If we have more complex prompts or use cases, we can split the task into smaller subtasks that run sequentially. Each link in the chain deals with a specific subtask.
flowchart LR subgraph MultiPromptChain B1(Prompt 1) --> B2(Prompt 2) B1 <--> C[LLM] B2 <--> C end A(User input) --> B1 B2 --> D(Output)
As an example, we can prompt the LLM to create a story. First we ask it for a title based on the user prompt, then characters based on the title, then a story based on the characters and title. The first link is the only one that requires user input.
from langchain import LLMChain
# Create a chain for the title of our story
= """<s><|user|> Create a title for a story about {summary}. Only return the title. <|end|> <|assistant|>"""
title_template = PromptTemplate(template=title_template, input_variables= ["summary"])
title_prompt = LLMChain(llm=llm, prompt=title_prompt, output_key="title")
title
# Create a chain for the character description using the summary and title
= """<s><|user|> Describe the main character of a story about {summary} with the title {title}. Use only two sentences.<|end|> <|assistant|>"""
character_template = PromptTemplate(template=character_template, input_variables=["summary", "title"])
character_prompt = LLMChain(llm=llm, prompt=character_prompt, output_key="character")
character
# Create a chain for the story using the summary, title, and character description
= """<s><|user|> Create a story about {summary} with the title {title}. The main character is: {character}. Only return the story and it cannot be longer than one paragraph. <|end|> <|assistant|>"""
story_template = PromptTemplate(template=story_template, input_variables=["summary", "title", "character"])
story_prompt = LLMChain(llm=llm, prompt=story_prompt, output_key="story")
story
# Combine all three components to create the full chain
= title | character | story llm_chain
ggml_metal_free: deallocating
Now we can invoke the chain just like before:
"summary": "a dog that can smell danger"}) llm_chain.invoke({
llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
/Users/gurpreetjohl/miniconda3/envs/thellmbook/lib/python3.10/site-packages/llama_cpp/llama.py:1031: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
inputs = [input]
Llama.generate: prefix-match hit
llama_print_timings: load time = 840.73 ms
llama_print_timings: sample time = 1.34 ms / 16 runs ( 0.08 ms per token, 11958.15 tokens per second)
llama_print_timings: prompt eval time = 295.27 ms / 23 tokens ( 12.84 ms per token, 77.89 tokens per second)
llama_print_timings: eval time = 1424.98 ms / 15 runs ( 95.00 ms per token, 10.53 tokens per second)
llama_print_timings: total time = 1725.28 ms / 38 tokens
llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
/Users/gurpreetjohl/miniconda3/envs/thellmbook/lib/python3.10/site-packages/llama_cpp/llama.py:1031: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
inputs = [input]
Llama.generate: prefix-match hit
llama_print_timings: load time = 840.73 ms
llama_print_timings: sample time = 1.97 ms / 16 runs ( 0.12 ms per token, 8109.48 tokens per second)
llama_print_timings: prompt eval time = 190.41 ms / 44 tokens ( 4.33 ms per token, 231.08 tokens per second)
llama_print_timings: eval time = 1315.12 ms / 15 runs ( 87.67 ms per token, 11.41 tokens per second)
llama_print_timings: total time = 1511.43 ms / 59 tokens
llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
/Users/gurpreetjohl/miniconda3/envs/thellmbook/lib/python3.10/site-packages/llama_cpp/llama.py:1031: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
inputs = [input]
Llama.generate: prefix-match hit
llama_print_timings: load time = 840.73 ms
llama_print_timings: sample time = 1.48 ms / 16 runs ( 0.09 ms per token, 10803.51 tokens per second)
llama_print_timings: prompt eval time = 290.88 ms / 70 tokens ( 4.16 ms per token, 240.65 tokens per second)
llama_print_timings: eval time = 1480.88 ms / 15 runs ( 98.73 ms per token, 10.13 tokens per second)
llama_print_timings: total time = 1777.98 ms / 85 tokens
{'summary': 'a dog that can smell danger',
'title': ' "Scent of Peril: The Canine Detective\'s N',
'character': ' In the heartwarming tale, Scent of Peril follows Max,',
'story': ' In a quaint suburban neighborhood, there lived an extraordinary German Shepherd'}
3. Memory
By default, most LLMs will not remember what was said previously in the conversation. For example, if you give your name in one prompt, it will not be able to recall it later.
# Let's give the LLM our name
"input_prompt": "Hi! My name is Gurp. What is 1 + 1?"}) basic_chain.invoke({
llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
/Users/gurpreetjohl/miniconda3/envs/thellmbook/lib/python3.10/site-packages/llama_cpp/llama.py:1031: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
inputs = [input]
Llama.generate: prefix-match hit
llama_print_timings: load time = 840.73 ms
llama_print_timings: sample time = 2.43 ms / 16 runs ( 0.15 ms per token, 6573.54 tokens per second)
llama_print_timings: prompt eval time = 286.44 ms / 19 tokens ( 15.08 ms per token, 66.33 tokens per second)
llama_print_timings: eval time = 1401.11 ms / 15 runs ( 93.41 ms per token, 10.71 tokens per second)
llama_print_timings: total time = 1694.51 ms / 34 tokens
' Hello Gurp! The answer to 1 + 1 is 2'
# Next, we ask the LLM to reproduce the name
"input_prompt": "What is my name?"}) basic_chain.invoke({
llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
/Users/gurpreetjohl/miniconda3/envs/thellmbook/lib/python3.10/site-packages/llama_cpp/llama.py:1031: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
inputs = [input]
Llama.generate: prefix-match hit
llama_print_timings: load time = 840.73 ms
llama_print_timings: sample time = 3.35 ms / 16 runs ( 0.21 ms per token, 4771.85 tokens per second)
llama_print_timings: prompt eval time = 255.48 ms / 7 tokens ( 36.50 ms per token, 27.40 tokens per second)
llama_print_timings: eval time = 1341.37 ms / 15 runs ( 89.42 ms per token, 11.18 tokens per second)
llama_print_timings: total time = 1606.10 ms / 22 tokens
" I'm unable to determine your name as I don't have access to"
Models are stateless; they do not store memory. We can add specific types of memory to the chain to help it “remember” conversations:
- Conversation buffer
- Conversation summary
3.1. Conversation Buffer
The simplest way we can force the LLM to remember previous conversation is by passing the full conversation history in our prompt.
This approach is called conversation buffer memory. We update our prompt with the history of the chat.
3.1.1. Simple Conversation Buffer
# Create an updated prompt template to include a chat history
= """<s><|user|>Current conversation:{chat_history} {input_prompt}<|end|> <|assistant|>"""
template = PromptTemplate(template=template, input_variables=["input_prompt", "chat_history"]) prompt
Next, we create a ConversationBufferMemory
link in the chain that will store the conversations we have previously had with the LLM.
from langchain.memory import ConversationBufferMemory
# Define the type of memory we will use
= ConversationBufferMemory(memory_key="chat_history")
memory
# Chain the LLM, prompt, and memory together
= LLMChain(prompt=prompt, llm=llm, memory=memory) llm_chain
We can verify if this has worked by seeing if it now remembers our name in later prompts:
# Generate a conversation and ask a basic question
"input_prompt": "Hi! My name is Gurp. What is 1 + 1?"}) llm_chain.invoke({
llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
/Users/gurpreetjohl/miniconda3/envs/thellmbook/lib/python3.10/site-packages/llama_cpp/llama.py:1031: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
inputs = [input]
Llama.generate: prefix-match hit
llama_print_timings: load time = 840.73 ms
llama_print_timings: sample time = 5.11 ms / 16 runs ( 0.32 ms per token, 3129.89 tokens per second)
llama_print_timings: prompt eval time = 262.37 ms / 19 tokens ( 13.81 ms per token, 72.42 tokens per second)
llama_print_timings: eval time = 1334.30 ms / 15 runs ( 88.95 ms per token, 11.24 tokens per second)
llama_print_timings: total time = 1608.21 ms / 34 tokens
{'input_prompt': 'Hi! My name is Gurp. What is 1 + 1?',
'chat_history': '',
'text': ' Hello, Gurp! The answer to 1 + 1 is '}
# Does the LLM remember my name?
= llm_chain.invoke({"input_prompt": "What is my name?"})
response print(response)
llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
/Users/gurpreetjohl/miniconda3/envs/thellmbook/lib/python3.10/site-packages/llama_cpp/llama.py:1031: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
inputs = [input]
Llama.generate: prefix-match hit
llama_print_timings: load time = 840.73 ms
llama_print_timings: sample time = 1.28 ms / 16 runs ( 0.08 ms per token, 12539.18 tokens per second)
llama_print_timings: prompt eval time = 344.80 ms / 47 tokens ( 7.34 ms per token, 136.31 tokens per second)
llama_print_timings: eval time = 1425.89 ms / 15 runs ( 95.06 ms per token, 10.52 tokens per second)
llama_print_timings: total time = 1775.85 ms / 62 tokens
{'input_prompt': 'What is my name?', 'chat_history': 'Human: Hi! My name is Gurp. What is 1 + 1?\nAI: Hello, Gurp! The answer to 1 + 1 is ', 'text': ' Hi Gurp! The answer to 1 + 1 is 2'}
"input_prompt": "What is my name?"}) llm_chain.invoke({
llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
/Users/gurpreetjohl/miniconda3/envs/thellmbook/lib/python3.10/site-packages/llama_cpp/llama.py:1031: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
inputs = [input]
Llama.generate: prefix-match hit
llama_print_timings: load time = 840.73 ms
llama_print_timings: sample time = 6.22 ms / 16 runs ( 0.39 ms per token, 2571.93 tokens per second)
llama_print_timings: prompt eval time = 346.93 ms / 36 tokens ( 9.64 ms per token, 103.77 tokens per second)
llama_print_timings: eval time = 1359.76 ms / 15 runs ( 90.65 ms per token, 11.03 tokens per second)
llama_print_timings: total time = 1721.41 ms / 51 tokens
{'input_prompt': 'What is my name?',
'chat_history': 'Human: Hi! My name is Gurp. What is 1 + 1?\nAI: Hello, Gurp! The answer to 1 + 1 is \nHuman: What is my name?\nAI: Hi Gurp! The answer to 1 + 1 is 2',
'text': " Hi Gurp! You're asking your own name; I am an"}
3.1.2 Windowed Conversation Buffer
As the conversation goes on, the size of the chat history grows until eventually it may exceed the token limit.
One approach to work around this is to only hold the last \(k\) conversations in memory rather than the entire history.
from langchain.memory import ConversationBufferWindowMemory
# Retain only the last 2 conversations in memory
= ConversationBufferWindowMemory(k=2, memory_key="chat_history")
memory
# Chain the LLM, prompt, and memory together
= LLMChain(prompt=prompt, llm=llm, memory=memory) llm_chain
/var/folders/8k/8jqhnfbd1t99blb07r1hs5440000gn/T/ipykernel_47761/3936398832.py:4: LangChainDeprecationWarning: Please see the migration guide at: https://python.langchain.com/docs/versions/migrating_memory/
memory = ConversationBufferWindowMemory(k=2, memory_key="chat_history")
This approach is not ideal for longer conversations. An alternative is to summarise the chat history to fit in the token limit, rather than truncating it.
3.2. Conversation Summary
This approach uses an LLM to summarise the main points of the history so far to reduce the number of tokens required to pass to the main LLM.
The “summary LLM” can be a different model to our “main LLM”. We may want to use a smaller LLM for the “easier” task of summarisation to speed up computation.
There will now be two LLM calls per invocation: the user prompt and the summarisation prompt.
# Create a summary prompt template
= """
summary_prompt_template <s><|user|>Summarize the conversations and update with the new lines.
Current summary: {summary} new lines of conversation: {new_lines}
New summary:<|end|> <|assistant|>
"""
= PromptTemplate(
summary_prompt =["new_lines", "summary"],
input_variables=summary_prompt_template
template )
In this example, we’ll pass both calls to the same LLM, but in general we don’t have to.
from langchain.memory import ConversationSummaryMemory
# Define the type of memory we will use
= ConversationSummaryMemory(
memory =llm,
llm="chat_history",
memory_key=summary_prompt
prompt
)
# Chain the LLM, prompt, and memory together
= LLMChain(prompt=prompt, llm=llm, memory=memory) llm_chain
/var/folders/8k/8jqhnfbd1t99blb07r1hs5440000gn/T/ipykernel_47761/178162952.py:4: LangChainDeprecationWarning: Please see the migration guide at: https://python.langchain.com/docs/versions/migrating_memory/
memory = ConversationSummaryMemory(
We can try this out by having a short conversation with the LLM and checking it has retained previous information.
# Generate a conversation and ask for the name
"input_prompt": "Hi! My name is Gurp. What is 1 + 1?"})
llm_chain.invoke({"input_prompt": "What is my name?"}) llm_chain.invoke({
llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
/Users/gurpreetjohl/miniconda3/envs/thellmbook/lib/python3.10/site-packages/llama_cpp/llama.py:1031: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
inputs = [input]
Llama.generate: prefix-match hit
llama_print_timings: load time = 840.73 ms
llama_print_timings: sample time = 1.33 ms / 16 runs ( 0.08 ms per token, 12066.37 tokens per second)
llama_print_timings: prompt eval time = 273.75 ms / 19 tokens ( 14.41 ms per token, 69.41 tokens per second)
llama_print_timings: eval time = 1438.78 ms / 15 runs ( 95.92 ms per token, 10.43 tokens per second)
llama_print_timings: total time = 1717.89 ms / 34 tokens
Llama.generate: prefix-match hit
llama_print_timings: load time = 840.73 ms
llama_print_timings: sample time = 5.37 ms / 16 runs ( 0.34 ms per token, 2977.85 tokens per second)
llama_print_timings: prompt eval time = 291.94 ms / 77 tokens ( 3.79 ms per token, 263.75 tokens per second)
llama_print_timings: eval time = 1356.44 ms / 15 runs ( 90.43 ms per token, 11.06 tokens per second)
llama_print_timings: total time = 1661.81 ms / 92 tokens
llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
/Users/gurpreetjohl/miniconda3/envs/thellmbook/lib/python3.10/site-packages/llama_cpp/llama.py:1031: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
inputs = [input]
Llama.generate: prefix-match hit
llama_print_timings: load time = 840.73 ms
llama_print_timings: sample time = 1.40 ms / 16 runs ( 0.09 ms per token, 11428.57 tokens per second)
llama_print_timings: prompt eval time = 133.14 ms / 28 tokens ( 4.76 ms per token, 210.30 tokens per second)
llama_print_timings: eval time = 1437.07 ms / 15 runs ( 95.80 ms per token, 10.44 tokens per second)
llama_print_timings: total time = 1575.77 ms / 43 tokens
Llama.generate: prefix-match hit
llama_print_timings: load time = 840.73 ms
llama_print_timings: sample time = 4.67 ms / 16 runs ( 0.29 ms per token, 3427.59 tokens per second)
llama_print_timings: prompt eval time = 294.58 ms / 81 tokens ( 3.64 ms per token, 274.96 tokens per second)
llama_print_timings: eval time = 1346.13 ms / 15 runs ( 89.74 ms per token, 11.14 tokens per second)
llama_print_timings: total time = 1654.38 ms / 96 tokens
{'input_prompt': 'What is my name?',
'chat_history': ' Gurp initiated a conversation by introducing himself and asking for the sum',
'text': ' It seems there may have been a misunderstanding. In our current conversation, you'}
# Check whether it has summarized everything thus far
"input_prompt": "What was the first question I asked?"}) llm_chain.invoke({
llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
/Users/gurpreetjohl/miniconda3/envs/thellmbook/lib/python3.10/site-packages/llama_cpp/llama.py:1031: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
inputs = [input]
Llama.generate: prefix-match hit
llama_print_timings: load time = 840.73 ms
llama_print_timings: sample time = 3.82 ms / 16 runs ( 0.24 ms per token, 4187.39 tokens per second)
llama_print_timings: prompt eval time = 274.15 ms / 31 tokens ( 8.84 ms per token, 113.08 tokens per second)
llama_print_timings: eval time = 1342.18 ms / 15 runs ( 89.48 ms per token, 11.18 tokens per second)
llama_print_timings: total time = 1625.96 ms / 46 tokens
Llama.generate: prefix-match hit
llama_print_timings: load time = 840.73 ms
llama_print_timings: sample time = 1.95 ms / 16 runs ( 0.12 ms per token, 8200.92 tokens per second)
llama_print_timings: prompt eval time = 300.16 ms / 84 tokens ( 3.57 ms per token, 279.85 tokens per second)
llama_print_timings: eval time = 1455.62 ms / 15 runs ( 97.04 ms per token, 10.30 tokens per second)
llama_print_timings: total time = 1761.25 ms / 99 tokens
{'input_prompt': 'What was the first question I asked?',
'chat_history': ' Gurp introduced himself to the human and inquired about their name. The',
'text': ' The first question you asked could be, "Nice to meet you, G'}
# Check what the summary is thus far
memory.load_memory_variables({})
{'chat_history': ' Gurp introduced himself to the human and inquired about their name, while'}
The conversation summary approach reduces the tokens required, but it does risk losing information depending on the quality of the summary.
3.3. Comparison of Memory Approaches
Conversation buffer
Pros:
- Easiest to implement
- Ensures no loss of info (as long as conversation fits in congtext window)
Cons:
- Slower generation (more tokens needed)
- Only suitable for LLMs with large context windows
- Handles larger chat histories poorly
Windowed conversation buffer
Pros:
- Can use LLMs with smaller context windows
- Good for shorter chats; no information loss over the last k interactions
Cons:
- Only captures k interactions
- No compression, so can still require a large context window if k is large
Conversation summary
Pros:
- Captures full history
- Enables long chats
- Reduces required tokens
Cons:
- Requires an additional LLM call per interaction
- Quality of response depends on LLM’s summarisation quality
4. Agents
We can think of agents as an extension of multiple prompt chains. The difference is rather than requiring a user to input multiple prompts, the LLM decides on what actions it should take and in which order. Understanding both the query and deciding which tool to use and when is the crux of what makes agents useful.
Agents can make use of all the tools so far, including chains and memory. Two more that are vital for agents are:
- Tools that the agent uses to do things it cannot do by itself
- The agent type wich plans the actions to take (and the tools to use)
Agents show more advanced behaviour like creating and self-correcting a roadmap to achieve a goal, and they can interact with the real world through tools.
4.1. ReAct
Many agent-based systems rely on the ReAct framework, which standard for Reasoning and Acting.
We can give the LLM the ability to use tools, but it can only generate text, so it needs to generate the right text to interact with tools. For example, if we let it use a weather forecasting API, it needs to provide a request in the correct format.
ReAct merges the concepts of reasoning and acting as they are essentially two sides of the same coin: we want reasonong to afect actions and actions to affect reasoning. It does this by iteratively following these three steps:
- Thought
- Action
- Observation
We incorporate this into a prompt template like so:
We ask it to create a thought about the prompt, then trigger an action based on the thought, then observe the output, i.e. whatever it retrieved from an external tool.
An example of this is using an LLM to use a calculator.
import os
from langchain_openai import ChatOpenAI
# Load OpenAI's LLMs with LangChain
"OPENAI_API_KEY"] = "MY_KEY"
os.environ[= ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0) openai_llm
# Create the ReAct template
= """
react_template Answer the following questions as best you can.
You have access to the following tools: {tools}
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action ... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
Question: {input}
Thought:{agent_scratchpad}
"""
= PromptTemplate(
prompt =react_template,
template=["tools", "tool_names", "input", "agent_scratchpad"]
input_variables )
Next we need to define the tools it can use to interact with the outside world.
from langchain.agents import load_tools, Tool
from langchain.tools import DuckDuckGoSearchResults
# You can create the tool to pass to an agent
= DuckDuckGoSearchResults()
search = Tool(
search_tool ="duckduck",
name="A web search engine. Use this to as a search engine for general queries.",
description=search.run
func
)
# Prepare tools
= load_tools(["llm-math"], llm=llm) # Calculator tool is included by default
tools tools.append(search_tool)
Finally we can create the ReAct agent and pass it to the AgentExecutor
which handles the execution steps.
from langchain.agents import AgentExecutor, create_react_agen
# Construct the ReAct agent
= create_react_agent(openai_llm, tools, prompt)
agent = AgentExecutor(
agent_executor =agent,
agent=tools,
tools=True,
verbose=True
handle_parsing_errors )
Now we can invoke the LLM to find the price of an item and convert the currency. It will choose the appropriate tools to use for this.
# What is the price of a MacBook Pro?
agent_executor.invoke({"input": "What is the current price of a MacBook Pro in USD? How much would it cost in EUR if the exchange rate is 0.85 EUR for 1 USD."
})
The agent generates intermediate steps during execution which we can use to follow its train of thought.
The important thing to consider when using agents is that there is no human in the loop; it will generate an answer but there is no guarantee that it is the correct answer.
We can make some tweaks to help ourselves debug this. For exampe, asking the agent to return the website’s URL that it retrieved prices from to make manual verification easier.
References
- Chapter 7 of Hands-On Large Language Models by Jay Alammar & Marten Grootendoorst
- Langchain docs
- GGUF file format
- Visual guide to quantization
- Open LLM leaderboard
- The langchain_huggingface library