class ArraySet:
def __init__(self):
self.elements = []
def __repr__(self):
return str(self.elements)
def add(self, element):
# Search the array for `element`, then append it if it is not a duplicate.
if element not in self.elements:
self.elements.append(element)
def remove(self, element):
# Search the array for the value, then remove it.
if element in self.elements:
self.elements.remove(element)
def contains(self, element):
return element in self.elements
Data Structures
1. Data Structure Operations
There are common ways that we can interact with different data structures.
It is useful to frame the appropriateness of a data structure for a given based on the speed of the operations that are required most for that task.
- Read: Look up the value at a particular location in the data structure
- Search: Look for a particular value in the data structure.
- Insert: Add a new value to the data structure.
- Delete: Remove a value from the data structure.
Reading is “find by key”, searching is “find by value”.
2. Arrays
An array is a list of elements.
It is stored in memory as a block of contiguous memory addresses. When the array is declared, the head of the array is stored, i.e. the memory address of the first elelment.
The size of an array is the number of elements in the array. The index denotes where a particular piece of data resides in that array.
Operation | Complexity (Worst) | Complexity (Best) |
---|---|---|
Read | O(1) | O(1) |
Search | O(N) | O(1) |
Insertion | O(N) | O(1) |
Deletion | O(N) | O(1) |
2.1 Reading an Array
A computer can look up a given memory address in constant time.
We know the head of the array and the index to look up. So we can read in O(1) time.
Example:
- Head of the array is memory address 1063.
- We want to look up index 5.
- Read memory address 1068 (because 1063 + 5 = 1068).
If you were asked to raise your right pinky finger, you wouldn’t need to search through all of your fingers to find it
2.2 Searching an Array
A computer has immediate access to all of its memory addresses but does not know ahead of time their contents.
So to find a particular value, we will potentially have to search through every element.
Searching an array is therefore O(N).
2.3. Insertion into an Array
The efficiency of inserting into an array depends on where in the array you are inserting.
If inserting an element at the end, we simply place the new value at that memory address (assuming the memory address is empty). This is a constant time operation O(1).
But if we insert at any other position, we need to:
- Shift each of the existing elements to the right of the insert index 1 position rightwards
- Insert the new value in the gap created.
So to insert at index \(i\), there are \(N - i\) elements to shift (Step 1), then 1 more operation to insert the new value (Step 2).
In the worst case - inserting at the start of an array - insertion is O(N).
2.4. Deletion from an Array
Similarly, the efficiency of deletion depends on the index being deleted.
If deleting the last element, there is simply 1 operation to clear the memory address, so this is a constant time operation O(1).
But if we delete an element in any other position, we need to:
- Delete the element. This leaves a gap in the middle of the array.
- Shift each of the elements to the right of the gap leftwards, to fill the gap.
So to delete at index \(i\), we do 1 operation to delete that element (Step 1), then shift the next \(N-i\) elements leftwards (Step 2).
In the worst case - deleting the first element of the array - deletion is O(N).
3. Sets
A set is a collection of unique elements, i.e. duplicate values are not allowed.
There are different ways of implementing sets: array-based sets and hash-sets are discussed here.
Note that Python already has sets, but we’ll give outline implementations for clarity.
3.1. Array-based Sets
An array is used to store elements. As with standard arrays, elements are stored in contiguous memory locations, and each element has a unique index.
Example in Python:
= ArraySet()
set_arr 1)
set_arr.add(2)
set_arr.add(3)
set_arr.add(print(set_arr)
[1, 2, 3]
If we try to add a duplicate value it does not get added to the array:
2)
set_arr.add(print(set_arr)
[1, 2, 3]
The read, search and delete operations for an array-based set are identical to the standard array.
Insert operations are where array-based sets diverge. We always insert at the end of a set, which is constant time. But we need to search the array every time to ensure the new value is not a duplicate.
So we always need to do a search of all N elements, and then 1 insert operation at the end.
This means even in the best case, insertion into an array-based set is O(N) compared to O(1) when inserting at the end of a standard array.
The reason for using a set is because the use case requires no duplicates, not because it is inherently “quicker” than a standard array.
Operation | Complexity (Worst) | Complexity (Best) |
---|---|---|
Read | O(1) | O(1) |
Search | O(N) | O(1) |
Insertion | O(N) | O(N) |
Deletion | O(N) | O(1) |
3.2. Hash Sets
A hash-based set computes the hash of each element and uses this to store elements.
An example implementation implements the set as key-value pairs where keys are the hash of the elements and values are a placeholder value like True, or an array to handle hash collisions.
When there is a hash collision between mutliple elements, a typical approach is to insert all of these elements as an array under the same hash key.
The worst case scenario is caused by the extreme edge case where hash collisions are so prominent that every element has the same hash, essentially reducing the hash set to an array. This is generally avoided as long as the hash algorithm is decent.
For this reason, the average complexity is more meaningful in the table below. (Note that best has been replace with average in the table headings.)
Hash-based sets do not support reading by index, unlike array-based sets. But all other operations are typically constant time.
Operation | Complexity (Worst) | Complexity (Average) |
---|---|---|
Read | N/A | N/A |
Search | O(N) | O(1) |
Insertion | O(N) | O(1) |
Deletion | O(N) | O(1) |
Example in Python:
class HashSet:
def __init__(self):
# Use a dict to represent the hash set.
self.elements = {}
def __repr__(self):
return str(self.elements)
def add(self, element):
# The key is the element and the value is arbitrary.
# There are two extensions we could add here:
# 1. The key should really be the *hash* of the element, not just the element itself.
# Essentially, this is using an implicit hash function which is just a pass-through:
# hash_func = lambda x: x
# 2. Handle hash collisions by making the value an array which is appended to in the case of collisions.
self.elements[element] = True
def remove(self, element):
if element in self.elements:
del self.elements[element]
def contains(self, element):
return element in self.elements
= HashSet()
set_hash 1)
set_hash.add(2)
set_hash.add(3)
set_hash.add(print(set_hash)
{1: True, 2: True, 3: True}
If we try to add a duplicate value it simply overwrites the previous value:
2)
set_hash.add(print(set_hash)
{1: True, 2: True, 3: True}
4. Ordered Arrays
These are identical to regular arrays with the additional condition that elements are always ordered.
This obviously relies heavily on efficient sorting. This is a topic unto itself; see notes on sorting for more info.
When inserting into an ordered array, we need to:
- Search for the correct position - Look at each element in turn and compare if the insert element is greater than it
- Insert into the array
These two terms increase in opposite directions depending on the insert position. The further into the array we need to search (Step 1), the fewer elements we need to shift for the insertion (Step 2).
4.1. Binary Search
In a typical (unordered) array, the only option for searching is a linear search: we loop through each element in turn until we find our target.
For an ordered array, we can improve on this using a binary search.
- Pick the middle element.
- If the target value is greater than this, search the right half, otherwise search the left half.
- Repeat this recursively until we find our target.
This approach splits the search region in half for every constant time comparison operation.
Or put another way, if we doubled the number of elements in the array, the binary search would only have to perform 1 extra step. For \(N\) elements we need \(log_2(N)\) binary splits.
Therefore, the time complexity is O(log(N)).
def binary_search(ordered_array, target):
"""Perform a binary search for the target value on the given ordered array.
Parameters
----------
ordered_array: list
The array to search in.
target: int
The target value we are searching for.
Returns
-------
target_index: int
The index of the target value.
Returns None if the value does not exist in the array.
"""
# Establish the lower and upper bounds of our search.
# Initially, this is the entire array
= 0
idx_lower = len(ordered_array) - 1
idx_upper
while idx_lower <= idx_upper:
# Find the midpoint between our bounds
= (idx_upper + idx_lower) // 2
idx_midpoint = ordered_array[idx_midpoint]
value_at_midpoint
# Compare to our target value and narrow the upper or lower bound accordingly
if value_at_midpoint == target:
# We have found the target!
return idx_midpoint
elif value_at_midpoint < target:
# The target is bigger so must be to the right side
= idx_midpoint + 1
idx_lower elif value_at_midpoint > target:
# The target is smaller so must be on the left side
= idx_midpoint - 1
idx_upper
# If the lower and upper bounds meet we have exhausted the whole array, so the target is not in the array
return None
Let’s try this on a few examples.
= [1, 2, 4, 5, 7, 8, 9, 10, 13, 14] ordered_array
7) binary_search(ordered_array,
4
14) binary_search(ordered_array,
9
Now a value that’s not in the array:
14) binary_search(ordered_array,
9
Compare this with a linear search
def linear_search(array, target):
"""Perform a linear search for the target value on the given array.
Parameters
----------
array: list
The array to search in.
target: int
The target value we are searching for.
Returns
-------
target_index: int
The index of the target value.
Returns None if the value does not exist in the array.
"""
# Loop through every element in the array.
# Note: we should really use enumerate() rather than range(len()) but I wanted to keep this generic
# without too many python-specific helpers
for idx in range(len(array)):
if array[idx] == target:
return idx
# If we reach the end of the array without returning a value, then the target does not exist in the array.
return None
Let’s compare how they perform for a reasonably big array with 1 million elements.
= [k for k in range(1000000)] array
%%timeit
987654) binary_search(array,
1.13 µs ± 56.9 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
%%timeit
987654) linear_search(array,
15.9 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The binary search is ~14000x faster than the linear search!
5. Hash Tables
Hash tables are key:value pairs. We can look up a the value for a given key in \(O(1)\) time.
Also known as hash maps, dictionaries, maps, associative arrays.
5.1. Hashing
The process of taking characters and converting them to numbers is called hashing.
The code that performs this conversion is the hash function.
A hash function requires one condition:
Consistency: A hash function must convert the same string to the same number every time it’s applied.
In practice, for the hash function to be useful, it should also be collision resistant:
Collision resistant: Different inputs should hash to different outputs.
As an extreme example, the following hash function is consistent but not collision resistant:
def crappy_hash(input_str: str) -> int:
"""This is the world's worst hash function."""
return 1
5.2. Hash Table Lookups
We want to insert the following key:value pair into our hash table:
key = "Name"
value = "Gurp"
Let’s say we have a hash function that is actually good, and in this particular case hash_function(key)
returns 12
.
The hash table will then insert value
at memory address 12
(or more specifically, the memory address of the head of the dictionary + 12).
This means if we ever want to look up the key "Name"
, we hash it and immediately know to access memory address 12 and return the value "Gurp"
.
So hash table lookups are \(O(1)\).
More specifically, looking up by key is \(O(1)\). Searching by value is essentially searching through an array, so is \(O(N)\).
5.3. Dealing with Collisions
A collision occurs when we try to add data to a cell that is already occupied.
One approach to handling this is called separate chaining.
Instead of placing a single value in a cell, we place a pointer to an array. This array contains length-2 subarrays where the first element is the key and the second element is the value.
If there are no collisions, a hash table look up is \(O(1)\). In the worst case, ALL keys collide and so we essentially have to search through an array which is \(O(N)\).
5.4. Hash Table Efficiency
A hash tables efficiency depends on:
- How much data we’re storing in it
- How many cells are available
- Which hash function we use
A good hash function (3) should distribute the data (1) evenly across all cells (2).
This ensures the memory is used efficiently while avoiding collisions.
The load factor is the ratio of data to cells, and ideally should be around 0.7, i.e. 7 elements for every 10 cells.
Operation | Complexity (Worst) | Complexity (Average) |
---|---|---|
Read | O(N) | O(1) |
Search | O(N) | Keys: O(1) Values: O(N) |
Insertion | O(N) | O(1) |
Deletion | O(N) | O(1) |
The worst case corresponds to when all keys collide, reducing the hash table to an array effectively.
6. Stacks
A stack is stored in memory the same as an array, but it has 3 constraints:
- Data can only be inserted at the end (push)
- Data can only be deleted from the end (pop)
- Only the last element can be read (peek)
Read, insert and delete can only happen at the end.
This makes them useful as Last-In First-Out (LIFO) data stores: the last item pushed the the stack is the first to be popped.
Example in Python:
class Stack:
def __init__(self, initial_elements: list = []):
# We can pass a list to initialise the Stack
self.data = initial_elements
def __repr__(self):
return str(self.data)
def push(self, element):
self.data.append(element)
def pop(self):
return self.data.pop()
def peek(self):
if self.data:
return self.data[-1]
= Stack([1, 2, 3, 4])
stack stack
[1, 2, 3, 4]
5)
stack.push(print(stack)
[1, 2, 3, 4, 5]
stack.peek()
5
print(stack)
[1, 2, 3, 4, 5]
stack.pop()
5
print(stack)
[1, 2, 3, 4]
The benefits of stacks, and other constrained data structures, are:
- Prevent potential bugs when using certain algorithms. For example, an algorithm that relies on stacks may break if removing elements from the middle of the array, so using a standard array is more error-prone.
- A new mental model for tackling problems. In the case of stacks, this is the LIFO approach.
The concept of stacks is a useful precursor to recursion, as we push to and pop from the end of a stack.
7. Queues
A queue is conceptually similar to a stack - it is a constrained array. This time, it is First-In First-Out (FIFO) like a queue of people; the first person to arrive is the first to leave.
Queue restrictions:
- Data can only be inserted at the end (enqueue)
- Data can only be deleted from the front (dequeue)
- Data can only be read from the front (peek)
Points (2) and (3) are the opposite of the stack.
class Queue:
def __init__(self, initial_elements: list = []):
# We can pass a list to initialise the Stack
self.data = initial_elements
def __repr__(self):
return str(self.data)
def enqueue(self, element):
self.data.append(element)
def dequeue(self):
return self.data.pop(0)
def peek(self):
if self.data:
return self.data[0]
= Queue([1, 2, 3, 4])
q q
[1, 2, 3, 4]
5)
q.enqueue(print(q)
[1, 2, 3, 4, 5]
q.dequeue()
1
print(q)
[2, 3, 4, 5]
q.peek()
2
print(q)
[2, 3, 4, 5]
8. Linked Lists
A linked list represents a list of items as non-contiguous blocks of memory.
It is a list of items, similar to an array. But an array occupies a continuous block of memory.
Operation | Complexity (Worst) | Complexity (Best)* |
---|---|---|
Read | O(N) | O(1) |
Search | O(N) | O(1) |
Insertion | O(N) | O(1) |
Deletion | O(N) | O(1) |
The best case corresponds to operating on the head node.
In a linked list, each element is contained in a node that can be in scattered positions in memory. The node contains the data element and a “link” which is a pointer to the memory address of the next element.
Benefits of a linked list over an array:
- Memory efficient: we don’t need a continuous block of memory
- \(O(1)\) inserts and deletes from the beginning of the list
- Useful when we want to traverse through a data structure while making inserts and deletes, because we do not have to shift the entire data structure each time as we would have to with an array
A node contains two pieces of information:
Data | Link |
---|---|
“a” | 1666 |
These nodes can then be linked together in a list… a linked list!
flowchart LR A("'a'|1666") --> B("'b'|1984") --> C("'c'|1066") --> D("...") --> E("'z'|null")
The link of the last node is null to indicate the end of the list.
8.1. Implementating a Node
We first need a node data structure, which will hold our data and a link to the next node.
We’ll point to the next node itself, rather than its memory address. This still has the same effect as nodes are scattered throughout different memory locations.
class Node:
def __init__(self, data, link=None):
self.data = data
self.link = link
def __repr__(self) -> str:
return f"Data: {self.data}\tLink: \n{self.link}"
Create some nodes and link them together
= Node("a")
node1 = Node("b")
node2 = Node("c")
node3 = Node("d") node4
This is what a single node looks like:
print(node1)
Data: a Link:
None
Now we link them
= node2
node1.link = node3
node2.link = node4 node3.link
node1
Data: a Link:
Data: b Link:
Data: c Link:
Data: d Link:
None
8.2. Implementing a Linked List
The linked list simply keeps track of the head, i.e. the first node in the list.
When using linked lists, we only have immediate access to this first node. For any other values, we need to start at the head node and traverse the list.
class LinkedList:
def __init__(self, head):
self.head = head
def __repr__(self) -> str:
return str(self.head)
= LinkedList(node1)
ll ll
Data: a Link:
Data: b Link:
Data: c Link:
Data: d Link:
None
8.3. Reading from a Linked List
We start at the head an traverse the list until we reach the desired index.
This means they ar \(O(N)\) in the worst case.
class LinkedList:
def __init__(self, head):
self.head = head
def __repr__(self) -> str:
return str(self.head)
def read(self, index):
"""Read the node at the given index."""
= 0
current_idx = self.head
current_node
while (index > current_idx):
if current_node.link is None:
# The index does not exist in the linked list, we have reached the end
return None
= current_node.link
current_node += 1
current_idx
return current_node
= LinkedList(node1)
ll 2) ll.read(
Data: c Link:
Data: d Link:
None
10) ll.read(
8.4. Searching a Linked List
To search for a value, again we have to traverse the whole list.
This means the worst case complexity is \(O(N)\).
The mechanics of searching are the same as reading - we traverse the graph. The difference is we keep going until we find the value or reach the end of the list, rather than stopping at a predetermined index with read
.
class LinkedList:
def __init__(self, head):
self.head = head
def __repr__(self) -> str:
return str(self.head)
def read(self, index):
"""Read the node at the given index."""
# Start at the head
= 0
current_idx = self.head
current_node
# Traverse the list until we find the desired index
while (index > current_idx):
if current_node.link is None:
# The index does not exist in the linked list, we have reached the end
return None
= current_node.link
current_node += 1
current_idx
return current_node
def search(self, value):
"""Find the index of the given value."""
# Start at the head
= 0
current_idx = self.head
current_node
# Loop until we reach the None value which denotes the end of the list
while current_node:
if current_node.data == value:
# We've found our target value
return current_idx
# Try the next node
= current_node.link
current_node += 1
current_idx
# We have traversed the whole list without finding a matching value
return None
= LinkedList(node1)
ll 'c') ll.search(
2
8.5. Inserting into a Linked List
Inserting a node into a linked list where we already have the current node is an \(O(1)\) operation.
- Point to the next node.
new_node.link = current_node.link
- Link from the previous node.
current_node.link = new_node
With a linked list, we only have the head node, so we can insert at the start in \(O(1)\) time.
But to insert at any other point, we have to traverse there first (an \(O(N)\) operation) and then do the insert.
This is the key point of linked lists: insertion at the beginning is \(O(1)\) but at the end is \(O(N)\). This is the opposite of arrays, meaning linked lists are useful in cases where insertions are mostly at the beginning.
class LinkedList:
def __init__(self, head):
self.head = head
def __repr__(self) -> str:
return str(self.head)
def read(self, index):
"""Read the node at the given index."""
# Start at the head
= 0
current_idx = self.head
current_node
# Traverse the list until we find the desired index
while (index > current_idx):
if current_node.link is None:
# The index does not exist in the linked list, we have reached the end
return None
= current_node.link
current_node += 1
current_idx
return current_node
def search(self, value):
"""Find the index of the given value."""
# Start at the head
= 0
current_idx = self.head
current_node
# Loop until we reach the None value which denotes the end of the list
while current_node:
if current_node.data == value:
# We've found our target value
return current_idx
# Try the next node
= current_node.link
current_node += 1
current_idx
# We have traversed the whole list without finding a matching value
return None
def insert(self, value, index):
"""Insert the value at the given index."""
= Node(value)
new_node
if index == 0:
# Link to the old head and update the linked lists head
= self.head
new_node.link self.head = new_node
return
# Traverse the linked list until we find our node
= self.head
current_node = 0
current_idx while current_idx < index - 1:
= current_node.link
current_node += 1
current_idx
# Update the links to insert the new node
= current_node.link
new_node.link = new_node
current_node.link return
Insert a new head of our linked list
= LinkedList(node1)
ll ll
Data: a Link:
Data: b Link:
Data: c Link:
Data: d Link:
None
'new_head', 0) ll.insert(
ll
Data: new_head Link:
Data: a Link:
Data: b Link:
Data: c Link:
Data: d Link:
None
Insert in the middle
"I'm new here", 3) ll.insert(
ll
Data: new_head Link:
Data: a Link:
Data: b Link:
Data: I'm new here Link:
Data: c Link:
Data: d Link:
None
8.6. Deleting from a Linked List
It is quick to delete from the beginning of a linked list for the same reasons as insertion.
- Make the previous node point to the next next node
class LinkedList:
def __init__(self, head):
self.head = head
def __repr__(self) -> str:
return str(self.head)
def read(self, index):
"""Read the node at the given index."""
# Start at the head
= 0
current_idx = self.head
current_node
# Traverse the list until we find the desired index
while (index > current_idx):
if current_node.link is None:
# The index does not exist in the linked list, we have reached the end
return None
= current_node.link
current_node += 1
current_idx
return current_node
def search(self, value):
"""Find the index of the given value."""
# Start at the head
= 0
current_idx = self.head
current_node
# Loop until we reach the None value which denotes the end of the list
while current_node:
if current_node.data == value:
# We've found our target value
return current_idx
# Try the next node
= current_node.link
current_node += 1
current_idx
# We have traversed the whole list without finding a matching value
return None
def insert(self, value, index):
"""Insert the value at the given index."""
= Node(value)
new_node
if index == 0:
# Link to the old head and update the linked lists head
= self.head
new_node.link self.head = new_node
return
# Traverse the linked list until we find our node
= self.head
current_node = 0
current_idx while current_idx < index - 1:
= current_node.link
current_node += 1
current_idx
# Update the links to insert the new node
= current_node.link
new_node.link = new_node
current_node.link return
def delete(self, index):
"""Delete the value at the given index."""
if index == 0:
# We are deleting the head node, so point at the second node instead
self.head = self.head.link
return
# Traverse the linked list until we find our node
= self.head
current_node = 0
current_idx while current_idx < index - 1:
= current_node.link
current_node += 1
current_idx
# Skip the next node (which we are deleting) and point ot its link instead
= current_node.link.link
current_node.link return
= LinkedList(node1)
ll ll
Data: a Link:
Data: b Link:
Data: I'm new here Link:
Data: c Link:
Data: d Link:
None
Delete the head node
0)
ll.delete(print(ll)
Data: b Link:
Data: I'm new here Link:
Data: c Link:
Data: d Link:
None
Delete a middle node
1)
ll.delete(print(ll)
Data: b Link:
Data: c Link:
Data: d Link:
None
8.7. Doubly Linked Lists
A doubly linked list is a variant where each node contains pointers to the previous node and the next node.
Data | Previous | Next |
---|---|---|
“a” | null | 1666 |
The linked list tracks the head and tail.
This makes it quicker to read/insert/delete from either the beginning or end. We can also traverse backwards or forwards through the list.
flowchart LR A("'a'|null|1666") <---> B("'b'|1234|1984") <--> C("'c'|1666|1066") <--> D("...") <--> E("'z'|1993|null")
Doubly linked lists are a good data structure to use for queues, since we can insert/delete at either end.
9. Binary Search Trees
We can have some use cases where we want to keep our data sorted.
Sorting is expensive, \(O(N log N)\) at the best of times, so we want to avoid sorting often. Ideally we would keep our data sorted at all times. An ordered array could do the job, but insertions and deletions are slow as we have to shift a chunk over the array every time.
We want a data structure that:
- Maintains order
- Has fast inserts, deletes and search
This is where a binary search tree comes in.
Operation | Complexity (Worst)* | Complexity (Best) |
---|---|---|
Read | N/A | N/A |
Search | O(N) | O(log N) |
Insertion | O(N) | O(log N) |
Deletion | O(N) | O(log N) |
*The worst case corresponds to an imbalanced tree that is essentailly a linked list (a straight line). The best case is a perfectly balanced tree.
9.1. Trees
Trees are another node-based data structure when each node can point to multiple other nodes.
flowchart TD A(a) --> B(b) A(a) --> C(c) B(b) --> D(d) B(b) --> E(e) C(c) --> F(f)
- The root is the uppermost node.
a
is the parent ofb
andc
;b
andc
are children ofa
.- The descendants of a node are all of its children and its children’s children’s children etc. The ancestors of anode are its parents and its parent’s parent’s parents etc.
- Each horizontal layer ofthe tree is a level.
- A tree is balanced if all of its subtrees have the same number of nodes.
9.2. Binary Search Tree Rules
A binary tree is one in which each node can have at most 2 children.
A binary search treemust abide by the following rules:
- Each node can have at most one “left” child and one “right” child
- A node’s left descendants are all smaller than the node. It’s right descendants are all larger.
9.3. Implementation
class TreeNode:
def __init__(self, value, left=None, right=None):
self.value = value
self.left = left
self.right = right
def __repr__(self):
return f"TreeNode with value: {self.value}"
class Tree:
def __init__(self, root_node):
self.root_node = root_node
def __repr__(self) -> str:
return f"Tree with root: {self.root_node}"
= TreeNode('b')
tree_node2 = TreeNode('c')
tree_node3 = TreeNode('a', tree_node2, tree_node3)
tree_node1
= Tree(tree_node1) tree
tree_node1
TreeNode with value: a
tree_node1.left
TreeNode with value: b
tree_node1.right
TreeNode with value: c
tree
Tree with root: TreeNode with value: a
9.4. Searching a Binary Search Tree
- Designate a current node (start with the root) and inspect its value.
- If the current node is our target, success! Stop here.
- If the current node is smaller than our target, search the left subtree. If it’s larger, search the right subtree.
- Repeat until we find our target. If we reach the end of the subtree without finding the target, then the target is not in the tree.
9.5. Complexity
Searching a binary search tree is \(O(log N)\) in the best case (a perfectly balanced tree) since we narrow our search area by half on each step.
The worst case is a horribly imbalanced tree. Imagine a tree that only has left descendants. This is essentially a linked list, so searching through it means inspecting every element. Therefore the worst case complexity is \(O(N)\).
So searching a binary search tree is the same complexity as a binary search performed on an ordered array. Where trees differentiate themselves is on insertions and deletes.
9.6. Insertion
- Compare our new_node to each value in the tree starting from the root. Like with search, follow the path to the left if new_node is lower or right if higher.
- Traverse until we find a node where the appropriate child (left or right) does not exist yet. Make new_node the child.
The order of insertion is important to maintain balance. Binary search trees work best when seeded with randomly sorted data, as this will typically end up quite well-balanced. Seeding the tree with sorted data leads to imbalanced trees.
Insertion is \(O(log N)\) for balanced trees because we search for the correct place to insert and then perform an \(O(1)\) operation to insert it.
9.7. Deletion
Deleting a node is more complicated because it depends if the target node has 0, 1 or 2 children.
Number of Children | Action |
---|---|
0 | Simply delete the node |
1 | Replace the deleted node with its child |
2 | Replace the deleted node with the successor node |
The successor node is the next largest descendant of the deleted node. More formally: the child node whose value is the least of all values that are greater than the deleted node.
Finding the successor:
- Visit the right child of the deleted node.
- Keep visiting left children until there are no more elft children. This is the successor node.
- If the successor node has a right child, that child takes the original place of the successor node. The successor node gets moved to replace the deleted node.
Deletion is \(O(log N)\) for balanced trees because we search for the correct place to insert and then perform (potentially several) \(O(1)\) operations to delete the node and replace it with a child / successor.
9.8. Inorder Traversal
The point of a binary search tree is to maintain order. We can traverse the elements in order with the following recursive algorithm.
- Call itself recursively on the node’s left child. This will repeat until we hit a node without a left child.
- Visit this node.
- Call itself recursviely on the node’s right child.
Since we are visiting every element of the tree, this is necessarily \(O(N)\).
def traverse(tree_node):
"""Inorder traversal of a binary search tree."""
if tree_node is None:
return
traverse(tree_node.left)print(tree_node.value)
traverse(tree_node.right)
10. Heaps
10.1. Priority Queues
A queue is a FIFO list that means data is inserted at the end but accessed/removed from the front.
A priority queue is a list where deletions and access are the same as a regular queue, but insertions are like an ordered array. So the priority queue is always ordered.
An example use case is a triage system in a hospital: patients are seen based on severity, not just when they arrived.
This is an abstract data types that we could implement in multiple ways using different fundamental data types. E.g. we could implement using an ordered array, but insertions would be O(N).
Heaps are another data structure that fit this use case well.
10.2. Binary Heaps
There a multiple kinds of heaps. As a starting point, we consider the binary max-heap, which is a special kind of binary tree.
Binary max-heap conditions:
- Heap condition: Each node is greater than its children.
- The tree must be complete.
Because the heap condition is true for each node, we can recursively reason that a node is greater than its children, and they are too, then a node is greater than all of its descendants.
The complete condition essentially means values are filled left-to-right, top-down with no holes. There are no empty positions anywhere except the bottom row, and the bottom row is filled from left to right. The last node is the rightmost node at its bottom level.
There is also a min-heap variant. The difference is trivial, aside from the heap condition inequality being reversed, the logic is the same.
10.3. Heap Properties
- Heaps are weakly ordered.
- The root node is the maximum value.
With a binary search tree, we know precisely whether a value is the left or right descendant of a node. E.g. 3
will be a left child of 100
.
But in a max-heap, we don’t know whether 3
is to the left or right, only that it is a descendant rather than an ancestor.
Searching would require inspecting every node, \(O(N)\). In practice, searching a heap is not typically done in the use cases it is used for.
Reading a heap typically refers to accessing the root node, which is \(O(1)\).
10.4. Inserting into a Heap
- Last node: Insert the new node as the heap’s last node
- Trickle up the new node: Compare the new node to its parent. If it’s greater than its parent, swap them.
The number of steps to trickle is proportional to the depth of the tree. Therefore, insertion is a \(O(log N)\) operation.
10.5. Deleting from a Heap
We only ever delete the root node from a heap.
- New root: The last node becomes the new root node.
- Trickle down the root node: Trickle the root down to its proper place.
Trickling down is more complicated than trickling up because there are two possible directions to swap in. We compare to both children and swap with the larger of the two. This ensures the largest value ends up as the root node, which is the key property we want to preserve.
Deletion is also \(O(log N)\) since the number of steps to trickle is again proprtional to the depth of the tree.
10.6. Complexity of Heap vs Ordered Array
An alternative way of implementing a priority queue would be using an ordered array rather than a heap.
Operation | Heap Complexity | Array Complexity |
---|---|---|
Read* | \(O(1)\) | \(O(1)\) |
Search^ | \(O(N)\) | \(O(N)\) |
Insertion | \(O(log N)\) | \(O(N)\) |
Deletion | \(O(log N)\) | \(O(1)\) |
* Reading the root node
^ Searching is not typically done on a heap
Comparing the complexities, heaps are fast, \(O(log N)\), for both insertions and deletions, wheres ordered arrays are faster for deletions but slower for insertions.
As this is in log space and priority queues typically perform similar numbers of inserts and deletes, on average this makes heap much faster.
10.7 The Problem of the Last Node
Many operations with heaps, such as insertion, rely on knowing where the last node is.
This is important because inserting/deleting the last node ensures the heap is always complete, and completeness is important to ensure the graph remains well-balanced.
Being well-balanced ensures the heap remains efficient. As in the case of binary search tree, if a tree is severely unbalanced it effectively becomes a linked list, so actions that should search through the depth of the tree in \(O(log N)\) time take \(O(N)\) in the worst case.
10.7.1. Array-based Heaps
Heaps are often implemented as arrays because the problem of the last node is so crucial, and accessing the last element of an array is \(O(1)\).
The tree below shows the values and their index in the corresponding array as value|index
:
flowchart TD A["100|0"] ---> B["88|1"] A["100|0"] ---> C["25|2"] B["88|1"] ---> B1["87|3"] B["88|1"] ---> B2["16|4"] C["25|2"] ---> C1["8|5"] C["25|2"] ---> C2["12|6"]
The array representation is then:
0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
100 | 88 | 25 | 87 | 16 | 8 | 12 |
The indices are then deterministic. We can find the child index of any given node as:
left_child_idx = (parent_index * 2) + 1
right_child_idx = (parent_index * 2) + 2
To find a given node’s parent index:
parent_idx = (child_idx - 1) // 2
10.7.2. Linked Node-based Heaps
It is also possible to implement heaps as linked nodes.
In this case, a different trick is required to solve the problem of the last node using binary numbers.
- Assign Binary Numbers: Each level of the heap is assigned a unique binary number. The root is 0, then each left child is concatenates the parent’s number with 0 at the end, and each right child concatenates the parent’s index with 1 at the end.
- Insertion using Binary Representation: To insert a new node into the heap, you convert the index of the node into binary form. Starting from the most significant bit (MSB), traverse the heap according to the binary digits:
- If the bit is 0, move to the left child.
- If the bit is 1, move to the right child.
- Insertion at the Last Available Position: As you traverse the heap, you’ll eventually reach a node that doesn’t exist yet. This node represents the last available position in the heap, where the new node can be inserted while maintaining the complete binary tree property.
flowchart TD A["100|0"] ---> B["88|00"] A["100|0"] ---> C["25|01"] B["88|00"] ---> B1["87|000"] B["88|00"] ---> B2["16|001"] C["25|01"] ---> C1["8|010"] C["25|01"] ---> C2["12|011"]
11. Tries
This is a kind of tree which is particularly useful for text-based features.
Each trie node can have any number of children. Each word is split into characters which are stored as a series of nested child nodes. A terminal character is used to denote the end of a word.
flowchart TD G(G) ---> U(U) ---> R(R) ---> P(P) ---> X(*)
11.1. Trie Search
There are two flavours of search: (1) checking if a substring is a valid prefix, and (2) checking if a substring is a valid whole word.
The algorithm below is for the more general case of searching for a prefix. Searching for a whole word then becomes trivial because we have a terminal character, so we can search for “Implement*” to find the whole word.
Iterate through the trie one character at a time:
- Initialise
current_node
at the root. - Iterate through each character of
search_string
and see if current_node has a child with that key. - Repeat Step 3 for the whole
search_string
. If the character is not a valid key, the word does not exist.
The efficiency of a trie search depends on the length of the search term, \(K\), not the number of elements stored in the trie, \(N\). Therefore, it has complexity \(O(K)\).
11.2. Trie Insertion
Inserting a new word into a trie follows similar steps to searching for an existing word: we traverse the trie and insert new nodes where they do not already exist.
- Initialise
current_node
at the root. - Iterate through each character of
search_string
and see if current_node has a child with that key.- If the key exists, update
current_node
to move to that next node. - Otherwise, create a new child node and update
current_node
to move to it.
- If the key exists, update
- Insert the terminal character of the word at the end.
This is again \(O(K)\) since it depends on the length of the input, not the data stored in the trie.
11.3. Implementation
We can implement a trie as dictionaries nested within dictionaries.
class TrieNode:
def __init__(self):
self.children = {}
class Trie:
def __init__(self):
self.root = TrieNode()
self._TERMINAL_CHAR = "*"
def search(self, word):
"""Search for a given word in the Trie."""
= self.root
current_node
for char in word:
if char in current_node.keys():
# Keep traversing as long as we find valid children
= current_node[char]
current_node else:
# The search_string is not a valid prefix, so return None
return None
return current_node
def insert(self, new_word):
"""Insert a new word into the Trie."""
= self.root
current_node
# Traverse the word + terminal character
for char in new_word + self._TERMINAL_CHAR:
if char in current_node.keys():
# Keep traversing as long as we find valid children
= current_node[char]
current_node else:
= {}
current_node[char] = current_node[char] current_node
11.4. Autocomplete
A good use case of tries is for autocomplete.
We use a trie to store our dictionary of possible words.
Then for a given user input, we can recursively list all of the words with that prefix.
We can improve autocomplete further by storing an integer popularity
value as the terminal value rather than an empty dictionary or null value. This can be a 1-10 score of how commonly used that word is, so that we can prioritise showing more common words as autocompletion suggestions.