flowchart LR A[1] <---> B[2] B[2] <---> C[3]
Introduction to Graphs
1. Graphs: What and Why
Graphs are a generic way of representing relationships (edges) between entities (nodes).
This makes them useful in a wide variety of applications, including modelling biological pathways, social networks, or even images.
Graphs do not have a fixed structure and the relationships (edges) can be time-varying.
1.1. Graph Jargon
Entities are nodes or vertices and their relationships are edges.
Nodes that are connected are adjacent and are called neighbours.
Graphs are the general case of trees. A graph is broadly something with nodes and edges. A tree is a graph that has no cycles and all nodes are connected.
A directed graph or digraph shows where relationships can be directional, e.g. social network followers.
Edges can be weighted to represent quantities like distance or cost.
A path is the sequence of edges used to get from one vertex to another.
A simple path does not visit any node more than once.
A graph is connected if there is a path between every pair of vertices, i.e. there are no isolated nodes or clusters.
A cycle is a path where the first and last nodes are the same.
1.2. Types of Graphs
- Tree: An undirected graph with no cycles.
- Rooted tree: A tree where one node is designated ads the root.
- Directed Acyclic Graph (DAG): A directed graph with no cycles. This means edges can only be traversed in a particular direction.
- Bipartite graphs: A graph where vertices can be divided into two disjoint sets.
- Complete graph: A graph where every pair of vertices is connected by an edge, i.e. it is fully connected.
2. Implementing a Graph From Scratch
Graphs are an abstract data structure, so there are multiple ways we can represent them in memory.
Some common options are:
- Objects and pointers
- Adjacency lists
- Edge lists
- Adjacency matrix
Each approach has its own trade-offs in terms of memory usage, ease of implementation, and efficiency based on the specific characteristics of the graph you’re dealing with.
We’ll implement the following simple graph using each approach:
2.1. Objects and Pointers
In this approach, we represent each node in the graph as an object, and you use pointers to connect these nodes. Each object typically contains a list of pointers to other nodes it’s connected to.
For a connected graph, if we have access to one node we can traverse the graph to find all other nodes. This is not true for graphs with unconnected nodes.
This approach will be familiar to computer scientists - it bears a resemblance to linked lists. In practice this isn’t used much in practical ML because traversing the graph is more cumbersome.
Pros:
- Allows for flexible representation of complex relationships.
- Ideal for graphs where nodes have additional properties beyond just connectivity.
Cons:
- Memory overhead due to object instantiation and pointer storage.
- Can be complex to implement and manage.
- If the graph is not connected, an additional data structure is needed to keep track of isolated nodes.
Python Implementation:
class Node:
def __init__(self, value):
self.value = value
self.neighbors = []
def __repr__(self) -> str:
return f"Node {self.value} with neighbors {[k.value for k in self.neighbors]}"
def add_edge(node1, node2):
node1.neighbors.append(node2)
node2.neighbors.append(node1)
= Node(1)
node1 = Node(2)
node2 = Node(3)
node3
add_edge(node1, node2) add_edge(node2, node3)
print(node1)
Node 1 with neighbors [2]
print(node2)
Node 2 with neighbors [1, 3]
print(node3)
Node 3 with neighbors [2]
2.2. Adjacency List
In this approach, we store a list of edges per node.
We use a list or dictionary where the index or key represents a node, and the value is a list of its adjacent nodes.
Pros:
- Efficient memory usage for sparse graphs - \(O(V + E)\) space complexity.
- Easy to implement and understand.
- Suitable for graphs with varying degrees of connectivity.
- Easy to iterate through neighbors.
Cons:
- Slower for dense graphs.
- Checking if nodes are connected is slow - retrieving edge information may require linear search.
Python Implementation:
class Graph:
def __init__(self):
self.adj_list = {}
def __repr__(self) -> str:
return str(self.adj_list)
def add_edge(self, node1, node2):
if node1 not in self.adj_list:
self.adj_list[node1] = []
if node2 not in self.adj_list:
self.adj_list[node2] = []
self.adj_list[node1].append(node2)
self.adj_list[node2].append(node1)
= Graph()
graph 1, 2)
graph.add_edge(2, 3)
graph.add_edge(
print(graph)
{1: [2], 2: [1, 3], 3: [2]}
2.3. Edge Lists
An even more compact form is the edge list, which lists all of the edges as (source_node, target_node)
tuples.
Pros:
- Even more efficient memory usage for sparse graphs - \(O(E)\) space complexity.
Cons:
- Slower for dense graphs.
- Checking if nodes are connected is slow - retrieving edge information may require linear search.
Python Implementation:
class Graph:
def __init__(self):
self.edge_list = []
def __repr__(self) -> str:
return str(self.edge_list)
def add_edge(self, node1, node2):
= (node1, node2)
node_pair if node_pair not in self.edge_list:
self.edge_list.append(node_pair)
= Graph()
graph 1, 2)
graph.add_edge(2, 3)
graph.add_edge(
print(graph)
[(1, 2), (2, 3)]
2.4. Adjacency Matrix
In this approach, we represent the graph as a 2D matrix where rows and columns represent nodes, and matrix cells indicate whether there’s an edge between the nodes.
For weighted graphs, the values in the matrix correspond to the weights of the edges.
Pros:
- Efficient for dense graphs.
- Constant time access to check edge existence; checking if nodes are connected is \(O(1)\).
Cons:
- Not ideal for graphs with a large number of nodes due to \(O(V^2)\) space complexity. This is particularly bad for sparse graphs.
- Adding/removing nodes can be expensive as it requires resizing the matrix. Bad for dynamic graphs.
Python Implementation:
class Graph:
def __init__(self, num_nodes):
self.adj_matrix = [[0] * num_nodes for row in range(num_nodes)]
def __repr__(self) -> str:
return str(self.adj_matrix)
def add_edge(self, node1, node2):
self.adj_matrix[node1][node2] = 1
self.adj_matrix[node2][node1] = 1
= Graph(num_nodes=3)
graph 0, 1)
graph.add_edge(1, 2)
graph.add_edge(
print(graph)
[[0, 1, 0], [1, 0, 1], [0, 1, 0]]
The adjacency matrix approach is useful in machine learning as it naturally fits into a tensor representation, which most ML libraries (e.g. pytorch, tensorflow) play nicely with. So we’ll stick with tthat going forward.
3. Implementing a Graph with networkx
Grpahs are common enough to work with that we don’t need to implement them from scratch every time we want ot use them (although doing so in the previous section is instructive).
networkx
is a useful third-party library for this task
import networkx as nx
= nx.Graph()
graph
graph.add_edges_from(['A', 'B'),
('A', 'C'),
('B', 'D'),
('B', 'E'),
('C', 'F'),
('C', 'G'),
('G', 'G'),
(
])
print(graph)
Graph with 7 nodes and 7 edges
From this Graph object we can see our familiar edge list:
graph.adj
AdjacencyView({'A': {'B': {}, 'C': {}}, 'B': {'A': {}, 'D': {}, 'E': {}}, 'C': {'A': {}, 'F': {}, 'G': {}}, 'D': {'B': {}}, 'E': {'B': {}}, 'F': {'C': {}}, 'G': {'C': {}, 'G': {}}})
Or the adjacency matrix:
nx.adjacency_matrix(graph).todense()
/var/folders/8k/8jqhnfbd1t99blb07r1hs5440000gn/T/ipykernel_50230/1468791322.py:1: FutureWarning: adjacency_matrix will return a scipy.sparse array instead of a matrix in Networkx 3.0.
nx.adjacency_matrix(graph).todense()
matrix([[0, 1, 1, 0, 0, 0, 0],
[1, 0, 0, 1, 1, 0, 0],
[1, 0, 0, 0, 0, 1, 1],
[0, 1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 1]])
4. Graph Metrics
4.1. Degree of a Node
This is the number of edges which are incident on a node.
For an undirected graph, this is the number of edges connected to the node. Self-loops add 2 to the degree.
For a directed graph, we can distinguish the indegree, \(deg^-(V)\), as the edges that point towards the node. The outdegree, \(deg^+(V)\), denotes the edges that start from the node. Self-loops add 1 to the indegree and 1 to the outdegree.
graph.degree
DegreeView({'A': 2, 'B': 3, 'C': 3, 'D': 1, 'E': 1, 'F': 1, 'G': 3})
4.2. Centraility
Centrality measures the importance of a given node in a network.
There a different formulations of centrality:
- Degree centrality: This is simply the degree of the node. How many neighbours?
- Closeness centrality: The average length of the shortest path between this node and all others. How fast can I reach my neighbours?
- Betweenness centrality: The number of times a node lies on the shortest path between pairs of other nodes in the graph. How often am I the bottleneck/bridge for others?
nx.degree_centrality(graph)
{'A': 0.3333333333333333,
'B': 0.5,
'C': 0.5,
'D': 0.16666666666666666,
'E': 0.16666666666666666,
'F': 0.16666666666666666,
'G': 0.5}
nx.closeness_centrality(graph)
{'A': 0.6,
'B': 0.5454545454545454,
'C': 0.5454545454545454,
'D': 0.375,
'E': 0.375,
'F': 0.375,
'G': 0.375}
nx.betweenness_centrality(graph)
{'A': 0.6, 'B': 0.6, 'C': 0.6, 'D': 0.0, 'E': 0.0, 'F': 0.0, 'G': 0.0}
4.3. Connectedness
4.3.1. Minimum Cut
This is the minimum number of edges that would need to be removed to make the graph disconnected.
How easy is it to create isolated nodes/clusters?
This is a measure of connectedness.
4.3.2. Density
This is the ratio of the number of edges in the grap to the maximum possible number of edges between all nodes.
For a directed graph with \(n\) nodes, the maximum number of edges is \(n(n-1)\).
For an undirected graph with \(n\) nodes, the maximum number of edges is \(\frac{n(n-1)}{2}\).
Broadly speaking, a graph is considered dense if \(density > 0.5\) and sparse if \(density < 0.1\).
5. Why Use Graph Learning?
There are several categories of learning tasks we may want to perform on a graph:
- Node classification: Predict the category of each node. E.g. categorising songs by genre.
- Link prediction: Predict missing links between nodes. E.g. friend recommnedations in a social network.
- Graph classification: Categorise an entire graph.
- Graph generation: Generate a new graph based on desired properties.
There are several prominent families of graph learning techniques:
- Graph signal processing: Apply traditional signal processing techniques like Fourier analysis to graphs to study its connectivity and structure.
- Matrix factorisation: Find low-dimensional representations of large matrices
- Random walk: Simulate random walks over a graph, e.g. to generate training date for downstream models.
- Deep learning: Encode the graph as tensors and perform deep learning.
Traditional tabular datasets focus on the entities which are represented as rows. But often, the relationships between entities contain vital information. Global features, such as network-wide statistics, may also provide useful information.