A&DS S01E08. Disjoint Sets

TLDR;

This video by Pavel Mavrin explains the disjoint sets data structure, also known as the union-find data structure. It's a data structure that maintains a collection of disjoint sets, where each element belongs to exactly one set. The video covers the basic operations of union (joining two sets) and find (determining which set an element belongs to), along with different implementations and optimisations to improve performance. The key takeaways include understanding the basic implementation using arrays, optimising with lists to track elements in each set, and further optimising using trees and path compression for near-constant time operations.

Disjoint sets maintain a collection of sets where each element belongs to one set.
The two main operations are "union" (joining two sets) and "find" (finding the representative of a set).
Different implementations, including array-based, list-based, and tree-based, offer varying performance characteristics.
Optimisations like path compression and rank heuristics can significantly improve the efficiency of disjoint set operations.

Introduction to Disjoint Sets [0:08]

The disjoint sets data structure, also known as the union-find data structure, is introduced. It's noted that while it might seem simple, it's surprisingly important in many algorithms, especially graph algorithms which will be discussed in the third semester. The primary function of this data structure is to maintain a collection of elements grouped into disjoint sets, where each element belongs to exactly one set.

Basic Operations: Union and Find [3:37]

The two fundamental operations for disjoint sets are union and find. The union operation takes two elements, x and y, and merges the sets containing these elements into a single set. The find operation, given an element x, returns the representative element of the set that x belongs to. To simplify the find operation, each set has a designated representative element, and the find function returns this representative.

Simple Array-Based Implementation [8:59]

A basic implementation of disjoint sets using an array p is discussed. In this implementation, p[x] stores the representative of the set containing element x. The find operation simply looks up the representative in the array, taking constant time. The union operation involves finding the representatives of the two sets and then iterating through the array to update the representatives of all elements in one set to the representative of the other set, resulting in a linear time complexity of O(n).

Optimization with Lists [16:36]

An optimisation is introduced to improve the union operation by maintaining a list of elements for each representative. Instead of iterating through all elements in the array, the union operation iterates only over the elements in the smaller set, updating their representatives and moving them to the list of the larger set. While this optimisation can still have a linear time complexity in the worst case, it improves the overall performance by reducing the number of iterations.

Weighted Union Optimization [27:10]

To further optimise the union operation, a weighted union approach is introduced. Before merging two sets, the algorithm compares the sizes of the sets and always moves the elements from the smaller set to the larger set. This ensures that each element is moved a maximum of log n times, where n is the total number of elements. As a result, the total time complexity for all union operations becomes O(n log n).

Tree-Based Implementation [42:30]

A more advanced implementation of disjoint sets using trees is explained. Each set is represented as a tree, where the root of the tree is the representative of the set. An array p is used to store the parent of each element in the tree. The find operation involves traversing the tree from an element to the root, while the union operation involves attaching the root of one tree to the root of another tree.

Rank Heuristics for Balanced Trees [52:17]

To prevent the trees from becoming unbalanced, rank heuristics are introduced. The rank of an element is defined as the maximum distance from the element to a leaf in its subtree. When merging two trees, the tree with the smaller rank is attached to the tree with the larger rank. If the ranks are equal, the rank of the resulting tree is incremented by one. This ensures that the height of the trees remains logarithmic, resulting in a logarithmic time complexity for the find operation.

Path Compression Optimization [1:04:36]

Path compression is introduced as another optimisation technique. During the find operation, as the algorithm traverses the tree from an element to the root, it updates the parent pointers of all visited nodes to point directly to the root. This flattens the tree structure, reducing the time complexity of future find operations for those nodes.

Amortized Analysis and Inverse Ackermann Function [1:10:38]

The video discusses the amortized time complexity of disjoint set operations with both path compression and rank heuristics. The amortized time complexity for the find operation is O(α(m, n)), where α(m, n) is the inverse Ackermann function, m is the number of operations, and n is the number of elements. The inverse Ackermann function grows extremely slowly, making the disjoint set operations effectively constant time in practice.

Iterated Logarithm Upper Bound [1:16:04]

To provide a more understandable upper bound on the time complexity, the iterated logarithm function is introduced. The iterated logarithm of a number n is the number of times the logarithm function must be applied to n to reach a value less than or equal to 1. The amortized time complexity of the find operation with path compression and rank heuristics can be approximated as O(log* n), where log* n is the iterated logarithm of n.

Proof of Iterated Logarithm Time Complexity [1:20:05]

A proof is presented to demonstrate that the amortized time complexity of the find operation is bounded by the iterated logarithm function. The edges in the paths traversed during find operations are divided into small jumps and big jumps based on the rank difference between the nodes. By analysing the number of small and big jumps, it is shown that the total time complexity for all find operations is O(m log* n), where m is the number of operations.

Watch the Video

Date: 6/15/2025 Source: www.youtube.com