-
Notifications
You must be signed in to change notification settings - Fork 0
DSA Toolbox Cheat Sheet
Hรฉla Ben Khalfallah edited this page Nov 10, 2024
·
12 revisions
-
BinarySearch ๐:
-
How: A highly efficient search method that operates on a sorted array by repeatedly dividing the search interval in half. It begins by comparing the middle element of the array with the target value:
- If the middle element matches the target, the search is complete.
- If the middle element is greater than the target, Binary Search restricts the search to the left half of the array.
- If the middle element is less than the target, it restricts the search to the right half. This process repeats until the target is found or the interval is empty.
- Requires: Sorted data; Binary Search only works on data that is in ascending or descending order.
- Time Complexity: O(log n), as the array is halved with each step, leading to a logarithmic number of comparisons.
- Space Complexity: O(1), requiring no additional storage aside from a few variables for indexing.
- When to Use: Best suited for large, sorted datasets where the primary goal is fast lookup. Commonly used in applications like database indexing, dictionary search, and any scenario where data is pre-sorted for efficient retrieval.
-
How: A highly efficient search method that operates on a sorted array by repeatedly dividing the search interval in half. It begins by comparing the middle element of the array with the target value:
const sortedData = [1, 3, 5, 7, 9, 11];
const target = 7;
console.log("Sorted Data:", sortedData);
const index = binarySearch(sortedData, target, { compareFn: (a, b) => a - b, isSorted: true });
console.log("Binary Search Result:", index !== -1 ? `Found at index ${index}` : "Not found");
-
ExponentialSearch โก:
-
How: Exponential Search is a two-phase search technique designed for large, sorted datasets. It begins by finding an interval where the target might be located using exponential growth, then performs a binary search within this narrowed range:
- Exponential Range Expansion: Starting from the first element, it checks elements at progressively larger intervals (1, 2, 4, 8, 16, etc.) until it finds an element greater than or equal to the target. This approach quickly locates a possible range containing the target.
-
Binary Search Within the Range: Once an interval [low, high] is identified (where
target
is likely to be between these indices), Exponential Search performs a binary search within this range, leveraging the efficiency of binary search to find the exact position of the target.
- Requires: Sorted data, as both exponential expansion and binary search rely on order to narrow down the search efficiently.
- Time Complexity: O(log n), with the exponential range expansion taking O(log n) and the binary search within the narrowed range taking an additional O(log n) in the worst case.
- Space Complexity: O(1), as it operates in place without requiring additional data structures.
- When to Use: Exponential Search is particularly useful for very large, sorted datasets where the approximate position of the target is unknown, such as in unbounded or infinite-length arrays (e.g., data streams) or when accessing elements in a remote database or file where direct access is costly.
-
How: Exponential Search is a two-phase search technique designed for large, sorted datasets. It begins by finding an interval where the target might be located using exponential growth, then performs a binary search within this narrowed range:
const sortedData = [1, 3, 5, 7, 9, 11, 13, 15];
const target = 13;
console.log("Sorted Data:", sortedData);
const index = exponentialSearch(sortedData, target, { compareFn: (a, b) => a - b, isSorted: true });
console.log("Exponential Search Result:", index !== -1 ? `Found at index ${index}` : "Not found");
-
HybridSearch ๐:
- How: Combines linear search for small datasets and binary search for larger ones. The search type adapts based on dataset size.
- Requires: Unsorted data for linear search, Sorted data for binary search
- Time Complexity: O(n) for small data, O(log n) for large data
- Space Complexity: O(1)
- When to Use: Useful when the dataset size varies, optimizing efficiency without prior sorting.
const data = [3, 1, 4, 1, 5, 9];
const target = 4;
console.log("Data:", data);
const index = hybridSearch(data, target, { compareFn: (a, b) => a - b, isSorted: false });
console.log("Hybrid Search Result:", index !== -1 ? `Found at index ${index}` : "Not found");
-
LinearSearch ๐โโ๏ธ:
- How: Sequentially checks each element until the target is found, making it suitable for unsorted data.
- Requires: Unsorted or sorted data
- Time Complexity: O(n)
- Space Complexity: O(1)
- When to Use: Best for small or unsorted datasets due to simplicity.
// Example:
let array = [4, 1, 7, 2, 9];
let target = 7;
// Steps:
1. Check 4 -> Not Found
2. Check 1 -> Not Found
3. Check 7 -> Target found!
Result: Target found at index 2
-
TernarySearch ๐ข:
- How: Divides the search range into three parts instead of two, refining the search in sorted data by comparing the target with two middle elements.
- Requires: Sorted data
- Time Complexity: O(logโ n)
- Space Complexity: O(1)
- When to Use: For sorted arrays, potentially improving efficiency in some cases over binary search.
// Example:
let array = [1, 2, 5, 8, 9, 12, 15];
let target = 9;
// Steps:
1. Divide into thirds -> Compare target with 5 and 12
2. Narrow down to middle section [8, 9]
3. Result: Target found at index 4
-
HeapSort โ๏ธ:
- How: HeapSort first transforms the array into a max-heap, a binary tree where each parent node is greater than or equal to its children. It then repeatedly extracts the maximum element (the root of the heap) and places it at the end of the array, reducing the heap size and re-heapifying the remaining elements. This process results in a sorted array.
- Requires: Unsorted data
- Time Complexity: O(n log n), with O(log n) time for re-heapifying after each extraction and n extractions in total.
- Space Complexity: O(1) for in-place sorting (assuming the heap is built in the input array).
- When to Use: Ideal when memory is a constraint, as it sorts in place without additional memory. However, it does not maintain the relative order of equal elements (not stable), making it less suitable when sorting stability is required.
const data = [12, 3, 9, 1, 4, 7];
console.log("Unsorted:", data);
const sortedData = heapSort([...data], (a, b) => a - b);
console.log("Heap Sorted:", sortedData);
-
MergeSort ๐:
- How: MergeSort is a divide-and-conquer algorithm that recursively splits the array into halves, sorts each half, and then merges them back together in sorted order. The merging process ensures that each element is placed in the correct position, resulting in a fully sorted array.
- Requires: Unsorted data
- Time Complexity: O(n log n), as each level of recursion takes O(n) for merging, and there are O(log n) levels.
- Space Complexity: O(n), as it requires additional storage for merging the sorted halves.
- When to Use: MergeSort is stable and ideal for large datasets where sorting stability is important. Itโs also well-suited for linked lists, as it doesnโt require random access to elements and can merge linked lists efficiently without additional space overhead.
const data = [5, 3, 8, 6, 2, 7, 4, 1];
console.log("Unsorted:", data);
const sortedData = mergeSort([...data], (a, b) => a - b);
console.log("Merge Sorted:", sortedData);
-
TimSort โฑ๏ธ:
- How: TimSort is a hybrid sorting algorithm based on MergeSort and Insertion Sort. It starts by breaking the array into small runs (sections of partially sorted elements), which it sorts individually using Insertion Sort. It then merges these sorted runs using a process similar to MergeSort. TimSort is particularly optimized for real-world datasets, where data often contains sorted or partially sorted sequences.
- Requires: Unsorted data (but takes advantage of partially sorted sequences for efficiency)
- Time Complexity: O(n log n) on average, O(n) in best cases when the data is already mostly sorted.
- Space Complexity: O(n), as it requires additional memory to handle merging.
- When to Use: TimSort is the default sort in JavaScript, Python, and Java because of its efficiency with real-world data. Itโs particularly useful when handling datasets with pre-existing order, as it can exploit partially sorted runs to improve performance.
const data = [12, 4, 7, 1, 9, 5];
console.log("Unsorted:", data);
const sortedData = timSort([...data], (a, b) => a - b);
console.log("Tim Sorted:", sortedData);
-
LFU (Least Frequently Used) ๐ถโโ๏ธ:
- How: LFU cache removes the items that have been accessed the fewest times when the cache reaches its capacity. Each item in the cache has an associated frequency count, which increments each time the item is accessed. When the cache is full and a new item needs to be added, the item with the lowest frequency count is evicted. If multiple items have the same frequency, the one that has been in the cache the longest is usually removed (to ensure predictable behavior).
- Time Complexity: O(1) for access (using hash tables) and O(log n) for eviction (using priority queues or frequency lists).
- Space Complexity: O(n), where n is the number of items in the cache.
- When to Use: LFU is ideal for scenarios where items that are accessed most frequently are likely to be accessed again, such as in databases, content delivery networks, and recommendation systems where popular items need to stay in memory longer.
const lfuCache = new LFUCache<number, string>(3); // Cache with capacity 3
lfuCache.set(1, "apple");
lfuCache.set(2, "banana");
lfuCache.get(1); // Access "apple"
lfuCache.set(3, "cherry");
lfuCache.set(4, "date"); // Evicts least frequently used item
console.log("LFU Cache Contents:", lfuCache);
-
LRU (Least Recently Used) ๐ฐ๏ธ:
- How: LRU cache removes the least recently accessed items to make room for new entries. Each time an item is accessed, it is moved to the most recent position in the cache (typically represented by a linked list or doubly linked list). When the cache reaches its capacity, the item that hasnโt been accessed the longest is evicted. LRU maintains ordering based on recency of access, ensuring that recently accessed items stay in the cache.
- Time Complexity: O(1) for access and eviction, as both can be efficiently managed with a hash table and doubly linked list.
- Space Complexity: O(n), where n is the number of items in the cache.
- When to Use: LRU is optimal when there is a high likelihood that recently accessed items will be accessed again soon. Itโs commonly used in operating systems, database caches, and web browsers where user access patterns follow temporal locality (i.e., recently used items are likely to be reused in the near future).
const lruCache = new LRUCache<number, string>(3); // Cache with capacity 3
lruCache.set(1, "apple");
lruCache.set(2, "banana");
lruCache.get(1); // Access "apple"
lruCache.set(3, "cherry");
lruCache.set(4, "date"); // Evicts least recently used item
console.log("LRU Cache Contents:", lruCache);
-
MaxHeap ๐:
- How: A MaxHeap is a binary tree where each parent node has a value greater than or equal to its children. The largest element is always at the root. When elements are added, they โbubble upโ if they are greater than their parent, ensuring the maximum element remains at the top. When the root (largest element) is removed, the last element in the heap takes its place, and it โbubbles downโ to maintain the heap property.
- Requires: Unsorted data; the heap organizes data internally.
- Time Complexity: O(log n) for insertion and deletion, as elements may need to bubble up or down the tree.
- Space Complexity: O(n) for storing all elements.
- When to Use: MaxHeap is ideal for priority queues where quick access to the largest element is required. Common applications include scheduling algorithms, simulations, and implementing the HeapSort algorithm.
const maxHeap = new MaxHeap<number>();
maxHeap.insert(10);
maxHeap.insert(5);
maxHeap.insert(20);
console.log("Max element:", maxHeap.extract()); // Output: 20
console.log("Heap after extraction:", maxHeap);
-
MinHeap ๐ฝ:
- How: A MinHeap is a binary tree where each parent node has a value less than or equal to its children, with the smallest element at the root. Similar to MaxHeap, insertion involves โbubbling upโ smaller elements to maintain the order, and removing the root (smallest element) involves โbubbling downโ the new root.
- Requires: Unsorted data; the heap organizes data internally.
- Time Complexity: O(log n) for insertion and deletion, as elements may need to adjust their position within the tree.
- Space Complexity: O(n) for storing all elements.
- When to Use: MinHeap is perfect for priority queues where quick access to the smallest element is important. Itโs often used in algorithms like Dijkstraโs shortest path, Primโs minimum spanning tree, and other scenarios where the smallest element must be processed first.
const minHeap = new MinHeap<number>();
minHeap.insert(9);
minHeap.insert(4);
minHeap.insert(7);
console.log("Extract Min:", minHeap.extract()); // Output: 4
console.log("Heap After Extraction:", minHeap);
-
LinkedList ๐:
- How: A sequential collection of nodes where each node contains data and a reference (or pointer) to the next node in the sequence. Unlike arrays, LinkedLists do not require contiguous memory, allowing nodes to be scattered in memory. New nodes can be added at the head, tail, or any position without shifting elements.
- Requires: Unsorted or sorted data, depending on the application.
-
Time Complexity:
- O(n) for search, as it must traverse nodes sequentially.
- O(1) for insertion or deletion at the head.
- Space Complexity: O(n), where n is the number of nodes.
- When to Use: Ideal when frequent insertions and deletions are required without shifting elements, such as in dynamic data structures, stacks, and queues. Itโs also useful when memory reallocation is undesirable, as in implementing real-time applications or dynamic memory systems.
const linkedList = new LinkedList<number>();
linkedList.append(3);
linkedList.append(5);
linkedList.append(7);
console.log("Search 5:", linkedList.contains(5) ? "Found" : "Not found");
linkedList.remove(5);
console.log("Search 5 after deletion:", linkedList.contains(5) ? "Found" : "Not found");
-
DoublyLinkedList ๐:
- How: Similar to a LinkedList, but with each node containing pointers to both the previous and next nodes. This bidirectional linking allows traversal in both forward and backward directions, improving flexibility. DoublyLinkedLists allow for efficient deletion and insertion at both ends, as well as the ability to delete nodes from the middle without needing access to the head.
- Requires: Unsorted or sorted data, depending on the application.
-
Time Complexity:
- O(n) for search.
- O(1) for insertion or deletion at both the head and tail.
- Space Complexity: O(n), as each node requires additional memory to store a pointer to the previous node.
- When to Use: Best when bidirectional traversal is needed, such as in browser history (moving back and forth between pages), undo/redo functionality in text editors, and implementing complex data structures like LRU caches. Itโs also beneficial when frequent insertions and deletions are required at both ends.
const doublyLinkedList = new DoublyLinkedList<number>();
doublyLinkedList.append(2);
doublyLinkedList.append(4);
doublyLinkedList.append(6);
console.log("Search 4:", doublyLinkedList.contains(4) ? "Found" : "Not found");
doublyLinkedList.remove(4);
console.log("Search 4 after deletion:", doublyLinkedList.contains(4) ? "Found" : "Not found");
- How: A First-In-First-Out (FIFO) data structure where elements are added to the rear (enqueue) and removed from the front (dequeue). This sequential order of operations ensures that the first element added is also the first to be removed. Queues can be implemented using arrays or linked lists, and are often circular to maximize space efficiency in fixed-size implementations.
- Time Complexity: O(1) for both enqueue (adding an element to the rear) and dequeue (removing an element from the front), as these operations affect only the ends of the queue.
- Space Complexity: O(n), where n is the number of elements in the queue.
- When to Use: Ideal for processes that follow a first-in, first-out order, such as task scheduling, buffering data streams, handling requests in network traffic, and breadth-first search in graphs.
const queue = new Queue<number>();
queue.enqueue(1);
queue.enqueue(2);
queue.enqueue(3);
console.log("Dequeue:", queue.dequeue()); // Output: 1
console.log("Current Queue:", queue);
- How: A Last-In-First-Out (LIFO) data structure where elements are added (pushed) and removed (popped) only from the top. This makes the most recently added element the first to be removed. Stacks can be implemented using arrays or linked lists, depending on whether a fixed or dynamic size is required.
- Time Complexity: O(1) for push (adding an element to the top) and pop (removing an element from the top), as these operations only modify the top element.
- Space Complexity: O(n), where n is the number of elements in the stack.
- When to Use: Stacks are useful for reversing elements, handling backtracking tasks (like navigating back in browser history), and managing function calls in programming languages, as each function call is pushed onto the stack and removed upon completion. Stacks are also essential in algorithms for evaluating expressions and parsing.
const stack = new Stack<number>();
stack.push(10);
stack.push(20);
stack.push(30);
console.log("Pop:", stack.pop()); // Output: 30
console.log("Current Stack:", stack);
-
BinarySearchTree ๐ฒ:
- How: A hierarchical structure where each node has up to two children. Nodes are arranged such that all values in the left subtree are less than the root, and all values in the right subtree are greater than the root. This organization maintains a sorted order for easy access.
- Requires: Initially unsorted data for insertion; maintains ordered structure automatically after insertion.
- Time Complexity: O(log n) on average for insert, search, and delete, but can degrade to O(n) if the tree becomes unbalanced.
- Space Complexity: O(n)
- When to Use: For applications needing efficient, dynamic insertion and deletion in a sorted structure, such as database indexing.
const bst = new BinarySearchTree<number>();
bst.insert(5);
bst.insert(3);
bst.insert(7);
console.log("Search 5:", bst.search(5) ? "Found" : "Not found");
console.log("Delete 3");
bst.delete(3);
console.log("Search 3:", bst.search(3) ? "Found" : "Not found");
-
AVLTree ๐๏ธโโ๏ธ:
- How: A self-balancing binary search tree that ensures the height difference (balance factor) between the left and right subtrees of any node is at most 1, -1, or 0. When this balance factor exceeds these values due to an insertion or deletion, the AVL tree performs rotations (single or double) to restore balance. This self-balancing prevents the tree from becoming skewed, maintaining efficient search times.
- Requires: Unsorted data; automatically orders and balances itself.
- Time Complexity: O(log n) for insert, delete, and search
- Space Complexity: O(n)
- When to Use: Ideal for applications with frequent inserts and deletes, where balanced height is crucial for consistent performance, like in-memory databases and priority queues.
const avlTree = new AVLTree<number>();
avlTree.insert(15);
avlTree.insert(10);
avlTree.insert(20);
console.log("Search 15:", avlTree.search(15) ? "Found" : "Not found");
console.log("Delete 10");
avlTree.delete(10);
console.log("Search 10:", avlTree.search(10) ? "Found" : "Not found");
-
RedBlackTree ๐ดโซ:
- How: A self-balancing binary search tree that uses color properties and rotations to maintain balance. Each node is colored red or black, and these colors ensure that the longest path from the root to a leaf is no more than twice the length of the shortest path, keeping the tree height logarithmic. This color-coding allows it to balance itself after insertions or deletions without needing to be perfectly balanced.
- Requires: Unsorted data; automatically balances and orders itself.
- Time Complexity: O(log n) for insert, delete, and search
- Space Complexity: O(n)
- When to Use: Suitable for applications requiring frequent updates with guaranteed balance, such as associative containers (e.g., maps, sets) and memory management subsystems.
const rbTree = new RedBlackTree<number>();
rbTree.insert(10);
rbTree.insert(20);
rbTree.insert(5);
console.log("Search 10:", rbTree.search(10) ? "Found" : "Not found");
console.log("Delete 10");
rbTree.delete(10);
console.log("Search 10:", rbTree.search(10) ? "Found" : "Not found");
-
Trie ๐ก:
- How: A prefix tree where each node represents a character in a string, organizing words by common prefixes. Each path from the root to a leaf node spells out a word, enabling efficient storage and retrieval of words with shared prefixes stored only once. This makes Tries particularly effective for word-based operations.
- Requires: Unsorted data; organizes words or sequences by shared prefixes.
- Time Complexity: O(m) for insert, search, and delete, where m is the length of the word
- Space Complexity: O(n * m), where n is the number of words and m is their average length
- When to Use: For applications like autocomplete, dictionary implementations, spell-checking, and IP routing, where prefix-based searches are common and need to be efficient.
const trie = new Trie();
trie.insert("cat");
trie.insert("can");
trie.insert("cap");
console.log("Search 'cat':", trie.search("cat") ? "Found" : "Not found");
console.log("Search 'dog':", trie.search("dog") ? "Found" : "Not found");
-
B-Tree ๐ณ
- How: A self-balancing tree that maintains sorted data and is optimized for systems requiring minimal disk I/O. B-Trees allow nodes to have multiple children, supporting efficient insertion, deletion, and search for large datasets.
- Requires: Unsorted data; it organizes data into a balanced, ordered structure.
- Time Complexity: O(log n) for insertion, deletion, and search
- Space Complexity: O(n)
- When to Use: Ideal for databases, filesystems, and applications that handle large datasets on disk, as it minimizes the number of accesses needed for read and write operations.
const btree = new BTree<number>(3, (a, b) => a - b); // Order 3 B-Tree
btree.insert(25);
btree.insert(15);
btree.insert(5);
console.log("Search 15:", btree.search(15) ? "Found" : "Not found");
console.log("Delete 15");
btree.delete(15);
console.log("Search 15:", btree.search(15) ? "Found" : "Not found");
Tree Type | Strengths | Ideal Use Cases | Insertion | Deletion | Search | Space Complexity |
---|---|---|---|---|---|---|
Binary Search Tree | Simple, efficient for basic search and insertion | Small datasets, where perfect balance isnโt critical | O(log n) avg, O(n) worst | O(log n) avg, O(n) worst | O(log n) avg, O(n) worst | O(n) |
AVL Tree | Self-balancing, guarantees logarithmic performance | Applications with frequent updates needing predictable performance, like databases | O(log n) | O(log n) | O(log n) | O(n) |
Red-Black Tree | Self-balancing, efficient with frequent insertions and deletions | Dynamic systems with continuous changes, like operating system schedulers | O(log n) | O(log n) | O(log n) | O(n) |
B-Tree | Handles massive datasets, optimized for disk access | Databases, file systems, applications handling large disk-based datasets | O(log n) | O(log n) | O(log n) | O(n) |
Trie (Prefix Tree) | Efficient for prefix-related operations | Autocomplete, spell checking, dictionary lookups, IP routing | O(m) | O(m) | O(m) | O(N * M) in worst case |
Notes:
- Binary Search Tree: Efficient for smaller datasets, but performance can degrade without balancing.
- AVL Tree: Maintains strict balance, ensuring consistent performance at the cost of more complex insertion and deletion operations.
- Red-Black Tree: Allows slight imbalance for faster insertion and deletion, offering a good trade-off for dynamic data.
- B-Tree: Optimized for systems where data is stored on disk, minimizing disk access for large datasets.
- Trie: Useful for applications with prefix-based searches; space complexity varies based on prefix sharing across entries.
- How: A probabilistic data structure for membership testing that quickly checks if an element might be in a set. It utilizes multiple hash functions to map elements to specific bits in a fixed-size bit array. When all bits mapped for an element are set, the element may be in the set (with a small chance of false positives), otherwise, it is definitely not in the set.
-
Real-World Applications:
- Redis: Uses Bloom filters in RedisBloom for fast membership checks across large datasets.
- Chromium: Uses Bloom filters for URL and data filtering in its optimization guide.
- Apache Spark: Integrates Bloom filters for efficient data loading, avoiding redundancies.
- RocksDB: Employs Bloom filters for optimized disk lookups.
- ScyllaDB: Utilizes Bloom filters to improve data retrieval speeds, ideal for high-performance databases.
- Requires: Unordered data
-
Time Complexity: O(k) for insert and lookup, where
k
is the number of hash functions -
Space Complexity: O(m), where
m
is the size of the bit array - When to Use: Ideal for membership testing in systems needing high-speed responses with low memory usage, such as in spell-checking, web caches, and network traffic filtering.
const bloomFilter = new BloomFilter<string>(100); // Capacity for 100 items
bloomFilter.add("apple");
bloomFilter.add("banana");
console.log("Check 'apple':", bloomFilter.check("apple") ? "Possibly in set" : "Not in set");
console.log("Check 'grape':", bloomFilter.check("grape") ? "Possibly in set" : "Not in set");
- How: HyperLogLog is used for cardinality estimation (counting unique elements) in large datasets. It uses hash functions to map elements into "buckets," tracking the maximum number of leading zeroes in hash values to estimate the dataset's cardinality. This method requires significantly less memory than exact counting.
-
Real-World Applications:
- Amazon Redshift: Supports approximate distinct counts for large datasets, enabling efficient querying.
- ScyllaDB: Utilizes HyperLogLog for key count estimation in SSTable management.
- Redis: Commands like PFADD and PFCOUNT enable memory-efficient cardinality estimations.
- Facebook and Google BigQuery: Used for user interaction counts in analytics.
- Requires: Unordered data
- Time Complexity: O(1) for insertion and lookup
- Space Complexity: O(m), requiring fixed memory that scales minimally with dataset size
- When to Use: Best for applications needing approximate unique counts with minimal memory, such as in analytics, network monitoring, and social media user tracking.
const hll = new HyperLogLog();
hll.add("user1");
hll.add("user2");
hll.add("user1"); // Duplicate entry
console.log("Approximate Unique Count:", hll.estimate());
- How: Count-Min Sketch provides approximate frequency estimation by hashing elements to multiple counters. The minimum counter value across hashes is taken as the frequency estimate, allowing frequency tracking with limited memory.
-
Real-World Applications:
- Redis: Used for estimating the frequency of large data items.
- Network Traffic Analysis: Detects frequent packets and heavy hitters.
- E-commerce & AdTech: Real-time product or ad popularity tracking, optimizing inventory and ad placements.
- Requires: Unordered data
-
Time Complexity: O(d) for insertion and query, where
d
is the depth (number of hash functions) -
Space Complexity: O(w * d), where
w
is width (number of counters) andd
is depth - When to Use: For tracking item frequencies in data streams (e.g., popular topics in social media, network traffic analysis) where memory is constrained.
const cms = new CountMinSketch();
cms.add("apple");
cms.add("apple");
cms.add("banana");
console.log("Frequency of 'apple':", cms.query("apple"));
console.log("Frequency of 'banana':", cms.query("banana"));
- How: t-digest approximates quantile calculations and is efficient for tracking high percentiles in large datasets. By clustering data points into centroids representing ranges of values, t-digest enables accurate quantile estimation with a compact structure.
-
Real-World Applications:
- Redis Stack: Enables real-time percentile and quantile queries.
- Apache Druid: Essential for percentile aggregations in large, interactive analytics.
- PostgreSQL & ElasticSearch: Supports percentile aggregations for massive datasets.
- Amazon Redshift: Facilitates quantile estimation for reports on extensive data tables.
- Requires: Unordered data
-
Time Complexity: O(log k) for insertion, where
k
is the number of centroids - Space Complexity: O(n) for storing centroids
- When to Use: For quantile and percentile estimations in streaming data, useful in monitoring systems, real-time analytics, and latency tracking.
const tDigest = new Tdigest();
tDigest.add(100);
tDigest.add(200);
tDigest.add(300);
console.log("90th Percentile:", tDigest.percentile(0.9));
- How: MinHash and SimHash are similarity estimation algorithms that provide a way to compare datasets by creating compact "signatures." MinHash is used for set similarity, ideal for deduplication, while SimHash is more common for text or vector similarity.
-
Real-World Applications:
- MinHash: Used in document and web page deduplication, collaborative filtering, and genomic analysis.
- SimHash: Deployed in search engines for duplicate content detection, social media for grouping similar posts, and plagiarism detection.
-
Time Complexity: O(n) for creating a signature, where
n
is the number of elements or features - When to Use: Useful for approximate similarity checks in large datasets, like recommendation engines, document deduplication, and detecting near-duplicates in content feeds.
const minHash = new MinHash();
minHash.add("apple");
minHash.add("banana");
console.log("MinHash Signature:", minHash.getSignature());
- How: A layered linked list that allows fast insertion, deletion, and search operations by creating multiple levels of linked lists. The probabilistic structure enables skipping large sections of elements, providing an efficient alternative to balanced trees.
- Requires: Unordered data; automatically organizes data across levels for efficient access.
- Time Complexity: O(log n) for insertion, deletion, and search on average
- Space Complexity: O(n), where n is the number of elements
- When to Use: Useful for in-memory databases, priority queues, and cases where probabilistic balancing is preferred over strict balancing.
const skipList = new SkipList<number>();
skipList.insert(3);
skipList.insert(7);
skipList.insert(9);
console.log("Search 7:", skipList.search(7) ? "Found" : "Not found");
skipList.delete(7);
console.log("Search 7 after deletion:", skipList.search(7) ? "Found" : "Not found");
Data Structure | Time Complexity | Space Complexity | Use Cases |
---|---|---|---|
Bloom Filter | Insert/Search: O(k) | O(m) | Fast membership testing (e.g., RedisBloom, Spark data loading, RocksDB for disk lookups) |
HyperLogLog | Insert: O(1) | O(m) | Cardinality estimation for large datasets (e.g., Redshift, ScyllaDB SSTables, Redis for distinct counts) |
Count-Min Sketch | Insert/Query: O(d) | O(w * d) | Frequency estimation in data streams (e.g., Redis for item frequency, network traffic, e-commerce real-time trends) |
t-digest | Insert: O(log(n)) | O(n) | Quantile estimation (e.g., Redis for percentile queries, Apache Druid for large-scale data, ElasticSearch for accurate percentile queries) |
SimHash | Insert: O(n) | O(n) | Similarity detection (e.g., document deduplication, social media content grouping, plagiarism detection) |
MinHash | Insert: O(n * k) | O(n * k) | Set similarity (e.g., document deduplication in search indexing, collaborative filtering, genomic analysis) |
Skip List | Avg O(log(n)) | O(n) | Ordered set operations (e.g., in-memory databases, priority queues, version control) |