Hashing
Hashing is a technique used to uniquely identify objects from a group of similar objects.
It involves mapping large data elements to smaller, fixed-size values using a special function called a Hash Function.
The primary purpose of hashing is to enable fast data access (searching, insertion, and deletion operations) in constant average time O(1).
Hash Function:
A Hash Function is a mathematical function that converts an input (or key) into a specific value called a hash code or hash address which determines where the data will be stored in the hash table.
Hash Address=H(Key)Hash\ Address = H(Key)Hash Address=H(Key)
Example:
For integer keys:
H(Key)=Key % TableSizeH(Key) = Key\ \% \ TableSizeH(Key)=Key % TableSize
Hash Table:
A Hash Table is a data structure that stores data in an array format, where each data value has a unique index determined by the hash function.
Collision:
A collision occurs when two different keys are mapped to the same hash address. Handling collisions is a critical part of hashing.
Collision Resolution Techniques:
- Open Addressing:
- If a collision occurs, find another open slot using techniques like:
- Linear Probing: Next slot sequentially.
- Quadratic Probing: Next slot by adding squared intervals.
- Double Hashing: Second hash function used.
- Chaining (Closed Addressing):
- Each slot contains a linked list; all elements that hash to the same slot are stored in the linked list.
Advantages of Hashing:
Fast data access — O(1) average time complexity.
Efficient storage for dynamic sets of data.
Suitable for search, insert, and delete operations.
Disadvantages of Hashing:
Collisions may reduce efficiency.
Needs a good hash function for uniform distribution.
Wastage of space if the load factor is low.
Applications of Hashing:
Database indexing (e.g., primary key searches).
Symbol tables in compilers.
Password storage and verification.
Cache implementations.
Hash maps / Hash sets in programming languages.
Introduction to Hashing
In computer science, Hashing is a technique used to efficiently store and retrieve data in a fixed-size table called a Hash Table.
Hashing transforms a given key (such as a number, string, or object) into an index using a mathematical function known as a Hash Function.
The index generated by the hash function determines the position in the hash table where the actual data or record is stored.
This allows constant-time complexity O(1) on average for fundamental operations like searching, inserting, and deleting.
Need for Hashing:
In data structures like arrays, linked lists, and trees, the time complexity for searching or inserting an element is generally O(n) or O(log n).
Hashing overcomes this limitation by providing an average O(1) access time, making it highly suitable for real-time applications where speed is critical.
Features of Hashing:
- Efficient data retrieval even from large datasets.
- Helps in performing operations like search, insert, delete very quickly.
- Relies on a carefully designed Hash Function for minimizing collisions.
Applications of Hashing:
Database indexing for fast retrieval of records.
Implementing symbol tables in compilers.
Maintaining cache memory.
Password verification systems.
Creation of associative arrays, maps, and dictionaries in programming languages.
Deletion in Hashing
In Hashing, the process of deleting an element from a hash table is slightly more complex than insertion or searching because improper deletion can break the structure and affect future searches.
In Open Addressing techniques, deletion needs careful handling to ensure the search chain remains valid.
In Chaining methods, deletion is simpler, as elements are stored in linked lists.
Deletion in Open Addressing:
- Simply removing an element (setting the slot to empty) can disconnect the probe sequence, causing search failures for other elements that have collided earlier.
- To avoid this problem, a special marker value (called “Deleted” or “Tombstone”) is used to indicate that an element was once present but has been deleted.
Example:
- Search for the key to be deleted.
- If found, mark the slot as “Deleted” instead of empty.
- During future searches or insertions, treat “Deleted” slots carefully to maintain probing integrity.
Deletion in Chaining (Separate Chaining):
- In this technique, each slot contains a linked list of records.
- To delete:
- Search the key in the linked list.
- Remove the node from the linked list using standard list deletion methods.
- No need for tombstone markers because the linked list can dynamically adjust.
Issues in Deletion:
- Open Addressing:
- Can lead to performance degradation if many “Deleted” slots accumulate (called clustering).
- Chaining:
- Extra memory overhead for pointers in linked lists.
Efficiency of Hashing
The efficiency of Hashing is determined by how quickly and effectively it can perform the fundamental operations of searching, inserting, and deleting data.
Hashing is especially designed to provide constant average-case time complexity O(1), making it extremely efficient for accessing data directly without searching through an entire data structure.
Factors Affecting Efficiency:
- Hash Function Quality:
- A good hash function uniformly distributes keys across the hash table.
- Reduces the possibility of collisions, ensuring fewer comparisons.
- Load Factor (λ):
- Defined as:
λ=nm\lambda = \frac{n}{m}λ=mn
Where:
- n = Number of keys inserted.
- m = Size of the hash table.
- A higher load factor increases collisions and decreases efficiency.
- Efficiency is best maintained when λ ≤ 1.
- Collision Resolution Technique:
- Open Addressing: Performance degrades if the table gets too full.
- Chaining: Can handle higher load factors better but uses extra memory.
Time Complexity:
Operation | Best / Average Case | Worst Case |
Search | O(1) | O(n) (if all elements collide) |
Insertion | O(1) | O(n) |
Deletion | O(1) | O(n) |
In practice, average-case time is O(1) for all operations if collisions are minimal and the hash function is good.
In the worst-case, all keys may hash to the same location (forming a list), causing performance to degrade to O(n).
Space Complexity:
Space complexity is generally O(n) where ‘n’ is the number of elements, plus some extra space for handling collisions (in chaining or open addressing).
Advantages of High Efficiency:
Fast searching, insertion, and deletion.
Ideal for large datasets where linear or binary searching would be too slow.
Used in high-performance applications like caches, symbol tables, and databases.
Limitations to Efficiency:
Collisions reduce efficiency.
Efficiency highly depends on:
- Quality of hash function.
- Collision resolution method.
- Load factor control.
Hash Table Reordering
Hash Table Reordering refers to the process of restructuring or rehashing the contents of a hash table to improve its performance, typically when the table becomes too full or the load factor exceeds an acceptable limit.
Reordering is essential to maintain the efficiency of search, insert, and delete operations in hashing.
Need for Reordering (Rehashing):
- Load Factor Exceeds Threshold:
When the number of stored elements significantly increases relative to the table size, causing excessive collisions.
- Performance Degradation:
Increased collisions lead to longer search times and higher insertion costs.
- Maintaining Efficiency:
Reordering ensures the hash table maintains an average time complexity of O(1) for basic operations.
Process of Hash Table Reordering (Rehashing):
- Create a New Hash Table with a larger size (typically double the old size or next prime number).
- Recompute the hash values of all existing keys using the new hash function or table size.
- Insert all elements into the new table based on their new hash addresses.
Example:
For an old hash function:
H(Key)=Key%10H(Key) = Key \% 10H(Key)=Key%10
If reordering is done to double the table size:
New H(Key)=Key%20New\ H(Key) = Key \% 20New H(Key)=Key%20
Types of Reordering:
- Rehashing:
The process of reorganizing the table by recalculating new positions using a new table size.
- Dynamic Resizing:
Automatically resizing and reordering the table when a predefined load factor is exceeded.
- Ordered Hashing (in some systems):
Reordering the elements based on key values or insertion order — not common in typical hash tables but used in some specialized applications like Ordered Dictionaries.
Advantages of Reordering:
Reduces collisions significantly.
Improves searching, insertion, and deletion performance.
Maintains uniform distribution of keys across the hash table.
Ensures the table operates at optimal load factors.
Disadvantages of Reordering:
Reordering is time-consuming — O(n) time complexity as all existing Keys must be rehashed.
Requires extra space temporarily to hold the new hash table.
Frequent reordering can reduce performance if not controlled properly.
Collision Resolution Techniques in Hashing
A collision occurs in hashing when two different keys are mapped to the same hash address by the hash function. To handle such situations, collision resolution techniques are used to store multiple elements in the hash table without losing data or affecting efficiency.
Types of Collision Resolution Techniques:
Open Addressing:
In Open Addressing, when a collision occurs, the algorithm searches for the next available slot within the table itself according to a probing sequence.
Methods of Open Addressing:
- Linear Probing:
Next slot is checked in sequential order:
H′(key,i)=(H(key)+i)%tableSizeH'(key, i) = (H(key) + i) \% tableSizeH′(key,i)=(H(key)+i)%tableSize
Problem: Clustering (grouping of filled slots).
- Quadratic Probing:
The gap between slots increases quadratically:
H′(key,i)=(H(key)+c1i+c2i2)%tableSizeH'(key, i) = (H(key) + c_1 i + c_2 i^2) \% tableSizeH′(key,i)=(H(key)+c1i+c2i2)%tableSize
Reduces clustering but may not use all slots.
- Double Hashing:
A second hash function determines the probe step:
H′(key,i)=(H1(key)+i×H2(key))%tableSizeH'(key, i) = (H_1(key) + i \times H_2(key)) \% tableSizeH′(key,i)=(H1(key)+i×H2(key))%tableSize
Minimizes clustering and offers better performance.
Separate Chaining (Closed Addressing):
In Separate Chaining, each slot of the hash table contains a linked list (or chain).
When multiple keys hash to the same slot, they are simply added to the corresponding linked list.
Insertion: Insert at the head or tail of the list.
Search: Traverse the list at the slot.
Deletion: Remove the node from the list.
Advantages:
- Simple to implement.
- Hash table can hold more elements than its size.
- No clustering problem.
Disadvantages:
- Requires additional space for pointers.
- Worst-case time complexity becomes O(n) if all keys hash to the same index.
Coalesced Chaining:
Coalesced Chaining is a hybrid method of Open Addressing and Separate Chaining.
When a collision occurs:
- The new element is placed in the next available slot anywhere in the table.
- The table also maintains a linked list using pointers (or indices) inside the table to connect elements with the same hash address.
No external linked lists are required — everything is stored in the table itself.
More memory-efficient than Separate Chaining.
Reduces primary clustering compared to Open Addressing.
Comparison Table:
Technique | Space | Clustering | Complexity (Average) | Complexity (Worst) | Remarks |
Open Addressing | Less (In-place) | Yes (in Linear) | O(1) | O(n) | Requires careful load factor management |
Separate Chaining | Extra (Pointers) | No | O(1 + λ) | O(n) | Simple, flexible, preferred for large data |
Coalesced Chaining | Less than Separate Chaining | Reduced | O(1) | O(n) | Hybrid approach, reduces space overhead |
Extendible Hashing
Extendible Hashing is a dynamic hashing technique used primarily in database systems to handle growing datasets efficiently.
Unlike static hashing, where the hash table size is fixed, extendible hashing dynamically adjusts the table size as new data is inserted.
It is particularly effective in minimizing overflows and collisions.
Need for Extendible Hashing:
In traditional hashing:
- As more elements are added, collisions increase.
- Once the table is full, resizing the entire table is costly.
Extendible hashing overcomes these limitations by using:
- A directory that grows as needed.
- Buckets that split when they overflow.
Basic Concepts:
- Directory:
- A table that stores pointers to buckets.
- Indexed using a certain number of bits from the hash value.
- Grows in size (doubles) when needed.
- Buckets:
- Fixed-size blocks of memory that store records.
- Each has a local depth indicating how many bits are used to index it.
- When full, the bucket is split using one more bit.
- Global Depth:
- Number of bits used in the directory.
- If the local depth of a bucket exceeds the global depth, the directory size is doubled.
How It Works:
- Hash the key and use the first d bits to index into the directory (where d = global depth).
- Insert the key into the corresponding bucket.
- If the bucket is not full, simply insert the key.
- If the bucket overflows:
- If local depth < global depth → Split the bucket.
- If local depth = global depth → Double the directory size, then split the bucket.
- Redistribute the keys using an extra hash bit.
Example:
Let’s say we have keys and a hash function H(key) = binary representation.
Start with global depth = 1 → Directory has 2 entries: 0 and 1.
If a bucket overflows:
- We split the bucket, increase local depth.
- If necessary, double the directory and redistribute pointers.
Advantages:
Efficient for dynamic databases — handles growth without rehashing all data.
Reduces overflow chains — high performance maintained.
Suitable for disk-based storage (used in DBMS).
Provides fast access, typically O(1) time.
Disadvantages:
Directory can grow large if not well-managed.
Requires extra space for storing directory pointers.
Slightly complex implementation compared to static hashing.
Applications:
Database management systems.
Disk-based storage systems.
Useful when data grows unpredictably.
Hash Function Selection
A Hash Function is the heart of any hashing mechanism.
It is a mathematical function that maps a given key to a specific index in a hash table.
The efficiency, speed, and reliability of hashing operations (search, insert, delete) largely depend on the quality of the hash function chosen.
Properties of a Good Hash Function:
- Uniform Distribution:
- Should distribute the keys uniformly across the hash table to minimize collisions.
- Deterministic:
- The same key must always produce the same hash value.
- Efficient Computation:
- Should be simple and fast to compute.
- Minimize Clustering:
- Avoid patterns that cause groups of data to map to the same location.
- Use All Key Data:
- All parts of the key should influence the hash value to reduce the chances of collision.
Hash Function Methods:
- Division Method:
H(Key)=Key%TableSizeH(Key) = Key \% TableSizeH(Key)=Key%TableSize
Simple and effective if TableSize is prime.
Poor choice if TableSize and key patterns have common factors.
- Multiplication Method:
H(Key)=⌊m×(Key×Amod 1)⌋H(Key) = \lfloor m \times (Key \times A \mod 1) \rfloorH(Key)=⌊m×(Key×Amod1)⌋
Where:
- A is a constant (0 < A < 1)
- m is the table size.
Suitable for situations where TableSize is a power of 2.
Reduces clustering better than the division method.
- Folding Method:
- Key is divided into equal parts and summed.
- Example: Key = 123456 → Split as 12, 34, 56 → Sum = 102.
Works well with numeric keys.
- Mid-Square Method:
- Square the key and use the middle digits or bits as the hash value.
Simple and effective for small keys.
Not suitable for large key ranges.
- Digit Extraction Method:
- Select specific digits from the key.
Easy but may cause patterns and clustering if digits are poorly chosen.
Factors in Selecting a Hash Function:
- Nature of the Key:
- Numerical, string, or composite key — choose function accordingly.
- Size of the Hash Table:
- Should be considered (prime sizes are preferable for the division method).
- Distribution of Key Values:
- Function must ensure even spread across all slots.
- Speed Requirements:
- Must provide quick computation even for large datasets.
- Collision Tolerance:
- Must reduce the likelihood of clustering and collisions.
Perfect Hashing
Perfect Hashing is a type of hashing where a hash function maps each key to a unique slot in the hash table without any collisions.
It is particularly used when the set of keys is known in advance and static (i.e., no insertion/deletion after initialization).
In a perfect hash function, for a given set of keys:
H(keyi)≠H(keyj)for all i≠jH(key_i) \neq H(key_j) \quad \text{for all } i \neq jH(keyi)=H(keyj)for all i=j
Types of Perfect Hashing:
- Minimal Perfect Hashing:
- A collision-free hash function where the number of slots = number of keys.
- Memory-efficient (ideal space usage).
- No unused cells.
- Non-Minimal Perfect Hashing:
- Still collision-free, but allows extra space in the hash table.
- Easier to construct than minimal perfect functions.
When to Use Perfect Hashing:
- When the set of keys is fixed (e.g., list of reserved words in a compiler).
- When constant-time access is needed without collisions.
- In embedded systems and read-only databases.
Construction of Perfect Hash Functions:
Perfect hash functions are often built using:
- Two-level Hashing:
- First-level hashing distributes keys into buckets.
- Second-level hashing within each bucket ensures no collisions.
- This is known as Fredman-Komlós-Szemerédi (FKS) scheme.
- Heuristic and algorithmic methods:
- Graph-based ,koikhuyfvtyj models (like Cichelli’s method).
- Randomized algorithms that search for a perfect mapping.
Advantages of Perfect Hashing:
No collisions — ensures fast and predictable performance.
Constant-time lookup (O(1)).
Especially useful in read-only applications where speed is critical.
Efficient use of space in minimal version.
Disadvantages of Perfect Hashing:
Difficult to construct for large or dynamic key sets.
Not suitable for dynamic data (insertion or deletion not allowed without reconstruction).
Can be computationally expensive to generate.
Applications:
Compiler design (e.g., keyword recognition).
Spell-checking tools.
Networking and protocol tables.
Static database lookups.
Embedded systems with fixed data sets.