18. Dictionaries, Hash-Tables and Set
- 1. Dictionaries,
Hash Tables and Sets
Dictionaries, Hash Tables,
Hashing, Collisions, Sets
SoftUni Team
Technical Trainers
Software University
http://softuni.bg
- 2. Table of Contents
1. Dictionary (Map) Abstract Data Type
2. Hash Tables, Hashing and Collision Resolution
3.Dictionary<TKey, TValue> Class
4. Sets: HashSet<T> and SortedSet<T>
2
- 4. 4
The abstract data type (ADT) "dictionary" maps key to values
Also known as "map" or "associative array"
Holds a set of {key, value} pairs
Dictionary ADT operations:
Add(key, value)
FindByKey(key) value
Delete(key)
Many implementations
Hash table, balanced tree, list, array, ...
The Dictionary (Map) ADT
- 5. 5
Sample dictionary:
ADT Dictionary – Example
Key Value
C#
Modern general-purpose object-oriented programming
language for the Microsoft .NET platform
PHP
Popular server-side scripting language for Web
development
compiler
Software that transforms a computer program to
executable machine code
… …
- 7. 7
A hash table is an array that holds a set of {key, value} pairs
The process of mapping a key to a position in a table is called
hashing
Hash Table
… … … … … … … …
0 1 2 3 4 5 … m-1
T
h(k) Hash table
of size m
Hash function
h: k → 0 … m-1
- 8. 8
A hash table has m slots, indexed from 0 to m-1
A hash function h(k) maps the keys to positions:
h: k → 0 … m-1
For arbitrary value k in the key range and some hash function h
we have h(k) = p and 0 ≤ p < m
Hash Functions and Hashing
… … … … … … … …
0 1 2 3 4 5 … m-1
T
h(k)
- 9. 9
Perfect hashing function (PHF)
h(k): one-to-one mapping of each key k to an integer in the
range [0, m-1]
The PHF maps each key to a distinct integer within some
manageable range
Finding a perfect hashing function is impossible in most cases
More realistically
Hash function h(k) that maps most of the keys onto unique
integers, but not all
Hashing Functions
- 10. 10
A collision comes when different keys have the same hash value
h(k1) = h(k2) for k1 ≠ k2
When the number of collisions is sufficiently small, the hash
tables work quite well (fast)
Several collisions resolution strategies exist
Chaining collided keys (+ values) in a list
Re-hashing (second hash function)
Using the neighbor slots (linear probing)
Many other
Collisions in a Hash Table
- 11. 11
h("Pesho") = 4
h("Kiro") = 2
h("Mimi") = 1
h("Ivan") = 2
h("Lili") = m-1
Collision Resolution: Chaining
Ivan
null
null null
collision
Chaining the elements
in case of collision
null Mimi Kiro null Pesho … Lili
0 1 2 3 4 … m-1
T
null
- 12. 12
Open addressing as collision resolution strategy means to take another
slot in the hash-table in case of collision, e.g.
Linear probing: take the next empty slot just after the collision
h(key, i) = h(key) + i
where i is the attempt number: 0, 1, 2, …
Quadratic probing: the ith next slot is calculated by a quadratic
polynomial (c1 and c2 are some constants)
h(key, i) = h(key) + c1*i + c2*i2
Re-hashing: use separate (second) hash-function for collisions
h(key, i) = h1(key) + i*h2(key)
Collision Resolution: Open Addressing
- 13. 13
The load factor (fill factor) = used cells / all cells
How much the hash table is filled, e.g. 65%
Smaller fill factor leads to:
Less collisions (faster average seek time)
More memory consumption
Recommended fill factors:
When chaining is used as collision resolution less than 75%
When open addressing is used less than 50%
How Big the Hash-Table Should Be?
- 14. 14
Adding Item to Hash Table With Chaining
Ivan null
null Mimi Kiro null null … Lili
0 1 2 3 4 … m-1
T
Add("Tanio")
hash("Tanio") % m = 3
Fill factor >= 75%?
Resize & rehash
yes
no
map[3] == null?
Insert("Tanio")
Initiliaze
linked list
null
null
yes
- 16. 16
The hash-table performance depends on the probability
of collisions
Less collisions faster add / find / delete operations
How to implement a good (efficient) hash function?
A good hash-function should distribute the input values uniformly
The hash code calculation process should be fast
Integer n use n as hash value (n % size as hash-table slot)
Real number r use the bitwise representation of r
String s use a formula over the Unicode representation of s
Implementing a Good Hash Function
- 17. 17
All C# / Java objects already have GetHashCode() method
Primitive types like int, long, float, double, decimal, …
Built-in types like: string, DateTime and Guid
Built-In Hash Functions in C# / Java
int c, hash1 = (5381<<16) + 5381; int hash2 = hash1;
char *s = src;
while ((c = s[0]) != 0) {
hash1 = ((hash1 << 5) + hash1) ^ c;
c = s[1];
if (c == 0)
break;
hash2 = ((hash2 << 5) + hash2) ^ c;
s += 2;
}
return hash1 + (hash2 * 1566083941);
Hash function for
System.String
- 18. 18
What if we have a composite key
E.g. FirstName + MiddleName + LastName?
1. Convert keys to string and get its hash code:
2. Use a custom hash-code function:
Hash Functions on Composite Keys
var hashCode = (this.FirstName != null ? this.FirstName.GetHashCode() : 0);
hashCode = (hashCode * 397) ^ (this.MiddleName != null ?
this.MiddleName.GetHashCode() : 0);
hashCode = (hashCode * 397) ^ (this.LastName != null ?
this.LastName.GetHashCode() : 0);
return hashCode;
var key = string.Format("{0}-{1}-{2}", FirstName, MiddleName, LastName);
- 19. 19
Hash table efficiency depends on:
Efficient hash-functions
Most implementations use the built-in hash-functions in C# / Java
Collisions should be as low as possible
Fill factor (used buckets / all buckets)
Typically 70% fill resize and rehash
Avoid frequent resizing! Define the hash table capacity in advance
Collisions resolution algorithm
Most implementations use chaining with linked list
Hash Tables and Efficiency
- 20. 20
Hash tables are the most efficient dictionary implementation
Add / Find / Delete take just few primitive operations
Speed does not depend on the size of the hash-table
Amortized complexity O(1) – constant time
Example:
Finding an element in a hash-table holding 1 000 000 elements
takes average just 1-2 steps
Finding an element in an array holding 1 000 000 elements
takes average 500 000 steps
Hash Tables and Efficiency
- 23. 23
Implements the ADT dictionary as hash table
The size is dynamically increased as needed
Contains a collection of key-value pairs
Collisions are resolved by chaining
Elements have almost random order
Ordered by the hash code of the key
Dictionary<TKey,TValue> relies on:
Object.Equals() – compares the keys
Object.GetHashCode() – calculates the hash codes of the keys
Dictionary<TKey,TValue>
- 24. 24
Major operations:
Add(key, value) – adds an element by key + value
Remove(key) – removes a value by key
this[key] = value – add / replace element by key
this[key] – gets an element by key
Clear() – removes all elements
Count – returns the number of elements
Keys – returns a collection of all keys (in unspecified order)
Values – returns a collection of all values (in unspecified order)
Dictionary<TKey,TValue> (2)
Exception when the key already exists
Returns true / false
Exception on
non-existing key
- 25. 25
Dictionary<TKey,TValue> (3)
Major operations:
ContainsKey(key) – checks if given key exists in the dictionary
ContainsValue(value) – checks whether the dictionary
contains given value
Warning: slow operation – O(n)
TryGetValue(key, out value)
If the key is found, returns it in the value parameter
Otherwise returns false
- 26. 26
Dictionary<TKey,TValue> – Example
var studentGrades = new Dictionary<string, int>();
studentGrades.Add("Ivan", 4);
studentGrades.Add("Peter", 6);
studentGrades.Add("Maria", 6);
studentGrades.Add("George", 5);
int peterGrade = studentGrades["Peter"];
Console.WriteLine("Peter's grade: {0}", peterGrade);
Console.WriteLine("Is Peter in the hash table: {0}",
studentsGrades.ContainsKey("Peter"));
Console.WriteLine("Students and their grades:");
foreach (var pair in studentsGrades)
{
Console.WriteLine("{0} --> {1}", pair.Key, pair.Value);
}
- 28. 28
Counting the Words in a Text
string text = "a text, some text, just some text";
var wordsCount = new Dictionary<string, int>();
string[] words = text.Split(' ', ',', '.');
foreach (string word in words)
{
int count = 1;
if (wordsCount.ContainsKey(word))
count = wordsCount[word] + 1;
wordsCount[word] = count;
}
foreach(var pair in wordsCount)
{
Console.WriteLine("{0} -> {1}", pair.Key, pair.Value);
}
- 30. 30
Data structures can be nested, e.g. dictionary of lists:
Dictionary<string, List<int>>
Nested Data Structures
static Dictionary<string, List<int>> studentGrades =
new Dictionary<string, List<int>>();
private static void AddGrade(string name, int grade)
{
if (! studentGrades.ContainsKey(name))
{
studentGrades[name] = new List<int>();
}
studentGrades[name].Add(grade);
}
- 31. 31
Nested Data Structures (2)
var countriesAndCities =
new Dictionary<string, Dictionary<string, int>>();
countriesAndCities["Bulgaria"] = new Dictionary<string, int>());
countriesAndCities["Bulgaria"]["Sofia"] = 1000000;
countriesAndCities["Bulgaria"]["Plovdiv"] = 400000;
countriesAndCities["Bulgaria"]["Pernik"] = 30000;
foreach (var city in countriesAndCities["Bulgaria"])
{
Console.WriteLine("{0} : {1}", city.Key, city.Value);
}
var totalPopulation = countriesAndCities["Bulgaria"]
.Sum(c => c.Value);
Console.WriteLine(totalPopulation);
- 34. 34
SortedDictionary<TKey,TValue> implements the ADT
"dictionary" as self-balancing search tree
Elements are arranged in the tree ordered by key
Traversing the tree returns the elements in increasing order
Add / Find / Delete perform log2(n) operations
Use SortedDictionary<TKey,TValue> when you need the
elements sorted by key
Otherwise use Dictionary<TKey,TValue> – it has better
performance
SortedDictionary<TKey,TValue>
- 35. 35
Counting Words (Again)
string text = "a text, some text, just some text";
IDictionary<string, int> wordsCount =
new SortedDictionary<string, int>();
string[] words = text.Split(' ', ',', '.');
foreach (string word in words)
{
int count = 1;
if (wordsCount.ContainsKey(word))
count = wordsCount[word] + 1;
wordsCount[word] = count;
}
foreach(var pair in wordsCount)
{
Console.WriteLine("{0} -> {1}", pair.Key, pair.Value);
}
- 38. 38
Dictionary<TKey,TValue> relies on
Object.Equals() – for comparing the keys
Object.GetHashCode() – for calculating the hash codes of the
keys
SortedDictionary<TKey,TValue> relies on IComparable<T>
for ordering the keys
Built-in types like int, long, float, string and DateTime
already implement Equals(), GetHashCode() and
IComparable<T>
Other types used when used as dictionary keys should provide
custom implementations
IComparable<T>
- 39. 39
Implementing Equals() and GetHashCode()
public struct Point
{
public int X { get; set; }
public int Y { get; set; }
public override bool Equals(Object obj)
{
if (!(obj is Point) || (obj == null)) return false;
Point p = (Point)obj;
return (X == p.X) && (Y == p.Y);
}
public override int GetHashCode()
{
return (X << 16 | X >> 16) ^ Y;
}
}
- 40. 40
Implementing IComparable<T>
public struct Point : IComparable<Point>
{
public int X { get; set; }
public int Y { get; set; }
public int CompareTo(Point otherPoint)
{
if (X != otherPoint.X)
{
return this.X.CompareTo(otherPoint.X);
}
else
{
return this.Y.CompareTo(otherPoint.Y);
}
}
}
- 42. 42
The abstract data type (ADT) "set" keeps a set of elements with no
duplicates
Sets with duplicates are also known as ADT "bag"
Set operations:
Add(element)
Contains(element) true / false
Delete(element)
Union(set) / Intersect(set)
Sets can be implemented in several ways
List, array, hash table, balanced tree, ...
Set and Bag ADTs
- 44. 44
HashSet<T> implements ADT set by hash table
Elements are in no particular order
All major operations are fast:
Add(element) – appends an element to the set
Does nothing if the element already exists
Remove(element) – removes given element
Count – returns the number of elements
UnionWith(set) / IntersectWith(set) – performs union /
intersection with another set
HashSet<T>
- 45. 45
HashSet<T> – Example
ISet<string> firstSet = new HashSet<string>(
new string[] { "SQL", "Java", "C#", "PHP" });
ISet<string> secondSet = new HashSet<string>(
new string[] { "Oracle", "SQL", "MySQL" });
ISet<string> union = new HashSet<string>(firstSet);
union.UnionWith(secondSet);
foreach (var element in union)
{
Console.Write("{0} ", element);
}
Console.WriteLine();
- 46. 46
SortedSet<T> implements ADT set by balanced search tree
(red-black tree)
Elements are sorted in increasing order
Example:
SortedSet<T>
ISet<string> firstSet = new SortedSet<string>(
new string[] { "SQL", "Java", "C#", "PHP" });
ISet<string> secondSet = new SortedSet<string>(
new string[] { "Oracle", "SQL", "MySQL" });
ISet<string> union = new HashSet<string>(firstSet);
union.UnionWith(secondSet);
PrintSet(union); // C# Java PHP SQL MySQL Oracle
- 48. 48
Data Structure Internal Structure
Time Compexity
(Add/Update/Delete)
Dictionary<K,V>
HashSet<K>
O(1)
SortedDictionary<K,V>
SortedSet<K>
O(log(n))
Dictionaries and Sets Comparison
- 49. 49
Dictionaries map key to value
Can be implemented as hash table or
balanced search tree
Hash-tables map keys to values
Rely on hash-functions to distribute the keys in the table
Collisions needs resolution algorithm (e.g. chaining)
Very fast add / find / delete – O(1)
Sets hold a group of elements
Hash-table or balanced tree implementations
Summary
- 51. License
This course (slides, examples, labs, videos, homework, etc.)
is licensed under the "Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International" license
51
Attribution: this work may contain portions from
"Fundamentals of Computer Programming with C#" book by Svetlin Nakov & Co. under CC-BY-SA license
"Data Structures and Algorithms" course by Telerik Academy under CC-BY-NC-SA license
- 52. Free Trainings @ Software University
Software University Foundation – softuni.org
Software University – High-Quality Education,
Profession and Job for Software Developers
softuni.bg
Software University @ Facebook
facebook.com/SoftwareUniversity
Software University @ YouTube
youtube.com/SoftwareUniversity
Software University Forums – forum.softuni.bg
Editor's Notes
- (c) 2007 National Academy for Software Development - http://academy.devbg.org. All rights reserved. Unauthorized copying or re-distribution is strictly prohibited.*