Python dictionary: past, present, future
-
Upload
delimitry -
Category
Engineering
-
view
600 -
download
3
Transcript of Python dictionary: past, present, future
Python dictionary past, present, future
Dmitry Alimov Senior Software Engineer
Zodiac Interactive
2016
SPb Python Interest Group
Dictionary in Python
>>> d = {} # the same as d = dict()
>>> d['a'] = 123
>>> d['b'] = 345
>>> d['c'] = 678
>>> d
{'a': 123, 'c': 678, 'b': 345}
>>> d['b']
345
>>> del d['c']
>>> d
{'a': 123, 'b': 345}
Dictionary keys must be hashable An object is hashable if it has a hash value which never changes during its lifetime
>>> d[list()] = 1 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'list' >>> d[set()] = 2 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'set' >>> d[dict()] = 3 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'dict'
All of Python’s immutable built-in objects are hashable
import random
class A(object):
def __init__(self, index):
self.index = index
def __eq__(self, other):
return True
def __hash__(self):
return random.randint(0, 3)
def __repr__(self):
return 'A%d' % self.index
d = {A(0): 0, A(1): 1, A(2): 2}
print('keys: %s' % d.keys())
print('values: %s' % d.values())
for k in d:
print('%s = %s' % (k, d.get(k, 'not found')))
Random hash is a bad idea
Run 1
keys: [A1, A2, A0]
values: [1, 2, 0]
A1 = 1
A2 = not found
A0 = 0
Run 2
keys: [A1, A0]
values: [2, 0]
A1 = not found
A0 = not found
Past
Three kinds of slots in the table: 1) Unused 2) Active 3) Dummy
typedef struct {
Py_ssize_t me_hash;
PyObject *me_key;
PyObject *me_value;
} PyDictEntry;
- Hash table - Open addressing collision resolution strategy - Initial size = 8 - Load factor = 2/3 - Growth rate = 2 or 4 (depending on the number of cells used) - “/Include/dictobject.h”, “/Objects/dictobject.c”, “/Objects/dictnotes.txt”
Dictionary in CPython >2.1
ma_fill – is the number of non-NULL keys (sum of Active and Dummy) ma_used – number of Active items ma_mask – mask == PyDict_MINSIZE - 1 ma_lookup – lookup function (lookdict_string by default)
#define PyDict_MINSIZE 8 typedef struct _dictobject PyDictObject; struct _dictobject { PyObject_HEAD Py_ssize_t ma_fill; Py_ssize_t ma_used; Py_ssize_t ma_mask; PyDictEntry *ma_table; PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash); PyDictEntry ma_smalltable[PyDict_MINSIZE]; };
Good hash functions are needed
>>> map(hash, [0, 1, 2, 3, 4]) [0, 1, 2, 3, 4] >>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce']) [1540938117, 1540938118, 1540938119, 1540938112, 1540938113]
Modified FNV (Fowler–Noll–Vo) hash function for strings
“-R” option – turns on hash randomization, so that the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value
>>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce']) [-218138032, -218138029, -218138030, -218138027, -218138028]
Hash functions
Collision resolution
Collision is a situation that occurs when two distinct pieces of data have the same hash value. Probing is a scheme in computer programming for resolving collisions in hash tables for maintaining a collection of key–value pairs and looking up the value associated with a given key. In CPython a pseudo-random probing is used
PERTURB_SHIFT = 5 perturb = hash(key) while True: j = (5 * j) + 1 + perturb perturb >>= PERTURB_SHIFT index = j % 2**i
See “/Objects/dictobject.c”
In CPython <2.2 used a polynomial-based index computing
>>> PyDict_MINSIZE = 8 >>> key = 123 >>> hash(key) % PyDict_MINSIZE >>> 3
Index computing
>>> mask = PyDict_MINSIZE - 1 >>> hash(key) & mask >>> 3
Instead of the modulo operation use logical "AND" and the mask
Get least significant bits of the hash: 2 ** i = PyDict_MINSIZE, hence i = 3, i.e. three least significant bits is enough hash(123) = 123 = 0b1111011 mask = PyDict_MINSIZE - 1 = 8 - 1 = 7 = 0b111 index = hash(123) & mask = 0b1111011 & 0b111 = 0b011 = 3
mask = PyDict_MINSIZE - 1 index = hash(123) & mask
Integers
Strings
mask = PyDict_MINSIZE - 1 index = hash(123) & mask
Dictionary in CPython >2.1
Dictionary initialization
Add an item
PyDict_SetItem()
PyDict_New() ma_used = 0 ma_fill = 0 ma_mask = PyDict_MINSIZE – 1 ma_table = ma_smalltable ma_lookup = lookdict_string
insertdict() ma_used += 1 ma_fill += 1 dictresize() if ma_fill >= 2/3 * size
Delete an item
PyDict_DelItem() ma_used -= 1
Add item
Add item
Add item
Add item
Add item
perturb = -1297030748 # i = (i * 5) + 1 + perturb i = (4 * 5) + 1 + (-1297030748) = -1297030727 index = -1297030727 & 7 = 1
hash('!!!') = -1297030748 i = -1297030748 & 7 = 4
# perturb = perturb >> PERTURB_SHIFT perturb = -1297030748 >> 5 = -40532211 # i = (i * 5) + 1 + perturb i = (-1297030727 * 5) + 1 + (-40532211) = -6525685845 index = -6525685845 & 7 = 3
>>> d {'python': 2, 'article': 4, '!!!': 5, 'dict': 3, 'a key': 1} >>> d.__sizeof__() 248
Add item
Hash table resize
>>> d {'!!!': 5, 'python': 2, 'dict': 3, 'a key': 1, 'article': 4, ';)': 6} >>> d.__sizeof__() 1016
Hash table resize
/* Find the smallest table size > minused. */ for (newsize = 8; newsize <= minused && newsize > 0; newsize <<= 1) ; ...
}
dictresize(PyDictObject *mp, Py_ssize_t minused) { ...
PyDict_SetItem(...) { ... dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used); ... }
In the example: ma_fill = 6 > (8 * 2 / 3) ma_used = 6
Hence minused = 4 * 6 = 24, therefore newsize = 32
Addition order
>>> d1 = {'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5} >>> d2 = {'three': 3, 'two': 2, 'five': 5, 'four': 4, 'one': 1} >>> d1 == d2 True >>> d1.keys() ['four', 'three', 'five', 'two', 'one'] >>> d2.keys() ['four', 'one', 'five', 'three', 'two']
The order of items added to the dictionary depends on the items already in it
>>> 7.0 == 7 == (7+0j) True >>> d = {} >>> d[7.0] = 'float' >>> d {7.0: 'float'} >>> d[7] = 'int' >>> d {7.0: 'int'} >>> d[7+0j] = 'complex' >>> d {7.0: 'complex'} >>> type(d.keys()[0]) <type 'float'>
int, float, complex
>>> hash(7) 7 >>> hash(7.0) 7 >>> hash(7+0j) 7
>>> d = {'a': 1}
>>> for i in d:
... d['new item'] = 123
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: dictionary changed size during iteration
Adding item during iteration
Delete item
dummy = PyString_FromString("<dummy key>"));
Interesting case
Interesting case
ma_fill = 6 > (8 * 2 / 3) dictresize()
Interesting case
ma_fill = 6 > (8 * 2 / 3) ma_used = 1
hence minused = 4 * 1 = 4, therefore newsize = 8
Cache
PyDictEntry ma_smalltable[8];
On x86 with 64 bytes per cache line: 64 / (4 * 3) = 5.333 entries
typedef struct {
Py_ssize_t me_hash;
PyObject *me_key;
PyObject *me_value;
} PyDictEntry;
Cache locality and collisions See “/Objects/dictnotes.txt”
Source Access time
L1 Cache 1 ns
L2 Cache 4 ns
RAM 100 ns
Open addressing vs separate chaining
Although here is the linear probing rather than pseudo-random as in CPython
OrderedDict
from collections import OrderedDict
- Internal dict - Circular doubly linked list - “/Lib/collections/__init__.py”
Present
Dictionary in CPython 3.5
- PEP 412 - Key-Sharing Dictionary - The DictObject can be in one of two forms: combined table or split table - Initial size = 4 (split table) or 8 (combined table) - Maximum dictionary load = (2*n+1)/3 - Growth rate = used*2 + capacity/2 - “/Objects/dict-common.h”, “/Include/dictobject.h”, “/Objects/dictobject.c”,
“/Objects/dictnotes.txt”
typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry; struct _dictkeysobject { Py_ssize_t dk_refcnt; Py_ssize_t dk_size; dict_lookup_func dk_lookup; Py_ssize_t dk_usable; PyDictKeyEntry dk_entries[1]; };
typedef struct { PyObject_HEAD Py_ssize_t ma_used; PyDictKeysObject *ma_keys; PyObject **ma_values; } PyDictObject;
Combined table vs split table
Combined table - For explicit dictionaries (dict() and {}) - ma_values = NULL, dk_refcnt = 1 - Never becomes a split-table dictionary
Split table - For attribute dictionaries (the__dict__ attribute of an object) - ma_values != NULL, dk_refcnt >= 1 - Only string (unicode) keys are allowed - Values are stored in the ma_values array - When resizing a split dictionary it is converted to a combined table (but if
resizing is as a result of storing an instance attribute, and there is only instance of a class, then the dictionary will be re-split immediately)
- Lookup function = lookdict_split
Dictionary in CPython 3.5
A new kind of slot: 1) Unused 2) Active 3) Dummy 4) Pending (me_key != NULL, me_key != dummy and me_value == NULL)
typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry;
Split table
Initial size = 4 Maximum dictionary load = (2*n+1)/3 = (2*4+1)/3 = 3, i.e. initially ma_keys->dk_usable = 3
Split table
class A(): def __init__(self): self.a = 1 self.b = 2 self.c = 3 a = A() print(a.__dict__.__sizeof__()) # 72 setattr(a, 'd', 4) # re-split print(a.__dict__.__sizeof__()) # 168
print({}.__sizeof__()) # 264
Initial size = 4 Maximum dictionary load = (2*n+1)/3 = (2*4+1)/3 = 3 Growth rate = used*2 + capacity/2 = 3*2 + 4/2 = 8, hence minused = 8, therefore newsize = 16 (see dictresize)
class A(): def __init__(self): self.a = 1 self.b = 2 self.c = 3 a = A() print(a.__dict__.__sizeof__()) # 72 b = A() setattr(a, 'd', 4) # no re-split because of b print(a.__dict__.__sizeof__()) # 456
Split table
Split table is converted to a combined table
Key differences between this implementation and CPython 2.x: - The table can be split into two parts – the keys and the values - A new kind of slot - No more ma_smalltable embedded in the dict
- General dictionaries are slightly larger - All object dictionaries of a single class can share a single key-table, saving
about 60% memory for such cases (accordint to
https://github.com/python/cpython/blob/3.5/Objects/dictnotes.txt) Bugs still happens: Unbounded memory growth resizing split-table dicts (https://bugs.python.org/issue28147)
Summary
Hash functions in CPython 3.5
SipHash for strings and bytes (>= CPython 3.4)
- Resistant against hash flooding DoS attacks
- Successfully used in many other languages
Slightly modified hash function for float
PEP 456 – Secure and interchangeable hash algorithm
hash(float("+inf")) == 314159,
hash(float("-inf")) == -314159, was -271828
OrderedDict in CPython 3.5
- Doubly-linked-list - od_fast_nodes hash table that mirrors the od_dict table - “/Include/odictobject.h”, “/Objects/odictobject.c”
Alternative versions
Dictionary in PyPy
- Starting from PyPy 2.5.0 – ordereddict is used by default - Initial size = 16 - Load factor up to 2/3 - Growth rate = 4 (up to 30000 items) or 2 - If a lot of items are deleted the compaction is performed - “/rpython/rtyper/lltypesystem/rordereddict.py”
struct dicttable { int num_live_items; int num_ever_used_items; int resize_counter; variable_int *indexes; // byte, short, int, long dictentry *entries; ... }
struct dictentry { PyObject *key; PyObject *value; long hash; bool valid; }
Dictionary in PyPy
struct dicttable { variable_int *indexes; dictentry *entries; ... }
FREE = 0 DELETED = 1 VALID_OFFSET = 2
PyDictionary in Jython
- Based on ConcurrentHashMap - Separate chaining collision resolution - Initial size = 16, load factor = 0.75, growth rate = 2 - Segments and thread safety
PythonDictionary in IronPython
- Based on Dictionary (.NET) - Separate chaining collision resolution - Initial size = 0, load factor = 1.0 - Rehashing if the number of collisions >= 100 - Growth rate = 2 (the new size is equal to the next higher prime number) from a set of
primes = {3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107,… , 4999559, 5999471, 7199369}
Future
Raymond Hettinger is happy
Dictionary in CPython 3.6
typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry;
typedef struct { PyObject_HEAD Py_ssize_t ma_used; /* number of items in the dictionary */ uint64_t ma_version_tag; /* unique, changes when dict modified */ PyDictKeysObject *ma_keys; PyObject **ma_values; } PyDictObject;
- ma_version_tag is added (PEP 509 – Add a private version to dict) - Initial size = 8 (for split table too) - Maximum dictionary load = (2*n)/3 - Contributed by INADA Naoki in https://bugs.python.org/issue27350
Four kinds of slots in the table: 1) Unused (index == DKIX_EMPTY == -1) 2) Active (index >= 0 , me_key != NULL and me_value != NULL) 3) Dummy (index == DKIX_DUMMY == -2, only for combined table) 4) Pending (index >= 0 , me_key != NULL and me_value == NULL, only for split table)
Dictionary in CPython 3.6
- Added dk_nentries and dk_indices
struct _dictkeysobject { Py_ssize_t dk_refcnt; Py_ssize_t dk_size; /* Size of the hash table (dk_indices) */ dict_lookup_func dk_lookup; /* Function to lookup in dk_indices */ Py_ssize_t dk_usable; /* Number of usable entries in dk_entries */ Py_ssize_t dk_nentries; /* Number of used entries in dk_entries */ union { int8_t as_1[8]; int16_t as_2[4]; int32_t as_4[2]; #if SIZEOF_VOID_P > 4 int64_t as_8[1]; #endif } dk_indices; PyDictKeyEntry dk_entries[dk_usable]; /* using DK_ENTRIES macro */ };
Dictionary in CPython 3.6 (Combined table)
Key differences between this implementation and CPython 3.5: - Compact and ordered - Added dk_indices with type, depending on the size of dictionary - Added ma_version_tag (PEP 509) - Initial size for split table is changed to 8 - Maximum dictionary load changed to (2*n)/3 - Deleting item cause converting the dict to the combined table
- Preserving the order of **kwargs in a function (PEP 468) is implemented - Preserving Class Attribute Definition Order (PEP 520) is implemented - The memory usage of the new dict() is between 20% and 25% smaller compared
to Python 3.5 (https://docs.python.org/3.6/whatsnew/3.6.html#other-language-changes)
Summary
References 1. The implementation of a dictionary in Python 2.7 https://habrahabr.ru/post/247843/ 2. Python hash calculation algorithms http://delimitry.blogspot.com/2014/07/python-hash-calculation-algorithms.html 3. PEP 412 - Key-Sharing Dictionary https://www.python.org/dev/peps/pep-0412/ 4. PEP 456 - Secure and interchangeable hash algorithm https://www.python.org/dev/peps/pep-0456/ 5. Mirror of the CPython repository https://github.com/python/cpython/ 6. Faster, more memory efficient and more ordered dictionaries on PyPy https://morepypy.blogspot.com/2015/01/faster-
more-memory-efficient-and-more.html 7. PyDictionary (Jython API documentation) http://www.jython.org/javadoc/org/python/core/PyDictionary.html 8. Jython repository https://bitbucket.org/jython/jython 9. Java theory and practice: Building a better HashMap http://www.ibm.com/developerworks/library/j-jtp08223/ 10. Back to basics: Dictionary part 2, .NET implementation https://blog.markvincze.com/back-to-basics-dictionary-part-2-
net-implementation/ 11. http://referencesource.microsoft.com/mscorlib/system/collections/generic/dictionary.cs.html 12. https://github.com/IronLanguages/main/blob/ipy-2.7-maint/Languages/IronPython/IronPython/ 13. https://bitbucket.org/pypy/pypy/ 14. https://twitter.com/raymondh 15. PEP 509 - Add a private version to dict https://www.python.org/dev/peps/pep-0509/ 16. Compact and ordered dict http://bugs.python.org/issue27350 17. What’s New In Python 3.6 https://docs.python.org/3.6/whatsnew/3.6.html 18. PEP 468 - Preserving the order of **kwargs in a function https://www.python.org/dev/peps/pep-0468/ 19. PEP 520 - Preserving Class Attribute Definition Order https://www.python.org/dev/peps/pep-0520/ 20. https://en.wikipedia.org/ Images from: http://www.rcreptiles.com/blog/index.php/2008/06/28/read_the_operating_manual_first http://kiwigamer450.deviantart.com/art/Back-to-The-Past-Logo-567858767 http://beyondplm.com/wp-content/uploads/2014/04/time-paradox-past-future-present.jpg http://itband.ru/wp-content/uploads/2014/10/Future.jpg https://en.wikipedia.org/wiki/Hash_table
Q & A
@delimitry
spbpython.guru
SPb Python Interest Group
Additional slides
Separate chaining collision resolution
Open addressing collision resolution (pseudo-random probing)