Python dictionary: past, present, future

Python dictionary past, present, future

Dmitry Alimov Senior Software Engineer

Zodiac Interactive

2016

SPb Python Interest Group

Dictionary in Python

>>> d = {} # the same as d = dict()

>>> d['a'] = 123

>>> d['b'] = 345

>>> d['c'] = 678

>>> d

{'a': 123, 'c': 678, 'b': 345}

>>> d['b']

345

>>> del d['c']

>>> d

{'a': 123, 'b': 345}

Dictionary keys must be hashable An object is hashable if it has a hash value which never changes during its lifetime

>>> d[list()] = 1 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'list' >>> d[set()] = 2 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'set' >>> d[dict()] = 3 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'dict'

All of Python’s immutable built-in objects are hashable

import random

class A(object):

def __init__(self, index):

self.index = index

def __eq__(self, other):

return True

def __hash__(self):

return random.randint(0, 3)

def __repr__(self):

return 'A%d' % self.index

d = {A(0): 0, A(1): 1, A(2): 2}

print('keys: %s' % d.keys())

print('values: %s' % d.values())

for k in d:

print('%s = %s' % (k, d.get(k, 'not found')))

Random hash is a bad idea

Run 1

keys: [A1, A2, A0]

values: [1, 2, 0]

A1 = 1

A2 = not found

A0 = 0

Run 2

keys: [A1, A0]

values: [2, 0]

A1 = not found

A0 = not found

Three kinds of slots in the table: 1) Unused 2) Active 3) Dummy

typedef struct {

Py_ssize_t me_hash;

PyObject *me_key;

PyObject *me_value;

} PyDictEntry;

- Hash table - Open addressing collision resolution strategy - Initial size = 8 - Load factor = 2/3 - Growth rate = 2 or 4 (depending on the number of cells used) - “/Include/dictobject.h”, “/Objects/dictobject.c”, “/Objects/dictnotes.txt”

Dictionary in CPython >2.1

ma_fill – is the number of non-NULL keys (sum of Active and Dummy) ma_used – number of Active items ma_mask – mask == PyDict_MINSIZE - 1 ma_lookup – lookup function (lookdict_string by default)

#define PyDict_MINSIZE 8 typedef struct _dictobject PyDictObject; struct _dictobject { PyObject_HEAD Py_ssize_t ma_fill; Py_ssize_t ma_used; Py_ssize_t ma_mask; PyDictEntry *ma_table; PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash); PyDictEntry ma_smalltable[PyDict_MINSIZE]; };

Good hash functions are needed

>>> map(hash, [0, 1, 2, 3, 4]) [0, 1, 2, 3, 4] >>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce']) [1540938117, 1540938118, 1540938119, 1540938112, 1540938113]

Modified FNV (Fowler–Noll–Vo) hash function for strings

“-R” option – turns on hash randomization, so that the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value

>>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce']) [-218138032, -218138029, -218138030, -218138027, -218138028]

Hash functions

Collision resolution

Collision is a situation that occurs when two distinct pieces of data have the same hash value. Probing is a scheme in computer programming for resolving collisions in hash tables for maintaining a collection of key–value pairs and looking up the value associated with a given key. In CPython a pseudo-random probing is used

PERTURB_SHIFT = 5 perturb = hash(key) while True: j = (5 * j) + 1 + perturb perturb >>= PERTURB_SHIFT index = j % 2**i

See “/Objects/dictobject.c”

In CPython <2.2 used a polynomial-based index computing

>>> PyDict_MINSIZE = 8 >>> key = 123 >>> hash(key) % PyDict_MINSIZE >>> 3

Index computing

>>> mask = PyDict_MINSIZE - 1 >>> hash(key) & mask >>> 3

Instead of the modulo operation use logical "AND" and the mask

Get least significant bits of the hash: 2 ** i = PyDict_MINSIZE, hence i = 3, i.e. three least significant bits is enough hash(123) = 123 = 0b1111011 mask = PyDict_MINSIZE - 1 = 8 - 1 = 7 = 0b111 index = hash(123) & mask = 0b1111011 & 0b111 = 0b011 = 3

mask = PyDict_MINSIZE - 1 index = hash(123) & mask

Integers

Strings

mask = PyDict_MINSIZE - 1 index = hash(123) & mask

Dictionary in CPython >2.1

Dictionary initialization

Add an item

PyDict_SetItem()

PyDict_New() ma_used = 0 ma_fill = 0 ma_mask = PyDict_MINSIZE – 1 ma_table = ma_smalltable ma_lookup = lookdict_string

insertdict() ma_used += 1 ma_fill += 1 dictresize() if ma_fill >= 2/3 * size

Delete an item

PyDict_DelItem() ma_used -= 1

Add item

perturb = -1297030748 # i = (i * 5) + 1 + perturb i = (4 * 5) + 1 + (-1297030748) = -1297030727 index = -1297030727 & 7 = 1

hash('!!!') = -1297030748 i = -1297030748 & 7 = 4

# perturb = perturb >> PERTURB_SHIFT perturb = -1297030748 >> 5 = -40532211 # i = (i * 5) + 1 + perturb i = (-1297030727 * 5) + 1 + (-40532211) = -6525685845 index = -6525685845 & 7 = 3

>>> d {'python': 2, 'article': 4, '!!!': 5, 'dict': 3, 'a key': 1} >>> d.__sizeof__() 248

Add item

Hash table resize

>>> d {'!!!': 5, 'python': 2, 'dict': 3, 'a key': 1, 'article': 4, ';)': 6} >>> d.__sizeof__() 1016

Hash table resize

/* Find the smallest table size > minused. */ for (newsize = 8; newsize <= minused && newsize > 0; newsize <<= 1) ; ...

}

dictresize(PyDictObject *mp, Py_ssize_t minused) { ...

PyDict_SetItem(...) { ... dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used); ... }

In the example: ma_fill = 6 > (8 * 2 / 3) ma_used = 6

Hence minused = 4 * 6 = 24, therefore newsize = 32

Addition order

>>> d1 = {'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5} >>> d2 = {'three': 3, 'two': 2, 'five': 5, 'four': 4, 'one': 1} >>> d1 == d2 True >>> d1.keys() ['four', 'three', 'five', 'two', 'one'] >>> d2.keys() ['four', 'one', 'five', 'three', 'two']

The order of items added to the dictionary depends on the items already in it

>>> 7.0 == 7 == (7+0j) True >>> d = {} >>> d[7.0] = 'float' >>> d {7.0: 'float'} >>> d[7] = 'int' >>> d {7.0: 'int'} >>> d[7+0j] = 'complex' >>> d {7.0: 'complex'} >>> type(d.keys()[0]) <type 'float'>

int, float, complex

>>> hash(7) 7 >>> hash(7.0) 7 >>> hash(7+0j) 7

>>> d = {'a': 1}

>>> for i in d:

... d['new item'] = 123

...

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

RuntimeError: dictionary changed size during iteration

Adding item during iteration

Delete item

dummy = PyString_FromString("<dummy key>"));

Interesting case

Interesting case

ma_fill = 6 > (8 * 2 / 3) dictresize()

Interesting case

ma_fill = 6 > (8 * 2 / 3) ma_used = 1

hence minused = 4 * 1 = 4, therefore newsize = 8

Cache

PyDictEntry ma_smalltable[8];

On x86 with 64 bytes per cache line: 64 / (4 * 3) = 5.333 entries

typedef struct {

Py_ssize_t me_hash;

PyObject *me_key;

PyObject *me_value;

} PyDictEntry;

Cache locality and collisions See “/Objects/dictnotes.txt”

Source Access time

L1 Cache 1 ns

L2 Cache 4 ns

RAM 100 ns

Open addressing vs separate chaining

Although here is the linear probing rather than pseudo-random as in CPython

OrderedDict

from collections import OrderedDict

- Internal dict - Circular doubly linked list - “/Lib/collections/__init__.py”

Present

Dictionary in CPython 3.5

- PEP 412 - Key-Sharing Dictionary - The DictObject can be in one of two forms: combined table or split table - Initial size = 4 (split table) or 8 (combined table) - Maximum dictionary load = (2*n+1)/3 - Growth rate = used*2 + capacity/2 - “/Objects/dict-common.h”, “/Include/dictobject.h”, “/Objects/dictobject.c”,

“/Objects/dictnotes.txt”

typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry; struct _dictkeysobject { Py_ssize_t dk_refcnt; Py_ssize_t dk_size; dict_lookup_func dk_lookup; Py_ssize_t dk_usable; PyDictKeyEntry dk_entries[1]; };

typedef struct { PyObject_HEAD Py_ssize_t ma_used; PyDictKeysObject *ma_keys; PyObject **ma_values; } PyDictObject;

Combined table vs split table

Combined table - For explicit dictionaries (dict() and {}) - ma_values = NULL, dk_refcnt = 1 - Never becomes a split-table dictionary

Split table - For attribute dictionaries (the__dict__ attribute of an object) - ma_values != NULL, dk_refcnt >= 1 - Only string (unicode) keys are allowed - Values are stored in the ma_values array - When resizing a split dictionary it is converted to a combined table (but if

resizing is as a result of storing an instance attribute, and there is only instance of a class, then the dictionary will be re-split immediately)

- Lookup function = lookdict_split


A new kind of slot: 1) Unused 2) Active 3) Dummy 4) Pending (me_key != NULL, me_key != dummy and me_value == NULL)

typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry;

Split table

Initial size = 4 Maximum dictionary load = (2*n+1)/3 = (2*4+1)/3 = 3, i.e. initially ma_keys->dk_usable = 3

Split table

class A(): def __init__(self): self.a = 1 self.b = 2 self.c = 3 a = A() print(a.__dict__.__sizeof__()) # 72 setattr(a, 'd', 4) # re-split print(a.__dict__.__sizeof__()) # 168

print({}.__sizeof__()) # 264

Initial size = 4 Maximum dictionary load = (2*n+1)/3 = (2*4+1)/3 = 3 Growth rate = used*2 + capacity/2 = 3*2 + 4/2 = 8, hence minused = 8, therefore newsize = 16 (see dictresize)

class A(): def __init__(self): self.a = 1 self.b = 2 self.c = 3 a = A() print(a.__dict__.__sizeof__()) # 72 b = A() setattr(a, 'd', 4) # no re-split because of b print(a.__dict__.__sizeof__()) # 456

Split table

Split table is converted to a combined table

Key differences between this implementation and CPython 2.x: - The table can be split into two parts – the keys and the values - A new kind of slot - No more ma_smalltable embedded in the dict

- General dictionaries are slightly larger - All object dictionaries of a single class can share a single key-table, saving

about 60% memory for such cases (accordint to

https://github.com/python/cpython/blob/3.5/Objects/dictnotes.txt) Bugs still happens: Unbounded memory growth resizing split-table dicts (https://bugs.python.org/issue28147)

Summary

Hash functions in CPython 3.5

SipHash for strings and bytes (>= CPython 3.4)

- Resistant against hash flooding DoS attacks

- Successfully used in many other languages

Slightly modified hash function for float

PEP 456 – Secure and interchangeable hash algorithm

hash(float("+inf")) == 314159,

hash(float("-inf")) == -314159, was -271828

OrderedDict in CPython 3.5

- Doubly-linked-list - od_fast_nodes hash table that mirrors the od_dict table - “/Include/odictobject.h”, “/Objects/odictobject.c”

Alternative versions

Dictionary in PyPy

- Starting from PyPy 2.5.0 – ordereddict is used by default - Initial size = 16 - Load factor up to 2/3 - Growth rate = 4 (up to 30000 items) or 2 - If a lot of items are deleted the compaction is performed - “/rpython/rtyper/lltypesystem/rordereddict.py”

struct dicttable { int num_live_items; int num_ever_used_items; int resize_counter; variable_int *indexes; // byte, short, int, long dictentry *entries; ... }

struct dictentry { PyObject *key; PyObject *value; long hash; bool valid; }

Dictionary in PyPy

struct dicttable { variable_int *indexes; dictentry *entries; ... }

FREE = 0 DELETED = 1 VALID_OFFSET = 2

PyDictionary in Jython

- Based on ConcurrentHashMap - Separate chaining collision resolution - Initial size = 16, load factor = 0.75, growth rate = 2 - Segments and thread safety

PythonDictionary in IronPython

- Based on Dictionary (.NET) - Separate chaining collision resolution - Initial size = 0, load factor = 1.0 - Rehashing if the number of collisions >= 100 - Growth rate = 2 (the new size is equal to the next higher prime number) from a set of

primes = {3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107,… , 4999559, 5999471, 7199369}

Future

Raymond Hettinger is happy


typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry;

typedef struct { PyObject_HEAD Py_ssize_t ma_used; /* number of items in the dictionary */ uint64_t ma_version_tag; /* unique, changes when dict modified */ PyDictKeysObject *ma_keys; PyObject **ma_values; } PyDictObject;

- ma_version_tag is added (PEP 509 – Add a private version to dict) - Initial size = 8 (for split table too) - Maximum dictionary load = (2*n)/3 - Contributed by INADA Naoki in https://bugs.python.org/issue27350

Four kinds of slots in the table: 1) Unused (index == DKIX_EMPTY == -1) 2) Active (index >= 0 , me_key != NULL and me_value != NULL) 3) Dummy (index == DKIX_DUMMY == -2, only for combined table) 4) Pending (index >= 0 , me_key != NULL and me_value == NULL, only for split table)


- Added dk_nentries and dk_indices

struct _dictkeysobject { Py_ssize_t dk_refcnt; Py_ssize_t dk_size; /* Size of the hash table (dk_indices) */ dict_lookup_func dk_lookup; /* Function to lookup in dk_indices */ Py_ssize_t dk_usable; /* Number of usable entries in dk_entries */ Py_ssize_t dk_nentries; /* Number of used entries in dk_entries */ union { int8_t as_1[8]; int16_t as_2[4]; int32_t as_4[2]; #if SIZEOF_VOID_P > 4 int64_t as_8[1]; #endif } dk_indices; PyDictKeyEntry dk_entries[dk_usable]; /* using DK_ENTRIES macro */ };

Dictionary in CPython 3.6 (Combined table)

Key differences between this implementation and CPython 3.5: - Compact and ordered - Added dk_indices with type, depending on the size of dictionary - Added ma_version_tag (PEP 509) - Initial size for split table is changed to 8 - Maximum dictionary load changed to (2*n)/3 - Deleting item cause converting the dict to the combined table

- Preserving the order of **kwargs in a function (PEP 468) is implemented - Preserving Class Attribute Definition Order (PEP 520) is implemented - The memory usage of the new dict() is between 20% and 25% smaller compared

to Python 3.5 (https://docs.python.org/3.6/whatsnew/3.6.html#other-language-changes)

Summary

References 1. The implementation of a dictionary in Python 2.7 https://habrahabr.ru/post/247843/ 2. Python hash calculation algorithms http://delimitry.blogspot.com/2014/07/python-hash-calculation-algorithms.html 3. PEP 412 - Key-Sharing Dictionary https://www.python.org/dev/peps/pep-0412/ 4. PEP 456 - Secure and interchangeable hash algorithm https://www.python.org/dev/peps/pep-0456/ 5. Mirror of the CPython repository https://github.com/python/cpython/ 6. Faster, more memory efficient and more ordered dictionaries on PyPy https://morepypy.blogspot.com/2015/01/faster-

more-memory-efficient-and-more.html 7. PyDictionary (Jython API documentation) http://www.jython.org/javadoc/org/python/core/PyDictionary.html 8. Jython repository https://bitbucket.org/jython/jython 9. Java theory and practice: Building a better HashMap http://www.ibm.com/developerworks/library/j-jtp08223/ 10. Back to basics: Dictionary part 2, .NET implementation https://blog.markvincze.com/back-to-basics-dictionary-part-2-

net-implementation/ 11. http://referencesource.microsoft.com/mscorlib/system/collections/generic/dictionary.cs.html 12. https://github.com/IronLanguages/main/blob/ipy-2.7-maint/Languages/IronPython/IronPython/ 13. https://bitbucket.org/pypy/pypy/ 14. https://twitter.com/raymondh 15. PEP 509 - Add a private version to dict https://www.python.org/dev/peps/pep-0509/ 16. Compact and ordered dict http://bugs.python.org/issue27350 17. What’s New In Python 3.6 https://docs.python.org/3.6/whatsnew/3.6.html 18. PEP 468 - Preserving the order of **kwargs in a function https://www.python.org/dev/peps/pep-0468/ 19. PEP 520 - Preserving Class Attribute Definition Order https://www.python.org/dev/peps/pep-0520/ 20. https://en.wikipedia.org/ Images from: http://www.rcreptiles.com/blog/index.php/2008/06/28/read_the_operating_manual_first http://kiwigamer450.deviantart.com/art/Back-to-The-Past-Logo-567858767 http://beyondplm.com/wp-content/uploads/2014/04/time-paradox-past-future-present.jpg http://itband.ru/wp-content/uploads/2014/10/Future.jpg https://en.wikipedia.org/wiki/Hash_table

https://habrahabr.ru/post/247843/



http://delimitry.blogspot.com/2014/07/python-hash-calculation-algorithms.html









https://www.python.org/dev/peps/pep-0412/










https://github.com/python/cpython/

https://morepypy.blogspot.com/2015/01/faster-more-memory-efficient-and-more.html













http://www.jython.org/javadoc/org/python/core/PyDictionary.html

https://bitbucket.org/jython/jython

http://www.ibm.com/developerworks/library/j-jtp08223/



https://blog.markvincze.com/back-to-basics-dictionary-part-2-net-implementation/















http://referencesource.microsoft.com/mscorlib/system/collections/generic/dictionary.cs.html



https://github.com/IronLanguages/main/blob/ipy-2.7-maint/Languages/IronPython/IronPython/







https://bitbucket.org/pypy/pypy/



https://twitter.com/raymondh






http://bugs.python.org/issue27350

http://bugs.python.org/issue27350

https://docs.python.org/3.6/whatsnew/3.6.html











https://en.wikipedia.org/

https://en.wikipedia.org/

Q & A

@delimitry

spbpython.guru

SPb Python Interest Group

Additional slides

Separate chaining collision resolution

Open addressing collision resolution (pseudo-random probing)

Python dictionary: past, present, future

Engineering

Transcript of Python dictionary: past, present, future