A deep dive into Clojure's data structures - EuroClojure 2015
-
Upload
mohit-thatte -
Category
Software
-
view
442 -
download
5
Transcript of A deep dive into Clojure's data structures - EuroClojure 2015
What Lies Beneath
Mohit Thatte
EUROCLOJURE 2015Barcelona
A Deep Dive into Clojure’s data structures
@mohitthatte @pastafari
A DAY IN THE LIFE
Image: User:Joonspoon Wikimedia Commons
Programs that use Maps
Map API
Map Implementation
Primitives (JVM, et al)
TOWERS OF ABSTRACTION
“Any sufficiently advanced data structure is indistinguishable from magic”
- Me
With apologies to Arthur Clarke
IMMUTABILITY IS GOOD
PERFORMANCE IS NECESSARY
By U.S. Navy photo [Public domain], via Wikimedia Commons
IMMUTABILITY
PERF
Image: Maj. Gen. William Anders, Apollo 8
“… functional programming’s stricture against destructive updates (assignments)
is a staggering handicap, tantamount to confiscating a master chef’s knives.”
- Chris Okasaki
ABSTRACT DATA TYPE
enqueue add an element to the end
head first element
tail remaining elements
QUEUE
INTERFACE INVARIANTS
NAME
THE CHALLENGE
Correct
Performant
ImmutableX
CHALLENGE ACCEPTED
Structural Sharing
KEY IDEAS
Structural Bootstrapping
Hybrid Structures
STRUCTURAL SHARING
:a :b :c :d :e
(assoc v 2 :zz)
:a :b :zz
STRUCTURAL SHARING
:c
:a
:d
:f
:m
(assoc v 4 :zz)
:e:b
:d
:f
:zz
Image: Alan Levine
STRUCTURAL DECOMPOSITION
Image: Alan Chia (Lego Color Bricks)
HYBRID STRUCTURES
LETS DIVE IN!
‘(1 2 3) Lists: Code manipulation
[1 2 3] Vectors: All things sequential
{:a 1 :b 2} Maps: Structured Data
#{\a \e \i \o \u} Sets: Ermm, Sets
CLOJURE DATA STRUCTURES
MAPS
GET GET value for given key
ASSOC ADD key,value to map
DISSOC REMOVE key,value from map
MERGE MERGE two maps together
THE MAP INTERFACE
WHAT MAKES A GOOD MAP?
Constant time operations independent of number of keys
Efficient space utilization even with mutation
Objects as keys, Objects as values
IDEAS
ARRAYS
IDEA #1
:a 1 :b 2 :c 3
KEY VALUE PAIRS
NOT A GREAT MAP!
Time complexity O(n)
Space efficiency NO
Objects as keys YES
HOW DO WE DO BETTER?
Image: www.pooktre.com
TREES TO THE RESCUE
Ramon Llull, Catalunya c. 1250
Arbol de ciencia
IDEA #2
BINARY SEARCH TREE
13 a
8 f 17
1 11q b
6 z
15 s
r
n25
t22 u27
13 a
17
m
r
25
u27
NOT A GREAT MAP!
Time complexity worst case O(n)
Space efficiency POSSIBLY
Objects as keys YES
How do we keep our trees in ‘balance’?
IDEA #3
BALANCED BINARY SEARCH TREES
RED BLACK TREES
ALWAYS BALANCED, 100 % MONEY BACK GUARANTEE
Guibas, Sedgwick 1978
RED BLACK TREES
Root is black
Every path from root to an empty node contains the same number of black nodes
Every node is colored red or black
No red node can have a red child
RED BLACK TREES
Okasaki ‘96
A PRETTY GOOD MAP!
Time complexity O(log2N)
Space efficiency YES
Objects as keys YES
Clojure’s sorted-maps are Red Black Trees
CONSTRAINTS
KEYS MUST BE COMPARABLE
KEYS ARE COMPARED AT EVERY NODE, THIS CAN BE EXPENSIVE
IDEA #4
TRIE - SEARCH BY DIGIT
t apLEVEL 0
LEVEL 1
LEVEL 2
next(node, symbol)
FINITE STATE MACHINE
Symbols #{a..z}
Nodes, Edges
TRIE IMPLEMENTATIONS
Associate each symbol with an offset, e.g a=0,b=1,…
LOOKUP TABLES
next = lookup(node, offset)
Fast and space efficient trie searches, Bagwell 2000
ADD
NOT A GREAT MAP!
Time complexity O(logmN)
Space efficiency NO
Objects as keys NO
How do we avoid null nodes?
IDEA #4
BST + TRIE = TSTBentley, Sedgwick 1998
Fast and space efficient trie searches, Bagwell 2000
ADD
A DECENT MAP
Time complexity ~O(log2N)
Space efficiency YES
Objects as keys NO
No null nodes, but can we do better
than log2N?
CHALLENGE ACCEPTED
Fast and space efficient trie searches, Bagwell 2000
Array Mapped Trie
IDEA #5
Use bitmaps to determine presence or absence
of symbol
Lets say we have 16 symbols, 0…15
0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0
USING BITMAPS
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Does the symbol with offset 6 exist?
mask = 1 << offset bitmap & mask
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
bitwise AND with a mask
There’s an array alongside that only contains entries
for the 1’s. NOT pre-allocated.
What offset in the dynamic array should I look at?
0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0
0 1 2 3 4
MapEntry MapEntrySubTrie Pointer MapEntry MapEntry
0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0
USING BITMAPS15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Where in the array is the entry for ‘6’?
Integer.bitCount(bitmap & mask)
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1
Count tally marks to the ‘right’ of offset
mask = (1 << 6 ) - 1How do I create a mask to do that?
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
What happens if I insert a new map entry?
0 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0
0 1 2 3 4
MapEntry MapEntry MapEntry MapEntry MapEntry
0 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0
0 1 2 3 4 5
Map Entry
Map Entry
SubTrie Pointer
Map Entry
Map Entry
Map Entry
A DECENT MAP
Time complexity O(logmN)
Space efficiency YES
Objects as keys NO
How do we support arbitrary
Objects as keys?
Ideal hash trees, Bagwell 2001
Hashing + AMT
IDEA #6
Ideal hash trees, Bagwell 2001
Use a good hash function to generate an integer key.
STEP 1
0010 1101 1011 1110 1100 1111 1111 1001
hasheq
STEP 2
72021 35
Divide the 32 bit integer into ‘symbols’ 5 bits at a time.
00101 001111010010101 000110100101
11
Use the ‘symbols’ to walk down an AMT
t bits per symbol give
2t symbols
Why 5 bits?
BIT JUGGLING!Compute ‘symbols’ by shifting and masking
0011100011001011010010101010010100 00000 00000 00000 00000 00000 11111
(hash >>> shift) & 0x01f
How to calculate nth digit?
Shift by 5*n and mask with 0x1f
BEST COMMENT EVER.
A persistent rendition of Phil Bagwell's Hash Array Mapped Trie Hickey R., Grand C., Emerick C., Miller A., Fingerhut A.
Uses path copying for persistence HashCollision leaves vs. extended hashing Node polymorphism vs. conditionals No sub-tree pools or root-resizing Any errors are my own
PersistentHashMap.java:19
NODE POLYMORPHISM
ArrayNode - 32 wide pointers to sub-tries
BitmapIndexedNode - bitmap + dynamic array
HashCollisionNode - array for things that collide
EXAMPLE
(let [h (zipmap (range 1e6) (range 1e6))] (get h 123456))
10111 111001100101001 0001028259 223
0101100000110
shift = 0ArrayNode
ArrayNodeshift = 5
ArrayNodeshift = 10
BitmapIndexedNodeshift = 15
… and then follow the AMT down
A GOOD MAP
Time complexity O(log32N)
Space efficiency YES
Objects as keys YES
Key compared only once
Bit juggling for great performance!
HAMT
~6 hops to a leaf node
NEED ROOT RESIZING
NOT AMENABLE TO STRUCTURAL SHARING
REGULAR HASH TABLE?
UPDATES?
Search for the key, clone leaf nodes and path to root
VECTORS
ArrayNode’s all the way. Break ‘index’ into digits and walk down levels.
INTUITION
(let [arr (vec (range 1e6))] (nth arr 123456))
030 182400
shift = 15ArrayNode
ArrayNodeshift = 10
ArrayNode
shift = 5
ArrayNodeshift = 0
00011 000001001011000000000000000000
123456
THE TAIL OPTIMIZATIONPersistentVector
count shift root tail
RIGHT TOOL FOR THE JOB
By Schnobby (Own work) [CC BY-SA 3.0], via Wikimedia Commons
HashMaps do not merge efficiently
data.int-mapMAP CATENATION
Okasaki & Gill’s “Fast Mergeable int maps”
Zach Tellman
Vectors do not concat efficiently
Vectors do not subvec efficiently
VECTOR CATENATION
Based on Bagwell and Rompf, “RRB-Trees: Efficient Immutable Vectors”
logarithmic catenation and slicing
Michal Marczyk
core.rrb-vector
TODO: benchmarks
CTRIESMichál Marczyk
Tomorrow at 0850
1959 Birandais, Fredkin Trie
1960 Windley,Booth, Colin,Hibbard Binary Search Trees
1962 Adelson-Velsky, Landis AVL Trees
1978 Guibas, Sedgwick Red Black Trees
1985 Sleator, Tarjan Splay Trees
1996 Okasaki Purely Functional Data Structures
1998 Sedgwick Ternary Search Trees
2000 Phil Bagwell AMT
2001 Phil Bagwell HAMT
2007 Rich Hickey Clojure!
Reading List
Ideal Hash Trees, Bagwell 2001
Fast and efficient trie searches, Bagwell 2000
Fast Mergeable Integer Maps, Okasaki & Gill, 1998
The worlds fastest scrabble program, Appel & Jacobson, 1988
File searching using variable length keys, Birandais, 1959
Purely Functional Data Structures, Okasaki 1996
Polymatheia: Jean Niklas L’Orange
QUESTIONS?
Ask Michal or Zach or Jean Niklas :)
THANK YOU