Persistent Data Structures by @aradzie

Post on 15-Jan-2015

3.905 views 1 download

Tags:

description

 

Transcript of Persistent Data Structures by @aradzie

Persistent Data Structures

Living in a world where nothing changes but everything evolves

- or -A complete idiot's guide to immutability

● Warm, soft and cute● Imperative● Object oriented● Just like good old

Basic, but with classes

Java Haskell

● Strange, unfamiliar alien● Purely functional● Everything is different● Shocking news! It's not

like Basic!

vs

Haskell does not have variables!

Imagine a dialect of Java where everything is final by defaultclass LinkedList { class Node { final Node next, prev; final Object value; }

final Node head, tail;

void add(final Object v) { for (final Node n = head; n != null; n = n.next) { ... } }}

All fields, parameters and variables are automatically immutable, the final is implied everywhere, and there is no

way to get rid of it

Haskell does not have variables!

Imagine a dialect of Java where everything is final by defaultclass LinkedList { class Node { final Node next, prev; final Object value; }

final Node head, tail;

void add(final Object v) { for (final Node n = head; n != null; n = n.next) { ... } }}

All fields, parameters and variables are automatically immutable, the final is implied everywhere

But it doesn't make sense!

It won't work!

It does for me!

What is a variable?

var·y/ˈve(ə)rē/vary, varied, varying

● — verb (used with object)Definition: to change or alter, as in form, appearance, character, or substance

● — verb (used without object)Definition: to undergo change in appearance, form, substance, character, etc

● — synonyms:modify, mutate

"Variables" in Haskell

● Must be assigned once declared

YES: int a = 1; NO: int a;

● Cannot be reassigned

YES: final int a = 1; NO: a = 2;

These are mathematical variables, not imperative ones!

When everything is immutable

There is no notion of time:

● Functions take old values, produce new values, nothing is changed in-place

● It does not matter when a function was called, it only matters what arguments it was called with

There is no notion of identity:

● Everything is a value, complex data structures are values too

● There is no way to tell if a == b, only if a.equals(b)● In other words, values are never identical to each other, but

may be equal

I want my linked list!

Basic terminology:

● Ephemeral data structure — everything that is not persistent. Most Java data structures (lists, sets, etc.) are ephemeral.

● Persistent data structure — immutable data structure with

history. No in-place modifications. Operations on it create new versions. Older versions are always available. That. Is. Simple.

● The persistence property has nothing to do with persistent storage, like disks! This is a completely different story.

I want my linked list!

● In imperative languages, like Java, most data structures are ephemeral by default

Designing persistent data structures is somewhat awkward and not always efficient

● In purely functional languages, like Haskell, all data structures are automatically persistent!

There is just no other way to make data structures

History of updates

Making update to a persistent DS instancealways creates a new instance that contains this update.

The current version is left unmodified.

Why should I bother?

Is it fun? Hell yeah!

But is it practical? Let's see!

The free lunch is over!"The biggest sea change in software development

since the OO revolution is knocking at the door,and its name is Concurrency." — Herb Sutter

A commodity hardware

(my laptop)

The need for writing correct multi-threaded codeis constantly increasing

Concurrent data structures are hard!

Want a concurrent ephemeral linked list?Here are some implementation strategies:

● Coarse-grained synchronization● Fine-grained synchronization● Optimistic synchronization● Lazy synchronization

All lock-based — no composition, deadlocks, etc

● Non-blocking synchronization in different flavorsAnd you need the size of a list you are in trouble!

Concurrent data structures are hard!

● Making mutable concurrent data structures requires inter-thread coordination within these structures

● Locks and atomic references all over the place

● Decades of research by academia with many attempts

● Sophisticated algorithms that are hard to reason about, test and prove

● Several different ways to solve the same problems, each with its own cons and pros

Concurrent data structures are hard!

● Making mutable concurrent data structures requires inter-thread coordination within these structures

● Locks and atomic references all over the place

● Decades of research by academia with many attempts

● Sophisticated algorithms that are hard to test and prove

● Several different ways to solve the same problems, each with its own cons and pros

Yes, but are persistent data structures actually simpler?

Just give up mutability!

● Persistent data structures are easy to reason about in concurrent environment

● The behavior does not depend on how many threads are trying to "modify" it at once

● Therefore persistent data structures are very easy to test and debug

The whole picture

● Persistent data structures alone are not sufficientThey are an essential part of the picture, but not the whole answer to concurrency

● Inter-thread coordination is neededThreads still need to know what each other thread is doing to agree on a common outcome

● But it can be added "outside"Which gives us complete separation of concerns

The whole picture

Solving concurrency challenge in a modern language:

● Scala Way — Persistent data structures with message passing

● Clojure Way — Persistent data structures with software transactional memory

● Will likely be mixed in the future

Last few words on concurrency

● Persistent data structures are slower than ephemeral ones in sequential use

● But not that much slower!

● We can forgive it, since they give you more functionality, and ephemeral data structures are simply less capable

● And in multiprocessor era, it is better to make things scalable rather than fast

Efficient persistent data structures

We want persistent data structures to be space and time efficient:

● Structural sharingWe want to reuse as many fragments of the previous version as possible

● Path copyingWe want to copy as few pieces as possible

● Maybe, just maybe lazy evaluation (where available)We don't want nasty pathological cases

A case study

● Let's make some persistent data structures in Java

● All these structures consist of classes with only final fields

● With good amortized asymptotic complexity in most cases

Why are you looking at me?!

Our plan

Lets start with some trivial examples

● Stack

● Queue

● Tree

The proceed with more advanced structures

● Hash Table

● Finger Tree

Trivial Example — Persistent Stackclass Stack<T> { final T v; (a) final Stack<T> next; (b)

Stack() { v = null; next = null; size = 0; }

Stack(T v, Stack<T> next) { this.v = v; this.next = next; } ...

Source Code 1/2

It's just a singly linkedlist of nodes

Trivial Example — Persistent Stackclass Stack<T> { ... Stack<T> push(T v) { return new Stack<T>(v, this); (a) }

T peek() { if (next == null) throw new NoSuchElementException(); return v; (b) }

Stack<T> pop() { if (next == null) throw new NoSuchElementException(); return next; (c) }

Source Code 2/2

Trivial Example — Persistent Stack

Structural sharing in persistent stack

Trivial Example — Persistent Stack

Looks familiar?The versions tree!

Trivial Example — Persistent Stack

Also known as Spaghetti stack or

Cactus stack

Persistent Queue

It's just two stacks combined:

● Back stack to enqueue items● Front stack to dequeue items

When front stack is empty, reverse back stack and use it as front stack

Persistent Queueclass Queue<T> { // back stack - push elements here final Stack<T> b; (a) // front stack - pop elements from here final Stack<T> f; (b)

Queue() { b = f = new Stack<T>(); }

Queue(Stack<T> b, Stack<T> f) { this.b = b; this.f = f; }

boolean isEmpty() { return f.isEmpty(); (c) } ...

Source Code 1/3

Persistent Queueclass Queue<T> { ... static <T> Queue<T> check(Stack<T> b, Stack<T> f) { if (f.isEmpty()) return new Queue<T>(f, b.reverse()); (a) else return new Queue<T>(b, f); (b) }

Queue<T> push(T v) { return check(b.push(v), f); }

Queue<T> pop() { if (isEmpty()) { throw new NoSuchElementException(); } return check(b, f.pop()); }

Source Code 2/3

Persistent Queueclass Queue<T> { ... T peek() { if (isEmpty()) { throw new NoSuchElementException(); } return f.peek(); }

class Stack<T> { ... Stack<T> reverse() { if (isEmpty() || next.isEmpty()) return this; Stack<T> r = new Stack<T>(); for (Stack<T> s = this; !s.isEmpty(); s = s.pop()) { r = r.push(s.peek()); } return r; }

Source Code 3/3

Persistent Queue

Structural sharing in persistent queue

Persistent Queue

Beware pathological cases!

● What is forward stack is empty, but back stack is full?

● And we are going to pop from the same queue N times

● Then we get N back back stack reversions!

● Lazy evaluation to the rescue — use lazy streams instead of strict stacks

Persistent Queue

But there is a better wayto design queue!

Monoidally Annotated 2-3 Finger Tree is a versatile data structure that can be used to build efficient lists, deques, priority queues, interval trees, ropes, etc.

It is more complex, we will take a look at it later.

Persistent Tree

● It is trivial to convert any ephemeral tree to a persistent one by means of path copying

● It works for binary trees, 2-3 trees, B-trees, etc

● The shape of tree is not affected, only mutating algorithms

● In a balanced binary tree at most log N nodes need to be copied — quite efficient

● The secret to all persistent data structures is that they all are trees! (Yes, lists and hash tables are trees too)

Persistent Tree

Simple Persistent Binary Tree

class SimpleBinaryTree { static class Node { final K key; (a) final V value; (b) final Node l, r; (c)

Node(K key, V value, Node l, Node r) { this.key = key; this.value = value; this.l = l; this.r = r; } } ...

Source Code 1/2

Simple Persistent Binary Tree

class SimpleBinaryTree { ... static Node insert(Node n, K key, V value) { if (n == null) { return new Node(key, value, null, null); (a) } int cmp = key.compareTo(n.key); (b) if (cmp < 0) { return new Node(n.key, n.value, (c) insert(n.l, key, value), n.r); } if (cmp > 0) { return new Node(n.key, n.value, (d) n.l, insert(n.r, key, value)); } return new Node(key, value, n.l, n.r); (e) }

Source Code 2/2

Persistent Tree

Multiple definitions of persistence:

● Immutable data structure with history● Committed to a persistent storage

Append only databases and file systems:

● CouchDB uses append only B-Tree● RethinkDB makes append only variant of MySQL● ZFS, BTRFS implement copy-on-write transactions

and snapshots

Nothing is new under the moon!

Persistent Map

interface Map<K, V> { // get value for a key, or null if not found V get(K key); // make key/value association Map<K, V> put(K key, V value); // remove key/value association Map<K, V> remove(K key);}

Remember, no in-place updatesMutations create new instances

Persistent Map

Implementation Strategy

● Persistent red-black tree for ordered keysTime complexity — O(log n)

● Persistent hash table for hashable keysTime complexity — O(1)

Persistent Hash Table

But how do we implement it?Copying the whole table would be too expensive!

Persistent Hash Table

Here's the idea: partition hash table into smaller pieces, organized them as a persistent tree

Nice idea, but how do we navigate in such a tree?

Prefix Tree/Trie

Hash code is just a string of digits!

Search is guided by individual letters of a string key

Persistent Hash Table in Prefix Tree

Represent 32 bit hash codes as strings of 5 bit symbol:

hashCode = CAFEBABE16level 6 5 4 3 2 1 0bits 11 00101 01111 11101 01110 10101 11110symbol 3 5 15 29 14 21 30

Persistent Hash Table

hashCode = ... xxxxx xxxxx xxxxx xxxxx

Each item is either a key/value pair or a subtree

Persistent Hash Table

class PersistentHashMap { abstract class Item<K, V> {}

class Node<K, V> extends Item<K, V> { final Item<K, V> children = new Item<K, V>[32]; (a) }

class Entry<K, V> extends Item<K, V> { final int hashCode; (b) final K key; (c) final V value; (d) final Entry<K, V> next; (e) }

Source Code 1/2

Persistent Hash Table

class PersistentHashMap { V get(K key) { return root.find(key.hashCode(), key, 0); (a) }

class Node<K, V> extends Item<K, V> { V find(int hashCode, K key, int level) { int index = (hashCode >>> (level * 5)) & 31; (b) Item<K, V> item = children[index]; (c) if (item instanceof Node) { (d) return ((Node<K, V>) item) (e) .find(hashCode, key, level + 1); } if (item instanceof Entry) { (f) return ((Entry<K, V>) item) (g) .find(hashCode, key); } return null; }

Source Code 2/2

Persistent Hash Table

Do not waste space!

class PersistentHashMap { class Node<K, V> { final Item<K, V> children = new Item<K, V>[32]; (a) }

● Most of the children would be null on deeper levels

● The number of arrays grows exponentially as we go deeper

● Need to find a way to compact tree

● Simply get rid of nulls in arrays!

Persistent Hash Table

● Mask is a 32-bit integer whose bits set to 1 only for those array elements that are not null

● Array stores only non-null elements. Its size is the number of 1 bits in the mask. Array size varies from 2 to 32 elements.

● Overhead for null array element is just one bit. Quite good!

class Node<K, V> { final int mask; (a) final Item<K, V> children = new Item<K, V>[bitCount(mask)]; (b)}

Persistent Hash Table

● To test that array has element at index i, simply test if ith bit in the mask is 1:

if ((mask & (1 << i)) != 0) { ...

● To get offset to ith element in the array, count number of 1 bits lower than i in the mask:

int offset = bitCount(mask & ((1 << i) - 1));if (children[offset] instanceof ...

Persistent List

interface Seq<T> { T head(); // get first element Seq<T> tail(); // get list without first element Seq<T> cons(T v); // append element to head Seq<T> snoc(T v); // append element to tail Seq<T> concat(Seq<T> that); // join two lists int size(); // get number of elements T get(int index); // get Nth element Seq<T> set(int index, T v); // set Nth element }

Remember, no in-place updatesMutations create new instances

Persistent List

● There are quite a few ways to implement persistent lists

● But we will not be studying them

● Instead, we will turn our attention to finger trees

● Soon, it will be clear why

Finger Trees

● An incredibly elegant, simple and efficient data structure

● Oh so very versatile, functional programmer's Swiss Army knife

● Basic data structure for building random acces sequences, deques, priority queues, ropes, interval trees, etc.

● Let's define it in stages

Persistent leafy 2-3 trees

Let's begin with a simple data structure — leafy 2-3 tree

● Every intermediate node has either two childrent or three children

● All values are stored in leafs

● Perfectly balanced — all leafs are at the same level

Persistent leafy 2-3 trees

Persistent leafy 2-3 trees

Leafs contain interesting values,

but what is stored in nodes?

Annotated leafy 2-3 trees

● There must be a way to find interesting values in a tree

● We need to guide search from the root of a tree to its leafs

● Let's add special annotations to nodes

● Use these annotations to find values

Size annotated leafy 2-3 trees

● Each intermediate node is annotated with the size of a subtree rooted at this node

● Makes it trivial to find any leaf by its index

● Starting from root, test if index is in the range of its left (middle) or right subtree, and repeat recursively for that subtree, until a leaf is found

Size annotated leafy 2-3 trees

Looks like random access list

Priority annotated leafy 2-3 trees

● Each intermediate node is annotated with the highest priority of an element in its subtree

● Makes it trivial to find value with the highest priority

● Starting from root, find subtree with the highest priority descent recursively into it, until a leaf is found

Priority annotated leafy 2-3 trees

Looks like priority queue

Monoids

● One interface to unify size, priority (and more!) annotations on trees

● A set of values with a "zero" element 0 and a binary associative operation ⊕

● Monoid laws:0⊕a = aa⊕0 = aa⊕(b⊕c) = (a⊕b)⊕c

Monoid examples

● Strings with empty string and concatenation"" + "a" = "a", "a" + "" = "a""a" + ("b" + "c") = ("a" + "b") + "c"

● Integers with zero and addition0 + 1 = 1, 1 + 0 = 11 + (2 + 3) = (1 + 2) + 3

● Integers with one and multiplication1 * 2 = 2, 2 * 1 = 12 * (3 * 4) = (2 * 3) * 4

● And many, more of them! (Monoids are everywhere)

Monoid interface

interface Monoid<T extends Monoid<T>> { T unit(); T combine(T that);}

class String implements Monoid<String> { ...

String unit() { return ""; (a) }

String combine(String that) { return this + that; (b) }}

Size monoid

class Size implements Monoid<Size> { final int size; (a)

Size(int size) { this.size = size; }

Size unit() { return new Size(0); (b) }

Size combine(Size that) { return new Size(this.size + that.size); (c) }}

Priority monoid

class Priority implements Monoid<Priority> { final int priority; (a)

Priority(int priority) { this.priority = priority; }

Priority unit() { return new Priority(MAX_INTEGER); (b) }

Priority combine(Priority that) { return new Priority( Math.min(this.priority, that.priority)); (c) }}

But where do we get monoids from?

● Monoids have nice property of composability

● We can get more monoids by combining existing ones

● But where do we get initial monoids to begin with?

● We need a way to measure values!

● Those measures must be monoids, obviously

interface Measured<M extends Monoid> { M measure();}

Let's make a sketch of annotated tree/** <V> is the type of values <M> is the type of monoidal measures of values */class Tree<M extends Monoid, V extends Measured<M>> implements Measured<M> { (a)

abstract class Leaf<M, V> extends Tree<M, V> { final V value; (b) override abstract M measure(); (c) }

class Node<M, V> extends Tree<M, V> { final Tree<M, V> left, right; (d) final M m; (e) Node(Tree<M, V> l, Tree<M, V> r) { left = l; right = r; m = l.measure().combine(r.measure()); (f) } override final M measure() { return m; (g) }

Pseudocode!

Let's make a sketch of annotated tree ... class Leaf<V> extends Tree<Size, V> { final V value;

override final Size measure() { return new Size(1); (a) } }

... class Leaf<V> extends Tree<Priority, V> { final V value;

override final Priority measure() { return new Priority(value.priority()); (b) } }

Pseudocode!

But that is not finger tree yet!

Finger Tree

... is a just an annotated tree of annotated 2-3 trees!

Finger Tree

Digits, 2-3 trees, fingers and nested levels

Finger Tree

A little bit of Haskell would not hurt:

data Node v a = Node2 v a a | Node3 v a a a

data Digit v a = One v a | Two v a a | Three v a a a | Four v a a a a

data FingerTree v a = Empty | Single a | Deep v (Digit a) (a) (FingerTree v (Node v a)) (b) (Digit a) (c)

Finger Tree

class FingerTree<M extends Monoid<M>, T extends Measured<M>> implements Measured<M> {

class Empty<M extends Monoid<M>, T extends Measured<M>> extends FingerTree<M, T> {}

class Single<M extends Monoid<M>, T extends Measured<M>> extends FingerTree<M, T> { final T v; (a) final M m; (b)

class Deep<M extends Monoid<M>, T extends Measured<M>> extends FingerTree<M, T> { final Digit<M, T> prefix; (c) final FingerTree<M, Node<M, T>> middle; (d) final Digit<M, T> suffix; (e) final M m; (f)

Source Code 1/3

Finger Tree

class Digit<M extends Monoid<M>, T extends Measured<M>> implements Measured<M> { final M m; (a)

class One<M extends Monoid<M>, T extends Measured<M>> extends Digit<M, T> { final T a; (b)

class Two<M extends Monoid<M>, T extends Measured<M>> extends Digit<M, T> { final T a, b; (c)

class Three<M extends Monoid<M>, T extends Measured<M>> extends Digit<M, T> { final T a, b, c; (d)

class Four<M extends Monoid<M>, T extends Measured<M>> extends Digit<M, T> { final T a, b, c, d; (e)

Source Code 2/3

Finger Tree

class Node<M extends Monoid<M>, T extends Measured<M>> implements Measured<M> { final M m; (a)

class Node2<M extends Monoid<M>, T extends Measured<M>> extends Node<M, T> { final T a, b; (b)

class Node3<M extends Monoid<M>, T extends Measured<M>> extends Node<M, T> { final T a, b, c; (c)

Source Code 3/3

Finger Tree Interface

Basic operations:

● cons, snoc — append/prepend element● concat — join two trees● split — find prefix, element and suffix using predicate

Beyond the scope of this presentation, sorry

Finger Tree Performance

Amortized bounds:

● cons, snoc● head, last● concat● split● index

Finger TreeO(1)O(1)O(log min(ℓ1, ℓ2))O(log min(n, ℓ-n))O(log min(n, ℓ-n)

ListO(1)/O(n)O(1)/O(n)O(n)O(n)O(n)

2-3 TreeO(log n)O(log n)O(log n)O(log n)O(log n)

Thanks!

Questions?