Release 0.1 Wolfgang Doll - Read the Docs€¢TheSlidesof Bill Karwin •Agistwith a PHP...

23
dango-ct Documentation Release 0.1 Wolfgang Doll Sep 27, 2017

Transcript of Release 0.1 Wolfgang Doll - Read the Docs€¢TheSlidesof Bill Karwin •Agistwith a PHP...

dango-ct DocumentationRelease 0.1

Wolfgang Doll

Sep 27, 2017

Contents

1 Implementation Details 3

2 Some Research 7

3 Test Data Sources 9

4 Requirements 11

5 SQL Snippets 13

6 Queries for Digraph-shaped Hierarchies 15

7 Indices and tables 19

i

ii

dango-ct Documentation, Release 0.1

Contents:

Contents 1

dango-ct Documentation, Release 0.1

2 Contents

CHAPTER 1

Implementation Details

Note: The book “Pro Django” from Marty Alchin is a great source of knowledge. I took a lot of ideas from it. Forme realizing an application like django-ct is not possible without the knowledge out of this book.

This section provides an overview of the features that will be available when the application is completed, so we canstart seeing these features fall into place as the code progresses.

First the act of assigning a manager to the model that need the possibilities of a closure table. This should be as simpleas possible, preferably just a single attribute assignment. Simply pick a name and assign an object, just like Django’sown model fields.

from django.db import modelsfrom ct import models as ct

class Topic(models.Model):name = models.TextField()index = ct.ClosureTable()

def __unicode__(self):return u'(%s) %s' % (self.id, self.name)

That’s enough to get everything configured. From there, the framework is able to set up a model behind the scenes tostore the closure table entries and the index attribute can be used to access those information using the API methodsof django-ct.

The whole registration of the manager begins by assigning a ClosureTable object to a model, so that’s a goodplace to start defining code. There are a number of things that have to happen in sequence to get the closure tablesystem initialized for a particular model; at a high level ClosureTable manages the following tasks:

• Create a model for the requested closure table in a way that the foreign keys from this table referencing therelated objects in the original model (create_model()).

• Register signal handlers to execute when the original model is saved. These in turn add new rows to the closuretable each time a new instance of the model is saved.

3

dango-ct Documentation, Release 0.1

• Assign a descriptor to the original model, using the attribute name where the ClosureTable was assigned.This descriptor will forward the work to a InstanceManager object or a ClassManager object.

Before any of those steps can really begin, there’s a small amount of housekeeping that must be done. Sincethe ClosureTable object gets assigned as an attribute of a model, the first chance it gets to execute is in thecontribute_to_class() method.

def contribute_to_class(self, cls, name):self.name = namemodels.signals.class_prepared.connect(self.finalize, sender=cls)

So far is’s not much, but this is the only point in the process where Django tells ClosureTable whatname it was given when assigned to the model. This is stored away for future reference. The methodcontribute_to_class() gets called on each field in turn, in the order they appear in the namespace dictio-nary Python created for the model’s definition. Since standard dictionaries don’t have a guaranteed order, there’s noway to predict how many fields will already have been processed by the time ClosureTable gets a chance to peekat the model.

To solve this, we turn to a signal: class_prepared. Django fires this signal once all the fields and managershave been added to the model and everything is in place to be used by external code. That’s when ClosureTablewill have guaranteed access to the model so contribute_to_class() continuous by setting up a listener forclass_prepared.

Django will now call ClosureTable.finalize() with the fully-prepared model once everything is in place tocontinue processing it. That method is the responsible for performing all of the remaining tasks. Most of the detailsare delegated to other methods, but finalize() coordinates them.

def finalize(self, sender, **kwargs):self.ctModel = self.create_model(sender)

# The ClosureTable object will be discarded,# so the signal handler can't use weak references.models.signals.post_save.connect(self.post_save, sender=sender, weak=False)

descriptor = self.descriptor(self.ctModel)setattr(sender, self.name, descriptor)

There are a few different sub-steps required in creating a closure table. Adding all the logic in one method wouldhamper readability and maintainability, so it’s been broken up into three additional methods (create_model(),get_fields() and get_options()).

def create_model(self, model):attrs = {'__module__': model.__module__}class Meta: passMeta.__dict__.update(attrs)Meta.__dict__.update(self.get_options(model))attrs['Meta'] = Metaattrs.update(self.get_fields(model))name = '%s_ct_%s' % (model._meta.object_name, self.name.lower())return type(name, (models.Model,), attrs)

Django is Python! So we use here all the hard core Python stuff to create a new Django model class. Thecreate_model() method above mimic the process of creating a model like in this code:

from django.db import modelsfrom tests.models import Topic

class Topic_ct_index(models.Model):

4 Chapter 1. Implementation Details

dango-ct Documentation, Release 0.1

ancestor = models.ForeignKey(Topic, related_name='+', on_delete=models.CASCADE,blank=False, null=False)

descendant = models.ForeignKey(Topic, related_name='+', on_delete=models.CASCADE,blank=False, null=False),

path_length = models.PositiveIntegerField(default=0, blank=False, null=False)

class Meta:unique_together = ('ancestor', 'descendant')

On this way we implement a model inside the database that looks like the original proposal from Bill Karwin:

ct(c)

CREATE TABLE ct (ancestor INTEGER NOT NULL REFERENCES c (id) ON DELETE CASCADE,descendant INTEGER NOT NULL REFERENCES c (id) ON DELETE CASCADE,length INTEGER NOT NULL DEFAULT 0,PRIMARY KEY (ancestor, descendant)

)

The last two lines of the methode finalize() assignes a so called descriptor to the original model. This results inan for example index attribute in the model Topic. This attribute is implemented as a Descriptor object.

class Descriptor(object):

def __init__(self, ctModel):self._ctModel = ctModel

def __get__(self, instance, owner):if instance is None:

return ClassManager(self._ctModel)return InstanceManager(self._ctModel, instance)

With this django-ct can provide two different kind of API’s. If we use the index like a class property the API will beprovided by a ClassManager. If we use the index attribute like a instance property the API will be provided by aInstanceManager.

$ python manage.py shell>>> from tests.models import Topic>>> type(Topic.index)<class 'ct.manager.ClassManager'>>>> a = Topic.objects.get(pk=1)>>> type(a.index)<class 'ct.manager.InstanceManager'>>>> exit()

5

dango-ct Documentation, Release 0.1

6 Chapter 1. Implementation Details

CHAPTER 2

Some Research

• Trees and Other Hierarchies in MySQL

• Managing hierarchies in SQL

• What is the most efficient/elegant way to parse a flat table into a tree?

• The simplest(?) way to do tree-based queries in SQL

• Rendering Trees with Closure Tables

• The Term TCT (transitive closure tables)

• Hierarchical Data: Persistence via Closure Table

• The Slides of Bill Karwin

• A gist with a PHP implementation.

• Optimize Hierarchy Queries with a Transitive Closure Table

• Hierarchy Queries - Creating a Transitive Closure to Optimize Rollups (Steven F. Lott)

• Moving Subtrees in Closure Table Hierarchies

• Bill Karwin: SQL Antipatterns: Avoiding the Pitfalls of Database Programming

• Robert Sedgewick, Kevin Wayne: Directed Graphs and some Slides

7

dango-ct Documentation, Release 0.1

8 Chapter 2. Some Research

CHAPTER 3

Test Data Sources

• Integrated Taxonomic Information System (ITIS)

9

dango-ct Documentation, Release 0.1

10 Chapter 3. Test Data Sources

CHAPTER 4

Requirements

• Every class “C” is associated with one ore more closure table classes “C.CTn”

• There is a possibility to manage more then one tree “T” inside of one closure table “C.CTn”.

• Every new instance “I” of “C” will be automatically added to its associated closure tables “C.CTn”(Building a tree “T” with one node. The root node)

• An instance “I” of “C” always has at least one reference in its closure table “C.CTn”.

• When a instance “I” of “C” is deleted, all references to “I” in “C.CTn” are be deleted too.(The tree structure is preserved)

• We need a ability to connect a tree “T” with a subtree “ST”.

• We need a ability to disconnect a subtree “ST” from its tree “T”.(This generates a new tree)

11

dango-ct Documentation, Release 0.1

12 Chapter 4. Requirements

CHAPTER 5

SQL Snippets

A closure table is a way of storing hierarchies (Digraph’s). It involves storing all path through the graph, not just thosewith a direct parent-child realtionship1.

ct(c)

CREATE TABLE ct (ancestor INTEGER NOT NULL REFERENCES c (id) ON DELETE CASCADE,descendant INTEGER NOT NULL REFERENCES c (id) ON DELETE CASCADE,length INTEGER NOT NULL DEFAULT 0,PRIMARY KEY (ancestor, descendant)

)

To create a new node, we first insert the self referencing row2.

create(t)

INSERT INTO ct (ancestor, descendant) VALUES (t,t)

We need to insert all the nodes of the new subtree “st”. We use a cartesian join between the ancestors of “st” (goingup) and the descendants of “t” (going down)3.

connect(t, st)

INSERT INTO ct (ancestor, descendant, length)SELECT supertree.ancestor, subtree.descendant, supertree.length+subtree.length+1FROM ct AS supertree JOIN ct AS subtreeWHERE subtree.ancestor = tAND supertree.descendant = st

But it is not possible to connect every tree with any other subtree. The definition of our ct do not allow duplicate edges.Therefore the intersection of already existing edges and new edges must be empty.

1 Bill Karwin: SQL Antipatterns: Avoiding the Pitfalls of Database Programming - Page 362 Bill Karwin: SQL Antipatterns: Avoiding the Pitfalls of Database Programming - Page 383 Bill Karwin: Moving Subtrees in Closure Table Hierarchies

13

dango-ct Documentation, Release 0.1

checkA(t, st)

SELECT ancestor, descendantFROM ct

INTERSECTSELECT supertree.ancestor, subtree.descendantFROM ct AS supertree JOIN ct AS subtreeWHERE subtree.ancestor = tAND supertree.descendant = st

We disconnecting the subtree from all notes which are not descendants of “st”3.

disconnect(st)

DELETE FROM ctWHERE descendant IN (SELECT descendant FROM ct WHERE ancestor = st)AND ancestor NOT IN (SELECT descendant FROM ct WHERE ancestor = st)

A subtree is considered disconnected if the following query returns no result. The parameter “t” in “connect” shouldbe checked against this query.

checkB(st)

SELECT ancestor, descendantFROM ctWHERE descendant IN (SELECT descendant FROM ct WHERE ancestor = st)AND ancestor NOT IN (SELECT descendant FROM ct WHERE ancestor = st)

Assumption

The application of the checks A and B resulting in digraphs, with special properties. Such a digraph has exactly onestarting point. Each additional node has exactly one predecessor. There is no way for cycles. With this, the conditionsfor a data structure named “tree” is given.

Many thanks to Bill Karwin for these beautiful “how to implements a closure table” ideas.

14 Chapter 5. SQL Snippets

CHAPTER 6

Queries for Digraph-shaped Hierarchies

To retrieve the ancestors of a node “st”, we have to match rows in “ct” where the descendant is “st”. However the node“st” is still part of the result. To solve this we filter out the self referencing row of the node “st”.

ancestors(st)

SELECT ancestorFROM ctWHERE descendant = st AND ancestor <> descendant

To retrieve the descendants of a node “st”, we have to match rows in “ct” where the ancestor is “st”. The same tale asbefore: the node “st” is still part of the result if we not filtering out the self referencing row of the node “st”

descendants(st)

SELECT descendantFROM ctWHERE ancestor = st AND length ancestor <> descendant

Queries for direct predecessor or successor nodes should also use the “length” attribute in “ct”. We know the pathlength of a immediate successor is 1. The searching for the direct successors of “st” is now straightforward:

successors(st)

SELECT descendant AS successorFROM ctWHERE ancestor = st AND length = 1

Adjusted accordingly we can use the same method to find the predecessors of the node “st”:

predecessors(st)

SELECT ancestor AS predecessorFROM ctWHERE descendant = st AND length = 1

15

dango-ct Documentation, Release 0.1

Childs having the same parents, are usually known as siblings. In our graph, we call this kind of relationship corpora-tion. We can search the members of an corporation with a nested query. First we search the predecessors and secondwe try then to find the related successors.

corporation(st)

SELECT DISTINCT descendant AS memberFROM ctWHERE length = 1 AND ancestor IN (SELECT ancestorFROM ctWHERE descendant = st and length = 1

)

With the following query, we are able to retrieve those starting points, which lead us along the graph to the node “st”.

startpoints(st)

SELECT ancestor AS startpointFROM ctWHERE descendant = st AND ancestor NOT IN (SELECT descendantFROM ctWHERE length ancestor <> descendant

)

With the following query, we are able to retrieve the end points, where the graph arrives after starting the traverse fromthe node “st”.

endpoints(st)

SELECT descendant AS endpointFROM ctWHERE ancestor = st AND descendant NOT IN (SELECT ancestorFROM ctWHERE length ancestor <> descendant

)

A node is called a producer if he is an ancestor of another node

producer()

SELECT DISTINCT ancestor AS producerFROM ctWHERE length ancestor <> descendant

A node is called a consumer if he is an descendant of another node

consumer()

SELECT DISTINCT descendant AS consumerFROM ctWHERE length ancestor <> descendant

A node which is a consumer but not a producer is called a sink.

16 Chapter 6. Queries for Digraph-shaped Hierarchies

dango-ct Documentation, Release 0.1

sinks()

SELECT DISTINCT descendant AS sinkFROM ctWHERE ancestor NOT IN (SELECT ancestorFROM ctWHERE length ancestor <> descendant

)

A node which is a producer but not a consumer is called a source.

sources()

SELECT DISTINCT ancestor AS sourceFROM ctWHERE ancestor NOT IN (SELECT descendantFROM ctWHERE length ancestor <> descendant

)

The number of head endpoints adjacent to a node is called the indegree of the node.

indegree(st)

SELECT COUNT(ancestor) AS indegreeFROM ctWHERE descendant = st and length = 1

The number of tail endpoints adjacent to a node is called its outdegree.

outdegree(st)

SELECT COUNT(descendant) AS outdegreeFROM ctWHERE ancestor = st and length = 1

Every node in “ct” is defined over its self referencing row.

nodes()

SELECT ancestor AS nodeFROM ctWHERE ancestor = descendant

We can retrieve a list of direct connections between the nodes.

args()

SELECT ancestor AS tail, descendant AS headFROM ctWHERE length = 1

17

dango-ct Documentation, Release 0.1

18 Chapter 6. Queries for Digraph-shaped Hierarchies

CHAPTER 7

Indices and tables

• genindex

• modindex

• search

19