Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink...

21
Using Left-Right Trees for Hierarchic Data Storage Dale Chant, Roland Seidel, Red Centre Software Pty Ltd SSS Conference, Bristol, 2011 Version: 20 September 2011

Transcript of Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink...

Page 1: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

UsingLeft-RightTreesforHierarchicDataStorage

DaleChant,RolandSeidel,RedCentreSoftwarePtyLtdSSSConference,Bristol,2011

Version:20September2011

Page 2: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

Abstract• Hierarchiessuchasgrids(BrandImage)orcubes(Brand/Statement/Rating)arelevelswhereno

levelsareparallel,or,alternatively,alllevelsaremutuallyorthogonalattheorigin.

• SuchN-dimensionalstructuresmustpresentlybestoredaseitherflatorasaSSSv2<hierarchy>

• Butifflat,thenmanycolumns,andifashierarchyofsurveys,thenmanyfiles.

• Forflatstorage,theproblemisacuteonlargebrandlistswithsparsecodeinstantiation.

• 1,000brands*10attributes*10ratingpoints=20,000columns,evenifmostrespondentsskiporrespondforonlyafewoutofthe1,000brands.Andif10suchquestions,then200,000columns.

• Forhierarchicstorage,multiplefilesforsimplegridsandcubesisoverkill,andconceptualisingasahierarchyofsurveyscanbecounter-intuitivewherethecaseisasinglerespondent.

• Thisproposalforthestorageofsuchdataasleft-righttrees(parsablebysimplyreadingastringfromtheleft) canhugelyreducethenumberofrequiredcolumns.

• Forfixedwidth,thenumberofcolumnsisdeterminedbythelongestresponseintherecord.Fordelimitedstorage,eachrespondentwouldrequireonlyasmanycharactersasneededtorecordandstructurejustthatrespondent’sanswerset.

• Theproposedstoragecouldalsobeusedtostoreanylevelsstructure,butattheexpenseofneedingtoduplicatetheupperpathsforparallel(non-orthogonal)levels.

Page 3: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

LeftRightTreesLeft-righttreesaresimplyawayofrepresentingdatahierarchiesasstringswhichcanbeparsedfromlefttoright.

3

2 5

3,58

6,91,74

Assignadepthdelimitertoeachlevel– ega,b,c,d

a

b

c

d

Thetop-downtreenodestructureabccbcddStorethedataateachnodeasa3b2c4c1,7b5c6,9d8d3,5(ThisisconceptuallysimilartoSurveycraftloops)

Page 4: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

TheSSSV2HouseholdDataHousehold1Terrace,East

Household2Semi-Det,South

Household3Flat,East

Person1 Person2 Person1 Person2 Person3 Person1

Fem Male Male MaleFem Fem

21-45 21-45 >65 46-65 <21 21-45

SocSoc WorkBus WorkWork Work Soc Work Work Soc Soc

Bus CarP Train CarP CarDTrain CarD CarD CarD CarP BusBus

Household,N=3

PersonN=6

Trip,N=12

Triple-SXMLversion2.0.001(December2006),pp42ff.

Page 5: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

SSSDataStorage:HierarchyofSurveysHousehold

010001230100023201000313

Person

010001012201000102120100020114010002022301000203110100030122

Trip

010001011301000101120100010224010001023201000102240100010211010002012101000201210100020111010002031201000301230100030123

1=Terrace2=Semi-Det3=Flat

2=South3=East

1=Male2=Female

1=<212=21-453=46-654=>65

1=Social2=Work3=Business

1=CarDrv2=CarPass3=Bus4=Train

Red=HouseholdLinkIDRed+Blue =PersonLinkIDBlack=Data

Page 6: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

Household#2as5LRTreesHousehold2

Semi-Det,South

Person1

Person2

Person3

Male1

Male1

Fem2

>654

46-653

<211

Work2

Work2

Soc1

Soc1

CarD1

CarD1

CarD1

CarP2

b:Gender:ab1ab2ab1

b:Age:ab4ab3ab1

c:Mode:abc1bc1bc1aabc2

a:Person:a1a2a3

b:Purpose:ab2b2b1aab1

Onetreeperlevelrequires3parallelblevels

Page 7: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

Household#2as3LRTreesHousehold2

Semi-Det,South

Person1

Person2

Person3

Male1

Male1

Fem2

>654

46-653

<211

Work2

Work2

Soc1

Soc1

CarD1

CarD1

CarD1

CarP2

Gender:a1b1a2b2a3b1

Age:a1b4a2b3a3b1

Trips:a1b2c1b2c1b1c1a2a3b1c2

• Storeupperleveldatainsteadofjustthenodes.

• 3parallelblevels,soneedatleast3trees

Page 8: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

TreevsHierarchyofSurveys

• Thethreeparallellevelsmandatethreestorageinstancesforboth– eitherthreetrees,orthreesurveyfiles

• Left-righttreesneedtoduplicatetheupperpathsforparallellevels

• Butforcircumstanceswheretherearenoparallellevels,suchasBrand/Attribute/RatingsorBrandImage,left-righttreesofferseveraladvantages.

• Theprimaryadvantageisdramaticallyreducedstoragerequirementsfortypicalbrand-orientedconsumersurveys

Page 9: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

Grids,Cubes,AsLRTrees

Left-righttreescanalsobeusedtostoregrids,cubes,oranyN-dimensionaldatastructure.

a1b5a2b3a3b7

a1b1c8b2c6b3c5a2b1c6b2c7b3c2a3b1c7b2c5b3c2

BrandXrated5

BrandYrated3

BrandZrated7BrandRating

Ratin

g

Page 10: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

Multi-responseBrandImage a1b1;2;3;4;5;6;7;8a2b5;6;7;8a3b2;3;5;6

• Notethe;delimitertoavoidconfusionwithEuropean,asdecimalplace

• Anylevel(ordimension)canbemulti-response,ega1;2b3;4c5;6;7

• For10statementscoded1to10,theflatstoragefor3brands(spreadformat)requires60columns

• Canhavemulti-responseatanylevel,ega1;2b3;4;5

Page 11: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

CurrentGrid/CubeStorageTheimplementermustchoosebetween

• traditionalflatstorage,or• SSSver2.0hierarchicstorage

Butatypicalbrandtrackerwillhavemanygrids,cubes,etc– arandomsampleof3jobsgives,15,42,and37instances.Thecostiseither

• Alargenumberofcolumns(ifflat),or• Alargenumberoffiles(ifSSShierarchic)

Andwithinternetcollectionnowdominant,thetendencytoallowresponsesforanysubsetofbrandsforwhichthereisawareness(ratherthanjustthetraditionalmainbrandlist)canresultincombinatorialexplosionswhichimposeaheavyburdenonstorage,RAMandCPU.Internationaljobsalsocanhaveverylargebrandlists.

Real-worldexamplesfollow:

Page 12: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

FMCG(1):HierarchyofSurveysSSSfixed-widthexportfromConfirmit,180respondents,12brands,10gridsand5cubesrequires15*2=30files(15XML,15ASC)

ASC Bytes Tree Bytes

Data_0 15,747 B32 15,755

Data_1 14,728 B41 1,181

Data_2 38,523 B42a 492

Data_3 12,549 KC32 11,333

Data_4 9,218 KC41 862

Data_5 55,215 KC42a 537

Data_6 17,031 M32 14,469

Data_7 11,308 M41 975

Data_8 86,031 M42a 657

Data_9 18,321 P32 17,417

Data_10 18,528 P41 1,349

Data_11 68,055 P42a 594

Data_12 11,325 SP32 12,448

Data_13 9,978 SP41 968

Data_14 32,103 SP42a 465

total 418,660 79,502

0

100

200

300

400

500

Hierarchy Tree

Kilobytes

Asmallnumberofbrands,andhighinstantiation,butstillfivetimeslessspace

Comparingstoragerequirements:

Page 13: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

FMCG(2)Flat:BrandImage323brandsby58statements(multi-response)over69,841cases

• Spreadformat:Requires323*58*2=37,468columnscolumns*cases=2,496meg

• Bitformat(divideby2):Requires323*58=18,734columnscolumns*cases=1,248meg

• TreeasFixedWidth:Longestresponse=1150characterschars*cases=76.6meg

• TreeasDelimited:Sumofresponselengths=11.33meg

0

500

1000

1500

2000

2500

3000

Spread Bit FixedTree DelimitedTree

Megabytes

Page 14: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

FMCG(3)FixedWidth:BrandStatementRating

• Bitformat:Requires204*4*5=4,080columnscolumns*cases=6,096k

• TreeasFixedWidth:Longestresponse=120characterschars*cases=179.3k

• TreeasDelimited:Sumofresponselengths=51.5k

0

1000

2000

3000

4000

5000

6000

7000

Bit Spread FixedTree DelimitedTree

Kilobytes

204brandsby4statementsby5ratingsover1,530cases

• Spreadformat:Requires204*4=816columnscolumns*cases=1,219k

Page 15: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

ProposedSSSStorage:FixedWidthSingle

BrandRating:<tree ident="BRAT">

<position start="3" finish="10"/><level ident="Brand" type="single">

<values><value code="1">AMEX</value><value code="2">Visa</value>

</values></level><level ident="Rating" type="single">

<values><value code="1">1</value><value code="2">2</value><value code="3">3</value>

</values></level>

</tree>

11Column: 12345678901Case#1: xxa1b3a2b1xCase#2: xxa2b2 xCase#3: xx xCase#4: xxa1b1a2b3x

• Newtagtype,tree• Differentcontextforthe<level>tag• Nohreforparent,sothelevelsaresubordinate

Page 16: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

ProposedSSSStorage:DelimitedSingle

BrandRating:<tree ident="BRAT">

<position start="3"/><level ident="Brand" type="single">

<values><value code="1">AMEX</value><value code="2">Visa</value>

</values></level><level ident="Rating" type="single">

<values><value code="1">1</value><value code="2">2</value><value code="3">3</value>

</values></level>

</variable>

11111Column: 12345678901234Case#1: x,x,a1b3a2b1,xCase#2: x,x,a2b2,xCase#3: x,x,,xCase#4: x,x,a1b1a2b3,x

Page 17: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

ProposedSSSStorage:FixedWidthMulti

BrandImage:<tree ident="BIM">

<position start="3" finish="12"/><level ident="Brand" type="single">

<values><value code="1">AMEX</value><value code="2">Visa</value>

</values></level><level ident="Image" type="multiple">

<values><value code="1">Cool</value><value code="2">Relevant</value><value code="3">Popular</value>

</values></level>

</variable>

1111Column: 1234567890123Case#1: xxa1b1;3a2b1xCase#2: xxa2b1;2;3 xCase#3: xx xCase#4: xxa1b1a2b1;2x

Page 18: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

ProposedSSSStorage:DelimitedMulti

BrandImage:<tree ident="BIM">

<position start="3"/><level ident="Brand" type="single">

<values><value code="1">AMEX</value><value code="2">Visa</value>

</values></level><level ident="Image" type="multiple">

<values><value code="1">Cool</value><value code="2">Relevant</value><value code="3">Popular</value>

</values></level>

</tree>

1111111Column: 1234567890123456Case#1: x,x,a1b1;3a2b1,xCase#2: x,x,a2b1;2;3,xCase#3: x,x,,xCase#4: x,x,a1b1a2b1;2,x

Page 19: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

ProsandConsPro:• 2filesonlyalways(oneXML,oneASC)• NoneedforLinkIDs• Noneedfor<parent>and<href>tags• NoneedforaDefinitionXML• Thenumberofcases(acrossalldata)remainsconstant• Thebasecountsateachlevelaresimplythenumberofa-nodes,

b-nodes,c-nodesetc• Dataisimplicitlyordered,sodonotneedorder attribute• Dramaticreductioninspacerequirementsforgrids/cubeswithlarge

codeframeswhenonlyasubsethaveresponses,especiallyunderCSV• Thecurrent<Hierarchy>tagsareunaffected• Alevelsjobcanstoregridsandcubesaskedatdifferentlevelsastrees,

avoidinglevelswithinlevelsconundrumsCon:• Couldcostmorecharactersthanfixed-widthfornode-complete(all

codesatalllevelsareinstantiated)• Positionisrecordedonlyforthestart/endofthetree

Page 20: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

HouseholdDataStorage<trees ident="HHTrips">

<level ident="Person" type="single"><position start="1"/><values>

<range from="1" to="10"/></values>

</level><level ident="Gender" type="single" parent="Person">

<position start="2"/><values>

<value code="1">Male</value><value code="2">Female</value>

</values></level><level ident="Age" type="single" parent="Person">

<position start="3"/><values>

<value code="1">Under 21</value><value code="2">21-45</value><value code="3">46-65</value><value code="4">Over 65</value>

</values></level><level ident="Purpose" type="single" parent="Person">

<position start="4"/><values>

<value code="1">Social</value><value code="2">Work</value><value code="3">Business</value>

</values></level><level ident="Method" type="single" parent="Purpose">

<position start="5"/><values>

<value code="1">Car Driver</value><value code="2">Car Passenger</value><value code="3">Bus</value><value code="4">Train</value>

</values></level>

</trees>

HH#1:a1a2,ab2ab1,ab2ab2,ab1b1,abc3bc2HH#2:a1a2a3,ab1ab2ab1,ab4ab3ab1,ab2b2b1aab1,abc1bc1bc1aabc2HH#3:a1,ab2,ab2,ab2b2,abc3bc3

• Treestagbecauseasetoftreesisdescribed

• ThetoplevelPerson hasnoparent

• Theparentattributeallowsparallelism

• Ifnoparentsassignedthensameas<tree>

XMLcouldlooklikethis:

Page 21: Using Left-Right Trees for Hierarchic Data Storage · 2=CarPass 3=Bus 4=Train Red = HouseholdLink ID Red+Blue= Person Link ID Black = Data. Household #2 as 5 LR Trees Household 2

End