Incremental Data Transformations on Wide-Column Stores with NotaQL

45
on Wide-Column Stores M. Sc. Johannes Schildgen 2015-06-29 [email protected] Incremental Data Transformations with

Transcript of Incremental Data Transformations on Wide-Column Stores with NotaQL

Page 1: Incremental Data Transformations on Wide-Column Stores with NotaQL

on Wide-Column Stores

M. Sc. Johannes Schildgen2015-06-29

[email protected]

Incremental Data Transformations

with

Page 2: Incremental Data Transformations on Wide-Column Stores with NotaQL

"A DBA walks into a NoSQL bar, but turns and leaves because he couldn't find a table"

Page 3: Incremental Data Transformations on Wide-Column Stores with NotaQL

Column Families

RowId info children

Page 4: Incremental Data Transformations on Wide-Column Stores with NotaQL

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

Peter, 1965,IBM, 70k

Lisa, 1997,BSIT

Column Families

€10 €5

€0€7

Page 5: Incremental Data Transformations on Wide-Column Stores with NotaQL

HBase APIput ‘pers‘, ‘Carl‘, ‘info:born‘, ‘1982‘

put ‘pers‘, ‘Carl‘, ‘info:school‘, ‘BSIT‘

put ‘pers‘, ‘Carl‘, ‘info:school‘, ‘BUIT‘

get ‘pers‘, ‘Carl‘

Page 6: Incremental Data Transformations on Wide-Column Stores with NotaQL

Jaspersoft HBase QL

{ "tableName": "pers", "deserializerClass": "com.jaspersoft…DefaultDeserializer", "filter": { "SingleColumnValueFilter": { "family": „info", "qualifier": „school", "compareOp": "EQUAL", "comparator": { "SubstringComparator": { "substr":

„BSIT" } } } }}

𝛔𝐬𝐜𝐡𝐨𝐨𝐥 ¿ ′𝐁𝐒𝐈𝐓 ′𝐩𝐞𝐫𝐬

http://community.jaspersoft.com/wiki/jaspersoft-hbase-query-language

Page 7: Incremental Data Transformations on Wide-Column Stores with NotaQL

Phoenix

SELECT * FROM pers WHERE school = ‘BSIT‘

𝛔𝐬𝐜𝐡𝐨𝐨𝐥 ¿ ′𝐁𝐒𝐈𝐓 ′𝐩𝐞𝐫𝐬

https://github.com/forcedotcom/phoenix

„Parent of each person?“

Page 8: Incremental Data Transformations on Wide-Column Stores with NotaQL
Page 9: Incremental Data Transformations on Wide-Column Stores with NotaQL

Input Table Output Table

Page 10: Incremental Data Transformations on Wide-Column Stores with NotaQL

Column

RowID Value

Input Cell Output Cell

Column

RowID Value

Page 11: Incremental Data Transformations on Wide-Column Stores with NotaQL

_c

_r _v

_c

_r _v

Input Cell Output Cell

Page 12: Incremental Data Transformations on Wide-Column Stores with NotaQL

_cborn

_rLisa

_v1997

_cborn

_rLisa

_v1997

Input Cell Output Cell

𝐩𝐞𝐫𝐬

OUT._r <- IN._r,OUT.born <- IN.born;

𝝅𝒃𝒐𝒓𝒏

Page 13: Incremental Data Transformations on Wide-Column Stores with NotaQL

_cborn

_rLisa

_v1997

_cborn

_rLisa

_v1997

Input Cell Output Cell

𝐩𝐞𝐫𝐬

OUT._r <- IN._r,OUT.born <- IN.born,OUT.school <- IN.school;

𝝅𝒃𝒐𝒓𝒏 , 𝒔𝒄𝒉𝒐𝒐𝒍

Page 14: Incremental Data Transformations on Wide-Column Stores with NotaQL

_cborn

_rLisa

_v1997

_cborn

_rLisa

_v1997

Input Cell Output Cell

𝐩𝐞𝐫𝐬

OUT._r <- IN._r,OUT.$(IN._c) <- IN._v;

𝛔𝐬𝐜𝐡𝐨𝐨𝐥 ¿ ′𝐁𝐒𝐈𝐓 ′

Page 15: Incremental Data Transformations on Wide-Column Stores with NotaQL

_cborn

_rLisa

_v1997

_cborn

_rLisa

_v1997

Input Cell Output Cell

IN-FILTER: school=‘BSIT‘,OUT._r <- IN._r,OUT.$(IN._c) <- IN._v;

row predicate

𝐩𝐞𝐫𝐬𝛔𝐬𝐜𝐡𝐨𝐨𝐥 ¿ ′𝐁𝐒𝐈𝐓 ′

Page 16: Incremental Data Transformations on Wide-Column Stores with NotaQL

That was:Selection and Projection

Page 17: Incremental Data Transformations on Wide-Column Stores with NotaQL

Now:Grouping

Page 18: Incremental Data Transformations on Wide-Column Stores with NotaQL

_ccmpny

_rPeter

_vIBM

_csalsum

_rIBM

_v645k

Input Cell Output Cell

Salary sum of each company.

OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary);

Page 19: Incremental Data Transformations on Wide-Column Stores with NotaQL

RowId info

Eve

Carl

Julia

Lisa

OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary):

born cmpny salary

1965 IBM 70k

born cmpny job

1966 IBM intern

born cmpny salary

1967 IBM 80k

born school salary

1997 BSIT 1k

salsum

IBM 70k

salsum

IBM 80k

salsum

IBM 150k

Page 20: Incremental Data Transformations on Wide-Column Stores with NotaQL

Advanced Transformations:More Filters

Page 21: Incremental Data Transformations on Wide-Column Stores with NotaQL

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

Peter, 1965,IBM, 70k

Lisa, 1997,BSIT

€10 €5

€0€7

Page 22: Incremental Data Transformations on Wide-Column Stores with NotaQL

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

OUT._r <- IN._r,OUT.$(IN._c) <- IN._v;

Page 23: Incremental Data Transformations on Wide-Column Stores with NotaQL

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

IN-FILTER: COL_COUNT(children)>0OUT._r <- IN._r,OUT.$(IN._c) <- IN._v;

Page 24: Incremental Data Transformations on Wide-Column Stores with NotaQL

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

IN-FILTER: COL_COUNT(children)>0OUT._r <- IN._r,OUT.$(IN.children._c) <- IN._v;

Page 25: Incremental Data Transformations on Wide-Column Stores with NotaQL

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

IN-FILTER: COL_COUNT(children)>0OUT._r <- IN._r,OUT.$(IN.children._c?(@>5)) <- IN._v;

cell predicate

Page 26: Incremental Data Transformations on Wide-Column Stores with NotaQL

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

IN-FILTER: COL_COUNT(children)>0OUT._r <- IN._r,OUT.$(IN.children._c?(!Carl)) <- IN._v;

cell predicate

Page 27: Incremental Data Transformations on Wide-Column Stores with NotaQL

NotaQL Transformation Platform:MapReduce

Page 28: Incremental Data Transformations on Wide-Column Stores with NotaQL

map(rowId, row)

row violates row pred.?

has more columns?

no

cell violates cell pred.?

yes

map IN.{_r,_c,_v}, fetched columns and constants to r,c and v

no

emit((r, c), v)

no

yes

Stop

yes

RowId info

Peter born cmpny salary

1965 IBM 70k

salsum

IBM 70k

((IBM, salsum), 70k)

Page 29: Incremental Data Transformations on Wide-Column Stores with NotaQL

reduce((r,c), {v})

put(r, c, aggregateAll(v))

Stop

((IBM, salsum), {70k, 80k, 10k})

((IBM, salsum), 160k)

Page 30: Incremental Data Transformations on Wide-Column Stores with NotaQL

Incremental Transformations:Self-Maintainability

Page 31: Incremental Data Transformations on Wide-Column Stores with NotaQL

Salary sum of all people born before 1980 per company.

IN-FILTER: born<1980,OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary);

17:45 17:47 17:50

Execute job New peopleare added

Execute jobagain

Δ + =

Page 32: Incremental Data Transformations on Wide-Column Stores with NotaQL

Salary sum of all people born before 1980 per company.

IN-FILTER: born<1980,OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary);

17:45 17:47 17:50

Execute job New peopleare added

Execute jobagain

salsum

IBM 150k

born cmpny salary

Melissa 1989 IBM 50k

Nora 1977 IBM 80k

salsum

IBM 230k+

Page 33: Incremental Data Transformations on Wide-Column Stores with NotaQL

Salary sum of all people born before 1980 per company.

IN-FILTER: born<1980,OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary);

17:50 17:52 17:55

Execute job Peopleare deleted

Execute jobagain

salsum

IBM 230k

born cmpny salary

Melissa 1989 IBM 50k

Nora 1977 IBM 80k

salsum

IBM 150k-

Page 34: Incremental Data Transformations on Wide-Column Stores with NotaQL

Salary sum of all people born before 1980 per company.

IN-FILTER: born<1980,OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary);

17:50 17:52 17:55

Execute job Peopleare updated

Execute jobagain

born cmpny salary

Peter 1965 IBM 70k1990

-= 70k

Page 35: Incremental Data Transformations on Wide-Column Stores with NotaQL

Salary sum of all people born before 1980 per company.

IN-FILTER: born<1980,OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary);

17:50 17:52 17:55

Execute job Peopleare updated

Execute jobagain

born cmpny salary

Peter 1965 IBM 70k1990

+= 70k1965

Page 36: Incremental Data Transformations on Wide-Column Stores with NotaQL

Salary sum of all people born before 1980 per company.

IN-FILTER: born<1980,OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary);

17:50 17:52 17:55

Execute job Peopleare updated

Execute jobagain

born cmpny salary

Peter 1965 IBM 70k75k

+= 75k-70k

Page 37: Incremental Data Transformations on Wide-Column Stores with NotaQL

Salary sum of all people born before 1980 per company.

IN-FILTER: born<1980,OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary);

17:50 17:52 17:55

Execute job Peopleare updated

Execute jobagain

born cmpny salary

Peter 1965 IBM 70kSAP

-= 70k

Page 38: Incremental Data Transformations on Wide-Column Stores with NotaQL

map(rowId, row)

row violates row pred.?

has more columns?

no

cell violates cell pred.?

yes

map IN.{_r,_c,_v}, fetched columns and constants to r,c and v

no

emit((r, c), v)

no

yes

Stop

yes

reduce((r,c), {v})

put(r, c, aggregateAll(v))

Stop

Just read Delta (ts>ts‘)

Page 39: Incremental Data Transformations on Wide-Column Stores with NotaQL

map(rowId, row)

row violates row pred.?

has more columns?

no

cell violates cell pred.?

yes

map IN.{_r,_c,_v}, fetched columns and constants to r,c and v

no

emit((r, c), v)

no

yes

Stop

yes

reduce((r,c), {v})

put(r, c, aggregateAll(v))

Stop

…and the former Result

Page 40: Incremental Data Transformations on Wide-Column Stores with NotaQL

map(rowId, row)

row violates row pred.?

has more columns?

no

cell violates cell pred.?

yes

map IN.{_r,_c,_v}, fetched columns and constants to r,c and v

no

emit((r, c), v)

no

yes

Stop

yes

reduce((r,c), {v})

put(r, c, aggregateAll(v))

Stop

Was it satisfied on previous execution?

Yes? Invert v

*

Page 41: Incremental Data Transformations on Wide-Column Stores with NotaQL

map(rowId, row)

row violates row pred.?

has more columns?

no

cell violates cell pred.?

yes

map IN.{_r,_c,_v}, fetched columns and constants to r,c and v

no

emit((r, c), v)

no

yes

Stop

yes

reduce((r,c), {v})

put(r, c, aggregateAll(v))

Stop

Unchanged?

Page 42: Incremental Data Transformations on Wide-Column Stores with NotaQL

Evaluation:Performance

Page 43: Incremental Data Transformations on Wide-Column Stores with NotaQL

TPCH 1 (selection, projection, sum of prices per status code)

TPCH 6 (selection, projection, sum of all prices)

Reverse Web-Link Graph0

10

20

30

40

50

60

70

80

90

100

Native Hadoop

NotaQL (non-inc.)

NotaQL (0.1% changes, timestamp-based CDC)

NotaQL (10% changes, manual CDC)

Page 44: Incremental Data Transformations on Wide-Column Stores with NotaQL

Conclusion

• Selection, Projection• Grouping, Aggregation• Schema-Flexible• Horizontal Aggregation• MetadataData• Graph Processing• Text Processing

SQLIncremental!

Page 45: Incremental Data Transformations on Wide-Column Stores with NotaQL

Thank you!