OLAP1

download OLAP1

of 31

Transcript of OLAP1

  • 8/17/2019 OLAP1

    1/31

    Data Warehousing and OLAP

    dr. Toon [email protected]

    Motivation• « Traditional » relational databases are

    geared towards online transactionprocessing: – bank terminal – flight reservations – student administration

    • Decision support systems have differentrequirements

  • 8/17/2019 OLAP1

    2/31

    Transaction Processing

    Transaction processing

    • Operational setting• Up-to-date = critical

    • Simple data

    • Simple queries; only« touch » a small partof the database

    Flight reservations

    • ticket sales• do not sell a seat

    twice• reservation, date,

    name

    • Give flight details of XList flights to Y

    Transaction Processing• Database must support

    – simple data• tables

    – simple queries• select from where …

    – consistency & integrity CRITICAL – concurrency

    • Relational databases, Object-Oriented,Object-Relational

  • 8/17/2019 OLAP1

    3/31

    Decision Support

    Decision support

    • Off-line setting• « Historical » data• Summarized data

    • Integrate different

    databases• Statistical queries

    Flight company

    • Evaluate ROI flights• Flights of last year• # passengers per

    carrier for destination X• Passengers, fuel costs,

    maintenance info• Average % of seats

    sold/month/destination

    Outline of this lectureOnline Analytical Processing• Data Warehouses• Conceptual model: Data Cubes• Query languages for supporting OLAP

    – SQL extensions

    – MDX• Database Explosion Problem

  • 8/17/2019 OLAP1

    4/31

    A decision support DB maintained separatelyfrom the operational databases.

    Why Separate Data Warehouse?• Different functions

    – DBMS— tuned for OLTP – Warehouse—tuned for OLAP

    • Different data – Decision support requires historical data

    • Integration of data from heterogeneoussources

    Data Warehouse

    Three-Tier Architecture

    DataWarehouse

    ExtractTransformLoadRefresh

    OLAP Engine

    Monitor&

    IntegratorMetadata

    Data Sources Front-End Tools

    Serve

    Data Marts

    Operational

    DBs

    othersources

    Data Storage

    OLAPServer

    Analysis

    Query/Reporting

    Data Mining

    ROLAPServer

  • 8/17/2019 OLAP1

    5/31

    Three-Tier Architecture

    • Extract-Transform-Load – Get data from different sources; data

    integration – Cleaning the data – Takes significant part of the effort (up to 80%!)

    • Refresh – Keep the data warehouse up to date when

    source data changes

    Three-Tier Architecture• Data storage

    – Optimized for OLAP – Specialized data structures – Specialized indexing structures

    • Data marts – common term to refer to “less ambitious data

    warehouses” – Task oriented, departmental level

  • 8/17/2019 OLAP1

    6/31

    OLAP

    • OLAP = OnLine Analytical Processing – Online = no waiting for answers

    • OLAP system = system that supportsanalytical queries that are dimensional in nature.

    Outline of the lectureOnline Analytical Processing• Data Warehouses• Conceptual model: Data cubes• Query languages for supporting OLAP

    – SQL extensions

    – MDX• Database Explosion Problem

  • 8/17/2019 OLAP1

    7/31

    Examples of Queries

    • Flight company: evaluate ticket sales – give total, average, minimal, maximal amount – per date: week, month, year – by destination/source port/country/continent – by ticket type – by # of connections – …

    Common Characteristics• One (or few) special attribute(s): amount

    measure

    • Other attributes: select relevant regions dimensions

    • Different levels of generality (month, year)

    hierarchies• Measurement data is summarized: sum,

    min, max, averageaggregations

  • 8/17/2019 OLAP1

    8/31

    Supermarket Example

    • Evaluate the sales of products – Product cost in $ – Customer: ID, city, state, country, – Store: chain, size, location, – Product: brand, type, … – …

    • What are the measure and dimensionalattributes, where are the hierarchies?

    measure

    Dim.hierarchies

    Why Dimensions?

    customer

    store

    product

    Cost in $

    • Multidimensional view on the data

  • 8/17/2019 OLAP1

    9/31

    Cross Tabulation

    • Cross-tabulations are highly useful – Sales of clothes June August ‘06

    138512265August

    57032967174Total

    1981202058July

    2341582551June

    TotalOrangeRedBlue

    Product: color

    Date:month,June August2006

    Data Cubes• Extension of Cross-Tables to multiple

    dimensions• Conceptual notion

    138512265August

    57032967174Total

    1981202058July

    2341582551June

    TotalOrangeRedBlue

    Dimensions

    Data Points/ 1st level of aggregation

    Aggregatedw.r.t. X-dim

    Aggregatedw.r.t. Y-dim

    Aggregatedw.r.t. X and Y

  • 8/17/2019 OLAP1

    10/31

    Data CubesDate

    P r o d

    u c t

    C o u n t r y

    sum

    sum TV

    VCRPC

    1Qtr 2Qtr 3Qtr 4Qtr

    Ireland

    France

    Germany

    sum

    Data Cubes

    1) #TV’s sold in the 2Qtr?2) Total sales in 3Qtr?3) #products sold in France

    this year4) #PC’s in Ireland in 2Qtr?

    sum 1Qtr 2Qtr 3Qtr 4Qtr

    Ireland

    France

    Germany

    sum

    sum

    TV

    VCRPC

  • 8/17/2019 OLAP1

    11/31

    Data Cubes

    sum

    sum

    4

    2

    3

    TV

    VCRPC

    1Qtr 2Qtr 3Qtr 4Qtr

    Ireland

    France

    Germany

    sum

    1) #TV’s sold in the 2Qtr?

    2) Total sales in 3Qtr?3) #products sold in France

    this year4) #PC’s in Ireland in 2Qtr?

    In the back, at thebottom, second column

    Outline of the lectureOnline Analytical Processing• Data Warehouses• Conceptual model: Data cubes• Query languages for supporting OLAP

    – SQL extensions

    – MDX• Database Explosion Problem

  • 8/17/2019 OLAP1

    12/31

    Operations with Data Cubes

    • What operations can you think of that ananalyst might find useful? (e.g., store)

    Operations with Data Cubes• What operations can you think of that an

    analyst might find useful? (e.g., store) – only look at stores in the Netherlands – look at cities instead of individual stores – look at the cross-table for product-date – restrict analysis to 2006, product O1 – go back to a finer granularity at the store level

  • 8/17/2019 OLAP1

    13/31

    Roll-Up

    • Move in one dimension from a lowergranularity to a higher one – store city – cities country – product product type – Quarter Semester

    Roll-Up

    P r o d

    u c t

    C o u n t r y

    sum

    sum TV

    VCRPC

    1Qtr 2Qtr 3Qtr 4Qtr

    Ireland

    France

    Germany

    sum

    Date

  • 8/17/2019 OLAP1

    14/31

    Roll-Up

    P r o d

    u c t

    C o u n t r y

    sum

    sum TV

    VCRPC

    1st semester

    Ireland

    France

    Germany

    sum

    Date2nd semester

    Drill-down• Inverse operation• Move in one dimension from a higher

    granularity to a lower one – city store – country cities – product type product

    • Drill-through: – go back to the original, individual data records

  • 8/17/2019 OLAP1

    15/31

    Pivoting

    • Change the dimensions that are“displayed”; select a cross-tab. – look at the cross-table for product-date – display cross-table for date-customer

    Pivoting• Change the dimensions that are

    “displayed”; select a cross-tab. – look at the cross-table for product-date – display cross-table for date-customer

    Sales Date

    1st sem 2 nd sem Total

    Ireland 20 23 43France 126 138 264

    Country Germany 56 48 104

    Total 202 209 411

  • 8/17/2019 OLAP1

    16/31

    Slice & dice

    • Select a part of the cube by restricting oneor more dimensions – restrict analysis to Ireland and VCR

    Slice & dice

    P r o d

    u c t

    C o u n t r y

    sum

    sum TV

    VCRPC

    1Qtr 2Qtr 3Qtr 4Qtr

    Ireland

    France

    Germany

    sum

  • 8/17/2019 OLAP1

    17/31

    Slice & dice

    • Select a part of the cube by restricting oneor more dimensions – restrict analysis to Ireland and VCR

    sum 1Qtr 2Qtr 3Qtr 4Qtr

    Outline of the lectureOnline Analytical Processing• Data Warehouses• Conceptual model: Data cubes• Query languages for supporting OLAP

    – SQL extensions

    – MDX• Database Explosion Problem

  • 8/17/2019 OLAP1

    18/31

    Extended Aggregation

    • SQL-92 aggregation quite limited – Many useful aggregates are either very hard

    or impossible to specify• Data cube• Complex aggregates (median, variance)• binary aggregates (correlation, regression curves)• ranking queries (“assign each student a rank

    based on the total marks”)

    • SQL:1999 OLAP extensions

    Representing the Cube• Special value « null » is used:

    Sales Date1st sem 2 nd sem Total

    Ireland 20 23 43France 126 138 264

    Country Germany 56 48 104

    Total 202 209 411

  • 8/17/2019 OLAP1

    19/31

    Representing the Cube

    • Special value « null » is used:

    411nullnull

    104Germanynull

    264Francenull

    43Irelandnull

    209null2nd semester

    48Germany2nd semester

    138France2nd semester

    23Ireland2nd semester

    202null1st semester

    56Germany1st semester

    126France1st semester

    20Ireland1st semester

    SalesCountryDate

    Group by Cube• group by cube :

    select item-name, color, size, sum (number )from sales group by cube (item-name, color, size )

    Computes the union of eight differentgroupings of the sales relation:{ (item-name, color, size ), (item-name, color ),

    (item-name, size ),(color, size ),(item-name ),(color ),(size ), ( ) }

  • 8/17/2019 OLAP1

    20/31

    Group by Cube

    • Relational representation of the date-country-sales cube can be computed asfollows:

    select semester as date , country, sum (sales)from sales group by cube (semester,country )

    – grouping() and decode() can be applied to

    replace “null” by other constant:• decode (grouping (semester), 1, ‘all’, semester )

    Group by Rollup• rollup construct generates union on every

    prefix of specified list of attributesselect item-name , color , size , sum (number )from sales group by rollup (item-name, color, size )

    Generates union of four groupings:

    { (item-name, color, size ), (item-name,color ), (item-name ), ( ) }

  • 8/17/2019 OLAP1

    21/31

    Group by Rollup

    • Rollup can be used to generateaggregates at multiple levels.

    • E.g., suppose itemcategory (item-name,category ) gives category of each item.

    select category, item-name , sum (number )from sales, itemcategory where sales.item-name = itemcategory.item-name

    group by rollup (category, item-name )gives a hierarchical summary by item- name and by category.

    Group by Cube & Rollup• Multiple rollups and cubes can be used in

    a single group by clause – Each generates set of group by lists, cross

    product of sets gives overall set of group bylists

  • 8/17/2019 OLAP1

    22/31

    Example

    select item-name, color, size , sum (number )from sales group by rollup (item-name ), rollup (color, size )

    generates the groupings{item-name, ()} X {(color, size), (color), ()}

    ={ (item-name, color, size ), (item-name, color ),(item-name ),(color, size ), (color ), ( ) }

    Ranking• Ranking is done in conjunction with an

    order by specification.• Given a relation student-marks(student-id,

    marks) find the rank of each student.select student-id , rank ( ) over (order by marks desc) as s-rank from student-marks

  • 8/17/2019 OLAP1

    23/31

    Ranking

    • An extra order by clause is needed to getthem in sorted orderselect student-id , rank ( ) over (order by marks desc )as s-rank from student-marksorder by s-rank

    • Ranking may leave gaps: e.g. 2 studentshave top mark, then both have rank 1, andthe next rank is 3 – dense_rank does not leave gaps, so next

    dense rank would be 2

    Ranking (Cont.)• Ranking can be done within partitions.• “Find the rank of students within each

    section.”select student-id, section , rank ( ) over (partition by sectionorder by marks desc ) as sec-rank from student-marks, student-section where student-marks.student-id = student-section.student-id order by section, sec-rank

    • Multiple rank clauses can occur in a singleselect clause

  • 8/17/2019 OLAP1

    24/31

    Exercises

    • Find students with top n ranks

    • Rank students by sum of their marks indifferent courses

    given relation student-course-

    marks(student-id, course, marks)

    Ranking (Cont.)• Other ranking functions:

    – percent_rank – cume_dist (cumulative distribution) – row_number

    • SQL:1999 permits the user to specifynulls first or nulls last

  • 8/17/2019 OLAP1

    25/31

  • 8/17/2019 OLAP1

    26/31

    Outline of the lecture

    Online Analytical Processing• Data Warehouses• Conceptual model: Data cubes• Query languages for supporting OLAP

    – SQL extensions – MDX

    • Database Explosion Problem

    Implementation• To make query answering more efficient:

    consolidate (materialize) all aggregations

    • Early implementations used amultidimensional array.

    – Fast lookup: cell(prod. p, date d, prom. pr):• look up index of p1, index of d, index of pr:index = (p x D x PR) + (d x PR) + pr

  • 8/17/2019 OLAP1

    27/31

    Implementation

    • Multidimensional array – obvious problem: sparse data

    can easily be solved, though.

    Example:binary search tree, key on indexhash table.

    Implementation• However: very quickly people were

    confronted with the Data ExplosionProblem

    Consolidating the summaries blows up the dataenormously !

    Reasons are often misunderstood and confusing.

  • 8/17/2019 OLAP1

    28/31

    Data Explosion Problem

    • Why?Suppose:

    – n dimensions, every dimension has d values – d n possible tuples.

    – Number of cells in the cube: (d+1) n

    – So, this is not the problem

    Data Explosion Problem• Why?Suppose

    – n dimensions, every dimension has d values – every dimension has a hierarchy

    – most extreme case: binary tree2d possibilities/dimension

  • 8/17/2019 OLAP1

    29/31

    Data Explosion Problem

    • Why?Suppose

    – n dimensions, every dimension has d values – every dimension has a hierarchy – most extreme case: binary tree

    2d possibilities/dimension 2n x dn cells

    Only partial explanation (factor 2 n comesfrom an extremely pathological case)

    Data Explosion Problem• Why?

    – The problem is that most data is not dense,but sparse .

    – Hence, not all d n combinations are possible.

    Example : 10 dimensions with 10 values – 10 000 000 000 possibilities

    Suppose « only » 1 000 000 are present

  • 8/17/2019 OLAP1

    30/31

    Data Explosion Problem

    Example : 10 dimensions with 10 values – 10 000 000 000 possibilities

    Suppose « only » 1 000 000 are presentEvery tuple increases count of 2 10 cells !

    With hierarchies: effect even worse!

    If every hierarchy has 5 items:510 = 9 765 625 cells!

    Data Explosion• Conclusion

    – Straightforward implementation as a multi-dimensional cube won’t work for large numberof dimensions

  • 8/17/2019 OLAP1

    31/31

    Summary

    • Datawarehouses supporting OLAP fordecision support

    • Data Cubes as a conceptual model – Measurement, dimensions, hierarchy,

    aggregation

    • Queries – Roll-up, Drill-down, Slice and dice, pivoting… – SQL:1999 extensions for supporting OLAP

    • Straightforward implementation isproblematic

    Next Lecture• How can we implement this?

    – ROLAP – MOLAP – (HOLAP)

    • How can we do this efficiently?

    – Multicubes – Indexing• bitmap, join, …