Using Wide Table to Manage Web Data: A Survey Bin Yang

Using Wide Table to Manage Web Data: A SurveyBin [email protected]

Outline

Why use the Wide Table Data Model Physical Implementation Distributed Deployment Query Executions Other Issues

Why use the Wide Table? Several Scenarios

New generation of e-commerce application By Rekesh Agrawal, VLDB 2001, ICDE 2003 By Jennifer L. Beckmann, ICDE 2006 &Eric Chu, SIGMOD 2007

Data publishing and annotating service del.icio.us, flicker UTab CIDR 2007

Novel web application Google Base: allow user structured their uploaded content Ebay: market-place Craigslist: community portals

Search engine crawler data Bigtable

Medical information system Distributed workload management system

Condor Semantic Web data storage

By Daniel J. Abadi, VLDB 2007, Best paper

The character of Wide Table

DataVery wide schema, has many attributesSparsity Schema evolves constantly and quickly

QueryMore flexible query than SQLMore structured than keyword search

Conventional Database Horizontal Representation

N-ary horizontal representation Optimized for dense and slowly evolved data

Problem Large Number of Columns. Current database system has a restriction on the

number of columns in one table. Eg: 1012 (DB2 and Oracle) Sparsity. Many NULLs occupy storage, also increase the size of index, even

affect the sort result. Schema Evolution. Frequently altering schema is expensive Query performance. If a few columns are used in the query, the query incurs

a large performance penalty. Debate

Whether such wide table is a real need or just inappropriate design The distribution of non-NULL is unknown at schema design time Schema evolvement Reconstruct cost

Data Model

n-ary Horizontal Representation 2-ary Binary Representation 3-ary Vertical Representation Hybrid Representation

n-ary Horizontal Representation A basic example. See figure 1. As many columns as the number of attributes When a new attribute is brought in, the schema has to

altered Another approach: Attributes are divide into “dense”

and “sparse”. Dense are stored in a horizontal table, Sparse are stored in a plain text object.

2-ary Binary Representation DSM (Decomposed Storage Model) SIGMOD 1985 Decomposed horizontal tables into as many 2-ary

tables as the number of columns.

Number of columns. Just two.

Sparsity. NULL need not to store. Left Outer Join to reconstruct.

Schema evolution. Equals to add or delete a DSM table. On the other hand, too many tables makes DBMS hard to manage.

Performance of queries involved a few attributes are increased.

3-ary Vertical Representation

Number of columns. Just three.

Sparsity. NULL need not to store.

Schema evolution. Equals to add or delete a row.

Queries of vertical tables are much more complicated.

Unlike Horizontal Representation, Binary Representation and Vertical Representation decouple the logical and physical storage of entities.

Because most current applications and development tools are designed for horizontal format, reconstruct from either Binary Representation and Vertical Representation to horizontal table is necessary.

Hybrid RepresentationUsed in Bigtable and HBase

Divide into several column family. Row + timestamp equals surrogate in DSM.

Less tables than DSM. Vertical representation is used

in each table

Physical Implementation

Row Oriented Storage (stores data row by row)Positional Format Interpreted Format

Column Oriented Storage (stores data column by column)

Positional Format Always has a header

Relation-id:schema Tuple-id, tuple length NULL-bitmap, timestamp

Fixed length 8, 255, 1, 10, 4 bytes

respectively Variable length

A,B: fixed C,D,E: variable Offset:length, pointer are in

header Variable length attributes

always follows the fixed length attribute.

Fixed length attribute is pre-allocated no matter whether it is null

Schema evolving is expensive in both fixed and variable length record.

Deal with NULL in Positional Format Storage NULL-bitmap: Record header using it to indicate

which attribute is NULL A bit in NULL-bitmap Full size of the fixed length attribute is wasted Variable length attribute is omitted

Bitmap only Not pre-allocated for fixed length attribute Location is more complicated than pre-allocated PostgreSQL

A special value: The size should be small Length-data pairs: Length is 0 Interpreted format

Interpret Format

Column Oriented Storage Row-oriented is write-optimized

Push all of the fields of a single record out to disk Column-oriented is read-optimized

Firstly applied in data warehouse system Advantages

Support efficient queries over wide schema, improved bandwidth utilization and cache locality

Efficiently deal with sparse columns, improve data compression Use CPU cycles to save disk bandwidth Query on compressed representation, just decompress when presented the

data to user Disadvantage

Increased cost of inserts Increased record reconstruction cost

Typical example of Column Oriented C-Store http://db.csail.mit.edu/projects/cstore/ MonetDB http://monetdb.cwi.nl/projects/monetdb/Home/index.html

http://db.csail.mit.edu/projects/cstore/

http://monetdb.cwi.nl/projects/monetdb/Home/index.html

Compression method

Run Length Code (RLE) A list of non-NULL values Offset Bitmap Position Ranges

Challenge A “storage wizard” to automatically decide positional?

Interpreted? Horizontal? Vertical? According to density, frequency of access

Distributed Deployment In order to provide more storage capacity, high availability and high

performance Many nodes, each with private disk and private memory (shared-nothing

architecture) GFS (Google File System)

One master single point failure – shadow similar to Napster

Many Chunkserver Data transfer between chunkserver without interaction with master

Each chunk has 3 replicas Availability & Performance

Bigtable Based on GFS Partition horizontally according to row key, each partition is called tablet

C-Store Just implement projections Each projection is partition horizontally Different projections may have same columns Projections may have replicas, while different replicas can have different sort

order K-safe: can tolerate K failures

Query Execution

Query on Binary Representation Query on Vertical Representation Query on Interpreted Representation Partial and sparse index Keyword search Partitions, Hidden Schema and Virtual Relation Ranking

Query on Binary Representation Suitable for queries which involved a small number of attributes Transformation between binary and horizontal

Store non-NULL only

Physical layout One is clustered on surrogate, suitable for B2H The other is clustered on the attribute value. Suitable for specific

queries.

Query on Vertical Representation Transformation between vertical and horizontal

Physical layout One is clustered on object identifier The other is clustered on attribute name. Suitable for V2H.

Query on Interpreted Representation No need to reconstruct the horizontal representation EXTRACTION operation

Get the offset of each attribute Expensive, execute in batch

Partial Index In horizontal table, column involved in frequent queries is always

indexed. In vertical table, all three columns are indexed

The entire table are indexed. The size is much more bigger The large indices adversely impact the performance of vertical

representation (each update has to modify the index) Partial index: only the rows of interest needs to be indexed

Proposed by Stonebraker 1989 SIGMOD Use a predicate, only the tuples which is evaluated as true are

indexed Challenge: How to identify the real interest rows Format:

CREATE index-type INDEX on relationname(column name) where predicate

CREATE B-tree INDEX on EMP(salary) where salary < 500 Sparse Index

In interpreted table, only index the non-NULL values. Index size is proportional to the number of non-NULL value. For insertion and deletion operation, only the index on the

attributes that are non-NULL need to update

Keyword search Most suitable way

Ordinary user don’t know the exactly attribute name, because there are “too many attributes”

Ordinary user may not write a SQL query Inverted index in classical IR

One inverted index on data value (traditional IR) One inverted index on attribute name (UTab)

Problem Too many records may contain a specific keyword Zipf-like distribution, accepted by most users

Number of attributes contained the term Number of rows contained the term

Imprecision Maybe a user want to find keyword in some specific attribute More structured keyword search

A example of structured keyword search Fuzzy attribute

Name based schema matching techniques WordNet semantic dictionary

Suppose: A2 is fuzzy attribute, and A2 is similar to C21 and C22

Alternative Run keyword search on the data value to obtain a set of

objects Z. Find out A, which is the set of attributes in Z that contain the

keyword. Match A with fuzzy attributes to get the B. Return objects in Z which has attribute B

Partitions, Hidden Schema and Virtual RelationVertical partition the data set in a Wide tableScanning the vertical partition is more efficient

than scanning the base tableHow to partition

A reasonable number of partitions Partitions contain minimal null values Each base-table tuple is preferably store entirely in

one partitionA common way

Group together co-occurring attributes Challenge: How to define the degree of co-occurring

Classification Disjoint partition (Hidden Schema) Joint partition (Virtual Relation)

After get the partition Materialized views (positional format because of it is dense) Covering index (a way to find out the efficient partial index) Provide browsing-based interface

Hidden Schema (by Eric Chu, Jeffrey Naugthon, wisconsin-madison)

Jaccard coefficient Statistical information

Virtual relation (Applied in UTab by NUS) Clustered on attributes and tags Semantic information

Ranking SQL query is always unordered set of qualifying

records Flexible query should have a order

Keyword search Most classical method is tf-idf Modern search engine always involve many aspects with

different weights Timestamp

Classification of keyword in attribute name and value data, different weights

Hidden schema or virtual relation should also have a ranking

Decreasing order according to the number of tuples it contains

Weak data typesArbitrary stringTimestamp

ConsistencyCAP theorem

Consistent, availability, tolerance of partitions Write-once-read-many, most queries are read-only

Other Issues

[1] R. Agrawal, A. Somani, and Y. Xu. Storage and querying of ecommerce data. In Proc. Of VLDB, pages 149-158, 2001.

[2] G.P. Copeland and S.N. Khhoshafian. A decomposition storage model. In SIGMOD 1985. [3] J. L. Beckmann, A. Halverson, R. Krishnamurthy, and J. F. Naughton. Extending RDBMSs to support sparse

datasets using an interpreted attribute storage format. In Proc. of ICDE, 2006. [4] B. Yu, G. Li, B. C. Ooi, and L.Z. Zhou. One Table Stores All: Enabling Painless Free-and- Easy Data Publishing and Sharing. In CIDR 2007. [5] M. Stonebraker, D.J. Abadi, A. Batkin et al. C-Store: a Column-Oriented DBMS. In Proc. of VLDB 2005. [6] E. Chu, J. Beckmann, J. Naughton. The Case for aWide-Table Approach to Manage Sparse Relational Data Sets. In SIGMOD 2007. [7] C. Cunningham, C.A. Galindo-Legaria, and G. Graefe. PIVOT and UNPIVOT: Optimization and Execution

Strategies in an RDBMS. In VLDB 2004. [8] F. Chang, J Dean, S. Ghemawat et al. Bigtable: A Distributed Storage System for Structured Data. In OSDI 2006. [9] S. Ghemawat, H. Gobioff and S.T. Leung. The Google File System. In SOSP 2003. [10] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI 2004. [11] D.J. Abadi. Column Stores For Wide and Sparse Data. In CIDR 2007. [12] J. Madhavan, S.R. Jeffery, and S. Cohen. Web-scale Data Integration: You can only afford to Pay As You Go. In

CIDR 2007. [13] A.S. Hoque. Storage and Querying of High Dimensional Sparsely Populated Data in Compressed

Representation. EurAsia-ICT 2002. [14] V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases.In Proc. of VLDB, 2002. [15] J. Madhavan, A. Halevy and S. Cohen et al. Structured Data Meets the Web: A Few Observations. IEEE Data

Engineering Bulletion, 29(4), December 2006. [16] Hadoop website. http://lucene.apache.org/hadoop/ [17] HBase website. http://wiki.apache.org/lucene-hadoop/Hbase [18] Delicious website. http://del.icio.us/ [19] Flickr website. http://www.flickr.com/ [20] Google Base website. http://base.google.com/ [21] Google Co-op website. http://www.google.com/coop [22] M. Stonebraker. The case for partial indexes. SIGMOD Record, 1989. [23] WordNet website. http://wordnet.princeton.edu/ [24] R. Agrawal, R. Srikant and Y. Xu. Database Technologies for Electronic Commerce. In Proc. of VLDB, 2002. [25] Database Complete Book [26] E.A.Brewer. Combining Systems and Databases: A Search Engine Retrospective.

Using Wide Table to Manage Web Data: A Survey Bin Yang

Documents

Transcript of Using Wide Table to Manage Web Data: A Survey Bin Yang