1111 Topic for Thursday?. Miscellaneous Topics in Databases.

1111

Topic for Thursday?

Miscellaneous Topicsin Databases

PARALLEL DBMS

4444

WHY PARALLEL ACCESS TO DATA?

1 Terabyte

10 MB/s

At 10 MB/s1.2 days to scan

1 Terabyte

1,000 x parallel1.5 minute to scan.

Parallelism: divide a big problem into many smaller ones to be solved in parallel.

Bandwidth

5555

PARALLEL DBMS: INTRO

Parallelism is natural to DBMS processing Pipeline parallelism: many machines each doing

one step in a multi-step process. Partition parallelism: many machines doing the

same thing to different pieces of data. Both are natural in DBMS!

Pipeline

Partition

Any Sequential Program


SequentialSequential SequentialSequentialAny

Sequential Program


outputs split N ways, inputs merge

6666

SOME || TERMINOLOGY Speed-Up

More resources means proportionally less time for given amount of data.

Scale-Up If resources increased

in proportion to increase in data size, time is constant.

Why Realistic <> Ideal?

degree of ||-ism

Xac

t/se

c.(t

hrou

ghpu

t)

Ideal

degree of ||-ism

sec.

/Xac

t(r

espo

nse

tim

e)Ideal

Realistic

Realistic

7777

INTRODUCTION Parallel machines are becoming quite

common and affordable Prices of microprocessors, memory and disks

have dropped sharply Recent desktop computers feature multiple

processors and this trend is projected to accelerate

Databases are growing increasingly large large volumes of transaction data are collected

and stored for later analysis. multimedia objects like images are increasingly

stored in databases Large-scale parallel database systems

increasingly used for: storing large volumes of data processing time-consuming decision-support

queries providing high throughput for transaction

processing

8888

Google data centers around the world, as of 2008

9999

PARALLELISM IN DATABASES Data can be partitioned across multiple disks

for parallel I/O. Individual relational operations (e.g., sort,

join, aggregation) can be executed in parallel data can be partitioned and each processor can

work independently on its own partition Results merged when done

Different queries can be run in parallel with each other. Concurrency control takes care of conflicts.

Thus, databases naturally lend themselves to parallelism.

10101010

PARTITIONING Horizontal partitioning (shard)

involves putting different rows into different tables

Ex: customers with ZIP codes less than 50000 are stored in CustomersEast, while customers with ZIP codes greater than or equal to 50000 are stored in CustomersWest

Vertical partitioning involves creating tables with fewer columns and

using additional tables to store the remaining columns

partitions columns even when already normalized called "row splitting" (the row is split by its

columns) Ex: split (slow to find) dynamic data from (fast to

find) static data in a table where the dynamic data is not used as often as the static

11111111

COMPARISON OF PARTITIONING TECHNIQUES

Evaluate how well partitioning techniques support the following types of data access:

1.Scanning the entire relation. 2.Locating a tuple associatively – point

queries. E.g., r.A = 25.

3.Locating all tuples such that the value of a given attribute lies within a specified range – range queries. E.g., 10 r.A < 25.

12121212

HANDLING SKEW USING HISTOGRAMS

Balanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram

Histogram can be constructed by scanning relation, or sampling (blocks containing) tuples of the relation

13131313

INTERQUERY PARALLELISM Queries/transactions execute in parallel with

one another concurrent processing

Increases transaction throughput; used primarily to scale up a transaction processing system to support a larger number of transactions per second.

Easiest form of parallelism to support

14141414

INTRAQUERY PARALLELISM Execution of a single query in parallel on

multiple processors/disks; important for speeding up long-running queries

Two complementary forms of intraquery parallelism : Intraoperation Parallelism – parallelize the

execution of each individual operation in the query

(each CPU runs on a subset of tuples) Interoperation Parallelism – execute the

different operations in a query expression in parallel.

(each CPU runs a subset of operations on the data)

15151515

PARALLEL JOIN The join operation requires pairs of tuples to

be tested to see if they satisfy the join condition, and if they do, the pair is added to the join output.

Parallel join algorithms attempt to split the pairs to be tested over several processors. Each processor then computes part of the join locally.

In a final step, the results from each processor can be collected together to produce the final result.

16161616

QUERY OPTIMIZATION Query optimization in parallel databases is more

complex than in sequential databases Cost models are more complicated, since we must take

into account partitioning costs and issues such as skew and resource contention

When scheduling execution tree in parallel system, must decide: How to parallelize each operation how many processors to use for it What operations to pipeline what operations to execute independently in parallel what operations to execute sequentially

Determining the amount of resources to allocate for each operation is a problem E.g., allocating more processors than optimal can

result in high communication overhead

DEDUCTIVE DATABASES

18181818

OVERVIEW OF DEDUCTIVE DATABASES

Declarative Language Language to specify rules

Inference Engine (Deduction Machine) Can deduce new facts by interpreting the rules Related to logic programming

Prolog language (Prolog => Programming in logic) Uses backward chaining to evaluate

Top-down application of the rules

Consists of: Facts

Similar to relation specification without the necessity of including attribute names

Rules Similar to relational views (virtual relations that are not

stored)

19191919

PROLOG/DATALOG NOTATION

Facts are provided as predicates Predicate has

a name a fixed number of arguments

Convention: Constants are numeric or character strings

Variables start with upper case letters E.g., SUPERVISE(Supervisor, Supervisee)

States that Supervisor SUPERVISE(s) Supervisee

20202020


Rule Is of the form head :- body

where :- is read as if and only iff E.g., SUPERIOR(X,Y) :- SUPERVISE(X,Y) E.g., SUBORDINATE(Y,X) :- SUPERVISE(X,Y)

21212121


Query Involves a predicate symbol followed by some

variable arguments to answer the question where :- is read as if and only iff

E.g., SUPERIOR(james,Y)? E.g., SUBORDINATE(james,X)?

22222222

Supervisory treeProlog notation

23232323

PROVING A NEW FACT

24242424

DATA MINING

26262626

DEFINITION

Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.

Example pattern (Census Bureau Data):If (relationship = husband), then (gender = male). 99.6%

27272727

DEFINITION (CONT.)

Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.

Valid: The patterns hold in general.Novel: We did not know the pattern

beforehand.Useful: We can devise actions from the

patterns.Understandable: We can interpret and

comprehend the patterns.

28282828

WHY USE DATA MINING TODAY?

Human analysis skills are inadequate:Volume and dimensionality of the dataHigh data growth rate

Availability of:DataStorageComputational powerOff-the-shelf softwareExpertise

29292929

THE KNOWLEDGE DISCOVERY PROCESS

Steps: Identify business problem Data mining Action Evaluation and measurement Deployment and integration into

businesses processes

30303030

PREPROCESSING AND MINING

Original Data

TargetData

PreprocessedData

PatternsKnowledge

DataIntegrationand Selection

Preprocessing

ModelConstruction

Interpretation

31313131

DATA MINING TECHNIQUES

Supervised learning Classification and regression

Unsupervised learning Clustering

Dependency modeling Associations, summarization, causality

Outlier and deviation detection Trend analysis and change detection

32323232

EXAMPLE APPLICATION: SKY SURVEY

Input data: 3 TB of image data with 2 billion sky objects, took more than six years to complete

Goal: Generate a catalog with all objects and their type

Method: Use decision trees as data mining model

Results:94% accuracy in predicting sky object

classes Increased number of faint objects

classified by 300%Helped team of astronomers to discover 16

new high red-shift quasars in one order of magnitude less observation time

33333333

CLASSIFICATION EXAMPLE

Example training databaseTwo predictor attributes:

Age and Car-type (Sport, Minivan and Truck)

Age is ordered, Car-type iscategorical attribute

Class label indicateswhether person boughtproduct

Dependent attribute is categorical

Age Car Class

20 M Yes30 M Yes25 T No30 S Yes40 S Yes20 T No30 M Yes25 M Yes40 M Yes20 S No

34343434

GOALS AND REQUIREMENTS

Goals: To produce an accurate classifier/regression

function To understand the structure of the problem

Requirements on the model: High accuracy Understandable by humans, interpretable Fast construction for very large training

databases

35353535

WHAT ARE DECISION TREES?

Minivan

Age

Car Type

YES NO

YES

<30 >=30

Sports, Truck

0 30 60 Age

YES

YES

NO

Minivan

Sports,Truck

36363636

DENSITY-BASED CLUSTERING A cluster is defined as a connected dense

component. Density is defined in terms of number of

neighbors of a point. We can find clusters of arbitrary shape

37373737

MARKET BASKET ANALYSIS

Consider shopping cart filled with several items

Market basket analysis tries to answer the following questions: Who makes purchases? What do customers buy together? In what order do customers purchase items?

38383838

MARKET BASKET ANALYSIS (CONTD.)

Coocurrences 80% of all customers purchase items X, Y and Z

together. Association rules

60% of all customers who purchase X and Y also buy Z.

Sequential patterns 60% of customers who first buy X also purchase Y

within three weeks.

SPATIAL DATA

40404040

41414141

WHAT IS A SPATIAL DATABASE?

Database that: Stores spatial objects Manipulates spatial objects just like other objects

in the database

42424242

WHAT IS SPATIAL DATA?

Data which describes either location or shape

e.g.House or Fire Hydrant locationRoads, Rivers, Pipelines, Power linesForests, Parks, Municipalities, Lakes

In the abstract, reductionist view of the

computer, these entities are represented as Points, Lines, and Polygons.

43434343

Roads are represented as Lines Mail Boxes are represented as Points

44444444

TOPIC THREE

Land Use Classifications arerepresented as Polygons

45454545

TOPIC THREE

Combination of all the previous data

46464646

SPATIAL RELATIONSHIPS

Not just interested in location, also interested in “Relationships” between objects that are very hard to model outside the spatial domain.

The most common relationships are Proximity : distance Adjacency : “touching” and “connectivity” Containment : inside/overlapping

47474747


Distance between a toxic waste dump and a piece of property you were considering buying.

48484848


Distance to various pubs

49494949


Adjacency: All the lots which share an edge

50505050

Connectivity: Tributary relationships in river networks

51515151

MOST ORGANIZATIONS HAVE SPATIAL DATA

Geocodable addresses Customer location Store locations Transportation tracking Statistical/Demographic Cartography Epidemiology Crime patterns

Weather Information Land holdings Natural resources City Planning Environmental planning Information Visualization Hazard detection

52525252

ADVANTAGES OF SPATIAL DATABASES

Able to treat your spatial data like anything else in the DB transactions backups integrity checks less data redundancy fundamental organization and operations

handled by the DB multi-user support security/access control locking

53535353


Offset complicated tasks to the DB server organization and indexing done for you do not have to re-implement operators do not have to re-implement functions

Significantly lowers the development time of client applications

54545454


Spatial querying using SQL use simple SQL expressions to determine spatial

relationships distance adjacency containment

use simple SQL expressions to perform spatial operations area length intersection union buffer

55555555

Original Polygons

Union Intersection

56565656

Original river network

Buffered rivers

57575757


… WHERE distance(<me>,pub_loc) < 1000SELECT distance(<me>,pub_loc)*$0.01 +

beer_cost …... WHERE touches(pub_loc, street)

… WHERE inside(pub_loc,city_area) and city_name = ...

58585858


Simple value of the proposed lot

Area(<my lot>) * <price per acre> + area(intersect(<my log>,<forested area>) ) * <wood value per acre>- distance(<my lot>, <power lines>) * <cost of power line laying>

59595959

New Electoral Districts

• Changes in areas between 1996 and 2001 election.

• Want to predict voting in 2001 by looking at voting in 1996.

• Intersect the 2001 district polygon with the voting areas polygons.• Outside will have zero area• Inside will have 100% area• On the border will have partial area

• Multiply the % area by 1996 actual voting and sum

• Result is a simple prediction of 2001 voting

More advanced: also use demographic data.

60606060

DISADVANTAGES OF SPATIAL DATABASES

Cost to implement can be high Some inflexibility Incompatibilities with some GIS software Slower than local, specialized data structures User/managerial inexperience and caution

61616161

PICTOGRAMS - SHAPES

Types: Basic Shapes, Multi-Shapes, Derived Shapes, Alternate Shapes, Any possible Shape, User-Defined Shapes

Basic Shapes Alternate Shapes

Multi-Shapes Any Possible Shape

Derived Shapes User Defined Shape

N 0, N

*

!

62626262

SPATIAL DATA ENTITY CREATION

Form an entity to hold county names, states, populations, and geographiesCREATE TABLE County(

Name varchar(30),State varchar(30),Pop Integer,Shape Polygon);

Form an entity to hold river names, sources, lengths, and geographiesCREATE TABLE River(

Name varchar(30),Source varchar(30),Distance Integer,Shape LineString);

63636363

EXTENDING THE ER DIAGRAM

Standard ER Diagram

Spatial ER Diagram

64646464

1111 Topic for Thursday?. Miscellaneous Topics in Databases.

Documents

Transcript of 1111 Topic for Thursday?. Miscellaneous Topics in Databases.