1111 Topic for Thursday?. Miscellaneous Topics in Databases.
-
date post
19-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of 1111 Topic for Thursday?. Miscellaneous Topics in Databases.
1111
Topic for Thursday?
Miscellaneous Topicsin Databases
PARALLEL DBMS
4444
WHY PARALLEL ACCESS TO DATA?
1 Terabyte
10 MB/s
At 10 MB/s1.2 days to scan
1 Terabyte
1,000 x parallel1.5 minute to scan.
Parallelism: divide a big problem into many smaller ones to be solved in parallel.
Bandwidth
5555
PARALLEL DBMS: INTRO
Parallelism is natural to DBMS processing Pipeline parallelism: many machines each doing
one step in a multi-step process. Partition parallelism: many machines doing the
same thing to different pieces of data. Both are natural in DBMS!
Pipeline
Partition
Any Sequential Program
Any Sequential Program
SequentialSequential SequentialSequentialAny
Sequential Program
Any Sequential Program
outputs split N ways, inputs merge
6666
SOME || TERMINOLOGY Speed-Up
More resources means proportionally less time for given amount of data.
Scale-Up If resources increased
in proportion to increase in data size, time is constant.
Why Realistic <> Ideal?
degree of ||-ism
Xac
t/se
c.(t
hrou
ghpu
t)
Ideal
degree of ||-ism
sec.
/Xac
t(r
espo
nse
tim
e)Ideal
Realistic
Realistic
7777
INTRODUCTION Parallel machines are becoming quite
common and affordable Prices of microprocessors, memory and disks
have dropped sharply Recent desktop computers feature multiple
processors and this trend is projected to accelerate
Databases are growing increasingly large large volumes of transaction data are collected
and stored for later analysis. multimedia objects like images are increasingly
stored in databases Large-scale parallel database systems
increasingly used for: storing large volumes of data processing time-consuming decision-support
queries providing high throughput for transaction
processing
8888
Google data centers around the world, as of 2008
9999
PARALLELISM IN DATABASES Data can be partitioned across multiple disks
for parallel I/O. Individual relational operations (e.g., sort,
join, aggregation) can be executed in parallel data can be partitioned and each processor can
work independently on its own partition Results merged when done
Different queries can be run in parallel with each other. Concurrency control takes care of conflicts.
Thus, databases naturally lend themselves to parallelism.
10101010
PARTITIONING Horizontal partitioning (shard)
involves putting different rows into different tables
Ex: customers with ZIP codes less than 50000 are stored in CustomersEast, while customers with ZIP codes greater than or equal to 50000 are stored in CustomersWest
Vertical partitioning involves creating tables with fewer columns and
using additional tables to store the remaining columns
partitions columns even when already normalized called "row splitting" (the row is split by its
columns) Ex: split (slow to find) dynamic data from (fast to
find) static data in a table where the dynamic data is not used as often as the static
11111111
COMPARISON OF PARTITIONING TECHNIQUES
Evaluate how well partitioning techniques support the following types of data access:
1.Scanning the entire relation. 2.Locating a tuple associatively – point
queries. E.g., r.A = 25.
3.Locating all tuples such that the value of a given attribute lies within a specified range – range queries. E.g., 10 r.A < 25.
12121212
HANDLING SKEW USING HISTOGRAMS
Balanced partitioning vector can be constructed from histogram in a relatively straightforward fashion
Assume uniform distribution within each range of the histogram
Histogram can be constructed by scanning relation, or sampling (blocks containing) tuples of the relation
13131313
INTERQUERY PARALLELISM Queries/transactions execute in parallel with
one another concurrent processing
Increases transaction throughput; used primarily to scale up a transaction processing system to support a larger number of transactions per second.
Easiest form of parallelism to support
14141414
INTRAQUERY PARALLELISM Execution of a single query in parallel on
multiple processors/disks; important for speeding up long-running queries
Two complementary forms of intraquery parallelism : Intraoperation Parallelism – parallelize the
execution of each individual operation in the query
(each CPU runs on a subset of tuples) Interoperation Parallelism – execute the
different operations in a query expression in parallel.
(each CPU runs a subset of operations on the data)
15151515
PARALLEL JOIN The join operation requires pairs of tuples to
be tested to see if they satisfy the join condition, and if they do, the pair is added to the join output.
Parallel join algorithms attempt to split the pairs to be tested over several processors. Each processor then computes part of the join locally.
In a final step, the results from each processor can be collected together to produce the final result.
16161616
QUERY OPTIMIZATION Query optimization in parallel databases is more
complex than in sequential databases Cost models are more complicated, since we must take
into account partitioning costs and issues such as skew and resource contention
When scheduling execution tree in parallel system, must decide: How to parallelize each operation how many processors to use for it What operations to pipeline what operations to execute independently in parallel what operations to execute sequentially
Determining the amount of resources to allocate for each operation is a problem E.g., allocating more processors than optimal can
result in high communication overhead
DEDUCTIVE DATABASES
18181818
OVERVIEW OF DEDUCTIVE DATABASES
Declarative Language Language to specify rules
Inference Engine (Deduction Machine) Can deduce new facts by interpreting the rules Related to logic programming
Prolog language (Prolog => Programming in logic) Uses backward chaining to evaluate
Top-down application of the rules
Consists of: Facts
Similar to relation specification without the necessity of including attribute names
Rules Similar to relational views (virtual relations that are not
stored)
19191919
PROLOG/DATALOG NOTATION
Facts are provided as predicates Predicate has
a name a fixed number of arguments
Convention: Constants are numeric or character strings
Variables start with upper case letters E.g., SUPERVISE(Supervisor, Supervisee)
States that Supervisor SUPERVISE(s) Supervisee
20202020
PROLOG/DATALOG NOTATION
Rule Is of the form head :- body
where :- is read as if and only iff E.g., SUPERIOR(X,Y) :- SUPERVISE(X,Y) E.g., SUBORDINATE(Y,X) :- SUPERVISE(X,Y)
21212121
PROLOG/DATALOG NOTATION
Query Involves a predicate symbol followed by some
variable arguments to answer the question where :- is read as if and only iff
E.g., SUPERIOR(james,Y)? E.g., SUBORDINATE(james,X)?
22222222
Supervisory treeProlog notation
23232323
PROVING A NEW FACT
24242424
DATA MINING
26262626
DEFINITION
Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.
Example pattern (Census Bureau Data):If (relationship = husband), then (gender = male). 99.6%
27272727
DEFINITION (CONT.)
Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.
Valid: The patterns hold in general.Novel: We did not know the pattern
beforehand.Useful: We can devise actions from the
patterns.Understandable: We can interpret and
comprehend the patterns.
28282828
WHY USE DATA MINING TODAY?
Human analysis skills are inadequate:Volume and dimensionality of the dataHigh data growth rate
Availability of:DataStorageComputational powerOff-the-shelf softwareExpertise
29292929
THE KNOWLEDGE DISCOVERY PROCESS
Steps: Identify business problem Data mining Action Evaluation and measurement Deployment and integration into
businesses processes
30303030
PREPROCESSING AND MINING
Original Data
TargetData
PreprocessedData
PatternsKnowledge
DataIntegrationand Selection
Preprocessing
ModelConstruction
Interpretation
31313131
DATA MINING TECHNIQUES
Supervised learning Classification and regression
Unsupervised learning Clustering
Dependency modeling Associations, summarization, causality
Outlier and deviation detection Trend analysis and change detection
32323232
EXAMPLE APPLICATION: SKY SURVEY
Input data: 3 TB of image data with 2 billion sky objects, took more than six years to complete
Goal: Generate a catalog with all objects and their type
Method: Use decision trees as data mining model
Results:94% accuracy in predicting sky object
classes Increased number of faint objects
classified by 300%Helped team of astronomers to discover 16
new high red-shift quasars in one order of magnitude less observation time
33333333
CLASSIFICATION EXAMPLE
Example training databaseTwo predictor attributes:
Age and Car-type (Sport, Minivan and Truck)
Age is ordered, Car-type iscategorical attribute
Class label indicateswhether person boughtproduct
Dependent attribute is categorical
Age Car Class
20 M Yes30 M Yes25 T No30 S Yes40 S Yes20 T No30 M Yes25 M Yes40 M Yes20 S No
34343434
GOALS AND REQUIREMENTS
Goals: To produce an accurate classifier/regression
function To understand the structure of the problem
Requirements on the model: High accuracy Understandable by humans, interpretable Fast construction for very large training
databases
35353535
WHAT ARE DECISION TREES?
Minivan
Age
Car Type
YES NO
YES
<30 >=30
Sports, Truck
0 30 60 Age
YES
YES
NO
Minivan
Sports,Truck
36363636
DENSITY-BASED CLUSTERING A cluster is defined as a connected dense
component. Density is defined in terms of number of
neighbors of a point. We can find clusters of arbitrary shape
37373737
MARKET BASKET ANALYSIS
Consider shopping cart filled with several items
Market basket analysis tries to answer the following questions: Who makes purchases? What do customers buy together? In what order do customers purchase items?
38383838
MARKET BASKET ANALYSIS (CONTD.)
Coocurrences 80% of all customers purchase items X, Y and Z
together. Association rules
60% of all customers who purchase X and Y also buy Z.
Sequential patterns 60% of customers who first buy X also purchase Y
within three weeks.
SPATIAL DATA
40404040
41414141
WHAT IS A SPATIAL DATABASE?
Database that: Stores spatial objects Manipulates spatial objects just like other objects
in the database
42424242
WHAT IS SPATIAL DATA?
Data which describes either location or shape
e.g.House or Fire Hydrant locationRoads, Rivers, Pipelines, Power linesForests, Parks, Municipalities, Lakes
In the abstract, reductionist view of the
computer, these entities are represented as Points, Lines, and Polygons.
43434343
Roads are represented as Lines Mail Boxes are represented as Points
44444444
TOPIC THREE
Land Use Classifications arerepresented as Polygons
45454545
TOPIC THREE
Combination of all the previous data
46464646
SPATIAL RELATIONSHIPS
Not just interested in location, also interested in “Relationships” between objects that are very hard to model outside the spatial domain.
The most common relationships are Proximity : distance Adjacency : “touching” and “connectivity” Containment : inside/overlapping
47474747
SPATIAL RELATIONSHIPS
Distance between a toxic waste dump and a piece of property you were considering buying.
48484848
SPATIAL RELATIONSHIPS
Distance to various pubs
49494949
SPATIAL RELATIONSHIPS
Adjacency: All the lots which share an edge
50505050
Connectivity: Tributary relationships in river networks
51515151
MOST ORGANIZATIONS HAVE SPATIAL DATA
Geocodable addresses Customer location Store locations Transportation tracking Statistical/Demographic Cartography Epidemiology Crime patterns
Weather Information Land holdings Natural resources City Planning Environmental planning Information Visualization Hazard detection
52525252
ADVANTAGES OF SPATIAL DATABASES
Able to treat your spatial data like anything else in the DB transactions backups integrity checks less data redundancy fundamental organization and operations
handled by the DB multi-user support security/access control locking
53535353
ADVANTAGES OF SPATIAL DATABASES
Offset complicated tasks to the DB server organization and indexing done for you do not have to re-implement operators do not have to re-implement functions
Significantly lowers the development time of client applications
54545454
ADVANTAGES OF SPATIAL DATABASES
Spatial querying using SQL use simple SQL expressions to determine spatial
relationships distance adjacency containment
use simple SQL expressions to perform spatial operations area length intersection union buffer
55555555
Original Polygons
Union Intersection
56565656
Original river network
Buffered rivers
57575757
ADVANTAGES OF SPATIAL DATABASES
… WHERE distance(<me>,pub_loc) < 1000SELECT distance(<me>,pub_loc)*$0.01 +
beer_cost …... WHERE touches(pub_loc, street)
… WHERE inside(pub_loc,city_area) and city_name = ...
58585858
ADVANTAGES OF SPATIAL DATABASES
Simple value of the proposed lot
Area(<my lot>) * <price per acre> + area(intersect(<my log>,<forested area>) ) * <wood value per acre>- distance(<my lot>, <power lines>) * <cost of power line laying>
59595959
New Electoral Districts
• Changes in areas between 1996 and 2001 election.
• Want to predict voting in 2001 by looking at voting in 1996.
• Intersect the 2001 district polygon with the voting areas polygons.• Outside will have zero area• Inside will have 100% area• On the border will have partial area
• Multiply the % area by 1996 actual voting and sum
• Result is a simple prediction of 2001 voting
More advanced: also use demographic data.
60606060
DISADVANTAGES OF SPATIAL DATABASES
Cost to implement can be high Some inflexibility Incompatibilities with some GIS software Slower than local, specialized data structures User/managerial inexperience and caution
61616161
PICTOGRAMS - SHAPES
Types: Basic Shapes, Multi-Shapes, Derived Shapes, Alternate Shapes, Any possible Shape, User-Defined Shapes
Basic Shapes Alternate Shapes
Multi-Shapes Any Possible Shape
Derived Shapes User Defined Shape
N 0, N
*
!
62626262
SPATIAL DATA ENTITY CREATION
Form an entity to hold county names, states, populations, and geographiesCREATE TABLE County(
Name varchar(30),State varchar(30),Pop Integer,Shape Polygon);
Form an entity to hold river names, sources, lengths, and geographiesCREATE TABLE River(
Name varchar(30),Source varchar(30),Distance Integer,Shape LineString);
63636363
EXTENDING THE ER DIAGRAM
Standard ER Diagram
Spatial ER Diagram
64646464