Spatial, text, and multimedia databases Erik Zeitler UDBL.
-
date post
21-Dec-2015 -
Category
Documents
-
view
224 -
download
2
Transcript of Spatial, text, and multimedia databases Erik Zeitler UDBL.
![Page 1: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/1.jpg)
Spatial, text, and multimedia databases
Erik Zeitler
UDBL
![Page 2: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/2.jpg)
Why indexing?
• Speed up retrieval – Non-key attributes– Feature based
![Page 3: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/3.jpg)
Applications
• Image databases (2-D, 3-D)– Shapes, colors, textures
• Financial analysis– Sales patterns, stock market prediction, consumer behavior
• Scientific databases– Sensor data/Simulation results:
• Scalar/vector fields
• Scientific databases
![Page 4: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/4.jpg)
Traditional indexing methodsA record with k attributes
A point in k-dimensional space
Name Salary Age Dept
Smith 40000 45 3
Dilbert 35000 35 4
Wally 35000 37 4
Dogbert 45000 30 5
…
4 attributes: Name, salary, age, dept.
![Page 5: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/5.jpg)
Spatial query complexity
• Exact matchname = ’Smith’ and salary=40000 and age=45
• Partial matchsalary=40000 and age=45
• Range35000 ≤ salary ≤ 45000 and age=45
• Boolean((not name = ’Smith’) and salary ≥ 40000) or age ≥ 50
• Nearest-neighbor (similarity)Salary 40000 and age 45
![Page 6: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/6.jpg)
Inverted files
Given an attribute,
Name Salary Age Dept
• For each attribute value, store
1. A list of pointers to records having this attribute value
2. (Optionally) The length of this list
• Organize the attribute values using
• B-trees, B+-trees, B*-trees
• Hash tables
![Page 7: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/7.jpg)
B-tree
• B = Bayer or ”Balanced”– Bayer: Binary B-Trees for Virtual Memory, ACM-SIGFIDET
Workshop 1971
• Data structure– Balanced tree of order p– Node: <P1, <K1,Pr1>, P2, <K2, Pr3>, … Pq>
q p
For all search key fields X in subtree Pi: Ki-1< X < Ki
• Algorithm– Guarantees logarithmic insert/delete time– Keeps tree balanced
![Page 8: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/8.jpg)
B-tree
5 8o o
1 3o o 6 7o o 9 12o o
oPr
Data pointerP
Tree node pointerNull tree pointer
![Page 9: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/9.jpg)
B-tree variants
• B+-tree (More commonly used than B-tree)
– Data pointers only at the leaf nodes– All leaf nodes linked together
Allows ordered access
Internal node: <P1, K1, P2, K2, …, Pq-1, Kq-1, Pq>
Leaf node: <<K1,Pr1>, <K2, Pr2>, …, <Kq-1, Prq-1>, Pnext>
![Page 10: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/10.jpg)
B+-tree
K1 Ki Kq-1P1 Pi ... Pq
X X X
X K1 Ki < X Ki
Ki-1
Kq-1 X
...
K1 Kq-1Pri ... PnextKi...Pr1 Prq-1
data pointer data pointer data pointer
pointer to next leafnode in tree
Internal node
Leaf node
![Page 11: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/11.jpg)
B(+)-tree index SQL syntax
CREATE TABLE emp ( ssn int(11) NOT NULL default '0', name text, PRIMARY KEY (ssn));
CREATE INDEX part_of_name_index on emp (name(10));
![Page 12: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/12.jpg)
Multi dimensional index methods
• Point Access Methods– Grid files– k-D trees
• Spatial Access Methods– Space filling curves– R-trees
• Nearest (similarity)
![Page 13: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/13.jpg)
Applications
• GIS
• CAD
• Image analysis, computer vision
• Rule indexing
• Information Retrieval
• Multimedia databases
…
![Page 14: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/14.jpg)
Grid files
”multi dimensional hashing”• Partition address space:
– Each cell corresponds to one disk page
– Cuts allowed on predefined points only (¼, ½, ¾, …) on each axis
– Cut all the way a grid is formed
A
M
Z
0 2537.5
50 100
name
age
![Page 15: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/15.jpg)
Grid files
• Shortcomings– Correlated values: – Large directory is needed for high
dimensionality
• OTOH:– Fast– Simple
![Page 16: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/16.jpg)
k-D trees
• Binary search tree– Each level splits in one
dimension• dimension 0 at level 0,• dimension 1 at level 1• … (round robin)
Each internal node:– left pointer
– right pointer
– split value
– data pointer
![Page 17: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/17.jpg)
k-D trees
40
A1
A2
(20,30)
20
(10,10)
(40,50)
20
40 20,30
40,50
10,10
A1 < 10
A2 < 30
A1 < 40 A1 40
A2 30
A1 40
![Page 18: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/18.jpg)
k-D trees
• Shortcomings• Incremental inserts/deletes can unbalance the tree
– Re-balancing is difficult
• Re-constructing the tree from scratch
![Page 19: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/19.jpg)
Space filling curves
Idea: Impose a linear ordering on multi-dimensional data
Allows for one-dimensional index and search on multi-dimensional data
• Z-ordering
![Page 20: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/20.jpg)
![Page 21: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/21.jpg)
![Page 22: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/22.jpg)
![Page 23: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/23.jpg)
0 4 8 12 16
Y
X
01
11
10
00
00 01 10 11
zO= shuffle("1,2,1,2",xO,yO) = shuffle("1,2,1,2",00,11) = 0101 = (5)10
![Page 24: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/24.jpg)
Hilbert curves
• Z-ordering has long diagonal jumps in space – Connected objects split and separate far– Distances are not preserved
• Hilbert curves preserve distances better
![Page 25: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/25.jpg)
![Page 26: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/26.jpg)
![Page 27: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/27.jpg)
![Page 28: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/28.jpg)
![Page 29: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/29.jpg)
![Page 30: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/30.jpg)
![Page 31: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/31.jpg)
Space filling curves
• ”Quick” algorithm:
O(b) for calculcating valuesb – number of bits of the z/Hilbert value
typically, b = xD
x – size of one dimension
![Page 32: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/32.jpg)
R-trees
• B-trees in multiple dimensions
• Spatial object represented by its MBR
Minimum Bounding Rectangle
![Page 33: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/33.jpg)
R-trees
– Nonleaf nodes• <ptr, R>
– ptr – pointer to a child node– R – MBR covering all rectangles in the child node
– Leaf nodes• <obj-id, R>
– obj-id – pointer to object– R – MBR of the object
![Page 34: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/34.jpg)
R-trees
• Algorithms– Insert
• Find the most suitable leaf node• Possibly, extend MBRs in parent nodes to enclose
the new object• Leaf node overflow split
– Split• Heuristics based(Possible propagation upwards)
![Page 35: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/35.jpg)
R-trees
• Range queries– Traverse the tree
• Compare query MBR with the current node’s MBR
• Nearest neighbor– Branch and bound:
• Traverse the most promising sub-tree– find neighbors– Estimate best- and worstcase
• Traverse the other sub-trees – Prune according to obtained thresholds
![Page 36: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/36.jpg)
R-trees
• Spatial joins
”find intersecting objects”– Naïve method:
• Build a list of pairs of intersecting MBRs• Examine each pair, down to leaf level
(Faster methods exist)
![Page 37: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/37.jpg)
Variants
• R+-tree(Sellis et al 1987)
Avoids overlapping rectangles in internal nodes
• R*-tree(Beckmann et al 1990)
![Page 38: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/38.jpg)
Applications
• Spatial databases
• Text retrieval
• Multimedia retrieval
![Page 39: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/39.jpg)
Text retrieval• Full text scanning
Somewhat like sequence analysis in bioinformatics
• InversionBuild an index using keywords
• Signature filesA hash-like structure quick filtering of non-relevant material
• Vector space modeldocument clustering
• Performance measuresPrecision, recall, average precision
![Page 40: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/40.jpg)
Vector space model
• Hypothesis:Closely associated documents are relevant to the same requests
• Method:• For each document
Generate a histogram vector containing word counts, each bin counts one word
• Group documents together in clusters, based on histogram vector similarity.
– Popular metric: Cosine similarity
yx
yxyx
),cos(
![Page 41: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/41.jpg)
Vector space model
• Given a query phrase q– Generate a histogram
vector of q
– Compute similarity between q and all document cluster centroids
– Compute similarity between q and all documents in the relevant clusters
– Return a list of documents in descending similarity
q
Retrieval list
![Page 42: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/42.jpg)
Relevance feedback
– User pinpoints the most relevant documents
– These documents are added to the original query vector histogram q’
– Similarity computations based on q’
– A new improved retrieval list is presented to the user
q q'
Retrieval list
![Page 43: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/43.jpg)
Retrieval performance
Precision p
The proportion of retrieved material that is relevant.
Given a retrieval list of n items,
n
ngp
)(
, where g(n) is the number of items in the list relevant to the query.
n
![Page 44: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/44.jpg)
Retrieval performance
Average precision pavg
How the relevant items are distributed in the retrieval list.
• R – the number of relevant items in the retrieval list• ni – the rank of each relevant item, 1 i R• For each ni, calculate pni – the average precision of the
partial list of top ni items• The average precision is the average of all pni:
R
inavg ip
Rp
1
1
![Page 45: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/45.jpg)
Multimedia databases
• Data structures– Bitmap image: 2D (3D) array of pixels– Sound clip/song: Sequence of samples– Video: Sequence of images
• User requirements– Music written by a particular artist– Texture similarity– ”Fuzzy” requirements, e.g. Musical preference
![Page 46: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/46.jpg)
Multimedia databases
• Meta data queries– Images and video described by text
• Figure captions • Keywords • Associated paragraphs
– Retrieval based on text• Keywords• Textual features
![Page 47: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/47.jpg)
Features
• Images– Color of pixels– Line segments and edges– Texture– Shape
• Sound– Spectral content– Rhythm (music)
• Video– Motion
![Page 48: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/48.jpg)
Color
• Perception-based models: – CIE chromaticity (X,Y,Z) – Opponent color model: Luv – Hue, saturation, value or brightness
• Hardware-oriented models: RGB, CMY
• Color histograms– Relative frequency distribution of each color dimension– Compute similarity between corresponding histograms of each
color dimension
![Page 49: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/49.jpg)
Histogram
![Page 50: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/50.jpg)
Texture representation
• Pixel based– Co-occurrence matrix– Markov models– Auto-regressive models
• Pattern properites– Contrast– Orientation– PCA
![Page 51: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/51.jpg)
Textures
![Page 52: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/52.jpg)
Shapes, regions
• Image analysis methods– Description of regions
• Moments or normalized moments • 2 D transforms
– Description of boundaries• Chain encoding• Fourier descriptors• Skeletons
– Regions• Edge detection• Corners detection• Edge Linking• Region segmentation• Region description
![Page 53: Spatial, text, and multimedia databases Erik Zeitler UDBL.](https://reader036.fdocuments.in/reader036/viewer/2022062313/56649d5e5503460f94a3d79a/html5/thumbnails/53.jpg)
Video
• Segments, scenes, and basic frames • Transitions • Motion
– Motion of objects – Camera
• Compression standards – MPEG 2 – Region coding and motion compensation – MPEG 4 – Content-based compression and synthetic
data representation – MPEG 7 – Standardization of structures and arbitrary
description schemes