Result Page Analysis (Cheng Wang)

40
Cheng Wang

Transcript of Result Page Analysis (Cheng Wang)

Page 1: Result Page Analysis (Cheng Wang)

Cheng Wang

Page 2: Result Page Analysis (Cheng Wang)
Page 3: Result Page Analysis (Cheng Wang)

²  A list of results decorated with ³ Ø Side bars

³ Ø Branding banners

³ Ø Advertisement

³ Ø Merchant Information

³ Ø Search forms

³ Ø Navigation part

Page 4: Result Page Analysis (Cheng Wang)

²  Data Area Identification

²  Record Segmentation

²  Data Alignment

Page 5: Result Page Analysis (Cheng Wang)
Page 6: Result Page Analysis (Cheng Wang)
Page 7: Result Page Analysis (Cheng Wang)

²  Visual Information ³ Ø ViDE, VIPER

²  Ontology ³ Ø ODE

²  HTML Page based ³ Ø FiVaTech

²  Regular Expression ³ Ø EXALG, DELA

Page 8: Result Page Analysis (Cheng Wang)

²  Weifeng Su, Jiying Wang, Frederick H.Lochvsky. 2009.

²  1: Domain ontology construction ³ Ø query interface ³ Ø query result pages

²  2. Data Extraction using the ontology ³ Ø Identify data area ³ Ø Segments record ³ Ø Data Value alignment

Page 9: Result Page Analysis (Cheng Wang)
Page 10: Result Page Analysis (Cheng Wang)
Page 11: Result Page Analysis (Cheng Wang)

²  Multiple Query Result Page ³ Ø PADE

Page 12: Result Page Analysis (Cheng Wang)
Page 13: Result Page Analysis (Cheng Wang)

²  1: Match query interface element to data values. Ø title=“%orientalism%”

²  2. Search for voluntary labels in table headers.

²  3. Search for voluntary labels encoded together with data values. ³  Ø ISBN No: 0814756654 ³  Ø ISBN No: 0789204592

²  4. Data values formats ³  Ø 18/09/2008 : 20080918 ³  Ø 03/18/98 : 19980318

Page 14: Result Page Analysis (Cheng Wang)

²  1. Value level matching ³ Ø Data value similarity

²  2. Label level matching ³ Ø Label co-occurrence

²  3. Label-value matching ³ Ø Check assigned label

³ Ø Assign a suitable label for columns

³ Ø Matching conflict resolution

Page 15: Result Page Analysis (Cheng Wang)
Page 16: Result Page Analysis (Cheng Wang)
Page 17: Result Page Analysis (Cheng Wang)

²  1. Matching is unique ð create attribute

²  2. Matching is 1:1 ð alias ³ Ø Category : Subject

²  3. Matching is 1:n ð n+1 attributes ³ Ø Author: {Last Name, First Name}

²  4. Matching is n:m ð n:1 + 1:m

Page 18: Result Page Analysis (Cheng Wang)
Page 19: Result Page Analysis (Cheng Wang)

²  One result page ð One data area

²  Maximum Entropy Model ³ Maximum Correlation Subtree Identification

Page 20: Result Page Analysis (Cheng Wang)

² Ø 1 result

² Ø several results (CABABABAD) ³ Ø find continuous repeated patterns

³ Ø Visual gap

Page 21: Result Page Analysis (Cheng Wang)

²  Each data value is assigned a label Ø Maximum Entropy Model Ø Match with Ontology

² ØLabel ð Column

Page 22: Result Page Analysis (Cheng Wang)
Page 23: Result Page Analysis (Cheng Wang)

²  Wei Liu, Xiaofeng Meng and Weiyi Meng. 2009.

²  ViDRE: Data Record Extractor

²  ViDIE: Data Item Extractor

²  New measure: revision

Page 24: Result Page Analysis (Cheng Wang)

²  1. Build a Visual Block tree

²  2. Extract data records ³ Ø Noise block filtering

³ Ø Blocks clustering

³ Ø Regroup blocks

²  3. Partition data records into data items and alignment

Page 25: Result Page Analysis (Cheng Wang)
Page 26: Result Page Analysis (Cheng Wang)
Page 27: Result Page Analysis (Cheng Wang)

²  Mandatory data items

²  Optional data items

²  Static data items

Page 28: Result Page Analysis (Cheng Wang)

²  Simple one-pass clustering algorithm ³ Ø Take the first block from the list, use it to form a

cluster.

³ Ø For each remaining blocks, compute similarities to existing clusters.

Page 29: Result Page Analysis (Cheng Wang)

²  ViDE assumes ³ 1. blocks in the same cluster all come from different

data records

³ 2. the cluster which has maximum number n of blocks may contain the mandatory value of data records.

Page 30: Result Page Analysis (Cheng Wang)

²  Step 1: Rearranges blocks in each cluster.

²  Step 2: A cluster with n blocks is used as seed. Initialize n groups, each contains one seed block.

²  Step 3: For all blocks (in all clusters), determines which group it belongs.

Page 31: Result Page Analysis (Cheng Wang)
Page 32: Result Page Analysis (Cheng Wang)
Page 33: Result Page Analysis (Cheng Wang)

²  WDBt: total number of web databases processed

²  WDBc: number of web databases whose precision and recall are both 100%

Page 34: Result Page Analysis (Cheng Wang)
Page 35: Result Page Analysis (Cheng Wang)
Page 36: Result Page Analysis (Cheng Wang)

Root

£

Data Area (LCA)

Record

£

Separator Record

£

Separator Record

£

Page 37: Result Page Analysis (Cheng Wang)

²  Real-estate domain

²  60 agents’ websites ³ Ø MRP: 95.0%

³ Ø ERP: 90.0%

Page 38: Result Page Analysis (Cheng Wang)

Root

Data Area

Record 1

Part A

£

Record 1

Part B

Record 2

Part A

£

Record 2

Part B

Record 3

Part A

£

Record 3

Part B

Page 39: Result Page Analysis (Cheng Wang)

²  DIADEM 0.1 : ³ Ø Construct Real-estate result page ontology

³ Ø Ontological Record Segmentation °  (More features)

³ Ø Data labeling and data alignment

²  After: ³ Ø Add visual information

Page 40: Result Page Analysis (Cheng Wang)