Result Page Analysis (Cheng Wang)

Cheng Wang

²  A list of results decorated with ³ Ø Side bars

³ Ø Branding banners

³ Ø Advertisement

³ Ø Merchant Information

³ Ø Search forms

³ Ø Navigation part

²  Data Area Identification

²  Record Segmentation

²  Data Alignment

²  Visual Information ³ Ø ViDE, VIPER

²  Ontology ³ Ø ODE

²  HTML Page based ³ Ø FiVaTech

²  Regular Expression ³ Ø EXALG, DELA

²  Weifeng Su, Jiying Wang, Frederick H.Lochvsky. 2009.

²  1: Domain ontology construction ³ Ø query interface ³ Ø query result pages

²  2. Data Extraction using the ontology ³ Ø Identify data area ³ Ø Segments record ³ Ø Data Value alignment

²  Multiple Query Result Page ³ Ø PADE

²  1: Match query interface element to data values. Ø title=“%orientalism%”

²  2. Search for voluntary labels in table headers.

²  3. Search for voluntary labels encoded together with data values. ³  Ø ISBN No: 0814756654 ³  Ø ISBN No: 0789204592

²  4. Data values formats ³  Ø 18/09/2008 : 20080918 ³  Ø 03/18/98 : 19980318

²  1. Value level matching ³ Ø Data value similarity

²  2. Label level matching ³ Ø Label co-occurrence

²  3. Label-value matching ³ Ø Check assigned label

³ Ø Assign a suitable label for columns

³ Ø Matching conflict resolution

²  1. Matching is unique ð create attribute

²  2. Matching is 1:1 ð alias ³ Ø Category : Subject

²  3. Matching is 1:n ð n+1 attributes ³ Ø Author: {Last Name, First Name}

²  4. Matching is n:m ð n:1 + 1:m

²  One result page ð One data area

²  Maximum Entropy Model ³ Maximum Correlation Subtree Identification

² Ø 1 result

² Ø several results (CABABABAD) ³ Ø find continuous repeated patterns

³ Ø Visual gap

²  Each data value is assigned a label Ø Maximum Entropy Model Ø Match with Ontology

² ØLabel ð Column

²  Wei Liu, Xiaofeng Meng and Weiyi Meng. 2009.

²  ViDRE: Data Record Extractor

²  ViDIE: Data Item Extractor

²  New measure: revision

²  1. Build a Visual Block tree

²  2. Extract data records ³ Ø Noise block filtering

³ Ø Blocks clustering

³ Ø Regroup blocks

²  3. Partition data records into data items and alignment

²  Mandatory data items

²  Optional data items

²  Static data items

²  Simple one-pass clustering algorithm ³ Ø Take the first block from the list, use it to form a

cluster.

³ Ø For each remaining blocks, compute similarities to existing clusters.

²  ViDE assumes ³ 1. blocks in the same cluster all come from different

data records

³ 2. the cluster which has maximum number n of blocks may contain the mandatory value of data records.

²  Step 1: Rearranges blocks in each cluster.

²  Step 2: A cluster with n blocks is used as seed. Initialize n groups, each contains one seed block.

²  Step 3: For all blocks (in all clusters), determines which group it belongs.

²  WDBt: total number of web databases processed

²  WDBc: number of web databases whose precision and recall are both 100%

Root

£

Data Area (LCA)

Record

£

Separator Record

£

Separator Record

£

²  Real-estate domain

²  60 agents’ websites ³ Ø MRP: 95.0%

³ Ø ERP: 90.0%

Root

Data Area

Record 1

Part A

£

Record 1

Part B

Record 2

Part A

£

Record 2

Part B

Record 3

Part A

£

Record 3

Part B

²  DIADEM 0.1 : ³ Ø Construct Real-estate result page ontology

³ Ø Ontological Record Segmentation °  (More features)

³ Ø Data labeling and data alignment

²  After: ³ Ø Add visual information

Result Page Analysis (Cheng Wang)

Documents

Transcript of Result Page Analysis (Cheng Wang)