Automatically Annotating Web Pages Using Google Rich Snippets

15
Automatically Annotating Web Pages Using Google Rich Snippets 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) February 4, 2011 Frederik Hogenboom [email protected] Flavius Frasincar [email protected] Damir Vandic [email protected] Jeroen van der Meer [email protected] m Ferry Boon [email protected] Uzay Kaymak [email protected] Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, the Netherlands This talk is based on the paper A Framework for Automatic Annotation of Web Pages Using the Google Rich Snippets Vocabulary. Meer, J. van der, Boon, F., Hogenboom, F.P., Frasincar, F. & Kaymak, U. (2011). In 26th Symposium on Applied Computing (SAC 2011) (pp. 763-770). ACM.

description

Automatically Annotating Web Pages Using Google Rich Snippets. - PowerPoint PPT Presentation

Transcript of Automatically Annotating Web Pages Using Google Rich Snippets

Page 1: Automatically Annotating Web Pages Using  Google  Rich  Snippets

Automatically Annotating Web Pages Using Google Rich Snippets

11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

February 4, 2011

Frederik [email protected]

Flavius [email protected]

Damir [email protected]

Jeroen van der [email protected]

Ferry [email protected]

Uzay [email protected]

Erasmus University RotterdamPO Box 1738, NL-3000 DRRotterdam, the Netherlands

This talk is based on the paper A Framework for Automatic Annotation of Web Pages Using the Google Rich Snippets Vocabulary. Meer, J. van der, Boon, F., Hogenboom, F.P., Frasincar, F. & Kaymak, U. (2011). In 26th Symposium on Applied Computing (SAC 2011) (pp. 763-770). ACM.

Page 2: Automatically Annotating Web Pages Using  Google  Rich  Snippets

Introduction (1)

• Semantically annotating Web pages enhances machine interpretation

• Google Rich Snippets (RDFa) enable Web page owners to add semantics to their pages

• The vocabulary enables interesting applications

11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Page 3: Automatically Annotating Web Pages Using  Google  Rich  Snippets

Introduction (2)

• Automating annotation for static and 3rd party Web sites is deemed necessary

• Hence, we propose the Automatic Review Recognition and annOtation of Web pages (ARROW) framework

11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Page 4: Automatically Annotating Web Pages Using  Google  Rich  Snippets

Framework (1)

11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

• Four main stages:– Hotspot identification– Subjectivity analysis– Information extraction– Page annotation

• Web pages are converted to DOM trees in order to enable easy processing

Page 5: Automatically Annotating Web Pages Using  Google  Rich  Snippets

Framework (2)

11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

RDFa

Page 6: Automatically Annotating Web Pages Using  Google  Rich  Snippets

Framework (3): Hotspots

• Reviews are characterized by large blocks of text: hotspots

• Headers, navigation elements, footers, etc., do not contain these blocks

• Text blocks have few HTML elements

• For each element in the DOM tree, we compute the text-to-content-ratio (TTCR):

, with = # textual characters, and = total # characters in

DOM

11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

DOM

text

L

LTTCR textL

DOML

Page 7: Automatically Annotating Web Pages Using  Google  Rich  Snippets

Framework (4): Hotspots

• Illustrative example:

• The h1 element contains 64/73 × 100% ≈ 88% text

• However, the div element merely contains 34/116 × 100% ≈ 29% text due to its span elements

11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

<h1> Intel Core i7-975 Extreme And i7-950 Processors Reviewed</h1><div> <p> Page <span class="page-number">1</span> of <span class="num-pages">15</span> </p></div>

Page 8: Automatically Annotating Web Pages Using  Google  Rich  Snippets

Framework (5): Subjectivity

• Hotspots are verified as reviews whenever they are subjective enough

• We utilize an updated version of the LightWeight subjectivity Detection mechanism (LWD) of Barbosa et al. (2009):– Original: check if document has ≥ k sentences that contain ≥

n subjectivity words each– Modification: check if document has ≥ m percent of all

sentences that contain ≥ n subjectivity words each

11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Page 9: Automatically Annotating Web Pages Using  Google  Rich  Snippets

Framework (6): IE

• Various information is extracted:– Authors:

• Named entities are detected in the vicinity of hotspots• Named Entity Recognizer (NER)

– Dates:• Many different date formats are easily parsed• Regular expressions

– Products:• Name often found in title and h1 elements• Overlapping words

– Ratings:• Many formats, e.g., images (90%), which can be numerical (80%),

descriptors (15%), or letters (5%)• We focus on numerical ratings• Regular expressions on plain text or alt text of images

11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

(\w)\s(\d{1,2})(th|,)?\s(\d{2,4})

([0-9.,]+)\s?/\s?([0-9.,]+)

MM dd yyyy

4/5

Page 10: Automatically Annotating Web Pages Using  Google  Rich  Snippets

Framework (7): Annotation

• Key elements are tagged using Google Rich Snippets

• A new annotated Web page is returned

11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

<div xmlns:v="http://rdf.data-vocabulary.org/#" typeof="v:Review"> <span property="v:itemreviewed"> Tango Hotel Taichung </span> <span property="v:reviewer">Sarah Lee</span> <span property="v:rating">4 stars</span> <span property="v:dtreviewed">18th December 2008</span> <p property="v:summary"> Boutique like hotel without the boutique price </p></div>

Page 11: Automatically Annotating Web Pages Using  Google  Rich  Snippets

Implementation (1)

• We have implemented the ARROW framework as a Web application:– Java-based– Apache Tomcat server

• Input:– URL– Preferred output:

• Visualizer

• Annotated document

11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Page 12: Automatically Annotating Web Pages Using  Google  Rich  Snippets

Implementation (2)

11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Page 13: Automatically Annotating Web Pages Using  Google  Rich  Snippets

Evaluation

• Test set: 100 review, 100 non-review Web pages

• Sub-second performance

• Precision and specificity are good (both ± 90%), while accuracy and recall are varying (± 40% – 60%)

• Main problems related to detecting authors, likely caused by the use of nicknames

• Dependency on Web site structures

11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Page 14: Automatically Annotating Web Pages Using  Google  Rich  Snippets

Conclusions

• We presented ARROW, a framework for automatically annotating reviews with Google Rich Snippets

• Framework not bound to vocabulary

• Proof-of-concept implementation shows promising results

• Future work:– Improve heuristics– Add intelligent (semantically enabled) text parsers– Extend to other domains, e.g., recipes, videos, etc.

11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Page 15: Automatically Annotating Web Pages Using  Google  Rich  Snippets

Questions

http://www.arrow-project.com/

11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)