Generating Links by Mining Quotations

30
OKAN KOLAK AND BILL N. SCHILIT PRESENTATION BY DUSTIN SMITH THE UNIVERSITY OF TEXAS AT AUSTIN SCHOOL OF INFORMATION Generating Links by Mining Quotations

description

 

Transcript of Generating Links by Mining Quotations

Page 1: Generating Links by Mining Quotations

OKAN KOLAK AND BILL N. SCHILIT

PRESENTATION BY DUSTIN SMITHTHE UNIVERSITY OF TEXAS AT AUSTIN

SCHOOL OF INFORMATION

Generating Links by Mining Quotations

Page 2: Generating Links by Mining Quotations

04/10/2023INF384H

Outline2

IntroductionChallengesAlgorithm

Phase 1: Generating the Shingle Table Phase 2: Extracting Shared Sequences Phase 3: Sequence Grouping Filtering and Ranking

User InterfaceEvaluation

Page 3: Generating Links by Mining Quotations

04/10/2023INF384H

3

Introduction

What is the goal and why? Engaging user interface in Google Books Richer hypertext for scanned books Achieving these goals at scale for large sets of books

Via MapReduce

Page 4: Generating Links by Mining Quotations

04/10/2023INF384H

4

Challenges

Mining quality quotation from millions of books in a scalable and efficient manner.

Filtering out misleading quotations and ranking the good quotations based on quality.

Incorporating the proposed link structure online in a clear and effective way for users.

Page 5: Generating Links by Mining Quotations

04/10/2023INF384H

5

Algorithm: Phase 1

Generation of shingle tables

Pass text through shingler

Text is parsed, normalized, and

output as a stream of

overlapping shingles

Generate a shingle table

Page 6: Generating Links by Mining Quotations

04/10/2023INF384H

6

Algorithm: Phase 1 (cont)

Each book is passed through the shinglerA shingle is a stream of text of k length. Ex.

A 2-shingle for the text “a lucky dog” would be “a lucky” and “lucky dog”.

Page 7: Generating Links by Mining Quotations

04/10/2023INF384H

7

Algorithm: Phase 1 (cont)

Prior to shingling, the text is parsed and normalized.

Possible normalizations: Lowercasing Removing punctuations and accents Stemming Removing stop-words Collapsing numbers to single tokens

Page 8: Generating Links by Mining Quotations

04/10/2023INF384H

8

Algorithm: Phase 1 (cont)

Shingle Tables

Shingle key: a unique shingle footprintB: Book ID where the shingle existsi: index of the shingle in its relative B

Key Shingle info Shingle info

Shingle key(1) <B,i> <B,i>

Shingle key(2) <B,i> <B,i>

Page 9: Generating Links by Mining Quotations

04/10/2023INF384H

9

Algorithm: Phase 1 (cont)

Shingle Tables Requires a single linear pass and a very large sorting

phase They observe that quotes of length <8 are not

significant quotations and so they set their shingle length to 8 words.

Page 10: Generating Links by Mining Quotations

04/10/2023INF384H

10

Algorithm: Phase 2

Involves extracting shingles that are shared between books

Books are processed 1 at a time Current book = “Source book” All other books = “Target books”

Page 11: Generating Links by Mining Quotations

04/10/2023INF384H

11

Algorithm: Phase 2 (cont)

Process for a single book:

Generate a list of shingles in the order that they appear

Take each shingle and use

the shingle table to find all

other occurrences

Page 12: Generating Links by Mining Quotations

04/10/2023INF384H

12

Algorithm: Phase 2 (cont)

Pseudo-code for Phase 2:

Page 13: Generating Links by Mining Quotations

04/10/2023INF384H

13

Algorithm: Phase 2 (cont)

MapReduce adaptation:Mapper:Start with shingle table as input into the MapperUse the equivalent method for looking up all shingle buckets for a given book’s shinglesEmit (source book ID, relevant shingle bucket)

Reducer:Input (source book ID, list of relevant shingle buckets)Use the algorithm from previous slide (Figure 1) with a few modifications

Page 14: Generating Links by Mining Quotations

04/10/2023INF384H

14

Algorithm: Phase 2 (cont)

One notable issue: Common shingles that are shared by many books will

greatly increase overhead. These are often insignificant quotes and should be

discarded.

Page 15: Generating Links by Mining Quotations

04/10/2023INF384H

15

Algorithm: Phase 3

Sequence Grouping:Why?

Page 16: Generating Links by Mining Quotations

04/10/2023INF384H

16

Algorithm: Phase 3 (cont)

Sequence Grouping:How does it work?

Page 17: Generating Links by Mining Quotations

04/10/2023INF384H

17

Filtering and Ranking

They identify certain phrases as copyright sentences, legal boilerplate, publisher addresses, bibliography citations, publisher addresses, titles of other books by the author or publisher These are not desirable or quality quotations. Need to filter these out

Page 18: Generating Links by Mining Quotations

04/10/2023INF384H

18

Filtering and Ranking (cont)

Filtering:• Quotations on “low content” pages• Unusual characteristic filtering• Too many digits or special characters, repeated

tokens, etc.

• Book edition filtering

Page 19: Generating Links by Mining Quotations

04/10/2023INF384H

19

Filtering and Ranking (cont)

Ranking:Some quotes are more interesting than others, ie:“The unemployment rate is the percentage of the labor force that is unemployed” vs. “All human beings are born free and equal in dignity and rights…”• This is difficult to distinguish automatically

Page 20: Generating Links by Mining Quotations

04/10/2023INF384H

20

Filtering and Ranking (cont)

Scoring method for rankingBasically:Too short and too long receive low scoresOptimal length and is in the middle ground and a piecewise function is used to represent this scoring.• What defines “too short ” and “too long” is

determined by “experimental tuning”• Same scoring method for frequency

Page 21: Generating Links by Mining Quotations

04/10/2023INF384H

21

User Interface

How to present this concept of general links between books?

“Popular Passages” not “Quotations”Display issues:

Long quotes containing shorter, more familiar quotes Quote order variations

Skyline vectors are used to address these issues and does so effectively. • Basically the “best” quotes are chosen for

presentation to the user

Page 22: Generating Links by Mining Quotations

04/10/2023INF384H

22

User Interface (cont)

Navigation within books Goals:

Provide a general feel for the book Provide an interface in which the user can quickly

navigate to important passages within the book

Page 23: Generating Links by Mining Quotations

04/10/2023INF384H

23

User Interface (cont)

Navigation between books

Page 24: Generating Links by Mining Quotations

04/10/2023INF384H

24

Evaluation

Manual labeling to determine accuracy User studied (passive) over a 30 day periodAnalysis of distribution of link types within

Google’s scanned books.

Page 25: Generating Links by Mining Quotations

04/10/2023INF384H

25

Evaluation (cont)

Manual labeling:• Sampled 120 passages from low scores and

120 from high scores (to avoid precision bias).

• Use a Likert scale of 1 to 5 with 1-2 meaning good, 3 meaning neutral, and 4-5 meaning bad.

• Inter-annotator agreement was 88.5% (± 3.5% to account for neutral labels)

• 88% marked good

Page 26: Generating Links by Mining Quotations

04/10/2023INF384H

26

Evaluation (cont)

User study:• Consisted of monitoring user activity in

Google Books.• Specifically if they navigated via popular passages

(Quotations); other book edition links (Editions); to other similar books within a cluster (Related); or to books that cite the current book (Cited By)

• Results

Page 27: Generating Links by Mining Quotations

04/10/2023INF384H

27

Evaluation (cont)

Page 28: Generating Links by Mining Quotations

04/10/2023INF384H

28

Evaluation (cont)

Coverage: What is the distribution of these link types in scanned

books?

Page 29: Generating Links by Mining Quotations

04/10/2023INF384H

29

Related Work & Future Work

Related Work Automatic Hypertext Plagiarism Detection

Future Work Improved Ranking Incremental Processing Primary Source Identification Attribution

Page 30: Generating Links by Mining Quotations

04/10/2023INF384H

30

Questions + Discussion

The End.

Questions & discussion.

….Go Rangers!