Download - Television News Search and Analysis with Lucene/Solr

Transcript
Page 1: Television News Search and Analysis with Lucene/Solr

Television News Search and Analysis with Lucene/Solr

Kai Chan <[email protected]>

Social Sciences Computing

University of California, Los Angeles

Lucene Revolution, May 10, 2012

Page 2: Television News Search and Analysis with Lucene/Solr

Communication Studies ArchiveBackground (1)

• Continuation of analog recording of TV news

– Thousands of tapes since Watergate/1970s

– Hard to look for a particular news program or topic

1

Page 3: Television News Search and Analysis with Lucene/Solr

Communication Studies ArchiveBackground (2)

• Digital recording since 2005

• Capture news programs on computers

– Video: can be streamed over the Web

– Closed captioning (“subtitle text”): indexed and searchable

– Image snapshots

– Search engine and analysis tools

2

Page 4: Television News Search and Analysis with Lucene/Solr

Communication Studies ArchiveBackground (3)

• Also download transcripts and web-streamed news programs

• 100 news programs and 600,000 words added each day

3

Page 5: Television News Search and Analysis with Lucene/Solr

Communication Studies ArchiveBackground (4)

• January 2005 to present

– 28 networks

– 1,600 shows

– 130,000 hours

– 160,000 news programs

– 50,000,000 images

– 880,000,000 words

4

Page 6: Television News Search and Analysis with Lucene/Solr

Why This is Important (1)

• Researchers

– Large and unique collection of communication

– Many modalities

• Speech, facial expression, body gesture, etc.

– Different conditions/settings

– Different networks and communities

– Allows study of TV news + communication in general in ways impossible before

5

Page 7: Television News Search and Analysis with Lucene/Solr

Why This is Important (2)

• Non-researchers

– TV news about presentation and persuasion

• Which happen in daily life also

– TV main source of news for many/most

– Greatly affects the public’s decisions

– Learn about what we watch

6

Page 8: Television News Search and Analysis with Lucene/Solr

7

Page 9: Television News Search and Analysis with Lucene/Solr

8

Page 10: Television News Search and Analysis with Lucene/Solr

9

Page 11: Television News Search and Analysis with Lucene/Solr

10

Page 12: Television News Search and Analysis with Lucene/Solr

11

Page 13: Television News Search and Analysis with Lucene/Solr
Page 14: Television News Search and Analysis with Lucene/Solr

13

Page 15: Television News Search and Analysis with Lucene/Solr

Application in Research

• Communication Studies

– Amount of coverage for events over time

• Linguistic

– Speech and language patterns

• Computer Science

– Object identification

– Identify news anchors, public figures

– Story segmentation

14

Page 16: Television News Search and Analysis with Lucene/Solr

Application in Teaching (1)

• Chicano Studies: Representations of Latinos on the Television News

– May 1, 2007 immigration march

– MacArthur Park, Los Angeles, CA

– 2 days (May 1 & 2, 2007)

– Framing, stereotyping, metaphor, silencing

– reports with screenshots and links to news stories

15

Page 17: Television News Search and Analysis with Lucene/Solr

Application in Teaching (2)

• Communication Studies: Presidential Communication

– 2008 presidential primary

– 6 weeks (Dec 2007 to Feb 2008)

– Coverage of sound bites

• Amount of time given to candidate/party

• Types of response (positive, neutral, negative)

– Students created their own political ad.

16

Page 18: Television News Search and Analysis with Lucene/Solr

Work flow (1)Capture/conversion machines

• 2 groups, 2 machines per group– Keep the best recording– 6 TV tuners per machine

• Capture video and CC to separate files in real-time– MPEG-TS (~7 GB/hr)– Timestamp every 2-3 seconds

• Generate image snapshots• Convert videos

– MP4/H.264 (VGA, ~240 MB/hr)

17

Page 19: Television News Search and Analysis with Lucene/Solr

Work flow (2)Storage/static file servers

• Control server – Download TV schedules

– Download web-streamed news programs

– Collect and check recordings

– Pushes files to places

• Video streaming server

• Backup storage server

• Image server

18

Page 20: Television News Search and Analysis with Lucene/Solr

Work flow (3)Search server

• Lucene index updated daily

– Main text field tokenized

– Separate fields for date, network, show, etc.

– Binary fields for segment and time data

• Hosts search engine

19

Page 21: Television News Search and Analysis with Lucene/Solr

The search process

20

Page 22: Television News Search and Analysis with Lucene/Solr

Custom query typeSegment-enclosed query (1)

• Problem 1: search for “X near Z”

• Lucene: search for “X within Y words of Z”

– How to pick Y?

– Hard to pick a fixed number

21

Page 23: Television News Search and Analysis with Lucene/Solr

Custom query typeSegment-enclosed query (2)

• Problem 2: all matched search words might not be talking about same story

– E.g. “Obama AND visit AND Afghanistan”

– Might match a news program about Obama’s visit to Canada + violence in Afghanistan

22

Page 24: Television News Search and Analysis with Lucene/Solr

Custom query typeSegment-enclosed query (3)

• A news program can contain several stories

– E.g. Local, national, world, weather, sports

23

Page 25: Television News Search and Analysis with Lucene/Solr

Custom query typeSegment-enclosed query (4)

24

Page 26: Television News Search and Analysis with Lucene/Solr

Custom query typeSegment-enclosed query (5)

• One solution: search for “X and Z within same story segment”

– Possible with Lucene + story segment info

• Bonus: enables searching/filtering for a particular story type

– E.g. Politics

25

Page 27: Television News Search and Analysis with Lucene/Solr

Custom query typeSegment-enclosed query (6)

• How to mark segments– Automated

• Computer Science researchers working on them

• Word frequency

• Scene change

• Black frame and silence

– Manual segmentation• Watch the video

• Decide where a story starts and ends

• Mark positions in semi-automated system

26

Page 28: Television News Search and Analysis with Lucene/Solr

Custom query typeSegment-enclosed query (7)

27

Page 29: Television News Search and Analysis with Lucene/Solr

Custom query typeSegment-enclosed query (8)

• Idea– Get spans from SpanNearQuery– Filter and keep those fully within segments

• In production: segment info in stored fields– As a list of <start position, end position>– Simple to implement– Reasonably fast searching

• Alternative: store segment info as terms– Possible to find segments by themselves– Appears to run much faster

28

Page 30: Television News Search and Analysis with Lucene/Solr

Custom query typeTime-enclosed query

29

Page 31: Television News Search and Analysis with Lucene/Solr

Custom query typeMulti-term regular expression (1)

• “here is _ _ _ with the (news|story|details|report)”

• Apply RegEx to a phrase or sentence

– Not just individual words

• Lucene core has regular expression query support

– Good starting point

– Not a complete solution for us

30

Page 32: Television News Search and Analysis with Lucene/Solr

Custom query typeMulti-term regular expression (2)

• Problems

– Some analyzers do not work with RegEx

– Lucene’s RegEx query classes only apply RegEx to individual terms

• Want to match a pattern against a phrase/sentence

• Want placeholders for whole words (not just characters)

– Term(fieldName, “.*”) matches all terms, and all documents, and all positions in the index

• very slow

• takes lots of memory

31

Page 33: Television News Search and Analysis with Lucene/Solr

Custom query typeMulti-term regular expression (3)

• What we did– Parse and translate multi-term RegEx into Lucene

built-in queries (SpanNearQuery, RegexQuery)• E.g. “here is _ _ _ with the” = “here is” followed by “with

the” (with exactly 3 terms in between)

– Leading and trailing placeholders• E.g. “_ _ is the _ _ _”

• Preserve for correctness

• Store word count for each document

• Expand each span on both sides

• Bounds checking

32

Page 34: Television News Search and Analysis with Lucene/Solr

Custom query typeMulti-term regular expression (4)

• Regular expression libraries differ in

– Syntax (e.g. Perl 5-compatible)

– Capabilities (e.g. back-references)

– Speed

• Memory usage

– Proportional to number of terms matched

– Increasing available memory might help

33

Page 35: Television News Search and Analysis with Lucene/Solr

Custom result formatOccurrence count

34

Page 36: Television News Search and Analysis with Lucene/Solr

Future workJob queue (1)

• Research front moving towards analysis of whole database

– Want full search result set

– Queries are intensive and take a long time

• Solution will be beyond increasing timeout

– Users might close their browsers

– We might restart the search back-end

35

Page 37: Television News Search and Analysis with Lucene/Solr

Future workJob queue (2)

• Features

– Query runs in background

– Notification when finished/failed

– Restart queries with recoverable errors

– Check and cancel jobs

– Downloadable result

– Schedule recurring queries

– Manage job priority and quota

36

Page 38: Television News Search and Analysis with Lucene/Solr

Future workMultiple sources and languages (1)

• Multilingual news programs

– E.g. some have English + Spanish CC

• Multiple text and timestamp sources

– E.g. CNN transcript available from website

– Applying speech-to-text to videos

– Manual correction of text and timestamps

• Multiple markets

– E.g. Capture TV programs in Denmark and Norway

37

Page 39: Television News Search and Analysis with Lucene/Solr

Future workMultiple sources and languages (2)

• Need language detection

– Libraries exist

• Search for specific channel

– Search by language more useful

– But no fixed channel -> language mapping

• What will proximity search and occurrence counting mean when there are multiple channels/languages?

38

Page 40: Television News Search and Analysis with Lucene/Solr

Future workMetadata

• Types of metadata– Segment boundary, type and topic

– Headline and description (from transcripts)

– Website links

– Syntactic tags (e.g. part of speech)

– Generated annotation (e.g. object identification)

– User annotation (e.g. scene description)

– Screen text

• Eventually: want them to be searchable

39

Page 41: Television News Search and Analysis with Lucene/Solr

Thank you for coming!

• Any questions?

• My e-mail: [email protected]

• Slides available: http://ucla.in/IDJq2u

40