CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

22
CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    230
  • download

    3

Transcript of CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Page 1: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

CS336: Intelligent Information Retrieval

Why is Information Retrieval difficult?

Page 2: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

• What is information retrieval?

• What is a relational database?

Page 3: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Relational Databases• Can think of them as containing a

bunch of tables or records.

– Records contain information pertaining to a particular data item (e.g. patient information)

• Relationships are explicit

• labels or fields (e.g. date, name, age, …)

• possible field values (e.g. 2001, Mary Smith, 29)

Page 4: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Relational Databases vs IR

• How does the basic characteristic of data in a relational database differ from that of a text document?

– Relational database has structure!• employee records• store inventory• student information: id number, name, year of

graduation, etc.

• Information retrieval is hard because textual data is unstructured …

Page 5: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Lets do an experiment …

• Why do crackers break into honeypots?

• What strategies did you use to answer this question?

Page 6: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Honeypot and Honeynet Characteristics

Attract Purpose Arguments

Con Arguments

Pro Time to

1st attack why

crackers crack

Interested parties

crackers security research

unethical not unethical

< 1 week base to launch attacks on other networks

SANS GIAC

entrapment no coercion so not entrapment

run private chat systems

CERT

electronic wiretapping

no useful info collected

because they can!

Honeynet Project team

illegal promotes

criminal behavior

Honeypots: Bait for the Cracker by Michelle Delio

Set up a server and fill it with tempting files. Make it hard but not impossible to break into. Then sit back and wait for the crackers to show up. Observe them as they cavort around in the server. Log their conversations with each other. Study them like you'd watch insects under magnifying glass. That's the basic concept behind honeypots and honeynets, systems that are set up specifically so that security experts can secretly observe crackers in their natural habitats. The Honeynet Project team, an invitation-only security group, has been working with the project, a network that exists only to allow the team to watch who cracks it, in order to determine what crackers do and why they do it. The team will soon publish a paper on their research. But some say that honeynets and honeypots, single servers used for cracker observation, are really nothing more than electronic wiretapping and entrapment and charge that the systems are unethical and possibly illegal. London SecTech systems administrator Dan Adams, who is following the project closely, said that honeynets are ethically similar to installing electronic surveillance equipment in a nursery school. Honeynets give crackers a large space in which to roam. They present obstacles that are challenging enough to engage them but not difficult enough to frustrate them completely, Adams said. "They get to play with stuff, and they chatter excitedly among themselves about all the 'kewl warez' they are finding, while the security people who set it up are watching their every move with amusement," Adams said. "Frankly, I have mixed emotions about spying on people, even if they aren't nice people." Adams also feels that honeypots and honeynets come close to entrapment. "It's like opening a fake store, loading it with cool stuff, and sitting back hoping someone will break into it," he said. But since entrapment involves coercing someone to commit a crime they would not otherwise have committed, attorney Jason Wilson said that the typical honeynet or honeypot would not be considered entrapment under United States law. "If you, for example, asked the team members to anonymously spread the word around the hacker corners of the Net that there was an unprotected network chock full of goodies, then there could be an argument made for entrapment," Wilson said. Honeynet team member Saumil Shah said that nothing special is done to attract crackers to the honeynet. "The honeynet systems got hacked within just a week of being deployed. The first attack occurred on June 4, 2000," Shah said. "There was no publicity of the honeynet being live, the systems contained absolutely no information of any value,

Page 7: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Information Retrieval

• Trick is to find a means of describing object

– We’ll focus on text, but could include• images• audio files• video

– Language Complicates our Task

– What approach might you take to developing a document description?

Page 8: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Manual Indexing

• Earliest method– people read every document

– choose descriptors from “controlled vocabulary”• like card catalog

– categorize document via descriptors

Page 9: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Example “Ebola” documentNat Med 1998 Jan;4(1):37-42

Immunization for Ebola virus infection.

Xu L, Sanchez A, Yang Z, Zaki SR, Nabel EG, Nichol ST, Nabel GJ

Department of Biological Chemistry, University of Michigan Medical Center, Ann Arbor 48109-0650, USA.

Infection by Ebola virus causes rapidly progressive, often fatal, symptoms of fever, hemorrhage and hypotension. Previous attempts to elicit protective immunity for this disease have not met with success. We report here that protection against the lethal effects of Ebola virus can be achieved in an animal model by immunizing with plasmids encoding viral proteins. We analyzed immune responses to the viral nucleoprotein (NP) and the secreted or transmembrane forms of the glycoprotein (sGP or GP) and their ability to protect against infection in a guinea pig infection model analogous to the human disease. Protection was achieved and correlated with antibody titer and antigen-specific T-cell responses to sGP or GP. Immunity to Ebola virus can therefore be developed through genetic vaccination and may facilitate efforts to limit the spread of this disease.

Page 10: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

MeSH Indexing of Example Document

MH - AnimalMH - Antibody FormationMH - Disease Models, AnimalMH - Ebola Virus/*immunologyMH - FemaleMH - Guinea PigsMH - Hemorrhagic Fever, Ebola/*immunology/*prevention & controlMH - HumanMH - MaleMH - MiceMH - Mice, Inbred BALB CMH - Nucleocapsid Proteins/immunologyMH - PlasmidsMH - T-Lymphocytes/immunologyMH - TransfectionMH - *Vaccines, DNAMH - Viral Proteins/biosynthesis/immunologyMH - *Viral Vaccines

Page 11: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Honeypots

• Build your own controlled vocabulary for this document.

Page 12: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Controlled Vocabulary

• What kinds of difficulties do you think might arise?

Page 13: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Controlled Vocabulary• What kinds of difficulties do you think

might arise?– Maintenance of the vocabulary is costly

• changes over time• must train specialists

– Many documents = a lot of person hours reading/indexing

– Searcher’s vocabulary may not match indexer’s

Page 14: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Free Text

• Choose words from within the document text– make two short lists

• words you think would be useful

• words you don’t believe would be useful

Page 15: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

What does an IR system do?• Generate a representation of each document

– essentially pick best words and/or phrases

• Generate query representation– if documents processed specially, queries must

also be– possibly weight query words

• Match queries and documents– find relevant documents

• Perhaps, rank and sort documents

Page 16: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Ambiguity Complicates the Task

• Synonyms: many ways to express concept– lorry/truck, elevator/lift, pump/impeller,

hypertension/high blood pressure– failure to use specific words => failure to get

doc

• Words have many meanings– How many diff meanings are there for “bank”?

Page 17: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Ambiguity Complicates the Task

• Difficult to Specify Important but Vague Concepts– e.g. will interest rates be raised in the

next six months

• Spelling variants/ spelling errors

Page 18: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Basic Automatic Indexing

• Parse documents to recognize structure– e.g. title, date, other fields

• Scan for word tokens– numbers, special characters, hyphenation,

capitalization, etc.– languages like Chinese need segmentation– record positional information for proximity operators

• Stopword removal– based on short list of common words such as “the”,

“and”, “or”– saves storage overhead of very long indexes– can be dangerous (e.g. “Mr. The”, “and-or gates”)

Page 19: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Basic Automatic Indexing

• Stem words– group word variants such as plurals via

morphological processing• computer, computers, computing, computed,

computation, computerized, computerize, computerizable

– can make mistakes but generally preferred

• Optional– phrase indexing– thesaurus classes

Page 20: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

How do you rank results?

• What does it mean for a document to be important/relevant?

• Word matching is imperfect, how do we decide which documents are most important?

Page 21: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

How do you rank results?• How do we decide which documents are most

important?

– Count words • high frequency words indicate document “aboutness”

– Weight infrequent corpus words more strongly• can be strong signifiers of meaning; easier to partition

– Determine meaning by analyzing text surrounding a word

»

– Give extra weight to title words, etc.

– Make sense of references given, citations received, etc.

Page 22: CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Free Text Search Engines

• Different engines use different ranking strategies (often a trade secret)– Word frequency– Placement in document– Popularity of document– Number of links to document– Business relationships etc….