Course Overview: An Introduction to Information Retrieval and Applications

Post on 01-Jan-2016

11 views 3 download

Tags:

description

Course Overview: An Introduction to Information Retrieval and Applications. J. H. Wang Feb. 23, 2011. Instructor & TA. Instructor J. H. Wang ( 王正豪 ) Assistant Professor, CSIE, NTUT Office: R1534, Technology Building E-mail: jhwang@csie.ntut.edu.tw Tel: ext. 4238 - PowerPoint PPT Presentation

Transcript of Course Overview: An Introduction to Information Retrieval and Applications

Course Overview: An Introduction to Information

Retrieval and Applications

J. H. WangFeb. 23, 2011

IR, Spring 2011 NTUT CSIE 2

Instructor & TA

• Instructor– J. H. Wang ( 王正豪 )– Assistant Professor, CSIE, NTUT– Office: R1534, Technology Building– E-mail: jhwang@csie.ntut.edu.tw– Tel: ext. 4238– Office Hour: 10:00-12:00 am, every Wednesday and

Thursday• TA

– Mr. Lin ( 林承翰 ): 2011.ir.ta@gmail.com – R1424, Technology Building

IR, Spring 2011 NTUT CSIE 3

Course Description• Course Web Page

– http://www.ntut.edu.tw/~jhwang/IR/• Time: 13:10-16:00pm, Wed.• Classroom: R327, 6th Teaching Building• Textbook:

– Christopher D. Manning, Prabhakar Raghavan and Hinrich Schuetze, Introduction to Information Retrieval, Cambridge University Press, 2008.

• Available online• International Student Edition, imported by Kai-Fa ( 開發 ) Publis

hing• Prerequisites:

– Basic knowledge of data structures and algorithms, linear algebra, and probability theory

– Programming experience is necessary for projects

IR, Spring 2011 NTUT CSIE 4

Additional References

• References: – Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Mo

dern Information Retrieval: The Concepts and Technology behind Search, Addison-Wesley, 2011.

• This is the second edition of their book Modern Information Retrieval in 1999. ( 華通 )

– Stefan Buettcher, Charles L.A. Clarke, and Gordon V. Cormack, Information Retrieval: Implementing and Evaluating Search Engines, MIT Press, 2010.

– Bruce Croft, Donald Metzler, and Trevor Strohman, Search Engines: Information Retrieval in Practice, Addison-Wesley, 2010. ( 全華 )

IR, Spring 2011 NTUT CSIE 5

More Books on IR• Gerald Salton, Automatic information organization an

d retrieval, McGraw-Hill, 1968.• Gerald Salton and M.J. McGill, Introduction to modern

information retrieval, McGraw-Hill, 1983.– Two classics, but out-of-print.

• C. J. van Rijsbergen, Information Retrieval, Butterworths, 1979. – The classic. More than 40 years old, but still worth reading.

• K. Sparck Jones, P. Willett, Readings in Information Retrieval, Morgan Kaufmann, 1997. – A collection of classical IR papers. (out of print)

• I.H. Witten, A. Moffat, T.C. Bell. Morgan Kaufmann, Managing Gigabytes, 1999. – The authority on index construction and compression.

IR, Spring 2011 NTUT CSIE 6

Grading Policy

• Homework assignments and programming exercises: 40%

• Mid-term exam: 25%• Term project (including the

proposal): 35%

IR, Spring 2011 NTUT CSIE 7

Programming Exercises and Term Project

• At least two programming exercises– Team-based (at most 4 persons per team)– You can either write your own code or reuse

existing open source code– Topics: (to be announced…)

• The term project– Either team-based system development (the

same as programming exercises)– Or academic paper presentation

• But, you should do it on your own (only 1 person), NOT team-based

– A proposal is required around midterm (Apr. 2011)

• Introduction, methods, experiment designs

IR, Spring 2011 NTUT CSIE 8

Online Submission

• Submission instructions– Programs, project proposals, and project

reports in electronic files must be submitted to the TA online at:• http://140.124.183.39/ir/

– Before submission: • User name: Your student ID• Please change your default password at your

first login

IR, Spring 2011 NTUT CSIE 9

What this Course is NOT about

• This course will NOT tell you– The tips and tricks when using search engines,

although power users might have better ideas on how to improve them

• There’re plenty of books and websites on that…

– How to find books in libraries, although it’s somewhat related to the basic concepts of IR

– How to make money on the Web, although the currently largest search engine did it

IR, Spring 2011 NTUT CSIE 10

What’s Information Retrieval

IR, Spring 2011 NTUT CSIE 11

On Wikipedia

IR, Spring 2011 NTUT CSIE 12

On GeoNet

IR, Spring 2011 NTUT CSIE 13

On Google Maps

IR, Spring 2011 NTUT CSIE 14

On Google News

IR, Spring 2011 NTUT CSIE 15

On Blogs

IR, Spring 2011 NTUT CSIE 16

Or More Related Keywords

• South Island• Christchurch• Canterbury• Christchurch Cathedral• …

IR, Spring 2011 NTUT CSIE 17

What if We Search in Chinese

IR, Spring 2011 NTUT CSIE 18

And More…

• 南島• 第二大城• 基督城• 大教堂• …• And other languages…

IR, Spring 2011 NTUT CSIE 19

What Is Information Retrieval?

• “Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.” (Salton, 1968)

IR, Spring 2011 NTUT CSIE 20

Goal

• Information retrieval (IR): a research field that targets at effectively and efficiently searching information in text and multimedia documents

• In this course, we will introduce the basic text and query models in IR, retrieval evaluation, indexing and searching, and applications for IR

IR, Spring 2011 NTUT CSIE 21

A Big Picture

IR, Spring 2011 NTUT CSIE 22

Inverted Index

UserInterface

Text Operations

Query Expansion Indexing

Retrieval

Ranking

Text

query

user need

user feedback

ranked docs

retrieved docs

Doc representationlogical view

inverted file

Document Collection

IR, Spring 2011 NTUT CSIE 23

Topics

• Text IR– Indexing and Searching– Query Languages and Operations

• Retrieval Evaluation• Modeling

– Boolean model– Vector space model– Probabilistic model

• Applications for IR– Multimedia IR– Web Search– Digital Libraries

IR, Spring 2011 NTUT CSIE 24

Organization of the Textbook

• Basics in IR (focus)– Inverted indexes for boolean queries (Ch.1-5)– Term weighting and vector space model (Ch. 6-7)– Evaluation in IR (Ch. 8)

• Advanced Topics– Relevance feedback (Ch. 9)– XML retrieval (Ch. 10)– Probabilistic IR (Ch. 11)– Language models (Ch. 12)

• Machine learning in IR– Text classification (Ch. 13-15)– Document clustering (Ch. 16-18)

• Web Search– Web crawling and indexes (Ch. 19-20)– Link analysis (Ch. 21)

IR, Spring 2011 NTUT CSIE 25

Pointers to Other Topics

• Cross-language IR• Image, video, and multimedia IR• Speech retrieval• Music retrieval• User interfaces• Parallel, distributed, and P2P IR• Digital libraries• Information science perspective• Logic-based approaches to IR• Natural language processing techniques

IR, Spring 2011 NTUT CSIE 26

Tentative Schedule

• Before midterm– Boolean retrieval (1 wk)– Indexing (2 wks)– Vector space model and evaluation (2 wk)– Relevance feedback (1 wk)– Probabilistic IR (2 wk)

• After midterm – Text classification (1 wk)– Document clustering (1 wk)– Web search (2 wks)– Advanced topics: CLIR, IE, … (2 wks)– Term Project Presentation (3 wks)

IR, Spring 2011 NTUT CSIE 27

Generic Resources

• Wikipedia page on Information Retrieval: http://en.wikipedia.org/wiki/Information_retrieval

• Information Retrieval Resources: http://www-csli.stanford.edu/~hinrich/information-retrieval.html

IR, Spring 2011 NTUT CSIE 28

Academic Resources

• Journals– ACM TOIS: Transactions on Information Systems – JASIST: Journal of the American Society of Information

Sciences– IP&M: Information Processing and Management

• Conferences– ACM SIGIR: International Conference on Information

Retrieval– ACM CIKM: Conference on Information Knowledge and

Management– JCDL: ACM/IEEE Joint Conference on Digital Libraries– TREC: Text Retrieval Conference

IR, Spring 2011 NTUT CSIE 29

Thanks for Your Attention!