Interactive Review + a (corny) ending 12/05 Project due today (with extension) Homework 4 due...

14
Interactive Review + a (corny) ending 12/05 roject due today (with extension) omework 4 due Friday emos (to the TA) as scheduled
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Interactive Review + a (corny) ending 12/05 Project due today (with extension) Homework 4 due...

Interactive Review +

a (corny) ending

12/05

Project due today (with extension)Homework 4 due FridayDemos (to the TA) as scheduled

2

Course Outcomes

• After this course, you should be able to answer:– How search engines work

and why are some better than others

– Can web be seen as a collection of (semi)structured databases?

• If so, can we adapt database technology to Web?

– Can useful patterns be mined from the pages/data of the web?

What did you think these were going to be??

C S E 4 9 4 / 5 9 8I n f o r m a t i o n R e t r i e v a l , M i n i n g a n d

I n t e g r a t i o n o n t h e I n t e r n e t

a b o u t X M L / X q u e r y / R D F

H e l l o , S u b b a r a o K a m b h a m p a t i .W e h a v e r e c o m m e n d a t i o n s f o r y o u .

REVIEW

3

Main Topics• Approximately three halves plus a bit:

– Information retrieval– Information integration/Aggregation– Information mining– other topics as permitted by time

REVIEW

4

Adapting old disciplines for Web-age

• Information (text) retrieval – Scale of the web

– Hyper text/ Link structure

– Authority/hub computations

• Databases– Multiple databases

• Heterogeneous, access limited, partially overlapping

– Network (un)reliability

• Datamining [Machine Learning/Statistics/Databases]– Learning patterns from large scale data

REVIEW

Topics Covered

1. Introduction (8/22;)

2. Text retrieval; vectorspace ranking

3. Indexing/Retrieval issues 4. Correlation analysis &

Latent Semantic Indexing 5. Search engine technology 6. Anatomy of Google etc

7. Clustering 8. Text Classification (m)9. Filtering/Personalization

10.Web & Databases: Why do we even care?

11.XML and handling semi-structured data

12.Semantic web and its standards (RDF/RDF-S/OWL...)

13. Information Extraction 14.Data/Information

Integration/aggregation 15.Query Processing in Data

Integration: Gathering and Using Source Statistics

16.Bridging Information Retrieval and Databases

17.Social Networks

Finding“Sweet Spots” in computer-mediated cooperative work

• It is possible to get by with techniques blythely ignorant of semantics, when you have humans in the loop– All you need is to find the right sweet spot, where the

computer plays a pre-processing role and presents “potential solutions”

– …and the human very gratefully does the in-depth analysis on those few potential solutions

• Examples:– The incredible success of “Bag of Words” model!

• Bag of letters would be a disaster ;-)• Bag of sentences and/or NLP would be good

– ..but only to your discriminating and irascible searchers ;-)

Collaborative Computing AKA Brain Cycle Stealing

AKA Computizing Eyeballs

• A lot of exciting research related to web currently involves “co-opting” the masses to help with large-scale tasks– It is like “cycle stealing”—except we are stealing “human brain

cycles” (the most idle of the computers if there is ever one ;-) • Remember the mice in the Hitch Hikers Guide to the Galaxy?

(..who were running a mass-scale experiment on the humans to figure out the question..)

– Collaborative knowledge compilation (wikipedia!)– Collaborative Curation – Collaborative tagging– Paid collacoration/contracting

• Many big open issues– How do you pose the problem such that it can be solved using

collaborative computing?– How do you “incentivize” people into letting you steal their brain

cycles? • Pay them! (Amazon mturk.com )

Tapping into the Collective Unconscious

• Another thread of exciting research is driven by the realization that WEB is not random at all!– It is written by humans

– …so analyzing its structure and content allows us to tap into the collective unconscious ..

• Meaning can emerge from syntactic notions such as “co-occurrences” and “connectedness”

• Examples:– Analyzing term co-occurrences in the web-scale corpora to capture

semantic information (today’s paper)

– Analyzing the link-structure of the web graph to discover communities• DoD and NSA are very much into this as a way of breaking terrorist cells

– Analyzing the transaction patterns of customers (collaborative filtering)

9

Rao: I could've taught more...I could've taught more, if I'd just...I could've taught more...T&U: Rao, there are twenty people who are mad at you because you taught too much. Look at them.Rao: If I'd made more time...I wasted so much time, you have no idea. If I'd just...T&U: There will be generations (of bitter people) because of what you did.Rao: I didn't do enough.T&U: You did so much.Rao: This slide. We could’ve removed this slide. Why did I keep the slide? Two minutes, right there. Two minutes, two more minutes.. This music, a bit on p2p. This review. Two points on custom portals. I could easily have made two for it. At least one. I could’ve gotten one more point across. One more. One more point. A point, Sree. For this. I could've gotten one more point across and I didn't.

Adieu with an Oscar Schindler Routine.

Schindler: I could've got more...I could've got more, if I'd just...I could've got more...Stern: Oskar, there are eleven hundred people who are alive because of you. Look at them.Schindler: If I'd made more money...I threw away so much money, you have no idea. If I'd just...Stern: There will be generations because of what you did.Schindler: I didn't do enough.Stern: You did so much.Schindler: This car. Goeth would've bought this car. Why did I keep the car? Ten people, right there. Ten people, ten more people...(He rips the swastika pin from his lapel) This pin, two people. This is gold. Two more people. He would've given me two for it. At least one. He would've given me one. One more. One more person. A person, Stern. For this. I could've gotten one more person and I didn't.

Top few things I would have done if I had more time• Information extraction; Automated annotation• Record/Ontology/Schema matching issues • Customized portal generation• P2P mediation • Services—and service standards• Security issues..• .• Be less demanding more often (or even once…)

“It is not what you have covered, but rather what you have uncovered”

Mr. Andersen, May I be excused? My brain is full.

11A Farside treasury…

494 studentsOkay, folks Google can be improvedWith LSI. We need data integration,Clustering, which Google doesn’t domuch, we need db/IR integration..

Blah blah Google blah blah blahBlah. Blah blah blah blah blah blah,Blah blah blah Google blah blahBlah blah blah blah blah blah blah..

Interactive Review (Format)

• Each of you will get about 4min to hold forth on any of the following:

topics covered in the course that particularly caught your fancy (and why) intriguing connections *between* the various topics covered in the course that struck you where you think this area ought to go what topics should have been covered more

what topics--if any--got overplayed

Anatomy may be likened to a harvest-field. • First come the reapers, who, entering upon untrodden

ground, cut down great store of corn from all sides of them. These are the early anatomists of Europe

• Then come the gleaners, who gather up ears enough from the bare ridges to make a few loaves of bread. Such were the anatomists of last.

• Last of all come the geese, who still contrive to pick up a few grains scattered here and there among the stubble, and waddle home in the evening, poor things, cackling with joy because of their success.

Gentlemen, we are the geese. --John Barclay English Anatomist

Information Integration on Web still rife with uncut corn

• Unlike anatomy of Barclay’s day, Web is still young. We are just figuring out how to tap its potential

• …You have great stores of uncut corn in front of you.

• ……