WEB MINING. Why IR ？ Research & Fun .

Web Mining

Why IR？

Research & Fun

http://duilian.msra.cn

Overview of Search Engine

Flow Chart of SE

Text Processing (1) - Indexing

A list of terms with relevant informationFrequency of termsLocation of terms Etc.

Index terms: represent document content & separate documents “economy” vs “computer” in a news article of Financial Times

To get IndexExtraction of index terms Computation of their weights

Text Processing (2) - Text Processing (2) - ExtractionExtraction

Extraction of index termsWord or phrase levelMorphological Analysis (stemming in English)“information”, “informed”, “informs”, “informative”

informRemoval of stop words

“a”, “an”, “the”, “is”, “are”, “am”, …

Text Processing (3) – Term Text Processing (3) – Term WeightWeight

Calculation of term weights Statistical weights using frequency information importance of a term in a document

E.g. TF*IDF TF: total frequency of a term k in a document IDF: inverse document frequency of a term k in a collection

DF: In how many documents the term appears? High TF , low DF means good word to represent text

High TF, High DF means bad word

An ExampleAn ExampleDocument 1

Document 2

Text Processing (4) - Storing Text Processing (4) - Storing indexing resultsindexing results

Arizona

University

：：：

1 1 2 2

Index Word Word Info.Document 1

Document 2

1 1 1 1

Text Processing (2) - Storing indexing result

Text Processing (3) - Inverted File

Matching & Ranking (2)

Ranking Retrieval Model

Boolean (exact) => Fuzzy Set (inexact)

Vector SpaceProbabilisticInference Net ...

Weighting SchemesIndex terms, query termsDocument characteristics

Vector Space Model

Techniques for efficiency New storage structure esp. for new document types

Use of accumulators for efficient generation of ranked output

Compression/decompression of indexes Technique for Web search engines

Use of hyperlinks Inlinks & outlinks (PageRank)Authority vs hub pages (HITS)

In conjunction with Directory Services (e.g. Yahoo)

Matching & Ranking (2)

Pagerank Algorithm

Basic idea: more links to a page implies a better page But, all links are not created equal Links from a more important page should count more than links from a weaker page

Basic PageRank R(A) for page A: outDegree(B) = number of edges leaving page B = hyperlinks on page B

Page B distributes its rank boost over all the pages it points to

Readings Gregory Grefenstette (1998). “The Problem of Cross-Language Information

Retrieval.” In Cross-Language Information Retrieval (ed: Grefenstette), Kluwer Academic Publishers.

Doug Oard et al. (1999). “Multilingual Information Discovery and AccesS (MIDAS).” D-Lib Magazine, 5 (10), Oct.

Sung Hyon Myaeng et al. (1998). “ A Flexible Model for Retrieval of SGML Documents.” Proc. of the 21st ACM SIGIR Conference, Austrailia.

James Allan (2002). “Introduction to Topic Detection and Tracking.” in Topic Detection and Tracking: Event-based Information Organization (ed: Allan), Kluwer Academic Publishers.

Paul Resnick & Hal Varian (1997). “Recommender Systems.” CACM 40 (3), March, pp 56-58.

Bardrul Sarwar et al. (2001). “Item-based Collaborative Recommendation Algorithms”, http://citeseer.nj.nec.com/sarwar01itembased.html

Karen Sparck Jones (1999). “Automatic summarizing: factors and directions.” In Advances in Automatic Text Summarization (eds: Mani & Maybury), MIT Press.

Ellen Boorhees. (2000). “Overview of TREC-9 Question Answering Track.” Ralph Grishman (1997). “Information Extraction: Techniques and Challenges.”

In Information Extraction - International Summer School SCIE-97, (ed: Maria Teresa Pazienza), Springer-Verlag, 1997. (See http://nlp.cs.nyu.edu/publication/index.shtml)

WEB MINING. Why IR ？ Research & Fun .

Documents

Transcript of WEB MINING. Why IR ？ Research & Fun .

36 RESTAURANT RANGES · RESTAURANT RANGES Model Numbers IR-6 IR-6-C IR-6 -XB IR-2-G24 IR-2-G24-C IR-2-G24-XB IR-4-G12 IR-4-G12-C IR-4-G12-XB IR-G36 IR-G36-C IR ... deck and door lining

Fun-fun fun fun SSS ( download in ppt format )

Text mining & Web miningadmis.fudan.edu.cn/member/sgzhou/courses/data-mining-2007s/Lec… · 2007-6-23 Data Mining: Tech. & Appl. 3 Text Databases and IR Text databases (document

Meme Mining for Fun and Profit - defcon.org · Meme Mining for Fun and Profit ... Conference,Presentations,Technology,Phreaking,lockpicking,Hackers,Hardware Hacking,Physcial Security,RFID,InfoSec,Bio

Enhanced Landfill Mining - TU Wien · Enhanced Landfill Mining A case in Belgium Dr. Ir. Daneel Geysen . Daneel Geysen, Group Machiels | ELFM – a case in Belgium . ... - Mine the

A Survey of Semantic based Solutions to Web Mining · way to increase the precision of IR systems. This paper focuses on the various Semantic-based approaches in Web mining research.

FUN, FUN, FUN!!

Overview of Nyrstar’s EUR 490M Rights Issue/media/Files/N/Nyrstar-IR/... · 1 Based on full production of mining assets. Compared against Brook Hunt’s 2011 zinc mining company

Mining Unstructured Software Repositories Using IR Models

Sumitomo Metal Mining IR-Day 2020 Smelting & Refining ...

fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun fun

IR Series Roaster (IR-1, IR-2.5, IR-5, IR-12 & IR-25 ...

BMO Capital Markets th Global Metals & Mining Conference/media/Files/A/Alcoa-IR/documents/eve… · BMO Capital Markets 27th Global Metals & Mining Conference Alcoa Corporation February

Process Mining Opportunities for CMMI Assessments · Process Mining Opportunities for CMMI Assessments Jordi Riera Cruañas July 2011 Supervisors: Dr. Ir. J.J.M. Trienekens, TU/e,

Protein Secondary Structure Prediction Using RT-RICO: A ...libres.uncg.edu › ir › uncg › f › L_Lee_Protein_2010.pdf · Keywords: Data mining, Protein secondary structure prediction,

BMO Metals and Mining Conferenceinvestors.cnx.com/~/media/Files/C/Consol-Energy-IR/...BMO Metals and Mining Conference February 25-26, 2013 . Cautionary Language 2 This presentation

fun fun fun

Text Mining and IR

STATE MINING AND GEOLOGY BOARD IR 2007-02.pdfDec 14, 2006 · SMGB Information Report 2007-02 STATE MINING AND GEOLOGY BOARD Report on Backfilling of Open-Pit Metallic Mines …

Industrial Power Transmissionfiles.investis.com/tomk/ir/respres/polandsite/iptb.pdfFood & Beverage Tr anspor tion 15.0M€ Equipment 9.9M€ Mining & Aggregrates 18.7M€ Metal 9.9M€