10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data...

04/19/23 1

Data Mining: Concepts and Techniques

— Chapter 10 —10.3.1 Mining Text and Web Data (I)

Jiawei Han and Micheline Kamber

Department of Computer Science

University of Illinois at Urbana-Champaign

www.cs.uiuc.edu/~hanj

Acknowledgements: Based on the slides by students at CS512 (Spring 2009)

Outline

Introduction to Information Retrieval (Rui Li)

Text categorization (Parikshit Sondhi)

Web link analysis (Kavita Ganesan)

Mining and Searching Structured Data on

the Web (Bo Zhao)

Information Retrieval

Rui [email protected]

What’s the Information Retrieval ? Information Retrieval

There exists a collection of text documents User gives a query to express the information need A retrieval system returns relevant documents to users

Typical IR systems Online library catalogs Online document management systems Web Search Engine (Google)

Information Retrieval vs. Database System Unstructured/free text vs. structured data Ambiguous vs. well-defined semantics Incomplete vs. complete specification Relevant documents vs. matched records No transaction VS transaction management,

Typical IR System Architecture

5

User

querydocs

results

Query RepDoc Rep (Index)

ScorerIndexer

Tokenizer

Index

Document Representation

A document can be described by a set of representative keywords called index terms.

Different index terms have varying relevance when used to describe document contents.

Steps: Tokenize the document into the words Remove stop words from stop word list E.g., “is”

“a” “or” Words stemmer: Several words are small syntactic

variants of each other since they share a common word stem E.g., drug, drugs, drugged

Calculate the term weight based on the word frequency

Query Representation is a similar process

Indexing Inverted index

Maintains two hash- or B+-tree indexed tables: document_table: A set of document records <doc_id,

postings_list> term_table: A set of term records, <term, postings_list>

Answer query: Find all docs associated with one or a set of terms

+ easy to implement + effective to fetch documents with specific term – do not handle well synonymy and polysemy, and posting

lists could be too long (storage could be very large) Other index techniques: e.g., signature file

Ranking Model The basic question: Given a query, how do we know if document A

is more relevant than B? Relevance = Similarity

Query and document are represented similarly A query can be regarded as a “document” Relevance(d, q) similarity(d, q)

Key issues How to represent query/document? How to define the similarity measure ?

Typical Models Boolean Model Vector Space Model Language Model

9

The Notion of RelevanceRelevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel(Salton et al., 75)

Prob. distr.model(Wong & Yao, 89)

…

GenerativeModel

RegressionModel(Fox 83)

Classicalprob. model(Robertson & Sparck Jones, 76)

Docgeneration

Querygeneration

LMapproach(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model(Wong & Yao, 95)

Differentinference system

Inference network model(Turtle & Croft, 91)

10

Vector Space Model Represent a doc/query by a term vector

Term: basic concept, e.g., word or phrase Each term defines one dimension and N terms

define a high-dimensional space Element of vector corresponds to term weight, i.e.,

the “importance” of the term

Java

Microsoft

Starbucks

D6

D10

D9

D4

D7D8

D5

D11

D2 ? ?

D1

? ?

D3? ?Query

04/19/23Data Mining: Principles and Algorithms

11

How to Assign Weights

Two-fold heuristics based on frequency TF (Term frequency)

More frequent within a document more relevant to semantics

e.g., “query” vs. “commercial”

IDF (Inverse document frequency) Less frequent among documents more discriminative e.g. “algebra” vs. “science”

TF-IDF weighting: weight(t, d) = TF(t, d) * IDF(t) Frequent within doc high tf high weight Selective among docs high idf high weight


12

How to Measure Similarity? Given two document

Similarity definition dot product

normalized dot product (or cosine)

13

Advantages and Disadvantages of VS Model Advantages:

Empirically effective! (Top TREC performance) Intuitive Easy to implement Well-studied/most evaluated

Disadvantages: Assume term independence Assume query and document be the same Lack of “predictive adequacy” Arbitrary term weighting Arbitrary similarity measure

14

Language Models for Retrieval(Ponte & Croft 98)

Document

Text miningpaper

Food nutritionpaper

Language Model

…text ?mining ?assocation ?clustering ?…food ?…

…food ?nutrition ?healthy ?diet ?…

Query = “data mining algorithms”

?Which model would most likely have generated this query?

15

Text Generation with Unigram LM

(Unigram) Language Model p(w| )

…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001…

Topic 1:Text mining

…food 0.25nutrition 0.1healthy 0.05diet 0.02…

Topic 2:Health

Document

Text miningpaper

Food nutritionpaper

Sampling

16

Estimation of Unigram LM

(Unigram) Language Model p(w|) = ?

Document

text 10mining 5association 3database 3algorithm 2…query 1efficient 1

…text ?mining ?association ?database ?…query ?…

Estimation

A “text mining paper”(total #words=100)

10/1005/1003/1003/100

1/100

17

Ranking Docs by Query Likelihood

d1

d2

dN

qd1

d2

dN

Doc LM

p(q| d1)

p(q| d2)

p(q| dN)

Query likelihood

18

Retrieval as Language Model Estimation

Document ranking based on query likelihood

n

ii

wwwqwhere

dwpdqp

...,

)|(log)|(log

21

• Retrieval problem Estimation of p(wi|d)

• Smoothing is an important issue, and distinguishes different approaches

Document language model

Basic Measures for Text Retrieval


19

Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)

Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved

|}{|

|}{}{|

Relevant

RetrievedRelevantrecall

|}{||}{}{|

RetrievedRetrievedRelevant

precision

Relevant Relevant & Retrieved Retrieved

All Documents

Acknowledge Some slides are coming from Professor Jiawei

Han’ s CS512 course slides and from Professor Chengxiang Zhai’s CS410 course slides (Language Model Part)

By:Parikshit Sondhi

Computer ScienceUniversity of Illinois at Urbana Champaign

Some slides have been adapted from Prof. Han's presentation

Text Categorization

Document Classification: Motivation

News article classification Automatic email filtering Webpage classification Word sense disambiguation … …

04/19/23Data Mining: Principles and Algorithms23

Text Categorization Pre-given categories and labeled document

examples (Categories may form hierarchy) Classify new documents A standard classification (supervised learning )

problem

CategorizationSystem

…

Sports

Business

Education

Science…

SportsBusiness

Education

Document Classification: Problem Definition

Need to assign a boolean value {0,1} to each entry of the decision matrix

C = {c1,....., cm} is a set of pre-defined categories D = {d1,..... dn} is a set of documents to be

categorized 1 for aij : dj belongs to ci 0 for aij : dj does not belong to ci

A Tutorial on Automated Text Categorisation, Fabrizio Sebastiani, Pisa (Italy)

Flavors of Classification Single Label

For a given di at most one (di, ci) is true Train a system which takes a di and C as input and outputs

a ci

Multi-label For a given di zero, one or more (di, ci) can be true Train a system which takes a di and C as input and outputs

C’, a subset of C

Binary Build a separate system for each ci, such that it takes in as

input a di and outputs a boolean value for (di, ci) The most general approach Based on assumption that decision on (di, ci) is independent

of (di, cj)

Classification Methods


Manual: Typically rule-based (KE Approach) Does not scale up (labor-intensive, rule inconsistency) May be appropriate for special data on a particular

domain Automatic: Typically exploiting machine learning

techniques Vector space model based

Prototype-based (Rocchio) K-nearest neighbor (KNN) Decision-tree (learn rules) Neural Networks (learn non-linear classifier) Support Vector Machines (SVM)

Probabilistic or generative model based Naïve Bayes classifier

Steps in Document Classification Classification Process

Data preprocessingE.g., Term Extraction, Dimensionality

Reduction, Feature Selection, etc.Definition of training set and test setsCreation of the classification model using

the selected classification algorithmClassification model validationClassification of new/unknown text

documents

Taking an Example : TFIDF Classifiers

Vector Space Model


Represent a doc by a term vector Term: basic concept, e.g., word or phrase Each term defines one dimension N terms define an N-dimensional space Element of vector corresponds to term weight

E.g., d = (x1,…,xN), xi is “importance” of term i

New document is assigned to the most likely category based on vector similarity (e.g., based on cosine formula).

VS Model: Illustration


Java

Microsoft

StarbucksC2 Category 2

C1 Category 1

C3

Category 3

new doc

TFIDF Classifier The basic idea of the algorithm is to represent

each document d as a vector d = (d(1),...., d(|F|)) in a vector space so that documents with similar content have similar vectors.

Each dimension of the vector space represents a word selected by the feature selection process.

d(i) for a document d is calculated as a combination of the statistics TF(w, d) and DF(w).d(i) is called weight of word wi in document d.

A probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Thorsten Joachims, Carnegie Mellon University, Pittsburgh, PA

Representation

Each distinct word is a feature with the number of times the word occurs in the document as its value. This value is usually a function of TF(w,d) and IDF(w,d).

To avoid unnecessarily large feature vectors words are considered as features only if they occur in the training data at least m times (e.g., m = 3).


Preprocessing: Feature Selection All available features vs. "good" subset The problem of finding a "good" subset of

features is called feature selection Feature selection methods;

1- pruning of infrequent words Words are only considered as features, if they occur at

least a few times in the training data. 2- Pruning of high frequency words

This technique is supposed to eliminate non content words like "the", "and", "for".


Classification: TFIDF Classifier


Evaluations


Effectiveness measure Classic: Precision & Recall

Precision

Recall

Evaluation (con’t)


Benchmarks Classic: Reuters collection

A set of newswire stories classified under categories related to economics

Effectiveness Difficulties of strict comparison

different parameter setting different “split” (or selection) between training and testing various optimizations … …

However, widely recognizable Best: Boosting-based committee classifier & SVM Worst: Naïve Bayes classifier

Need to consider other factors, especially efficiency

Document Classification: Approach Comparisons

Document Clustering


Motivation Automatically group related documents based on

their contents No predetermined training sets or taxonomies Generate a taxonomy at runtime

Most popular clustering methods are: K-Means clustering Agglomerative hierarchical clustering EM (Gaussian Mixture) …

The Steps and Algorithms Clustering Process

Data preprocessing: remove stop words, stem, feature extraction, lexical analysis, etc.

Hierarchical clustering: compute similarities by applying clustering algorithms

Model-Based clustering (Neural Network Approach): clusters are represented by “exemplars” (e.g., SOM)

K-Means clustering Given:

set of documents (e.g., TFIDF vectors), distance measure (e.g., cosine) K (number of groups)

For each of K groups, initialize its centroid with a random document

While not converging Each document is assigned to the nearest group

(represented by its centroid) For each group, calculate new centroid (group

mass point, average document in the group)

Slide adapted from Dr. Andrew Moore’s Presentation

Summary: Text Categorization


Wide application domain

Comparable effectiveness to professionals

Manual TC is not 100% and unlikely to improve

substantially

A.T.C. is growing at a steady pace

Prospects and extensions

Very noisy text, such as text from O.C.R.

Speech transcripts

References


Fabrizio Sebastiani, “Machine Learning in Automated Text

Categorization”, ACM Computing Surveys, Vol. 34, No.1,

March 2002

Yiming Yang, “An evaluation of statistical approaches to text

categorization”, Journal of Information Retrieval, 1:67-88,

1999.

Yiming Yang and Xin Liu, “A re-examination of text

categorization methods”, Proceedings of ACM SIGIR

Conference on Research and Development in Information

Retrieval (SIGIR'99, pp 42--49), 1999.

Thank You

April 19, 2023Data Mining: Concepts and

Techniques 45

Web Link Analysis

By Kavita Ganesan


Techniques 46

RECAP

What is ranking in information retrieval?

Doc 1Doc 1

Doc 2Doc 2

Doc 3Doc 3

Doc 4Doc 4 perform searchon google


Techniques 47

RECAP

What is ranking in information retrieval?

Doc 1Doc 1

Doc 2Doc 2

Doc 3Doc 3

Doc 4Doc 4

Ranked 1st

Ranked 2nd

Ranked 3rd

Ranked 4th perform searchon google


Techniques 48

Why is ranking important?

Users tend to look at top few results make sure that good matches are at the very

top

Fast access to information! savvy users want results immediately

What happens if pages are poorly ranked?

important matches missed

poor user retention


Techniques 49

Ranking in Text Information Retrieval

Before web existed Each document treated as a bag of words Minimal structure Ranking heuristics

Solely based on words in the documents E.g., term frequency, inverse document

frequency

After the web was born Documents

have structure contain hyperlinks contain components like title, author,

abstract, sections, references

Question is: Can we leverage this information to improve ranking?Question is: Can we leverage this information to improve ranking?


Techniques 50

Exploiting inter-document links

Description(“anchor text”)

Hub Authority

Links indicate the utility of a doc

What does a link tell us?

show


Techniques 51

Links Analysis Algorithms

PageRankPageRank HITSHITS

Hyperlink analysis to rank documents


Techniques 52

PageRank

Based on the idea of a ‘random surfer’ the likelihood that a person randomly clicking on

links will arrive at any particular page

Pages represented as Markov Chain states

Probability of moving from one page to another is modelled as a state transition probability


Techniques 53

PageRank

Ex:

BBAA

CC DD

0 1/2 1/2 0

1/2 0 0 1/2

1 0 0 01/2 0 1/2 0

State transition matrix

ABCD

A B C D

PR(A) =½*PR(B) + 1*PR(C)+ ½ PR(D)

1/2

1/21


Techniques 54

PageRank PageRank value for any page u can be expressed

as

PR(u) VEBu

PR(v) L(v)

L(v) = number of outbound links of page vPR(v) = PageRank of page vBu = set of pages linking to page u


Techniques 55

HITS

HITS = Hyperlink-Induced Topic Search

Developed by Jon Michael Kleinberg from Cornell

The algorithm produces two types of pages: Authority: pages that provide an important, trustworthy

information on a given topic Hub: pages that contain links to authorities

Authorities and hubs exhibit a mutually reinforcing relationship:

a better hub points to many good authorities, and a better authority is pointed to by many good hubs


Techniques 56

HITS algorithm

Start with each node(page) having a hub score and authority score of 1

Run the Authority Update Rule

Run the Hub Update Rule

Normalize the values: divide each Hub score by the sum of all Hub scores divide each Authority score by the sum of all Authority

scores

Repeat from the second step as necessary


Techniques 57

HITS algorithm—Authority Update

Node's Authority score = the sum of the Hub Score's of each node that points to it.

A page has high authority if it is linked to by pages that are recognized as Hubs for information.

1 A

B

C

D

authority(A) = h(B) + h(C) + h(D)


Techniques 58

HITS algorithm—Hub Update

Node’s Hub Score = the sum of the Authority Score's of each node that it points to. A page is a good hub if it links to pages that

have high authority

A

5

6

7

E

G

F

hub(A) = a(E) + a(F) + a(G)


Techniques 59

PageRank vs HITS

HITS PageRankiterative algorithm based on linkage of documents on the web

iterative algorithm based on linkage of documents on the web

HITS is executed at query time (authority and hub scores are query specific) takes a perfomance hit

PageRank is pre-computed

Computes two scores per document, hub and authority

Computes a single score


Techniques 60

End


Techniques 61

AUTHORITY PAGE

Kevin Chang

Cheng Zhai

Marianne Winslet

ibm.com

berkeley.edu

Stanford.edu

If a page is popular, then it must be an important page [back]

If a page is popular, then it must be an important page [back]

Mining and Searching Structured Data on the Web

Bo Zhao ([email protected])

Department of Computer Science

University of Illinois at Urbana-Champaign

Structured Data are EVERYWHERE!

Deep Web: databases behind websites (aa.com) Web 2.0 Contents: Flickr, Del.icio.us tags Google Base: structured data portals Surface Web: emails, org, country…

Solutions

Deep Web Data Integration Vertical Search Engines On-the-fly Meta-querying Systems Pay-As-You-Go Integration

Deep Web Surfacing Entity Search on the Surface Web

Vertical Search Engines—”Warehousing” approach

Academic Search Libra@MSRA DBLife@WISC ArnetMiner@Tsinghua

Many other domains Shopping Events Apartments …

66

Integrating information from multiple types of sources Ranking papers, conferences, and authors for a given query Handling structured queries

WebDatabase

WebDatabase

WebDatabase

WebDatabase

WebDatabase

…

PDF

PS DOC

JournalHomepage

AuhtorHomepage

Conf.Homepage

Vertical Search Engines—”Warehousing” approach e.g., Libra Academic Search [NieZW+05] (courtesy MSRA)

On-the-fly Meta-querying Systems

MetaQuerier@UIUC WISE-Integrator

http://www.data.binghamton.edu:8080/wise-integrator/ Commercial Systems

http://www.cheaptickets.com http://pipl.com …

68

On-the-fly Meta-querying Systems—e.g., WISE [HeMYW03], MetaQuerier [ChangHZ05]

FIND sources

QUERY sources

db of dbs

unified query interface

Amazon.comCars.com

411localte.com

Apartments.com

MetaQuerier@UIUC :

69

Technical Challenges.

Source Modeling & Selection How to describe a source and find right sources for query answering?

Schema Matching How to match the schematic structures between sources?

Source Querying, Crawling, and Object Ranking How to query a source? How to crawl all objects and to search them?

Data Extraction How to extract result pages into relations?

70

Source Modeling & Selection: How to describe a source and find right sources for query answering

Focus: Discovery of sources. Focused crawling to collect query interfaces [BarbosaF05, ChangHZ05].

Focus: Extraction of source models. Hidden grammar-based parsing [ZhangHC04]. Proximity-based extraction [HeMY+04]. Classification to align with given taxonomy [HessK03, Kushmerick03].

Focus: Organization of sources and query routing Offline clustering [HeTC04, PengMH+04]. Online search for query routing [KabraLC05].

71

Form Extraction: the Problem

Output all the conditions, for each: Grouping elements (into query conditions) Tagging elements with their “semantic roles”

attribute operator value

72

Schema Matching: How to match the schematic structures between sources

Focus: Matching large number of interface schemas, often in a holistic way. Statistical model discovery [HeC03]; correlation mining [HeCH04, HeC05]. Query probing [WangWL+04]. Clustering [HeMY+03, WuYD+04]. Corpus-assisted [MadhavanBD+05]; Web-assisted [WuDY06].

Focus: Constructing unified interfaces. As a global generative model [HeC03]. Cluster-merge-select [HeMY+03].

73

WISE-Integrator: Cluster-Merge-Represent [HeMY+03]

74

Source Querying: How to query a source? How to crawl all objects and to search them?

1. Metaquerying model: Focus: On-the-fly Querying.

MetaQuerier Query Assistant [ZhangHC05].

2. Vertical-search-engine model: Focus: Source crawling to collect objects.

Form submission by query generation/selection e.g., [RaghavanG01, WuWLM06].

Focus: Object search and ranking [NieZW+05]

75

On-the-fly Querying: [ZhangHC05] Type-locality based Predicate Translation

Target template P

Target Predicate t*

Type Recognizer

Domain Specific Handler

Text Handler

Numeric Handler

Datetime Handler

Predicate Mapper

Source predicate s

Correspondences occur within localities

Translation by type-handler

76

Source Crawling by Query Selection [WuWL+06]

Author Title Category

Ullman Complier System

Ullman Data Mining Application

Ullman Automata Theory

Han Data Mining ApplicationUllman

Han

Compiler

Automata

Data Mining

Application

TheorySystem

Conceptually, the DB as a graph: Node: Attributes Edge: Occurrence relationship

Crawling is transformed into graph traversal problem:Find a set of nodes N in the graph G such that for every node i in G, there exists a node j in N, j->i. And the summation of the cost of nodes in N should be minimum.

77

Object Ranking - Object Relationship Graph [NieZW+05]

Popularity Propagation Factor for each type of relationship link

Popularity of an object is also affected by the popularity of the Web pages containing the object

78

Data Extraction: How to extract result pages into relations

Mediator

Wrapper Wrapper Wrapper

Focus: Semi-automatic wrapper construction

Techniques: Wrapper-mediator architecture [Wiederhold92] . Manual construction: Semi-automatic: Learning-based

HLRT [KushmerickWD97], Stalker [MusleaMK99], Softmealy [HsuD98];

79

Data Extraction: How to extract result pages into relations

Mediator

Wrapper Wrapper Wrapper

Focus: Even more automatic approaches.

Techniques: Semi-automatic: Learning-based

[ZhaoMWRY05], [IRMKS06]. Automatic: Syntax-based

RoadRunner [MeccaCM01], ExAlg [ArasuG03],DEPTA [LiuGZ03, ZhaiL05].

You can only afford to Pay As You Go

Data Integration Solution Build data integration systems with deep web sources Reformulate user queries at search-time Build data integration for every domain of interest

Impractical for web search! Cannot query sources too often Precise content description required Too many domains of interest? Mediated schema design is infeasible!

Web Search Queries and Users

Web Queries are typically keyword queries

Data integration solutions assume structured queries

Web users do not typically care if results are structured or unstructured

User attention restricted to small number of portals (~1)

PAYGO Architecture

There can be many, potentially ill-defined, domainsMediated Schema Schema Clusters

Precise mappings cannot be created to all data sourcesExact Mappings Approximate Mappings

Users prefer keyword queries to structured queriesQuery Reformulation Query Routing

Data sources are diverse and mappings approximateExact Answers Heterogeneous Result Ranking

Uncertainty everywhere !

Pay As You Go in PAYGO

Integration is a continuous process Apriori integration impossible Understanding of mappings/sources/ranking/etc. evolves over time

Mechanisms to facilitate evolution over time Automatic schema clustering and matching Implicit use of user feedback, e.g., from result clicks Result variations to elicit disambiguating user feedback

Queries always answered with best effort “Pay” more by correcting/creating semantic mappings

Query Routing Example

Keyword Analysis

Domain Selection

Query Construction

Source Selection

Result Ranking

make model year attribute

vehicle

vehicle (mk:honda, md:civic, yr:2007, review:?)

car-reviews-by-year.com > car-reviews.com > car-prices.com

“honda civic 2007 review”

Surfacing the Deep Web – A More Practical Solution?

Pre-compute all interesting form submissions each HTML form

Each form submission corresponds to a distinct URL Add URLs for each form submission into search engine index

Enables the reuse of existing search engine infrastructure Deep-web URLs are like any other URL (GET method)

Reduced load on deep-web sites Only in response to user clicks on a search results Search engine performance not dependent on deep-web source

Surfacing Challenges

1. Predicting the appropriate values for text inputs Valid input values are required for retrieving data Ingredients in recipes.com and zipcodes in borderstores.com

2. Predicting the correct input combinations Generating all possible URLs is wasteful + unnecessary Cars.com has ~500K listings, but 250M possible queries

Google’s Deep-Web crawling system

Affects more than 1000 queries per second Enables access to more than a million Deep-Web sites Spans 50+ languages and 100+ domains Results served from 400K+ distinct forms per day Results validate the utility of Deep-Web content

Other systems: http://www.deeppeep.org/ …

Searching for Structured Data on the Surface Web: EntitySearch@UIUC

Entity Extraction and Indexing Ranking Entities Directly:

Contextual - Utilize Entities’ Surrounding Context Uncertain - Extractions are non-”prefect” Holistic - Many evidences from multiple sources Discriminative - Web Pages are of Varying Quality Associative - Tell True Associations from Accidental

Other systems: NAGA (http://www.mpi-inf.mpg.de/~kasneci/naga/) Correlator (http://correlator.sandbox.yahoo.net/)

References

Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web. K. C.-C. Chang, tutorial in SIGMOD 2006

EntityRank: Searching Entities Directly and Holistically. T. Cheng, X. Yan, and K. C.-C. Chang. VLDB 2007.

Web-scale Data Integration: You can only afford to Pay As You Go. Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy. CIDR, 2007.

Google's Deep-Web Crawl. Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, Alon Halevy. VLDB, 2008.

Thank you!

10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data...

Documents

Transcript of 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data...