BIDM

download BIDM

of 122

Transcript of BIDM

  • 7/29/2019 BIDM

    1/122

  • 7/29/2019 BIDM

    2/122

    Business Intelligence and Data Mining (BI &DM)

    Text Book:

    Business Intelligence A Managerial Approach by

    Efraim Turban, Ramesh Sharda, Dursun Delen and

    Devid King, 2/e, Pearson, 2012

    Reference Material:

    Decision Support and Business Intelligence

    Systems by Efraim Turban, Ramesh Sharda andDursun Delen, 9/e, Pearson, 2012

  • 7/29/2019 BIDM

    3/122

  • 7/29/2019 BIDM

    4/122

    Sessions Plan Introduction to Business Intelligence

    Decision Support Systems Concepts, Methodologies

    and Technologies Data Warehousing

    Business Performance Management

    Data Mining for Business Intelligence Text and Web Mining

    Business Intelligence: Implementation and Emerging

    Trends

    Business Intelligence and Data Mining (BI &DM)

  • 7/29/2019 BIDM

    5/122

    Introduction to Text andWeb Mining

    Business Intelligence and Data Mining (BI &DM)

  • 7/29/2019 BIDM

    6/122

  • 7/29/2019 BIDM

    7/122

    Learning Objectives

    Describe Web mining, its objectives, and its

    benefits

    Understand the three different branches of Web

    mining Web content mining

    Web structure mining

    Web usage mining

    Understand the applications of these three mining

    paradigms

  • 7/29/2019 BIDM

    8/122

    Opening Vignette

    Mining Text For Security And Counterterrorism

    What is MITRE?

    Problem description Proposed solution

    Results

    Answer & discuss the case questions

  • 7/29/2019 BIDM

    9/122

  • 7/29/2019 BIDM

    10/122

  • 7/29/2019 BIDM

    11/122

  • 7/29/2019 BIDM

    12/122

  • 7/29/2019 BIDM

    13/122

  • 7/29/2019 BIDM

    14/122

    What is Text-Mining?

    finding interesting regularities in large

    textualdatasets (adapted from Usama Fayad)

    where interesting means: non-trivial, hidden,

    previously unknown and potentially useful

    finding semantic and abstract information

    from the surface form of textual data

  • 7/29/2019 BIDM

    15/122

  • 7/29/2019 BIDM

    16/122

  • 7/29/2019 BIDM

    17/122

    Semi-Structured Data

    Text databases are, in general, semi-structured Example:

    Title

    Author Publication_Date

    Length

    Category

    Abstruct

    Content

    Structured attributes/value pair

    Unstructured

  • 7/29/2019 BIDM

    18/122

    Text Mining Process

    Text preprocessing Syntactic/Semantic

    text analysis

    Features Generation Bag of words

    Features Selection Simple counting

    Statistics

    Text/Data Mining Classification

    Clustering Associations

    Analyzing results

  • 7/29/2019 BIDM

    19/122

  • 7/29/2019 BIDM

    20/122

  • 7/29/2019 BIDM

    21/122

  • 7/29/2019 BIDM

    22/122

    Levels of text representations

    Character (character n-grams and sequences)

    Words (stop-words, stemming, lemmatization)

    Phrases (word n-grams, proximity features)

    Part-of-speech tags

    Taxonomies / thesauri Vector-space model

    Language models

    Full-parsing

    Cross-modality

    Collaborative tagging / Web2.0

    Templates / Frames

    Ontologies / First order theories

  • 7/29/2019 BIDM

    23/122

    Levels of text representations

    Character

    Words

    Phrases

    Part-of-speech tags

    Taxonomies / thesauri

    Vector-space model

    Language models

    Full-parsing

    Cross-modality

  • 7/29/2019 BIDM

    24/122

    Character level

    Character level representation of a text

    consists from sequences of characters

    a document is represented by a frequency

    distribution of sequences

    Usually we deal with contiguous strings

    each character sequence of length 1, 2, 3,

    represent a feature with its frequency

  • 7/29/2019 BIDM

    25/122

  • 7/29/2019 BIDM

    26/122

    Levels of text representations

    Character

    Words

    Phrases

    Part-of-speech tags

    Taxonomies / thesauri

    Vector-space model

    Language models

    Full-parsing

    Cross-modality

  • 7/29/2019 BIDM

    27/122

    Word level

    The most common representation of text used

    for many techniques

    there are many tokenization software packages

    which split text into the words

    Important to know:

    Word is well defined unit in western languages

    e.g. Chinese has different notion of semantic unit

  • 7/29/2019 BIDM

    28/122

    Words Properties

    Relations among word surface forms and their senses:

    Homonomy: same form, but different meaning (e.g. bank:river bank, financial institution)

    Polysemy: same form, related meaning (e.g. bank: bloodbank, financial institution)

    Synonymy: different form, same meaning (e.g. singer,vocalist)

    Hyponymy: one word denotes a subclass of an another(e.g. breakfast, meal)

    Word frequencies in texts have power distribution:

    small number of very frequent words

    big number of low frequency words

  • 7/29/2019 BIDM

    29/122

  • 7/29/2019 BIDM

    30/122

    Levels of text representations

    Character

    Words

    Phrases

    Part-of-speech tags

    Taxonomies / thesauri Vector-space model

    Language models

    Full-parsing

    Cross-modality

  • 7/29/2019 BIDM

    31/122

    Phrase level

    Instead of having just single words we candeal with phrases

    We use two types of phrases:

    Phrases as frequent contiguous word sequences Phrases as frequent non-contiguous word

    sequences

    both types of phrases could be identified bysimple dynamic programming algorithm

    The main effect of using phrases is to moreprecisely identify sense

  • 7/29/2019 BIDM

    32/122

    Levels of text representations

    Character

    Words

    Phrases

    Part-of-speech tags

    Taxonomies / thesauri

    Vector-space model

    Language models

    Full-parsing

    Cross-modality

  • 7/29/2019 BIDM

    33/122

    Part-of-Speech level

    By introducing part-of-speech tags we introduce word-

    types enabling to differentiate words functions

    For text-analysis part-of-speech information is used mainly for

    information extraction where we are interested in e.g. named

    entities which are noun phrases

    Another possible use is reduction of the vocabulary (features)

    it is known that nouns carry most of the information in text

    documents

    Part-of-Speech taggers are usually learned by HMMalgorithm on manually tagged data

  • 7/29/2019 BIDM

    34/122

    Part-of-Speech Table

    http://www.englishclub.com/grammar/parts-of-speech_1.htm

    http://www.englishclub.com/grammar/parts-of-speech_1.htmhttp://www.englishclub.com/grammar/parts-of-speech_1.htmhttp://www.englishclub.com/grammar/parts-of-speech_1.htmhttp://www.englishclub.com/grammar/parts-of-speech_1.htmhttp://www.englishclub.com/grammar/parts-of-speech_1.htmhttp://www.englishclub.com/grammar/parts-of-speech_1.htm
  • 7/29/2019 BIDM

    35/122

  • 7/29/2019 BIDM

    36/122

    Levels of text representations

    Character

    Words

    Phrases

    Part-of-speech tags

    Taxonomies / thesauri

    Vector-space model

    Language models

    Full-parsing

    Cross-modality

  • 7/29/2019 BIDM

    37/122

  • 7/29/2019 BIDM

    38/122

    WordNet database of lexical relations

    WordNet is the most well developedand widely used lexical database forEnglish it consist from 4 databases (nouns,

    verbs, adjectives, and adverbs)

    Each database consists from senseentries each sense consists from aset of synonyms, e.g.: musician, instrumentalist, player

    person, individual, someone

    life form, organism, being

    Category Unique

    Forms

    Number

    of

    Senses

    Noun 94474 116317

    Verb 10319 22066

    Adjective 20170 29881

    Adverb 4546 5677

  • 7/29/2019 BIDM

    39/122

  • 7/29/2019 BIDM

    40/122

    WordNet relations

    Each WordNet entry is connected with other entries in the graph through

    relations Relations in the database of nouns:

    Relation Definition Example

    Hypernym From lower to higher

    concepts

    breakfast -> meal

    Hyponym From concepts to

    subordinates

    meal -> lunch

    Has-Member From groups to their

    members

    faculty -> professor

    Member-Of From members to their

    groups

    copilot -> crew

    Has-Part From wholes to parts table -> leg

    Part-Of From parts to wholes course -> meal

    Antonym Opposites leader -> follower

    D t R t ti

  • 7/29/2019 BIDM

    41/122

    Document Representation

    A document representation aims to capture

    what the document is about

    One possible approach

    Each entry describes a document

    Attribute describe whether or not a term appears

    in the document

    Term

    Camera Digital Memory Pixel

    Document 1 1 1 0 1

    Document 2 1 1 0 0

    - - - - -

  • 7/29/2019 BIDM

    42/122

  • 7/29/2019 BIDM

    43/122

    Levels of text representations

    Character

    Words

    Phrases

    Part-of-speech tags

    Taxonomies / thesauri Vector-space model

    Language models

    Full-parsing

    Cross-modality

  • 7/29/2019 BIDM

    44/122

  • 7/29/2019 BIDM

    45/122

  • 7/29/2019 BIDM

    46/122

  • 7/29/2019 BIDM

    47/122

    Distance Based Matching

    In order retrieve documents similar to a given documentone need a measure of similarity

    Euclidean distance

    The Euclidean distance between

    X = (x1, x2, x3, .., xn) and Y = (y1,y2,y3, .., yn)Is defined as

    D(X,Y) = (xiyi)2

    Similarity between document

  • 7/29/2019 BIDM

    48/122

    Similarity between document

    vectors

    Each document is represented as a vector of weights D =

    Cosine similarity (dot product) is the most widely used

    similarity measure between two document vectors

    calculates cosine of the angle between document vectors

    efficient to calculate (sum of products of intersecting words)

    similarity value between 0 (different) and 1 (the same)

    k kj j

    i

    ii

    xx

    xxDDSim

    22

    21

    21 ),(

    P f M

  • 7/29/2019 BIDM

    49/122

    Performance Measure The set of retrieved documents can be formed by collecting

    the top-ranking documents according to a similarity measure The quality of a collection can be compared by the two

    following measures

    Relevant Relevant

    Precision = ------------------------------------Retrieved

    Relevant Retrieved

    Recall = ------------------------------------------

    Relevant

    Relevant

    Documents

    Relevant &

    Retrieved

    Retrieved

    Documents

  • 7/29/2019 BIDM

    50/122

  • 7/29/2019 BIDM

    51/122

    Cluster Analysis for Data Mining

    Analysis methods

    Statistical methods (including both hierarchical

    and nonhierarchical), such as k-means, k-modes,

    and so on. Neural networks (adaptive resonance theory

    [ART], self-organizing map [SOM])

    Fuzzy logic (e.g., fuzzy c-means algorithm) Genetic algorithms

    Divisive versus Agglomerative methods

    T t Mi i f P t t A l i

  • 7/29/2019 BIDM

    52/122

    Text Mining for Patent Analysis

    (see Applications Case 7.2)

    What is a patent?

    exclusive rights granted by a country to an

    inventor for a limited period of time in exchange

    for a disclosure of an invention

    How do we do patent analysis (PA)?

    Why do we need to do PA?

    What are the benefits?

    What are the challenges?

    How does text mining help in PA?

  • 7/29/2019 BIDM

    53/122

    Natural Language Processing (NLP)

    Structuring a collection of text Old approach: bag-of-words

    New approach: natural language processing

    NLP is a very important concept in text mining.

    a subfield of artificial intelligence and computational

    linguistics.

    the study of "understanding" the natural humanlanguage.

    Syntax versus semantics based text mining

  • 7/29/2019 BIDM

    54/122

  • 7/29/2019 BIDM

    55/122

    Natural Language Processing (NLP)

    Challenges in NLP Part-of-speech tagging

    Text segmentation

    Word sense disambiguation

    Syntax ambiguity

    Imperfect or irregular input

    Speech acts

    Dream of AI community

    to have algorithms that are capable of automatically

    reading and obtaining knowledge from text

  • 7/29/2019 BIDM

    56/122

    Natural Language Processing (NLP)

    WordNet A laboriously hand-coded database of English words,

    their definitions, sets of synonyms, and various semantic

    relations between synonym sets

    A major resource for NLP

    Needs automation to be completed

    Sentiment Analysis

    A technique used to detect favorable and unfavorableopinions toward specific products and services

    See Application Case 7.3 for a CRM application

  • 7/29/2019 BIDM

    57/122

    NLP Task Categories

    Information retrieval

    Information extraction

    Named-entity recognition

    Question answering

    Automatic summarization

    Natural language generation & understanding

    Machine translation

    Foreign language reading & writing Speech recognition

    Text proofing

    Optical character recognition

  • 7/29/2019 BIDM

    58/122

    Text Mining Applications

    Marketing applications Enables better CRM

    Security applications

    ECHELON, OASIS

    Deception detection

    example coming up

    Medicine and biology

    Literature-based gene identification example coming up

    Academic applications

    Research stream analysis - example coming up

    T t Mi i A li ti

  • 7/29/2019 BIDM

    59/122

    Text Mining Applications(gene/protein interaction identification)

    Gene/

    Protein 596 12043 24224 281020 42722 397276

    D007962

    D 016923

    D 001773

    D019254 D044465 D001769 D002477 D003643 D016158

    185 8 51112 9 23017 27 5874 2791 8952 1623 5632 17 8252 8 2523

    NN IN NN IN VBZ IN JJ JJ NN NN NN CC NN IN NN

    NP PP NP NP PP NP NP PP NP

    Ontology

    Word

    POS

    Shallow

    Parse

    ...expression of Bcl-2 is correlated with insufficient white blood cell death and activation of p53.

  • 7/29/2019 BIDM

    60/122

  • 7/29/2019 BIDM

    61/122

    Text Mining Process

    Establish the Corpus:Collect & Organize the

    Domain Specific

    Unstructured Data

    Create the Term-Document Matrix:Introduce Structure

    to the Corpus

    Extract Knowledge:Discover Novel

    Patterns from the

    T-D Matrix

    The inputs to the process

    includes a variety of relevant

    unstructured (and semi-

    structured) data sources such

    as text, XML, HTML, etc.

    The output of the Task 1 is a

    collection of documents in

    some digitized format for

    computer processing

    The output of the Task 2 is a

    flat file called term-document

    matrix where the cells are

    populated with the term

    frequencies

    The output of Task 3 is a

    number of problem specific

    classification, association,

    clustering models and

    visualizations

    Task 1 Task 2 Task 3

    FeedbackFeedback

    The three-step text mining process

  • 7/29/2019 BIDM

    62/122

    Text Mining Process

    Step 1: Establish the corpus

    Collect all relevant unstructured data (e.g.,

    textual documents, XML files, emails, Web

    pages, short notes, voice recordings) Digitize, standardize the collection (e.g.,

    all in ASCII text files)

    Place the collection in a common place (e.g.,

    in a flat file, or in a directory as separate files)

  • 7/29/2019 BIDM

    63/122

    Text Mining Process

    Step 2: Create the TermbyDocument Matrix

    inv

    estm

    entri

    sk

    p

    roject

    man

    agem

    ent

    softw

    areen

    ginee

    ring

    d

    evelo

    pment

    1

    SAP

    ...

    Document 1

    Document 2

    Document 3

    Document 4

    Document 5

    Document 6

    ...

    Documents

    Terms

    1

    1

    1

    2

    1

    1

    1

    3

    1

  • 7/29/2019 BIDM

    64/122

    Text Mining Process

    Step 2: Create the TermbyDocumentMatrix (TDM)

    Should all terms be included?

    Stop words, include words Synonyms, homonyms

    Stemming

    What is the best representation of the indices

    (values in cells)?

    Row counts; binary frequencies; log frequencies;

    Inverse document frequency

  • 7/29/2019 BIDM

    65/122

    Text Mining Process

    Step 2: Create the TermbyDocumentMatrix (TDM)

    TDM is a sparse matrix. How can we reduce the

    dimensionality of the TDM? Manual a domain expert goes through it

    Eliminate terms with very few occurrences in very

    few documents (?)

    Transform the matrix using singular valuedecomposition (SVD)

    SVD is similar to principle component analysis

  • 7/29/2019 BIDM

    66/122

    Text Mining Process

    Step 2: Extract patterns/knowledge Classification (text categorization)

    Clustering (natural groupings of text)

    Improve search recall Improve search precision

    Scatter/gather

    Query-specific clustering

    Association

    Trend Analysis ()

  • 7/29/2019 BIDM

    67/122

    Web Mining The term created by Orem Etzioni (1996)

    Application of data mining techniques toautomatically discover and extract information from

    Web data

  • 7/29/2019 BIDM

    68/122

    What is Web Mining?

    Discovering useful information from the

    World-Wide Web and its usage patterns

  • 7/29/2019 BIDM

    69/122

    Web Mining v. Data Mining

    Structure (or lack of it)

    Textual information and linkage structure

    Scale

    Data generated per day is comparable to largest

    conventional data warehouses

    Speed

    Often need to react to evolving usage patterns inreal-time (e.g., merchandising)

  • 7/29/2019 BIDM

    70/122

    Web Mining topics

    Web graph analysis

    Power Laws and The Long Tail

    Structured data extraction

    Web advertising

    Systems Issues

  • 7/29/2019 BIDM

    71/122

    Web Mining topics

    Web graph analysis

    Power Laws and The Long Tail

    Structured data extraction

    Web advertising

    Systems Issues

  • 7/29/2019 BIDM

    72/122

    Web Mining topics

    Web graph analysis

    Power Laws and The Long Tail

    Structured data extraction

    Web advertising

    Systems Issues

  • 7/29/2019 BIDM

    73/122

    Size of the Web

    Number of pages Technically, infinite

    Much duplication (30-40%)

    Best estimate of unique static HTML pagescomes from search engine claims

    Until last year, Google claimed 8 billion(?), Yahooclaimed 20 billion

    Google recently announced that their index contains 1trillion pages

    How to explain the discrepancy?

  • 7/29/2019 BIDM

    74/122

    The web as a graph

    Pages = nodes, hyperlinks = edges

    Ignore content

    Directed graph

    High linkage

    10-20 links/page on average

    Power-law degree distribution

  • 7/29/2019 BIDM

    75/122

    Structure of Web graph

    Lets take a closer look at structure

    Broder et al (2000) studied a crawl of 200M pages

    and other smaller crawls

    Bow-tie structure Not a small world

  • 7/29/2019 BIDM

    76/122

    Bow-tie Structure

    Source: Broder et al, 2000

  • 7/29/2019 BIDM

    77/122

  • 7/29/2019 BIDM

    78/122

    Web Mining topics

    Web graph analysis

    Power Laws and The Long Tail

    Structured data extraction

    Web advertising

    Systems Issues

    l d d b

  • 7/29/2019 BIDM

    79/122

    Power-law degree distribution

    Source: Broder et al, 2000

    l l

  • 7/29/2019 BIDM

    80/122

    Power-laws galore

    Structure

    In-degrees

    Out-degrees

    Number of pages per site

    Usage patterns

    Number of visitors

    Popularity e.g., products, movies, music

    h il

  • 7/29/2019 BIDM

    81/122

    The Long Tail

    Source: Chris Anderson (2004)

    W b Mi i i

  • 7/29/2019 BIDM

    82/122

    Web Mining topics

    Web graph analysis

    Power Laws and The Long Tail

    Structured data extraction

    Web advertising

    Systems Issues

    E i S d D

  • 7/29/2019 BIDM

    83/122

    Extracting Structured Data

    http://www.simplyhired.com

    E i d d

  • 7/29/2019 BIDM

    84/122

    Extracting structured data

    http://www.fatlens.com

    W b Mi i t i

  • 7/29/2019 BIDM

    85/122

    Web Mining topics

    Web graph analysis

    Power Laws and The Long Tail

    Structured data extraction

    Web advertising

    Systems Issues

    Ad h lt

  • 7/29/2019 BIDM

    86/122

    Ads vs. search results

    Ad h lt

  • 7/29/2019 BIDM

    87/122

    Ads vs. search results

    Search advertising is the revenue model

    Multi-billion-dollar industry

    Advertisers pay for clicks on their ads

    Interesting problems

    What ads to show for a search?

    If Im an advertiser, which search terms should I

    bid on and how much to bid?

    T A h t A l i D t

  • 7/29/2019 BIDM

    88/122

    Two Approaches to Analyzing Data

    Machine Learning approach

    Emphasizes sophisticated algorithms e.g., Support

    Vector Machines

    Data sets tend to be small, fit in memory

    Data Mining approach

    Emphasizes big data sets (e.g., in the terabytes)

    Data cannot even fit on a single disk! Necessarily leads to simpler algorithms

    W b Mi i t i

  • 7/29/2019 BIDM

    89/122

    Web Mining topics

    Web graph analysis

    Power Laws and The Long Tail

    Structured data extraction

    Web advertising

    Systems Issues

    S t hit t

  • 7/29/2019 BIDM

    90/122

    Systems architecture

    Memory

    Disk

    CPU

    Machine Learning, Statistics

    Classical Data Mining

    V L S l D t Mi i

  • 7/29/2019 BIDM

    91/122

    Very Large-Scale Data Mining

    Mem

    Disk

    CPU

    Mem

    Disk

    CPU

    Mem

    Disk

    CPU

    Cluster of commodity nodes

    Systems Issues

  • 7/29/2019 BIDM

    92/122

    Systems Issues

    Web data sets can be very large

    Tens to hundreds of terabytes

    Cannot mine on a single server!

    Need large farms of servers

    How to organize hardware/software to mine

    multi-terabye data sets

    Without breaking the bank!

    Project

  • 7/29/2019 BIDM

    93/122

    Project

    Lots of interesting project ideas If you cant think of one please come discuss with us

    Infrastructure

    Aster Data cluster on Amazon EC2

    Supports both MapReduce and SQL Data

    Netflix

    ShareThis

    Google WebBase

    TREC

    Data Mining vs Web Mining

  • 7/29/2019 BIDM

    94/122

    Data Mining vs. Web Mining

    Traditional data mining data is structured and relational

    well-defined tables, columns, rows, keys, and

    constraints.

    Web data

    Semi-structured and unstructured

    readily available data rich in features and patterns

  • 7/29/2019 BIDM

    95/122

    Web Data

    Web Structure

    tag Click here to

    Shop Online

    http://www.walmart.com/http://www.walmart.com/
  • 7/29/2019 BIDM

    96/122

    Web Data

    Web Usage

    Application Server logs

    Http logs

  • 7/29/2019 BIDM

    97/122

    Web Data

    Web Content

    Web Mining Categories

  • 7/29/2019 BIDM

    98/122

    Web Mining Categories

    Web Content Mining

    Discovering useful information from web

    contents/data/documents.

    Web Structure Mining

    Discovering the model underlying link structures (topology)

    on the Web. E.g. discovering authorities and hubs

    Web Usage MiningMake sense of data generated by surfers

    Usage data from logs, user profiles, user sessions, cookies,

    user queries, bookmarks, mouse clicks and scrolls, etc.

    99

    Web Content Data Structure

  • 7/29/2019 BIDM

    99/122

    Web Content Data Structure

    Unstructured free text

    Semi-structured HTML

    More structured Table or Database

    generated HTML pages

    Multimedia data receive less attention than

    text or hypertext

    100

  • 7/29/2019 BIDM

    100/122

    Web Content Mining Process of informationor resource discovery from

    content of millions of sources across the World Wide

    Web

    E.g. Web data contents: text, Image, audio, video,

    metadata and hyperlinks

    Goes beyond key word extraction, or some simple

    statistics of words and phrases in documents.

    Web Content Mining

  • 7/29/2019 BIDM

    101/122

    Web Content Mining

    Pre-processing data before web content mining:feature selection (Piramuthu 2003)

    Post-processing data can reduce ambiguous

    searching results (Sigletos & Paliouras 2003)

    Web Page Content Mining

    Mines the contents of documents directly

    Search Engine Mining

    Improves on the content search of other tools like search

    engines.

    Web Content Mining

  • 7/29/2019 BIDM

    102/122

    Web Content Mining

    Web content mining is related to data miningand text mining. [Bing Liu. 2005]

    It is related to data mining because many datamining techniques can be applied in Web contentmining.

    It is related to text mining because much of theweb contents are texts.

    Web data are mainly semi-structured and/orunstructured, while data mining is structured andtext is unstructured.

    Web Content Mining: IR View

    http://www.cs.uic.edu/~liubhttp://www.cs.uic.edu/~liub
  • 7/29/2019 BIDM

    103/122

    Web Content Mining: IR View

    Unstructured Documents

    Bag of words, or phrase-based feature

    representation Features can be boolean or frequency based

    Features can be reduced using different featureselection techniques

    Word stemming, combining morphologicalvariations into one feature

    104

    Web Content Mining: IR View

  • 7/29/2019 BIDM

    104/122

    Web Content Mining: IR View

    Semi-Structured Documents Uses richer representations for features, based on

    information from the document structure

    (typically HTML and hyperlinks) Uses common data mining methods (whereas

    unstructured might use more text mining methods)

    105

    Web Content Mining: DB View

  • 7/29/2019 BIDM

    105/122

    Web Content Mining: DB View

    Tries to infer the structure of a Web site or transforma Web site to become a database

    Better information management

    Better querying on the Web

    Can be achieved by: Finding the schema of Web documents

    Building a Web warehouse

    Building a Web knowledge base

    Building a virtual database

    106

    Web-Structure Mining

  • 7/29/2019 BIDM

    106/122

    Web-Structure Mining

    Generate structural summaryabout the Website and Web page

    Depending upon the hyperlink, Categorizing the Webpages and the related Information @ inter domain level

    Discovering the Web Page Structure.

    Discovering the nature of the hierarchy of hyperlinks inthe website and its structure.

    Web-Structure Mining cont

  • 7/29/2019 BIDM

    107/122

    Web-Structure Mining cont

    Finding Information about web pages

    Inference on Hyperlink

    Retrieving information about the relevance and the quality

    of the web page.

    Finding the authoritative on the topic and content.

    The web page contains not only information but also

    hyperlinks, which contains huge amount of annotation.

    Hyperlink identifies authors endorsement of the other webpage.

    Web-Structure Mining cont

  • 7/29/2019 BIDM

    108/122

    Web-Structure Mining cont

    More Information on Web Structure Mining

    Web Page Categorization. (Chakrabarti 1998)

    Finding micro communities on the web

    e.g. Google (Brin and Page, 1998)

    Schema Discovery in Semi-Structured Environment.

    Web Usage Mining

  • 7/29/2019 BIDM

    109/122

    Web Usage Mining

    Tries to predict user behavior frominteraction with the Web

    Wide range of data (logs)

    Web client data Proxy server data

    Web server data

    Two common approaches

    Map usage data into relational tables before usingadapted data mining techniques

    Use log data directly by utilizing special pre-processingtechniques

    110

    Web Usage Mining

  • 7/29/2019 BIDM

    110/122

    Web Usage Mining

    Typical problems: Distinguishing among

    unique users, server sessions, episodes,

    etc in the presence of caching and proxyservers

    Often Usage Mining uses some

    background or domain knowledge

    E.g. site topology, Web content, etc

    111

    Web Usage Mining

  • 7/29/2019 BIDM

    111/122

    Web Usage Mining

    Two main categories: Learning a user profile (personalized)

    Web users would be interested in techniques thatlearn their needs and preferences automatically

    Learning user navigation patterns (impersonalized)

    Information providers would be interested intechniques that improve the effectiveness of their

    Web site or biasing the users towards the goals ofthe site

    112

    Web-Usage Mining cont

  • 7/29/2019 BIDM

    112/122

    Web Usage Mining cont

    Data Mining Techniques Navigation Patterns

    Analysis:

    Example:

    70% of users who accessed/company/product2 did so by startingat/company and proceeding through/company/new,

    /company/products and company/product1

    80% of users who accessed the site started from

    /company/products

    65% of users left the site after

    four or less page references

    Web-Usage Mining cont

  • 7/29/2019 BIDM

    113/122

    Web Usage Mining cont

    Data Mining Techniques Sequential Patterns

    Example:

    Supermarket

    Cont

    Customer Transaction Time Purchased Items

    John 6/21/05 5:30 pm BeerJohn 6/22/05 10:20 pm Brandy

    Frank 6/20/05 10:15 am Juice, CokeFrank 6/20/05 11:50 am BeerFrank 6/20/05 12:50 am Wine, Cider

    Mary 6/20/05 2:30 pm Beer

    Mary 6/21/05 6:17 pm Wine, CiderMary 6/22/05 5:05 pm Brandy

    Web-Usage Mining cont

  • 7/29/2019 BIDM

    114/122

    Web Usage Mining cont

    Data Mining Techniques Sequential Patterns

    Customer SequenceCustomer Customer Sequences

    John (Beer) (Brandy)

    Frank (Juice, Coke) (Beer) (Wine, Cider)

    Mary (Beer) (Wine, Cider) (Brandy)

    Example:

    Supermarket

    Cont

    Sequential Patterns with Supporting

    Support >= 40% Customers

    (Beer) (Brandy) John, Mary

    (Beer) (Wine, Cider) Frank, Mary

    Mining Result

    Web-Usage Mining cont

  • 7/29/2019 BIDM

    115/122

    Web Usage Mining cont

    Data Mining Techniques Sequential Patterns

    Web usage examples

    In Google search, within past week 30% of users who visited

    /company/product/ had camera as text.

    60% of users who placed an online order in

    /company/product1 also placed an order in /company/product4

    within 15 days

    Tech for Web Content Mining

  • 7/29/2019 BIDM

    116/122

    Tech for Web Content Mining

    Classifications

    Clustering

    Association

    Document Classification

  • 7/29/2019 BIDM

    117/122

    Document Classification

    Supervised Learning Supervised learning is a machine learningtechnique for creating a

    function from training data .

    Documents are categorized

    The output can predict a class label of the input object (called

    classification).

    Techniques used are

    Nearest Neighbor Classifier

    Feature Selection Decision Tree

    Feature Selection

  • 7/29/2019 BIDM

    118/122

    Feature Selection

    Removes terms in the training documents which arestatistically uncorrelated with the class labels

    Simple heuristics

    Stop words like a, an, the etc.

    Empirically chosen thresholds for ignoring too

    frequent or too rare terms Discard too frequent and too rare terms

    Document Clustering

  • 7/29/2019 BIDM

    119/122

    Document Clustering

    Unsupervised Learning : a data set of input objects is gathered

    Goal : Evolve measures of similarity to cluster a collection ofdocuments/terms into groups within which similarity within a cluster islarger than across clusters.

    Hypothesis : Given a `suitable clustering of a collection, if the user isinterested in document/term d/t, he is likely to be interested in othermembers of the cluster to which d/tbelongs.

    Hierarchical Bottom-Up

    Top-Down

    Partitional

    Semi-Supervised Learning

  • 7/29/2019 BIDM

    120/122

    p g

    A collection of documents is available

    A subset of the collection has known labels

    Goal: to label the rest of the collection.

    Approach Train a supervised learner using the labeled subset.

    Apply the trained learner on the remaining documents.

    Idea

    Harness information in the labeled subset to enablebetter learning.

    Also, check the collection for emergence of new topics

    Association

  • 7/29/2019 BIDM

    121/122

    Association

    Example: SupermarketTransaction ID Items Purchased

    1 butter, bread, milk2 bread, milk, beer, egg3 diaper

    An association rule can be

    If a customer buys milk, in 50% of cases, he/she also

    buys beers. This happens in 33% of all transactions.

    50%: confidence33%: support

    Can also Integrate in Hyperlinks

  • 7/29/2019 BIDM

    122/122

    Q & A