Text Dan Web Mining

download Text Dan Web Mining

of 45

Transcript of Text Dan Web Mining

  • 8/10/2019 Text Dan Web Mining

    1/45

    5/12/2014

    1

    Text & Web MiningData Mining Ilmu Komputer IPB

    Kuliah 12

    Data terstruktur

    Sejauh ini kita berurusan dengan data terstruktur,

    Umumnya data mining menggunakan data semacam ini

    Attribute Value

    Attribute Value

    Attribute Value

    Attribute Value

    Outlook Sunny

    Temperature Hot

    WindyYes

    Humidity High

    PlayYes

  • 8/10/2019 Text Dan Web Mining

    2/45

    5/12/2014

    2

    Complex Data Types

    Berkembangnya data complex

    Spatial data: geographic data, medical &satellite images

    Multimedia data: images, audio, & video Time-series data: banking data & stockexchange data

    Text data: word descriptions for objects World-Wide-Web: highly unstructured text &multimedia data

    5/12/2014

    Basisdata Teks

    Dalam prakteknya terdapat banyak basis data teks: artikel berita paper riset buku

    perpustakaan digital e-mail halaman web

    Berkembang dengan cepat baik dari segi jumlah maupunkepentingan (80%)

    5/12/2014

  • 8/10/2019 Text Dan Web Mining

    3/45

    5/12/2014

    3

    Text Mining Text mining merujuk pada data mining yang

    menggunakan dokumen teks sebagai data

    Hampir semua tugas Text Mining menggunakan metodeInformation Retrieval (IR) untuk pra-proses dokumenteks.

    Metode ini sedikit berbeda daripada metode pra-prosesdata yang digunakan dalam tabel relasional

    Web search juga berakar pada IR

    CS583, Bing Liu,

    UIC

    Definisi Text Mining

    Discover useful and previously unknowngems of information in large text collections

  • 8/10/2019 Text Dan Web Mining

    4/45

    5/12/2014

    4

    Definisi Text Mining

    Text Mining=

    Data Mining (applied to text data)+

    basic linguistics

    Text Mining is understood as a process of automaticallyextracting meaningful, useful, previously unknown andultimately comprehensible information from textualdocument repositories.

    Definisi

    yang tidak diketahui sebelumnya ? Definisi ketat

    Informasi yang bahkan penulisnya tidak mengetahui

    Contoh: menemukan metode baru untuk pertumbuhan rambutyang merupakan efek samping dari suatu prosedur

    Definisi longgar

    Menemukan kembali informasi yang telah ditulis pengarangdalam teksnya Contoh: secara otomatis mengekstrak nama produk dari sebuah

    halaman web

  • 8/10/2019 Text Dan Web Mining

    5/45

  • 8/10/2019 Text Dan Web Mining

    6/45

    5/12/2014

    6

    DM vs TMData Mining Text Mining

    Object ofinvestigation

    Numerical and categoricaldata

    Texts

    Object structure Relational databases Free form texts

    GoalPredict outcomes of futuresituations

    Retrieve relevant information,distill the meaning,categorize and target-deliver

    MethodsMachine learning: SKAT,DT, NN, GA, MBR, MBA

    Indexing, special neural networkprocessing, linguistics,

    ontologiesCurrent marketsize

    100,000 analysts at largeand midsize companies

    100,000,000 corporate workersand individual users

    MaturityBroad implementationsince 1994

    Broad implementation starting2000

    Search vs Discover

    DataMining

    TextMining

    DataRetrieval

    InformationRetrieval

    Search(goal-oriented)

    Discover(opportunistic)

    StructuredData

    UnstructuredData (Text)

  • 8/10/2019 Text Dan Web Mining

    7/45

    5/12/2014

    7

    Aplikasi Text Mining Pemasaran: Menemukankelompok pembeli yangpotensial berdasarkan profilteks pengguna contoh. amazon

    Industri: Mengidentifikasisitus web kelompok pesaing Produk pesaing dan harganya

    Pencarian kerja:mengidentifikasi parameterdalam pencarian pekerjaan

    www.flipdog.com

    Aplikasi Text Mining Search engines

    Enterprise portals

    Knowledge management systems

    e-Business systems

    Vertical applications: e-mail categorization and routing

    Call center notes categorization

    CRM systems

  • 8/10/2019 Text Dan Web Mining

    8/45

    5/12/2014

    8

    IndexingQueryOperations

    Searching

    Text Operations

    UserInterface

    Ranking

    INDEX

    TextDatabase

    Search Subsystem

    Inverted filesystem

    queryparse query

    stemming*

    stemmedterms

    stop list* non-stoplist

    tokens

    query tokens

    Booleanoperations*

    ranking*

    relevantdocument set

    rankeddocument set

    retrieved documentset

    *Indicatesoptionaloperation.

  • 8/10/2019 Text Dan Web Mining

    9/45

    5/12/2014

    9

    Indexing SubsystemDocuments

    break into tokens

    stop list*

    stemming*

    term weighting*

    Inverted filesystem

    text

    non-stoplisttokens

    tokens

    stemmedterms

    terms withweights

    *Indicates

    optionaloperation.

    assign document IDsdocuments

    documentnumbersand *fieldnumbers

    Text Mining

    SampleDocuments

    Transformed

    Representationmodels

    Learning Domain specifictemplates/models

    Text document

    VisualizationsLearning Working

  • 8/10/2019 Text Dan Web Mining

    10/45

    5/12/2014

    10

    Text characteristics: Outline Large textual data base

    High dimensionality

    Several input modes

    Dependency

    Ambiguity

    Noisy data

    Not well structured text

    Text characteristics

    Large textual data base Efficiency consideration

    over 2,000,000,000 web pages

    almost all publications are also in electronic form High dimensionality (Sparse input)

    Consider each word/phrase as a dimension

    Several input modes e.g., Web mining: information about user is generated

    by semantics, browse pattern and outsideknowledgebase.

  • 8/10/2019 Text Dan Web Mining

    11/45

    5/12/2014

    11

    Text characteristics Dependency

    relevant information is a complex conjunction ofwords/phrases e.g., Document categorization.

    Pronoun disambiguation.

    Ambiguity Word ambiguity

    Pronouns (he, she )

    buy, purchase

    Semantic ambiguity The king saw the rabbit with his glasses. (8 meanings)

    Text characteristics

    Noisy data Example: Spelling mistakes

    Not well structured text Chat rooms r u available ?

    Hey whazzzzzz up

    Speech

  • 8/10/2019 Text Dan Web Mining

    12/45

  • 8/10/2019 Text Dan Web Mining

    13/45

    5/12/2014

    13

    Syntactic / Semantic text analysis Part of Speech (pos) tagging

    Find the corresponding pos for each word

    e.g., John (noun) gave (verb) the (det) ball (noun)

    ~98% accurate.

    Word sense disambiguation Context based or proximity based

    Very accurate

    Parsing Generates a parse tree (graph) for each sentence

    Each sentence is a stand alone graph

    Feature Generation: Bag of words

    Text document is represented by the words it contains(and their occurrences) e.g., Lord of the rings {the, Lord, rings, of}

    Highly efficient

    Makes learning far simpler and easier Order of words is not that important for certain applications

    Stemming: identifies a word by its root e.g., flying, flew fly

    Reduce dimensionality

    Stop words: The most common words are unlikely to helptext mining e.g., the, a, an, you

  • 8/10/2019 Text Dan Web Mining

    14/45

    5/12/2014

    14

    Feature Generation: D2K ExampleHi,Here is your weekly update (that unfortunately hasn't goneout in about a month). Not much action here right now.

    1) Due to the unwavering insistence of a member of thegroup, the ncsa.d2k.modules.core.datatype package isnow completely independent of the d2k application.2) Transformations are now handled differently in Tables.Previously, transformations were done using aTransformationModule. That module could then be addedto a list that an ExampleTable kept. Now, there is aninterface called Transformation and a sub-interface calledReversibleTransformation.

    hi, weekly update (that unfortunately gone out month).much action here right now. 1) due unwavering insistencemember group, ncsa.d2k.modules.core.datatype packagenow completely independent d2k application. 2)transformations now handled differently tables. previously,transformations done using transformationmodule. moduleadded list exampletable kept. now, interface calledtransformation sub-interface calledreversibletransformation.

    hi week update unfortunate go out month much action hereright now 1 due unwaver insistence member group ncsa d2kmodules core datatype package now complete independenced2k application 2 transformation now handle different tableprevious transformation do use transformationmodule moduleadd list exampletable keep now interface call transformationsub-interface call reversibletransformation

    Feature Generation: XML Current keyword-oriented search engines cannot handle rich

    queries like Find all books authored by Scooby-Doo.

    XML: Extensible Markup Language XML documents have a nested structure in which each

    element is associated with a tag. Tags describe the semantics of elements.

    The making of a bad movie Scooby-Doo Cartoons

  • 8/10/2019 Text Dan Web Mining

    15/45

    5/12/2014

    15

    Feature selection Reduce dimensionality

    Learners have difficulty addressing tasks with highdimensionality

    Irrelevant features Not all features help!

    e.g., the existence of a noun in a news article is unlikely to help

    classify it as politics or sport

    Feature selection: D2K Example I

    hiweekupdateunfortunategooutmonthmuch

    actionhererightnow1dueunwaverinsistencemembergroupncsad2kmodulesdo

    coredatatypepackagecompleteindependenceapplication2transformation

    handledifferenttableprevioususetransformationmoduleaddlistexampletablekeepinterfacecallsub-interfacereversibletransformation

    hiweekupdateunfortunategooutmonthmuchactionhererightnowdueinsistencemembergroupncsad2kmodules

    docoredatatypepackagecompleteindependenceapplicationtransformationhandledifferenttableprevioususeaddlistkeepinterfacecallsub-interface

  • 8/10/2019 Text Dan Web Mining

    16/45

    5/12/2014

    16

    Feature selection: D2K Example IIhiweekupdateunfortunategooutmonthmuchactionhererightnow1dueunwaverinsistencemembergroupncsad2kmodulesdo

    coredatatypepackagecompleteindependenceapplication2transformationhandledifferenttableprevioususetransformationmoduleaddlistexampletablekeepinterfacecallsub-interfacereversibletransformation

    hiweekupdateunfortunategooutmonthmuchactionhererightnowdueinsistencemembergroupncsad2kmodules

    docoredatatypepackagecompleteindependenceapplicationtransformationhandledifferenttableprevioususeaddlistkeepinterfacecallsub-interface

    hiweekupdateunfortunatemonthactionrightdueinsistencemembergroupncsad2kmodulescore

    datatypepackagecompleteindependenceapplicationtransformationhandledifferenttablepreviousaddlistinterfacecallsub-interface

    Text Mining: Classification definition

    Given: a collection of labeled records (training set) Each record contains a set of features (attributes), and the

    true class (label)

    Find: a model for the class as a function of thevalues of the features

    Goal: previously unseen records should beassigned a class as accurately as possibleA test set is used to determine the accuracy of the model.

    Usually, the given data set is divided into training and testsets, with training set used to build the model and test setused to validate it

  • 8/10/2019 Text Dan Web Mining

    17/45

    5/12/2014

    17

    Similarity Measures:

    Euclidean Distance if attributes are continuous Other Problem-specific Measures

    e.g., how many words are common in these documents

    Text Mining: Clustering definition Given: a set of documents and a similarity measure

    among documents

    Find: clusters such that: Documents in one cluster are more similar to one another

    Documents in separate clusters are less similar to one another

    Goal:

    Finding a correct set of documents

    ContohGREAT Camera., Jun 3, 2004Reviewer:jprice174 fromAtlanta, Ga.

    I did a lot of research last yearbefore I bought this camera...It kinda hurt to leave behind

    my beloved nikon 35mmSLR, but I was going to Italy,and I needed somethingsmaller, and digital.

    The pictures coming out of thiscamera are amazing. The'auto' feature takes greatpictures most of the time. Andwith digital, you're notwasting film if the picturedoesn't come out.

    Summary:

    Feature1: picture

    Positive: 12

    The pictures coming out of this cameraare amazing.

    Overall this is a good camera with areally good picture clarity.

    Negative: 2

    The pictures come out hazy if yourhands shake even for a moment duringthe entire process of taking a picture.

    Focusing on a display rack about 20 feetaway in a brightly lit room during daytime, pictures produced by this camerawere blurry and in a shade of orange.

    Feature2: battery life

    CS583, Bing Liu,

    UIC34

  • 8/10/2019 Text Dan Web Mining

    18/45

    5/12/2014

    18

    Visual Comparison

    CS583, Bing Liu,

    UIC35

    Summary ofreviews ofDigital camera 1

    Picture Battery Size WeightZoom

    Comparison ofreviews of

    Digital camera 1Digital camera 2

    +

    _

    _

    +

    Information ExtractionPosting from Newsgroup

    Telecommunications. Solaris Systems

    Administrator. 55-60K. Immediate need.

    3P is a leading telecommunications firm

    in need of a energetic individual to

    fill the following position in the

    Atlanta office:

    SOLARIS SYSTEM ADMINISTRATOR

    Salary: 50-60K with full benefits

    Location: Atlanta, Georgia no relocation

    assistance provided

    FILLED TEMPLATE

    job title: SOLARIS SYSTEM ADMINISTRATOR

    salary: 55-60K

    city: Atlanta

    state: Georgia

    platform: SOLARIS

    area: Telecommunications

  • 8/10/2019 Text Dan Web Mining

    19/45

    5/12/2014

    19

    Classification: An Example

    Ex# Country MaritalStatus

    IncomeHooligan

    1 England Single 125K Yes

    2 England Married Yes

    3 England Single 70K Yes

    4 Italy Married 40K No

    5 USA Divorced 95K No

    6 England Married 60K Yes

    7 England 20K Yes

    8 Italy Single 85K Yes

    9 France Married 75K No

    10 Denmark Single 50K No1 0 Training

    SetModel

    Learn

    Classifier

    Country MaritalStatus

    IncomeHooligan

    England Single 75K ?Turkey Married 50K ?

    England Married 150K ?

    Divorced 90K ?

    Single 40K ?

    Itlay Married 80K ?1 0

    Test

    Set

  • 8/10/2019 Text Dan Web Mining

    20/45

    5/12/2014

    20

    Text Classification: An Example

    Ex#Hooligan

    1An English football fan

    Yes

    2During a game in Italy

    Yes

    3England has beenbeating France

    Yes

    4Italian football fans werecheering

    No

    5An average USAsalesman earns 75K

    No

    6The game in Londonwas horrific

    Yes

    7 Manchester city is likelyto win the championshipYes

    8Rome is taking the leadin the football league

    Yes1 0

    Training

    SetModel

    Learn

    Classifier

    Test

    Set

    Hooligan

    A Danish football fan ?

    Turkey is playing vs. France.The Turkish fans

    ?1 0

  • 8/10/2019 Text Dan Web Mining

    21/45

    5/12/2014

    21

    Web MiningData mining Ilmu Komputer IPB

    Web Mining

    Knowledge

    WWW

  • 8/10/2019 Text Dan Web Mining

    22/45

    5/12/2014

    22

    Example: Web data extraction

    CS583, Bing Liu,

    UIC43

    Dataregion1

    Dataregion2

    A datarecord

    A datarecord

    Align and extract data items (e.g., region1)image1 EN7410

    17-inchLCDMonitorBlack/Darkcharcoal

    $299.99

    AddtoCart

    (Delivery /Pick-Up )

    PennyShopping

    Compare

    image2 17-inch

    LCDMonitor

    $249.9

    9

    Add

    toCart

    (Delivery /

    Pick-Up )

    Penny

    Shopping

    Compare

    image3 AL1714 17-inch LCDMonitor,Black

    $269.99

    AddtoCart

    (Delivery /Pick-Up )

    PennyShopping

    Compare

    image4 SyncMaster 712n 17-inch LCDMonitor,Black

    Was:$369.99

    $299.99

    Save$70After:$70 mail-in-rebate(s)

    AddtoCart

    (Delivery /Pick-Up )

    PennyShopping

    Compare

    CS583, Bing Liu, UIC

  • 8/10/2019 Text Dan Web Mining

    23/45

    5/12/2014

    23

    Ads vs. search results

    Reproduced from Ullman & Rajaraman with permission

    Ads vs. search results

    Search advertising is the revenue modelSearch advertising is the revenue model

    Multi-billion-dollar industry

    Advertisers pay for clicks on their ads

    Interesting problemsInteresting problems

    How to pick the top 10 results for a search from2,230,000 matching pages?

    What ads to show for a search? If Im an advertiser, which search terms should I bid on

    and how much to bid?

    Reproduced from Ullman & Rajaraman with permission

  • 8/10/2019 Text Dan Web Mining

    24/45

    5/12/2014

    24

    Whats Web Mining?

    Web search : Google, Yahoo,MSN, Ask,

    Specialized search: e.g. Froogle(comparison shopping), job ads(Flipdog)

    eCommerce : Recommendations: e.g. Netflix,Amazon

    improving conversion rate: nextbest product to offer

    Advertising, e.g. Google Adsense Fraud detection: click fraud

    detection, Improving Web site design and

    performance

    Discovering interesting and useful

    information from Web content and usage

    Web Mining

    Web mining - data mining techniques toautomatically discover and extract informationfrom Web documents/services (Etzioni, 1996).

    Web mining research integrate research fromseveral research communities (Kosala andBlockeel, July 2000) such as: Database (DB) Information retrieval (IR) The sub-areas of machine learning (ML) Natural language processing (NLP)

    May 12, 2014 Web Mining

  • 8/10/2019 Text Dan Web Mining

    25/45

    5/12/2014

    25

    Web Mining The World Wide Web may have more opportunities

    for data mining than any other area However, there are serious challenges:

    It is too huge Complexity of Web pages is greater than any traditional

    text document collection It is highly dynamic It has a broad diversity of users

    Only a tiny portion of the information is truly useful

    5/12/2014

    How big is the Web ?

    Numberof pagesNumberof pages

    Technically,

    infinite

    Technically,

    infinite

    Because of dynamicallygenerated content

    Because of dynamicallygenerated content

    Lots of duplication (30-40%)Lots of duplication (30-40%)

    Best estimate ofunique static HTMLpages comes from

    search engine claims

    Best estimate ofunique static HTMLpages comes from

    search engine claims

    Google = 8 billion,Yahoo = 20 billionGoogle = 8 billion,Yahoo = 20 billion

    Lots of marketinghype

    Lots of marketinghype

    Reproduced from Ullman & Rajaraman with permission

  • 8/10/2019 Text Dan Web Mining

    26/45

    5/12/2014

    26

    Why Mine the Web? Enormous wealth of textual information on the Web.

    Book/CD/Video stores (e.g., Amazon)

    Restaurant information (e.g., Zagats)

    Car prices (e.g., Carpoint)

    Lots of data on user access patterns

    Web logs contain sequence of URLs accessed by users

    Possible to retrieve previously unknown information

    People who ski also frequently break their leg. Restaurants that serve sea food in California are likely to be outside

    San-Francisco

    In the May 2014, 975,262,468sites 16 million more than last month

    http://news.netcraft.com/archives/category/web-server-survey/

  • 8/10/2019 Text Dan Web Mining

    27/45

    5/12/2014

    27

    Unique Features of the Web The Web is a huge collection of documentswhere many contain: Hyper-link information

    Access and usage information

    The Web is very dynamic Web pages are constantly being generated (removed)

    Challenge: Develop new Web mining algorithms to . . .Exploit hyper-links and access patterns.Be adaptable to its documents source

    Web Mining vs Data Mining

    Web is not relation Textual information and linkage structureStructureStructure

    Usage data is huge and growing rapidly Data generated per day is comparable to

    largest conventional data warehousesScaleScale

    Often need to react to evolving usagepatterns in real-time (e.g., merchandising)

    No human in the loopSpeedSpeed

  • 8/10/2019 Text Dan Web Mining

    28/45

    5/12/2014

    28

    Web Mining Taxonomy

    May 12, 2014 Web Mining

    Web Mining

    Web ContentMining

    Web UsageMining

    Web StructureMining

    Web Mining Taxonomy

    Web Mining

    Web ContentMining

    Web Structure

    Mining

    Web Usage

    Mining

    Web PageContent Mining

    Search ResultMining

    General AccessPattern Tracking

    CustomizedUsage Tracking

    Identify information

    within given web

    pages

    Distinguish personal

    home pages from

    other web pages

    Categorizes documents

    using phrases in titles

    and snippets

    Uses interconnections between

    web pages to give weight to

    pages

    Understand access

    patterns and trends to

    improve structure

    Analyzes access

    patterns of a user to

    improve response

  • 8/10/2019 Text Dan Web Mining

    29/45

    5/12/2014

    29

    Mining the World Wide Web

    May 12, 2014 Web Mining

    Web Mining

    Web StructureMining

    Web ContentMining

    Web Page Content Mining

    Web Page Summarization

    WebOQL(Mendelzon et.al. 1998) :

    Web Structuring query languages;

    Can identify information within

    given web pages

    (Etzioni et.al. 1997):Uses heuristics todistinguish personal home pages

    from other web pages

    ShopBot (Etzioni et.al. 1997): Looksfor product prices within web

    pages

    Search ResultMining

    Web UsageMining

    General AccessPattern Tracking

    CustomizedUsage Tracking

    Mining the World Wide Web

    May 12, 2014 Web Mining

    Web Mining

    Web UsageMining

    General AccessPattern Tracking

    CustomizedUsage Tracking

    Web StructureMining

    Web Content

    Mining

    Web PageContent Mining Search Result Mining

    Search Engine Result

    Summarization

    Clustering Search Result(Leouski and Croft, 1996,

    Zamir and Etzioni, 1997):

    Categorizes documents

    using phrases in titles and

    snippets

  • 8/10/2019 Text Dan Web Mining

    30/45

    5/12/2014

    30

    Mining the World Wide Web

    May 12, 2014 Web Mining

    Web Mining

    Web ContentMining

    Web PageContent Mining

    Search ResultMining

    Web UsageMining

    General AccessPattern Tracking

    CustomizedUsage Tracking

    Web Structure Mining

    Using Links

    PageRank (Brin et al., 1998)CLEVER (Chakrabarti et al., 1998)Use interconnections between web pages

    to give weight to pages.

    Using Generalization

    MLDB (1994)Uses a multi-level database

    representation of the Web. Counters

    (popularity) and link lists are used for

    capturing structure.

    Mining the World Wide Web

    May 12, 2014 Web Mining

    Web Mining

    Web StructureMining

    Web ContentMining

    Web PageContent Mining

    Search ResultMining

    Web UsageMining

    General Access Pattern Tracking

    Web Log Mining (Zaane, Xin andHan, 1998)

    Uses KDD techniques to understand

    general access patterns and trends.

    Can shed light on better structure and

    grouping of resource providers.

    CustomizedUsage Tracking

  • 8/10/2019 Text Dan Web Mining

    31/45

    5/12/2014

    31

    Mining the World Wide Web

    May 12, 2014 Web Mining

    Web Mining

    Web UsageMining

    General AccessPattern Tracking

    Customized Usage Tracking

    Adaptive Sites (Perkowitz and Etzioni,1997)

    Analyzes access patterns of each user at

    a time.

    Web site restructures itself automatically

    by learning from user access patterns.

    Web StructureMining

    Web ContentMining

    Web PageContent Mining

    Search ResultMining

    Web Content Mining Approaches

    Information Retrieval Approach

    To assist or to improve the information finding orfiltering the information to the users usually basedon either inferred or solicited user profiles.

    Database Approach

    To model the data on the Web and to integratedthem so that more sophisticated queries other thanthe keywords based could be performed.

  • 8/10/2019 Text Dan Web Mining

    32/45

    5/12/2014

    32

    Web Content Mining

    5/12/2014

    IR View DB View

    View of Data Unstructured

    Semi-structured

    Semi-structured

    Web site as DB

    Main Data Text documents

    Hypertext documents

    Hypertext documents

    Representation Bag of words, n-grams

    Terms, phrases

    Concepts or ontology

    Relational

    Edge-labeled graph

    Relational

    Methods Machine LearningStatistics

    ILPAssociation rules

    Applications Categorization

    Clustering

    Finding extraction rules

    Finding patterns in textUser modeling

    Finding frequent substructures

    Web site schema discovery

    Isu dalam Web Content Mining

    Pengembangan alat cerdar untuk IR

    Mencari kata kunci & frasa kunci

    Menemukan aturan gramatikal & collocation

    Klasifikasi/kategorisasi hyperteks Mengekstra frasa kunci dari dokumen html

    Ekstraksi model/aturan pembelajaran

    Hierarchical clustering

    Memprediksi keterhubungan kata

    Membangun web Query system (WebOQL, XMLQL)

    Mining multimedia data

    May 12, 2014 Web Mining

  • 8/10/2019 Text Dan Web Mining

    33/45

    5/12/2014

    33

    Web Structure Mining

    5/12/2014

    View of Data Links structure

    Main Data Links structure

    Representation Graph

    Methods Proprietary algorithms

    Applications Categorization

    Clustering

    Web Structure Mining

    Untuk menemukan struktur link dari hyperlinkspada level antardokumen untuk membangunringkasan struktur tentang situs web

    Arah 1: berbasis hyperlinks, mengkategorikan halamanWeb & informasi yang dibangun

    Arah 2: menemukan struktur dari dokumen web itusendiri

    Arah 3: menemukan kealamiahan hierarki/jaringanhyperlinks pada situsweb tertentu

    May 12, 2014 Web Mining

  • 8/10/2019 Text Dan Web Mining

    34/45

    5/12/2014

    34

    Web Structure Mining

    Menemukan halaman web yg authorative

    Menemukembalikan halaman yang tidak hanya relevan, tapijuga berkualitas tinggi/authorative terhadap topik

    Hyperlinks dapat merujuk authority

    Web menganfung juga hyperlinks dari satu halaman kehalaman lain

    Hyperlinks mengandung anotasi manusia berjumlah besar

    Hyperlink yang merujuk ke halaman lain, dapat

    dipertimbangkan sebagai kesukaan pengarang terhadaphalaman lain

    May 12, 2014 Web Mining

    Web Usage Mining

    5/12/2014

    View of Data Interactivity

    Main Data Server logs

    Browser logs

    Representation Relational table

    GraphMethods Machine learning

    Statistics

    Association rules

    Applications Site construction, adaptation & managementMarketing

    User modeling

  • 8/10/2019 Text Dan Web Mining

    35/45

    5/12/2014

    35

    Web Usage MiningWeb usage miningjuga disebut Weblog mining Teknik mining untuk menemukan polapenggunaan yang menarik dari datasekunder yang diturunkan dari interaksipengguna ketika menjelajahi web

    May 12, 2014 Web Mining

    Web Usage MiningAplikasi

    Menargetkan kostumer yang potensial untuk produkelektronik

    Memperluas kualitas dan pengantaran Internet

    Information Services kepada pengguna akhir. Memperbaiki performa sistem web server Mengidentifikasi lokasi iklan yang potensial Memfasilitasi personalisasi/situs adaptif Memperbaki desain situs Deteksi fraud/intrusion Memprediksi aksi pengguna

    May 12, 2014 Web Mining

  • 8/10/2019 Text Dan Web Mining

    36/45

    5/12/2014

    36

    May 12, 2014 Web Mining

    Log Data - Simple Analysis

    Statistical analysis of users

    Length of path

    Viewing time Number of page views

    Statistical analysis of site

    Most common pages viewed

    Most common invalid URL

    May 12, 2014 Web Mining

  • 8/10/2019 Text Dan Web Mining

    37/45

    5/12/2014

    37

    Web Log Data Mining ApplicationsAssociation rules

    Find pages that are often viewed together

    Clustering

    Cluster users based on browsing patterns

    Cluster pages based on content

    Classification Relate user attributes to patterns

    May 12, 2014 Web Mining

    Common Log Format

    Remotehost: browser hostname or IP #

    Remote log name of user (almost

    always "-" meaning "unknown")

    Authuser: authenticated username

    Date: Date and time of the request

    "request: exact request lines from client

    Status: The HTTP status code returned

    Bytes: The content-length of response

  • 8/10/2019 Text Dan Web Mining

    38/45

    5/12/2014

    38

    SERVER LOGS

    May 12, 2014 Web Mining 75

    Fields

    Client IP: 128.101.228.20 Authenticated User ID: - - Time/Date: [10/Nov/1999:10:16:39 -0600] Request: "GET / HTTP/1.0" Status: 200 Bytes: - Referrer: - Agent: "Mozilla/4.61 [en] (WinNT; I)"

    May 12, 2014 Web Mining

  • 8/10/2019 Text Dan Web Mining

    39/45

    5/12/2014

    39

    Searching the Web

    Content aggregatorsThe Web Content consumersReproduced from Ullman & Rajaraman with permission

    Web search basics

    The Web

    Ad indexes

    Web Results1- 10of about 7,310,000formiele. (0.12 seconds)

    Miele, Inc -- Anythingelseis acompromiseAt theheart of yourhome, Appliances byMiele. ... USA. to miele.com. ResidentialAppliances.VacuumCleaners. Dishwashers. CookingAppliances. SteamOven. CoffeeSystem...www.miele.com/ -20k- Cached - Similar pages

    Miele

    Welcometo Miele, thehomeof the very best appliances andkitchensintheworld.www.miele.co.uk/ -3k - Cached- Similar pages

    Miele -DeutscherHerstellervon Einbaugerten, Hausgerten ... -[ Translatethispage]Das PortalzumThemaEssen& Geniessenonlineunterwww.zu-tisch.de.Miele weltweit...einLebenlang. ... WhlenSiedie MieleVertretungIhresLandes.www.miele.de/ -10k- Cached- Similar pages

    Herzlichwillkommenbei Mielesterreich -[ Translatethis page]Herzlichwillkommenbei Miele sterreichWennSienicht automatischweitergeleitet werden, klickenSiebittehier! HAUSHALTSGERTE ... www.miele.at/ -3k- Cached - Similar pages

    SponsoredLinks

    CG ApplianceExpressDiscount Appliances (650)756-3931SameDayCertifiedInstallationwww.cgappliance.com SanFrancisco-Oakland-SanJose,CAMiele VacuumCleanersMiele Vacuums-CompleteSelectionFreeShipping!www.vacuums.com

    Miele VacuumCleanersMiele-FreeAirshipping!Allmodels. Helpfuladvice.www.best-vacuum.com

    Web crawler

    Indexer

    Indexes

    Search

    User

    Reproduced from Ullman & Rajaraman with permission

  • 8/10/2019 Text Dan Web Mining

    40/45

  • 8/10/2019 Text Dan Web Mining

    41/45

    5/12/2014

    41

    Layout Structure Compared to plain text, a web page is a 2D presentation Rich visual effects created by different term types, formats,

    separators, blank areas, colors, pictures, etc Different parts of a page are not equally important

    5/12/2014 Data Mining: Principles and Algorithms

    Title: CNN.com International

    H1: IAEA: Iran had secret nuke agenda

    H3: EXPLOSIONS ROCK BAGHDAD

    TEXT BODY (with position and font

    type): The International Atomic EnergyAgency has concluded that Iran hassecretly produced small amounts ofnuclear materials including low enricheduranium and plutonium that could be usedto develop nuclear weapons according to aconfidential report obtained by CNN

    Hyperlink:

    URL: http://www.cnn.com/...

    Anchor Text: AI oaedaImage:

    URL: http://www.cnn.com/image/...

    Alt & Caption: Iran nuclear

    Anchor Text: CNN Homepage News

    Web Page BlockBetter Information Unit

    5/12/2014 Data Mining: Principles and Algorithms

    Importance = Med

    Importance = Low

    Importance = High

    Web Page Blocks

  • 8/10/2019 Text Dan Web Mining

    42/45

    5/12/2014

    42

    Web Usage MiningApplications:Applications:

    Simple and Basic:Simple and Basic:

    Monitor performance, bandwidth usage Catch errors (404 errors- pages not found) Improve web site design

    (shortcuts for frequent paths, remove links not used, etc)

    Advanced and Business Critical :Advanced and Business Critical : eCommerce: improve conversion, sales, profit Fraud detection: click stream fraud,

    Web Usage Mining Three Phases

  • 8/10/2019 Text Dan Web Mining

    43/45

    5/12/2014

    43

    Web Usage Mining Issues Identification of exact user not possible.

    Exact sequence of pages referenced by a user notpossible due to caching.

    Session not well defined

    Security, privacy, and legal issues

    Systems Issues

    Tens to hundreds ofterabytes

    Web data sets canbe very large

    Web data sets canbe very large

    Need large farms ofservers

    Cannot mine on asingle server!

    Cannot mine on asingle server!

    Without breaking thebank!

    How to organizehardware/software

    to mine multi-terabye data sets

    How to organizehardware/software

    to mine multi-terabye data sets

  • 8/10/2019 Text Dan Web Mining

    44/45

    5/12/2014

    44

    root

    furnishing

    accomodation

    event area

    ...

    hotel youth hostel...

    cityregion ...

    wellness hotel

    Ontology Learning

    [Mdche, Staab: ECAI 2000]

    Derived concept pairs

    (wellness hotel, area)(hotel, area)(accomodation, area)

    AssociationRule Mining

    Generalized Conceptual

    Relation

    hasLocation(accomodation,area)

    is-ahierarchy

    Semantic Web Structure/Content Mining

    Knowledge base

    Hotel: Wellnesshotel

    GolfCourse: Seaview

    belongsTo(Seaview,Wellnesshotel)

    ...

    ILP BasedAssociationRule Mining,

    eg. [Dehaspe,Toivonen,

    J. DMKD 1998]

    Hotel(x), GolfCourse(y), belongsTo(y,x) hasStars(x,5)

    support = 0.4 % confidence = 89 %

    belongsTo

    FORALL X, Y

    Y: Hotel[cooperatesWith ->> X] > Y].

    GolfCourse

    Organization

    Hotel

    name Cooperat

    es With

    Ontology

  • 8/10/2019 Text Dan Web Mining

    45/45

    5/12/2014

    Complex Data Types Summary Emerging areas of mining complex data types:

    Text mining can be done quite effectively, especially ifthe documents are semi-structured

    Web mining is more difficult due to lack of suchstructure

    Data includes text documents, hypertext documents, linkstructure, and logs

    Need to rely on unsupervised learning, sometimes

    followed up with supervised learning such as classification

    5/12/2014