Text Mining and Text Stream Mining Tutorial Miha Grčar [email protected] Department of Knowledge...

84
Text Mining and Text Stream Mining Tutorial Miha Grčar [email protected] Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana http://kt.ijs.si Large-scale information extraction and integration infrastructure for supporting financial decision making (FP7-ICT-257928) http://project-first.eu

Transcript of Text Mining and Text Stream Mining Tutorial Miha Grčar [email protected] Department of Knowledge...

PowerPoint Presentation

Text Mining and Text Stream Mining TutorialMiha [email protected]

Department of Knowledge TechnologiesJoef Stefan Institute, Ljubljanahttp://kt.ijs.si

Large-scale information extraction and integration infrastructure for supporting financial decision making (FP7-ICT-257928)http://project-first.euText and text stream miningtutorialPart I: Text mining

Part II: Text stream miningLucca, Oct 2012Miha Grar: Text and text stream mining2SimplePragmaticFocused

PART I PART IIPart I:Text miningText mining provides a set of methodologies and tools for discovering, presenting, and evaluating knowledge from large collections of textual documentsText mining adopts and adapts methodologies and tools from Data mining (DM)Machine learning (ML)Information retrieval (IR)Natural language processing (NLP)VisualizationSocial network analysis and graph miningKnowledge managementLucca, Oct 2012Miha Grar: Text and text stream mining4What is text mining?INTRO BOW ML EVAL APPPART I PART IITypical text mining processLucca, Oct 2012Miha Grar: Text and text stream mining5Evaluation / validationApplicationData acquisitionModeling- Discover- Extract- Organize knowledge- Transformation- Performance and- utility assessment- Feedback loop- Presentation- InteractionText pre-processingINTRO BOW ML EVAL APPPART I PART II- Acquisition- CleaningFeedback loopFeedback loopWhat do we cover in Part 1?Lucca, Oct 2012Miha Grar: Text and text stream mining6ApplicationData acquisition- Vector spc model- (bags-of-words)- Cross validation- Precision- Recall Text pre-processingModeling- Machine learning - Classification - ClusteringEvaluation / validationINTRO BOW ML EVAL APPPART I PART IIText pre-processing- Search & browse- Categorization- Recommendation- Advertising- Spam detection- Summarization - Visualization thequickbrowndogjumpsoverthelazydogThe quickbrown dog jumps over the lazy dog.thequickbrowndogjumpsover the lazydogthequickbrowndogjumps -> jumpoverthelazydog11211quickbrowndogjumplazyBags of words Remove stop words Tokenize Lemmatize Compute weightsINTRO BOW ML EVAL APPPART I PART IILucca, Oct 2012Miha Grar: Text and text stream mining7Bags of wordsTokenization & stop word removalSimple tokenizer (alphanumeric strings only):

After | ripping | 14 | higher | from | June | until | the | first | week | of | October | stocks | ran | headfirst | into | a | wall | of | worry | seemingly | too | large | to | climb | Europe | China | the | fiscal | cliff | etc | aren | t | new | concerns | but | that | doesn | t | mean | they | aren | t | real | Investors | suddenly | care | and | are | behaving | accordingly | selling | some | of | their | more | aggressive | names | and | rotating | into | defensivesLucca, Oct 2012Miha Grar: Text and text stream mining8Original text:

After ripping 14% higher from June until the first week of October, stocks ran headfirst into a wall of worry seemingly too large to climb. Europe, China, the fiscal cliff, etc aren't new concerns but that doesn't mean they aren't real. Investors suddenly care and are behaving accordingly, selling some of their more aggressive names and rotating into defensives.INTRO BOW ML EVAL APPPART I PART IIBags of wordsTokenization & stop word removalRegex tokenizer ([\p{L}']+):

After | ripping | higher | from | June | until | the | first | week | of | October | stocks | ran | headfirst | into | a | wall | of | worry | seemingly | too | large | to | climb | Europe | China | the | fiscal | cliff | etc | aren't | new | concerns | but | that | doesn't | mean | they | aren't | real | Investors | suddenly | care | and | are | behaving | accordingly | selling | some | of | their | more | aggressive | names | and | rotating | into | defensivesLucca, Oct 2012Miha Grar: Text and text stream mining9Original text:

After ripping 14% higher from June until the first week of October, stocks ran headfirst into a wall of worry seemingly too large to climb. Europe, China, the fiscal cliff, etc aren't new concerns but that doesn't mean they aren't real. Investors suddenly care and are behaving accordingly, selling some of their more aggressive names and rotating into defensives.INTRO BOW ML EVAL APPPART I PART IIBags of wordsLemmatizationLemmatized:

After | rip | high | from | June | until | the | first | week | of | October | stock | run | headfirst | into | a | wall | of | worry | seemingly | too | large | to | climb | Europe | China | the | fiscal | cliff | etc | aren't | new | concern | but | that | doesn't | mean | they | aren't | real | Investor | suddenly | care | and | are | behave | accordingly | sell | some | of | their | more | aggressive | name | and | rotate | into | defensiveLucca, Oct 2012Miha Grar: Text and text stream mining10Original text:

After ripping 14% higher from June until the first week of October, stocks ran headfirst into a wall of worry seemingly too large to climb. Europe, China, the fiscal cliff, etc aren't new concerns but that doesn't mean they aren't real. Investors suddenly care and are behaving accordingly, selling some of their more aggressive names and rotating into defensives.INTRO BOW ML EVAL APPPART I PART IIBags of wordsLemmatizationOriginal text:

uno dei punti pi contestati della legge di Stabilit approvata da poco dal governo: il taglio alle detrazioni fiscali, ossia gli "sconti" che ogni contribuente pu vantare sulla propria dichiarazione dei redditi. Secondo una bozza aggiornata del disegno di legge, il taglio si applicherebbe a decorrere dal periodo di imposta al 31 dicembre 2012. Un dettaglio che aveva creato, nei giorni scorsi, non poche polemiche.Lemmatized:

E | uno | dei | puntare | pi | contestato | della | legge | di | Stabilit | approvare | da | poco | dal | governo | il | tagliare | alle | detrazione | fiscale | ossia | gli | scontare | che | ogni | contribuire | pu | vantare | sulla | proprio | dichiarazione | dei | reddito | Secondo | una | bozzare | aggiornare | del | disegnare | di | legge | il | tagliare | si | applicare | a | decorrere | dal | periodare | di | impostare | al | dicembre | Un | dettagliare | che | aveva | creare | nei | giorno | scorrere | non | poca | polemicoLucca, Oct 2012Miha Grar: Text and text stream mining11INTRO BOW ML EVAL APPPART I PART IIComputing weightsTF Term FrequencyThe number of times a lemma (stem) occurs in a documentDFDocument FrequencyThe number of documents in which a lemma (stem) occurs at least onceTFIDFLucca, Oct 2012Miha Grar: Text and text stream mining12Higher TF means higher TFIDFHigher DF means lower TFIDFThe quickbrown dog jumps over the lazy dog.11211quickbrowndogjumplazyINTRO BOW ML EVAL APPPART I PART IIComputing weightsLucca, Oct 2012Miha Grar: Text and text stream mining13The quickbrown dog jumps over the lazy dog.quick1brown1dog2jump1lazy1TFIDFTFIDFDF111110000000000INTRO BOW ML EVAL APPPART I PART IIjumpComputing weightsLucca, Oct 2012Miha Grar: Text and text stream mining14The quickbrown dog jumps over the lazy dog.quick1brown1dog2jump1lazy1TFIDFTFIDFDF111210.690.690.690.690.691.39000.690.69INTRO BOW ML EVAL APPPART I PART IICosine similarityLucca, Oct 2012Miha Grar: Text and text stream mining15d1d20INTRO BOW ML EVAL APPPART I PART IICosine similarityLucca, Oct 2012Miha Grar: Text and text stream mining16d1d201d1'd2'INTRO BOW ML EVAL APPPART I PART II

CentroidsLucca, Oct 2012Miha Grar: Text and text stream mining17Determine characteristic words in a clusterNearest centroid classifierk-means clusteringINTRO BOW ML EVAL APPPART I PART IIFeedback loopWhere are we?Lucca, Oct 201218ApplicationData acquisition- Vector spc model- (bags-of-words)Text pre-processing- Machine learning - Classification - ClusteringEvaluation / validationINTRO BOW ML EVAL APPPART I PART IIModelingMiha Grar: Text and text stream mining- Cross validation- Precision- Recall - Search & browse- Categorization- Recommendation- Advertising- Spam detection- Summarization - Visualization Machine learningMachine learning is concerned with the development of algorithms that allow computer programs to learn from past experience [Mitchell]Machine learning refers to a collection of algorithms that take as input empirical data (e.g., from databases or sensors) and try to discover some characteristics (rules, constraints, patterns, features) of the process that generated the data [Wikipedia]Learning from past experience = learning from past examplesExamples (instances) = document vectors (normalized sparse vectors)Lucca, Oct 2012Miha Grar: Text and text stream mining19INTRO BOW ML EVAL APPPART I PART IIMachine learningWe will look at two commonly used machine learning techniquesLucca, Oct 2012Miha Grar: Text and text stream mining20

INTRO BOW ML EVAL APPPART I PART IIClassificationAssigning instances (documents) to two or more predefined (discrete) classesSupervised learning methodClusteringArranging instances (documents) into groups (clusters) so that instances in the same group are more similar to each other than to those in other groupsUnsupervised learning methodLabeled documents

Learn to classify

Classify unlabeled documentsMergers & Acquisitions Ingram Wraps Up Brightpoint BuyoutMergers & Acquisitions State Street completes acquisition of Goldman Sachs Administration ServicesEconomy & Government Gasoline fuels inflation, but Fed policy seen steadyEconomy & Government Euro Leads Majors Higher as Spanish Bailout Looks Increasingly Likely. . .Investing Picks Smith & Wesson Holding Corp. Enters Oversold TerritoryInvesting Picks The Fresh Market: A Strong BuyLabeleddatasetTrainingAlgorithmClassificationModelClassificationAlgorithmUnlabeleddatasetPredictions(Labels)Fresh Del Monte Produce Inc.Enters Oversold TerritoryInvesting PicksClassificationModelLucca, Oct 2012Miha Grar: Text and text stream mining21ClassificationINTRO BOW ML EVAL APPPART I PART II21Classificationwith k-Nearest NeighborsLucca, Oct 201222Mergers & Acquisitions Investing Picks Economy & Government Investing Picks: 4 Mergers & Acquisitions: 1 Economy & Government: 0INTRO BOW ML EVAL APPPART I PART IIClassificationwith Nearest Centroid ClassifierLucca, Oct 201223Mergers & Acquisitions Economy & Government s1s2s3Similarity s2 > s1 > s3s2: Mergers & Acquisitionss1: Investing Pickss3: Economy & GovernmentINTRO BOW ML EVAL APPPART I PART IIInvesting Picks Classificationwith Support Vector Machine (SVM)Lucca, Oct 2012Miha Grar: Text and text stream mining24Mergers & Acquisitions Investing Picks wMaximize wMinimize tradeoffINTRO BOW ML EVAL APPPART I PART IIClassification algorithmsLucca, Oct 2012Miha Grar: Text and text stream mining25k-NNNearestcentroidSVM(linear kernel)Multiclass?yesyesnoExplains decisions?noyesyesExplains model?noyesyesNumber of parameters101Model sizebigsmallsmallTraining speed0fastslowClassification speedslowfastfastAccuracy (on texts)lowmediumhighINTRO BOW ML EVAL APPPART I PART IIClusteringLucca, Oct 201226INTRO BOW ML EVAL APPPART I PART II26Clustering27INTRO BOW ML EVAL APPPART I PART IILucca, Oct 2012Miha Grar: Text and text stream miningk-means clustering

Agglomerative hierarchical clustering 27k-means clusteringInput: kOutput: k clusters (and their centroids)Randomly select k instances for initial centroidsAssign step Assign each instance to the nearest centroidIf the assignments did not change, end the algorithmUpdate stepRecompute (update) centroidsRepeat at Step 2Lucca, Oct 2012Miha Grar: Text and text stream mining28INTRO BOW ML EVAL APPPART I PART IIk-means clusteringLucca, Oct 2012Miha Grar: Text and text stream mining29INTRO BOW ML EVAL APPPART I PART II

Applet at http://www.math.le.ac.uk/people/ag153/homepage/KmeansKmedoids/Kmeans_Kmedoids.html29Agglomerative hierarchical clusteringLucca, Oct 2012Miha Grar: Text and text stream mining30DendrogramINTRO BOW ML EVAL APPPART I PART IIFind the two most similar instancesConnect themReplace them with their centroidRepeat Feedback loopWhere are we?Lucca, Oct 201231ApplicationData acquisition- Vector spc model- (bags-of-words)Text pre-processing- Machine learning - Classification - ClusteringINTRO BOW ML EVAL APPPART I PART IIModelingMiha Grar: Text and text stream mining- Cross validation- Precision- Recall Evaluation / validation- Search & browse- Categorization- Recommendation- Advertising- Spam detection- Summarization - Visualization EvaluationCross validation (http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29)10-fold cross validationStratifiedAccuracyPrecision, recall, F1 (http://en.wikipedia.org/wiki/Precision_and_recall | http://en.wikipedia.org/wiki/F1_Score)Micro and macro-averaging (http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-text-classification-1.html | http://datamin.ubbcluj.ro/wiki/index.php/Evaluation_methods_in_text_categorization)Statistical tests (http://en.wikipedia.org/wiki/Statistical_hypothesis_testing)Lucca, Oct 2012Miha Grar: Text and text stream mining32INTRO BOW ML EVAL APPPART I PART IIFeedback loopWhere are we?Lucca, Oct 201233Data acquisition- Vector spc model- (bags-of-words)Text pre-processing- Machine learning - Classification - ClusteringINTRO BOW ML EVAL APPPART I PART IIModelingEvaluation / validationMiha Grar: Text and text stream mining- Cross validation- Precision- Recall Application- Search & browse- Categorization- Recommendation- Advertising- Spam detection- Summarization - Visualization ApplicationsEnhanced Web search (SearchPoint)Social browsing (LiveNetLife)Content categorization Content-based recommender systemsAdvertisingBlogging assistance (Zemanta)Spam detectionVisualization / summarization of large corporaText summarizationLeskovec et al. (2005): Extracting Summary Sentences Based on the Document Semantic Graph. Microsoft Research Technical Report MSR-TR-2005-07.Sentiment analysis (demo later)News aggregationhttp://emm.newsexplorer.eu Knowledge engineeringhttp://ontogen.ijs.si Lucca, Oct 2012Miha Grar: Text and text stream mining34INTRO BOW ML EVAL APPPART I PART IILucca, Oct 2012Miha Grar: Text and text stream mining35Enhanced Web search (http://www.searchpoint.com)

Lucca, Oct 2012Miha Grar: Text and text stream mining36

HelloHi!Social browsing (http://www.livenetlife.com) @ http://videolectures.net

Lucca, Oct 2012Miha Grar: Text and text stream mining37Content categorization @ http://videolectures.netLucca, Oct 2012Miha Grar: Text and text stream mining38Recommender system @ http://videolectures.net

Contextualized advertisingLucca, Oct 2012Miha Grar: Text and text stream mining39

Blogging assistant (http://www.zemanta.com)Lucca, Oct 2012Miha Grar: Text and text stream mining40INTRO BOW ML EVAL APPPART I PART IILucca, Oct 2012 Miha Grar: Text and text stream mining41

INTRO BOW ML EVAL APPPART I PART IIPump & dumpSiering, Muntermann, Grar (2012)

- Vegas77 Entertainment SE- Spam normally sent on weekends, lines drawn at Fridays exceptions 28.3. and 28.4. - Price on Monday higher in many cases41

VisualizationsDocument spacevisualization

Canyon flows

Tag cloudsLucca, Oct 2012Miha Grar: Text and text stream mining42

http://www.jasondavies.com/wordcloud/INTRO BOW ML EVAL APPPART I PART IIRecapBasicsWhat is text mining?TF-IDF bag-of-words vectorsCosine similarityCentroidsMachine learningk-NNNearest centroid classifierSVMk-meansAgglomerative clusteringApplicationsEnhanced Web search (SearchPoint)Social browsing (LiveNetLife)Content categorization Content-based recommender systemsAdvertisingWriting assistance (Zemanta)Spam detectionVisualization / summarization of large corpora Lucca, Oct 2012Miha Grar: Text and text stream mining43PART I PART II

Part II:Text stream miningPART I PART II EUR/CHF=What is text stream mining?Same as text mining but on streams

Text stream mining provides a set of methodologies and tools for discovering, presenting, and evaluating knowledge from streams of textual documentsLucca, Oct 2012Miha Grar: Text and text stream mining45INTRO DACQ BOW ML APPPART I PART II RememberTypical text mining processLucca, Oct 2012Miha Grar: Text and text stream mining46Evaluation / validationApplicationData acquisitionModeling- Discover- Extract- Organize knowledgeFeedback loopText pre-processing- Transformation- Acquisition- CleaningINTRO DACQ BOW ML APPPART I PART II - Performance and- utility assessment- Feedback loop- Presentation- InteractionLucca, Oct 2012Miha Grar: Text and text stream mining47Evaluation / validationApplicationStream data acquisitionModeling- Discover- Extract- Organize knowledge- Performance and - utility assessment- Obtaining new - labels- Feedback loopFeedback loopText pre-processingTypical text stream mining process- Transformation- Acquisition- CleaningINTRO DACQ BOW ML APPPART I PART II - Presentation- InteractionPipelining and parallelizationEnables concurrent processingIncreases throughputEnables distributed execution (cluster)Near-realtime online systemsStream cannot be paused or slowed down (e.g., newsfeeds)[Near-realtime] Time between reception and utilization of data should be as short as possible[Online] Stream is infinite and (sooner or later) outdated data needs to be deleted

Text stream mining pipelinesLucca, Oct 2012Miha Grar: Text and text stream mining48ParallelizationStreamPipeliningINTRO DACQ BOW ML APPPART I PART II What do we cover in Part II?Lucca, Oct 2012Miha Grar: Text and text stream mining49Evaluation / validationApplicationStream data acquisition- Online BOW- Online document- space visualization- Online tweeter - sentiment classif.Feedback loopText pre-processing- RSS feeds- Boilerplate remover- Language detectionStream data acquisition- Online ML - Incr. NCC - Incr. k-means - Incr. SVMINTRO DACQ BOW ML APPPART I PART II ModelingText pre-processingRSS reader RSS reader RSS reader Text stream acquisition and preprocessingLucca, Oct 2012Miha Grar: Text and text stream mining50BoilerplateremoverBoilerplateremoverBoilerplateremover......PreprocessingpipelinesLoad balancingLanguage detectorLanguage detectorLanguage detectorSyncOnline BOW. . ....INTRO DACQ BOW ML APPPART I PART II RSS (Really Simple Syndication)Lucca, Oct 2012Miha Grar: Text and text stream mining51

INTRO DACQ BOW ML APPPART I PART II RSS (Really Simple Syndication)Lucca, Oct 2012Miha Grar: Text and text stream mining52

NFE/1.0Top Stories - Google Newshttp://news.google.com/news?pz=1&ned=us&[email protected] Google

Egypt Analysts Comment on Next Steps After Mubaraks Ouster - Bloomberghttp://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNEF9B7Q8C7_TBDKPEMFjb83fcuNfQ&url=http://www.bloomberg.com/news/2011-02-11/egypt-analysts-comment-on-next-steps-after-mubarak-s-ouster.htmlTop StoriesFri, 11 Feb 2011 20:15:40 GMT+00:00The ouster of Hosni Mubarak from Egypts presidency today, after protests that started Jan. 25, prompted the following comments from analysts: The army needs to move quickly to remove obstacles to ... ...

INTRO DACQ BOW ML APPPART I PART II BoilerplateremoverBoilerplateremoverBoilerplateremoverLucca, Oct 2012Miha Grar: Text and text stream mining53RSS reader RSS reader RSS reader ......PreprocessingpipelinesLoad balancingLanguage detectorLanguage detectorLanguage detectorSyncOnline BOW. . ....Text stream acquisition and preprocessingINTRO DACQ BOW ML APPPART I PART II Lucca, Oct 2012Miha Grar: Text and text stream mining54

Boilerplate removalhttp://www.bbc.co.uk/news/world-us-canada-15051554 INTRO DACQ BOW ML APPPART I PART II http://www.bbc.co.uk/news/world-us-canada-1505155454Lucca, Oct 2012Miha Grar: Text and text stream mining55protocol :// domain / path / file ? queryhttp://kt.ijs.si/a/b/c.html?pg=0Tree branch:

# si ijs kt a brootdomainpathhttp://www.bbc.co.uk/news/world-us-canada-15051554

# uk co bbc www newsBoilerplate removal URL treeINTRO DACQ BOW ML APPPART I PART II Lucca, Oct 2012Miha Grar: Text and text stream mining56StreamHow many times did I see About Us in this part of the tree?DomainPathRootBoilerplate removal URL tree#This method is UnsupervisedOnline Incremental(consumes one document at a time)INTRO DACQ BOW ML APPPART I PART II Language detectorLanguage detectorLanguage detectorLucca, Oct 2012Miha Grar: Text and text stream mining57BoilerplateremoverBoilerplateremoverRSS reader RSS reader RSS reader Boilerplateremover......PreprocessingpipelinesLoad balancingSyncOnline BOW. . ....Text stream acquisition and preprocessingINTRO DACQ BOW ML APPPART I PART II Motivation: language-specific text analysis components and applicationsSolutions based on word lists and word or character sequences (n-grams)Character n-gram modelBuild character n-gram histograms for many languages (language models)Compare text document histogram to language modelsLucca, Oct 2012Miha Grar: Text and text stream mining58Language detectionINTRO DACQ BOW ML APPPART I PART II THEDER, DENLucca, Oct 2012Miha Grar: Text and text stream mining59E1T2O3A4N5I6H7S8R9D10E_11L12_T13TH14HE15U16W17C18M19......E1N2R3I4T5S6A7D8U9EN10G11ER12H13L14N_15O16M17_D18C19......Language detectionGermanEnglishINTRO DACQ BOW ML APPPART I PART II Article Egypt rejoices at Mubarak departureLucca, Oct 2012Miha Grar: Text and text stream mining60Language detectionINTRO DACQ BOW ML APPPART I PART II Online BOWLucca, Oct 2012Miha Grar: Text and text stream mining61BoilerplateremoverBoilerplateremoverRSS reader RSS reader RSS reader Boilerplateremover......PreprocessingpipelinesLoad balancingLanguage detectorLanguage detectorLanguage detectorSync. . ....Text stream acquisition and preprocessingINTRO DACQ BOW ML APPPART I PART II Online BOWLucca, Oct 2012Miha Grar: Text and text stream mining62OutdatedStreamQueueof TF vectorsDFvaluesRemoveAddINTRO DACQ BOW ML APPPART I PART II Online BOWLucca, Oct 2012Miha Grar: Text and text stream mining63OutdatedStreamQueueof TF vectorsDFvaluesTF-IDFINTRO DACQ BOW ML APPPART I PART II TFDFWhere are we?Lucca, Oct 2012Miha Grar: Text and text stream mining64INTRO DACQ BOW ML APPPART I PART II Evaluation / validationApplicationStream data acquisition- Online BOW- Online document- space visualization- Online tweeter - sentiment classif.Feedback loopText pre-processing- RSS feeds- Boilerplate remover- Language detection- Online ML - Incr. NCC - Incr. k-means - Incr. SVMModelingBatch, incremental, offline, onlineBatch learningConsuming all training examples at onceIncremental learningConsuming one example at a timeMini-batch learningConsuming several examples at a timeOffline learning (for datasets/finite streams)All data is stored and can be accessed repeatedlyOnline learning (for infinite streams)Each example is discarded after being processedLucca, Oct 2012Miha Grar: Text and text stream mining65INTRO DACQ BOW ML APPPART I PART II Taken from http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-265Obtain actual label (red)Update centroidsLucca, Oct 2012Miha Grar: Text and text stream mining66Incremental nearest centroid classifierOutdatedinstanceClassify / predict (green)INTRO DACQ BOW ML APPPART I PART II Incremental k-means clusteringLucca, Oct 2012Miha Grar: Text and text stream mining67

Converges in only a few iterations (warm start)INTRO DACQ BOW ML APPPART I PART II 67Other incremental methodsIncremental SVMA. Bordes, S. Ertekin, J. Weston, and L. Bottou (2005): Fast Kernel Classifiers with Online and Active Learning, Journal of Machine Learning Research, vol. 6, pp. 15791619Incremental perceptronwww.cs.columbia.edu/~jebara/4771/tutorials/perceptron.pdf Incremental winnowhttp://en.wikipedia.org/wiki/Winnow_%28algorithm%29Lucca, Oct 2012Miha Grar: Text and text stream mining68INTRO DACQ BOW ML APPPART I PART II Where are we?Lucca, Oct 2012Miha Grar: Text and text stream mining69INTRO DACQ BOW ML APPPART I PART II Evaluation / validationStream data acquisition- Online BOW- Online document- space visualization- Online tweeter - sentiment classif.Feedback loopText pre-processing- RSS feeds- Boilerplate remover- Language detection- Online ML - Incr. NCC - Incr. k-means - Incr. SVMModelingApplicationDocument space visualizationLucca, Oct 2012Miha Grar: Text and text stream mining70

Several 1000dimensions2DINTRO BOW ML EVAL APPPART I PART IIDocument space visualizationLucca, Oct 2012Miha Grar: Text and text stream mining71NeighborhoodscomputationStressmajorizationLeast-squaresinterpolationk-means clusteringCorpus preprocessingDocumentcorpusLayout

INTRO BOW ML EVAL APPPART I PART II

Lucca, Oct 2012Miha Grar: Text and text stream mining72

Document space visualizationINTRO BOW ML EVAL APPPART I PART IIDocument space visualizationLucca, Oct 2012Miha Grar: Text and text stream mining73NeighborhoodscomputationStressmajorizationLeast-squaresinterpolationk-means clusteringCorpus preprocessingDocumentcorpusLayout

ParallelizationPipeliningWarm startWarm startWarm startOnlineBOWMaintainingsorted listsINTRO DACQ BOW ML APPPART I PART II

Document space visualizationLucca, Oct 2012Miha Grar: Text and text stream mining74INTRO DACQ BOW ML APPPART I PART II

Lucca, Oct 201275

TwitterMiha Grar: Text and text stream miningPlatform for sending short messages(similar to SMS)Est. 225 million users100 million accounts added in 201065 million tweets per day

INTRO DACQ BOW ML APPPART I PART II 75

Financial tweetsLucca, Oct 2012Miha Grar: Text and text stream mining76Informal $ sign conventionSome examples (March 19):User#1: $AAPL is making an announcement at 9am on what it plans to do with its 97 billion in cash.We expect a dividend announcement User#2: $AAPL over 600.00 a share in the pre-market on news of a dividend. User#3: Will there be any other news besides $AAPL dividend? We acquire ~13,000 tweets per weekday, for ~1,800 NASDAQ/NYSE stocks ($GOOG, $MSFT)We analyze tweets to determine whether they contain positive or negative vocabularyINTRO DACQ BOW ML APPPART I PART II Labeled documents

Learn to classify

Classify unlabeled documentsLabeleddatasetTrainingAlgorithmClassificationModelClassificationAlgorithmUnlabeleddatasetPredictions(Labels)So Nickelodeon filed for bankruptcyand announced that the next Kids Choice Awards will be it's last.NEGClassificationModelLucca, Oct 2012Miha Grar: Text and text stream mining77Sentiment classificationPOS Financial markets are now officially open :)POS market intelligence GMI Interactive and Mintel Win ARF Great Minds Award for Quality in ResearchPOS $AAPL : trust me -- AAPL will soar tomorrowNEG Oh how I miss the days with GBP was at least 2 times the AUD. Sterling forecast to hit all-time lows soonNEG omg! did you know BORDERS closed?! they went bankrupt last month and closed!! awww, too bad! i love borders!!NEG @aekins that's just too bad. . .INTRO DACQ BOW ML APPPART I PART II 77Sentiment classificationSVM classifier& emoticons

Neutral zone

Hyperplane+Lucca, Oct 201278+++++++++++++INTRO DACQ BOW ML APPPART I PART II 0000000Goodnight everyoneeee :) Love yall I have a good feeling about today ;)ooo the ice cream van is here... yaaaaaay :Din the garden in the sun! Just about to fill the pool! happy days! :DFinally got JSON in #processing to work. More playing around coming :)

@oanhLove I hate when that happens... :-/No jobs, no money. how in the hell is min wage here 4 f'n clams an hour? :(I hate when I have to call and wake people up :(I don't have any chalk! :-/ MY CHALKBOARD IS USELESSUGHHHHHHHHHHHHHHH.. life is NOT good all the time!!!!!! ;(78Replace usernames with a tokenReplace URLs with a tokenRemove letter repetitionReplace negations with a tokenReplace exclamation marks with a tokenReplace question marks with a tokenAccuracyPrecision/recallAverage accuracy 10-fold cross validationXX81.06%81.32%/81.32% 76.98%XXXXXX80.22%82.08%/78.02%77.43%XXX79.94%77.78%/84.62%77.10%XXX79.94%76.70%/86.81%77.53%XXX79.67%80.79%/78.57%76.85%X78.83%77.60%/81.87%77.29%XX78.55%75.86%/84.62%76.91%78.55%77.78%/80.77%76.93%XXXX78.27%80.23%/75.82%76.93%XXX78.27%76.53%/82.42%77.04%XXXXX77.44%75.12%/82.97%76.86%Sentiment classificationSVM classifier& emoticons

Neutral zone

Explanations

AccuracySovereign debt and unemployment are big issues in EU.

unemployed, issues, debt, eusovereign, bigLucca, Oct 2012Miha Grar: Text and text stream mining79INTRO DACQ BOW ML APPPART I PART II 79

Red:The number of negative tweetsYellow:The difference between the positive and negative tweets

Blue:The number of positive tweetsGreen dots:Relevant events concerning NetflixGrey:Netflix stock closing priceLucca, Oct 2012Miha Grar: Text and text stream mining80

First-quarter earnings releasePlans to launch in 43 countries in Latin America and the Caribbean

Lucca, Oct 2012Miha Grar: Text and text stream mining81Netflix loses TV shows and films, Netflix loses the Starz dealVolume peaks likely represent important events

Sentiment cross-overSentiment cross-over happens before price plunge

Lucca, Oct 2012Miha Grar: Text and text stream mining82Presidential electionsLucca, Oct 2012Miha Grar: Text and text stream mining83http://predsedniskevolitve.siINTRO DACQ BOW ML APPPART I PART II

RecapBasicsWhat is text stream mining?Pipelining, parallelizationWeb data acquisition Online BOWsMachine learningBatch, incremental, offline, onlineIncremental nearest centroid classifierIncremental k-meansWarm startApplicationsOnline document space visualizationOnline tweeter sentiment classifierStock sentiment monitoringPresidential electionsLucca, Oct 2012Miha Grar: Text and text stream mining84PART I PART II