Text mining tutorial

1. Getting Started with Text Mining: STM, CART and TreeNetDan SteinbergMykhaylo Golovnya Ilya PolosukhinMay, 2011

2. Text Mining and Data MiningText mining is an important and fascinating area of modern analyticsOn the one hand text mining can be thought of as just another applicationarea for powerful learning machinesOn the other hand, text mining is a distinct field with its own dedicatedconcepts, vocabulary, tools, and techniquesIn this tutorial we aim to illustrate some important analytical methods andstrategies from both perspectives on data mining introducing tools specific to the analysis text, and, deploying general machine learning technologyThe Salford Text Mining utility (STM) is a powerful text processing systemthat prepares data for advanced machine learning analyticsOur machine learning tools are the Salford Systems flagship CART decisiontree and stochastic gradient boosting TreeNetEvaluation copies of the the proprietary technology in CART and TreeNet aswell as the STM are available from http://www.salford-systems.com Salford Systems Copyright 2011 2

3. For Readers of this TutorialTo follow along this tutorial we recommend that you have the analytical tools we useinstalled on your computer. Everything you need may already be on a CD diskcontaining this tutorial and analytical softwareCreate an empty folder named stmtutor, this is the root folder where all of the workfiles related to this tutorial will resideYou may also use the following link to download Salford Systems Predictive Modeler(SPM) http://www.salford-systems.com/dist/SPM/SPM680_Mulitple_Installs_2011_06_07.zipAfter downloading the package, unzip its contents into stmtutor which will create anew folder named SPM680_Mulitple_Installs_2011_06_07. Follow installation stepsdescribed on the next slide.For the original DMC2006 competition website visit http://www.data-mining-cup.de/en/review/dmc-2006/We recommend that you visit the above site for information only; data and tools forpreparing that data are available at the URL next belowFor the STM package, prepared data files, and other utilities developed for this tutorialplease visit http://www.salford-systems.com/dist/STM.zipAfter downloading the archive, unzip its contents into stmtutorSalford Systems Copyright 20113

4. Important! Installing the SPM SoftwareThe Salford Systems software youve just downloaded needs to be bothinstalled and licensed. No-cost license codes for a 30 day period areavailable on request to visitors of this tutorial*Double click on the Install_a_Transform_SPM.exe file located in theSPM680_Mulitple_Installs_2011_06_07 folder (see the previous slide) toinstall the specific version of SPM used in this tutorial Following the above procedure will ensure that all of the currently installed versions of SPM, if any, will remain intact!Follow simple installation steps on your screen* Salford Systems reserves the right to decline to offer a no-cost license at its sole discretionSalford Systems Copyright 20114

5. Important! Licensing the SPM SoftwareWhen you launch the Salford Systems Predictive Modeler (SPM) you will begreeted with a License dialog containing information needed to secure alicense via emailPlease, send thenecessary informationto Salford Systems tosecure your license byentering the UnlockCode which will be e-mailed back to youThe software willoperate for 3 dayswithout any licensing;however, you cansecure a 30-daylicense on request Salford Systems Copyright 20115

6. Installing the Salford Text Miner (STM)In addition to the Salford Predictive Modeler (SPM) you will also work with theSalford Text Miner (STM) softwareNo installation is needed and you should already have the stm.exeexecutable in the stmtutorSTMbin folder as the result of unzipping theSTM.zip package earlierSTM builds upon the Python 2.6 distribution and the NLTK (Natural LanguageTool Kit) but makes text data processing for analytics very easy to conductand manage You do not need to add any other support software to use STMExpect to see several folders and a large number of files located under thestmtutorSTM folder. It is important to leave these files in the location towhich you have installed them. Please do not MOVE or alter any of the installed files other than those explicitly listed as user-modifiable!stm.exe will expire in the middle of 2012, contact Salford Systems to get anupdated version beyond thatSalford Systems Copyright 20116

7. The Example ProjectThe best examples are drawn from real world data sets and we werefortunate to locate data publicly released by eBay.Good teaching examples also need to be simple. Unfortunately, real world text mining could easily involve hundreds of thousands if not millions of features characterizing billions of records. Professionals need to be able to tackle such problems but to learn we need to start with simpler situations. Fortunately, there are many applications in which text is important but the dimensions of the data set are radically smaller, either because the data available is limited or because a decision has been made to work with a reduced problem.We use our simpler example to illustrate many useful ideas for beginning textminers while pointing the way to working on larger problems.Salford Systems Copyright 2011 7

8. The DMC2006 Text Mining ChallengeIn 2006 the DMC data mining competition (restricted to student competitorsonly) introduced a predictive modeling problem for which much of thepredictive information was in the form of unstructured text.The datasets for the DMC 2006 data mining competition can be downloadedfrom http://www.data-mining-cup.de/en/review/dmc-2006/ For your convenience we have re-packaged this data and made it somewhat easier to work with. This re-packaged data is included in the STMU package described near the beginning of this tutorial.The data summarizes 16,000 iPod auctions held at eBay from May 2005through May 2006 in GermanyEach auction item is represented by a text description written by the seller (inGerman) as well as a number of flags and features available to the seller atthe time of the auctionAuction items were grouped into 15 mutually exclusive categories based ondistinct iPod features: storage size, type (regular, mini, nano), and colorThe competition goal was to predict whether the closing price would be aboveor below the category average Salford Systems Copyright 20118

9. Comments on the ChallengeOne might think that a challenge with text in German might not be of generalinterest outside of GermanyHowever, working with a language essentially unfamiliar to any member ofthe analysis team helps to illustrate one important point Text mining via tools that have no understanding of the language can be strikingly effectiveWe have no doubt that dedicated tools which embed knowledge of thelanguage being analyzed can yield predictive benefits We also believe we could have gained further valuable insight into the data if any of the authors spoke German! But our performance without this knowledge is still impressive.In contexts where simple methods can yield more than satisfactory results, orin contexts where the same methods must be applied uniformly acrossmultiple languages, the methods described in this tutorial will be an excellentguide.Salford Systems Copyright 20119

10. Configuring Work Location in SPMThe original datasets from the DMC 2006 challenge reside in thestmtutorSTMdmc2006 folderTo facilitate further modeling steps, we will configure SPM to use this locationas the default location: Start SPM Go to the Edit Options menu Switch to the Directories tab Enter the stmtutorSTMdmc2006 folder location in all text entry boxes except the last one Press the [Save as Defaults] button so that the configuration is restored the next time you start SPMSalford Systems Copyright 2011 10

11. Configuring TreeNet EngineNow switch to the TreeNet tab Configure the Plot Creation section as shown on the screen shot Press the [Save as Defaults] button Press the [OK] button to exit Salford Systems Copyright 201111

12. Steps in the Analysis: Data Overview1. Describe the data: (Data Dictionary and Dimensions of Data) a. What is the unit of observation? Each record of data is describing what? b. What is the dependent or target variable? c. What other variables (data base fields) are available? d. How many records are available?2. Statistical Summary a. Basic summary including means, quantiles, frequency tables b. Dimensions of categorical predictors c. Number of distinct values of continuous variables3. Outlier and Anomaly Assessment a. Detection of gross data errors such as extreme values b. Assessment of usability of levels of categorical predictors (rare levels)Salford Systems Copyright 201112

13. Data FundamentalsThe original dataset is called dmc2006.csv and resides in thestmtutorSTMdmc2006 folder16,000 records divided into two equal sized partitions Part 1: Complete data including target, available for training during the competition Part 2: Data to be scored; during the competition the target was not availabler25 database fields two of which were unstructured text written by the sellerEach line of data describes an auction of an iPod including the final winningbid priceAn eBay seller must construct a headline and a description of the productbeing sold. Sellers can also pay for selling assistance E.g. Seller can pay to list the item title in BOLD Salford Systems Copyright 201113

14. The Data: Available FieldsThe following variables describe general features of each auction eventVariableDescriptionAUCT_ID ID number of auctionITEM_LEAF_CATEGORY_NAME products categoryLISTING_START_DATEstart date of auctionLISTING_END_DATEend date of auctionLISTING_DURTN_DAYSduration of auctionLISTING_TYPE_CODE type of auction (normal auction, multi auction, etc)QTY_AVAILABLE_PER_LISTING amount of offered items for multi auctionFEEDBACK_SCORE_AT_LISTINfeedback-rating of the seller of this auction listingSTART_PRICE start price in EURBUY_IT_NOW_PRICEbuy it now price in EURBUY_IT_NOW_LISTING_FLAG option for buy it now on this auction listingSalford Systems Copyright 201114

15. Available Data FieldsIn addition, there are binary indicators of various value added features thatcan be turned on for each auctionVariable DescriptionBOLD_FEE_FLAGoption for bold font on this auction listingFEATUERD_FEE_FLAGshow this auction listing on top of homepageCATEGORY_FEATURED_FEE_FLAG show this auction listing on top of categoryGALLERY_FEE_FLAG auction listing with picture galleryGALLERY_FEATURED_FEE_FLAGauction listing with gallery (in gallery view)IPIX_FEATURED_FEE_FLAG auction listing with IPIX (additional xxl, picture show, pack)RESERVE_FEE_FLAG auction listing with reserve-priceHIGHLIGHT_FEE_FLAG auction listing with background colorSCHEDULE_FEE_FLAGauction listing, including the definition of the starting timeBORDER_FEE_FLAGauction listing with frame Salford Systems Copyright 2011 15

16. Target VariableFinally, the target variable is defined based on the winning bid price revenuerelative to the category averageVariableDescriptionGMS scored sales revenue in EURCATEGORY_AVG_GMSAverage sales revenue for the product categoryGMS_GREATER_AVG zero when the revenue is less than or equal to thecategory average sales and one otherwise The values were only disclosed on a randomly selected set of 8,000 auctions which we use to train a model 4199 auctions with the revenue below the category average 3801 auctions with the revenue above the category average During the competition the auction results for the remaining 8,000 auction results were kept secret, and used to score competitive entries We will only use these records at the very end of this tutorial to validate the performance of various models that will be built Salford Systems Copyright 201116

17. Comments on MethodologyPredictive modeling and general analytics competitions are increasingly beinglaunched both by private companies and by professional organizations andprovide both public data sets and a wealth of illustrative examples usingdifferent analytic techniquesWhen reviewing results from a competition, and especially when comparingresults generated by analysts running models after the competition, it isimportant to keep in mind that there is an ocean of difference between beinga competitor during the actual competition and an after-the-fact commentatorRegardless of what is reported the after-the-fact analyst does have access towhat really happened and it is nearly impossible to simulate the competitiveenvironment once the results have been published We all learn in both direct and indirect ways from many sources including the outcomes of public competitions. This can affect anything that comes later in time.In spite of this, we have tried to mimic the circumstances of the competitorsby presenting analyses based only on the original training data, and usingwell-established guidelines we have been promoting for more than decade toarrive at a final modelWe urge you to never take as face value an analysts report on what wouldhave happened if they had hypothetically participatedSalford Systems Copyright 2011 17

18. First Round Modeling: Ignoring the TEXT DataEven before doing any type of data preparation it is always valuable to run afew preliminary CART models CART automatically handles missing values and is immune to outliers CART is flexible enough to adapt to any type of nonlinearity and interaction effects among predictors. The analyst does not need to do any data preparation to assist CART in this regard CART performs well enough out of the box that we are guaranteed to learn something of value without conducting any of the common data preparation operationsThe only requirement for useful results is that we exclude any possibleperfect or near perfect illegitimate predictors Common examples of illegitimate predictors include repackaged versions of the dependent variable, ID variables, and data drawn from the future relative to the data to be predictedWe start with a quick model using 20 of the 25 available predictors. None ofthese involve any of the text data we will focus on later.Salford Systems Copyright 201118

19. Quick Modeling Round with CARTWe start by building a quick CART model using original raw variables and all8,000 complete auction recordsAssuming that you already have SPM launched Go to the File Open Data File menu Note that we have already configured the default working folder for SPM Make sure that the Files of Type is set to ASCII Highlight the dmc2006.csv dataset Press the [Open] buttonSalford Systems Copyright 2011 19

20. Dataset Summary WindowThe resulting window summarizes basic facts about the datasetNote that even though the dataset has 16000 records, only top 8000 will beused for modeling as was already pointed outSalford Systems Copyright 2011 20

21. The View Data WindowPress the [View Data] button to have a quick impression of physicalcontents of the datasetOut goal is to eventually use the unstructured information contained in thetext fields right next to the auction ID Salford Systems Copyright 2011 21

22. Requesting Basic Descriptive StatsWe next produce some basic stats for all available variables: Go to the View Data Info menu Set the Sort mode into File Order Highlight the Include column Check the Select box Press the [OK] button Salford Systems Copyright 201122

23. Data Information WindowAll basic descriptive statistics for all requested variables are now summarized in oneplaceNote that the target variable GMS_GREATER_AVG is not defined for the one half ofthe dataset (N Missing 8,000), all those records will be automatically discarded duringmodel buildingPress the [Full] button to see more detailsSalford Systems Copyright 201123

24. Setting Up CART ModelWe are now ready to set up a basic CART run: Switch to the Classic Output window active Go to the Model Construct Model menu (alternatively, you could press one of the buttons located on the bar right below the menu bar) In the resulting Model Setup window make sure that the Analysis Method is set to CART In the Model tab make sure that the Sort is set to File Order and the Tree Type is set to Classification Check GMS_GREATER_AVG as the Target Check all of the remaining variables except AUCT_ID, LISTING_TITLE$, LISTING_SUBTITLE$, GMS, and CATEGORY_AVG_GMS as predictors You should see something similar to what is shown on the next slide Salford Systems Copyright 2011 24

27. Model Setup Window: Advanced TabSwitch to the Advanced tab and set the minimum required number of recordsfor the parent nodes and the child nodes at 15 and 5These limits were chosen to avoid extremely small nodes in the resulting treeSalford Systems Copyright 201127

28. Building CART ModelPress the [Start] button, building progress window will appear for a while and then the Navigatorwindow containing model results will be displayedPress on the little button right above the [+][-] pair of buttons, along the left border of the Navigatorwindow, note that all trees within one standard error (SE) of the optimal tree are now marked in greenUse the arrow keys to select the 64-node tree from the tree sequence, which is the smallest 1SE tree Salford Systems Copyright 2011 28

29. CART model observationsThe selected CART model contains 64 terminal nodes and it is the smallestmodel with the relative error still within one standard error of the optimalmodel (the model with the smallest relative error) pointed by the green bar This approach to model selection is usually employed for easy comprehension We might also want to require terminal nodes to contain more than the 6 record minimum we observe in this out of the box treeAll 20 predictor variables play a role in the tree construction but there is more to observe about this when we look at the variable importance detailsArea under the ROC curve is a respectable 0.748 Salford Systems Copyright 201129

30. CART Model Performance Press the [Summary Reports] button in the Navigator, select Prediction Success tab, and press the [Test] button to display cross- validated test performance of 68.66% classification accuracy Now select the Variable Importance tab to review which variables entered into the model Interestingly enough, none of the added value paid options are important and exhibit practically no direct influence on the sales revenue A detailed look at the nodes might also be instructive for understanding the modelSalford Systems Copyright 201130

31. Experimenting with TreeNetWe almost always follow initial CART models with similar TreeNet modelsWe start with CART because some glaring errors such as perfect predictorsare more quickly found and obviously displayed in CART A perfect predictor often yields a single split tree (two terminal nodes) for classification treesTreeNet models have strengths similar to CART regarding flexibility androbustness and has advantages and disadvantages relative to CART TreeNet is an ensemble of small CART trees that have been linked together in special ways. Thus TreeNet shares many desirable features of CART TreeNet is superior to CART in the context of errors in the dependent variable (not relevant in this data) TreeNet yields much more complex models but generally offers substantially better predictive accuracy. TreeNet may easily generate thousands of trees to arrive at an optimal model TreeNet yields more reliable variable importance rankings Salford Systems Copyright 201131

32. A few words about TreeNetTreeNet builds predictive models in stages. It first starts with a deliberatelyvery small first round tree (essentially a CART tree).Then TreeNet calculates the prediction error made by this simple model andbuilds a second tree to try to model that prediction error. The second treeserves as tool to update, refine, and improve the first stage model.A TreeNet model produces a score which is a simple of sum of all thepredictions made by each tree in the modelTypically the TreeNet score becomes progressively more accurate as thenumber of trees is increased up to an optimal number of treesRarely the optimal number of trees is just one! Occasionally, a handful oftrees are optimal. More typically, hundreds or thousands of trees are optimal.TreeNet models are very useful for the analysis of data with large numbers ofpredictors as the models are built up in layers each of which makes use ofjust a few predictorsMore detail on TreeNet can be found at http://www.salford-systems.comSalford Systems Copyright 201132

33. Setting Up TN ModelSwitch to the Classic Output window and go to the Model ConstructModel menuChoose TreeNet as the Analysis MethodIn the Model tab make sure that the Tree Type is set to Logistic BinarySalford Systems Copyright 201133

34. Setting Up TN ParametersSwitch to the TreeNet tab and do the following: Set the Learnrate to 0.05 Set the Number of trees to use: to 800 trees Leave all of the remaining options at their default valuesSalford Systems Copyright 201134

36. Checking TN PerformancePress on the [Summary] button and switch to the Prediction Success tabPress the [Test] button to view cross-validation resultsLower the Threshold: to 0.45 to roughly equalize classification accuracy inboth classes (this makes it easier to compare the TN performance with theearlier reported CART performance)Salford Systems Copyright 201136

37. The Performance Has Improved!The overall classification accuracy goes up to about 71%Press the [ROC] button to see that the area under ROC is now a solid 0.800This comes at the cost of added model complexity 796 trees each with about 6terminal nodesVariable importance remains similar to CARTSalford Systems Copyright 2011 37

38. Understanding the TreeNet ModelTreeNet produces partial dependency plots for every predictor thatappears in the model, the plots can be viewed by pressing on the [DisplayPlots] buttonSuch plots are generally 2D illustrations of how the predictor in questionaffects an outcome For example, in the graph below the Y axis represents the probability that an iPod will sell at an above category average priceWe see that for a BUY_IT_NOW price between 200 and 300 the probability ofabove average winning bid rises sharply with the BUY_IT_NOW_PRICEFor prices above 300 or below 200 the curve is essentially flat meaning thatchanges in the predictor do not result in changes in the probable outcomeSalford Systems Copyright 201138

39. Understanding the Partial Dependency Plot (PD Plot)The PD Plot is not a simple description of the data. If you plotted the raw dataas say the fraction of above average winning bids against prices intervals youmight see a somewhat different curveThe PD Plot is a plot that is extracted from the TreeNet model and it isgenerated by examining TreeNet predictions (and not input data)The PD Plot appears to be relate two variables but in fact other variables maywell play a role in the graph constructionEssentially the PD Plot shows the relationship between a predictor and thetarget variable taking all other predictors into accountThe important points to understand are that the graph is extracted from the model and not directly from raw data the graph provides an honest estimate of the typical effect of a predictor the graph displays not absolute outcomes but typical expected changes from some baseline as the predictor varies. The graph can be thought of as floating up or down depending on the values of other predictorsSalford Systems Copyright 2011 39

41. Introducing the Text Mining DimensionTo this point, we have been working only with the set of traditional structureddata fields continuous and categorical variablesFurther substantial performance improvement can be achieved only if weutilize the text descriptions supplied by the seller in the following fieldsVariableDescriptionLISTING_TITLE title of auctionLISTING_SUBTITLEsubtitle of auction Unfortunately, these two variables cannot be used as is. Sellers were free to enter free form text including misspellings, acronyms, slang, etc. So we must address the challenge of converting the unstructured text strings of the type shown here into a well structured representation Salford Systems Copyright 2011 41

42. The Bag of Words Approach of Text MiningThe most straightforward strategy for dealing with free form text is torepresent each word that appears in the complete data set as a dummy(0/1) indicator variableFor iPods on eBay we could imagine sellers wanting to use words like newslightly scratched, pink etc. to describe their iPod. Of course thedescriptions may well be complete phrases like autographed by AngelaMerkel rather than just single term adjectivesNevertheless in the simplest Bag of Words (BOW) approach we just createdummy indicators for every wordEven though the headlines and descriptions are space limited the number ofdistinct words that can appear in collections of free text can be hugeText mining applications involving complete documents, e.g. newspaperarticles, the number of distinct words can easily reach several hundredthousands or even millionsSalford Systems Copyright 2011 42

43. The End Goal of the Bag of Words Record_IDREDUSEDSCRATCHED CASE 1001 01 0 1 1002 00 0 0 1003 10 0 0 1004 00 0 0 1005 11 1 0 1006 00 0 0 Above we see an example of a database intended to describe each auctionitem by indicating which words appeared in the auction announcement Observe that Record_ID 1005 contains the three words RED, USED andSCRATCHED Data in the above format looks just like the kind of numeric data used intraditional data mining and statistical modeling We can use data in this form, as is, feeding it into CART, TreeNet, orregression tools such Generalized Path Seeker (GPS) or everyday regression Observe that we have transformed the unstructured text into structurednumerical dataSalford Systems Copyright 2011 43

44. Coding the Term Vector and TF weightingIn the sample data matrix on the previous slide we coded all of our indicatorsas 0 or 1 to indicate presence or absence of a termAn alternative coding scheme is based on the FREQUENCY COUNT of theterms with these variations: 0 or 1 coding for presence/absence Actual term count (0,1,2,3,) Three level indicator for absent, one occurrence, and more than one (0,1,2)The text mining literature has established some useful weighted codingschemes. We start with term frequency weighting (tf) Text mining can involve blocks of text of considerably different lengths It is thus desirable to normalize counts based on relative frequency. Two text fields might each contain the term RED twice, but one of the fields contains 10 words while the other contains 40 words. We might want our coding to reflect the fact that 2/10 is more frequent than 2/40. This is nothing more than making counts relative to the total length of the unit of text (or document) and such coding yields the term frequency weightingSalford Systems Copyright 2011 44

45. Inverse Document Frequency (IDF) WeightingIDF weighting is drawn from the information retrieval literature and is intendedto reflect the value of a term in narrowing the search for a specific documentwithin a larger corpus of documentsIf a given term occurs very rarely in a collection of documents then that termis very valuable as a tag to target those documents accuratelyBy contrast, if a term is very common, then knowing that such a term occurswithin the document you are looking for is not helpful in narrowing the searchWhile text mining has somewhat different goals than information retrieval theconcept of IDF weighting has caught on. IDF weighting serves to upweightterms that occur relatively rarely.IDF(term) = log { (Number of documents)/Number of documents containing(term))}The IDF increases with the rarity of a term and is maximum for words thatoccur in only one documentA common coding of the term vector uses the product: tf * idf Salford Systems Copyright 201145

46. Coding the DMC2006 Text DataThe DMC2006 text data is unusual principally because of the limit on the amount oftext a seller was allowed to uploadThis has the effect making the lengths of all the documents very similarIt also limits sharply the possibility that a term in a document would occur with a highfrequencyThese factors contribute to making the TF-IDF weighting irrelevant to this challenge. Infact, for this prediction task other coding schemes allow more accurate prediction.STM offers these options for term vector coding 0 no/yes 1 no/yes/many this one will be used in the remainder of this tutorial 2 0/1 3 0/1/2 4 term frequency (relative to document) 5 inversed document frequency (relative to corpus) 6 TF-IDF (traditional IR coding) Salford Systems Copyright 201146

47. Text Mining Data PreparationThe heavy lifting in text mining technology is devoted to moving us from rawunstructured text to structured numerical dataOnce we have structured data we are free to use any of a large number oftraditional data mining and statistical tools to move forwardTypical analytical tools include logistic and multiple regression, predictivemodeling, and clustering toolsBut before diving into the analysis stage we need move through the texttransformation stage in detailThe first step is to extract and identify the words or terms which can bethought of as creating the list of all words recognized in the training data setThis stage is essentially one of defining the dictionary, the list of officiallyrecognized terms. Any new term encountered in the future will beunrecognizable by the dictionary and will represent an unknown itemIt is therefore very important to ensure that the training data set containsalmost all terms of interest that would be relevant for future prediction Salford Systems Copyright 201147

48. Automatic Dictionary BuildingThe following steps will build an active dictionary for a collection ofdocuments (in our case, auction item description strings) Read all text values into one character string Tokenize this string into an array of words (token) Remove words without any letters or digits Remove stop words (words like the, a, in, und, mit, etc.) for both English and German languages Remove words that have fewer than 2 letters and encountered less than 10 times across the entire collection of documents (rare small words) At this point the too-common, too-rare, weird, obscure, and useless combinations of characters should have been eliminated Lemmatize words using WordNet lexical database This step combines words present in different grammatical forms (go, went, going, etc.) into the corresponding stem word (go) Remove all resulting words that appear less than MIN times (5 in the remainder of this tutorial) Salford Systems Copyright 2011 48

49. Build the Dictionary (or Term Vector)For purpose of automatic dictionary building and preprocessing data we developed theSalford Text Mining (STM) software - a stand alone collection of tools that perform allthe essential steps in preparing text documents for text miningSTM builds on the Python Natural Language Toolkit (NLTK)From NLTK we use the following tools Tokenizer (extract items most likely to be words) Porter Stemmer(recognize different simple forms of same word e.g. plural) Word Net lemmatizer (more complex recognition of same word variations) stop word list (words that contribute little to no vale such as the, a)Future versions of STM might use other tools to accomplish these essential tasksstm.exe is a command line utility that must be run from a Command Prompt window(assuming you are running Windows, go to the Start All Programs Accessories Command Prompt menu)The version provided here resides in the stmtutorSTMbin folder Salford Systems Copyright 2011 49

50. STM Commands and OptionsOpen a Command Prompt window in Windows, then CD to thestmtutorSTM folder location, for example, on our system you would type incd c:stmtutorSTMTo obtain help type the following at the prompt: binstm --helpThis command will return very concise information about STM: stm [-h] [-data DATAFILE] [-dict DICTFILE] [-source-dict SRCDICTFILE] [-score SCOREFILE] [-spm SPMAPP] [-t TARGET] [-ex EXCLUDE] etc.The details for each command line option are contained in the softwaremanual appearing in the appendixYou will also notice the stm.cfg configuration file this file controls the defaultbehavior of the STM module and relieves you of specifying a large number ofconfiguration options each time stm.exe is launched Note the TEXT_VARIABLES : ITEM_LEAF_CATEGORY_NAME, LISTING_TITLE, LISTING_SUBTITLE line which specifies the names of the text variables to be processed 50

51. Create Dictionary OptionsFor the purposes of this tutorial, we have prepackaged all of the text processingsteps into individual command files (extension *.bat). You can either double-click on the referenced command file or alternatively type its contents into theCommand Prompt window opened in the directory that contains the filesThe most important arguments for our purposes in this tutorial now are: --dataset DATAFILEname and location of your input CSV format data set --dictionary DICTFILE name and location of the dictionary to be createdThese two arguments are all you need to create your dictionary. By default,STM will process every text field in your input data set to create a singleomnibus dictionarySimply double click on the stm_create_dictionary.bat to create the dictionaryfile for the DMC 2006 dataset, which will be saved in the dmc2006_ynm.dictfile in the stmtutorSTMdmc2006 folderIn typical text mining practice the process of generating the final dictionary willbe iterative. A review of the first dictionary might reveal further words you wishto exclude (stop words)Salford Systems Copyright 201151

52. Internal Dictionary Format The dictionary file is a simple text file with extension *.dict The file contents can be viewed and edited in a standard text editor The name of the text mining variable that will be created later on appears on the left of the = sign on each un-indented line The default value that will be assigned to this variable appears on the right side of the = sign of the un-indented lines and it usually means the absence of the word(s) of interest Each indented line represents the value (left of the =) which will be entered for a single occurrence in a document for any of the word(s) appearing on the right of the = More than one occurrence will be recorded asmany when requested (always the case in thistutorial)Salford Systems Copyright 201152

53. Hand Made DictionaryTo use a multi-level coding you need to create a hand made dictionary, which is alreadysupplied to you as hand.dict in the stmtutorSTMdmc2006 folderHere is an example of an entry in this filehand_model=standard mini nano standardThe un-indented line of an entry starts with the name we wish to give to the term(HAND_MODEL) and also indicates that a BLANK or missing value is to be coded withthe default value of standardThe remaining indented entries are listed one-per-line and are an exhaustive list of theacceptable values which the term HAND-MODEL can receive in the term vectorAnother coding option is, for example: hand_unused=noyes=unbenutzt,ungeoffnetwhich sets no as the default value but substitutes yes if one of the two values listedabove is encounteredYou may study additional examples in our stmtutorSTMdmc2006hand.dict file on yourown, all of them were created manually based on common sense logic 53

54. Why Create Hand Made Dictionary EntriesLets revisit the variable HAND_MODEL which brings together the terms Standard, mini, nanoWithout a hand made dictionary entry we would have three terms created,one for each model type, with yes and no values, and possibly manyBy creating the hand made entry we Ensure that every auction is assigned a model (default=standard) All three models are brought together into one categorical variable with three possible values standard, mini, and nanoThis representation of the information is helpful when using tree-basedlearning machines but not helpful for regression-based learning machines The best choice of representation may vary from project to project Salford regression-based learning machines automatically repackage categorical predictors into 01 indicators meaning that you work with one representation But if you need to use other tools you may not have this flexibilitySalford Systems Copyright 201154

55. Further Dictionary CustomizationThe following table summarizes some of the important fields introduced in thecustom dictionary for this tutorialVariableValues Combines word variantsCAPACITY20 20gb,20 gb,20 gigabyte30 30gb,30 gb,30 gigabyte40 40gb,40 gb,40 gigabyte 80gb,80 gb,80 gigabyte80 STATUSWieneu Wie neu,super gepflegt,top gepflegt,top zustand,neuwertigNeuneu,new,brandneu,brandneuesUnbenutztUnbenudefekt defekt.,--defekt--,defekt,-defekt-,-defekt,defekter,defektesMODEL Mini, nano,Captures presence of the corresponding word in the auctionstandard descriptionCOLOR Black, white,Captures presence of the corresponding words or variants inGreen, etc.the auction descriptionIPOD_GENE First, Identified iPod generation from the information available inRATIONsecond, etc. the text descriptionSalford Systems Copyright 201155

56. Final Stage Dictionary ExtractionTo generate a final version of the dictionary in most real world applicationsyou would also need to prepare an expanded list of stopwordsThe NLTK provides a ready-made list of stopwords for English and another14 major languages spanning Europe, Russia, Turkey, and Scandinavia These appear in the directory named stmtutorSTMdatacorporastopwords and should be left as they areAdditional stopwords, which might well vary from project to project, can beentered into the file named stopwords.dat in the stmtutorSTMdatafolder In the package distributed with this tutorial the stopwords.dat file is empty You can freely add words to this file, with one stopword per lineOnce the custom stopwords.dat and hand.dict files have been preparedyou just run the dictionary extraction again but with the --source-dictionaryargument added (see the command files introduced in the later slides)The resulting dictionary will now include all the introduced customizationsSalford Systems Copyright 2011 56

57. Creating Structured Text Mining VariablesThe resulting dictionary file dmc2006_ynm.dict contains about 600 individual stemsIn the final step of text processing the data dictionary is applied to each document entryEach stem from the dictionary is represented by a categorical variable (usually binary)with the corresponding nameThe preparation process checks whether any of the known word variants associatedwith each stem from the dictionary are present in the current auction description, and ifyes, the corresponding value is set to yes, otherwise, it is set to no When the --code YNM option is set, multiple instances of yes will be coded as many You can also request integer codes 0, 1, 2 in place of the character yes/no/many We have experimented with alternative variants of coding (see the --code help entry in the STM manual) and came to conclusion that the YNM approach works best in this tutorial Feel free to experiment with alternative coding schemas on your ownThe resulting large collection of variables will be used as additional predictors in ourmodeling effortsEven though other more computationally intense text processing methods exist, furtherinvestigation failed to demonstrate their utility on the current data which is most likelyrelated to extremely terse nature of the auction descriptions Salford Systems Copyright 2011 57

58. Creating Additional VariablesFinally, we spent additional efforts on reorganizing the original raw variablesinto more useful measures MONTH_OF_START based on the recorded start date of auction MONTH_OF_SALE based on the recorded closing date of auction HIGH_BUY_IT_NOW set to yes if BUY_IT_NOW_PRICE exceeds the CATEGORY_AVG_GMS as suggested by common sense and the nature of the classification problem In the original raw data, BUY_IT_NOW_PRICE was set to 0 on all items where that option was not available we reset all such 0s to missingAll of these operations are encoded in the preprocess.py Python filelocated in the stmtutorSTMdmc2006 folder This component of the STM is under active development The file is automatically called by the main STM utility You may add/modify the contents of this file to allow alternative transformations of the original predictors Salford Systems Copyright 2011 58

59. Generation of the Analysis Data SetAs this point we are ready to move on to the next step which is data creationThis is nothing more than appending the relevant columns of data to theoriginal data set. Remember that the dictionary may contain tens ofthousands if not hundreds of thousands of termsFor the DMC2006 dataset the dictionary is quite small by text miningstandards containing just a little over 600 wordsTo generate processed dataset simply double-click on the stm_ynm.batcommand file or explicitly type in its contents in the Command Prompt The --dataset option specifies the input dataset to be processed The --code YNM option requests yes/no/many style of coding The --source-dictionary option specifies the hand dictionary The --process option specifies the output dataset Of course you may add other options as you preferThis creates a processed dataset with the name dmc2006_res_ynm.csvwhich resides in the stmtutorSTMdmc2006 folderSalford Systems Copyright 201159

60. Analysis Data Set ObservationsAt this point we have a new modeling dataset with the text informationrepresented by the extra variables Note that he raw input data set is just shy of 3 MB in size in a plain text format while the prepared analysis data set is about 40 MB in size, 13 times largerProcess only training data or all data? For prediction purposes all data needs to be processed, both the data that will be used to train the predictive models and the holdout or future data that will receive predictions later In the DMC2006 data we happen to have access to both training and holdout data and thus have the option of processing all the text data at the same time Generating the term vector based only on the training data would generally be the norm because future data flows have not yet arrived In this project we elected to process all the data together for convenience knowing that the train and holdout partitions were created by random division of the data It is worth pointing out, though, that the final dictionary generated from training data only might be slightly different due to the infrequent word elimination component of the text processor Salford Systems Copyright 2011 60

61. Quick Modeling Round with CARTWe are now ready to proceed with another CART run this time using all of thenewly created text fields as additional predictorsAssuming that you already have SPM launched Go to the File Open Data File menu Make sure that the Files of Type is set to ASCII Highlight the dmc2006_res_ynm.csv dataset Press the [Open] button Salford Systems Copyright 201161

63. The View Data WindowPress the [View Data] button to have a quick look at the physical contentsof the datasetNote how the individual dictionary word entries are now coded with the yes,no, or many values for each document rowSalford Systems Copyright 201163

64. Setting Up CART ModelProceed with setting up a CART modeling run as before: Make the Classic Output window active Go to the Model Construct Model menu (alternatively, you could use one of the buttons located on the bar right below the menu) In the resulting Model Setup window make sure that the Analysis Method is set to CART In the Model tab make sure that the Sort is set to File Order and the Tree Type is set to Classification Check GMS_GREATER_AVG as the Target Check all of the remaining variables except AUCT_ID, LISTING_TITLE$, LISTING_SUBTITLE$, GMS, and CATEGORY_AVG_GMS as predictors You should see something similar to what is shown on the next slide Salford Systems Copyright 2011 64

67. Model Setup Window: Advanced TabSwitch to the Advanced tab and set the minimum required number of recordsfor the parent nodes and the child nodes at 15 and 5These limits were chosen to avoid extremely small nodes in the resulting treeSalford Systems Copyright 201167

68. Building CART ModelPress the [Start] button, building progress window will appear for a while and then the Navigatorwindow containing model results will be displayed (this time, the process takes a few minutes!)Press on the little button right above the [+][-] pair of buttons, along the left border of the Navigatorwindow, note that all trees within one standard error (SE) of the optimal tree are now marked in greenUse the arrow keys to select the 102-node tree from the tree sequence, which is the smallest 1SE tree Salford Systems Copyright 2011 68

69. CART Model PerformanceThe selected CART model contains 102 terminal nodes where nearly all availablepredictor variables play a role in the tree constructionArea under the ROC curve (Test) is now an impressive 0.830, especially whencompared to the one reported earlier at 0.748 for the basic CART run or the 0.800 forthe basic TN runPress on the [Summary Reports] button in the Navigator window, select thePrediction Success tab, and finally press the [Test] button to see cross-validated testperformance at 76.58% classification accuracy a significant improvement!Also note the presence of the original and derived variables on the list shown in theVariable Importance tabSalford Systems Copyright 201169

70. Setting Up TN ModelNow switch to the Classic Output window and go to the Model ConstructModel menuChoose TreeNet as the Analysis MethodIn the Model tab make sure that the Tree Type is set to Logistic BinarySalford Systems Copyright 201170

71. Setting Up TN ParametersSwitch to the TreeNet tab and do the following: Set the Learnrate: to 0.05 Set the Number of trees to use: to 800 Leave all of the remaining options at their default valuesSalford Systems Copyright 201171

72. TN Results WindowPress the [Start] button to initiate TN modeling run, the TreeNet Resultswindow will appear in the end, even though you might want to take a coffeebreak until the modeling run completesSalford Systems Copyright 2011 72

73. Checking TN PerformancePress on the [Summary] button and switch to the Prediction Success tabPress the [Test] button to view cross-validation resultsLower the Threshold: to 0.47 to roughly equalize classification accuracy in both classes(this makes it easier to compare the TN performance with the earlier reported CARTand TN model performance)You can clearly see the improvement!Salford Systems Copyright 2011 73

74. Requesting TN GraphsHere we present a sample collection of all 2-D contribution plots produced byTN for the resulting modelThe plots are available by pressing on the [Display Plots] button in theTreeNet Results windowThe list is arranged according to the variable importance table74

76. Insights Suggested by the ModelHere is a list of insights we arrived at by looking into the selection of plots There is a distinct effect of the iPod category once all the other factors have been accounted for Larger start price means above the average sale (most likely relates to the quality of an item) Anew and unpacked item should fetch a better price, while any defect brings the price down End of the year means better sales Having a good feedback score is important It is best to wait 10 days or more before closing the deal Interestingly, 1st and 3rd generations of iPod show poorer sales than the 2nd and 4th 2G started to fall out of favor in 2005-2006 Black is much more popular in Germany than other colors Mentioning photo, video, color display, etc. helps get a better price The paid advertising features are of little or marginal importance Salford Systems Copyright 201176

77. Final Validation of ModelsAt this point we are ready to check the performance of all our models usingthe remaining 8,000 auctions originally not available for trainingThis way each model can be positioned with respect to all of the official 173entries originally submitted to the DMC 2006 competitionHowever, in order to proceed with the evaluation, we must first score theinput data using all of the models we have generated up until nowThe following slides explain how to score the most recently constructedCART and TN models, the earlier models can be scored using similar stepsYou may choose to skip the scoring steps as we have already included theresults of scoring in the stmtutorSTMscored folder: Score_cart_raw.csv simple CART model predictions Score_tn_raw.csv simple TN model predictions Score_cart_txt.csv text mining enhanced CART model predictions Score_tn_txt.csv text mining enhanced TN model predictions Salford Systems Copyright 2011 77

78. Scoring a CART ModelSelect the Navigator window for the model you wish to scoreSelect the tree from the tree sequence (in our runs we pick the 1SE trees asmore robust)Press the [Score] button to open the Score Data windowMake sure that the Data file is set to dmc2006_res_ynm.csv, if not pressthe [Select] button on the right and select the dataset to be scoredPlace a checkmark in the Save results to a file box, then press the [Select]button right next to it, this will open the Save As windowNavigate to the stmtutorSTMscored folder under Save in: selection box,enter Scored_cart_txt.csv in the File name: text entry box, and press the[Save] buttonYou should now see something similar to whats shown on the next slidePress the [OK] button to initiate the scoring processYou should now have the Scored_cart_txt.csv file in the stmtutorSTMscoredfolderSalford Systems Copyright 2011 78

80. Scoring a TN ModelSelect the TreeNet Results window for the model you wish to scoreGo to the Model Score Data menu to open the Score Data windowMake sure that the Data file is set to dmc2006_res_ynm.csv, if not pressthe [Select] button on the right and select the dataset to be scoredPlace a checkmark in the Save results to a file box, then press the[Select] button right next to it, this will open the Save As windowNavigate to the stmtutorSTMscored folder under Save in: selectionbox, enter Scored_tn_txt.csv in the File name: text entry box, and pressthe [Save] buttonYou should now see something similar to whats shown on the next slidePress the [OK] button to initiate the scoring processYou should now have the Scored_tn_txt.csv file in thestmtutorSTMscored folder Salford Systems Copyright 201180

82. Using STM to Validate PerformanceWe can now use the STM machinery to do final model validationSimply double-click the stm_validate.bat command file to proceedNote the use of the following options inside of the command file: -score specifies the output dataset where the model predictions will be written --score-column specifies the name of the variable containing the actual model predictions (these variables are produced by CART or TN during the scoring process) --check specifies the name of the dataset that contains the originally withheld values of the target this dataset was used by the organizers of the DMC 2006 competition toselect the actual winners STM is currently configured to validate only the bottom 8,000 of the 16,000 predictions generated by the model; the top 8,000 records (used for learning) are simply ignoredThe results will be saved into text files with extensions *.result appended tothe original score file names in the stmtutorSTMscored folderSalford Systems Copyright 2011 82

83. Validation Results FormatThe following window shows the validation results of the final TN model webuilt 8000 validation records were scored, of which: 719 ones were misclassified as zeroes 807 zeroes were misclassified as ones Thus 1,526 documents were misclassified This gives the final score of 8,000 (1,526 * 2) = 4,948Salford Systems Copyright 2011 83

84. Final Validation of ModelsBased on the predicted class assignments, the final performance score iscalculated as 8,000 minus twice the total number of auction itemsmisclassifiedThe following table summarizes how these virtually out-of-the-box elementarymodelings perform on the holdout data (the values are extracted from the four*.result files produced by the STM validator)ModelROC Area Missed 0sMissed 1s ScoreCART raw data75%1123 13872980TN raw data80%1308 926 3532CART text data 83%981848 4342TN text data 89%807719 4948Salford Systems Copyright 2011 84

85. Visual Validation of the ResultsThe following graph summarizes the positioning of the four basic models withrespect to the 173 official competition entriesThe TN model with text mining processing is among the top 10 winners!TN textCART text TN raw CART raw Salford Systems Copyright 201185

86. Observations on the ResultsWe used the most basic form of text mining, the Bag of Words, with minoremendations None of the authors speaks German although we did look up some of the words in an on-line dictionary. If there are any subtleties to be picked from seller wording choices we would have missed them.We chose the coding scheme that performed best on the training data. Wehave six coding options and one stands out as clearly bestWe used common settings for the controls for CART and TreeNetWe did not use any of the modeling refinement techniques we teach in ourCART and TreeNet tutorialsWe thus invite you to see if you can tweak the performances of these modelseven higherSalford Systems Copyright 2011 86

87. Command Line Automation in SPMSPM has a powerful command line processing component which allows you to completelyreproduce any modeling activity by creating and later submitting a command fileWe have packaged the command files for the four modeling and scoring runs you have conductedin the course of this tutorial SPM command files must have the extension *.cmd The four command files are stored in the stmtutorSTMdmc2006 folderYou can create, open, or edit a command file using a simple text editor, like Notepad, etc.SPM has a built-in editor, just go to the File New Notepad menuYou may also access the command line directly from inside of the SPM GUI, just make sure that theFile Command Prompt menu item is checkedJust type in help in the Command Prompt part (starts with the > mark) of the Classic Outputwindow to get the listing of all available commandsThen you can request a more detailed help for any specific command of interest, for example helpbattery will produce a long list of various batteries of automated runs available in SPMFurthermore, you may view all of the commands issued during the current session by going to theView Open Command Log menu, this way you can quickly learn which commands correspondto the recent GUI activity you were involved with Salford Systems Copyright 2011 87

88. Basic CART Model Command FileYou may now restart SPM to emulate a new fresh runGo to the File Open Command File menuSelect the cart_raw.cmd command file and press the [Open] buttonThe file is now opened in the built-in Notepad windowSalford Systems Copyright 2011 88

89. CART Command File Contents OUT saves the classic output into a text file USE points to the modeling dataset GROVE saves the model as a binary grove file MODEL specifies the target variable CATEGORY indicates which variables are categorical, including the target KEEP specifies the list of predictors LIMIT sets the node limits ERROR requests cross-validation BUILD builds a CART model SAVE names the file where the CART model predictions will be saved HARVEST specifies which tree is to be used in scoring IDVAR requests saving of theNote the use of the relative paths in the GROVE and SAVE commandsadditional variables into the output datasetAlso note the use of the forward slash / to separate folder names SCORE scores the CART model OUTPUT * closes the current text output fileSalford Systems Copyright 201189

90. Submitting Command FileWith the Notepad window active, go to the File Submit Window menu tosubmit the command file into SPMIn the end you will see the Navigator and the Score windows opened whichshould be identical to the ones you have already seen in the beginning of thistutorialFurthermore, you should now have cart_raw.dat text file created in the stmtutorSTMdmc2006 folder, the file contains the classic output you normally see in the Classic Output window cart_raw.grv binary grove file created in the stmtutorSTMmodels folder, the file contains the CART model itself, it can be opened in the GUI using the File Open Open Grove menu which reopens the Navigator window, this file will be also needed to future scoring or translation Score_cart_raw.csv data file created in the stmtutorSTMscored folder, the file contains the selected CART model predictions on your dataYou may proceed now with opening up the tn_raw.cmd file using the File Open Command File menuSalford Systems Copyright 2011 90

91. TN Command File Contents OUT, USE, GROVE, MODEL, CATEGORY, KEEP, ERROR, SAVE, IDVAR, SCORE, OUTPUT same as the CART command file introduced earlier MART TREES sets the TN model size in trees MART NODES sets the tree size in terminal nodes MART MINCHILD - set the minimum individual node size in records MART OPTIMAL sets the evaluation criterion that will be used for optimal model selection MART BINARY requests logistic regression processing in our case MART LEARNRATE sets the learnrate parameter MART SUBSAMPLE sets the sampling rate MART INFLUENCE sets the influence trimming value The rest of the MART commands requests automatic saving of the 2-D and 3-D plots into the grove; type in help mart to get full descriptionsSalford Systems Copyright 2011 91

92. Submitting the Rest of the Command FilesAgain, with the current Notepad window active, use the File Submit Window menuto launch the basic TN modeling run automatically followed by scoringThis will create the output, grove, and scored data files in the corresponding locationsfor the chosen TN model; also note the use of the EXCLUDE command in place of theKEEP command inside of the command file this saves a lot of typingNow go back to the Classic Output window and notice that the File menu haschangedGo to the File Sumbit Command File menu, select the cart_txt.cmd commandfile, and press the [Open] buttonNotice the modeling activity in the Classic Output window, but no Results window isproduced this is how the Submit Command File menu item is different from theSubmit Window menu item used previously; nonetheless, the output, grove, and scorefiles are still created in the specified locationsUse the File Open Open Grove menu to open the tn_raw.grv file located inthe stmtutorSTMmodels folder, you will need to navigate into this folder using theLook in: selection box in the Open Grove File windowYou may now proceed with the final TN run by submitting the tn_txt.cmd commandfile using either the File Open Command File / File Submit Window or File Submit Command File menu routes dont forget that it does take long time to run!Salford Systems Copyright 2011 92

93. Final RemarksThis completes the Salford Systems Data Mining and Text Mining tutorialIn the process of going through the tutorial you have learned how to use bothGUI and command cine facilities of SPM as well as the command line textmining facility STMYou managed to build two CART models, two TN models, as well as enrichedthe original dataset with a variety of text mining fieldsThe final model puts you among the top winners in a major text miningcompetition a proud achievementEven though we have barely scratched the surface, you are now ready toproceed with exploring the remainder of the vast data mining activities offeredwithin SPM and STM on your ownWe wish you best of luck on the exciting and never ending road of moderndata analysis and explorationAnd dont forget that you can always reach us at www.salford-systems.comshould you have further modeling questions and needs Salford Systems Copyright 2011 93

94. ReferencesBreiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification andRegression Trees, Pacific Grove: WadsworthBreiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.Hastie, T., Tibshirani, R., and Friedman, J.H (2000). The Elements ofStatistical Learning. Springer.Freund, Y. & Schapire, R. E. (1996). Experiments with a new boostingalgorithm. In L. Saitta, ed., Machine Learning: Proceedings of the ThirteenthNational Conference, Morgan Kaufmann, pp. 148-156.Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: StatisticsDepartment, Stanford University.Friedman, J.H. (1999). Greedy function approximation: a gradient boostingmachine. Stanford: Statistics Department, Stanford University.Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau (2004).Text Mining. Predictive Methods for Analyzing Unstructured Information.Springer. Salford Systems Copyright 2011 94

95. STM Command ReferenceSalford Text Miner is simple utility that should make text mining processmuch easier. For this purpose application described in this manual havedifferent parameters and can execute Salford Predictive Miner at the datamining backendSTM Workflow: Automatically generate dictionary based on dataset Process dataset and generate new with additional columns based on dictionary Generate model folder with dataset, command file and dictionary Run Salford Predictive Miner with generated command file Run checking process comparing results from scoring with real classesAll of these steps can be done in separate STM calls or in one call Salford Systems Copyright 2011 95

96. STM Command ReferenceShort OptionLong OptionDescription-data DATAFILE--dataset DATAFILE Specify dataset to work with-dict DICTFILE--dictionary Specify dictionary to work withDICTFILE-source-dict SDFILE --source-dictionaryDictionary that is used as source forSDFILE automatic dictionary retrieval process-score SFILE--scoreresult SFILESpecify file with score result, for checking process, default score.csv-spm SPMAPP --spmapplication Path to spm application, default SPMAPP spm.exe-t TARGET --target TARGETTarget variable to generate command file, default GMS_GREATER_AVG-ex EXCLUDE --exclude EXCLUDEList of variables to exclude from keep list, when generate command file.-cat CATEGORY --category List of variables to select as categoryCATEGORY variables, when generate command file Salford Systems Copyright 201196

97. STM Command ReferenceShort OptionLong OptionDescription-templ CMDTEMPL --cmdtemplateSpecify template of command file, that willCMDTEMPL be used for generation. Default data/template.cmd-md MODEL_DIR --modeldir Dir, where models folders will be created.MODEL_DIRDefault models-trees TREES--trees TREESParameter for TreeNet command files, specify number of trees will be build. Default 500-maxnodes --maxnodes Parameter for TreeNet command files,MAXNODESMAXNODES specify numbers of nodes in one tree will be build. Default 6-fixwords --fixwords Enables heuristics that tries to fix words (find nearest by different metrics, searching spell checking, etc)-textvars VARLIST --text-variables List of variables separated by commas,VARLISTwhich will be used in dictionary retrieving process Salford Systems Copyright 201197

98. STM Command ReferenceShort Option Long OptionDescription-outrmwords--output-removed-Enables outputting removed stop words to wordsfile data/removed.dat-code CODE --column-codingSpecify how to code absence/presence of CODE word in row:YN or 0 no/yesYNM or 1 no/yes/many01 or 2 0/1012 or 3 0/1/2TF or 4 term frequencyIDF or 5 inversed document frequencyTF-IDF or 6 TF-IDFTC or 7 term count (0,1,2,)Default YN-mp MODELPATH--model-path Specify path where model files would be MODELPATHcreated-cmd-path CMDPATH --command-file-path Specify path to command file, which will beCMDPATH executed by Salford Predictive Miner-ppfile PPFILE --preprocess-filePath to python code that will be executed PPFILE on process step for data manipulate dataSalford Systems Copyright 201198

99. STM Command ReferenceShort Option Long Option Description-rc NAME --realclass-Specify column name for in real class dataset for column-name check step. Default GMS_GREATER_AVG-e --extract Run first step automatic extraction of dictionary from dataset. Need to specify --dataset-p OUTFILE --process Run second step process dataset and create new OUTFILE dataset with name OUTPUTFILE were depending on dictionary will be created new columns. Need to specify --dataset and --dictionary-g --generateRun third step generate model folder with command file. Need specify --dataset, --dictionary-m --model Run forth step. Run Salford Predictive Miner with generate command file. Works only with generate-c DATASET --check DATASET Run fives step. Check score file with real classes (from specified REALCLASSFILE) and outputs misclassification table. Need to specify --scoreresult-h --helpShow helpSalford Systems Copyright 2011 99

100. STM Configuration FileNameDescriptionDefaultSPM_APPLICATION Path to Salford Predictive Miner spm.exeCMD_TREES Number of trees to build in TN models500CMD_NODES Tree size for TN modes 6CMD_TEMPLATECommand file templatedata/template.cmdMODELS_DIRDir, where models folders will be created modelsLANGUAGES Languages, stop words which will be used English, GermanSPELLCHECKER_DICT Additional spell checker dictionary, with words that data/spellchecker_dict.datare allowed (like ipod)SPELLCHECKER_LANGUAGE Language for spell checker de_DEADDITIONAL_STOPWORDSFile with additional stop words, which user can edit data/stopwords.datREMOVED_WORDS_FILEFile, where removed words will be written on data/removed.datextract stepWORD_FREQUENCY_THRESH Lower threshold word frequency, which will be5OLD deleted on extract stepPREPROCESS_FILE Include script to do additional processing dmc2006/preprocess.py Salford Systems Copyright 2011 100

101. STM Configuration FileName DescriptionDefaultCHECK_RESULTS_FILEdata/score_results.csvLOGFILEPath to log file. Can be mask (%s for date). log/stm%s.logTARGET Default variable for target argument, which would be used to GMS_GREATER_AVG fill command file templateEXCLUDEDefault variable for keep argument, which would be used to AUCT_ID, fill command file template LISTING_TITLE$,LISTING_SUBTITLE$,GMS,GMS_GREATER_AVGCATEGORY Default variable for category argument, which would be usedGMS_GREATER_AVG to fill command file templateSCORE_FILE Name of score file which need to be checkedScore.csvTEXT_VARIABLES List of text variables in dataset separated by comma ITEM_LEAF_CATEGORY_NAME, LISTING_TITLE,LISTING_SUBTITLEDEFAULT_CODING Default coding for extract and preprocess stepsYNREALCLASS_COLUMN_Name of column in real class file, which would be used inGMS_GREATE_AVGNAME check stepSCORE_COLUMN_NAM Name of column in score file, which would be used in check PREDICTIONEstepSalford Systems Copyright 2011 101

Text mining tutorial

Technology

Transcript of Text mining tutorial