1. Getting Started with Text Mining: STM, CART and TreeNetDan
SteinbergMykhaylo Golovnya Ilya PolosukhinMay, 2011
2. Text Mining and Data MiningText mining is an important and
fascinating area of modern analyticsOn the one hand text mining can
be thought of as just another applicationarea for powerful learning
machinesOn the other hand, text mining is a distinct field with its
own dedicatedconcepts, vocabulary, tools, and techniquesIn this
tutorial we aim to illustrate some important analytical methods
andstrategies from both perspectives on data mining introducing
tools specific to the analysis text, and, deploying general machine
learning technologyThe Salford Text Mining utility (STM) is a
powerful text processing systemthat prepares data for advanced
machine learning analyticsOur machine learning tools are the
Salford Systems flagship CART decisiontree and stochastic gradient
boosting TreeNetEvaluation copies of the the proprietary technology
in CART and TreeNet aswell as the STM are available from
http://www.salford-systems.com Salford Systems Copyright 2011
2
3. For Readers of this TutorialTo follow along this tutorial we
recommend that you have the analytical tools we useinstalled on
your computer. Everything you need may already be on a CD
diskcontaining this tutorial and analytical softwareCreate an empty
folder named stmtutor, this is the root folder where all of the
workfiles related to this tutorial will resideYou may also use the
following link to download Salford Systems Predictive Modeler(SPM)
http://www.salford-systems.com/dist/SPM/SPM680_Mulitple_Installs_2011_06_07.zipAfter
downloading the package, unzip its contents into stmtutor which
will create anew folder named SPM680_Mulitple_Installs_2011_06_07.
Follow installation stepsdescribed on the next slide.For the
original DMC2006 competition website visit
http://www.data-mining-cup.de/en/review/dmc-2006/We recommend that
you visit the above site for information only; data and tools
forpreparing that data are available at the URL next belowFor the
STM package, prepared data files, and other utilities developed for
this tutorialplease visit
http://www.salford-systems.com/dist/STM.zipAfter downloading the
archive, unzip its contents into stmtutorSalford Systems Copyright
20113
4. Important! Installing the SPM SoftwareThe Salford Systems
software youve just downloaded needs to be bothinstalled and
licensed. No-cost license codes for a 30 day period areavailable on
request to visitors of this tutorial*Double click on the
Install_a_Transform_SPM.exe file located in
theSPM680_Mulitple_Installs_2011_06_07 folder (see the previous
slide) toinstall the specific version of SPM used in this tutorial
Following the above procedure will ensure that all of the currently
installed versions of SPM, if any, will remain intact!Follow simple
installation steps on your screen* Salford Systems reserves the
right to decline to offer a no-cost license at its sole
discretionSalford Systems Copyright 20114
5. Important! Licensing the SPM SoftwareWhen you launch the
Salford Systems Predictive Modeler (SPM) you will begreeted with a
License dialog containing information needed to secure alicense via
emailPlease, send thenecessary informationto Salford Systems
tosecure your license byentering the UnlockCode which will be
e-mailed back to youThe software willoperate for 3 dayswithout any
licensing;however, you cansecure a 30-daylicense on request Salford
Systems Copyright 20115
6. Installing the Salford Text Miner (STM)In addition to the
Salford Predictive Modeler (SPM) you will also work with theSalford
Text Miner (STM) softwareNo installation is needed and you should
already have the stm.exeexecutable in the stmtutorSTMbin folder as
the result of unzipping theSTM.zip package earlierSTM builds upon
the Python 2.6 distribution and the NLTK (Natural LanguageTool Kit)
but makes text data processing for analytics very easy to
conductand manage You do not need to add any other support software
to use STMExpect to see several folders and a large number of files
located under thestmtutorSTM folder. It is important to leave these
files in the location towhich you have installed them. Please do
not MOVE or alter any of the installed files other than those
explicitly listed as user-modifiable!stm.exe will expire in the
middle of 2012, contact Salford Systems to get anupdated version
beyond thatSalford Systems Copyright 20116
7. The Example ProjectThe best examples are drawn from real
world data sets and we werefortunate to locate data publicly
released by eBay.Good teaching examples also need to be simple.
Unfortunately, real world text mining could easily involve hundreds
of thousands if not millions of features characterizing billions of
records. Professionals need to be able to tackle such problems but
to learn we need to start with simpler situations. Fortunately,
there are many applications in which text is important but the
dimensions of the data set are radically smaller, either because
the data available is limited or because a decision has been made
to work with a reduced problem.We use our simpler example to
illustrate many useful ideas for beginning textminers while
pointing the way to working on larger problems.Salford Systems
Copyright 2011 7
8. The DMC2006 Text Mining ChallengeIn 2006 the DMC data mining
competition (restricted to student competitorsonly) introduced a
predictive modeling problem for which much of thepredictive
information was in the form of unstructured text.The datasets for
the DMC 2006 data mining competition can be downloadedfrom
http://www.data-mining-cup.de/en/review/dmc-2006/ For your
convenience we have re-packaged this data and made it somewhat
easier to work with. This re-packaged data is included in the STMU
package described near the beginning of this tutorial.The data
summarizes 16,000 iPod auctions held at eBay from May 2005through
May 2006 in GermanyEach auction item is represented by a text
description written by the seller (inGerman) as well as a number of
flags and features available to the seller atthe time of the
auctionAuction items were grouped into 15 mutually exclusive
categories based ondistinct iPod features: storage size, type
(regular, mini, nano), and colorThe competition goal was to predict
whether the closing price would be aboveor below the category
average Salford Systems Copyright 20118
9. Comments on the ChallengeOne might think that a challenge
with text in German might not be of generalinterest outside of
GermanyHowever, working with a language essentially unfamiliar to
any member ofthe analysis team helps to illustrate one important
point Text mining via tools that have no understanding of the
language can be strikingly effectiveWe have no doubt that dedicated
tools which embed knowledge of thelanguage being analyzed can yield
predictive benefits We also believe we could have gained further
valuable insight into the data if any of the authors spoke German!
But our performance without this knowledge is still impressive.In
contexts where simple methods can yield more than satisfactory
results, orin contexts where the same methods must be applied
uniformly acrossmultiple languages, the methods described in this
tutorial will be an excellentguide.Salford Systems Copyright
20119
10. Configuring Work Location in SPMThe original datasets from
the DMC 2006 challenge reside in thestmtutorSTMdmc2006 folderTo
facilitate further modeling steps, we will configure SPM to use
this locationas the default location: Start SPM Go to the Edit
Options menu Switch to the Directories tab Enter the
stmtutorSTMdmc2006 folder location in all text entry boxes except
the last one Press the [Save as Defaults] button so that the
configuration is restored the next time you start SPMSalford
Systems Copyright 2011 10
11. Configuring TreeNet EngineNow switch to the TreeNet tab
Configure the Plot Creation section as shown on the screen shot
Press the [Save as Defaults] button Press the [OK] button to exit
Salford Systems Copyright 201111
12. Steps in the Analysis: Data Overview1. Describe the data:
(Data Dictionary and Dimensions of Data) a. What is the unit of
observation? Each record of data is describing what? b. What is the
dependent or target variable? c. What other variables (data base
fields) are available? d. How many records are available?2.
Statistical Summary a. Basic summary including means, quantiles,
frequency tables b. Dimensions of categorical predictors c. Number
of distinct values of continuous variables3. Outlier and Anomaly
Assessment a. Detection of gross data errors such as extreme values
b. Assessment of usability of levels of categorical predictors
(rare levels)Salford Systems Copyright 201112
13. Data FundamentalsThe original dataset is called dmc2006.csv
and resides in thestmtutorSTMdmc2006 folder16,000 records divided
into two equal sized partitions Part 1: Complete data including
target, available for training during the competition Part 2: Data
to be scored; during the competition the target was not
availabler25 database fields two of which were unstructured text
written by the sellerEach line of data describes an auction of an
iPod including the final winningbid priceAn eBay seller must
construct a headline and a description of the productbeing sold.
Sellers can also pay for selling assistance E.g. Seller can pay to
list the item title in BOLD Salford Systems Copyright 201113
14. The Data: Available FieldsThe following variables describe
general features of each auction eventVariableDescriptionAUCT_ID ID
number of auctionITEM_LEAF_CATEGORY_NAME products
categoryLISTING_START_DATEstart date of auctionLISTING_END_DATEend
date of auctionLISTING_DURTN_DAYSduration of
auctionLISTING_TYPE_CODE type of auction (normal auction, multi
auction, etc)QTY_AVAILABLE_PER_LISTING amount of offered items for
multi auctionFEEDBACK_SCORE_AT_LISTINfeedback-rating of the seller
of this auction listingSTART_PRICE start price in
EURBUY_IT_NOW_PRICEbuy it now price in EURBUY_IT_NOW_LISTING_FLAG
option for buy it now on this auction listingSalford Systems
Copyright 201114
15. Available Data FieldsIn addition, there are binary
indicators of various value added features thatcan be turned on for
each auctionVariable DescriptionBOLD_FEE_FLAGoption for bold font
on this auction listingFEATUERD_FEE_FLAGshow this auction listing
on top of homepageCATEGORY_FEATURED_FEE_FLAG show this auction
listing on top of categoryGALLERY_FEE_FLAG auction listing with
picture galleryGALLERY_FEATURED_FEE_FLAGauction listing with
gallery (in gallery view)IPIX_FEATURED_FEE_FLAG auction listing
with IPIX (additional xxl, picture show, pack)RESERVE_FEE_FLAG
auction listing with reserve-priceHIGHLIGHT_FEE_FLAG auction
listing with background colorSCHEDULE_FEE_FLAGauction listing,
including the definition of the starting timeBORDER_FEE_FLAGauction
listing with frame Salford Systems Copyright 2011 15
16. Target VariableFinally, the target variable is defined
based on the winning bid price revenuerelative to the category
averageVariableDescriptionGMS scored sales revenue in
EURCATEGORY_AVG_GMSAverage sales revenue for the product
categoryGMS_GREATER_AVG zero when the revenue is less than or equal
to thecategory average sales and one otherwise The values were only
disclosed on a randomly selected set of 8,000 auctions which we use
to train a model 4199 auctions with the revenue below the category
average 3801 auctions with the revenue above the category average
During the competition the auction results for the remaining 8,000
auction results were kept secret, and used to score competitive
entries We will only use these records at the very end of this
tutorial to validate the performance of various models that will be
built Salford Systems Copyright 201116
17. Comments on MethodologyPredictive modeling and general
analytics competitions are increasingly beinglaunched both by
private companies and by professional organizations andprovide both
public data sets and a wealth of illustrative examples
usingdifferent analytic techniquesWhen reviewing results from a
competition, and especially when comparingresults generated by
analysts running models after the competition, it isimportant to
keep in mind that there is an ocean of difference between beinga
competitor during the actual competition and an after-the-fact
commentatorRegardless of what is reported the after-the-fact
analyst does have access towhat really happened and it is nearly
impossible to simulate the competitiveenvironment once the results
have been published We all learn in both direct and indirect ways
from many sources including the outcomes of public competitions.
This can affect anything that comes later in time.In spite of this,
we have tried to mimic the circumstances of the competitorsby
presenting analyses based only on the original training data, and
usingwell-established guidelines we have been promoting for more
than decade toarrive at a final modelWe urge you to never take as
face value an analysts report on what wouldhave happened if they
had hypothetically participatedSalford Systems Copyright 2011
17
18. First Round Modeling: Ignoring the TEXT DataEven before
doing any type of data preparation it is always valuable to run
afew preliminary CART models CART automatically handles missing
values and is immune to outliers CART is flexible enough to adapt
to any type of nonlinearity and interaction effects among
predictors. The analyst does not need to do any data preparation to
assist CART in this regard CART performs well enough out of the box
that we are guaranteed to learn something of value without
conducting any of the common data preparation operationsThe only
requirement for useful results is that we exclude any
possibleperfect or near perfect illegitimate predictors Common
examples of illegitimate predictors include repackaged versions of
the dependent variable, ID variables, and data drawn from the
future relative to the data to be predictedWe start with a quick
model using 20 of the 25 available predictors. None ofthese involve
any of the text data we will focus on later.Salford Systems
Copyright 201118
19. Quick Modeling Round with CARTWe start by building a quick
CART model using original raw variables and all8,000 complete
auction recordsAssuming that you already have SPM launched Go to
the File Open Data File menu Note that we have already configured
the default working folder for SPM Make sure that the Files of Type
is set to ASCII Highlight the dmc2006.csv dataset Press the [Open]
buttonSalford Systems Copyright 2011 19
20. Dataset Summary WindowThe resulting window summarizes basic
facts about the datasetNote that even though the dataset has 16000
records, only top 8000 will beused for modeling as was already
pointed outSalford Systems Copyright 2011 20
21. The View Data WindowPress the [View Data] button to have a
quick impression of physicalcontents of the datasetOut goal is to
eventually use the unstructured information contained in thetext
fields right next to the auction ID Salford Systems Copyright 2011
21
22. Requesting Basic Descriptive StatsWe next produce some
basic stats for all available variables: Go to the View Data Info
menu Set the Sort mode into File Order Highlight the Include column
Check the Select box Press the [OK] button Salford Systems
Copyright 201122
23. Data Information WindowAll basic descriptive statistics for
all requested variables are now summarized in oneplaceNote that the
target variable GMS_GREATER_AVG is not defined for the one half
ofthe dataset (N Missing 8,000), all those records will be
automatically discarded duringmodel buildingPress the [Full] button
to see more detailsSalford Systems Copyright 201123
24. Setting Up CART ModelWe are now ready to set up a basic
CART run: Switch to the Classic Output window active Go to the
Model Construct Model menu (alternatively, you could press one of
the buttons located on the bar right below the menu bar) In the
resulting Model Setup window make sure that the Analysis Method is
set to CART In the Model tab make sure that the Sort is set to File
Order and the Tree Type is set to Classification Check
GMS_GREATER_AVG as the Target Check all of the remaining variables
except AUCT_ID, LISTING_TITLE$, LISTING_SUBTITLE$, GMS, and
CATEGORY_AVG_GMS as predictors You should see something similar to
what is shown on the next slide Salford Systems Copyright 2011
24
25. Model Setup Window: Model TabSalford Systems Copyright 2011
25
26. Model Setup Window: Testing TabSwitch to the Testing tab
and confirm that the 10-fold cross-validation is usedas the optimal
model selection method Salford Systems Copyright 2011 26
27. Model Setup Window: Advanced TabSwitch to the Advanced tab
and set the minimum required number of recordsfor the parent nodes
and the child nodes at 15 and 5These limits were chosen to avoid
extremely small nodes in the resulting treeSalford Systems
Copyright 201127
28. Building CART ModelPress the [Start] button, building
progress window will appear for a while and then the
Navigatorwindow containing model results will be displayedPress on
the little button right above the [+][-] pair of buttons, along the
left border of the Navigatorwindow, note that all trees within one
standard error (SE) of the optimal tree are now marked in greenUse
the arrow keys to select the 64-node tree from the tree sequence,
which is the smallest 1SE tree Salford Systems Copyright 2011
28
29. CART model observationsThe selected CART model contains 64
terminal nodes and it is the smallestmodel with the relative error
still within one standard error of the optimalmodel (the model with
the smallest relative error) pointed by the green bar This approach
to model selection is usually employed for easy comprehension We
might also want to require terminal nodes to contain more than the
6 record minimum we observe in this out of the box treeAll 20
predictor variables play a role in the tree construction but there
is more to observe about this when we look at the variable
importance detailsArea under the ROC curve is a respectable 0.748
Salford Systems Copyright 201129
30. CART Model Performance Press the [Summary Reports] button
in the Navigator, select Prediction Success tab, and press the
[Test] button to display cross- validated test performance of
68.66% classification accuracy Now select the Variable Importance
tab to review which variables entered into the model Interestingly
enough, none of the added value paid options are important and
exhibit practically no direct influence on the sales revenue A
detailed look at the nodes might also be instructive for
understanding the modelSalford Systems Copyright 201130
31. Experimenting with TreeNetWe almost always follow initial
CART models with similar TreeNet modelsWe start with CART because
some glaring errors such as perfect predictorsare more quickly
found and obviously displayed in CART A perfect predictor often
yields a single split tree (two terminal nodes) for classification
treesTreeNet models have strengths similar to CART regarding
flexibility androbustness and has advantages and disadvantages
relative to CART TreeNet is an ensemble of small CART trees that
have been linked together in special ways. Thus TreeNet shares many
desirable features of CART TreeNet is superior to CART in the
context of errors in the dependent variable (not relevant in this
data) TreeNet yields much more complex models but generally offers
substantially better predictive accuracy. TreeNet may easily
generate thousands of trees to arrive at an optimal model TreeNet
yields more reliable variable importance rankings Salford Systems
Copyright 201131
32. A few words about TreeNetTreeNet builds predictive models
in stages. It first starts with a deliberatelyvery small first
round tree (essentially a CART tree).Then TreeNet calculates the
prediction error made by this simple model andbuilds a second tree
to try to model that prediction error. The second treeserves as
tool to update, refine, and improve the first stage model.A TreeNet
model produces a score which is a simple of sum of all
thepredictions made by each tree in the modelTypically the TreeNet
score becomes progressively more accurate as thenumber of trees is
increased up to an optimal number of treesRarely the optimal number
of trees is just one! Occasionally, a handful oftrees are optimal.
More typically, hundreds or thousands of trees are optimal.TreeNet
models are very useful for the analysis of data with large numbers
ofpredictors as the models are built up in layers each of which
makes use ofjust a few predictorsMore detail on TreeNet can be
found at http://www.salford-systems.comSalford Systems Copyright
201132
33. Setting Up TN ModelSwitch to the Classic Output window and
go to the Model ConstructModel menuChoose TreeNet as the Analysis
MethodIn the Model tab make sure that the Tree Type is set to
Logistic BinarySalford Systems Copyright 201133
34. Setting Up TN ParametersSwitch to the TreeNet tab and do
the following: Set the Learnrate to 0.05 Set the Number of trees to
use: to 800 trees Leave all of the remaining options at their
default valuesSalford Systems Copyright 201134
35. TN Results WindowPress the [Start] button to initiate TN
modeling run, the TreeNet Resultswindow will appear in the
endSalford Systems Copyright 201135
36. Checking TN PerformancePress on the [Summary] button and
switch to the Prediction Success tabPress the [Test] button to view
cross-validation resultsLower the Threshold: to 0.45 to roughly
equalize classification accuracy inboth classes (this makes it
easier to compare the TN performance with theearlier reported CART
performance)Salford Systems Copyright 201136
37. The Performance Has Improved!The overall classification
accuracy goes up to about 71%Press the [ROC] button to see that the
area under ROC is now a solid 0.800This comes at the cost of added
model complexity 796 trees each with about 6terminal nodesVariable
importance remains similar to CARTSalford Systems Copyright 2011
37
38. Understanding the TreeNet ModelTreeNet produces partial
dependency plots for every predictor thatappears in the model, the
plots can be viewed by pressing on the [DisplayPlots] buttonSuch
plots are generally 2D illustrations of how the predictor in
questionaffects an outcome For example, in the graph below the Y
axis represents the probability that an iPod will sell at an above
category average priceWe see that for a BUY_IT_NOW price between
200 and 300 the probability ofabove average winning bid rises
sharply with the BUY_IT_NOW_PRICEFor prices above 300 or below 200
the curve is essentially flat meaning thatchanges in the predictor
do not result in changes in the probable outcomeSalford Systems
Copyright 201138
39. Understanding the Partial Dependency Plot (PD Plot)The PD
Plot is not a simple description of the data. If you plotted the
raw dataas say the fraction of above average winning bids against
prices intervals youmight see a somewhat different curveThe PD Plot
is a plot that is extracted from the TreeNet model and it
isgenerated by examining TreeNet predictions (and not input
data)The PD Plot appears to be relate two variables but in fact
other variables maywell play a role in the graph
constructionEssentially the PD Plot shows the relationship between
a predictor and thetarget variable taking all other predictors into
accountThe important points to understand are that the graph is
extracted from the model and not directly from raw data the graph
provides an honest estimate of the typical effect of a predictor
the graph displays not absolute outcomes but typical expected
changes from some baseline as the predictor varies. The graph can
be thought of as floating up or down depending on the values of
other predictorsSalford Systems Copyright 2011 39
40. More TN Partial Dependency PlotsSalford Systems Copyright
201140
41. Introducing the Text Mining DimensionTo this point, we have
been working only with the set of traditional structureddata fields
continuous and categorical variablesFurther substantial performance
improvement can be achieved only if weutilize the text descriptions
supplied by the seller in the following
fieldsVariableDescriptionLISTING_TITLE title of
auctionLISTING_SUBTITLEsubtitle of auction Unfortunately, these two
variables cannot be used as is. Sellers were free to enter free
form text including misspellings, acronyms, slang, etc. So we must
address the challenge of converting the unstructured text strings
of the type shown here into a well structured representation
Salford Systems Copyright 2011 41
42. The Bag of Words Approach of Text MiningThe most
straightforward strategy for dealing with free form text is
torepresent each word that appears in the complete data set as a
dummy(0/1) indicator variableFor iPods on eBay we could imagine
sellers wanting to use words like newslightly scratched, pink etc.
to describe their iPod. Of course thedescriptions may well be
complete phrases like autographed by AngelaMerkel rather than just
single term adjectivesNevertheless in the simplest Bag of Words
(BOW) approach we just createdummy indicators for every wordEven
though the headlines and descriptions are space limited the number
ofdistinct words that can appear in collections of free text can be
hugeText mining applications involving complete documents, e.g.
newspaperarticles, the number of distinct words can easily reach
several hundredthousands or even millionsSalford Systems Copyright
2011 42
43. The End Goal of the Bag of Words Record_IDREDUSEDSCRATCHED
CASE 1001 01 0 1 1002 00 0 0 1003 10 0 0 1004 00 0 0 1005 11 1 0
1006 00 0 0 Above we see an example of a database intended to
describe each auctionitem by indicating which words appeared in the
auction announcement Observe that Record_ID 1005 contains the three
words RED, USED andSCRATCHED Data in the above format looks just
like the kind of numeric data used intraditional data mining and
statistical modeling We can use data in this form, as is, feeding
it into CART, TreeNet, orregression tools such Generalized Path
Seeker (GPS) or everyday regression Observe that we have
transformed the unstructured text into structurednumerical
dataSalford Systems Copyright 2011 43
44. Coding the Term Vector and TF weightingIn the sample data
matrix on the previous slide we coded all of our indicatorsas 0 or
1 to indicate presence or absence of a termAn alternative coding
scheme is based on the FREQUENCY COUNT of theterms with these
variations: 0 or 1 coding for presence/absence Actual term count
(0,1,2,3,) Three level indicator for absent, one occurrence, and
more than one (0,1,2)The text mining literature has established
some useful weighted codingschemes. We start with term frequency
weighting (tf) Text mining can involve blocks of text of
considerably different lengths It is thus desirable to normalize
counts based on relative frequency. Two text fields might each
contain the term RED twice, but one of the fields contains 10 words
while the other contains 40 words. We might want our coding to
reflect the fact that 2/10 is more frequent than 2/40. This is
nothing more than making counts relative to the total length of the
unit of text (or document) and such coding yields the term
frequency weightingSalford Systems Copyright 2011 44
45. Inverse Document Frequency (IDF) WeightingIDF weighting is
drawn from the information retrieval literature and is intendedto
reflect the value of a term in narrowing the search for a specific
documentwithin a larger corpus of documentsIf a given term occurs
very rarely in a collection of documents then that termis very
valuable as a tag to target those documents accuratelyBy contrast,
if a term is very common, then knowing that such a term
occurswithin the document you are looking for is not helpful in
narrowing the searchWhile text mining has somewhat different goals
than information retrieval theconcept of IDF weighting has caught
on. IDF weighting serves to upweightterms that occur relatively
rarely.IDF(term) = log { (Number of documents)/Number of documents
containing(term))}The IDF increases with the rarity of a term and
is maximum for words thatoccur in only one documentA common coding
of the term vector uses the product: tf * idf Salford Systems
Copyright 201145
46. Coding the DMC2006 Text DataThe DMC2006 text data is
unusual principally because of the limit on the amount oftext a
seller was allowed to uploadThis has the effect making the lengths
of all the documents very similarIt also limits sharply the
possibility that a term in a document would occur with a
highfrequencyThese factors contribute to making the TF-IDF
weighting irrelevant to this challenge. Infact, for this prediction
task other coding schemes allow more accurate prediction.STM offers
these options for term vector coding 0 no/yes 1 no/yes/many this
one will be used in the remainder of this tutorial 2 0/1 3 0/1/2 4
term frequency (relative to document) 5 inversed document frequency
(relative to corpus) 6 TF-IDF (traditional IR coding) Salford
Systems Copyright 201146
47. Text Mining Data PreparationThe heavy lifting in text
mining technology is devoted to moving us from rawunstructured text
to structured numerical dataOnce we have structured data we are
free to use any of a large number oftraditional data mining and
statistical tools to move forwardTypical analytical tools include
logistic and multiple regression, predictivemodeling, and
clustering toolsBut before diving into the analysis stage we need
move through the texttransformation stage in detailThe first step
is to extract and identify the words or terms which can bethought
of as creating the list of all words recognized in the training
data setThis stage is essentially one of defining the dictionary,
the list of officiallyrecognized terms. Any new term encountered in
the future will beunrecognizable by the dictionary and will
represent an unknown itemIt is therefore very important to ensure
that the training data set containsalmost all terms of interest
that would be relevant for future prediction Salford Systems
Copyright 201147
48. Automatic Dictionary BuildingThe following steps will build
an active dictionary for a collection ofdocuments (in our case,
auction item description strings) Read all text values into one
character string Tokenize this string into an array of words
(token) Remove words without any letters or digits Remove stop
words (words like the, a, in, und, mit, etc.) for both English and
German languages Remove words that have fewer than 2 letters and
encountered less than 10 times across the entire collection of
documents (rare small words) At this point the too-common,
too-rare, weird, obscure, and useless combinations of characters
should have been eliminated Lemmatize words using WordNet lexical
database This step combines words present in different grammatical
forms (go, went, going, etc.) into the corresponding stem word (go)
Remove all resulting words that appear less than MIN times (5 in
the remainder of this tutorial) Salford Systems Copyright 2011
48
49. Build the Dictionary (or Term Vector)For purpose of
automatic dictionary building and preprocessing data we developed
theSalford Text Mining (STM) software - a stand alone collection of
tools that perform allthe essential steps in preparing text
documents for text miningSTM builds on the Python Natural Language
Toolkit (NLTK)From NLTK we use the following tools Tokenizer
(extract items most likely to be words) Porter Stemmer(recognize
different simple forms of same word e.g. plural) Word Net
lemmatizer (more complex recognition of same word variations) stop
word list (words that contribute little to no vale such as the,
a)Future versions of STM might use other tools to accomplish these
essential tasksstm.exe is a command line utility that must be run
from a Command Prompt window(assuming you are running Windows, go
to the Start All Programs Accessories Command Prompt menu)The
version provided here resides in the stmtutorSTMbin folder Salford
Systems Copyright 2011 49
50. STM Commands and OptionsOpen a Command Prompt window in
Windows, then CD to thestmtutorSTM folder location, for example, on
our system you would type incd c:stmtutorSTMTo obtain help type the
following at the prompt: binstm --helpThis command will return very
concise information about STM: stm [-h] [-data DATAFILE] [-dict
DICTFILE] [-source-dict SRCDICTFILE] [-score SCOREFILE] [-spm
SPMAPP] [-t TARGET] [-ex EXCLUDE] etc.The details for each command
line option are contained in the softwaremanual appearing in the
appendixYou will also notice the stm.cfg configuration file this
file controls the defaultbehavior of the STM module and relieves
you of specifying a large number ofconfiguration options each time
stm.exe is launched Note the TEXT_VARIABLES :
ITEM_LEAF_CATEGORY_NAME, LISTING_TITLE, LISTING_SUBTITLE line which
specifies the names of the text variables to be processed 50
51. Create Dictionary OptionsFor the purposes of this tutorial,
we have prepackaged all of the text processingsteps into individual
command files (extension *.bat). You can either double-click on the
referenced command file or alternatively type its contents into
theCommand Prompt window opened in the directory that contains the
filesThe most important arguments for our purposes in this tutorial
now are: --dataset DATAFILEname and location of your input CSV
format data set --dictionary DICTFILE name and location of the
dictionary to be createdThese two arguments are all you need to
create your dictionary. By default,STM will process every text
field in your input data set to create a singleomnibus
dictionarySimply double click on the stm_create_dictionary.bat to
create the dictionaryfile for the DMC 2006 dataset, which will be
saved in the dmc2006_ynm.dictfile in the stmtutorSTMdmc2006
folderIn typical text mining practice the process of generating the
final dictionary willbe iterative. A review of the first dictionary
might reveal further words you wishto exclude (stop words)Salford
Systems Copyright 201151
52. Internal Dictionary Format The dictionary file is a simple
text file with extension *.dict The file contents can be viewed and
edited in a standard text editor The name of the text mining
variable that will be created later on appears on the left of the =
sign on each un-indented line The default value that will be
assigned to this variable appears on the right side of the = sign
of the un-indented lines and it usually means the absence of the
word(s) of interest Each indented line represents the value (left
of the =) which will be entered for a single occurrence in a
document for any of the word(s) appearing on the right of the =
More than one occurrence will be recorded asmany when requested
(always the case in thistutorial)Salford Systems Copyright
201152
53. Hand Made DictionaryTo use a multi-level coding you need to
create a hand made dictionary, which is alreadysupplied to you as
hand.dict in the stmtutorSTMdmc2006 folderHere is an example of an
entry in this filehand_model=standard mini nano standardThe
un-indented line of an entry starts with the name we wish to give
to the term(HAND_MODEL) and also indicates that a BLANK or missing
value is to be coded withthe default value of standardThe remaining
indented entries are listed one-per-line and are an exhaustive list
of theacceptable values which the term HAND-MODEL can receive in
the term vectorAnother coding option is, for example:
hand_unused=noyes=unbenutzt,ungeoffnetwhich sets no as the default
value but substitutes yes if one of the two values listedabove is
encounteredYou may study additional examples in our
stmtutorSTMdmc2006hand.dict file on yourown, all of them were
created manually based on common sense logic 53
54. Why Create Hand Made Dictionary EntriesLets revisit the
variable HAND_MODEL which brings together the terms Standard, mini,
nanoWithout a hand made dictionary entry we would have three terms
created,one for each model type, with yes and no values, and
possibly manyBy creating the hand made entry we Ensure that every
auction is assigned a model (default=standard) All three models are
brought together into one categorical variable with three possible
values standard, mini, and nanoThis representation of the
information is helpful when using tree-basedlearning machines but
not helpful for regression-based learning machines The best choice
of representation may vary from project to project Salford
regression-based learning machines automatically repackage
categorical predictors into 01 indicators meaning that you work
with one representation But if you need to use other tools you may
not have this flexibilitySalford Systems Copyright 201154
55. Further Dictionary CustomizationThe following table
summarizes some of the important fields introduced in thecustom
dictionary for this tutorialVariableValues Combines word
variantsCAPACITY20 20gb,20 gb,20 gigabyte30 30gb,30 gb,30
gigabyte40 40gb,40 gb,40 gigabyte 80gb,80 gb,80 gigabyte80
STATUSWieneu Wie neu,super gepflegt,top gepflegt,top
zustand,neuwertigNeuneu,new,brandneu,brandneuesUnbenutztUnbenudefekt
defekt.,--defekt--,defekt,-defekt-,-defekt,defekter,defektesMODEL
Mini, nano,Captures presence of the corresponding word in the
auctionstandard descriptionCOLOR Black, white,Captures presence of
the corresponding words or variants inGreen, etc.the auction
descriptionIPOD_GENE First, Identified iPod generation from the
information available inRATIONsecond, etc. the text
descriptionSalford Systems Copyright 201155
56. Final Stage Dictionary ExtractionTo generate a final
version of the dictionary in most real world applicationsyou would
also need to prepare an expanded list of stopwordsThe NLTK provides
a ready-made list of stopwords for English and another14 major
languages spanning Europe, Russia, Turkey, and Scandinavia These
appear in the directory named stmtutorSTMdatacorporastopwords and
should be left as they areAdditional stopwords, which might well
vary from project to project, can beentered into the file named
stopwords.dat in the stmtutorSTMdatafolder In the package
distributed with this tutorial the stopwords.dat file is empty You
can freely add words to this file, with one stopword per lineOnce
the custom stopwords.dat and hand.dict files have been preparedyou
just run the dictionary extraction again but with the
--source-dictionaryargument added (see the command files introduced
in the later slides)The resulting dictionary will now include all
the introduced customizationsSalford Systems Copyright 2011 56
57. Creating Structured Text Mining VariablesThe resulting
dictionary file dmc2006_ynm.dict contains about 600 individual
stemsIn the final step of text processing the data dictionary is
applied to each document entryEach stem from the dictionary is
represented by a categorical variable (usually binary)with the
corresponding nameThe preparation process checks whether any of the
known word variants associatedwith each stem from the dictionary
are present in the current auction description, and ifyes, the
corresponding value is set to yes, otherwise, it is set to no When
the --code YNM option is set, multiple instances of yes will be
coded as many You can also request integer codes 0, 1, 2 in place
of the character yes/no/many We have experimented with alternative
variants of coding (see the --code help entry in the STM manual)
and came to conclusion that the YNM approach works best in this
tutorial Feel free to experiment with alternative coding schemas on
your ownThe resulting large collection of variables will be used as
additional predictors in ourmodeling effortsEven though other more
computationally intense text processing methods exist,
furtherinvestigation failed to demonstrate their utility on the
current data which is most likelyrelated to extremely terse nature
of the auction descriptions Salford Systems Copyright 2011 57
58. Creating Additional VariablesFinally, we spent additional
efforts on reorganizing the original raw variablesinto more useful
measures MONTH_OF_START based on the recorded start date of auction
MONTH_OF_SALE based on the recorded closing date of auction
HIGH_BUY_IT_NOW set to yes if BUY_IT_NOW_PRICE exceeds the
CATEGORY_AVG_GMS as suggested by common sense and the nature of the
classification problem In the original raw data, BUY_IT_NOW_PRICE
was set to 0 on all items where that option was not available we
reset all such 0s to missingAll of these operations are encoded in
the preprocess.py Python filelocated in the stmtutorSTMdmc2006
folder This component of the STM is under active development The
file is automatically called by the main STM utility You may
add/modify the contents of this file to allow alternative
transformations of the original predictors Salford Systems
Copyright 2011 58
59. Generation of the Analysis Data SetAs this point we are
ready to move on to the next step which is data creationThis is
nothing more than appending the relevant columns of data to
theoriginal data set. Remember that the dictionary may contain tens
ofthousands if not hundreds of thousands of termsFor the DMC2006
dataset the dictionary is quite small by text miningstandards
containing just a little over 600 wordsTo generate processed
dataset simply double-click on the stm_ynm.batcommand file or
explicitly type in its contents in the Command Prompt The --dataset
option specifies the input dataset to be processed The --code YNM
option requests yes/no/many style of coding The --source-dictionary
option specifies the hand dictionary The --process option specifies
the output dataset Of course you may add other options as you
preferThis creates a processed dataset with the name
dmc2006_res_ynm.csvwhich resides in the stmtutorSTMdmc2006
folderSalford Systems Copyright 201159
60. Analysis Data Set ObservationsAt this point we have a new
modeling dataset with the text informationrepresented by the extra
variables Note that he raw input data set is just shy of 3 MB in
size in a plain text format while the prepared analysis data set is
about 40 MB in size, 13 times largerProcess only training data or
all data? For prediction purposes all data needs to be processed,
both the data that will be used to train the predictive models and
the holdout or future data that will receive predictions later In
the DMC2006 data we happen to have access to both training and
holdout data and thus have the option of processing all the text
data at the same time Generating the term vector based only on the
training data would generally be the norm because future data flows
have not yet arrived In this project we elected to process all the
data together for convenience knowing that the train and holdout
partitions were created by random division of the data It is worth
pointing out, though, that the final dictionary generated from
training data only might be slightly different due to the
infrequent word elimination component of the text processor Salford
Systems Copyright 2011 60
61. Quick Modeling Round with CARTWe are now ready to proceed
with another CART run this time using all of thenewly created text
fields as additional predictorsAssuming that you already have SPM
launched Go to the File Open Data File menu Make sure that the
Files of Type is set to ASCII Highlight the dmc2006_res_ynm.csv
dataset Press the [Open] button Salford Systems Copyright
201161
62. Dataset Summary WindowAgain, the resulting window
summarizes basic facts about the datasetNote the dramatic increase
in the number of available variables Salford Systems Copyright 2011
62
63. The View Data WindowPress the [View Data] button to have a
quick look at the physical contentsof the datasetNote how the
individual dictionary word entries are now coded with the yes,no,
or many values for each document rowSalford Systems Copyright
201163
64. Setting Up CART ModelProceed with setting up a CART
modeling run as before: Make the Classic Output window active Go to
the Model Construct Model menu (alternatively, you could use one of
the buttons located on the bar right below the menu) In the
resulting Model Setup window make sure that the Analysis Method is
set to CART In the Model tab make sure that the Sort is set to File
Order and the Tree Type is set to Classification Check
GMS_GREATER_AVG as the Target Check all of the remaining variables
except AUCT_ID, LISTING_TITLE$, LISTING_SUBTITLE$, GMS, and
CATEGORY_AVG_GMS as predictors You should see something similar to
what is shown on the next slide Salford Systems Copyright 2011
64
65. Model Setup Window: Model TabSalford Systems Copyright 2011
65
66. Model Setup Window: Testing TabSwitch to the Testing tab
and confirm that the 10-fold cross-validation is usedas the optimal
model selection method Salford Systems Copyright 2011 66
67. Model Setup Window: Advanced TabSwitch to the Advanced tab
and set the minimum required number of recordsfor the parent nodes
and the child nodes at 15 and 5These limits were chosen to avoid
extremely small nodes in the resulting treeSalford Systems
Copyright 201167
68. Building CART ModelPress the [Start] button, building
progress window will appear for a while and then the
Navigatorwindow containing model results will be displayed (this
time, the process takes a few minutes!)Press on the little button
right above the [+][-] pair of buttons, along the left border of
the Navigatorwindow, note that all trees within one standard error
(SE) of the optimal tree are now marked in greenUse the arrow keys
to select the 102-node tree from the tree sequence, which is the
smallest 1SE tree Salford Systems Copyright 2011 68
69. CART Model PerformanceThe selected CART model contains 102
terminal nodes where nearly all availablepredictor variables play a
role in the tree constructionArea under the ROC curve (Test) is now
an impressive 0.830, especially whencompared to the one reported
earlier at 0.748 for the basic CART run or the 0.800 forthe basic
TN runPress on the [Summary Reports] button in the Navigator
window, select thePrediction Success tab, and finally press the
[Test] button to see cross-validated testperformance at 76.58%
classification accuracy a significant improvement!Also note the
presence of the original and derived variables on the list shown in
theVariable Importance tabSalford Systems Copyright 201169
70. Setting Up TN ModelNow switch to the Classic Output window
and go to the Model ConstructModel menuChoose TreeNet as the
Analysis MethodIn the Model tab make sure that the Tree Type is set
to Logistic BinarySalford Systems Copyright 201170
71. Setting Up TN ParametersSwitch to the TreeNet tab and do
the following: Set the Learnrate: to 0.05 Set the Number of trees
to use: to 800 Leave all of the remaining options at their default
valuesSalford Systems Copyright 201171
72. TN Results WindowPress the [Start] button to initiate TN
modeling run, the TreeNet Resultswindow will appear in the end,
even though you might want to take a coffeebreak until the modeling
run completesSalford Systems Copyright 2011 72
73. Checking TN PerformancePress on the [Summary] button and
switch to the Prediction Success tabPress the [Test] button to view
cross-validation resultsLower the Threshold: to 0.47 to roughly
equalize classification accuracy in both classes(this makes it
easier to compare the TN performance with the earlier reported
CARTand TN model performance)You can clearly see the
improvement!Salford Systems Copyright 2011 73
74. Requesting TN GraphsHere we present a sample collection of
all 2-D contribution plots produced byTN for the resulting modelThe
plots are available by pressing on the [Display Plots] button in
theTreeNet Results windowThe list is arranged according to the
variable importance table74
75. More GraphsSalford Systems Copyright 2011 75
76. Insights Suggested by the ModelHere is a list of insights
we arrived at by looking into the selection of plots There is a
distinct effect of the iPod category once all the other factors
have been accounted for Larger start price means above the average
sale (most likely relates to the quality of an item) Anew and
unpacked item should fetch a better price, while any defect brings
the price down End of the year means better sales Having a good
feedback score is important It is best to wait 10 days or more
before closing the deal Interestingly, 1st and 3rd generations of
iPod show poorer sales than the 2nd and 4th 2G started to fall out
of favor in 2005-2006 Black is much more popular in Germany than
other colors Mentioning photo, video, color display, etc. helps get
a better price The paid advertising features are of little or
marginal importance Salford Systems Copyright 201176
77. Final Validation of ModelsAt this point we are ready to
check the performance of all our models usingthe remaining 8,000
auctions originally not available for trainingThis way each model
can be positioned with respect to all of the official 173entries
originally submitted to the DMC 2006 competitionHowever, in order
to proceed with the evaluation, we must first score theinput data
using all of the models we have generated up until nowThe following
slides explain how to score the most recently constructedCART and
TN models, the earlier models can be scored using similar stepsYou
may choose to skip the scoring steps as we have already included
theresults of scoring in the stmtutorSTMscored folder:
Score_cart_raw.csv simple CART model predictions Score_tn_raw.csv
simple TN model predictions Score_cart_txt.csv text mining enhanced
CART model predictions Score_tn_txt.csv text mining enhanced TN
model predictions Salford Systems Copyright 2011 77
78. Scoring a CART ModelSelect the Navigator window for the
model you wish to scoreSelect the tree from the tree sequence (in
our runs we pick the 1SE trees asmore robust)Press the [Score]
button to open the Score Data windowMake sure that the Data file is
set to dmc2006_res_ynm.csv, if not pressthe [Select] button on the
right and select the dataset to be scoredPlace a checkmark in the
Save results to a file box, then press the [Select]button right
next to it, this will open the Save As windowNavigate to the
stmtutorSTMscored folder under Save in: selection box,enter
Scored_cart_txt.csv in the File name: text entry box, and press
the[Save] buttonYou should now see something similar to whats shown
on the next slidePress the [OK] button to initiate the scoring
processYou should now have the Scored_cart_txt.csv file in the
stmtutorSTMscoredfolderSalford Systems Copyright 2011 78
79. Scoring CARTSalford Systems Copyright 201179
80. Scoring a TN ModelSelect the TreeNet Results window for the
model you wish to scoreGo to the Model Score Data menu to open the
Score Data windowMake sure that the Data file is set to
dmc2006_res_ynm.csv, if not pressthe [Select] button on the right
and select the dataset to be scoredPlace a checkmark in the Save
results to a file box, then press the[Select] button right next to
it, this will open the Save As windowNavigate to the
stmtutorSTMscored folder under Save in: selectionbox, enter
Scored_tn_txt.csv in the File name: text entry box, and pressthe
[Save] buttonYou should now see something similar to whats shown on
the next slidePress the [OK] button to initiate the scoring
processYou should now have the Scored_tn_txt.csv file in
thestmtutorSTMscored folder Salford Systems Copyright 201180
81. Scoring TNSalford Systems Copyright 201181
82. Using STM to Validate PerformanceWe can now use the STM
machinery to do final model validationSimply double-click the
stm_validate.bat command file to proceedNote the use of the
following options inside of the command file: -score specifies the
output dataset where the model predictions will be written
--score-column specifies the name of the variable containing the
actual model predictions (these variables are produced by CART or
TN during the scoring process) --check specifies the name of the
dataset that contains the originally withheld values of the target
this dataset was used by the organizers of the DMC 2006 competition
toselect the actual winners STM is currently configured to validate
only the bottom 8,000 of the 16,000 predictions generated by the
model; the top 8,000 records (used for learning) are simply
ignoredThe results will be saved into text files with extensions
*.result appended tothe original score file names in the
stmtutorSTMscored folderSalford Systems Copyright 2011 82
83. Validation Results FormatThe following window shows the
validation results of the final TN model webuilt 8000 validation
records were scored, of which: 719 ones were misclassified as
zeroes 807 zeroes were misclassified as ones Thus 1,526 documents
were misclassified This gives the final score of 8,000 (1,526 * 2)
= 4,948Salford Systems Copyright 2011 83
84. Final Validation of ModelsBased on the predicted class
assignments, the final performance score iscalculated as 8,000
minus twice the total number of auction itemsmisclassifiedThe
following table summarizes how these virtually out-of-the-box
elementarymodelings perform on the holdout data (the values are
extracted from the four*.result files produced by the STM
validator)ModelROC Area Missed 0sMissed 1s ScoreCART raw
data75%1123 13872980TN raw data80%1308 926 3532CART text data
83%981848 4342TN text data 89%807719 4948Salford Systems Copyright
2011 84
85. Visual Validation of the ResultsThe following graph
summarizes the positioning of the four basic models withrespect to
the 173 official competition entriesThe TN model with text mining
processing is among the top 10 winners!TN textCART text TN raw CART
raw Salford Systems Copyright 201185
86. Observations on the ResultsWe used the most basic form of
text mining, the Bag of Words, with minoremendations None of the
authors speaks German although we did look up some of the words in
an on-line dictionary. If there are any subtleties to be picked
from seller wording choices we would have missed them.We chose the
coding scheme that performed best on the training data. Wehave six
coding options and one stands out as clearly bestWe used common
settings for the controls for CART and TreeNetWe did not use any of
the modeling refinement techniques we teach in ourCART and TreeNet
tutorialsWe thus invite you to see if you can tweak the
performances of these modelseven higherSalford Systems Copyright
2011 86
87. Command Line Automation in SPMSPM has a powerful command
line processing component which allows you to completelyreproduce
any modeling activity by creating and later submitting a command
fileWe have packaged the command files for the four modeling and
scoring runs you have conductedin the course of this tutorial SPM
command files must have the extension *.cmd The four command files
are stored in the stmtutorSTMdmc2006 folderYou can create, open, or
edit a command file using a simple text editor, like Notepad,
etc.SPM has a built-in editor, just go to the File New Notepad
menuYou may also access the command line directly from inside of
the SPM GUI, just make sure that theFile Command Prompt menu item
is checkedJust type in help in the Command Prompt part (starts with
the > mark) of the Classic Outputwindow to get the listing of
all available commandsThen you can request a more detailed help for
any specific command of interest, for example helpbattery will
produce a long list of various batteries of automated runs
available in SPMFurthermore, you may view all of the commands
issued during the current session by going to theView Open Command
Log menu, this way you can quickly learn which commands
correspondto the recent GUI activity you were involved with Salford
Systems Copyright 2011 87
88. Basic CART Model Command FileYou may now restart SPM to
emulate a new fresh runGo to the File Open Command File menuSelect
the cart_raw.cmd command file and press the [Open] buttonThe file
is now opened in the built-in Notepad windowSalford Systems
Copyright 2011 88
89. CART Command File Contents OUT saves the classic output
into a text file USE points to the modeling dataset GROVE saves the
model as a binary grove file MODEL specifies the target variable
CATEGORY indicates which variables are categorical, including the
target KEEP specifies the list of predictors LIMIT sets the node
limits ERROR requests cross-validation BUILD builds a CART model
SAVE names the file where the CART model predictions will be saved
HARVEST specifies which tree is to be used in scoring IDVAR
requests saving of theNote the use of the relative paths in the
GROVE and SAVE commandsadditional variables into the output
datasetAlso note the use of the forward slash / to separate folder
names SCORE scores the CART model OUTPUT * closes the current text
output fileSalford Systems Copyright 201189
90. Submitting Command FileWith the Notepad window active, go
to the File Submit Window menu tosubmit the command file into SPMIn
the end you will see the Navigator and the Score windows opened
whichshould be identical to the ones you have already seen in the
beginning of thistutorialFurthermore, you should now have
cart_raw.dat text file created in the stmtutorSTMdmc2006 folder,
the file contains the classic output you normally see in the
Classic Output window cart_raw.grv binary grove file created in the
stmtutorSTMmodels folder, the file contains the CART model itself,
it can be opened in the GUI using the File Open Open Grove menu
which reopens the Navigator window, this file will be also needed
to future scoring or translation Score_cart_raw.csv data file
created in the stmtutorSTMscored folder, the file contains the
selected CART model predictions on your dataYou may proceed now
with opening up the tn_raw.cmd file using the File Open Command
File menuSalford Systems Copyright 2011 90
91. TN Command File Contents OUT, USE, GROVE, MODEL, CATEGORY,
KEEP, ERROR, SAVE, IDVAR, SCORE, OUTPUT same as the CART command
file introduced earlier MART TREES sets the TN model size in trees
MART NODES sets the tree size in terminal nodes MART MINCHILD - set
the minimum individual node size in records MART OPTIMAL sets the
evaluation criterion that will be used for optimal model selection
MART BINARY requests logistic regression processing in our case
MART LEARNRATE sets the learnrate parameter MART SUBSAMPLE sets the
sampling rate MART INFLUENCE sets the influence trimming value The
rest of the MART commands requests automatic saving of the 2-D and
3-D plots into the grove; type in help mart to get full
descriptionsSalford Systems Copyright 2011 91
92. Submitting the Rest of the Command FilesAgain, with the
current Notepad window active, use the File Submit Window menuto
launch the basic TN modeling run automatically followed by
scoringThis will create the output, grove, and scored data files in
the corresponding locationsfor the chosen TN model; also note the
use of the EXCLUDE command in place of theKEEP command inside of
the command file this saves a lot of typingNow go back to the
Classic Output window and notice that the File menu haschangedGo to
the File Sumbit Command File menu, select the cart_txt.cmd
commandfile, and press the [Open] buttonNotice the modeling
activity in the Classic Output window, but no Results window
isproduced this is how the Submit Command File menu item is
different from theSubmit Window menu item used previously;
nonetheless, the output, grove, and scorefiles are still created in
the specified locationsUse the File Open Open Grove menu to open
the tn_raw.grv file located inthe stmtutorSTMmodels folder, you
will need to navigate into this folder using theLook in: selection
box in the Open Grove File windowYou may now proceed with the final
TN run by submitting the tn_txt.cmd commandfile using either the
File Open Command File / File Submit Window or File Submit Command
File menu routes dont forget that it does take long time to
run!Salford Systems Copyright 2011 92
93. Final RemarksThis completes the Salford Systems Data Mining
and Text Mining tutorialIn the process of going through the
tutorial you have learned how to use bothGUI and command cine
facilities of SPM as well as the command line textmining facility
STMYou managed to build two CART models, two TN models, as well as
enrichedthe original dataset with a variety of text mining
fieldsThe final model puts you among the top winners in a major
text miningcompetition a proud achievementEven though we have
barely scratched the surface, you are now ready toproceed with
exploring the remainder of the vast data mining activities
offeredwithin SPM and STM on your ownWe wish you best of luck on
the exciting and never ending road of moderndata analysis and
explorationAnd dont forget that you can always reach us at
www.salford-systems.comshould you have further modeling questions
and needs Salford Systems Copyright 2011 93
94. ReferencesBreiman, L., J. Friedman, R. Olshen and C. Stone
(1984), Classification andRegression Trees, Pacific Grove:
WadsworthBreiman, L. (1996). Bagging predictors. Machine Learning,
24, 123-140.Hastie, T., Tibshirani, R., and Friedman, J.H (2000).
The Elements ofStatistical Learning. Springer.Freund, Y. &
Schapire, R. E. (1996). Experiments with a new boostingalgorithm.
In L. Saitta, ed., Machine Learning: Proceedings of the
ThirteenthNational Conference, Morgan Kaufmann, pp.
148-156.Friedman, J.H. (1999). Stochastic gradient boosting.
Stanford: StatisticsDepartment, Stanford University.Friedman, J.H.
(1999). Greedy function approximation: a gradient boostingmachine.
Stanford: Statistics Department, Stanford University.Sholom M.
Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau (2004).Text
Mining. Predictive Methods for Analyzing Unstructured
Information.Springer. Salford Systems Copyright 2011 94
95. STM Command ReferenceSalford Text Miner is simple utility
that should make text mining processmuch easier. For this purpose
application described in this manual havedifferent parameters and
can execute Salford Predictive Miner at the datamining backendSTM
Workflow: Automatically generate dictionary based on dataset
Process dataset and generate new with additional columns based on
dictionary Generate model folder with dataset, command file and
dictionary Run Salford Predictive Miner with generated command file
Run checking process comparing results from scoring with real
classesAll of these steps can be done in separate STM calls or in
one call Salford Systems Copyright 2011 95
96. STM Command ReferenceShort OptionLong
OptionDescription-data DATAFILE--dataset DATAFILE Specify dataset
to work with-dict DICTFILE--dictionary Specify dictionary to work
withDICTFILE-source-dict SDFILE --source-dictionaryDictionary that
is used as source forSDFILE automatic dictionary retrieval
process-score SFILE--scoreresult SFILESpecify file with score
result, for checking process, default score.csv-spm SPMAPP
--spmapplication Path to spm application, default SPMAPP spm.exe-t
TARGET --target TARGETTarget variable to generate command file,
default GMS_GREATER_AVG-ex EXCLUDE --exclude EXCLUDEList of
variables to exclude from keep list, when generate command
file.-cat CATEGORY --category List of variables to select as
categoryCATEGORY variables, when generate command file Salford
Systems Copyright 201196
97. STM Command ReferenceShort OptionLong
OptionDescription-templ CMDTEMPL --cmdtemplateSpecify template of
command file, that willCMDTEMPL be used for generation. Default
data/template.cmd-md MODEL_DIR --modeldir Dir, where models folders
will be created.MODEL_DIRDefault models-trees TREES--trees
TREESParameter for TreeNet command files, specify number of trees
will be build. Default 500-maxnodes --maxnodes Parameter for
TreeNet command files,MAXNODESMAXNODES specify numbers of nodes in
one tree will be build. Default 6-fixwords --fixwords Enables
heuristics that tries to fix words (find nearest by different
metrics, searching spell checking, etc)-textvars VARLIST
--text-variables List of variables separated by commas,VARLISTwhich
will be used in dictionary retrieving process Salford Systems
Copyright 201197
98. STM Command ReferenceShort Option Long
OptionDescription-outrmwords--output-removed-Enables outputting
removed stop words to wordsfile data/removed.dat-code CODE
--column-codingSpecify how to code absence/presence of CODE word in
row:YN or 0 no/yesYNM or 1 no/yes/many01 or 2 0/1012 or 3 0/1/2TF
or 4 term frequencyIDF or 5 inversed document frequencyTF-IDF or 6
TF-IDFTC or 7 term count (0,1,2,)Default YN-mp
MODELPATH--model-path Specify path where model files would be
MODELPATHcreated-cmd-path CMDPATH --command-file-path Specify path
to command file, which will beCMDPATH executed by Salford
Predictive Miner-ppfile PPFILE --preprocess-filePath to python code
that will be executed PPFILE on process step for data manipulate
dataSalford Systems Copyright 201198
99. STM Command ReferenceShort Option Long Option
Description-rc NAME --realclass-Specify column name for in real
class dataset for column-name check step. Default GMS_GREATER_AVG-e
--extract Run first step automatic extraction of dictionary from
dataset. Need to specify --dataset-p OUTFILE --process Run second
step process dataset and create new OUTFILE dataset with name
OUTPUTFILE were depending on dictionary will be created new
columns. Need to specify --dataset and --dictionary-g --generateRun
third step generate model folder with command file. Need specify
--dataset, --dictionary-m --model Run forth step. Run Salford
Predictive Miner with generate command file. Works only with
generate-c DATASET --check DATASET Run fives step. Check score file
with real classes (from specified REALCLASSFILE) and outputs
misclassification table. Need to specify --scoreresult-h --helpShow
helpSalford Systems Copyright 2011 99
100. STM Configuration
FileNameDescriptionDefaultSPM_APPLICATION Path to Salford
Predictive Miner spm.exeCMD_TREES Number of trees to build in TN
models500CMD_NODES Tree size for TN modes 6CMD_TEMPLATECommand file
templatedata/template.cmdMODELS_DIRDir, where models folders will
be created modelsLANGUAGES Languages, stop words which will be used
English, GermanSPELLCHECKER_DICT Additional spell checker
dictionary, with words that data/spellchecker_dict.datare allowed
(like ipod)SPELLCHECKER_LANGUAGE Language for spell checker
de_DEADDITIONAL_STOPWORDSFile with additional stop words, which
user can edit data/stopwords.datREMOVED_WORDS_FILEFile, where
removed words will be written on data/removed.datextract
stepWORD_FREQUENCY_THRESH Lower threshold word frequency, which
will be5OLD deleted on extract stepPREPROCESS_FILE Include script
to do additional processing dmc2006/preprocess.py Salford Systems
Copyright 2011 100
101. STM Configuration FileName
DescriptionDefaultCHECK_RESULTS_FILEdata/score_results.csvLOGFILEPath
to log file. Can be mask (%s for date). log/stm%s.logTARGET Default
variable for target argument, which would be used to
GMS_GREATER_AVG fill command file templateEXCLUDEDefault variable
for keep argument, which would be used to AUCT_ID, fill command
file template
LISTING_TITLE$,LISTING_SUBTITLE$,GMS,GMS_GREATER_AVGCATEGORY
Default variable for category argument, which would be
usedGMS_GREATER_AVG to fill command file templateSCORE_FILE Name of
score file which need to be checkedScore.csvTEXT_VARIABLES List of
text variables in dataset separated by comma
ITEM_LEAF_CATEGORY_NAME,
LISTING_TITLE,LISTING_SUBTITLEDEFAULT_CODING Default coding for
extract and preprocess stepsYNREALCLASS_COLUMN_Name of column in
real class file, which would be used inGMS_GREATE_AVGNAME check
stepSCORE_COLUMN_NAM Name of column in score file, which would be
used in check PREDICTIONEstepSalford Systems Copyright 2011
101