TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXTMINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXTMINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXTMINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXTMINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXTMINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING TEXT MINING
Beyond Search-and-Retrieval: Enterprise Text Mining
with SAS®
White Paper
by Stephen E. ArnoldArnoldIT.com
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
Table of Contents
1. Enhancing the usefulness of existing information in an enterprise .............1a. Integratediscoveryindatabaseswithunstructuredtext........................................... 1
b. Differencebetweensearchanddiscovery............................................................... 2
c. HowSAS®TextMinerworksasadiscoverysearchtool......................................... 2
d.HowSAS®TextMinerenhancesthediscoveryprocess.......................................... 2
2. SAS® Text Miner is part of the SAS® Enterprise Intelligence Platform ......4a. Overview.................................................................................................................. 4
b. SAS®createsaninformationcircuitfororganizations............................................. 5
3. SAS – The company ........................................................................................6a. History...................................................................................................................... 6
b. Customers................................................................................................................. 7
4. Data mining products available from SAS .....................................................9a. HighlightingtheSAS®TextMinersolution............................................................ 9
i. Textualdatacomesinmanyflavors..................................................................................................9ii. Competitivedifferentiators..............................................................................................................10iii.Keyfeatures....................................................................................................................................10iv. SAS®TextMinerunderthecovers...................................................................................................10v. Reducingthemanualworkoftextmining.......................................................................................12
b. HighlightingSAS®EnterpriseMiner™.................................................................. 13i. WhatisthedifferencebetweenSASTextMinerandSASEnterpriseMiner?................................13ii. Modelscoring................................................................................................................................16iii.Customizingapplications...............................................................................................................17
c. Howdotheseproductsinterrelate?........................................................................ 17
5. SAS® Text Miner Feature Snapshot ..............................................................18
6. Conclusion ......................................................................................................20
7. Appendix ........................................................................................................21
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
ii
Formoreinformationaboutsearchandtextmining,visitwww.arnoldit.com.
Noportionofthisdocumentmaybereproducedwithoutthepriorwritten
permissionofStephenE.Arnold.
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
Enterprisesearchanditslaundrylistsofresultsareincreasinglyfrustratingtousersofenterprisesearchsystems.Whatcanan
organizationdotobreakthroughthelogjamofoften-irrelevantresultsand“missing”information?Howcananorganization
make“findinginformation”reducecostsinsteadofincreasingthem?Whatcanyoudotosqueezemorevaluefromthedigital
dataandelectronicdocumentsyouhave?SAShasananswer:textanalytics.
1. Enhancing the usefulness of existing information in an enterprise
a. Integrate discovery in databases with unstructured text
Enterprisesearchisundergoinganimportantchange.Systemsusingbruteforcematchingofthewordsinauser’s
queryarebecomingmorepliantanddiscovery-centric.Textminingiscomingtoanenterprisesearchsystemnearyou.
Itsarrivalislongoverdue.Asystemthatcanseamlesslyintegratediscoveryindatabasesandunstructuredtextcanpay
hugedividendsquickly.(SeeStephenArnold’sarticleintheNovember2006issueofCXOAmerica:
http:/www.cxoamerica.com/pastissue/printarticle.asp?art=25408.)
ManyCFOs,CEOs,CIOsandCTOs(hereinafter,CxOs)havelearnedthepainfulwaythatintheirorganizations,text-
centricinformationproblemsarecommon–andexpensivetoaddressusingtraditionalmanagementtechniquessuch
ashiringtemporarystaff,outsourcingortryingtogetexistinginformationtechnologytohandlecustomersupport,
marketingandlegaltasks.Thisanalysiswilloutlinetheemergingtopicoftextminingasapotentialanswertoreduce
thedissatisfactionwithexistingmanualprocesses.Textminingprovidesawaytoenhancetheusefulnessofexisting
informationinanenterprise.Byaddingvaluetoitsinformation,acompanycanhelpeveryoneintheorganization
worksmarter.Dataminingistheprocessofdataselection,explorationandbuildingmodelsusingvastdatastoresto
uncoverpreviouslyunknownpatterns.Whatdoesthismeantoyou?Youcanproducenewknowledgetobetterinform
decisionmakersbeforetheyact.Buildamodeloftherealworldbasedondatacollectedfromavarietyofsources,
includingcorporatetransactions,customerhistoriesanddemographics,evenexternalsourcessuchascreditbureaus.
Thenusethismodeltoproducepatternsintheinformationthatcansupportdecisionmakingandpredictnewbusiness
opportunities.Textminingcapabilitiesenableyoutoapplysuchanalysestotext-baseddocuments.WithSAS’rich
suiteoftextprocessingandanalysistools,youcanuncoverunderlyingthemesorconceptscontainedinlargedocument
collections,groupdocumentsintotopicalclusters,classifydocumentsintopredefinedcategoriesandintegratetextdata
withstructureddataforenrichedpredictivemodelingendeavors.
Whetherit’susedtodrivenewbusiness,reducecostsorgainthecompetitiveedge,datamining(anditsclosecousin
textmining)isalsoavaluableassetfororganizations.Especiallyintheareaofcustomerrelationshipmanagement
(CRM),applyingthetechniquesofdatamininghasbecomeanaccepted“bestpractice”forcompaniesseriousabout
business.Nolongercansuccessfulorganizationsrestontheirlaurelsandcontinuetodobusinessastheyalwayshave
inthepast.Onlythosethattakeproactivemeasureswillsurviveintoday’scompetitiveworld.
Textmining,however,isnotwell-knownbecauseitisviewedasanemergingtechnologywithuntappedpotentialby
thoseeagertoanalyzethetextual-baseddatathatisaccumulatingandfillingtheirvastdatawarehouses.Enterprise
searchvendorshaveassertedthattheirkeywordmatchingsystemsare,ineffect,textminingsystems.Becausetext
miningisnotwellunderstood,theseassertionsleadmanyorganizationstobelievethatbroadcategoriesofresults
equalstextmining.Thisisnottrueandthesituationcancostanorganizationmoneyandopportunities.Enterprise
searchthatgeneratesalonglistofpossiblematcheswastestheuser’stime.Everyminutespentbrowsingandclicking
on“hits”coststheorganizationmoneyandreducestheemployee’sabilitytorespondtoaninformationneedquickly.
Tools,likethoseavailablefromSAS,providealternativewaystofindananswer,pinpointthefactneeded,orget
‘oversight’andinsightintostructuredandunstructureddata.
1
2
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
Organizationsoughttotakestepsandbegintoharnessmoreofthedatatheyareaddingtotheirgrowingdatastores–
structuredandunstructuredalike–sothattheywillbeempoweredtotakeproactivestepsandimprovedecisionsto
guidetheircompaniestoabrighter(moreprofitable)future.Textminingsoftwarecansurfacepatternsandtrendsthat
youmayneverhavethoughttolookfor.WithcomprehensivedataminingsolutionssuchasSASTextMiner,youcan
gofromrawdatatoaccurate,business-drivenanalyticalmodelsinaseamless,efficientprocess.
b. Difference between search and discovery
It’simportanttorememberthatsearchenginesaredesignedtoreturnaspecificsubsetofdocumentsbasedonaquery
performedbytheuser.Typically,usersknowwhattheyseekandconstructaquerytoaccomplishthegoal.Thesearch
enginethenreturnsasubsetofdocumentsthatmatchesthatquery.
Incontrast,textminingenginesaredesignedtohelptheuserdiscoverkeyconceptswithindocumentsorsubsetsof
documentswithouthavingtoreadtheentirecollection.Unliketheuserofasearchengine,textminersareunsureof
whattheywanttoknowandarerelyingonthecollectionasawholetospeakforitself.
Textminingsoftwareisdesignedtoprocessunstructuredinformation.SAShashadsuccessinsellingitstechnology
wherecustomersmustfindmeaningandtrendsine-mail,callcenterfeedbackfromcustomersandtextualinformation
fromindividualswhofilereportsinelectronicform.SASTextMinercanbeusedtodiscovercommonthemesin
collectionsofunstructuredinformation.
SASTextMinercanbeoptimallyintegratedwithinyourexistingITinfrastructurewithasingle,unifiedsystem.The
SASEnterpriseIntelligencePlatformtranscendsorganizationalsilos,diversecomputingplatformsandnichetools
todelivernewinsightsthatdrivevaluefororganizations.SAScustomershavereadycapabilitiesfordataintegration,
analyticsandbusinessintelligence(BI).Thisensuresdatacanbecleaned,manipulated,preparedandfedintoanalytics
foranonlineanalyticalprocessing(OLAP)engine,forecastingorotheranalyticalmodulesbeforedisplayingresults
viaWebgraphicsorinteractiveflexiblebusinessintelligence(hereinafter,BI)tools.
c. How SAS® Text Miner works as a discovery search tool
SASTextMinertechnologydeliversextraction,automaticclassificationanddiscoverytoSAScustomers.Theoutputs
oftheSASTextMinersubsystemallowyoutoviewtermstatistics,identifysimilardocuments,andviewtermand
conceptrelationshipsinagraphicdisplay.Thesystemalsoperformsautomaticclassificationandmetatagging.
Ifyouhaveanexistingtaxonomy,SASTextMinercanusethatdataanddiscovernewcategories.Thesystemcan
identifyclustersofrelatedinformationsuchascallcentercommentsorstatementswithine-mail.StephenE.Arnold,
ArnoldIT.com,callsthis“informationoversight”which,hesays,“Makesitpossibletospotakeyfactortheneeded
informationwithoutwastingtimeclickingonlinksandhuntingfortheitemofinformation.”
ThedatageneratedbySASTextMinercanbeusedtoenhanceanexistingsearchsystem.However,thisdatacanbe
usedforpredictivemodelingbyitself.Alternatively,youcanmergediscovereddatawithexistingstructureddataand
perform
d. How SAS® Text Miner enhances the discovery process
SAS’textanalyticsareanadd-ontoitspremierdataminingsoftwaresolution–SASEnterpriseMiner™.TheSAS
TextMinerproductissurfacedasanadditionalnodetofitcomfortablywithinthepredictiveanalyticsumbrella.SAS
istheleaderinbusinessintelligenceandpredictiveanalyticssoftware.SASPredictiveanalyticssoftwareprovidesa
widerangeofintegratedcapabilitiesfortimeseriesanalysisandforecasting,econometricsandsystemsmodeling,
andfinancialanalysisandreporting,withdirectaccesstointernal/externalcommercialdatawarehouses/marts.With
31yearsexperienceand43,000customersitesworldwide,SAShelpsmanageperformancebytransformingdatainto
predictiveinsights.
3
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
SAS’systemassumesthattheprimaryuserswillhavestrongunderstandingofstatisticsandbasicnavigation
familiaritywiththeJavaGUIofSASEnterpriseMiner.
ThebenefittotheSASTextMineruseristhatvirtuallyallofthenaturallanguageprocessingandstatistical
functionalityisavailablewithintheSASgraphicalenvironment.SAShasdonetheheavyliftingrequiredtomake
algorithmsavailableasdrag-and-droporpoint-and-clickfunctions.AccordingtoMaryCrissey,AnalyticsProduct
MarketingManageratSAS,“AkeybenefittotheuserofSASTextMineristhatthetimerequiredtograftlinguistic
functionsontoamodelisreducedbymorethan90percent.”
Usingintelligentalgorithmsandlexicalprocessingtoautomatethecategorization,tagging,orbuildingoftopicsand
documentindexes:
• Providesautomaticidentificationandindexingofconceptswithinthetext.
• Visuallypresentsahigh-levelviewoftheentirescopeofthedataprocessedbythesystem,withtheabilitytodrill
downtorelevantdetail.
• Enablesuserstomakenewassociationsandrelationshipsandtopresentpathsandlinksforfurtherdocument
analysis.
In2006,SASmovedquicklytobecomeaGooglepartner.GoogleOneBoxAPIallowsSAStooffertheGooglesearch
interfacetocoreSASdataandreports.Inaddition,astheGoogleSearchAppliancebecomesmorerobust,SAShas
theopportunitytooffercollaborationandaccesstootherGooglefunctionalities.It’stooearlyintheadoptioncycleto
beabletoquantifythevalueoftheGooglerelationship,butasaprivatelyheldsoftwarecompany,SASisabletomove
quicklyanddecisively.
InthesamewaythatSAShasexpandedaccesstoBIbycreatingtargeteduserinterfacesforitssoftwarethatmatch
theskilllevelsofindividualusers,SASandGooglewillprovidejointcustomerswhoactivatethenewGoogleOneBox
forEnterprisefeatureoftheGoogleSearchAppliancewithafamiliar,securewaytosearchforreal-timeinformation
deliveredbySASEnterpriseBIServersoftware.Inaddition,thecombinationoftheGoogleSearchAppliancewiththe
SASEnterpriseIntelligencePlatformwillgiveusersmoreinformationthanordinarykeywordsearchescanprovide.
SAS’contextuallyrelevantsearchcapabilitieswithGooglenotonlyexploremetadatabutalsolookatthebusiness
views(SASInformationMaps)thathavebeendefinedbySASclients.
ThevisionaheadforSASwillnotbelimitedtoitspremierdataanalysis(statisticalprocessing)capabilities.Not
onlyisSASgainingfootholdinqueryandreportingsoftware(BI),itinfactextendsacrosstheentireSASEnterprise
IntelligencePlatformasacompleteend-to-endsystemthatincludesthedatawarehousingETLwhichstandsfor
extract,transform,andload(handlingdatainputissues).ThepotentialforSAStodeliverTHEPOWERTOKNOW®
seemsunlimitedbecauseSASismeetingthechallengeforsmallandlargedatabasesstuffedwithtraditionalstructured
datathatcanbemeasuredandquantifiedaswellasthequalitativedescriptorsofdatareferredtoas“unstructured”–
text-basedcontentthatcanoccupygigabytesofinformationinorganizations.Examplesoftraditionalstructureddata
areage,jobclassificationandincomeinformation.Examplesofunstructureddataincludetext-basedcontentinWeb
pages,e-mailmessages,articles,memos,customerfeedback,warrantyclaims,patentinformation,surveys,research
studies,resumes,clientnotes,competitiveintelligenceandmore.
4
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
2. SAS® Text Miner is part of the SAS® Enterprise Intelligence Platform
a. Overview
TheSASplatformoptimallyintegratesindividualtechnologycomponentswithinyourexistingITinfrastructureinto
asingle,unifiedsystem.Theresultisaninformationflowthattranscendsorganizationalsilos,diversecomputing
platformsandnichetools–anddeliversnewinsightsthatdrivevalueforyourorganization.
ThecorenotionofSASisthatanalyticprocessesworkoffasingleversionofthetruth.Bydefiningananalytical
platformandsettingupametadata(informationaboutdatasources,includinghowitwasderived,businessrulesand
accessauthorizations)architecture,SASprovidesacentralizedrepositoryfordataandinformation.SAShascreated
avarietyofinterfacesandtoolstopermitthird-partyapplicationstoaccessinformationmanagedbySAS.SAS’
principalstrengthisthataclientnolongerhastobeartheburdenoffiguringthemostcost-effectivewaytomanagethe
complexitiesofintegratingdifferentrepositoriesofdata.
Thecompanyaddedtextminingtoitstechnologysuitesixyearsagowheninterestinunstructuredtextfirstrippled
throughtheanalyticssector.SAS’approachwastolicensemodulesfromInxightforstemming,part-of-speechtagging
andentityextractiontechnologiessotheycouldlinkwiththeclusteringandpatternrecognitionfacilitiesoftheSAS
textminingsolution.Bycontractualright,SAScontinuestheuseofInxighttechnologyasitdidbeforetheBusiness
ObjectsacquisitionofInxight.SASTextMinerwillcontinuetoofferadvancedtechnologyfortheintegrationof
structuredandunstructureddata.Additionalsolutionsfocusedtoindustryneedsarealreadyunderdevelopmentat
SAS.AnyfuturepotentialreplacementofmoduleswillbeintegratedintonewreleasesofSAStechnologywhich
currentcustomersgetatnoextracost.
SAS’approachisdesignedtoprovideasingleenvironmentfordataanalysis,textminingandknowledgediscovery.
SAS’marketingteamlikestosay,“SASgivescustomersthepowertoknow.”SASTextMinerhastightlyintegrated
text-basedinformationwithstructureddata.Fromasingleinterface,ananalystcanobtainacompleteviewof
structuredandunstructuredinformationtoenhanceanalysesanddecisionmaking.
WithSAS,nothird-partytoolsarerequired,andSAScan“playnicely”withmostthird-partysystemsbecauseit
accommodatesdatainamyriadofformatsandrunningonawidevarietyofoperatingsystems.Ifaproprietary
functionisneeded,SASpermitsunlimitedcustomization.Configurationassistanceandinstallationexpertsareonhand
toguidethecustomer’sSASadministratorintheroleofmanagingtheserversandsoftwarethatcomprisetheSAS
framework.
SAS’textminingsubsystemisavailableasanenterpriseapplication(installedonaserverandlinkedtomanyclient
computers)oraquasi-standalonesubsystem(installedonasinglelaptop).Bothtextminingsubsystemscan:
• Filtere-mailforregulatorycomplianceorsecurityapplications.
• Groupdocumentsintopredefinedcategories.
• Routenewsitems.
• Clusteranalysisofdocumentsinacollection,surveydata,reviewcustomercomplaintsandcomments.
• Predictfinancialshiftsfrombusinessinformationandcustomersatisfactionfromcustomercomments
5
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
b. SAS® creates an information circuit for organizations
SASisasystem,notasinglesoftwarepackage.ThefigurebelowshowstheoverallSASarchitecture.TheSASText
Minermodule(highlightedinyellow)sitsintheclienttieroftheSASframework.SASTextMinerisasubcomponent
oftheSASEnterpriseMinersubsystem.
SAS, now in V ersion 9, provides a comprehensive informaiton environment. It includes servers, data management tools, analytic routines, APIs, and such advanced features as the exchange of data, scipts, andmodels across the dozens of SAS modules. Featuring full support of portal accessible text mining, SAS provides a comprehensive solution to organizations that need to analyze large volumes of textual and numeric data.
Inthisinformationcircuit,informationflowsintothevariousservers.Analystsmanipulatethedataintheserversusing
arangeoftools.UsersofthedatacanaccessthereportsinstandardofficesoftwareapplicationssuchasMicrosoft
Excel.
BeforelookingatthespecificfunctionsoftheSASTextMinermodule,let’sreviewSAS’twindevelopmentpaths,
whichhaveyieldedsomeimportantinnovations.
ThefirstisthedevelopmentoftheSASIntelligenceStoragesystem.TheSAScustomercanrelyonSAS’framework
toprovidedatawarehousing,integrityanalysistoolsandadministrativeinterfaceswhenmanipulatingthegigabyte
andterabytecollectionsofdatafoundinmanyorganizations.Fortheanalyst,thedatamanagementtechniquesdonot
intrudeonthespecificstatisticalprocessesanindividualanalystuses.However,thoseaccessingandmanipulatingdata
seethatthespeedwithwhichSAShandlescertaindataintensiveprocesseshasincreased.
ToillustrateSAS’storageefficiency,consideranOracletablecontaining10terabytesofXMLdata.TheOracle
databasewillrequire10terabytesforthetabledataandanother30to40terabytesofstorageforindicesandthe
swapspacethatwillallowforefficientdatamanipulation.Thecostofstorageisdropping,butthevolumeofdatais
increasing.Itmakessensetominimizediscspacerequirementsinordertoholddowncostsandtohavetheability
toprocessnewdata.TheSASIntelligenceStoragesystemrequires10terabytesforthedata,butSASrequiresan
additional10terabytesofspaceformetadataanddatamanipulation.SAShasmostlyremovedthebottleneckswhile
implementingamultithreadedkernelcapableofscalinguptohandleclustersofcommodityservers.Atthesametime,
SAShasreducedtheburdenonthedatabaseadministrator.
6
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
Thesecondinnovationhasbeentherefinementoftoolsforindividualanalystsandtheservicesavailabletoananalyst
tocreateaboiled-down,graphicalsnapshotofkeyfindings.IfthetoolsbuiltintotheSASenterpriseminingandtext
miningsubsystemsarenotadequate,ananalystcanturntothecompany’sgraphproductandcreateMadisonAvenue-
stylechartscompletewithwizardsandpreviewsofagraphorchart.
ThecomponentsforSASTextMineraretheSAS®9.1system,SASEnterpriseMiner,Java,necessaryhardware
servers.BeforeSASTextMinercanbeused,SASrequiresthatabasicstatisticalfoundationbeinstalled.SAS
EnterpriseMinerinstallsonthatbase.Finally,theSASTextMinercodeinstallsintotheSASEnterpriseMiner
module.Thereis,therefore,nowaytousetheSASTextMinerwithoutlicensingothersoftwarefromSAS.Pricingcan
varydependingonthespecificrequirementsoftheclientorganization.
Theplatformconceptisanimportantone.Theideaisthatalicenseecanbuildanenterprisewideinformation
analyticssystemonSAStechnology.SAS’descriptionsofitsplatformsimplifytoagreatdegreethelargenumberof
componentsthecompanyhasdeveloped.Forexample,alicenseecanacquiresoftwarethatfacilitatesgridcomputing.
SASalsooffersastoragemanagementsystem.Bothgridandstoragetechnologyarepivotaltothesuccessofalarge-
scaletextminingapplication.
3. SAS – The company
a. History
SAScanboastthatitisoneofthelargestprivatelyheldsoftwarecompaniesintheworld.HeadquarteredinCary,
NorthCarolina,thecompanygeneratesaboutUS$1.5billioninannualrevenue.Thecompany’sanalyticproducts
rangefromstatisticaltoolkitstoenterprisewidedataanalysissystems.Thecompany’ssoftwareisincludedaspartof
thecurriculumindataminingandBIintheUS,Europe,Australiaandelsewhere.
SASbeganatNorthCarolinaStateUniversityin1976,whenJimGoodnight,nowCEO,launchedthenewcompany
withthreeofhiscolleagues.Thecompanyhasgrowntoencompass10,000employeesand400officesworldwide.
DabblersinanalyticswillfindSAS’productsinscrutable.TraininginspecificSAStechnicalapproachesisbeing
incorporatedintomoreandmoreuniversitycampusestoday–especiallyintobusinessprograms,whichvaluethe
competitiveadvantageanalyticsprovides.SASpublictrainingcentersarelocatedaroundtheworld,staffedbycertified
instructorswhoalsotraveltocustomersitesaswellasteachviaLiveWeb.SASSelf-Pacede-Learningtutorialsare
becomingquitepopular.SASPresspublishesawidevarietyofbooksthatclarifytheterminology,technicalframework
andbestpracticescustomersareimplementingwithSASsoftware.Softwareusermanualsanddocumentationare
postedontheInternetforthepublictosee,soeventhosewhohavenotyetinstalledthesoftwarecanlookupthe
mathematicalalgorithmsincorporatedinacertainprocedureorproductofinterest.
SAShasworkeddiligentlytomakeitstoolsaccessibleanduserfriendlyinresponsetomarketdemandsandfeedback
fromitslargecustomerbaseacrosstheglobe.Today’susersattractedtothedataminingsolutionsincludetraditional
statisticians,businessmanagersanduniversityresearchersaswellasdomain-specificindustryprofessionals.
TodaySASisrecognizedasasoftwareproviderthatdoesmorethannumber-crunchingandfancydatamanipulation.
AnalystshavestatedthatSASsetsthestandardforBI.1
1http://www.sas.com/news/analysts/by_technology_bi.html.
7
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
ForthoseunfamiliarwithSAS,thecompanyprovidesseveralchoicesforkickingoffnumericalanalysisprocesses,
suchastheSASEnterpriseGuidepoint-and-clickinterfaceorJMP®software(version7,releasedinMay2007,links
toSASdatasets).Storedservicesnowprovideautomaticwaysofrerunningcommontaskssoprogrammingtools
canbesavedandquicklyautomated.Thesebuilt-inoptionsanduserinterfacemakeiteasyforbusinessmanagersto
produceandunderstandreportsgeneratedbySAS.ProgrammerstrainedtouseSAScanmakejumpsthroughflaming
hoopsbyaddingtheirowncreativetoucheswithpersonalizedmacroprogrammingortailoredinterfaces.SASwants
toenableitscustomerstodeployasbasicorascomplexatextminingsystemastheydesire.Then,whenitbecomes
necessarytoaddadditionalfunctionality–forexample,advanceddataanalytics–SAScomponentswillworksmoothly.
SAScontinuestomaintainitsleadershiproleintheanalyticsmarketasithasbeendoingformorethanthreedecades.
Newnichevendorsentertheanalyticsmarketsector,butnonepossessesthebreadthanddepthofSASanalytics
completedbystrongdataintegrationandBIcapabilitiesacrosstheplatform.
Theprogrammer-analystwithastrongfoundationinstatisticswillrevelinSAS’technologyanditspower.Aperson
whoinveststhetimetolearntheSASapproachwillfindthatSASdeliversusable,flexibleandscalableanalytictools.
b. Customers
ManagersandprofessionalswithSASexperiencearestrongadvocatesofthesystem.Inarticlesandspeeches,SAS
customersemphasize:
• Reducedtime-to-decisionsandamoreaccurateorganizationalview.Bycombiningstructureddataandunstructured
text,andautomatingtheprocessofanalyzingthedata,SASTextMinerhelpsorganizationsgainmeaningful
insightsthatsuccessfullydriveoverallbusinessdirection.
• Improvedorganizationalperformance.Whilemostsoftwarevendorsofferclassificationofonetextfieldintoa
singleclass,SASTextMinerenablestheclassificationofmultiplestructuredandunstructuredfields.Itimproves
yourorganization'sperformancebydistillinginformationfrommultiplebusinessunitsintoaformatthatiseasyto
manageandanalyze.
• Abilitytorecognizetrendsandpredictbusinessopportunities.Analysisofinformationsuchascustomerlettersandcall
centernotesmayprovidevaluableinformationaboutcustomerdissatisfactionorinsightsintoserviceandproductneeds.
SAShasimplementedanumberofinterestingtextminingsystemsforclientsworldwide.Twoexamplesbelow
illustratetheuseoftheSAStextminingsysteminbothacademiaandbusiness:
(i)UniversityofLouisvilleusestextmining,dataanalysisandgeospatialmappingtounderstandmedical
conditionsandimprovepatientcareinhospitals.
(ii)Hondausestextmininginwarrantyandqualityanalyses.
(i) Understandingmedicaltrends
TheUniversityofLouisvilleusedSAStextanddataminingtechnologytoidentifymajortrendsinseveral
cardiacandrespiratoryconditions,andtoanalyzecorrelationsuchasmedianincomeandgeographical
locationsforJeffersonCountypatientsinKentucky.SASandArcGISwereusedtocompile,author,analyze,
map,andpublishgeographicinformationandknowledge.
8
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
SAS in combination with ESRI tools generated graphic displays of data on maps of Jefferson County and kernel density comparison graphs of the combined data from structured tables and text mining.
Thisstudyultimatelydemonstratedthattextminingandclusteringallowedforefficientstatisticaldataanalysis.
Furthermore,thebridgebetweenSASandESRIallowedforastatisticalformulationofthedataaswellasavisual
interpretationofthepatientsandhealthcareproviderlocationsinJeffersonCounty.
AccordingtoauthorPatriciaCerrito:
“SAShasintegrateditsanalytictoolsbetterthananyothersoftwarevendor.TextMinerhighlightsrelevantpatternsin
documentssuchasclinicalreports,anditquantifiestext-basedinformation.WemoveseamlesslytoEnterpriseMiner
tocombineandanalyzethisunstructuredtextwithstructureddatasuchasdemographicsandlaboratoryvalues.And,
weaugmentthatanalysiswithSAS/STATThat’swhywestandardizedonSASforourresearch.Noothersoftware
deliveredthisdepthandbreadthofanalyticfunctionality.Inadditiontosuperioralgorithms,SASdeliveredthe
simplestinterfaceformanagingandimportingdatafromanysource.Duringanygivenworkday,thesebenefitsaddup
tosignificantROIattheUniversityofLouisville.”2
(ii)Improvingqualityinautomotivemanufacturing
AmericanHondalaunchedaninitiativetoextractearlywarningofpotentialproductproblemsfromcustomer
feedback.AnalertsystemwasimplementedthatusedSASTextMinerandSAS’qualitycontrolmodule.Although
analystsalreadyusedcommentfieldstounderstandautomotiveproblemsunderinvestigation,thenewsystemhas
to“read”incomingfreeformtextdatasystematicallyandgenerateoutputsthatcouldbeanalyzedbyautomated
SASAnalytics.Thesystemusedclusteringmodelstogroupsimilarwarrantyclaimstogether.Newdataaboutthat
particularautomodelwasthenprocessed,scoredandplacedintothebestcluster.AmericanHondaanalystsmonitor
clustersizesandvariancesweekly.Deviationsfromexpectedvaluesareflaggedforhumanreview.
TheoverallHondaprocessconsistedoftakingwarrantyandcustomerdata(bothstructuredandunstructured)and
creatingacommonSASrepresentation.Thedatamovedthroughamodelreview,scoringandstatisticalprocess
controlsequence.Theoutputsgeneratedthealertswhencertainwarranty-relatedvalueswereabovetheexpected
threshold.
2Dr.PatriciaCerrito,UniversityofLouisville,quotedbySASinaDecember2003newsrelease.Seehttp://www.sas.com/news/preleases/121503/news1.html
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
Thetextminingprocesshighlightedinthefigureaboveconsistedofsevendistinctsteps.Theoutputsofthesestepswere
clustereditems.
4. Data mining products available from SAS
a. Highlighting the SAS® Text Miner solution
SASTextMinerisanadd-ontotheSASEnterpriseMinerapplication.SASTextMinerrunsasanodewithinthe
largerSASframework.SASTextMinerprovidesanumberoftoolsfordiscoveringandextractingknowledgefromtext
documents.Thesystemtransformsunstructuredtextualdataintoausable,structuredformat.Theunstructureddata
feedsintotheSASSystemwhichtransformsthatintodocumentsthatareclassified,displayedinrelationshipdiagrams,
clusteredintocategoriesand/orincorporatedwithotherstructureddataforfurtheranalysis.TheoutputsofSASText
Minercanbecounted,parsedandanalyzedinseveralways.
SASTextMinerprovidestoolsfordiscoveringandextractinginformationfromawidelyvariedcollectionoftext
documents.Thetoolscanuncoververbalpatternsandconceptsthatareembeddedwithinthee-documentcollection.
i. Textualdatacomesinmanyflavors SASTextMinercanprocessvolumesoftextualdataincluding:
• Webcontent.
• E-mailmessages.
• Newsarticles.
• Researchpapers.
• Surveys.
• Morethan200officefiletypes.
9
10
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
ii. Competitivedifferentiators SAS’approachtotextmininghasthreedistinctivefeatures:
• TextminingistightlyintegratedintoSAS'broadersystemforstatisticalanalysis.Thebenefitisthatan
analystcanusetoolsasopposedtohavingtofigureouthowtogetdifferenttoolsto“playwell”withother
applications.
• TheprogrammingandproceduresexpertiseofSASeliminatesmostofthelearningcurveassociatedwith
textmining.
• Datafromthetextminingmodule–whatSAScallsanode–isavailabletootherSASprocessesand
procedures.Datafromtextminingcanbemergedwithnumericdatafromotheranalyticprocesseswithout
aformattingstep.
BecauseSASTextMinerrunsasafunctionwithintheSASoperatingsystem,itsoutputscanbefurtherprocessed
withoutconversionbyanyotherSASsoftwaremodule.Thisintegrationoftextminingoutputsintousableformats
isadefiniteadvantage.SASwasoneofthefirstcompaniestodeliveratextminingsolutionabletomergeexisting
structureddatawiththestructuredoutputsofitstextminingsubsystem.Bycombiningtextandstructureddata,the
reliabilityofpredictiveanalysesimproves.
iii. Keyfeatures SASTextMinerguaranteescompatibilitywithotherSAScomponents.Furthermore,thetextminingsubsystemcanbe
integratedintotheSASplatformwithoutcustomprogramming.OtherfeaturesofSASTextMinerinclude:
• Anintegratedinterfaceforanalyzingtext(unstructureddata)inconjunctionwithmultiplerelateddatabase
(structured)fields.Thegraphicaluserinterfacesignificantlyreducestextminingtimeforbothbusiness
analystsandstatisticians.
• Textparsingtoextracttermsorphrasesfromlargetextcollections,whilestemmingtermstorootformbased
onpartsofspeechandfindingphrasesofinterestsuchasabbreviations,countrynames,organizationnames
andotherspecificentities.
• Transformationofdataintoastructuredrepresentationofthesourcecontent.Thesystemdistillskeyconcepts
containedinlargedocumentcollectionsandanalyzesrelationshipsbetweenisolatedtermsorphrasesand
documents.Thetransformedrepresentationofthesourcetextstructuresitforuseindataexploration,
clusteringandpredictivemodeling.
• Aninteractiveresultsbrowserthatenablesanalyststointeractivelyexploreconceptsandrelationshipsbetween
documentsandtomakemodificationstofurthertailoranalyses.
• FullintegrationwithSASEnterpriseMiner’sdataanalyticssoftware.Theanalyticsprovidecomprehensive
analytictoolsfortextandrelatedstructureddata.
iv. SAS®TextMinerunderthecovers SASTextMinerrunsonmostoperatingsystems,includingLinuxandWindows.Withthelatestrelease,itrunson
SolarisandAIXUNIXenvironmentsaswell.SASTextMineriscomputationallyintensiveandrequiresdedicated
hardwareforoptimumthroughput.
SASTextMinerreliesprimarilyuponpatternrecognitiontechnologyinsteadofalinguistics-centricordictionary-
basedapproach.SAS’textminingsoftwareproducesanumericalrepresentationofadocument.Ingeneral,SASText
Minerdelivershigheraccuracywhenalargenumberofdocumentsareprocessed.
11
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
SASTextMinerisbestsuitedtoapplicationswherehundredsorthousandsofmegabytesofunstructuredcontentmust
beprocessedrapidly.Thesystemcangeneratereportsthatmakeiteasyforananalystorendusertograspthekey
facts,ideasandtrendsintheprocessedcontent.
Theprocessesoftextminingusethenumericalrepresentationtoprovideinsightsintooneormoredocumentsina
collection.
Thesystemperformstextminingbymovingeachdocumentthroughfourprocessingsteps.Documentsareacquired
andpreprocessedsothatasingledatasetisformed.Next,leveragingInxight’sLinguistXPlatformandThingFinder,
SASTextMinerperformsparsingandmarkupofthedocumentsintheset.Theprocesseddocumentsarethen
transformed,objectstagged,andreducedtoastructuredform.Thefinalstageintheprocessisthemanipulationofthe
newlygenerateddatasetbyanalyticroutines.Thereisnolimittothenumberofdocumentsinacollection.Analysis
canbeperformedonasinglecollectionoracrossmultiplecollectionsaswellascombinationsofstructureddatafrom
otherapplicationsandtheoutputsofthetextminingprocess.
The SAS Text Miner performs preprocessing, text parsing, transformation, document reduction (creating a structured representation of the objecs and associated metadata), and analysis.
Theprocessflowforminingtextconsistsofaseriesofactionsthatareperformedonastaticcollectionofdata.The
SASTextMinerapproachassumesthatthecollectionorcorpustobeanalyzedisnotupdatedduringtheparsingand
othertextminingprocesses.SASTextMinerworksthroughasequenceofactivitiesinordertogeneratenumerical
snapshotsofdocuments,clustersofobjectssuchasmajorconceptsinthecollectionitself.
Thereareseveralcoreprocessesthatresultinacohesivesetofdataandreportsaboutthedocumentsprocessed.These
basicprocessesare:
• Parsing:separatingwordsandphrasesinthedocument.
• Stemming:locatingroots.
12
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
• Synonymidentification:mappingrootsinthedocumenttosynonyms.
• Stoplistanalysis:eliminationofstopwordsviathefilesashelp.stoplst.
• Termbydocumentmatrixgeneration:mathematicalrepresentationofeachdocument.
• Singularvaluedecomposition:aprocesstoenablefastprocessingofnumericalrepresentationsofdocuments.
• Clustering:aprocessthatgroupsobjectsthatareinsomewayrelated.
TheresultsoftheseprocessescanbeusedbyotherSASanalyticroutinesforadditionalanalysesandgraphic
expressionsoftherelationshipsidentifiedbythealgorithmicprocesses.
v. Reducingthemanualworkoftextmining RawtextisinputtedtoSASTextMiner.Twopre-analysisstepsarerequired.Thesearetasksusuallyperformedby
ananalyst.WitheachreleaseofSASTextMiner,SAShasincludedroutinesthatcanreducesomeofthenecessary
manualwork.Thesetwolistsareimportantbecausetheaccuracyoftheclusteringprocessultimatelydependsonthe
“knowledge”embodiedineachlist.
SASTextMinerperformsstemminginEnglish,French,GermanandSpanish.Stemmingallowsvariantstobegrouped
inonetermset.Anexampleappearsinthetablebelow.
Stem Terms
aller(French) vais,vas,va,allez,allons,vont
reach reaches,reaching,reached
big bigger,biggest
wagon wagons
SASTextMinerforSAS®9isavailablein15languages.CustomersreceiveSASTextMinerforEnglishandachoice
ofoneadditionallanguage,witheachcustomer’snativelanguagerecommended.
Theanalystcanpreparealistofwordsandphrasesforthesystemtoidentify.Useofastartlistcanreducetheamount
oftimerequiredtoprocessacorpusorrepositoryoftext.
Inanycollectionofdocuments,certaintermsmustbeidentifiedsothatSASTextMinerprocessesignorethesewords
whenprocessingthecorpus.
Eachcorpuscontainswordsthatmaybeusedasequivalents.Examplesincludeaproductname,itsmodelnumber
oranacronym.Forexample,F-150maybeasynonymfortheboundphrase“lighttruck.”SASTextMineruses
synonymstogroupconceptsintoameaningfulcluster.
13
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
Equivalenttermsareillustratedinthetablebelow:
Term Parent Category
appearing appear Verb
EM SASEnterpriseMiner Product
Employees Employee Noun
AdministrativeAssistant
Employee Noun
SAShasdevelopedanumberoftoolstoassistananalystwithdiscoveringandextractinginformationfromawide
varietyoftextdocuments.SASTextMinercanidentifythemesandconceptsthatarecontainedinthecollection.
TheoutputofaSASTextMineranalysisallowsananalysttounderstandtheinformationcontainedinacollectionof
documents.AnanalystoreditordoesnothavetopreprocessorpretagthecontentprocessedbySASTextMiner.
TheSASTextMinermodulegeneratesreports.UnlikeotherSASnodes,SASTextMinerdoesnotproducecodethat
canbeusedoutsidetheminingactivity,nordoesSASTextMinerproducePMML.Nevertheless,thetightintegration
oftheSASTextMinerfunctionalityintheSASinterfaceisattractive.AnanalystfamiliarwithSAS’datamining
softwarecanbeupandrunningwithSASTextMinerofteninaslittleasoneday.
b. Highlighting SAS® Enterprise Miner™
i. WhatisthedifferencebetweenSASTextMinerandSASEnterpriseMiner? SASEnterpriseMinerprovidestheprocessflowdiagramtomanagetheanalyticalinvestigationasthedataanalysis
projectispursued.ThetextminingapplicationisturnedonbyclickingthetextminerNODElocatedontheworkspace
diagramofSASEnterpriseMiner.
SASEnterpriseMinerstreamlinesthedataminingprocess.Thesoftwareprovidesanintegratedsystemtotheanalyst.
Fromasingleinterface,theanalystcanaccessdata,testmodelsandgeneratereportsforendusers.SASEnterprise
Minersupportscollaborativeworkprocesses.
SASprovidesthemostcompletedataminingsolutionavailable.Thesystemsupportsintegrationwiththird-party
enterpriseapplicationsanddesktopsoftware.Deliveredasadistributedclient/serversystem,SASEnterpriseMiner
scalesandiswell-suitedfordatamininginlargeorganizations.
SASEnterpriseMinerisdesignedfordataminers,marketinganalysts,databasemarketers,riskanalysts,fraud
investigators,engineersandscientists.
Throughdatamining,whichSASdefinesas“theprocessofselecting,exploringandmodelingmassiveamountsof
data,”ananalystcanuncoverpreviouslyunknownpatternsandhelpimprovedecisionmakingacrosstheenterprise.
Youmayfindmoreinformationathttp://www/sas.com/technologies/analytics/datamining.
Manydataminingsolutionsarestandaloneapplicationsandoftendonotintegrateeasilyintoexistingapplications.
Furthermore,scalabilityisakeyissue,particularlywhendatasetscanrunintoterabytes.Consequently,quantitative
specialistsmustspendtimeaccessing,preparingandmanipulatingdisparatedata.SASEnterpriseMinerautomates
thesedatapreparationtasks.SASEnterpriseMinerallowsanalyststomodeldataandapplytheirexpertiseto
solvespecificbusinessproblems.Italsoensuresthattheresultscanbedeployedintoscoringenginesforactual
implementation.
14
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
SASEnterpriseMinerincludesabroadsetoftoolsthatsupportthecompletedataminingprocess.Thefunctionsare
exposedinagraphicaluserinterface.Mostaspectsofmodelbuildingcanbecompletedbypointing-and-clickingona
SASEnterpriseMineroption.SASEnterpriseMinerincludesaprocessflowdiagramsystemthateliminatestheneed
formostmanualcodingandhelpsreducethetimerequiredtoconstructmodels.SASEnterpriseMinercangenerate
diagramstosavethevarioustypesofanalyticalpattern-detectingalgorithmsexplored.Thesediagramsserveasself-
documentingtemplates.Thesetemplatescanbeupdatedasnecessaryandbeappliedtonewproblemswithoutstarting
theanalysisfromscratch.Diagramscanalsobesharedwithotheranalyststhroughouttheenterprise.
SASEnterpriseMineroffersarangeofintegratedassessmentfeatures.Thesefeaturesallowananalysttocomparethe
resultsgeneratedbydifferentmodelingtechniques.Modelresultscanbeeasilysharedthroughout.Themodelsreside
inanSASdataminingmodelrepository.SASwasoneofthefirstbusinessintelligencevendorstoimplementmodel
managementinanenterpriseintelligenceframework.
Scoringistheprocessofapplyingamodeltodata.Theoutputofadataminingprocessconsistsofrawscores.SAS
EnterpriseMinerautomatesthemodelscoringprocessandsuppliescompletescoringcodeforeachstageofmodel
development.Thescoringcodecanbedeployedinnumerousreal-timeorbatchenvironmentswithinSAS,ontheWeb
ordirectlyinrelationaldatabases.
SASEnterpriseMinerisdeliveredasadistributedclient/serversystem.Thesubsystemeliminatestheneedforan
organizationtohavenichedataminingapplications.
AnanalystcandraganddropiconsontotheprocessflowdiagramusingSASEnterpriseMiner’sgraphicalprocess
flowtools.
The SAS Interface allows most functions related to programming and data manipulation to be handled within a multi-paned interface. The SAS system generates needed code automatically reducing the time an analyst spends creating scripts so that more “what if” and alternative model investigations can be conducted.
15
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
SASEnterpriseMinerimplementstheSEMMAdataminingprocess.SEMMAisanacronymforSAS’coreapproach
todataanalysis:sample,explore,modify,modelandassess.Theideaisthatananalystcanexperimentwithdifferent
numericalrecipes,seeresults,makechangesandthengenerateresultsfromtheoptimalmodeloftendescribedbySAS
asthechampionmodel.SEMMAisalogicalorganizationofthefunctionsintheSASEnterpriseMinersubsystem.
SEMMAprovidesaflexibleframeworkforconductingthecoretasksofdatamining.
SASEnterpriseMinerusesaJavaclientandtheSASserverarchitecture.Theresultisthattheminingcomputational
serverisseparatefromtheuserinterface.Ananalystcanconfigureaninstallationthatscalesfromasingle-usersystem
toverylargeenterprisesolutionswithnochangestothemodelorrecoding.Certainprocess-intensiveservertaskssuch
asdatasorting,summarization,variableselectionandregressionmodelingaremultithreaded.SASEnterpriseMiner
canautomaticallydistributetheworkloadovermultipleCPUs.Itsprocessescanberuninparallelandasynchronously.
Largeorrepetitivetrainingorscoringprocessescanbescheduledforbatchprocessingduringoff-peakdemandhours
ontheanalyticalserverwithoutprogramming.
SASEnterpriseMinerbakesinadvancedpredictiveanddescriptivemodelingtoolsandalgorithms.Ananalyst
canimplementdecisiontrees,neuralnetworks,autoneuralnetworks,memory-basedreasoning,linearandlogistic
regression,clustering,associations,andtimeseriesamongothers,bypointingandclickingonamenuchoiceora
radiobutton.Withmultiplemodelsandalgorithmsreadytoruninasinglesubsystem,SASEnterpriseMinerallowsan
analysttoexploredataandtheoutputsofdifferentmodelsfromavisualuserinterface.
SASEnterpriseMinerincludesassessmentcapabilities.Thesehelpensurethatacommonframeworkforcomparing
differentmodelingtechniquesisavailabletotheanalyst.
SASEnterpriseMinerincludestoolstohelptheanalystperformsuchtasksassampling,datapartitioning,missing
valueimputation,clustering,mergingdatasources,droppingvariables,customizedSASlanguageprocessingvia
theSAScodenode,variabletransformationandfilteringoutliers.Extensivedescriptivesummarizationfeaturesare
included,aswellasadvancedvisualizationtoolsthatpermitexaminationoflargeamountsofdatainmultidimensional
plotsandgraphicallycomparemodelingresults.
Theplatform-independentJava-basedGUIofSASEnterpriseMinerprovidesenhancedstatisticalgraphicswith
flexibleactionsandcontrols.AJavagraphicswizardisalsoprovidedfordevelopingcustomizedgraphs.Plotsand
underlyingtablesaredynamicallylinkedbySASEnterpriseMinertoallowinteractionswiththemodelandother
componentsofthesystem.
TheSASEnterpriseMinerenvironmentfeaturesintegratedassessmentfunctions.Theseallowananalysttocompare
theresultsofdifferentmodelingtechniques.Ananalystcangaugeamodel’seffectivenessinstatisticaltermsandin
moreconcretemeasuressuchasmoneysavedornewsalesrevenue.
16
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
Scoring data often appear in a series of tables. SAS’s approach is to make the scoring data for various runs accessible in a point-and-click interface.
SASEnterpriseMinerfeaturesaformofversioncontrolformodels.Analystscankeeptrackofthemodelsthathave
beenupdatedandcomparetheimprovementsofaccuracyovertime.Modelsdevelopedforotherprojectscanbe
locatedandtheircomponentsreusedinanewproject.
SASEnterpriseMinerallowstheanalysttosaveschematicsandflowdiagrams.ItexportstheseobjectsasXMLfiles.
Ananalystcancreateacompressedmodelresultspackagescontainingthedatausedinadataminingprocessflow,
includingdatapreprocessing,modelinglogic,modelresultsandscoringcode.
ThesemodelresultpackagescanberegisteredtotheSASMetadataServerforsubsequentretrievalandqueryingby
dataminers,businessmanagersanddatamanagersviaaWeb-basedmodelrepositoryviewer.Whenintroducedin
2002,SAShadtheonlyWeb-basedsystemforeffectivelymanaginganddistributinglargemodelportfoliosthroughout
theorganization.
ii. Modelscoring Insomeorganizations,theanalystmakesthemodelavailabletomanagersinoperatingunits.Scoring–sometimes
calledweightingofdifferentfactorsinamodel–requiresconsiderableexpertise.Scoringinsometextmining
systemsisatime-consumingprocess.Insomecases,theanalystwillhavetowriteoreditscoringscripts.TheSAS
EnterpriseMinerenvironmentautomatesvirtuallyallofthescoringcodeprocess.Ifamanualadjustmentisrequired,
SASprovidesscoringcodeinSASscript,C,JavaandPMML(predictivemodelmarkuplanguage).TheSAS
implementationcapturesthecodeofananalyst’smodels,supportstheimportandexportofPMMLmodelsforNaive
Bayesandassociationrulesmodels,andcodesforpreprocessingactivitiessuchasdatanormalization.
17
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
Oncethescoringcodehasbeencaptured,ananalystcan:
• ScoretheproductiondatasetwithinSASEnterpriseMiner.
• Exportthescoringcodeandscoreonadifferentmachine.
• Deploythescoringformulainbatchorreal-timeontheWebordirectlyinrelationaldatabases.
iii. Customizingapplications SASEnterpriseMinercanbecustomizedandextended.ThedefaulttoollibraryofSASEnterpriseMinerallowsan
analysttoscriptadditionalfunctionsusingSAScodeandXMLlogic.SASsupportsaJavaAPIthatcanbeembedded
intocustomizedapplications.SASmakesitpossibleforananalysttobuildaproprietaryanalyticalapplicationsuchas
havinganOLAPreportcapabilitylinkedwithathird-partyapplicationwithinthesameinterface.
c. How do these products interrelate?
TheflagshipproductSASTextMinerfeaturesacomprehensiverangeoftextminingtoolstospeedsearchanddata
analysis,including:
• Noungroupextractionsupportforalllanguages.
• Additionalentitytypes.
• AnenhancedTMFILTERmacrotool.
• UNIXsupport.
• Morerobustsynonymlistprocessingandparsing,includingarevisedDOCPARSEprocedureandDOCSCORE
DATAstepscoringfunction.
18
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
Feature Comment
Analyticsfunctions Acomprehensiverangeofapplicationsinadiverserangeofindustriesexists.Themostpopularinclude(1)frauddetection,(2)warrantyearlywarningand(3)supplierrelationshipmanagementstatistics.Comprehensivetextpreprocessingcapabilities
include:
• Captureanddistillthemostimportantunderlyinginformationwithina
documentcollection.
• Defaultorcustomizedstoplistsforeachlanguagetoremovetermswith
littleornoinformationalvalue.
• Automatedspellingcorrection.
• Stemmingtoidentifyrootwords.
• Part-of-speechtaggingbasedonsentencecontext.
• Noungroupextractionforidentifyingphrase-levelconceptssuchas
“competitiveintelligence.”
• User-definedmultiwordtokens,suchas“pointandclick.”
• User-customizedanddefaultsynonymlists.
Compoundwordsplittingintodistinctsubterms.
API APIsareprovidedwithcompletedocumentationandsamplecode.
ControlledVocabularySupport Thesystemcanmakeuseofexistingdictionariesaswellasdeveloponefromscratchasdocumentsareanalyzedandcommontermsaresurfaced.Insteadofsellingindustry-specificdictionaries,SASoffersspecialroutinesthatallowdomainexpertstorapidlytailortheirownstop/startlistsofsynonymswithuser-friendlyinteractivepoint-and-clickprogramminginterfaces.Forapredictivemodelingexercise,significanteffortwithindustry-specificdictionariescandegraderesults.Ifacustomerhashisorherowndictionary,SAScanincorporatethatintothemodel.
Directthird-partysupport Third-partyapplicationscanmakeuseofSASoutputs.TheAPIsareusedtolinkSASTextMinerwithothersystemsandsoftware.ScorecodecanbesavedasastoredprocessanddeployedthroughafamiliarBIclientsuchasExcel,SASWebReportStudio,SASInformationDeliveryPortalandSASEnterpriseGuide.
Documentconversionsupport SASTextMinercanreadtextstoredinavarietyofdocumentformats,suchasPDF,ASCII,HTML,MicrosoftWordandWordPerfect.
Entityextraction People,places,events,datesandotherobjectscanbeextracted.
Languagesupport Danish,Dutch,English,Finnish,French,German,Italian,Japanese,Korean,Norwegian(Bokmal),Portuguese,Spanish,Swedish,TraditionalChineseandSimplifiedChinese.
• SupportforLatin-1,DoubleByteCharacterandUTF-8encodings.
• Europeanlanguages(Latin-1encoding):Danish,Dutch,English,Finnish,French,German,Italian,Norwegian(Bokmal),Portuguese,SpanishandSwedish.
• Far-Easternlanguages(DoubleByteCharacterSupport):Japanese,Korean,SimplifiedChineseandTraditionalChinese.
EncodingsupportforUnicodeUTF-8allowsanalysisofnon-Englishinterfacesandlanguagesthatlistedabove
5. SAS® Text Miner Feature Snapshot*
19
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
Professionalservices SASoffersfreeWebtutorials,on-siteeducationandtraining,on-demandWebseminars,engineering,conferences,academicinitiativesforschoolsanduniversities,andconsultingservices.
Programminglanguages Majorprogramminglanguagesaresupported.Graphicalinterfacesareavailabletomodifycertainsystemfunctions.
Ready-to-runmodules SASsystemsrequireinstallationandsomesetuppriortooperation.Minimumprocessorspeedis1GHZ.Memoryrequirements:512MBserver.Diskspacerequired:40MBthinJavaclient,300MBSASclient.300MBserver–averageWinXPinstall.
Platform SASrunsonWindows(32-bit)AIXandSolaris(64-bit).
Crawler Thesystemincludestoolstoacquireinformation.Additionally,thesystemcanprocessinformationpushedtowatchedfolders.
Standardssupport TheSASTextMinersystemsupportsJavaandXMLstandards.SASisactivelyparticipatingintheUIMAstandardsdiscussionswithIBMandotheralliancepartners,allofwhomjoininthePMMLdiscussionsatDMG.organdtheGridComputingForum.SASismonitoringdevelopmenteffortstoidentifyandselectmeaningfulmetadatasothatwhenitdoesintegratewithaUIMA-compliantstandardofapplication,itwillbeusefultoSAScustomers,especiallyforanalysispurposes.
Taxonomymanagementtools Thesystemprovidesagraphicalinterfacetomanageclassificationsystemsandrepresenttheclass/subclassrelationshipsinahierarchicalfashiondisplayingdocumentsina“tree.”
Taxonomysupport SASallowsthetaxonomybe“discovered”withanunsupervisedmethod,hierarchicalclustering.Wealso“discover”thelabelsfortheclassesandsubclasses.OrSASTextMinercanfollowapredefinedclassificationsystem,ifavailable.
Third-partysoftwaresupport Mostthird-partyandproprietarysystemscanbesupportedbySASTextMinerandSASEnterpriseMiner.
Visualizationfunctions SASprovidesarangeofdatavisualizationtools.Theconcept-linkingdiagramshavebeentailoredspecificallyforSASTextMiner.User-directedconceptlinkingprovidesflexiblenavigationtovisualizecomplexhiddenrelationshipsbetweenterms,phrasesandentities(suchaspeopleandplacenames).Conceptlinking,nowintegratedwiththeInteractiveResultsBrowserforenhancedusabilityandgreaterinsight,helpsdetectpatternsinaclearvisualfashionthatmaynothaveotherwisebeenobserved.
*FeaturesdescribedreflecttheNovember2006releaseofSASTextMiner.
20
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
6. Conclusion
SAShasdonetheworkrequiredtoeasetheintegrationoftextwithstructureddataelementsthroughdrag-and-drop
orpoint-and-clickfunctions.SASTextMinerprovidesafullrangeofpredictivemodelingtoolsthatcanunearth
opportunitiesfortimelyexploitation.Benefitsinclude
• DatainaSASSystemcanbetraced.Thenotionofdatalineageisimportantwhentheorganizationmustoperate
withintheguidelinesofSarbanes-Oxley,BaselIII,andothercomplianceandgovernancerequirements.
• Alargeorganizationcandevelopacomprehensive,customizedtextminingsystemwithoutthecostandtime
requiredtointegratethird-partytoolsintoacohesivesystem.
• SAShascontinuedtoaddfunctionalitytotheSASTextMinernode;forexample,SAShasaddedacreditscoring
functionthatcanuseawiderangeofnumericandtextualinformation.
SASTextMiner’sapproachsetsitapartfrommanybusinessintelligenceandtextminingsystems.BecauseSASText
Minerisacomponentresidingwithinalargerdatamanagementandanalyticenvironment,anorganizationusing
SASTextMinercanrepurposemodels,streamline“whatif ”andalternativemodelanalysis,andmergeoutputsfrom
SASTextMinerwithdatafromnumericorhybridsystems.Thesebuilt-inservicesslashthetimeandprogramming
requiredtoachievesimilargoalsusingsoftwarefrommultiplevendors.
TheprogrammeranalystwithastrongfoundationinstatisticswillrevelinSAS’technologyanditspower.Aperson
whoinveststhetimetolearntheSASapproachwillfindthatSASdeliversusable,flexibleandscalableanalytictools.
Endusersbenefitfromvisualpresentationsofcomplexdataandpoint-and-clickinterfacescreatedwithSAS’built-
intools.Adequatecomputationalhorsepower,systemspecialists,andanalystswiththerequisiteprogrammingand
statisticalexpertiseareallthat’sneededtomakeSAShum.
Insummary,integratingtext-basedinformationwithstructureddataenrichesyourpredictivemodelingcapabilitiesand
providesnewstoresofinsightfulinformationfordrivingyourbusinessandresearchinitiativesforward.Textmining
supportsawidevarietyofapplications,suchascategorizinghugecollectionsofcallcenterdata,Webtextandblogs
analysis,findingpatternsincustomerfeedbackoremployeesurveys,detectingemergingproductissues,analyzing
competitiveintelligencereportsorpatentdatabases,andclassifyingreportsandcompanyinformationbytopicor
businessissue.
Achievingindustry-leadingstatusalmostuponitsintroduction,theseSASdataminingtechnologiescontinueto
receiveravereviewsfromindustryexpertsandusersalike.KeepyoureyesonSAS.
FormoreinformationonSASTextMiner,visit:
www.sas.com/technologies/analytics/datamining/textminer
21
Beyond Search-and-Retrieval: Enterprise Text Mining with SAS®
7. Appendix
Birds-eyeview
Product SASTextMiner(requiresSASEnterpriseMinerandSAS/STAT®licenses)
Platform Textminingtoolkit–runsonMicrosoftWindows,UNIX.
LicenseFee TextMiningadd-onstartsatabout$30,000.
UseCases Transformstextualdataintoausable,intelligibleformatthatfacilitatesclassifying
documents,findingexplicitrelationshipsorassociationsbetweendocuments,clustering
documentsintocategoriesandincorporatingtextwithotherstructureddatatoenrich
predictivemodelingendeavors.
KeyFeatures Avarietyofadvancedanalytics–statistical,heuristics,neuralnets,patterndecisionsand
predictiveforecastingdataminingtechnologies.
Caveats ThisSAStechnologyrunsonaclient/serverplatform.Installationmustbedonecarefully
todefinethepropermetadataarchitecture.
Similarto LikeproductsfromSPSS,TEMISandIBM,theSASTextMinerproductperforms
textualanalytics.Unlikecompetitors,SASexcelsintheintegrationofthestructureand
unstructureddataforpredictivescoringmodels.
PointstoNote SASprovidesanindustry-standard,integratedstatisticaloperatingsystemandtheSAS
TextMinermoduleto“bridgethegapbetweendataandtext.”
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies. Copyright © 2007, Stephen E. Arnold. All rights reserved. 000000_451499.0707
Top Related