Text Mining for Clementine 12.0 User's Guide

Text Mining for Clementine® 12.0 User’s Guide

For more information about SPSS® software products, please visit our Web site at http://www.spss.com or contact:

SPSS Inc.233 South Wacker Drive, 11th FloorChicago, IL 60606-6412Tel: (312) 651-3000Fax: (312) 651-3668

SPSS is a registered trademark and the other product names are the trademarks of SPSS Inc. for its proprietarycomputer software. No material describing such software may be produced or distributed without the writtenpermission of the owners of the trademark and license rights in the software and the copyrights in the publishedmaterials.

The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use, duplication, or disclosureby the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of The Rights in TechnicalData and Computer Software clause at 52.227-7013. Contractor/manufacturer is SPSS Inc., 233 South WackerDrive, 11th Floor, Chicago, IL 60606-6412.

Patent No. 7,023,453

Graphs powered by SPSS Inc.’s nViZn™ advanced visualization technology (http://www.spss.com/sm/nvizn).

General notice: Other product names mentioned herein are used for identification purposes only and may betrademarks of their respective companies.

This product includes the Java Access Bridge. Copyright © by Sun Microsystems Inc. All rights reserved. See theLicense for the specific language governing permissions and limitations under the License.

Microsoft and Windows are registered trademarks of Microsoft Corporation.

IBM, DB2, and Intelligent Miner are trademarks of IBM Corporation in the U.S.A. and/or other countries.

Oracle is a registered trademark of Oracle Corporation and/or its affiliates.

UNIX is a registered trademark of The Open Group.

DataDirect and SequeLink are registered trademarks of DataDirect Technologies.

Copyright © 1994–2006 Sun Microsystems Inc. All Rights Reserved. Redistribution and use in source and binaryforms, with or without modification, are permitted provided that the following conditions are met: Redistributionof source code must retain the above copyright notice, this list of conditions, and the following disclaimer.Redistribution in binary form must reproduce the above copyright notice, this list of conditions, and the followingdisclaimer in the documentation and/or other materials provided with the distribution. Neither the name ofSun Microsystems Inc. or the names of contributors may be used to endorse or promote products derived fromthis software without specific prior written permission. This software is provided “AS IS,” without a warrantyof any kind. ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS, AND WARRANTIES,INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULARPURPOSE OR NON-INFRINGEMENT, ARE HEREBY EXCLUDED. SUN MICROSYSTEMS INC. (“SUN”)AND ITS LICENSORS SHALL NOT BE LIABLE FOR ANY DAMAGES SUFFERED BY LICENSEE AS ARESULT OF USING, MODIFYING, OR DISTRIBUTING THIS SOFTWARE OR ITS DERIVATIVES. IN NOEVENT WILL SUN OR ITS LICENSORS BE LIABLE FOR ANY LOST REVENUE, PROFIT OR DATA,OR FOR DIRECT, INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, OR PUNITIVE DAMAGES,HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF THE USEOF OR INABILITY TO USE THIS SOFTWARE, EVEN IF SUN HAS BEEN ADVISED OF THE POSSIBILITYOF SUCH DAMAGES. You acknowledge that this software is not designed, licensed, or intended for use in thedesign, construction, operation, or maintenance of any nuclear facility.

Portions of the Software are licensed under the Apache License, Version 2.0 (the “License”); you maynot use applicable files except in compliance with the License. You may obtain a copy of the License athttp://www.apache.org/licenses/LICENSE-2.0.

Apache Axis2 1.3. Portions of the Software are licensed under the Apache License, Version 2.0 (the “License”);you may not use applicable files except in compliance with the License. You may obtain a copy of the License athttp://www.apache.org/licenses/LICENSE-2.0.

Java Service Wrapper 3.2. Copyright (c) 1999, 2006 Tanuki Software, Inc. Permission is hereby granted, freeof charge, to any person obtaining a copy of the Java Service Wrapper and associated documentation files (the“Software”), to deal in the Software without restriction, including without limitation the rights to use, copy,modify, merge, publish, distribute, sub-license, and/or sell copies of the Software, and to permit persons towhom the Software is furnished to do so, subject to the following conditions: The above copyright notice andthis permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWAREIS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDINGBUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULARPURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHTHOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTIONOF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THESOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.Portions of the Software have been derived from source code developed by Silver Egg Technology under the

following license:BEGIN Silver Egg Techology License ———————————– Copyright (c) 2001 Silver Egg Technology.Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associateddocumentation files (the “Software”), to deal in the Software without restriction, including without limitation therights to use, copy, modify, merge, publish, distribute, sub-license, and/or sell copies of the Software, and to permitpersons to whom the Software is furnished to do so, subject to the following conditions: The above copyrightnotice and this permission notice shall be included in all copies or substantial portions of the Software. THESOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FORA PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE AUTHORS ORCOPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHERIN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTIONWITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

This product includes software known as Libtextcat 2.2 and is licensed pursuant to the following: TheRedistribution and use in source and binary forms, with or without modification, are permitted provided that thefollowing conditions are met: Redistributions of source code must retain the above copyright notice, this list ofconditions, and the following disclaimer. Redistributions in binary form must reproduce the above copyrightnotice, this list of conditions, and the following disclaimer in the documentation and/or other materials providedwith the distribution. Neither the name of the organization nor the names of its contributors may be used to endorseor promote products derived from this software without specific prior written permission. THIS SOFTWAREIS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESSOR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OFMERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NOEVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OFLIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OROTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OFTHE POSSIBILITY OF SUCH DAMAGE.

Copyright © 1995–2003 International Business Machines Corporation and others. All rights reserved. Permissionis hereby granted, free of charge, to any person obtaining a copy of this software and associated documentationfiles (the “Software”), to deal in the Software without restriction, including without limitation the rights to use,copy, modify, merge, publish, distribute, and/or sell copies of the Software, and to permit persons to whomthe Software is furnished to do so, provided that the above copyright notice(s) and this permission noticeappear in all copies of the Software and that both the above copyright notice(s) and this permission noticeappear in supporting documentation. THE SOFTWARE IS PROVIDED “AS IS,” WITHOUT WARRANTYOF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OFMERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT OFTHIRD-PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDEDIN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIALDAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISINGOUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. Except as

contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote thesale, use, or other dealings in this Software without prior written authorization of the copyright holder.

Boost Software License - Version 1.0 - August 17, 2003. Permission is hereby granted, free of charge, to anyperson or organization obtaining a copy of the software and accompanying documentation covered by thislicense (the “Software”) to use, reproduce, display, distribute, execute, and transmit the Software, and to preparederivative works of the Software, and to permit third parties to whom the Software is furnished to do so, allsubject to the following: The copyright notices in the Software and this entire statement, including the abovelicense grant, this restriction and the following disclaimer, must be included in all copies of the Software, inwhole or in part, and all derivative works of the Software, unless such copies or derivative works are solely inthe form of machine-executable object code generated by a source language processor. THE SOFTWARE ISPROVIDED “AS IS,” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDINGBUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULARPURPOSE, TITLE, AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS ORANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY,WHETHER IN CONTRACT, TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTIONWITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

This software includes third-party proprietary software gSoapToolkit v.2.7.7. A source code version of suchsoftware is available for public use pursuant to the Mozilla Public License v. 1.1 (“License”), which may be foundat http://www.mozilla.org/MPL/MPL-1.1.html. SPSS has not made or makes any “Modifications,” nor is SPSS a“Contributor” as those terms are defined in the License.

The CyberNeko Software License, Version 1.0 © Copyright 2002–2005, Andy Clark. All rights reserved.Redistribution and use in source and binary forms, with or without modification, are permitted provided that thefollowing conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list ofconditions, and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyrightnotice, this list of conditions, and the following disclaimer in the documentation and/or other materials providedwith the distribution. 3. The end-user documentation included with the redistribution, if any, must includethe following acknowledgment: “This product includes software developed by Andy Clark.” Alternately, thisacknowledgment may appear in the software itself, if and wherever such third-party acknowledgments normallyappear. 4. The names “CyberNeko” and “NekoHTML” must not be used to endorse or promote products derivedfrom this software without prior written permission. For written permission, please contact [email protected]. Products derived from this software may not be called “CyberNeko,” nor may “CyberNeko” appear intheir name, without prior written permission of the author. THIS SOFTWARE IS PROVIDED “AS IS”AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THEIMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE AREDISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR OTHER CONTRIBUTORS BE LIABLE FORANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ONANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDINGNEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. This license is based on the ApacheSoftware License, version 1.1.

OSSP foo—Foo LibraryCopyright © 2002 Ralf S. EngelschallCopyright © 2002 The OSSP ProjectCopyright © 2002 Cable & Wireless DeutschlandThis file is part of OSSP foo, a foo library which can be found at http://www.ossp.org/pkg/foo/. Permission to use,copy, modify, and distribute this software for any purpose with or without fee is hereby granted, provided thatthe above copyright notice and this permission notice appear in all copies. THIS SOFTWARE IS PROVIDED“AS IS” AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSEARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS AND COPYRIGHT HOLDERS AND THEIRCONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, ORCONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTEGOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVERCAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, ORTORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS

SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Home: http://www.ossp.org/Repo: http://cvs.ossp.org Dist: ftp://ftp.ossp.org/

This software includes third-party software that is copyrighted by Christian Werner. The following terms apply toall files associated with such third-party software unless explicitly disclaimed in individual files. The authorshereby grant permission to use, copy, modify, distribute, and license this software and its documentation forany purpose, provided that existing copyright notices are retained in all copies and that this notice is includedverbatim in any distributions. No written agreement, license, or royalty fee is required for any of the authorizeduses. Modifications to this software may be copyrighted by their authors and need not follow the licensingterms described here, provided that the new terms are clearly indicated on the first page of each file where theyapply. IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY FORDIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THEUSE OF THIS SOFTWARE, ITS DOCUMENTATION, OR ANY DERIVATIVES THEREOF, EVEN IF THEAUTHORS HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. THE AUTHORS ANDDISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO,THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, ANDNON-INFRINGEMENT. THIS SOFTWARE IS PROVIDED ON AN “AS IS” BASIS, AND THE AUTHORSAND DISTRIBUTORS HAVE NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES,ENHANCEMENTS, OR MODIFICATIONS.

WordNet 2.1 Copyright © 2005 by Princeton University. All rights reserved. THIS SOFTWARE ANDDATABASE IS PROVIDED “AS IS” AND PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS ORWARRANTIES, EXPRESS OR IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETONUNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANTABILITY ORFITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE LICENSED SOFTWARE,DATABASE OR DOCUMENTATION WILL NOT INFRINGE ANY THIRD-PARTY PATENTS,COPYRIGHTS, TRADEMARKS, OR OTHER RIGHTS. The name of Princeton University or Princeton may notbe used in advertising or publicity pertaining to distribution of the software and/or database. Title to copyright inthis software, database, and any associated documentation shall at all times remain with Princeton Universityand LICENSEE agrees to preserve same.

Text Mining for Clementine® 12.0 User’s GuideCopyright © 2007 by Integral Solutions Limited.All rights reserved.Printed in the United States of America.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by anymeans—electronic, mechanical, photocopying, recording, or otherwise—without the prior written permission ofthe publisher.

Preface

Text Mining for Clementine is a fully integrated add-on for Clementine that requires a separatelicense. Text Mining for Clementine uses advanced linguistic technologies and Natural LanguageProcessing (NLP) to rapidly process a large variety of unstructured text data and, from this text,extract and organize the key concepts. Furthermore, Text Mining for Clementine can groupthese concepts into categories.Around 80% of data held within an organization is in the form of text documents—for example,

reports, Web pages, e-mails, and call center notes. Text is a key factor in enabling an organizationto gain a better understanding of their customers’ behavior. A system that incorporates NLPcan intelligently extract concepts, including compound phrases. Moreover, knowledge ofthe underlying language allows classification of terms into related groups, such as products,organizations, or people, using meaning and context. As a result, you can quickly determine therelevance of the information to your needs. Extracted concepts and categories can be combinedwith existing structured data, such as demographics, and applied to modeling using Clementine’sfull suite of data mining tools to yield better and more-focused decisions.Linguistic systems are knowledge sensitive—the more information contained in their

dictionaries, the higher the quality of the results. Text Mining for Clementine is delivered with aset of linguistic resources, such as dictionaries for terms and synonyms, libraries, and templates.This product further allows you to develop and refine these linguistic resources to your context.Fine-tuning of the linguistic resources is often an iterative process and is necessary for accurateconcept retrieval and categorization. Custom templates, libraries, and dictionaries for specificdomains, such as CRM and genomics, are also included.In addition to concept extraction, category model building, cluster exploration, text link

analysis, and access to Web feed and blog data as text mining input, this release also offers anindependent text mining editor to fine-tune linguistic resource templates and libraries outsidethe context of a stream execution.

Serial Numbers

Your serial number is your identification number with SPSS Inc. You will need this serial numberwhen you contact SPSS Inc. for information regarding support, payment, or an upgraded system.The serial number was provided with your Clementine system.

Customer Service

If you have any questions concerning your shipment or account, contact your local office, listedon the SPSS Web site at http://www.spss.com/worldwide/. Please have your serial number readyfor identification.

vii

Training Seminars

SPSS Inc. provides both public and on-site training seminars. All seminars featurehands-on workshops. Seminars will be offered in major cities on a regular basis. For moreinformation on these seminars, contact your local office, listed on the SPSS Web site athttp://www.spss.com/worldwide/.

Technical Support

The services of SPSS Technical Support are available to registered customers. Student Versioncustomers can obtain technical support only for installation and environmental issues. Customersmay contact Technical Support for assistance in using Clementine products or for installationhelp for one of the supported hardware environments. To reach Technical Support, see theSPSS Web site at http://www.spss.com or contact your local office, listed on the SPSS Web siteat http://www.spss.com/worldwide/. Be prepared to identify yourself, your organization, andthe serial number of your system.

Tell Us Your Thoughts

Your comments are important. Please let us know about your experiences with SPSS products.We especially like to hear about new and interesting applications using Clementine. Please sende-mail to [email protected] or write to SPSS Inc., Attn.: Director of Product Planning, 233 SouthWacker Drive, 11th Floor, Chicago, IL 60606-6412.

Contacting SPSS

If you would like to be on our mailing list, contact one of our offices listed on our Web siteat http://www.spss.com/worldwide/.

viii

Contents

Part I: Text Mining Nodes

1 Text Mining for Clementine 1

What’s New in Version 12.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Upgrading to Version 12.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2About Text Mining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

How Extraction Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5How Categorization Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Text Mining for Clementine Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Reading in Source Text 11

File List Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11File List Node: Settings Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12File List Node: Other Tabs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Using the File List Node in Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Scripting Properties: filelistnode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Web Feed Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Web Feed Node: Input Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Web Feed Node: Records Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Using the Web Feed Node in Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Scripting Properties: webfeednode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Mining for Concepts and Categories 25

Text Mining Modeling Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25What Are Concepts and Categories? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Sampling Upstream to Save Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Text Mining Modeling Node: Fields Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Text Mining Node: Model Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Text Mining Modeling Node: Language Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42Text Mining Node: Expert Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

ix

Using the Text Mining Modeling Node in a Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Scripting Properties: textminingnode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Text Mining Model Nugget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Model Nugget: Model Tab (Concept Model). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Model Nugget: Model Tab (Category Model) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58Model Nugget: Settings Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61Model Nugget: Fields Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62Model Nugget: Language Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Model Nugget: Summary Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Using Text Mining Model Nuggets in a Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Scripting Properties: applytextminingnode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4 Mining for Text Links 73

Text Link Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Text Link Analysis Node: Fields Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Text Link Analysis Node: Language Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Text Link Analysis Node: Expert Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Text Link Analysis Node: Annotations Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Using the Text Link Analysis Node in a Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Scripting Properties: tlanode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5 Categorizing Files and Records 89

LexiQuest Categorize Model Nugget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89Importing a LexiQuest Categorize Model Nugget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90LexiQuest Categorize Model Nugget: Model Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91LexiQuest Categorize Model Nugget: Settings Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92LexiQuest Categorize Model Nugget: Fields Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94LexiQuest Categorize Model Nugget: Language Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98Using the LexiQuest Categorize Model Nugget in a Stream . . . . . . . . . . . . . . . . . . . . . . . . . . 99Scripting Properties: applycategorizenode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6 Translating Text for Extraction 105

Translate Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105Translate Node: Fields Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106Translate Node: Language Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

x

Using the Translate Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108Scripting Properties: translatenode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7 Browsing External Source Text 113

File Viewer Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113File Viewer Node Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113Using the File Viewer Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Part II: Interactive Workbench

8 Interactive Workbench Mode 119

The Categories and Concepts View. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119The Clusters View. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123The Text Link Analysis View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126The Resource Editor View. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130Setting Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Options: Session Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131Options: Colors Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132Options: Sounds Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Microsoft Internet Explorer Settings for Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134Generating Model Nuggets and Modeling Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134Updating Modeling Nodes and Saving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134Closing and Deleting Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Keyboard Accessibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Shortcuts for Dialog Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

9 Extracting Concepts and Types 139

Extracted Results: Concepts and Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139Extracting Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Extract Dialog Box: Settings Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143Extract Dialog Box: Language Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

Filtering Extracted Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

xi

Refining Extraction Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148Adding Synonyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150Adding Concepts to Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152Excluding Concepts from Extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154Forcing Words into Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

10 Categorizing Text Data 157

The Categories Pane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158Category Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

The Data Pane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161Adding Columns to the Data Pane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

Building Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163Build Categories: Techniques Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164Build Categories: Limits Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166Concept Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168Concept Inclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169Semantic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170Co-occurrence Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172Creating New or Renaming Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

Using Conditional Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174Deleting Conditional Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

Managing and Refining Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174Adding to Category Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175Editing Category Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175Moving Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176Merging or Combining Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177Deleting Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

11 Analyzing Clusters 179

Building Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180Build Clusters: Settings Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181Build Clusters: Limits Tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182Calculating Similarity Link Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

Exploring Clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184Cluster Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

xii

12 Exploring Text Link Analysis 187

Extracting TLA Pattern Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188Type and Concept Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

Filtering TLA Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190Data Pane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

13 Visualizing Graphs 195

Category Graphs and Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195Category Bar Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196Category Web Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197Category Web Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

Cluster Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198Concept Web Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199Cluster Web Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

Text Link Analysis Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200Concept Web Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201Type Web Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

Using Graph Toolbars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202Editing Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

General Rules for Editing Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204Editing and Formatting Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204Changing Colors, Patterns, and Dashings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205Rotating and Changing the Shape and Aspect Ratio of Point Elements . . . . . . . . . . . . . . . . . 206Changing the Size of Graphic Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207Specifying Margins and Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207Changing the Position of the Legend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208Keyboard Shortcuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

14 Session Resource Editor 209

Editing Resources in the Resource Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209Making and Updating Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210Switching Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

xiii

Part III: Templates and Resources

15 Templates and Resources 215

Template Editor vs. Resource Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215Available Resource Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216The Editor Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217Opening Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218Saving Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219Updating Node Resources After Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220Managing Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221Importing and Exporting Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222Exiting the Template Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224Backing Up Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224Importing Resource Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

16 Working with Libraries 229

Shipped Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230Creating Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231Adding Public Libraries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232Finding Terms and Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233Viewing Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234Managing Local Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

Renaming Local Libraries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234Disabling Local Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235Deleting Local Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

Managing Public Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236Sharing Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

Publishing Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239Updating Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

Resolving Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

xiv

17 About Library Dictionaries 243

Type Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243Built-in Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244Creating Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245Adding Terms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247Forcing Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250Renaming Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251Moving Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252Disabling Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252Deleting Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

Substitution Dictionaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253Adding Synonyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254Adding Optional Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256Disabling Substitutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257Deleting Substitutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

Exclude Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258Adding Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259Disabling Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260Deleting Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

18 About Advanced Resources 261

Editing Advanced Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264Replacing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265Fuzzy Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266Classification Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

Link Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267Excluded Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

Nonlinguistic Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268Regular Expression Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

Type Dictionary Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270Language Handling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

Dynamic POS Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272Forced POS Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

xv

Language Identifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

Text Link Analysis Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275Variable Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276Macro Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278Pattern Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280Multistep Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

Index 285

xvi

Part I:Text Mining Nodes

Chapter

1Text Mining for Clementine

Text Mining for Clementine is a fully integrated add-on for Clementine that requires a separatelicense. Text Mining for Clementine uses advanced linguistic technologies and Natural LanguageProcessing (NLP) to rapidly process a large variety of unstructured text data and, from this text,extract and organize the key concepts. Furthermore, Text Mining for Clementine can groupthese concepts into categories.Around 80% of data held within an organization is in the form of text documents—for example,

reports, Web pages, e-mails, and call center notes. Text is a key factor in enabling an organizationto gain a better understanding of their customers’ behavior. A system that incorporates NLPcan intelligently extract concepts, including compound phrases. Moreover, knowledge ofthe underlying language allows classification of terms into related groups, such as products,organizations, or people, using meaning and context. As a result, you can quickly determine therelevance of the information to your needs. Extracted concepts and categories can be combinedwith existing structured data, such as demographics, and applied to modeling using Clementine’sfull suite of data mining tools to yield better and more-focused decisions.Linguistic systems are knowledge sensitive—the more information contained in their

dictionaries, the higher the quality of the results. Text Mining for Clementine is delivered with aset of linguistic resources, such as dictionaries for terms and synonyms, libraries, and templates.This product further allows you to develop and refine these linguistic resources to your context.Fine-tuning of the linguistic resources is often an iterative process and is necessary for accurateconcept retrieval and categorization. Custom templates, libraries, and dictionaries for specificdomains, such as CRM and genomics, are also included.In addition to concept extraction, category model building, cluster exploration, text link

analysis, and access to Web feed and blog data as text mining input, this release also offers anindependent text mining editor to fine-tune linguistic resource templates and libraries outsidethe context of a stream execution.

Deployment. You can deploy text mining streams using the Clementine Solution Publisher forreal-time scoring of unstructured data. The ability to deploy these streams ensures successful,closed-loop text mining implementations. For example, your organization can now analyzescratch-pad notes from inbound or outbound callers by applying your predictive models toincrease the accuracy of your marketing message in real time.

Automated translation of supported languages. Text Mining for Clementine, in conjunction withLanguage Weaver, enables you to translate text from a list of supported languages, includingArabic, Chinese, and Persian, into English. You can then perform your text analysis on translatedtext and deploy these results to people who could not have understood the contents of the sourcelanguages. Since the text mining results are automatically linked back to the correspondingforeign-language, or source, text, your organization can then focus the much-needed nativespeaker resources on only the most significant results of the analysis. Language Weaver offers

1

2

Chapter 1

automatic language translation using statistical translation algorithms that resulted from 20person-years of advanced translation research.

What’s New in Version 12.0

This release of Text Mining for Clementine adds the following features:

Text Mining Template Editor available on the main Clementine toolbar. The editor is now directlyaccessible from the main Clementine toolbar (instead of having to go through an interactiveworkbench session). Use it to create and edit templates or libraries, from which you can load andcopy resources into your text mining nodes and sessions. For more information, see “Templatesand Resources” in Chapter 15 on p. 215.

Extraction results caching. You can choose to update the Text Mining node with extraction resultsduring an interactive workbench session for reuse later. Use these cached extraction resultsto bypass upstream processing and the time it takes to reextract. So now you can start yournext interactive session with the same data and extraction results that you last saved. For moreinformation, see “Updating Modeling Nodes and Saving” in Chapter 8 on p. 134.

Linguistic resource enhancements. SPSS Inc.’s Natural Language Processing (NLP) technologyhas been enhanced for supported languages. Opinions templates now exist for English, Dutch,French, German, and Spanish.

Text Mining node palette in Clementine. All Text Mining for Clementine nodes are now availableon their own Text Mining palette in Clementine.

Broader OS support. Now you can use Text Mining for Clementine with Microsoft® WindowsVista® Business or Home Basic (32- and 64-bit).

Upgrading to Version 12.0

Upgrading from Text Mining for Clementine 5.0

If you are upgrading to Text Mining for Clementine 12.0 from version 5.0 or later, begin byinstalling version 12.0 before uninstalling version 5.0 to ensure that your templates and publishedlibraries are migrated to version 12.0. Any shipped libraries and templates from a previousrelease will be marked as such to differentiate them. If you no longer need the older versions,you can delete them.

Important! If you uninstall Text Mining for Clementine 5.0 before installing this new version,any template and public library work performed in version 5.0 will be lost and unable to bemigrated to version 12.0.

Upgrading from Text Mining for Clementine 4.1 or earlier

If you are upgrading to Text Mining for Clementine 12.0 from version 4.1 or earlier, considerstream updates and node replacements. For this upgrade, any preexisting streams that contain theolder nodes will no longer be fully executable until you update the nodes. Certain improvementsin the Clementine 11.0 release required older nodes to be replaced with the newer versions,

3

Text Mining for Clementine

which are both more deployable and more powerful. For more information, see “ Text Miningfor Clementine Nodes” on p. 9. Note: The old Text Extraction node was replaced by the TextMining modeling node.

Text Mining Builder 2.0 Library Migration

Any of the public (or published) libraries from Text Mining Builder 2.0 found on your machineare migrated during installation to Text Mining for Clementine.

Note: If you do not want these resources migrated, you can delete them before installing TextMining for Clementine or shut down the MySQL service for Text Mining Builder so they areinaccessible. Contact your system administrator for help.

About Text Mining

Today an increasing amount of information is being held in unstructured and semistructuredformats, such as customer e-mails, call center notes, open-ended survey responses, news feeds,Web forms, etc. This abundance of information poses a problem to many organizations that askthemselves, “How can we collect, explore, and leverage this information?”Text mining is the process of analyzing collections of textual materials in order to capture key

concepts and themes and uncover hidden relationships and trends without requiring that you knowthe precise words or terms that authors have used to express those concepts. Although they arequite different, text mining is sometimes confused with information retrieval. While the accurateretrieval and storage of information is an enormous challenge, the extraction and managementof quality content, terminology, and relationships contained within the information are crucialand critical processes.

Text Mining and Data Mining

For each article of text, linguistic-based text mining returns an index of concepts, as well asinformation about those concepts. This distilled, structured information can be combined withother data sources to address questions such as:

Which concepts occur together?What else are they linked to?What higher level categories can be made from extracted information?What do the concepts or categories predict?How do the concepts or categories predict behavior?

Combining text mining with data mining offers greater insight than is available from eitherstructured or unstructured data alone. This process typically includes the following steps:

1. Identify the text to be mined. Prepare the text for mining. If the text exists in multiple files, save thefiles to a single location. For databases, determine the field containing the text.

2. Mine the text and extract structured data. Apply the text mining algorithms to the source text.

4

Chapter 1

3. Build concept and category models. Identify the key concepts and/or create categories. The numberof concepts returned from the unstructured data is typically very large. Identify the best conceptsand categories for scoring.

4. Analyze the structured data. Employ traditional data mining techniques, such as clustering,classification, and predictive modeling, to discover relationships between the concepts. Merge theextracted concepts with other structured data to predict future behavior based on the concepts.

Text Analysis and Categorization

Text analysis, a form of qualitative analysis, is the extraction of useful information from textso that the key ideas or concepts contained within this text can be grouped into an appropriatenumber of categories. Text analysis can be performed on all types and lengths of text, although theapproach to the analysis will vary somewhat.Shorter records or documents are most easily categorized, since they are not as complex and

usually contain fewer ambiguous words and responses. For example, with short, open-endedsurvey questions, if we ask people to name their three favorite vacation activities, we might expectto see many short answers, such as going to the beach, visiting national parks, ordoing nothing. Longer, open-ended responses, on the other hand, can be quite complex andvery lengthy, especially if respondents are educated, motivated, and have enough time to completea questionnaire. If we ask people to tell us about their political beliefs in a survey or have a blogfeed about politics, we might expect some lengthy comments about all sorts of issues and positions.The ability to extract key concepts and create insightful categories from these longer text

sources in a very short period of time is a key advantage of using Text Mining for Clementine.This advantage is obtained through the combination of automated linguistic and statisticaltechniques to yield the most reliable results for each stage of the text analysis process.

Linguistic Processing and NLP

The primary problem with the management of all of this unstructured text data is that thereare no standard rules for writing text so that a computer can understand it. The language, andconsequently the meaning, varies for every document and every piece of text. The only wayto accurately retrieve and organize such unstructured data is to analyze the language and thusuncover its meaning. There are several different automated approaches to the extraction ofconcepts from unstructured information. These approaches can be broken down into two kinds,linguistic and nonlinguistic.Some organizations have tried to employ automated nonlinguistic solutions based on statistics

and neural networks. Using computer technology, these solutions can scan and categorize keyconcepts more quickly than human readers can. Unfortunately, the accuracy of such solutionsis fairly low. Most statistics-based systems simply count the number of times words occur andcalculate their statistical proximity to related concepts. They produce many irrelevant results, ornoise, and miss results they should have found, referred to as silence.To compensate for their limited accuracy, some solutions incorporate complex nonlinguistic

rules that help to distinguish between relevant and irrelevant results. This is referred to asrule-based text mining.

5


Linguistics-based text mining, on the other hand, applies the principles of natural languageprocessing (NLP)—the computer-assisted analysis of human languages—to the analysis of words,phrases, and syntax, or structure, of text. A system that incorporates NLP can intelligently extractconcepts, including compound phrases. Moreover, knowledge of the underlying language allowsclassification of concepts into related groups, such as products, organizations, or people, usingmeaning and context.Linguistics-based text mining finds meaning in text much as people do—by recognizing a

variety of word forms as having similar meanings and by analyzing sentence structure to providea framework for understanding the text. This approach offers the speed and cost-effectivenessof statistics-based systems, but it offers a far higher degree of accuracy while requiring far lesshuman intervention.To illustrate the difference between statistics-based and linguistics-based approaches during

the extraction process, consider how each would respond to a query about reproduction of

documents. Both statistics-based and linguistics-based solutions would have to expand the wordreproduction to include synonyms, such as copy and duplication. Otherwise, relevantinformation will be overlooked. But if a statistics-based solution attempts to do this type ofsynonymy—searching for other terms with the same meaning—it is likely to include the termbirth as well, generating a number of irrelevant results. The understanding of language cutsthrough the ambiguity of text, making linguistics-based text mining, by definition, the morereliable approach.Linguistic systems are knowledge sensitive—the more information contained in their

dictionaries, the higher the quality of the results. Modification of the dictionary content, such assynonym definitions, can simplify the resulting information. This is often an iterative process andis necessary for accurate concept retrieval. NLP is a core element of Text Mining for Clementine.

How Extraction Works

During the extraction of key concepts and ideas from your text data, Text Mining for Clementinerelies on linguistics-based text analysis. Understanding how the extraction process works can helpyou make key decisions when fine-tuning your linguistic resources (libraries, types, synonyms,etc.). Steps in the extraction process include:

Inputting data conversion into a standard format.Identifying candidate terms.Identifying equivalence classes and integration of synonyms.Assigning type.Indexing.Matching patterns and events extraction.

Step 1. Inputting data conversion into a standard format

In this first step, the data you import is converted to a uniform format that can be used for furtheranalysis. This conversion is performed internally and does not change your original data.

Step 2. Identifying candidate terms

6

Chapter 1

It is important to understand the role of linguistic resources in the identification of candidate termsduring linguistic extraction. Linguistic resources are used every time an extraction is run. Theyexist in the form of shipped templates, libraries, and compiled resources. Libraries include lists ofwords, relationships, and other information used to specify or tune the extraction. The compiledresources cannot be viewed or edited. However, the remaining resources (templates) can be editedin the Template Editor or, if you are in an interactive workbench session, in the Resource Editor.Compiled resources are core, internal components of the extractor engine within Text Mining

for Clementine. These resources include a general dictionary containing a list of base formswith a part-of-speech code (noun, verb, adjective, adverb, participle, coordinator, determiner, orpreposition). The resources also include reserved, built-in types used to assign many extractedterms to the following term types, Location, Organization, Person, or Product. For moreinformation, see “Built-in Types” in Chapter 17 on p. 244.In addition to the compiled resources, several shipped libraries are delivered and used in

projects to complement the types and term definitions in the compiled resources, as well as tooffer other types and synonyms. These libraries—and any custom ones you create—are madeup of several dictionaries. These include type dictionaries, substitution dictionaries (synonymsand optional elements), and exclude dictionaries. For more information, see “Working withLibraries” in Chapter 16 on p. 229.Once the data have been imported and converted, the extractor engine will begin identifying

candidate terms for extraction. Candidate terms are words or groups of words that are used toidentify concepts in the text. During the processing of the text, single words (uniterms) thatare not in the compiled resources are considered as candidate term extractions. And candidatecompound words (multiterms) are identified using hard-coded or dynamic part-of-speech patternextractors. For example, the multiterm sports car, which follows the “adjective noun”part-of-speech pattern, has two components. The multiterm fast sports car, which followsthe “adjective adjective noun” part-of-speech pattern, has three components. There are about 30patterns, and the maximum pattern size is about six components.

Note: The terms in the aforementioned compiled general dictionary represent a list of all ofthe words that are likely to be uninteresting or linguistically ambiguous as uniterms. Thesewords are excluded from extraction when you are identifying the uniterms. However, they arereevaluated when you are determining parts of speech or looking at longer candidate compoundwords (multiterms).Finally, a special algorithm is used to handle uppercase letter strings, such as job titles, so

that these special patterns can be extracted.

Step 3. Identifying equivalence classes and integration of synonyms

After candidate uniterms and multiterms are identified, the software uses a set of algorithmsto compare them and identify equivalence classes. An equivalence class is a base form of aphrase or a single form of two variants of the same phrase. The purpose of assigning phrases toequivalence classes is to ensure that, for example, president of the company and companypresident are not treated as separate concepts. To determine which concept to use for theequivalence class—that is, whether president of the company or company president isused as the lead term, the extractor component applies the following rules in the order listed:

The user-specified form in a library.

7


The most frequent form in the full body of text.The shortest form in the full body of text (which usually corresponds to the base form).

Step 4. Assigning type

Next, types are assigned to extracted concepts. A type is a semantic grouping of concepts.Both compiled resources and the libraries are used in this step. Types include such things ashigher-level concepts, positive and negative words and qualifiers, contextual qualifiers, firstnames, places, organizations, and more. Additional types can be defined by the user. For moreinformation, see “Type Dictionaries” in Chapter 17 on p. 243.

Step 5. Indexing

The entire set of documents or records is reindexed by establishing a pointer between a textposition and the representative term for each equivalence class. This assumes that all of theinflected form instances of a candidate concept are indexed as a candidate base form. The globalfrequency is calculated for each base form.

Step 6. Matching patterns and events extraction

Text Mining for Clementine can discover not only types and concepts but also relationships amongthem. Several algorithms and libraries are available with this product and provide the ability toextract relationship patterns between types and concepts. They are particularly useful whenattempting to discover specific opinions (for example, product reactions) or the relational linksbetween two people or objects (for example, links between political groups or genomes).

How Categorization Works

When creating category models in Text Mining for Clementine, there are several differenttechniques you can choose to create categories. Because every dataset is unique, the number oftechniques and the order in which you apply them may change. Since your interpretation of theresults may be different from someone else’s, you may need to experiment with the differenttechniques to see which one produces the best results for your text data.In Text Mining for Clementine, you have the option of creating a category model directly

from the node or launching a workbench session in which you can explore and fine-tune yourcategories further. If you create a category model directly from the node, you have less controlover the output, however you can still select the classification techniques and the resourcetemplate to be used.

In this guide, classification refers to the generation of category definitions through the use of abuilt-in technique, and categorization refers to the scoring, or labeling, process whereby uniqueidentifiers (name/ID/value) are assigned to the category definitions for each document or record.Both categorization and classification happen simultaneously.During classification, the concepts and types that were extracted are used as the building blocks

for your categories. When you build categories, the documents or records are automaticallyassigned to categories if they contain text that matches an element of a category’s definition.

Text Mining for Clementine offers you several automated classification techniques to help youcategorize your documents or records quickly.

8

Chapter 1

Concept Grouping Techniques. Each of the techniques is well suited to certain types of data andsituations, but often it is helpful to combine techniques in the same analysis to capture the fullrange of documents or records. In the interactive workbench, the concepts and types that weregrouped into a category are still available for classification the next time you build categories.This means that you may see a concept in multiple categories or find redundant categories. Youcan exclude concepts from being grouped together by any of these techniques by defining them asantilinks. For more information, see “Link Exceptions” in Chapter 18 on p. 267.

Concept derivation. This technique creates categories by taking a concept and finding otherconcepts that are related to it through analyzing whether any of the concept componentsare morphologically related. For example, the concept opportunities to advancewould be grouped with the concepts opportunity for advancement and advancementopportunity. This technique is very useful for identifying synonymous compound wordconcepts, since the concepts in each category generated are synonyms or closely related inmeaning. It works with data of varying lengths and generates a smaller number of compactcategories. For more information, see “Concept Derivation” in Chapter 10 on p. 168.Concept inclusion. This technique creates categories by taking a concept and finding otherconcepts that include it. This technique works best in combination with semantic networksbut can be used separately. This is performed using lexical series algorithms, which identifyconcepts included in other concepts. A concept series based on inclusion often corresponds toa taxonomic hierarchy (a semantic ISA relationship). This technique begins by identifyingsingle-word or compound-word concepts that are included in other compound-word concepts(and positioned as suffix, prefix, or optional elements) and then grouping them together intoone category. When determining inclusion, the algorithm ignores word order and the presenceof function words, such as in or of. This technique works with data of varying lengths andgenerates a larger number of compact categories. For example, seat would be grouped withsafety seat, seat belt, and infant seat carrier. For more information, see“Concept Inclusion” in Chapter 10 on p. 169.Semantic network. This technique creates categories by grouping concepts based on anextensive index of word relationships. This technique applies to English language text only.However, it can be less helpful when the text contains a large amount of domain-specificterminology. This technique begins by identifying the possible senses of each concept inthe semantic network. Concept senses that are synonyms or hyponyms are grouped into asingle category. This technique can produce very good results when the terms are known tothe semantic network and are not too ambiguous. It is less helpful when text contains alarge amount of specialized domain-specific terminology unknown to the network. In theearly stages of creating categories, you may want to use this technique by itself to see whatsort of categories it produces. To help you produce better results, you can choose from twoprofiles for this technique,Wider and Narrow. For more information, see “Semantic Networks”in Chapter 10 on p. 170.Co-occurrence rules. This technique creates one category with each co-occurrence rulegenerated. A co-occurrence rule is a type of conditional rule that groups words that occurtogether often within records since this generally signals a relationship between them. Forexample, if many records include the words apples and oranges, these concepts could begrouped into a co-occurrence rule. The technique looks for concepts that tend to appeartogether in documents. Two concepts strongly co-occur if they frequently appear together in aset of documents and rarely separately in any of the other documents. This technique can

9


produce good results with larger datasets with at least several hundred documents or records.For more information, see “Co-occurrence Rules” in Chapter 10 on p. 172.

One category for each of the top [n] types. If you do not choose to use Concept Grouping techniques,you can create categories based on type frequency. Frequency represents the number ofdocuments or records containing concepts from the extracted type in question. This techniqueallows you to get one category for each frequently occurring type. This technique works bestwhen the data contain straightforward lists or simple, one-word concepts. Applying this techniqueto types allows you to obtain a quick view regarding the broad range of documents and recordspresent. Note that the Unknown type is not included here and will not be used to create a category.

Text Mining for Clementine Nodes

Along with the many standard nodes delivered with Clementine, you can also work with textmining nodes to incorporate the power of text analysis into your streams. Text Mining forClementine offers you several text mining nodes to do just that.

The File List source node generates a list of document names as input to the text miningprocess. This is useful when the text resides in external documents rather than in adatabase or other structured file. The node outputs a single field with one record for eachdocument or folder listed, which can be selected as input in a subsequent Text Miningnode. For more information, see “File List Node” in Chapter 2 on p. 11.The Web Feed source node makes it possible to read in text from Web feeds, such asblogs or news feeds in RSS or HTML formats, and use this data in the text miningprocess. The node outputs one or more fields for each record found in the feeds, whichcan be selected as input in a subsequent Text Mining node. For more information, see“Web Feed Node” in Chapter 2 on p. 15.The Text Mining node uses linguistic methods to extract key concepts from the text,allows you to create categories with these concepts and other data, and offers the abilityto identify relationships and associations between concepts based on known patterns(called text link analysis). The node can be used to explore the text data contents or toproduce either a concept model or category model. The concepts and categories can becombined with existing structured data, such as demographics, and applied to modeling.For more information, see “Text Mining Modeling Node” in Chapter 3 on p. 25.The Text Link Analysis node extracts concepts and also identifies relationships betweenconcepts based on known patterns within the text. Pattern extraction can be used todiscover relationships between your concepts, as well as any opinions or qualifiersattached to these concepts. The Text Link Analysis node offers a more direct way toidentify and extract patterns from your text and then add the pattern results to the datasetin the stream. But you can also perform TLA using an interactive workbench sessionin the Text Mining modeling node. For more information, see “Text Link Analysis” inChapter 4 on p. 73.

10

Chapter 1

LexiQuest Categorize models assign documents or records to a predefined set ofcategories according to the text they contain. These models can be created in LexiQuestCategorize version 3.2 or later and imported into Clementine for purposes of scoring.For example a document might be assigned to a bread category based on conceptsyeast, flour, and sourdough. LexiQuest Categorize models are similar to TextMining models except that with LexiQuest Categorize models, a prediction is returned.For more information, see “ LexiQuest Categorize Model Nugget” in Chapter 5 on p. 89.The Translate node can be used to translate text from supported languages, such asArabic, Chinese, and Persian, into English or other languages for purposes of modeling.This makes it possible to mine documents in double-byte languages that would nototherwise be supported and allows analysts to extract concepts from these documentseven if they are unable to speak the language in question. The same functionality can beinvoked from any of the text modeling nodes, but use of a separate Translate node makesit possible to cache and reuse a translation in multiple nodes. For more information, see“Translate Node” in Chapter 6 on p. 105.

Applications

In general, anyone who routinely needs to review large volumes of documents to identify keyelements for further exploration can benefit from Text Mining for Clementine.

Some specific applications include:Scientific and medical research. Explore secondary research materials, such as patent reports,journal articles, and protocol publications. Identify associations that were previouslyunknown (such as a doctor associated with a particular product), presenting avenues forfurther exploration. Minimize the time spent in the drug discovery process. Use as an aid ingenomics research.Investment research. Review daily analyst reports, news articles, and company press releasesto identify key strategy points or market shifts. Trend analysis of such information revealsemerging issues or opportunities for a firm or industry over a period of time.Fraud detection. Use in banking and health-care fraud to detect anomalies and discover redflags in large amounts of text.Market research. Use in market research endeavors to identify key topics in open-endedsurvey responses.Blog and Web feed analysis. Explore and build models using the key ideas found in newsfeeds, blogs, etc.CRM. Build models using data from all customer touch points, such as e-mail, transactions,and surveys.

Chapter

2Reading in Source Text

Data for text mining may reside in any of the standard formats used by Clementine, includingdatabases or other “rectangular” formats that represent data in rows and columns, or in documentformats, such as Microsoft Word, PDF, or HTML, that do not conform to this structure.

To read in text from documents that do not conform to standard data structure, includingMicrosoft Word, Excel, and PowerPoint, as well as PDF, XML, HTML, and others, the FileList node can be used to generate a list of documents or folders as input to the text miningprocess. For more information, see “File List Node” on p. 11.To read in text from Web feeds, such as blogs or news feeds in RSS or HTML formats, theWeb Feed node can be used to format Web feed data for input into the text mining process.For more information, see “Web Feed Node” on p. 15.To read in text from any of the standard data formats used by Clementine, such as a databasewith one or more text fields for customer comments, any of the standard source nodes nativeto Clementine can be used. See the Clementine node documentation for more information.

File List Node

To read in text from unstructured documents saved in formats such as Microsoft Word, Excel,and PowerPoint, as well as PDF, XML, HTML, and others, the File List node can be used togenerate a list of documents or folders as input to the text mining process. This is necessarybecause unstructured text documents cannot be represented by fields and records—rows andcolumns—in the same manner as other data used by Clementine. This node can be found on theText Mining palette.

Note: Text mining extraction cannot process Office and PDF files under non-Windows platforms.However, XML, HTML, or text files can always be processed.The File List node functions as a source node, except that instead of reading the actual data,

the node reads the names of the documents or directories below the specified root and producesthese as a list. The output is a single field, with one record for each document or folder listed,which can be selected as input for a subsequent Text Mining node.

List of files. By default, the File List node creates a list of files. This output works well forsmaller sets of files, such as files less than 25K. An advantage to using List of files is that youcan exclude certain supported file types by deselecting them in the Extension list.List of directories. With a larger collection of files, we recommend that you create a list ofdirectories. This will shorten the extraction time significantly since a prescanning step of theentire list of files is skipped. All supported file types are included.

11

12

Chapter 2

Figure 2-1Text Mining palette

File List Node: Settings Tab

Figure 2-2File List node dialog box: Settings tab

Directory. Specifies the root folder containing the documents that you want to list.

Include subdirectories. Specifies that subdirectories should also be scanned.

Create List of files/directories. Specifies whether files or directories should be listed. If you expectthe contents of the directory or subdirectories to change over time or when working with largecollections of files, select List of directories. This will shorten the extraction time significantly sincea prescanning step of the entire list of files is skipped. If you want to exclude certain file types,use List of files and deselect those file types in the Extension list.

Extension list. You can select or deselect the file types and extensions you want to use. Bydeselecting a file extension, the files with that extension are ignored. You can filter by thefollowing extensions:

.rtf, .doc .xls .ppt .txt, .text

.htm, .html, .shtml .xml .pdf .$

13

Reading in Source Text

Note: Text mining extraction cannot process Office and PDF files under non-Windows platforms.However, XML, HTML or text files can always be processed.

File List Node: Other Tabs

The Types tab is a standard tab in Clementine nodes, as is the Annotations tab.

Using the File List Node in Text Mining

The File List node is used when the text data resides in external unstructured documents in formatssuch as Microsoft Word, Excel, and PowerPoint, as well as PDF, XML, HTML, and others. Thisnode is used to generate a list of documents or folders as input to the text mining process (asubsequent Text Mining or Text Link Analysis node).If you use the File List node, make sure to specify that the Text field represents pathnames to

documents in the Text Mining or Text Link Analysis node to indicate that rather than containingthe actual text you want to mine, the selected field contains paths to the documents where thetext is located.

In the following example, we connected a File List node to a Text Mining node in order to supplytext that resides in external documents.

Figure 2-3Example stream: File List (source) node with the Text Mining (modeling) node

E File List node: Settings tab. First, we added this node to the stream to specify where the textdocuments are stored. We selected the directory containing all of the documents on which wewant to perform text mining.

14

Chapter 2


E Text Mining node: Fields tab. Next, we added and connected a Text Mining node to the File Listnode. In this node, we defined our input format, resource template, and output format. We selectedthe field name produced from the File List node and selected the option Text field representspathnames to documents, as well as other settings. For more information, see “Using the TextMining Modeling Node in a Stream” in Chapter 3 on p. 45.

Figure 2-5Text Mining node dialog box: Fields tab

15


For more information on using the Text Mining node, see Chapter 3.

Scripting Properties: filelistnode

You can use the properties in the following table for scripting. The node itself is called filelistnode.Table 2-1File List node scripting properties

Scripting properties Data typePath stringRecurse flagCreateList Directory

FileWordProcessing flagExcelFile flagPowerpointFile flagTextFile flagWebPage flagXMLFile flagPDFFile flagLongExtension flag

Web Feed Node

The Web Feed node can be used to prepare text data from Internet Web feeds for the text miningprocess. This node accepts Web feeds in two formats:

RSS Format. RSS is a simple XML-based standardized format for Web content. It is commonlyused for content from syndicated news sources and Weblogs, for example. For RSS formattedfeeds, you can copy the URL from the address bar of your Web browser and paste it onto theInput tab of this node. Since RSS is a standardized format, no further input is required for youto be able to identify the important text data and the records from the feed.HTML Format. For each HTML page defined on the Input tab of this node, you can use thesource code to define the delimiters that distinguish each record on a page, as well as otherdelimiters identifying information such as the author, special dates, and main content.

The output of this node is a set of fields used to describe the records. In the text mining process,the Description field is generally the most commonly used field since it contains the bulk of the textcontent. However, you may also be interested in other fields, such as the short description of arecord (Short Desc field) or the record’s title (Title field). Any of the output fields can be selectedas input for a subsequent Text Mining node.The Web Feed node is installed with Text Mining for Clementine and can be found on the

Source Node palette.

16

Chapter 2


Web Feed Node: Input Tab

The Input tab is used to specify one or more URLs to Web feeds. In the context of text mining,you could specify URLs for feeds that contain text data.Figure 2-7Web Feed node dialog box: Input tab

You can set the following parameters:

Enter or paste URLs. In this field, you can type or paste one or more URLs. If you are enteringmore than one, enter only one per line and use the Enter/Return key to separate lines. Enter thefull URL path to the file from which the record content was obtained. These URLs can be forfeeds in one of two formats:

RSS feeds. The URL for this format points to a page that has a set of linked articles. Eachlinked article can be automatically identified and treated as a separate record in the resultingdata stream.HTML feeds. The URL for this format is the path to the HTML page itself. You must definethe start tag for each record on the Record Start Tag field on the Records tab. For moreinformation, see “Web Feed Node: Records Tab” on p. 17.

Number of most recent entries to read per URL. This field specifies the maximum number of recordsto read for each URL listed in the field starting with the first record found in the feed.

17


Save and reuse previous web feeds when possible. This field specifies that Text Mining forClementine will scan the feeds and cache the processed results. Then, upon subsequent streamexecutions, the product can check if the feed contents have been updated. If the contents of agiven feed have not changed or if the feed is inaccessible (an Internet outage, for example), thecached version is used to speed processing time. Any new content discovered in these feeds isalso cached for the next time you execute the node.

Label. If you select Save and reuse previous web feeds when possible, you must specify a labelname for the results. This label is used to identify the previously cached feeds on the server. Ifno label is specified, a warning will be added to the Stream Properties when you execute thestream and no reuse will be possible.

Web Feed Node: Records Tab

The Records tab is used to define the HTML tags to be used by the node to identify where eachnew record begins, as well as other relevant information regarding each record. You must definethese tags for each individual HTML feed. In the case where you have included an RSS formattedfeed, you are not required to define any of these tags since RSS is a standardized format. You can,however, still preview the information presented in either format.

Figure 2-8Web Feed node dialog box: Records tab

18

Chapter 2

URL. This drop-down list contains a list of URLs entered on the Input tab. Both HTML andRSS formatted feeds are present. If the URL address is too long for the drop-down list, it willautomatically be clipped in the middle using an ellipsis to replace the clipped text, such ashttp://www.spss.com/example/start-of-address...rest-of-address/path.htm.

With HTML formatted feeds, if the feed contains more than one record (or entry), you candefine which HTML tags contain the data corresponding to the field shown in the table. Forexample, you can define the start tag that indicates a new record has started, a modifieddate tag, or an author name.With RSS formatted feeds, you are not prompted to enter any tags since RSS is a standardizedformat. You can, however, still view sample results on the Preview tab.

Source tab. On this tab, you can view the source code for any HTML feeds. This code is noteditable. You can use the Find field to locate specific tags or information on this page that youcan then copy and paste into the table below. The Find field is not case sensitive and will matchpartial strings.

Preview tab. On this tab, you can preview how a record will be read by the Web feed node. This isparticularly useful for HTML feeds since you can change how a record will be read by definingHTML tags in the table below the Preview tab.

Record start tag. The HTML tag you define here is used to indicate the beginning of a record (suchas an article or blog entry). If you do not define one for an HTML feed, the entire page is treated asone single record, the entire contents are displayed in the Description field, and the node executiondate is used as both the Modified Date and the Published Date.

Field table. In this table, you can define additional tags for HTML feeds if you want to be able toidentify specific types of information within a given record. A predefined set of fields is availablein the table. Enter the start tag only. All matches are done by parsing the HTML and matching thetable contents to the tag names and attributes found in the HTML.

When you enter a tag into the table, the feed is scanned using this tag as the minimum tag to matchrather than an exact match. That is, if you entered <div>, this would match any div tag in thefeed, including those with specified attributes (such as <div class=”post three”>), suchthat <div> is equal to the root tag (<div>) and any derivative that includes an attribute.

Table Match 1 Match 2 Does not match<div> <div> <div class=”post”> a non-div tag<pclass=”auth”>

<pclass=”auth”>

<p color=”black”class=”auth” id=”85643”>

<pcolor=”black”>

When a tag is specified in the table along with an attribute (for example, <p class=”auth”), itwill produce a match only if the tag in the feed also contains that attribute (class="auth"). Whatis listed in the table is root match required for the tag although the actual HTML tag may containadditional attributes in any given order. You can identify the start tag for the following fields:

Title.Short Desc.Description. If left blank, this field will contain all other content in either the <body> tag(if there is a single record) or the content found inside the current record (when a recorddelimiter has been specified).

19


Author.Contributors.Published Date. If left blank, this field will contain the date when the node reads the data.Modified Date. If left blank, this field will contain the date when the node reads the data.

You can use the buttons at the bottom to copy the tags you have defined and reuse them forother feeds.

Using the Web Feed Node in Text Mining

The Web Feed node can be used to prepare text data from Internet Web feeds for the text miningprocess. This node accepts Web feeds in either an HTML or RSS format. These feeds serve asinput into the text mining process (a subsequent Text Mining or Text Link Analysis node).If you use the Web Feed node, you must make sure to specify that the Text field represents

Actual Text in the Text Mining or Text Link Analysis node to indicate that these feeds link directlyto each article or blog entry.

In the following example, we connect a Web Feed node to a Text Mining node in order to supplytext data in the form of a Web feed into the text mining process.

Figure 2-9Example stream: Web Feed (source) node with the Text Mining (modeling) node

E Web Feed node: Input tab. First, we added this node to the stream to specify where the feedcontents are located and to verify the content structure. On the first tab, we provided the URLs (oraddresses) to each feed. Please note that this URL example is fictitious.

20

Chapter 2

Figure 2-10Web Feed node dialog box: Input tab

E Web Feed node: Records tab. Since our example is for an RSS feed, the formatting is alreadydefined, and we do not need to make any changes on the Records tab.

21


Figure 2-11Web Feed node dialog box: Records tab

E Text Mining node: Fields tab. Next, we added and connected a Text Mining node to the Web Feednode. On this tab, we defined our input format and selected a text field produced by the WebFeed node. In this case, we selected the Description field. We also selected the option Text fieldrepresents actual text, as well as other settings.

22

Chapter 2


E Text Mining node: Model tab. Next, on the Model tab, we defined our modeling choices andresource template. In this example, we chose to build a concept model directly from this node.

Figure 2-13Text Mining node: Model tab

23


For more information on using the Text Mining node and the next steps, see Chapter 3.

Scripting Properties: webfeednode

You can use the properties in the following table for scripting. The node itself is calledwebfeednode.Table 2-2Web Feed node scripting properties

Scripting properties Data type Property descriptionurl string1 string2

...stringnEach URL is specified in the list structure.

use_previous flaguse_previous_label stringlimit_entries integerurln.title string For each URL in the list, you must define one

here too. The first one will be url1.title, where thenumber matches its position in the URL list.

urln.shortdesc string Same as for urln.title.urln.description string Same as for urln.title.urln.author string Same as for urln.title.urln.contributors string Same as for urln.title.urln.pub_date string Same as for urln.title.urln.mod_date string Same as for urln.title.record_start string

Chapter

3Mining for Concepts and Categories

Text Mining Modeling Node

The Text Mining modeling node generates either a text mining concept model nugget or a textmining category model nugget. These text mining models uncover and extract salient conceptsand/or produce categories with these concepts from your structured or unstructured text data.Extracted concepts, patterns, and categories can be combined with existing structured data,

such as demographics, and applied to modeling using the full suite of data mining tools fromClementine to yield better and more focused decisions. For example, if customers frequently listlogin issues as the primary impediment to completing online account management tasks, youmight want to incorporate “login issues” into your models.In addition, Text Mining modeling nodes are fully integrated within Clementine so that you

can deploy text mining streams via Clementine Solution Publisher for real-time scoring ofunstructured data in applications such as PredictiveCallCenter. The ability to deploy these streamsensures successful closed-loop text mining implementations. For example, your organization cannow analyze scratch-pad notes from inbound or outbound callers by applying your predictivemodels to increase the accuracy of your marketing message in real time. Using text mining modelresults in streams has been shown to improve the accuracy of predictive data models.You can also perform text link analysis with this modeling node, rather than using the Text

Link Analysis node, in cases where you would want to explore the patterns and/or use themto better build your category model. You can also generate and explore clusters through theinteractive workbench mode.It is also possible to perform an automatic translation of languages. This feature allows you to

mine documents in a language that you may not speak or read. If you want to use the translationfeature, you must have the Language Weaver Translation Server installed and configured.You can execute the Text Mining node automatically (Build Model option) or use a more

hands-on approach in an interactive workbench mode. Once you execute this modeling node, aninternal linguistic extractor engine extracts and organizes the concepts, patterns, and/or categoriesusing natural language processing methods.When you create concept or category model nuggets non-interactively (using the Build model

option) only the concepts used by the model nugget for scoring are kept, while the rest arediscarded. However, when a model is built interactively (using the Interactive workbench option),all extracted concepts are retained inside the category model nugget regardless of whether they arebeing used by the model nugget. This is due to the fact that model nuggets created interactivelymay contain TLA patterns, which require that all concepts remain available to perform accuratepattern matching. Additionally, model nuggets created non-interactively tend to be created usinglarger datasets, in which case keeping the model nugget more compact is more interesting. In the

25

26

Chapter 3

end, keep in mind that a model nugget created non-interactively could produce a coarser set ofresults or more matches than a model nugget created interactively.

Interactive Workbench Mode. If you choose to use the interactive workbench mode, you gainaccess to an advanced interface when the stream is executed, from which you can:

Refine your linguistic resources (resource templates, libraries, dictionaries, synonyms, etc.).Explore extraction results, including concepts and typing.Create categories using manual and automatic classification techniques (concept groupingand frequency).Explore extracted text link analysis (TLA) patterns.Generate clusters to discover new relationships.Generate refined concept and category models.

Model (only) Mode. If you execute this node using the automatic model mode, the resulting modelis built using only the settings you define explicitly in the node.

Typically, text mining nodes are used as part of an iterative process in which concepts areextracted, examined, and refined. These concepts can be used to create and refine categories. Aspart of an iterative process, you can make changes to the linguistic resources that are appliedduring extraction from within an interactive workbench session and thereby affect the content andstructure of the final set of concepts and/or categories.

Requirements. Text Mining modeling nodes accept text data from a Web Feed node, File List node,or any of the standard source nodes. The Text Mining modeling node is installed with Text Miningfor Clementine and can be accessed on the Text Mining palette. See the Clementine ModelingNode documentation for more information.


Important! This node replaces the Text Extraction node, which was offered in previous versionsof Text Mining for Clementine. If you have streams from a previous version of Text Mining forClementine that use the Text Extraction node or model nuggets, you must rebuild your streamsusing the new Text Mining node.

What Are Concepts and Categories?

In Text Mining for Clementine, we often refer to concepts and categories that are discovered,extracted, or formed during the text extraction and analysis process. It is important to understandthe meaning of concepts and categories since they can help you make more informed decisionsduring your exploratory work and model building.

27

Mining for Concepts and Categories

Concepts and Concept Models

During the extraction process, the text data is scanned and analyzed in order to identifyinteresting or relevant single words, such as election or peace, and word phrases. such aspresidential election, election of the president, or peace treaties. Thesewords and phrases are collectively referred to as terms. Using the linguistic resources, the relevantterms are extracted and similar terms are grouped together under a lead term called a concept.In this way, a concept could represent multiple terms depending on your text and the set of

linguistic resources you are using. For example, if you looked at all of the records in which theconcept cost appeared, you may actually notice that the word cost cannot be found in thedocuments but instead something similar is present, such as the word price. In fact, the conceptcost that appears in your concept list after extraction may represent many other terms, such asprice, costs, fee, fees, and dues, if the extractor deemed them as similar or if it foundsynonyms based on processing rules or linguistic resources. In this case, any documents or recordscontaining any of those terms would be treated as if they contained the word cost. If you want tosee what terms are grouped under a concept, you can explore the concept within an interactiveworkbench or look at which synonyms are shown in the concept model. For more information,see “Synonyms in Concept Models” on p. 58.A concept model contains a set of concepts that can be used to identify records or documents

that also contain the concept (including any of its synonyms or grouped terms). A concept modelcan be used in two ways. The first would be to explore and analyze the concepts that werediscovered in the original source text or to quickly identify documents of interest. The secondwould be to apply this model to new text records or documents to quickly identify the samekey concepts in the new documents/records, such as the real-time discovery of key concepts inscratch-pad data from a call center. Please note that the text extraction model, which is no longersupported in this release, also produced concept models.

Categories and Category Model Nuggets

In Text Mining for Clementine, you can create categories that represent, in essence, higher-levelconcepts or topics that will capture the key ideas, knowledge, and attitudes expressed in the text.Categories are made up of set of descriptors, such as concepts, types, and rules. Together, thesedescriptors are used to identify whether or not a record or document belongs to a given category.A document or record can be scanned to see whether any text it contains matches a descriptor.If a match is found, the document/record is assigned to that category. This process is calledcategorization.Categories can be created automatically using the product’s robust set of automated techniques,

manually using additional insight you may have regarding the data, or a combination of both.However, you can only create categories manually or fine-tune them through the interactiveworkbench. For more information, see “Text Mining Node: Model Tab” on p. 32.A category model contains a set of categories along with its descriptors. The model can be

used to categorize a set of documents and records based on the text it contains. Each document orrecord is read and then assigned to each category for which a descriptor match was found. Youcan use category model nuggets to see the essential ideas in open-ended survey responses or ina set of blog entries, for example.

28

Chapter 3

Sampling Upstream to Save Time

When you have a large amount of data, the processing times can take minutes to hours, especiallywhen using the interactive workbench session. The greater the size of the data, the more time theextraction and categorization processes will take. To work more efficiently, you can add one ofClementine’s Sample nodes upstream from your Text Mining node. Use this Sample node to takea random sample using a smaller subset of documents or records to do the first few passes.A smaller sample is often perfectly adequate to decide how to edit your resources and even

create most if not all of your categories. And once you have run on the smaller dataset and aresatisfied with the results, you can apply the same technique for creating categories to the entire setof data. Then you can look for documents or records that do not fit the categories you have createdand make adjustments as needed.

Note: The Sample node is a standard Clementine node.

Text Mining Modeling Node: Fields Tab

The Fields tab is used to specify the field settings for the data from which you will be extractingconcepts. Consider using a Sample node upstream from this node when working with largerdatasets to speed processing times. For more information, see “Sampling Upstream to SaveTime” on p. 28.

Figure 3-2Text Mining modeling node dialog box: Fields tab

29



Text field. Select the field containing the text to be mined, the document pathname, or the directorypathname to documents. This field depends on the data source.

Text field represents. Indicate what the text field specified in the preceding setting contains.Choices are:

Actual text. Select this option if the field contains the exact text from which concepts shouldbe extracted. When you select this option, many of the other settings are disabled.Pathnames to documents. Select this option if the field contains one or more pathnames for thelocation(s) of where the text documents reside.

Document type. This option is available only if you specified that the text field representsPathnames to documents. Document type specifies the structure of the text. Select one of thefollowing types:

Full text. Use for most documents or text sources. The entire set of text is scanned forextraction. If you select this option, you do not need to click the Settings button and defineanything.Structured text. Use for bibliographic forms, patents, and any files that contain regularstructures that can be identified and analyzed. This document type is used to skip all or part ofthe extraction process. It allows you to define term separators, assign types, and impose aminimum frequency value. If you select this option, you must click the Settings button andenter text separators in the Structured Text Formatting area of the Document Settings dialog box.XML text. Use to specify the XML tags that contain the text to be extracted. All other tags areignored. If you select this option, you must click the Settings button and explicitly specifythe XML elements containing the text to be read during the extraction process in the XMLText Formatting area of the Document Settings dialog box.

Textual unity. This option is available only if you specified that the text field represents Pathnames

to documents and selected Full text as the document type. Select the extraction mode from thefollowing:

Document mode. Use for documents that are short and semantically homogenous, such asarticles from news agencies.Paragraph mode. Use for Web pages and nontagged documents. The extraction processsemantically divides the documents, taking advantage of characteristics such as internal tagsand syntax. If this mode is selected, scoring is applied paragraph by paragraph. Therefore,for example, the rule word1 & word2 is true only if word1 and word2 are found in thesame paragraph.

Paragraph mode settings. This option is available only if you specified that the text field representsPathnames to documents and set the textual unity option to Paragraph mode. Specify the characterthresholds to be used in any extraction. The actual size is rounded up or down to the nearestperiod. To ensure that the word associations produced from the text of the document collection arerepresentative, avoid specifying an extraction size that is too small.

Minimum. Specify the minimum number of characters to be used in any extraction.Maximum. Specify the maximum number of characters to be used in any extraction.

30

Chapter 3

Input encoding. This option is available only if you indicated that the text field representsPathnames to documents. It specifies the default text encoding. For all languages except Japanese,a conversion is done from the specified or recognized encoding to ISO-8859-1. So even if youspecify another encoding, the extractor will convert it to ISO-8859-1 before it is processed. Anycharacters that do not fit into the ISO-8859-1 encoding definition will be converted to spaces.

Partition mode. Use the partition mode to choose whether to partition based on the type nodesettings or to select another partition. Partitioning separates the data into training and test samples.

Document Settings for Fields Tab

Figure 3-3Document Settings dialog box

XML Text Formatting

If you want to limit the extraction process to only the text within specific XML tags, use the XML

text document type option and declare the tags containing the text in the XML Text Formatting

section of the Document Settings dialog box. Extracted terms are derived only from the textcontained within these tags or their child tags.

Important! If you want to skip the extraction process and impose rules on term separators, assigntypes to the extracted text, or impose a frequency count for extracted terms, use the Structured

text option described next.

Use the following rules when declaring tags for XML text formatting:Only one XML tag per line can be declared.Tag elements are case sensitive.If a tag has attributes, such as <title id="id_name">, and you want to include allvariations or, in this case, all IDs, add the tag without the attribute or the ending angle bracket(>), such as <title

31


To illustrate the syntax, let’s assume you have the following XML document:

<section>Rules of the Road<title id="01234">Traffic Signals</title><p>Road signs are helpful.</p>

</section><p>Learning the rules is important.</p>

For this example, we will declare the following tags:

<section><title

In this example, since you have declared the tag <section>, the text in this tag and its nestedtags, Traffic Signals and Road signs are helpful, are scanned during the extractionprocess. However, Learning the rules is important is ignored since the tag <p> wasnot explicitly declared nor was the tag nested within a declared tag.

Structured Text Formatting

If you want to skip all or part of the extraction process because you have structured data or want toimpose rules on how to handle the text, use the Structured text document type option and declarethe fields or tags containing the text in the Structured Text Formatting section of the DocumentSettings dialog box. Extracted terms are derived only from the text contained within the declaredfields or tags (and child tags). Any undeclared field or tag will be ignored.In certain contexts, linguistic processing is not required, and the linguistic extractor engine can

be replaced by explicit declarations. In a bibliography file where keyword fields are separatedby separators such as a semicolon (;) or comma (,), it is sufficient to extract the string betweentwo separators. For this reason, you can suspend the full extraction process and instead definespecial handling rules to declare term separators, assign types to the extracted text, or imposea minimum frequency count for extraction.

Use the following rules when declaring structured text elements:Only one field, tag, or element per line can be declared. They do not have to be present inthe data.Declarations are case sensitive.If declaring a tag that has attributes, such as <title id="id_name">, and you want toinclude all variations or, in this case, all IDs, add the tag without the attribute or the endingangle bracket (>), such as <titleAdd a colon after the field or tag name to indicate that this is structured text. Add this colondirectly after the field or tag but before any separators, types, or frequency values, such asauthor: or <place>:.To indicate that multiple terms are contained in the field or tag and that a separator isbeing used to designate the individual terms, declare the separator after the colon, such asauthor:, or <section>:;.To assign a type to the content found in the tag, declare the type code after the colon and aseparator, such as author:,P or <place>:;L. You can declare types using only a singleletter (a–z). Digits are not supported. For more information, see “Type Dictionary Maps”in Chapter 18 on p. 270.

32

Chapter 3

To define a minimum frequency count for a field or tag, declare a number at the end of theline, such as author:,P1 or <place>:;L5. Where n is the frequency count you defined,terms found in the field or tag must occur at least n times in the entire set of documents orrecords to be extracted. This also requires you to define a separator.If you have a tag that contains a colon, you must precede the colon with a backslashcharacter so that the declaration is not ignored. For example, if you have a field called<topic:source>, enter it as <topic\:source>.

To illustrate the syntax, let’s assume you have the following recurring bibliographic fields:

author:Morel, Martensabstract:This article describes how fields are declared.publication:SPSS Documentationdatepub:March 2009

For this example, if we wanted the extraction process to focus on author and abstract but ignorethe rest of the content, we would declare only the following fields:

author:,P1abstract:

In this example, the author:,P1 field declaration states that linguistic processing was suspendedon the field contents. Instead, it states that the author field contains more than one name, which isseparated from the next by a comma separator, and these names should be assigned to the Persontype (code: P) and that if the name occurs at least once in the entire set of documents or records, itshould be extracted. Since the field abstract: is listed without any other declarations, the fieldwill be scanned during extraction and standard linguistic processing and typing will be applied.

Text Mining Node: Model Tab

The Model tab is used to specify the build method and general model settings for the node output.

33


Figure 3-4Text Mining node dialog box: Model tab


Model name. You can generate the model name automatically based on the target or ID field (ormodel type in cases where no such field is specified) or specify a custom name.

Use partitioned data. If a partition field is defined, this option ensures that only data from thetraining partition is used to build the model.

Mode. Specifies the output that will be produced when a stream with this Text Mining node isexecuted.When you create concept or category model nuggets non-interactively (using the Build model

option) only the concepts used by the model nugget for scoring are kept, while the rest arediscarded. However, when a model is built interactively (using the Interactive workbench option),all extracted concepts are retained inside the category model nugget regardless of whether they arebeing used by the model nugget. This is due to the fact that model nuggets created interactivelymay contain TLA patterns, which require that all concepts remain available to perform accuratepattern matching. Additionally, model nuggets created non-interactively tend to be created usinglarger datasets, in which case keeping the model nugget more compact is more interesting. In theend, keep in mind that a model nugget created non-interactively could produce a coarser set ofresults or more matches than a model nugget created interactively.

Launch interactive workbench. When the stream is executed, this option launches an advancedinterface in which you can perform exploratory analyses (concepts, TLA patterns, andclusters), fine-tune the extraction results and linguistic resources (templates, synonyms,types, libraries, etc.), create categories, visualize results in graphs, and build category model

34

Chapter 3

nuggets. If you select this option, the settings on this tab apply only to interactive workbenchsessions. For more information, see “Interactive Workbench Mode” in Chapter 8 on p. 119.Build model. This option indicates that a model should be created and added directly to thepalette using only the settings you define in this node. No additional manipulation is neededfrom you at execution time. If you select this option, model specific options appear withwhich you can define the type of model you want to produce.

Build Model Mode

Figure 3-5Text Mining node dialog box: Model tab, Build Model mode

Create category model. This option, which applies only when you build a model automatically(non-interactive), indicates that you want to create a category model using the techniques asdefined in the dialog box accessible using the Settings button. You cannot choose this option ifyou have selected the Launch an interactive workbench mode. For more information, see “BuildCategories Dialog Box” on p. 38.

Create concept model based on global frequencies for top [n] concepts. This option, which appliesonly when you build a model automatically (non-interactive), indicates that you want to create aconcept model. It also states that this model should contain no more than the specified number ofthe most frequently occurring concepts. You cannot select this option if you have selected theLaunch an interactive workbench mode.

35


Check the most frequent [n] concepts. Specifies the number of the most frequently occurringconcepts that should be selected for scoring by default (with a check box) in the final output.If you deselect this option, all concepts are selected by default in the output. This field acceptsany integer greater than 0.Uncheck categories that occur in more than [n]% of records. Specifies that those conceptsappearing in more than the specified percentage of records (or documents) should not beselected for scoring in the final output. This field accepts any integer between 1 and 100.

Interactive Workbench Mode

Figure 3-6Text Mining modeling node dialog box: Model tab for Interactive Workbench

Use session work (categories, TLA, resources, etc.) from last node update. When you work in aninteractive workbench session, you can update the node with session data (extraction parameters,resources, category definitions, etc.). The Use session work option allows you to relaunch theinteractive workbench using the saved session data. This option is disabled the first time you usethis node, since no session data could have been saved. To learn how to update the node withsession data so that you can use this option, see “Updating Modeling Nodes and Saving” on p. 134.If you launch a session with this option, then the extraction settings, categories, resources, and

any other work from the last time you performed a node update from an interactive workbenchsession are available when you next launch a session. Since saved session data are used with thisoption, certain content, such as the resources copied from the template below, and other tabs aredisabled and ignored. But if you launch a session without this option, only the contents of thenode as they are defined now are used, meaning that any previous work you’ve performed inthe workbench will not be available.

36

Chapter 3

Skip extraction and reuse cached data and results. You can reuse any cached extraction results anddata in the interactive workbench session. This option is particularly useful when you want tosave time and reuse extraction results rather than waiting for a completely new extraction to beperformed when the session is launched. In order to use this option, you must have previouslyupdated this node from within an interactive workbench session and chosen the option to Keep the

session work and cache text data with extraction results for reuse. To learn how to update the nodewith session data so that you can use this option, see “Updating Modeling Nodes and Saving”on p. 134

Begin session by. Select the option indicating the view and action you want to take place first uponlaunching the interactive workbench session.

Using extraction results to build categories. This option launches the interactive workbench inthe Categories and Concepts view and, if applicable, performs an extraction. In this view, youcan creatie categories and generate a category model. You can also switch to another view.For more information, see “Interactive Workbench Mode” in Chapter 8 on p. 119.Exploring text link analysis (TLA) results. This option launches and begins by extracting, ifnecessary, and identifying relationships between concepts within the text, such as opinions,qualifiers, or other links in the Text Link Analysis view. You must select a template thatcontains pattern rules in order to use this option and obtain results. For more information,see “Exploring Text Link Analysis” in Chapter 12 on p. 187. If you are working with largerdatasets, the TLA extraction can take some time. In this case, you may want to consider usinga Sample node upstream.Analyzing co-word clusters. This option launches in the Clusters view, performs an extractionif needed, and enables you to immediately start a co-word cluster analysis. This analysiswill produce a set of clusters. Co-word clustering is a process that begins by assessing thestrength of the link value between two concepts based on their co-occurrence in a givenrecord or document and ends with the grouping of strongly linked concepts into clusters. Youcan also switch to other views. For more information, see “Interactive Workbench Mode” inChapter 8 on p. 119.

Resource Template. A resource template is a predefined set of libraries and advanced linguisticand nonlinguistic resources that have been fine-tuned for a particular domain or usage. Theseresources serve as the basis for how to handle and process data during extraction. By default, acopy of the resources from a basic template are already loaded in the node when you add the nodeto the stream, but you can reload a copy of a template or change templates by clicking Load.Whenever you load, a copy of the template’s resources at that moment is loaded and stored in

the node. For your convenience, the date and time at which the resources were copied and loadedis shown in the Text Mining modeling node.Note that if you make changes to a template outside of this node, you must reload here or, if

you are using an interactive workbench session, switch your resources in the Resource Editor. Formore information, see “Updating Node Resources After Loading” in Chapter 15 on p. 220.If you updated this modeling node during an interactive session and selected the option to Use

session work on this tab, the Load button is disabled to indicate that the resources in interactivesession are used instead.

37


Loading from Resource Templates

By default, a copy of the resources from a basic template are already loaded in the node when youadd the node to the stream. However, if you want to copy resources from a different templateor even reload from the same template to get updated resources, you can select the template inthe Load Resource Template dialog box. To learn about the templates that are shipped with TextMining for Clementine, see “Available Resource Templates” on p. 216.Whenever you load a template, a copy of the template’s resources at that moment is loaded

and stored in the node. Only the contents of the template are copied while the template itself isnot linked to the node. This means that if this template is later updated, these updates are notautomatically available in the node. In short, the resources loaded into the node from the templateare always used unless you either load a different template’s contents or unless you make changesto the resources in a session, update the node, and select the Use session work... option.

Important! If you intend to use the Use session work option on the Model tab, the resources loadedfrom a template here will be ignored and instead the resources present in the session when thenode was last updated are used. In this case, edit or switch your resources directly inside thesession through the Resource Editor view rather than reloading. For more information, see“Updating Node Resources After Loading” in Chapter 15 on p. 220.

Figure 3-7Load Resource Template dialog box

38

Chapter 3

When you select a template, choose one with the same language as your text data. You canonly use templates in the languages for which you are licensed. If you want to perform text linkanalysis, you must select a template that contains TLA patterns. If a template contains TLApatterns, an icon will appear in the TLA column of the Load Resource Template dialog box.If you do not see the template you want in the list but you have an exported copy on your

machine, you can import it now. You can also export from this dialog box to share with otherusers. For more information, see “Importing and Exporting Templates” in Chapter 15 on p. 222.

To Load a Copy of the Template’s Resources

E Click the Load button on the Model tab. The Load Resource Template dialog box opens.

E Select the template name and click OK. The resources are loaded into the node.

Build Categories Dialog Box

Using the Build Categories dialog box, you can automatically create categories by eitherconcept-grouping techniques or by frequency. The concept-grouping techniques include conceptderivation, concept inclusion, semantic networks, and co-occurrence rules. These techniques canbe used alone or in combination to create categories. The frequency techniques allow you tocreate categories based on types or concepts.Each time you create categories using this dialog box, the new categories will not be merged

with preexisting categories. For example, if you already have a category called MyCategoryand one of the techniques creates a category with the same name, a unique name is given to thenew category by adding a numeric suffix, as in MyCategory_1. The resulting categories areautomatically named. If you want to change a name, you can rename your categories. For moreinformation, see “Creating New or Renaming Categories” in Chapter 10 on p. 173.

Techniques Tab

On this tab, you can select which techniques you want to use to create your categories.

39


Figure 3-8Build Categories dialog box: Techniques tab

Concept Grouping Techniques. Each of the techniques is well suited to certain types of data andsituations, but often it is helpful to combine techniques in the same analysis to capture the fullrange of documents or records. In the interactive workbench, the concepts and types that weregrouped into a category are still available for classification the next time you build categories.This means that you may see a concept in multiple categories or find redundant categories. Youcan exclude concepts from being grouped together by any of these techniques by defining them asantilinks. For more information, see “Link Exceptions” in Chapter 18 on p. 267.

Concept derivation. This technique creates categories by taking a concept and finding otherconcepts that are related to it through analyzing whether any of the concept componentsare morphologically related. For example, the concept opportunities to advancewould be grouped with the concepts opportunity for advancement and advancementopportunity. This technique is very useful for identifying synonymous compound wordconcepts, since the concepts in each category generated are synonyms or closely related inmeaning. It works with data of varying lengths and generates a smaller number of compactcategories. For more information, see “Concept Derivation” in Chapter 10 on p. 168.Concept inclusion. This technique creates categories by taking a concept and finding otherconcepts that include it. This technique works best in combination with semantic networksbut can be used separately. This is performed using lexical series algorithms, which identifyconcepts included in other concepts. A concept series based on inclusion often corresponds toa taxonomic hierarchy (a semantic ISA relationship). This technique begins by identifyingsingle-word or compound-word concepts that are included in other compound-word concepts(and positioned as suffix, prefix, or optional elements) and then grouping them together intoone category. When determining inclusion, the algorithm ignores word order and the presenceof function words, such as in or of. This technique works with data of varying lengths andgenerates a larger number of compact categories. For example, seat would be grouped withsafety seat, seat belt, and infant seat carrier. For more information, see“Concept Inclusion” in Chapter 10 on p. 169.

40

Chapter 3

Semantic network. This technique creates categories by grouping concepts based on anextensive index of word relationships. This technique applies to English language text only.However, it can be less helpful when the text contains a large amount of domain-specificterminology. This technique begins by identifying the possible senses of each concept inthe semantic network. Concept senses that are synonyms or hyponyms are grouped into asingle category. This technique can produce very good results when the terms are known tothe semantic network and are not too ambiguous. It is less helpful when text contains alarge amount of specialized domain-specific terminology unknown to the network. In theearly stages of creating categories, you may want to use this technique by itself to see whatsort of categories it produces. To help you produce better results, you can choose from twoprofiles for this technique,Wider and Narrow. For more information, see “Semantic Networks”in Chapter 10 on p. 170.Co-occurrence rules. This technique creates one category with each co-occurrence rulegenerated. A co-occurrence rule is a type of conditional rule that groups words that occurtogether often within records since this generally signals a relationship between them. Forexample, if many records include the words apples and oranges, these concepts could begrouped into a co-occurrence rule. The technique looks for concepts that tend to appeartogether in documents. Two concepts strongly co-occur if they frequently appear together in aset of documents and rarely separately in any of the other documents. This technique canproduce good results with larger datasets with at least several hundred documents or records.For more information, see “Co-occurrence Rules” in Chapter 10 on p. 172.

One category for each of the top [n] types. If you do not choose to use Concept Grouping techniques,you can create categories based on type frequency. Frequency represents the number ofdocuments or records containing concepts from the extracted type in question. This techniqueallows you to get one category for each frequently occurring type. This technique works bestwhen the data contain straightforward lists or simple, one-word concepts. Applying this techniqueto types allows you to obtain a quick view regarding the broad range of documents and recordspresent. Note that the Unknown type is not included here and will not be used to create a category.

Limits Tab

On this tab, you can set some limits that affect the categories generated by the Concept Groupingtechniques only. These limits do not apply to the Frequency technique. These limits apply only towhat is produced during this application of the techniques. It does not include concept countsfrom other categories, if any should exist.

41


Figure 3-9Build Categories dialog box: Limits tab

Maximum number of categories to create. Use to limit the maximum number of categories thatcan be generated.

Apply techniques to. Choose an option from one of the following to determine which conceptswill be used.

Top concepts (based on doc. count). Use this option to apply the concept grouping techniquesonly to the top number of concepts specified here. The top concepts are ranked by the numberof documents in which each concept appears.Top percentage of concepts (based on doc. count). Use this option to apply the concept groupingtechniques only to the top percentage of concepts specified here. The top concepts are rankedby the number of documents in which each concept appears.All concepts. Use this option to apply the concept grouping techniques to all extractedconcepts.

Maximum number of categories per concept. Use this option to limit the number of categories intowhich a given concept can be assigned at the time the categories are generated by this dialog box.For example, if you set a maximum limit of the number of categories in which a concept can beused to 2, then a given concept can be placed in only up to two different category definitions.

Minimum number of concepts per category. Use this option to limit smaller categories by setting theminimum number of concepts that have to be grouped in order to form a category. Categories withtoo few concepts could be too narrow to be of value.

Maximum number of concepts per category. Use this option to limit broader categories by settingthe maximum number of concepts above which a category will not be formed. Categories with toomany concepts could be too broad to be interesting.

42

Chapter 3

Maximum number of concepts per co-occurrence rule. Use this option to define the maximumnumber of concepts that can be grouped together into a given rule by this technique. By default,the maximum is set to 3. This limit of 3 means that a concept occurring with one or two otherconcepts can be grouped into rules. For more information, see “Co-occurrence Rules” in Chapter10 on p. 172.

Minimum link percentage for grouping. This option applies globally to all techniques. You can entera percentage from 0 to 100. If you enter 0, all possible results are produced. The lower the value,the more results you will get—however, these results may be less reliable or relevant. The higherthe value, the fewer results you will get—however, these results will be less noisy and are morelikely to be significantly linked or associated with each other.

Maximum number of docs to use for calculating co-occurrence rules. By default, co-occurrencesare calculated using the entire set of documents or records. However, in some cases, you maywant to speed up the category creation process by limiting the number of documents or recordsused. To use this option, select the check box to its left and enter the maximum number ofdocuments or records to use.

Text Mining Modeling Node: Language Tab

The Language tab is used to specify the language settings for the extraction process, including anytranslation settings.

Note: Select a resource template in the same language as your text data. You can use templates inonly the languages for which you are licensed.

Figure 3-10Text Mining node dialog box: Language tab

43



Language. Identifies the language of the text being mined. Most of the options in this list arestraightforward, such as Dutch, English, French, German, Italian, Portuguese, or Spanish.Although these languages appear in the list, you must have a license to use them in the textmining process. Contact your sales representative if you are interested in purchasing a licensefor a supported language for which you do not currently have access. Here are some additionallanguage options:

ALL. If you know that your text is in only one language, we highly recommend that youselect that language. Choosing the ALL option will add time when executing your stream,since Automatic Language Recognition is used to scan all documents and records in orderto identify the text language first. With this option, all records or documents that are in asupported and licensed language are read by the extractor using the language-appropriateinternal dictionaries. Although you may select this option, Text Mining for Clementine willaccept only those in a language for which you have a license. You can edit certain parametersaffecting this option in the Automatic Language Identification section of the advanced resourceeditor. For more information, see “Language Identifier” in Chapter 18 on p. 274.Translate with Language Weaver. With this option, the text will be translated for extraction.You must have Language Weaver Translation Server installed and configured. Othertranslation settings in this dialog box also apply.

Note: You can also use a Translate node if you want to separate the translation process from theextraction process or cache the results. If you use a Translate node, you should select English inthe Language field. For more information, see “Translate Node” in Chapter 6 on p. 105.

Allow for unrecognized characters from previous translations/processing. Specifies that the text maycontain some unsupported or non-English characters. This may be due to a previous translation orsome kind of document preprocessing.

From. Identifies the language of the source text that will be translated.

To English. States that the text will be translated into English.

Translation accuracy. Specifies the desired accuracy level for the translation process. Choose avalue of 1 to 7. It takes the maximum amount of time to produce the most accurate translationresults. To help save time, you can set your own accuracy level. A lower value produces fastertranslation results but with diminished accuracy. A higher value produces results with greateraccuracy but increased processing time. To optimize time, we recommend beginning with a lowerlevel and increasing it only if you feel you need more accuracy after reviewing the results.

Language Weaver Server Settings. In order to translate the language properly, you must specifyboth the hostname and the port number on which the Language Weaver Translation Server isinstalled and located. For Hostname, you must specify http:// preceding the URL or machinename, such as http://lwhost:4655. For more information on your Language Weaver TranslationServer, contact your administrator. The text is then automatically translated into the supportedlanguage for extraction.

44

Chapter 3

Text Mining Node: Expert Tab

The Expert tab contains certain advanced parameters that impact how text is extracted andhandled. The parameters in this dialog box control the basic behavior, as well as a few advancedbehaviors, of the extraction process. However, they represent only a portion of the optionsavailable to you. There are also a number of linguistic resources and options that impact theextraction results, which are controlled by the resource template you select on the Model tab. Formore information, see “Text Mining Node: Model Tab” on p. 32.

Note: This entire tab is disabled if you have selected the Launch an interactive workbench modeusing saved interactive workbench information on the Model tab, in which case the extractionsettings are taken from the last saved workbench session.

Figure 3-11Text Mining node dialog box: Expert tab


Limit extraction to concepts with a global frequency of at least [n]. Specifies the minimum numberof times a word or phrase must occur in the text in order for it to be extracted. For example, avalue of 2 limits the extraction to those words or phrases that occur at least twice in the entireset of records or documents.

Accommodate punctuation errors. Select this option to apply a normalization technique to improvethe extractability of concepts from short text data containing many punctuation errors. Theseerrors include the improper use of punctuation, such as the period, comma, semicolon, colon,and forward slash. This option is extremely useful when text quality may be poor (as, forexample, in open-ended survey responses, e-mail, and CRM data) or when the text contains manyabbreviations. Normalization does not permanently alter the text but “corrects” it internallyto place spaces around improper punctuation.

45


Accommodate spelling errors for a minimum root character limit of [n]. Select this option to apply afuzzy grouping technique. When extracting concepts from your text data, you may want to groupcommonly misspelled words or closely spelled words. You can have them grouped together usinga fuzzy grouping algorithm that temporarily strips vowels and double/triple consonants fromextracted words and then compares them to see if they are the same.

By default, this option applies only to words with five or more root characters. To change thislimit, specify that number here. The number of root characters in a term is calculated by totalingall of the characters and subtracting any characters that form inflection suffixes and, in the case ofcompound-word terms, determiners and prepositions. For example, the term exercises wouldbe counted as 8 root characters in the form “exercise,” since the letter s at the end of the wordis an inflection (plural form). Similarly, apple sauce counts as 10 root characters (“applesauce”) and manufacturing of cars counts as 16 root characters (“manufacturing car”). Thismethod of counting is only used to check whether the fuzzy grouping should be applied butdoes not influence how the words are matched.

Note: If you find that using this option also groups certain words incorrectly, you can excludeword pairs from this technique by explicitly declaring them in the advanced resources editor inthe Fuzzy Grouping > Exceptions section of the interactive workbench. For more information, see“Fuzzy Grouping” in Chapter 18 on p. 266.

Extract uniterms. Select this option to extract single words (uniterms) under the followingconditions: the word is not part of a compound word, the word is unknown to the extractor basedictionary, or the word is identified as a noun.

Extract nonlinguistic entities. Select this option to extract nonlinguistic entities. Nonlinguisticentities include phone numbers, social security numbers, times, dates, currencies, digits,percentages, e-mail addresses, and HTTP addresses. These entities are explicitly declared forinclusion or exclusion in the linguistic resources. You can enable and disable the nonlinguisticentity types you want to extract in the Nonlinguistic Entities > Configuration section of the interactiveworkbench. By disabling the entities you do not need, you can decrease the processing timerequired. For more information, see “Configuration” in Chapter 18 on p. 268.

Uppercase algorithm. Select this option to enable the default algorithm that extracts simple wordsand compound words that are not in the internal dictionaries as long as the first letter is in uppercase.

Maximum nonfunction word permutation. Specify the maximum number of nonfunction wordsthat must be present to apply the permutation technique. This technique groups similar phrasesthat vary only because nonfunction words (for example, of and the) are present, regardless ofinflection. For example, if you set this value to at least two words and both company officials

and officials of the company were extracted, then they would be grouped together inthe final concept list.

Using the Text Mining Modeling Node in a Stream

The Text Mining modeling node is used to access data and extract concepts in a stream. You canuse any source node to access data, such as a Database node, Variable File node, Web Feed node,or Fixed File node. For text that resides in external documents, a File List node can be used.

46

Chapter 3

Example 1: File List node with the Text Mining modeling node

The following example shows how to use the File list node along with the Text Mining modelingnode to generate the concept model output. For more information on using the File List node,see Chapter 2.

Figure 3-12Example stream: File List source node with the Text Mining modeling node

E File List node: Settings tab. First, we added this node to the stream to specify where the textdocuments are stored. We selected the directory containing all of the documents on which wewant to perform text mining.


E Text Mining node: Fields tab. Next, we added and connected a Text Mining node to the File Listnode. In this node, we defined our input format, resource template, and output format. We selectedthe field name produced from the File List node and selected the option Text field representspathnames to documents, as well as other settings. For more information, see “Using the TextMining Modeling Node in a Stream” on p. 45.

47



E Text Mining modeling node: Model tab. Next, on the Model tab, we selected the model buildingmode and chose to create a concept model directly from this node. You can select a differentresource template, but for this example, we have kept the basic resources.

Figure 3-15Text Mining modeling node dialog box: Model tab

48

Chapter 3

Example 2: SPSS File Node with a Text Mining modeling node in interactive workbench mode

This example shows how the Text Mining node can also launch an interactive session. For moreinformation on the interactive workbench, see Chapter 8.

Figure 3-16Example stream: SPSS File node with the Text Mining node (interactive workbench)

E SPSS File node: Data tab. First, we added this node to the stream to specify where the text is stored.

Figure 3-17SPSS File node dialog box: Data tab

E Text Mining modeling node: Fields tab. Next, we added and connected a Text Mining node. Onthis first tab, we defined our input format. We selected a field name from the source node andselected the option Text field represents actual text.

49



E Text Mining modeling node: Model tab. Next, on the Model tab, we selected to launch an interactiveworkbench session and to use the extraction results to build categories automatically. You canselect a different resource template, but for this example, we have kept the basic resources.

Figure 3-19Text Mining modeling node dialog box: Model tab

50

Chapter 3

E Interactive Workbench Session. Next, we executed the stream, and the interactive workbenchinterface opened. After an extraction was performed, we began creating our categories andexploring our data.

Figure 3-20Interactive Session

Scripting Properties: textminingnode

You can use the following parameters to define or update a node through scripting.

Important! It is not possible to specify a different resource template via scripting. If you think youneed a template, you must select a template in the node dialog box.Table 3-1Text Mining modeling node scripting properties

textminingnode properties Data type Property descriptiontext fieldmethod ReadText

ReadPathdocType integer With possible values (0,1,2) where 0

= Full Text, 1 = Structured Text, and 2= XML

unity DocumentParagraph

51


textminingnode properties Data type Property descriptionencoding Automatic

"UTF-8""UTF-16""ISO-8859-1""US-ASCII"CP850

Note that values with specialcharacters, such as "UTF-8", shouldbe quoted to avoid confusion with amathematical operator.

para_min integerpara_max integerpartition fieldcustom_field flag Indicates whether or not a partition

field will be specified.mtag string Contains all the mtag settings (from

Settings dialog box for XML files)mclef string Contains all the mclef settings (from

Settings dialog box for Structured Textfiles)

use_model_name flagmodel_name stringuse_partitioned_data flag If a partition field is defined, only

the training data are used for modelbuilding.

model_output_type InteractiveModel

use_interactive_info flagreuse_extraction_results flagmodel_type Concept

Categoriesextract_top integer This parameter is used when

model_type = Conceptuse_check_top flagcheck_top integeruse_uncheck_top flaguncheck_top integerinteractive_view Categories

TLAClusters

language DutchEnglishFrenchGermanItalianPortugueseSpanishLanguage_Weaver

52

Chapter 3

textminingnode properties Data type Property descriptiontranslate_from Arabic

ChineseDutchFrenchGermanHindiItalianPersianPortugueseRomanianRussianSpanishSomaliSwedish

translation_accuracy integer Specifies the accuracy level you desirefor the translation process—choose avalue of 1 to 7

lw_hostname stringlw_port integerfix_punctuation flagfix_spelling flagspelling_limit integerextract_uniterm flagextract_nonlinguistic flagpermutation integer Minimum nonfunction word

permutation (the default is 2).upper_case flagcreate_categories_technique_type ConceptGrouping

FrequencyOnly when model_output_type= Model& model_type= Categories

concept_derivation flag Only when model_output_type= Model& model_type= Categories

concept_inclusion flag Only when model_output_type= Model& model_type= Categories

semantic_network flag Only when model_output_type= Model& model_type= Categories

semantic_network_profile WiderNarrow

Only when model_output_type= Model& model_type= Categories

cooccurrence_rules flag Only when model_output_type= Model& model_type= Categories

type_frequency_n integer Only when model_output_type= Model& model_type= Categories

maximum_category_n integer Only when model_output_type= Model& model_type= Categories

apply_techniques ConceptPercentageAll

Only when model_output_type= Model& model_type= Categories

top_concept_n integer Only when model_output_type= Model& model_type= Categories

top_percentage_n integer Only when model_output_type= Model& model_type= Categories

53


textminingnode properties Data type Property descriptionmaximum_concepts_per_category integer Only when model_output_type= Model

& model_type= Categoriesmaximum_concepts_per_category integer Only when model_output_type= Model

& model_type= Categoriesmaximum_number_concept_per_cooccurrence

integer Only when model_output_type= Model& model_type= Categories

minimum_link_value_percent integer Only when model_output_type= Model& model_type= Categories

cooccurrence_doc_limit flag Only when model_output_type= Model& model_type= Categories

cooccurrence_doc_limit_n integer Applies only if cooccur-rence_doc_limit.

Text Mining Model Nugget

A Text Mining model nugget is created whenever you successfully execute a Text Mining modelnode or generate a model from within the interactive workbench. The Text Mining modeling nodecan produce two types of models. The first, a concept model nugget, contains a list of conceptsassigned to types that you can use for the real-time discovery of key concepts in other text data,such as scratch-pad data from a call center. The second, a category model nugget, contains a setof categories made up of concepts, types, patterns, and/or rules that you can use to sift throughand categorize survey responses, blog entries, or other Web feeds. If you launch an interactiveworkbench session in the modeling node, you can explore the extraction results, refine theresources, fine-tune your categories, and produce category models. For more information, see“What Are Concepts and Categories?” on p. 26.If the model nugget was generated using translated documents, the scoring will be performed

in the translated language. Similarly, if the model nugget was generated using English as thelanguage, you can specify a translation language in the model nugget, since the documents willthen be translated into English.Text Mining model nuggets are placed in the model nugget palette (located on the Models tab

in the upper right side of the Clementine window) when they are created.

Viewing Results

To see information about the model nugget, right-click the node in the model nuggets palette andchoose Browse from the context menu (or Edit for nodes in a stream).

Adding Models to Streams

To add the Text Mining model nugget to your stream, click the icon in the model nuggets paletteand then click the stream canvas where you want to place the node. Or right-click the icon andchoose Add to Stream from the context menu. Then connect your stream to the node, and youare ready to pass data to generate predictions.

54

Chapter 3

When you execute a stream containing a Text Mining model nugget, new fields are added tothe data. The number and the structure of the fields depend on the scoring mode selected on theModel tab of the Text Mining modeling node prior to building the model. For more information,see “Text Mining Node: Model Tab” on p. 32.The data coming into the model nugget must contain the same input fields and field types as

the training data used to create the model. If fields are missing or field types are mismatched,you will see an error message when you execute the stream.

Figure 3-21Models nugget palette containing a Text Mining model nugget

Model Nugget: Model Tab (Concept Model)

In concept models, the Model tab displays the set of concepts that were extracted. The conceptsare presented in a table format with one row for each concept. The objective on this tab is toselect the concepts that will be used for scoring.

Note: If you generated a category model nugget instead, this tab will contain different results. Formore information, see “Model Nugget: Model Tab (Category Model)” on p. 58.

55


Figure 3-22Concept model nugget dialog box: Model tab

All concepts are selected for scoring by default, as shown in the check boxes in the leftmostcolumn. A checked box means that the concept will be used for scoring. An unchecked box meansthat the concept will be excluded from scoring. You can check multiple rows by selecting themand clicking one of the check boxes in your selection.To learn more about each concept, you can look at the additional information provided in each

column. After the check box, you can review the following information:

Concept. This is the lead word or phrase that was extracted. In some cases, this concept representsthe concept name as well as some other terms and synonyms of this concept. To see whichsynonyms are part of a concept, display the synonym table and select the concept to see thecorresponding synonyms at the bottom of the dialog box. For more information, see “Synonymsin Concept Models” on p. 58.

Global. Here, global (frequency) refers to the number of times a concept (and all its terms andsynonyms) appears in the entire set of the documents/records.

Bar chart. The global frequency of this concept in the text data presented as a bar chart. Thebar takes the color of the type to which the concept is assigned in order to visually distinguishthe types.

56

Chapter 3

%. The global frequency of this concept in the text data presented as a percentage.N. The actual number of occurrences of this concept in the text data.

Docs. Here, docs refers to the document count, meaning number of documents or records in whichthe concept (and all its terms and synonyms) appears.

Bar chart. The document count for this concept presented as a bar chart. The bar takes thecolor of the type to which the concept is assigned in order to visually distinguish the types.%. The document count for this concept presented as a percentage.N. The actual number of documents or records containing this concept.

Type. The type to which the concept is assigned. Below the table is a legend showing the colorsof each possible type. For each concept, the Global and Docs columns appear in a color todenote the type to which this concept is assigned. A type is a semantic grouping of concepts.Types include such things as higher-level concepts, positive and negative words and qualifiers,contextual qualifiers, first names, places, organizations, and more. For more information, see“Type Dictionaries” in Chapter 17 on p. 243.

Below the table, you can see the total number of selected (checked) concepts and the totalnumber of concepts.

Working with Concepts

By right-clicking a cell in the table, you can display a context menu in which you can:Select All. All rows in the table will be selected.Copy. The selected concept(s) are copied to the clipboard.Copy (inc. headings). The selected concept(s) are copied to the clipboard along with thecolumn heading.Check Selected. Checks all check boxes for the selected rows in the table.Uncheck Selected. Unchecks all check boxes for the selected rows in the table.Check All. Checks all check boxes in the table. This results in all concepts being used inthe final output.Uncheck All. Unchecks all check boxes in the table. Unchecking a concept means that itwill not be used in the final output.Check Options. Displays the Check Options dialog box. For more information, see “Optionsfor Selecting Concepts for Scoring” on p. 57.

Tab Toolbar Description

This tab also contains a toolbar that offers quick access to many of the tasks you will perform.Table 3-2Toolbar buttonsButton Description

Check All. Checks all check boxes in the table. This results in all concepts being usedin the final output.Uncheck All. Unchecks all check boxes in the table. Unchecking a concept means that itwill not be used in the final output.

57


Button DescriptionCheck Options. Opens the Check Options dialog box to allow you to select concepts basedon rules. For more information, see “Options for Selecting Concepts for Scoring” on p. 57.

Sort by: Sort. The Sort menu button (the arrow button) controls the sorting of concepts. Directionof sorting (ascending or descending) can be changed using the sort direction button on thetoolbar. You can also sort by any of the column headings by clicking the heading.Display Synonyms. When the toggle button is clicked, synonym definitions are displayed atthe bottom of this window. For more information, see “Synonyms in Concept Models”on p. 58.

Options for Selecting Concepts for Scoring

You can use the options in this dialog box to quickly check or uncheck concepts for inclusionin the generated model nugget. All concepts that have a check mark on the Model tab will beincluded for scoring.Figure 3-23Check Options dialog box

You can choose from the following options:

Check top [n] concepts based on frequency. Starting with the concept with the highest frequency,this is the number of concepts that will be checked. Here, frequency refers to the number of timesa concept (and all its terms and synonyms) appears in the entire set of the documents/records.This number could be higher than the record count, since a concept can appear multiple timesin a record.

Check top [n] concepts based on document count. Starting with the concept with the highestdocument/record count, this is the number of concepts that will be checked. Here, document countrefers to number of documents/records in which the concept (and all its terms and synonyms)appears.

Check concepts based on [selected] type. Select a type from the drop-down list to check all conceptsthat are assigned to this type. Concepts are assigned to types automatically during the extractionprocess. A type is a semantic grouping of concepts. Types include such things as higher-levelconcepts, positive and negative words and qualifiers, contextual qualifiers, first names, places,organizations, and more. For more information, see “Type Dictionaries” in Chapter 17 on p. 243.

Check all concepts. All concepts in the table will be checked.

58

Chapter 3

Uncheck concepts that occur in more than [n]% of records. Unchecks concepts with a record countpercentage higher than the number you specified. This option is useful for excluding concepts thatoccur frequently in your text or in every record but have no significance in your analysis.

Uncheck concepts based on [selected] type. Unchecks concepts matching the type that you selectfrom the drop-down list.

Uncheck all concepts. All concepts in the table will be unchecked.

Synonyms in Concept Models

You can see the synonyms that are defined for the concepts that you have selected in the table. Byclicking the synonym toggle button on the toolbar, you can display the synonym table in a splitpane at the bottom of the window. Synonyms are two or more words that have the same meaning.For more information, see “Substitution Dictionaries” in Chapter 17 on p. 253.

Figure 3-24Display Synonyms toolbar button

Note: You cannot edit synonyms in this area. Synonyms are generated through substitutions,synonym definitions (in the substitution dictionary), fuzzy grouping, and more—all of which aredefined in the linguistic resources. In order to make changes to the synonyms, you must makechanges directly in the resources (editable in the Resource Editor in the interactive workbenchor in the Template Editor and then reloaded in the node) and then reexecute the stream to get anew model nugget with the updated results.

By right-clicking a synonym, you can display a context menu in which you can:

Copy. The selected synonym is copied to the clipboard.

Copy (inc. headings). The selected synonym is copied to the clipboard along with the columnheadings.

Select All. All synonyms in the table will be selected.

Model Nugget: Model Tab (Category Model)

For category models, the model tab displays the list of categories in the category model on the leftand the descriptors for a selected category on the right. Each category is made up of a number ofdescriptors. For each category you select, the associated descriptors appear in the table. Thesedescriptors can include concepts, rules, types, and patterns. The type of each descriptor, as well assome examples of what each descriptor represents, is also shown.

59


Figure 3-25Category model nugget dialog box: Model tab

On this tab, the objective is to select the categories you want to use for scoring. For a categorymodel, documents and records are scored into categories. If a document or record contains oneor more of the descriptors in its text, then that document or record is assigned to the category towhich the descriptor belongs.

Note: If you generated a concept model nugget instead, this tab will contain different results. Formore information, see “Model Nugget: Model Tab (Concept Model)” on p. 54.

Category Tree

All categories are selected for scoring by default, as shown in the check boxes in the left pane.A checked box means that the category will be used for scoring. An unchecked box means thatthe category will be excluded from scoring. You can check multiple rows by selecting them andclicking one of the check boxes in your selection.

By right-clicking a category in the tree, you can display a context menu from which you can:

Check Selected. Checks all check boxes for the selected rows in the table.

Uncheck Selected. Unchecks all check boxes for the selected rows in the table.

Check All. Checks all check boxes in the table. This results in all concepts being used in thefinal output.

60

Chapter 3

Uncheck All. Unchecks all check boxes in the table. Unchecking a concept means that it will notbe used in the final output.

Category Contents Table

To learn more about each category, select that category and review the information that appearsfor the descriptors in that category. For each descriptor, you can review the following information:

Descriptor. This field contains an icon representing what kind of descriptor it is, as well as thedescriptor name.Table 3-3Descriptors

Icons Kinds of DescriptorsConcepts

Types

TLA Patterns

Rules

Type. This field contains the type name for the descriptor. Types are collections of similarconcepts (semantic grouping), such as organization names, products, or positive opinions. Rulesare not assigned to types.

Details. This field contains a list of what is included in that descriptor. Depending on the number ofmatches, you may not see the entire list for each descriptor due to size limitations in the dialog box.

By right-clicking a cell in the table, you can display a context menu in which you can:

Copy. The selected concept(s) are copied to the clipboard.

Copy (inc. headings). The selected descriptor is copied to the clipboard along with the columnheadings.

Select All. All rows in the table will be selected.

Tab Toolbar Description

This tab also contains a toolbar that offers quick access to many of the tasks you will perform.Table 3-4Toolbar buttons

Button DescriptionCheck All. Checks all check boxes in the table. This results in all concepts being usedin the final output.Uncheck All. Unchecks all check boxes in the table. Unchecking a concept means that itwill not be used in the final output.

Sort by: Sort. The Sort menu button (the arrow button) controls the sorting of concepts. Directionof sorting (ascending or descending) can be changed using the sort direction button onthe toolbar. You can also sort by any of the column headings by clicking the heading.

61


Model Nugget: Settings Tab

The Settings tab is used to define the text field value for the new input data, if necessary. It is alsothe place where you define the data model for your output (scoring mode).

Note: This tab appears in the node dialog box only when the model nugget is placed in the stream.It does not exist when you are accessing this dialog box directly in the Models palette.

Figure 3-26Text Mining model nugget dialog box: Settings tab in a concept model

Scoring mode: Concepts as fields. With this option, there are just as many output records as therewere in the input. However, each record now contains one new field for each concept or categorythat was selected (using the check mark) on the Model tab

Scoring mode: Concepts as records. With this option, a new record is created for each (concept,document) or (category, document) pair. Typically, there are more records in the output than therewere in the input. In addition to the input fields, new fields are also added to the data dependingon what kind of model it is.

In concept models, for each input record, a new record is created for each concept found in agiven document. The value for each concept field depends on whether you select Flags orCounts as your field value on this tab. The following table shows the new fields for this model.

Table 3-5Output fields for “Concepts as records”

Field DescriptionConcept Contains the extracted concept name found in the text data field.

62

Chapter 3

Field DescriptionType Stores the type of the concept as a full type name, such as Location or Unknown. A type is a

semantic grouping of concepts. For more information, see “Type Dictionaries” in Chapter17 on p. 243.

Count Displays the number of occurrences for that concept (and its terms and synonyms) in the textbody (given document).

In category models, for each input record, a new record is created for each category to whicha given document is assigned. The value for each field depends on whether you select Flags orCounts as your field value on this tab. The following table shows the new field for this model.

Table 3-6Output fields for “Categories as records”

Field DescriptionCategory Contains the category name to which the text document was assigned.

Flags. This option does not exist for category models since it is always flags. Used to obtain flagswith two distinct values in the output, such as Yes/No, True/False, T/F, or 1 and 2. The storagetypes for flags can be string, integer, real number, or date/time.

True value. Specify a flag value for the field when the condition is met.False value. Specify a flag value for the field when the condition is not met.

Counts. This option does not exist for category models since the count is always 1. Used to obtaina count of how many times the concept occurred in a given record.

Field name extension. Specify an extension for the field name. Field names are generated byusing the concept name plus this extension.

Add as. Specify where the extension should be added to the field name. Choose Prefix to add theextension to the beginning of the string. Choose Suffix to add the extension to the end of the string.


Model Nugget: Fields Tab

The Fields tab is used to define the text field value for the new input data, if necessary. It is alsothe place where you define the data model for your output (scoring mode).

Note: This tab appears in the node dialog box only when the model nugget is placed in the stream.It does not exist when you are accessing this output directly in the Models palette.

63


Figure 3-27Text Mining model nugget dialog box: Fields tab





Full text. Use for most documents or text sources. The entire set of text is scanned forextraction. If you select this option, you do not need to click the Settings button and defineanything.Structured text. Use for bibliographic forms, patents, and any files that contain regularstructures that can be identified and analyzed. This document type is used to skip all or part ofthe extraction process. It allows you to define term separators, assign types, and impose a

64

Chapter 3

minimum frequency value. If you select this option, you must click the Settings button andenter text separators in the Structured Text Formatting area of the Document Settings dialog box.XML text. Use to specify the XML tags that contain the text to be extracted. All other tags areignored. If you select this option, you must click the Settings button and explicitly specifythe XML elements containing the text to be read during the extraction process in the XMLText Formatting area of the Document Settings dialog box.


Model Nugget: Language Tab

The Language tab is used to specify the language settings for the extraction process, includingany translation settings. You can set the following parameters:

Note: This tab appears in the node dialog box only when the model nugget is placed in the stream.It does not exist when you are accessing this output directly in the Models palette.

Figure 3-28Text Mining model nugget dialog box: Language tab

Language. Identifies the language of the text being mined. Most of the options in this list arestraightforward, such as Dutch, English, French, German, Italian, Portuguese, or Spanish.Although these languages appear in the list, you must have a license to use them in the textmining process. Contact your sales representative if you are interested in purchasing a license

65


for a supported language for which you do not currently have access. Here are some additionallanguage options:








Model Nugget: Summary Tab

The Summary tab presents information about the model itself (Analysis folder), fields used in themodel (Fields folder), settings used when building the model (Build Settings folder), and modeltraining (Training Summary folder).When you first browse a modeling node, the folders on the Summary tab are collapsed. To

see the results of interest, use the expander control to the left of the folder to show the results,or click the Expand All button to show all results. To hide the results after viewing them, use the

66

Chapter 3

expander control to collapse the specific folder that you want to hide, or click the Collapse All

button to collapse all folders.

Figure 3-29Text Mining model nugget dialog box: Summary tab

Using Text Mining Model Nuggets in a Stream

The Text Mining modeling node generates either a concept or category model. These Text Miningmodel nuggets can be used in a stream.

Example: File List node with the concept model nugget

The following example shows how to use the File list node along with a Text Mining modelnugget. For more information on using the File List node, see Chapter 2.

Figure 3-30Example stream: Web Feed (source) node with a Text Mining model nugget

E File List node: Settings tab. First, we added this node to the stream to specify where the textdocuments are stored.

67



E Text Mining model nugget: Model tab. Next, we added and connected a concept model nugget to theFile List node. We selected the concepts we wanted to use to score our data.

68

Chapter 3

Figure 3-32Text Mining model nugget dialog box: Model tab

E Text Mining model nugget: Settings tab. Next, we defined the output format.

69


Figure 3-33Text Mining model nugget dialog box: Settings tab

E Text Mining model nugget: Fields tab. Next, we selected Path, which is the field name comingfrom the File List node, and selected the option Text field represents pathnames to documents, aswell as other settings.

Figure 3-34Text Mining model nugget dialog box: Fields tab

E Table node. Next, we attached a table node to see the results and executed the stream.

70

Chapter 3

Figure 3-35Table output

Scripting Properties: applytextminingnode

You can use the properties in the following table for scripting.

Table 3-7Text Mining Model Nugget Properties

applytextminingnodeproperties

Data type Property description

scoring_mode FieldsRecords

field_values FlagsCounts

This option is not available in the Categorymodel nugget.

true_value stringfalse_value stringextension stringadd_as Suffix

Prefixfix_punctuation flagcheck_model flag Used to check or uncheck a specific concept

or category (depending on which model youhave). Usage: check_model.NAME,where NAME is the name of the concept orcategory to check or uncheck. If the nameincludes spaces, quotes are used. For example,to exclude (uncheck) the concept called myconcept from scoring, useset:applytextminingnode. check_model.'myconcept' = false

text field

71


applytextminingnodeproperties

Data type Property description

method ReadTextReadPath

docType integer With possible values (0,1,2) where 0 = FullText, 1 = Structured Text, and 2 = XML

encoding AutomaticUTF-8UTF-16ISO-8859-1US-ASCIICP850


translate_from ArabicChineseDutchFrenchGermanHindiItalianPersianPortugueseRomanianRussianSpanishSomaliSwedish

translation_accuracy integer Specifies the accuracy level you desire for thetranslation process—choose a value of 1 to 7

lw_hostname stringlw_port integer

Chapter

4Mining for Text Links

Text Link Analysis

The Text Link Analysis (TLA) node adds a pattern-matching technology to text mining’s conceptextraction in order to identify relationships between the concepts in the text data based on knownpatterns. These relationships can describe how a customer feels about a product, which companiesare doing business together, or even the relationships between genes or pharmaceutical agents.For example, extracting your competitor’s product name may not be interesting enough to you.Using this node, you could also learn how people feel about this product, if such opinions existin the data. The relationships and associations are identified and extracted by matching knownpatterns to your text data.You can use the TLA patterns inside certain resource templates shipped with Text Mining for

Clementine or create/edit your own. Patterns are made up of variables, macros, word lists, andword gaps to form a Boolean query, or rule, that is compared to your input text. Whenever a TLApattern matches text, this text can be exacted as a pattern and restructured as output data.The Text Link Analysis node offers a more direct way to identify and extract patterns from your

text and then add the pattern results to the dataset in the stream. But the Text Link Analysis nodeis not the only way in which you can perform text link analysis. You can also use an interactiveworkbench session in the Text Mining modeling node. In the interactive workbench, you can usethe patterns as category descriptors and/or to learn more about the patterns using drilldown andgraphs. For more information, see “Exploring Text Link Analysis” in Chapter 12 on p. 187. Infact, using the Text Mining node to extract TLA results is a great way to explore and fine-tunetemplates to your data for later use directly in the TLA node.

Requirements. The Text Link Analysis node accepts text data read into a field using any of thestandard source nodes (Database node, Flat File node, etc.) or read into a field listing paths toexternal documents generated by a File List node or a Web Feed node.

Strengths. The Text Link Analysis node goes beyond basic concept extraction to provideinformation about the relationships between concepts, as well as related opinions or qualifiers thatmay be revealed in the data.


73

74

Chapter 4

After running the Text Link Analysis node, the data are restructured. It is important to understandthe way that text mining restructures your data. If you desire a different structure for data mining,you can use nodes on the Field Operations palette to accomplish this. For example, if you wereworking with data in which each row represented a text record, then one row is created for eachpattern uncovered in the source text data. For each row in the output, there are 14 fields:

Six fields (Concept#, such as Concept1, Concept2, ..., and Concept6) represent any conceptsfound in the pattern matchSix fields (Type#, such as Type1, Type2, ..., and Type6) represent the type for each concept.A field using the name of the ID field you specified in the node and representing the recordor document ID as it was in the input dataMatched Text represents the portion of the text data in the original record or document thatwas matched to the TLA pattern.

Figure 4-2Output shown in Table node

Note: Any preexisting streams containing a Text Link Analysis node from a release prior to 5.0will no longer be fully executable until you update the nodes. Certain improvements in the latestClementine release require older nodes to be replaced with the newer versions, which are bothmore deployable and more powerful.

It is also possible to perform an automatic translation of certain languages. This feature allowsyou to mine documents in a language you may not speak or read. If you want to use the translationfeature, you must have the Language Weaver Translation Server installed and configured.

Caching TLA. If you cache, the text link analysis results are in the stream. To avoid repeating theextraction of text link analysis results each time the stream is executed, select the Text LinkAnalysis node and from the menus choose, Edit > Node > Cache > Enable. The next time the streamis executed, the output is cached in the node. The node icon displays a tiny “document” graphicthat changes from white to green when the cache is filled. The cache is preserved for the durationof the session. To preserve the cache for another day (after the stream is closed and reopened),select the node and from the menus choose, Edit > Node > Cache > Save Cache. The next time youopen the stream, you can reload the saved cache rather than running the translation again.

75

Mining for Text Links

Alternatively, you can save or enable a node cache by right-clicking the node and choosingCache from the context menu.

Text Link Analysis Node: Fields TabFigure 4-3Text Link Analysis node dialog box: Fields tab

The Fields tab is used to specify the field settings for the data from which you will be extractingconcepts. You can set the following parameters:

ID field. Select the field containing the identifier for the text records. Identifiers must be integers.The ID field serves as an index for the individual text records. Use an ID field if the text fieldrepresents the text to be mined. Do not use an ID field if the text field represents Pathnames to

documents.




76

Chapter 4


Full text. Use for most documents or text sources. The entire set of text is scanned forextraction. If you select this option, you do not need to click the Settings button and defineanything.Structured text. Use for bibliographic forms, patents, and any files that contain regularstructures that can be identified and analyzed. This document type is used to skip all or part ofthe extraction process. It allows you to define term separators, assign types, and impose aminimum frequency value. If you select this option, you must click the Settings button andenter text separators in the Structured Text Formatting area of the Document Settings dialog box.XML text. Use to specify the XML tags that contain the text to be extracted. All other tags areignored. If you select this option, you must click the Settings button and explicitly specifythe XML elements containing the text to be read during the extraction process in the XMLText Formatting area of the Document Settings dialog box.

Textual unity. This option is available only if you specified that the text field represents Pathnames

to documents and selected Full text as the document type. Select the extraction mode from thefollowing:

Document mode. Use for documents that are short and semantically homogenous, such asarticles from news agencies.Paragraph mode. Use for Web pages and nontagged documents. The extraction processsemantically divides the documents, taking advantage of characteristics such as internal tagsand syntax. If this mode is selected, scoring is applied paragraph by paragraph. Therefore,for example, the rule word1 & word2 is true only if word1 and word2 are found in thesame paragraph.

Paragraph mode settings. This option is available only if you specified that the text field representsPathnames to documents and set the textual unity option to Paragraph mode. Specify the characterthresholds to be used in any extraction. The actual size is rounded up or down to the nearestperiod. To ensure that the word associations produced from the text of the document collection arerepresentative, avoid specifying an extraction size that is too small.

Minimum. Specify the minimum number of characters to be used in any extraction.Maximum. Specify the maximum number of characters to be used in any extraction.


Resource Template. A resource template is a predefined set of libraries and advanced linguisticand nonlinguistic resources that have been fine-tuned for a particular domain or usage. Theseresources serve as the basis for how to handle and process data during extraction. By default, acopy of the resources from a basic template are already loaded in the node when you add the nodeto the stream, but you can reload a copy of a template or change templates by clicking Load.

77


Whenever you load, a copy of the template’s resources at that moment is loaded and stored inthe node. For your convenience, the date and time at which the resources were copied and loadedis shown in the Text Mining modeling node.Note that if you make changes to a template outside of this node, you must reload here or, if

you are using an interactive workbench session, switch your resources in the Resource Editor. Formore information, see “Updating Node Resources After Loading” in Chapter 15 on p. 220.You must choose a template that contains TLA patterns in order to extract TLA results using

this node. To see which template is currently selected, click Load and look for the selectedtemplate in the table. The table that is currently selected is the one that will be used to extractTLA patterns. To change templates, select a different one.In the dialog box, please select the resource template to load. Templates are loaded when you

select them and not when the stream is executed. If you make template changes in an InteractiveWorkbench in another stream, please reload the template here to get the latest changes.



XML Text Formatting






Use the following rules when declaring tags for XML text formatting:Only one XML tag per line can be declared.

78

Chapter 4

Tag elements are case sensitive.If a tag has attributes, such as <title id="id_name">, and you want to include allvariations or, in this case, all IDs, add the tag without the attribute or the ending angle bracket(>), such as <title





<section><title





Use the following rules when declaring structured text elements:Only one field, tag, or element per line can be declared. They do not have to be present inthe data.Declarations are case sensitive.If declaring a tag that has attributes, such as <title id="id_name">, and you want toinclude all variations or, in this case, all IDs, add the tag without the attribute or the endingangle bracket (>), such as <titleAdd a colon after the field or tag name to indicate that this is structured text. Add this colondirectly after the field or tag but before any separators, types, or frequency values, such asauthor: or <place>:.

79


To indicate that multiple terms are contained in the field or tag and that a separator isbeing used to designate the individual terms, declare the separator after the colon, such asauthor:, or <section>:;.To assign a type to the content found in the tag, declare the type code after the colon and aseparator, such as author:,P or <place>:;L. You can declare types using only a singleletter (a–z). Digits are not supported. For more information, see “Type Dictionary Maps”in Chapter 18 on p. 270.To define a minimum frequency count for a field or tag, declare a number at the end of theline, such as author:,P1 or <place>:;L5. Where n is the frequency count you defined,terms found in the field or tag must occur at least n times in the entire set of documents orrecords to be extracted. This also requires you to define a separator.If you have a tag that contains a colon, you must precede the colon with a backslashcharacter so that the declaration is not ignored. For example, if you have a field called<topic:source>, enter it as <topic\:source>.




author:,P1abstract:


80

Chapter 4

Text Link Analysis Node: Language TabFigure 4-5Text Link Analysis node dialog box: Language tab




81








Text Link Analysis Node: Expert Tab

The Expert tab contains certain advanced parameters that impact how text is extracted andhandled. The parameters in this dialog box control the basic behavior, as well as a few advancedbehaviors, of the extraction process. There are also a number of linguistic resources and optionsthat also impact the extraction results, which are controlled by the resource template you select.

82

Chapter 4

Figure 4-6Text Link Analysis node dialog box: Expert tab




83







and officials of the company were extracted, then they would be grouped together inthe final concept list.

Text Link Analysis Node: Annotations Tab

The Annotations tab is a standard node tab.

Using the Text Link Analysis Node in a Stream

The Text Link Analysis node is used to access data and extract concepts in a stream. You can useany source node to access data. Typically, a Database node, Variable File node, or Fixed File nodeis used. You can also use the File List node.

Example: Variable File node with the Text Link Analysis node

The following example shows how to use the Variable File source node with the Text LinkAnalysis node.

Figure 4-7Example: Variable File node with the Text Link Analysis node

84

Chapter 4

E Variable File node: File tab. First, we added this node to the stream to specify the input filecontaining all of the text to be processed by the extractor.

Figure 4-8Variable File node dialog box: File tab

E Text Link Analysis node: Fields tab. Next, we attached this node to the stream to extract concepts fordownstream modeling or viewing. We specified the ID field and the text field name containingthe data, as well as other settings.

85


Figure 4-9Text Link Analysis node dialog box: Fields tab

E Table node. Finally, we attached a Table node to view the concepts that were extracted from ourtext documents. In the table output shown, you can see the TLA pattern results found in the dataafter this stream was executed with a Text Link Analysis node. Some results show only oneconcept/type was matched. In others, the results are more complex and contain several types andconcepts. Additionally, as a result of running data through the Text Link Analysis node andextracting concepts, several aspects of the data are changed. The original data in our examplecontained three fields and 2,066 records. After executing the Text Link Analysis node, thereare now 14 fields and 4,993 records. There is now one row for each TLA pattern result found.For example, ID 3327 became three rows from the original because three pattern results wereextracted. You can use a Merge node if you want to merge this output data back into your originaldata.

86

Chapter 4

Figure 4-10Table output node

Scripting Properties: tlanode

You can use the parameters in the following table to define or update a node through scripting.

Important! It is not possible to specify a resource template via scripting. To select a template, youmust do so from within the node dialog box.

Table 4-1Text Link Analysis (TLA) node scripting properties

tlanode properties Data type Property descriptionid_field fieldtext fieldmethod ReadText

ReadPathdocType integer With possible values (0,1,2) where 0 = Full Text, 1 =

Structured Text, and 2 = XMLunity Document

Paragraphencoding Automatic


Note that values with special characters, such as"UTF-8", should be quoted to avoid confusion witha mathematical operator

para_min integerpara_max integermtag string Contains all the mtag settings (from Settings dialog

box for XML files)mclef string Contains all the mclef settings (from Settings dialog

box for Structured Text files)

87


tlanode properties Data type Property descriptionlanguage Dutch

EnglishFrenchGermanItalianPortugueseSpanishLanguage_Weaver



lw_hostname stringlw_port integerextract_freq integerfix_punctuation flagfix_spelling flagspelling_limit integerextract_uniterm flagextract_nonlinguistic flagpermutation integer Minimum nonfunction word permutation (the

default is 2)upper_case flag

Chapter

5Categorizing Files and Records

LexiQuest Categorize Model Nugget

LexiQuest Categorize model nuggets allow you to assign documents or records to a predefined setof categories according to the text they contain. These model nuggets can be created and exportedas XML from LexiQuest Categorize version 3.2 or later and then imported into Clementine forpurposes of scoring.

Note: You can also create a category model nugget directly in Clementine to categorize yourrecords and documents using the Text Mining node. For more information, see “Text MiningModeling Node” in Chapter 3 on p. 25.

A LexiQuest Categorize model nugget contains a taxonomy, which represents a set of categories.Each category in the taxonomy is defined by a set of descriptors, or concepts. These descriptorsare concepts that were extracted from the text in the learning documents when the taxonomy wasbuilt. To learn more about building taxonomies and descriptors, see the Taxonomy ManagerUsers’ Guide for LexiQuest Categorize.When you execute a stream containing a LexiQuest Categorize model nugget in Clementine,

the source records or documents you feed to this model nugget are scanned to determine whetherthey contain words used to define the categories in the taxonomy. For each record or document,Clementine will attempt to match the text it contains to the descriptors in each of the modelnugget’s categories. When matches are made, the record or document is considered for thatcategory. In this way, a LexiQuest Categorize model nugget can be used to channel specificdocuments or records to the groups or areas within your organization where they are most likely tobe of interest according to the categories to which each record or document is assigned.

Categorization versus Extraction

Unlike the Text Mining modeling node, the LexiQuest Categorize model nugget does not extractconcepts from documents. It simply scans the text of each document for any matches to thedescriptors defined in the model nugget. For example, if your LexiQuest Categorize model nuggetcontains a category called bread that is defined by the descriptors yeast, flour, rye bread,wheat bread, toast, and sourdough, documents containing any of these terms may beassigned to this category. But a document containing the related term pumpernickel would notbe assigned unless this term was specifically included in the model nugget.

Scoring Results (Output Fields)

When you execute a stream containing a LexiQuest Categorizemodel nugget, new fields arecreated.

89

90

Chapter 5

$Y-Category represents the category name into which a document or record was categorized.Several of the $Y-Category fields, each with a numeral suffix to differentiate them, may existdepending on the number of categories returned for that document or record. If two categorieswere returned, you would find both $Y-Category and $Y-Category1 in your data.The suffix C, such as $YC-Category, represents the confidence score (contribution) for thecategorization of this document or record into the category. There is one score for eachcategory name field present.The suffix -Descriptor, such as $Y-Category-Descriptor3, represents the name of theconcept/descriptor that was used to categorize the record or document. Several of the-Descriptor fields may exist for each returned category depending on the value you definefor Number of contributions to report and the Confidence thresholds you set on the Settings tab.Only those concept/descriptors that contribute the most to the categorization are present. Theyare returned in order of importance (weight).The suffix -Weight, such as $Y-Category1-Weight2, represents the score of how much thesecond descriptor/concept contributes to the categorization of the document or record inCategory1. This value is derived from the imported model nugget’s taxonomy.

Importing a LexiQuest Categorize Model Nugget

These model nuggets can be created in LexiQuest Categorize version 3.2 or later and thenimported into Clementine for purposes of scoring.

E First, we imported a LexiQuest Categorize model nugget by choosing File > Models > Import

Categorize Model from the menus.Figure 5-1Select file to import dialog box

E Next, we selected the desired model nugget from the Select file to import dialog box. This modelnugget must have been created and exported from LexiQuest Categorize 3.2.

Once imported, the model nugget will be displayed on the Models palette in the Manager window(upper right corner of the application window).

91

Categorizing Files and Records

LexiQuest Categorize Model Nugget: Model TabFigure 5-2Imported LexiQuest Categorize model nugget dialog box: Model tab

The Model tab for a LexiQuest Categorize model nugget allows you to view the categoriescontained in the model nugget.

92

Chapter 5

LexiQuest Categorize Model Nugget: Settings TabFigure 5-3Imported LexiQuest Categorize model nugget dialog box: Settings tab

The Settings tab for a LexiQuest Categorize model nugget allows you to specify scoring options,including:

Predicted categories for each document. Specifies the maximum number of categories to which adocument or record can be assigned. If multiple categories are assigned, they are reported in orderof confidence, with the likeliest category listed first.

Report concepts contributing to prediction. Lists the descriptors, or concepts, that had the greatestrole in determining each prediction. A prediction refers to the document’s or record’s assignmentto a category. For example, if the predicted category is bread, the contributing descriptorconcepts might be sourdough and yeast.

Number of contributions to report. Limits the number of descriptors, or concepts, that aredisplayed in the results. Only those that contributed the most will appear.

Calculate confidence. Specifies whether confidences are reported in the output for each prediction.

Confidence threshold. Determines the type of confidence strategy you want to apply when acategory is returned for a document. Choices are:

Summation. The confidence threshold is measured based on the sum of the predicted categoriesfor a given document/record. For each document/record, the confidence value for eachpredicted category is added to the confidence value for the next predicted category for thisdocument/record until this threshold is met. See also explanation for Minimum confidencelevel summation.Single prediction The confidence threshold is measured based on each predicted category for agiven document/record, independent from all other confidence values for other categories.

93


Minimum confidence level summation. This setting applies only when Summation is selected.Specifies the minimum value for the sum of the confidence levels for the set of predictedcategories for a document. Valid values range from 0 to 100. During the scoring process, thepredicted categories for each document receive confidence scores, the sum of which equals 100for each document. If you set this parameter to 100, the document will be matched to all candidatecategories. While you are sure to get good answers with a higher value, you can also be quitecertain that you will get more noisy answers. This setting is the most useful when you want todefine a minimum for the categorizing documents without degrading the overall precision.

As an example, let’s say that this setting is set to 80 and during the scoring process, the followingcategories (and confidence values) are returned for a document: Cat1 (40%), Cat2 (30%), Cat3(20%), and Cat4 (10%). When this setting is applied, Cat1 is accepted, since its confidence is40. Since the confidence value of Cat1 is less than 80, Cat2 is proposed and accepted. Since theconfidence value of Cat1 plus Cat2 is 70 (40 + 30), which is still less than 80, Cat3 is accepted.However, since the sum of the first three categories is 90 and exceeds the Minimum confidence

level summation value of 80, no other predicted categories are proposed and only the first threecategories are accepted and returned.

Minimum confidence level in a single prediction. This setting applies only when Single prediction isselected. Specifies the minimum level of the confidence value for a single category to be returnedfor a document. Valid values range from 0 to 100. During the scoring process, each predictedcategory receives an individual confidence score for each document, and it must be equal to orhigher than the value listed here to be accepted for that document. This parameter is most usefulwhen you want to exclude categories for which the individual confidence is too low. This can helpyou create a strategy in which you want only the good answers. As an example, if this value is setto 50, any category with a confidence value of less than 50 will not be returned.

Ignore concepts found less than X times in a document. Specifies the minimum number of times aword must occur within a single document or record in order for it to be used to make a categorymatch. For example, a value of 2 limits matching to only those words that occur at least twice in arecord/document. During the process of scoring the text, each record/document is scanned. Ifa concept appears fewer than the number of times listed here in a given record/document, thatconcept will not be used to check for a match to a category descriptor in the taxonomy.

Accommodate punctuation errors. Select this option to apply a normalization technique to textfound in the records or documents to improve the chances of matching this text to the descriptorsin the taxonomy. This option is extremely useful when text quality may be poor (containingmany punctuation errors) or when the text contains many abbreviations. These errors include theimproper use of punctuation, such as the period, comma, semicolon, colon, and forward slash.Normalization does not permanently alter the text but “corrects” it internally to place spacesaround improper punctuation.

94

Chapter 5

LexiQuest Categorize Model Nugget: Fields TabFigure 5-4Imported LexiQuest Categorize model nugget dialog box: Fields tab

The Fields tab for a LexiQuest Categorize model nugget allows you to specify scoring options,including:





Full text. Use for most documents or text sources. The entire set of text is scanned forextraction. If you select this option, you do not need to click the Settings button and defineanything.

95


Structured text. Use for bibliographic forms, patents, and any files that contain regularstructures that can be identified and analyzed. This document type is used to skip all or part ofthe extraction process. It allows you to define term separators, assign types, and impose aminimum frequency value. If you select this option, you must click the Settings button andenter text separators in the Structured Text Formatting area of the Document Settings dialog box.XML text. Use to specify the XML tags that contain the text to be extracted. All other tags areignored. If you select this option, you must click the Settings button and explicitly specifythe XML elements containing the text to be read during the extraction process in the XMLText Formatting area of the Document Settings dialog box.




XML Text Formatting






Use the following rules when declaring tags for XML text formatting:Only one XML tag per line can be declared.

96

Chapter 5

Tag elements are case sensitive.If a tag has attributes, such as <title id="id_name">, and you want to include allvariations or, in this case, all IDs, add the tag without the attribute or the ending angle bracket(>), such as <title





<section><title





Use the following rules when declaring structured text elements:Only one field, tag, or element per line can be declared. They do not have to be present inthe data.Declarations are case sensitive.If declaring a tag that has attributes, such as <title id="id_name">, and you want toinclude all variations or, in this case, all IDs, add the tag without the attribute or the endingangle bracket (>), such as <titleAdd a colon after the field or tag name to indicate that this is structured text. Add this colondirectly after the field or tag but before any separators, types, or frequency values, such asauthor: or <place>:.

97


To indicate that multiple terms are contained in the field or tag and that a separator isbeing used to designate the individual terms, declare the separator after the colon, such asauthor:, or <section>:;.To assign a type to the content found in the tag, declare the type code after the colon and aseparator, such as author:,P or <place>:;L. You can declare types using only a singleletter (a–z). Digits are not supported. For more information, see “Type Dictionary Maps”in Chapter 18 on p. 270.To define a minimum frequency count for a field or tag, declare a number at the end of theline, such as author:,P1 or <place>:;L5. Where n is the frequency count you defined,terms found in the field or tag must occur at least n times in the entire set of documents orrecords to be extracted. This also requires you to define a separator.If you have a tag that contains a colon, you must precede the colon with a backslashcharacter so that the declaration is not ignored. For example, if you have a field called<topic:source>, enter it as <topic\:source>.




author:,P1abstract:


98

Chapter 5

LexiQuest Categorize Model Nugget: Language TabFigure 5-6Imported LexiQuest Categorize model nugget dialog box: Language tab


Note: This tab appears in the node dialog box only when the model nugget is placed in the stream.It does not exist when you are accessing this dialog box directly in the Models palette.


ALL. If you know that your text is in only one language, we highly recommend that youselect that language. Choosing the ALL option will add time when executing your stream,since Automatic Language Recognition is used to scan all documents and records in orderto identify the text language first. With this option, all records or documents that are in asupported and licensed language are read by the extractor using the language-appropriateinternal dictionaries. Although you may select this option, Text Mining for Clementine willaccept only those in a language for which you have a license. You can edit certain parameters

99


affecting this option in the Automatic Language Identification section of the advanced resourceeditor. For more information, see “Language Identifier” in Chapter 18 on p. 274.Translate with Language Weaver. With this option, the text will be translated for extraction.You must have Language Weaver Translation Server installed and configured. Othertranslation settings in this dialog box also apply.







Using the LexiQuest Categorize Model Nugget in a Stream

The LexiQuest Categorize model nugget is used to score documents or records into a set ofpredefined categories. You can use any source node to access data, such as a Database node,Variable File node, or Fixed File node. For text that resides in external documents, a File Listnode can be used.

Example: File List node with a LexiQuest Categorize model nugget

For text that resides in external documents, a File List node can be used.

Figure 5-7Example stream: File List node with a LexiQuest Categorize model nugget

100

Chapter 5

E File List node: Settings tab. First, we added this node to the stream to specify where the textdocuments were stored. For more information on using the File List node, see Reading in SourceText on p. 11.


E LexiQuest Categorize model nugget. Next, we imported a Category model nugget by choosing File

> Models > Import Categorize Model from the menus. When the Select file to import dialog boxappeared, we selected the model nugget that was previously created and exported from LexiQuestCategorize 3.2 and clicked Open. Once imported, the model nugget appeared on the Modelspalette in the Manager window (upper right corner of the application window).

101


Figure 5-9Select file to import dialog box

E LexiQuest Categorize model nugget: Fields tab. Next, we added an imported LexiQuest Categorizemodel nugget and attached the File List node to this node to scan the documents identified by theFile List node for matches to the descriptors from the LexiQuest Categorize model nugget in orderto categorize the documents. We selected the field name from the File List node—in this case, thePath variable—and selected the option Text field represents pathnames to documents.

102

Chapter 5

Figure 5-10LexiQuest Categorize model nugget dialog box: Fields tab

E Table node. Next, we added a Table node to visualize the categorization results.

103


Figure 5-11Sample table output

Scripting Properties: applycategorizenode

You can use the properties in the following table for scripting.Table 5-1LexiQuest Categorize model nugget Properties

applycategorizenode properties Data type Descriptionmethod flagnum_categories integernum_contributions integerreturn_contributions flagcalc_confidences flagmin_frequency integerfix_punctuation flagconfidence Summation

Singlemin_confidence_summation integermin_confidence_single integertext fieldmethod ReadText

ReadPathdocType integer With possible values (0,1,2) where 0 = Full

Text, 1 = Structured Text, and 2 = XML

104

Chapter 5

applycategorizenode properties Data type Descriptionencoding Automatic

UTF-8UTF-16ISO-8859-1US-ASCIICP850



translation_accuracy integer Specifies the accuracy level you desire forthe translation process—choose a value of1 to 7

lw_hostname stringlw_port integer

Chapter

6Translating Text for Extraction

Translate Node

The Translate node can be used to translate text from supported languages, such as Arabic,Chinese, and Persian, into English for analysis using Text Mining for Clementine. This makesit possible to mine documents in double-byte languages that would not otherwise be supportedand allows analysts to extract concepts from foreign-language documents even if they are unableto comprehend the language in question. Note that Language Weaver’s Translation Server mustbe installed and configured prior to using the Translate node.When mining text in any of these languages, simply add a Translate node prior to the Text

Mining modeling node in your stream.


Alternatively, you can select a translation language in a Text Mining modeling node and anytext mining model nuggets without using a separate Translate node. The same translationfunctionality is invoked in either case, but using a separate Translate node allows you to feed thesame translation into several different modeling nodes without repeating the translation in eachnode. This can result in substantially improved performance. You can also enable caching in theTranslate node to avoid repeating the translation each time the stream is executed.

Caching the translation. If you cache the translation, the translated text is stored in the streamrather than in external files. To avoid repeating the translation each time the stream is executed,select the Translate node and from the menus choose, Edit > Node > Cache > Enable. The nexttime the stream is executed, the output from the translation is cached in the node. The node icondisplays a tiny “document” graphic that changes from white to green when the cache is filled. Thecache is preserved for the duration of the session. To preserve the cache for another day (afterthe stream is closed and reopened), select the node and from the menus choose, Edit > Node >

Cache > Save Cache. The next time you open the stream, you can reload the saved cache ratherthan running the translation again.Alternatively, you can save or enable a node cache by right-clicking the node and choosing

Cache from the context menu.

105

106

Chapter 6

Speeding up the translation. You will get the fastest results by making sure that your data and yourstream execution are on the same machine as the Language Weaver’s Translation Server. Addingadditional memory can also speed up translations; however, you may require a Win64 servermachine or a machine with multiple processors.

Translate Node: Fields TabFigure 6-2Translate node dialog box: Fields tab

Text field. Select the field containing the text to be mined, the document pathname, or the directorypathname to documents. This field depends on the data source. You can specify any string field,even those with Direction=None or Type=Typeless.


Actual text. Select this option if the field contains the exact text from which concepts shouldbe extracted.Pathnames to documents. Select this option if the field contains one or more pathnames towhere external documents, which contain the text for extraction, reside. For example, if aFile List node is used to read in a list of documents, this option should be selected. For moreinformation, see “File List Node” in Chapter 2 on p. 11.

Input encoding. Specifies the default text encoding. For all languages except Japanese, aconversion is done from the specified or recognized encoding to ISO-8859-1. So even if youspecify another encoding, the extractor will convert it to ISO-8859-1 before processing. Anycharacters that do not fit into the ISO-8859-1 encoding definition will be converted to spaces.

107

Translating Text for Extraction

Translate Node: Language TabFigure 6-3Translate node dialog box: Language tab

The Language tab is used to specify the language settings for translation. You can set thefollowing parameters:





Save and reuse previously translated text when possible. Specifies that the translation resultsshould be saved and if the same number of records/documents are present the next time thestream is executed, the content is assumed to be the same and the translation results are reusedto save processing time. If this option is selected at run time and the number of records doesnot match what was saved last time, the text is fully translated and then saved under the labelname for the next execution. This option is available only if you selected a Language Weavertranslation language.

Note: If the text is stored in the stream, you can achieve the same result by enabling cachingin a Translate node.

108

Chapter 6

Label. If you select Save and reuse previously translated text when possible, you must specify a labelname for the results. This label is used to identify the previously translated text on the server.If no label is specified, a warning will be added to the Stream Properties when you execute thestream and no reuse will be possible.

Using the Translate Node

To extract concepts from supported translation languages, such as Arabic, Chinese, or Persian,simply add a Translate node prior to any Text Mining node in your stream.

Example: Translating Text in External Documents

If the text to be translated is contained in one or more external files, a File List node can beused to read in a list of names. In this case, the Translate node would be added between the FileList node and any subsequent text mining nodes, and the output would be the location wherethe translated text resides.

Figure 6-4Example stream: File List node with Translate node

E File List node: Settings. In the File List node, we selected the source files.


109


E Translate node: Fields tab. Next, we added and connected a Translate node. In the node, we selectedthe field produced by the File List node—named Path by default—which specifies the originallocation of the files. You can specify a translation output directory and other options as desired.

Figure 6-6Translate node dialog box: Fields tab

E Translate node: Language tab. On this tab, we selected the original source language.

Figure 6-7Translate node dialog box: Language tab

E Text Mining node: Fields tab. In any subsequent Text Mining nodes, we selected the field output bythe Translate node—named after the text field from the File List node followed by _Translated—which specifies the location of the translated files.

110

Chapter 6


E Text Mining modeling node: Language tab. On the Language tab, we selected English as the languageand selected Allow for unrecognized characters from previous translations/processing to indicatethat non-English characters may also appear.

111


Figure 6-9Text Mining node dialog box: Language tab

Scripting Properties: translatenode

You can use the properties in the following table for scripting.Table 6-1Translate node properties

translatenode properties Data type Property descriptiontext fieldmethod ReadText

ReadPathdocType integer With possible values (0,1,2) where 0 = Full Text,

1 = Structured Text, and 2 = XMLencoding Automatic


Note that values with special characters, suchas "UTF-8", should be quoted to avoid confusionwith a mathematical operator

112

Chapter 6

translatenode properties Data type Property descriptiontranslate_from Arabic

ChineseDutchFrenchGermanHindiItalianPersianPortugueseRomanianRussianSpanishSomaliSwedish


lw_hostname stringlw_port integeruse_previous_translation flag Specifies that the translation results already

exist from a previous execution and can bereused

translation_label string Enter a label to identify the translation resultsfor reuse

translated flag

Chapter

7Browsing External Source Text

File Viewer Node

After using a Text Mining node to mine text from external files that are not in your stream (forexample, using a File List node) or to translate text, the File Viewer node can be used to provideyou with direct access to your original data. This node can help you better understand the resultsfrom text extraction by providing you access to the source, or untranslated, text from whichconcepts were extracted since it is otherwise inaccessible in the stream. This node is added to thestream after a File List node to obtain a list of links to all the files.


The result of this node is a window showing all of the document elements that were read andused to extract concepts. From this window, you can click a toolbar icon to launch the report inan external browser listing document names as hyperlinks. You can click a link to open thecorresponding document in the collection. For more information, see “Using the File ViewerNode” on p. 114.

Note: When you are working in client-server mode and File Viewer nodes are part of the stream,document collections must be stored in a Web server directory on the server. Since the TextMining output node produces a list of documents stored in the Web server directory, the Webserver’s security settings manage the permissions to these documents.

File Viewer Node Settings

The dialog box below is used to specify settings for the File Viewer node.

113

114

Chapter 7

Figure 7-2File Viewer node dialog box: Settings tab

Document field. Select the field from your data that contains the full name and path of thedocuments to be displayed.

Title for generated HTML page. Create a title to appear at the top of the page that contains the listof documents.

Using the File Viewer Node

The following example shows how to use the File Viewer node.

Example: File List node and a File Viewer node

Figure 7-3Stream illustrating the use of a File Viewer node

E File List node: Settings tab. First, we added this node to specify where the documents are located.

115

Browsing External Source Text


E File Viewer node: Settings tab. Next, we attached the File Viewer node to produce an HTML listof documents.

Figure 7-5File Viewer node dialog box: Settings tab

Executing the stream generates this list in a new window. To see the documents, we clicked thetoolbar button showing a globe with a red arrow. This opened a list of document hyperlinksin our browser.

116

Chapter 7

Figure 7-6File Viewer Output

Figure 7-7Clickable document list

Part II:Interactive Workbench

Chapter

8Interactive Workbench Mode

From Text Mining for Clementine, you can choose to execute a stream that launches an interactiveworkbench session. In this workbench, you can create categories, work with extracted conceptsfrom your text data, and explore text link analysis patterns and clusters. In this chapter, we discussthe workbench interface from a high-level perspective along with the major elements with whichyou will work within a workbench session. At the highest level, you will be working with someof the following elements:

Extracted results. After an extraction is performed, these are the key words and phrasesidentified and extracted from your text data, also referred to as concepts. These concepts aregrouped into types. Using these concepts and types, you can explore your data as well ascreate your categories. These are managed in the Categories and Concepts view.Categories. Using descriptors (such as extracted results, patterns, and rules) as a definition,you can manually or automatically create a set of categories to which documents and recordsare assigned based on whether or not they contain a part of the category definition. These aremanaged in the Categories and Concepts view.Clusters. You can build and explore clusters. Clusters are a grouping of concepts betweenwhich links have been discovered that indicate a relationship among them. The concepts aregrouped using a complex algorithm that uses, among other factors, how often two conceptsappear together compared to how often they appear separately. These are managed in theClusters view. You can also add the concepts that make up a cluster to categories.Text Link Analysis patterns. If you have created text link analysis (TLA) pattern rules in yourlinguistic resources or are using a resource template that already has some pattern rules, youcan extract patterns from your text data. These patterns can help you uncover interestingrelationships between concepts in your data. You can also use these patterns to create yourcategories. These are managed in the Text Link Analysis view.Linguistic resources. The extraction process relies on a set of parameters and linguisticdefinitions to govern how text is extracted and handled. These are managed in the form oftemplates and libraries in the Resource Editor view.

The Categories and Concepts View

The application interface is made up of several views. The Categories and Concepts view is thewindow in which you can create and explore categories as well as explore and tweak the extractedresults. Categories refers to a group of closely related ideas and patterns to which documents andrecords are assigned through a scoring process.

119

120

Chapter 8

Figure 8-1Categories and Concepts view

The Categories and Concepts view is organized into four panes, each of which can be hidden orshown by selecting its name from the View menu. For more information, see “CategorizingText Data” in Chapter 10 on p. 157.

Categories Pane

Located in the upper left corner, this area presents a table in which you can manage any categoriesyou build. After extracting the concepts and types from your text data, you can begin buildingcategories by using automatic techniques, such as semantic networks and concept inclusion, or bycreating them manually. If you double-click a category name, the Category Definitions dialog boxopens and displays all of the descriptors that make up its definition, such as concepts, types, andrules. For more information, see “Categorizing Text Data” in Chapter 10 on p. 157.When you select a row in the pane, you can then display information about corresponding

documents/records or descriptors in the Data and Visualization panes.

121


Figure 8-2Categories and Concepts view: Categories pane without categories and with categories

Extracted Results Pane

Located in the lower left corner, this area presents the extraction results. When you run anextraction, the extractor engine reads through the text data, identifies the relevant concepts, andassigns a type to each. Concepts are words or phrases extracted from your text data. Types aresemantic groupings of concepts stored in the form of type dictionaries. When the extractionis complete, concepts and types appear in the Extracted Results pane. Concepts and types arecolor coded to help you identify what type they belong to. For more information, see “ExtractedResults: Concepts and Types” in Chapter 9 on p. 139.

Text mining is an iterative process in which extraction results are reviewed according to thecontext of the text data, fine-tuned to produce new results, and then reevaluated. Extraction resultscan be refined by modifying the linguistic resources. This fine-tuning can be done in part directlyfrom the Extracted Results or Data pane but also directly in the Resource Editor view. For moreinformation, see “The Resource Editor View” on p. 130.

Figure 8-3Categories and Concepts view: Extracted Results pane after an extraction

122

Chapter 8

Visualization Pane

Located in the upper right corner, this area presents multiple perspectives on the commonalitiesin document/record categorization. Each graph or chart presents similar information but in adifferent manner or with a different level of detail. These charts and graphs can be used to analyzeyour categorization results and aid in fine-tuning categories or reporting. For example, in a graphyou might uncover categories that are too similar (for example, they share more that 75% of theirrecords) or too distinct. The contents in a graph or chart correspond to the selection in the otherpanes. For more information, see “Category Graphs and Charts” in Chapter 13 on p. 195.

Figure 8-4Categories and Concepts view: Visualization pane

Data Pane

The Data pane is located in the lower right corner. This pane presents a table containing thedocuments or records corresponding to a selection in another area of the view. Depending on whatis selected, only the corresponding text appears in the data pane. Once you make a selection, clickthe Display button to populate the Data pane with the corresponding text.If you have a selection in another pane, the corresponding documents or records show the

concepts highlighted in color to help you easily identify them in the text. You can also hover yourmouse over color-coded items to display the concept under which it was extracted and the type towhich it was assigned. For more information, see “The Data Pane” in Chapter 10 on p. 161.

123


Figure 8-5Categories and Concepts view: Data pane

The Clusters View

In the Clusters view, you can build and explore cluster results found in your text data. Clustersare a grouping of concepts generated by clustering algorithms based on how often concepts occurand how often they appear together. The goal of clusters is to group concepts that occur togetherwhile the goal of categories is to group documents or records. In this release, you can buildclusters and explore them in a set of charts and graphs that could help you uncover relationshipsamong concepts that would otherwise be too time-consuming to find. While you cannot add entireclusters to your categories, you can add the concepts in a cluster to a category through the ClusterDefinitions dialog box. For more information, see “Cluster Definitions” in Chapter 11 on p. 184.You can make changes to the settings for clustering to influence the results. For more

information, see “Building Clusters” in Chapter 11 on p. 180.

124

Chapter 8

Figure 8-6Clusters view

The Clusters view is organized into three panes, each of which can be hidden or shown byselecting its name from the View menu. Typically, only the Clusters pane and the Visualizationpane are visible.

Clusters Pane

Located on the left side, this pane presents the clusters that were discovered in the text data. Youcan create clustering results by clicking the Build button.Clusters are formed by a clustering algorithm, which attempts to identify concepts that occur

together frequently. The more the concepts within a cluster occur together coupled with theless they occur with other concepts, the better the cluster is at identifying interesting conceptrelationships. Two concepts co-occur when they both appear (or one of their synonyms or termsappears) in the same document or record. For more information, see “Analyzing Clusters” inChapter 11 on p. 179.Any time the extraction is updated (a new extraction occurs), the cluster results are cleared,

and you have to rebuild the clusters to get the latest results. When building the clusters, you canchange some settings, such as the maximum number of clusters to create, the maximum number ofclusters it can contain, or the maximum number of links with external concepts it can have. Formore information, see “Exploring Clusters” in Chapter 11 on p. 184.

125


Figure 8-7Clusters view: Clusters pane

Visualization Pane

Located in the upper right corner, this pane presents a web graph of the patterns as either typepatterns or concept patterns. If not visible, you can access this pane from the View menu (View >

Visualization). Depending on what is selected in the clusters pane, you can view the correspondinginteractions between or within clusters. The results are presented in multiple formats:

Concept Web. Web graph showing all of the concepts within the selected cluster(s), as well aslinked concepts outside the cluster.Cluster Web. Web graph showing the links from the selected cluster(s) to other clusters, aswell as any links between those other clusters.

Note: You must build clusters and select clusters with external links to display a Cluster Webgraph. For more information, see “Cluster Graphs” in Chapter 13 on p. 198.

126

Chapter 8

Figure 8-8Clusters view: Visualization pane

Data Pane

The Data pane is located in the lower right corner and is hidden by default. You cannot displayany Data pane results from the Clusters pane since these clusters span multiple documents/records,making the data results uninteresting. However, you can see the data corresponding to a selectionwithin the Cluster Definitions dialog box. Depending on what is selected in that dialog box, onlythe corresponding text appears in the data pane. Once you make a selection, click the Display &

button to populate the Data pane with the documents or records that contain all of the conceptstogether.The corresponding documents or records show the concepts highlighted in color to help you

easily identify them in the text. You can also hover your mouse over color-coded items to displaythe concept under which it was extracted and the type to which it was assigned. The Data panecan contain multiple columns but the text field column is always shown. It carries the name ofthe text field that was used during extraction or a document name if the text data is in manydifferent files. Other columns are available. For more information, see “Adding Columns to theData Pane” in Chapter 10 on p. 162.

The Text Link Analysis View

In the Text Link Analysis view, you can build and explore text link analysis pattern results found inyour text data. Text link analysis (TLA) is a pattern-matching technology that enables you to definepattern rules and compare them to actual extracted concepts and relationships found in your text.Patterns are most useful when you are attempting to discover relationships between concepts

or opinions about a particular subject. Some examples include wanting to extract opinionson products from survey data, genomic relationships from within medical research papers, orrelationships between people or places from intelligence data.

127


Once you’ve extracted some TLA pattern results, you can explore them in the Data orVisualization panes and even add them to categories in the Categories and Concepts view. Theremust be some TLA pattern rules defined in the resource template or libraries you are using inorder to extract TLA results. For more information, see “Text Link Analysis Rules” in Chapter 18on p. 275.If you chose to extract TLA pattern results, the results are presented in this view. If you have

not chosen to do so, you will have to use the Extract button and choose the option to extractthese pattern results.

Figure 8-9Text Link Analysis view

The Text Link Analysis view is organized into four panes, each of which can be hidden or shownby selecting its name from the View menu. For more information, see “Exploring Text LinkAnalysis” in Chapter 12 on p. 187.

Type and Concept Patterns Panes

Located on the left side, the Type and Concept Pattern panes are two interconnected panes inwhich you can explore and select your TLA pattern results. Patterns are made up of a seriesof up to either six types or six concepts. The TLA pattern rule as it is defined in the linguisticresources dictates the complexity of the pattern results. For more information, see “Text LinkAnalysis Rules” in Chapter 18 on p. 275.

128

Chapter 8

Pattern results are first grouped at the type level and then divided into concept patterns. Forthis reason, there are two different result panes: Type Patterns (upper left) and Concept Patterns(lower left).

Type Patterns. The Type Patterns pane presents pattern results consisting of two or morerelated types matching a TLA pattern rule. Type patterns are shown as <Organization> +<Location> + <Positive>, which might provide positive feedback about an organizationin a specific location.Concept Patterns. The Concept Patterns pane presents the pattern results at the concept levelfor all of the type pattern(s) currently selected in the Type Patterns pane above it. Conceptpatterns follow a structure such as hotel + paris + wonderful.

Just as with the extracted results in the Categories and Concepts view, you can review the resultshere. If you see any refinements you would like to make to the types and concepts that make upthese patterns, you make those in the Extracted Results pane in the Categories and Concepts view,or directly in the Resource Editor, and reextract your patterns.

Figure 8-10Text Link Analysis view: Both Type and Concept Patterns panes

129


Visualization Pane

Located in the upper right corner, this pane presents a web graph of the patterns as either typepatterns or concept patterns. If not visible, you can access this pane from the View menu (View >

Visualization). Depending on what is selected in the other panes, you can view the correspondinginteractions between documents/records and the patterns.

The results are presented in multiple formats:Concept Graph. This graph presents all the concepts in the selected pattern(s). The line widthand node sizes (if type icons are not shown) in a concept graph show the number of globaloccurrences in the selected table.Type Graph. This graph presents all the types in the selected pattern(s). The line width andnode sizes (if type icons are not shown) in the graph show the number of global occurrencesin the selected table. Nodes are represented by either a type color or by an icon.

For more information, see “Text Link Analysis Graphs” in Chapter 13 on p. 200.

Figure 8-11Text Link Analysis: Visualization pane

Data Pane

The Data pane is located in the lower right corner. This pane presents a table containing thedocuments or records corresponding to a selection in another area of the view. Depending on whatis selected, only the corresponding text appears in the data pane. Once you make a selection, clickthe Display button to populate the Data pane with the corresponding text.If you have a selection in another pane, the corresponding documents or records show the

concepts highlighted in color to help you easily identify them in the text. You can also hover yourmouse over color-coded items to display the concept under which it was extracted and the type towhich it was assigned. For more information, see “The Data Pane” in Chapter 10 on p. 161.

130

Chapter 8

The Resource Editor View

Text Mining for Clementine rapidly and accurately captures key concepts from text data using arobust extraction engine. This engine relies heavily on linguistic resources to dictate how largeamounts of unstructured, textual data should be analyzed and interpreted.The Resource Editor view is where you can view and fine-tune the linguistic resources used to

extract concepts, group them under types, discover patterns in the text data, and much more. TextMining for Clementine offers many preconfigured resource templates.Since these resources may not always be perfectly adapted to the context of your data, you can

create, edit, and manage your own resources for a particular context or domain in the ResourceEditor. For more information, see “Working with Libraries” in Chapter 16 on p. 229.

Note: To simplify the process of fine-tuning your linguistic resources, you can perform commondictionary tasks directly from the Categories and Concepts view through context menus in theExtracted Results and Data panes. For more information, see “Refining Extraction Results” inChapter 9 on p. 148.

Figure 8-12Resource Editor view

The operations that you perform in the Resource Editor view revolve around the managementand fine-tuning of the linguistic resources. These resources are stored in the form of templatesand libraries. The Resource Editor view is organized into four parts: Library Tree pane,Type Dictionary pane, Substitution Dictionary pane, and Exclude Dictionary pane. For moreinformation, see “The Editor Interface” in Chapter 15 on p. 217.

131


Setting Options

You can set general options for Text Mining for Clementine in the Options dialog box. Thisdialog box contains the following tabs:

Session. This tab contains general options and delimiters.Colors. This tab contains options for the colors used in the interface.Sounds. This tab contains options for sound cues.

To Edit Options

E From the menus, choose Tools > Options. The Options dialog box opens.

E Select the tab containing the information you want to change.

E Change any of the options.

E Click OK to save the changes.

Options: Session Tab

On this tab, you can define some of the basic settings.

Figure 8-13Options dialog box: Session tab

Data Pane and Category Graph Display. These options affect how data are presented in the Datapane and graphs in the Categories and Concepts view.

132

Chapter 8

Display limit for Data Pane and Category Web. This option sets the maximum number ofdocuments to show or use to populate the Data panes or graphs and charts in the Categoriesand Concepts view.Map documents to categories at Display time. When this option is selected, each time you clickDisplay, the documents and records are scored so as to show the categories to which theyare assigned in the Data pane and the category graphs. In some cases, especially with largerdatasets, you may want to turn off this option so that data and graphs are displayed much faster.

Resource Editor Delimiter. Select the character to be used as a delimiter when entering elements,such as concepts, synonyms, and optional elements, in the Resource Editor view.

Note: If you click the Default Values button, all options in this dialog box are reset to the valuesthey had when you first installed this product.

Options: Colors Tab

On this tab, you can edit options affecting the overall look and feel of the application and thecolors used to distinguish elements.

Figure 8-14Options dialog box: Colors tab

Standard Fonts & Colors. By default, Text Mining for Clementine uses a proprietary look andfeel. This option is called Use Product Settings. To use a standard Windows look and feel, selectUse Windows Settings. If you change options here, you will need to shut down the applicationand restart it for the changes to take effect.

Custom Colors. Edit the colors for elements appearing onscreen. For each of the elements in thetable, you can change the color. To specify a custom color, click the color area to the right of theelement you want to change and choose a color from the drop-down color list.

Non-extracted text. Text data that was not extracted yet visible in the data pane.Highlight background. Text selection background color when selecting elements in the panes ortext in the data pane.

133


Extraction needed background. Background color of the Extracted Results, Patterns, andClusters panes indicating that changes have been made to the libraries and an extractionis needed.Category feedback background. Category background color that appears after an operation.Default type. Default color for types and concepts appearing in the Data pane and ExtractedResults pane. This color will apply to any custom types that you create in the Resource Editor.You can override this default color for your custom type dictionaries by editing the propertiesfor these type dictionaries in the Resource Editor. For more information, see “Creating Types”in Chapter 17 on p. 245.Striped table 1. First of the two colors used in an alternating manner in the table in the EditForced concepts dialog box in order to differentiate each set of lines.Striped table 2. Second of the two colors used in an alternating manner in the table in the EditForced concepts dialog box in order to differentiate each set of lines.


Options: Sounds Tab

On this tab, you can edit options affecting sounds. Under Sound Events, you can specify a soundto be used to notify you when an event occurs. A number of sounds are available. Use the ellipsisbutton (...) to browse for and select a sound. The .wav files used to create sounds for Text Miningfor Clementine are stored in the media subdirectory of the installation directory. If you do notwant sounds to be played, select Mute All Sounds. Sounds are muted by default.


Figure 8-15Options dialog box: Sounds tab

134

Chapter 8

Microsoft Internet Explorer Settings for Help

Microsoft Internet Explorer Settings

Most Help features in this application use technology based on Microsoft Internet Explorer.Some versions of Internet Explorer (including the version provided with Microsoft Windows XP,Service Pack 2) will by default block what it considers to be “active content” in Internet Explorerwindows on your local computer. This default setting may result in some blocked content in Helpfeatures. To see all Help content, you can change the default behavior of Internet Explorer.

E From the Internet Explorer menus choose:Tools

Internet Options...

E Click the Advanced tab.

E Scroll down to the Security section.

E Select (check) Allow active content to run in files on My Computer.

Generating Model Nuggets and Modeling NodesWhen you are in an interactive session, you may want to use the work you have done to generateeither:

A modeling node. A modeling node generated from an interactive workbench session is aText Mining node whose settings and options reflect those stored in the open interactivesession. This can be useful when you no longer have the original Text Mining node or whenyou want to make a new version.A model nugget. A model nugget generated from an interactive workbench session is acategory model nugget. You must have at least one category in the Categories and Conceptsview in order to generate a category model nugget.

To Generate a Text Mining Modeling Node

E From the menus, choose Generate > Generate Modeling Node. A Text Mining modeling node isadded to the working canvas using all of the settings currently in the workbench session. Thenode is named after the text field.

To Generate a Category Model Nugget

E From the menus, choose Generate > Generate Model. A model nugget is generated directly ontothe Model palette with the default name.

Updating Modeling Nodes and SavingWhile you are working in an interactive session, we recommend that you update the modeling nodefrom time to time to save your changes. You should also update your modeling node wheneveryou are finished working in the interactive workbench session and want to save your work. When

135


you update the modeling node, the workbench session content is saved back to the Text Miningnode that originated the interactive workbench session. This does not close the output window.

To Update a Modeling Node (and Save Your Work)

E From the menus, choose File > Update Modeling Node. The modeling node is updated with thebuild and extraction settings, along with any options and categories you have.

Closing and Deleting Sessions

When you are finished working in your session, you can leave the session in three different ways:Save. This option allows you to first save your work back into the originating modelingnode for future sessions, as well as to publish any libraries for reuse in other projects. Formore information, see “Sharing Libraries” in Chapter 16 on p. 238. After you have saved,the session window is closed, and the session is deleted from the Output manager in theClementine window.Exit. This option will discard any unsaved work, close the session window, and delete thesession from the Output manager in the Clementine window. To free up memory, werecommend saving any important work and exiting the session.Close. This option will not save or discard any work. This option closes the session windowbut the session will continue to run. You can open the session window again by selecting thissession in the Output manager in the Clementine window.

To Close a Workbench Session

E From the menus, choose File > Close.

Figure 8-16Close Interactive Session dialog box

Keyboard Accessibility

The interactive workbench interface offers keyboard shortcuts to make the product’s functionalitymore accessible. At the most basic level, you can press the Alt key plus the appropriate key toactivate window menus (for example, Alt+F to access the File menu) or press the Tab key toscroll through dialog box controls. This section will cover the keyboard shortcuts for alternativenavigation. There are other keyboard shortcuts for the Clementine interface.

136

Chapter 8

Table 8-1Generic keyboard shortcuts

Shortcut key FunctionCtrl+1 Display the first tab in a pane with tabs.Ctrl+2 Display the second tab in a pane with tabs.Ctrl+A Select all elements for the pane that has focus.Ctrl+C Copy selected text to the clipboard.Ctrl+E Run a new extraction in Categories and Concepts and Text Link Analysis

views.Ctrl+F Display the Find toolbar in the Resource Editor/Template Editor, if not

already visible, and put focus there.Ctrl+I In the Categories and Concepts view, launch the Category Definitions

dialog box.In the Cluster view, launch the Cluster Definitions dialog box.

Ctrl+R Open the Add Terms dialog box in the Resource Editor/Template Editor.Ctrl+T Open the Type Properties dialog box to create a new type in the Resource

Editor/Template Editor.Ctrl+V Paste clipboard contents.Ctrl+X Cut selected items from the Resource Editor/Template Editor.Ctrl+Y Redo the last action in the view.Ctrl+Z Undo the last action in the view.F1 Display Help, or when in a dialog box, display context Help for an item.F2 Toggle in and out of edit mode in table cells.F6 Move the focus between the main panes in the active view.F8 Move the focus to pane splitter bars for resizing.F10 Expand the main File menu.up arrow,down arrow

Resize the pane vertically when the splitter bar is selected.

left arrow,right arrow

Resize the pane horizontally when the splitter bar is selected.

Home, End Resize panes to minimum or maximum size when the splitter bar isselected.

Tab Move forward through items in the window or dialog box.Shift+F10 Display the context menu for an item.Shift+Tab Move back through items in the window or dialog box.Shift+arrow Select characters in the edit field when in edit mode (F2).Ctrl+Tab Move the focus forward to the next main area in the window.Shift+Ctrl+Tab Move the focus backward to the previous main area in the window.

Shortcuts for Dialog Boxes

Several shortcut and screen reader keys are helpful when you are working with dialog boxes.Upon entering a dialog box, you may need to press the Tab key to put the focus on the firstcontrol and to initiate the screen reader. A complete list of special keyboard and screen readershortcuts is provided in the following table.

137


Table 8-2Dialog box shortcuts

Shortcut key FunctionTab Move forward through the items in the window or dialog box.Ctrl+Tab Move forward from a text box to the next item.Shift+Tab Move back through items in the window or dialog box.Shift+Ctrl+Tab Move back from a text box to the previous item.space bar Select the control or button that has focus.Esc Cancel changes and close the dialog box.Enter Validate changes and close the dialog box (equivalent to the OK button). If you are in

a text box, you must first press Ctrl+Tab to exit the text box.

Chapter

9Extracting Concepts and Types

Whenever you execute a stream that launches the interactive workbench, an extraction isautomatically performed on the text data in the stream. The end result of this extraction is aset of concepts, types, and, in the case where TLA patterns exist in the linguistic resources,patterns. You can view and work with concepts and types in the Extracted Results pane. For moreinformation, see “How Extraction Works” in Chapter 1 on p. 5.Figure 9-1Extracted results pane after an extraction

If you want to fine-tune the extraction results, you can modify the linguistic resources andreextract. For more information, see “Refining Extraction Results” on p. 148. The extractionprocess relies on the resources and any parameters in the Extract dialog box to dictate how toextract and organize the results. You can use the extraction results to define the better part, ifnot all, of your category definitions.

Extracted Results: Concepts and TypesDuring the extraction process, all of the text data is scanned and the relevant concepts areidentified, extracted, and assigned to types. When the extraction is complete, the results appear inthe Extracted Results pane located in the lower left corner of the Categories and Concepts view.The first time you launch the session, the linguistic resource template you selected in the node isused to extract and organize these concepts and types.The concepts, types, and TLA patterns that are extracted are collectively referred to as

extraction results, and they serve as the descriptors, or building blocks, for your categories.Additionally, the automatic classification techniques use concepts and types to build categories.

139

140

Chapter 9

Text mining is an iterative process in which extraction results are reviewed according to thecontext of the text data, fine-tuned to produce new results, and then reevaluated. After extracting,you should review the results and make any changes that you find necessary by modifying thelinguistic resources. You can fine-tune the resources, in part, directly from the Extracted Resultspane, Data pane, Category Definitions dialog box, or Cluster Definitions dialog box. For moreinformation, see “Refining Extraction Results” on p. 148. You can also do so directly in theResource Editor view. For more information, see “The Resource Editor View” in Chapter 8on p. 130.After fine-tuning, you can then reextract to see the new results. By fine-tuning your extraction

results from the start, you can be assured that each time you reextract, you will get identicalresults in your category definitions, perfectly adapted to the context of the data. In this way,documents/records will be assigned to your category definitions in a more accurate, repeatablemanner.

Concepts

During the extraction process, the text data is scanned and analyzed in order to identify interestingor relevant single words (such as election or peace) and word phrases (such as presidentialelection, election of the president, or peace treaties) in the text. These wordsand phrases are collectively referred to as terms. Using the linguistic resources, the relevant termsare extracted and then similar terms are grouped together under a lead term called a concept.In this way, a concept could represent multiple terms depending on your text and the set of

linguistic resources you are using. For example, if you looked at all of the records in whichthe concept cost appeared, you may actually notice that the word cost cannot be found inthe document but that instead something similar is present, such the word price. In fact, theconcept cost that appears in your concept list after extraction may represent many other terms,such as price, costs, fee, fees, and dues, if the extractor deemed them as similar or if itfound synonyms based on processing rules or linguistic resources. In this case, any documents orrecords containing any of those terms would be treated as if they contained the word cost.

Figure 9-2Concept view: Extracted results pane with used concepts in italics

141

Extracting Concepts and Types

By default, the pane displays the list of extracted concepts in lower case, and they appear indescending order by global frequency. Global frequency represents the number of times aconcept (or one of its terms or synonyms) appears in the entire set of documents or records. Whenconcepts are extracted, they are assigned a type to help group similar concepts. They are colorcoded according to their type. Colors are defined in the type properties within the Resource Editor.For more information, see “Type Dictionaries” in Chapter 17 on p. 243.Whenever a concept, type, or pattern is being used in a category definition, it appears in

italics in the table. You can view only the unused concepts by clicking the right-most icon inthe extracted results pane.

Types

Types are semantic groupings of concepts stored in the form of type dictionaries. When you selectthis view, the extracted types appear by default in descending order by global frequency. You canalso see that types are color coded to help distinguish them. You can change these colors in theResource Editor. For more information, see “Built-in Types” in Chapter 17 on p. 244.When concepts are extracted, they are assigned a type to help group similar concepts. Several

built-in types are delivered with Text Mining for Clementine, such as Location, Product,Person, Positive (qualifiers), and Negative (qualifiers). For more information, see “Built-inTypes” in Chapter 17 on p. 244. You can also create your own types. For more information,see “Creating Types” in Chapter 17 on p. 245.For example, the Location type groups geographical keywords and places. This type would

be assigned to concepts such as chicago, paris, and tokyo.

Note: Concepts that are not found in any type dictionary but are extracted from the text areautomatically typed as <Unknown>.

Figure 9-3Type view: Extracted results pane

142

Chapter 9

Patterns

Patterns can also be extracted from your text data. However, you must have a library that containssome Text Link Analysis (TLA) pattern rules in the Resource Editor. You also have to choose toextract these patterns in the Text Mining for Clementine node setting or in the Extract dialog boxusing the option Enable Text Link Analysis pattern extraction. For more information, see “ExploringText Link Analysis” in Chapter 12 on p. 187.

Extracting Data

The extraction process results in a set of concepts and types, as well as Text Link Analysis (TLA)patterns, if enabled. You can view and work with these concepts and types in the extracted resultspane in the Categories and Concepts view. If you extracted TLA patterns, you can see those in theText Link Analysis view.

Note: Whenever an extraction is needed, the Extracted Results pane becomes yellow in color.There is a relationship between the size of your dataset and the time it takes to complete theextraction process. You can always consider inserting a Sample node upstream or optimizingyour machine’s configuration.

To Extract Data

E From the menus, choose Tools > Extract. Alternatively, click the Extract toolbar button.

Figure 9-4Extract dialog box

E On the Settings tab, change any of the options you want to use. For more information, see “ExtractDialog Box: Settings Tab” on p. 143.

E On the Language tab, change any of the options you want to use. For more information, see“Extract Dialog Box: Language Tab” on p. 145.

E Click Extract to begin the extraction process.

143


Once the extraction begins, the progress dialog box opens. If you want to abort the extraction,click Cancel. When the extraction is complete, the dialog box closes and the extraction resultsappear in the pane.

Figure 9-5Extraction progress dialog box

The list of extracted concepts is sorted by global frequency in descending order. You can reviewthe results using the toolbar options to sort the results differently, to filter the results, or to switchto a different view (concepts or types). You can also refine your extraction results by working withthe linguistic libraries used by the extractor to identify concepts and types. For more information,see “Refining Extraction Results” on p. 148.

Extract Dialog Box: Settings Tab

The Settings tab contains some basic extraction options.

Note: This dialog box contains another tab with more options. For more information, see “ExtractDialog Box: Language Tab” on p. 145.

Figure 9-6Extract dialog box

144

Chapter 9

Enable Text Link Analysis pattern extraction. Specifies that you want to extract TLA patterns fromyour text data. It also assumes you have TLA pattern rules in one of your libraries in the ResourceEditor. This option may significantly lengthen the extraction time. For more information, see“Exploring Text Link Analysis” in Chapter 12 on p. 187.

Limit extraction to concepts with a global frequency of at least [n]. Specifies the minimum numberof times a word or phrase must occur in the text in order for it to be extracted. For example, avalue of 2 limits the extraction to those words or phrases that occur at least twice in the entireset of records or documents.







145




and officials of the company were extracted, they would be grouped together in thefinal concept list.

Extract Dialog Box: Language Tab

The Language tab contains some language-specific options, such as the language of the text dataas well as any translation options, if needed.

Note: This dialog box contains another tab with more options. For more information, see “ExtractDialog Box: Settings Tab” on p. 143.

Figure 9-7Extract dialog box: Language tab


ALL. If you know that your text is in only one language, we highly recommend that youselect that language. Choosing the ALL option will add time when executing your stream,since Automatic Language Recognition is used to scan all documents and records in orderto identify the text language first. With this option, all records or documents that are in asupported and licensed language are read by the extractor using the language-appropriateinternal dictionaries. Although you may select this option, Text Mining for Clementine willaccept only those in a language for which you have a license. You can edit certain parameters

146

Chapter 9

affecting this option in the Automatic Language Identification section of the advanced resourceeditor. For more information, see “Language Identifier” in Chapter 18 on p. 274.Translate with Language Weaver. With this option, the text will be translated for extraction.You must have Language Weaver Translation Server installed and configured. Othertranslation settings in this dialog box also apply.







Filtering Extracted Results

When you are working with very large datasets, the extraction process could produce millions ofresults. For many users, this amount can make it more difficult to review the results effectively.You can, however, filter these results in order to zoom in on those that are most interesting. Youcan change the settings in the Filter dialog box to limit what is visible in the Extracted Resultspane. All of these settings are used together.

147


Figure 9-8Filter dialog box (from the Extracted Results pane)

Filter by Frequency. You can filter to display only those results with a certain global or documentfrequency value.

Global frequency is the total number of times a concept appears in the entire set of documentsor records and is shown in the Global column.Document frequency is the total number of documents or records in which a concept appearsand is shown in the Docs column.

For example, if the concept nato appeared 800 times in 500 records, we would say that thisconcept has a global frequency of 800 and a document frequency of 500.

And by Type. You can filter to display only those results belonging to certain types. You canchoose all types or only specific types.

And by Match Text. You can also filter to display only those results that match the rule you definehere. Enter the set of characters to be matched in the Match text field and then select the conditionin which to apply the match.

Table 9-1Match text conditions

Condition DescriptionContains The text is matched if the string occurs anywhere. (Default choice)Starts with Text is matched only if the concept or type starts with the specified text.Ends with Text is matched only if the concept or type ends with the specified text.Exact Match The entire string must match the concept or type name.

148

Chapter 9

And by Rank. You can also filter to display only a top number of concepts according to globalfrequency (Global) or document frequency (Docs) in either ascending or descending order.

Results Displayed in Extracted Result Pane

Here are some examples of how the results might be displayed in the Extracted Results panetoolbar based on the filters.

Figure 9-9Filter results example 1

In this example, the toolbar shows the number of results. Since there was no text matching filterand the maximum was not met, no additional icons are shown.


In this example, the toolbar shows results were limited to the maximum specified in the filter,which in this case was 300. If a purple icon is present, this means that the maximum number ofconcepts was met. Hover over the icon for more information.


In this example, the toolbar shows results were limited using a match text filter (see magnifyingglass icon).

To Filter the Results

E From the menus, choose Tools > Filter. The Filter dialog box opens.

E Select and refine the filters you want to use.

E Click OK to apply the filters and see the new results in the Extracted Results pane.

Refining Extraction ResultsExtraction is an iterative process whereby you can extract, review the results, make changesto them, and then reextract to update the results. Since accuracy and continuity are essentialto successful text mining and categorization, fine-tuning your extraction results from the startensures that each time you reextract, you will get precisely the same results in your categorydefinitions. In this way, documents and records will be assigned to your categories in a moreaccurate, repeatable manner.The extraction results serve as the building blocks for categories. When you create categories

using these extraction results, documents and records are automatically assigned to categoriesif they contain text that matches one or more category descriptors. Although you can begincategorizing before making any refinements to the linguistic resources, it is useful to review yourextraction results at least once before beginning.

149


As you review your results, you may find elements that you want the extractor to handledifferently. Consider the following examples:

Unrecognized synonyms. Suppose you find several concepts you consider to be synonymous,such as smart, intelligent, bright, and knowledgeable, and they all appear asindividual concepts in the extracted results. You could create a synonym definition in whichintelligent, bright, and knowledgeable are all grouped under the target conceptsmart. Doing so would group all of these together with smart, and the global frequencycount would be higher as well. For more information, see “Adding Synonyms” on p. 150.Mistyped concepts. Suppose that the concepts in your extracted results appear in one type andyou would like them to be assigned to another. Or imagine that you find 15 vegetable conceptsin your extracted results and you want them all to be added to a new type called Vegetable.Keep in mind that concepts that are unrecognized by any of the dictionaries are automaticallyassigned to the Unknown type. You can add concepts to an existing type or to a new type. Formore information, see “Adding Concepts to Types” on p. 152.Insignificant concepts. Suppose that you find a concept that was extracted and has a veryhigh global frequency count—that is, it is found many times in your documents or records.However, you consider this concept to be insignificant to your analysis. You can exclude itfrom extraction. For more information, see “Excluding Concepts from Extraction” on p. 154.Incorrect matchings. Suppose that in reviewing the records or documents that contain acertain concept, you discover that two words were incorrectly grouped together, such asfaculty and facility. This match may be due to an internal algorithm, referred to asfuzzy grouping, that temporarily ignores double or triple consonants and vowels in order togroup common misspellings. You can add these words to a list of word pairs that should notbe grouped. For more information, see “Fuzzy Grouping” in Chapter 18 on p. 266.Unextracted concepts. Suppose that you expect to find certain concepts extracted but noticethat a few words or phrases were not extracted when you review the document or record text.Often these words are verbs or adjectives that you are not interested in. However, sometimesyou do want to use a word or phrase that was not extracted as part of a category definition. Toextract the concept, you can force a term into a type dictionary. For more information, see“Forcing Words into Extraction ” on p. 155.

Many of these changes can be performed directly from the Extracted Results pane, Data pane,Category Definitions dialog box, or Cluster Definitions dialog box by selecting one or moreelements and right-clicking your mouse to access the context menus.After making your changes, the background color of the Extracted Results, Patterns, and

Clusters panes changes to show that you need to reextract to view your changes. For moreinformation, see “Extracting Data” on p. 142. If you are working with larger datasets, it may bemore efficient to reextract after making several changes rather than after each change.

Note: You can view the entire set of editable linguistic resources used to produce the extractionresults in the Resource Editor view (View > Resource Editor). These resources appear in the formof libraries and dictionaries in this view. You can customize the concepts and types directlywithin the libraries and dictionaries. For more information, see “Working with Libraries” inChapter 16 on p. 229.

150

Chapter 9

Adding Synonyms

Synonyms associate two or more words that have the same meaning. Synonyms are often alsoused to group terms with their abbreviations or to group commonly misspelled words with thecorrect spelling. By using synonyms, the global frequency for the target concept is greater,which makes it far easier to discover similar information that is presented in different ways inyour text data.The linguistic resource templates and libraries delivered with the product contain many

predefined synonyms. However, not every possible synonym is included. If you discoverunrecognized synonyms, you can define them so that they will be recognized the next time youextract.The first step is to decide what the target, or lead, concept will be. The target concept is the

word or phrase under which you want to group all synonym terms in the final results. Duringextraction, the synonyms are grouped under this target concept. The second step is to identifyall of the synonyms for this concept. The target concept is substituted for all synonyms in thefinal extraction. A term must be extracted to be a synonym. However, the target concept does notneed to be extracted for the substitution to occur. For example, if you want intelligent to bereplaced by smart, then intelligent is the synonym and smart is the target concept.If you create a new synonym definition, a new target concept is added to the dictionary. You

must then add synonyms to that target concept. Whenever you create or edit synonyms, thesechanges are recorded in synonym dictionaries in the Resource Editor. If you want to view theentire contents of these synonym dictionaries or if you want to make a substantial number ofchanges, you may prefer to work directly in the Resource Editor. For more information, see“Substitution Dictionaries” in Chapter 17 on p. 253.Any new synonyms will automatically be stored in the first library listed in the library tree in

the Resource Editor view—by default, this is the Local Library.

Note: If you look for a synonym definition and cannot find it through the context menus or directlyin the Resource Editor, a match may result from an internal fuzzy grouping technique. For moreinformation, see “Fuzzy Grouping” in Chapter 18 on p. 266.

To Create a New Synonym

E In either the Extracted Results pane, Data pane, Category Definitions dialog box, or ClusterDefinitions dialog box, select the concept(s) for which you want to create a new synonym.

E Right-click to open the context menu.

E Select Add to Synonym > New Synonym. The Create Synonym dialog box opens.

151


Figure 9-12Create Synonym dialog box

E Enter a target concept in the Target text box. This is the concept under which all of the synonymswill be grouped.

E If you want to add more synonyms, enter them in the Synonyms list box. Use the global separatorto separate each synonym term. For more information, see “Options: Session Tab” in Chapter 8on p. 131.

E Click OK to apply your changes. The dialog box closes and the extracted results pane backgroundcolor changes, indicating that you need to reextract to see your changes. If you have severalchanges, make them before you reextract.

To Add to a Synonym

E In either the Extracted Results pane, Data pane, Category Definitions dialog box, or ClusterDefinitions dialog box, select the concept(s) that you want to add to an existing synonym definition.


E Select Add to Synonym. The menu displays a set of the synonyms with the most recently createdat the top of the list. Select the name of the synonym to which you want to add the selectedconcept(s). If you see the synonym that you are looking for, select it, and the concept(s) selectedare added to that synonym definition. If you do not see it, select More to display the All Synonymsdialog box.

152

Chapter 9

Figure 9-13All Synonyms dialog box

E In the All Synonyms dialog box, you can sort the list by natural sort order (order of creation) orin ascending or descending order. Select the name of the synonym to which you want to addthe selected concept(s) and click OK. The dialog box closes, and the concepts are added to thesynonym definition.

Adding Concepts to Types

Whenever an extraction is run, the extracted concepts are assigned to types in an effort to groupterms that have something in common. Text Mining for Clementine is delivered with many built-intypes. For more information, see “Built-in Types” in Chapter 17 on p. 244. Any extracted conceptsthat are not recognized by any of the types are automatically assigned to the Unknown type.When reviewing your results, you may find some concepts that appear in one type that you

would like to be assigned to another. Or you may find that a group of words really belongs in anew type by itself. In these cases, you would want to reassign the concepts to another type orcreate a new type altogether.For example, suppose that you are working with survey data relating to automobiles and you

are interested in categorizing by focusing on different areas of the vehicles. You could create atype called Dashboard to group all of the concepts relating to gauges and knobs found on thedashboard of the vehicles. Then you could assign concepts such as gas gauge, heater, radio,and odometer to that new type.In another example, suppose that you are working with survey data relating to universities and

colleges and the extraction typed Johns Hopkins (the university) as a Person type rather thanas an Organization type. In this case, you could add this concept to the Organization type.Whenever you create a type or add concepts as terms to a type, these changes are recorded in

type dictionaries within your linguistic resource libraries in the Resource Editor. If you want toview the contents of these libraries or make a substantial number of changes, you may prefer towork directly in the Resource Editor. For more information, see “Adding Terms” in Chapter 17on p. 247.

153


To Create a New Type

E In either the Extracted Results pane, Data pane, Category Definitions dialog box, or ClusterDefinitions dialog box, select the concepts for which you want to create a new type.


E Select Add to Type > New Type. The Type Properties dialog box opens.

Figure 9-14Type Properties dialog box

E Enter a new name for this type in the Name text box and make any changes to the other fields. Formore information, see “Creating Types” in Chapter 17 on p. 245.

E Click OK to apply your changes. The dialog box closes and the extracted results pane backgroundcolor changes, indicating that you need to reextract to see your changes. If you have severalchanges, make them before you reextract.

To Add a Concept to a Type

E In either the Extracted Results pane, Data pane, Category Definitions dialog box, or ClusterDefinitions dialog box, select the concept(s) that you want to add to an existing type.


E Select Add to Type. The menu displays a set of the types with the most recently created at the topof the list. Select the type name to which you want to add the selected concept(s). If you see thetype name that you are looking for, select it, and the concept(s) selected are added to that type. Ifyou do not see it, select More to display the All Types dialog box.

154

Chapter 9

Figure 9-15All Types dialog box

E In the All Types dialog box, you can sort the list by natural sort (order of creation) or in ascendingor descending order. Select the name of the type to which you want to add the selected concept(s)and click OK. The dialog box closes, and they are added as terms to the type.

Excluding Concepts from Extraction

When reviewing your results, you may occasionally find concepts that you did not want extractedor used by any automated classification techniques. In some cases, these concepts have a very highfrequency count and are completely insignificant to your analysis. In this case, you can mark aconcept as a term to be excluded from the final extraction. Typically, the terms you add to this listare fill-in words or phrases used in the text for continuity but that do not add anything important tothe text and may clutter the extraction results. By adding terms to the exclude dictionary, youcan make sure that they are never extracted.By excluding terms, all variations of the excluded term disappear from your extraction results

the next time that you extract. If this term already appears as a concept in a category, it will remainin the category with a zero count after reextraction.When you exclude a term, these changes are recorded in an exclude dictionary in the Resource

Editor. If you want to view all of the exclude definitions and edit them directly, you may preferto work directly in the Resource Editor. For more information, see “Exclude Dictionaries” inChapter 17 on p. 258.

To Exclude a Term

E In either the Extracted Results pane, Data pane, Category Definitions dialog box, or ClusterDefinitions dialog box, select the concept(s) that you want to exclude from the extraction.


E Select Exclude from Extraction. The concept is added as a term to the exclude dictionary in theResource Editor and the extracted results pane background color changes, indicating that you needto reextract to see your changes. If you have several changes, make them before you reextract.

155


Note. Any words that you exclude will automatically be stored in the first library listed in thelibrary tree in the Resource Editor—by default, this is the Local Library.

Forcing Words into Extraction

When reviewing the text data in the data pane after extraction, you may discover that some wordsor phrases were not extracted. Often, these words are verbs or adjectives that you are not interestedin. However, sometimes you do want to use a word or phrase that was not extracted as part of acategory definition. If you would like to have these words and phrases extracted, you can force aterm into a type library. For more information, see “Forcing Terms” in Chapter 17 on p. 250.Marking a term in a dictionary as forced is not foolproof. By this, we mean that even though

you have explicitly added a term to a dictionary, there are times when it may not be present in theextracted results pane after you have reextracted or it does appear but not exactly as you havedeclared it. Although this occurrence is rare, it can happen when a word or phrase was alreadyextracted as part of a longer phrase. During extraction, words are broken down into parts of speech(noun, verbs, adjectives, propositions, etc.). Part of the extraction process involves comparingword sequences with the hard-coded, part-of-speech patterns.

Chapter

10Categorizing Text Data

In the Categories and Concepts view, you can create categories that represent, in essence,higher-level concepts or topics that will capture the key ideas, knowledge, and attitudes expressedin the text. Categories are made up of set of descriptors, such as concepts, types, and rules.Together, these descriptors are used to identify whether or not a record or document belongs toa given category. The text within a document or record can be scanned to see whether any textmatches a descriptor. If a match is found, the document/record is assigned to that category. Thisprocess is called categorization. To be useful, a category should also be easily described by ashort phrase or label that captures its essential meaning.Categories can be created automatically using the product’s robust set of automated techniques,

manually using additional insight you may have regarding the data, or a combination of both.However, you can only create categories manually or fine-tune them through the interactiveworkbench. For more information, see “Text Mining Node: Model Tab” in Chapter 3 on p. 32.You can work with, build, and visually explore your categories using the data presented in the

four panes, each of which can be hidden or shown by selecting its name from the View menu.Categories pane. You can build and manage your categories in this pane. For moreinformation, see “The Categories Pane” on p. 158.Extracted Results pane. You can explore and work with the extracted concepts and types inthis pane. For more information, see “Extracted Results: Concepts and Types” in Chapter 9on p. 139.Visualization pane. You can visually explore your categories and how they interact in this pane.For more information, see “Category Graphs and Charts” in Chapter 13 on p. 195.Data pane. You can explore and review the text contained within documents and records thatcorrespond to selections in this pane. For more information, see “The Data Pane” on p. 161.

157

158

Chapter 10

Figure 10-1Categories and Concepts view

In order to categorize your records or documents, you need to choose the techniques and methodswith which you will create the definitions for your categories. Categories can be created in severaldifferent ways. For example, categories can be created using automated classification techniques,which use extracted concepts and types to generate categories. You can also create categorydefinitions manually.Each of the techniques and methods is well suited for certain types of data and situations, but

often it will be helpful to combine techniques in the same analysis to capture the full range ofdocuments or records. And in the course of categorization, you may see other changes to maketo the linguistic resources.

The Categories Pane

The Categories pane is the area in which you can build and manage your categories. This pane islocated in the upper left corner of the Categories and Concepts view. After extracting the conceptsand types from your text data, you can begin building categories automatically using classificationtechniques (semantic networks, concept inclusion, etc.) or manually. For more information, see“Building Categories” on p. 163.

159

Categorizing Text Data

Figure 10-2Categories pane without categories and with categories

Each time a category definition is created or updated, the documents or records are scannedautomatically to see whether any text corresponds to a descriptor in the category definition. Ifa match is found, the document or record is assigned to that category. This process is calledcategorization. The end result is that most, if not all, of the documents or records are assigned toone or more categories based on the category definitions you created.This pane presents each category name (Category), the number of descriptors that make up its

definition (Descriptors), as well as the number of documents or records (Docs) that are categorizedinto that category.When no categories exist, the table still contains two rows. The top row, called All Documents,

is the total number of document/records. A second row, called Uncategorized, shows the numberof documents/records that have yet to be categorized.For each category in the pane, a small yellow bucket icon precedes the category name. If

you double-click a category or choose View > Category Definitions in the menus, the CategoryDefinitions dialog box opens and presents all of the elements, called descriptors, that make upits definition, such as concepts, types, patterns, and rules. For more information, see “CategoryDefinitions” on p. 160.

Scoring Categories

Most of the time when you are creating categories, the number of documents or records is known.However, whenever you edit a category such that some but not all of the content is added ordeleted, the number of documents or records is no longer known. In this case, an icon with twoarrows appears in the Docs column. In order to update this column with actual document/recordcounts, click Score on the pane toolbar. Scoring will recalculate the number of documents andrecords that are in your category. Keep in mind that the scoring process can take some timewhen you are working with larger datasets.

Displaying in Data and Visualization Panes

When you select a row in the table, you can click the Display button to refresh the Visualizationand Data panes with information corresponding to your selection. If a pane is not visible, clickingDisplay will cause the pane to appear.

160

Chapter 10

Refining Your Categories

Categorization may not yield perfect results for your data on the first try, and there may well becategories that you want to delete or combine with other categories. You may also find, througha review of the extraction results, that there are some categories that were not created that youwould find useful. If so, you can make manual changes to the results to fine-tune them for yourparticular survey. For more information, see “Managing and Refining Categories” on p. 174.

Category Definitions

Each category is defined by one or more descriptors. Descriptors are concepts, types, andpatterns, as well as conditional rules that have been used to define a category. If you want to seethe descriptors that make up a given category, you can double-click the category name or you canselect the category in the Categories pane and open the Category Definitions dialog box (View >

Category Definitions). If you select multiple categories and open the Category Definitions dialogbox, the last category having focus is opened.

Figure 10-3Category Definitions dialog box

For example, when you build categories automatically using classification techniques suchas concept inclusion or semantic networks, the techniques will use concepts and types as thedescriptors to create your categories. If you extract TLA patterns, you can also add those patternsor parts of those patterns as category descriptors. For more information, see “Exploring Text LinkAnalysis” in Chapter 12 on p. 187. And if you build clusters, you can add a cluster’s descriptors tonew or existing categories. Lastly, you can manually create conditional rules to use as descriptorsin your categories. For more information, see “Using Conditional Rules” on p. 174.

Note: There is no Cancel button in this dialog box. Any changes you make are immediatelyapplied to your category.

161


Column Descriptions

Icons are shown so that you can easily identify each descriptor.

Table 10-1Columns and descriptor icons

Columns DescriptionDescriptors The name of the descriptor preceded by an icon indicating what kind of descriptor it is.

are concepts; are types; are patterns; and are conditional rules.Type Shows the type or types to which the descriptor belongs. If the descriptor is a conditional

rule, no type name is shown in this column.

Note: You can also add concepts to a type, as synonyms, or as exclude items using the contextmenus.

The Data Pane

As you create categories, there may be times when you might want to review some of the textdata you are working with. For example, if you create a category in which 640 documents arecategorized, you might want to look at some or all of those documents to see what text wasactually written. You can review records or documents in the Data pane, which is located in thelower right. If not visible by default, choose View > Panes > Data from the menus.The Data pane presents one row per document or record corresponding to a selection in the

Categories pane, Extracted Results pane, or the Category Definitions dialog box up to a certaindisplay limit. By default, the number of documents or records shown in the Data pane is limited inorder to allow you to see your data more quickly. However, you can adjust this in the Optionsdialog box. For more information, see “Options: Session Tab” in Chapter 8 on p. 131.

Displaying and Refreshing the Data Pane

The Data pane does not refresh its display automatically because with larger datasets automaticdata refreshing could take some time to complete. Therefore, whenever you make a selection inanother pane in this view or the Category Definitions dialog box, you can click Display to refreshthe contents of the Data pane.

Text Documents or Records

If your text data is in the form of records and the text is relatively short in length, the text fieldin the Data pane displays the text data in its entirety. However, when working with records andlarger datasets, the text field column shows a short piece of the text and opens a Text Preview paneto the right to display more or all of the text of the record you have selected in the table. If yourtext data is in the form of individual documents, the Data pane shows the document’s filename.When you select a document, the Text Preview pane opens with the selected document’s text.

162

Chapter 10

Figure 10-4Data pane with Text Preview pane

Colors and Highlighting

Whenever you select a concept or category in another pane and display the data, concepts anddescriptors found in those documents or records are highlighted in color to help you easilyidentify them in the text. The color coding corresponds to the types to which the concepts belong.You can also hover your mouse over color-coded items to display the concept under which itwas extracted and the type to which it was assigned. Any text that was not extracted appears inblack. Typically, these unextracted words are often connectors (and or with), pronouns (me orthey), and verbs (is, have, or take).

Data Pane Columns

You can show or hide columns in the data pane. For more information, see “Adding Columns tothe Data Pane” on p. 162.

Adding Columns to the Data Pane

The Data pane can contain multiple columns, but the text field column is always shown. Thefollowing columns may be available for display:

“Text field name” (#)/Documents. Adds a column for the text data from which concepts and typewere extracted. If your data is in documents, the column is called Documents and only thedocument filename or full path is visible. To see the text for those documents you must look inthe Text Preview pane. The number of rows in the Data pane is shown in parentheses after thiscolumn name. There may be times when not all documents or records are shown due to a limitin the Options dialog used to speed loading. If the maximum is reached, the number will befollowed by - Max. For more information, see “Options: Session Tab” in Chapter 8 on p. 131.

163


Categories. Adds a column for the text data from which concepts and type were extracted.Whenever this column is shown, refreshing the Data pane may take a bit longer so as to showthe most up-to-date information.

To Add Other Columns to the Data Pane

E From the menus, choose View > Display Columns, and then select the column that you want todisplay in the Data pane. The new column appears in the pane.

Building Categories

You can categorize your documents or records automatically using classification techniques ormanually by creating empty categories and then adding descriptors to the category. Throughthe Build Categories dialog box (Categories > Build Categories), you can apply the automatedclassification techniques. After you have applied a technique, the concepts and types that weregrouped into a category are still available for classification with other techniques. This means thatyou may see a concept in multiple categories.

The Build Categories dialog box has two tabs on which you can define the classificationtechniques and limits:

Techniques tab. For more information, see “Build Categories: Techniques Tab” on p. 164.Limits tab. For more information, see “Build Categories: Limits Tab” on p. 166.

Because every dataset is unique, the number of methods and the order in which you apply themmay change over time. Since your text mining goals may be different from one set of data to thenext, you may need to experiment with the different techniques to see which one produces thebest results for the given text data. None of the automatic techniques will perfectly categorizeyour data, therefore we recommend finding and applying one or more automatic techniques thatwork well with your data.After applying these techniques, review the resulting categories. You can then use manual

techniques to make minor adjustments, remove any misclassifications, or add records or wordsthat may have been missed. Also, since using different techniques may also produce redundantcategories, you can also merge or delete categories. For more information, see “Managing andRefining Categories” on p. 174.The automated classification techniques will not merge the new categories with preexisting

categories. For example, if you already have a category called MyCategory and one of thetechniques creates a category with the same name, a unique name is given to the new categoryby adding a numerical suffix, as in MyCategory_1. The resulting categories are automaticallynamed. If you want to change a name, you can rename your categories. For more information, see“Creating New or Renaming Categories” on p. 173.

164

Chapter 10

Tips on Category-to-Document Ratio

The categories into which the documents and records are assigned are not often mutually exclusivein qualitative text analysis for at least two reasons:

First, a general rule of thumb says that the longer the text document or record, the moredistinct the ideas and opinions expressed. Thus, the chances that a document or record can beassigned multiple categories is greatly increased.Second, often there are various ways to group and interpret text documents or records thatare not logically separate. In the case of a survey with an open-ended question about therespondent’s political beliefs, we could create categories, such as liberal/conservative,or Republican/Democrat, as well as more nuanced categories, such as sociallyliberal, fiscally conservative, and so forth. These categories do not have to bemutually exclusive and exhaustive.

Tips on Number of Categories to Create

Category creation should flow directly from the data—as you see something interesting withrespect to your data, you can create a category to represent that information. In general, there is norecommended upper limit on the number of categories that you create. However, it is certainlypossible to create too many categories to be manageable. Two principles apply:

Category frequency. For a category to be useful, it has to contain a minimum number ofdocuments or records. One or two documents may include something quite intriguing, but ifthey are one or two out of 1,000 documents, the information they contain very likely isn’tfrequent enough in the population to be practically useful.Complexity. The more categories you create, the more information you have to review andsummarize after completing the analysis. However, too many categories, while addingcomplexity, may not add useful detail.

Unfortunately, there are no rules for determining how many categories are too many orfor determining the minimum number of cases per category. You will have to make suchdeterminations based on the demands of your particular situation.We can, however, offer advice about where to start. Although the number of categories should

not be excessive, in the early stages of the analysis it is better to have too many rather than too fewcategories. It is easier to group categories that are relatively similar than to split off cases into newcategories, so a strategy of working from more to fewer categories is usually the best practice.Given the iterative nature of text mining and the ease with which it can be accomplished with thissoftware program, erring on the high side is acceptable at the start.

Build Categories: Techniques Tab

Using this dialog box, you can automatically create categories by using either concept-groupingtechniques or frequency. The concept-grouping techniques include concept derivation, conceptinclusion, semantic networks, and co-occurrence rules. These techniques can be used alone or incombination to create categories. You can also create categories based on frequently occurringtypes.

165


On this tab, you can select which techniques you want to use to create your categories. Youcan also define limits on another tab. For more information, see “Build Categories: Limits Tab”on p. 166. You can access the Build Categories dialog box through the menus (Categories >

Build Categories).

You can either select a set of concept grouping techniques or classify based on frequency.

Figure 10-5Build Categories dialog box: Techniques tab

Concept Grouping Techniques

Each of the techniques is well suited to certain types of data and situations, but often it is helpfulto combine techniques in the same analysis to capture the full range of documents or records. Youcan exclude concepts from being grouped together by any of these techniques by defining them asantilinks. For more information, see “Link Exceptions” in Chapter 18 on p. 267.

Concept derivation. This technique creates categories by taking a concept and finding otherconcepts that are related to it by analyzing whether any of the concept components aremorphologically related. For example, the concept opportunities to advance wouldbe grouped with the concepts opportunity for advancement and advancementopportunity. This technique is very useful for identifying synonymous multiword concepts,since the concepts in each category generated are synonyms or closely related in meaning. It alsoworks with data of varying lengths and generates a smaller number of compact categories. Formore information, see “Concept Derivation” on p. 168.

Concept inclusion. This technique creates categories by taking a concept and finding other conceptsthat include it. This technique works with data of varying lengths and generates a larger number ofcompact categories. For example, seat would be grouped with safety seat, seat belt, and

166

Chapter 10

infant seat carrier. This technique, when used in combination with semantic networks,can produce more interesting links. For more information, see “Concept Inclusion” on p. 169.

Semantic networks. This technique creates categories by grouping concepts based on an extensiveindex of word relationships. This technique applies only to English language text. However, itcan be less helpful when the text contains a large amount of domain-specific terminology. In theearly stages of creating categories, you may want to use this technique by itself to see what sort ofcategories it produces. To help you produce better results, you can choose from two profiles forthis technique: Wider and Narrow. For more information, see “Semantic Networks” on p. 170.

Co-occurrence rules. This technique creates one category with each co-occurrence rule generated.A co-occurrence rule is a type of conditional rule that groups words that occur together oftenwithin records since this generally signals a relationship between them. For example, if manyrecords include the words apples and oranges, these concepts could be grouped into aco-occurrence rule. The technique looks for concepts that tend to appear together in documents.Two concepts strongly co-occur if they frequently appear together in a set of documents andrarely separately in any of the other documents. This technique can produce good results withlarger datasets with at least several hundred documents or records. For more information, see“Co-occurrence Rules” on p. 172.

Create one category for each of the top [n] types. If you do not choose to use Concept Groupingtechniques, you can create categories based on type frequency. Frequency represents the numberof documents or records containing concepts from the extracted type in question. This techniqueallows you to get one category for each frequently occurring type. This technique works bestwhen the data contain straightforward lists or simple, one-word concepts. Applying this techniqueto types allows you to obtain a quick view regarding the broad range of documents and recordspresent. Please note that the Unknown type is not included here and will not be used to create acategory.

Build Categories: Limits Tab

On this tab, you can set some limits that affect the categories generated by the Concept Groupingtechniques only. These limits do not apply to the Frequency technique. These limits apply only towhat is produced during this application of the techniques. It does not include concept countsfrom other categories, if any should exist. You can also select the techniques on another tab. Formore information, see “Build Categories: Techniques Tab” on p. 164. You can access the BuildCategories dialog box through the menus (Categories > Build Categories).

167


Figure 10-6Build Categories dialog box: Limits tab

Maximum number of categories to create. Use to limit the maximum number of categories thatcan be generated.

Apply techniques to. Choose an option from one of the following to determine which concepts willbe used as input to the selected techniques.

Top concepts (based on doc. count). Use option to apply the concept grouping techniques onlyto the top number of concepts specified here. The top concepts are ranked by the number ofrecords or documents in which each concept appears.Top percentage of concepts (based on doc. count). Use option to apply the concept groupingtechniques only to the top percentage of concepts specified here. The top concepts are rankedby the number of records or documents in which each concept appears.All concepts. Use option to apply the concept grouping techniques to all extracted concepts.

Maximum number of categories per concept. Use this option to limit the number of categories intowhich a given concept can be assigned at the time when the categories are generated by this dialogbox. For example, if you set a maximum limit of the number of categories in which a concept canbe used to 2, then a given concept can be placed in only up to two different category definitions.

Minimum number of concepts per category. Use this option to limit smaller categories by setting theminimum number of concepts that have to be grouped in order to form a category. Categories withtoo few concepts are often too narrow to be of value.

Maximum number of concepts per category. Use this option to limit broader categories by settingthe maximum number of concepts above which a category will not be formed. Categories with toomany concepts are often too broad to be interesting.

168

Chapter 10

Maximum number of concepts per co-occurrence rule. Use to define the maximum number ofconcepts that can be grouped together into a given rule by this technique. By default, themaximum is set to 3. This limit of 3 means that a concept occurring with one or two other conceptscan be grouped into rules. For more information, see “Co-occurrence Rules” on p. 172.

Minimum link percentage for grouping. This option applies globally to all techniques. You can entera percentage from 0 to 100. If you enter 0, all possible results are produced. The lower the value,the more results you will get—however, these results may be less reliable or relevant. The higherthe value, the fewer results you will get—however, these results will be less noisy and are morelikely to be significantly linked or associated with each other.

Maximum number of docs to use for calculating co-occurrence rules. By default, co-occurrencesare calculated using the entire set of documents or records. However, in some cases, you maywant to speed up the category creation process by limiting the number of documents or recordsused. To use this option, select the check box to its left and enter the maximum number ofdocuments or records to use.

Concept Derivation

The concept derivation algorithm attempts to group concepts by looking at the endings (suffixes)of each component in a concept and finding other concepts that could be derived from them.The idea is that when words are derived from each other, they are likely to share or be close inmeaning. In order to identify the endings, internal language-specific rules are used.You can use concept derivation on any sort of text. By itself, it produces fairly few categories,

and each category tends to contain few concepts. The concepts in each category are eithersynonyms or situationally related. You may find it helpful to use this algorithm even if you arebuilding categories manually; the synonyms it finds may be synonyms of those concepts youare particularly interested in.

Term Componentization and De-inflecting

When the concept derivation or the concept inclusion techniques are applied, the terms arefirst broken down into components (words) and then the components are de-inflected. When atechnique is applied the concepts and its associated terms are loaded and split into componentsbased on separators, such as spaces, hyphens, and apostrophes. For example, the term system

administrator is split into components such as {administrator, system}.However, some parts of the original term may not be used and are referred to as ignorable

components. In English, some of these ignorable components might include a, and, as, by, for,from, in, of, on, or, the, to, and with.For example, the term examination of the data has the component set {data,

examination}, and both of and the are considered ignorable. Additionally, component orderis not in a component set. In this way, the following three terms could be equivalent: coughrelief for child, child relief from a cough, and relief of child cough

since they all have the same component set {child, cough, relief}. Each time a pair ofterms are identified as being equivalent, the corresponding concepts are merged to form a newconcept that references all of the terms.

169


Additionally, since the components of a term may be inflected, language-specific rules areapplied internally to identify equivalent terms regardless of inflectional variation, such as pluralforms. In this way, the terms level of support and support levels can be identified asequivalent since the de-inflected singular form would be level.

How Concept Derivation Works

After terms have been componentized and de-inflected (see previous section), the conceptderivation algorithm analyzes the component endings, or suffixes, to find the component root andthen groups the concepts with other concepts that have the same or similar roots. The endings areidentified using a set of linguistic derivation rules specific to the text language. For example, thereis a derivation rule for English language text that states that a concept component ending with thesuffix ical might be derived from a concept having the same root stem and ending with the suffixic. Using this rule (and the de-inflection), the algorithm would be able to group the conceptsepidemiologic study and epidemiological studies.Since terms are already componentized and the ignorable components (for example, in and

of) have been identified, the concept derivation algorithm would also be able to group the conceptstudies in epidemiology with epidemiological studies.The set of component derivation rules has been chosen so that most of the concepts grouped by

this algorithm are synonyms: the concepts epidemiologic studies, epidemiologicalstudies, studies in epidemiology are all synonyms. To increase completeness, there aresome derivation rules that allow the algorithm to group concepts that are situationally related. Forexample, the algorithm can group concepts such as empire builder and empire building.

Limits for Concept Derivation

When using the concept derivation technique, you can fine-tune many settings that influence therules on the Limits tab of the Build Categories dialog box. For example, you could change theMinimum link percentage for grouping. This option influences the number and quality of the resultsyou will get. The higher the value, the fewer results you will get—however, these results will beless noisy and are more likely to be significantly linked or associated with each other. For moreinformation, see “Build Categories: Limits Tab” on p. 166.

Note: You can exclude concepts from being grouped together by defining them as antilinks or youcan exclude entire types of concepts. For more information, see “Classification Exceptions” inChapter 18 on p. 266.

Concept Inclusion

The concept inclusion algorithm attempts to groups concepts into categories using lexical seriesalgorithms, which identify concepts included in other concepts. The idea is that when words in aconcept are a subset of another concept, it reflects an underlying semantic relationship. Inclusionis a powerful technique that can be used with any type of text.Concept inclusion may give better results when the documents or records contain lots of

domain-specific terminology or jargon. This is especially true if you have tuned the dictionariesbeforehand so that the special terms are extracted and grouped appropriately (with synonyms).

170

Chapter 10

How Concept Inclusion Works

Before the concept inclusion algorithm is applied, the terms are componentized and de-inflected.For more information, see “Concept Derivation” on p. 168. Next, the concept inclusion algorithmanalyzes the component sets. For each component set, the algorithm looks for another componentset that is a subset of the first component set.For example, if you have the concept continental breakfast, which has the component

set {breakfast, continental}, and you have the concept breakfast, which has thecomponent set {breakfast}, the algorithm would conclude that continental breakfast isa kind of breakfast and group these together.In a larger example, if you have the concept seat in the Extracted Results pane and you apply

this algorithm, then concepts such as safety seat, leather seat, seat belt, seat belt

buckle, infant seat carrier, and car seat laws would also be grouped in that category.Since terms are already componentized and the ignorable components (for example, in and of)

have been identified, the concept inclusion algorithm would recognize that the concept advancedspanish course includes the concept course in spanish.

Limits for Concept Inclusion

Since it tends to create a large number of large categories, you may want to modify the defaultvalues on the Limits tab of the Build Categories dialog box in order to get all of the results. Formore information, see “Build Categories: Limits Tab” on p. 166. For example, you can:

Increase the Maximum number of categories since the concept inclusion technique oftengenerates more than 20 categories.Increase the Maximum number of concepts per category since categories generated by thistechnique often contain more than 20 concepts.If you find that there are too many categories, consider increasing the Minimum numberof concepts per category.Change the Minimum link percentage for grouping. The higher the value, the fewer results youwill get—however, these results will be less noisy and are more likely to be significantlylinked or associated with each other.


Semantic Networks

In this release of Text Mining for Clementine, the semantic networks technique is only availablefor English language text. This technique creates categories using a built-in network of wordrelationships, which is based on WordNet. The coverage of the WordNet data used by thesemantic network technique resembles that found in a good general dictionary. For this reason,this technique can produce very good results when the terms are well-known and are not tooambiguous. However, you should not expect the technique to find many links between highlytechnical/specialized concepts. When dealing with such concepts, you may find the conceptinclusion and derivation techniques to be more useful.

171


How Semantic Network Works

The idea behind the semantic network technique is to leverage known word relationships tocreate categories of synonyms or hyponyms. A hyponym is when one concept is a sort ofsecond concept such that there is a hierarchical relationship, also known as an ISA relationship.For example, if animal is a concept, then cat and kangaroo are hyponyms of animal sincethey are sorts of animals.In addition to synonym and hyponym relationships, the semantic network technique also

examines part and whole links between any concepts from the Location type. For example, thetechnique will group the concepts normandy, provence, and france into one category becauseNormandy and Provence are parts of France.Semantic networks begin by identifying the possible senses of each concept in the semantic

network. When concepts are identified as synonyms or hyponyms, they are grouped into a singlecategory. For example, the technique would create a single category containing these threeconcepts: eating apple, dessert apple, and granny smith since the semantic networkcontains the information that: 1) dessert apple is a synonym of an eating apple, and 2)granny smith is a sort of eating apple (meaning it is a hyponym of eating apple).Taken individually, many concepts, especially uniterms, are ambiguous. For example, the

concept buffet can denote a sort of meal or a piece of furniture. If the set of concepts to beclassified includes meal, furniture and buffet, then the algorithm is forced to choose betweengrouping buffet with meal or with furniture. Be aware that in some cases the choices madeby the algorithm may not be appropriate in the context of a particular set of records or documents.The semantic network technique can outperform concept inclusion with two types of data. First,

when you expect to have concepts that are related, and you are interested in these relationships,this method is ideal. Second, when the documents or records are longer and contain more complexphrases, this method can often capture this information.Semantic networks will work in conjunction with the other techniques. For example, suppose

that you have selected both the semantic network and inclusion techniques and that the semanticnetwork has grouped the concept teacher with the concept tutor (because a tutor is a kind ofteacher). The inclusion algorithm can group the concept graduate tutor with tutor and, as aresult, the two algorithms collaborate to produce an output category containing all three concepts:tutor, graduate tutor, and teacher.

Note: The semantic network technique is based on WordNet. Information on WordNet is availableat http://www.cogsci.princeton.edu/~wn/doc.shtml. Be aware that in order to improve the qualityof categories produced by the algorithm, a certain number of WordNet words and senses havebeen excluded.

172

Chapter 10

Semantic Network Profiles and Limits

When you use the semantic network technique, you can select one of two profiles in order toprovide you with more control over its application. The two profiles are:

Wider. This profile handles the more ambiguous concepts. It creates more categories butmay group concepts into categories that are not closely linked for the context of your data.This profile is selected by default.Narrow. This profile excludes very ambiguous concepts and focuses on the clearestrelationships between concepts. It will tend to create fewer and smaller categories. Thecategories created will tend to be more coherent than those created by the aggressive profile.

You can exclude concepts from being grouped together by defining them as antilinks or youcan exclude entire types of concepts. For more information, see “Classification Exceptions”in Chapter 18 on p. 266. However, a number of types are permanently excluded from thesemantic networks technique since those types will not produce relevant results. They include<Positive>, <Positive Qualifier>, <Negative>, <Negative Qualifier>, <IP>,other non linguistic types, etc.

Important! Additionally, we recommend that you do not apply the option Accommodate spelling

errors for a minimum root character limit of (defined on the Expert tab of the node or on the Settingstab of the Extract dialog box) for fuzzy grouping when using this technique since some falsegroupings can have a largely negative impact on the results.

Co-occurrence Rules

Co-occurrence rules enable you to discover and group concepts that are strongly related withinthe set of documents or records. The idea is that when concepts are often found together indocuments and records, that co-occurrence reflects an underlying relationship that is probably ofvalue in your category definitions. Creating co-occurrence rules is useful only with datasets withat least several hundred documents or records.

How Co-occurrence Rules Works

This technique scans the documents or records looking for two or more concepts that tend toappear together. Two or more concepts strongly co-occur if they frequently appear together in aset of documents or records and if they seldom appear separately in any of the other documentsor records.When co-occurring concepts are found, a conditional rule is formed. These rules consist of two

or more concepts connected using the & Boolean operator. These rules are logical statementsthat will automatically classify a document or record into a category if the set of concepts in therule all co-occur in that document or record.For example, if the concepts peanut butter and jelly appear more often together than

apart, they would be grouped into a concept co-occurrence rule.

173


Limits for Co-occurrence Rules

If you are using the co-occurrence rule technique, you can fine-tune several settings that influencethe resulting rules:

Minimum link percentage for grouping. This option is used to influence the number and qualityof the results you will get. The higher the value, the fewer results you will get—however,these results will be less noisy and are more likely to be significantly linked or associatedwith each other.Maximum number of concepts per co-occurrence rule. This option limits the number ofco-occurring concepts that can be grouped together into a rule. By default, the maximum isset to 3. This limit means that a concept occurring with one or two other concepts can begrouped into rules.Maximum number of docs to use for calculating co-occurrence rules. This option is used tospeed up the categorization process by limiting the number of documents or records used.


Creating New or Renaming Categories

You can create empty categories in order to add concepts and types into them. You can alsorename your categories.

Figure 10-7Category Name dialog box

To Create a New Empty Category

E Go to the categories pane.

E From the menus, choose Categories > New Empty Category. The Category Name dialog box opens.

E Enter a name for this category in the Category Name field.

E Click OK to accept the name and close the dialog box. The dialog box closes and a new categoryname appears in the pane.

You can now begin adding to this category. For more information, see “Adding to CategoryDefinitions” on p. 175.

To Rename a Category

E Select a category and choose Categories > Rename Category. The Category Name dialog boxopens.

174

Chapter 10

E Enter a new name for this category in the Category Name field.

E Click OK to accept the name and close the dialog box. The dialog box closes and a new categoryname appears in the pane.

Using Conditional Rules

You can create categories in many ways. One of these ways is to define rules to express ideassuch as, include all documents or records that contain the extracted concept dog and catin this category.Conditional rules are statements that you can create to automatically classify documents or

records into a category based on a logical expression using extracted concepts and types as well asthe & Boolean operator. The ability to create these rules enhances coding precision, efficiency,and productivity by allowing you to layer your business knowledge onto the Text Mining forClementine extraction technology to automate preexisting categories with precision.

Deleting Conditional Rules

If you no longer want a rule, you can delete it.

To Delete a Conditional Rule

E In the Descriptors table in Category Definitions dialog box, select the rule.

E From the menus, choose Edit > Delete. The rule is deleted from the category.

Managing and Refining Categories

Once you create some categories, you will invariably want to look at them a little closer andmake some adjustments. In addition to refining the linguistic resources, you should review yourcategories by looking for ways to combine or clean up their definitions as well as checking someof the categorized documents or records. You can also review the documents or records in acategory and make adjustments so that categories are defined in such a way that nuances anddistinctions are captured. You can use the automated classification techniques to create yourcategories; however, you will surely want to perform a few tweaks to these definitions. Afterusing a technique, a number of new categories appear in the window. You can then review thedata in a category and make adjustments until you are comfortable with your category definitions.For more information, see “Category Definitions” on p. 160.

Here are some options for refining your categories.Adding descriptors to a category definition.Editing category definitions.Merging categories together.Moving categories.Deleting categories.

175


Visualizing how your categories work together and making adjustments. For moreinformation, see “Category Graphs and Charts” in Chapter 13 on p. 195.Making changes to your linguistic resources and reextracting.

Adding to Category Definitions

After using automated techniques, you will most likely still have extracted results that werenot used in any of the category definitions. You should review this list in the Extracted Resultspane. If you find elements that you would like to move into a category, you can add them to anexisting or new category.

To Add a Concept or Type to a Category

E From within the Extracted Results and Data panes, select the elements that you want to add to anew or existing category.

E From the menus, choose Categories > Add to Category. The menu presents a set of categorieswith the most recently created category at the top of the list. Select the category to which youwant to add the selected elements.

If you see the category you are looking for, select its name, and the selected element(s) areadded to its definition.If you want to add the elements to a new category, select New Category. A new categoryappears in the category pane using the name of the first selected element.If you do not see the category in the menu, select More to display the All Categories dialog box.

Figure 10-8All Categories dialog box

Editing Category Definitions

Once you’ve created some categories, you can open each category to see all of the descriptors thatmake up its definition. Inside the Category Definitions dialog box, you can make a number ofedits to your category definitions.

176

Chapter 10

To Edit a Category

E Select the category you want to edit in the Categories pane.

E From the menus, choose View > Category Definitions. The Category Definitions dialog box opens.

Figure 10-9Category Definitions dialog box

E Select the descriptor you want to edit and click the corresponding toolbar button.

The following table describes each toolbar button that allows you to edit your category definitions.

Table 10-2Toolbar buttons and descriptions

Icons DescriptionDeletes the selected descriptors from the category.

Moves the selected descriptors to a new or existing category.

Moves the selected the descriptors in the form of an & conditional rule to a category. Formore information, see “Using Conditional Rules” on p. 174.Moves each of the selected descriptors as its own new category

DisplayUpdates what is displayed in the Data pane and the Visualization pane according to theselected descriptors

Moving Categories

If you want to place a category into another category, you can move it.

177


To Move a Category

E In the categories pane, select a category or multiple categories that you would like to move intoanother category.

E From the menus, choose Categories > Move to Category. The menu presents a set of categorieswith the most recently created category at the top of the list. Select the name of the category towhich you want to add the selected concepts.

If you see the name you are looking for, select it, and the selected elements are added tothat category.If you do not see it, select More to display the All Categories dialog box, and select thecategory from the list.

Figure 10-10All Categories dialog box

Merging or Combining Categories

If you want to combine two or more categories, you can merge them. When you are mergingcategories, a new category with a generic name is created, and all of the concepts, types, andpatterns used in the definitions of the categories you are merging are moved into this new category.You can later rename this category by editing the category properties.

To Merge a Category or Part of a Category

E In the categories pane, select the elements you would like to merge together.

E From the menus, choose Categories > Merge Categories. The categories are merged into one newcategory with a new name.

Deleting Categories

If you no longer want to keep a category, you can delete it.

178

Chapter 10

To Delete a Category

E In the categories pane, select the category or categories that you would like to delete.

E From the menus, choose Edit > Delete.

Chapter

11Analyzing Clusters

You can build and explore concept clusters in the Clusters view (View > Clusters). A cluster isa grouping of related concepts generated by clustering algorithms based on how often theseconcepts occur in the document/record set and how often they appear together in the samedocument, also known as co-occurence. Each concept in a cluster co-occurs with at least oneother concept in the cluster. The goal of clusters is to group concepts that occur together while thegoal of categories is to group documents or records.A good cluster is one with concepts that are strongly linked and co-occur frequently and with

few links to concepts in other clusters. When working with larger datasets, this technique mayresult in significantly longer processing times.

Note: Use the Maximum number of docs to use for calculating clusters option in the Build Clustersdialog box in order to build with only a subset of all documents or records.

Clustering is a process that begins by analyzing a set of concepts and looking for concepts thatco-occur often in documents. Two concepts that co-occur in a document are considered to be aconcept pair. Next, the clustering process assesses the similarity value of each concept pair bycomparing the number of documents in which the pair occur together to the number of documentsin which each concept occurs. For more information, see “Calculating Similarity Link Values”on p. 183.Lastly, the clustering process groups similar concepts into clusters by aggregation and takes

into account their link values and the settings defined in the Build Clusters dialog box. Byaggregation, we mean that concepts are added or smaller clusters are merged into a larger clusteruntil the cluster is saturated. A cluster is saturated when additional merging of concepts orsmaller clusters would cause the cluster to exceed the settings in the Build Clusters dialog box(number of concepts, internal links, or external links). A cluster takes the name of the conceptwithin the cluster that has the highest overall number of links to other concepts within the cluster.In the end, not all concept pairs end up together in the same cluster since there may be a

stronger link in another cluster or saturation may prevent the merging of the clusters in which theyoccur. For this reason, there are both internal and external links.

Internal links are links between concept pairs within a cluster. Not all concepts are linkedto each other in a cluster. However, each concept is linked to at least one other conceptinside the cluster.External links are links between concept pairs in separate clusters (a concept within onecluster and a concept outside in another cluster).

179

180

Chapter 11

Figure 11-1Clusters view

The Clusters view is organized into three panes, each of which can be hidden or shown byselecting its name from the View menu:

Clusters pane. You can build and manage your clusters in this pane. For more information, see“Exploring Clusters” on p. 184.Visualization pane. You can visually explore your clusters and how they interact in this pane.For more information, see “Cluster Graphs” in Chapter 13 on p. 198.Data pane. You can explore and review the text contained within documents and records thatcorrespond to selections in the Cluster Definitions dialog box. For more information, see“Cluster Definitions” on p. 184.

Building Clusters

When you first access the Clusters view, no clusters are visible. You can build the clustersthrough the menus (Tools > Build Clusters) or by clicking the Build... button on the toolbar. Thisaction opens the Build Clusters dialog box in which you can define the settings and limits forbuilding your clusters.

Note: Whenever the extraction results no longer match the resources, this pane becomes yellowas does the Extracted Results pane. You can reextract to get the latest extracted results and theyellow coloring will disappear. However, each time an extraction is performed the Clusters

181

Analyzing Clusters

pane is cleared, and you will have to rebuild your clusters. Likewise clusters are not saved fromone session to another.

There are two tabs in the Build Clusters dialog box:Settings. This tab contains the options to fine-tune the clustering settings. For moreinformation, see “Build Clusters: Settings Tab” on p. 181.Limits. This tab contains the options to limit the number of concepts or the number ofdocuments/records to use to build the clusters. For more information, see “Build Clusters:Limits Tab” on p. 182.

Build Clusters: Settings Tab

Using the Build Clusters dialog box, you can define the settings and limits for building yourclusters. On this tab, you can fine-tune the clustering settings. You can also define limits onanother tab. For more information, see “Build Clusters: Limits Tab” on p. 182.

Figure 11-2Build Clusters dialog box: Settings tab

The settings in the Build Clusters dialog box are:

Maximum number of clusters to create. This value is the maximum number of clusters to generateand display in the Clusters pane. During the clustering process, saturated clusters are presentedbefore unsaturated ones, and therefore, many of the resulting clusters will be saturated. In orderto see more unsaturated clusters, you can change this setting to a value greater than the numberof saturated clusters.

Minimum concepts in a cluster. This value is the minimum number of concepts that must belinked in order to create a cluster.

Maximum concepts in a cluster. This value is the maximum number of concepts a cluster cancontain.

Maximum number of internal links. This value is the maximum number of internal links a clustercan contain. Internal links are links between concept pairs within a cluster.

Maximum number of external links. This value is the maximum number of links to concepts outsideof the cluster. External links are links between concept pairs in separate clusters.

182

Chapter 11

Minimum link value. This value is the smallest link value accepted for a concept pair to beconsidered for clustering. Link value is calculated using a similarity formula. For moreinformation, see “Calculating Similarity Link Values” on p. 183.

Note: You can exclude concepts from being grouped together in the same cluster by defining themas antilinks, or you can exclude entire types of concepts. For more information, see “ClassificationExceptions” in Chapter 18 on p. 266.

Build Clusters: Limits Tab

Using the Build Clusters dialog box, you can define the settings and limits for building yourclusters. On this tab, you can limit the number of concepts or the number of documents/records touse to build the clusters. You can also define clustering settings on the Settings tab. For moreinformation, see “Build Clusters: Settings Tab” on p. 181.

Figure 11-3Build Clusters dialog box: Limits tab

The limits in the Build Clusters dialog box are:

Build clusters from Select the number of concepts you want to use for clustering. By reducing thenumber of concepts, you can speed up the clustering process.

Top concepts (based on doc count). With this option, you can choose the number of concepts tobe considered for clustering. The concepts are chosen based on those that have the highest doccount value. Doc count is the number of documents or records in which the concept appears.Top % of concepts (based on doc count). With this option, you can choose the percentage ofconcepts to be considered for clustering. The concepts are chosen based on this percentage ofconcepts with the highest doc count value.All concepts. The clustering process will attempt to cluster all concepts beginning with thosewith the highest doc count until the maximum number of clusters has been built.

Maximum number of docs to use for calculating clusters. By default, link values are calculated usingthe entire set of documents or records. However, in some cases, you may want to speed up theclustering process by limiting the number of documents or records used to calculate the links.Limiting documents may decrease the quality of the clusters. To use this option, select the checkbox to its left and enter the maximum number of documents or records to use.

183

Analyzing Clusters

Note: You can exclude concepts from being grouped together in the same cluster by defining themas antilinks or you can exclude entire types of concepts. For more information, see “ClassificationExceptions” in Chapter 18 on p. 266.

Calculating Similarity Link Values

Knowing only the number of documents in which a concept pair co-occurs does not in itselftell you how similar the two concepts are. In these cases, the similarity value can be helpful.The similarity link value is measured using the co-occurrence document count compared to theindividual document counts for each concept in the relationship. When calculating similarity, theunit of measurement is the number of documents (doc count) in which a concept or conceptpair is found. A concept or concept pair is “found” in a document if it occurs at least once inthe document. You can choose to have the line thickness in the Concept graph represent thesimilarity link value in the graphs.The algorithm reveals those relationships that are strongest, meaning that the tendency for

the concepts to appear together in the text data is much higher than their tendency to occurindependently. Internally, the algorithm yields a similarity coefficient ranging from 0 to 1, wherea value of 1 means that the two concepts always appear together and never separately. Thesimilarity coefficient result is then multiplied by 100 and rounded to the nearest whole number.The similarity coefficient is calculated using the formula shown in the following figure.

Figure 11-4Similarity coefficient formula

Where:CI is the number of documents or records in which the concept I occurs.CJ is the number of documents or records in which the concept J occurs.CIJ is the number of documents or records in which concept pair I and J co-occurs in the setof documents.

For example, suppose that you have 5,000 documents. Let I and J be extracted concepts and letIJ be a concept pair co-occurrence of I and J. The following table proposes two scenarios todemonstrate how the coefficient and link value are calculated.Table 11-1Concept frequencies example

Concept/Pair Scenario A Scenario BConcept: I Occurs in 20 docs Occurs in 30 docsConcept: J Occurs in 20 docs Occurs in 60 docsConcept Pair: IJ Co-occurs in 20 docs Co-occurs in 20 docsSimilarity coefficient 1 0.22222Similarity link value 100 22

184

Chapter 11

In scenario A, the concepts I and J as well as the pair IJ occur in 20 documents, yielding asimilarity coefficient of 1, meaning that the concepts always occur together. The similarity linkvalue for this pair would be 100.In scenario B, concept I occurs in 30 documents and concept J occurs in 60 documents, but

the pair IJ occurs in only 20 documents. As a result, the similarity coefficient is 0.22222. Thesimilarity link value for this pair would be rounded down to 22.

Exploring Clusters

After you build clusters, you can see a set of results in the Clusters pane. For each cluster, thefollowing information is available in the table:

Cluster. This is the name of the cluster. Clusters are named after the concept with the highestnumber of internal links.Concepts. This is the number of concepts in the cluster. For more information, see “ClusterDefinitions” on p. 184.Internal. This is the number of internal links in the cluster. Internal links are links betweenconcept pairs within a cluster.External. This is the number of external links in the cluster. External links are links betweenconcept pairs when one concept is in one cluster and the other concept is in another cluster.Sat. If a symbol is present, this indicates that this cluster could have been larger but one ormore limits would have been exceeded, and therefore, the clustering process ended for thatcluster and is considered to be saturated. At the end of the clustering process, saturatedclusters are presented before unsaturated ones and therefore, many of the resulting clusters willbe saturated. In order to see more unsaturated clusters, you can change the Maximum numberof clusters to create setting to a value greater than the number of saturated clusters or decreasethe Minimum link value. For more information, see “Build Clusters: Settings Tab” on p. 181.Threshold. For all of the cooccurring concept pairs in the cluster, this is the lowest similaritylink value of all in the cluster. For more information, see “Calculating Similarity Link Values”on p. 183. A cluster with a high threshold value signifies that the concepts in that clusterhave a higher overall similarity and are more closely related than those in a cluster whosethreshold value is lower.

To learn more about a given cluster, you can select it and the visualization pane on the right willshow two graphs to help you explore the cluster(s). For more information, see “Cluster Graphs” inChapter 13 on p. 198. You can also cut and paste the contents of the table into another application.Whenever the extraction results no longer match the resources, this pane becomes yellow as

does the Extracted Results pane. You can reextract to get the latest extracted results and theyellow coloring will disappear. However, each time an extraction is performed, the Clusterspane is cleared and you will have to rebuild your clusters. Likewise clusters are not saved fromone session to another.

Cluster Definitions

You can see all of the concepts inside a cluster by selecting it in the Clusters pane and openingthe Cluster Definitions dialog box (View > Cluster Definitions).

185

Analyzing Clusters

Figure 11-5Cluster Definitions dialog box

All of the concepts in the selected cluster appear in the Cluster Definitions dialog box. If youselect one or more concepts in the Cluster Definitions dialog box and click Display &, the Data panewill display all of the records or documents in which all of the selected concepts appear together.However, the Data pane does not display any text records or documents when you select a clusterin the Clusters pane. For general information on the Data pane, see “The Data Pane” in Chapter 10.Selecting concepts in this dialog box also changes the concept web graph. For more

information, see “Cluster Graphs” in Chapter 13 on p. 198. Similarly, when you select one ormore concepts in the Cluster Definitions dialog box, the Visualization pane will show all of theexternal and internal links from those concepts.

Important! There is no Cancel button in this dialog box. Any changes you make are immediatelyapplied to your category.

Column Descriptions

Icons are shown so that you can easily identify each descriptor.Table 11-2Columns and Descriptor IconsColumns DescriptionDescriptors The name of the concept.

Shows the number of times this descriptor appears in the entire dataset, also knownas the global frequency.

DocsShows the number of documents or records in which this descriptor appears, also known asthe document frequency.

Type Shows the type or types to which the descriptor belongs. If the descriptor is a conditionalrule, no type name is shown in this column.

186

Chapter 11

Toolbar Actions

From this dialog box, you can also select one or more concepts to use in a category. There areseveral ways to do this but it is most interesting to select concepts that co-occur in a cluster andadd them as a conditional rule. For more information, see “Co-occurrence Rules” in Chapter 10on p. 172. You can use the toolbar buttons to add the concepts to categories.Table 11-3Toolbar buttons to add concepts to categories

Icons DescriptionAdd the selected concepts to a new or existing category

Add the selected concepts in the form of an & conditional rule to a new or existingcategory. For more information, see “Using Conditional Rules” in Chapter 10 on p. 174.Add each of the selected concepts as its own new category

Display &Updates what is displayed in the Data pane and the Visualization pane according to theselected descriptors

Note: You can also add concepts to a type, as synonyms, or as exclude items using the contextmenus.

Chapter

12Exploring Text Link Analysis

In the Text Link Analysis (TLA) view, you can build and explore text link analysis pattern results.Text link analysis is a pattern-matching technology that enables you to define pattern rules andcompare these to actual extracted concepts and relationships found in your text.For example, extracting ideas about an organization may not be interesting enough to you.

Using TLA, you could also learn about the links between this organization and other organizationsor the people within an organization.You can also use TLA to extract opinions on products or the relationships between genes. Once

you’ve extracted some TLA pattern results, you can explore them in the Data or Visualizationpanes and even add them to categories.If you extract TLA pattern results, the results are presented in this view in the Type and Concept

Patterns panes. For more information, see “Type and Concept Patterns” on p. 189. If you have notchosen to do so, you can click Extract and choose Enable Text Link Analysis pattern extraction in theExtract dialog box. For more information, see “Extracting TLA Pattern Results” on p. 188.However, there must be some TLA pattern rules defined in the resource template or libraries

you are using in order to extract TLA pattern results. You can use the TLA patterns in certainresource templates shipped with Text Mining for Clementine or create/edit your own. Patterns aremade up of variables, macros, word lists, and word gaps to form a Boolean query, or rule, that iscompared to your input text. Whenever a TLA pattern matches text, this text can be extractedas a pattern and restructured as output data. For more information, see “Text Link AnalysisRules” in Chapter 18 on p. 275.The Text Link Analysis view is divided into panes, each of which can be hidden or shown

by selecting its name from the View menu:Type and Concept Patterns Panes. You can build and explore your patterns in these two panes.For more information, see “Type and Concept Patterns” on p. 189.Visualization pane. You can visually explore how the concepts and types in your patternsinteract in this pane. For more information, see “Text Link Analysis Graphs” in Chapter13 on p. 200.Data pane. You can explore and review text contained within documents and records thatcorrespond to selections in another pane. For more information, see “Data Pane” on p. 192.

187

188

Chapter 12

Figure 12-1Text Link Analysis view

Extracting TLA Pattern Results

The extraction process results in a set of concepts and types, as well as Text Link Analysis (TLA)patterns, if enabled. If you extracted TLA patterns you can see those in the Text Link Analysisview. Whenever the extraction results are not in sync with the resources, the Patterns panesbecome yellow in color indicating that a reextraction would produce different results.You have to choose to extract these patterns in the Text Mining for Clementine node setting or

in the Extract dialog box using the option Enable Text Link Analysis pattern extraction. For moreinformation, see “Extract Dialog Box: Settings Tab” in Chapter 9 on p. 143.

Note: There is a relationship between the size of your dataset and the time it takes to completethe extraction process. See the installation instructions for performance statistics andrecommendations. You can always consider inserting a Sample node upstream or optimizingyour machine’s configuration.

To Extract Data

E From the menus, choose Tools > Extract. Alternatively, click the Extract toolbar button.

E On the Settings tab, change any of the options you want to use. Keep in mind that the optionEnable Text Link Analysis pattern extraction must be selected on this tab as well as having TLA rules

189

Exploring Text Link Analysis

in your template in order to extract TLA pattern results. For more information, see “ExtractDialog Box: Settings Tab” in Chapter 9 on p. 143.

E On the Language tab, change any of the options you want to use. For more information, see“Extract Dialog Box: Language Tab” in Chapter 9 on p. 145.

E Click Extract to begin the extraction process.

Once the extraction begins, the progress dialog box opens. If you want to abort the extraction,click Cancel. When the extraction is complete, the dialog box closes and the results appear in thepane. For more information, see “Type and Concept Patterns” on p. 189.

Type and Concept PatternsPatterns are made up of two parts, a combination of concepts and types. Patterns are most usefulwhen you are attempting to discover opinions about a particular subject or relationships betweenconcepts. Extracting your competitor’s product name may not be interesting enough to you. Inthis case, you can look at the extracted patterns to see if you can find examples where a documentor record contains text expressing that the product is good, bad, or expensive.Figure 12-2Text Link Analysis view: Type and Concept Patterns panes

Patterns can consist of up to six types or six concepts. For this reason, the rows in both patternspanes contain up to six slots, or positions. Each slot corresponds to an element’s specific positionin the TLA pattern rule as it is defined in the linguistic resources. In the interactive workbench, if

190

Chapter 12

a slot contains no values, it is not shown in the table. For example, if the longest pattern resultscontain no more than four slots, the last two are not shown. For more information, see “TextLink Analysis Rules” in Chapter 18 on p. 275.When you extract pattern results, they are first grouped at the type level and then divided

into concept patterns. For this reason, there are two different result panes: Type Patterns (upperleft) and Concept Patterns (lower left). To see all concept patterns returned, select all of thetype patterns. The bottom concept patterns pane will then display all concept patterns up to themaximum rank value (as defined in the Filter dialog box).

Type Patterns. This pane presents pattern results consisting of two or more related typesmatching a TLA pattern rule. Type patterns are shown as <Organization> + <Location>+ <Positive>, which might provide positive feedback about an organization in a specificlocation. The syntax is as follows:

<Type1> + <Type2> + <Type3> + <Type4> + <Type5> + <Type6>

Concept Patterns. This pane presents the pattern results at the concept level for all of the typepattern(s) currently selected in the Type Patterns pane above it. Concept patterns follow a structuresuch as hotel + paris + wonderful. The syntax is as follows:

concept1 + concept2 + concept3 + concept4 + concept5 + concept6

When pattern results use less than the six maximum slots, only the necessary number ofslots (or columns) are displayed. By default, any single slot patterns are hidden but can bedisplayed through the context menu in the patterns table (Show One-Slot Patterns). Any emptyslots found between two filled slots will be represented by a null value. Thus, a pattern that is<Type1>+<>+<Type2>+<>+<>+<> can be represented by <Type1>+<>+<Type2>. (where <>represents a null type). For a concept pattern, this would be concept1+.+concept2 ( where. represents a null value).Just as with the extracted results in the Categories and Concepts view, you can review the

results here. If you see any refinements you would like to make to the types and concepts thatmake up these patterns, you make those in the Extracted Results pane in the Categories andConcepts view or directly in the Resource Editor and reextract your patterns.Whenever a concept, type, or pattern is being used in a category definition, it appears in

italics in the table. You can view only the unused concepts by clicking the right-most icon inthe extracted results pane.

Filtering TLA Results

When you are working with very large datasets, the extraction process could produce millions ofresults. For many users, this amount can make it more difficult to review the results effectively.You can, however, filter these results in order to zoom in on those that are most interesting. Youcan change the settings in the Filter dialog box to limit what patterns are shown. All of thesesettings are used together.

191


Figure 12-3Filter dialog box (in the TLA view)

Filter by Frequency. You can filter to display only those results with a certain global or documentfrequency value.

Global frequency is the total number of times a pattern appears in the entire set of documentsor records and is shown in the Global column.Document frequency is the total number of documents or records in which a pattern appearsand is shown in the Docs column.

For example, if a pattern appeared 300 times in 500 records, we would say that this pattern hasa global frequency of 300 and a document frequency of 500.

And by Match Text. You can also filter to display only those results that match the rule you definehere. Enter the set of characters to be matched in the Match text field, and select whether to lookfor this text within concept or type names by identifying the slot number or all of them. Thenselect the condition in which to apply the match (you do not need to use angled brackets to denotethe beginning or end of a type name). Select either And or Or from the drop-down list so that therule matches both statements or just one of them, and define the second text matching statement inthe same manner as the first.Table 12-1Match text conditions

Condition DescriptionContains Text is matched if the string occurs anywhere. (Default choice)Starts with Text is matched only if the concept or type starts with the specified text.Ends with Text is matched only if the concept or type ends with the specified text.Exact Match The entire string must match the concept or type name.

And by Rank. You can also filter to display only a top number of patterns according to globalfrequency (Global) or document frequency (Docs) in either ascending or descending order. Thismaximum rank value limits the total number of patterns returned for display.

192

Chapter 12

When the filter is applied, the product adds type patterns until the maximum total number ofconcept patterns (rank maximum) would be exceeded. It begins by looking at the type patternwith the top rank and then takes the sum of the corresponding concept patterns. If this sumdoes not exceed the rank maximum, the patterns are displayed in the view. Then, the number ofconcept patterns for the next type pattern is summed. If that number plus the total number ofconcept patterns in the previous type pattern is less than the rank maximum, those patterns arealso displayed in the view. This continues until as many patterns as possible, without exceedingthe rank maximum, are displayed.

Important! Not all results are shown by default. The display of single slot patterns is disabled bydefault (enable using context menu). Therefore, you may not always get the maximum numbershown in the view.

Results Displayed in Patterns Pane

Here are some examples of how the results might be displayed on the Patterns pane toolbarbased on the filters.


In this example, the toolbar shows that the number of patterns returned was limited because of therank maximum specified in the filter. If a purple icon is present, this means that the maximumnumber of patterns was met. Hover over the icon for more information. See the precedingexplanation of the And by Rank filter.


In this example, the toolbar shows results were limited using a match text filter (see magnifyingglass icon). You can hover over the icon to see what the match text is.

To Filter the Results

E From the menus, choose Tools > Filter. The Filter dialog box opens.

E Select and refine the filters you want to use.

E Click OK to apply the filters and see the new results.

Data Pane

As you extract and explore text link analysis patterns, you may want to review some of the datayou are working with. For example, you may want to see the actual records in which a group ofpatterns were discovered. You can review records or documents in the Data pane, which is locatedin the lower right. If not visible by default, choose View > Panes > Data from the menus.

193


The Data pane presents one row per document or record corresponding to a selection in theview, up to a certain display limit. By default, the number of documents or records shown in theData pane is limited in order to make it faster for you to see your data. However, you can adjustthis in the Options dialog box. For more information, see “Options: Session Tab” in Chapter 8on p. 131.

Displaying and Refreshing the Data Pane

The Data pane does not refresh its display automatically, because with larger datasets automaticdata refreshing could take some time to complete. Therefore, whenever you select type or conceptpatterns in this view, you can click Display to refresh the contents of the Data pane.

Text Documents or Records

If your text data is in the form of records and the text is relatively short in length, the text fieldin the Data pane displays the text data in its entirety. However, when working with records andlarger datasets, the text field column shows a short piece of the text and opens a Text Preview paneto the right to display more or all of the text of the record you have selected in the table. If yourtext data is in the form of individual documents, the Data pane shows the document’s filename.When you select a document, the Text Preview pane opens with the selected document’s text.Figure 12-6Data pane with Text Preview pane

Colors and Highlighting

Whenever you select a concept or category in another pane and display the data, concepts anddescriptors found in those documents or records are highlighted in color to help you easilyidentify them in the text. The color coding corresponds to the types to which the concepts belong.You can also hover your mouse over color-coded items to display the concept under which itwas extracted and the type to which it was assigned. Any text that was not extracted appears in

194

Chapter 12

black. Typically, these unextracted words are often connectors (and or with), pronouns (me orthey), and verbs (is, have, or take).

Data Pane Columns

You can show or hide columns in the data pane. For more information, see “Adding Columnsto the Data Pane” in Chapter 10 on p. 162.

Chapter

13Visualizing Graphs

The Categories and Concepts view, Clusters view, and Text Link Analysis view all have avisualization pane in the upper right corner of the window. You can use this pane to visuallyexplore your data. The following graphs and charts are available.

Categories and Concepts view. This view has three graphs and charts: Category Bar, CategoryWeb, and Category Web Table. In this view, the graphs are only updated when you clickDisplay. For more information, see “Category Graphs and Charts” on p. 195.Clusters view. This view has two web graphs: Concept Web Graph and Cluster Web Graph.For more information, see “Cluster Graphs” on p. 198.Text Link Analysis view. This view has two web graphs: Concept Web Graph and Type WebGraph. For more information, see “Text Link Analysis Graphs” on p. 200.

Category Graphs and Charts

When building your categories, it is important to take the time to review the category definitions,the documents or records they contain, and how the categories overlap. The visualization paneoffers several perspectives on your categories. The Visualization pane is located in the upperright corner of the Categories and Concepts view. If it isn’t already visible, you can access thispane from the View menu (View > Visualization).In this view, the visualization pane offers three perspectives on the commonalities in document

or record categorization. The charts and graphs in this pane can be used to analyze yourcategorization results and aid in fine-tuning categories or reporting. When refining categories, youcan use this pane to review your category definitions to uncover categories that are too similar (forexample, they share more than 75% of their documents or records) or too distinct.Depending on what is selected in the Extracted Results pane or Categories pane or in

the Category Definitions dialog box, you can view the corresponding interactions betweendocuments/records and categories on each of the tabs in this pane. Each presents similarinformation but in a different manner or with a different level of detail. However, in order torefresh a graph for the current selection, click Display on the toolbar of the pane or dialog box inwhich you have made your selection.

Note: By default, the graphs are in the interactive/selection mode in which you can move nodes.However, you can edit your graph layouts in Edit mode including colors and fonts, legends, andmore. For more information, see “Using Graph Toolbars” on p. 202.

The Categories and Concepts view has three graphs and charts.

195

196

Chapter 13

Category Bar Chart. A table and bar chart present the overlap between the documents or recordscorresponding to your selection and the associated categories. The bar chart also presentsratios of the documents or records in categories to the total number of documents or records.For more information, see “Category Bar Chart” on p. 196.Category Web Graph. This graph presents the document/record overlap for the categories towhich the documents or records belong according to the selection in the other panes. For moreinformation, see “Category Web Graph” on p. 197.Category Web Table. This table presents the same information as the Category Web tab but ina table format. The table contains three columns that can be sorted by clicking the columnheaders. For more information, see “Category Web Table” on p. 197.

For more information, see “Categorizing Text Data” in Chapter 10 on p. 157.

Category Bar ChartThis tab displays a table and bar chart showing the overlap between the documents or recordscorresponding to your selection and the associated categories. The bar chart also presents ratios ofthe documents or records in categories to the total number of documents or records. You cannotedit the layout of this chart. You can, however, sort the columns by clicking the column headers.

The chart contains five columns:Category. This column presents the name of the categories in your selection. By default, themost common category in your selection is listed first.Bar. This column presents, in a visual manner, the ratio of the documents or records in a givencategory to the total number of documents or records.Selection %. This column presents a percentage based on the ratio of the total number ofdocuments or records for a category to the total number of documents or records representedin the selection.Docs. This column presents the number of documents or records in a selection for the givencategory.

Figure 13-1Category Bar chart

197

Visualizing Graphs

Category Web Graph

This tab displays a category web graph. The web presents the documents or records overlap forthe categories to which the documents or records belong according to the selection in the otherpanes. If category labels exist, these labels appear in the graph. You can choose which graphlayout (network, circle, directed, or grid) using the toolbar buttons in this pane.

Figure 13-2Category Web graph, grid layout

In the web, each node represents a category. You can select and move the nodes within the pane.The size of the node represents the relative size based on the number of documents or records forthat category in your selection. The thickness and color of the line between two categories denotesthe number of common documents or records they have. If you hover your mouse over a node inthe Interactive/Selection mode, a ToolTip displays the following information for the category:

Name (or label).Selection count, which represents the number of documents or records for that categorywithin your selection in other panes.Total count, which represents the overall number of documents or records in the category.


Category Web Table

This tab displays the same information as the Category Web tab but in a table format. The tablecontains three columns that can be sorted by clicking the column headers:

Count. This column presents the number of shared, or common, documents or records betweenthe two categories.

198

Chapter 13

Category 1. This column presents the name of the first category followed by the total numberof documents or records it contains, shown in parentheses.Category 2. This column presents the name of the second category followed by the totalnumber of documents or records it contains, shown in parentheses.

Figure 13-3Category Web table

Cluster Graphs

After building your clusters, you can explore them visually in the web graphs in the Visualizationpane. The visualization pane offers two perspectives on clustering: a Concept Web graph and aCluster Web graph. The web graphs in this pane can be used to analyze your clustering resultsand aid in uncovering some concepts and rules you may want to add to your categories. TheVisualization pane is located in the upper right corner of the Clusters view. If it isn’t alreadyvisible, you can access this pane from the View menu (View > Visualization). By selecting a clusterin the Clusters pane, you can automatically display the corresponding graphs in the Visualizationpane.

Note: By default, the graphs are in the interactive/selection mode in which you can move nodes.However, you can edit your graph layouts in Edit mode, including colors and fonts, legends, andmore. For more information, see “Using Graph Toolbars” on p. 202.

The Clusters view has two web graphs.Concept Web Graph. This graph presents all of the concepts within the selected cluster(s) aswell as linked concepts outside the cluster. This graph can help you see how the conceptswithin a cluster are linked and any external links. For more information, see “Concept WebGraph” on p. 199.Cluster Web Graph. This graph presents the selected cluster(s) with all of the external linksbetween the selected clusters shown as dotted lines. For more information, see “Cluster WebGraph” on p. 199.

199

Visualizing Graphs

For more information, see “Analyzing Clusters” in Chapter 11 on p. 179.

Concept Web Graph

This tab displays a web graph showing all of the concepts within the selected cluster(s) as wellas linked concepts outside the cluster. This graph can help you see how the concepts within acluster are linked and any external links. Each concept in a cluster is represented as a node,which is color coded according to the type color. For more information, see “Creating Types” inChapter 17 on p. 245.The internal links between the concepts within a cluster are drawn and the line thickness of

each link is directly related to either the doc count for each concept pair’s co-occurrence or thesimilarity link value, depending on your choice on the graph toolbar. The external links between acluster’s concepts and those concepts outside the cluster are also shown.If concepts are selected in the Cluster Definitions dialog box, the Concept Web graph will

display those concepts and any associated internal and external links to those concepts. Any linksbetween other concepts that do not include one of the selected concepts do not appear on the graph.


Figure 13-4Concept Web graph

Cluster Web Graph

This tab displays a web graph showing the selected cluster(s). The external links between theselected clusters as well as any links between other clusters are all shown as dotted lines. Ina Cluster Web graph, each node represents an entire cluster and the thickness of lines drawnbetween them represents the number of external links between two clusters.

200

Chapter 13

Important! You must build clusters and select clusters with external links to display a ClusterWeb graph.For example, let’s say we have two clusters. Cluster A has three concepts: A1, A2, and A3.

Cluster B has two concepts: B1 and B2. The following concepts are linked: A1-A2, A1-A3,A2-B1 (External), A2-B2 (External), A1-B2 (External), and B1-B2. This means that in theCluster Web graph, the line thickness would represent the three external links.


Figure 13-5Cluster Web graph

Text Link Analysis Graphs

After extracting your Text Link Analysis (TLA) patterns, you can explore them visually in theweb graphs in the Visualization pane. The visualization pane offers two perspectives on TLApatterns: a concept (pattern) web graph and a type (pattern) web graph. The web graphs in thispane can be used to visually represent patterns. The Visualization pane is located in the upperright corner of the Text Link Analysis. If it isn’t already visible, you can access this pane from theView menu (View > Visualization). If there is no selection, then the graph area is empty.


The Text Link Analysis view has two web graphs.

201

Visualizing Graphs

Concept Web Graph. This graph presents all the concepts in the selected pattern(s). The linewidth and node sizes (if type icons are not shown) in a concept graph show the number ofglobal occurrences in the selected table. For more information, see “Concept Web Graph”on p. 201.Type Web Graph. This graph presents all the types in the selected pattern(s). The line width andnode sizes (if type icons are not shown) in the graph show the number of global occurrencesin the selected table. Nodes are represented by either a type color or by an icon. For moreinformation, see “Type Web Graph” on p. 201.

For more information, see “Exploring Text Link Analysis” in Chapter 12 on p. 187.

Concept Web Graph

This web graph presents all of the concepts represented in the current selection. For example,if you selected a type pattern that had three matching concept patterns, this graph would showthree sets of linked concepts. The line width and node sizes in a concept graph represent theglobal frequency counts. The graph visually represents the same information as what is selectedin the patterns panes. The types of each concept are presented either by a color or by an icondepending on what you select on the graph toolbar. For more information, see “Using GraphToolbars” on p. 202.

Figure 13-6Concept Web graph

Type Web Graph

This web graph presents each type pattern for the current selection. For example, if you selectedtwo concept patterns, this graph would show one node per type in the selected patterns and thelinks between those it found in the same pattern. The line width and node sizes represent theglobal frequency counts for the set. The graph visually represents the same information as what isselected in the patterns panes. In addition to the type names appearing in the graph, the types are

202

Chapter 13

also identified either by their color or by a type icon, depending on what you select on the graphtoolbar. For more information, see “Using Graph Toolbars” on p. 202.

Figure 13-7Type Web graph

Using Graph Toolbars

For each graph, there is a toolbar that provides you with quick access to some common actionsyou might perform with your graphs. Each view (Categories and Concepts, Clusters, and TextLink Analysis) has a slightly different toolbar. To learn what each button means, refer to thefollowing table.

You can choose between the Explore view mode or the Edit view mode.Explore mode. By default, the Explore mode is turned on, which means that you can moveand drag nodes around the graph as well as hover over graph objects to reveal additionalToolTip information.Edit mode. Switch to the Edit mode to change the look of the graph, such as enlarging the font,changing the colors to match your corporate style guide, or removing labels and legends. Formore information, see “Editing Graphs” on p. 203.

203

Visualizing Graphs

Table 13-1Toolbar buttons

Button/List DescriptionSelect a type of web display for the graphs in the Categories and Concepts view as wellas the Text Link Analysis view.

Circle Layout. A general layout that can be applied to any graph. It lays out a graphassuming that links are undirected and treats all nodes the same. Nodes are onlyplaced around the perimeter of a circle.Network Layout. A general layout that can be applied to any graph. It lays out agraph assuming that links are undirected and treats all nodes the same. Nodes areplaced freely within the layoutDirected Layout. A layout that should only be used for directed graphs. This layoutproduces treelike structures from root nodes down to leaf nodes and organizes bycolors. Hierarchical data tends to display nicely with this layout.Grid Layout. A general layout that can be applied to any graph. It lays out a graphassuming that links are undirected and treats all nodes the same. Nodes are onlyplaced at grid points within the space

A toggle button that when pressed displays the type icons in the graph rather than typecolors. This only applies to Text Link Analysis view.A drop-down list of link size choices. You can choose between using the co-occurrencedocument count and similarity link values to determine the thickness of the link lines inthe Concept web. The Clusters web graph only shows the number of external linksbetween clusters. This only applies to the Clusters view.Copies the graph to the clipboard as an image for use in another application, such asMS Word or MS PowerPoint.A toggle button that when pushed displays the legend. When the button is not pushed,the legend is not shown.A toggle button that when pushed displays the Links Slider beneath the graph. Youcan filter the results by sliding the arrow.Enables Edit mode.

Enables Selection/Interactive mode.

Editing Graphs

You have several options for editing a graph. You can:Edit text and format it.Change the fill color and pattern of frames and graphic elements.Change the color and dashing of borders and lines.Rotate and change the shape and aspect ratio of point elements.Change the size of graphic elements (such as bars and points).Adjust the space around items by using margins and padding.Change the axis and scale settings.Sort, exclude, and collapse categories on a categorical axis.Set the orientation of axes and panels.Change the position of the legend.

204

Chapter 13

The following topics describe how to perform these various tasks. It is also recommended thatyou read the general rules for editing graphs.

General Rules for Editing Graphs

Selection

The options available for editing depend on selection. Different toolbar and properties paletteoptions are enabled depending on what is selected. Only the enabled items apply to the currentselection. For example, if an axis is selected, the Scale, Major Ticks, and Minor Ticks tabs areavailable in the properties palette.

Here are some tips for selecting items in the graph:Click an item to select it.Select a graphic element (such as points in a scatterplot or bars in a bar chart) with a singleclick. Double-click to drill down the selection to groups of graphic elements or a singlegraphic element.Press Esc to deselect everything.

Automatic Settings

Some settings provide an -auto- option. This indicates that automatic values are applied. Whichautomatic settings are used depends on the specific graph and data values. You can enter a valueto override the automatic setting. If you want to restore the automatic setting, delete the currentvalue and press Enter. The setting will display -auto- again.

Removing/Hiding Items

You can remove/hide various items in the graph. For example, you can hide the legend or axislabel. To delete an item, select it and press Delete. If the item does not allow deletion, nothingwill happen. If you accidentally delete an item, press Ctrl+Z to undo the deletion.

State

Some toolbars reflect the state of the current selection, others don’t. The properties palette alwaysreflects state. If a toolbar does not reflect state, this is mentioned in the topic that describes thetoolbar.

Editing and Formatting Text

You can edit text in place and change the formatting of an entire text block. Note that you can’tedit text that is linked directly to data values. For example, you can’t edit a tick label because thecontent of the label is derived from the underlying data. However, you can format any text inthe graph.

205

Visualizing Graphs

How to Edit Text in Place

E Double-click the text block. This action selects all the text. All toolbars are disabled at this time,because you cannot change any other part of the graph while editing text.

E Type to replace the existing text. You can also click the text again to display a cursor. Positionthe cursor in the desired position and enter the additional text.

How to Format Text

E Select the frame containing the text. Do not double-click the text.

E Format text using the font toolbar. If the toolbar is not enabled, make sure only the framecontaining the text is selected. If the text itself is selected, the toolbar will be disabled.

Figure 13-8Font toolbar

You can change the font:ColorFamily (for example, Arial or Verdana)Size (the unit is pt unless you indicate a different unit, such as pc)WeightAlignment relative to the text frame

Formatting applies to all the text in a frame. You can’t change the formatting of individual lettersor words in any particular block of text.

Changing Colors, Patterns, and Dashings

Many different items in a graph have a fill and border. The most obvious example is a bar in a barchart. The color of the bars is the fill color. They may also have a solid, black border around them.There are other less obvious items in the graph that have fill colors. If the fill color is

transparent, you may not know there is a fill. For example, consider the text in an axis label. Itappears as if this text is “floating” text, but it actually appears in a frame that has a transparent fillcolor. You can see the frame by selecting the axis label.Any frame in the graph can have a fill and border style, including the frame around the whole

graph.

How to Change the Colors, Patterns, and Dashing

E Select the item you want to format. For example, select the bars in a bar chart or a framecontaining text. If the graph is split by a categorical variable or field, you can also select thegroup that corresponds to an individual category. This allows you to change the default aestheticassigned to that group. For example, you can change the color of one of the stacking groups ina stacked bar chart.

206

Chapter 13

E To change the fill color, the border color, or the fill pattern, use the color toolbar.

Figure 13-9Color toolbar

Note: This toolbar does not reflect the state of the current selection.

You can click the button to select the displayed option or click the drop-down arrow to chooseanother option. For colors, notice there is one color that looks like white with a red, diagonal linethrough it. This is the transparent color. You could use this, for example, to hide the borders onbars in a histogram.

The first button controls the fill color.The second button controls the border color.The third button controls the fill pattern. The fill pattern uses the border color. Therefore, thefill pattern is visible only if there is a visible border color.

E To change the dashing of a border or line, use the line toolbar.

Figure 13-10Line toolbar

Note: This toolbar does not reflect the state of the current selection.

As with the other toolbar, you can click the button to select the displayed option or click thedrop-down arrow to choose another option.

Rotating and Changing the Shape and Aspect Ratio of Point Elements

You can rotate point elements, assign a different predefined shape, or change the aspect ratio(the ratio of width to height).

How to Modify Point Elements

E Select the point elements. You cannot rotate or change the shape and aspect ratio of individualpoint elements.

E Use the symbol toolbar to modify the points.

Figure 13-11Symbol toolbar

The first button allows you to change the shape of the points. Click the drop-down arrow andselect a predefined shape.

207

Visualizing Graphs

The second button allows you to rotate the points to a specific compass position. Click thedrop-down arrow and then drag the needle to the desired position.The third button allows you to change the aspect ratio. Click the drop-down arrow and thenclick and drag the rectangle that appears. The shape of the rectangle represents the aspect ratio.

Changing the Size of Graphic Elements

You can change the size of the graphic elements in the graph. These include bars, lines, andpoints among others. If the graphic element is sized by a variable or field, the specified sizeis the minimum size.

How to Change the Size of the Graphic Elements

E Select the graphic elements you want to resize.

E Use the slider or enter a specific size for the option available on the symbol toolbar. The unit ispixels unless you indicate a different unit (see below for a full list of unit abbreviations). Youcan also specify a percentage (such as 30%), which means that a graphic element uses thespecified percentage of the available space. The available space depends on the graphic elementtype and the specific graph.Table 13-2Valid unit abbreviationsAbbreviation Unitcm centimeterin inchmm millimeterpc picapt pointpx pixel

Figure 13-12Size control on symbol toolbar

Specifying Margins and Padding

If there is too much or too little spacing around or inside a frame in the graph, you can changeits margin and padding settings. The margin is the amount of space between the frame andother items around it. The padding is the amount of space between the border of the frame andthe contents of the frame.

How to Specify Margins and Padding

E Select the frame for which you want to specify margins and padding. This can be a text frame,the frame around the legend, or even the data frame displaying the graphic elements (suchas bars and points).

208

Chapter 13

E Use the Margins tab on the properties palette to specify the settings. All sizes are in pixels unlessyou indicate a different unit (such as cm or in).

Figure 13-13Margins tab

Changing the Position of the Legend

If the graph includes a legend, the legend is typically displayed to the right of the graph. Youcan change this position if needed.

How to Change the Legend Position

E Select the legend.

E Click Legend on the properties palette.

Figure 13-14Legend tab

E Select a position.

Keyboard ShortcutsTable 13-3Keyboard shortcuts

Shortcut Key FunctionCtrl+Space Toggle between Explore and Edit modeDelete Delete a graph itemCtrl+Z UndoCtrl+Y RedoF2 Display outline for selecting items in the graph

Chapter

14Session Resource Editor

Text Mining for Clementine rapidly and accurately captures and extracts key concepts fromtext data. This extraction process relies heavily on linguistic resources to dictate how to extractinformation from text data. By default, these resources come from resource templates.Text Mining for Clementine is shipped with a set of specialized resource templates that

contain a set of linguistic and nonlinguistic resources, in the form of libraries and advancedresources, to help define how your data will be handled and extracted. For a list of resourcetemplates shipped with this product, see “Available Resource Templates” on p. 216.In the node dialog box, you can load a copy of the template’s resources into the node. Once

inside an interactive workbench session, you can customize these resources specifically for thisnode’s data, if you wish. During an interactive workbench session, you can work with yourresources in the Resource Editor view. Whenever an interactive session is launched, an extractionis performed using the resources loaded in the node dialog box, unless you have cached your dataand extraction results in your node.

Editing Resources in the Resource Editor

The Resource Editor offers access to the set of resources used to produce the extraction results(concepts, types, and patterns) for an interactive workbench session. This editor is very similarto the Template Editor except that in the Resource Editor you are editing the resources for thissession. When you are finished working on your resources and any other work you’ve done,you can update the modeling node to save this work so that it can be restored in a subsequentinteractive workbench session. For more information, see “Updating Modeling Nodes andSaving” in Chapter 8 on p. 134.If you want to work directly on the templates used to load resources into nodes, we recommend

you use the Template Editor. Many of the tasks you can perform inside the Resource Editor areperformed just like they are in the Template Editor, such as:

Working with libraries. For more information, see “Working with Libraries” in Chapter 16on p. 229.Creating type dictionaries. For more information, see “Creating Types” in Chapter 17 on p. 245.Adding terms to dictionaries. For more information, see “Adding Terms” in Chapter 17 on p.247.Creating synonyms. For more information, see “Adding Synonyms” in Chapter 17 on p. 254.Importing and exporting templates. For more information, see “Importing and ExportingTemplates” in Chapter 15 on p. 222.Publishing libraries. For more information, see “Publishing Libraries” in Chapter 16 on p. 239.

209

210

Chapter 14

Figure 14-1Resource Editor view

Making and Updating Templates

Whenever you make changes to your resources and want to reuse them in the future, you can savethe resources as a template. When doing so, you can choose to save using an existing templatename or by providing a new name. Then, whenever you load this template in the future, you’ll beable to obtain the same resources. For more information, see “Loading from Resource Templates”in Chapter 3 on p. 37.

Note: You can also publish and share your libraries. For more information, see “SharingLibraries” in Chapter 16 on p. 238.

211

Session Resource Editor

Figure 14-2Make Template dialog box

To Make (or Update) a Template

E From the menus in the Resource Editor view, choose File > Resource Templates > Make Template.The Make Template dialog box opens.

E Enter a new name in the Template Name field, if you want to make a new template. Select atemplate in the table, if you want to overwrite an existing template with the currently loadedresources.

E Click Save to make the template.

Important! Since templates are loaded when you select them in the node and not when the streamis executed, please make sure to reload the resource template in any other nodes in which it is usedif you want to get the latest changes. For more information, see “Updating Node Resources AfterLoading” in Chapter 15 on p. 220.

Switching Resources

If you want to replace the resources currently loaded in the session with a copy of those fromanother template, you can switch to those resources. Doing so will overwrite any resourcescurrently loaded in the session. If you are switching resources in order to have some predefinedText Link Analysis (TLA) pattern rules, make sure to select a template that has them markedin the TLA column.Switching resources is particularly useful when you want to restore the session work

(categories, patterns, and resources) but want to load an updated copy of the resources from atemplate without losing your other session work. You can select the template whose contents youwant copy into the Resource Editor and click Select. This replaces the resources you have in thissession. Make sure you update the modeling node at the end of your session if you want to keepthese changes next time you launch the interactive workbench session.

212

Chapter 14

Note: If you switch to the contents of another template during an interactive session, the name ofthe template listed in the node will still be the name of the last template loaded and copied. Inorder to benefit from these resources or other session work, update your modeling node beforeexiting the session and select the Use session work option in the node. For more information, see“Updating Modeling Nodes and Saving” in Chapter 8 on p. 134.

Figure 14-3Switch Resources dialog box

To Switch Resources

E From the menus in the Resource Editor view, choose File > Resource Templates > Switch Resources.The Switch Resources dialog box opens.

E Select the template you want to use from those shown in the table.

E Click Select to abandon those resources currently loaded and load a copy of those in the selectedtemplate in their place. If you have made changes to your resources and want to save your librariesfor a future use, you can publish, update, and share them before switching. For more information,see “Sharing Libraries” in Chapter 16 on p. 238.

Part III:Templates and Resources

Chapter

15Templates and Resources

Text Mining for Clementine rapidly and accurately captures and extracts key concepts fromtext data. This extraction process relies heavily on linguistic resources to dictate how to extractinformation from text data. By default, the resources come from resource templates.Text Mining for Clementine is shipped with a set of specialized resource templates that are

made up of a set of libraries, compiled resources, and some advanced resources. Libraries aremade up of dictionaries used to define and manage types, terms, synonyms, and exclude lists. Formore information, see “Working with Libraries” in Chapter 16 on p. 229.These shipped templates allow you to benefit from years of research and fine-tuning for

specific languages or for specific applications, such as opinions/surveys, genomics, and securityintelligence. During extraction, Text Mining for Clementine also refers to some internal, compiledresources, which contain a large number of definitions complementing the types in the Corelibrary. These compiled resources cannot be edited. For more information, see “AvailableResource Templates” on p. 216.Since the shipped templates may not always be perfectly adapted to the context of your data,

you can edit these templates or even create and use custom libraries uniquely fine-tuned to yourorganization’s data.

Template Editor vs. Resource Editor

There are two main methods for working with and editing your templates, libraries, and theirresources. One is using the Template Editor, which allows you to create and edit templates and theresources they contain independent of a specific node or stream. The other method is using theResource Editor, accessible within an interactive workbench session, which allows you to workwith the resources in the context of a specific node and dataset.

Template Editor

The Template Editor can be used to create and edit templates as well as libraries directly, withoutan interactive workbench session. You can use this editor to create or edit templates before loadingthem into the Text Link Analysis node and the Text Mining modeling node.

The Template Editor is accessible through the main Clementine toolbar or the Tools > Text Mining

Template Editor menu.

215

216

Chapter 15

Resource Editor

When you add a Text Mining modeling node to a stream, you can load a copy of a resourcetemplate’s content to control how text is extracted for text mining. When you launch an interactiveworkbench session, in addition to creating categories, extracting text link analysis patterns, andcreating category models, you can also fine-tune the resources for that session’s data in theintegrated Resource Editor view. For more information, see “Editing Resources in the ResourceEditor” in Chapter 14 on p. 209.Whenever you work on the resources in an interactive workbench session, that work applies

only to that session. If you want to save your work (resources, categories, patterns, etc.) so youcan continue in a subsequent session, you must update the modeling node. For more information,see “Updating Modeling Nodes and Saving” in Chapter 8 on p. 134.If you want to save your changes back to the original template, whose contents were copied into

the modeling node, so that this updated template can be loaded into other nodes, you can make atemplate from the resources. For more information, see “Making and Updating Templates” inChapter 14 on p. 210.

Available Resource TemplatesTemplate Description Languages TLABasicResources

This template extracts concepts and types without a specificdomain in mind. You can also use this template to customizeyour own template. It contains the following libraries: CoreLibrary and Variations Library.

DutchEnglishGermanFrenchItalianPortugueseSpanish

No

Opinions This template includes thousands of words representingattitudes, qualifiers, and preferences that—when used inconjunction with other terms—indicate an opinion about asubject. It can be very useful in extracting TLA patternsfrom survey or scratch-pad data. It contains the followinglibraries: Core Library, Opinions Library, Budget Library,and Variations Library.

DutchEnglishGermanFrenchSpanish

Yes

Genomics This template contains the advanced resources, libraries,types, terms, synonyms, and TLA pattern rules useful inextracting relationships between genes and/or proteins. Itcontains the Genomics Library.

English only Yes

Gene Ontology This template is fine-tuned to extract concepts and types thatare specific to gene ontology. It contains the Gene OntologyLibrary.

English only No

SecurityIntelligence

This template is fine-tuned to extract relationships andcomplex events that describe the activities of individuals ororganizations in the context of national security and policing.It contains the Security Intelligence Library.

EnglishSpanish

Yes

CRM This template is fine-tuned to extract concepts and types thatare specific to the customer relationship management field.It contains the CRM Library.

EnglishPortuguese

No

CompetitiveIntelligence

This template is fine-tuned to extract relationships andcomplex events that describe the activities of individualsor organizations in the business world. It contains theCompetitive Intelligence Library.

English only Yes

217

Templates and Resources

Template Description Languages TLAMeSH This template is fine-tuned to extract concepts and types that

are specific to Medical Subject Headings. It contains theMeSH Library.

English only No

IT This template is fine-tuned to extract concepts and types thatare specific to Information Technology. It contains the ITLibrary.

English only No

Important! The libraries that are installed along with these resource templates are identical incontent to the libraries inside a template. However, these templates also have some advancedresources that offer even more fine-tuning to a given context.

The Editor Interface

The operations that you perform in the Template or Resource editors revolve around themanagement and fine-tuning of the linguistic resources. These resources are stored in the form oftemplates and libraries. For more information, see “Type Dictionaries” in Chapter 17 on p. 243.

Figure 15-1Text Mining Template Editor

218

Chapter 15

The interface is organized into four parts, as follows:Library Tree pane. Located in the upper left corner, this area presents a tree of the openlibraries. You can enable and disable libraries in this tree as well as filter the views in the otherpanes by selecting a library in the tree. You can perform many operations in this tree using thecontext menus. If you expand a library in the tree, you can see the set of types it contains.Type Dictionary pane. Located to the right of the library tree, this pane displays the contentsof the type dictionaries for the libraries selected in the library tree. A type dictionary isa collection of words to be grouped under one label, or type, name. When the extractorengine reads your text data, it compares words found in the text to the terms in the typedictionaries. If an extracted concept appears as a term in a type dictionary, then that typename is assigned. You can think of the type dictionary as a distinct dictionary of terms thathave something in common. For example, the Locations type in the Core library containsconcepts such as new orleans, great britain, paris, and new york. These terms allrepresent geographical locations. A library can contain one or more type dictionaries. Formore information, see “Type Dictionaries” in Chapter 17 on p. 243.Substitution Dictionary pane. Located in the lower left, this pane displays the contents ofdefined substitutions. A substitution dictionary is a collection of terms defined as synonymsor as optional elements used to group similar terms under one lead, or target, concept inthe final extraction results. This dictionary can contain known synonyms and user-definedsynonyms and elements, as well as common misspellings paired with the correct spelling.Since this pane manages both synonyms and optional elements, this information is organizedinto two tabs. The substitutions for all of the libraries in the tree are shown together in thispane. A library can contain only one substitution dictionary. For more information, see“Substitution Dictionaries” in Chapter 17 on p. 253.Exclude Dictionary pane. Located on the right side, this pane displays the contents of theexclude dictionary. An exclude dictionary is a collection of terms and types that will beremoved from the final extraction results. Therefore, the terms and types in the excludedictionary do not appear in the Extracted Results pane. The excludes for all of the libraries inthe tree are shown together in this pane. A library can contain only one exclude dictionary.For more information, see “Exclude Dictionaries” in Chapter 17 on p. 258.

Note: If you want to filter so that you see only the information pertaining to a single library, youcan change the library view using the drop-down list on the toolbar. It contains a top-level entrycalled All Libraries as well as an additional entry for each individual library. For more information,see “Viewing Libraries” in Chapter 16 on p. 234.

Opening Templates

When you launch the Template Editor, you are prompted to open a template. Likewise, you canopen a template from the File menu. If you want to open a template, with some predefined TextLink Analysis (TLA) pattern rules, make sure to select a template that has them. The presenceof TLA rules is indicated in the TLA column. The language for which a template was createdis shown in the Language column.If you want to import a template that isn’t shown in the table or if you want to export a

template, you can use the buttons in the Open Template dialog box. For more information, see“Importing and Exporting Templates” on p. 222.

219


Figure 15-2Open Template dialog box

To open a template

E From the menus in the Template Editor, choose File > Open Templates. The Open Template dialogbox opens.

E Select the template you want to use from those shown in the table.

E Click OK to open this template. If you currently have another template open in the editor, clickingOK will abandon that template and display the template you selected here. If you have madechanges to your resources and want to save your libraries for a future use, you can publish,update, and share them before opening another. For more information, see “Sharing Libraries” inChapter 16 on p. 238.

Saving Templates

In the Template Editor, you can save the changes you made to a template. When doing so, youcan choose to save using an existing template name or by providing a new name. If you makechanges to a template that you’ve already loaded into a node at a previous time, you will have toreload the template contents into the node to get the latest changes. For more information, see“Loading from Resource Templates” in Chapter 3 on p. 37. Or, if you are using the option Use

saved interactive work, meaning you are using resources from a previous interactive workbenchsession, you’ll need to switch to this template’s resources from within the interactive workbenchsession. For more information, see “Switching Resources” in Chapter 14 on p. 211.

Note: You can also publish and share your libraries. For more information, see “SharingLibraries” in Chapter 16 on p. 238.

220

Chapter 15

Figure 15-3Save Template dialog box

To Save a Template

E From the menus in the Template Editor, choose File > Save Templates. The Save Template dialogbox opens.

E Enter a new name in the Template name field, if you want to save this template as a new template.Select a template in the table, if you want to overwrite an existing template with the currentlyloaded resources.

E Enter a description to display a comment or annotation in the table.

E Click Save to save the template.

Important! Since templates are loaded when you select them in the node and not when the streamis executed, please make sure to reload the resource template in any other nodes in which it isused if you want to get the latest changes. For more information, see “Updating Node ResourcesAfter Loading” on p. 220.

Updating Node Resources After Loading

Whenever you load a template into a node, the contents of the template are copied at that verymoment and embedded into the node. The template is not linked to the node directly. Hence, ifyou make changes to a template that you previously loaded in a node and you want to benefitfrom those updates, you would have to update the resources in that node. The resources can beupdated in one of two ways.

Method 1: Reloading Resources in Model Tab

If you want to update the resources in the node using a new or updated template, you can reload itin the Model tab of the node. By reloading, you will replace the copy of the resources in the nodewith a more current copy. For your convenience, the updated time and date will appear on the

221


Model tab along with the originating template’s name. For more information, see “Loading fromResource Templates” in Chapter 3 on p. 37.However, if you are working with interactive session data in a Text Mining modeling node

and you have selected the Use session work option on the Model tab, the saved session workand resources will be used and the Load button is disabled. It is disabled because, at one timeduring an interactive workbench session, you used the Update Modeling Node option and kept thecategories, resources, and other session work. In that case, if you want to change or update thoseresources, you can try the next method of switching the resources in the Resource Editor.

Method 2: Switching Resources in the Resource Editor

Anytime you want to use different resources during an interactive session, you can exchangethose resources using the Switch Resources dialog box. This is especially useful when you wantto reuse existing category work but replace the resources. In this case, you can select the Use

session work option on the Model tab of a Text Mining modeling node. Doing so will disable theability to reload a template through the node dialog box. Then you can launch the interactiveworkbench session by executing the stream and switch the resources in the Resource Editor. Formore information, see “Switching Resources” in Chapter 14 on p. 211.In order to keep session work for subsequent sessions, including the resources, you need to

update the modeling node from within the interactive workbench session so that the resources(and other data) are saved back to the node. For more information, see “Updating ModelingNodes and Saving” in Chapter 8 on p. 134.

Note: If you switch to the contents of another template during an interactive session, the nameof the template listed in the node will still be the name of the last template loaded and copied.In order to benefit from these resources or other session work, update your modeling nodebefore exiting the session.

Managing Templates

There are also some basic management tasks you might want to perform from time to time onyour templates, such as renaming your templates, importing and exporting templates, or deletingobsolete templates. These tasks are performed in the Manage Templates dialog box. Importing andexporting templates enables you to your share templates with other users. For more information,see “Importing and Exporting Templates” on p. 222.

Note: You cannot rename or delete the templates that are shipped with this product. If you try todelete a shipped template, it will be reset to the version you installed.

222

Chapter 15

Figure 15-4Manage Templates dialog box

To Rename a Template

E From the menus, choose File > Manage Templates. The Manage Templates dialog box opens.

E Select the template you want to rename and click Rename. The name box becomes an editablefield in the table.

E Type a new name and press the Enter key. A confirmation dialog box opens.

E If you are satisfied with the name change, click Yes. If not, click No.

To Delete a Template

E From the menus, choose File > Manage Templates. The Manage Templates dialog box opens.

E In the Manage Templates dialog box, select the template you want to delete.

E Click Delete. A confirmation dialog box opens.

E Click Yes to delete or click No to cancel the request. If you click Yes, the template is deleted.

Importing and Exporting Templates

You can share templates with other users or machines by importing and exporting them. Templatesare stored in an internal database but can exported as *.lrt files to your hard drive. Since there arecircumstances under which you might want to import or export templates, there are several dialogboxes that offer those capabilities.

Open Template dialog box in the Template EditorLoad Resources dialog box in the Text Mining modeling node and Text Link Analysis node.Manage Templates dialog box in the Template Editor and the Resource Editor.

223


To Import a Template

E In the dialog box, click Import. The Import Template dialog box opens.

Figure 15-5Import Template dialog box

E Select the resource template file (*.lrt) to import and click Import. You can save the template youare importing with another name or overwrite the existing one. The dialog box closes, and thetemplate now appears in the table.

To Export a Template

E In the dialog box, select the template you want export and click Export. The Select Directorydialog box opens.

Figure 15-6Select Directory dialog box

224

Chapter 15

E Select the directory to which you want to export and click Export. This dialog box closes, and thetemplate is exported and carries the file extension (*.lrt)

Exiting the Template Editor

When you are finished working in the Template Editor, you can save your work and exit the editor.

To Exit the Template Editor

E From the menus, choose File > Close. The Save and Close dialog box opens.

Figure 15-7Save and Close dialog box

E Select Save changes to template in order to save the open template before closing the editor.

E Select Publish libraries in order to publish any of the libraries in the open template before closingthe editor. If you select this option, you will be prompted to select the libraries to publish. Formore information, see “Publishing Libraries” in Chapter 16 on p. 239.

Backing Up Resources

You may need to back up your Text Mining for Clementine resources from time to time asa security measure.

Important! When you restore, the entire contents of your resources will be wiped clean and onlythe contents of the backup file will accessible in the product. This includes any open work.

To Back Up the Resources

E From the menus, choose File > Backup Tools > Backup Resources. The Backup dialog box opens.

225


Figure 15-8Backup Resources dialog box

E Enter a name for your backup file and click Save. The dialog box closes, and the backup fileis created.

To Restore the Resources

E From the menus, choose File > Backup Tools > Restore Resources. The Restore Resources dialogbox opens.

Figure 15-9Restore Resources dialog box

E Select the backup file you want to restore and click Open. The dialog box closes, and resourcesare restored in Text Mining for Clementine.

226

Chapter 15

Important! When you restore, the entire contents of your resources will be wiped clean and onlythe contents of the backup file will accessible in the product. This includes any open work.

Importing Resource Files

If you have made changes directly in resource files outside of Text Mining for Clementine or ifyou have resource files from previous SPSS text mining products, you can import them into aselected library. When you import a directory, you can import all of the contents into a specificopen library as well. You can only import files with the following file extensions:

.txt .add .kw .ini .pos .sup

To Import a Single Resource File

E From the menus, choose File > Resource Templates > Import File. The Import File dialog box opens.

Figure 15-10Import File dialog box

E Select the file you want to import and click Import. The file contents are transformed into aninternal format and added to your library.

To Import All of the Files in a Directory

E From the menus, choose File > Resource Templates > Import Directory. The Import Directorydialog box opens.

227


Figure 15-11Import Directory dialog box

E Select the library in which you want all of the resource files imported from the Import list. If youselect the Default option, a new library will be created using the name of the directory as its name.

E Select the directory from which to import the files. Subdirectories will not be read.

E Click Import. The dialog box closes and the content from those imported resource files nowappears in the editor in the form of dictionaries and advanced resource files.

Chapter

16Working with Libraries

The resources used by the extraction engine to extract and group terms from your text data alwayscontain one or more libraries. You can see the set of libraries in the library tree located in theupper left part of the view window. The libraries are composed of three kinds of dictionaries:

Type dictionary. A collection of words grouped under one label, or type name. When theextractor engine reads your text data, it compares the words found in the text to the termsdefined in your type dictionaries. Extracted words (concepts) are assigned to the typedictionary in which they appear as terms. You can manage your type dictionaries in the upperleft and center panes of the editor—the library tree and the term pane. For more information,see “Type Dictionaries” in Chapter 17 on p. 243.Substitution dictionary. A collection of words defined as synonyms or as optional elementsused to group similar terms under one target term, called a concept in the final extractedresults. You can manage your substitution dictionaries in the lower left pane of the editorusing the Synonyms tab and the Optional tab. For more information, see “SubstitutionDictionaries” in Chapter 17 on p. 253.Exclude dictionary. A collection of terms and types that will be removed from the finalextracted results. You can manage your exclude dictionaries in the rightmost pane of theeditor. For more information, see “Exclude Dictionaries” in Chapter 17 on p. 258.

The resource template you chose includes several libraries to enable you to immediately beginextracting concepts from your text data. However, you can create your own libraries as well. Anycustom libraries that exist can be published and reused. For more information, see “PublishingLibraries” on p. 239.For example, suppose that you frequently work with text data related to the automotive

industry. After analyzing your data, you decide that you would like to create some customizedresources to handle industry-specific vocabulary or jargon. Using the Template Editor, you cancreate a library to extract and group automotive terms. Since you will need the information in thislibrary again, you publish your library to a central repository, accessible in theManage Librariesdialog box, so that it can be reused independently in different stream sessions.Suppose that you are also interested in grouping terms that are specific to different subindustries,

such as electronic devices, engines, cooling systems, or even a particular manufacturer or market.You can create a library for each group and then publish the libraries so that they can be usedwith multiple sets of text data. In this way, you can add the libraries that best correspond to thecontext of your text data.

Note: Although not part of any given library, additional resources can be configured and managed.These are called advanced resources. They control or manage category antilinks, nonlinguisticentities, fuzzy grouping exceptions, language identifier settings, etc. For more information, see“About Advanced Resources” in Chapter 18 on p. 261.

229

230

Chapter 16

Shipped Libraries

By default, several libraries are installed with Text Mining for Clementine. You can use thesepreformatted libraries to access thousands of predefined terms and synonyms as well as manydifferent types. These shipped libraries are fine-tuned to several different domains and areavailable in several different languages.

Local library. Used to store user-defined dictionaries. It is an empty library added by defaultto all resources. It contains an empty type dictionary too. It is most useful when makingchanges or refinements to the resources directly (such as adding a word to a type) from theother interactive workbench views will be automatically stored in the first library listed inthe library tree in the Resource Editor; by default, this is the Local Library. You cannotpublish this library because it is specific to the session project data. If you want to publish itscontents, you must rename the library first.Core library. Available in all languages. Used in most cases, since it comprises the basic fivebuilt-in types representing people, locations, organizations, products, and unknown. Whileyou may see only a few terms listed in one of its type dictionaries, the types represented in theCore library are actually complements to the robust types found in the compiled resourcesdelivered with your text-mining product. These compiled resources contain thousands ofterms for each type. For this reason, you may not see a term that was typed with one of theCore types listed in that type dictionary here. This explains how names such as George canbe extracted and typed as Person when only John appears in the Person type dictionaryin the Core library. Similarly, if you do not include the Core library, you may still see thesetypes in your extraction results, since the compiled resources containing these types will stillbe used by the extractor.Opinions library. Available in English only. Used most commonly to extract opinionpatterns from survey or scratch-pad data. This library includes thousands of wordsrepresenting attitudes, qualifiers, and preferences that—when used in conjunction with otherterms—indicate an opinion about a subject. This library includes seven built-in types, aswell as a large number of synonyms and excludes. It also includes a large set of patternrules used for text link analysis. Keep in mind that you must specify this library in theLibrary Patterns tab of the Edit Advanced Resources dialog box in order to benefit from thetext link analysis rules it contains. For more information, see “Text Link Analysis Rules”in Chapter 18 on p. 275.Budget library. Available in English only. Used to extract terms referring to the cost ofsomething. This library includes many words and phrases that represent adjectives, qualifiers,and judgments regarding the price or quality of something.Variations library. Available in all languages. Used to include cases where certain languagevariations require synonym definitions to properly group them. This library includes onlysynonym definitions.Genomics library. Available in English only. Used most commonly to extract relationshipsbetween genes and/or proteins. This library includes several types, as well as many synonymsthat identify genes and proteins, as well as predicates relevant to protein/protein interaction. Italso includes a large set of pattern rules used for text link analysis. Keep in mind that youmust specify this library in the Library Patterns tab of the Edit Advanced Resources dialogbox in order to benefit from the text link analysis rules it contains. For more information, see“Text Link Analysis Rules” in Chapter 18 on p. 275.

231

Working with Libraries

Gene Ontology library. Available in English only. Used most commonly to extract wordsrepresenting gene products. This library includes one type and many synonyms.Security Intelligence library. Available in English and Spanish. Used most commonly toextract relationships and complex events that describe the activities of individuals ororganizations in the context of national security and policing. It also includes a large set ofpattern rules used for text link analysis. Keep in mind that you must specify this library in theLibrary Patterns tab of the Edit Advanced Resources dialog box in order to benefit from thetext link analysis rules it contains. For more information, see “Text Link Analysis Rules”in Chapter 18 on p. 275.CRM library. Available in English and Portuguese. Used to extract words and phrases oftenfound in the CRM industry.Competitive Intelligence library. Available in English only. Used most commonly to extractrelationships and complex events that describe the activities of individuals or organizations inthe context of business or research. It also includes TLA pattern rules. Keep in mind that youmust specify this library in the Library Patterns tab of the Edit Advanced Resources dialogbox in order to benefit from the text link analysis rules it contains. For more information, see“Text Link Analysis Rules” in Chapter 18 on p. 275.MeSH library. Available in English only. Used to extract words and phrases considered asMedical Subject Headings as described by the National Library of Medicine.IT library. Available in English only. Used to extract words and phrases often found in the ITindustry.

Although some of the libraries shipped outside the templates resemble the contents in sometemplates, the templates have been specifically tuned to particular applications and containadditional advanced resources. We recommend that if you are working with opinion/survey data,genomics data, or security intelligence data that you use the corresponding templates and makeany changes there rather than just adding individual libraries to a more generic template.

Compiled resources are also delivered with all SPSS text-mining products. They are always usedduring the extraction process and contain a large number of complementary definitions to thebuilt-in type dictionaries in the default libraries. Since these resources are compiled, they cannotbe viewed or edited. You can, however, force a term that was typed by these compiled resourcesinto any other dictionary. For more information, see “Forcing Terms” in Chapter 17 on p. 250.

Creating Libraries

You can create any number of libraries. After creating a new library, you can begin to createdictionaries in this library and enter terms, synonyms, and excludes.

To Create a Library

E From the menus, choose File > Libraries > New Library. The Library Properties dialog box opens.

232

Chapter 16

Figure 16-1Library Properties dialog box

E Enter a name for the library in the Name text box.

E If desired, enter a comment in the Annotation text box.

E Click Publish if you want to publish this library now before entering anything in the library. Formore information, see “Sharing Libraries” on p. 238. You can also publish later at any time.

E Click OK to create the library. The dialog box closes and the library appears in the tree view. If youexpand the libraries in the tree, you will see that an empty type dictionary has been automaticallyincluded in the library. In it, you can immediately begin adding terms. For more information,see “Adding Terms” in Chapter 17 on p. 247.

Adding Public Libraries

If you want to reuse a library from another project/session data, you can add it to your currentresources as long as it is a public library. A public library is a library that has been published.For more information, see “Publishing Libraries” on p. 239.When you add a public library, a local copy is embedded into your project/session data. For

this reason, you can only add a library once. You can make changes to this library; however, youmust republish the public version of the library if you want to share the changes.When adding a public library, a Resolve Conflicts dialog box may appear if any conflicts are

discovered between the terms and types in one library and the other local libraries. You mustresolve these conflicts or accept the proposed resolutions in order to complete this operation. Formore information, see “Resolving Conflicts” on p. 240.

Note: If you always update your libraries when you launch an interactive workbench session orpublish when you close one, you are less likely to have libraries that are out of sync. For moreinformation, see “Sharing Libraries” on p. 238.

To Add a Library

E From the menus, choose File > Libraries > Add Library. The Add Library dialog box opens.

233


Figure 16-2Add Library dialog box

E Select the library or libraries in the list.

E Click Add. If any conflicts occur between the newly added libraries and any libraries thatwere already there, you will be asked to verify the conflict resolutions or change them beforecompleting the operation. For more information, see “Resolving Conflicts” on p. 240.

Finding Terms and TypesYou can search for terms and types in the various panes in the editor using the Find feature. Inthe editor, you can choose Edit > Find from the menus and the Find toolbar appears. You can usethis toolbar to find one occurrence at a time. By clicking Find again, you can find subsequentoccurrences of your search term.When searching, the editor searches only the library or libraries listed in the drop-down list on

the Find toolbar. If All Libraries is selected, the program will search everything in the editor.Figure 16-3Find toolbar

When you start a search, it begins in the area that has the focus. The search continues througheach section, looping back around until it returns to the active cell. You can reverse the order ofthe search using the directional arrows.Table 16-1Find toolbar icon descriptionsIcon Description

Toggle indicating if the search is case sensitive. When clicked (highlighted), the search iscase sensitive. For example, if you enable this option and enter the word Vegetable, thecase-sensitive search would find Vegetable but not vegetable.Toggle indicating if the search term represents the entire term or if it is a partial search. Whenthis option is not enabled, the search is for exact matches only. When enabled, the searchextends to partial string matches as well. For example, if you enable this option and enter theword veg, the search would find Vegetable, vegetable, veggies, and vegetarian.Toggle indicating the search direction. When clicked, the search goes backward, or up.

Toggle indicating the search direction. When clicked, the search goes forward, or down.

234

Chapter 16

To Find a Type Name in Other Dictionaries in a Library

E From the menus, choose Edit > Find. The Find toolbar appears.

E Enter the string for which you want to search.

E Click the Find button to begin the search. The next occurrence of the term or type is thenhighlighted.

E Click the button again to move from occurrence to occurrence.

Viewing Libraries

You can display the contents of one particular library or all libraries. This can be helpful whendealing with many libraries or when you want to review the contents of a specific library beforepublishing it. Changing the view only impacts what you see in this view but does not disableany libraries from being used during extraction. For more information, see “Disabling LocalLibraries” on p. 235.The default view is All Libraries, which shows all libraries in the tree and their contents in other

panes. You can change this selection using the drop-down list on the toolbar or through a menuselection (View > Libraries) When a single library is being viewed, all items in other librariesdisappear from view but are still read during the extraction.

To Change the Library View

E From the menus, choose View > Libraries. A menu with all of the local libraries opens.

E Select the library that you want to see or select the All Libraries option to see the contents of alllibraries. The contents of the view are filtered according to your selection.

Managing Local Libraries

Local libraries are the libraries inside your interactive workbench session or inside a template, asopposed to public, shareable libraries. For more information, see “Managing Public Libraries”on p. 236. There are also some basic local library management tasks that you might want toperform, including:

Renaming a local library. For more information, see “Renaming Local Libraries” on p. 234.Disabling or enabling a local library. For more information, see “Disabling Local Libraries”on p. 235.Deleting a local library. For more information, see “Deleting Local Libraries” on p. 235.

Renaming Local Libraries

You can rename local libraries. If you rename a local library, you will disassociate it from thepublic version, if a public version exists. This means that subsequent changes can no longer beshared with the public version. You can republish this local library under its new name. This also

235


means that you will not be able to update the original public version with any changes that youmake to this local version.

Note: You cannot rename a public library.

To Rename a Local Library

E In the tree view, select the library that you want to rename.

E From the menus, choose Edit > Library Properties. The Library Properties dialog box opens.Figure 16-4Library Properties dialog box

E Enter a new name for the library in the Name text box.

E Click OK to accept the new name for the library. The dialog box closes and the library nameis updated in the tree view.

Disabling Local Libraries

If you want to temporarily exclude a library from the extraction process, you can deselect thecheck box to the left of the library name in the tree view. This signals that you want to keep thelibrary but want the contents ignored when checking for conflicts and during extraction.

To Disable a Library

E In the tree view, select the library that you want to disable and click the spacebar to disable thelibrary. The check box to the left of the library name is cleared.

Deleting Local Libraries

You can remove a library without deleting the public version of the library and vice versa.Deleting a local version of a library does not remove that library from other projects/sessions orthe public version. For more information, see “Managing Public Libraries” on p. 236.

To Delete a Local Library

E In the tree view, select the library you want to delete.

E From the menus, choose Edit > Delete to delete the library. The library is removed.

236

Chapter 16

E If you have never published this library before, a message asking whether you would like to deleteor keep this library opens. Click Delete to continue or Keep if you would like to keep this library.

Note: One library must always remain.

Managing Public Libraries

In order to reuse local libraries, you can publish them and then work with them and see themthrough the Manage Libraries dialog box (File > Libraries > Manage Libraries). For moreinformation, see “Sharing Libraries” on p. 238. Some basic public library management tasksthat you might want to perform include importing, exporting, or deleting a public library. Youcannot rename a public library.

Figure 16-5Manage Libraries dialog box

Importing Public Libraries

E In the Manage Libraries dialog box, click Import.... The Import Library dialog box opens.

Figure 16-6Import Library dialog box

237


E Select the library file (*.lib) that you want to import and if you also want to add this library locally,select Add library to current project.

E Click Import. The dialog box closes. If a public library with the same name already exists, you willbe asked to rename the library that you are importing or to overwrite the current public library.

Exporting Public Libraries

You can export public libraries into the .lib format so that you can share them.

E In the Manage Libraries dialog box, select the library that you want to export in the list.

E Click Export.... The Select Directory dialog box opens.

Figure 16-7Select Directory dialog box

E Select the directory to which you want to export and click Export. The dialog box closes andthe library file (*.lib) is exported.

Deleting Public Libraries

You can remove a local library without deleting the public version of the library and vice versa.However, if the library is deleted from this dialog box, it can no longer be added to any sessionresources until a local version is published again. If you delete a library that was installed with theproduct, the originally installed version is restored.

E In the Manage Libraries dialog box, select the library that you want to delete. You can sort the listby clicking on the appropriate header.

E Click Delete to delete the library. Text Mining for Clementine verifies whether the local versionof the library is the same as the public library. If so, the library is removed with no alert. If thelibrary versions differ, an alert opens to ask you whether you want to keep or remove the publicversion is issued.

238

Chapter 16

Sharing Libraries

Libraries allow you to work with resources in a way that is easy to share among multiple interactiveworkbench sessions. Libraries can exist in two states, or versions. Libraries that are editable in theeditor and part of an interactive workbench session are called local libraries. While working within an interactive workbench session, you may make a lot of changes in the Vegetables library, forexample. If your changes could be useful with other data, you can make these resources availableby creating a public library version of the Vegetables library. A public library, as the nameimplies, is available to any other resources in any interactive workbench session.You can see the public libraries in the Manage Libraries dialog box. Once this public library

version exists, you can add it to the resources in other contexts so that these custom linguisticresources can be shared.The shipped libraries are initially public libraries. It is possible to edit the resources in these

libraries and then create a new public version. Those new versions would then be accessible inother interactive workbench sessions.As you continue to work with your libraries and make changes, your library versions will

become desynchronized. In some cases, a local version might be more recent than the publicversion, and in other cases, the public version might be more recent than the local version. It isalso possible for both the public and local versions to contain changes that the other does notif the public version was updated from within another interactive workbench session. If yourlibrary versions become desynchronized, you can synchronize them again. Synchronizing libraryversions consists of republishing and/or updating local libraries.Whenever you launch an interactive workbench session or close one, you will be prompted

to synchronize any libraries that need updating or republishing. Additionally, you can easilyidentify the synchronization state of your local library by the icon appearing beside the libraryname in the tree view or by viewing the Library Properties dialog box. You can also choose to doso at any time through menu selections. The following table describes the five possible statesand their associated icons.

Table 16-2Local library synchronization states

Icon Local library status descriptionUnpublished—The local library has never been published.

Synchronized—The local and public library versions are identical. This also applies to theLocal Library, which cannot be published because it is intended to contain only project-specificresources.Out of date—The public library version is more recent than the local version. You can updateyour local version with the changes.Newer—The local library version is more recent than the public version. You can republishyour local version to the public version.Out of sync—Both the local and public libraries contain changes that the other does not. Youmust decide whether to update or publish your local library. If you update, you will lose thechanges that you made since the last time you updated or published. If you choose to publish,you will overwrite the changes in the public version.

Note: If you always update your libraries when you launch an interactive workbench session orpublish when you close one, you are less likely to have libraries that are out of sync.

239


You can republish a library any time you think that the changes in the library would benefit otherstreams that may also contain this library. Then, if your changes would benefit other streams, youcan update the local versions in those streams. In this way, you can create streams for eachcontext or domain that applies to your data by creating new libraries and/or adding any numberof public libraries to your resourcesIf a public version of a library is shared, there is a greater chance that differences between local

and public versions will arise. Whenever you launch or close and publish from an interactiveworkbench session or open or close a template from the Template Editor, a message appears toenable you to publish and/or update any libraries whose versions are not in sync with thosein the Manage Libraries dialog box. If the public library version is more recent than the localversion, a dialog box asking whether you would like to update opens. You can choose whetherto keep the local version as is instead of updating with the public version or merge the updatesinto the local library.

Publishing Libraries

If you have never published a particular library, publishing entails creating a public copy of yourlocal library in the database. If you are republishing a library, the contents of the local library willreplace the existing public version’s contents. After republishing, you can update this library inany other stream sessions so that their local versions are in sync with the public version. Eventhough you can publish a library, a local version is always stored in the session.

Important! If you make changes to your local library and, in the meantime, the public version ofthe library was also changed, your library is considered to be out of sync. We recommend that youbegin by updating the local version with the public changes, make any changes that you want, andthen publish your local version again to make both versions identical. If you make changes andpublish first, you will overwrite any changes in the public version.

To Publish Local Libraries to the Database

E From the menus, choose File > Libraries > Publish Libraries. The Publish Libraries dialog boxopens, with all libraries in need of publishing selected by default.

Figure 16-8Publish Libraries dialog box

E Select the check box to the left of each library that you want to publish or republish.

E Click Publish to publish the libraries to the Manage Libraries database.

240

Chapter 16

Updating Libraries

Whenever you launch or close an interactive workbench session, you can update or publish anylibraries that are no longer in sync with the public versions. If the public library version is morerecent than the local version, a dialog box asking whether you would like to update the libraryopens. You can choose whether to keep the local version instead of updating with the publicversion or replacing the local version with the public one. If a public version of a library ismore recent than your local version, you can update the local version to synchronize its contentwith that of the public version. Updating means incorporating the changes found in the publicversion into your local version.

Note: If you always update your libraries when you launch an interactive workbench session orpublish when you close one, you are less likely to have libraries that are out of sync. For moreinformation, see “Sharing Libraries” on p. 238.

To Update Local Libraries

E From the menus, choose File > Libraries > Update Libraries. The Update Libraries dialog box opens,with all libraries in need of updating selected by default.

Figure 16-9Update Libraries dialog box

Note: In the image above, the Core and MeSH local libraries have been updated more recentlythan the public versions; therefore, you probably would not update these libraries. The publicversion of the Variations library, however, is more recent than your local version; therefore, youmay want to update your local version of this library.

E Select the check box to the left of each library that you want to publish or republish.

E Click Update to update the local libraries.

Resolving Conflicts

Local versus Public Library Conflicts

Whenever you launch a stream session, Text Mining for Clementine performs a comparison of thelocal libraries and those listed in the Manage Libraries dialog box. If Text Mining for Clementinedetects that any local libraries in your session are not in sync with the published versions, the

241


Library Synchronization Warning dialog box opens. You can choose from the following options toselect the library versions that you want to use here:

All libraries local. This option keeps all of your local libraries as they are. You can alwaysrepublish or update them later.All published libraries on this machine. This option will replace the shown local libraries withthe versions found in the database.All more recent libraries. This option will replace any older local libraries with the morerecent public versions from the database.Other. This option allows you to manually select the versions that you want by choosingthem in the table.

Forced Term Conflicts

Whenever you add a public library or update a local library, conflicts and duplicate entries may beuncovered between the terms and types in this library and the terms and types in the other librariesin your resources. If this occurs, you will be asked to verify the proposed conflict resolutions orchange them before completing the operation in the Edit Forced Terms dialog box. For moreinformation, see “Forcing Terms” in Chapter 17 on p. 250.

Figure 16-10Edit Forced Terms dialog box

The Edit Forced Terms dialog box contains each pair of conflicting terms or types. Alternatingbackground colors are used to visually distinguish each conflict pair. These colors can be changedin the Options dialog box. For more information, see “Options: Colors Tab” in Chapter 8 on p.132. The Edit Forced Terms dialog box contains two tabs:

Duplicates. This tab contains the duplicated terms found in the libraries. If a pushpin iconappears after a term, it means that this occurrence of the term has been forced. If a black Xicon appears, it means that this occurrence of the term will be ignored during extractionbecause it has been forced elsewhere.User Defined. This tab contains a list of any terms that have been forced manually in the typedictionary term pane and not through conflicts.

Note: The Edit Forced Terms dialog box opens after you add or update a library. If you cancel outof this dialog box, you will not be canceling the update or addition of the library.

242

Chapter 16

To Resolve Conflicts

E In the Edit Forced Terms dialog box, select the radio button in the Use column for the term thatyou want to force.

E When you have finished, click OK to apply the forced terms and close the dialog box. If you clickCancel, you will cancel the changes you made in this dialog box.

Chapter

17About Library Dictionaries

The resources used to extract text data are stored in the form of templates and libraries. Everylibrary is made up of three dictionaries.

Type dictionary. A collection of words grouped under one label, or type name. When theextractor engine reads your text data, it compares the words found in the text to the termsdefined in your type dictionaries. Extracted words (concepts) are assigned to the typedictionary in which they appear as terms. You can manage your type dictionaries in the upperleft and center panes of the editor—the library tree and the term pane. For more information,see “Type Dictionaries” on p. 243.Substitution dictionary. A collection of words defined as synonyms or as optional elementsused to group similar terms under one target term, called a concept in the final extractedresults. You can manage your substitution dictionaries in the lower left pane of the editorusing the Synonyms tab and the Optional tab. For more information, see “SubstitutionDictionaries” on p. 253.Exclude dictionary. A collection of terms and types that will be removed from the finalextracted results. You can manage your exclude dictionaries in the rightmost pane of theeditor. For more information, see “Exclude Dictionaries” on p. 258.

For more information, see “Working with Libraries” in Chapter 16 on p. 229.

Type Dictionaries

A type dictionary is made up of a type name, or label, and a list of terms. Type dictionaries aremanaged in the upper left and center panes of the editor. If you are in an interactive workbenchsession, you can access this view with View > Resource Editor in the menus. Otherwise, you canedit dictionaries for a specific template in the Template Editor.When the extractor engine reads your text data, it compares words found in the text to the terms

defined in your type dictionaries. If an extracted term appears in a type dictionary, then that typename is assigned to the term. If you want a term to be assigned to a particular type, you canadd it to the corresponding type dictionary.

243

244

Chapter 17

Figure 17-1Library tree and term pane

The list of type dictionaries is shown in the library tree pane on the left. The content of each typedictionary appears in the center pane. Type dictionaries consist of more than just a list of terms.The manner in which words and word phrases in your text data are matched to the terms definedin the type dictionaries is determined by the match option defined. A match option specifieshow a term is anchored with respect to a candidate word or phrase in the text data. For moreinformation, see “Adding Terms” on p. 247.Additionally, you can extend the terms in your type dictionary by specifying whether you want

to automatically generate and add inflected forms of the terms to the dictionary. By generatingthe inflected forms, you automatically add plural forms of singular terms and singular forms ofplural terms to the type dictionary. This option is particularly useful when your type containsmostly nouns, since it is unlikely you would want inflected forms of verbs or adjectives. For moreinformation, see “Adding Terms” on p. 247.

Note: Terms that are not found in any type dictionary, built-in or editable, but are extracted fromthe text are automatically typed as <Unknown>.

Built-in Types

Text Mining for Clementine is delivered with a set of linguistic resources in the form of shippedlibraries and compiled resources. The shipped libraries contain a set of built-in type dictionaries.The type dictionaries are used by the extractor engine to type the terms it extracts. Althougha large number of terms have been defined in the built-in type dictionaries, they do not coverevery conceivable term or grouping. Therefore, you can add to them or create your own. For adescription of the contents of a particular shipped type dictionary, read the annotation in theType Properties dialog box. Select the type in the tree, right-click your mouse, and choose Type

Properties from the context menu. For more information, see “Shipped Libraries” in Chapter 16on p. 230.

245

About Library Dictionaries

Note: In addition to the shipped libraries, the compiled resources (also used by the extractorengine) contain a large number of definitions complementary to the built-in type dictionaries, buttheir content is not visible in the product. You can, however, force a term that was typed by thecompiled dictionaries into any other dictionary. For more information, see “Forcing Terms”on p. 250.

Creating Types

You can create type dictionaries to help group similar terms that are extracted. When termsappearing in this dictionary are discovered during the extraction process, they will be assignedto this type name. Whenever you create a library, an empty type library is always included sothat you can begin entering terms immediately.If you are analyzing text about food and want to group terms relating to vegetables, you could

create your own Vegetables type dictionary. You could then add terms such as carrot,broccoli, and spinach if you feel that they are important terms that will appear in the text.Then, during extraction, if any of these terms are found and extracted, they will be assignedto the Vegetables type.You do not have to define every form of a word or expression, because you can choose

to generate the inflected forms of terms. By choosing this option, the extractor engine willautomatically recognize singular or plural forms of the terms among other forms as belongingto this type. This option is particularly useful when your type contains mostly nouns, since it isunlikely you would want inflected forms of verbs or adjectives.

Note: A project cannot contain more than 56 user-defined types.


Name. The name you give to the type dictionary you are creating.

246

Chapter 17

Default match. The default match attribute instructs the extractor engine how to match this term totext data. For more information, see “Adding Terms” on p. 247. Whenever you add a term to thistype dictionary, this is the match attribute automatically assigned to it. You can always change thematch choice manually in the term list. Options include: Entire Term, Start, End, Any, Start or End,Entire and Start, Entire and End, and Entire and (Start or End).

Add to. This field indicates the library in which you will create your new type dictionary.

Generate inflected forms by default. This option tells the extractor engine to use grammaticalmorphology to capture similar forms of the terms that you add to this dictionary during theextraction process, such as singular or plural forms of the term. This option is particularly usefulwhen your type contains mostly nouns, since it is unlikely you would want inflected forms of verbsor adjectives. When you select this option, all new terms added to this type will automaticallyhave this option although you can change it manually in the list

Font color. This field allows you to distinguish the terms in this type from others in the interface. Ifyou select Use parent color, the default type color is used for this type dictionary, as well. Thisdefault color is set in the options dialog box. For more information, see “Options: Colors Tab” inChapter 8 on p. 132. If you select Custom, select a color from the drop-down list.

Annotation. This field is used for any comments or descriptions.

To Create a Type Dictionary

E Select the library in which you would like to create a new type dictionary.

E From the menus, choose Tools > New Type. The Type Properties dialog box opens.

E Enter the name of your type dictionary in the Name text box.

E Select the Default match from the drop-down list.

E Select the library name in which you will create your new type dictionary from the Add to

drop-down list.

E Select Generate inflected forms by default if you want the extractor engine to use grammaticalmorphology to capture similar forms of the terms that you add to this dictionary during theextraction process.

E Select a font color option if you want to distinguish the terms in this type from others in theinterface.

E Enter a comment or description for the type in the Annotation box.

E Click OK to create the type dictionary. The new type is visible in the library tree pane andappears in the center pane. You can begin adding terms immediately. For more information,see “Adding Terms”.

Note: The previous instructions show you how to make changes within the Resource Editor viewor the Template Editor. Keep in mind that you can also do this kind of fine-tuning directly fromthe Extracted Results pane, Data pane, Category Definitions dialog box, or Cluster Definitionsdialog box in the other views. For more information, see “Refining Extraction Results” inChapter 9 on p. 148.

247


Adding Terms

The library tree pane displays libraries and can be expanded to show the type dictionaries that theycontain. In the center pane, a term list displays the terms in the selected library or type dictionary,according to the selection in the tree. You can add terms to this term pane directly.

Figure 17-3Library term pane

In the Resource Editor, you can add terms to a type dictionary in two ways—directly in theterm pane or through the Add Terms dialog box. The terms that you add can be single words orcompound words. You will always find a blank row at the top of the list to allow you to add a newterm. Keep in mind that you can also add terms directly from the Extracted Results pane, Datapane, Category Definitions dialog box, and Cluster Definitions dialog box in the other views. Formore information, see “Refining Extraction Results” in Chapter 9 on p. 148.

The following columns exist in the term list:Term. Enter single or compound words into the cell. The color in which the term appearsdepends on the color for the type in which the term is stored or forced. You can changetype colors in the Type Properties dialog box.Force. By putting a pushpin icon into this cell, you tell the extractor engine to ignore anyother occurrences of this same term in other libraries. For more information, see “ForcingTerms” on p. 250.Match. Select a match option to instruct the extractor engine how to match this term to textdata. See the Match Options Descriptions table for more information. When you change amatch choice, the drop-down list gives you the following choices: Entire Term, Start, End, Any,Start or End, Entire and Start, Entire and End, or Entire and (Start or End). You can change thedefault value by editing the type properties. For more information, see “Creating Types” on p.245. From the menus, choose Edit > Change Match.

248

Chapter 17

Inflect. Select whether the extractor should generate inflected forms of this term duringextraction. The default value for this column is defined in the Type Properties but you canchange this option on a case-by-case basis here. From the menus, choose Edit > ChangeInflection.Type. Select a type dictionary from the drop-down list. The list of types is filtered accordingto your selection in the library tree pane. The first type in the list is always the default typeselected in the library tree pane. From the menus, choose Edit > Change Type.Library. Lists the library in which your term is stored. You can drag and drop a term intoanother library in the library tree pane to change its library.

Table 17-1Match option descriptions

Match option DescriptionEntire term If the entire term extracted from the text matches the exact term in the dictionary,

this type is applied. For the <Person> type, Entire term will also extract entirenames using a first name only (for example, entering marilyn will type MarilynMonroe as <Person>).

Start If the term found in the dictionary matches the beginning of a term extracted fromthe text, this type is applied. For example, if you enter apple, apple tart willbe matched.

End If the term found in the dictionary matches the end of a term extracted from thetext, this type is applied. For example, if you enter apple, cider apple willbe matched.

Any If the term found in the dictionary matches any part of a term extracted from thetext, this type is applied. For example, if you enter apple, the Any option will typeapple tart, cider apple, and cider apple tart the same way.

In the following table, we assume that we have created a type dictionary called <Cleaning>and have also added the term soap to the type dictionary. Now whenever we extract from thetext, if the term soap is extracted, it will be assigned to the type <Cleaning>. But how a wordis typed when it is part of a longer compound word is different. This is where the default andalternate match options you define in the type dictionary properties determine how the wordsoap will be typed when part of a longer word.For each extracted term in the following table, you can see whether the term would be typed as

<Cleaning> for each possible match option. The first column in the table shows the extractedtext. The following columns indicate whether the terms would be typed as <Cleaning>.Table 17-2Match examples for type dictionary <Cleaning>

Extracted Term Entire Term matchoption

Startmatch option End match option Any match option

soap soap typed as<Cleaning>

soap typed as<Cleaning>



soap powder This type notassigned.

soap powdertyped as<Cleaning>

This type notassigned.

soap powdertyped as<Cleaning>

dish soap This type notassigned.

This type notassigned.

dish soap typedas <Cleaning>

dish soap typedas <Cleaning>

249


To Add a Single Term to a Type Dictionary

E In the library tree pane, select the type dictionary to which you want to add the term.

E In the term list in the center pane, type your term in the next available empty cell.

E If desired, change the match option for this term by clicking the match cell and selecting an optionfrom the list.For more information, see “Creating Types” on p. 245.

E If desired, change the type dictionary in which this term is stored by clicking the type cell andselecting a name from the list.

To Add Multiple Terms to a Type Dictionary

E In the library tree pane, select the type dictionary to which you want to add terms.

E From the menus, choose Tools > New Terms. The Add Terms dialog box opens.

Figure 17-4Add Terms dialog box

E Enter the terms you want to add to the selected type dictionary by typing the terms or copyingand pasting a set of terms. If you enter multiple terms, you must separate them using the globaldelimiter, as defined in the Options dialog box, or add each term on a new line. For moreinformation, see “Setting Options” in Chapter 8 on p. 131.

E Click OK to add the terms to the dictionary. The match option is automatically set to the defaultoption for this type library. The dialog box closes and the new terms appear in the dictionary.


250

Chapter 17

Forcing Terms

If you want a term to be assigned to a particular type, you can add it to the corresponding typedictionary. However, if there are multiple terms with the same name, the extractor engine mustknow which type should be used. Therefore, you will be prompted to select which type should beused. This is called forcing a term into a type.Forcing will not remove the other occurrences of this term; rather, they will be ignored by the

extractor engine. You can later change which occurrence should be used by forcing or unforcinga term. You may also need to force a term into a type dictionary when you add a public libraryor update a public library.

Figure 17-5Force status icons

You can see which terms are forced or ignored in the Force column, the second column in theterm pane. If a pushpin icon appears, this means that this occurrence of the term has been forced.If a black X icon appears, this means that this occurrence of the term will be ignored duringextraction because it has been forced elsewhere. Additionally, when you force a term, it willappear in the color for the type in which it was forced. This means that if you forced a term thatis in both Type 1 and Type 2 into Type 1, any time you see this term in the window, it willappear in the font color defined for Type 1.You can double-click the icon in order to change the status. If the term appears elsewhere, a

Resolve Conflicts dialog box opens to allow you to select which occurrence should be used.

251


Figure 17-6Resolve Conflicts dialog box

Renaming Types

You can rename a type dictionary or change other dictionary settings by editing the type properties.

To Rename a Type

E In the library tree pane, select the type dictionary you want to rename.

E Right-click your mouse and choose Type Properties from the context menu. The Type Propertiesdialog box opens.


E Enter the new name for your type dictionary in the Name text box.

E Click OK to accept the new name. The new type name is visible in the library tree pane.

252

Chapter 17

Moving Types

You can drag a type dictionary to another location within a library or to another library in the tree.

To Reorder a Type within a Library

E In the library tree pane, select the type dictionary you want to move.

E From the menus, choose Edit > Move Up to move the type dictionary up one position in the librarytree pane for a given library or Edit > Move Down to move it one position down.

To Move a Type to Another Library

E In the library tree pane, select the type dictionary you want to move.

E Right-click your mouse and choose Type Properties from the context menu. The Type Propertiesdialog box opens. (You can also drag and drop the type into another library).

E In the Add To list box, select the library to which you want to move the type dictionary.

E Click OK. The dialog box closes, and the type is now in the library you selected.

Disabling Types

If you want to temporarily remove a type dictionary, you can deselect the check box to the left ofthe dictionary name in the library tree pane. This signals that you want to keep the dictionaryin your library but want the contents ignored during conflict checking and during the extractionprocess.

To Disable a Type Dictionary

E In the library tree pane, select the type dictionary you want to disable and click the spacebar. Thecheck box to the left of the type name is cleared.

Deleting Types

You can permanently delete type dictionaries from a library when you no longer need them.

To Delete a Type Dictionary from a Library

E In the library tree pane, select the type dictionary you want to delete.

E From the menus, choose Edit > Delete to delete the type dictionary.

253


Substitution Dictionaries

A substitution dictionary is a collection of term substitutions that help to group similar termsunder one target term. Substitution dictionaries are managed in the bottom pane. If you are in aninteractive workbench session, you can access this view with View > Resource Editor in the menus.Otherwise, you can edit dictionaries for a specific template in the Template Editor.You can define two forms of substitutions in this dictionary: synonyms and optional elements.

Synonyms associate two or more words that have the same meaning. Optional elements identifyoptional words in a compound term that can be ignored during extraction in order to keep liketerms together even if they appear slightly different in the text.After you run an extraction on your text data, you may find several terms that are synonyms or

inflected forms of other terms. By identifying optional elements and synonyms, you can forceText Mining for Clementine to map these terms to one single target term. This reduces the numberof terms in the final list and thus creates a more significant, representative term list with higherfrequencies.

Figure 17-8Substitution dictionary pane

You can use the tabs of the substitution dictionary pane to switch between the synonyms andoptional elements.

Synonyms

On the Synonyms tab, you can define synonyms in order to associate two or more words that havethe same meaning. You can also use synonyms to group terms with their abbreviations or to groupcommonly misspelled words with the correct spelling. The first step is to decide what the target,or lead, term will be. The target term is the term that you want to group all synonym terms underin the final list. During extraction, the synonyms are grouped under this target term. The secondstep is to identify all of the synonyms for this term. The target term is substituted for all synonymsin the final extraction. For example, if you want automobile to be replaced by vehicle, thenautomobile is the synonym and vehicle is the target term.By grouping, the frequency results for the target term are greater, which makes it far easier to

discover similar information that is presented in different ways in your text data.

Note: You can enter any words into the Synonym column, but if the word is not extracted fromthe text, no substitution will take place. However, the target term does not need to be extractedfor the substitution to occur.

254

Chapter 17

Figure 17-9Substitution dictionary, Synonyms tab

Optional Elements

On the Optional tab, you can define optional elements for compound terms in order to groupsimilar terms together. Optional elements are single words that, if removed from an extractedcompound term, could create a match with another extracted term. These single words can appearanywhere within the compound term—at the beginning, middle, or end.

Figure 17-10Substitution dictionary, Optional tab

Adding Synonyms

On the Synonyms tab, you can enter a synonym definition in the empty line at the top of the table.Begin by defining the target term and its synonyms. You can also select the library in which youwould like to store this definition. During extraction, all occurrences of the synonyms will begrouped under the target term in the final extraction. Keep in mind that synonyms are matchedusing the Any attribute. For more information, see “Adding Terms” on p. 247.

255


Figure 17-11Synonym entries

For example, if your text data includes a lot of telecommunications information, you may havethese terms: cellular phone, wireless phone, and mobile phone. In this example, youmay want to define cellular and mobile as synonyms of wireless. If you define thesesynonyms, then every extracted occurrence of cellular phone and mobile phone will betreated as the same term as wireless phone and will appear together in the term list.When you are building your type dictionaries, you may enter a term and then think of three

or four synonyms for that term. You can drag your target term into the substitution dictionaryand then add any number of synonyms to it.Synonym substitution is also applied to the inflected forms (such as the plural form) of the

synonym. Depending on the context, you may want to impose constraints on how terms aresubstituted. Certain characters can be used to place limits on how far the synonym processingshould go:

Exclamation mark (!). An exclamation mark directly preceding the synonym, such as!<synonym>, means that you want this term to be replaced exactly as it appears in thedefinition and not by any inflected forms. An exclamation mark directly preceding the targetterm, such as !<target-term>, means that you do not want any part of the compound targetterm or variants to receive any further substitutions.Asterisk (*). An asterisk placed directly after a synonym, such as <synonym>*, means thatyou want this word to be replaced by the target term. For example, if you defined manage* asthe synonym and management as the target, then associate managers will be replacedby the target term associate management. You can also add a space and an asterisk afterthe word (<synonym> *) such as internet *. If you defined the target as internet andthe synonyms as internet * * and web *, then internet access card and webportal would be replaced with internet. You cannot begin a word or string with theasterisk wildcard in this dictionary.Caret (^). A caret and a space preceding the synonym, such as ^ <synonym>, means that thesynonym grouping applies only when the term begins with the synonym. For example, if youdefine ^ wage as the synonym and income as the target and both terms are extracted, thenthey will be grouped together under the term income. However, if minimum wage andincome are extracted, they will not be grouped together, since minimum wage does notbegin with wage. A space must be placed between this symbol and the synonym.

256

Chapter 17

Dollar sign ($). A space and a dollar sign following the synonym, such as <synonym> $,means that the synonym grouping applies only when the term ends with the synonym. Forexample, if you define cash $ as the synonym and money as the target and both terms areextracted, then they will be grouped together under the term money. However, if cash cowand money are extracted, they will not be grouped together, since cash cow does not endwith cash. A space must be placed between this symbol and the synonym.Caret (^) and dollar sign ($). If the caret and dollar sign are used together, such as^ <synonym> $, a term matches the synonym only if it is an exact match. This meansthat no words can appear before or after the synonym in the extracted term in order for thesynonym grouping to take place. For example, you may want to define ^ van $ as thesynonym and truck as the target so that only van is grouped with truck, while ludwigvan beethoven will be left unchanged. Additionally, whenever you define a synonym usingthe caret and dollar signs and this word appears anywhere in the source text, the synonym isautomatically considered for extraction. This can increase the likelihood of extraction.

To Add a Synonym Entry

E With the substitution pane displayed, click the Synonyms tab in the lower left corner.

E In the empty line at the top of the table, type your target term in the Target column. The targetterm you entered appears in color. This color represents the type in which the term appears or isforced, if that is the case. If the term appears in black, this means that it does not appear inany type dictionaries.

E Click in the second cell to the right of the target and enter the set of synonyms. Separate eachentry using the global delimiter as defined in the Options dialog box. For more information, see“Setting Options” in Chapter 8 on p. 131. The terms that you enter appear in color. This colorrepresents the type in which the term appears or is forced, if that is the case. If the term appears inblack, this means that it does not appear in any type dictionaries.

E Click in the third cell to select the library in which you want to store this synonym definition.Regardless of the library, the synonym definition will be applied to all of the extracted terms. Itsposition does, however, affect the order in which it is applied. Order is determined by the library’sposition in the library tree pane.


Adding Optional Elements

On the Optional tab, you can define optional elements for any library you want. These entries aregrouped together for each library. As soon as a library is added to the library tree pane, an emptyoptional element line is added to the Optional tab.

257


For example, to group the terms spss and spss inc together, you should declare inc to betreated as an optional element in this case. In another example, if you designate the term access

to be an optional element and during extraction both internet access speed and internetspeed are found, they will be grouped together under the term that occurs most frequently.

Note: All entries are transformed into lowercase words automatically. The extractor engine willmatch entries to both lowercase and uppercase words in the text.

Figure 17-12Optional element entry for access

Note: Terms are delimited using the global delimiter as defined in the Options dialog box. Formore information, see “Setting Options” in Chapter 8 on p. 131. If the optional element that youare entering includes the same delimiter as part of the term, a backslash must precede it.

To Add an Entry

E With the substitution pane displayed, click the Optional tab in the lower left corner of the editor.

E Click in the cell in the Optional Elements column for the library to which you want to add thisentry.

E Enter the optional element. Separate each entry using the global delimiter as defined in the Optionsdialog box. For more information, see “Setting Options” in Chapter 8 on p. 131.

Disabling Substitutions

You can remove an entry in a temporary manner by disabling it in your dictionary. By disabling anentry, the entry will be ignored during extraction.

To Disable an Entry

E In your dictionary, select the entry you want to disable and click the spacebar. The check box tothe left of the entry is cleared.

Note: You can also deselect the check box to the left of the entry to disable it.

258

Chapter 17

Deleting Substitutions

You can delete any obsolete entries in your substitution dictionary.

To Delete a Synonym Entry

E In your dictionary, select the entry you want to delete.

E From the menus, choose Edit > Delete. The entry is no longer in the dictionary.

To Delete an Optional Element Entry

E In your dictionary, double-click the entry you want to delete.

E Manually delete the term.

E Press Enter to apply the change.

Exclude Dictionaries

An exclude dictionary is a list of terms that are to be ignored or excluded in the final extraction.Exclude dictionaries are managed in the right pane of the editor. Typically, the terms that you addto this list are fill-in words or phrases that are used in the text for continuity but that do not reallyadd anything important to the text and may clutter the extraction results. By adding these terms tothe exclude dictionary, you can make sure that they are never extracted. If you are in an interactiveworkbench session, you can access this view with View > Resource Editor in the menus. Otherwise,you can edit dictionaries for a specific template in the Template Editor.

Figure 17-13Exclude dictionary pane

259


Adding Entries

In the exclude dictionary, you can enter a word, phrase, or partial string in the empty line at the topof the table. You can add character strings to your exclude dictionary as one or more words oreven partial words using the asterisk as a wildcard. The entries declared in the exclude dictionarywill be used to bar terms from extraction. If an entry is also declared somewhere else in theinterface, such as in a type dictionary, it is shown with a strike-through in the other dictionaries,indicating that it is currently excluded. This string does not have to appear in the text data or bedeclared as part of any type dictionary to be applied.

Note: If you add a term to the exclude dictionary that also acts as the target in a synonym entry, thenthe target and all of its synonyms will also be excluded, since substitutions occur before exclusionsduring the extraction process. For more information, see “Adding Synonyms” on p. 254.

Table 17-3Examples of exclude entries

Kind of Entry Exact Entry Resultsword next No terms will be extracted if they contain the word next.phrase for example No terms will be extracted if they contain the phrase for example.partial copyright* Will exclude any terms matching or containing the variations of

the word copyright, such as copyrighted, copyrighting,copyrights , or copyright 2006.

partial *ware Will exclude any terms matching or containing the variations of theword ware, such as freeware, shareware, software, hardware,beware, or silverware.

Using Wildcards (*)

You can use the asterisk wildcard to denote that you want the exclude entry to be treated as apartial string. Any terms found by the extractor engine that contain a word that begins or ends witha string entered in the exclude dictionary will be excluded from the final extraction. However,there are two cases where the wildcard usage is not permitted:

Dash character (-) preceded by an asterisk wildcard, such as *-Apostrophe (‘) preceded by an asterisk wildcard, such as *’s

To Add an Entry

E In the empty line at the top of the table, enter a term. The term that you enter appears in color.This color represents the type in which the term appears or is forced, if that is the case. If the termappears in black, this means that it does not appear in any type dictionaries.


260

Chapter 17

Disabling Entries

You can temporarily remove an entry by disabling it in your exclude dictionary. By disabling anentry, the entry will be ignored during extraction.

To Disable an Entry

E In your exclude dictionary, select the entry that you want to disable and click the spacebar. Thecheck box to the left of the entry is cleared.

Deleting Entries

You can delete any unneeded entries in your exclude dictionary.

To Delete an Entry

E In your exclude dictionary, select the entry that you want to delete.

E From the menus, choose Edit > Delete. The entry is no longer in the dictionary.

Chapter

18About Advanced Resources

In addition to type, exclude and substitution dictionaries, you can also work with a variety ofadvanced resource settings in the Edit Advanced Resources dialog box.

Figure 18-1Edit Advanced Resources dialog box

These advanced resource files can be managed on either the Session tab or the Library Patterns tab.

Session Tab

The Session tab is the first tab to appear when you open the Edit Advanced Resources dialog box.This tab contains advanced resource files that are stored at the session level. These files containmore generic information that applies to the data as a whole.

261

262

Chapter 18

Language. You can specify the language for your text data so that any language-specific filesare available for extraction. For example, if you select English here, you will see only theEnglish dynamic pattern file and not the German one.Fuzzy Grouping Exceptions. Used to exclude word pairs from the fuzzy grouping (spelling errorcorrection) algorithm. For more information, see “Fuzzy Grouping” on p. 266.Nonlinguistic Entities. Used to enable/disable which nonlinguistic entities can be extracted,as well as the regular expressions and the normalization rules that are applied during theirextraction. For more information, see “Nonlinguistic Entities” on p. 267.Type Dictionaries. Used to force type codes for custom type dictionaries rather than usingrandomly generated codes. For more information, see “Type Dictionary Maps” on p. 270.Language Handling. Used to declare the special ways of structuring sentences (dynamic POSpatterns) and using abbreviations for the selected language. For more information, see“Language Handling” on p. 271.Language Identifier. Used to configure the automatic Language Identifier called when thelanguage is set to ALL. For more information, see “Language Identifier” on p. 274.

Library Patterns Tab

The Library Patterns tab contains advanced resource files that are stored at the library level. If youwant a specific library level file to be used rather than a session level version, you can select thatfile and edit its content here. For example, if you want to use pattern rules for a text link analysisapplication, on this tab you could select the library containing those pattern rules.

Use POS patterns from. Used to select the local library that contains the dynamic part-of-speechpatterns and forced definitions that you want to use instead of the session version, which isavailable on the Session tab. For more information, see “Dynamic POS Patterns” on p. 272.Use Text Link Analysis patterns from. Used to select the local library that contains the text linkanalysis (TLA) rules in which you can define the variables, macros, and patterns used toextract complex relationships from your text documents. For more information, see “TextLink Analysis Rules” on p. 275.

Editing Advanced Resources

If you want to edit the advanced resource files, you must open the Edit Advanced Resourcesdialog box.

Note: You can use the Find/Replace feature (accessed from the Edit menu) to find informationquickly or to make uniform changes to a section. For more information, see “Replacing” on p. 265.

To Edit Advanced Resource Files

E From the menus, choose Tools > Edit Advanced Resources. The Edit Advanced Resources dialogbox opens and displays the contents of the Session tab.

263

About Advanced Resources

Figure 18-2Edit Advanced Resources dialog box

E Select a language from the list. This language affects the set of language-specific files that youcan edit and use during extraction.

E Locate and select the resource file that you want to edit. The contents appear in the right pane ofthe editor.

E To use or change pattern rules in a specific library, click the Library Patterns tab to display thosefiles.

E Use the menu or the toolbar buttons to cut, copy, or paste content, if necessary.

E Edit the file(s) that you want to change using the formatting rules in this section. Your changesare saved as soon as you make them. Use the undo or redo arrows on the toolbar to revert tothe previous changes.

E Use the Reset to Default toolbar to designate the contents of a file as the default to use for allfuture resources or to reset a file back to its original content. The options on this toolbar menuare described in the following table.

264

Chapter 18

Table 18-1Reset to Default toolbar menu descriptions

Menu item DescriptionSet as Default Saves the current file as the default for all future resources.Reset to Default Replaces the current file with the user-saved default. If no user-saved default

exists, replaces the file with the original.Reset to Original Replaces the current file with the version shipped with the product.Set All as Default Saves all files in the editor as the default for all future resources.Reset All to Default Replaces all files in the editor with the user-saved defaults. Whenever no

user-saved default exists, replaces that file with the original.Reset All to Original Replaces all files with those originally shipped with the product.

E When finished, from the menus in the dialog box choose File > Save All and Close. The dialogbox closes.

Finding

In some cases, you may need to locate information quickly in a particular section. For example, ifyou perform text link analysis, you may have hundreds of variables, macros, and patterns. Usingthe Find feature, you can find a specific rule quickly. To search for information in a section,you can use the Find toolbar.

Figure 18-3Find toolbar

To Use the Find Feature

E Locate and select the resource section that you want to search. The contents appear in the rightpane of the editor.

E From the menus, choose Edit > Find. The Find toolbar appears at the upper right of the EditAdvanced Resources dialog box.

E Enter the word string that you want to search for in the text box. You can use the toolbar buttons tocontrol the case sensitivity, partial matching, and direction of the search.Table 18-2Find toolbar buttons

Button Name DescriptionCase sensitive Toggle indicating whether the search is case sensitive. When clicked

(highlighted), the search is case sensitive. For example, if you enable thisoption and enter the word Vegetable, the case-sensitive search wouldfind Vegetable but not vegetable.

Exact match Toggle indicating whether the search term represents the entire term or ifit is a partial search. When clicked, the search will match even a partialmatch. For example, if you enable this option and enter the word veg, thesearch would find Vegetable, vegetable, veggies, and vegetarian.

265


Button Name DescriptionDown arrow Toggle indicating the search direction. When clicked, the search goes

forward, or down.Up arrow Toggle indicating the search direction. When clicked, the search goes

backward, or up.

E Click Find to start the search. If a match is found, the text is highlighted in the window.

E Click Find again to look for the next match.

ReplacingIn some cases, you may need to make broader updates to your advanced resources. The Replacefeature can help you to make uniform updates to your content.

To Use the Replace Feature

E Locate and select the resource section in which you want to search and replace. The contentsappear in the right pane of the editor.

E From the menus, choose Edit > Replace. The Replace dialog box opens.Figure 18-4Replace dialog box

E In the Find what text box, enter the word string that you want to search for.

E In the Replace with text box, enter the string that you want to use in place of the text that was found.

E Select Match whole word only if you want to find or replace only complete words.

E Select Match case if you want to find or replace only words that match the case exactly.

E Click Find Next to find a match. If a match is found, the text is highlighted in the window. If you donot want to replace this match, click Find Next again until you find a match that you want to replace.

E Click Replace to replace the selected match.

E Click Replace... to replace all matches in the section. A message opens with the number ofreplacements made.

E When you are finished making your replacements, click Close. The dialog box closes.

Note: If you made a replacement error, you can undo the replacement by closing the dialogbox and choosing Edit > Undo from the menus. You must perform this once for every changethat you want to undo.

266

Chapter 18

Fuzzy GroupingIn the Text Mining node, if you select Accommodate spelling for a minimum root character limit of: onthe Expert tab, you have enabled the fuzzy grouping algorithm.Fuzzy grouping helps to group commonly misspelled words or closely spelled words by

temporarily stripping vowels and double or triple consonants from extracted words and thencomparing them to see if they are the same. During the extraction process, the fuzzy groupingfeature is applied to the extracted terms and the results are compared to determine whether anymatches are found. If so, the original words are grouped together in the final extraction list. Theyare grouped under the term that occurs most frequently in the data.

Note: If each term is assigned to a different type, excluding the <Unknown> type, the fuzzygrouping technique will not be applied.

If you enabled this feature and found that certain words that are spelled similarly were incorrectlygrouped together under one term, you may want to exclude them from fuzzy grouping. You cando this by entering the incorrectly matched pairs into the Exceptions section of the Edit AdvancedResources dialog box, which can be accessed from the menus at Tools > Edit Advanced Resources.For more information, see “Editing Advanced Resources” on p. 262.The following example demonstrates how fuzzy grouping is performed. If fuzzy grouping is

enabled, these words appear to be the same and are matched in the following manner:

color -> clr mountain -> mntncolour -> clr montana -> mntn

modeling -> mdlng furniture -> frntrmodelling -> mdlng furnature -> frntr

In the preceding example, you would most likely want to exclude mountain and montanafrom being grouped together. Therefore, you could enter them in the Exceptions section in thefollowing manner:

mountain montana

Formatting Rules for Fuzzy Grouping Exceptions

Define only one exclude pair per line.Use simple or compound words.Use only lowercase characters for the words. Uppercase words will be ignored.Use a <tab> character to separate each word in a pair.

Note: In previous text mining releases, this information was also stored in a file calledfuzzyexclude.add. If you import this file, its content will be used in this section.

Classification ExceptionsDuring automated classification and clustering, the internal algorithms group words by knownassociations. You can use this section to fine-tune this process a little further. There are two partsin this section. The first, the Link Exceptions section, offers the ability to prevent a pair of wordsfrom being linked during the classification and clustering process. For more information, see

267


“Link Exceptions” on p. 267. The second, the Excluded Types section, offers the ability to declareany types that you want excluded. For more information, see “Excluded Types” on p. 267.

Link Exceptions

During classification and clustering, the internal algorithms group words by known associations.To prevent a pair of concepts from being linked, you can enter them in the Link Exceptionssection. When you exclude a pair of concepts, they are also called antilinks. For example, if youwanted to make sure that the concept pair luxury and cost are not grouped, and neither areplan and budget, you could add them as follows:

luxury costplan budget

Formatting Rules for Link Exceptions

Define only one exclude pair per line.Use simple or compound words.Use only lowercase characters for the words. Uppercase words will be ignored.Use a <tab> character to separate each word in a pair.

Excluded Types

During classification and clustering, the internal algorithms attempt to create categories from theconcepts and types extracted from your text data. Using this section, you exclude all of theconcepts in a given type. By default, this section is empty, and all types (and their concepts) areavailable to the classification process. In the following example, we have excluded all conceptsassigned to the Unknown and Organization types from the automated classification process.For more information, see “Building Categories” in Chapter 10 on p. 163.

UnknownOrganization

Formatting Rules for Excluded Types

Define only one type per line.Do not use brackets around the Type name.Type names are case sensitive.

Note: When building categories using the top types, the <Unknown> type is always excluded.

Nonlinguistic Entities

When working with certain types of data, you might be very interested in extracting dates,social security numbers, percentages, or other nonlinguistic entities. These entities are explicitlydeclared in the configuration file, in which you can enable or disable the entities you want toextract. For more information, see “Configuration” on p. 268. In order to optimize the output

268

Chapter 18

from the extractor engine, the input from nonlinguistic processing is normalized to group likeentities according to predefined formats. For more information, see “Normalization” on p. 270.

Note: Nonlinguistic entity extraction is not performed automatically; therefore, you must enablethe feature. You can enable nonlinguistic entity extraction via the interface.

The nonlinguistic entities in the following table can be extracted.Table 18-3Nonlinguistic entity type codes

Nonlinguistic entity Name code Type codeAddresses Address aAmino acids Aminoacid aCurrencies Currency cDates Date dDigits Digit #E-mail addresses email eHTTP/URL addresses url uIP address IP iPercentages Percent %Proteins Protein GPhone numbers PhoneNumber nTimes Time tU.S. social security numbers SocialSecurityNumber sWeights and measures Weights-Measures w

Configuration

You can enable and disable the nonlinguistic entity types that you want to extract in thenonlinguistic entity configuration file. By disabling the entities that you do not need, you candecrease the processing time required. This is done in the Configuration section of the EditAdvanced Resources dialog box, which can be accessed from the menus at Tools > Edit Advanced

Resources. For more information, see “Editing Advanced Resources” on p. 262. If nonlinguisticextraction is enabled, the extractor engine reads this configuration file during the extractionprocess to determine which nonlinguistic entity types should be extracted.

Note: Nonlinguistic entity extraction must be activated in the product interface or in the preferencefile in order for this configuration file to be read during extraction.

The syntax for this file is as follows:

<#name><tab><Language><tab><Type><tab><PoS>

Table 18-4Syntax for configuration file

Column label Description<#name> The wording by which nonlinguistic entities will be referenced in the two other

required files for nonlinguistic entity extraction. The names used here are casesensitive.

269


Column label Description<Language> The language of the documents. It is best to select the specific language; however,

an ALL option exists. For more information, see “Language Identifier” on p. 274.Possible options are: 0 = All; 1 = French; 2 = English; 3 = Both English andFrench; 4 = German; 5 = Spanish; 6 = Dutch; 10 = Italian.

<Type> The type code assigned to extracted terms that match entries in the dictionary.These type codes can be any single valid ASCII character that has not yet beenused. These codes are case sensitive.

<PoS> Part-of-speech rule. Most entities take a value of “s” except in a few cases.Possible values are: s = stopword; a = adjective; n = noun. If enabled, nonlinguisticentities are first extracted and the hard-coded or dynamic patterns are applied toidentify its role in a larger context. For example, percentages are given a value of“a.” Suppose that 30% is extracted as an nonlinguistic entity. It would be identifiedas an adjective. Then if your text contained “30% salary increase,” the “30%”nonlinguistic entity fits the part-of-speech pattern “ann” (adjective noun noun).

Important! The order in which the entities are declared in this file is important and affects howthey are extracted. They are applied in the order listed. Changing the order will change the results.

Formatting Rules for Configuration

Use a <tab> character to separate each entry in a column.Do not delete any lines.Respect the syntax shown in the preceding table using a unique type code.To disable an entry, place a # symbol at the beginning of that line. To enable an entity, removethe # character before that line.

Note: In previous text mining releases, this information was also stored in a file calledNonLingEntitiesConf.txt. If you import this file, its content will be used in this section.

Regular Expression Definitions

When extracting nonlinguistic entities, you may want to edit or add to the regular expression rulesthat are used to identify regular expressions. This is done in the Regular Expression Definitions

section of the Edit Advanced Resources dialog box, which can be accessed from the menus at Tools

> Edit Advanced Resources. For more information, see “Editing Advanced Resources” on p. 262.The file is broken up into distinct sections. The first section is called [macros]. In addition to

that section, an additional section can exist for each nonlinguistic entity. You can add sectionsto this file. Within each section, rules are numbered (regexp1, regexp2, and so on). These rulesmust be numbered sequentially from 1–n. Any break in numbering will cause the processing ofthis file to be suspended altogether.In certain cases, an entity is language dependent. An entity is considered to be language

dependent if it takes a value other than 0 for the language parameter in the configuration file. Formore information, see “Configuration” on p. 268. When an entity is language dependent, thelanguage must be used to prefix the section name, such as [english/PhoneNumber]. Thatsection would contain rules that apply only to English phone numbers when the PhoneNumberentity is given a value of 2 for the language.

Note: This file requires a certain level of familiarity with regular expressions. If you requireadditional assistance in this area, please contact SPSS Inc. for help.

270

Chapter 18

Formatting Rules for Regular Expression Definitions

Add only one rule per line.Within a section, place the most specific rule before the rest.Strictly respect the sections in this file. For example, all macros must be defined in the[macros] section.Within each section, rules are numbered (regexp1, regexp2, and so on). These rules mustbe numbered sequentially from 1–n. Any break in numbering will cause the processing ofthis file to be suspended altogether.To disable an entry, place a # symbol at the beginning of each line used to define the regularexpression. To enable an entity, remove the # character before that line.

Important! If you make changes to this file or any other in the editor and the extractor engineno longer works as desired, use the Reset to Original option on the toolbar to reset the file tothe original shipped content.

Note: In previous text mining releases, this information was also stored in a file called RegExp.ini.If you import this file, its content will be used in this section.

Normalization

When extracting nonlinguistic entities, the entities encountered are normalized to group likeentities according to predefined formats. For example, currency symbols and their equivalent inwords are treated as the same. The normalization entries are stored in the Normalization section ofthe Edit Advanced Resources dialog box, which can be accessed from the menus at Tools > Edit

Advanced Resources. For more information, see “Editing Advanced Resources” on p. 262. Thefile is broken up into distinct sections.

Important! This file is for advanced users only. It is highly unlikely that you would need to changethis file. If you require additional assistance in this area, please contact SPSS Inc. for help.

Formatting Rules for Normalization

Add only one normalization entry per line.Strictly respect the sections in this file. No new sections can be added.To disable an entry, place a # symbol at the beginning of that line. To enable an entity, removethe # character before that line.

Note: In previous text mining releases, this information was also stored in a file calledNonLingNorm.ini. If you import this file, its content will be used in this section.

Type Dictionary Maps

All of the types delivered with the shipped libraries already have reserved codes that are used.However, if you create a new type dictionary in Text Mining for Clementine, the type code forthis type dictionary will be randomly generated whenever necessary. In most cases, this processworks very well.

271


However, if you have created variables for text link analysis that refer to specific type codesfor these type dictionaries or if you want to be able to visualize types in the SPSS LexiQuestMine interface (use type codes T or C), you should force those type codes in the Type Dictionary -

Advanced Type Map section of the Edit Advanced Resources dialog box. This can be accessedfrom the menus at Tools > Edit Advanced Resources. For more information, see “Editing AdvancedResources” on p. 262. In this section, you can add a line for any of the libraries that you havecreated. Use the following syntax to define a type code:

<typename>=<code>,<name>,<int>

Table 18-5Syntax description

Entry Description<typename> The name of the type as it appears in the Type Properties dialog box and in the tree

view.<code> A single alphanumeric character representing the type code.<name> Repeat the value for <typename>.<int> Numerical value dictating the export procedure for type codes. Possible values are:

0 = the type code is used for typing and should not be written to the TermTypingconf.txtfile.1 = the type code is used for typing and is written to the TermTypingconf.txt file inorder to benefit from its forcing status.

Important! We highly recommend that you do not change the type codes for the libraries shippedwith this product. These are the type codes that are present in the original shipped version of thissection. Instead, use this section to add or remove lines for the libraries you create.

Formatting Rules for Advanced Type Map

Define a unique single character type code per line.The contents of this section are case sensitive.Use a comma to separate each entry in the line.Use a hash symbol (#) to comment lines.

Language Handling

Every language used today has special ways of expressing ideas, structuring sentences, and usingabbreviations. In the Language Handling section, you can edit dynamic POS patterns, forcedefinitions for those patterns, and declare abbreviations for the language that you have selected inthe Language drop-down list.

Dynamic POS patterns.Forced definitions.Abbreviations.

272

Chapter 18

Dynamic POS Patterns

When extracting information from your documents, the extractor engine applies a set ofhard-coded parts-of-speech (POS) patterns to a “stack” of words in the text to identify candidateterms (words and phrases) for extraction. If you want to override the hard-coded patterns, youcan add or modify the dynamic POS patterns.Parts of speech include grammatical elements, such as nouns, adjectives, past participles,

determiners, prepositions, coordinators, first names, initials, and particles. A series of theseelements makes up a POS pattern. In SPSS text mining products, each part of speech isrepresented by a single character to make it easier to define your patterns. For instance, anadjective is represented by the lowercase letter a. The set of supported codes appears by default atthe top of each default dynamic POS file along with a set of patterns and examples of each patternto help you understand each code that is used.In Text Mining for Clementine, dynamic POS patterns can be stored at the session and library

level. Typically, users will use and declare their dynamic POS patterns in the Language Handling >

Dynamic POS Patterns section of the Session tab.However, in certain cases, you may want dynamic POS patterns associated with a particular

library that was created for a special usage scenario. For example, if you are planning on usingtext link analysis to extract complex relationships within your text, you may have a library for thiscase and your own dynamic POS patterns stored within that library. If you want to create or editdynamic POS patterns for a library, you can do so by selecting that library from the Use Dynamic

POS Patterns From drop-down list on the Library Patterns tab.

Important! If you select a library for your POS patterns in the Library Patterns tab, you will notbe able to edit the contents of the Dynamic POS Patterns section of the Session tab, which willnow appear in gray. You must deselect the library on the Library Patterns tab, if you want to editthe POS patterns on the Session tab.

Formatting Rules for Dynamic Patterns

One pattern per line.Use # at the beginning of a line to disable a pattern.

The order in which you list the dynamic patterns is very important because a given sequenceof words is read only once by the extractor engine and is assigned to the first dynamic patternfor which the engine finds a match.

Note: In previous text mining releases, this information was also stored in a language specificfile using the lowercase name of the language with the *.ptr file extension, such as english.ptr.If you import this file, its content will be used in this section.

Forced POS Definitions

When extracting information from your documents, the extractor engine scans the text andidentifies the part of speech for every word it encounters. In some cases, a word could fit severaldifferent roles depending on the context. If you want to force a word to take a particular POS roleor to exclude the word completely from POS processing, you can do so in the Forced Definition

273



> Edit Advanced Resources. For more information, see “Editing Advanced Resources” on p. 262.In Text Mining for Clementine, forced definitions can be stored at the session and library level.

Typically, users will use and declare these in the Language Handling > Forced Definitions sectionof the Session tab. However, in certain cases, you may want POS definitions associated with aparticular library. If you want to create or edit dynamic POS patterns and forced definitionsfor a library, you can do so by selecting that library from the Use Dynamic POS Patterns From

drop-down list on the Library Patterns tab.

Important! If you select a library on the Library Patterns tab, you will not be able to edit thecontents of the Forced Definitions section of the Session tab, which will now appear in gray. Youmust deselect the library on the Library Patterns tab, if you want to edit the forced definitions onthe Session tab.

To force a POS role for a given word, you must add a line to this section using the following syntax:

<uniterm>:<POS_codes>

Table 18-6Syntax description

Entry Description<uniterm> A single-word term. Compound words, spaces, and colons are not supported.<POS_codes> A single-character code representing the POS role. You can list up to six different POS

codes per uniterm. Additionally, you can stop a word from being extracted by using thelowercase code s, such as additional:s.

Formatting Rules for Forced Definitions

One line per word following the syntax <uniterm>:<POS_codes>.Use only uniterms because compound words are not supported. Uniterms cannot contain acolon.Use the lowercase s as a POS code to stop a word from being extracted altogether.Use up to six POS codes per uniterm, or line. The set of supported codes appears by defaultat the top of each default dynamic POS file along with a set of patterns and examples ofeach pattern.Use the asterisk character (*) as a wildcard at the end of a string for partial matches. Forexample, if you enter add*:s, words such as additional, additionally, addendum,and additive are never extracted as a term or as part of a compound word term. However,if a word match is explicitly declared as a term in a compiled dictionary or in the forceddefinitions, it will still be extracted. For example, if you enter both add*:s and addendum:n,addendum will still be extracted if found in the text.

Note: In previous text mining releases, this information was also stored in a language specificfile using the lowercase name of the language with the *.add file extension, such as german.add.If you import this file, its content will be used in this section.

274

Chapter 18

Abbreviations

When the extractor engine is processing text, it will generally consider any period it finds as anindication that a sentence has ended. This is typically correct; however, this handling of periodcharacters does not apply when abbreviations are contained in the text.If you extract terms from your text and find that certain abbreviations were mishandled, you

should explicitly declare that abbreviation in this section. Just like the dynamic POS patterns,there is one set of abbreviations for each supported language. The content that can be vieweddepends on the language chosen in the Language drop-down list.

Note: If the abbreviation already appears in a synonym definition or is defined as a term in a typedictionary, there is no need to add the abbreviation entry here.

Formatting Rules for Abbreviations

Define one abbreviation per line.

Note: In previous text mining releases, this information was also stored in a language specificfile using the lowercase name of the language with the _abbv.txt file extension, such asenglish_abbv.txt. If you import this file, its content will be used in this section.

Language Identifier

While it is always best to restrict the text data that you are analyzing to one language, you canalso specify the ALL option to help when you have documents in several different or unknownlanguages. The ALL language option uses a language autorecognition engine called the LanguageIdentifier. The Language Identifier scans the documents to identify those that are in a supportedlanguage and automatically applies the best internal dictionaries for each file during extraction.The ALL option is governed by the parameters in these sections. For more information, see“Properties”. The supported languages are defined in the “Languages” section in the EditAdvanced Resources dialog box.

Properties

The Language Identifier is configured using the parameters in this section. The following tableprovides information about the parameters that you can set in the Language Identifier - Properties


> Edit Advanced Resources. For more information, see “Editing Advanced Resources” on p. 262.Table 18-7Parameter descriptions

Parameter DescriptionCONFIGURATION_FILE Specifies the path and name for the configuration file. The default value

is LangIdentifierConf.txt, which is automatically produced bythe Language section. This file contains the list of languages that canbe returned by the Language Identifier. Consider eliminating smallerlanguages from this list because they can cause false positives with largerlanguages and slow performance. Also, place the most probable languagesat the top of the list to speed recognition times.

275


Parameter DescriptionNUM_CHARS Specifies the number of characters that should be read by the extractor

engine in order to determine the language the document is in. The lowerthe number, the faster the language is identified. The higher the number,the more accurately the language is identified. If you set the value to 0, theentire text of the document will be read.

USE_FIRST_SUPPORTED_LANGUAGE

Specifies whether the extractor engine should use the first supportedlanguage found by the Language Identifier. If you set the value to 1, thefirst supported language is used. If you set the value to 0, the fallbacklanguage value is used.

FALLBACK_LANGUAGE Specifies the language to use if the language returned by the identifier isnot supported. Possible values are english, french, german, spanish,dutch, italian, and ignore. If you set the value to ignore, thedocument with no supported language will be ignored.

VERBOSE Specifies the verbosity level. If you set the value to 0, no log file isgenerated. If you set the value to 1, a log file is generated.

LOGFILE Specifies the path and name of the log file you want to create.

Note: In previous text mining releases, this information was also stored in a file calledLangIdentifier.ini. If you import this file, its content will be used in this section.

Languages

The Language Identifier supports many different languages. You can edit the list of languages inthe Language Identifier - Languages section of the Edit Advanced Resources dialog box.You may consider eliminating a couple of small languages from this list because larger

languages can cause false positives and slow performance. You cannot add new languages tothis file, however. Consider placing the most likely languages at the top of the list to help theLanguage Identifier find a match to your documents faster.

Note: In previous text mining releases, this information was also stored in a file calledLangIdentifierConf.txt. If you import this file, its content will be used in this section.

Text Link Analysis RulesText link analysis is a pattern-matching technology that enables users to define relationshipsbetween extracted elements from your text. For example, extracting information about anorganization may not be interesting enough to you, but by using text link analysis, you could alsolearn about the links between different organizations or the people associated with the organization.If you want to be able to benefit from text link analysis, you must select the text link analysis

configuration file on the Library Patterns tab and make any necessary changes to the pattern rulesin the Text Link Analysis section. The following libraries are shipped with text link analysis patternrules: Genomics, Opinions, and Security Intelligence.A text link analysis configuration file can contain variables, macros, and pattern rules and

is organized in the following order:Variables. A variable corresponds to either a type code used during extraction or a literallist of words. Variables are used within a pattern to specify the matching of typed terms orword lists. All variables that will be used in patterns must be explicitly declared. For moreinformation, see “Variable Syntax” on p. 276.

276

Chapter 18

Macros. Macros can simplify the appearance of patterns by allowing you to group variablesand word strings together with an OR operator (|). Although macros are not required inpatterns, they are often used. All macros that will be used in patterns must be explicitlydeclared. For more information, see “Macro Syntax” on p. 278.Patterns. A pattern is a Boolean query, or rule, that performs a match on text in a sentence.The rule itself is made up of a combination of variables, macros, word lists, and word gaps.For more information, see “Pattern Syntax” on p. 280.

When text link analysis is performed, the patterns are loaded and applied in numerical orderaccording to their IDs. The ID determines which pattern is applied to the source text first. Thefirst pattern that matches the source text is sent to the output. For this reason, it is imperative thatyou place patterns that are more specific (lower ID number) before the more generic patterns. Toreorder a pattern, you must renumber it. Keep in mind that pattern IDs must be unique.

Variable Syntax

A variable definition consists of the following syntax:

[variable(<ID#>)]name = <variable_name>[variable(<ID#>)/input(<#>)]type = [tag|list]value = [<type_code>|<list_of_words>]

Table 18-8Description of variable syntax

Parameter Definition and value[vari-able(<ID#>)]

ID of the variable. Each variable must have a unique numericalvalue, such as [variable(150)].

name User-defined name of the variable. Each name must be unique.[variable(<ID#>)/input(<#>)]

The specific instance of the variable. Generally, there is only asingle instance for each variable defined.

type Identifies whether the variable comes from the extractor or is aliteral list of words. Following are the possible values:

tag means that the variable defines a type code from theextractor. Any defined type codes can be used as a tag. Thisis the most common kind of variable, since the words itrepresents have been either extracted and/or forced.list means that the variable defines a list of words. In somecases, the words searched for will not be in the linguisticresources and dictionaries. So, whenever a word is not (orcannot be) extracted or forced, it is recommended to declare itin a list variable.

value The actual type code or list of words to be assigned to the variable.For more information about type codes or how to create a list ofwords, see the following table.

277


Table 18-9Possible arguments for the value parameter for variables

Element DescriptionType code Used when type = tag. For type codes, you can use any of the

type codes defined in TermTypingConf.txt. There are standard typecodes, domain type codes, and nonlinguistic entity type codes. Notethat for text link analysis for Opinions, Genomics, and SecurityIntelligence, type codes are already defined in TermTypingConf.txtand SynonymConf.txt in the domain-specific resource files.

List of words Used when type = list. For word lists, you must respect thefollowing syntax:

Use single or compound words.Enclose the list of words in parentheses, such as (a|an|the).Separate each word in the list by the | character, which isequivalent to a Boolean OR.Enter both singular and plural forms if you want to match both.Inflection is not automatically generated.Use lower case only.Do not reuse a word list that you have already defined inanother variable. You can define a word list only once.With the exception of commas, all punctuation marks aretreated as a space. For example, to match the word a.k.a intext, enter it in a list as a k a.

Note: A variable defined as a list does not have any type codeassociated with it. So in the output, the field corresponding to thetype code of a list variable will be empty.

Following are some examples of variables:

[variable(1)]name = VarLocation[variable(1)/input(1)]type = tagvalue = L

In the previous example, the variable called VarLocation was declared. The type tag meansthat this variable defines a type code. This type code value is L. This type code is a predefinedtype code for locations.

[variable(2)]name = VarCoord[variable(2)/input(1)]type = listvalue = (and|or|&)

In the previous example, the variable called VarCoord was declared. The type list meansthat the variable defines a list of words. This list includes the following words: and, or, andthe ampersand character (&).

Formatting Rules for Variables

Variables are case sensitive.

278

Chapter 18

You can always use the predefined variable $SEP, which corresponds to the comma (,) stringin any pattern.To disable an element, place a comment indicator (#) before each line.

Important! When using a variable in a macro or pattern, it must be preceded by the dollar sign($) character (for example, $VarLocation).

Macro Syntax

A macro definition consists of the following syntax:

[macro(<ID#>)]name = <macro_name>value = [$<variable_names>|<word_gaps>|<list_of_words>]

Table 18-10Description of macro syntax

Parameter Definition and value[macro(<ID#>)] ID of the macro. Each macro must have a unique numerical value,

such as [macro(7)].name User-defined name of the macro. Each name must be unique.value A combination of one or more variables or word lists. When

combining elements, use parentheses to group the elements and the| character to indicate a Boolean OR. For more information aboutthe values that you can use, see the following table.

Table 18-11Possible arguments for the value parameter for macros

Element DescriptionList of words If you have a word list, then you must respect the following syntax:

Use single or compound words.Separate each word in the list by a | character, which is equivalent toa Boolean OR.Enclose the list of words in parentheses, such as (a|an|the).Enter both singular and plural forms if you want to match both.Inflection is not automatically generated.Use lower case only.To reuse word lists, define them as a variable and then use thatvariable in your macros and patterns.With the exception of commas, all punctuation marks are treated asa space. For example, to match the word a.k.a in text, enter itin a list as a k a.

279


Element DescriptionWord gaps A word gap defines a numeric range of tokens that may be present

between two elements. Word gaps are very useful when matching verysimilar phrases that may differ only slightly due to the presence ofadditional prepositional phrases, adjectives, or other such words (forexample, the phrases John Doe, the CEO of, and John Doe CEOof). The syntax for a word gap is: @{#,#}. For example, @{1,3}means that a match can be made between the two defined elements ifthere is at least one gap word present but no more than three gap words.For example, if you add the following elements and word gap to yourmacro or pattern, you are referring to the presence of a word matchingthe variable vSupport separated by zero or one character from the wordnot or a word matching the variable vAdvNeg: ($vSupport @{0,1}(not|$vAdvNeg)).

Variables andmacros

Use existing variables or macros within the value for another macroor pattern by preceding the variable or macro name by a dollar signcharacter ($), such as $VarLocation.

Following are some examples of macros:

[macro(1)]name = mVerbvalue = ($VarPred|$VarPret|$VarSup)

In the previous example, the macro called mVerb is declared. The value for this macro is thepresence of one of the three following variables: VarPred, VarPret, or VarSup.

[macro(2)]name = mSupportNegvalue = ($vSupNeg|not|($vSup @{0,1} (not|$vAdvNeg))|($vAdvNeg $vSup))

In the previous example, the macro called mSupportNeg is declared. The value for this macro isthe presence of one of the following:

A term with a type fitting the variable vSupNeg.The word not.A term with a type fitting the variable vSup followed by a word gap of zero or one word andthen either the word not or a term with a type fitting the variable vAdvNeg.A term with a type fitting the variable vAdvNeg immediately followed by a term with a typefitting the variable vSup.

Formatting Rules for Macros

Macros are case sensitive.If you use a variable in a macro, it must be preceded by the $ (dollar sign) character (forexample, $vVerb).To disable an element, place a comment indicator (#) before each line.

280

Chapter 18

Pattern Syntax

A pattern definition consists of the following syntax:

[pattern(<ID#>)]name = <pattern_name>value = [$<variable_names>|<word_gaps>|<list_of_words>]output = $<digit>[\t]#<digit>[\t]$<digit>[\t]#<digit>[\t]$<digit>[\t]#<digit>[\t][outputdic = [<element>[ <another_element>],<type_code>]

Table 18-12Description of pattern syntax

Parameter Description and value[pattern(<ID#>)] ID of the pattern. Each pattern must have a unique numerical value,

such as [pattern(25)]. Patterns are processed in numericalorder. For more information, see “Multistep Processing” on p. 282.

name User-defined name of the pattern.value The actual rule to be matched to input text. It can contain one or

more variables, macros, word lists, and word gaps. See the nexttable in this section for a detailed list of valid pattern syntax forthese elements.

output The format of the output to be created when the pattern is matched.The output references any item (string, variable, macro, optionalelement, word gap, word list) defined in the pattern.References to the items are positional. Since the output formatis tabulated, it is possible to separate the items either with a tabor \t, such as:output = $1\t#1\t$3\t#3\t$2\t#2If you want to separate items with spaces, use:output = $1 #1 $3 #3 $2 #2

This indicates that the output should consist of items matched atpositions 1, 3, and 2 ($1, $3, and $2) as defined in the pattern withtheir respective type codes (#1, #2, and #3). A value of NULL inthe output definition indicates that an empty string will be used.

If an item is a word list or comes from a variable defined as a list,there will be no type code associated with it. The correspondingfield will be empty.

If a term was grouped under a synonym “target” term, then thetarget term is displayed rather than the original term.

Note: It is possible to have more than one line of output from thesame pattern using a different ID number for each line:output(1) = $1\t#1\t$3\t#3\t$2\t#2output(2) = $1\t#7\t$7\t#3\t$2\t#2

outputdic An optional definition for specifying that the output of the patternshould be placed in the working dictionary with an assigned typecode. The format of this command is:<item(s)>,<type code>To specify that the first item should be typed as an organization,use outputdic=$1,O.To specify that a term must be created from the concatenation ofthe third and fourth items and typed as a gene, use outputdic=$3 $4,G.

281


Table 18-13Possible arguments for the value parameter for patterns

Element DescriptionList of words If you have a word list, then you must respect the following syntax:

Use single or compound words.Separate each word in the list by the | character, which is equivalentto a Boolean OR.Enclose the list of words in parentheses, such as (a|an|the).Enter both singular and plural forms if you want to match both.Inflection is not automatically generated.Use lower case only.Do not reuse a word list that you have already defined in anotherelement, since it will not be matched. To reuse word lists, define themas a variable and then use that variable in your macros and patterns.With the exception of commas, all punctuation marks are treated as aspace. For example, to match the word a.k.a in text, enter it in alist as a k a.

Word gaps A word gap defines a numeric range of tokens that may be present betweentwo elements. Word gaps are very useful when matching very similarphrases that may differ only slightly due to the presence of additionalprepositional phrases, adjectives, or other such words (for example, thephrases John Doe, the CEO of and John Doe CEO of). The syntaxfor a word gap is: @{#,#}. For example, @{1,3} means that a matchcan be made between the two defined elements if there is at least onegap word present but no more than three gap words. For example, if youadd the following elements and word gap to your macro or pattern, youare referring to the presence of a word matching the variable vSupportseparated by zero or one character from the word not or a word matchingthe variable vAdvNeg: ($vSupport @{0,1} (not|$vAdvNeg)).

Variables andmacros

Use existing variables or macros within the value for another macro orpattern by preceding the variable or macro name by a dollar sign character($), such as $VarLocation.

Optionalelements

When you are declaring macros and patterns, you can also definecertain elements as optional. This means that they do not have to bepresent in order for the pattern rule to match the text. An element ismarked as optional when you append a question mark character (?) tothe variable name, macro name, or word list. For example, if you add$vPerson the? $vFunction of $vOrg, the following would bematched: John Doe the CEO of ... and John Doe CEO of ....In another example, if you add the rule $vPerson ($SEP|$vDet)@{0,2} $vFunction, the following would be matched, assuming thatjohn doe is typed as vPerson and ceo is typed as vFunction: JohnDoe, the CEO of ..., John Doe the CEO of ..., John Doe,CEO of ..., and John Doe CEO of ....

Use the following syntax for optional elements:Place a question mark character with no spaces directly after theelement, such as $vOrg?.You cannot define several optional elements in a row. In order to doso, you have two choices. If either one element or the other must bepresent, add them as follows: ($var1|$var2). If all elements areoptional, add them as follows: ($var1|$var2)?.You cannot begin a macro or pattern with an optional element.

282

Chapter 18

The following is an example of a pattern:

[pattern(205)]name = 205value = $mNeg $mTopic ($SEP|and){1,2} $mTopicoutput(1) = $2\t#2\t$1\t#1\tNULL\tn/aoutput(2) = $4\t#4\t$1\t#1\tNULL\tn/a

In the previous example, the pattern is called 205. This rule would match the following cases:And only: I hate mushrooms and olives.

Comma only: I hate mushrooms, olives.

Comma + and: I hate mushrooms, and olives.

However, it would not match

I hate mushrooms olives

because either an and or a comma (,) is required by the rule.

Formatting Rules for Patterns

Whenever two or more elements are defined, they must be enclosed in parentheses whether ornot they are optional (for example, ($VarPred|$VarPret) or ($vCoord|$SEP)?).The first element in a pattern cannot be an optional element or word gap. For example, youcannot begin with value = $VarGene? or value = @{0,1}.It is possible to associate an instance count to a token. This is useful in writing only one rulethat encompasses all cases instead of writing a separate rule for each case. For example,you may use the literal string ($SEP|and) if you are trying to match either , (comma)or and. If you extend this by adding an instance count so that the literal string becomes($SEP|and){1,2}, you will now match any of the following three instances: ,, and,or ,and.In the pattern value, spaces are not supported between the variable or macro name and the $and ? characters. Use $varName or $mName?.In the pattern output, spaces are not supported before the tab code (\t), between the dollarsign character ($) and the term item or between the hash character (#) and the type item.To disable an element, place a comment indicator (#) before each line.

Multistep Processing

Patterns are loaded by the pattern matcher and sorted alphanumerically by their IDs, section bysection. They are loaded according to a numerical sort on their numbers. In some applications, itis almost impossible to write a rule that would cover all of the entities and links that you want toextract on the same sentence. So, instead of having different ptnmatcher.ini initialization files andapplying them one after the other, it is possible to write specific subsets of rules. A specific subsetof rules is defined by the keyword [set(<digit>)]. The best-matching rule in each set will beapplied to the same sentence. For example,

[set(1)][set(2)][set(3)]

283


Note: You can add up to 512 rules per set.

Index

abbreviations, 271, 274accommodatingpunctuation errors, 44, 62, 82, 93, 144spelling errors, 45, 82, 144

activating nonlinguistic entities, 268addingconcepts to categories, 175optional elements, 256public libraries, 232sounds, 132–133synonyms, 150, 254terms to exclude list, 259terms to type dictionaries, 247types, 152

addresses (nonlinguistic entity), 267advanced resources editor, 261editing resources, 261–262find and replace, 264–265Library Patterns tab, 262Session tab, 261

advanced type map, 270all documents, 159all language option, 274amino acids (nonlinguistic entity), 267antilinks, 267applycategorizenode properties, 103applytextminingnode scripting properties, 70asterisk (*), 255

backing up resources, 224bar charts, 195Budget library, 230, 244Budget type dictionary, 244build categories settings, 38buildingcategories, 8–9, 39–40, 163, 165, 167clusters, 180–182

cachingdata and session extraction results, 36translated text, 107Web feeds, 17

calculating similarity link values, 183caret symbol (^), 255categories, 27, 157–158, 174adding to, 175building, 8–9, 39–40, 163, 165, 167commonality charts, 195creating conditional rules, 174creating new empty category, 173deleting, 177

descriptors, 160editing, 175limits tab, 167managing, 174merging, 177moving, 176refining results, 174renaming, 173scoring, 159techniques tab, 165text mining category model nuggets, 34web graph, 195

categories and concepts view, 119Categories and Concepts view, 157Categories pane, 158Data pane, 161–162

Categories pane, 158categorizing, 7, 157, 266linguistic techniques, 163link exceptions, 267using techniques, 8, 39, 165

category bar chart, 195–196category model nuggets, 27, 53building via node, 34building via workbench, 36generating, 134output, 58

category name, 159category web graph, 195, 197category web table, 195, 197changingtemplates, 211, 218type codes, 270

charts, 195classification, 5, 7, 165classification exceptions, 266co-occurrence rules, 8, 39, 165, 172concept derivation, 8, 39, 165, 168concept inclusion, 8, 39, 165, 169excluding types, 267frequency, 166link exceptions, 267semantic networks, 8, 39, 165, 170

classification link exceptions, 266Clementine Solution Publisher, 1closing the session, 135cluster web graph, 198–199clusters, 36, 123, 179about, 179building, 180–182descriptors, 184exploring, 184

285

286

Index

similarity link values, 183viewing graphs, 198–199

clusters view, 123co-occurrence rules technique, 8, 40, 166, 172codes for types, 270colors, 162, 193exclude dictionary, 259for types and terms, 246in data pane, 162, 193setting color options, 132synonyms, 256

column wrapping, 132combining categories, 177componentization, 168concept derivation technique, 8, 39, 165, 168concept inclusion technique, 8, 39, 165, 169concept model nuggets, 27, 53building via node, 34concepts for scoring, 54synonyms, 58

concept patterns, 189concept web graph, 198–199concepts, 27, 54adding to categories, 160, 175adding to types, 152as fields or records for scoring, 62creating types, 148excluding from extraction, 154extracting, 139filtering, 146forcing into extraction, 155in categories, 160in clusters, 184

conditional rules, 174co-occurrence rules, 165–166deleting, 174from co-occurrence technique, 172from concept co-occurrence, 8, 40from synonymous words, 8, 40, 165–166

confidences in LexiQuest Categorize models, 92configuration file, 275macros, 278

Contextual Qualifier type dictionary, 244Core library, 230, 244counts, 93creatingcategories, 34category model nuggets, 134conditional rules, 174exclude dictionary entries, 259libraries, 231modeling nodes, 134optional elements, 256synonyms, 148, 150, 254template from resources, 210templates, 219type dictionaries, 245

types, 152CRM library, 230currencies (nonlinguistic entity), 267custom colors, 132

datacategorizing, 157, 163classification, 8, 39, 165clustering, 179Data pane, 161–162, 192extracting, 139, 142–143, 145, 188extracting TLA Patterns, 187filtering results, 146, 190refining results, 148restructuring, 74text link analysis, 187

Data paneCategories and Concepts view, 161–162display button, 159text link analysis view, 192

dates (nonlinguistic entity), 267deactivating nonlinguistic entities, 268definitions, 160deletingcategories, 177conditional rules, 174disabling libraries, 235excluded entries, 260libraries, 235, 237optional elements, 258resource templates, 221synonyms, 258type dictionaries, 252

delimiter, 131descriptors, 159categories, 160clusters, 184editing in categories, 175

dictionaries, 130, 243excludes, 229, 243, 258substitutions, 229, 243, 253types, 229, 243

digits (nonlinguistic entity), 267disablingexclude dictionaries, 260libraries, 235nonlinguistic entities, 268substitution dictionaries, 257synonym dictionaries, 266type dictionaries, 252

display button, 159display columns in the Data pane, 162, 192.doc files for text mining, 12docs, 159document fields, 114document mode, 29, 76document settings dialog box, 30, 77, 95

287

Index

documents, 161–162, 192listing, 113

dollar sign ($), 255dynamic POS patterns, 271–272library version, 262

e-mail (nonlinguistic entity), 267edit mode, 202editingadvanced resources, 262categories, 174–175refining extraction results, 148

editing graphs, 203automatic settings, 204colors and patterns, 205dashing, 205legend position, 208margins, 207padding, 207point aspect ratio, 206point rotation, 206point shape, 206rules, 204selection, 204size of graphic elements, 207text, 204

enabling nonlinguistic entities, 268encoding, 30, 64, 76, 95, 106exclamation mark (!), 255exclude dictionary, 229, 258–260excludingconcepts from extraction, 154disabling dictionaries, 252, 257disabling exclude entries, 260disabling libraries, 235from category links, 267from fuzzy exclude, 266types from classification, 267

explore mode, 202exploring clusters, 184exportingpublic libraries, 237templates, 222

expression builder, 136extension list in file list node, 12external links, 179extracting, 1, 3, 5, 139, 142–143, 145, 229, 243extraction results, 139forcing words, 155nonlinguistic entities, 45, 83, 144patterns from data, 73refining results, 148TLA patterns, 188uniterms, 6, 45, 83, 144

extraction size, 29, 76

file list node, 9, 11–13example, 13extension list, 12other tabs, 13scripting properties, 15settings tab, 12

filelistnode scripting properties, 15filtering libraries, 234filtering results, 146, 190find and replace (advanced resources), 264–265finding terms and types, 233font color, 246forced definitions, 271–272forcingconcept extraction, 155terms, 250type codes, 270

formatting forstructured text, 31, 78, 96XML text, 30, 77, 95

frequency, 9, 40, 93, 166full text documents, 29, 63, 76, 94document mode, 29, 76paragraph mode, 29, 76

fuzzy grouping exceptions, 45, 82, 144, 261, 266

generated modelsText Mining models, 9

generatingcategory model nuggets, 134inflected forms, 243, 245–247modeling nodes, 134

Genomics library, 230global delimiter, 131graphs, 200–201category web graph, 195cluster web graph, 198–199colors and patterns, 205concept web graph, 198–199dashings, 205editing, 202–203explore mode, 202legend position, 208margins, 207padding, 207point aspect ratio, 206point rotation, 206point shape, 206refreshing, 195size of graphic elements, 207text, 204TLA concept web graph, 200–201type web graph, 200–201

.htm/ .html files for text mining, 12HTML formats for Web feeds, 15, 17HTTP/URLs (nonlinguistic), 267

288

Index

ID field, 75identifying languages, 274ignoring concepts, 154importingLexiQuest Categorize model nuggets, 90public libraries, 236templates, 222

imposing type codes, 270inflected forms, 168, 243, 245–247inflection, 168initialization filepatterns, 280variables, 276

input encoding, 30, 64, 76, 95, 106interactive workbench, 119interactive workbench mode, 35interactive workbench session, 33internal links, 179IP addresses (nonlinguistic entity), 267IT library, 230

keyboard shortcuts, 135–136keyword dictionaries, 277

labelto reuse translated text, 107to reuse Web feeds, 17

language, 43, 64, 80, 98, 145for advanced resources, 261

language handling sections, 261, 271abbreviations, 271, 274dynamic POS patterns, 271–272forced definitions, 271–272

Language Identifier, 261, 274launch interactive workbench, 33LexiQuest Categorize, 10LexiQuest Categorize model, 93LexiQuest Categorize model nuggets, 89–90, 93–95, 98–99Fields tab, 94–95importing, 90Language tab, 98Model tab, 91scripting properties, 103Settings tab, 92usage example, 99

.lib, 236libraries, 130, 243adding, 232Budget library, 230, 244Core library, 230, 244creating, 231CRM library, 230deleting, 235, 237dictionaries, 229disabling, 235exporting, 237Genomics library, 230

importing, 236IT library, 230library synchronization warning, 238linking, 232local libraries, 238Local library, 230naming, 234Opinions library, 230, 244public libraries, 238publishing, 239renaming, 234sharing and publishing, 238shipped libraries, 230synchronizing, 238updating, 240Variations library, 230viewing, 234

Library Patterns tab, 262limits tab, 165linguistic resource templates, 209, 215–216linguistic resources, 36, 76, 229link exceptions, 267link values, 42, 182–183linking libraries, 232links in clusters, 179loadingresource template into node, 220

loading resource templates, 36–37, 76Local library, 230Location type dictionary, 244

macros, 278making templates from resources, 210managingcategories, 174local libraries, 234public libraries, 236

match option, 243, 245, 247–248maximum number of categories to create, 167merging categories, 177minimum confidence level summation, 93minimum link value, 167modelcategory model nuggets, 27

model nuggets, 33, 89category model nuggets, 34, 53, 58concept model nuggets, 27, 34, 53concept models, 54generating category model nuggets, 134LexiQuest Categorize, 10text mining model nuggets, 53

modeling nodes, 10, 25generating, 134updating, 134

movingcategories, 176type dictionaries, 252

289

Index

multistep processing, 282muting sounds, 133

naminglibraries, 234type dictionaries, 251

narrow profile, 166, 172navigatingkeyboard shortcuts, 135

Negative Qualifier type dictionary, 244Negative type dictionary, 244new categories, 173nodesfile list, 9generated Text Mining model, 9LexiQuest Categorize, 10LexiQuest Categorize model nuggets, 89text link analysis, 9, 73text mining model nugget, 53Text Mining modeling, 9text mining modeling node, 25translate, 10web feed, 9

nonlinguistic entities, 45, 83, 144, 261addresses, 267amino acids, 267currencies, 267dates, 267digits, 267e-mail addresses, 267enabling and disabling, 268HTTP addresses/URLs, 267IP addresses, 267normalization, NonLingNorm.ini, 270percentages, 267phone numbers, 267proteins, 267regular expressions, RegExp.ini, 269times, 267U.S. Social Security number, 267weights and measures, 267

normalization, 270

opening templates, 218Opinions library, 230, 244optional elements, 253adding, 256definition of, 254deleting entries, 258target, 256

options, 131colors, 132session, 131sound, 133

Organization type dictionary, 244

paragraph mode, 29, 76parameterspattern definition, 280variable definition, 276

part-of-speechforcing definitions, 272patterns, 272

partition mode, 30partitioning datamodel building, 33

patterns, 36, 73, 139, 142, 189, 280definition parameters, 280library version, 262multistep processing, 282

.pdf files for text mining, 12percentages (nonlinguistic entity), 267permutations, 45, 145Person type dictionary, 244phone numbers (nonlinguistic), 267plural word forms, 246POS patterns, 271–272library version, 262

Positive Qualifier type dictionary, 244Positive type dictionary, 244.ppt files for text mining, 12predicted categories, 92PredictiveCallCenter, 1preferences, 131–133Product type dictionary, 244proteins (nonlinguistic entity), 267publishing, 239adding public libraries, 232libraries, 238

punctuation errors, 44, 62, 82, 93, 144

records, 161–162, 192refining resultsadding concepts to types, 152adding synonyms, 150categories, 174creating types, 152excluding concepts, 154extraction results, 148forcing concept extraction, 155

refreshing graphs, 195renamingcategories, 173libraries, 234resource templates, 221type dictionaries, 251

replacing resources with template, 211reset to default, 264resource editor, 130Resource Editor, 209–211, 215, 261making templates, 210switching resources, 211updating templates, 210

290

Index

resource templates, 6, 36, 73, 76, 130, 187, 209, 215–216resourcesbacking up, 224classification exceptions, 266editing advanced resources, 261restoring, 224shipped libraries, 230switching template resources, 211

restoring resources, 224results of extractions, 139filtering results, 146, 190

reuse cached data and results, 36reusingdata and session extraction results, 36translated text, 107Web feeds, 17

RSS formats for Web feeds, 15, 17.rtf files for text mining, 12rulesco-occurrence rules technique, 172deleting, 174

Sample nodewhen mining text, 28

sampling datafor text mining, 28

savingdata and session extraction results, 36interactive workbench session, 134resources, 224resources as templates, 210templates, 219time using Sample nodes, 28translated text, 107Web feeds, 17

score button, 159scoring, 159concepts, 57

screen readers, 135–136searching, 233, 264selecting concepts for scoring, 57semantic networks technique, 8, 40, 166, 170, 172profiles, 166, 172

separators, 131session information, 35settings, 131–133sharing libraries, 238adding public libraries, 232publishing, 239updating, 240

shipped libraries, 230shortcut keys, 135–136shortcutskeyboard, 135

.shtml files for text mining, 12similarity link values, 183single prediction, 93

Social Security # (nonlinguistic), 267sound options, 133source nodesfile list, 9web feed, 9

spelling mistakes, 45, 82, 144, 266statistical techniques, 4, 7structured text documents, 29, 31, 63, 76, 78, 95–96substitution dictionary, 229colors, 256deleting, 258disabling, 257optional elements, 253synonyms, 253

summation, 93switching templates, 211synchronizing libraries, 238–240synonyms, 148, 253adding, 150, 254asterisk (*), 255caret symbol (^), 255colors, 256definition of, 253deleting entries, 258dollar sign ($), 255exclamation mark (!), 255fuzzy grouping exceptions, 45, 82, 144, 266in concept model nuggets, 58target terms, 254

tables, 136target terms, 256techniques, 165co-occurrence rules, 165, 172concept derivation, 165, 168concept inclusion, 165, 169frequency, 166semantic networks, 165, 170

Template Editor, 215–222, 224deleting templates, 221exiting the editor, 224importing and exporting, 222opening templates, 218renaming templates, 221resource libraries, 229saving templates, 219updating resources in node, 220

templates, 6, 36, 73, 76, 130, 187, 209, 215–216backing up, 224deleting, 221importing and exporting, 222load resource templates dialog box, 37making from resources, 210opening templates, 218renaming, 221restoring, 224saving, 219

291

Index

switching templates, 211updating or saving as, 210

term componentization, 168termsadding to exclude dictionary, 259adding to types, 247color, 246finding in the editor, 233forcing terms, 250inflected forms, 243match options, 243

text analysis, 4text extraction, 5text field, 29, 63, 75, 94, 106.text files for text mining, 12text link analysis, 126, 187, 275Data pane, 192filtering patterns, 190in text mining modeling nodes, 36library version, 262macros, 278multistep processing, 282pattern rule syntax, 280patterns, 280variables, 276viewing graphs, 200–201Visualization pane, 200–201web graph, 200–201

text link analysis node, 9, 73, 75, 80–81, 83, 86Annotations tab, 83caching TLA, 74example, 83Expert tab, 81Fields tab, 75Language tab, 80output, 74restructuring data, 74scripting properties, 86

text mining, 3Text Mining for Clementine, 1applications, 10nodes, 9

Text Mining model nuggetscripting properties, 70

text mining model nuggets, 53concepts as fields or records, 62example, 66Fields tab, 62Language tab, 64Model tab, 58Settings tab, 61Summary tab, 65

Text Mining model nuggetsModel tab, 54

text mining modeling node, 25example, 45Fields tab, 28

Language tab, 42Model tab, 32scripting properties, 50

text mining nodegenerating new node, 134

Text Mining node, 9Expert tab, 44

text separators, 131textminingnode scripting properties, 50textual unity, 29, 76times (nonlinguistic entity), 267titles, 114TLA concept web graph, 200–201TLA patterns, 73, 187, 189tlanode properties, 86translate node, 10, 105–108, 111caching translated text, 105, 107–108Fields tab, 106Language tab, 107reusing translated files, 110scripting properties, 111usage example, 108

translatenode scripting properties, 111translation, 43, 65, 81, 99, 107, 146translation label, 107.txt files for text mining, 12type dictionary, 229, 261, 277adding terms, 247built-in types, 244creating types, 245deleting, 252disabling, 252forcing terms, 250moving, 252optional elements, 243renaming, 251synonyms, 243

type frequency in classification, 9, 40, 166type map, 270type patterns, 189type web graph, 200–201types, 243adding concepts, 148built-in types, 244codes, 270creating, 245default color, 132, 246dictionaries, 229excluding from classification, 267extracting, 139filtering, 146, 190finding in the editor, 233

uncategorized, 159Uncertain Qualifier type dictionary, 244Uncertain type dictionary, 244uniterms, 45, 83, 144

292

Index

Unknown type dictionary, 244updating, 240graphs, 195libraries, 238modeling nodes, 134node resources and template, 220templates, 210, 219

URLs, 16, 18

Variations library, 230viewer node, 113–114example, 114for text mining, 113Settings tab, 113

viewingcategories, 195clusters, 198–199documents, 113libraries, 234text link analysis, 200–201

viewsCategories and Concepts, 157

views in interactive workbenchcategories and concepts, 119clusters, 123resource editor, 130text link analysis, 126

visualization pane, 195visualizing, 195category web graph, 195cluster web graph, 198–199concept web graph, 198–199Text Link Analysis view, 200–201TLA concept web graph, 200–201type web graph, 200–201updating graphs, 195

web feed node, 9, 11, 15–17, 23example, 19Input tab, 16label for caching and reuse, 17Records tab, 17scripting properties, 23

Web Feed node, 17web graphscategory web graph, 195cluster web graph, 198–199concept web graph, 198–199TLA concept web graph, 200–201type web graph, 200–201

web table, 195webfeednode properties, 23weights/measures (nonlinguistic), 267wider profile, 166, 172WordNet, 170workbench, 33, 35session information, 35

.xls files for text mining, 12

.xml files for text mining, 12XML text, 29, 64, 76, 95formatting, 30, 77, 95

Text Mining for Clementine 12.0 User's Guide

Documents

Transcript of Text Mining for Clementine 12.0 User's Guide