Endeca_07_AdvFeaturesGuide

Endeca® Navigation PlatformAdvanced Features Guide

Copyright and DisclaimerProduct specifications are subject to change without notice and do not represent a commitment on the part of Endeca Technologies, Inc. The software described in this document is furnished under a license agreement. The software may not be reverse assembled and may be used or copied only in accordance with the terms of the license agreement. It is against the law to copy the software on any medium except as specifically allowed in the license agreement.

No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, for any purpose without the express written permission of Endeca Technologies, Inc.

Copyright © 2003-2005 Endeca Technologies, Inc. All rights reserved. Printed in USA.

Corda PopChart® and Corda Builder™ Copyright 1996-2005 Corda Technologies, Inc.

Outside In® SearchML © 1992-2005 Stellent Chicago, Inc. All rights reserved.

Rosette ® Globalization Platform Portions Copyright © Basis Technology Corp. 2003-2005. All rights reserved.

Teragram Language Identification Software Portions Copyright © 1997-2005 Teragram Corporation. All rights reserved.

TrademarksDon't Stop At Search, Endeca, Endeca InFront, Endeca Navigation Engine, Guided Navigation, and ProFind are registered trademarks, and Endeca Data Foundry and Endeca Latitude are trademarks of Endeca Technologies, Inc.

Basis Technology and Rosette are trademarks of Basis Technology Corp.

All other trademarks or registered trademarks contained herein are the property of their respective owners.

Endeca Advanced Features Guide • August 2005

Contents

Preface

About This Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviWho Should Use This Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviSymbols and Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviEndeca Documentation Set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiContacting Endeca Standard Customer Support . . . . . . . . . . . . . . . . xx

SECTION I DATA IMPORT FEATURES

Chapter 1 Content Acquisition System

Sections of This Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23CAS and Security Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Components that Support CAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25CAS Reference Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Full Crawls versus Differential Crawls . . . . . . . . . . . . . . . . . . . . . . . . 27URL and Record Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Redundant URLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Source Documents and Endeca Records. . . . . . . . . . . . . . . . . . . . . . . 31

Property Name Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Viewing all Properties Generated by CAS . . . . . . . . . . . . . . . . . . . 38

Creating a Full Crawling Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Creating a Record Adapter to Read Documents . . . . . . . . . . . . . . 41Creating a Record Manipulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Adding a RETRIEVE_URL Expression . . . . . . . . . . . . . . . . . . . . . . . 45Converting Documents to Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

iv

Identifying the Language of the Documents . . . . . . . . . . . . . . . . . . 50Removing Document Body Properties . . . . . . . . . . . . . . . . . . . . . . 52Modifying Records with a Perl Manipulator . . . . . . . . . . . . . . . . . . 54Creating a Spider. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Specifying Root URLS to Crawl . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Configuring URL Extraction Settings. . . . . . . . . . . . . . . . . . . . . . . . 60Example Syntax of URL Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Specifying a Record Source for the Spider . . . . . . . . . . . . . . . . . . . 65Specifying Timeouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Specifying Proxy Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Removing any Unnecessary Records after a Crawl . . . . . . . . . . . . 68Handling Crawling Errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Properties Generated by CAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Formats Supported by ProFind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Chapter 2 Web Crawling with Authentication

Configuring Basic Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89KEY_RING Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91SITE Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

HOST Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92PORT Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

HTTP Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93REALM Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93KEY Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Configuring HTTPS Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94Boot-Strapping Server Authentication . . . . . . . . . . . . . . . . . . . . . . 95CA_DB Element. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95Disabling Server Authentication for a Host. . . . . . . . . . . . . . . . . . . 96HTTPS Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

AUTHENTICATE_HOST Attribute . . . . . . . . . . . . . . . . . . . . . . . . 96Configuring Client Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . 97CERT Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

PATH Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

v

PRIV_KEY_PATH Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98Authenticating with a Microsoft Exchange Server . . . . . . . . . . . . 98EXCHANGE_SERVER Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Authenticating with a Proxy Server . . . . . . . . . . . . . . . . . . . . . . . . . . . 99PROXY Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Using Forge to Encrypt Keys and Pass Phrases . . . . . . . . . . . . . . . . 100Encrypting a Username/Password Pair . . . . . . . . . . . . . . . . . . . . 101Encrypting a Pass Phrase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

SECTION II RECORD FEATURES

Chapter 3 Creating Aggregated Records

Aggregated Record Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106Enabling Record Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107Generating and Displaying Aggregated Records . . . . . . . . . . . . . . . 109

Determining the Available Rollup Keys . . . . . . . . . . . . . . . . . . . . 109Creating Aggregated Record Navigation Queries . . . . . . . . . . . . 112

Specifying the Rollup Key for the Navigation Query . . . . . . . 112Setting the Maximum Number of Returned Records . . . . . . 113

Creating Aggregated Record Queries. . . . . . . . . . . . . . . . . . . . . . 114Displaying Aggregated Records . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Retrieving an Aggregated Record from a ENEQueryResults Object . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Retrieving an Aggregated Record List from a Navigation Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Displaying Aggregated Record Attributes . . . . . . . . . . . . . . . 117Displaying the Records in the Aggregated Record . . . . . . . . 119

Chapter 4 Using Derived Properties

Specifying Derived Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124Displaying Derived Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125Troubleshooting Derived Properties . . . . . . . . . . . . . . . . . . . . . . . . . 128Derived Property Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

vi

Chapter 5 Selecting a Record Set Based on a Key

About the Select Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131Configuring the Select Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132Using URL Query Parameters for Select . . . . . . . . . . . . . . . . . . . . . . 133Selecting Keys in the Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Using the Java Selection Method. . . . . . . . . . . . . . . . . . . . . . . . . . 133Using the .NET Selection Property . . . . . . . . . . . . . . . . . . . . . . . . 135Using the COM/Perl Selection Methods . . . . . . . . . . . . . . . . . . . . 135

Chapter 6 Bulk Export of Records

Configuring the Bulk Export Feature . . . . . . . . . . . . . . . . . . . . . . . . . 137Using URL Query Parameters for Bulk Export . . . . . . . . . . . . . . . . . 138Retrieving Bulk Records in the Application . . . . . . . . . . . . . . . . . . . . 138

Setting the Number of Bulk Records. . . . . . . . . . . . . . . . . . . . 138Retrieving the Bulk-format Records . . . . . . . . . . . . . . . . . . . . . . . 140

Using Java Bulk Export Methods . . . . . . . . . . . . . . . . . . . . . . . 140Using COM/Perl Bulk Export Methods. . . . . . . . . . . . . . . . . . . 142Using .NET Bulk Export Methods . . . . . . . . . . . . . . . . . . . . . . . 143

Performance Impact for Bulk Export Records. . . . . . . . . . . . . . . . . . 144

Chapter 7 Record Filters

Record Filter Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146ENE Query Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147XML Syntax for File-based Record Filter Expressions . . . . . . . . 149

Enabling Properties for Use in Record Filters . . . . . . . . . . . . . . . . . . 151Data Configuration for File-based Filter Expressions. . . . . . . . . . . . 151Record Filter Result Caching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152ENE URL Query Parameters for Record Filters . . . . . . . . . . . . . . . . 153

Sample Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154Record Filter Performance Implications . . . . . . . . . . . . . . . . . . . . . . 154

Memory Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154Expression Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

Record Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

vii

Interaction with Spelling Auto-correction and Spelling Did You Mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

Memory Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155Expression Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

SECTION III DIMENSION FEATURES

Chapter 8 Using Inert Dimension Values

Configuring Inert Dimension Values . . . . . . . . . . . . . . . . . . . . . . . . . 162Using Inert Dimension Values in the Application . . . . . . . . . . . . . . . 163

Sample Java Code for Inert Dimension Values . . . . . . . . . . . 164Sample .NET Code for Inert Dimension Values . . . . . . . . . . . 165Sample COM Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

Chapter 9 Working with Externally Created Dimensions

XML Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170XML Syntax to Specify Dimension Hierarchy . . . . . . . . . . . . . . . . 171Example of Using Nested node Elements . . . . . . . . . . . . . . . . . . 172Example of Using Parent Attributes . . . . . . . . . . . . . . . . . . . . . . . 173Example of Using Child Elements . . . . . . . . . . . . . . . . . . . . . . . . 173Node ID Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173Importing an Externally Created Dimension . . . . . . . . . . . . . . . . 174

Chapter 10 Working with an Externally Managed Taxonomy

XSLT and XML Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180XSLT Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181XML Syntax to Specify Dimension Hierarchy . . . . . . . . . . . . . . . . 181Example of Using Nested node Elements . . . . . . . . . . . . . . . . . . 183Example of Using parent Attributes . . . . . . . . . . . . . . . . . . . . . . . 183Example of Using child Elements . . . . . . . . . . . . . . . . . . . . . . . . . 183Node ID Requirements and Identifier Management in Forge . . 184

Pipeline Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185Integrating an Externally Managed Taxonomy . . . . . . . . . . . . . . 185

viii

Transforming an Externally Managed Taxonomy. . . . . . . . . . . . . 187Loading an Externally Managed Dimension . . . . . . . . . . . . . . . . . 188Running a Second Baseline Update. . . . . . . . . . . . . . . . . . . . . . . . 189Updating an Externally Managed Taxonomy in Your Pipeline. . . 190

Chapter 11 Classifying Documents with Stratify

Sections of This Document. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192Frequently Used Terms and Concepts . . . . . . . . . . . . . . . . . . . . . 193How Endeca and Stratify Classify Unstructured Documents . . . 196Overview of the Integration Process . . . . . . . . . . . . . . . . . . . . . . . 199Required Stratify Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

Developing a Stratify Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201Building a Taxonomy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202Exporting a Taxonomy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203Creating a Pipeline to Incorporate Stratify . . . . . . . . . . . . . . . . . . 204Creating a CAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206Classifying Documents with Stratify Classification Server . . . . . 207Adding a Property Mapper and Indexer Adapter . . . . . . . . . . . . . 212Integrating a Stratify Taxonomy. . . . . . . . . . . . . . . . . . . . . . . . . . . 213Running the First Baseline Update . . . . . . . . . . . . . . . . . . . . . . . . 215Loading a Dimension and its Dimension Values . . . . . . . . . . . . . 216About Synonym Values and Dimension Values. . . . . . . . . . . . . . . 219Mapping a Dimension Based on a Stratify Taxonomy . . . . . . . . . 222Running the Second Baseline Update . . . . . . . . . . . . . . . . . . . . . . 222Updating a Taxonomy in Your Pipeline . . . . . . . . . . . . . . . . . . . . . 223

SECTION IV LOGGING AND PERFORMANCE FEATURES

Chapter 12 Forge Hierarchical Logging System

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227Log Levels and Message Categories . . . . . . . . . . . . . . . . . . . . . . . 228Log Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228Message Category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

ix

Log Appenders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232Format of the Appenders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235Configuring MustMatch Messages . . . . . . . . . . . . . . . . . . . . . . . . 236Configuring the Dimension Server Match Count Log . . . . . . . . . 237Reference log.ini File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238A Simple Reference log.ini file . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

The Forge Logging Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

Chapter 13 Using Multithreaded Mode

Understanding Multithreaded Mode . . . . . . . . . . . . . . . . . . . . . . . . . 245Costs of Multithreaded Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246Configuration for Multithreaded Mode . . . . . . . . . . . . . . . . . . . . . . . 247Multithreaded Navigation Engine Performance . . . . . . . . . . . . . . . . 247

Application Query Characteristics . . . . . . . . . . . . . . . . . . . . . . . . 248Thread Pool Size and OS Platform . . . . . . . . . . . . . . . . . . . . . . . . 249

Hyperthreaded Intel Processors . . . . . . . . . . . . . . . . . . . . . . . 249Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250Solaris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

Chapter 14 Coremetrics Integration

Using the Integration Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

SECTION V OTHER ADVANCED FEATURES

Chapter 15 Implementing Merchandising and Content Spotlighting

Introduction to Dynamic Business Rules and Promoting Records . .258Comparing Dynamic Business Rules to Content Management

Publishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259Dynamic Business Rule Constructs . . . . . . . . . . . . . . . . . . . . . . . 260Query Results and Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262Two Examples of Promoting Records . . . . . . . . . . . . . . . . . . . . . 262

An Example with One Rule Promoting Records. . . . . . . . . . . 263

x

An Expanded Example with Three Rules. . . . . . . . . . . . . . . . . 265Suggested Workflow Using Endeca Tools to Promote Records . . . . 268

Incremental Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269Building the Supporting Constructs for a Rule . . . . . . . . . . . . . . . . . 270

Ensuring Promoted Records are Always Produced. . . . . . . . . . . 270Creating a Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

Controlling the Number of Promoted Records. . . . . . . . . . . . 272Performance and the Maximum Records Setting . . . . . . . . . 273Ensuring Consistent Property Usage with

Property Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273Indicating How to Display Promoted Records. . . . . . . . . . . . . 275

Creating Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277Creating a Rule and Ordering Its Results . . . . . . . . . . . . . . . . . . . 278Specifying When to Promote Records . . . . . . . . . . . . . . . . . . . . . . 279Specifying a Time Trigger to Promote Records . . . . . . . . . . . . . . 283

Synchronizing Time Zone Settings. . . . . . . . . . . . . . . . . . . . . . 284Specifying Which Records to Promote . . . . . . . . . . . . . . . . . . . . . 284Adding Custom Properties to a Rule . . . . . . . . . . . . . . . . . . . . . . . 286Adding Static Records in Rule Results . . . . . . . . . . . . . . . . . . . . . 287

Order of Featured Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288No Uniqueness Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 288No Maximum Record Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

Sorting Rules in the Rules View. . . . . . . . . . . . . . . . . . . . . . . . . . . 289Prioritizing Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

Presenting Rule Results in a Web Application. . . . . . . . . . . . . . . . . . 291Required Navigation Engine URL Query Parameters . . . . . . . . . 292Adding Web Application Code to Extract Rule Results . . . . . . . . 293

Sample Java Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296Sample ASP .NET Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297Sample COM Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299

Adding Web Application Code to Render Rule Results . . . . . . . . 301Grouping Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

Deleting a Rule Group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

xi

Prioritizing Rule Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304Interaction Between Rules and Rule Groups. . . . . . . . . . . . . . . . 305

Performance Considerations for Dynamic Business Rules. . . . . . . 305Rules without Explicit Triggers . . . . . . . . . . . . . . . . . . . . . . . . . . . 306

Using an Agraph and Dynamic Business Rules . . . . . . . . . . . . . . . . 306Applying Relevance Ranking to Rule Results . . . . . . . . . . . . . . . . . . 308About Overloading Supplement Objects . . . . . . . . . . . . . . . . . . . . . . 309

Chapter 16 Implementing User Profiles

Profile-Based Trigger Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312Developer Studio Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 314User Profile Query Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315Objects and Method Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

Java Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.NET C# Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317COM Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

Performance Impact of User Profiles . . . . . . . . . . . . . . . . . . . . . . . . 318

Chapter 17 Implementing Partial Updates

About Partial Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319Partial Update Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321Partial Updates Reference Implementation . . . . . . . . . . . . . . . . 322Baseline Pipeline Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

Creating a Partial Update Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . 324Creating the Record Adapter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326Creating the Record Manipulator . . . . . . . . . . . . . . . . . . . . . . . . . 327IF Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328UPDATE_RECORD Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . 328

Examples of UPDATE_RECORD Expressions. . . . . . . . . . . . . 333UPDATE_RECORD Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334

Format of Update Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334Format of Records to Be Deleted . . . . . . . . . . . . . . . . . . . . . . 335Format of Records to Be Updated. . . . . . . . . . . . . . . . . . . . . . 335

xii

Format of Records to Be Added . . . . . . . . . . . . . . . . . . . . . . . . 335Format of Records in Your Implementation . . . . . . . . . . . . . . 335

Creating the Update Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336Dimension Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

Dimension Adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337Dimension Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338

Naming Format of Update Source Data Files. . . . . . . . . . . . . . . . 339Index Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340Record Specification Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340

Navigation Engine Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341Dgidx Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343

Control Script Development and Execution . . . . . . . . . . . . . . . . . . . . 343Directory Structure for Updates . . . . . . . . . . . . . . . . . . . . . . . . . . 344Running the Baseline Updates Script . . . . . . . . . . . . . . . . . . . . . . 346

Step 1: Delete Old Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346Step 2: Run Forge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347Step 3: Run Dgidx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347Step 4: Stop the Navigation Engine . . . . . . . . . . . . . . . . . . . . . 347Step 5: Move the Index Files to the Dgraph Directory . . . . . . 348Step 6: Start the Navigation Engine . . . . . . . . . . . . . . . . . . . . . 348

Running the Partial Updates Script . . . . . . . . . . . . . . . . . . . . . . . . 349Step 1: Run Forge on the New Source Data . . . . . . . . . . . . . . 349Step 2: Apply a Timestamp to the Record File . . . . . . . . . . . . 350Step 3: Update the Navigation Engine . . . . . . . . . . . . . . . . . . . 351

Adding Other Bricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352URL Update Command Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 353Partial Updates in Agraph Implementations . . . . . . . . . . . . . . . . . . . 355

Choosing a Distribution Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 355How the Agraph Partitions Handle Updates . . . . . . . . . . . . . . 357

Use of Record Spec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358Naming Convention for Source Data Files . . . . . . . . . . . . . . . . . . 358

Random Distribution Format . . . . . . . . . . . . . . . . . . . . . . . . . . 358Deterministic Distribution Format . . . . . . . . . . . . . . . . . . . . . . 359

xiii

Configuring the Partial Updates Pipeline. . . . . . . . . . . . . . . . . . . 360Configuring the Record Adapter . . . . . . . . . . . . . . . . . . . . . . . 360Configuring the Record Manipulator . . . . . . . . . . . . . . . . . . . 361Configuring the Update Adapter . . . . . . . . . . . . . . . . . . . . . . . 365

Control Script for Agraph Updates . . . . . . . . . . . . . . . . . . . . . . . . 367Forge Partial Updates Brick . . . . . . . . . . . . . . . . . . . . . . . . . . 368Distributing the Forge Output to the Dgraphs . . . . . . . . . . . . 368

Chapter 18 Using the Agraph

What You Should Know First. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371Overview of Distributed Query Processing . . . . . . . . . . . . . . . . . . . . 371

Agraph Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372Data Foundry Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373Guidance about When to Use an Agraph . . . . . . . . . . . . . . . . . . . 376Implementation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376

Modifying the Project for Agraph Partitions . . . . . . . . . . . . . . . . . . . 376Provisioning an Agraph Implementation . . . . . . . . . . . . . . . . . . . . . . 378Running an Agraph Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 383Agraph Presentation API Development . . . . . . . . . . . . . . . . . . . . . . . 384Agraph Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385Agraph Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386Control Script Environment Considerations . . . . . . . . . . . . . . . . . . . 386

Arranging Partitions and Files . . . . . . . . . . . . . . . . . . . . . . . . . . . 386Agraph and Dynamic Business Rules. . . . . . . . . . . . . . . . . . . . . . 387

Chapter 19 Using Internationalized Data

Installing the Supplemental Language Pack . . . . . . . . . . . . . . . . . . 390Specifying the License Key . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

Configuring Forge Components for Languages . . . . . . . . . . . . . . . . 391Setting the Encoding for the Incoming Source Data . . . . . . . . . . 391Specifying the Language for Documents . . . . . . . . . . . . . . . . . . . 393

Forge Language Support Table . . . . . . . . . . . . . . . . . . . . . . . . 394Performance Considerations for Language Identification . . . . . 397

xiv

Configuring Languages for the Navigation Engine . . . . . . . . . . . . . . 398Using Language Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398

Specifying a Global Language ID . . . . . . . . . . . . . . . . . . . . . . . 400Specifying a Per-Record Language ID . . . . . . . . . . . . . . . . . . . 401Specifying a Per-Dimension/Property Language ID. . . . . . . . 401Specifying a Per-Query Language ID . . . . . . . . . . . . . . . . . . . . 402

Configuring Language-Specific Spelling Correction . . . . . . . . . . 403Using Encoding in the Web Application . . . . . . . . . . . . . . . . . . . . . . . 405

Setting the Encoding for URLs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 405Setting the Page Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406

Viewing Navigation Platform Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . 406

Index

Preface

The Endeca® Navigation Platform is the foundation for building applications based on Endeca Navigation Engine® technology. With the Endeca Navigation Platform, you can build solutions that allow your users to quickly, precisely, and easily search and navigate through large data sets, avoiding all the traditional problems associated with information overload and finding information online. Endeca applications generate precise, relevant results with sub-second response times, even across very large data sets.

The Endeca Navigation Platform allows you to build Guided Navigation® functionality into your Web applications. The Endeca Guided Navigation solution puts the results of all search, navigation, and analytic queries in an organized context that shows users precisely how to refine and explore further. This helps solve the problems associated with information overload by guiding users as they quickly, precisely, and easily navigate through large data sets. The Endeca Navigation Platform is based on technology that makes it possible to scale to very large data sources and user loads while running on low-cost hardware.

xvi

About This Guide

The Endeca Advanced Features Guide provides procedures for implementing advanced Endeca features such as the Content Acquisition System and partial updates.

Who Should Use This Guide

This guide is intended for developers while they are implementing an Endeca system.

Symbols and Conventions

IMPORTANT: Text marked as important requires special attention.

Note: Notes provide related information, recommendations, and suggestions.

The Endeca documentation set uses the following symbols and conventions:

1. Numbered lists, when the order of the items is important.

a. Alphabetical lists, when the order of secondary items is important.

• Bulleted lists, when the order of the items is unimportant.

Advanced Features Guide Endeca Confidential

xvii

Italic text represents variables you should substitute a value for, such as:

C:\RootDirectory\MyDirectory\MyFile

Italic text may also indicate new terms that appear in the Endeca Glossary.

Courier text indicates code snippets or commands that you should enter exactly as they are written in the documentation.

Endeca Documentation Set

Note: In addition to the documentation deliverables listed below, you can find useful information, including the Endeca Performance Tuning Guide, in the knowledge base on the Endeca Customer Support site at https://customers.endeca.com.

The Endeca documentation set consists of the following:

• Endeca Installation Guide for UNIX and Endeca Installation Guide for Windows describe how to install Endeca software.

• Endeca Migration Guide provides information on migrating from previous versions of Endeca software.

• Endeca Concepts Guide introduces the critical concepts you should understand before learning how to build an Endeca application. The information in this guide is the foundation upon which all the other Endeca documentation depends.

Endeca Confidential Preface

xviii

• Endeca Developer's Guide for Java, Endeca Developer's Guide for COM, and Endeca Developer's Guide for .NET provide an overview of the Endeca development process as well as procedures and code snippets for all non-advanced Endeca development tasks.

• Endeca Advanced Features Guide provides procedures for implementing advanced Endeca features such as the Content Acquisition System and partial updates.

• Endeca Administrator's Guide for UNIX and Endeca Administrator's Guide for Windows provide information on using Endeca's administrative and logging tools to configure and manage your Endeca implementation, and create logging reports.

• Endeca Tools Guide provides information on configuring and administering Endeca tools, including the Endeca Manager, Endeca Developer Studio, and Endeca Web Studio.

• Endeca Developer Studio Help provides online information for developing data pipelines using the Endeca Developer Studio.

• Endeca Web Studio Help provides online information for the administrative tasks, as well as search and merchandising configuration, that you can do using Endeca Web Studio.

• Endeca Javadocs provide online access to class and method descriptions for the Java version of the Presentation and Logging APIs.


xix

• Endeca API Guide for COM, Endeca API Guide for Perl, and Endeca API Guide for .NET provide class and method descriptions for the COM, Perl, and .NET versions of the Presentation and Logging APIs. The Endeca API Guide for .NET is in an online format.

• Endeca Security Guide for Java and Endeca Security Guide for .NET and COM describe how to implement user authentication and how to structure your data to limit access to only those users with the correct permissions. The Java version of this guide also provides information on using SSL certificates and encryption to secure your Endeca application.

• Endeca Performance Tuning Guide provides guidelines on monitoring and tuning the performance of the Endeca Navigation Engine. It also contains tips on resolving associated operational issues.

• Endeca Content Adapter Developer's Guide describes the Content Adapter Development Kit (CADK), a framework that provides developers with a flexible and simple mechanism to extract data from a data source and load it into Forge. The CADK is only available from Endeca customer support.

• Endeca Data Indexing API Guide provides class and method descriptions of the Data Indexing API and describes how to use the API to move source data to the Forge directory and run updates.

• Endeca Forge API Guide for Perl provides online information for the class and method descriptions of the Perl Manipulator component. You can use a Perl

Endeca Confidential Preface

xx

manipulator within a data pipeline to perform record manipulation.

• Endeca XML Reference provides detailed, online reference information for the XML files used in a Data Foundry pipeline.

• Endeca Glossary defines terms used in the Endeca Navigation Platform documentation set.

• Release Announcement describes the major new features changes for the release.

• Release Notes detail the changes specific to the release, including bug fixes and new features.

• Endeca Third-Party Software Usage and Licenses provides copyright, license agreement, and/or disclaimer of warranty information for the third-party software packages that Endeca incorporates.

Contacting Endeca Standard Customer Support

You can contact Endeca Standard Customer Support through the online Endeca Support Center (<https://customers.endeca.com>).

The Endeca Support Center provides registered users with important information regarding Endeca software, implementation questions, product and solution help, training and professional services consultation as well as overall news and updates from Endeca.


SECTION IData Import Features

22


Chapter 1

Content Acquisition System

The Content Acquisition System (CAS) provides the capability to crawl file systems, HTTP hosts, and HTTPS hosts to fetch documents in a variety of formats. You use a record adapter to read in the documents to a CAS pipeline. Once read into the CAS pipeline, Forge processes the documents and converts them into Endeca records. These records can contain property values, dimension values, and meta-data based on each document’s content. You can then build an Endeca application to access the records and allow your application users to search and navigate the document contents contained in the records.

Note: You can use either Endeca InFront® or ProFind® to build a CAS pipeline; however, InFront can process only .html and .txt documents. ProFind can process over 200 document types. See “Formats Supported by ProFind” on page 81 for the full list of document types that ProFind supports.

Sections of This Chapter

This chapter is divided into the following sections:

• The introductory sections, which provide an overview of CAS and its components.

24

• “URL and Record Processing”, which describes the flow and processing of URLs and records in a CAS pipeline.

• “Source Documents and Endeca Records”, which describes the relationship of source documents that CAS crawls with the Endeca records that CAS produces.

• “Creating a Full Crawling Pipeline”, which provides the procedure to create a CAS pipeline.

• “Properties Generated by CAS”, which describes each property generated by the components in a CAS pipeline.

• “Formats Supported by ProFind”, which lists the source document types that Profind can process.

CAS and Security Information

The CAS also supports accessing hosts that require basic client authentication and HTTPS authentication. For details on setting up the CAS to crawl secure resources, see the chapter “Web Crawling with Authentication” on page 89.

In addition, the CAS can be used to support the Access Control System. The CAS generates access control list (ACL) properties for each record. These properties can be used in conjunction with security login modules to limit access to records based on user login profiles. (Login modules are a part of the Access Control System.) For

Advanced Features Guide Endeca ConfidentialChapter 1

25

details on gathering security information, see the Endeca Security Guide.

Components that Support CAS

Developer Studio exposes CAS functionality using the following components. These components are the core of a CAS pipeline.

• A spider—Crawls documents starting at the root URLs you specify. In the spider component, you indicate the root URLs from which to begin a crawl, URL filters to determine which documents to crawl, as well as other configuration information that specifies how the crawl should proceed. This information may include timeout values for a crawl, proxy server values, and so on. The spider crawls the URLs and manages a URL queue that feeds the record adapter.

• A record adapter configured to read documents—Receives URLs from the spider and creates an Endeca record for each document located at a URL. Each record contains a number of properties, one of which is the record’s identifying URL. A downstream record manipulator uses the record identifier to retrieve the document and extract its data. Unlike basic pipelines, which use a record adapter to input source data from a variety of formats, a CAS pipeline uses a record adapter to input URLs provided by the spider. In a basic pipeline, the format type of a record adapter matches the source data, for example, delimited, XML, fixed-width, ODBC, and so on. In a

Endeca Confidential Content Acquisition System

26

CAS pipeline, the format type of a record adapter must be set to Document. See “URL and Record Processing” on page 28 for further explanation of the flow of records and URLs among CAS pipeline components.

• A record manipulator incorporating CAS expressions—Contains several Data Foundry expressions that support crawling and document processing tasks. At a minimum, a record manipulator contains one CAS expression to retrieve a URL based on the record's identifier and a second expression to extract and convert the document’s content to text. In addition, you can include optional expressions to identify the language of a document, remove temporary properties after processing is complete, or perform a variety of other processing tasks.

In addition to the CAS components listed above, a CAS pipeline can use other components common to basic pipelines, for example, a dimension adapter, dimension server, property mapper, indexer adapter, and so on. For the sake of simplicity, this feature document does not emphasize the common components but rather focuses on creating, configuring, and explaining the components specific to CAS.

CAS Reference Implementation

The Endeca Navigation Platform includes a sample CAS reference implementation that you can examine and run. This project is stored in ENDECA_ROOT\reference\sample_CAS_data. The CAS reference implementation


27

crawls http://endeca.com and produces Endeca records that can be searched with user-provided search terms and navigated according to the meta-data properties for the records. In other words, the navigation controls are based on meta-data properties, such as date modified, encoding, MIME type, file size, fetch status, and so on.

Full Crawls versus Differential Crawls

There are two types of CAS crawls a spider can perform:

• Full crawl—A crawl in which the spider retrieves all the documents that it is configured to access. A full crawl in a CAS pipeline is analogous to a full update in a basic data pipeline. This feature document describes a full crawl.

• Differential crawl—A re-crawl in which the spider retrieves only the documents that have changed since the last crawl. The differential crawl URL specified on the General tab of the Spider editor indicates a file in which Forge stores URLs and metadata about URLs. By reading this file at the beginning of a crawl, the spider can detect source documents that have been modified (updated, added, and so on) since the last crawl. For more information on creating a differential crawl, contact your Endeca technical team.


28

URL and Record Processing

As mentioned in the introduction, Developer Studio exposes crawling and text extraction functionality in the context of a pipeline. It is important to understand how this functionality fits into the Forge processing framework.

The following figure shows a diagram of a full crawling pipeline. There are two kinds of flow in the pipeline:

• URLs flow from the spider to the record adapter (a record adapter that uses the Document format)

• Documents flow into the indexer adapter and get turned into Endeca records.

When Forge executes this pipeline using Developer Studio, the flow of URLs and records is as follows:

1. The terminating component (indexer adapter) requests the next record from its record source (property mapper).


29

2. At this point, the property mapper has no record, so the property mapper asks its record source (spider) for the next record.

3. The spider has no record, so the spider asks its record source (record manipulator) for the next record.

4. The record manipulator also has no record, so it passes the request for the next record upstream to the record adapter (with format type Document).

5. The record adapter then asks the spider for the next URL it is to retrieve (the first iteration through, this is the root URL configured on the Root URL tab of the Spider editor).

6. Based on the URL that the spider provides, the record adapter creates a record containing the URL and a limited set of metadata.

7. The created record then flows down to the record manipulator where the following takes place:

• The document associated with the URL is fetched (using the RETRIEVE_URL expression).

• Content (searchable text) is extracted from the document (using the CONVERTTOTEXT or PARSE_DOC expression).

• Any URLs in the text are also extracted for additional crawling.

8. The record then moves to the spider where additional URLs (those extracted in the record manipulator) are queued for crawling.


30

9. The property mapper performs property to dimension and source property to Endeca property mapping.

10.The indexer adapter receives the record and writes it out to disk.

The process repeats until there are no URLs in the URL queue maintained by the spider.

Redundant URLs

To minimize the possibility of creating redundant records in an application, the spider avoids enqueuing a URL that is already queued or that has already been retrieved. The spider does this by comparing URLs to determine if they are equivalent.

Even in cases where flexible URL formatting allows two URLs to be different strings and yet point to the same resource, the spider can determine that both URLs are equivalent, and does not queue both URLs for processing. For example, http://www.endeca.com:80/about/../index.shtml and http://www.endeca.com/index.shtml point to the same resource. The spider recognizes the equivalence and does not queue both URLs.

However, there is no foolproof way to determine if two non-equivalent URLs point to the same resource. If the following configurations are present in a crawling environment, it is possible to cause the spider to queue URLs that point to the same resource:


31

• Symbolic links: For example, if file:///data.txt is a symbolic link to file:///machineA/directory/data.txt the spider queues both URLs and processes data.txt twice.

• Virtual hosts: For example, if a spider was set up to crawl both http://web.mit.edu/index.html and http://www.mit.edu/index.html, it would crawl the same resource twice because one URL is an alias for the other.

• IP Address/DNS name equivalence: The spider does not perform reverse DNS look ups to determine if two URLs are equivalent. For example, if a spider is set up to crawl http://www.mit.edu/index.html and http://18.181.0.31/index.html, it crawls the same resource twice because the second URL is the IP address of the first.

Source Documents and Endeca Records

Note: For a general overview of Endeca records, see “Understanding Endeca Records, Endeca Properties, and Dimensions” in the Endeca Concepts Guide.

In the context of a CAS application, Endeca records represent the data in the source documents that a spider crawls. CAS provides the means to get the source documents into a pipeline. The source documents themselves may reside on a file system, HTTP, or HTTPS host and be in a wide variety of file formats (common examples include .pdf, .html, .doc, and .txt). As is the case with basic (non-CAS) pipelines, the source


32

documents themselves are not modified in any way by CAS pipeline processing.

Here is an example source document from the CAS reference implementation. The reference implementation crawls http://endeca.com; this file is the homepage for that URL. The .html file has a title, text that describes Endeca solutions, and links to other areas of the Web site.

Title

Links to more URLs

Document text


33

Suppose a full crawl has been configured to crawl and process this document. In a basic CAS application, the document content on the .html page is crawled, converted to text, and stored in a single Endeca property named Endeca.Document.Text. (Forge properties generated in a CAS pipeline will be explained in later sections.) In addition, CAS crawls the links to additional URLs and queues them for further crawling and text extraction.

If only the Endeca.Document.Text and Endeca.Title properties are mapped using the property mapper, the Endeca record for this document looks like this:

Title

The section called “Solutions for Industry”begins here.

The section called “Solutions for EnterpriseSearch” begins here.


34

Notice the following correspondence between the source document and the Endeca record:

• The heading in blue is the document’s title.

• The first line in the record’s Endeca.Document.Text property also lists the source document’s title and the first line under the documents main graphic. The text begins “Award Winning…”

• The remaining lines correspond to the two sections of the source document called “Solutions for Industry” and “Solutions for Enterprise Search.”

Note: The order in which a source document’s data appears in an Endeca record depends upon the structure of the actual source document. The ordering shown here is an example.

Although this example is useful for illustrative purposes, such a record is not very useful to application users. Here it shows the simplest relationship between a source document and an Endeca record with two properties (title and text). An application for users is not likely to have all of a document's data contained in a single property.

In a more user-oriented application, a CAS pipeline might include Perl code to parse properties from the document text and use those to build Endeca properties and dimensions. Alternatively, a CAS pipeline might build dimensions based on any the meta-data properties that CAS generates for a record. The CAS reference implementation is designed in this way: it builds dimensions based on meta-data properties of a record.


35

Suppose the example shown so far were modified to show a record page based on the same source document with the metadata properties that Forge can generate. From the meta-data properties available, the pipeline is set up to expose properties such as encoding, date modified, application type, fetch status, and so, for use as dimensions. These properties are mapped with a property mapper component to provide both record details and navigation controls in the application. (See the Developer Studio Help for information on these tasks that are common to both CAS and basic pipelines.)

Re-running a full crawl would produce an Endeca record for the source document that looks like this:


36

The properties Forge generates are prefixed with Endeca. A user-oriented application may not employ all of the properties that Forge generates, but it is useful to see some of them here. Notice the following changes to the revised Endeca record:

• All meta-data properties in the source document appear with the record, rather than just Endeca.Document.Text and Endeca.Document.Title as shown previously.

• There are dimension values based on meta-data properties for encoding, fetch status, and MIME type.

To build a record page that displays all properties, the property mapper must be configured to map all of the Forge-generated source properties to Endeca properties. (See the Developer Studio Help for information on these tasks that are common to both CAS and basic pipelines.)

Property Name Syntax

During a crawl, Forge produces properties according to a standardized naming scheme. The naming scheme is made up of a qualified name with a period (.) to separate qualifier terms. The first term Endeca indicates that the property was automatically created by Forge. There may be any number of additional terms after Endeca depending on the property value being described.

Simple properties require only one additional term to fully qualify the property, for example, Endeca.Identifier or Endeca.Title. Often, a second term


37

describes a property category and a third term fully qualifies the property, for example, Endeca.Document.Body or Endeca.Fetch.Timeout. Less frequently, properties require additional terms to fully be qualified.

The following table provides an overview of naming syntax and includes several common property examples.

For a description of each property generated by CAS, see “Properties Generated by CAS” on page 73.

Common Qualifier Terms

Description Example with Fully Qualified Names

Document Text of the document or metadata about the document.

Endeca.Document.TextEndeca.Document.RevisionEndeca.Document.MimeTypeEndeca.Document.Language

Fetch Configuration information about how the document is retrieved.

Endeca.Fetch.ConnectTimeoutEndeca.Fetch.Proxy

ACL Security information (access control list) about the document.

Endeca.ACL.Allow.ReadEndeca.ACL.Allow.WriteEndeca.ACL.Allow.Execute

Relation Reference information to other documents.

Endeca.Relation.References


38

Viewing all Properties Generated by CAS

Depending on the type of source documents a spider crawls, Forge can generate dozens of properties for each record. Although you may or may not employ all the properties in your application, it is useful to see which properties are available. You control which properties are available by modifying the property mapper in your CAS pipeline to map all source properties to Endeca properties.

Note: The following procedure assumes you created a full crawling pipeline and can also access your Endeca records via an Endeca application.

To view all properties:

1. In the Project tab of Developer Studio, double-click Pipeline Diagram.

2. Double-click the property mapper. The Property Mapper editor displays.

3. Click the Advanced tab of the Property Mapper editor.

4. Check “If no mapping is found, map source properties to Endeca:” and then click Properties.

5. Click OK.

6. Perform a full update.

7. Start an Endeca application, for example, a generic JSP reference implementation, and view your Endeca records.


39

See the Endeca Developer Studio Help for more information about configuring the property mapper and running full updates.

Creating a Full Crawling Pipeline

The remaining sections of this feature document describe how to create and configure a full crawling pipeline using Developer Studio. As mentioned in the introduction, the goal of the section is to describe the pipeline components that are specific to crawling. Therefore, components that you would create in both a crawling pipeline and in a basic pipeline (dimension server, property mapper, indexer adapter, and so on) are omitted here for simplicity. The document focuses on the processing loop for a crawling pipeline that is made up of the record adapter, record manipulator, and spider components.

Note: Creating a differential crawl pipeline requires more than just providing a URL to store metadata about URLs. In addition to providing the URL, you have to create a differential pipeline, which requires a different design than that of a full crawl. For more information on creating a differential crawl pipeline, contact your Endeca technical team for assistance.

The high-level overview of a full crawling pipeline is as follows:


40

1. Create a record adapter to read documents (required). See “Creating a Record Adapter to Read Documents” on page 41.

2. Create a record manipulator to perform the following tasks. See “Creating a Record Manipulator” on page 43.

• Retrieve documents from a URL (required).

• Extract and convert document text for each URL (required).

• Identify the language of a document (optional).

• Remove document body properties (optional).

3. Modify records with a Perl manipulator (optional). See “Modifying Records with a Perl Manipulator” on page 54.

4. Create a spider to send URLs to the record adapter (required). See “Creating a Spider” on page 55.

• Provide root URLs from which to start a crawl (required).

• Configure URL extraction settings (required).

• Specify a record source for the spider (required).

• Specify spider settings such as timeout values and proxy servers (optional).

5. Create a record manipulator to remove any unnecessary records after processing (optional). See “Removing any Unnecessary Records after a Crawl” on page 68.


41

Here is an example of a CAS pipeline that calls out the core CAS components and also shows the components common to both basic and CAS pipelines (that is, dimension flow, a property mapper, an indexer adapter, and so on).

Creating a Record Adapter to Read Documents

A record adapter reads in the documents associated with the URLs provided by the spider component, and creates a record for each document. As long as the spider has URLs queued, the record adapter creates a record for each URL until all are processed.

Record adapter of type Document

Record manipulator

Spider


42

To create a Document record adapter:

1. Start Developer Studio.

2. From the File menu, select New Project.


4. In the Pipeline Diagram editor, click New.

5. Select Record > Adapter. The Record Adapter editor displays.

6. In the Name text box, type in the name of this record adapter.

7. In the Direction frame, make sure the Input option is selected.

8. From the Format drop-down list, choose Document.

9. Leave the URL text box empty, leave Filter Empty Properties and Multi File unchecked. These settings are ignored by a record adapter configured for Document formats.

10.Enter a language encoding in the Encoding text box, if you know that all of the source documents are the same encoding type. If you do not provide an encoding value, CAS automatically attempts to determine the encoding of each document by either requesting that information from the Web server or by examining the document’s body.

11.Click the Pass Throughs tab of the record adapter.

Note: You may have to use the left/right arrows to scroll to the Pass Throughs tab.


43

12.Enter URL_SOURCE in the Name text box and enter the name of the spider component in the Value text box. You will create and configure the spider component later in “Creating a Spider” on page 55. For now, you only have to choose the name of the spider. The URL source is required and must name a spider component.

13.Click Add.

14.Click OK to add the new record adapter to the project.

15.From the File menu, choose Save.

For a description of each property generated by the record adapter, see “Properties Generated by CAS” on page 73.

Creating a Record Manipulator

Expressions in a record manipulator perform document retrieval, text extraction, language identification, record or property clean up, and other tasks related to crawling. These expressions are evaluated against each record as it flows through the pipeline, and the record is changed as necessary.

At a minimum, a CAS pipeline requires a record manipulator with two expressions, one to retrieve documents (RETRIEVE_URL) and another to convert documents to text (CONVERTTOTEXT or PARSE_DOC). In addition to these expressions, you can include other optional expressions to identify the language of documents (ID_LANGUAGE) or delete the temporary files


44

created on disk by RETRIEVE_URL (using REMOVE_EXPORTED_PROP).

The expressions associated with these operations are described in the sections below.

To create a record manipulator:



3. Select Record > Manipulator. The New Record Manipulator editor displays.

4. In the Name text box, type in the name of this record manipulator.

5. From the Record source drop-down list, choose the name of the record adapter that you created in “Creating a Record Adapter to Read Documents” on page 41.

6. Click OK to add the new record manipulator to the project.

7. From the File menu, choose Save.

8. If you are ready to add the expressions described in the sections below, double-click the record manipulator in your pipeline diagram. The Expression editor displays.


45

Adding a RETRIEVE_URL Expression

The RETRIEVE_URL expression is required in a CAS pipeline to retrieve a document from its URL and store the document in a file on disk. A STRING DIGEST sub-expression of RETRIEVE_URL typically determines the name of the file in which the document is stored. RETRIEVE_URL places file’s location into the Endeca.Document.Body property. Later in pipeline processing, a text extraction expression examines Endeca.Document.Body and converts the body content into text stored in Endeca.Document.Text.

Forge also places any metadata it can retrieve about the document in properties on the record.

To add RETRIEVE_URL to a record manipulator:

1. If the Expression editor is not already open, double-click the Pipeline Diagram on the Project tab of Developer Studio.

2. Double-click the record manipulator. The Expression editor displays.

3. Starting at the first line in the Expression editor, insert a RETRIEVE_URL expression using the example below as a guide. The nested sub-expressions within RETRIEVE_URL configure how it functions. Here are several important points to consider when configuring RETRIEVE_URL:

• A STRING sub-expression is required to name a file created to store the document content for a URL. Typically, you use a STRING DIGEST expression


46

create a shorter property identifier (a digest) of the URL indicated by PROP_NAME. This digest is necessary because URLs may contain values that are invalid for use as file names. DIGEST creates a file name based on the URL but uses only characters a-f and numbers 0-9, so the file name is valid.

• The VALUE expression node in the CONST expression specifies the path where the contents of each URL are stored on disk after retrieval.

• The PROP_NAME expression node in the DIGEST expression specifies the property that contains the URL to retrieve. The default name of this property is Endeca.Identifier.

4. Click Check Syntax to ensure the expressions are well formed.

5. Click Commit Changes and close the Expression editor.

Note: It is not necessary to provide attribute values for the LABEL or URL attributes.

For additional information on expression configuration, see the Endeca XML Reference.

The following properties contain values that are passed as parameters to RETRIEVE_URL. The property values configure additional fetching options. The Endeca.Fetch properties exist for a record if you provided values on the Timeout tab, Proxy tab, and User Agent text box of the Spider editor.

• Endeca.Fetch.Timeout


47

• Endeca.Fetch.ConnectTimeout

• Endeca.Fetch.TransferRateLowSpeedLimit

• Endeca.Fetch.TransferRateLowSpeedTime

• Endeca.Fetch.Proxy

• Endeca.Fetch.UserAgent

For a description of each property generated by the RETRIEVE_URL expression, see “Properties Generated by CAS” on page 73.

Converting Documents to Text

An expression such as CONVERTTOTEXT or PARSE_DOC is required in a CAS pipeline to extract document content from the file created by RETRIEVE_URL and convert the content into text.

If you are using Endeca ProFind, you can use CONVERTTOTEXT to convert over 200 document types into text. If you are using Endeca InFront, you can use PARSE_DOC to convert .html and .txt documents. See “Formats Supported by ProFind” on page 81 for the full list of document types that ProFind supports.

After a record manipulator retrieves a URL and stores a path to the file in Endeca.Document.Body, a text extraction expression examines the file indicated by Endeca.Document.Body, extracts the document body from the file, and converts the document body into text. The text is stored by default in Endeca.Document.Text.


48

To guide text extraction and conversion, the text extraction expression refers to the Endeca.Document.MimeType and Endeca.Document.Encoding properties. If no Endeca.Document.Encoding exists, Forge attempts to identify the encoding automatically. See “Identifying the Language of the Documents” on page 50 if you want to explicitly identify encoding.

As the document body is being extracted from the file and converted to text, the expression examines the document body for any URLs. The text extraction expression adds any URLs it finds as Endeca.Relation.References properties to the record.

For example, if a product overview document contains links to ten product detail pages, the Endeca record for the overview document will have ten Endeca.Relation.References properties – one for each product detail link. When the record for this document is passed to the downstream spider component, the spider queues the URLs in each Endeca.Relation.References property and crawls it. This process continues until CAS processes all URLs contained in a document.

One of following text extraction expressions must be included in a CAS pipeline:

• CONVERTTOTEXT expression-Extracts documents based on content-type, converts the document body to text, and extracts any URL links contained in the document. This expression uses a document conversion library to convert files from more than 200 different document


49

types into text. See “Formats Supported by ProFind” on page 81 for the complete list of supported document formats. Using CONVERTTOTEXT subsumes the functionality of PARSE_DOC by doing the same extraction and conversion on .html and .txt documents. CONVERTTOTEXT is only available as part of Endeca ProFind.

• PARSE_DOC expression – Extracts .html and .txt documents, converts the document body to text, and extracts any URL links contained in the document. PARSE_DOC is available as part of both Endeca InFront and Endeca ProFind.

To add a text extraction expression to a record manipulator:

1. In the Pipeline diagram of Developer Studio, double-click the record manipulator. The Expression editor displays.

2. After the RETRIEVE_URL expression, add either the CONVERTTOTEXT or the PARSE_DOC expression using the examples below as a guide. No nested expressions or expression nodes are required.

• For CONVERTTOTEXT:

<EXPRESSION NAME="CONVERTTOTEXT" TYPE="VOID">

• For PARSE_DOC:

<EXPRESSION NAME="PARSE_DOC" TYPE="VOID">



50




For a description of each property generated by either of the text extracting and converting expressions, see “Properties Generated by CAS” on page 73.

Identifying the Language of the Documents

If your pipeline requires explicitly identifying multiple source documents that may be in multiple languages, you can use the ID_LANGUAGE expression in your record manipulator. This identification requirement may be necessary if you crawl a set of source documents where each document may be in a different language, and an aspect of your application depends on identifying the language. For example, an application might organize documents to be navigated by language.

The ID_LANGUAGE expression examines a property that you specify, determines the language of the document, and tags the record with a corresponding language value in an Endeca.Document.Language property. ISO 639 lists the valid language codes. See http://www.oasis-open.org/cover/iso639a.html for a full list of the language codes.


51

If you do not use the ID_LANGUAGE expression in your pipeline, the RETRIEVE_URL expression attempts to determine a language value based on the Content-type header of the document that a Web server returns to CAS. If no value exists for the Content-type header, then a text extraction expression (for example, CONVERTTOTTEXT or PARSE_DOC) attempts to determine the encoding value and to generate the property.

The advantage of explicitly using the ID_LANGUAGE expression is two fold: you can specify any property to examine, and you can modify the number of bytes to examine in the property. Increasing the number of bytes leads to more accurate language detection. Decreasing the number of bytes improves processing performance.

To identify the language of a document:

1. In the Pipeline view of Developer Studio, double-click the record manipulator. The Expression Editor displays.

2. After the text extraction expression, add the ID_LANGUAGE expression using the example below as a guide. Here are several important points to consider when configuring ID_LANGUAGE:

• The PROPERTY expression node specifies the property to use for language identification. Typically, this is the Endeca.Document.Body property.

• The LANG_PROP_NAME expression node specifies the property to store a value representing the language


52

of the document. If unspecified, the value is stored in Endeca.Document.Language.

• The LANG_ID_BYTES expression node specifies the number of bytes Forge uses to determine the language. A larger number provides a more accurate determination, but requires more processing time. The default value is 300 bytes.




Here is an example of an ID_LANGUAGE expression configured to examine 500 bytes of Endeca.Document.Body.


For a description of each property generated by ID_LANGUAGE, see “Properties Generated by CAS” on page 73.

Removing Document Body Properties

As a system clean up task, you may want to remove the files indicated by each record’s Endeca.Document.Body property. These files are no longer necessary after the


53

text extraction expression runs. This is an optional task in a CAS pipeline that can occur after a text extraction expression evaluates each record.

As part of CAS document processing, the following two steps occur in the record manipulator:

• RETRIEVE_URL retrieves a URL and automatically exports its contents to a file indicated by Endeca.Document.Body.

• A text extraction expression, for example, CONVERTTOTEXT or PARSE_DOC, examines the file indicated by Endeca.Document.Body, converts the contents of the file to text, and stores the text in Endeca.Document.Text.

After the text extraction expression completes, you can use a REMOVE_EXPORTED_PROP expression to remove the exported file indicated by Endeca.Document.Body and also the Endeca.Document.Body property if desired.

To add REMOVE_EXPORTED_PROP to a pipeline:

1. In the Pipeline view of Developer Studio, double-click the Record Manipulator. The Expression Editor displays.

2. After the text extraction expression (either CONVERTTOTEXT or PARSE_DOC), add a REMOVE_EXPORTED_PROP expression using the example below as a guide. Here are several important points to consider when configuring REMOVE_EXPORTED_PROP:


54

• The PROP_NAME expression node specifies the name of the property that indicates the file to remove. Typically, this is the Endeca.Document.Body property.

• The URL expression node specifies the URL that files were written to (by RETRIEVE_URL). This value may be either an absolute path or a path relative to the location of the Pipeline.epx file.

• The PREFIX expression node specifies any prefix used in the file name to remove.

• The REMOVE_PROPS expression node specifies whether to remove the property from the record after deleting the file where the property was stored. TRUE removes the property from the record after removing the corresponding file. FALSE does not remove the property.





Modifying Records with a Perl Manipulator

Although there is no requirement that a CAS pipeline use a Perl manipulator component, this component is useful


55

to perform more extensive record modification during processing. For example, the component can be used to strip values out of a property such as Endeca.Document.Text and add the values back to a record for use in dimension mapping. The component can be used to concatenate properties and add a resulting new property to a record, and so on.

For information about how to add a Perl manipulator component to a pipeline, see the Endeca Developer Studio Help. For information about how to implement Perl code in a Perl manipulator, see the Endeca Forge API Guide for Perl.

Creating a Spider

The spider component is the core of a CAS pipeline. Working in conjunction with a record adapter and a record manipulator, the spider forms a document-processing loop whose function is to get documents into a CAS pipeline. The primary function of the spider in a loop is to crawl URLs, filter URLs, send URLs to the record adapter, and manage the URL queue until all source documents are processed. For a review of the role of each component in this loop, see “URL and Record Processing” on page 28.

The Spider editor is where you indicate the URLs to crawl, create URL filters to determine which documents to crawl, as well as specify timeout, proxy, and other configuration information that controls how the crawl proceeds.


56

Once configured and run, the spider loops through processing documents in a CAS pipeline as described in the steps below. Note the spider’s tasks begin at step 5 in the larger process described earlier in “URL and Record Processing” on page 28. These steps focus only on the spider's document processing loop.

1. For the first loop of source document processing, the spider crawls the root URL indicated on the Root URLs tab of the Spider editor.

2. Based on the root URL that the spider crawls, the record adapter creates a record containing the URL, indicated by Endeca.Identifier, and a limited set of metadata properties.

3. The newly created record then flows down to the record manipulator where the following takes place:

• The document associated with the URL is fetched (using the RETRIEVE_URL expression) and stored in Endeca.Document.Body.

• Content (searchable text) is extracted from Endeca.Document.Body (using the CONVERTTOTEXT or PARSE_DOC expression) and stored in Endeca.Document.Text.

• Any URLs in Endeca.Document.Body are extracted for additional crawling and, by default, are stored in Endeca.Relation.References.

4. The record based on the root URL then moves downstream to the spider where additional URLs (those extracted from the root URL and stored in Endeca.Relation.References) are queued for crawling.


57

5. The spider crawls URLs from the record as indicated in the Endeca.Relation.References properties. This is the next loop of source document processing.

6. Based on the queued URL that the spider crawls, the record adapter creates a record containing the URL, indicated by Endeca.Identifier, and a limited set of metadata properties.

7. Steps 3 through 6 repeat until the spider processes all URLs and the record adapter creates corresponding records.

To create a spider:


2. In the Pipeline Diagram editor, choose New > Spider. The New Spider editor displays.

3. In the Name box, type a unique name for the spider. This should be the same name you specified as the value of URL_SOURCE when you created the record adapter.

4. If you want to limit the number of hops from the root URL (specified on the Root URLs tab), enter a value in the Maximum hops field. The Maximum hops value specifies the number of links that may be traversed beginning with the root URL before the spider reaches the document at a target URL. For example, if http://www.endeca.com is a root URL and it links to a document at http://www.endeca.com/news.html, then http://www.endeca.com/news.html is one hop away from the root.


58

5. If you want to limit the depth of the crawl from the root URL, enter a value in the Maximum depth field. Maximum depth is based on the number of separators in the path portion of the URL. For example, http://endeca.com has a depth of zero (no separators). Whereas, http://endeca.com/products/index.shtml has a depth of one. The /products/ portion of the URL constitutes one separator.

6. To specify the User-Agent HTTP header that the spider should present to Web servers, enter the desired value in the Agent name field. The Agent name identifies the name of the spider, as it will be referred to in the User-agent field of a Web server’s robots.txt file. If you provide a name, the spider adheres to the robots.txt standard. If you do not provide a name, the spider responds only to rules in a robots.txt file where the value of the User-agent field is “*”.

Note: (A robots.txt file allows Web-server administrators to identify robots, like spiders, and control what URLs a robot may or may not crawl on a Web server. The file specifies a robot’s User-agent name and the rules associated with the name. These crawling rules configured in robots.txt are often known as the robots.txt standard or, more formally, as the Robots Exclusion Standard. For more information on this standard, see http://www.robotstxt.org/wc/robots.html.)

7. To instruct the spider to ignore the robots.txt file on a Web server, check Ignore robots. By ignoring the file, the spider does not obey the robots.txt standard and proceeds with the crawl with the parameters you configure.


59

8. If you want the spider to reject cookies, check Disable Cookies. If you leave this unchecked, CAS adds cookie information to the records during the crawl, and CAS also stores and sends cookies to the server as it crawls.

9. For the full crawl described in this chapter, do not provide any value in the Differential Crawl URL box. For information about configuring a differential crawl, contact your Endeca technical team for assistance.

Specifying Root URLS to Crawl

On the Root URLs tab, you provide the starting points for the spider to crawl. Each root URL must have a scheme of file, http, or https. The URL must be absolute and well formed. A useful URL reference is available at http://www.w3.org/Addressing/URL.

In addition to starting a crawl from a root URL, you can also start a crawl by posting data to a URL if necessary. You can simulate a form post (the HTTP POST protocol) by specifying a root URL with post syntax and values. To construct a POST URL, postfix the URL with “?”, add name=value pairs delimited by “&”, and then add a “$” followed by the post data. For example, given this URL:

http-post://web01.qa:8080/qa/post/NavServlet?arg0=foo&arg1=bar$link=1/3

The spider executes an HTTP POST request to web01.qa:8080/qa/post/NavServlet with query data: arg0=foo&arg1=bar and post data: link=1/3.


60

When you run the pipeline, the spider validates each root URL, checks whether or not the URL passes the appropriate filters, including the site’s robots.txt exclusions (if the Ignore Robots checkbox is not set). If a root URL is invalid or does not pass any of the filters an appropriate message is logged.

To specify root URLs:

1. In the Spider editor, select the Root URLs tab.

2. In the URL text box, type the location from which the spider starts crawling. This value can use file, http, https, or form post URLs.

3. Click Add.

4. Repeat steps 2 and 3 for additional locations.

Configuring URL Extraction Settings

On the URL Configuration tab, you provide the name for the properties used to store queued URLs, and you provide URL filters.

• Enqueue URL – indicates the name of the property that stores links (URLs) to other documents to crawl. When a spider crawls a root URL, CAS extracts any URLs contained on the root and adds those URLs as properties to the record. The spider queues the URLs, crawls them, and CAS creates additional records until all URLs stored in this property are processed. In a simple crawler application, you only need to specify the Endeca.Relation.References property here. This is the default property name produced by either text


61

extraction expression that holds the URLs to be queued.

• URL Filter – specifies the filters by which the spider includes or excludes URLs during a crawl. Filters are expressed as wildcards or Perl regular expressions. URL filters are mutually exclusive; that is, URL filter A does not influence URL filter B and vice versa. At least one URL filter is required to allow the spider make additional processing loops over the root URL.

To configure URL extraction settings:

1. In the Spider editor, click the URL Configuration tab.

2. Right-click the Enqueue URLs folder and click Add. The Enqueue URL editor displays.

3. Enter a property name in the Enqueue URL editor that designates the property of the record that contains links to queue.

4. Optionally, select Remove if you want to remove the property from the record after its value has been queued.

5. Click OK.

6. If necessary, repeat steps 2 through 5 to add additional queue URL properties.

7. Select the URL Filters folder and click Add. The URL Filter editor displays.

8. In the URL Filter text box, enter either a wildcard filter or regular expression filter according to the following guidelines and samples. There are additional samples in “Example Syntax of URL Filters” on page 64.


62

Filters can be specified either by using wildcard filters for example, *.endeca.com or Perl regular expressions, for example /.*\.html/i. Generally, you should use Wildcard patterns for Host filters and use Regular Expression patterns for URL filters.

This example shows a host include filter. It uses a wildcard to include all hosts that are in the endeca.com domain:

This example shows a URL inclusion filter that uses a regular expression filter to include all .html files, regardless of case:

9. In the Type frame, select either Host or URL.

• Host filters apply only to the host name portion of a URL.


63

• URL filters are more flexible and can filter URLs based on whether the entire URL matches the specified pattern. For example, the spider may crawl a file system in which a directory named presentations contains PowerPoint documents that, for some reason, should not be crawled. They can be excluded using a URL exclusion filter with the pattern /.*\/presentations\/.*\.ppt/.

10.In the Action frame, select either Include or Exclude. Include indicates that the spider crawls documents that match the URL filter. Exclude indicates that the spider excludes documents that match the URL filter. A URL must pass both inclusion and exclusion filters for the spider to queue it. In other words, a URL must match at least one inclusion filter and a URL also must not match any exclusion filter.

11.In the Pattern frame, select either Wildcard or Regular expression depending on the syntax of the filter you specified in step 5.

12.Repeat steps 7 through 10 to create additional URL filters as necessary. At a minimum, the spider requires one host inclusion filter that corresponds to each root URL you specified on the Root URL tab. For example, if you set up a spider to crawl http://endeca.com, then the spider needs a host include filter for endeca.com. The filter allows the spider to include any links found on the root for additional processing. If you omit this filter, the spider processes the root URL but not the URLs that the roots contains.


64

Example Syntax of URL Filters

Here are additional examples of common URL filter syntax:

• To crawl only file systems (not HTTP or HTTPS hosts), use a URL inclusion filter with a regular expression pattern of: /^file/i

• To crawl only documents with an .htm or .html extension, use a URL inclusion filter with a regular expression pattern of: /\.html?$/i

• To crawl the development branch of the Example corporate Web site, use a URL inclusion filter with a regular expression pattern of: /example\.com\/dev\/.*/i

This pattern confines the crawler to URLs of the form:

example.com/dev/

• To restrict a crawler so that it does not crawl URLs on a corporate intranet (for example, those located on host intranet.foo.com/dev), use a Host exclusion filter with a regular expression pattern of:

/intranet\.example\.com/


65

Specifying a Record Source for the Spider

A spider requires an upstream pipeline component to act as its record source. In most cases, this record source is the record manipulator that contains the RETRIEVE_URL and the text extraction expression. The record source could also be a record adapter or another spider.

To specify a record source:

1. In the Spider editor, select the Sources tab.

2. From the Record source list, choose the name of the record manipulator that you created.

3. Optionally, specify timeouts and proxy server settings as described in the two sections that follow.

4. Click OK to finish creating the spider.


Specifying Timeouts

The spider may be configured with three timeout values specified in the Timeout tab. These values control connection timeouts and URL retrieval timeouts for each URL that the spider fetches. Providing values on this tab is optional.

If you do provide values, the spider sends them with each URL to the record adapter. The record adapter generates Endeca.Fetch properties for each record. The property values become parameters to the RETRIEVE_URL expression during the fetch. For a description of each


66

Endeca.Fetch property, see “Properties Generated by CAS” on page 73.

To specify timeouts:

1. In the Pipeline Diagram editor, double click the Spider component. The Spider editor displays.

2. Click the Timeout tab.

3. If you want to limit the time that the spider spends retrieving a URL before aborting the fetch, type a value in the “Maximum time spent fetching a URL” text box.

4. If you want to limit the time that the spider spends making a connection to a host before aborting the retrieve operation, type a value in the “Maximum time to wait for a connection to be made” text box.

5. If you want to abort a fetch based on transfer rate, type a value in the Bytes/Sec for at Least text box and the Second text box.

Specifying Proxy Servers

You can configure a spider to use a proxy servers when accessing HTTP or HTTPS URLs. There are several ways to configure the spider for use with proxy servers:

• You can specify a single proxy server, through which the spider accesses both HTTP and HTTPS URLs.

• You can specify separate proxy servers for HTTP URLs and HTTPS URLs.

• You can bypass proxy server settings for a specified URL.


67

You specify these settings on the Proxy tab of the Spider editor.

To specify a single proxy server for HTTP and HTTPS:

1. Click the Proxy tab.

2. Select Use a Proxy Server to Fetch URLs from the list.

3. In the Host text box of the Proxy server frame, type the name of the proxy server.

4. In the Port text box, type the port number that the proxy server listens to for URL requests from the spider.

5. If you want to bypass the specified proxy server for a URL, click the Bypass URLs button. The Bypass URLs editor displays.

6. Type the name of the host you want to access without the use of a proxy server and click Add. You can use wildcards to indicate a number of Web servers within a domain. Repeat this step as necessary for additional URLs.

7.Click OK.

To specify separate proxy servers for HTTP and HTTPS:

1. On the Proxy tab of the Spider editor, select Use Separate HTTP/HTTPS Proxy Servers from the list.

2. In the Host text box of the HTTP Proxy server frame, type the name of the proxy server.


68

3. In the Port text box, type the port number that the proxy server listens to, for HTTP URL requests from the spider.

4. In the Host text box of the HTTPS Proxy server frame, type the name of the proxy server.

5. In the Port text box, type the port number that the proxy server listens to, for HTTPS URL requests from the spider.

6. If you want to bypass the specified proxy server for a URL, click the Bypass URLs button. The Bypass URLs editor displays.

7. Type the name of the host you want to access without the use of a proxy server and click Add. Repeat this step as necessary for additional URLs.

8. Click OK on the Bypass URLs editor.

9. Click OK on the Spider editor.

10.From the File menu, choose Save.

Removing any Unnecessary Records after a Crawl

After the CAS components of a pipeline have processed all the source documents, you may want to remove any records that merely reflect source data structure before Forge writes out these records with an indexer adapter. This record removal is typically necessary when CAS creates records based on directory pages, index pages, or other forms of source documents that reflect the structure of the source data but do not correspond to a source document that you need in an application.


69

If you do not remove these records before indexing, the records become available to users of your Endeca application. For example, suppose a spider crawls a directory list page at ..\data\incoming\red\index.html and creates a corresponding record. You are unlikely to want users to search the record for the index.html page because it primarily contains a list of links; however, the spider must crawl the index page to queue and retrieve the other pages that index.html links to, such as ..\data\incoming\red\product1.html, ..\data\incoming\red\product2.html,

..\data\incoming\red\product3.html, and so on.

You can remove records from a pipeline using a REMOVE_RECORD expression. In the pipeline, the REMOVE_RECORD expression must appear in a record manipulator that is placed after the CAS record processing loop. Specifically, the expression must appear after the spider component because the spider needs to crawl all URLs that may appear on a directory page.


70

For example, see the position of the RemoveRecords component in the following example CAS pipeline:

To add REMOVE_RECORD to a pipeline:



3. Select Record > Manipulator. The New Record Manipulator editor displays.

4. In the Name text box, type in the name of this record manipulator.

RecordManipulator


71

5. From the Record source drop-down list, choose the name of the spider that you created.

6. From the Dimension source drop-down list, chose the dimension source for the pipeline.

7. Click OK to add the new record manipulator to the project.

8. Open the PropertyMapper component and change its record source to the new record manipulator you just created.


10.In the Pipeline Diagram, double-click the record manipulator. The Expression Editor displays.

11.Starting at the first line in the Expression editor, insert a REMOVE_RECORD expression using the example below as a guide. The REMOVE_RECORD expression appears on line 14. REMOVE_RECORD is typically used within an IF expression to remove records that meet or do not meet certain criteria. There are no nested expressions within REMOVE_RECORD to configure how it functions.

12.Click Check Syntax to ensure the expressions are well formed.

13.Click Commit Changes and close the Expression editor.




72

Handling Crawling Errors

Processing source documents, including retrieving and extracting text can introduce problems. This section lists several common errors and any workarounds, if applicable.

• .pdf content in Endeca.Document.Text displays as binary data: If CAS processes .pdf files and the content from those files appears in your application as binary data, the .pdf files may contain custom-encoded embedded fonts.

• CAS cannot always correctly display content that contains custom-encoded embedded fonts. To solve the issue, CAS attempts to substitute a system font for the custom-encoded font. The substitution succeeds if the encoding in the substituted system font is the same as the custom encoding in the embedded font. When the substitution is not successful, you see binary data in Endeca.Document.Text.

Here are several issues related to retrieving documents from HTTP hosts, and an explanation of how the spider handles them:

• Connection timeout: The spider retries the request five times. Each timeout is logged in an informational message. After a fifth timeout, an error message is logged, and the record for the offending URL is created with its Endeca.Document.Status property set to Fetch Aborted.

• URL not found: The spider logs a warning message that the URL could not be located and creates a record


73

with its Endeca.Document.Status property set to “Fetch Failed.”

• Malformed URL: The spider logs a warning message that the URL is malformed and creates a record with its Endeca.Document.Status property set to “Fetch Failed.”

• Authentication failure: The spider logs a warning message that the URL could not be retrieved and creates a record with its Endeca.Document.Status property set to “Fetch Failed.”

Properties Generated by CAS

The following table describes the properties generated by the record adapter:

Property Name Description

Endeca.Identifier The source URL of this document

Endeca.Document.NumberOfHops The minimum number of hops (link traversals) to get to this document from a root URL during the crawl.

Endeca.Fetch.Timeout The maximum time in seconds that Forge should wait to retrieve the resource indicated by Endeca.Identifier before aborting the retrieve operation.

Endeca.Fetch.ConnectTimeout The maximum time in seconds that Forge should wait to establish a connection to the server hosting the resource indicated by Endeca.Identifier before aborting the retrieve operation.


74

Note: Records created by the record adapter may contain properties prefixed with the name Endeca.Fetch that describe how to access the resource identified by the Endeca.Identifier property. The Endeca.Fetch property values are created based on the values you provide, if any, in the Timeout tab of the Spider editor. When the spider sends a URL to the record adapter, the URL is accompanied by the timeout configuration values that determine how the URL should be retrieved. See “Specifying Timeouts” on page 65 for more information.

Endeca.Fetch.TransferRateLowSpeedLimit

The minimum transfer rate in bytes per second below which Forge should abort the retrieve operation after the timeout specified in Endeca.Fetch.TransferRateLowSpeedTime.

Endeca.Fetch.TransferRateLowSpeedTime

The maximum time in seconds that Forge should allow the transfer operation to fall below the minimum transfer rate specified in Endeca.Fetch.TransferRateLowSpeedLimit before aborting.

Endeca.Fetch.Proxy The URI (HOST:PORT) of the proxy server that Forge should use to retrieve the resource indicated by Endeca.Identifier.

Endeca.Fetch.UserAgent The User-Agent HTTP header that Forge should present to the server hosting the resource indicated by Endeca.Identifier. For more information about user agent settings, see the procedure on page 55.



75

The following table describes the properties generated by the RETRIEVE_URL expression:


Endeca.Document.Body The body of the retrieved document. The value of this property is a path to the file that stores the document body on disk. A text extraction expression, such as CONVERTTOTEXT or PARSE_DOC, uses Endeca.Document.Body to extract the text of the document (Endeca.Document.Text).

Endeca.Document.Status The status of the fetch. This property can have any of the following values:

• Fetch Succeeded - The document fetch was successful and the document body is in the file indicated by the Endeca.Document.Body property.

• Fetch Skipped - The document was not retrieved because based on the value of Endeca.Document.Revision, the document should not yet be considered for re-fetch. For example, if the HTTP Expires header specifies a time in the future, the document will not be fetched.

• Fetch Aborted - The document was not retrieved. This status indicates that a potentially transient phenomenon, for example a timeout, prevented the document from being fetched. Additional information can often be found in scheme-specific properties.

continued


76

Endeca.Document.Status

continued

• Fetch Failed - The document was not retrieved. This status indicates that a non-transient phenomenon, for example an HTTP error like Document Not Found, caused the failure. Additional information can often be found in scheme-specific properties.

Endeca.Document.Revision The scheme-specific revision information of the document. This value is typically a timestamp of the document's last modified date and may also include revision information about references and metadata. The information contained in the revision varies from scheme to scheme (for example, file vs. http).

Endeca.Document.IsRedirection If the document is a redirection to another document (for example, an HTTP Redirect, or a symbolic link file), this property has the value true. The document’s URL that is redirected to is stored in an Endeca.Relation.References property.

Endeca.Document.IsUnchanged If the document is unchanged, either because the fetch was skipped or because fetch determined that the document has not changed, this property has the value true. This property is generated in differential crawl pipelines, not full crawl pipelines.

Endeca.Document.MimeType The MIME Type of the document, if it can be determined. Common examples of this property value include text/html, application/pdf, image/gif, and so on.



77

Endeca.Document.Encoding The encoding of the body of the document, if it can be determined. This property value is an ISO code that describes the encoding, for example, ISO-8859-1. The RETRIEVE_URL expression attempts to determine this value based on the Content-type header of the document that a Web server returns to CAS.

If no value exists for the Content-type header, then a text extraction expression (for example, CONVERTTOTTEXT or PARSE_DOC) attempts to determine the encoding value and to generate the property. If the value cannot be determined, Forge logs an error.

Endeca.Relation.References If a document references other documents, each reference is placed into this property. This is the default property that stores URLs to be queued by the spider.

Endeca.ACL.Allow.Read (Write, Execute, Delete, and so on)

Security information about the document. There can be several dozen access control list (ACL) properties. These properties take their names from system security attributes. For more information, see the Endeca Security Guide.

Endeca.ACL.Deny.Read (Write, Execute, Delete, and so on)

Security information about the document. For more information, see the Endeca Security Guide.



78

The following table describes properties generated by extracting document text:

Endeca.Cookie Information about cookies associated with the fetched URL. This information may include the name of the cookie, the Web server’s domain, the path to the Web server, fetch dates, and remove dates. When RETRIEVE_URL gets a Set Cookie header as part of its HTTP response, RETRIEVE_URL can pass this value back to the server, when appropriate, to simulate a session.

Endeca.Document.Info.<scheme>.* Scheme-specific metadata supplied by each scheme. For example, the HTTP scheme supplies HTTP protocol information in the property Endeca.Document.Info.http.protocol and the FILE scheme supplies the File type in the Endeca.Document.Info.file.Attribute property.



Endeca.Document.Encoding The document encoding, which is added if it does not already exist.

Endeca.Document.Text The extracted text of the document. The text is extracted from the file indicated by Endeca.Document.Body.

Endeca.Title The title of the document.


79

In addition to the properties described above, a text extraction expression may create additional properties with arbitrary names that describe metadata about a document, if such metadata exists. For example, if a document contains HTML META tags, the expression creates corresponding properties. Or, for example, if you

Endeca.Relation.References This property references the document indicated. Endeca.Relation.References is the default property name produced by either text extraction expression. The property stores the any additional URLs to be queued. Property values are either absolute URLs or URLs relative to the record’s Endeca.Identifier property.

For example, crawling a directory overview page that has links to three sub-category pages produces the following Endeca.Relation.References properties:

<PROP NAME="Endeca.Relation.References"><PVAL>http://endeca.com/products/</PVAL></PROP>

<PROP NAME="Endeca.Relation.References"><PVAL>red/index.html</PVAL></PROP>

<PROP NAME="Endeca.Relation.References"><PVAL>white/index.html</PVAL></PROP>

<PROP NAME="Endeca.Relation.References"><PVAL>sparkling/index.html</PVAL> </PROP>



80

parse MS Word documents that contain an Author attribute, the expression creates a corresponding property for author.

The following table describes properties generated by classifying documents with Stratify:.


Endeca.Stratify.Topic.HID<hierarchy ID>= <topic ID>

This property corresponds to the ID value of a topic in your published Stratify taxonomy. Each topic in your taxonomy has an ID value assigned by Stratify. For example, if an Eating Disorders topic has an ID of 209722 in a health care taxonomy whose hierarchy ID is 15, then the Endeca property is Endeca.Stratify.Topic.HID15="209722".

Endeca.Stratify.Topic.Topic.Name.HID <hierarchy ID>.TID<topic ID>=<topic name>

This property corresponds to a topic name from your published Stratify taxonomy for its corresponding topic ID. For example, for the Eating Disorders topic in the health care taxonomy mentioned earlier, this property is Endeca.Stratify.Topic.Name.HID15.TID2097222="Eating Disorders".

Endeca.Stratify.Topic.Score.HID <hierarchy ID>.TID<topic ID>=<score>

This property indicates classification score between an unstructured document and the topic it has been classified into. The value of <score> is a percentage expressed as a value between zero and one. Zero indicates the lowest classification score-0%, and one indicates the highest score-100%. You can use this property to remove records from your application that have a low score for classification matching, for example, Endeca.Stratify.Topic.Score.HID15.TID2097222="0.719380021095276".


81

Property Generated by ID LANGUAGE

Formats Supported by ProFind

Endeca ProFind supports the following source document formats.

WORD PROCESSING FORMATS

GENERIC TEXT

ANSI Text ..................................................................................7 & 8 bit

ASCII Text .................................................................................7 & 8 bit

EBCDIC

HTML .................................................... through 3.0 (some limitations)

IBM FFT ...............................................................................All versions

IBM Revisable Form Text....................................................All versions

Microsoft Rich Text Format (RTF) ......................................All versions

Text Mail (MIME).................................................... No specific version

Unicode Text .......................................................................All versions

WML ......................................................................................Version 5.2

DOS WORD PROCESSORS

DEC WPS Plus (DX)............................................................ through 4.0

DEC WPS Plus (WPL).......................................................... through 4.1


Endeca.Document.Language The document language code. ISO 639 lists the valid language codes. Common examples include the following: en for English, es for Spanish, fr for French, de for German, ja for Japanese, and ko for Korean. See http://www.oasis-open.org/cover/iso639a.html for a full list of the language codes.


82

DisplayWrite 2 & 3 (TXT) ...................................................All versions

DisplayWrite 4 & 5 .................................................through Release 2.0

Enable ............................................................................3.0, 4.0 and 4.5

First Choice.......................................................................... through 3.0

Framework......................................................................................... 3.0

IBM Writing Assistant ...................................................................... 1.01

Lotus Manuscript ..................................................................Version 2.0

MASS11 ................................................................. Versions through 8.0

Microsoft Word ..................................................... Versions through 6.0

Microsoft Works.................................................... Versions through 2.0

MultiMate .............................................................. Versions through 4.0

Navy DIF..............................................................................All versions

Nota Bene.............................................................................Version 3.0

Novell WordPerfect .............................................. Versions through 6.1

Office Writer ...............................................................Versions 4.0 - 6.0

PC-File Letter ........................................................ Versions through 5.0

PC-File+ Letter ...................................................... Versions through 3.0

PFS:Write................................................................ Versions A, B and C

Professional Write................................................. Versions through 2.1

Q&A ......................................................................................Version 2.0

Samna Word................................... Versions through Samna Word IV+

SmartWare II ...................................................................... Version 1.02

Sprint..................................................................... Versions through 1.0

Total Word ............................................................................Version 1.2

Volkswriter 3 & 4.................................................. Versions through 1.0

Wang PC (IWP) .................................................... Versions through 2.6

WordMARC ....................................... Versions through Composer Plus

WordStar................................................................ Versions through 7.0

WordStar 2000....................................................... Versions through 3.0


83

XyWrite .......................................................... Versions through III Plus

WINDOWS WORD PROCESSORS

Adobe FrameMaker (MIF)....................................................Version 6.0

Hangul ................................................ Version 97 and 2002 (text only)

JustSystems Ichitaro......................... Versions 5.0, 6.0, 8.0 – 13.0, 2004

JustWrite ................................................................ Versions through 3.0

Legacy ................................................................... Versions through 1.1

Lotus AMI/AMI Professional ................................ Versions through 3.1

Lotus Word ProVersions 96 through Millennium Edition 9.6, text only

Microsoft Write ..................................................... Versions through 3.0

Microsoft Word .................................................. Versions through 2003

Microsoft WordPad ..............................................................All versions

Microsoft Works.................................................... Versions through 4.0

Novell Perfect Works............................................................Version 2.0

Novell/Corel WordPerfect .................................. Versions through 12.0

Professional Write Plus.........................................................Version 1.0

Q&A Write ........................................................................... Version 3.0

Star Office/Open Office Writer .. Star Office Versions 5.2, 6.x, and 7.x

......................................................Open Office version 1.1 (text only)

WordStar................................................................................Version 1.0

MACINTOSH WORD PROCESSORS

MacWrite II ...........................................................................Version 1.1

Microsoft Word (Mac) .......................................... Versions 4.0 — 2004

Microsoft Works (Mac) ......................................... Versions through 2.0

Novell WordPerfect ...................................... Versions 1.02 through 3.0

SPREADSHEET FORMATS

Enable .............................................................Versions 3.0, 4.0 and 4.5

First Choice........................................................... Versions through 3.0

Framework............................................................................Version 3.0


84

Lotus 1-2-3 (DOS & Windows)............................ Versions through 5.0

Lotus 1-2-3 (OS/2)................................................ Versions through 2.0

Lotus 1-2-3 Charts (DOS & Windows) ................ Versions through 5.0

Lotus 1-2-3 for SmartSuite...................... Versions 97 – Millennium 9.6

Lotus Symphony.............................................. Versions 1.0,1.1 and 2.0

Microsoft Excel Charts................................................Versions 2.x - 7.0

Microsoft Excel (Mac) ........................... Versions 3.0 – 4.0, 98 — 2004

Microsoft Excel (Windows)......................... Versions 2.2 through 2003

Microsoft Multiplan ..............................................................Version 4.0

Microsoft Works (Windows) ................................ Versions through 4.0

Microsoft Works (DOS) ........................................ Versions through 2.0


Mosaic Twin .........................................................................Version 2.5

Novell Perfect Works............................................................Version 2.0

PFS: Professional Plan..........................................................Version 1.0

Quattro Pro (DOS) ............................................... Versions through 5.0

Quattro Pro (Windows) ..................................... Versions through 12.0

SmartWare II .......................................................................Version 1.02

Star Office/Open Office Calc..... Star Office Versions 5.2, 6.x, and 7.x

......................................................Open Office version 1.1 (text only)

SuperCalc 5...........................................................................Version 4.0

VP Planner 3D ......................................................................Version 1.0

PRESENTATION FORMATS

Corel/Novell Presentations ................................ Versions through 12.0

Harvard Graphics for DOS ...................................... Versions 2.x & 3.x

Harvard Graphics (Windows)..................................Windows versions

Freelance (Windows) ....................... Versions through Millennium 9.6

Freelance for OS/2 ............................................... Versions through 2.0

Microsoft PowerPoint (Windows) .............. Versions 3.0 through 2003


85

Microsoft PowerPoint (Mac) ................. Versions 4.0, 98 through 2004

StarOffice / OpenOffice Impress............... StarOffice 5.2, 6.x, and 7.x

.................................................................... OpenOffice 1.1 (text only)

GRAPHICS FORMATS

Adobe Photoshop (PSD)......................................................Version 4.0

Adobe Illustrator............................................ Versions through 7.0, 9.0

Adobe FrameMaker graphics (FMV) ............Vector/raster through 5.0

Adobe Acrobat (PDF)........................ Versions 2.1, 3.0 – 6.0, Japanese

Ami Draw (SDW) .................................................................. Ami Draw

AutoCAD Interchange and Native Drawing formats ...DXF and DWG

AutoCAD Drawing......... Versions 2.5 - 2.6, 9.0 - 14.0, 2000i and 2002

AutoShade Rendering (RND)...............................................Version 2.0

Binary Group 3 Fax.............................................................All versions

Bitmap (BMP, RLE, ICO, CUR, OS/2 DIB & WARP) ..........All versions

CALS Raster (GP4)...................................................Type I and Type II

Corel Clipart format (CMX)..................................Versions 5 through 6

Corel Draw (CDR) .....................................................Versions 3.x – 8.x

Corel Draw (CDR with TIFF header) .......................Versions 2.x – 9.x

Computer Graphics Metafile (CGM).......ANSI, CALS NIST version 3.0

Encapsulated PostScript (EPS) ...................................TIFF header only

GEM Paint (IMG).................................................... No specific version

Graphics Environment Mgr (GEM)............................. Bitmap & vector

Graphics Interchange Format (GIF) ...................... No specific version

Hewlett Packard Graphics Language (HPGL)........................Version 2

IBM Graphics Data Format (GDF) ......................................Version 1.0

IBM Picture Interchange Format (PIF) ................................Version 1.0

Initial Graphics Exchange Spec (IGES) ...............................Version 5.1

JFIF (JPEG not in TIFF format) ...........................................All versions

JPEG (including EXIF).........................................................All versions


86

Kodak Flash Pix (FPX)........................................................All versions

Kodak Photo CD (PCD).......................................................Version 1.0

Lotus PIC..............................................................................All versions

Lotus Snapshot ....................................................................All versions

Macintosh PICT1 & PICT2 .................................................Bitmap only

MacPaint (PNTG).....................................................No specific version

Micrografx Draw (DRW) ...................................... Versions through 4.0

Micrografx Designer (DRW) ................................ Versions through 3.1

Micrografx Designer(DSF) ............................ Windows 95, version 6.0

Novell PerfectWorks (Draw)................................................Version 2.0

OS/2 PM Metafile (MET) ......................................................Version 3.0

Paint Shop Pro 6 (PSP) ................... Windows only, versions 5.0 – 6.0

PC Paintbrush (PCX and DCX).............................................All version

Portable Bitmap (PBM) .......................................................All versions

Portable Graymap (PGM) .......................................No specific version

Portable Network Graphics (PNG)......................................Version 1.0

Portable Pixmap (PPM)...........................................No specific version

Postscript (PS).............................................................................Level II

Progressive JPEG.....................................................No specific version

Sun Raster (SRS) ......................................................No specific version

Star Office/Open Office Draw...........Star Office 5.2, 6.x, and 7.x and

.......................................................OpenOffice version 1.1 (text only)

TIFF .......................................................................... Versions through 6

TIFF CCITT Group 3 & 4 ........................................ Versions through 6

Truevision TGA (TARGA) .......................................................Version 2

Visio (preview) ........................................................................Version 4

Visio ............................................................... Versions 5, 2000 — 2003

WBMP ......................................................................No specific version

Windows Enhanced Metafile (EMF).......................No specific version


87

Windows Metafile (WMF) ...................................... No specific version

WordPerfect Graphics (WPG & WPG2)Versions through 2.0, 7 and 10

X-Windows Bitmap (XBM) ...........................................x10 compatible

X-Windows Dump (XWD)............................................x10 compatible

X-Windows Pixmap (XPM) ...........................................x10 compatible

COMPRESSED FORMATS

GZIP. ....................................................................................All versions

LZA Self Extracting Compress.............................................All versions

LZH Compress .....................................................................All versions

Microsoft Binder ............................................................ Versions 7.0-97

...................... (conversion of Binder is supported only on Windows)

MIME Text Mail

UUEncode

UNIX Compress

UNIX TAR

ZIP..................................................... PKWARE versions through 2.04g

DATABASE FORMATS

Access ................................................................... Versions through 2.0

dBASE ................................................................... Versions through 5.0

DataEase ...............................................................................Version 4.x

dBXL......................................................................................Version 1.3

Enable .............................................................Versions 3.0, 4.0 and 4.5

First Choice........................................................... Versions through 3.0

FoxBase.................................................................................Version 2.1

Framework............................................................................Version 3.0

Microsoft Works (Windows) ................................ Versions through 4.0

Microsoft Works (DOS) ........................................ Versions through 2.0


Paradox (DOS) ..................................................... Versions through 4.0


88

Paradox (Windows) ............................................. Versions through 1.0

Personal R:BASE...................................................................Version 1.0

R:BASE 5000 ......................................................... Versions through 3.1

R:BASE System V..................................................................Version 1.0

Reflex ....................................................................................Version 2.0

Q & A.................................................................... Versions through 2.0

SmartWare II .......................................................................Version 1.02

OTHER FORMATS

Executable (EXE, DLL)

Executable (Windows) NT

Microsoft Outlook Express (EML) ..........................No specific version

Microsoft Outlook Folder (PST) ........ Versions 97, 98, 2002, and 2002

Microsoft Outlook Message (MSG) ....................................All versions

Microsoft Project.................................... Versions 98 - 2003 (text only)

vCard.....................................................................................Version 2.1


Chapter 2

Web Crawling with Authentication

This chapter describes how to configure a Forge crawler pipeline to access sites that require client authentication over HTTP using either basic authentication or HTTPS. It also describes how to set up Forge to require authentication from the server when using HTTPS.

Note: If Forge is to be used to crawl a file system, you must ensure that the Forge process is run from an account that is granted all of the appropriate permissions to access the target data.

This chapter assumes that you have already created a Forge pipeline for Web crawling, as described in “Content Acquisition System” on page 23.

Configuring Basic Authentication

When Forge connects to a Web site that requires basic authentication, it needs to provide the site with a valid username and password before the Web server will transmit a response. You can use a key ring file to supply Forge with an appropriate username/password pair to access a particular site that requires basic authentication.

90

The following is a sample key ring file that could be used to configure Forge for basic authentication:

To use this key ring file, you specify its location via the third argument of the RETRIEVE_URL expression in the Forge crawler pipeline, which is used to fetch URLs from the targeted Web server, as shown below (the relevant line is in boldface):

<KEY_RING><SITE HOST="www.endeca.com" PORT="6000">

<HTTP><REALM NAME="Sales Documents">

<KEY>BOcxV3wFSGuoBqbhPHkFGmA=</KEY></REALM>

</HTTP></SITE></KEY_RING>

<EXPRESSION TYPE="VOID" NAME="RETRIEVE_URL"><EXPRESSION TYPE="STRING" NAME="CONCAT">

<EXPRESSION TYPE="STRING" NAME="CONST"><EXPRNODE NAME="VALUE" VALUE="&cwd;"/>

</EXPRESSION><EXPRESSION TYPE="STRING" NAME="DIGEST">

<EXPRESSION TYPE="PROPERTY" NAME="IDENTITY"><EXPRNODE NAME="PROP_NAME" VALUE="Endeca.Identifier"/>

</EXPRESSION></EXPRESSION>

</EXPRESSION><EXPRNODE NAME="KEY_RING" VALUE="key_ring.xml"/>

</EXPRESSION>

91

The path to the key ring file is expressed relative to the pipeline file or as an absolute path. In the above example, the key ring file is in the same directory as the pipeline file.

Note that the specified key ring applies only to the RETRIEVE_URL expression from which it is referenced.

The following subsections describe each element of the sample key ring file in detail.

KEY_RING Element

The KEY_RING element is the root element of the key ring file. All other components of the key ring file are contained within the KEY_RING element.

SITE Element

The SITE element is used to refer to a target Web site or server. All of the directives within a SITE element are targeted at the site or server specified by the parent SITE element. For example, the HTTP element in the sample key ring file refers to an HTTP connection to the Web site on host www.endeca.com at port 6000.

The SITE element may contain one sub-element for each URL scheme by which it can be accessed. The authentication parameters for each of these schemes are specified in the body of each scheme sub-element. The two schemes that currently support authentication are

Endeca Confidential Web Crawling with Authentication

92

HTTP and HTTPS, represented by HTTP and HTTPS elements, respectively.

The SITE element has one required attribute, HOST, and one optional attribute, PORT.

HOST Attribute

The value of the HOST attribute should be the fully-qualified domain name of the server that hosts the target site.

PORT Attribute

If the target site is not accessed via the default port for all relevant URL schemes, the PORT attribute can be used to specify the port explicitly. If the PORT attribute is unspecified, the default port for each access scheme specified will be used.

For example, the following sample key ring file would be used to specify the authentication configuration settings for accessing host www.endeca.com via port 80 for HTTP and port 443 for HTTPS:

<KEY_RING><SITE HOST="www.endeca.com">

<HTTP>

</HTTP><HTTPS>

</HTTPS>

</SITE></KEY_RING>

93

HTTP Element

The HTTP element is used to encapsulate the basic authentication settings for accessing the parent host via HTTP. Some parts of a site may be password-protected, while others are not. The parts of an HTTP site that require authentication are called realms.

A realm is an arbitrary name for a directory on an HTTP server and all of its contents, including subdirectories. For instance, the realm “Sales Documents” (referenced in the sample key ring file) might refer to the directory:

http://www.endeca.com:6000/sales/

which in turn contains the “contracts” and “bookings” subdirectories, each of which may contain some Word documents or Excel spreadsheets. If a Forge crawler attempted to access any of this content, including the “sales”, “contracts”, or “bookings” directories themselves, it would be prompted for a username and password to gain access to the “Sales Documents” realm.

To provide Forge with a username/password pair for accessing this realm, a REALM element is used. An HTTP site may have many realms, so an HTTP element may contain any number of REALM sub-elements.

REALM Element

Each REALM element is used to setup basic authentication for a particular named realm on the target site. The REALM element has one required attribute, NAME, which specifies


94

the name of the realm. The body of a REALM element must contain one (and only one) KEY element, which encapsulates the username and password combination that should be used by Forge to access the specified realm on the target site.

KEY Element

The body of a KEY element can contain a username/password pair or a pass phrase. For protection, Forge expects the contents of a KEY element to be encrypted. “Using Forge to Encrypt Keys and Pass Phrases” on page 100 describes how Forge can be used to encrypt username/password pairs or pass phrases for use in KEY elements in such a way that only Forge itself is capable of decoding them.

Configuring HTTPS Authentication

HTTPS configuration is similar to HTTP authentication configuration. Forge supports HTTPS authentication of the server, client authentication with certificates, and secure communication over HTTPS. The following sub-sections describe how to use a key ring file to configure an HTTPS connection to a particular site, given various security constraints.


95

Boot-Strapping Server Authentication

To make an HTTPS connection, all that is often required is for Forge (as a client) to be able to authenticate the server. When Forge connects to a server via HTTPS it will attempt to validate the server’s certificate by checking its signature.

Therefore, Forge must be supplied with the public keys of the certificate authority (CA) that signed the server’s certificate. This information can be provided via a key ring file that contains a CA_DB element, in as this example:

CA_DB Element

The body of a CA_DB element specifies the path to a PEM format certificate which contains one or more public keys that Forge should use to validate the CA signatures it encounters on server certificates when it retrieves URLs via HTTPS. The path to this certificate may be relative to the parent pipeline XML file or an absolute path.

<KEY_RING><CA_DB>eneCA.pem</CA_DB><SITE HOST="www.endeca.com" PORT="6000">

<HTTP><REALM NAME="Sales Documents">

<KEY>BOcxV3wFSGuoBqbhPHkFGmA=</KEY></REALM>

</HTTP></SITE></KEY_RING>


96

If Forge is unable to find the public key of the CA that signed a server certificate that it receives when attempting to initiate an HTTPS transfer, it will fail to retrieve the requested document and report an error. If a certificate chain is necessary to validate the server certificate, the public key of each CA along the chain must be present in the CA_DB in order for host authentication to succeed.

Disabling Server Authentication for a Host

By default, Forge always attempts to validate CA signatures for every HTTPS host. Host authentication can be disabled for an individual host, however, by setting the AUTHENTICATE_HOST attribute of the appropriate HTTPS element in the key ring to FALSE (for more information, see the AUTHENTICATE_HOST Attribute section).

HTTPS Element

The HTTPS element is the analog of the HTTP element. It encapsulates the HTTPS configuration information that applies to a particular site, which is defined by the HTTPS element’s parent SITE element.

AUTHENTICATE_HOST Attribute

The HTTPS element has one optional attribute, AUTHENTICATE_HOST. This attribute specifies whether or not to verify the CA signature of server certificates received from the target host. By default, the value of this attribute


97

is TRUE. To disable host authentication for HTTPS connections to the target host, set this attribute to FALSE:

<HTTPS AUTHENTICATE_HOST="FALSE"/>

Configuring Client Authentication

Some HTTPS servers may require clients to authenticate themselves. A client does this by presenting a certificate that has been signed by a CA that the server trusts. In order for Forge to be able to connect to a server that requires client authentication, it must be supplied with an appropriate client certificate as well as an associated private key, as illustrated by this example:

CERT Element

One CERT element can be inserted in the body of an HTTPS element to bootstrap the HTTPS connection with a certificate and corresponding private key for a site that requires client authentication. The CERT element has two required attributes, PATH and PRIV_KEY_PATH, which specify the locations of the certificate and private key.

<KEY_RING><CA_DB>cacert.pem</CA_DB><SITE HOST="www.endeca.com" PORT="6000">

<HTTPS><CERT PATH="clientcert.pem" PRIV_KEY_PATH="clientkey.key">

<KEY>AqS6+A3u+ivX</KEY></CERT>

</HTTPS></SITE></KEY_RING>


98

If these files are protected by a pass phrase, the pass phrase can be provided in the body of a KEY child element of the CERT element, as in the above example.

As with HTTP username/password keys, Forge expects a CERT’s key to be stored in an encrypted form. For more information on how to put a CERT’s key in an encrypted form, see “Using Forge to Encrypt Keys and Pass Phrases” on page 100.

PATH Attribute

The PATH attribute of a CERT element specifies the location of the certificate file. The certificate must be stored in the PEM format. The path may be expressed relative to the pipeline file or as an absolute path.

PRIV_KEY_PATH Attribute

The PRIV_KEY_PATH attribute specifies the path to a PEM format file containing the private key associated with the certificate referenced in the PATH element. This path also may be expressed relative to the pipeline file or as an absolute path.

Authenticating with a Microsoft Exchange Server

A key ring file may also be used to specify authentication configuration for a Microsoft Exchange Server when using a record adapter with an EXCHANGE format. The Exchange server will expect a valid username and password combination, which may be specified via a KEY


99

element embedded in an EXCHANGE_SERVER element within a key ring, as in the following example:

EXCHANGE_SERVER Element

This element opens a block of configuration for authenticating to an Exchange server. It has one required attribute, which is the HOST attribute.

The HOST attribute specifies the name of the Exchange server to which the supplied configuration information applies.

Authenticating with a Proxy Server

A key ring file may also be used to specify authentication configuration for proxy servers.

Note: Basic authentication is the only method supported by Forge for authenticating with proxy servers.

The proxy server will expect a valid username and password combination, which may be specified via a KEY

<KEY_RING><EXCHANGE_SERVER HOST="exchange.mycompany.com">

<KEY>B9qtQOON6skNTFTHm9rnn04=</KEY></EXCHANGE_SERVER></KEY_RING>


100

element embedded in a PROXY element within a key ring, as in the following example:

PROXY Element

The PROXY element contains configuration for proxy authenticating. It has two required attributes:

• The HOST attribute specifies the host name of the proxy server for which the supplied configuration information applies.

• The PORT attribute specifies the port number on the proxy host specified in the HOST attribute for which the supplied configuration information applies.

Using Forge to Encrypt Keys and Pass Phrases

Forge requires the username/password pairs or pass phrases kept in KEY elements within the key ring file to be stored in an encrypted form which only Forge can decode. Forge provides a command-line argument, --encryptKey, which should be used to put the contents of KEY elements in this form. The encrypt key flag has the following syntax:

forge --encryptKey [username:]passphase

<KEY_RING><PROXY HOST="proxy.mycompany.com" PORT="8080">

<KEY>J9dtQOOR6skPTFTHm5rnn08=</KEY></PROXY></KEY_RING>


101

Encrypting a Username/Password Pair

The following example shows how to run Forge to encrypt a username/password pair (username=sales, password=endeca) for use in an HTTP block of a key ring file:

forge --encryptKey sales:Endeca

As the example illustrates, the username and password must be entered together, separated by a colon, as the argument to the --encryptKey flag. Forge then outputs the encrypted key, which you then insert in the body of the applicable KEY element.

Encrypting a Pass Phrase

To encrypt the pass phrase “burning down the house” Forge should be executed with the following command:

forge --encryptKey "burning down the house"


102


SECTION IIRecord Features

104


Chapter 3

Creating Aggregated Records

The Endeca aggregated records feature allows the end user to group records by dimension or property values. By configuring aggregated records, you enable the Navigation Engine to handle a group of multiple records as though it were a single record, based on the value of the rollup key. A rollup key can be any property or dimension with the rollup attribute set to true, as described in “Enabling Record Aggregation” on page 107.

Aggregated records are typically used to eliminate duplicate display entries. For example, an album by the same title may exist in several formats, with different prices. Each title is represented in the Navigation Engine as a distinct Endeca record. When querying the Navigation Engine, you may want to treat these instances as a single record. This is accomplished by creating an Endeca aggregated record.

From a performance perspective, aggregated Endeca records are not an expensive feature. However, they should only be used when necessary, because they add organization and implementation complexity to the application (particularly if the rollup key is different from the display information).

106

Aggregated Record Behavior

Aggregated records behave differently than ordinary records, as follows:

• Representative values—Given a single record, evaluating the record’s information is straightforward. However, aggregated records consist of many records, which can have different representative values. Generally for display and other logic requiring record values, a single representative record from the aggregated record is used. The representative record is the individual record that occurs first in order of the underlying records in the aggregated record. This order is determined by either a specified sort key or a relevance ranking strategy.

• Sort—The sort feature is first applied to all records in the data set (prior to aggregating the records). The record at the top of this set is the record with the highest sort value. Given the sorted set of records, aggregated records are created by iterating over the set in descending order, aggregating records with the same rollup key. An aggregated record’s rank is equal to that of the highest ranking record in that aggregated record set. The result is the same as aggregating all records on the rollup key, taking the highest value of the sort key for these aggregated records and sorting the set based on this value.

Note that if you have a defined list of sort keys, the first key is the primary sort criterion, the second key is the secondary sort criterion, and so on.


107

• More control—The user may want to gain more control over representative values and/or sort. For example, the desired behavior may be to sort the aggregated records by the maximum price. This can be accomplished by configuring a derived property. In this case the property would derive from price with the MAX function applied. This derived property could be configured as a sort key, ensuring that the aggregated records are sorted by the maximum price of the records in the set.

The presentation developer has more power over retrieving the representative values. The individual records are returned with the aggregated record. Therefore, the developer has all the information necessary to correctly represent aggregated records (at the cost of increased complexity). However, to achieve the desired sort behavior, the Navigation Engine must be configured correctly, because the internals of this operation are not exposed to the presentation developer.

Enabling Record Aggregation

In Developer Studio, you enable aggregate Endeca record creation by allowing record rollups based on properties and dimensions.

Proper configuration of this feature requires that the rollup key is a single assign value. That is, each record should have at most one value from this dimension or property. If the value is not single assign, the “first”

Endeca Confidential Creating Aggregated Records

108

(arbitrarily-chosen) value is used to create the aggregated record. This can cause the results to vary arbitrarily, depending upon the navigation state of the user. In addition, features such as sort can change the grouping of aggregated records that are assigned multiple values of the rollup key.

To enable a property for record aggregation:

1. In the Project tab of Developer Studio, double-click Properties to open the Properties view.

2. Select a property and click Edit. The Property editor is displayed.

3. In the General tab, check Rollup.

4. Click OK.


To enable a dimension for record aggregation:

1. In the Project tab of Developer Studio, double-click Dimensions to open the Dimensions view.

2. Select a dimension and click Edit. The Dimension editor is displayed.

3. In the Dimension editor, click the Advanced tab.

4. Check Enable for Rollup.

5. Click OK.


109

Generating and Displaying Aggregated Records

The general procedure of generating and displaying aggregated records is as follows:

1. Determine which rollup keys are available to be used for an aggregated record navigation query.

2. Create an aggregated record navigation query by using one of the available rollup keys. This rollup key is called the active rollup key, while all the other rollup keys are inactive.

3. Retrieve the list of aggregated records from the Navigation object and display their attributes.

These steps are discussed in detail below.

Determining the Available Rollup Keys

Assuming that you have a navigation state, the following objects and method calls are used to determine the available rollup keys. These rollup keys can be used in subsequent queries to generate aggregated records.

• Navigation.getRollupKeys() - Gets the rollup keys applicable for this navigation query. The .NET version is the Navigation.RollupKeys property. The rollup keys are returned as an ERecRollupKeyList object.

• ERecRollupKeyList.size() - Gets the number of rollup keys in the ERecRollupKeyList object (Java). The .NET and COM APIs have the ERecRollupKeyList.Count property.


110

• ERecRollupKeyList.getKey() - Gets the rollup key from the ERecRollupKeyList object, using a zero-based index (Java). The COM version is the ERecRollupKeyList.Item() method and the .NET API has the ERecRollupKeyList.Item property. The rollup key is returned as an ERecRollupKey object.

• ERecRollupKey.getName() - Gets the name of the rollup key (Java and COM). The .NET version is the ERecRollupKey.Name property.

• ERecRollupKey.isActive() - Returns true if this rollup key was applied in the navigation query or false if it was not.

The rollup keys are retrieved from the Navigation object in an ERecRollupKeyList object. Each ERecRollupKey in this list contains the name and active status of the rollup key:

• The name is used to specify the rollup key in a subsequent navigation or aggregated record query.

• The active status indicates whether the rollup key was applied to the current query.

The following code fragments show how to retrieve a list of rollup keys, iterate over them, and display the names of keys that are active in the current navigation state.


111

Sample Java Code for Retrieving Rollup Keys

Sample .NET Code for Retrieving Rollup Keys

// Get rollup keys from the Navigation objectERecRollupKeyList rllupKeys = nav.getRollupKeys();// Loop through rollup keysfor (int i=0; i< rllupKeys.size(); i++) {// Get each rollup key from the list

ERecRollupKey rllupKey = rllupKeys.getKey(i);// If the key is active, display the key name

if (rllupKey.isActive()) {%>Active rollup key: <%= rllupKey.getName() %><%

}}

// Get rollup keys from the Navigation objectERecRollupKeyList rllupKeys = nav.RollupKeys;// Loop through rollup keysfor (int i=0; i< rllupKeys.Count; i++) {// Get each rollup key from the list

ERecRollupKey rllupKey = (ERecRollupKey)rllupKeys[i];// If the key is active, display the key name

if (rllupKey.IsActive()) {%>Active rollup key: <%= rllupKey.Name %><%

}}


112

Sample COM Code for Retrieving Rollup Keys

Creating Aggregated Record Navigation Queries

You can generate aggregated records with URL query parameters or with Presentation API methods.

Note: Regardless of how many properties or dimensions you have enabled as rollup keys, you can specify a maximum of one rollup key per navigation query.

Specifying the Rollup Key for the Navigation Query

To generate aggregated Endeca records, the query must be appended with an Nu parameter. The value of the Nu parameter specifies a rollup key for the returned aggregated records, using the following syntax:

Nu=<rollupkey>

For example:

controller.jsp?N=0&Nu=Winery

' Get rollup keys from the Navigation objectdim rllupKeysset rllupKeys = nav.GetRollupKeys()' Loop through rollup keysFor i = 1 to rllupKeys.Count' Get rollup key

Dim rllupKeyset rllupKey = rllupKeys(i)

' If the key is active, display the key nameif (rllupKey.isActive()) then

%>Active rollup key: <%= rllupKey.GetName() %><%end if

Next


113

The records associated with the navigation query are grouped with respect to the rollup key prior to computing the subset specified by the Nao parameter (that is, if Nu is specified, Nao applies to the aggregated records rather than individual records). Aggregated records only apply to a navigation query. Therefore, the Nu query parameter is only valid with an N parameter.

The equivalent Java API method to the Nu parameter is the ENEQuery.setNavRollupKey() method; for example:

usq.setNavRollupKey(“Winery”);

The .NET version is the ENEQuery.NavRollupKey property.

When the aggregated record navigation query is made, the returned Navigation object which will contain an AggrERecList object.

Setting the Maximum Number of Returned Records

You can use the Np parameter to control the maximum number of Endeca records returned in any aggregated record. Set the parameter to 0 (zero) for no records, 1 for one record, or 2 for all records. For example:

controller.jsp?Np=1&N=1&Nu=Winery

The ENEQuery.setNavERecsPerAggrERec() method is the equivalent API method. The .NET version is the ENEQuery.NavERecsPerAggrERec property.


114

Creating Aggregated Record Queries

An aggregated record request is similar to an ordinary record request with these exceptions:

• If you are using URL query parameters, the A parameter is specified (instead of R). The value of the A parameter is the record spec of the aggregated record.

• If you are using API methods, use the ENEQuery.setAggrERecSpec() method to specify the aggregated record to be queried for. The .NET version is the ENEQuery.AggrERecSpec property.

• The element returned is an aggregated record (not a record).

Similar to an ordinary record, An (instead of N) is the user’s navigation state. Only records that satisfy this navigation state are included in the aggregated record. In addition, the Au parameter must be used to specify the aggregated record rollup key.

The following are examples of queries using An:

controller.jsp?An=0&A=32905&Au=Winery

controller.aspx?A=7&An=123&Au=ssn

The following example, from the nav_agg_records.jps page in the JSP reference, shows how the UrlGen class to construct the URL query string:


115

Note that the ENEQuery.setAggrERecSpec() method provides the aggregated record specification to the A parameter, the ENEQuery.getNavDescriptors() method gets the navigation values for the An parameter, and the ENEQuery.getNavRollupKey() method gets the name of the rollup key for the Au parameter.

Displaying Aggregated Records

The following sections describe how to handle aggregated records that have been returned by the Navigation Engine and how to display them.

Retrieving an Aggregated Record from a ENEQueryResults Object

On an aggregated record request, the aggregated record is returned as an AggrERec object in the ENEQueryResults object. Use these methods:

// Create aggregated record request (start from empty request)UrlGen urlg = new UrlGen("", "UTF-8");urlg.addParam("A",aggrec.getSpec());urlg.addParam("An",usq.getNavDescriptors().toString());urlg.addParam("Au",usq.getNavRollupKey());urlg.addParam("eneHost",(String)request.getAttribute("eneHost"));urlg.addParam("enePort",(String)request.getAttribute("enePort"));urlg.addParam("displayKey",(String)request.getParameter("displayKey"));urlg.addParam("sid",(String)request.getAttribute("sid"));String url = CONTROLLER+"?"+urlg;%><a href="<%= url %>">%>


116

• ENEQueryResults.containsAggrERec() returns true if the ENEQueryResults object contains an aggregated record.

• ENEQueryResults.getAggrERec() retrieves the AggrERec object from the ENEQueryResults object. The .NET version is the ENEQueryResults.AggrERec property.

For example:

Retrieving an Aggregated Record List from a Navigation Object

On an aggregated record navigation query, a list of aggregated records (an AggrERecList object) is returned in the Navigation object. Use these methods:

• Navigation.getAggrERecs() retrieves a list of aggregated records returned by the navigation query, as an AggrERecList object. The .NET version is the Navigation.AggrERecs property.

Note: By default, the Navigation Engine returns a maximum of 10 aggregated records. To change this number, use the ENEQuery.setNavNumAggrERecs() method.

• Navigation.getTotalNumAggrERecs() returns the number of aggregated records that matched the navigation query. Typically, this number is much higher than the number of aggregated records

// Make Navigation Engine requestENEQueryResults qr = nec.query(usq);// Check for an AggrERec object in ENEQueryResultsif (qr.containsAggrERec()) {

AggrERec aggRec = qr.getAggrERec();...

}


117

returned in the Navigation object, unless you used the ENEQuery.setNavNumAggrERecs() method to change the default number of 10 returned aggregated records.

Displaying Aggregated Record Attributes

After you retrieve an aggregated record, you can use the following AggrERec class methods:

• getERecs() gets the Endeca records (ERec objects) that are in this aggregated record. The .NET version is the ERecs property.

• getProperties() returns the properties (as a PropertyMap object) of the aggregated record. The .NET version is the Properties property.

• getRepresentative() gets the Endeca record (ERec object) that is the representative record of this aggregated record. The .NET version is the Representative property.

• getSpec() gets the specification of the aggregated record to be queried for. The .NET version is the Spec property.

• getTotalNumERecs() returns the number of Endeca records (ERec objects) that are in this aggregated record. The .NET version is the TotalNumERecs property.

The following code snippets illustrate these methods.


118

Sample Java Code for AggrERec Methods

Sample .NET Code for AggrERec Properties

Navigation nav = qr.getNavigation();// Get total number of aggregated records that matched the querylong nAggrRecs = nav.getTotalNumAggrERecs();// Get the aggregated records from the Navigation objectAggrERecList aggrecs = nav.getAggrERecs();// Loop over the aggregated record listfor (int i=0; i<aggrecs.size(); i++) {

// Get individual aggregate recordAggrERec aggrec = (AggrERec)aggrecs.get(i);// Get number of records in this aggregated recordlong recCount = aggrec.getTotalNumERecs();

// Get the aggregated record's attributesString aggrSpec = aggrec.getSpec();PropertyMap propMap = aggrec.getProperties();ERecList recs = aggrec.getERecs();ERec repRec = aggrec.getRepresentative();

}

Navigation nav = qr.Navigation;// Get total number of aggregated records that matched the querylong nAggrRecs = nav.TotalNumAggrERecs;// Get the aggregated records from the Navigation objectAggrERecList aggrecs = nav.AggrERecs;// Loop over the aggregated record listfor (int i=0; i<aggrecs.Count; i++) {

// Get individual aggregate recordAggrERec aggrec = (AggrERec)aggrecs[i];// Get number of records in this aggregated recordlong recCount = aggrec.TotalNumERecs;

// Get the aggregated record's attributesString aggrSpec = aggrec.Spec;PropertyMap propMap = aggrec.Properties;ERecList recs = aggRec.ERecs;ERec repRec = aggrec.Representative;

}


119

Sample COM Code for AggrERec Methods

Displaying the Records in the Aggregated Record

You display the Endeca records (ERec objects) in an aggregated record with the same procedures as described in the “Working with Endeca Records” chapter in the Endeca Developer’s Guide.

dim navset nav = qr.getNavigation()' Get total number of aggregated records that matched the querydim nAggrRecsset nAggrRecs = nav.GetTotalNumERecs()' Get the aggregated records from the Navigation objectdim aggrecsset aggrecs = nav.GetAggrERecs()' Loop over aggregated record listFor i = 1 to aggrecs.Count

' Get individual aggregate recorddim aggrecset aggrec = aggrecs(i)' Get number of records in this aggregated recordDim recCountrecCount = aggrec.GetTotalNumERecs()

' Get the Aggregated Record's Specifier, Property Map,' List of Records, and Representative Record

dim aggrSpecset aggrSpec = aggrec.GetSpec()dim aggPropsMapset aggPropsMap = aggrec.GetProperties()dim recsset recs = aggRec.GetERecs()dim repRecset repRec = aggrec.GetRepresentative()

Next


120

In the following example, a list of aggregated records is retrieved from the Navigation object and the properties of each representative record are displayed.

Sample Java Code for Displaying the Representative Record

// Get aggregated record list from the Navigation objectAggrERecList aggrecs = nav.getAggrERecs();// Loop over aggregated record listfor (int i=0; i<aggrecs.size(); i++) {

// Get an individual aggregated recordAggrERec aggrec = (AggrERec)aggrecs.get(i);

// Get representative record of this aggregated recordERec repRec = aggrec.getRepresentative();

// Get property map for representative recordPropertyMap repPropsMap = repRec.getProperties();// Get property iterator to loop over the property mapIterator repProps = repPropsMap.entrySet().iterator();

// Display representative record propertieswhile (repProps.hasNext()) {

// Get a propertyProperty prop = (Property)repProps.next();// Display name and value of the property%><tr><td>Property name: <%= prop.getKey() %></td><td>Property value: <%= prop.getValue() %></tr><%

}}


121

Sample .NET Code for Displaying the Representative Record

// Get aggregated record list from the Navigation objectAggrERecList aggrecs = nav.AggrERecs;// Loop over aggregated record listfor (int i=0; i<aggrecs.Count; i++) {

// Get an individual aggregated recordAggrERec aggrec = (AggrERec)aggrecs[i];

// Get representative record of this aggregated recordERec repRec = aggrec.Representative;

// Get property map for representative recordPropertyMap repPropsMap = repRec.Properties;// Get property list for representative recordSystem.Collections.Ilist repPropsList = repPropsMap.EntrySet;

// Display representative record propertiesforeach (Property repProp in repPropsList) {

%><tr><td>Property name: <%= repProp.Key %></td><td>Property value: <%= repProp.Value %></tr><%

}}


122

Sample COM Code for Displaying the Representative Record

' Get Aggregated Records list from the Navigation objectdim aggrecsset aggrecs = nav.GetAggrERecs()' Loop over aggregated record listFor i = 1 to aggrecs.Count

' Get an individual aggregated recorddim aggrecset aggrec = aggrecs(i)

' Get representative record of this aggregated recorddim repRecset repRec = aggrec.GetRepresentative()

' Get property map for representative recorddim repPropsMapset repPropsMap = repRec.GetProperties()' Get property iterator for representative recorddim repPropsset repProps = repPropsMap.EntrySet()

' Display representative record propertiesFor k = 1 to repProps.Count

' Get propertyset prop = repProps(k)' Display property%>

<tr><td>Representative property name: <%= prop.GetKey() %></td><td>Representative property value: <%= prop.GetValue() %></tr><%

NextNext


Chapter 4

Using Derived Properties

A derived property is a property that is calculated by applying a function to properties or dimension values from each member record of an aggregated record. Subsequently, the resultant derived property is assigned to the aggregated record.

Aggregated records are a prerequisite to derived properties. (If you are not already familiar with specifying a rollup key and creating aggregated records, see “Creating Aggregated Records” on page 105.)

To see how derived properties work, consider a book application for which only unique titles are to be displayed. The books are available in several formats (various covers, special editions, and so on) and the price varies by format. Specifying Title as the rollup key aggregates books of the same title, regardless of format. To control the aggregated record’s representative price (for display purposes), use a derived property.

For example, the representative price can be the price of the aggregated record’s lowest priced member record. The derived property used to obtain the price in this example would be configured to apply a minimum function to the Price property.

124

Specifying Derived Properties

The DERIVED_PROP element in the Derived_props.xml file specifies a derived property. The attributes of the DERIVED_PROP element are:

• DERIVE_FROM - The property or dimension from which the derived property will be calculated.

• FCN - The function to be applied to the DERIVE_FROM properties of the aggregated record. Valid functions are MIN, MAX, AVG, or SUM. Any dimension or property type can be used with the MIN or MAX functions. Only INTEGER or FLOAT properties may be used in AVG and SUM functions.

• NAME - The name of the derived property. This name can be the same as the DERIVE_FROM attribute.

Note: Developer Studio currently does not support configuring derived properties. The workaround is to hand-edit the XML file to add the DERIVED_PROP element.

Below is an example of the XML element that defines the derived property described in the book example above.

Similarly, a derived property can derive from dimension values, if the dimension name is specified in the DERIVE_FROM attribute. In addition, the function attribute

<DERIVED_PROPDERIVE_FROM="PRICE"FCN="MIN"NAME="LOW_PRICE"/>


125

(FCN) can be MAX, AVG or SUM, depending on the desired behavior.

Displaying Derived Properties

The Presentation API’s semantics for a derived property are similar to those of ordinary properties, though there are a few differences. Derived properties apply only to aggregated Endeca records. Therefore, the Navigation Engine query must be properly formulated to include a rollup key (see “Creating Aggregated Records” on page 105 for more information).

In the aggregated record, the derived properties are accessed via the getProperties method (Properties property for .NET) and the representative properties are obtained by calling the getProperties method on the representative record. The representative record is retrieved from the aggregated record via the getRepresentative method (Representative property for .NET).

The following code example demonstrates how to display the names and values of an aggregated record’s derived properties. (For an example of how to display the representative record’s property values, see “Creating Aggregated Records” on page 105.)

Endeca Confidential Using Derived Properties

126

Sample Java Code// Get aggregated record listAggrERecList aggrecs = nav.getAggrERecs();for (int i=0; i<aggrecs.size(); i++) {// Get individual aggregated recordAggrERec aggrec = (AggrERec)aggrecs.get(i);// Get all derived properties.PropertyMap derivedProps = aggrRec.getProperties();Iterator derivedPropIter = derivedProps.entrySet().iterator();// Loop over each derived property, handle as an ordinary property.foreach (Property derivedProp in derivedPropsList) {

// Display property%><tr><td>Derived property name: <%= derivedProp.Key

%></td><td>Derived property value: <%= derivedProp.Value

%></td></tr><%

}}


127

Sample .NET Code// Get aggregated record listAggrERecList aggrecs = nav.AggrERecs;// Loop over aggregated record listfor (int i=0; i<aggrecs.Count; i++) {// Get an individual aggregated recordAggrERec aggrec = (AggrERec)aggrecs[i];// Get all derived properties.PropertyMap derivedPropsMap = aggrec.Properties;// Get property list for agg recordSystem.Collections.IList derivedPropsList = derivedPropsMap.EntrySet;// Loop over each derived property, handle as an ordinary property.while (derivedPropIter.hasNext()) {

Property prop = (Property) derivedPropIter.next( );// Display property%><tr><td>Derived property name: <%= prop.getKey()

%></td><td>Derived property value: <%= prop.getValue()

%></td></tr><%

}}


128

Sample COM Code

Troubleshooting Derived Properties

A derived property can derive from either a property or a dimension. The DERIVE_FROM attribute specifies the property name or dimension name, respectively. Avoid name collisions between properties and dimensions, as this is likely to be confusing.

' Get Aggregated Records list from the Navigation objectdim aggrecsset aggrecs = nav.GetAggrERecs()' Loop over aggregated record listFor i = 1 to aggrecs.Count' Get an individual aggregated recorddim aggrecset aggrec = aggrecs(i)' Get all derived propertiesdim derivedPropsMapset derivedPropsMap = aggrec.GetProperties()' Get property iteratordim derivedPropsset derivedProps = derivedProps.EntrySet()' Display derived propertiesFor k = 1 to derivedProps.Count

' Get propertyset prop = repProps(k)' Display property%>

<tr><td>Derived property name: <%= prop.GetKey() %></td><td>Derived property value: <%= prop.GetValue() %></tr><%

NextNext


129

Derived Property Performance

Some overhead is introduced to calculate derived properties. In most cases this should be negligible. However, large numbers of derived properties and more importantly, aggregated records with many member records may degrade performance.


130


Chapter 5

Selecting a Record Set Based on a Key

A set of Endeca records is returned with every navigation query result. By default, each record includes the values from all the keys (properties and dimensions) that have record page and record list attributes. These attributes are set with the Show with Record (for record page) and Show with Record List (for record list) checkboxes, as configured in Developer Studio.

However, if you do not want all the key values, you can control the characteristics of the records returned by navigation queries by using the Select feature.

About the Select Feature

The Select feature allows you to select specific keys (Endeca properties and/or dimensions) from the data so that only a subset of values will be transferred for Endeca records in a query result set. The Select functionality allows the application developer to determine these keys dynamically, instead of at Dgraph or Agraph startup. This selection will override the default record page and record list fields.

132

A Web application that does not make use of all of the properties and dimension values on a record can be more efficient by only requesting the values that it will use. The ability to limit what fields are returned is useful for exporting bulk-format records and other scenarios. For example, if a record has properties that correspond to the same data in a number of languages, the application can retrieve only the properties that correspond to the current language. Or, the application may render the record list using tabs to display different sets of data columns (e.g., one tab to view customer details and another to view order details without always returning the data needed to populate both tabs).

This functionality prevents the transferring of unneeded properties and dimension values when they will not be used by the front-end Web application. It therefore makes the application more efficient because the unneeded data does not take up network bandwidth and memory on the application server.

The Select feature can also be used to specifically request fields that are not transferred by default.

Configuring the Select Feature

No system configuration is required for the Select feature. In other words, no instance configuration is required in Developer Studio and no Dgidx or Dgraph/Agraph flags are required to enable selection of properties and


133

dimensions. Any existing property or dimension can be selected.

Using URL Query Parameters for Select

A query for selected fields is the same as any valid navigation query. Therefore, the Navigation parameter (N) is required for the request.

Selecting Keys in the Application

With the Select feature, the Web application can specify which properties and dimensions should be returned for the result record set from the navigation query.

Using the Java Selection Method

You set the selection list on the ENEQuery object with the setSelection() method, which has this syntax:

ENEQuery.setSelection(FieldList selectFields)

where selectFields is a list of property or dimension names that should be returned with each record. You can populate the FieldList object with string names (such as "P_WineType") or with Property or Dimension objects. In the case of objects, the FieldList.addField method will automatically extract the string name from the object and add it to the FieldList object.

Endeca Confidential Selecting a Record Set Based on a Key

134

During development, you can check which fields are set with the ENEQuery.getSelection() method, which returns a FieldList object.

The FieldList object will contain a list of Endeca property and/or dimension names for the query. For details on the methods of the FieldList class, see the Endeca Javadocs.

The following is a simple Java example of setting an Endeca property and dimension for a navigation query:

When the ENEQueryResults object is returned, it will have a list of records that have been tagged with the P_WineType property and the Designation dimension. You extract the records as with any record query.

Note: The setSelection() and getSelection() methods are also available in the UrlENEQuery class.

// Create a queryENEQuery usq = new UrlENEQuery(request.getQueryString(), "UTF-8");// Create an empty selection listFieldList fList = new FieldList();// Add an Endeca property to the listfList.addField("P_WineType");// Add an Endeca dimension to the listfList.addField("Designation");// Add the selection list to the queryusq.setSelection(fList);// Make the ENE queryENEQueryResults qr = nec.query(usq);


135

Using the .NET Selection Property

In a .NET application, the ENEQuery.Selection property is used to get and set the FieldList object. You can add properties or dimensions to the FieldList object with the FieldList.AddField method.

The following is a C# example of setting an Endeca property and dimension for a navigation query:

Using the COM/Perl Selection Methods

In a COM or Perl application, the Endeca property and/or dimension names for the record query is supplied via an Endeca FieldList collection object. You add the key names with FieldList.Add methods.

You then use the ENEQuery.SetSelection() method to set the FieldList object in the query to the Navigation Engine.

// Create a queryENEQuery usq = new UrlENEQuery(queryString, "UTF-8");// Create an empty selection listFieldList fList = new FieldList();// Add an Endeca property to the listint i = fList.AddField("P_WineType");// Add an Endeca dimension to the listi = fList.AddField("Designation");// Add the selection list to the queryusq.Selection = fList;// Make the ENE queryENEQueryResults qr = nec.Query(usq);

Endeca Confidential Selecting a Record Set Based on a Key

136

The following is a COM example of setting an Endeca property and dimension for a navigation query:

' Create a queryDim usqSet usq = Server.CreateObject("Endeca.UrlENEQuery")usq.init Request.QueryString, "UTF-8"' Create an empty selection listDim flistSet flist = Server.CreateObject("Endeca.FieldList") ' Add an Endeca property to the listflist.Add("P_WineType")' Add an Endeca dimension to the listflist.Add("Designation")' Add the selection list to the queryusq.SetSelection(flist)' Make Navigation Engine queryDim qrSet qr = nec.Query(usq)


Chapter 6

Bulk Export of Records

The Bulk Export feature allows your application to perform a navigation query for a large number of records. Each record in the resulting record set is returned from the Navigation Engine in a bulk export-ready format, which is a gzipped format. The records can then be exported to external tools, such as a Microsoft Excel or a CSV (comma separated value) file.

Applications are typically limited in the number of records that can be requested by the memory requirements of the front-end application server. The Bulk Export feature adds a means of delaying parsing and ERec or AggrERec object instantiation, which allows front-end applications to handle requests for large numbers of records.

Configuring the Bulk Export Feature

Endeca properties and/or dimensions that will be included in a result set for bulk exporting must be configured in Developer Studio with the Show with Record List checkbox enabled. When this checkbox is set, the property or dimension will appear in the record list display.

138

No Dgidx or Dgraph flags are necessary to enable the bulk exporting of records. Any property or dimension that has the Show with Record List attribute is available to be exported.

Using URL Query Parameters for Bulk Export

A query for bulk export records is the same as any valid navigation query. Therefore, the Navigation parameter (N) is required for the request.

Retrieving Bulk Records in the Application

By using members from the ENEQuery and ENavigation classes, you can set the number of bulk-format records to be returned by the Navigation Engine and then retrieve them from the Navigation query object.

Setting the Number of Bulk Records

When creating the navigation query, the application can specify the number of Endeca records or aggregated records that should be returned in a bulk format with these Java/COM/Perl methods:

• ENEQuery.setNavNumBulkERecs() sets the maximum number of Endeca records (ERec objects) to be returned in a bulk format from a navigation query.

• ENEQuery.setNavNumBulkAggrERecs() sets the maximum number of aggregated Endeca records


139

(AggrERec objects) to be returned in bulk format from a navigation query.

The MAX_BULK_ERECS_AVAILABLE constant can be used with either method to specify that all of the records that match the query should be exported; for example:

usq.setNavNumBulkERecs(MAX_BULK_ERECS_AVAILABLE);

To find out how many records will be returned for a bulk-record navigation query, use these Java/COM/Perl methods:

• ENEQuery.getNavNumBulkERecs() is for Endeca records.

• ENEQuery.getNavNumBulkAggrERecs() is for aggregated Endeca records.

The ENEQuery.NavNumBulkAggrERecs and ENEQuery.NavNumBulkERecs properties are the .NET versions of the above methods.

Note: All of the above methods are also available in the UrlENEQuery class.

The Java example sets the maximum number of bulk-format records to 5,000 for a navigation query:

// Set Navigation Engine connectionENEConnection nec = new HttpENEConnection(eneHost,enePort);// Create a queryENEQuery usq = new UrlENEQuery(request.getQueryString(), "UTF-8");// Specify the maximum number of records to be returnedusq.setNavNumBulkERecs(5000);// Make the query to the Navigation EngineENEQueryResults qr = nec.query(usq);

Endeca Confidential Bulk Export of Records

140

Retrieving the Bulk-format Records

The list of Endeca records is returned from the Navigation Engine inside the standard Navigation object. The records are returned compressed in a gzipped format. The format is not directly exposed to the application developer; the developer only has access to the bulk data through the the methods from the language being used.

It is up to the front-end application developer to determine what to do with the retrieved records. For example, you can display each record’s property and/or dimension values, as documented in Endeca Developer’s Guide. You can also write code to properly format the property and dimension values for export to an external file, such as a Microsoft Excel file or a CSV file.

Using Java Bulk Export Methods

The list of Endeca records is returned as a standard Java Iterator object. To access the bulk-record iterator, use one of these methods:

• Navigation.getBulkERecIter() returns an Iterator object containing the list of Endeca bulk-format records (ERec objects).

• Navigation.getBulkAggrERecIter() returns an Iterator object containing the list of aggregated Endeca bulk-format records (AggrERec objects).


141

The Iterator class provides access to the bulk-exported records with the following methods:

• Iterator.hasNext()—Returns true if the Iterator has more records.

• Iterator.next()—Extracts (using gunzip) and returns the next record in the iteration. The record is returned as either an ERec or AggrERec object, depending on which Navigation method was used to retrieve the iterator.

The following Java code fragment shows how to set the maximum number of bulk-format records to 5,000 and then obtain a record list and iterate through the list:

// Create a queryENEQuery usq = new UrlENEQuery(request.getQueryString(), "UTF-8");// Specify the maximum number of bulk export records to be returnedusq.setNavNumBulkERecs(5000);// Make the query to the Navigation EngineENEQueryResults qr = nec.query(usq);// Verify we have a Navigation object before doing anything.if (qr.containsNavigation()) {

// Get the Navigation objectNavigation nav = ENEQueryResults.getNavigation();// Get the Iterator object that has the ERecsIterator bulkRecs = nav.getBulkERecIter();// Loop through the record listwhile (bulkRecs.hasNext()) {

// Get a record, which will be gunzippedERec record = (ERec)bulkRecs.next();// Display its properties or format the record for export...

}}


142

Using COM/Perl Bulk Export Methods

In a COM or Perl application, the list of Endeca records is returned as an Endeca ERecIter collection object. To access this object, use the Navigation.GetBulkERecIter() or Navigation.GetBulkAggrERecIter() method.

The ERecIter class provides HasNext(), NextERec(), and NextAggrERec() methods to access to the bulk-exported records. The latter two methods will gunzip the next result record and materialize the per-record object.

This COM sample shows how to set the records to 5,000 and iterate through the returned ERecIter object.

Dim usq ' Create a querySet usq = Server.CreateObject("Endeca.UrlENEQuery")usq.init Request.QueryString, "UTF-8"usq.SetNavNumBulkERecs(5000)' Make the Navigation Engine requestDim qrSet qr = nec.Query(usq)If qr.ContainsNavigation() thenDim navSet nav = qr.GetNavigation()' Get the ERecIter object that has the recordsDim bulkrecsSet bulkrecs = nav.GetBulkERecIter()

' Loop over ERecIter object and get recordsDo While bulkrecs.HasNext()

' Get individual bulk-format recorddim recset rec = bulkrecs.NextERec()' Display properties or format for export' ...

LoopEnd If


143

Using .NET Bulk Export Methods

In a .NET application, the list of Endeca records is returned as an Endeca ERecEnumerator object. To retrieve this object, use the Navigation.BulkAggrERecEnumerator or Navigation.BulkERecEnumerator property.

The following .NET code sample shows how to set the maximum number of bulk-format records to 5000, obtain the record list, and iterate through the collection:

After the ERecEnumerator object is created, an enumerator is positioned before the first element of the collection, and the first call to MoveNext() moves the enumerator over the first element of the collection. After the end of the collection is passed, subsequent calls to MoveNext()

// Create a queryENEQuery usq = new UrlENEQuery(queryString, "UTF-8");// Set max number of returned bulk-format recordsusq.NavNumBulkERecs = 5000;// Make the query to the Navigation Engine ENEQueryResults qr = nec.Query(usq);// First verify we have a Navigation object.if (qr.ContainsNavigation()) {

// Get the Navigation objectNavigation nav = ENEQueryResults.Navigation;// Get the ERecEnumerator object that has the ERecsERecEnumerator bulkRecs = nav.BulkERecEnumerator;// Loop through the record listwhile (bulkRecs.MoveNext()) {

// Get a record, which will be gunzippedERec record = (ERec)bulkRecs.Current;// Display its properties or format for export...

}}


144

return false. The Current property will gunzip the current result record in the collection and materialize the per-record object.

Performance Impact for Bulk Export Records

Unneeded overhead is typically experienced when exporting records from a Navigation Engine without the Bulk Export feature. Currently, the front-end converts the on-wire representation of all the records into objects in the API language, which is not appropriate for bulk export given the memory footprint that results from multiplying a large number of records by the relatively high overhead of the Endeca record object format. For export, converting all of the result records to API language objects at once requires an unacceptable amount of application server memory.

Reducing the per-record memory overhead allows you to output a large number of records from existing applications. Without this feature, applications that want to export large amounts of data are required to split up the task and deal with a few records at a time to avoid running out of memory in the application server’s threads. This division of exports adds query processing overhead to the Navigation Engine which reduces system throughput and slows down the export process.

In addition, the compressed format of bulk-export records further reduces the application's memory usage.


Chapter 7

Record Filters

Note: This feature is available for customers who have purchased Endeca InFront’s Custom Catalogs or Endeca ProFind’s User Access Filters component.

Record filters allow an Endeca application to define arbitrary subsets of the total record set and dynamically restrict search and navigation results to these subsets.

For example, the catalog might be filtered to a subset of records appropriate to the specific end user or user role. The records might be restricted to contain only those visible to the current user based on security policies. Or, an application might allow end users to define their own custom record lists (that is, the set of parts related to a specific project) and then restrict search and navigation based on a selected list. Record filters enable these and many other application features that depend on applying Endeca search and navigation to dynamically defined and selected subsets of the data.

Record filters support Boolean syntax using property values and dimension values as base predicates and standard Boolean operators (AND, OR, and NOT) to compose complex expressions. For example, a filter can consist of a list of part number property values joined in a multi-way OR expression. Or, a filter might consist of a complex nested

146

expression of ANDs, ORs, and NOTs on dimension IDs and property values.

Filter expressions can be saved and loaded from XML files, or passed directly as part of a Navigation Engine query. In either case, when a filter is selected, the set of visible records is restricted to those matching the filter expression. For example, record search queries will not return records outside the selected subset, and refinement dimension values are restricted to lead only to records contained within the subset.

Finally, it is important to note that record filters are case-sensitive.

Record Filter Syntax

Record filters can be specified directly within a Navigation Engine query. For example, the complete Boolean expression representing the desired record subset can be passed directly in an application URL.

In some cases, however, filter expressions require persistence (in the case where the application allows the end user to define and save custom part lists) or may grow too large to be passed conveniently as part of the query (in the case where a filter list containing thousands of part numbers). To handle cases such as these, the Navigation Engine also supports file-based filter expressions.


147

File-based filter expressions are simply files stored in a defined location containing XML representations of filter expressions. This section describes both the ENE Query and XML syntaxes for filter expressions.

ENE Query Syntax

The following BNF grammar describes the syntax for query-level filter expressions:

<filter> ::= <and-expr> | <or-expr> | <not-expr> | <filter-expr> | <literal><and-expr> ::= AND(<filter-list>)<or-expr> ::= OR(<filter-list>)<not-expr> ::= NOT(<filter>)<filter-expr> ::= FILTER(<string>)<filter-list> ::= <filter> | <filter>,<filter-list><literal> ::= <pval> | <dval-id> | <dval-path><pval> ::= <prop-key>:<prop-value><prop-key> ::= <string><prop-value> ::= <string><dval-path> ::= <string> | <string>:<dval-path><dval-id> ::= <unsigned-int><string> ::= Any character string.

The following five special reserved characters must be prepended with an escape character (\) for inclusion in a string: ( ) , : \

Endeca Confidential Record Filters

148

Basically, the syntax supports prefix-oriented Boolean functions (AND, OR, and NOT), colon-separated paths for dimension values and property values, and numeric dimension value IDs.

The following example illustrates a basic filter expression:

OR(AND(Manufacturer:Sony,1001), AND(Manufacturer:Aiwa,NOT(1002)), Manufacturer:Denon)

This expression will match the set of records satisfying any of the following statements:

• Value for Manufacturer property is Sony and record assigned dimension value 1001.

• Value for Manufacturer is Aiwa and record not assigned dimension value 1002.

• Value for Manufacturer property is Denon.

Aside from the nested Boolean operations illustrated by the above example, a key aspect of query filter expressions is the ability to refer to file-based filter expressions using the FILTER operator. For example, if a filter is stored in a file called MyFilter, that filter can be selected as follows:

FILTER(MyFilter)

FILTER operators can be combined with normal Boolean operators to compose filter operations, as in this example:

AND(FILTER(MyFilter),NOT(Manufacturer:Sony))


149

The expression selects records that are satisfied by the expression contained in the file MyFilter but that are not assigned the value Sony to the Manufacturer property.

XML Syntax for File-based Record Filter Expressions

The syntax for file-based record filter expressions closely mirrors the query level syntax, with the following differences:

• In place of the AND, OR, NOT, and FILTER operators, the FILTER_AND, FILTER_OR, FILTER_NOT, and FILTER_NAME XML elements are used, respectively.

• In place of the property and dimension value syntax used for query expressions, the PROP, DVAL_ID, and DVAL_PATH common to other Endeca XML DTDs are used to refer to property and dimension values.

• Instead of parentheses to enclose operand lists, normal XML element nesting (implicit in the locations of element start and end tags) is used.

The full DTD for XML file-based record filter expressions is provided in the filter.dtd file packaged with the Endeca software release.

For example, the following query expression:

OR(AND(Manufacturer:Sony,1001), AND(Manufacturer:Aiwa,NOT(1002)), Manufacturer:Denon)


150

is represented as a file-based expression using the following XML syntax:

Just as file-based expressions can be composed with query expressions, file expressions can also be composed within other file expressions. For example, the following query expression:

AND(FILTER(MyFilter),NOT(Manufacturer:Sony))

can be represented as a file-based expression using the following XML:

<FILTER> <FILTER_OR> <FILTER_AND> <PROP NAME="Manufacturer"><PVAL>Sony</PVAL></PROP> <DVAL_ID ID="1001"/> </FILTER_AND> <FILTER_AND> <PROP NAME="Manufacturer"><PVAL>Aiwa</PVAL></PROP> <FILTER_NOT> <DVAL_ID ID="1002"/> </FILTER_NOT> </FILTER_AND> <PROP NAME="Manufacturer"><PVAL>Denon</PVAL></PROP> </FILTER_OR></FILTER>

<FILTER> <FILTER_AND> <FILTER_NAME NAME="MyFilter"/> <FILTER_NOT> <PROP NAME="Manufacturer"><PVAL>Sony</PVAL></PROP> </FILTER_NOT> </FILTER_AND></FILTER>


151

Enabling Properties for Use in Record Filters

All dimension values are automatically enabled for use in record filter expressions. Properties must be explicitly enabled for use in record filters by using Developer Studio.

To configure an existing property for use in record filters:

1. In the Project tab of Developer Studio, double-click Properties.

2. From the Properties view, select a property and click Edit. The Property editor is displayed.

3. Check Enable for Record Filters.

4. Click OK. The Properties view is redisplayed.


Data Configuration for File-based Filter Expressions

To use file-based filter expressions in an application, you must create a directory to contain record filter files in the same location where the Navigation Engine index data will reside. The name of this directory must be <index_prefix>.fcl.

For example, if the Navigation Engine index data resides in the directory:/usr/local/endeca/my_app/data/partition0/dgidx_output/


152

and the index data prefix is:

then the directory created to contain record filter files must be:

Record filters that are needed by the application should be stored in this directory, which is searched automatically when record filters are selected in an ENE query. For example, if in the above case you create a filter file with the path:

then the filter expression stored in this file will be used when the query refers to the filter MyFilter. For example, the URL query:

N=0&Nr=FILTER(MyFilter)

will use this file filter.

Record Filter Result Caching

The Navigation Engine caches the results of all record filter evaluations for re-use on subsequent ENE queries as part of the global dynamic cache. Both file-based record filters (that is, record filter expressions using the FILTER operator), and UrlENEQuery-based record filters are cached.

/usr/local/endeca/my_app/data/partition0/dgidx_output/index

/usr/local/endeca/my_app/data/partition0/dgidx_output/index.fcl

/usr/local/endeca/my_app/data/partition0/dgidx_output/index.fcl/MyFilter


153

The one caveat to this general rule is that any information derived from file-based record filters is not cached. This means that navigation refinements, navigation counts, and so on are not cached if they result from file-based record filters.

Therefore, Endeca recommends that you use the ENEQuery (via UrlENEQuery) based record filters instead of file-based record filters whenever possible.

ENE URL Query Parameters for Record Filters

Three ENE URL query parameters are available to control the use of record filters:

• Nr - Links to the ENEQuery.setNavRecordFilter method. The Nr parameter can be used to specify a record filter expression that will restrict the results of a navigation query.

• Ar - Links to the ENEQuery.setAggrERecNavRecordFilter method. The Ar parameter can be used to specify a record filter expression that will restrict the records contained in an aggregated-record result returned by the Navigation Engine.

• Dr - Links to the ENEQuery.setDimSearchNavRecordFilter method. The Dr parameter can be used to specify a record filter expression that will restrict the universe of records considered for a dimension search. Only dimension values represented on at least one record satisfying the specified filter will be returned as search results.


154

Sample Queries<application>?N=0&Nr=FILTER(MyFilter)<application>?A=2496&An=0&Ar=OR(10001,20099)<application>?D=Hawaii&Dn=0&Dr=NOT(Subject:Travel)

Record Filter Performance Implications

Memory Cost

The evaluation of record filter expressions is based on the same indexing technology that supports navigation queries in the Navigation Engine. Because of this, there is no additional memory or indexing cost associated with using navigation dimension values in record filters.

Expression Evaluation

Because expression evaluation is based on composition of indexed information, most expressions of moderate size (that is, tens of terms/operators) do not add significantly to request processing time.

Furthermore, because the Navigation Engine caches the results of record filter operations on an LRU (least recently used) basis, the costs of expression evaluation are typically only incurred on the first use of a filter during a navigation session. However, some expected uses of record filters have known performance bounds, which are described in the following two sections.


155

Record Filters

Record filters can impact the following areas:

• Spelling auto-correction and spelling Did You Mean

• Memory cost

• Expression evaluation

Interaction with Spelling Auto-correction and Spelling Did You Mean

Record filters impose an extra cost on spelling auto-correction and spelling Did You Mean.

Memory Cost

The evaluation of record filter dimension value expressions is based on the same indexing technology that supports navigation queries in the Dgraph. Because of this, there is no additional memory or indexing cost associated with using navigation dimension values in record filters. When using property values in record filter expressions, additional memory and indexing cost is incurred because properties are not normally indexed for navigation.

This feature is controlled in Developer Studio by the Enable for Record Filters setting on the Property editor.

Expression Evaluation

Because expression evaluation is based on composition of indexed information, most expressions of moderate size (that is, tens of terms and operators) do not add


156

significantly to request processing time. Furthermore, because the Dgraph caches the results of record filter operations, the costs of expression evaluation are typically only incurred on the first use of a filter during a navigation session. However, some expected uses of record filters have known performance bounds, which are described in the following two sections.

Large OR Filters (“Part Lists”)

One common use of record filters is the specification of lists of individual records to identify data subsets (for example, custom part lists for individual customers, culled from a superset of parts for all customers).

The total cost of processing records can be broken down into two main parts: the parsing cost and the evaluation cost. For large expressions such as these, which will commonly be stored as file-based filters, XML parsing performance dominates total processing cost.

XML parsing cost is linear in the size of the filter expression, but incurs a much higher unit cost than actual expression evaluation. Though lightweight, expression evaluation exhibits non-linear slowdown as the size of the expression grows.

OR expressions with a small number of operands perform linearly in the number of results, even for large result sets. While the expression evaluation cost is reasonable into the low millions of records for large OR expressions, parsing costs relative to total query execution time can become too large, even for smaller numbers of records.


157

Part lists beyond approximately one hundred thousand records generally result in unacceptable performance (10 seconds or more load time, depending on hardware platform). Lists with over one million records can take a minute or more to load, depending on hardware. Because results are cached, load time is generally only an issue on the first use of a filter during a session. However, long load times can cause other Dgraph requests to be delayed and should generally be avoided.

Large-scale Negation

In most common cases, where the NOT operator is used in conjunction with other positive expressions (that is, AND with a positive property value), the cost of negation does not add significantly to the cost of expression evaluation.

However, the costs associated with less typical, large-scale negation operations can be significant. For example, while still sub-second, top-level negation filtering (such as “NOT availability=FALSE”) of a record set in the millions does not allow high throughput (generally less than 10 operations per second).

If possible, attempt to rephrase expressions to avoid the top-level use of NOT in Boolean expressions. For example, in the case where you want to list only available products, the expression “availablity=TRUE” will yield better performance than “NOT availability=FALSE.”


158


SECTION IIIDimension Features

160


Chapter 8

Using Inert Dimension Values

Marking a dimension value as inert makes it non-navigable. That is, the dimension value should not be included in the navigation state.

From an end user perspective, the behavior of an inert dimension value is similar to the behavior of a dimension within a dimension group: With dimension groups, the dimension group behaves like a dimension and the dimension itself behaves like an inert child dimension value. When the user selects the dimension, the navigation state is not changed, but instead the user is presented with the child dimension values. Similarly, when a user selects an inert dimension value, the navigation state is not changed, but the children of the dimension value are displayed for selection.

Whether or not a dimension value should be inert is a subjective design decision about the navigation flow within a dimension. Two examples of when you might use inert dimension values are the following:

• You want the “More...” option to be displayed at the bottom of an otherwise long list. To do this, use Developer Studio’s Dimension editor to enable dynamic ranking for the dimension and generate a “More…” dimension value.

162

• You want to define other dimension values that provide additional information to users, but for which it is not meaningful to filter items.

Note that the inert dimension value feature is purely a presentation feature and has no performance impact on the system.

Configuring Inert Dimension Values

To configure an existing dimension as non-navigable:

1. In the Project tab of Developer Studio, double-click Dimensions to open the Dimensions view.

2. Select a dimension and click Edit. The Dimension editor is displayed.

3. Select a dimension and click Values. In the Dimension Values view, the Inert column indicates which dimensions have been marked as inert.

4. Select a dimension value and click Edit. The Dimension Value editor is displayed.

5. Check Inert.

6. Click OK. The Dimensions view is redisplayed, with a Yes indicator in the Inert column for the changed dimension.


There are no Dgidx or Dgraph flags necessary to mark a dimension value as inert. Once a dimension has been


163

marked as inert in Developer Studio, the Presentation API will be aware of its status.

Using Inert Dimension Values in the Application

When sending the new navigation state to the Navigation Engine, the Endeca application should check the value of the isNavigable() method on each DimVal object. Only dimension values that are navigable (that is, not inert) should be sent to the Navigation Engine, for example, via the ENEQuery.setNavDescriptors() method.

Setting the Inert attribute for a dimension value indicates to the Presentation API that the dimension value should be inert. However, it is up to the front-end application to check for inert dimension values and handle them in an appropriate manner.

The following code snippet shows how a DimVal object is checked to determine if it is a navigable or inert dimension value. In the example, the N parameter is added to the navigation request only if the dimension value is navigable (not inert).

Endeca Confidential Using Inert Dimension Values

164

Sample Java Code for Inert Dimension Values

// Get refinement list for a Dimension objectDimValList refs = dim.getRefinements();// Loop over refinement listfor (int k=0; k < refs.size(); k++) {

// Get refinement dimension valueDimVal dimref = refs.getDimValue(k);//// Create request to select refinement valueurlg = new UrlGen(request.getQueryString(), "UTF-8");// If refinement is navigable, change the Navigation parameterif (dimref.isNavigable()) {

urlg.addParam("N",(ENEQueryToolkit.selectRefinement(nav,dimref)).toString());

urlg.addParam("Ne",Long.toString(rootId));}

// If refinement is non-navigable, change only the exposed// dimension parameter (Leave the Navigation parameter as is)

else {urlg.addParam("Ne",Long.toString(dimref.getId()));

}}


165

Sample .NET Code for Inert Dimension Values

// Get refinement list for a Dimension objectDimValList refs = dim.Refinements;// Loop over refinement listfor (int k=0; k < refs.Count; k++) {

// Get refinement dimension valueDimVal dimref = (DimVal)refs[k];// Create request to select refinement valueurlg = new UrlGen(Request.Url.Query.Substring(1), "UTF-8");// If refinement is navigable, change the Navigation parameterif (dimref.IsNavigable()) {

urlg.addParam("N",(ENEQueryToolkit.SelectRefinement(nav,dimref)).ToString());

urlg.AddParam("Ne",rootId.ToString());}// If refinement is non-navigable, change only the exposed// dimension parameter (Leave the Navigation parameter as is)else {

urlg.AddParam("Ne",dimref.Id.ToString());}

}

Endeca Confidential Using Inert Dimension Values

166

Sample COM Code

' Get refinement list for a Dimension objectdim refsset refs = dimn.GetRefinements' Loop over refinement listFor k = 1 to refs.Count

// Get refinement dimension valueset dimref = refs(k)' Create request to expose dimension valuesset urlg = Server.CreateObject("Endeca.UrlGen")urlg.init Request.QueryString, "UTF-8"' If refinement is navigable, change the Navigation parameterif (dimref.isNavigable) then

set qt = Server.CreateObject("Endeca.ENEQueryToolkit")set newref = qt.SelectRefinement(nav, dimref)urlg.addParam "N" , newref.idStringurlg.addParam "Ne", rootId

' If refinement is non-navigable, change only the exposed' dimension parameter (Leave the Navigation parameter as is)else

urlg.addParam "Ne", dimref.getId()end if

Next


Chapter 9

Working with Externally Created Dimensions

This document describes how to include and work with an externally created dimension in a Developer Studio project. This capability allows you to build all or part of a logical hierarchy for your data set outside of Developer Studio and then import that logical hierarchy as an Endeca dimension available for use in search and Guided Navigation.

An externally created dimension describes a logical hierarchy of a data set; however, the dimension hierarchy is transformed from its source format to Endeca compatible .xml outside of Developer Studio. The logical hierarchy of the dimension conforms to Endeca’s external interface for describing a data hierarchy (external_dimensions.dtd) before you import the dimension into your project. Once you import an externally created dimension, its ownership is wholly transferred to Developer Studio. As such, you can modify the dimension in any way necessary using Developer Studio.

It is important to clarify the difference between an externally managed taxonomy and an externally created dimension to determine which feature document is appropriate for your purposes. The two concepts are similar yet have two important key differences: externally managed

168

taxonomies and externally created dimensions differ in how you include them in a Developer Studio project and how Developer Studio treats them once they are part of a project. Use the table below to determine which one you are working with.

The following table compares an externally managed taxonomy and an externally created dimension..

Externally Managed Taxonomy

Externally Created Dimension

How do you modify or update the hierarchy after it is in the project?

Any changes to the dimension must be made in third-party tool. You then export the taxonomy from the tool, and Forge transforms the taxonomy and re-integrates the changes into your project.

You generally do not update the source file for the hierarchy after you import it into your project. If you do update the file and re-import, then any changes you made to the dimension using Developer Studio are discarded.After you import the hierarchy, you can modify a dimension just as if you created it manually using Developer Studio.

How does Developer Studio manage the hierarchy?

The third-party tool that created the file retains ownership. The dimension is almost entirely read-only in the project. You cannot add or remove dimension values from the dimension. However, you can modify whether dimension values are inert and collapsible.

After you import the file, Developer Studio takes full ownership of the dimension and its dimension values. You can modify any characteristics of the dimension and its dimension values.

How do you create the .xml file?

Created using a third-party tool. Created either directly in an .xml file or created using a third-party tool.


169

If you are working with externally created dimensions, use this chapter. If you are working with an externally managed taxonomy, see “Working with an Externally Managed Taxonomy” on page 177.

An overview of the process to include an externally created dimension in a Developer Studio project is as follows:

1. You create a dimension hierarchy either manually in an .xml file or you create a dimension using a third-party tool. The dimension file must conform to

How do you include the file in a Developer Studio project?

Read in to a pipeline using a dimension adapter with Format set to XML - Externally Managed. Forge transforms the taxonomy file in to a dimension according to the .xslt file that you specify on the Transformer tab of the dimension adapter.

By choosing Import External Dimension on the File menu. During import, Developer Studio creates internal dimensions and dimension values for each node in the file's hierarchy.If you create the file using a third-party tool and any .xml transformation is necessary, you must transform the file outside the project before you choose Import External Dimension on the File menu. The file must conform to external_dimensions.dtd before you import it.



Endeca Confidential Working with Externally Created Dimensions

170

Endeca’s external_dimensions.dtd file (described below).

2. You import the .xml file for the dimension into Developer Studio, and modify the dimension and dimension values as necessary.

XML Requirements

When you create an external dimension-whether by creating it directly in an .xml file or by transforming it from a source file—the dimension must conform to Endeca’s external_dimensions.dtd file before you import it into your project. External_dimensions.dtd defines Endeca-compatible .xml used to describe dimension hierarchies in an Endeca system. This file is located in %ENDECA_ROOT%\version\conf\dtd on Windows and $ENDECA_ROOT/version/conf/dtd on UNIX.

Also, an XML declaration that specifies the external_dimensions.dtd file is required in an external dimensions file. If you omit specifying the DTD in the XML declaration, none of the DTD’s implied values or other default values, such as classification values, are applied to the external dimensions during Data Foundry processing. Here is an example XML declaration that should appear at the beginning of an external dimension file:

<?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE external_dimensions SYSTEM "external_dimensions.dtd">


171

Here is a very simple example of an external dimension file with the required XML declaration and two dimensions.

You can describe a dimension hierarchy using any of the three syntax options described in the following section.

XML Syntax to Specify Dimension Hierarchy

The XML elements available to external_dimensions.dtd allow a flexible XML syntax to describe a dimension hierarchy. There are three different syntax approaches you can choose from when building the hierarchy structure of your externally created dimension. All three are supported by external_dimensions.dtd. Each approach provides a slightly different syntax structure to define a dimension and express the parent/child relationship among dimensions and dimension values. The three syntax choices are as follows:

<?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE external_dimensions SYSTEM "external_dimensions.dtd">

<external_dimensions><node id="1" name="color" classify="true">

<node id="2" name="red" classify="true"/><node id="3" name="blue" classify="true"/>

</node>

<node id="10" name="size" classify="true"><node id="20" name="small" classify="true"/><node id="30" name="med" classify="true"/>

</node>

</external_dimensions>


172

• Use nested node elements within node elements.

• Use the parent attribute of a node to reference a parent’s node ID.

• Use the child element to reference child’s node ID.

You can use only one of the three approaches to describe a hierarchy within a single .xml file. In other words, do not mix different syntax structures within one file. Any node element without a parent node describes a new dimension. You can describe as many dimensions as necessary in a single .xml file.

The following examples show each approach to building a dimension hierarchy. The these examples are semantically equivalent: each describes the same dimension and child dimension values.

Example of Using Nested node Elements

This example shows nested dimension values red and blue within the dimension color.<node name="color" id="1">

<node name="red" id="2"/><node name="blue" id="3"/>

</node>


173

Example of Using Parent Attributes

This example shows the red and blue dimension values using the parent attribute. The value of the parent attribute references the ID for the dimension color.

Example of Using Child Elements

This example uses child elements to indicate that red and blue are dimension values of the color dimension. The ID of each child element references the ID of the red and blue nodes.

Note: You can also find additional information and examples of using the elements in external_dimensions.dtd in the Endeca XML Reference.

Node ID Requirements

Each node element in your dimension hierarchy must have an id attribute. Depending on your requirements, you may choose to provide any of the following values for the id attribute:

<node name="color" id="1"/><node id="2" name="red" parent="1"/><node id="3" name="blue" parent="1"/>

<node name="color" id="1"><child id="2"/><child id="3"/>

</node><node name="red" id="2"/><node name="blue" id="3"/>


174

• Name - If the name of a dimension value is what determines its identity, then provide the id attribute with the name.

• Path - If the path from the root node to the dimension value determines its identity, then provide a value representing the path in the id attribute.

• Existing identifier - If a node already has an identifier, then that identifier can be used in the id attribute.

You can provide an arbitrary id value as long as the value is unique. If you are including multiple .xml files, the identifier must be unique across multiple files.

There is one scenario where an id attribute is optional. It is optional only if you are using an externally created dimension and also defining your dimension hierarchy using nested node sub-elements (rather than using parent or child ID referencing).

Importing an Externally Created Dimension

You add an externally created dimension to your pipeline by importing it from the File menu of Developer Studio. Once you import the .xml file into Developer Studio, the dimension appears in the Dimensions view, and Developer Studio has full read-write ownership of the dimension. You can modify any aspects of a dimension and its dimension values as if you created it manually using Developer Studio.


175

To import an externally created dimension:

1. From the File menu, choose Import External Dimensions. The Import External Dimensions dialog box displays.

2. Specify the .xml file that defines the dimensions.

3. Chose a dimension adapter from the “Dimension adapter to receive imported dimensions” drop-down list.

4. Click OK. The dimensions appear in the Dimensions editor for you to configure as necessary.

Note: Unlike the procedure to import an externally managed taxonomy, you do not need to run a baseline update to import an externally created dimension.


176


Chapter 10

Working with an Externally Managed Taxonomy

This document describes how to include and work with an externally managed taxonomy in a Developer Studio project. This capability allows you to build all or part of a logical hierarchy for your data set outside of Developer Studio and use Developer Studio to transform that logical hierarchy into Endeca dimensions and dimension values for use in search and Guided Navigation.

An externally managed taxonomy is a logical hierarchy for a data set that is built and managed using a third-party tool. Once you include an externally managed taxonomy in your project, it becomes a dimension whose hierarchy is managed by the third-party tool that created it. In Developer Studio, you cannot add or remove dimension values from it. If you want to modify a dimension or its dimension values, you have to edit the taxonomy using the third-party tool and then update the taxonomy in your project.

It is important to clarify the difference between an externally managed taxonomy and an externally created dimension to determine which feature document is appropriate for your purposes. The two concepts are similar yet have two important key differences: externally managed

178

taxonomies and externally created dimensions differ in how you include them in a Developer Studio project and how Developer Studio treats them once they are part of a project. Use the table below to determine which one you are working with.

The following table compares an externally managed taxonomy and an externally created dimension:



How do you modify or update the hierarchy after it is in the project?

Any changes to the dimension must be made in third-party tool. You then export the taxonomy from the tool, and Forge transforms the taxonomy and re-integrates the changes into your project.

You generally do not update the source file for the hierarchy after you import it into your project. If you do update the file and re-import, then any changes you made to the dimension using Developer Studio are discarded.After you import the hierarchy, you can modify a dimension just as if you created it manually using Developer Studio.

How does Developer Studio manage the hierarchy?

The third-party tool that created the file retains ownership. The dimension is almost entirely read-only in the project. You cannot add or remove dimension values from the dimension. However, you can modify whether dimension values are inert and collapsible.

After you import the file, Developer Studio takes full ownership of the dimension and its dimension values. You can modify any characteristics of the dimension and its dimension values.

How do you create the .xml file?

Created using a third-party tool. Created either directly in an .xml file or created using a third-party tool.


179

If you are working with an externally managed taxonomy, use this chapter. If you are working with externally created dimensions, see “Working with Externally Created Dimensions” on page 167.

An overview of the process to include an externally managed taxonomy in a Developer Studio project is as follows:

1. You build an externally managed taxonomy using a third-party tool. This document does not describe any third-party tools or procedures that you might use to perform this task.

2. You create an .xslt style sheet that instructs Forge how to transform the taxonomy into Endeca .xml that

How do you include the file in a Developer Studio project?

Read in to a pipeline using a dimension adapter with Format set to XML - Externally Managed. Forge transforms the taxonomy file in to a dimension according to the .xslt file that you specify on the Transformer tab of the dimension adapter.

By choosing Import External Dimension on the File menu. During import, Developer Studio creates internal dimensions and dimension values for each node in the file's hierarchy.If you create the file using a third-party tool and any .xml transformation is necessary, you must transform the file outside the project before you choose Import External Dimension on the File menu. The file must conform to external_dimensions.dtd before you import it.



Endeca Confidential Working with an Externally Managed Taxonomy

180

conforms to external_dimensions.dtd. This requirement is described below in XSLT and XML Requirements.

3. You configure your Developer Studio pipeline to perform the following tasks:

• Describe the location of an externally managed taxonomy and an .xslt style sheet with a dimension adapter.

• Transform an externally managed taxonomy into an externally managed dimension by running a baseline update.

• Add an externally managed dimension to the Dimensions view and the Dimension Values view.

After you finish the tasks listed above, you can perform additional pipeline configuration that uses the externally managed dimension, and then run a second baseline update to process and tag your Endeca records.

Note: You must set up and use the Endeca Manager in projects that incorporate an externally managed taxonomy.

XSLT and XML Requirements

In order to transform an externally managed taxonomy into an externally managed dimension, you have to create an .xslt style sheet that instructs Forge how to map the taxonomy .xml to Endeca .xml. The mapping in your .xslt style sheet and your resulting hierarchy must conform to the Endeca external_dimensions.dtd file.


181

Both the .xslt and .xml requirements are further described in the sections below.

XSLT Mapping

In order for Developer Studio to process the .xml from your externally managed taxonomy, you have to create an .xslt stylesheet that instructs Forge how to map the .xml elements in an externally managed taxonomy to Endeca-compatible .xml. Later in this document, you will configure the Transformer tab of a dimension adapter with the path to the .xslt style sheet and the path to the taxonomy .xml file, and then run a baseline update to transform the external taxonomy into an Endeca dimension.

The external_dimensions.dtd defines Endeca-compatible .xml to describe dimension hierarchies. This file is located in %ENDECA_ROOT%\version\conf\dtd on Windows and $ENDECA_ROOT/version/conf/dtd on UNIX. You can describe a dimension hierarchy using any of the three syntax options described in the following section.

Note: You can also find additional information and examples of using the elements in external_dimensions.dtd in the Endeca XML Reference.

XML Syntax to Specify Dimension Hierarchy

The .xml elements available to external_dimensions.dtd allow a flexible .xml syntax to describe dimension hierarchy. There are three different syntax approaches


182

you can choose from when building the hierarchy structure of your externally managed dimension. All three are supported by external_dimensions.dtd. Each approach provides a slightly different syntax structure to define a dimension and express the parent/child relationship among dimensions and dimension values. The three syntax choices are as follows:

• Use nested node elements within node elements.

• Use the parent attribute of a node to reference a parent’s node ID.

• Use the child element to reference child’s node ID.

You can use only one of the three approaches to describe a hierarchy within a single .xml file. In other words, do not mix different syntax structures within one file. Any node element without a parent node describes a new dimension. You can describe as many dimensions as necessary in a single .xml file.

The following examples show each approach to building a dimension hierarchy. These examples are semantically equivalent: each describes the same dimension and child dimension values.


183

Example of Using Nested node Elements

This example shows nested dimension values red and blue within the dimension color.

Example of Using parent Attributes

This example shows the red and blue dimension values using the parent attribute. The value of the parent attribute references the ID for the dimension color.

Example of Using child Elements

This example uses child elements to indicate that red and blue are dimension values of the color dimension. The ID of each child element references the ID of the red and blue nodes.

<node name="color" id="1"><node name="red" id="2"/><node name="blue" id="3"/>

</node>

<node name="color" id="1"/><node name="red" id="2" parent="1"/><node name="blue" id="3" parent="1"/>

<node name="color" id="1"><child id="2"/><child id="3"/>

</node><node name="red" id="2"/><node name="blue" id="3"/>


184

Node ID Requirements and Identifier Management in Forge

When you transform the hierarchy structure from an external taxonomy, each node element in your dimension hierarchy must have an id attribute. Forge ensures that each identifier is unique across an Endeca implementation by creating a mapping between a node’s ID and an internal identifier that Forge creates.

This internal mapping ensures that Forge assigns the same identifier to a node from an external taxonomy each time the taxonomy is processed. For example, if you provide updated versions of a taxonomy file, Forge determines which dimension values map to dimension values from a previous version of the file according to the internal identifier. However, there is a scenario where Forge does not preserve the mapping between the id attribute and the internal identifier that Forge creates for the dimension value. This scenario occurs if you reorganize a dimension value to become a child of a different parent dimension. Reorganizing a dimension value within the same parent dimension does not affect the id mapping when Forge reprocesses updated files.

Depending on your requirements, you may choose to provide any of the following values for the id attribute:

• Name - If the name of a dimension value is what determines its identity, then the .xslt style sheet should fill the id attribute with the name.

• Path - If the path from the root node to the dimension value determines its identity, then the .xslt style sheet


185

should put a value representing the path in the id attribute.

• Existing identifier - If a node already has an identifier, then that identifier can be used in the id attribute.

You can provide an arbitrary ID as long as the value is unique. If you are including multiple .xml files, the identifier must be unique across all files. As described above forge ensures that identifiers are unique across the system.

Pipeline Configuration

The following sections describe the pipeline configuration requirements to incorporate an externally managed taxonomy into your Developer Studio project.

Integrating an Externally Managed Taxonomy

You use a dimension adapter to read in .xml from an externally managed taxonomy and transform it to an externally managed Endeca dimension. If necessary, you can import and transform multiple taxonomies by using a different dimension adapter for each taxonomy file.

To perform the taxonomy transformation, you configure a dimension adapter with the .xml file of the taxonomy and the .xslt style sheet that Forge uses to transform the taxonomy file's .xml elements. You then build the rest of your pipeline and run a baseline update. When the


186

update runs, Forge transforms the taxonomy into a dimension that you can load and examine in the Dimensions view.

To integrate an externally managed taxonomy:


2. In the Pipeline Diagram editor, choose New > Dimension > Adapter. The Dimension Adapter editor displays.

3. In the Dimension Adapter Name text box, enter a unique name for the dimension adapter.

4. In the General tab, do the following:

• In the Direction frame, select Input.

• In the Format field, select XML - Externally Managed.

• In the URL field, enter the path to the source taxonomy file. This path can be absolute or relative to the location of your project’s Pipeline.epx file.

• Check Require Data if you want Forge to generate an error if the file does not exist or is empty.

5. In the Transformer tab, do the following:

• In the Type field, enter XSLT.

• In the URL field, specify the path to the .xslt file you created.

6. Click OK.

7. From the File menu, select Save.


187

8. If necessary, repeat steps 2 through 6 to include additional taxonomies.

9. Create a dimension server to provide a single point of reference for other pipeline components to access dimension information. For more information about dimension servers, see the Endeca Developer Studio Help.

Note: If you want to modify an externally managed dimension, see “Updating an Externally Managed Taxonomy in Your Pipeline” on page 190.

Transforming an Externally Managed Taxonomy

In order to transform your externally managed taxonomy into an Endeca dimension you have to run a baseline update. Running the update allows Forge to transform the taxonomy and store a temporary copy of the resulting dimension in the Endeca Manager. After you run the update, you can then create a dimension in the Dimensions view according to “Loading an Externally Managed Dimension” below.

Note: The Endeca Manager must be running before you start a baseline update. Also you must add any required pipeline components to your pipeline for the update to run. For example, you cannot run the update without a property mapper. However, you can temporarily add a default property mapper and later configure it with property and dimension mapping.


188

To transform an externally managed taxonomy:

1. Ensure you have sent the latest instance configuration to the Endeca Manager.

2. In the Endeca Manager toolbar, click Start Baseline Update.

Note: To reduce processing time for large source data sets, you may want to run the baseline update using the -n flag for Forge. (The -n flag controls the number of records processed in a pipeline, for example, -n 10 processes ten records.) You can specify the flag in the Forge field of the Endeca Manager Settings dialog box.

Loading an Externally Managed Dimension

After you transform an external taxonomy into an Endeca dimension, you can then load the dimension in the Dimensions view and add its dimension values to the Dimension Values view.

Rather than click New, as you would to manually create a dimension in Developer Studio, you instead click Discover in Dimensions view to add an externally managed dimension. Developer Studio discovers the dimension by reading in the dimension’s temporary file that Forge created when you ran the first baseline update. Next, you load the dimension values in the Dimension Values editor.

Note: Because the dimension values are externally managed, you cannot add or remove dimension values. You can however modify whether dimension values are inert or collapsible.


189

To load a dimension and its dimension values:

1. In the Project tab of Developer Studio, double-click Dimensions. The Dimensions view displays.

2. Click the Discover button to add the externally managed dimension to the Dimensions view. The dimension appears in the Dimensions view with its Type column set to Externally Managed.

3. In Dimensions view, select the externally managed dimension and click Values. The Dimension Values view appears with the root dimension value of the externally managed dimension displayed.

4. Select the root dimension value and click Load. The remaining dimension values display.

5. Repeat steps 3 to 4 for any additional externally managed taxonomies you integrated in your project.

Note: Most characteristics of an externally managed dimension and its dimension values are not modifiable. These characteristics either appear as unavailable or Developer Studio displays a message indicating what actions are possible. If you want to modify these characteristics of a dimension or its dimension values, you have to edit the source taxonomy by performing the tasks described in “Updating an Externally Managed Taxonomy in Your Pipeline” on page 190.

Running a Second Baseline Update

After loading dimension values and building the rest of your pipeline, you must run a second baseline update to process and tag your Endeca records.


190

The second baseline update performs property and dimension mapping that could not be performed in the first baseline update because the externally managed dimensions had not yet been transformed and available for mapping.

To run a second baseline update:



Updating an Externally Managed Taxonomy in Your Pipeline

If you want to modify an externally managed taxonomy and replace it with a newer version, you have to revise the taxonomy using the third-party tool that created it, and then repeat the process of incorporating the externally managed taxonomy into your pipeline as described in “Pipeline Configuration” on page 185.


Chapter 11

Classifying Documents with Stratify

This document describes how to integrate Stratify taxonomies and Stratify classification capabilities into a Developer Studio project. Incorporating Stratify into your project allows you to classify unstructured source data, for example, Web pages, .pdf documents, and Microsoft Word documents, for use in Endeca Guided Navigation applications.

Unstructured source data requires different processing than structured source data. Structured data, such as example, databases, .csv files, character-delimited files, fixed-width files and so on, have name/value pairs that Endeca can translate into dimensions and Endeca properties.

Unstructured data, on the other hand, is not composed of name/value pairs that Endeca can translate into dimensions and Endeca properties. For unstructured data, you have to use tools like the Stratify Discovery System to evaluate the content of an unstructured document and assign the document a topic based on classification logic that you configure. In an Endeca pipeline, this topic becomes a property that can be used like any other property associated with an Endeca record; for example, it can be manipulated and mapped to dimensions or Endeca properties.

192

In the Stratify Discovery System, you use the Stratify Taxonomy Manager to build a taxonomy to organize your source data, and you use the Stratify Classification Server to classify unstructured source data against that taxonomy. Endeca Developer Studio provides the capability to include a Stratify taxonomy and transform it to an Endeca dimension. Endeca uses the results of Stratify’s document classification to tag Endeca records, that is your unstructured documents, with classification properties. After the records contain classification properties, you can map the properties to dimension values. The interaction between Endeca and Stratify Discovery System is covered in more detail in “How Endeca and Stratify Classify Unstructured Documents” on page 196.

You integrate Stratify into your pipeline by adding a dimension adapter to transform the Stratify taxonomy, a Content Acquisition System (CAS) to crawl unstructured documents, and a record manipulator with a STRATIFY expression to access the Stratify Classification Server. For an overview of the process, see “Overview of the Integration Process” on page 199. You will find it helpful to read the Content Acquisition System chapter beginning on page 23 before integrating Stratify into your project.

Sections of This Document

This document contains the following main sections:

• Classifying Documents with Stratify

• Frequently Used Terms and Concepts


193

• How Endeca and Stratify Classify Unstructured Documents

• Overview of the Integration Process

• Required Stratify Tools

• Developing a Stratify Taxonomy

• Building a Taxonomy

• Exporting a Taxonomy

• Creating a Pipeline to Incorporate Stratify

• Creating a CAS

• Classifying Documents with Stratify Classification Server

• Adding a Property Mapper and Indexer Adapter

• Integrating a Stratify Taxonomy

• Running the First Baseline Update

• Loading a Dimension and its Dimension Values

• Mapping a Dimension Based on a Stratify Taxonomy

• Running the Second Baseline Update

• Updating a Taxonomy in Your Pipeline

Frequently Used Terms and Concepts

Endeca concepts and terms for record organization generally correspond to Stratify terms for document organization. For example, a Stratify taxonomy contains topics into which source documents are organized; an

Endeca Confidential Classifying Documents with Stratify

194

Endeca dimension contains dimension values into which Endeca records are organized. (Remember that Endeca records are based on source documents.) Both sets of concepts provide a similar framework to describe a way of hierarchically organizing information.

There are several frequently used terms in this document:

• Structured data is source data based on name/value pairs. Common example formats that provide this structure include databases, .csv files, character-delimited files, fixed-width files and so on. For example, this pairing might occur in source data as Color/Red, Price/$8.00, and Size/Medium. Endeca can translate these properties into dimensions and Endeca properties for use in search and Guided Navigation.

• Unstructured data is source data that does not have name/value pairs. Common examples of unstructured data include MS Word documents, .pdf files, Web pages, and so on. If you have unstructured data, applications such as Stratify can examine the document and classify the document into topics based on logic that you configure. Endeca can use the document-to-topic classifications to create name/value properties that can be manipulated and mapped for Guided Navigation.

• Taxonomies provide a logical hierarchy to organize data, typically based on a theme. A taxonomy has a root and topics. Topics provide sub-grouping for the theme. For example, in a taxonomy whose root theme is Heath Care, topics may include Physical Health and


195

Emotional Health. Respective sub-topics may include Immunizations and Depression. Source documents are organized in a taxonomy according to a classification model (described below). After you import a taxonomy into your pipeline, it becomes an Endeca dimension.

• Taxonomy topics provide sub-grouping organization in a Stratify taxonomy. Think of topics as nodes in a taxonomy structure. Once you import a taxonomy into your pipeline, the topics in a Stratify taxonomy become the dimension values of a single Endeca dimension.

• Externally managed taxonomies are logical hierarchies to organize your data that are built and managed by a third-party tool and transformed by Developer Studio into a dimension. A Stratify taxonomy is an example of an external managed taxonomy that you can create to assist in classifying unstructured source documents. The Stratify Classification Server manages a published taxonomy to perform classification tasks. If you want to modify the dimension or dimension values, you have to edit the taxonomy and taxonomy topics in Taxonomy Manager, publish it to the Stratify Classification Server, export the .xml, and integrate the .xml into Developer Studio.

• Classification model describes the classification method to organize source documents into topics. Classification models may use any of the following methods either individually or in combination: statistical, Boolean, keyword, and source rules. The Stratify Classification Server uses both the taxonomy


196

and its classification model to classify unstructured source documents.

How Endeca and Stratify Classify Unstructured Documents

Endeca components and Stratify components communicate closely during record processing to classify each unstructured source document. In particular, a record manipulator using a STRATIFY expression interacts with a Stratify Classification Server to classify each record as it is processed through the pipeline. A simplified summary of the interaction is as follows: Forge crawls unstructured source documents, hands them off to Stratify to classify them, and then Forge appends classification properties to the record for the corresponding source document. You map the Stratify properties to a dimension created from the Stratify taxonomy.

The illustration below shows the interaction between Endeca components and Stratify components in greater detail. There are three kinds of flow in the diagram:

• URLs flow from the spider to the record adapter (a record adapter that uses the Document format).

• Documents flow to the indexer adapter and get turned into Endeca records.

• The Stratify taxonomy is published to the Stratify Classification Server, exported from the Taxonomy Manager as .xml, and transformed in the pipeline as a dimension. Strictly speaking, this step is not part of the record processing flow. This step must be performed only once before you run the pipeline.


197

When Forge executes this pipeline using Developer Studio, the flow of URLs and records is as follows:

1. The terminating component (indexer adapter) requests the next record from its record source (property mapper). At this point, none of the pipeline


198

components between the indexer adapter and the record adapter has records to process yet.

2. When the request for the next record reaches the record adapter, the record adapter asks the spider for the next URL it is to retrieve (the first iteration through the URL processing loop, the URL is the root URL configured on the Root URL tab of the Spider editor).

3. Based on the URL that the spider provides, the record adapter creates a record containing the URL and a limited set of metadata.

4. The created record then flows down to the first record manipulator where the following takes place:

• The document associated with the URL is fetched (using the RETRIEVE_URL expression).

• Content (searchable text) is extracted from the document using the CONVERTTOTEXT or PARSE_DOC expression. Any URLs in the document are also extracted for additional crawling.

5. The record then moves to the spider where additional URLs from the document (those extracted in the record manipulator) are queued for crawling.

6. The created record then flows down to the second record manipulator where the following takes place:

• The STRATIFY expression requests that the Stratify Classification Server classify each document. Forge sends the document as an attachment to a Stratify Classification Server.

• The Stratify Classification Server examines the document, including the document’s structure, and


199

classifies it according to the classification model you developed in the Stratify Taxonomy Manager.

• The Stratify Classification Server then replies to Forge with a classification response that indicates what properties to append to the record.

7. The property mapper performs source property to dimension and source property to Endeca property mapping.

8. The indexer adapter receives the record and writes it out to disk.

The process repeats until there are no URLs in the URL queue maintained by the spider.

Overview of the Integration Process

There are two main phases to integrating a Stratify taxonomy and Stratify classification capabilities into a Developer Studio project. You perform the first phase using Stratify tools. You perform the second phase using Endeca tools.

In phase one you develop a Stratify taxonomy by performing the following tasks:

1. Build and validate the taxonomy, including its classification model.

2. Publish the taxonomy and classification model to the Stratify Classification Server.

3. Export the taxonomy as .xml.


200

See “Developing a Stratify Taxonomy” on page 201 for details about each step listed above.

In phase two you create a Developer Studio pipeline to incorporate the Stratify taxonomy and access the Stratify Classification Server by performing the following tasks:

1. Create a CAS.

2. Create a record manipulator to classify documents with the Stratify Classification Server.

3. Add a property mapper and indexer adapter.

4. Integrate the Stratify taxonomy.

5. Run a baseline update to transform the taxonomy.

6. Load a dimension and its dimension values.

7. Map the dimension in a property mapper.

8. Run another baseline update.

See “Creating a Pipeline to Incorporate Stratify” on page 204 for details about each step listed above.

Required Stratify Tools

In addition to installing and configuring the Endeca Navigation Platform, Endeca Developer Studio, and the Endeca Manager, you must also install and configure a Stratify Discovery System 3.0 before you begin the integration process. At a minimum, this installation requires the following two components of the Stratify Discovery System:


201

• Stratify Taxonomy Manager—manages the total taxonomy lifecycle, using a combination of automated technologies, advanced analytics and human review to create, define, test, publish and refine taxonomies.

• Stratify Classification Server—stores the taxonomy and uses it to classify documents in response to requests from Forge.

You do not need to use either Stratify Analytics or Stratify Notification Server as part of the integration.

Developing a Stratify Taxonomy

Before you can build an integrated project you first need to develop your Stratify taxonomy. Development includes building the taxonomy itself, publishing it and its classification model to the Stratify Classification Server, and then exporting the taxonomy as .xml.

You perform all of the steps involved in building and exporting a taxonomy using Stratify tools. This document describes the high-level tasks of building a taxonomy and some of the detailed procedures for exporting a taxonomy. Refer to the User’s Guide for Stratify Discovery System 3.0 for additional details about how to perform each task.


202

Building a Taxonomy

The following steps outline the high-level process to build and publish a Stratify taxonomy using the Taxonomy Manager. For complete documentation of these steps, see the User’s Guide for Stratify Discovery System 3.0.

1. Build a taxonomy using the Taxonomy Manager. This step includes creating the taxonomy in any of the following ways: manually, generating automatically, or importing from another external source. This step may also include editing the taxonomy as necessary during development.

Note: When creating a Stratify taxonomy, the name you provide becomes the name of the root dimension value and the names of the taxonomy’s topics become the names of the dimension values beneath the root.

Note: Also, if you are building multiple Stratify taxonomies, the identifier for each Stratify topic must be unique across all taxonomy files.

2. Define the classification model for the taxonomy. You can employ any of the four Stratify classification models: statistical, keyword, source rules, or Boolean.

3. Compile the model of the taxonomy, if you created a statistical classification model. This step is not necessary for the other models.

4. Test the taxonomy’s classification model.

5. Publish the taxonomy to the Stratify Classification Server and version any other published taxonomy if necessary.


203

Exporting a Taxonomy

After you have built and published a taxonomy to the Stratify Classification Server, you must export the taxonomy as an .xml file. You will later include this .xml file in Developer Studio and transform it to an Endeca dimension to use during mapping.

Note: This procedure describes exporting a taxonomy from the Stratify Taxonomy Manager. If you have additional questions about the procedure, see the User’s Guide for Stratify Discovery System 3.0.

To export a taxonomy:

1. Start Stratify Taxonomy Manager.

2. From the View menu, select Taxonomies.

3. In the Select Taxonomies to View dialog box, check the taxonomy that you want to export.

4. Click OK.

5. In the taxonomy tree, right-click the published taxonomy.

6. Select Export from the submenu.

7. In the Export dialog box, perform the following steps:

a. In the Topic field, confirm that you selected the correct taxonomy.

b.In the To File field, provide a path and file name for the taxonomy you want to export.

c. Select the Export Children option.


204

d.Under Advanced Options, select All.

8. Click OK.

9. If necessary, repeat steps 2 through 8 to export multiple taxonomies.

Creating a Pipeline to Incorporate Stratify

A Developer Studio project uses CAS to crawl unstructured source documents. Next, a record manipulator accesses the Stratify Classification Sever, which classifies the Endeca record for each source document. The following sections of this document describe how to create and configure a pipeline that includes Stratify taxonomies and classification capabilities.

There are two specific additions to a typical CAS pipeline that allow you to incorporate Stratify taxonomies and classification capabilities:

• A dimension adapter is required to read in and transform the Stratify taxonomy.

• A STRATIFY expression is required after the RETRIEVE_URL and text extraction expressions (either PARSE_DOC or CONVERTTOTEXT) to communicate with the Stratify Classification Server and classify documents.

Because of the similarity between a CAS pipeline and a CAS pipeline with Stratify, this document relies on cross-references to the Content Acquisition System chapter. In cases where the two CAS pipelines are exactly


205

the same, this document references the CAS procedure. In cases where a typical CAS pipeline differs with a CAS pipeline including Stratify, this document indicates those differences.

This diagram shows an example pipeline similar to the one you create in this document.


206

Creating a CAS

Begin your pipeline by creating a CAS that crawls your unstructured documents. Most of the steps to create a CAS pipeline that includes Stratify are common to a typical CAS pipeline. The pipeline differs in the components that follow the spider. After you finish the procedure in this section, go on to “Classifying Documents with Stratify Classification Server” on page 207.

To create a CAS:

1. Create a record adapter to read source documents (required). See “Creating a Record Adapter to Read Documents” on page 41.

2. Create a record manipulator (required). See “Creating a Record Manipulator” on page 43 for this task and also for the following bullet items:

a. Add a RETRIEVE_URL expression (required).

b.Convert documents to text (required).

c. Identify the language of the documents (optional).

d.Remove document body properties (optional but recommended).

3. Create a spider (required). See “Creating a Spider” on page 55.


207

Classifying Documents with Stratify Classification Server

A STRATIFY expression is required after the RETRIEVE_URL, and text extraction expressions (either PARSE_DOC or CONVERTTOTEXT). The STRATIFY expression identifies a Stratify Classification Server that classifies the unstructured document associated with an Endeca record.

For the sake of pipeline clarity, we recommend you add the STRATIFY expression in its own record manipulator that follows the spider component. The recommended position of a record manipulator containing the STRATIFY


208

expression is after the spider component and before the property mapper.

If you have more than one Stratify Classification Server in your environment, then you need one STRATIFY expression to specify the host, port, hierarchy ID, and so on for each server. Typically, a single taxonomy is published to a single Stratify Classification Server. Note, however, that you can publish multiple taxonomies to a single Stratify Classification Server.

Record manipulator with STRATIFY expression


209

To add a STRATIFY expression to a record manipulator:


2. In the Pipeline Diagram editor, choose New > Record > Manipulator. The Record Manipulator editor displays.

3. Click OK.

4. Double-click the record manipulator.

5. Add a STRATIFY expression as shown in the example below. If necessary, add additional STRATIFY expressions for each Stratify Classification Server in your environment.

Nested expression nodes within STRATIFY configure how it functions. The following expression nodes are required:

• STRATIFY_HOST – The machine name or IP address of the Stratify Classification Server.

• STRATIFY_PORT – The port on which the Stratify Classification Server listens for requests from Forge.

• HIERARCHY_ID – The identifier of a Stratify classification model. To determine the VALUE of HIERARCHY_ID:

1. Navigate to the working directory of the Stratify Classification Server that contains your classification model and taxonomy files. This directory is typically located at <Stratify Install Directory>\ClassificationServer\ClassificationSer

ver\ClassificationServerWorkDir\Taxonomy-N, where N is the number of the directory that contains the classification model you want to use with your Endeca


210

project. (Your environment may have multiple \Taxonomy-N directories each containing different classification model and taxonomy files.)

2. Note the number at the end of the of \Taxonomy-N directory. This number is the value of HIERARCHY_ID. For example, if the classification model you want to use is stored in ...\Taxonomy-2, then HIERARCHY_ID should have VALUE="2". If you published more than one taxonomy to your Stratify Classification Server, include a HIERARCHY_ID node for each taxonomy.

• IDENTIFIER_PROP_NAME – The Endeca identifier for the record being processed. The default is Endeca.Identifier.

• BODY_PROP_NAME – The property that the Stratify Classification Server examines to classify the document. The default property is Endeca.Document.Body. You can provide either Endeca.Document.Body or Endeca.Document.Text. However, specifying Endeca.Document.Body provides better classification because Forge can send the document to the Stratify Classification Server as an attachment, and the Stratify Classification Server can use the attachment to determine structural information of the document that aids in classification. If you specify Endeca.Document.Text, Forge sends the converted text of the document without any of its structural information.



211

For general information about how to create expressions, see the Endeca Developer Studio Help.

When you run the pipeline, here is how the classification process takes place:

1. For each record that passes through the record manipulator, the STRATIFY expression requests that the Stratify Classification Server classify the document indicated by Endeca.Document.Body.

2. Forge sends the document as an attachment to the Stratify Classification Server.

3. The Stratify Classification Server examines the document, including the document’s structure, and classifies it according to the classification model you developed in Taxonomy Manager. (You indicate the classification model in the HIERARCHY_ID expression node.)

4. The Classification Server then replies to Forge with a classification response that indicates what property values to append to the record.

The STRATIFY expression generates at least the following properties to append to each record:

• Endeca.Stratify.Topic.HID<hierarchy ID>=<topic ID>

This property corresponds to the ID value of a topic in your published Stratify taxonomy. Each topic in your taxonomy has an ID value assigned by Stratify. The value of <hierarchy ID> corresponds to your HIERARCHY_ID expression node. For example, if an Eating Disorders topic has a topic ID of 2097222 in a


212

health care taxonomy whose hierarchy ID is 15, then the Endeca property is Endeca.Stratify.Topic.HID15="2097222".

• Endeca.Stratify.Topic.Topic.Name.HID<hierarchy

ID>.TID<topic ID>=<topic name>

This property corresponds to a topic name from your published Stratify taxonomy for its corresponding topic ID. For example, for the Eating Disorders topic in the health care taxonomy mentioned earlier, this property is Endeca.Stratify.Topic.Name.HID15.TID2097222="Eating Disorders".

• Endeca.Stratify.Topic.Score.HID<hierarchy ID>.TID<topic ID>=<score>

This property indicates classification score between an unstructured document and the topic it has been classified into. The value of <score> is a percentage expressed as a value between zero and one. Zero indicates the lowest classification score (0%), and one indicates the highest score (100%). You can use this property to remove records from your application that have a low score for classification matching, for example, Endeca.Stratify.Topic.Score.HID15.TID2097222="0.719380021095276".

Adding a Property Mapper and Indexer Adapter

Although the pipeline does not have any external dimensions available yet for mapping, because you have not integrated the Stratify taxonomy yet, you need to add a property mapper in the pipeline to successfully run the first baseline update. You will come back to this property


213

mapper later in the process to add dimension mapping for your external taxonomy.

For now, create an empty property mapper and select the upstream record manipulator as its record source. Also add an indexer adapter to your project as you would for a basic pipeline. There are no special configuration requirements for the indexer adapter in a pipeline that integrates Stratify. See the Developer Studio Help for more information about creating a property mapper and indexer adapter.

Integrating a Stratify Taxonomy

You use a dimension adapter to read in the Stratify taxonomy XML and transform it to Endeca-compatible XML. If necessary, you can include and transform multiple Stratify taxonomies by using a single dimension adapter for each taxonomy file.

In the Dimension Adapter editor, you specify the XML file of the taxonomy that Forge transforms. To perform the taxonomy transformation, you have to run a baseline update. When the update runs, Forge transforms the taxonomy into an externally managed Endeca dimension that you can load in the Dimensions view.

After you integrate the Stratify taxonomy, a newly transformed dimension is very similar to a dimension created using Developer Studio. However, because the dimension is externally managed by Stratify, you cannot edit the dimension or its dimension values using


214

Developer Studio. If you want to modify an externally managed dimension, you have to repeat the process of integrating Stratify into your pipeline. For more information, see “Updating a Taxonomy in Your Pipeline” on page 223.

To integrate a Stratify taxonomy:


2. In the Pipeline Diagram editor, choose New > Dimension > Adapter. The Dimension Adapter editor displays.

3. In the Dimension Adapter Name text box, enter a unique name for the dimension adapter.

4. In the General tab, do the following:

• In the Direction frame, select Input.

• In the Format field, select Stratify.

• In the URL field, enter the path to the source taxonomy file that you exported in “Exporting a Taxonomy” on page 203. This path must be relative to the location of your project’s Pipeline.epx file.

• Check Require data if you want Forge to generate an error if the file does not exist or is empty.

5. Click OK.

6. From the File menu, select Save.

7. If necessary, repeat steps 2 through 6 to include additional Stratify taxonomies.


215

8. Create a dimension server to provide a single point of reference for other pipeline components to access dimension information from multiple dimension adapters. See the Endeca Developer Studio Help for more information.

Running the First Baseline Update

In order to transform the source .xml from your Stratify taxonomy into an Endeca dimension, you have to run a baseline update. Running the update allows Forge to transform the taxonomy and store a copy of the resulting dimensions.xml in the Endeca Manager. After you run the update, you can then load the dimension as described in “Loading a Dimension and its Dimension Values” on page 216. When you load the dimension, Developer Studio reads the dimensions.xml file and displays the dimension values in the Dimension Values view.

Note: The Endeca Manager must be running before you start a baseline update.

To transform the taxonomy:

1. Place a copy of your source taxonomy file into the directory you specified in the Incoming Directory field of the Web Studio’s Provisioning System page.




216

Note: To reduce processing time for large source data sets, you may want to run the baseline update using the -n flag for Forge. (The -n flag controls the number of records processed in a pipeline.) You can specify the flag in the Forge field of the Endeca Manager Settings dialog box.

Loading a Dimension and its Dimension Values

After you transform a Stratify taxonomy into an Endeca dimension, you can then add the dimension to the Dimensions view and load the dimension values in the Dimension Values view.

Rather than choose New, as you would to create a standard dimension in Developer Studio, you instead click Discover in the Dimensions view to see an externally managed dimension. Developer Studio discovers the dimension by reading in the dimensions.xml file created when you ran the first baseline update. Next, you click Load in the Dimension Values view and Forge reads in dimension value information from the dimensions.xml file and presents the dimension values in the Dimension Values view.

The core characteristics of an externally managed dimension and its dimension values are not modifiable in the Dimension Values view. These features appear as unavailable or Developer Studio displays a message indicating what actions are possible. If you want to modify these features of a dimension or its dimension values, you have to edit the features in the source taxonomy by performing the tasks described in “Updating a Taxonomy in Your Pipeline” on page 223. You can


217

however modify whether dimension values are inert or collapsible.

Note: You must set up and use the Endeca Manager to view dimensions and dimension values in Developer Studio. If you choose to use Developer Studio in standalone mode for pipeline development, you can run Forge by hand to transform your taxonomy and create a dimensions.xml file containing the Endeca dimensions for your taxonomy. However, you will not be able to view the dimensions in the Dimensions view without the Endeca Manager.

To load a dimension based on a Stratify taxonomy:

1. In the Project tab of Developer Studio, double-click Dimensions. The Dimensions view displays.

2. Click the Discover button to display any externally managed dimensions that you imported in Integrating a Stratify Taxonomy on page 213. The dimension or dimensions appear in the Dimensions view with the value of their External Type column set to External.

3. In the Dimensions view, select the dimension based on a Stratify taxonomy and click Values. The Dimension view appears with the root dimension displayed.

4. Select the root dimension and click Load. The dimension values appear in the Dimension Values view with the value of their External Type column set to External.

5. Repeat steps 3-4 for an additional dimensions based on a Stratify taxonomy that you integrated into your project


218

Here is an example of a Dimensions view with one dimension based on a Stratify taxonomy:

Here is an example of a Dimension Values view with the loaded dimension values:


219

About Synonym Values and Dimension Values

After you load the dimension values, you will notice that each dimension value has two synonyms. One synonym is the name of the dimension value used for display in an application, and the second synonym is an ID required for dimension value mapping. You should not modify these synonym values.

The synonym with the name of the dimension value has default settings where Search is checked and (Display) is enabled as shown in the following example. In this example, these settings allow an application user to search and navigate on the Eating Disorders dimension value.

The ID synonym has default settings on the Synonym editor where Classify is checked and (Display) is disabled as shown in the following example. The Classify setting


220

instructs Forge to use the ID synonym during record classification.

The ID synonym is based on a topic’s id element as shown in the Stratify taxonomy. In the example above, here is a portion of the topic element for the Eating Disorders topic in the Ask Alice taxonomy: <topic id="2097222" name="Eating Disorders" ...</topic>.

An ID synonym, based on a Stratify id is not intended to provide an alternative way of describing or searching a dimension value in the same way that synonyms created using Developer Studio are often used. The Endeca Navigation Platform uses the id synonym for dimension value tagging because an integer based on a Stratify id is guaranteed to be unique across other external dimensions; whereas, the name of the dimension value is not guaranteed to be unique across other external dimensions.


221

The following process describes the role of id synonyms in dimension value mapping.

1. After you transform a taxonomy and load a dimension, each dimension value has an id synonym that comes from the topic id in the Stratify taxonomy.

2. After Stratify classifies the unstructured documents, Forge tags properties to each record that include the Stratify topic id and Stratify topic name. Remember this topic ID is the same as the id synonym.

In this example, Forge assigns the properties Endeca.Stratify.Topic.Name.HID15.TID2097222=

"Eating Disorders" and Endeca.Stratify.Topic.HID15="2097222" after Stratify classifies an unstructured document into the Eating Disorders topic.

3. A property mapper maps a source property named Endeca.Stratify.Topic.HID<hierarchy id> to a target dimension for a Stratify taxonomy.

In this example, Endeca.Stratify.Topic.HID15 is mapped to the AskAlice target dimension.

4. Forge uses the ID synonyms to classify records into dimension values during mapping.

In this example, any documents that Forge tags with Endeca.Stratify.Topic.HID15="2097222" Forge also classifies with the ID synonym 2097222 into the Eating Disorders dimension value of the AskAlice dimension.


222

Mapping a Dimension Based on a Stratify Taxonomy

Using Discover to add the dimension makes it available for explicit mapping as a target dimension. You map the source property Endeca.Stratify.Topic.HID<hierarchy ID> to your target dimension for the taxonomy. All topics within that hierarchy are mapped as dimension values.

If you are using more than one Stratify taxonomy in your pipeline, create one property to dimension mapping for each taxonomy. For more information about using a property mapper, see the Endeca Developer Studio Help

Running the Second Baseline Update

After discovering and loading the dimension and mapping the dimension to a source property, you must run a second baseline update to process and tag your Endeca records.

The second baseline update performs property and dimension mapping that could not be performed in the first baseline update because the Stratify taxonomy had not yet been transformed into an Endeca dimension and available for mapping.

To run a second baseline update:




223

Note: If you ran the first baseline update with the Forge -n flag, you should delete the flag before running the second baseline update.

Updating a Taxonomy in Your Pipeline

If you want to modify an externally managed dimension and replace it with a newer version, you have to revise the taxonomy in Stratify Taxonomy Manager, publish it to the Stratify Classification Server, include it into Developer Studio, and make the other necessary changes in Developer Studio. In effect, you have to repeat the process of integrating Stratify into your pipeline as described in this document. This process is necessary because the Stratify Classification Server must manage all taxonomy changes in order to properly classify documents during Forge’s record processing.


224


SECTION IVLogging and

Performance Features

226


Chapter 12

Forge Hierarchical Logging System

This section provides a brief introduction to the hierarchical logging system used by Forge. It covers the features and design rationale, as well as an outline of the log category hierarchy. The Forge hierarchical logging system combines cascading log levels, hierarchical message categories, runtime configuration, and the ability to direct log messages to multiple, arbitrary destinations.

Overview

The constituent components of a Forge pipeline tend to generate large log files. When all of these log messages are directed to a single log file, key messages are less easily accessed and often overlooked in the sheer number of messages.

Providing a means to create logical message groupings and the capability to place these message groups in separate logs is the key to organizing this information so that it is easy for a user to locate the data of interest. While this functionality is the primary goal, the logging system was designed with several other important considerations in mind:

228

• Users should be able to organize log messages hierarchically in logical message groups.

• Users should be able to specify a verbosity level in log messages.

Log Levels and Message Categories

The hierarchical logging system provides the capability to create logical message groupings by specifying two dimensions for each log message:

• Log level, which indicates what types of messages will be logged.

• Message category, which defines the component from which the message comes.

Log Levels

The value of a log level is chosen from the following ordered list:

• DEBUG—debugging messages, to be used during development or in tracking down problems.

• VERBOSE—verbose messages, which give as much information as possible on an event.

• INFO—informational messages, which indicate information that may be of interest, such as internal application state changes and events. These do not indicate errors and processing should continue with expected output.


229

• STAT—status messages.

• WARN—warning messages, which indicate that processing will continue, but the component output may not be as expected.

• ERR—error messages, which indicate a deviation from normal processing.

• FTL—fatal messages, which indicate a serious problem that may cause the issuing component to stop functioning.

Each log level on the above list includes messages from all lower levels.

Message Category

The value of a message category consists of a category name which adheres to a hierarchical naming convention

DEBUGVERBOSE

INFO STATWARN ERR

FTL

Endeca Confidential Forge Hierarchical Logging System

230

in which the dot (.) character specifies a child category. The category names currently used in Forge are defined below in the Forge Logging Hierarchy section.

A category is an ancestor of another category if its name followed by a dot is a prefix of the descendant category name. A category is a parent of a child category if there are no ancestors between itself and the descendant category.

For example, the category named Edf.Pipeline is a parent of the categories named Edf.Pipeline.RecordPipeline and Edf.Pipeline.DimensionPipeline. Similarly, Edf is a parent of Edf.Pipeline and an ancestor of both Edf.Pipeline.RecordPipeline.RecordServer and Edf.Pipeline.DimensionPipeline.DimensionServer.

This hierarchical naming convention defines a partial ordering of message categories using an inclusion relation. In the example above, the categories Edf.Pipeline.RecordPipeline and Edf.Pipeline.DimensionPipeline are sub-categories that are included within the Edf.Pipeline parent category. The Edf.Pipeline category includes log messages generated by all of its descendent categories. That is, any message in the Edf.Pipeline.RecordPipeline category also belongs to both the Edf.Pipeline category and the Edf category. Log


231

messages bubble up to the root of the message category hierarchy (in our case, Edf).

The log level and message category dimensions allow a user of the hierarchical log system to specify arbitrary groupings of messages which can then be directed to specific output destinations, such as a file or one of the standard output streams (stdout, stderr). A message grouping consists of a verbosity level and a message category. The verbosity level is given by one of the log level values and specifies a group. As the diagram on page 229 shows, a message grouping with verbosity level INFO would include all the log messages whose log level is INFO, STAT, WARN, ERR or FTL.

A message group’s category value specifies that the group include all messages from the specified message category and all of its sub-categories. Thus a message group with category value Edf.Pipeline (continuing with the example above) would include all messages within the categories Edf.Pipeline, Edf.Pipeline.RecordPipeline and Edf.Pipeline.DimensionPipeline.

EdfPipeline RecordPipeline

DimensionPipeline


232

Log Appenders

Log appenders are determine how log messages for a specific message group are handled. Each message group can have a set of appenders associated with it, each of which may handle the same message differently. For example, one appender might pipe log messages to stdout while another appender writes the same log message to a file.

The logging system uses a configuration file named log.ini to map log appenders to message groups. Forge looks for the log.ini file in the same directory as the pipeline input file (which means that you cannot move the log.ini file to another location).

As an example, consider the following log.ini file:

The first line specifies the version of the log.ini file format in use. The next two lines map message groups to log appenders.

In the first line, the command:

logger=Edf.Pipeline.RecordPipeline.Spider

VERSION=00.02

logger=Edf.Pipeline.RecordPipeline.Spider;class=stream;file;

name= spider.log;level=INFO

logger=Edf.Pipeline.RecordPipeline.Spider.mySpider;class=stream;file;

name=myspider.log;level=DEBUG


233

specifies that the message group shall include all messages in the Edf.Pipeline.RecordPipeline.Spider category. The command:

level=INFO

specifies that the group should include only messages with verbosity level INFO or less. The remaining commands in the first line:

class=stream;file;name=spider.log,

specify the behavior of the appender to which this message group should be sent. These commands indicate that the appender is a stream appender of type file with the filename spider.log.

The stream appender implementation supports multiple loggers appending to a single file. Similarly, the second line of the file specifies that all messages in the Edf.Pipeline.RecordPipeline.Spider.mySpider category with verbosity DEBUG or less should be appended to the myspider.log file.

<first-line>

<line>

These components are defined recursively as follows:

first-line VERSION=<version-num>

version-num <digits>.<digits>

digits 0|1|2|3|4|5|6|7|8|9

line <logger>\n<line> | <comment>\n<line>


234

comment Any sequence of characters starting with #

logger logger=<logger-name>;<appender-list>

logger-name Any sequence of characters excluding '.', '\n' and '\r'

appender-list <appender-info> | <appender-info>;<appender-list>

appender-info class=<appender-class>[;compression-info][;level-info>][;format-info>]

appender-class stream;<appender-stream> | unique-stream;<appender-stream>

appender-stream file;name=<filename> | cout

compression-info compression=<compression-level>

level-info level=<log-level>

format-info format=<format-option>

Compression-level -1|0|1|2|3|4|5|6|7|8|9

format-option simple | bare | bare-counted

log-level DEBUG | VERBOSE | INFO | STAT | WARN | ERR | FTL


235

Format of the Appenders

An interpretation of the above BNF results in the following commonly-used sections/keywords.

Appender Name Parameter Description

VERSION The version number of the log file.

logger A category name, as described in the Message Category section above

class Specifies how messages should be handled when logged. Values are:

• stream—the individual messages should be logged consecutively as they come in.

• unique-stream—each message should be made unique by removing duplicates.

file Takes no parameter. If this appender is present, indicates that the messages should be written to a disk file.

name The pathname of the log file. Used only if file is specified.

compression Can be used if file is specified. Indicates the level of data compression in the log file. The default compression level is -1, which is the same as omitting this appender. Alternatively, you can specify levels 0 through 9.

level The level of verbosity of the log messages. The value is one of the levels described in the Log Levels section above.


236

Configuring MustMatch Messages

When configuring dimension mapping, you can specify Must Match for the Match mode. The MustMatch mode tags resulting records with any matching dimension values. If none match, a warning message is issued.

To capture all MustMatch messages, use the following logger appender:

logger=Edf.Pipeline.Expression.MustMatch

format Specifies the format of the messages. Values are:

• simple—the standard log message from Forge. It prepends the level (INF:, VER:, WRN:, etc.) and possibly the file name, line number, and a timestamp, depending on other options.

• bare—indicates that timestamps, file names, and log levels should not be included.

• bare-count—similar to bare, except that a message count is appended in the format ":::count" (where count is the number of messages with the same string. This format should generally be used with the unique-stream class.

Appender Name Parameter Description


237

For example, the following lines will write all MustMatch messages to the match.txt and match.txt.stats files:

To capture MustMatch messages for a specific property, append the property name to the category name, as in the following line examples:

Configuring the Dimension Server Match Count Log

A dimension server can be configured with a Match Count log. This log keeps track of how many times during the auto-generation process that certain properties were matched.

You can add the following line to the log.ini file to log all MatchCount messages:

logger=Edf.DimensionPipeline.DimensionServer.DimServerName.MatchCount;class=stream;file;name=matchCount.log;level=DEBUG;format=bare

logger=Edf.Pipeline.Expression.MustMatch;class=unique-stream;

file;name=match.txt;level=DEBUG;format=bare

logger=Edf.Pipeline.Expression.MustMatch;class=unique-stream;

file;name=match.txt.stats;level=DEBUG;format=bare-counted

logger=Edf.Pipeline.Expression.MustMatch.Color;class=unique-stream;

file;name=colors.log;level=DEBUG;format=bare

logger=Edf.Pipeline.Expression.MustMatch.Color;class=unique-stream;

file;name=colors.log.stats;level=DEBUG;format=bare-counted


238

where DimServerName is the configured name of the dimension server and matchCount.log is the name of the file to which the MatchCount messages will be logged.

Reference log.ini File

The following log.ini file serves as a reference for users who want to customize the default behavior of log output in Forge. This file should serve as a reasonable default and should be customized further to account for the specific tasks for each individual pipeline. For example, different pipelines will have different components and component names for the message categories.

The set of available message categories is specified in the next section titled The Forge Logging Hierarchy. Categories starting with “my” refer to the component names specified in the pipeline. These must always be customized to match the specific name chosen within the pipeline.


239

VERSION=00.02# log edf messages of level INFO or less to stdoutlogger=Edf;class=stream;cout;level=INFO# log all edf messages to a file named Edf.loglogger=Edf;class=stream;file;name=Edf.log;level=DEBUG# log pipeline messages of level INFO or less to pipeline.loglogger=Edf.Pipeline;class=stream;file;name=pipeline.log;level=INFO

# log dimension pipeline messages of level INFO or less to # dimensionpipeline.loglogger=Edf.Pipeline.DimensionPipeline;class=stream;file;name=dimensionpipeline.log;level=INFO# log messages from ANY dimension adapter of level INFO or less to# dimensionadapter.loglogger=Edf.Pipeline.DimensionPipeline.DimensionAdapter;class=stream;file;name=dimensionadapter.log;level=INFO# log messages from the dimension adapter named myDimensionAdapter1 of # level INFO or less to mydimensionadapter1.loglogger=Edf.Pipeline.DimensionPipeline.DimensionAdapter.myDimensionAdapter1;class=stream;file;name=mydimensionadapter1.log;level=INFO# log messages from the dimension adapter named myDimensionAdapter2 of # level VERBOSE or less to mydimensionadapter2.log, and compress the filelogger=Edf.Pipeline.DimensionPipeline.DimensionAdapter.myDimensionAdapter2;class=stream;file;name=mydimensionadapter2.log;compression=5;level=VERBOSE# log record pipeline messages of level INFO or less to # recordpipeline.loglogger=Edf.Pipeline.RecordPipeline;class=stream;file;name=recordpipeline.log;level=INFO# log expression messages of level INFO or less to expression.loglogger=Edf.Pipeline.Expression;class=stream;file;name=expression.log;level=INFO

# log DimensionServer messages of level INFO or less to servers.loglogger=Edf.Pipeline.DimensionPipeline.DimensionServer;class=stream;file;name=servers.log;level=INFO


240

A Simple Reference log.ini file

The following simple log.ini file can be used as-is and requires no customization.

VERSION=00.02

# log edf messages of level INFO or less to stdoutlogger=Edf;class=stream;cout;level=INFO# log edf messages of level VERBOSE or less to a file named # logs/edf_verbose.loglogger=Edf;class=stream;file;name=logs/edf_verbose.log;level=VERBOSE# log edf messages of level WARN or less to a file named # logs/edf_warn.loglogger=Edf;class=stream;file;name=logs/edf_warn.log;level=WARN

# log dimension pipeline messages of level INFO or less to # logs/dimensionpipeline.loglogger=Edf.Pipeline.DimensionPipeline;class=stream;file;name=logs/dimensionpipeline.log;level=INFO

# log record pipeline messages of level INFO or less to # logs/recordpipeline.loglogger=Edf.Pipeline.RecordPipeline;class=stream;file;name=logs/recordpipeline.log;level=INFO


241

The Forge Logging Hierarchy

The following diagram demonstrates the log category hierarchy currently used in Forge.

Note: Instance name refers to the name of a component as it is defined in a pipeline.

E df

P ipeline

DimensionP ipeline NavC onfigP ipeline

DimensionS erver DimensionAdapter NavC onfigS erver NavC onfigAssembler

instance name instance nameinstance nameinstance name

R ecordP ipeline

Assembler R ecordManipulator R ecordAdapter R ecordC ache IndexerAdapter S pider UpdateAdapter

instance name instance name instance name instance name instance name instance name

R obotE xclus ion UR L

F ilter

NavC onfigAdapter

instance name

E xpress ion

instancename

R ecordJ oin


242


Chapter 13

Using Multithreaded Mode

By default, and in most Endeca applications, the Navigation Engine runs in single-threaded mode. In this normal mode of operation, the Navigation Engine processes queries one at a time. Once the Navigation Engine starts processing a new query, it continues working on the query until it is completed.

While working on this query, the Navigation Engine does not work on or respond to other queries. In many cases, this simple method of execution is sufficient. Because most queries complete in a tiny fraction of a second, queries never have to wait long to be serviced, and the Navigation Engine appears immediately responsive at all times.

However, in some cases, this simple “one request at a time” single-threaded execution model does not adequately meet the hardware utilization and/or responsiveness requirements of the application. To address such situations, the Navigation Engine supports a multithreaded execution mode.

244

Applications that are candidates for multithreaded execution mode generally exhibit one or both of the following characteristics:

• Large memory footprint with high throughput—The Navigation Engine relies on in-memory index structures to provide sub-second responses to complex queries. As the scale of application data increases, so does the memory required to host a single instance of the Navigation Engine. Throughput scalability of the Navigation Engine is generally achieved by running multiple independent instances. For example, if a single Navigation Engine can service 20 operations per second on a given data set and query load but site traffic is 60 operations per second, then three Navigation Engine instances should be run (each on its own CPU) to ensure sufficient application throughput.

For applications with small-to-medium data scale, the cost of hardware to service additional load is reasonable, as each additional CPU need only be coupled with a moderate amount of RAM. But for applications with large-data scale, each additional CPU would need to be configured with up to 4GB of RAM in single-threaded mode, which can play a significant role in hardware cost.

Multithreaded execution mode enables more efficient utilization of RAM via SMP configurations. For example, if the data scale requires 4GB of RAM, and query throughput requires four CPUs, multithreaded execution allows the site to be hosted on a single quad-processor machine with 4GB of ram, rather than


245

a more costly option such as four single-processor machines, each with 4GB of RAM. In addition to reduced hardware costs, this approach simplifies system management and network architecture, and reduces the hardware hosting space required.

• Long-running queries—For applications that rely on commonly-used Navigation Engine features, the vast majority of queries complete in a fraction of a second. This allows the Navigation Engine to remain responsive at all times. However, some applications that make use of more advanced features (such as computing complex aggregate analytics) will encounter longer running queries. For such applications, multithreaded mode allows the Navigation Engine to remain responsive while working on long-running queries.

Understanding Multithreaded Mode

In multithreaded mode, the Navigation Engine is started with a pool of worker threads. These threads represent sequences of program execution managed and scheduled by the operating system and hosted within the Navigation Engine process.

Each worker thread acts like an independent Navigation Engine, processing queries one at a time. But the threads share data, memory, and the server network port. Essentially, this allows a multithreaded Navigation Engine with N threads to appear as a single Navigation Engine process that can work on N queries at a time. Each of the

Endeca Confidential Using Multithreaded Mode

246

independent worker threads can run on independent CPUs, allowing a single multithreaded Navigation Engine to make use of multi-processor hardware.

Multiple threads can also share a CPU, allowing a multi-thread Navigation Engine running on a single-CPU host to remain responsive as long-running queries are handled.

Costs of Multithreaded Mode

Although the benefits of threaded mode are critical in certain applications, multithreaded mode is not without costs, and should not be engaged in situations where it is unnecessary. Because the worker threads in multithreaded mode share data, memory, and other resources, they must synchronize their execution using mechanisms such as OS-supported locks.

These synchronization operations introduce runtime overhead that is avoided in the single-threaded case. Because of this, a multithreaded Navigation Engine typically provides slightly lower performance than multiple independent single-threaded Navigation Engine processes. Performance is discussed in more detail below; as a general rule, if an application does not exhibit one or both of the characteristics described above on page 244, single-threaded mode is the recommended approach.


247

Configuration for Multithreaded Mode

By default, the Navigation Engine runs in non-threaded mode. To enable multithreaded mode, specify the --threads option along with the number of worker threads desired when starting your Navigation Engine (Dgraph).

For example:

--threads 4

will start the Navigation Engine in multithreaded mode with 4 worker threads.

Multithreaded Navigation Engine Performance

The performance possible with a multithreaded Navigation Engine process is a function of a number of factors, such as:

• Base, single-threaded performance, given the application data and query profile

• Number of CPUs on the host system

• Query characteristics

• Host operating system

Generally, on a host system with N CPUs, where a single non-threaded Navigation Engine process can serve K operations/seconds of query load, N or more


248

independent Navigation Engine processes will serve somewhat less than N times K, commonly in the 80-90% utilization range. In other words, given the base single-instance performance of K, the expected N-processor performance is given by

.

The expected performance for one multithreaded Navigation Engine on an N processor machine is similar, but generally somewhat less. In this case, the expected performance is given by the above formula, except with utilization in the 60% to 80% range ( ).

For example, if a single Navigation Engine provides 20 ops/sec on given load, running two Navigation Engines on a dual processor may provide around 36 ops/sec (U=90%, K=20, N=2). Running the same application with a single multithreaded Navigation Engine may provide 32 ops/sec (U=80%, K=20, N=2).

Application Query Characteristics

Actual utilization depends on a number of factors. One important factor is the types of queries commonly executed in the application. While most query operations (such as basic data navigation, merchandising, or sorting) provide good concurrency, and hence high utilization, other operations require costly thread synchronization and reduce performance. For example, spelling correction relies on a non-thread-safe spelling library, which requires worker thread synchronization. Thus, if most queries in an application require spelling correction,

U K N×× where 0.8 U 0.9≤ ≤( )

0.6 U 0.8≤ ≤


249

processor utilization is likely to be on the lower end of the expected range.

Thread Pool Size and OS Platform

Other factors that impact performance and processor utilization are the size of the thread pool and the host operating system type. For example, on the Solaris operating system, which provides an efficient hierarchical threads system, the Navigation Engine will exhibit little decrease in performance at higher thread pool sizes. On other systems, such as Windows and Linux, performance will degrade at larger thread pool sizes.

In general, we recommend using one to four threads per processor for good performance in most cases. The actual optimal number of threads for a given application depends on many factors, and is best determined through experimental performance measurements using expected query load on production data.

The following section describes detailed issues for specific platforms.

Hyperthreaded Intel Processors

Hyperthreaded Intel processors appear at the application level like two CPUs. For example, a host with two hyperthreaded physical processors will appear to applications like a machine with four normal processors. Despite this virtual view, these apparent processors share resources, and don’t provide the full performance of true


250

independent processors. A single hyperthreaded processor generally provides a 20% to 40% performance boost to multiprocess or multithreaded applications.

The expected performance of the Navigation Engine on a hyperthreaded system is

.

and where U, K, and N are as described above in “Multithreaded Navigation Engine Performance” on page 247.

For example, if a single Navigation Engine provides 20 ops/sec on given load, running four Navigation Engines on a dual hyperthreaded processor (at least one Navigation Engine must run on each logical processor to achieve the full benefits of hyperthreading) may provide around 46.8 ops/sec (U=90%, K=20, N=2, H=1.3). Running the same application with a single multithreaded Navigation Engine may provide 41.6 ops/sec (U=80%, K=20, N=2, H=1.3).

Linux

On Linux, the Navigation Engine makes use of the LinuxThreads implementation of POSIX threads (pthreads) for low-level thread services. In this implementation, threads appear in the system as independent processes.

For example, if a Navigation Engine is started with a worker thread pool size of four (--threads 4), it will cause six processes to appear in the process table for the

U K N H where 1.2 U 1.4≤ ≤( )×××


251

system (which can be examined in tools such as ps and top). These six processes correspond to: four for worker threads, one for the main startup thread of the Navigation Engine, and one for a control process used by the LinuxThreads implementation itself.

This is normal behavior and does not indicate that these processes incur the normal costs of independent processes. For example, these processes share memory space with one another, allowing them to provide the normal benefits of the multithreaded Navigation Engine.

Solaris

On Solaris (Sparc and Intel hardware platforms), the Navigation Engine uses the native Solaris threads interface (as opposed to the pthreads compatibility interface), and is linked against the alternate Solaris threads implementation (described in the threads(3THR) Solaris man page). This threads implementation provides a light-weight, single-level, higher performance alternative to the standard two-level Solaris threads implementation.

Windows

On Windows, the Navigation Engine uses native Win32 threads. The thread count for an Navigation Engine can be examined in the Windows Task Manager in the Threads column (the number of threads listed will be one greater than the value specified for the --threads option; the additional thread is the main thread of the Navigation Engine).


252


Chapter 14

Coremetrics Integration

Endeca offers integration with the Coremetrics Online Analytics product through an integration module that is packaged with the Endeca reference library. The integration module contains the code required to capture search terms information and enable the Coremetrics On-Site Search Intelligence report. Coremetrics integration is offered for the JSP, ASP, and ASP .NET versions of the UI reference implementation.

All of the reference implementations assume that the code supplied by Coremetrics is located in the /coremetrics directory at the root of your application server. If you have installed Coremetrics in another directory, or are using a different version of Coremetrics, you will have to modify the coremetrics include statement in the integration module. In addition, the reference implementations are set up to point to the Coremetrics test server. In order to enable Coremetrics integration for production, you must add a cmSetProduction() call above the cmCreatePageviewTag() call in the integration module.

254

Using the Integration Module

Each reference implementation has a new module that contains the logic for when to include the Coremetrics tags:

• For the JSP, the integration code is in coremetrics.jsp.

• For the ASP, the integration code is in coremetrics.asp.

• For the ASP .NET, the integration code is in coremetrics.aspx.

Each reference implementation also has a commented-out include statement. Uncomment this statement to enable the Coremetrics code.

• For the JSP, the include statement is in nav.jsp.

• For the ASP, the include statement is in controller.asp.

• For the ASP .NET, the include statement is in controller.aspx.


SECTION VOther Advanced Features

256


Chapter 15

Implementing Merchandising and Content Spotlighting

This chapter describes how to implement merchandising in Endeca InFront and content spotlighting in Endeca ProFind using dynamic business rules. The chapter includes the following sections:

• Introduction to Dynamic Business Rules and Promoting Records

• Suggested Workflow Using Endeca Tools to Promote Records

• Building the Supporting Constructs for a Rule

• Creating Rules

• Presenting Rule Results in a Web Application

• Grouping Rules

• Performance Considerations for Dynamic Business Rules

• Using an Agraph and Dynamic Business Rules

• Applying Relevance Ranking to Rule Results

• About Overloading Supplement Objects

258

Introduction to Dynamic Business Rules and Promoting Records

Endeca provides the functionality to promote contextually relevant records to application users as they search and navigate within a data set. In Endeca InFront, this activity is called merchandising because the Endeca records you promote often represent product data. In Endeca ProFind, this activity is called content spotlighting because the Endeca records you promote often represent some type of document (HTML, DOC, TXT, XLS, and so on). For the sake of simplicity, this document uses “promoting records” to generically describe both merchandising and content spotlighting.

You implement merchandising and content spotlighting using dynamic business rules. The rules and their supporting constructs define when to promote records, which records may be promoted, and also indicate how to display the records to application users.

Here is a simple merchandising example using a wine data set. An application user enters a query with the search term Bordeaux. This search term triggers a rule that is set up to promote wines tagged as Best Buys. In addition to returning standard query results for term Bordeaux, the rule instructs the Navigation Engine to dynamically generate a subset of records that are tagged with both the Best Buy and also Bordeaux properties. The Web application displays the standard query results that match Bordeaux and also displays some number of


259

the rule results in an area of the screen set aside for “Best Buy” records. These are the promoted records.

Comparing Dynamic Business Rules to Content Management Publishing

Endeca’s record promotion works differently from traditional content management systems (CMS), where you select an individual record for promotion, place it on a template or page, and then publish it to a Web site. Endeca’s merchandising is dynamic, or rule based. In rule-based merchandising, a dynamic business rule specifies how to query for records to promote, and not necessarily what the specific records are.

This means that, as your users navigate or search, they continue to see relevant results, because appropriate rules are in place. Also, as records in your data set change, new and relevant records are returned by the same dynamic business rule. The rule remains the same, even though the promoted records may change.

In a traditional CMS scenario, if Wine A is “Recommended,” it is identified as such and published onto a static page. If you need to update the list of recommended wines to remove Wine A and add Wine B to the static page, you must manually remove Wine A, add Wine B, and publish the changes.

With Endeca’s dynamic record promotion, the effect is much broader and requires much less maintenance. A rule is created to promote wines tagged as

Endeca Confidential Implementing Merchandising and Content Spotlighting

260

“Recommended,” and the search results page is designed to render promoted wines.

In this scenario, a rule promotes recommended Wine A on any number of pages in the result set. In addition, removing Wine A and adding Wine B is simply a matter of updating the source data to reflect that Wine B is now included and tagged as “Recommended.” After making this change, the same rule can promote Wine B on any number of pages in the result set, without adjusting or modifying the rule or the pages.

Dynamic Business Rule Constructs

Two constructs make up a dynamic business rule:

• Trigger—a set of conditions that must exist in a query for a rule to fire. A trigger may include dimension values, a keyword or phrase, time values, and user-profiles. When a user’s query contains a condition that triggers a rule, the Navigation Engine evaluates the rule and returns a set of records that are candidates for promotion to application users.

• Target—specifies which records are eligible for promotion to application users. A target may include dimension values, custom properties, and featured records. For example, dimension values in a target are used to identify a set of records that are candidates for promotion to application users.


261

Three additional constructs support rules:

• Zone—specifies a collection of rules to ensure that rule results are produced in case a single rule does not provide a result.

• Style—specifies the minimum and maximum number of records a rule can return. A style also specifies any property templates associated with a rule. Rule properties are key/value pairs that are typically used to return supplementary information with promoted record pages. For example, a property key might be set to “SpecialOffer” and its value set to “BannerAd.gif”.

A rule’s style is passed back along with the rule’s results, to the Web application. The Web application uses the style as an indicator for how to render the rule’s results.

Note: The code to render the rule’s results is part of the Web application, not the style itself.

• Rule Group (optional)—provides a means to logically organize large numbers of rules into categories. This organization facilitates editing by multiple business users

The core of a dynamic business rule is its trigger and target values. The target identifies a set of records that are candidates for promotion to application users. The zone and style settings associated with a rule work together to restrict the candidates to a smaller subset of records that the Web application then promotes.


262

Query Results and Rules

Once you implement dynamic business rules in your application, each query a user makes is compared to each rule to determine if the query triggers a rule. If a user's query triggers a rule, the Navigation Engine returns several types of results:

• Standard record results for the query.

• Promoted records specified by the triggered rule’s target.

• Any rule properties specified for the rule.

Two Examples of Promoting Records

The following sections explain two examples of using dynamic business rules to promote Endeca records. The first example shows how a single rule provides merchandising results when an application user navigates to a dimension value in a data set. The scope of the merchandising coverage is somewhat limited by using just one rule.

The second example builds on the first by providing more broad merchandising coverage. In this example, an application user triggers two additional dynamic business rules by navigating to the root dimension value for the application. These two additional rules ensure that merchandising results are always presented to application users.


263

An Example with One Rule Promoting Records

This simple example demonstrates a basic merchandising scenario where an application user navigates to Wine Type > White, and a dynamic business rule called “Recommended Chardonnays” promotes chardonnays that have been tagged as Highly Recommended. From a merchandising perspective, the marketing assumption is that users who are interested in white wines are also likely to be interested in highly recommended chardonnays.

The “Recommended Chardonnays” rule is set up as follows:

• The rule’s trigger, which specifies when to promote records, is the dimension value Wine Type > White.

• The rule’s target, which specifies which records to promote, is a combination of two dimension values, Wine Type > White > Chardonnay and Designation > Highly Recommended.

• The style associated with this rule is configured to provide a minimum of at least one promoted record and a maximum of exactly one record.

• The zone associated with this rule is configured to allow only one rule to produce rule results.

When an application user navigates to Wine Type > White in the application, the rule is triggered. The Navigation Engine evaluates the rule and returns promoted records from the combination of the Chardonnay and Highly Recommended dimension


264

values. There may be a number of records that match these two dimension values, so zone and style settings restrict the number of records actually promoted to one.

The promoted record along with the user’s query and standard query results are called out in the following graphic:

Standard results for the query Wine Type >White

Rule results for Recommended Chardonnays

User’s query and also the trigger value


265

An Expanded Example with Three Rules

The previous example used just one rule to merchandise highly recommended chardonnays. The following example expands on the previous one by adding two more rules called “Best Buys” and “Highly Recommended.” These rules merchandise wines tagged with a Best Buy property and a Highly Recommended property, respectively. Together, the three rules merchandise records to expose a broader set of potential wine purchases.

The “Best Buys” rule is set up as follows:

• The rule’s trigger is set to the Web application’s root dimension value. In other words, the trigger always applies.

• The rule’s target is the dimension value named Best Buy.

• The style associated with this rule is configured to provide a minimum of four promoted records and a maximum of eight records.

• The zone associated with this rule is configured to allow only one rule to produce rule results.

The “Highly Recommended” rule is set up as follows:

• The rule’s trigger is set to the Web application’s root dimension value. In other words, the trigger always applies.

• The rule’s target is the dimension value named Highly Recommended.


266

• The style associated with this rule is configured to provide a minimum of at least one promoted record and a maximum of three records.

• There is the only rule associated with the zone, so no other rules are available to produce results. For details on how zones can be used when more rules are available, see “Ensuring Promoted Records are Always Produced” on page 270.

When an application user navigates to Wine Type > White, the “Recommended Chardonnays” rule fires and provides rule results as described in “An Example with One Rule Promoting Records”. In addition, the Highly Recommended and Best Buys rules also fire and provide results because their triggers always apply to any navigation query.

The promoted records for each of the three rules, along with the user’s query, and standard query results are called out in the following graphic:


267

Standard results for the query Wine Type >White

Rule Results for Recommended Chardonnays

Rule results for Highly Recommended

User’s query and trigger value

Rule results for Best Buys


268

Suggested Workflow Using Endeca Tools to Promote Records

You can build dynamic business rules and their constructs in Developer Studio. In addition, business users can use Web Studio to perform any of the following rule-related tasks:

• Create a new dynamic business rule.

• Modify an existing rule.

• Deploy a rule to a preview application and test or audit its results.

Because either tool can modify a project, the tasks involved in promoting records require coordination between the pipeline developer and the business user. The recommended workflow is as follows:

1. A pipeline developer uses Developer Studio in a development environment to create the supporting constructs (zones, styles, rule groups, and so on) for rule and perhaps small number of dynamic business rules as placeholders or test rules.

2. An application developer creates the Web application including rendering code for each style.

3. The pipeline developer makes the project available to business users via the Endeca Manager.

4. A business user starts Endeca Web Studio to access the project, create new rules, modify rules, and test the rules as necessary.


269

For general information about using Endeca tools and sharing projects, see the Endeca Tools Guide. Web Studio tasks are described in the Endeca Web Studio Help.

Note: Any changes to the constructs that support rules such as changes to zones, styles, rule groups, and property templates have to be performed in Endeca Developer Studio.

Incremental Implementation

Merchandising and content spotlighting are complex features to implement, and the best approach for developing your dynamic business rules is to adopt an incremental approach as you and business users of Web Studio coordinate tasks. It is also helpful to define the purpose of each dynamic business rule in the abstract (before implementing it in Developer Studio or Web Studio) so that everyone knows what to expect when the rule is implemented. If rules are only loosely defined when implemented, they may have unexpected side effects.

Begin with a single, simple business rule to become familiar with the core functionality. Later, you can add more advanced elements, along with additional rules, rule groups, zones, and styles. As you build the complexity of how you promote records, you will have to coordinate the tasks you do in Developer Studio (for example, zone and style definitions) with the work that is done in Web Studio.


270

Building the Supporting Constructs for a Rule

As discussed in “Dynamic Business Rule Constructs” on page 260, the records identified by a rule’s target are candidates for promotion and may or may not all be promoted in a Web application. It is a combination of zone and style settings that work together to effectively restrict which rule results are actually promoted to application users.

• A zone identifies a collection of rules to ensure at least one rule always produces records to promote.

• A style controls the minimum and maximum number of results to display, defines any property templates, and indicates how to display the rule results to the Web application.

The following sections describe zone and style usage in detail.

Ensuring Promoted Records are Always Produced

You ensure promoted records are always produced by creating a zone in Developer Studio to associate with a number of dynamic business rules. A zone is a logical collection of rules that allows you to have multiple rules available, in case a single rule does not produce a result. The rules in a zone ensure that the screen space dedicated to displaying promoted records is always populated.


271

A zone has a rule limit that dictates how many rules may successfully return rule results. For example, if three rules are assigned to a certain zone but the “Rule limit” is set to one, only the first rule to successfully provide rule results is evaluated. Any remaining rules in the zone are ignored.

To create a zone in Developer Studio:

1. In the Project Explorer, expand Dynamic Business Rules.

2. Double-click Zones to open the Zones view.

3. Click New to open the Zone editor.

4. In the Name field, provide a unique name for the zone.

5. (optional) If you want to limit the number of rules that can provide rule results within a zone, type a number in “Rule limit.”

6. (optional) If you want to randomly order the rules in the zone, select the “Shuffle rules” check box. When checked, this indicates that the Navigation Engine randomly shuffles the order of the rules within this zone before evaluating them.

7. (optional) Select “Valid for search” to indicate whether a zone (and all of the rules associated with that zone) is valid for navigation queries that include a record search parameter. Rules that include a keyword trigger require the “Valid for search” setting.

8. (optional) If you want to prevent the same record from appearing multiple times for multiple rules, check "Unique by this dimension/property" and


272

specify a unique record criterion. Selecting a dimension or property allows the Navigation Engine to identify individual records and prevent the same record from appearing multiple times. If you check "Unique by this dimension/property" and do not select a dimension or property, the same record may appear multiple times for multiple rules within a single zone.

9. Click OK.

Creating a Style

You create a style in the Styles view of Endeca Developer Studio. A style serves three functions:

• It controls the minimum and maximum number of records that may be promoted by a rule.

• It defines property templates, which facilitate consistent property usage between pipeline developers and business users of Web Studio.

• It indicates to a Web application which rendering code should be used to display a rule’s results.

Controlling the Number of Promoted Records

Styles can be used to affect the number of promoted records in two scenarios:

• A rule produces less than the minimum number of records. For example, if the “Best Buys” rule produces only two records to promote and that rule is assigned a style that has Minimum Records set to three, the rule does not return any results.


273

• A rule produces more than the maximum. For example, if the “Best Buys” rule produces 20 records, and the Maximum Records value for that rule’s style is five, only the first five records are returned.

If a rule produces a set of records that fall between the minimum and maximum settings, the style has no affect on the rule’s results.

Performance and the Maximum Records Setting

The Maximum Records setting for a style prevents dynamic business rules from returning a large set of matching records, potentially overloading the network, memory, and page size limits for a query. For example, if Maximum Records is set to 1000, then 1000 records could potentially be returned with each query, causing significant performance degradation.

Ensuring Consistent Property Usage with Property Templates

As discussed in the “Dynamic Business Rule Constructs” section, rule properties are key/value pairs typically used to return supplementary information with promoted record pages. For example, a property key might be set to “SpecialOffer” and its value set to “BannerAd.gif.”

As Web Studio users and Developer Studio users share a project with rule properties, it is easy for a key to be mis-typed. If this happens, then the supplementary information represented by a property does not get promoted correctly in a Web application. To address this, you can optionally create property templates for a style.


274

Property templates ensure that property keys are used consistently when pipeline developers and Web Studio users share project development tasks.

If you add a property template to a style in Endeca Developer Studio, that template is visible in Web Studio in the form of a pre-defined property key with an empty value. Web Studio users are allowed to add a value for the key when editing any rule that uses the template’s associated style. Web Studio users are not allowed to edit the key itself.

Furthermore, pipeline developers can restrict Web Studio users to creating new properties based only on property templates, thereby minimizing potential mistakes or conflicts with property keys.

For example, a pipeline developer can add a property template called “WeeklyBannerAd” and then make the project available to Web Studio users. Once the project is loaded in Web Studio, a property template is available with a populated key called “WeeklyBannerAd” and an empty value. The Web Studio user provides the property value. In this way, property templates reduce simple project-sharing mistakes such as creating a similar, but not identical property called “weeklybannerad”.

Note: Property templates are associated with styles in Developer Studio, not rules. Therefore, they are not available for use on the Properties tab of the Rule editor.


275

Indicating How to Display Promoted Records

You indicate how to display promoted records to users by creating a style to associate with each rule and by creating application-level rendering code for the style. You create a style in Developer Studio. You create rendering code in your Web application. This section describes how to create styles. Information about rendering code will be described later in “Adding Web Application Code to Render Rule Results” on page 301.

A style has a name and an optional title. Either the name or title can be displayed in the Web application. When the Navigation Engine returns rule results to your application, the engine also passes the name and title values to your application. The name uniquely identifies the style. The title does not need to be unique, so it is often more flexible to display the title if you use the same title for many dimension value targets, for example, the title “On Sale” may be commonly used.

Without application-level rendering code that uses the specific style or title values, the style and title are meaningless. Both require application-level rendering code in an application.

To create a style in Developer Studio:

1. In the Project Explorer, double-click Styles to open the Styles view.

2. Click New to open the Style editor.

3. In the Name field, provide a unique name for the style.


276

4. If desired, specify a title in the Title field. The title does not need to be unique.

5. In the Minimum Records field, specify the minimum number of records that must be returned by a rule’s target in order for that rule to return results. (The default Minimum Records value, if not specified, is zero.)

6. In the Maximum Records field, specify the maximum number of records that can be returned for a rule. (The default Maximum Records value, if not specified, is ten. If Maximum Records is set to zero, the rule returns zero records.)

7. If you want to create a property template, click Add in the Property Templates frame. The Property Template dialog box displays.

a. Provide the key for the property template.

b.Click OK.

c. Repeat for as many new property templates as necessary.

8. If you need to remove a property template, select it in the Property Templates frame and click Remove.

9. If you want to allow Web Studio users the ability to create new properties for a rule, check “Allow additional, custom properties.” Unchecking this option prevents Web Studio users from creating new properties to associate with a rule. In other words, Web Studio users will be restricted to creating


277

properties based only on the property templates you have created in Developer Studio.

10.Click OK.

Creating Rules

After you have created your zones and styles, you can start creating the rules themselves.

Note: It is not necessary to create rule groups unless your application requires it. If you want to create rule groups before you create a rule, see “Grouping Rules” on page 301.

As mentioned in “Suggested Workflow Using Endeca Tools to Promote Records”, a developer usually creates the preliminary rules and the other constructs in Developer Studio, and then hands off the project to a business user to fine tune the rules and created additional rules in Web Studio. However, the business user can use Web Studio to perform any of the tasks described in the following sections that are related to creating a rule. For details, see Endeca Web Studio Help.

The following sections guide you through creating a rule using Developer Studio, including the rule’s trigger, targets, featured records, properties, and result ordering.


278

Creating a Rule and Ordering Its Results

This section describes how to create a dynamic business rule and sort its results.

To create a dynamic business rule in Developer Studio:

1. In the Project Explorer, expand Dynamic Business Rules.

2. Make sure you have already created at least one zone and one style.

3. Double-click Rules, which opens the Rules view and also activates the Rule menu on the menu bar.

4. From the Rule menu, select New. The Rule editor displays.

5. In the Name text box, enter a unique name for the new rule.

6. From the Zone list, choose a zone to associate with the rule.

7. From the Style list, choose a style to associate with the rule.

8. (optional) From the “Members of this rule group” list, select a rule group to which this rule belongs. Rule groups are optional. This field is unavailable if no rule groups have been defined. For more information, see “Grouping Rules” on page 301.

9. If you want to sort the rule’s promoted records by a property or dimension value, select the property or dimension value from the Sort key list. Select [None] to accept the default sort order specified for the project.


279

10.If you chose a Sort key, choose Ascending or Descending to define sort order.

11.If you want to randomly order the promoted records for this rule, select Shuffle. Selecting Shuffle overrides any Sort key and Order options you specified above.

12.If you want to exclude promoted results that resemble the standard results of a query, select Self-pivot. This control ensures the Navigation Engine does not return the same records twice for a query, once as the standard navigation results and then again as the promoted results.

13.Continue with the following section to define the rule’s trigger.

Specifying When to Promote Records

You indicate when to promote records on the Triggers tab and on the Time Triggers tab of the Rule editor. If a user’s query matches a trigger, the Navigation Engine evaluates the rule. A rule may have any of the following triggers:

• Dimension triggers

• Time triggers

• Keyword triggers

• User-profile triggers

• Any combination of the above.

A dimension trigger is a collection of one or more dimension values that identify when a dynamic business


280

rule is evaluated (or fired) by the Navigation Engine. If a user’s query contains the dimension values identified in a rule’s trigger, the Navigation Engine evaluates that rule. For example, in a wine data set, you could set up a rule that is triggered when a user clicks Red. If the user clicks White, the Navigation Engine does not evaluate the rule. If the user clicks Red, the Navigation Engine evaluates the rule and returns any promoted records.

In addition to specifying explicit dimension value triggers, dimension value triggers can also be empty (unspecified) on the Triggers tab. When there is no explicit dimension value trigger for a rule, any navigation query may trigger the rule merely by navigating to the root dimension value for an application (N=0). Such a rule effectively has a global trigger: any query from the root dimension value triggers the rule.

A time trigger is a date/time value. If a user makes a query after a specified start time and before a specified expiration time, then the Navigation Engine fires the associated rule.

A keyword trigger is a single word or phrase. If a user’s query includes that word or phrase, then the Navigation Engine fires the associated rule. Keyword triggers require that the zone associated with the rule have “Valid for search” enabled on the Zone editor in Developer Studio. Keyword triggers also require a match mode that specifies how the query keyword should match in order to trigger the rule. There are three match modes:


281

• Phrase—A user’s query must match all of the words of the keyword trigger, in the same order, for the rule to fire. Phrase mode also allows the rule to fire if the spelling and stemming corrections of a user’s query match the keyword triggers.

• All—A user’s query must match all of the words of the keyword trigger, without regard for order, for the rule to fire. All mode also allows the rule to be eligible if the spelling and stemming corrections of a user’s query match the keyword triggers.

• Exact—A user’s query must exactly match the keyword trigger for the rule to fire. Unlike the other two modes, a user’s query must exactly match the keyword triggers in the number of words and cannot be a superset of the keyword triggers. Exact mode does not allow the rule to be qualified by spelling or stemming corrections.

A user-profile trigger is a label, such as premium_subscriber, that identifies an application user. If a user who has such a profile makes a query, the query triggers the associated rule. For more information, see “Implementing User Profiles” on page 311.

To specify when to promote records:

1. In the Rule editor for the rule you want to configure, click the Triggers tab.

2. If you want to add a dimension value trigger, click Add. The Select Dimension Value dialog box displays.

a. Chose a dimension value from the list.


282

b.Repeat step 2 if you want to add multiple dimension values to the rule’s trigger. Note that only one dimension value within each dimension may be selected.

3. Check Inherit to allow child dimension values of the specified dimension value to also trigger the rule. If unchecked, a query can trigger the rule only when a user navigates to the exact dimension value you specify.

Note: If a rule has an empty dimension value trigger (a global trigger), checking Inherit triggers the rule at any navigation state because the rule is inheriting from the root. This scenario has performance implications because the Navigation Engine must evaluate the rule at every navigation state. For details, see “Performance Considerations for Dynamic Business Rules” on page 305. Unchecking Inherit, with an empty trigger, triggers the rule at the root dimension value but does not trigger the rule at any other navigation state.

4. To add a keyword trigger, type a keyword or phrase in Keyword trigger. You can provide only one term or phrase per rule.

5. If you provided a keyword, select a match mode from the Match mode drop-down list. You can choose Phrase, All, or Exact as explained above.

6. If you want to add a user profile trigger, select a pre-defined user profile from the drop-down list.

7. Go on to the following section to specify a time trigger or see “Specifying Which Records to Promote” on page 284 to specify the rule’s target.


283

Specifying a Time Trigger to Promote Records

You specify a time trigger on the Time Trigger tab of the Rule editor. A trigger specified on this tab is a date/time value that indicates the time at which to start the rule’s trigger and the time at which the trigger ends. Any matching query that occurs between these two values triggers the rule.

A time trigger is useful if you want to promote records for a particular period of time. For example, you might create a rule called “This Weekend Only Sale” whose time trigger starts Friday at midnight and expires on Sunday at 6 p.m.

Only a start time value is required for a time trigger. If you do not specify an expiration time, the rule can be triggered indefinitely.

To specify a time trigger:

1. In the Rule editor for the rule you want to configure, click the Time Trigger tab.

2. Select “Give this rule a time trigger” to enable the Start time options.

3. From the “Start time” drop-down list, select the start time for the time trigger.

4. If desired, select “Give this rule an expiration date” and from the “Expiration time” drop-down list, choose


284

the end time for the time trigger. If you do not specify an expiration time, the trigger does not expire.

5. Go on to the following section to specify the rule’s target.

Synchronizing Time Zone Settings

The start time and expiration time values do not specify time zones. The server clock that runs your Web application identifies the time zone for the start and expiration times. If your application is distributed on multiple severs, you must synchronize the server clocks to ensure the time triggers are coordinated.

Specifying Which Records to Promote

You indicate which records to promote by specifying a target on the Target tab of the Rule editor. A target is a collection of one or more dimension values. These dimension values identify a set of records that are all candidates for promotion. Zone and style settings further control the specific records that are actually promoted to a user.

To specify which records to promote:

1. In the Rule editor for the rule you want to configure, click the Targets tab.

2. Click Add. The Select Dimension Value dialog box displays.

3. Select a dimension value from the list and click OK.


285

4. If necessary, repeat the above steps to add multiple dimension values to the target.

Note: Note that you cannot add auto-generated dimension values until you first load and promote the dimension values. See the Endeca Developer Studio Help for more information.

5. Check “Augment navigation state” to add target dimension values to the existing navigation state when evaluating the rule. If not checked, the Navigation Engine ignores all current navigation state filters when evaluating the rule. Navigation state filters include dimension value selections, record search, range filters, and so on. The one exception is custom catalog filters, which always apply regardless of this setting. For example, if checked, and a user navigates to Wine Type > Red and that triggers a rule that promotes wines under $10, then the rule results will include only red wines that are under $10. The rule results are always a subset of the standard query results. Conversely, if “Augment navigation state” is not checked, and a user navigates to Wine Type > Red which triggers a rule that promotes wines under $10, then the rules results will include any wine type (red, white, sparkling) with wines under $10. The rule results are not a subset of the standard query results.

6. Click OK if you are finished configuring the rule, or proceed with the following sections to promote custom properties or featured records.


286

Adding Custom Properties to a Rule

You can optionally promote custom properties by creating key/value pairs on the Properties tab of the Rule editor. Rule properties are typically used to return supplementary information with promoted record pages. Properties could specify editorial copy, point to rule-specific images, and so on. For example, a property name might be set to “SpecialOffer” and its value set to “BannerAd.gif.”

You can add multiple properties to a dynamic business rule. These properties are accessed with the same method calls used to access system-defined properties that are included in a rule’s results, such as a rule’s zone and style. For details see, “Adding Web Application Code to Extract Rule Results” on page 293.

To add a custom rule property:

1. In the Rule editor for the rule you want to configure, click the Properties tab.

2. Type the property name in the Property field and its corresponding value in the Value field.

3. Click Add.

4. Repeat steps 2 and 3 if you want to add additional properties.

5. Click OK.

Note: You can also create templates to facilitate the creation of rule properties in Web Studio. See “Ensuring Consistent Property Usage with Property Templates” on page 273 for details.


287

Adding Static Records in Rule Results

In addition to defining a rule’s dimension value targets and custom properties, you can optionally specify any number of static records to promote. These static records are called featured records, and you specify them on the Featured Records tab of the Rule editor. You access featured records in your Web application using the same methods you use to access dynamically generated records. For details, see “Adding Web Application Code to Extract Rule Results” on page 293.

To add featured records to a rule:

1. In the Rule editor for the rule you want to configure, click the Featured Records tab.

2. From the Record spec list, choose an Endeca property to uniquely identify featured records.

3. In the Value text box, type a value for the selected Endeca property. This value identifies a featured record you want to promote.

4. Click Add.

5. If desired, repeat steps 3 and 4 to add additional featured records to the list.

6. To change the order in which the featured records appear, select a record and click Up or Down.

7. To change a record spec value, select it from the Record spec values list, modify its value in the Value text box, and click Update.


288

8. Click OK when you are done adding featured records to a rule.

The Navigation Engine treats featured records differently than dynamically generated records. In particular, featured records are not subject to any of the following:

• Record order sorting by sort key

• Uniqueness constraints

• Maximum record limits

Order of Featured Records

The General tab of the Rule editor allows you to specify a sort order for dynamically generated records that the Navigation Engine returns. This sort order does not apply to featured records. Featured records are returned in a Supplement object in the same order that you specified them on the Featured Records tab. The featured records occur at the beginning of the record list for the rule’s results and are followed by any dynamically generated records. The dynamically generated records are sorted according to your specified sort options.

No Uniqueness Constraints

The zone associated with a rule allows you to indicate whether rule results are unique by a specified property or dimension value. This uniqueness constraint does not apply to featured records even if uniqueness is enabled for dynamically generated rule results. For example, if you enabled “Color” to be the unique property for record results and you have two dynamically generated records


289

with “Blue” as property value, then the Navigation Engine excludes the second record as a duplicate. On the other hand, if you have the same scenario but the two records are featured results not dynamically generated results, the Navigation Engine returns both records.

No Maximum Record Limits

The style associated with a rule allows you to set a maximum number of records that the Navigation Engine may return as rule results. This Maximum Records value does not apply to featured records. For example, if the Maximum Records value is set to three and you specify five featured records, the Navigation Engine returns all five records. Also, the Navigation Engine returns featured records before dynamically generated records, and the featured records count toward the maximum limit. Consequently, the number of featured records could restrict the number of dynamically generated rule results.

Sorting Rules in the Rules View

The dynamic business rules you create in Developer Studio appear in the Rules view. To make rules easier to find and work with, they can be sorted by name (in alphabetical ascending or descending order) or by priority.

The procedure described below changes the way rules are sorted in Rules view only. Sorting does not affect the priority used when processing the rules. Prioritizing rules


290

in Developer Studio is described in “Prioritizing Rules” on page 290.

To sort the rules in Rules view:

• In the Rules view, click the Name column header to cycle the sort order from Sorted by name (ascending), Sorted by name (descending), and Sort by priority order.

Prioritizing Rules

In addition to sorting rules by name or priority, you can also modify a rule’s priority in the Rules view of Developer Studio. Priority is indicated by a rule’s position in the Rules view, relative to the position of other rules when you have sorted the rules by priority. You modify the relative priority of a rule by moving it up or down in the Rules view.

A rule’s priority affects the order in which the Navigation Engine evaluates the rule. The Navigation Engine evaluates rules that are higher in the Rules view before those that are positioned lower. By increasing the priority of a rule, you increase the likelihood that the rule is triggered before another, and in turn, increase the likelihood that the rule promotes records before others.

It is important to consider rule priority in conjunction with the settings you specify in the Zone editor. For example, suppose a zone has “Rule limit” set to three. If you have ten rules available for the zone, the Navigation Engine evaluates the rules, in the order they appear in the


291

Rules view, and returns results from only the first three that have valid results. In addition, the “Shuffle rules” check box on the Zone editor overrides the priority order you specify in the Rules view. When you check “Shuffle rules”, the Navigation Engine randomly evaluates the rules associated with a zone.

If you set up rule groups, you can modify the priority of a rule within a group and modify the priority of a group with respect to other groups. For details, see “Prioritizing Rule Groups” on page 304.

To prioritize rules:

1. In the Rules view, click the Name column header to cycle the order of the rule sort so that the rules are displayed in priority order (you’ll see “Sorted by priority” in the lower left corner of the Rules view).

2. Select the rule whose priority you want to change and click the Up or Down buttons to move the rule to the desired position.

Presenting Rule Results in a Web Application

The Navigation Engine returns rule results to a Web application in a Supplement object. To display rule results to Web application users, an application developer writes code that extracts the rule results from the Supplement object and displays the results in the application.

Before explaining how these two tasks are accomplished, it is helpful to briefly describe the process from the point


292

at which a user makes a query to the point when an application displays the rule results:

1. A user submits a query that triggers a dynamic business rule.

2. When a query triggers a rule, the Navigation Engine evaluates the rule and returns rule results in a single Supplement object per rule. The rule results are derived from the rule’s target values refined by zone and style settings.

3. Web application code extracts the rule results and the style for the rule from the Supplement object.

4. Custom rendering code defines how to display the rule results in your application according to the style supplied with the results.

The following sections describe query parameter requirements and application and rendering code requirements.

Required Navigation Engine URL Query Parameters

The Navigation Engine evaluates dynamic business rules only for navigation queries. This evaluation also occurs with variations of navigation queries, such as record search, range filters, and so on. Dynamic business rules are not evaluated for record, aggregated record, or dimension search queries. Therefore, a query must include a navigation parameter (N) in order to potentially trigger a rule. No other specific query parameters are required.


293

Adding Web Application Code to Extract Rule Results

You must add code to your Web application that extracts rule results from the Supplement objects that the Navigation Engine returns. Supplement objects are children of the Navigation object and are accessed via the getSupplements() method for the Navigation object. The getSupplements() method returns a SupplementList object that contains some number of Supplement objects. For example, the following pseudo code gets all Supplement objects from the Navigation object.

Each Supplement object may contain three types of data: records, navigation references, and properties.

• Records—Each dynamic business rule’s Supplement object has one or more records attached to it. These records are structurally identical to the records found in navigation record results. This pseudo code gets all records from a Supplement object. See the sample code sections below for more detail.

• Navigation reference—Each business rule’s Supplement object also contains a single reference to a navigation query. This navigation reference is a collection of dimension values. These dimension values create a navigation query that may be used to direct a user to a new location (usually the full result

<SupplementList> = <Navigation>.getSupplements()<Supplement> = <SupplementList>.get(i)

<ERecList> = <Supplement>.getERecs()<ERec> = <ERecList>.get(i)


294

set that the promoted records were sampled from.) This is useful if you want to create a link from the rule’s title that displays the full result set of promoted records. This pseudo code gets the navigation reference from a Supplement object. See the sample code sections below for more detail.

• Properties—Each business rule’s Supplement object contains multiple properties, and each property consists of a key/value pair. Properties are rule-specific, and are used to specify the style, zone, title, and so on for each rule. This pseudo code gets all the properties from a Supplement object. See the sample code sections below for more detail.

There are a number of important properties for each business rule’s Supplement object. They include the following:

• Title—The title of a rule, as specified on the Name field of the Rule editor.

• Style—The name of the style associated with the rule, as specified in the Style drop-down list of the Rule editor’s General tab.

<NavigationrefsList>=<Supplement>.getNavigationRefs()<DimValList> = <NavigationRefsList>.get(i)

<DimVal> = <DimValList>.get(j)

<PropertyMap> = <Supplement>.getProperties()<Property> = <PropertyMap>.get(string)


295

• Style Title—The title of the style (different than the name of the style) associated with the rule, as specified in the Title field on the Style editor.

• Zone—The name of the zone the rule is associated with, as specified by the Zone drop-down list of the Rule editor’s General tab.

• DGraph.SeeAlsoMerchId—The rule ID. This ID is system-defined, not user-defined.

• DGraph.SeeAlsoPivotCount—This count specifies the total number of matching records that were available when evaluating the target for this rule. This count is likely to be greater than the actual number of records returned with the Supplement object, since only the top N records are returned for a given business rule style.

• DGraph.SeeAlsoMerchSort—If a sort order has been specified for a rule, the property or dimension name of the sort key is listed in this property.

• DGraph.SeeAlsoMerchSortOrder—If a sort key is specified, the sort direction applied for the key is also listed.

In addition to the properties listed above, you can create custom properties that on the Properties tab of the Rule editor. Custom properties also appear in a Supplement object. For details, see “Adding Custom Properties to a Rule” on page 286.


296

Sample Java Code

You can use the following sample Java code to assist you in extracting rule results from Supplement objects.

SupplementList sl = nav.getSupplements()for (int i=0; i < sl.Size(); i++) {

// Get Supplement objectSupplement sup = (Supplement)sl.get(i);

// Get propertiesPropertyMap supPropMap = sup.getProperties();

// Check if object is merchandising or// content spotlighting resultif (supPropMap.get("DGraph.SeeAlsoMerchId") != null)

(supPropMap.get("Style") != null) &&(supPropMap.get("Zone") != null) && hasMerch = true;

{// Get record listERecList recs = sup.getERecs()for (int j=0; j < recs.Size(); j++) {

// Get recordERec rec = (ERec)recs.get(j);

// Get record PropertiesPropertyMap recPropsMap = rec.getProperties();

// Get value of name prop from current record String name = recPropsMap.get("Name");

}

// Set target link using first Navigation ReferenceNavigationRefsList nrl = sup.getNavigationRefs();DimValList dvl = (DimValList)nrl.get(0);

...


297

Sample ASP .NET Code

You can use the following sample ASP code to assist you in extracting rule results from Supplement objects.

...// Loop over dimension values to build new target queryString newNavParam;for (int j=0; j < dvl.Size(); j++) {

DimVal dv = (DimVal)dvl.get(j)

// Add delimiter and idnewNavParam += " "+dv.getId();

}// Get specific rule propertiesString style = supPropMap.get("Style");String title = supPropMap.get("Title");String zone = supPropMap.get("Zone");String customText = supPropMap.get("CustomText");

}}


298

// Check if Supplement object is merchandising// or content spotlightingif ((supPropMap["DGraph.SeeAlsoMerchId"] != null) &&

(supPropMap["Style"] != null) &&(supPropMap["Zone"] != null)) &&(Request.QueryString["hideMerch"] == null)) {

// Get Record ListERecList supRecs = sup.ERecs;

// Loop over recordsfor (int j=0; j<supRecs.Count; j++) {

// Get recordERec rec = (ERec)supRecs[j];

// Get property map for recordPropertyMap propsMap = rec.Properties;

// Get value of name prop from current recordString name = (String)propsMap["Name"];

}

// Set target link using first navigation referenceNavigationRefsList nrl = sup.NavigationRefs;DimValList dvl = (DimValList)nrl[0];

// Loop over dimension values to build new target queryString newNavParam;for (int k=0; k<dvl.Count; k++) {

DimVal dv = (DimVal)dvl[k];

// Add delimiter and idnewNavParam += " "+dv.Id;

}...


299

Sample COM Code

You can use the following sample COM code to assist you in extracting rule results from Supplement objects.

' Get Supplement listdim supsset sups = nav.GetSupplements()

' Loop over Supplement objectsFor i = 1 to sups.Count

' Get Supplement objectdim supset sup = sups[i]

' Get propertiesdim supPropMapset supPropMap = sup.GetProperties()

' Check if Supplement object is merchandising or' contenting spotlightingif ((supPropMap.Get("DGraph.SeeAlsoMerchId") <> "") and

(supPropMap.Get("Style") <> "") and(supPropMap.Get("Zone") <> "")) and(Request.QueryString("hideMerch") = "")) then

...


300

...' Get record listdim supRecsset supRecs = sup.GetERecs()

' Loop over recordsFor j=1 to supRecs.count

' Get recordset rec = supRecs(j)

' Get property map for recordset propsMap = rec.GetProperties()

' Get value of name prop from current recordset name = propsMap.Get("Name")

next

' Set target link using first navigation referenceset nrl = sup.GetNavigationRefs()dim dvlset dvl = nrl[0]

' Loop over dimension values to build new target querynewNavParam = ""

For k=1 to dvl.Countdim dvset dv = dvl(k)

' Add delimiter if necessaryif (newNavParam <> "") then

newNavParam = newNavParam & " "end if

' Add idnewNavParam = newNavParam & dv.GetId()

next...


301

Adding Web Application Code to Render Rule Results

In addition to Web application code that extracts rule results from Supplement objects, you must also add application code to render the rule results on screen. (Rendering is the process of converting the rule results into displayable elements in your Web application pages.)

Rendering rule results is a Web application-specific development task. The reference implementations come with three arbitrary styles of rendering business rule results, but most applications require their own custom development that is typically keyed on the Title, Style, Zone, and other custom properties. For details, see “Adding Web Application Code to Extract Rule Results” on page 293.

Grouping Rules

Rule groups are a third and optional construct that complement zones and styles in supporting dynamic business rules. Rule groups serve two functions:

...' Get specific rule propertiesset style = supPropMap.Get("Style")set title = supPropMap.Get("Title")set zone = supPropMap.Get("Zone")set customText = supPropMap.Get("CustomText")

end ifnext


302

• They provide a means to logically organize rules into categories to facilitate creating and editing rules.

• They allow multiple business users to access Web Studio simultaneously.

A rule group provides a means to organize a large number of rules into smaller logical categories, which usually affect distinct (non-overlapping) parts of a Web site. For example, a retail application might organize rules that affect the electronics and jewelry portions of a Web site into a group for Electronics Rules and another group for Jewelry Rules.

A rule group also enables multiple business users to access Web Studio simultaneously. Each Web Studio user can access a single rule group at a time. Once a user selects a rule group, Web Studio prevents other users from editing that group until the user returns to the selection list or closes the browser window.

To create a rule group:

1. In the Project Explorer window, expand Dynamic Business Rules.

2. Double-click Rules. This opens the Rules view and also activates the Rule Group menu on the menu bar.

3. From the Rule Group menu, select New.

4. Type a unique name in the Group name field. Use only alphanumeric, dash, or underscore characters.

5. To select a rule for this group, highlight a rule in the “All rules” list and click Add. The rule appears in the


303

“Rules in group” list. (If this is the first group you created, all the rules are moved to the “Rules in group” list and the Remove button is inactive.)

Note: A rule can belong to only one rule group. Adding a rule to a group removes it from any group to which it previously belonged.

6. Click OK. The new rule group appears in the Rules view.

7. Repeat steps 3 to 7 if you want multiple rule groups in your project.

8. To change the priority of a rule within a group, select the rule in the “Rules in Group: Name” column and click either the Up or Down arrow buttons.

Deleting a Rule Group

You can delete rule groups from your project as necessary.

To delete a rule group:

1. In the Rules view’s “Rule groups: Name” column, select a rule group and then click Delete.

2. When the confirmation message appears, click Yes.

3. If your project contains at least one other rule group, the Select Rule Group dialog box appears. In the drop down list, select the rule group you want to move the rules in this group to, and click OK.

This dialog box appears when the rule group you delete is not the last one in the project. If the rule


304

group you are deleting is the only rule group, Developer Studio lists the rules under the All rules heading.

Prioritizing Rule Groups

In the same way that you can modify the priority of a rule within a group, you can also modify the priority of a rule group with respect to other rule groups.

The Navigation Engine evaluates rules first by group order, as shown in the Rules view of Developer Studio or Web Studio, and then by their order within a given group. For example, if Group_B is ordered before Group_A, the rules in Group_B will be evaluated first, followed by the rules in Group_A. Rule evaluation proceeds in this way until a zone’s Rule Limit value is satisfied.

This relationship is shown in the graphic below. In it, suppose zone 1 has a Rule Limit setting of 2. Because of the order of group B is before group A, rules 1 and 2 satisfy the Rule Limit rather than rules 4 and 5.Group B

Rule 1, Zone 1Rule 2, Zone 1Rule 3, Zone 2

Group ARule 4, Zone 1Rule 5, Zone 1Rule 6, Zone 2


305

To prioritize rule groups:

1. In the Rules view, select a group whose priority you want to change in the “Rule groups: Name” column.

2. Click the Up or Down buttons to move the group to the desired position.

If you want to further prioritize the rules within a particular rule group, see, “Prioritizing Rules” on page 290.

Interaction Between Rules and Rule Groups

When creating or editing rule groups, keep in mind the following interactions between rules and rule groups:

• Rules may be moved from one rule group to another. However a rule can appear in only one group.

• A rule group may be empty (that is, it does not have to contain rules).

• The order of rule groups with respect to other rule groups may be changed.

Performance Considerations for Dynamic Business Rules

Dynamic business rules require very little data processing or indexing, so they do not impact Forge performance, Dgidx performance, or the Navigation Engine memory footprint.


306

However, because the Navigation Engine evaluates dynamic business rules at query time, rules affect the response-time performance of the Navigation Engine. The larger the number of rules, the longer the evaluation and response time. Evaluating more than twenty rules per query can have a noticeable effect on response time. For this reason, you should monitor and limit the number of rules that the Navigation Engine evaluates for each query.

In addition to large numbers of rules slowing performance, query response time is also slower if the Navigation Engine returns a large number of records. You can minimize this issue by setting a low value for the Maximum Records setting in the Style editor for a rule.

Rules without Explicit Triggers

Dynamic business rules without explicit triggers also affect response time performance because the Navigation Engine evaluates the rules for every navigation query.

Using an Agraph and Dynamic Business Rules

To implement dynamic business rules when you are using the Agraph, keep in mind the following points:

• Using dynamic business rules with the Agraph affects performance if you are using zones configured with “Unique by this dimension/property” and combined with a high setting for the maximum number of records or a large numbers of rules. To avoid response


307

time problems, you may need to reduce the number of rules, reduce the maximum records that can be returned, or abandon uniqueness.

• All Dgraphs serving one Agraph must share the same set of dynamic business rules. To ensure this, it is necessary to update their configurations synchronously in the Endeca Manager, by running Dgidx with the --keepcats flag.

• If you update your Dgraphs with dynamic business rule changes using Developer Studio or Web Studio, and a request comes to the Agraph while the update is in progress, the Agraph issues a fatal error similar to the following:

[Thu Jun 24 16:26:29 2004] [Fatal] (merchbinsorter.cpp::276) - Dgraph 1 has fewer rules fired.

As long as the Agraph is running under the Endeca JCD, the JCD automatically restarts the Agraph. No data is lost. However, end-users will not receive a response to requests made during this short time. This problem has little overall impact on the system, because business rule updates are quick and infrequent. Nevertheless, Endeca recommends that you shut down the Agraph during business rule updates. To shut down the Agraph, go to a CMD prompt on Windows or a shell prompt on UNIX and type:

GET 'http://HOST:PORT/admin?op=exit'

where HOST is machine running the Agraph and PORT is the port number of the Agraph. GET is a Perl utility, so be sure the Perl binaries are in your system path variable.


308

Applying Relevance Ranking to Rule Results

In some cases, it is a good idea to apply relevance ranking to a rule’s results. For example, if a user performs a record search for Mondavi, the results in the Highly Rated rule can be ordered according to their relevance ranking score for the term Mondavi. In order to create this effect, there are three requirements:

• The navigation query that is triggering the rule must contain record search parameters (Ntt and Ntk). Likewise, the zone that the rule is assigned to must be identified as Valid for search. (Otherwise, the rule will not be triggered.)

• The rule’s target must be marked to Augment Navigation State.

• The rule must not have any sort parameters specified. If the rule has an explicit sort parameter, that parameter overrides relevance ranking. Sort parameters for a rule are set on the General tab of the Rule editor.

If these three requirements are met, then the relevance ranking rules specified with Navigation Engine startup options are used to rank specific business rules when triggered with a record search request (a keyword trigger).


309

About Overloading Supplement Objects

Recall that dynamic business rule results are returned to an application in Supplement objects. Each rule that returns results does so via a single Supplement object for that rule. However, not all Supplement objects contain rule results.

Supplement objects are also used to support “Did You Mean” suggestions, record search reports, and so on. In other words, a Supplement object can act as a container for a variety of features in an application. One Supplement object instance cannot contain results for two features. For example, one Supplement object cannot contain both rule results and also “Did You Mean” suggestions. For that reason, if you combine dynamic business rules with these additional features, you should check each Supplement object for specific properties such as DGraph.SeeAlsoMerchId to identify which Supplement object contains rule results.


310


Chapter 16

Implementing User Profiles

A user profile is a character-string-typed name that identifies a class of end users. User profiles enable applications built on the Endeca Navigation Platform to tailor the content displayed to an end user based on that user’s identity.

User profiles can be used to trigger dynamic business rules, where such rules are optionally constructed with an additional trigger attribute corresponding to a user profile. The Endeca Navigation Platform can accept information about the end user, and use that information to trigger pre-configured rules and behaviors.

This feature discusses how you create user profiles and then implement them as dynamic business rule triggers. Before reading further, make sure you are comfortable with the information in the Dynamic Business Rules chapter.

Note: Each business rule is allowed to have at most one user profile trigger.

312

Profile-Based Trigger Scenario

In the following scenario, an online clothing retailer wants to set up a dynamic business rule that says: “For young women, who are browsing stretch t-shirts, also recommend cropped pants.” We follow the shopping experience of a customer named Jane.

In order to set up this rule, a few configuration steps are necessary:

1. In Endeca Developer Studio, the retailer creates a user profile called young_woman, which corresponds to the set of customers who are female and are between the ages of 16 and 25.

2. In Endeca Web Studio, a dynamic business rule that uses the profile as a trigger is created:

young_woman X DVAL(stretch t-shirt) => DVAL(cropped pants)

No complex Boolean logic programming is necessary here. The business user simply selects a user profile from a set of available profiles to create the business rule.

3. In the Web application that’s driving the customer’s experience, there needs to be logic that identifies the user and tests to see if he or she meets the requirements to be classified as a young_woman. Alternatively, the profile young_woman may already be stored along with Jane’s information (such as age, address, and income) in a database or LDAP server.


313

The user’s experience would go something like this:

1. Jane accesses the clothing retailer’s Web site and is identified by a cookie on her computer. By looking up a few database tables, the application knows that it has interacted with her before. The database indicates that she is 19 years old and female.

At this point, the database may also indicate the user profiles that she belongs to: young_woman, r_and_b_music_fan, college_student. Alternatively, the application logic may test against her information to see which profiles she belongs to, as follows: “Jane is between 16 and 25 years old and she is female, so she belongs in the young_woman profile.”

2. As Jane is browsing the site, the Endeca Navigation Engine is driving her catalog experience. As each query is being sent to the Endeca Navigation Engine, it is augmented with user profile information. Here is some sample Java code:

profileSet.add("young_woman");eneQuery.setProfiles(profileSet);

3. As Jane clicks on a stretch t-shirt link, the Endeca Navigation Engine realizes that a dynamic business rule has been triggered: young_woman X DVAL(stretch t-shirt). Therefore, it returns a cropped pants record in one of the dynamic business rule zones.

4. Jane sees a picture of cropped pants in a box labeled, “You also might like...”

Endeca Confidential Implementing User Profiles

314

Developer Studio Implementation

You set up user profiles in Developer Studio. Both Developer Studio and Web Studio allow a user profile to be associated with a business rule’s trigger.

To set up a user profile in Developer Studio:

1. In the Project Explorer tab, double-click User Profiles. The User Profiles view appears.

2. Click New. The New User Profile editor appears.

3. In the Name text box, type a unique name for this user profile and click OK. The new user profile is added to the User Profiles view.

To assign a user profile as a business rule trigger in Developer Studio:

1. In the Project Explorer, click Dynamic Business Rules to expand it, and then double-click Rules to open the Rules view.

2. Select the rule you want to apply the trigger to, and then click Edit. The Rule editor appears.

3. Click the Triggers tab. In the User Profile list, select a profile. (You may also specify a dimension trigger and/or a keyword trigger in the Triggers tab.)

4. Click OK. The user profile information that you added to the rule now appears in the Rules view.


315

To assign a user profile as a business rule trigger in Web Studio:

1. Log on to Web Studio and click Rule Manager to display the Rule Manager page.

2. In the Rule List, click the rule you want to apply the user profile to.

3. Click the Edit Where and What tab.

4. In the User Profile list, select a user profile to use as a business rule trigger. There can only be one user profile trigger per rule.

User Profile Query Parameters

There are no URL ENE query parameters associated with user profiles. In many live application scenarios, the URL query is exposed to the end user, and it is usually not appropriate for end users to see or change the user profiles with which they have been tagged.

Objects and Method Calls

In the following code samples, the application recognizes the end user as Jane Smith, looks up some database tables and determines that she is 19 years old, female, a college student and likes R&B music. These characteristics map to the following Endeca user profiles created in Endeca Developer Studio:

• young_woman

• r_and_b_music_fan


316

• college_student

User profiles can be any string. The user profiles supplied to ENEQuery must exactly match those configured in Endeca Developer Studio.

Java Code Example// User profiles can be any string. The user profiles // supplied to ENEQuery must exactly match those // configured in Endeca Developer Studio.// Make sure you have the following import statement at // the top of your file:

// import java.util.*;

Set profiles = new HashSet();// Collect all the profiles into a single Set object.profiles.add("young_woman");profiles.add("r_and_b_music_fan");profiles.add("college_student");// Augment the query with the profile information.eneQuery.setProfiles(profiles);


317

.NET C# Code Example

COM Code Example

// Make sure you have the following statement at the top // of your file:// using System.Collections.Specialized;

StringCollection profiles = new StringCollection();

// Collect all the profiles into a single StringCollection object.

profiles.Add("young_woman");profiles.Add("r_and_b_music_fan");profiles.Add("college_student");

// Augment the query with the profile information.

eneQuery.Profiles = profiles;

' Create a zero-based string array that will hold the ' profiles

Dim profiles(3)

' Add each profile as a string

profiles(0) = "young_woman"profiles(1) = "r_and_b_music_fan"profiles(2) = "college_student"

' Augment the query with the profile informationeneQuery.SetProfiles(profiles)


318

Performance Impact of User Profiles

An application using this feature may experience additional memory costs due to user profiles being set in an ENEQuery object. In addition, the application may require additional ENEConnection.query() response time, because the Navigation Engine must do additional work to receive profile information and check if business rules fire. However, in typical application scenarios that set one to five user profile strings of at most 20 characters in the ENEQuery object, the performance impact is insignificant.


Chapter 17

Implementing Partial Updates

This chapter describes how to implement partial updates in your deployment.

About Partial Updates

The Endeca Navigation Engine processes two types of source data transformations:

• Baseline update

• Partial update

A baseline update (also called a full update) is a complete re-index of the entire data set. Baseline updates occur infrequently, usually once per day or once per week. They typically involve the customer generating an extract from their database system and making the files accessible either on an FTP server or on the indexing server. This data is processed by Forge and Dgidx, and is then finally made available through the Navigation Engine.

A partial update is a much smaller change in the overall data set. Partial updates affect a small percentage of the total records in the system, and therefore occur much more frequently. They consist of a much smaller extract from the

320

customer’s database and contain volatile information. For example, the price and availability of products on a retail store site are usually volatile.

The partial update capability of the Endeca Navigation Engine allows it to receive and process changes to its data without reprocessing the baseline data, thus allowing it to process the updates in a short amount of time and continue serving requests.

Implementing partial updates requires a separate pipeline to process the partial updates and starting the Navigation Engine with an additional command-line flag. These requirements are explained in detail in this chapter.

You can implement partial updates by using one of three methods:

• Using a control script, as documented below.

• Using the emgr_update utility to programmatically run a partial update via the Endeca Manager. For details, see the Endeca Tools Guide.

• Writing a client application program, using the Endeca Data Indexing API. For details, see the Endeca Data Indexing API Guide.

Note: Partial updates cannot be run from the Developer Studio or Web Studio.


321

Partial Update Capabilities

The Endeca software supports updates that allow you to:

• Add auto-generated dimension values to existing dimensions.

• Add an entirely new record with a new set of property values and dimension values.

• Delete a record. The dimension values with which the record is tagged are not removed from the system, even if there are no other records tagged with the same dimension values.

• Replace an existing record with an entirely new set of property values and dimension values. Again, dimension values no longer associated with any records remain in the system.

• Update an existing record, selectively adding and removing dimension and property values. Specifically it is possible to:

• Add property values to a record.

• Remove all property values of a property from a record.

• Add dimension values to a record.

• Remove specific dimension values from a record.

• Remove all dimension values of a dimension from a record.

Endeca Confidential Implementing Partial Updates

322

Partial Updates Reference Implementation

A data reference implementation for partial updates is included with the installation in these default locations:

• On UNIX:

$ENDECA_REFERENCE_DIR/sample_updates_data

• On Windows:

%ENDECA_REFERENCE_DIR%\sample_updates_data

This implementation contains the components for partial updates, including a baseline update pipeline, a partial update pipeline, a control script, and source data files. The descriptions in the rest of this chapter assume the use of this reference implementation.

Note: The reference implementation is for a single-Dgraph deployment. If you have an Agraph deployment, see the section “Partial Updates in Agraph Implementations” on page 355.

Baseline Pipeline Restrictions

Forge processing for baseline updates and partial updates is done in separate pipelines, but is coordinated and synchronized. Because baseline updates have loose restrictions on both time and computational resources, they can execute complex business logic and large joins. The resources available to process partial updates are much more tightly restricted, so resource-intensive tasks (such as large joins and complex business logic) should be avoided if possible.


323

The required coordination between the baseline and partial update pipelines, coupled with the resource restrictions on the partial update pipeline, impose constraints on the baseline update pipeline:

• Properties that are the subject of partial updates, and properties that are the basis for classifications that are the subject of partial updates, should not be assembled onto the records in large joins. Loading the join table takes time, and the partial update pipeline may not be able to load the table in the time required to perform the partial update.

• Records produced by the baseline update pipeline must be identified in the partial update pipeline by a single property that is unique for each record. “Record Specification Attribute” on page 340 describes this requirement in detail.

Dimensions affected by partial updates must be loaded in the baseline update pipeline. Loading large dimensions increases the startup time for the partial update pipeline. Therefore, the size of dimensions affected by partial updates should be kept small.

Baseline updates must be comprehensive: all of the information in all the partial updates since the previous baseline update must be incorporated in the new baseline update. If there are multiple types of updates being performed, all of the updates of all the different types must be included.

Baseline updates must not overlap. A new baseline cannot be started until processing of the prior baseline


324

has been completed (completed means that the baseline update has been loaded into a Navigation Engine and updates have started being processed against it).

To avoid ambiguous behavior and spurious errors, we suggest that partial updates not be extracted while a baseline data extraction is in progress.

Creating a Partial Update Pipeline

A partial update requires its own pipeline (separate from the baseline update pipeline) that only deals with partial updates. Use Developer Studio to create the partial update pipeline, because Developer Studio can open both pipeline files at the same time.

Each input record in a partial update pipeline describes a transformation to be performed on a single record in the running application. This means, for example, that a single update cannot change the spelling of a property on many records; instead, a separate update must be generated to change the spelling on each record in the application.


325

The reference implementation’s partial update pipeline components are as follows:


326

The partial update pipeline is executed at frequent intervals. Between runs, updates are queued. When the partial update process starts, all the queued updates are processed and written to a staging area. When Forge is complete, the updates are read from the staging area into the running application.

The partial update pipeline in the sample_updates_data reference implementation works as follows:

1. The partial update pipeline reads its input, using a record adapter (named LoadUpdateData) with the Multi File field checked.

2. The input records are transformed into record updates by a record manipulator (named UpdateManipulator) using IF and UPDATE_RECORD expressions.

3. The record updates are written out using an update adapter.

The following sections described the creation of the individual components of the pipeline.

Creating the Record Adapter

When you create the record adapter in Developer Studio, the General tab of the Record Adapter editor must have these basic settings:

• Direction – Must be Input.

• URL – Enter an input URL as a path, with the filename being a pattern. For example, a URL pattern of ../incoming/updates/*.txt.gz means that Forge will


327

read any file that has the txt.gz suffix in the sample_updates_data/data/incoming/updates directory. Each file that matches the pattern will be read in sequence.

• Multi File – Check this box to specify that Forge can read data from more than one input file and that the input URL is to be interpreted as a pattern.

You can leave the other tabs (Sources, Record Index, and so on) in their default state.

Creating the Record Manipulator

When creating the record manipulator, the Sources tab of the Record Manipulator editor must have the following settings:

• Record source – Select the name of the property mapper.

• Dimension source – Select None.

• You can leave the Record Index tab empty.

The Expression editor is where which you add the expressions described below. You open the Expression editor by double-clicking the record manipulator component in the Pipeline Diagram. You can add expressions after the record manipulator is created.


328

IF Expression

The record manipulator is essentially an IF expression that calls one of three UPDATE_RECORD expressions based on a conditional evaluation of the incoming record. The logic of the IF expression is as follows:

Other expressions (such as MATH and PROPERTY) are used in evaluating the incoming record for the value of the record’s update property.

You can see the entire IF expression (with its three UPDATE_RECORD expressions) by opening the reference implementation’s record manipulator in the Expression editor.

If you want to modify the UPDATE_RECORD expressions, the next section provides more details on its usage.

UPDATE_RECORD Expression

The UPDATE_RECORD expression updates existing records by adding, removing, or replacing dimensions, dimension values, or property values. The expression can also delete existing records and add new ones.

IF the incoming record has a "Remove" field equal to "1"THEN delete the record(i.e., call UPDATE_RECORD with an ACTION of "DELETE_OR_IGNORE")

ELSE_IF the incoming record has an "Update" field equal to "1"THEN update the record(i.e., call UPDATE_RECORD with an ACTION of "UPDATE")

ELSE add the record(i.e., call UPDATE_RECORD with an ACTION of "ADD_OR_REPLACE")


329

If different types of partial updates are processed using separate pipelines, the UPDATE_RECORD expression can be written to perform the same action on all of the input.

For example, a partial update pipeline written to handle only price and availability changes would always generate UPDATE-type record updates. If the same partial update pipeline needs to handle REPLACE updates (that is, reclassification of a record), the input data must contain some indication of what type of update to perform. Most commonly, this will simply be a property on the input record, which is checked inside an IF expression.

The UPDATE_RECORD expression takes a snapshot of the record at the time it is evaluated and generates a corresponding record update. Thus, the update contains the property names and values, as well as classifications, that are in effect at the time of evaluation. If properties are renamed, have their values changed, or classifications are added or deleted after the record update expression has been evaluated, the changes have no impact on the record update that will be generated. Only one record update can be generated per record.

Note the following:

• For ADD record updates, a complete record must be set up before the expression is evaluated.

• For REPLACE record updates, all the necessary property values and dimension values (as well as the property specifying the RECORD_SPEC) must be on the record.


330

• For ADD_OR_REPLACE record updates, if no record exists with the specified property value for the property that has been designated as the RECORD_SPEC, the system adds a new record; if the record exists, it is replaced.

• For DELETE record updates, the RECORD_SPEC property must be on the record. This property is used to identify the record to be deleted. All other properties and dimension values are ignored.

• For DELETE_OR_IGNORE record updates, if a record exists with the specified property value for the property that has been designated as the RECORD_SPEC, the system removes the record; if the record does not exist, the action is ignored and no error message is generated.

• For UPDATE record updates, further specification is necessary to describe how to handle the property values and dimension values on the record. UPDATE-type record updates must also include the RECORD_SPEC property with each record. Each property or dimension can have only one type of update performed, but a single record update may impact any or all of the properties and dimensions on a record.


331

The following table lists the expression nodes that are supported by the UPDATE_RECORD expression.

EXPRNODE Name Description

ACTION The type of action to perform on the record, as indicated by the VALUE attribute. Valid values for this attribute are:

• ADD – Adds a new record if it does not exist, or generates an error message if it already exists.

• ADD_OR_REPLACE – Adds a new record if it does not exist, or replaces it if it already exists.

• REPLACE – Replaces a record if it exists, or generates an error message if it does not exist.

• DELETE – Removes a record if it exists, or generates an error message if it does not exist.

• DELETE_OR_IGNORE – Removes a record if it exists, or does not generate an error message if it does not exist.

• UPDATE – Updates a record if it exists, or generates an error message if it does not exist.

Examples:

<EXPRNODE NAME="ACTION" VALUE="UPDATE"/>

<EXPRNODE NAME="ACTION" VALUE="ADD_OR_REPLACE"/>


332

PROP_ACTION If ACTION=UPDATE, the VALUE attribute specifies the type of update to perform on all the values of the named property. Valid values for this attribute are as follows:

• ADD – All values for the property on the update record are added to the current record.

• DELETE – All values for the property on the update record are removed from the current record.

• REPLACE – All values for the property are removed from the current record, then all values for the property on the update record are added to the current record. This node must be followed by a PROP_NAME expression node that names the property to be modified. Example:<EXPRNODE NAME="PROP_ACTION" VALUE="REPLACE"/><EXPRNODE NAME="PROP_NAME" VALUE="P_WineType"/>

DIM_ACTION If ACTION=UPDATE, the VALUE attribute specifies the type of update to perform on all the values of the named dimension. Valid values for this attribute are as follows:

• ADD – All dimension values in the dimension on the update record are added to the current record.

• DELETE – All dimension values in the dimension on the update record are removed from the current record.

• REPLACE – All dimension values in the dimension are removed from the current record, then all dimension values in the dimension on the update record are added to the current record. This node must be followed by a DIMENSION_ID expression node that names the dimension to be modified. Example:<EXPRNODE NAME="DIM_ACTION" VALUE="ADD"/><EXPRNODE NAME="DIMENSION_ID" VALUE="8000"/>



333

Examples of UPDATE_RECORD Expressions

Example 1: An expression configured to convert input records to ADD_OR_REPLACE RECORD updates:

Example 2: An expression configured to convert input records to replace the Price property, and the price range and availability classifications:

DVAL_ACTION If ACTION=UPDATE, removes the dimension value from the record. Note that the VALUE attribute only supports DELETE. This node must be followed by a DVAL_ID expression node that names the dimension value to be removed. Example:<EXPRNODE NAME="DVAL_ACTION" VALUE="DELETE"/><EXPRNODE NAME="DVAL_ID" VALUE="P_PriceStr"/>


<EXPRESSION TYPE="VOID" NAME="UPDATE_RECORD"><EXPRNODE NAME="ACTION" VALUE="ADD_OR_REPLACE"/>

</EXPRESSION>

<EXPRESSION TYPE="VOID" NAME="UPDATE_RECORD"><EXPRNODE NAME="ACTION" VALUE="UPDATE"/><EXPRNODE NAME="PROP_ACTION" VALUE="REPLACE"/><EXPRNODE NAME="PROP_NAME" VALUE="Price"/><EXPRNODE NAME="DIM_ACTION" VALUE="REPLACE"/><EXPRNODE NAME="DIMENSION_ID" VALUE="100"/><EXPRNODE NAME="DIM_ACTION" VALUE="REPLACE"/><EXPRNODE NAME="DIMENSION_ID" VALUE="200"/>

</EXPRESSION>

334

UPDATE_RECORD Errors

• If ACTION is not one of ADD, ADD_OR_REPLACE, REPLACE, DELETE, DELETE_OR_IGNORE, or UPDATE.

• If ACTION is ADD and a record with that specification already exists. In this case, the record to be added is skipped instead of replacing the existing record. Use an ACTION of ADD_OR_REPLACE to add a record if it does not exist or replace it if it does.

• If ACTION is UPDATE and a record with that specification does not exist. In this case, the record to be updated is skipped.

• If ACTION is UPDATE and a sub-action is not specified.

• If ACTION is not UPDATE and a sub-action is specified.

• If ACTION is DELETE and a record with that specification does not exist. In this case, the record to be deleted is skipped and an error message is generated. Use an ACTION of DELETE_OR_IGNORE to suppress the error message if the record does not exist.

• If more than one sub-ACTION (such as DVAL_ACTION) is specified for a given property, dimension, or dimension value.

Format of Update Records

The UPDATE_RECORD expression, as used in the partial updates reference implementation, requires that each incoming record have one of the Delimited formats described below.


336

record manipulator with the appropriate expressions to handle your source records.

Creating the Update Adapter

The update adapter is the component that writes out the record file (or files) that define the new, deleted, or modified records. The Update Adapter editor must have at least these settings:

• Output URL (General tab) – Enter the directory to which Forge writes the partial update files and processed records. The path is either an absolute path or a path relative to the location of the partial update pipeline file. With an absolute path, the protocol must be specified in RFC 2396 syntax (typically, this means the prefix file:/// precedes the path to the data file). Relative URLs must not specify the protocol.

• Output prefix (General tab) – Enter the filename prefix (such as wine) for the Forge output files. Use the same prefix as in the indexer adapter for the baseline update pipeline.

• Filter unknown properties (General tab) – Set this so it matches the Filter Unknown Properties setting in the indexer adapter of the baseline update pipeline.

• Record source (Sources tab) – Select the name of the record manipulator.

• Dimension sources (Sources tab) – Select the name of the dimension server. You need a dimension source if you are updating dimensions.


337

• Enable Agraph support (Agraph tab) – Set this so it matches the Agraph tab settings in the indexer adapter of the baseline update pipeline.

Dimension Components

The partial updates pipeline in the sample_updates_data reference implementation contains two dimension adapters and one dimension server.

Dimension Adapters

To support classification, the same dimensions that are loaded in the baseline update pipeline must be loaded in the partial update pipeline. To cut down on startup time, the dimensions can be split into multiple files, and only the dimensions actually used by the partial update pipeline need to be loaded. In the baseline update pipeline, multiple dimension adapters can feed into the same dimension server to consolidate the separate dimension files.

The sample_updates_data reference implementation uses two dimension adapters, one for the dimensions.xml file and the other for the winetype_dimension.xml file. For both dimension adapters, the Dimension Source field (on the Sources tab) is set to None.


338

Dimension Server

The dimension server uses the two dimension adapters as sources.

The URL field (General tab) specifies the location to which the autogen_dimensions.xml.gz file is written. This file contains persistent dimension data produced by auto-generation.

There are some special considerations when using AutoGen classification with partial updates. When new dimension values are generated in the partial updates pipeline, the dimension changes are included in the updates sent to the Navigation Engine.

Because the baseline and partial update pipelines share the same autogen file, changes to AutoGen dimensions are also shared between the two. However, at any given time, only one of the two update processes can modify the Autogen_dimensions file.

Rather than suspend partial updates during baseline updates, Forge supports the --noAutoGen command-line option, which turns off the creation of new dimension values. Classification with existing dimension values continues normally, but classification failures result in no matching dimension values, rather than in the creation of new ones.

For more details on this process, see “Control Script Development and Execution” on page 343.


339

Naming Format of Update Source Data Files

When Forge processes update source data files, it is important to keep two things in mind concerning the names of the data files:

• The update files should be processed by Forge in order of their creation. The reason is that if a specific record appears in more than one update file, you want the latest update to be processed last, so that it will override earlier versions when the Dgraph loads the update record files.

• Forge reads the files in strict lexicographic order of their filenames. Therefore, you should use a naming scheme that ensures the processing of the update files in chronological order of their creation (i.e., last created, last processed).

For these reasons, it is strongly recommended that you use a timestamp format as the naming scheme for the filenames. If necessary, use leading zeros to force the desired numeric order. For example, if you have two files named 9.xml and 10.xml, Forge will process 10.xml before 9.xml; therefore, you must rename 9.xml to 09.xml so that it is processed before 10.xml.

Note: For the sake of simplicity, the reference implementation uses source files that do not use a timestamp naming scheme.


340

Index Configuration

The index configuration files (such as the thesaurus and stop words files) and cannot be updated using the partial update mechanism; only records and dimensions can be updated.

Record Specification Attribute

Developer Studio lets you configure how you want to refer to records from your application and during partial updates. The RECORD_SPEC property attribute controls this behavior.

The RECORD_SPEC attribute allows you to specify the property that you wish to use to identify specific records in partial updates. For example, you may wish to use a field such as UPC, SKU, or part number to identify a record. You may set the RECORD_SPEC attribute’s value to TRUE in any property where the values for the property meet the following requirements:

• The value for this property on each record must be unique.

• Each record should be assigned exactly one value for this property.

Only one property in the project may have the RECORD_SPEC attribute set to TRUE.


341

All updates that add new records must include a valid value (that is, a value that fulfills the above criteria) for the RECORD_SPEC property.

For a partial updates deployment, you must have the RECORD_SPEC attribute of at least one property set to TRUE. If no property is marked as the RECORD_SPEC property, then the Navigation Engine will not process partial updates. If you are not doing partial updates, then you do not need to set the RECORD_SPEC to TRUE for any property.

To configure a RECORD_SPEC attribute for an existing property:

1. In the Project tab of Developer Studio, double-click Properties.

2. From the Properties view, select a property and click Edit. The Property editor is displayed.

3. In the General tab, check Use for Record Spec.

4. Click OK. The Properties view is redisplayed.


Navigation Engine Configuration

You must start the Navigation Engine with the Dgraph --updatedir flag to enable it to process partial updates. The flag takes as an argument the path of directory into which completed partial update files (from Forge) are placed. The Navigation Engine does not automatically load update files placed into this directory. The control


342

script must be configured to notify the running Navigation Engine to check for new updates.

Update files are read at startup as well as when the Navigation Engine receives the update signal. Because the Navigation Engine looks for update files automatically at startup, recovery from server failure can be achieved easily by ensuring that the Navigation Engine is provided the same --updatedir directory on recovery as it had prior to failure. The Navigation Engine then reads the existing files in the directory, restoring the Navigation Engine to its pre-failure state.

The Navigation Engine reads update files in numeric-lexicographic order of their filenames (lexicographic order unless the filename contains leading zeros, which are ignored). Therefore, the control scripts should be configured to name update files in ascending numeric-lexicographic order over time to ensure that updates are processed in the order they are produced. For further details, see “Step 2: Apply a Timestamp to the Record File” on page 350.

Note: While the Dgraph reads files in numeric-lexicographic order, Forge reads them in strict lexicographic order. Keep this difference in mind when naming files.

The Navigation Engine processes updates on a record-by-record basis. Updates fail or succeed entirely at the record level. This means that a record update that fails (for example because it attempts to assign an unknown dimension value to the record) leaves the value of the record unchanged. Property value changes or


343

dimension value changes in the failed record update have no effect. Previous and future record updates and dimension updates are not affected.

During development, you can use the --updateverbose flag to specify that the Navigation Engine should output verbose messages while processing updates. However, the flag should not be used on production systems because it will negatively impact update performance.

Dgidx Flags

In most cases, the Dgidx --keepcats flag should be used to indicate that unused dimension values (that is, dimension values that have been assigned to no actual records) should be passed through to the Navigation Engine. By default, Dgidx strips out such dimension values. Without such values in the Navigation Engine, record updates that assign a previously unused dimension value to a record will fail with an error.

Control Script Development and Execution

A reference template control script (named update_index.script) is included in the Endeca sample_updates_data reference implementation. The control script uses two high-level Script bricks, baseline_update and partial_update, to implement the baseline and partial update processes. These processes are described in sections below.


344

Note: The sample control script assumes a deployment using only one Dgraph. If you have an Agraph deployment, see “Partial Updates in Agraph Implementations” on page 355.

Directory Structure for Updates

The reference control script uses the following directory structure for handling data flow through the system:data

forge_inputincoming

updatespartition0

dgidx_outputdgraph_input

updatesforge_outputstate

The purpose of these directories are as follows:

• data - Base directory for all other subdirectories. All files and processes related to the data exist and work in or under this directory.

• data\forge_input - Contains the Developer Studio project file (sample_updates.esp), the baseline update pipeline file (pipeline.epx), the partial update pipeline file (partial_pipeline.epx), and the index configuration files (*.xml).

• data\incoming - Contains source data (in the wine_data.txt.gz file) for a baseline update. On a production site, the files in this directory may have been created by a data extraction process on the


345

customer’s database or may be picked up from another FTP server.

• data\incoming\updates - Contains source data for a partial update. The reference implementation ships with three gzipped files: adds.txt.gz (records to be added), deletes.txt.gz (records to be deleted, and updates.txt.gz (records to be updated).

• data\partition0 - Contains files generated by the Forge, Dgidx, and Dgraph programs.

• data\partition0\dgidx_output - Contains indices that have been processed by Dgidx and output in a format that can be read by the Navigation Engine.

• data\partition0\dgraph_input - Contains data that is read by the Navigation Engine on startup. The data includes the Dgidx output indices, spelling correction dictionaries, thesaurus files, and language-encoding files.

• data\partition0\dgraph_input\updates - Contains partial updates that have been processed by Forge. The Navigation Engine reads these updates when it is restarted with the Dgraph --updatedir flag pointing to this directory.

• data\partition0\forge_output - Contains data that has been processed by Forge and is ready for indexing.

• data\partition0\state - Contains any state information (such as auto-generated dimension IDs) that must be saved between Forge runs.


346

All references to directory names in the following text are relative to the data directory. All references to directory names in example or default brick definitions are relative to the parent of the data directory.

Running the Baseline Updates Script

In the control script, the high-level Script brick, baseline_update, implements the baseline update procedure by making calls to other script bricks:

baseline_update : Scriptclear_updatesbaseline_forgebaseline_dgidxif dgraph.running

dgraph.stopbaseline_fetchdgraph.start

You run a baseline update with a command line similar to this Windows example (assuming you are in the sample_updates_data\etc directory):

runcommand update_index.script baseline_update

The baseline update process is as follows:

Step 1: Delete Old Updates

All files in the data\partition0\draph_input\updates directory are deleted by the clear_updates brick.


347

Step 2: Run Forge

The baseline_forge brick runs Forge on the source data, using these default settings:

pipeline = ..\data\forge_input\pipeline.epxforge_options = -vw

You will want to modify the forge_options setting so that it is better suited for your application.

Step 3: Run Dgidx

The baseline_dgidx brick runs Dgidx with these settings:input = ..\data\partition0\forge_output\wineoutput = ..\data\partition0\dgidx_output\winedgidx_options = --keepcats

The --keepcats flag specifies that Dgidx should pass unused dimension values to the Navigation Engine instead of stripping them out.

You will probably want to modify the options passed to Dgidx. You will also want to change:

• The input setting so that it points to the location where your pipeline writes out the Forge output data.

• The output setting so that it points to the location where Dgidx should write out data for the Navigation Engine (make sure that the location ends with the prefix that you want to use for the Dgidx output).

Step 4: Stop the Navigation Engine

The dgraph.stop command stops the Navigation Engine.


348

Step 5: Move the Index Files to the Dgraph Directory

The baseline_fetch brick moves index files from the data\partition0\dgidx_output directory to the data\partition0\dgraph_input directory, where they are used by the Navigation Engine on startup.

Be sure to change the paths in the source and dest settings for your implementation.

Step 6: Start the Navigation Engine

The Navigation Engine is started with the dgraph brick, using these settings:

working_dir = $(sample_updates_data_dir)\logsinput = ..\data\partition0\dgraph_input\wineport = $(dgraph_port)dgraph_options = --updatedir

..\data\partition0\dgraph_input\updates

You may want to use the --updateverbose flag during development, but make sure you remove it for production. You may want to add other options relevant for your application. See the Endeca Administrator’s Guide for information about the available options.

At this point, the Navigation Engine should be running correctly with the latest baseline and partial update data.


349

Running the Partial Updates Script

In the control script, the high-level Script brick, partial_update, implements the partial update procedure as follows:

partial_update : Scriptupdate_forgeapply_timestampif dgraph.running

dgraph.update

You run a partial update with a command line similar to this Windows example (assuming you are in the sample_updates_data\etc directory):

runcommand update_index.script partial_update

The three major steps of the partial_update Script brick are described in the following sections.

Step 1: Run Forge on the New Source Data

The update_forge brick runs Forge with the partial update pipeline and new source data, using these default settings:

pipeline = ..\data\forge_input\partial_pipeline.epxforge_options = -vw

Because the record adapter uses the Multi Files setting, Forge can read data from multiple input files. (The reference implementation uses three input files.)


350

You will want to modify the forge_options setting so that it is better suited for your application. Modify the relative paths above as appropriate for your implementation.

When Forge finishes, it produces one or more update record files and stores them in the location specified by the pipeline's update adapter. This file contains XML definitions of how the updated records should be treated by the Navigation Engine (for example, which records to delete or add).

The record files use this naming format:

db_prefix-sgmtn.records.xml

For example, the update_forge brick in the reference implementation produces the wine-sgmt0.records.xml file in the data\partition0\dgraph_input\updates directory.

The -sgmt0 portion of the filename is generated when you roll over by size (i.e., the update indexer contains the ROLLOVER element, as in the reference partial updates pipeline). Forge splits the output into segment files, each of which is no larger than 2GB.

It is important that you know the names of the record files, because they will have to be timestamped, as described in the next section.

Step 2: Apply a Timestamp to the Record File

It is possible to generate multiple partial updates before the next baseline update, at which time all the partial


351

update files are deleted. Therefore, each record file must be timestamped to ensure that the Navigation Engine does not upload a partial update more than once.

The apply_timestamp brick renames the db_prefix-sgmtn.records.xml files by appending a timestamp string to the filename. The resulting filename will use this format:

originalfilename_YYYY.MM.DD.HH.NN.SS

where YYYY is the four-digit year, MM is the two-digit month, DD is the two-digit day, HH is the two-digit hour, NN is the two-digit minute, and SS is the two-digit second. For example:

wine-sgmt0.records.xml_2005.06.07.16.14.08

A running Navigation Engine keeps track of the last timestamped file it uploaded. When it next checks the updates directory, it will only upload partial update files that carry a timestamp later than the last uploaded file.

Note that the apply_timestamp brick in the reference script assumes that only one record file will be renamed. If your implementation generates multiple record files, you will need to change this brick for the additional renaming statements.

Step 3: Update the Navigation Engine

The dgraph.update command causes the running Navigation Engine to perform the following actions:


352

1. Go offline while it processes the updates (that is, it stops accepting user queries and temporarily closes its listening port).

2. Check the updates directory (whose path is specified with the --updatedir flag).

3. Upload any partial update with a timestamp later than the last currently-loaded partial update.

4. Go back online after it has processed all updates.

At this point, the Navigation Engine should be running correctly with the latest baseline and partial update data.

Adding Other Bricks

You can modify the update_index.script and add other bricks that are necessary for your implementation. For example, you can add a brick that fetches partial updates from an FTP server. In other installations, the partial updates may be dropped onto the indexing server, directly into the incoming\updates directory.

The following is an example of a Fetch brick:fetch_updates : Shell

perl bin/fetch.pl \--ftp_ip ftp.somecompany.com \--ftp_user anonymous \--ftp_pass somecompany.com \--fetch_dir incoming/ \--fetch_file_regexp

"endeca_update_200407(\d+)\.txt" \--exclude_file etc/exclude_files \--dest_dir data/incoming/updates


353

The flags in the example are:

• --ftp_ip is the IP address of the FTP server.

• --ftp_user is the username for logging into the FTP server.

• --ftp_pass is the password for the username.

• --fetch_dir is the directory on the FTP server that contains the update files to retrieve.

• --fetch_file_regexp is the regular expression that should be matched for a file to be considered a partial update file.

• --exclude_file points to a file that will be maintained automatically by fetch.pl. It is a list of all the files that have already been retrieved from the FTP server and should not be retrieved again.

• --dest_dir is the directory into which the fetched files will be dropped.

See the Endeca Administrator’s Guide for details on writing and modifying bricks.

URL Update Command Parameters

The dgraph.update command, when issued from a command line, causes the Navigation Engine to check the updates directory and upload all partial updates that have not yet been uploaded.


354

You can also issue the same command to the Navigation Engine by using the following URL command syntax in your Web browser:

http://hostname:dgraphport/admin?op=update

For example:

http://localhost:8000/admin?op=update

If you are using HTTPS mode, use https in the URL.

On receiving the update command, the Navigation Engine immediately goes offline (that is, closes its port and stops accepting requests) and checks for and loads new update files in its update directory. After the Navigation Engine has processed all updates, it returns to its normal online mode of operation.

If you want the Navigation Engine to stay online during the update process, use the offline=false option of the command, as in this example:

http://localhost:8000/admin?op=update&offline=false

In this online mode, the Navigation Engine will continue to accept queries.

Note: offline=true is the default and is the same as using the update command by itself.

To see the update history, use the updatehistory URL command, similar to the following example:

http://localhost:8000/admin?op=updatehistory


355

This command will show a list of the update files that the Navigation Engine has recently processed.

Partial Updates in Agraph Implementations

Implementing partial updates in Agraph implementations is similar to single-Dgraph deployments, with the important differences listed below.

Choosing a Distribution Strategy

The update record files produced by Forge contain XML definitions of the updated records, including information on how the records should be treated by the Dgraphs. For example, records to be deleted are flagged with a RECORD_DELETE element in the file.

New records (i.e., ones that use an ADD or ADD_OR_REPLACE action for the UPDATE_RECORD expression) are defined with a RECORD_ADD element that contains the partition number (in a PARTITION attribute) to which the record is assigned. Note that only ADD records are assigned partition numbers.


356

For example, this XML snippet shows a RECORD_ADD element that assigns a new record to Agraph partition1:

How the partition number is assigned to an ADD record depends on which distribution strategy you have chosen to implement:

• Random distribution, where you let Forge decide which partition gets the new record. That is, Forge uses the configured partition property (typically the record spec or rollup property) as a basis of assigning the partition number to the PARTITION attribute.

• Deterministic distribution, where you control the assignment of records to specific partitions. That is, you tell Forge which partition number it should assign to the PARTITION attribute.

The main advantage of random distribution is that you do not need to know exactly where the records should go in order for updates to be processed correctly. This scheme also simplifies operations because the same update

<UPDATE><UPD_UNIT>

<RECORD_ADD PARTITION="1"><PROP NAME="P_WineID">

<PVAL>99005</PVAL></PROP><PROP NAME="P_Year">

<PVAL>1992</PVAL></PROP>...

</RECORD_ADD></UPD_UNIT>

...</UPDATE>


357

record file is sent to all partitions, so there is less conditional logic in the control script.

Which distribution strategy you chose depends on the needs of your implementation. In general, Endeca recommends that the distribution strategy for partial updates be the same as for baseline updates.

How the Agraph Partitions Handle Updates

Regardless of which distribution strategy you are using, the Agraph partitions (i.e., the individual Dgraphs) handle the record update requests as follows:

• If a DELETE, REPLACE, or UPDATE action request is sent to all partitions of the Agraph, only the partition that contains the record will actually delete, replace, or update the record. The other partitions will issue a warning message, but continue to function as before.

• If an ADD action request is sent to all partitions, only the designated partition (as specified in the PARTITION attribute of the record file) will add the record. The other partitions will ignore the request.

Because any partition knows how to deal with any update request, this architecture allows you to send the record files to all partitions without having to worry about which partition is the correct one.


358

Use of Record Spec

The record specification in an Agraph deployment requires that the record property be unique across all records across all Navigation Engines.

Naming Convention for Source Data Files

Whether you are using a random or deterministic distribution strategy, it is strongly recommended that you use a timestamp format as the naming scheme for the update source data files. This format, which is explained in “Naming Format of Update Source Data Files” on page 339, ensures that Forge processes the files in the proper order of their creation.

For both strategies, a Perl expression in the record manipulator (described in a later section) can use the timestamp part of the filename for the name of the output record file.

Random Distribution Format

For a random distribution strategy, a suggested format is:

YYYYMMDDHHNNSS.ext

where YYYY is the four-digit year, MM is the two-digit month, DD is the two-digit day, HH is the two-digit hour, NN is the two-digit minute, and SS is the two-digit second, as this example:

20051023161408.txt


359

These files may contain ADD records that will be distributed randomly to the Agraph partitions.

Deterministic Distribution Format

For a deterministic distribution strategy, a suggested format is:

YYYYMMDDHHNNSS-partX.ext

where X is the number of the Agraph partition for which these records are intended. For example, records in this source data file are intended for partition3:

20050717151408-part3.txt

The Perl expression in the record manipulator parses the filename for the partition number and uses it to assign ADD records to that partition.

The expression also uses the timestamp and -partX information for the name of the output record file. For example, the above input filename will generate this output record file:

20050717151408-part3.records.xml

Keep in mind that if you pre-partition your baseline source files, you should also pre-partition the records to be added. That is, all ADD records for the partition 0 Dgraph should be in one file, ADD records for the partition1 Dgraph should be in a second file, and so on.


360

Configuring the Partial Updates Pipeline

This section describes how to configure the partial updates pipeline for either distribution strategy.

IMPORTANT: The procedures described below require that you hand-edit the pipeline files with a text editor. After you edit these files, do not open the project in Developer Studio, because it will overwrite the settings of the update adapter.

Configuring the Record Adapter

The record adapter should have the following settings:

• Set the MULTI attribute to True so that Forge can read multiple input data files.

• Set the URL attribute to the path of the incoming directory, with the filename being a pattern (such as ../incoming/updates/*.txt).

• In order to use the naming format of the input file for the records file name, set the MULTI_PROP_NAME attribute to a value of FILENAME.

These settings apply to both random and deterministic record adapters.


361

The following is an example of a record adapter for the partial updates pipeline:

Note that the FILENAME setting for the MULTI_PROP_NAME attribute will processed by both the update adapter and the Perl expression in the record manipulator.

Configuring the Record Manipulator

For both random and deterministic pipelines, the record manipulator should contain the same IF and UPDATE_RECORD expressions that are documented for the single-Dgraph implementation (see “Creating the Record Manipulator” on page 327).

In addition, you can add a Perl expression that parses the name of each input file (up to the file extension) and uses it to name the output record file (which has a records.xml extension). The exact Perl code in the expression depends on the distribution strategy.

<RECORD_ADAPTER NAME="LoadUpdateData"URL="../incoming/updates/*.txt"FORMAT="DELIMITED"COL_DELIMITER="|" ROW_DELIMITER="|\n"DIRECTION="INPUT" FILTER_EMPTY_PROPS="TRUE" FRC_PVAL_IDX="TRUE" MULTI="TRUE" MULTI_PROP_NAME="FILENAME"REQUIRE_DATA="FALSE"

</RECORD_ADAPTER>


362

Perl Expression for Random Distribution

For a random distribution pipeline, the following example of a Perl expression can be inserted into the record manipulator:

The expression will generate output record files with names similar to this example:

20050717180812.records.xml

Note that this sample expression is for use on Windows machines, where “\” is the directory separator. For UNIX, change the regex from:

/[\/\\](\w+)\.[^\.]+$/

to:

/\/(\w+\.[^\.]+$/

<EXPRESSION TYPE="VOID" NAME="PERL"><COMMENT>This Perl expression handles taking the source input filename andoutputting a record file with the same naming format.It assumes filenames of the format: timestamp.ext</COMMENT><EXPRBODY><![CDATA[

# Translate filename of input to filename of output.# Filename is everything after the last slashif ($props[0]->value() =~ /[\/\\](\w+)\.[^\.]+$/) {my $filename = $1;$props[0]->value($filename);replace_prop("FILENAME", 0, $props[0]);

} else {die("Could not parse filename: " . $props[0]->value());

}]]>

</EXPRBODY></EXPRESSION>


363

Keep in mind that you will have to change the Perl regex code if you use another naming convention for the source input files.

Perl Expression for Deterministic Distribution

The Perl expression for the record manipulator in a deterministic distribution pipeline is similar to the random distribution example, with the addition of code that extracts the partition ID (the partX piece) from the input filename and stores it in the X_ParitionNum property. The partition ID will be assigned by Forge to that record in the record file (via the PARTITION attribute of the ADD_RECORD element).


364

The Perl expression is as follows:

As in the previous example, this sample code is for Windows machines and the regex code must be changed for UNIX.

<EXPRESSION TYPE="VOID" NAME="PERL"><COMMENT>This Perl expression handles taking the source input filename anddetermining the appropriate partition. It assumes filenames of the format: timestamp-partN.extThe expression extracts the N in the "partN" piece.</COMMENT><EXPRBODY><![CDATA[

# Translate filename of input to filename of output.my @props = get_props_by_name("FILENAME");# Filename is everything after the last slashif ($props[0]->value() =~ /[\/\\](\w+\-part\d+)\.[^\.]+$/) {

my $filename = $1;$props[0]->value($filename);replace_prop("FILENAME", 0, $props[0]);# Extract the partition ID from the filename to determine# the partition number for the record.$filename =~ /part(\d+)$/;my $part_num = $1;# X_PartitionNum specifies the target partition for # this particular record.my $part_prop = new Zinc::PropVal("X_PartitionNum", $part_num);add_props($part_prop);

} else {die("Could not parse filename: " . $props[0]->value());

}]]>

</EXPRBODY></EXPRESSION>


365

Configuring the Update Adapter

The configuration of the update adapter is similar to that in single-Dgraph implementations. For both random and deterministic distribution, the update adapter should have the following attribute settings:

• OUTPUT_URL – Set to the path of the incoming directory, with the filename being a pattern.

• OUTPUT_PREFIX – Set to an empty string, because the output filename will begin with a timestamp format.

• MULTI – Set to True so that Forge can read multiple input data files.

• MULTI_PROP_NAME – Set to a value of FILENAME.

The recommended settings for the ROLLOVER element depend on the type of distribution strategy.

ROLLOVER Element for Random Distribution

In a random distribution pipeline, the following settings are recommended for the ROLLOVER element:

• NUM_IDX – Although this attribute normally sets the number of Agraph partitions, it is recommended that you use the Forge --numPartitions flag in the control script to actually set the number of partitions. Therefore, leave the field blank or use any number.

• PROP_NAME – Set to the partition property, which is the record spec or rollup property by which records are assigned to each partition. An empty field means that


366

Forge will use a round-robin strategy to assign partitions to records.

• PROP_TYPE – Set to the partition property’s type (typically, ALPHA).

• REMOVE_PROP – Typically set to FALSE.

• CUTOFF – Set to the default value of 2000000000.

The following is an example of an update adapter using the above settings:

ROLLOVER Element for Deterministic Distribution

In a deterministic distribution pipeline, the following settings are recommended for the ROLLOVER element:

• NUM_IDX – Leave blank or use any number, as it is recommended you use the Forge --numPartitions flag in the control script to actually set the number of Agraph partitions.

<UPDATE_ADAPTER NAME="UpdateAdapter" OUTPUT_URL="../partition0/dgraph_input/updates" OUTPUT_PREFIX="" MULTI="TRUE"MULTI_PROP_NAME="FILENAME">

<RECORD_SOURCE>UpdateManipulator</RECORD_SOURCE><ROLLOVER NAME="RECORD"

NUM_IDX=""PROP_NAME="P_WineID"PROP_TYPE="ALPHA"REMOVE_PROP="FALSE"CUTOFF="2000000000"/>

</UPDATE_ADAPTER>


367

• PROP_NAME – Set to the property (X_PartitionNum, for example) created by the Perl expression in the record manipulator.

• PROP_TYPE – Set to the INTEGER (because it will hold the partition number of the ADD record).

• REMOVE_PROP – Set to TRUE (because the PROP_NAME property should not be in the output).

• CUTOFF – Set to the default value of 2000000000.

The following is an example of an update adapter using the above settings:

Control Script for Agraph Updates

The reference control script implements partial updates for a single-machine, single-Dgraph deployment only. For an Agraph deployment, you can modify the control script to run Forge on a single machine and distribute the Forge output to all the other machines. Then, you notify each Dgraph in your deployment to check for new updates.

<UPDATE_ADAPTER NAME="UpdateAdapter" OUTPUT_URL="../partition0/dgraph_input/updates" OUTPUT_PREFIX="" MULTI="TRUE"MULTI_PROP_NAME="FILENAME">

<RECORD_SOURCE>UpdateManipulator</RECORD_SOURCE><ROLLOVER NAME="RECORD"

NUM_IDX=""PROP_NAME="X_PartitionNum"PROP_TYPE="INTEGER"REMOVE_PROP="TRUE"CUTOFF="2000000000"/>

</UPDATE_ADAPTER>


368

Forge Partial Updates Brick

The Forge brick that processes the partial update source data is similar to the brick described in “Step 1: Run Forge on the New Source Data” on page 349. The difference is that we recommend use of the Forge --numPartitions flag to specify the number of Agraph partitions, as in this brick example:

Using the --numPartitions flag (which overrides the NUM_IDX setting in the update adapter) lets you easily add or subtract Agraph partitions from within the control script. You will have to set up a global variable (named numPartitions in the example above) that stores the number of partitions.

Distributing the Forge Output to the Dgraphs

For a random distribution strategy, partial updates in Agraph implementations do not require any special update distribution requirements. Both dimension modifications (i.e., dimension value additions) and record modifications (updates, deletes, replaces, and adds) should be sent to all Dgraphs in the deployment. Each Dgraph should then be notified to check for new updates. If a Dgraph cannot handle data that is associated with another Dgraph, it will simply log a warning but will

# Runs Forge on the update source data.update_forge : Forge

working_machine = indexerpipeline = ..\data\forge_input\partial_pipeline.epxforge_options = -vw --numPartitions $(numPartitions)


369

otherwise continue working. Note that the Agraph process itself does not process updates.

For a deterministic distribution strategy, the distribution of the record files depends on the use of auto-generated dimensions:

• If you are using auto-generated dimensions, distribute all the record files to all the Dgraphs.

• If you are not using auto-generated dimensions, you can distribute each record file to its specific Dgraph.

To make sure that there is no interruption in servicing navigation requests, you may configure your Dgraphs to check for new updates at different times. Or you can also have smaller subgroups read in updates simultaneously (for example, three machines at a time in a six-machine implementation).


370


Chapter 18

Using the Agraph

Implementing the Agraph allows application users to search and navigate very large data sets. An Agraph implementation enables scalable search and navigation by partitioning a very large data set into multiple Dgraphs running in parallel. The Agraph sends an application user’s query to each Dgraph, then coordinates the results from each, and sends a single reply back to the application user.

What You Should Know First

This document assumes you are familiar with the basics of the Endeca Navigation Engine as described in the Endeca Developer’s Guide and that you can create, provision, and run an Endeca implementation using one Dgraph.

Overview of Distributed Query Processing

You can scale the Navigation Engine to accommodate a large data set by distributing the Navigation Engine across multiple processors.

372

In this type of distributed environment, you configure a Developer Studio project to partition your Endeca records into subsets of records—as many partitioned subsets as you need to process all your source data. Each subset of Endeca records is typically referred to as a partition. Each processor runs an instance of the Dgraph program by loading one partition and maintaining a portion of the total Navigation Engine indices in its main memory.

Such a distributed configuration requires an additional program called the Agraph (Aggregated Navigation Engine). The Agraph program receives requests from clients, forwards the requests to the distributed Navigation Engines, and coordinates the results. An Agraph can coordinate as many child Dgraphs as are necessary for your data set.

Agraph Query Processing

From the perspective of the Endeca Presentation API, the Agraph program behaves identically to a Dgraph program. When an Aggregated Navigation Engine receives a request, it sends the request to all of the distributed Navigation Engines. Each Navigation Engine processes the request and returns its results to the Aggregated Navigation Engine which aggregates the results into a single response and returns that response to the client, via the Endeca Presentation API.


373

In the following illustration, one Agraph coordinates three Dgraphs.

Data Foundry Processing

The previous section described how an Agraph functions from the perspective of client requests and Agraph responses that are passed through the Endeca API. This section describes the offline processing that Data Foundry components perform to create Agraph partitions.

If you want a full explanation about how Data Foundry processing works for a single Dgraph implementation, see “Data Foundry Components” in the Endeca Developer’s Guide. To summarize a portion of that section, the Data Foundry architecture to process source

Endeca Navigation

Engine(Dgraph 2)

Endeca Navigation

Engine(Dgraph 3)

Endeca Presentation

APIClient Response

Client Request

Aggregated Endeca Navigation Engine

(Agraph)

Endeca Navigation

Engine(Dgraph 1)

Endeca Confidential Using the Agraph

374

data for a single partition, running in a single Navigation Engine, looks like this:

In an Agraph implementation, the Data Foundry processing is very similar; however, multiple Data Foundry components, namely Dgidx, Agidx, and the Dgraph run in parallel to process each partition’s data. The architecture to process an Agraph implementation with three partitions looks like this:

Forge(imports,

standardizes,tags, exports)

SourceData

EndecaRecords

Dgidx(createsindices)

ENE Indices

Dgraph(receives

and repliesto queries)

SourceData

Agidx

Forge

Partition 2EndecaRecords

Dgidx 2Partition 2

ENE Indices



Dgidx 3

Dgidx 1Partition 1

ENE Indices

Partition 3ENE Indices

Dgraph 1

Dgraph 2

Dgraph 3

Agraph

CombinedENE Indices


375

When you use either Developer Studio or Web Studio to run a project with three partitions, as shown above, the following occurs:

1. Forge reads in the source data. (Assume Forge has access to source data as shown in the diagram with a single Navigation Engine. )

2. Forge enables parallel processing by producing Endeca records in any number of partitions. You specify the number of partitions in the Agraph tab of the Indexer adapter or the Update adapter. This is described later in “Modifying a Project for Agraph Partitions”.

3. The Data Foundry starts a Dgidx process for each partition that Forge created. The Dgidx processes can run on one or multiple machines, depending on the desired allocation of computation resources.

4. Each Dgidx process creates a set of Navigation Engine indices for its corresponding partition.

5. After all the Dgidx processes complete, the Agidx program runs to create an index specific to the Agraph. This index contains information about each partition’s indices.

6. Each Navigation Engine (Dgraph) starts and loads the index for its corresponding partition.

7. After all Dgraphs start, the Agraph starts and loads its index, which contains information about each child Dgraph’s index.


376

Guidance about When to Use an Agraph

An Agraph implementation is necessary when you have a set of Endeca records large enough that a single Dgraph process would exceed the maximum process size limits of the machine’s operating system. For details about the Dgraph and process size, see “Endeca Navigation Engine Basics” in the Endeca Developer’s Guide.

One approximate test to gauge whether an Agraph may be necessary is to check how many records of source data you have before Data Foundry processing. One million or more records of source data is sufficient to suggest you may want to use an Agraph in your Endeca implementation.

Implementation Overview

The rest of this chapter describes the following tasks to implement an Agraph.

• Modify the project for Agraph partitions.

• Provision the Agraph implementation.

• Run the Agraph implementation.

Modifying the Project for Agraph Partitions

The first step in implementing an Agraph is to configure the Agraph tab of the project’s Indexer adapter or if you are working on a partial update pipeline, the Agraph tab


377

of the Update adapter. The Agraph tab serves the following functions:

• Enables Agraph support

• Specifies the number of Agraph partitions (Dgraphs) in your implementation

• Identifies an optional partition property

The partition property field identifies the property by which records are assigned to each partition. This field is read-only. The partition property field can display one of three possibilities:

• A rollup property—If you have a rollup property enabled in your project, the rollup property also functions as the partition property. Forge assigns all records that can be aggregated by the rollup property to the same partition.

For example, suppose “Year” is the rollup property and “Year” can have any number of rollup values such as 2002, 2003, 2004, and so on. Forge assigns all records tagged with a particular year’s value to the same partition. This means that all records tagged with 2002 are in the same partition; all records tagged with 2003 are in the same partition, and so on.

• A record spec property—If you do not have a roll up property but do have a record spec property enabled in your project, the record spec property functions as the partition property. Records are assigned evenly across all partitions according to the record spec property. This allocation provides equally sized partitions.


378

• An empty field (no property displays)—If you have not enabled a rollup property or record spec property, the partition property field is empty. With no partition property, Forge assigns records to each partition according to a round-robin strategy. This strategy also provides equally sized partitions.

To modify the project for Agraph partitions:

1. In the Project tab of Developer Studio, double-click the Pipeline Diagram.

2. Double click the Indexer adapter, or if you are modifying a partial update pipeline, double click the Update adapter.

3. Select the Agraph tab.

4. Check “Enable Agraph support”.

5. In “Number of Agraph partitions”, specify the number of child Dgraphs that the Agraph controls. In an Agraph implementation, this must be a value of 2 or more. You provision each of the partitions in “Provisioning an Agraph Implementation”.

Note: If you want to change the partition property, open the Properties view and modify which properties are enabled for rollup and record spec.

Provisioning an Agraph Implementation

In addition to modifying your project to support Dgraph partitioning, you must also provision your Endeca implementation using Web Studio. Provisioning informs


379

the Endeca Manager about the systems allocated to run the Forge, Dgidx, Agidx, Dgraph, and Agraph programs.

In a production environment, the Agraph and each Dgraph should run on its own processor. Your servers may have one or more processors. For example, you can set up a three Dgraph/one Agraph environment on a quad processor server. In a development environment, where optimal performance is less critical, the Agraph can run on one of the processors running a Dgraph.

An Agraph implementation requires a minimum of two replicas (mirrors) to provide full application uptime during partial updates. The second replica is necessary because one replica’s Agraph goes offline during a partial update. The second replica can continue to receive and reply to user requests during the downtime of the first replica.

To provision an Agraph implementation:

1. Open Internet Explorer, start Web Studio, and log in. If you have any questions about how to use Web Studio, see the Endeca Tools Guide. (This procedure assumes you know how to provision an Endeca implementation and focuses on the issues specific to provisioning an Agraph implementation.)

2. Select the Provisioning page from the Administration section of the navigation menu.

3. In the Hosts section, add each host that runs a Dgraph or Agraph, including host machines that run Dgraph or Agraph replicas.


380

For example, the set of hosts shown in the following graphic provision a system similar to the one shown in “Agraph Query Processing” on page 372. Namely, there are hosts to support a total of six Dgraphs and two Agraphs. (There are two replicas in the implementation and each replica runs one Agraph and three Dgraphs.)

4. In the Dgidx section, add as many Dgidx entries as you have Dgraph partitions. In other words, the number of Dgidx entries must correspond to the value of “Number of Agraph partitions” in your Indexer adapter.

To continue the previous example, this illustration shows three Dgidx entries that correspond to three Dgraph partitions.


381

5. In the Navigation Engines section, add as many Navigation Engines as you have Agraph partitions and replicas for those partitions. (You specified the number of Agraph partitions in “Modifying the Project for Agraph Partitions.”) For example, if you have three Dgraphs and two replicas, you need a total of six Navigation Engines.

To continue the previous example, this illustration shows six total Navigation Engines (three per replica).


382

6. In the Aggregated Navigation Engines section, add as many Agraphs as you need to support your desired number of Dgraphs and the required number of replicas. An Agraph can support any number of Dgraphs.

To continue the previous example, this illustration shows two Aggreated Navigation Engines, one for each replica.

7. In the Options section, specify the number of replicas and Dgraph partitions.

To continue the previous example, this illustration shows two replicas with three Dgraphs per replica.

8. Click Save Changes and go on to “Running an Agraph Implementation”, next.


383

Running an Agraph Implementation

After saving your provisioning changes, you can view the components on the Administration page and start a baseline update. The baseline update processes your source data and runs all the components shown on the Administration page including starting all Dgraphs and the Agraph for each replica.

To run the Agraph:

1. Select the Administration page from the navigation menu of Web Studio.

2. From the System Operation section, click the Start button for Baseline Update.

To continue the previous example, this illustration shows a running Agraph implementation. Each of the two replicas contain three Dgraphs managed by one Agraph.


384

Agraph Presentation API Development

No additional development is needed in the Presentation API to support the Agraph. The Agraph can be treated just like a Dgraph.

Note however that when you set a connection to the Navigation Engine, your application should connect to the Agraph not one of its child Dgraphs. For example, in Java, this connection might look like the following:


385

where engine.endeca.com is the Agraph host and 9001 is the Agraph port.

Agraph Limitations

The following features cannot be used with the Agraph:

• Enabling the “More…” option for dimension value ranking.

• Relevance ranking for dimension search is not supported in an Agraph. In addition, the Static relevance ranking module is not supported in an Agraph. See “Using Relevance Ranking” in the Endeca Developer’s Guide for information on configuring Dgraphs in an Agraph deployment to support relevance ranking for record search.

• If you are aggregating records in your application, you must specify a single property by which to aggregate the records. Specify the property by enabling the Rollup check box on General tab of the Property editor.

// Set connection to AgraphENEConnection nec = new HttpENEConnection("engine.endeca.com", "9001");


386

Agraph Performance

Ideally, the Agraph speeds up both Dgidix indexing and Navigation Engine request processing by a factor of the number of partitions. The indexing speed-up is close to this ideal, assuming that the Dgidx processes do not have to compete for computation or disk resources.

Assuming each Dgraph is running on its own processor as recommended, the Navigation Engine achieves close to the ideal speed-up for handling expensive requests, especially analytics requests. For smaller requests, the overhead of the Agraph tends to nullify the benefits of processing a query in parallel.

Control Script Environment Considerations

The following sections provide detail about using the Agraph in a control script environment.

Arranging Partitions and Files

When running your implementation with a control script, you have to arrange data files so that they are available to the various Dgidx processes. In particular, each Dgidx process needs to access its corresponding partition of the records, as well as the configuration files that are common to all of the processes. If the Dgidx processes are to be executed on the different machines, then the control script must distribute files across machines.


387

Agraph and Dynamic Business Rules

In a control script environment, Endeca recommends that you shut down the Agraph during dynamic business rule updates. (In a tools environment, the Manager automatically shuts down the Agraph during any type of update process and then brings the Agraph back online after the update completes.)

If you do not shut down the Agraph during the update, end-users will not receive a response to requests made during this short update time and the Agraph issues a fatal error similar to the following:[Thu Jun 24 16:26:29 2004] [Fatal] (merchbinsorter.cpp::276) - Dgraph 1 has fewer rules fired.

If you are using dynamic business rules with the Agraph, the -keepcats flag must be used with Dgidx. For more information, see “Implementing Merchandising and Content Spotlighting” on page 257.


388


Chapter 19

Using Internationalized Data

The Endeca suite of products supports the Unicode Standard, Version 4.0, which allows the Endeca Navigation Engine to process and serve data in virtually any of the world’s languages and scripts. The Endeca components can be configured to allow processing of such data when provided in a native encoding system.

This chapter provides a single source of information for implementation details that you need to know about when building a solution that includes internationalized data. The chapter makes the following assumptions:

• If working with Chinese, you are familiar with which native encodings (Big5, GBK, etc.) correspond to which character sets (Traditional versus Simplified).

• If working with Chinese or Japanese, you know that these languages do not use white space to delimit words.

• If working with Japanese, you are familiar with the shift_jis variants and how the same character can be used for either the Yen symbol or the backslash character.

For more information on the Unicode Standard and character encodings, see http://unicode.org.

390

Installing the Supplemental Language Pack

If you have purchased Japanese, Chinese, or Korean language functionality, you must install the Endeca Supplemental Language Pack, which contains Japanese, Chinese, and Korean dictionary files. Follow the instructions below to install this software.

To install the Endeca Supplemental Language Pack on UNIX:

1. Download the Endeca Supplemental Language Pack tar file from the Endeca Customer Support Web site:

https://customers.endeca.com

2. Change directories to INSTALL_BASE, which typically is the /usr/local directory:

cd INSTALL_BASE

3. Use this command to decompress and unpackage the Endeca tools tar file (GNU tar is recommended):

gzip -dc LANG_TAR_FILE | tar -xvpf -

To install the Endeca Supplemental Language Pack on Windows:

1. Download the Endeca Supplemental Language Pack executable file from the Endeca Customer Support Web site:

https://customers.endeca.com

The name of the file will be similar to lang460w2k.exe.

2. Run the file by clicking on it. You are not prompted for any information; instead, the installer automatically adds the appropriate dictionary files to the %ENDECA_ROOT%\conf\dicts directory.


391

Specifying the License Key

On its own, installing the Supplemental Language pack does not provide access to Japanese, Chinese, and Korean language support. These languages also require you to specify a license key when you run the Dgidx program and start your Navigation Engine. Contact your Endeca representative to purchase a license key.

After you acquire the license key, use the --lang_license flag to the Dgidx and Dgraph programs; for example:

--lang_license 174923185

Configuring Forge Components for Languages

The following sections discuss how to use Forge components to identify the language of the incoming source data.

Setting the Encoding for the Incoming Source Data

Forge needs to know the encoding of the data in order to process it correctly. For a list of valid encodings, see the ICU Converter Explorer at:

http://oss.software.ibm.com/cgi-bin/icu/convexp

Endeca Confidential Using Internationalized Data

392

The encoding can be specified in the following ways, depending on the format:

• If the format is Delimited, Vertical, Fixed-width, Exchange, ODBC, JDBC Adapter, or Custom Adapter, you must specify the encoding via the --input-encoding command-line flag when running Forge or in the Encoding field of the Record Adapter editor in Developer Studio.

• If the format is Document, documents are fetched via the Content Acquisition System. Because each document may have a different encoding, the command-line argument and the Encoding attribute are ignored. Instead, the encoding is determined automatically for each document during the parsing phase, as follows:

− If the Web server provides the encoding when sending the document, Forge uses that information.

− If the Web server does not provide the encoding (or if the file system is being crawled), Forge attempts to detect the encoding automatically (for example, by looking for a META tag identifying the encoding or by examining the actual bytes).

− If both methods fail, Forge emits a warning and defaults to LATIN-1.

• If the format is XML, the encoding must be specified in the DOCTYPE declaration of the XML document as required by the XML standard. Both the command-line argument and the Encoding attribute are ignored.


393

• If the format is Binary, the command-line argument and the Encoding attribute are ignored because encoding only applies to text files.

Specifying the Language for Documents

For the Document format, Forge can automatically deduce the language used in a document. Forge’s primary means for doing so is the ID_LANGUAGE expression, as used in a record manipulator.

This example identifies the language of the value stored in the Endeca.Document.Text property and then stores a corresponding language identifier in the Endeca.Document.Language property:

The EXPRNODE attributes are:

• PROPERTY - Specifies the name of the property on which to perform language identification.

• LANG_PROP_NAME - Specifies the name of the property to store the language. Endeca.Document.Language is the default value.

• LANG_ID_BYTES - Specifies the number of bytes Forge uses to determine the language. A larger number provides a more accurate determination, but requires more processing time. The default value is 300 bytes.

<EXPRESSION TYPE="VOID" NAME="ID_LANGUAGE"><EXPRNODE NAME="PROPERTY" VALUE="Endeca.Document.Text"/><EXPRNODE NAME="LANG_PROP_NAME" VALUE="Endeca.Document.Language"/><EXPRNODE NAME="LANG_ID_BYTES" VALUE="500"/></EXPRESSION>


394

For full details on the ID_LANGUAGE expression, see the Endeca XML Reference.

Forge Language Support Table

With the ID_LANGUAGE expression, Forge can identify the following language and encoding pairs:

Language/Encoding Language/Encoding Language/Encoding Language/Encoding

ARABIC CP1256 Microsoft Code Page 1256

ESTONIAN Latin4 ISO-8859-4 (Latin 4)

ITALIAN ISO-8859-1 ISO-8859-1 (Latin 1)

POLISH Latin2 ISO-8859-2 (Latin 2)

ARABIC UTF-8 Unicode UTF-8

ESTONIAN Latin4 Microsoft Code Page 1257

ITALIAN UTF-8 Unicode UTF-8

POLISH Latin2 Microsoft Code Page 1250

CATALAN ASCII ASCII

ESTONIAN UTF-8 Unicode UTF-8

JAPANESE ASCII JIS-Roman

POLISH UTF-8 Unicode UTF-8

CATALAN CP1252 Microsoft Code Page 1252

FINNISH ASCII ASCII JAPANESE CP932 Microsoft Code Page 932

PORTUGUESE ASCII ASCII

CATALAN ISO-8859-1 ISO-8859-1 (Latin 1)

FINNISH CP1252 Microsoft Code Page 1252

JAPANESE EUC-JP EUC-JP

PORTUGUESE CP1252 Microsoft Code Page 1252

CATALAN UTF-8 Unicode UTF-8

FINNISH ISO-8859-1 ISO-8859-1 (Latin 1)

JAPANESE JIS DEC Kanji

PORTUGUESE ISO-8859-1 ISO-8859-1 (Latin 1)

CHINESE ASCII CNS-Roman

FINNISH UTF-8 Unicode UTF-8

JAPANESE JIS ISO-2022-JP

PORTUGUESE UTF-8 Unicode UTF-8

CHINESE ASCII GB-Roman

FRENCH ASCII ASCII JAPANESE JIS JIS X 0201-1976

ROMANIAN Latin2 ISO-8859-2 (Latin 2)


395

CHINESE BIG5 Big Five

FRENCH CP1252 Microsoft Code Page 1252

JAPANESE JIS JIS X 0201-1997

ROMANIAN Latin2 Microsoft Code Page 1250

CHINESE BIG5-CP950 Microsoft Code Page 950

FRENCH ISO-8859-1 ISO-8859-1 (Latin 1)


ROMANIAN UTF-8 Unicode UTF-8

CHINESE CNS CNS 11643-1986

FRENCH UTF-8 Unicode UTF-8


RUSSIAN CP1251 Microsoft Code Page 1251

CHINESE GB GB2312-80

GERMAN ASCII ASCII JAPANESE JIS JIS X 0212-1983

RUSSIAN ISO-8859-5 ISO-8859-5

CHINESE EUC-CN EUC-CN

GERMAN CP1252 Microsoft Code Page 1252


RUSSIAN KOI8R KOI 8-R

CHINESE EUC DEC Hanzi Encoding

GERMAN ISO-8859-1 ISO-8859-1 (Latin 1)

JAPANESE SJS Shift-JIS

RUSSIAN UTF-8 Unicode UTF-8

CHINESE Unicode Unicode UCS-2

GERMAN UTF-8 Unicode UTF-8

JAPANESE Unicode Unicode UCS-2

SLOVAK Latin2 ISO-8859-2 (Latin 2)

CHINESE Unicode Unicode UTF-8

GREEK Greek ISO-8859-7

JAPANESE Unicode Unicode UTF-8

SLOVAK UTF-8 Unicode UTF-8

CZECH Latin2 ISO-8859-2 (Latin 2)

GREEK Greek Microsoft Code Page 1253

KOREAN ASCII KS-Roman

SPANISH ASCII ASCII

CZECH Latin2 Microsoft Code Page 1250

GREEK UTF-8 Unicode UTF-8

KOREAN KSC EUC-KR

SPANISH CP1252 Microsoft Code Page 1252

CZECH UTF-8 Unicode UTF-8

HEBREW Hebrew ISO-8859-8

KOREAN KSC KS C 5861-1992

SPANISH ISO-8859-1 ISO-8859-1 (Latin 1)



396

DANISH ASCII ASCII HEBREW Hebrew Microsoft Code Page 1255

KOREAN Unicode Unicode UCS-2

SPANISH UTF-8 Unicode UTF-8

DANISH CP1252 Microsoft Code Page 1252

HEBREW UTF-8 Unicode UTF-8

KOREAN Unicode Unicode UTF-8

SWEDISH ASCII ASCII

DANISH ISO-8859-1 ISO-8859-1 (Latin 1)

HUNGARIAN Latin2 ISO-8859-2 (Latin 2)

LATVIAN Latin4 ISO-8859-4

SWEDISH ISO-8859-1 ISO-8859-1 (Latin 1)

DANISH UTF-8 Unicode UTF-8

HUNGARIAN Latin2 Microsoft Code Page 1250

LATVIAN Latin4 Microsoft Code Page 1257

SWEDISH CP1252 Microsoft Code Page 1252

DUTCH ASCII ASCII HUNGARIAN UTF-8 Unicode UTF-8

LITHUANIAN Latin4 ISO-8859-4

SWEDISH UTF-8 Unicode UTF-8

DUTCH CP1252 Microsoft Code Page 1252

ICELANDIC ASCII ASCII

LITHUANIAN Latin4 Microsoft Code Page 1257

THAI CP874 Microsoft Code Page 874

DUTCH ISO-8859-1 ISO-8859-1 (Latin 1)

ICELANDIC ISO-8859-1 ISO-8859-1 (Latin 1)

LITHUANIAN UTF-8 Unicode UTF-8

THAI UTF-8 Unicode UTF-8

DUTCH UTF-8 Unicode UTF-8

ICELANDIC CP1252 Microsoft Code Page 1252

NORWEGIAN ASCII ASCII

TURKISH CP1254 Microsoft Code Page 1254

ENGLISH ASCII ASCII ICELANDIC UTF-8 Unicode UTF-8

NORWEGIAN CP1252 Microsoft Code Page 1252

TURKISH UTF-8 Unicode UTF-8

ENGLISH CP1252 Microsoft Code Page 1252

ITALIAN ASCII ASCII NORWEGIAN ISO-8859-1 ISO-8859-1 (Latin 1)



397

Performance Considerations for Language Identification

Language identification requires a balance between accuracy and performance, and the exact balance depends both on the requirements and the data:

• To increase accuracy, raise the number of bytes in the LANG_ID_BYTES attribute in the ID_LANGUAGE expression.

• To increase performance, either reduce the number of bytes, or, if possible, use different criteria to determine the language. For example, if the languages are already segmented by folder, then a conditional ADD_PROP expression can be used to create the language property on each record, avoiding the LANGUAGE_ID expression altogether.

Note that if the Web server being crawled by the CAS provides incorrect encoding information, you can remove the encoding property (which typically is the Endeca.Document.Encoding property) before the parse phase. In this case, the PARSE_DOC expression will attempt to detect the encoding automatically. If the encoding for all documents being crawled is known in advance, an expression could add the correct encoding to each record before the parse expression.

ENGLISH ISO-8859-1 ISO-8859-1 (Latin 1)

ITALIAN CP1252 Microsoft Code Page 1252

NORWEGIAN UTF-8 Unicode UTF-8



398

Configuring Languages for the Navigation Engine

The following sections discuss how to configure language identifiers for the Navigation Engine, as well as language-specific spelling correction.

When using internationalized data, keep in mind that the Navigation Engine does not support language-specific sort orders (for example, Spanish speakers expect ch and ll to be sorted as distinct characters), separation of compound words in German, or bi-directional languages like Arabic and Hebrew.

Using Language Identifiers

American English (“en”) is the default language of the Navigation Engine. If your application contains text in other languages, you should specify to the Navigation Engine the language of the text, so that it can perform language-specific operations correctly.

You use a language ID to identify a language. Language IDs must be specified as a valid RFC-3066 or ISO-639 code, such as the following examples:

• da – Danish

• de – German

• el – Greek

• en – English (United States)

• en-GB – English (United Kingdom)


399

• es – Spanish

• fr – French

• it – Italian

• ja – Japanese

• ko – Korean

• nl – Dutch

• pt – Portuguese

• zh – Chinese

• zh-CN – Chinese (simplified)

• zh-TW – Chinese (traditional)

A list of the ISO-639 codes is available at:

http://www.w3.org/WAI/ER/IG/ert/iso639.htm

You can supply the language ID for source data using one of these methods:

• A global language ID can be used if all or most of your text is in a single language.

• A per-record language ID should be used if the language varies on a per-record basis.

• A per-dimension/property language ID should be used if the language varies on a per-dimension basis.

• A per-query language ID should be used in your front-end application if the language varies on a per-query basis.


400

The following sections describe these methods of specifying the language ID for your data.

Specifying a Global Language ID

If most of your text is in a single language, you can use the global language ID by specifying the --lang flag to the Dgidx and Dgraph programs. The Navigation Engine assumes that text not tagged with a more specific language ID (the per-record, per-dimension, or per-query language IDs) is in the global language. The global language ID defaults to en (US English) if left unspecified.

The following example shows the Endeca Manager Settings dialog (in Developer Studio) configured to use the --lang flag set to zh-CN for Simplified Chinese:


401

Specifying a Per-Record Language ID

If your application data is organized so that all the data in a record is in a single language but different records are in different languages, you should use a per-record language ID. This scenario is common in applications that use the Content Acquisition System, because in those applications each record represents an entire document which is usually all in a single language, while different documents may be in different languages.

To specify a per-record language ID, add a property or dimension named Endeca.Document.Language to your records. This is the default name of the property created by the ID_LANGUAGE expression in Forge, so use of that expression automatically creates a per-record language ID. The value of the property or dimension should be a valid RFC-3066 or ISO-639 language ID.

Specifying a Per-Dimension/Property Language ID

If your application tends to have mixed-language records, and the languages are segregated into different dimensions or properties, use per-dimension/property language IDs. For example, your data may have an English property called Description and a Spanish property called Descripción. In this case, because an individual record can have both English and Spanish text, a per-property language ID would be more appropriate than a per-record language ID.

To specify per-dimension/property language IDs, create a file called (dbprefix.languages.xml) whose contents list


402

the dimensions and/or properties for which you want to specify a language. Use one KEY_LANGUAGE element for each dimension or property. For details on this element, see the Endeca XML Reference.

Note: You cannot create this file in Developer Studio. You must create it with a text editor. Place the file in the directory where the project XML files reside.

The following example illustrates a languages.xml file:

In the example, three dimensions are configured for English, Spanish, and German, respectively.

Note: This feature is not supported when using the Endeca Manager.

Specifying a Per-Query Language ID

The ENEQuery and UrlENEQuery classes in the Endeca Presentation API have a setLanguageId() method, which you use to tell the Navigation Engine what language full text queries are in. Note that in the .NET version of the API, the member is called the LanguageId property.

If you have enabled the language-specific spelling correction feature, a per-query language ID will enable

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE LANGUAGES SYSTEM "languages.dtd"><LANGUAGES>

<KEY_LANGUAGE NAME="Description" LANGUAGE="en"/><KEY_LANGUAGE NAME="Descripción" LANGUAGE="es"/><KEY_LANGUAGE NAME="Beschreibung" LANGUAGE="de"/>

</LANGUAGES>


403

the Navigation Engine to select the appropriate dictionary for a given query.

For details on the ENEQuery and UrlENEQuery class members, see the Endeca Javadocs or the appropriate Endeca API Guide.

The following Java code snippet shows how to set French (using its language code of “fr”) as the language of any text portion of the query (such as search terms):

If no per-query language ID is specified, the Navigation Engine uses the global language ID, which defaults to en (US English) if not set specifically.

Configuring Language-Specific Spelling Correction

You can enable language-specific spelling correction to prevent queries in one language from being spell-corrected to words in a different language.

This feature works by creating separate dictionaries for each language. The dictionaries are generated from the source data and therefore require that the source data be tagged with a language ID. You should also use a

// Create a Navigation Engine queryENEQuery usq = new UrlENEQuery(request.getQueryString(), "UTF-8");// Set French as the language for the queryusq.setLanguageId("fr");// Set other query attributes...// Make the request to the Navigation EngineENEQueryResults qr = nec.query(usq);


404

per-query language ID, so that the Navigation Engine can select the appropriate dictionary for a given query.

Note: This feature is not supported when using the Endeca Manager.

To enable the language-specific spelling correction feature, create a db_prefix.spell_config.xml file with the following text:

See the Endeca XML Reference for more information about the spell_config.xml file and its elements.

Note: You cannot create this file in Developer Studio. You must create it by hand using a text editor.

If a spell_config.xml file exists, it will override the use of these parameters to the Dgidx --spellmode option:

espell

aspell

aspell_OR_espell

aspell_AND_espell

The language-specific spelling correction feature uses the Espell language engine, which is part of the base product.

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE SPELL_CONFIG SYSTEM "spell_config.dtd"><SPELL_CONFIG>

<SPELL_ENGINE><DICT_PER_LANGUAGE>

<ESPELL/></DICT_PER_LANGUAGE>

</SPELL_ENGINE></SPELL_CONFIG>


405

The Aspell language engine only supports English, and so it is not supported for this feature.

Using Encoding in the Web Application

If you are using internationalized data in your Web application, you should be aware of the encoding (character set) requirements described in the following two sections.

Setting the Encoding for URLs

The UrlENEQuery and UrlGen classes require that you specify a character encoding so that they can properly decode URLs. For example, a URL containing %E5%8D%83 refers to the Chinese character for “thousand” if using the UTF-8 encoding, but refers to three accented European letters if using the windows-1252 encoding.

The following Java code snippet shows how to instantiate a UrlGen object using the UTF-8 character encoding set:

For details on these classes, see the Endeca Javadocs or the appropriate Endeca Presentation API Guide.

// Create request to select refinement valueurlg = new UrlGen(request.getQueryString(), "UTF-8");


406

Setting the Page Encoding

Your application should choose a suitable output encoding for the pages it produces. For example, a multi-lingual European site might choose the windows-1252 encoding, while a Chinese site might choose GB2312 or Big5. If you need to support all languages, we recommend using the UTF-8 encoding.

Viewing Navigation Platform Logs

Log messages output by the Navigation Platform binaries (Forge, Dgidx, Dgraph, and so forth) are in the UTF-8 encoding. Most common UNIX/Linux shells and terminal programs are not set up to display UTF-8 by default and will therefore display some valid characters as question marks (?).

If you find unexpected question marks in the data, first validate that it is not simply a display issue. Try the od command on Linux, or use a UTF-8 compatible display.


Index

Aaggregated records

creating record queries 114displaying records and properties 115enabling a dimension for 108enabling a property for 108ENE URL query parameters 112introduced 105methods for rollup keys 109

Agraphcontrol script 386introduced 371partial updates 355partitions 372performance 386provisioning 378query processing 372replicas 379running 383troubleshooting 385

audience for this guide xviauthentication

configuring basic 89

disabling for a host 96example key ring file 90

Bbaseline updates

defined 319pipeline details 322

Boolean syntax for record filters 145boot-strapping server authentication 95bulk export of records

Developer Studio configuration 137ENE URL query parameters 138introduced 137objects and method calls 138performance 144

CCA_DB element 95CAS

converting documents to text 47crawl types 27creating a full crawling pipeline

39–41

408

creating a record adapter to read documents 41–43

creating a record manipulator 43–50handling crawler errors 72–81identifying the language of

documents 50–52introduced 23Perl manipulator 54–55property name syntax 36–37redundant URLs 30reference implementation 26–27removing document body properties

52–54removing unnecessary records after a

crawl 68–71RETRIEVE_URL expression 45root URL extraction settings 60–63security information 24source documents and Endeca

records 31–36source formats supported by ProFind

81specifying root URLs to crawl 59–60spiders 55–59supporting components 25–26URL and record processing 28–30viewing all properties generated by

38–39client authentication 97components that support CAS 25configuring HTTPS 94Content Acquisition System. See CAScontent spotlighting 257converting documents to text in CAS 47CONVERTTOTEXT expression 48Coremetrics integration 253

crawl types in CAS 27crawler errors 72

Dderived properties

introduced 123performance 129presentation API development 125sample .Net code 127sample COM code 128sample Java code 126

DERIVED_PROP element 124deterministic distribution strategy for

partial updates 356Dgraph configuration for partial updates

341Dgraph partition 372dimension adapter for partial updates

337dimension server for partial updates 338dimension server match count log,

configuring 237directory structure for partial updates

344dynamic business rule

CMS comparisons 259creating 278properties 286rendering results 301rule groups 301style 261target 260trigger 260types of results 262zone 261


409

Eencrypting keys with Forge 100Endeca Navigation Engine distribution

across multiple processors 371ENE URL query parameters

bulk export of records 138for aggregated records 112key-based record sets 133record filters 153

ERecEnumerator class 143ERecIter class 142example syntax of URL filters 64Exchange Server authentication 98EXCHANGE_SERVER element 99expression evaluation of record filters

155externally created dimensions

compared to externally managed taxonomies 168

Developer Studio configuration 169importing 174introduced 167XML requirements 170

externally managed taxonomiescompared to an externally created

dimension 178Developer Studio configuration

179–180introduced 177loading 188–189node ID requirements 184pipeline configuration 185–187transforming 187–188XML syntax 181–183XSLT mapping 181

Ffeatured records 287Forge

encoding for internationalized data 391

encrypting keys with 100Forge hierarchical logging

configuring MustMatch messages 236–237

introduced 227log appenders 231–234log levels 228, 228–229log.ini file 238–240message categories 229–231

formatsForge log appenders 235partial update records 334

full updatesSee baseline updates

Gglobal language ID 400

HHTTP element 93HTTPS element 96

IID_LANGUAGE expression 50IF expression for partial updates 328implementing "More..." dimension

values 161importing externally created dimensions

174inert dimension values

.NET example 165

Endeca Confidential

410

COM example 166Developer Studio configuration 162introduced 161Java example 164Presentation API development

163–166internationalized data

ENE URLs 405Forge encoding 391global languange ID for 400ID_LANGUAGE expression 393introduced 389language identification 398language support table 394language-specific spelling corrections

403licence keys for Asian language

support 391page encoding 406per-dimension or property language

IDs 401performance 397per-query language ID 402per-record language ID 401supplemental language pack 390

Iterator class for bulk export of records 141

KKEY element 94key-based record sets

ENE URL query parameters 133introduced 131objects and method calls 133performance 132

Llanguage IDs

global 400per-dimension 401per-property 401per-query 402per-record 401

large OR filter performance 156large scale negation 157license key for Asian language support

391logging hierarchy, Forge 241

Mmemory costs of record filters 155merchandising 257

See also dynamic business ruletool workflow 259, 268

multithreaded modeassociated costs 246Dgraph configuration 247Intel processors 249introduced 243Linux 250performance 247–248Solaris 251Windows 251

MustMatch messages 236

Nnode ID requirements for externally

managed taxonomies 184non-navigable dimension values

See inert dimension values


411

Oobjects and method calls

aggregated record rollup keys 109bulk export of records 138key-based record sets 133

PPARSE_DOC expression 49part list performance 156partial updates

adding other control script bricks 352capabilities 321control script development 343control script development for

Agraph 367deterministric distribution strategy

356Dgidx configuration 343Dgraph configuration 341difference from baseline updates 319dimension adapter 337dimension server 338directory structure 344format of update records 334IF expression for record manipulator

328in Agraph deployments 355introduced 319naming format of data files 339, 358Perl expression for record

manipulator 362pipeline details 324random distribution strategy 356record adapter component 326record adapter for Agraph

deployment 360record manipulator component 327record manipulator for Agraph

deployment 361record specification attribute 340reference implementation 322update adapter 336update adapter for Agraph

deployment 365UPDATE_RECORD expression 328URL update command parameters

353passing phrases with Forge 100per-dimension language ID 401performance

Agraph 386bulk export of records 144derived properties 129internationalized data 397key-based record sets 132multithreaded mode 247user profiles 318

Perl expression for partial updates 362Perl manipulator in CAS 54per-property language ID 401per-query language ID 402per-record language ID 401per-record memory overhead 144promoting records with dynamic

business rules 258properties in CAS

name syntax 36viewing generated 38

provisioning the Agraph in Web Studio 378

PROXY element 100

Endeca Confidential

412

proxy server authentication 99

Rrandom distribution strategy for partial

updates 356REALM element 93record adapter

creating for Agraph partial updates 360

creating for CAS 41creating for partial updates 326

record filtersdata configuration 151–152Developer Studio configuration 151ENE query syntax 147ENE URL query parameters 153expression evaluation 155introduced 145large scale negation 157memory costs 155Navigation Engine configuration 152performance 154syntax 146XML file syntax 149

record manipulatorcreating for Agraph partial updates

361creating for CAS 43creating for partial updates 327

redundant URLS in CAS 30–31reference implementation, CAS 26REMOVE_RECORD expression 70removing document body properties in

CAS 52RETRIEVE_URL expression 45root URL extraction settings for CAS 60

root URLS for the spider to crawl 59rule groups 301

SSITE element 91specifying root URLs to crawl in CAS 59spelling correction, language-specific

403spiders

specifying proxy servers 66specifying record sources 65specifying timeouts 65

Stratify document classificationbuilding a taxonomy 202creating a pipeline 204–213dimension value synonyms 219Endeca integration with 196–199exporting a taxonomy 203–204integrating the taxonomy 213–215introduced 191loading the dimensions 216–218mapping dimensions 222required tools 200–201terms and concepts 193–196

styles for dynamic business rules, creating 275

supplemental language pack, installing 390

syntax of record filters 146

Ttarget for promoting records, creating

284taxonomy, developing a Stratify 201–223trigger

creating 281


413

dimension value 279keyword 280time 280user profile 281

Uupdate adapter for Agraph partial

updates 365update adapter for partial updates 336UPDATE_RECORD expression for

partial updates 328URL and record processing in CAS 28URL filters for CAS 64user profiles

.NET example 317COM example 317Developer Studio configuration

314–315introduced 311Java example 316objects and method calls 315–316performance 318scenario 312–313

WWeb crawling with authentication 89

XXML syntax for dimension hierarchy 171

Zzones for dynamic business rules,

creating 271

Endeca Confidential

414


Endeca_07_AdvFeaturesGuide

Documents

Transcript of Endeca_07_AdvFeaturesGuide