Post on 02-Feb-2016
description
April 22, 20231
FDsysFDsys
GPO Fdsys – Search Engine Configuration
FDsysFDsys
Data Analysis and Parsing
April 22, 20232
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Data Analysis and Parsing
Agenda:
• Data Management Definition
• Parsing
• fdsys.xml
April 22, 20233
FDsysFDsys
GPO Fdsys – Search Engine Configuration
FDsysFDsys
Data Management Definition(DMD)
April 22, 20234
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Data Management Definition (DMD)
• Purpose of the Data Management Definition (DMD)– Define collection-specific metadata elements– Specify roles for the granules, if applicable– Collection-specific schema definition for FDsys.xsd– Define mappings of metadata elements for Documentum
and FAST– Define mappings to metadata standards
• One DMD for each collection• PMO & dev team collaborative effort for CDM
documentation development• Is both a document and a process
April 22, 20235
FDsysFDsys
GPO Fdsys – Search Engine Configuration
The DMD Defines how Data Flows Through FDsys
Business Process OverviewSubmission
Ingest Process
Congressional SubmissionWorkflow (folder)
MigrationApplication
Bulk SubmissionProcess
Preservation
Archival ProcessingWorkflow
Archival UpdatingWorkflow
Access
Public UserAccess & Delivery
Application
Authorized UserAccess & Delivery
Application
Processing
Package UpdatingWorkflow
Access ProcessingWorkflow
Publishing Process
ILS IntegrationApplication
SubmissionProcess
Congressional SubmissionWorkflow (interactive)what renditions
are available?
how will metadata be
extracted and merged?
what manual edits may be
required?
how are PDF files processed?
how will the HTML rendition
be created
how will parser data and input
files be validated
what’s on the search form?
how will the content and metadata be
indexed
what are the navigators?
how will the MODS be created?
how are search results formatted?
what do content URLs look like?
April 22, 20236
FDsysFDsys
GPO Fdsys – Search Engine Configuration
DMD – Table of Contents
1. General Description
2. fdsys.xml Schema Elements
3. Renditions, Plant Processing and Interractions
4. Parser Definition – Extraction patterns and algorithms
5. Content Management
6. Content Publishing and Index
7. Search and Browse• Search results, navigators, and collection browsing
8. Content Delivery• URLs, content-detail, Front page, actions
9. mods.xml mappings
April 22, 20237
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Metadata Flow DiagramMetadata Flow Diagram
FAST Notification
ACPCache
Documentum
fdsysxml
Validate, cleanup, normalize, and
extend metadata and renditions
FASTindexes
Search Results:
1. [title] [ [type] [size] ][line 2][teaser...] [more...]
2. [title] [ [type] [size] ][line 2][teaser...] [more...]
Index
Search
mapfields
[per collection]
Content Detail
[field1]: [data1][field 2]: [data2][field 3]: [data3]...
Package TOC:[collection]
[congress num][document type]
[chapter][chapter]
[section][article **]
[chapter]...
Search Form:
field1: [_________]field2: [________v]field3: from [___] to [___]field4: [_________]
normalize dataand map to FQL
MODSXML
PREMIS
.xslt
.xslt
Collection Browsing:
[collection][congress number]
[document type][document version]
mapnavigators
[per collection]
.xslt.xslt.xslt.xslt
fdsysxml
fdsysxml
FASTXML
contentfile(s)
index.xml
mods.xml.xslt
Index Push
contentfile(s) content
file(s)
publish
Submission
OriginalContent
Parse
April 22, 20238
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Metadata Flow DiagramMetadata Flow Diagram
FAST Notification
ACPCache
Documentum
fdsysxml
Validate, cleanup, normalize, and
extend metadata and renditions
FASTindexes
Search Results:
1. [title] [ [type] [size] ][line 2][teaser...] [more...]
2. [title] [ [type] [size] ][line 2][teaser...] [more...]
Index
Search
mapfields
[per collection]
Content Detail
[field1]: [data1][field 2]: [data2][field 3]: [data3]...
Package TOC:[collection]
[congress num][document type]
[chapter][chapter]
[section][article **]
[chapter]...
Search Form:
field1: [_________]field2: [________v]field3: from [___] to [___]field4: [_________]
normalize dataand map to FQL
MODSXML
PREMIS
.xslt
.xslt
Collection Browsing:
[collection][congress number]
[document type][document version]
mapnavigators
[per collection]
.xslt.xslt.xslt.xslt
fdsysxml
fdsysxml
FASTXML
contentfile(s)
index.xml
mods.xml.xslt
Index Push
contentfile(s) content
file(s)
publish
Submission
OriginalContent
Parse
parsing rules
CMSmetadatamapping
fdsys.xmlstructure
search indexfield mapping
modsmapping
content-detailmapping
search-formmapping
search resultsmapping browse
algorithm
April 22, 20239
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Federal Register Granules
• Each article is a granule
• Each Part is a single granule
• There are no higher-level granules• Sections are not
preserved as independent granules
April 22, 202310
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Federal Register Example Metadata
agencies
title
actionsummary
dates
contact
FR Doc Number Billing Code
April 22, 202311
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Content Files
SGML
locator
CDTP
locatorlocator
SGMLSGML
RenditionsInput Files
texttext
extract granules
pdf-submitted
pdf-submitted
pdf (public)pdf (public)
OCR embedded images
Create “FrontMatter”, “ReaderAids”, and “Issue” PDF files
extract granules
April 22, 202312
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Content Files – Creating the HTML Rendition
texttext
html (public)html (public)
Add HTML headers and header metadata
Add URL and E-mail links
extract images as JPEG
OCR images
embed image tags
pdf-submitted
pdf-submitted image
s
longdesc text
html
April 22, 202313
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Extracting Metadata
SGML content
CDTP parse
SGML TOC
parse
parse
overwrite addMerged
Metadata
(TOC headings)
• Metadata is merged based on the FR Doc Number
April 22, 202314
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Search Results
73 FR 22020 - Title I-Improving the Academic Achievement of the Disadvantaged [PDF 123 KB] Federal Register. Proposed Rules. Notice of proposed rulemaking. RIN 0324-AJ10. Wednesday, April 23, 2008. ...The Secretary proposes to amend the regulations governing programs administered under Part A of Title I of the Elementary and Secondary Education Act of 1965, as amended (ESEA)... More Information...
volume
firstpage
title
collection
section
action(first 20 chars)
rin
publishdate teaser link to content-detail
April 22, 202315
FDsysFDsys
GPO Fdsys – Search Engine Configuration
FR Navigators
• Section
• Agency
• CFRs– Hierarchial
+ 15 CFR- Part 12- Part 13- Part 14
+ 16 CFR- Part 412- Part 413
April 22, 202316
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Collection Browsing
daynav
monthnav
yearnav
agencynav
April 22, 202317
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Advanced Search Form
April 22, 202318
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Package-Level URLs
• Package Content Detail– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/content-detail.html
• Package Metadata Standards– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/mods.xml– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/premis.xml
• Package Table of Contents– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/toc.html
• Today’s Table of Contents– http://www.gpo.gov/fdsys/html/FR/todays_toc.html
April 22, 202319
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Granule-Level URLs
• HTML and PDF Files– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/html/E6-1423.html– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/pdf/E6-1423.pdf
• Granule Content Detail– http://www.gpo.gov/fdsys/granule/FR-2006-01-01/
E6-1423/content-detail.html
• Granule Metadata Standards– http://www.gpo.gov/fdsys/granule/FR-2006-01-01/
E6-1423/mods.xml
April 22, 202320
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Content Detail
Sample UI
April 22, 202321
FDsysFDsys
GPO Fdsys – Search Engine Configuration
FDsysFDsys
Parsing
April 22, 202322
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Parsing Overview
• Runs regular expressions to extract metadataRegular Expression:
(Public Law|Pub. L.|PL|P. L.) (1[0-9][0-9])-([0-9]+)
Example: Pub. L. 109-130Produces: <law congress="109" number="130"/>
• Written in Java
• Called from Documentum when a package needs to be parsed
• Produces an instance of fdsys.xml– Parsing has an internal XML format (called the “raw”
XML) which is transformed to produce the fdsys.xml
April 22, 202323
FDsysFDsys
GPO Fdsys – Search Engine Configuration
FRFile
FRFile
Parser Foundation Classes
PParser PPackage PRendition PFile PGranule
USCODEParser
USCODEPackage
USCODEFile
USCODEGranule
• Foundation classes handle 95% of parsing needs• Derived classes handle all special cases
PContainer
FRRendition
USCODERendition
April 22, 202324
FDsysFDsys
GPO Fdsys – Search Engine Configuration
PContainer
PContainer
• Takes patterns and produces elements• Holds XML at each level of the parsing process
PPattern
"(Public Law|Pub. L.|P. L.) (1[0-9][0-9])-([0-9]+)"
used_by
used_by
produces
XML DOM
<publicLaw> <congressNum>109 <lawNum>123</publicLaw>
stored_in
XML Fragment
April 22, 202325
FDsysFDsys
GPO Fdsys – Search Engine Configuration
PRendition
XML
PFile
XML
PFile
XML
PGranule
XML
PGranule
XML
Parser Foundation Classes
PParser PPackage PRendition PFile PGranule
PContainer
XML DOM XML DOM XML DOM XML DOM
appendappendprioritymerge
XSLT
xml
April 22, 202326
FDsysFDsys
GPO Fdsys – Search Engine Configuration
PRendition
XML
PFile
XML
PFile
XML
Parsing XML Documents
PParser PPackage PRendition PFile
PContainer
XML DOM XML DOM XML DOM
appendprioritymerge
XSLT
fdsys.xml
XSLT
bills.xml
April 22, 202327
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Other Parsing Considerations
• Heuristics testing is integrated into the parsing– PEHelper: Checks for heuristics and adds “quality=“
attributes
• Output can be automatically Schema-Validated– Schema-Validation is run on all fdsys.xml formats
produced by the parser
• Parser Validation Tool– Used by GPO to validate that parsers meet the 90%
Service Level Agreement for accuracy– Randomly selects 100 documents or granules– Displays metadata & original text for manual review– Produces Validation Report
April 22, 202328
FDsysFDsys
GPO Fdsys – Search Engine Configuration
FDsysFDsys
fdsys.xml
June 13, 200829
FDsysFDsys
GPO FDsys – Data Model
FDsysFDsysFDsys.xml Purpose
• Internal container of metadata related to package
• Is a detailed representation/model of the data structure across all of FDsys
• Reduces duplication of data across metadata formats
• Reduces number of required transformations
• Can be transformed into standard schemas including:– METS– MODS– PREMIS
June 13, 200830
FDsysFDsys
GPO FDsys – Data Model
FDsysFDsysFDsys.xml General Structure
Header
Content
Metadata
June 13, 200831
FDsysFDsys
GPO FDsys – Data Model
FDsysFDsys
FDsys Publish and Search
June 13, 200832
FDsysFDsys
GPO FDsys – Data Model
Publish and Search
Agenda:
• FDsys Publish
• Search Engine Configuration
• Search Engine Application Services
April 22, 202333
FDsysFDsys
GPO Fdsys – Search Engine Configuration
FDsysFDsys
FDsys Publish
April 22, 202334
FDsysFDsys
GPO Fdsys – Search Engine Configuration
High-Level SW Components
Submission Component
- Submission Workflows- WebTop Submission User interfaces- Content Parsers- Migration Tool
Ingest Component
Content Processing
- Processing Workflows- WebTop User Interfaces- Package Management- ILS Integration
Archive Preservation
- Archival Workflows- WebTop User Interfaces- Preservation Process
Access Component
- Full-Fledged Search Application- Full Text Search Engine- Public Content Access and Delivery
Infrastructure Component
- COTS-based LDAP Integration
April 22, 202335
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Content Publishing - Overview
• Communicates from Documentum to Access
• From: Documentum– Extract fdsys.xml & premis.xml– Extract renditions and content files– Uses Documentum native DFC calls
• To: ACP Cache– Stores metadata and content files
• To: FAST ESP Search Engine– Converts fdsys.xml to FAST.xml -> to indexer
• Includes the mods.xml (indexed into ESP)• ESP pulls in content files automatically
– Uses FAST ESP content_api & search_api calls
April 22, 202336
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Component Interfaces
PackageUpdatingWorkflow
AccessProcessingWorkflow
ContentPublishing
FAST
ACPCache
CMS Access
WebApplication
UPDATE THIS
April 22, 202337
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Component Interfaces
ContentManagement
System
ContentPublishing
FAST
ACPCache
pull push
HTTP Commands
April 22, 202338
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Major Architectural Decisions
• Pull from Documentum, not Push– Maintenance of Access Subsystem databases
becomes the responsibility of the Access Subsystem– Data is pulled from Documentum only as needed
• Avoids overflow/queuing problems
– Allows multiple access systems to be fielded
• Search for Deletes in FAST– Packages can contain many granules– When updating the FAST indexes, use search to find
the list of all nested granules in the indexes– Guaranteed to avoid any “orphan” granule problems
April 22, 202339
FDsysFDsys
GPO Fdsys – Search Engine Configuration
ACP Cache Directory Structure
Proposed ACP Cache Directory : ( l i m i t s e n t r i e s p e r d i r e c t o r y t o 2 5 6 )
/ACP/hh/hh/hh/pkgXXXXXXXXXX/<package-contents>
Hexidecimal representation of the lower 24 bits of the MD5 hash of the package ID Package ID
April 22, 202340
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Metadata Flow Diagram
FAST Notification
ACP Cache
Documentum
OriginalContent
Parse
fdsysxml
Validate Values & Normalize
FASTindexes
Search Results:
1. [title] [ [type] [size] ][line 2][teaser...] [more...]
2. [title] [ [type] [size] ][line 2][teaser...] [more...]
Index
Search
mapfields
[per collection]
Content Detail
[field1]: [data1][field 2]: [data2][field 3]: [data3]...
Package TOC:[collection]
[congress num][document type]
[chapter][chapter]
[section][article **]
[chapter]...
Search Form:
field1: [_________]field2: [________v]field3: from [___] to [___]field4: [_________]
normalize dataand map to FQL
MODSXML
PREMIS
.xslt
.xslt
Collection Browsing:
[collection][congress number]
[document type][document version]
mapnavigators
[per collection]
.xslt.xslt.xslt.xslt
fdsysxml
fdsysxml
FASTXML
granulefile(s)
index.xml
search.xml.xslt
Index Push
granulefile(s)
granulefile(s)
publish
April 22, 202341
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Implementation Detail
FASTSearchEngine
Documentum
ACP Cache:
fdsys.xml and content files foreach Package
FAST Searchfor Deletes
serv
let w
rapp
erFDsys
Publish
individualgranule
files
Documentum APIs
fdsys.xml
Pro
cess
ing
Req
uest
s vi
a U
RL
FAST Content Processing
April 22, 202342
FDsysFDsys
GPO Fdsys – Search Engine Configuration
FDsysFDsys
Search Engine ConfigurationDesign
April 22, 202343
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Component Interfaces
PackageUpdatingWorkflow
AccessProcessingWorkflow
ContentPublishing
FAST
ACPCache
CMS Access
WebApplication
Update This
April 22, 202344
FDsysFDsys
GPO Fdsys – Search Engine Configuration
FAST System – Hardware & Network
index & search
index & search
index & search
index & search
index & search
search search search search search
publish & admin
document processors
Web Application
April 22, 202345
FDsysFDsys
GPO Fdsys – Search Engine Configuration
FAST System – Indexing Flow
index & search
index & search
index & search
index & search
index & search
search search search search search
publish & admin
document processors
April 22, 202346
FDsysFDsys
GPO Fdsys – Search Engine Configuration
FAST System – Search
index & search
index & search
index & search
index & search
index & search
search search search search search
QR server QR server QR server QR server QR server
Web Application
April 22, 202347
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Search Engine Sizing: Columns
• Total Number of Documents– Estimated 10 million records
• Each granule = 1 Search Engine document
– Allow 2x expansion for estimation errors and growth– Estimated 20 million records
• Sizing Recommendation:– FAST recommends: 5 million records per column
• For public facing web sites
– 5 columns: to account for the large number of navigators
April 22, 202348
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Search Engine Sizing: Disk
• Year 2006 FR – Index Sizing Test
• Scale to 20 million documents– Fixml: ~150gb– Index: ~420gb
• Total index space required:– 150gb + (420gb)*2 = 1tb– Add 50% for estimation error, total = 1.5tb
Documents Text Fixml Index Total FAST
31,500 500mb 230mb 604mb 834mb
April 22, 202349
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Search Engine Sizing: QPS
• Queries per second – Estimated from GPO Access– 0.8 QPS (across the whole day)
– Estimated peak: 2.4 qps (1/2 of queries in 4 hours)
• Estimated Peak QPS for FDsys:– Factor for improved search interface: 3x
– Factor for growth: 2x
– Estimated: 2.4 x 2 x 3 = ~15 QPS
– Correllates with other websites known to ST
• Each row: 20-30qps– Therefore: 1 row for query performance
• Recommend: 2 rows– 2nd row for redundancy, failover, and maintenance
April 22, 202350
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Metadata Flow Diagram
FAST Notification
ACPCache
Documentum
fdsysxml
Validate, cleanup, normalize, and
extend metadata and renditions
FASTindexes
Search Results:
1. [title] [ [type] [size] ][line 2][teaser...] [more...]
2. [title] [ [type] [size] ][line 2][teaser...] [more...]
Index
Search
mapfields
[per collection]
Content Detail
[field1]: [data1][field 2]: [data2][field 3]: [data3]...
Package TOC:[collection]
[congress num][document type]
[chapter][chapter]
[section][article **]
[chapter]...
Search Form:
field1: [_________]field2: [________v]field3: from [___] to [___]field4: [_________]
normalize dataand map to FQL
MODSXML
PREMIS
.xslt
.xslt
Collection Browsing:
[collection][congress number]
[document type][document version]
mapnavigators
[per collection]
.xslt.xslt.xslt.xslt
fdsysxml
fdsysxml
FASTXML
contentfile(s)
index.xml
mods.xml.xslt
Index Push
contentfile(s) content
file(s)
publish
Submission
OriginalContent
Parse
search indexfield mapping
April 22, 202351
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Mapping to the Index Profile
index profilefields:
resultsbundle
xml scope field
grank1-6
body
publishdatetitle
collectionspecific
fdsys.xml
ACP Cachecontent files
index.xslt
metadata for search results
metadata for simple search
metadata for navigation
mods.xslt
FASTExtractors
mergestandard
navigators
April 22, 202352
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Collections and Codes
• Three types of collections:– Processing Collection = “collectionCode” or
“processingCode”• Parsing, submission, processing, workflow
• Chooses which index.xslt and search.xslt to apply
– FAST Index Collection• One for each processing collection
• Allows easily deleting all documents in a collection
– “Access Collection” = “accode”• Re-group documents into collections for public users
• 98% the same as the “Processing Collection”– Reports in the Congressional Record, FR Unified Agenda
• Mapping is done in index.xslt
April 22, 202353
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Simple Maintenance #1
• New Collections– Just add the collection (admin GUI)– Start feeding data
• Add new fields– Add the field to the index profile– Reload profile with a hot-update
• Backups– Turn off feeding– Wait for documents in process to finish up– Make index backups
April 22, 202354
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Simple Maintenance #2
• Archiving Log Files– Simple file copy, can happen any time
• Correct Field Mapping Errors– Remove all documents in the FAST ESP collection– Re-index collection from ACP Cache
• Get list of packages in the collection from Documentum
• Does not require re-export (or re-publish) the packages from CMS
• Reorganize Access Collections– Remove all documents from affected collections– Re-index affected collections from ACP Cache
April 22, 202355
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Extensive Maintenance
• Examples:– New FAST Version– New FDsys Version– Complex index-profile changes (remove fields, major
restructuring)– Re-organizing collections or field mapping while
maintaining searches on the old snap-shot
• Process:– Servers to “stand-alone” mode– Make changes– Restore normal server operations
April 22, 202356
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Monitoring
• FAST standard monitoring tool (“Clarity”)
• Monitors query and indexing performance
• Built-in alerting mechanism
April 22, 202357
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Backups
• data_fixml– Holds processed copy of the indexes– Can be used to reconstruct the indexes in about 4
hours (will need to be benchmarked)
• data_index– The complete indexes actually used for searching
• Configuration backup
• When restoring a backup:– Will need to re-push all content updates which
occurred since the last backup
April 22, 202358
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Disaster Recovery Scenarios
• Servers crash– FAST restarts them automatically
• Hanging server processes– Shut it down manually and restart it
• Incremental indexing overloads the system– Should not happen in FDsys– Can “slow down” incremental indexing until situation
is corrected
• Severe incremental indexing problems– Revert to periodic batch index updates
April 22, 202359
FDsysFDsys
GPO Fdsys – Search Engine Configuration
FDsysFDsys
Search Services API
April 22, 202360
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Component Interfaces
FAST
ACPCache
AccessSearch
Web App
AccessSearch
API
FASTindexes
ContentDeliveryWeb App
search results
collection browsing
browsing PDF
content-detail
April 22, 202361
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Responsiblities: Search Services vs Web ApplicationSearch Services API
• All communication to FAST– All FAST API calls– All FAST parameters
• All FAST FQL– User query strings and
parameters to FAST FQL
• Raw data values– Allowed values, navigator
values, search results field values, etc.
• Choosing Navigators
Web Application
• Choosing which fields when– Advanced Search Form– Search Results
• User-interface oriented field data– Display names, help text,
display widgets
• Display value translation– translating from raw data
values to/from user-friendly values
April 22, 202362
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Responsiblities: Search Services vs Web Application – Browse TreesSearch Services API
• Browse tree computation– Selecting nodes to return– Returning an ordered list of
nodes– Caching search results
• Embargo Dates
Web Application
• Browse tree presentation– The definition of the levels– How many levels to display
when– Presentation of tree– Translating raw data values
to user friendly values
• Content Detail Pages
April 22, 202363
FDsysFDsys
GPO Fdsys – Search Engine Configuration
CongBills
CFR
CR
Component Interfaces
Search Services
API
FAST Search
API
FASTSearch Engine
Parsing, Processing &
Caching
SearchWeb
Application
HTTPJava method calls
Configuration files (XML)
FR
Master
Collection Specific
April 22, 202364
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Configuration File Contents
• Fields– for the Advanced Search Form– for field: searches– Allowed values
• Fixed enumerated list
• Enumerated list built from navigator
• Numeric or Date Range
• Navigators per collection
• FAST ESP Search Engine Connection Info
• Templates to reformat data for display
April 22, 202365
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Query Parsing: Syntax
• atom atom– defaults to AND
• atom and atom• atom or atom• atom before/# atom• atom near/# atom• atom adj atom
• atom– Atoms are space separated lists of characters,
double-quoted strings, or parenthetical expressions
• not atom• +atom• -atom• field:atom• range(#,#)• range(<date>,<date>)
April 22, 202366
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Query Parsing: Examples
• hearing• “congressional hearing”• congressional adj hearing• congressional hearing• ways and means• “ways and means”• ways “and” means• congressional or
congress• congress and (report or
meeting or notice)• congnum:range(103,110)
• congressional not report• congressional –report• +cardin congressional
committee• congresional not
(committee report)• congressional not
(committee or meeting)• representative near/10
cardin• representative before/10
cardin
April 22, 202367
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Derived Hierarchy: Example
• 110th Congress– House Bills
• H.R. 1-200• H.R. 201-400
– H.R. 201– H.R. 202– H.R. 203
• Engrossed in House• Introduced in House
. Condemning the persecution of labor rights advocates in Iran [PDF] [Text]
• Referred in Senate– H.R. 204
• H.R. 401-600
April 22, 202368
FDsysFDsys
GPO Fdsys – Search Engine Configuration
Specified Hierarchy: Example
April 22, 202369
FDsysFDsys
GPO Fdsys – Search Engine Configuration
PFile
XML
PFile
XML
PGranule
XML
PGranule
XML
Parser Foundation Classes
PParser PPackage PFile PGranule
PContainer
XML DOM XML DOM XML DOM
appendappend
XSLT
xml