October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

69
October 28, 2022 1 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing

Transcript of October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

Page 1: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 20231

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FDsysFDsys

Data Analysis and Parsing

Page 2: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 20232

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Data Analysis and Parsing

Agenda:

• Data Management Definition

• Parsing

• fdsys.xml

Page 3: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 20233

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FDsysFDsys

Data Management Definition(DMD)

Page 4: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 20234

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Data Management Definition (DMD)

• Purpose of the Data Management Definition (DMD)– Define collection-specific metadata elements– Specify roles for the granules, if applicable– Collection-specific schema definition for FDsys.xsd– Define mappings of metadata elements for Documentum

and FAST– Define mappings to metadata standards

• One DMD for each collection• PMO & dev team collaborative effort for CDM

documentation development• Is both a document and a process

Page 5: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 20235

FDsysFDsys

GPO Fdsys – Search Engine Configuration

The DMD Defines how Data Flows Through FDsys

Business Process OverviewSubmission

Ingest Process

Congressional SubmissionWorkflow (folder)

MigrationApplication

Bulk SubmissionProcess

Preservation

Archival ProcessingWorkflow

Archival UpdatingWorkflow

Access

Public UserAccess & Delivery

Application

Authorized UserAccess & Delivery

Application

Processing

Package UpdatingWorkflow

Access ProcessingWorkflow

Publishing Process

ILS IntegrationApplication

SubmissionProcess

Congressional SubmissionWorkflow (interactive)what renditions

are available?

how will metadata be

extracted and merged?

what manual edits may be

required?

how are PDF files processed?

how will the HTML rendition

be created

how will parser data and input

files be validated

what’s on the search form?

how will the content and metadata be

indexed

what are the navigators?

how will the MODS be created?

how are search results formatted?

what do content URLs look like?

Page 6: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 20236

FDsysFDsys

GPO Fdsys – Search Engine Configuration

DMD – Table of Contents

1. General Description

2. fdsys.xml Schema Elements

3. Renditions, Plant Processing and Interractions

4. Parser Definition – Extraction patterns and algorithms

5. Content Management

6. Content Publishing and Index

7. Search and Browse• Search results, navigators, and collection browsing

8. Content Delivery• URLs, content-detail, Front page, actions

9. mods.xml mappings

Page 7: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 20237

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Metadata Flow DiagramMetadata Flow Diagram

FAST Notification

ACPCache

Documentum

fdsysxml

Validate, cleanup, normalize, and

extend metadata and renditions

FASTindexes

Search Results:

1. [title] [ [type] [size] ][line 2][teaser...] [more...]

2. [title] [ [type] [size] ][line 2][teaser...] [more...]

Index

Search

mapfields

[per collection]

Content Detail

[field1]: [data1][field 2]: [data2][field 3]: [data3]...

Package TOC:[collection]

[congress num][document type]

[chapter][chapter]

[section][article **]

[chapter]...

Search Form:

field1: [_________]field2: [________v]field3: from [___] to [___]field4: [_________]

normalize dataand map to FQL

MODSXML

PREMIS

.xslt

.xslt

Collection Browsing:

[collection][congress number]

[document type][document version]

mapnavigators

[per collection]

.xslt.xslt.xslt.xslt

fdsysxml

fdsysxml

FASTXML

contentfile(s)

index.xml

mods.xml.xslt

Index Push

contentfile(s) content

file(s)

publish

Submission

OriginalContent

Parse

Page 8: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 20238

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Metadata Flow DiagramMetadata Flow Diagram

FAST Notification

ACPCache

Documentum

fdsysxml

Validate, cleanup, normalize, and

extend metadata and renditions

FASTindexes

Search Results:

1. [title] [ [type] [size] ][line 2][teaser...] [more...]

2. [title] [ [type] [size] ][line 2][teaser...] [more...]

Index

Search

mapfields

[per collection]

Content Detail

[field1]: [data1][field 2]: [data2][field 3]: [data3]...

Package TOC:[collection]

[congress num][document type]

[chapter][chapter]

[section][article **]

[chapter]...

Search Form:

field1: [_________]field2: [________v]field3: from [___] to [___]field4: [_________]

normalize dataand map to FQL

MODSXML

PREMIS

.xslt

.xslt

Collection Browsing:

[collection][congress number]

[document type][document version]

mapnavigators

[per collection]

.xslt.xslt.xslt.xslt

fdsysxml

fdsysxml

FASTXML

contentfile(s)

index.xml

mods.xml.xslt

Index Push

contentfile(s) content

file(s)

publish

Submission

OriginalContent

Parse

parsing rules

CMSmetadatamapping

fdsys.xmlstructure

search indexfield mapping

modsmapping

content-detailmapping

search-formmapping

search resultsmapping browse

algorithm

Page 9: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 20239

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Federal Register Granules

• Each article is a granule

• Each Part is a single granule

• There are no higher-level granules• Sections are not

preserved as independent granules

Page 10: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202310

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Federal Register Example Metadata

agencies

title

actionsummary

dates

contact

FR Doc Number Billing Code

Page 11: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202311

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Content Files

SGML

locator

CDTP

locatorlocator

SGMLSGML

RenditionsInput Files

texttext

extract granules

pdf-submitted

pdf-submitted

pdf (public)pdf (public)

OCR embedded images

Create “FrontMatter”, “ReaderAids”, and “Issue” PDF files

PDF

extract granules

Page 12: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202312

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Content Files – Creating the HTML Rendition

texttext

html (public)html (public)

Add HTML headers and header metadata

Add URL and E-mail links

extract images as JPEG

OCR images

embed image tags

pdf-submitted

pdf-submitted image

s

longdesc text

html

Page 13: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202313

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Extracting Metadata

SGML content

CDTP parse

SGML TOC

parse

parse

overwrite addMerged

Metadata

(TOC headings)

• Metadata is merged based on the FR Doc Number

Page 14: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202314

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Search Results

73 FR 22020 - Title I-Improving the Academic Achievement of the Disadvantaged [PDF 123 KB] Federal Register. Proposed Rules. Notice of proposed rulemaking. RIN 0324-AJ10. Wednesday, April 23, 2008. ...The Secretary proposes to amend the regulations governing programs administered under Part A of Title I of the Elementary and Secondary Education Act of 1965, as amended (ESEA)... More Information...

volume

firstpage

title

collection

section

action(first 20 chars)

rin

publishdate teaser link to content-detail

Page 15: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202315

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FR Navigators

• Section

• Agency

• CFRs– Hierarchial

+ 15 CFR- Part 12- Part 13- Part 14

+ 16 CFR- Part 412- Part 413

Page 16: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202316

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Collection Browsing

daynav

monthnav

yearnav

agencynav

Page 17: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202317

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Advanced Search Form

Page 18: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202318

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Package-Level URLs

• Package Content Detail– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/content-detail.html

• Package Metadata Standards– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/mods.xml– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/premis.xml

• Package Table of Contents– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/toc.html

• Today’s Table of Contents– http://www.gpo.gov/fdsys/html/FR/todays_toc.html

Page 19: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202319

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Granule-Level URLs

• HTML and PDF Files– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/html/E6-1423.html– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/pdf/E6-1423.pdf

• Granule Content Detail– http://www.gpo.gov/fdsys/granule/FR-2006-01-01/

E6-1423/content-detail.html

• Granule Metadata Standards– http://www.gpo.gov/fdsys/granule/FR-2006-01-01/

E6-1423/mods.xml

Page 20: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202320

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Content Detail

Sample UI

Page 21: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202321

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FDsysFDsys

Parsing

Page 22: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202322

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Parsing Overview

• Runs regular expressions to extract metadataRegular Expression:

(Public Law|Pub. L.|PL|P. L.) (1[0-9][0-9])-([0-9]+)

Example: Pub. L. 109-130Produces: <law congress="109" number="130"/>

• Written in Java

• Called from Documentum when a package needs to be parsed

• Produces an instance of fdsys.xml– Parsing has an internal XML format (called the “raw”

XML) which is transformed to produce the fdsys.xml

Page 23: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202323

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FRFile

FRFile

Parser Foundation Classes

PParser PPackage PRendition PFile PGranule

USCODEParser

USCODEPackage

USCODEFile

USCODEGranule

• Foundation classes handle 95% of parsing needs• Derived classes handle all special cases

PContainer

FRRendition

USCODERendition

Page 24: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202324

FDsysFDsys

GPO Fdsys – Search Engine Configuration

PContainer

PContainer

• Takes patterns and produces elements• Holds XML at each level of the parsing process

PPattern

"(Public Law|Pub. L.|P. L.) (1[0-9][0-9])-([0-9]+)"

used_by

used_by

produces

XML DOM

<publicLaw> <congressNum>109 <lawNum>123</publicLaw>

stored_in

XML Fragment

Page 25: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202325

FDsysFDsys

GPO Fdsys – Search Engine Configuration

PRendition

XML

PFile

XML

PFile

XML

PGranule

XML

PGranule

XML

Parser Foundation Classes

PParser PPackage PRendition PFile PGranule

PContainer

XML DOM XML DOM XML DOM XML DOM

appendappendprioritymerge

XSLT

xml

Page 26: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202326

FDsysFDsys

GPO Fdsys – Search Engine Configuration

PRendition

XML

PFile

XML

PFile

XML

Parsing XML Documents

PParser PPackage PRendition PFile

PContainer

XML DOM XML DOM XML DOM

appendprioritymerge

XSLT

fdsys.xml

XSLT

bills.xml

Page 27: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202327

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Other Parsing Considerations

• Heuristics testing is integrated into the parsing– PEHelper: Checks for heuristics and adds “quality=“

attributes

• Output can be automatically Schema-Validated– Schema-Validation is run on all fdsys.xml formats

produced by the parser

• Parser Validation Tool– Used by GPO to validate that parsers meet the 90%

Service Level Agreement for accuracy– Randomly selects 100 documents or granules– Displays metadata & original text for manual review– Produces Validation Report

Page 28: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202328

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FDsysFDsys

fdsys.xml

Page 29: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

June 13, 200829

FDsysFDsys

GPO FDsys – Data Model

FDsysFDsysFDsys.xml Purpose

• Internal container of metadata related to package

• Is a detailed representation/model of the data structure across all of FDsys

• Reduces duplication of data across metadata formats

• Reduces number of required transformations

• Can be transformed into standard schemas including:– METS– MODS– PREMIS

Page 30: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

June 13, 200830

FDsysFDsys

GPO FDsys – Data Model

FDsysFDsysFDsys.xml General Structure

Header

Content

Metadata

Page 31: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

June 13, 200831

FDsysFDsys

GPO FDsys – Data Model

FDsysFDsys

FDsys Publish and Search

Page 32: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

June 13, 200832

FDsysFDsys

GPO FDsys – Data Model

Publish and Search

Agenda:

• FDsys Publish

• Search Engine Configuration

• Search Engine Application Services

Page 33: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202333

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FDsysFDsys

FDsys Publish

Page 34: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202334

FDsysFDsys

GPO Fdsys – Search Engine Configuration

High-Level SW Components

Submission Component

- Submission Workflows- WebTop Submission User interfaces- Content Parsers- Migration Tool

Ingest Component

Content Processing

- Processing Workflows- WebTop User Interfaces- Package Management- ILS Integration

Archive Preservation

- Archival Workflows- WebTop User Interfaces- Preservation Process

Access Component

- Full-Fledged Search Application- Full Text Search Engine- Public Content Access and Delivery

Infrastructure Component

- COTS-based LDAP Integration

Page 35: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202335

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Content Publishing - Overview

• Communicates from Documentum to Access

• From: Documentum– Extract fdsys.xml & premis.xml– Extract renditions and content files– Uses Documentum native DFC calls

• To: ACP Cache– Stores metadata and content files

• To: FAST ESP Search Engine– Converts fdsys.xml to FAST.xml -> to indexer

• Includes the mods.xml (indexed into ESP)• ESP pulls in content files automatically

– Uses FAST ESP content_api & search_api calls

Page 36: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202336

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Component Interfaces

PackageUpdatingWorkflow

AccessProcessingWorkflow

ContentPublishing

FAST

ACPCache

CMS Access

WebApplication

UPDATE THIS

Page 37: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202337

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Component Interfaces

ContentManagement

System

ContentPublishing

FAST

ACPCache

pull push

HTTP Commands

Page 38: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202338

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Major Architectural Decisions

• Pull from Documentum, not Push– Maintenance of Access Subsystem databases

becomes the responsibility of the Access Subsystem– Data is pulled from Documentum only as needed

• Avoids overflow/queuing problems

– Allows multiple access systems to be fielded

• Search for Deletes in FAST– Packages can contain many granules– When updating the FAST indexes, use search to find

the list of all nested granules in the indexes– Guaranteed to avoid any “orphan” granule problems

Page 39: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202339

FDsysFDsys

GPO Fdsys – Search Engine Configuration

ACP Cache Directory Structure

Proposed ACP Cache Directory : ( l i m i t s e n t r i e s p e r d i r e c t o r y t o 2 5 6 )

/ACP/hh/hh/hh/pkgXXXXXXXXXX/<package-contents>

Hexidecimal representation of the lower 24 bits of the MD5 hash of the package ID Package ID

Page 40: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202340

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Metadata Flow Diagram

FAST Notification

ACP Cache

Documentum

OriginalContent

Parse

fdsysxml

Validate Values & Normalize

FASTindexes

Search Results:

1. [title] [ [type] [size] ][line 2][teaser...] [more...]

2. [title] [ [type] [size] ][line 2][teaser...] [more...]

Index

Search

mapfields

[per collection]

Content Detail

[field1]: [data1][field 2]: [data2][field 3]: [data3]...

Package TOC:[collection]

[congress num][document type]

[chapter][chapter]

[section][article **]

[chapter]...

Search Form:

field1: [_________]field2: [________v]field3: from [___] to [___]field4: [_________]

normalize dataand map to FQL

MODSXML

PREMIS

.xslt

.xslt

Collection Browsing:

[collection][congress number]

[document type][document version]

mapnavigators

[per collection]

.xslt.xslt.xslt.xslt

fdsysxml

fdsysxml

FASTXML

granulefile(s)

index.xml

search.xml.xslt

Index Push

granulefile(s)

granulefile(s)

publish

Page 41: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202341

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Implementation Detail

FASTSearchEngine

Documentum

ACP Cache:

fdsys.xml and content files foreach Package

FAST Searchfor Deletes

serv

let w

rapp

erFDsys

Publish

individualgranule

files

Documentum APIs

fdsys.xml

Pro

cess

ing

Req

uest

s vi

a U

RL

FAST Content Processing

Page 42: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202342

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FDsysFDsys

Search Engine ConfigurationDesign

Page 43: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202343

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Component Interfaces

PackageUpdatingWorkflow

AccessProcessingWorkflow

ContentPublishing

FAST

ACPCache

CMS Access

WebApplication

Update This

Page 44: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202344

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FAST System – Hardware & Network

index & search

index & search

index & search

index & search

index & search

search search search search search

publish & admin

document processors

Web Application

Page 45: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202345

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FAST System – Indexing Flow

index & search

index & search

index & search

index & search

index & search

search search search search search

publish & admin

document processors

Page 46: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202346

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FAST System – Search

index & search

index & search

index & search

index & search

index & search

search search search search search

QR server QR server QR server QR server QR server

Web Application

Page 47: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202347

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Search Engine Sizing: Columns

• Total Number of Documents– Estimated 10 million records

• Each granule = 1 Search Engine document

– Allow 2x expansion for estimation errors and growth– Estimated 20 million records

• Sizing Recommendation:– FAST recommends: 5 million records per column

• For public facing web sites

– 5 columns: to account for the large number of navigators

Page 48: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202348

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Search Engine Sizing: Disk

• Year 2006 FR – Index Sizing Test

• Scale to 20 million documents– Fixml: ~150gb– Index: ~420gb

• Total index space required:– 150gb + (420gb)*2 = 1tb– Add 50% for estimation error, total = 1.5tb

Documents Text Fixml Index Total FAST

31,500 500mb 230mb 604mb 834mb

Page 49: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202349

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Search Engine Sizing: QPS

• Queries per second – Estimated from GPO Access– 0.8 QPS (across the whole day)

– Estimated peak: 2.4 qps (1/2 of queries in 4 hours)

• Estimated Peak QPS for FDsys:– Factor for improved search interface: 3x

– Factor for growth: 2x

– Estimated: 2.4 x 2 x 3 = ~15 QPS

– Correllates with other websites known to ST

• Each row: 20-30qps– Therefore: 1 row for query performance

• Recommend: 2 rows– 2nd row for redundancy, failover, and maintenance

Page 50: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202350

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Metadata Flow Diagram

FAST Notification

ACPCache

Documentum

fdsysxml

Validate, cleanup, normalize, and

extend metadata and renditions

FASTindexes

Search Results:

1. [title] [ [type] [size] ][line 2][teaser...] [more...]

2. [title] [ [type] [size] ][line 2][teaser...] [more...]

Index

Search

mapfields

[per collection]

Content Detail

[field1]: [data1][field 2]: [data2][field 3]: [data3]...

Package TOC:[collection]

[congress num][document type]

[chapter][chapter]

[section][article **]

[chapter]...

Search Form:

field1: [_________]field2: [________v]field3: from [___] to [___]field4: [_________]

normalize dataand map to FQL

MODSXML

PREMIS

.xslt

.xslt

Collection Browsing:

[collection][congress number]

[document type][document version]

mapnavigators

[per collection]

.xslt.xslt.xslt.xslt

fdsysxml

fdsysxml

FASTXML

contentfile(s)

index.xml

mods.xml.xslt

Index Push

contentfile(s) content

file(s)

publish

Submission

OriginalContent

Parse

search indexfield mapping

Page 51: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202351

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Mapping to the Index Profile

index profilefields:

resultsbundle

xml scope field

grank1-6

body

publishdatetitle

collectionspecific

fdsys.xml

ACP Cachecontent files

index.xslt

metadata for search results

metadata for simple search

metadata for navigation

mods.xslt

FASTExtractors

mergestandard

navigators

Page 52: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202352

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Collections and Codes

• Three types of collections:– Processing Collection = “collectionCode” or

“processingCode”• Parsing, submission, processing, workflow

• Chooses which index.xslt and search.xslt to apply

– FAST Index Collection• One for each processing collection

• Allows easily deleting all documents in a collection

– “Access Collection” = “accode”• Re-group documents into collections for public users

• 98% the same as the “Processing Collection”– Reports in the Congressional Record, FR Unified Agenda

• Mapping is done in index.xslt

Page 53: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202353

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Simple Maintenance #1

• New Collections– Just add the collection (admin GUI)– Start feeding data

• Add new fields– Add the field to the index profile– Reload profile with a hot-update

• Backups– Turn off feeding– Wait for documents in process to finish up– Make index backups

Page 54: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202354

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Simple Maintenance #2

• Archiving Log Files– Simple file copy, can happen any time

• Correct Field Mapping Errors– Remove all documents in the FAST ESP collection– Re-index collection from ACP Cache

• Get list of packages in the collection from Documentum

• Does not require re-export (or re-publish) the packages from CMS

• Reorganize Access Collections– Remove all documents from affected collections– Re-index affected collections from ACP Cache

Page 55: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202355

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Extensive Maintenance

• Examples:– New FAST Version– New FDsys Version– Complex index-profile changes (remove fields, major

restructuring)– Re-organizing collections or field mapping while

maintaining searches on the old snap-shot

• Process:– Servers to “stand-alone” mode– Make changes– Restore normal server operations

Page 56: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202356

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Monitoring

• FAST standard monitoring tool (“Clarity”)

• Monitors query and indexing performance

• Built-in alerting mechanism

Page 57: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202357

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Backups

• data_fixml– Holds processed copy of the indexes– Can be used to reconstruct the indexes in about 4

hours (will need to be benchmarked)

• data_index– The complete indexes actually used for searching

• Configuration backup

• When restoring a backup:– Will need to re-push all content updates which

occurred since the last backup

Page 58: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202358

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Disaster Recovery Scenarios

• Servers crash– FAST restarts them automatically

• Hanging server processes– Shut it down manually and restart it

• Incremental indexing overloads the system– Should not happen in FDsys– Can “slow down” incremental indexing until situation

is corrected

• Severe incremental indexing problems– Revert to periodic batch index updates

Page 59: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202359

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FDsysFDsys

Search Services API

Page 60: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202360

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Component Interfaces

FAST

ACPCache

AccessSearch

Web App

AccessSearch

API

FASTindexes

ContentDeliveryWeb App

search results

collection browsing

browsing PDF

content-detail

Page 61: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202361

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Responsiblities: Search Services vs Web ApplicationSearch Services API

• All communication to FAST– All FAST API calls– All FAST parameters

• All FAST FQL– User query strings and

parameters to FAST FQL

• Raw data values– Allowed values, navigator

values, search results field values, etc.

• Choosing Navigators

Web Application

• Choosing which fields when– Advanced Search Form– Search Results

• User-interface oriented field data– Display names, help text,

display widgets

• Display value translation– translating from raw data

values to/from user-friendly values

Page 62: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202362

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Responsiblities: Search Services vs Web Application – Browse TreesSearch Services API

• Browse tree computation– Selecting nodes to return– Returning an ordered list of

nodes– Caching search results

• Embargo Dates

Web Application

• Browse tree presentation– The definition of the levels– How many levels to display

when– Presentation of tree– Translating raw data values

to user friendly values

• Content Detail Pages

Page 63: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202363

FDsysFDsys

GPO Fdsys – Search Engine Configuration

CongBills

CFR

CR

Component Interfaces

Search Services

API

FAST Search

API

FASTSearch Engine

Parsing, Processing &

Caching

SearchWeb

Application

HTTPJava method calls

Configuration files (XML)

FR

Master

Collection Specific

Page 64: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202364

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Configuration File Contents

• Fields– for the Advanced Search Form– for field: searches– Allowed values

• Fixed enumerated list

• Enumerated list built from navigator

• Numeric or Date Range

• Navigators per collection

• FAST ESP Search Engine Connection Info

• Templates to reformat data for display

Page 65: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202365

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Query Parsing: Syntax

• atom atom– defaults to AND

• atom and atom• atom or atom• atom before/# atom• atom near/# atom• atom adj atom

• atom– Atoms are space separated lists of characters,

double-quoted strings, or parenthetical expressions

• not atom• +atom• -atom• field:atom• range(#,#)• range(<date>,<date>)

Page 66: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202366

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Query Parsing: Examples

• hearing• “congressional hearing”• congressional adj hearing• congressional hearing• ways and means• “ways and means”• ways “and” means• congressional or

congress• congress and (report or

meeting or notice)• congnum:range(103,110)

• congressional not report• congressional –report• +cardin congressional

committee• congresional not

(committee report)• congressional not

(committee or meeting)• representative near/10

cardin• representative before/10

cardin

Page 67: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202367

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Derived Hierarchy: Example

• 110th Congress– House Bills

• H.R. 1-200• H.R. 201-400

– H.R. 201– H.R. 202– H.R. 203

• Engrossed in House• Introduced in House

. Condemning the persecution of labor rights advocates in Iran [PDF] [Text]

• Referred in Senate– H.R. 204

• H.R. 401-600

Page 68: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202368

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Specified Hierarchy: Example

Page 69: October 2, 20150 FDsys GPO Fdsys – Search Engine Configuration FDsys Data Analysis and Parsing.

April 19, 202369

FDsysFDsys

GPO Fdsys – Search Engine Configuration

PFile

XML

PFile

XML

PGranule

XML

PGranule

XML

Parser Foundation Classes

PParser PPackage PFile PGranule

PContainer

XML DOM XML DOM XML DOM

appendappend

XSLT

xml