NELSON Announces Management Transition By: Charles W. Nelson
Data Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies
description
Transcript of Data Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies
2DPMS: Example 1•Additional Government PublicationsSelect Publications •Budget of the United States GovernmentFiscal Years 2010 and 2011 •Code of Federal Regulations | XML Bulk Data 2000 to Present •Commerce Business Daily Bulk Data1996 to 2001 •Compilation of Presidential Documents1993 to Present •Congressional Bills103rd Congress to Present •Congressional Calendars104th Congress to Present •Congressional Committee Prints including Ways and Means Committee Prints105th Congress to Present •Congressional Directory104th Congress to Present •Congressional Documents104th Congress to Present •Congressional Hearings including House and Senate Appropriations Hearings105th Congress to Present •Congressional Pictorial Directory including New Member Pictorial Directory105th Congress to Present •Congressional Record (Bound)1999 to 2001 •Congressional Record (Daily)1994 to Present •Congressional Record Index (Daily)1983 to Present •Congressional Reports including Conference Reports 104th Congress to Present •Constitution of the United States of America: Analysis and Interpretation 1992 to 2008 •Economic Indicators1995 to Present •Economic Report of the President1995 to Present •Education Reports from Eric1995 to 2004
•Federal Register | XML Bulk Data | FR 2.0 1994 to Present •GAO Reports and Comptroller General Decisions1994 to 2008 •House Practice104th and 108th Congresses •House Rules and Manual 104th Congress to Present •History of Bills 1983 to Present •Independent Counsel Investigations1998 to 2002 •Journal of the House of Representatives1992 to 1999 •List of CFR Sections Affected1997 to Present •Precedents of the U.S. House of RepresentativesCannon, Deschler, and Hinds •Privacy Act Issuances1995 to 2005 •Public and Private Laws 104th Congress to Present •Public Papers of the Presidents of the United States1991 to 2005 •Riddick's Senate Procedure101st Congress •Senate Manual104th, 106th, 107th, and 110th Congresses •Supreme Court Decisions (FLITE) Bulk Data 1937 to 1975 •Unified Agenda 1994 to Present •United States Code1994 to Present •United States Government Manual1995 to Present •United States Government Policy and Supporting Positions (Plum Book)1996 to 2008 •United States Statutes at Large 2003 to 2006
3DPMS: Example 2
4DPMS: Example 3 (FY 2011)
5DPMS: Example 4 (FY 201?)
6DPMS: Example 5 6
CPA(sorry, I don’t have a screenshot)
7What is Aspire???
• ASPIRE: The Automated Assembly Line for Documents
7
8What is DPMS???
• A Methodology for Creating Good Assembly Lines For Search
• Document Process Methodology for Search
8
9What is DPMS?
Technologies Management Processes
MS-Windows
Java
Aspire.NET
LinuxSearchEngine
TheCustomer
SearchTechnologies
ISO-9000
6-Sigma
Agile
Waterfall
DPMS+
10What is DPMS?
DPMS is the 6-Sigma of
Document Preparation for Search
11Are you DPMS Compliant? Certified?1. Inputs: Identified & Documented
A. ValidatedB. Virus Checked
2. Metadata: Identified & DocumentedA. Fields namedB. Structure and arity knownC. Schema V
3. File Processing: Identified & DocumentedA. File names & formats specified
4. Index Fields: Identified & DocumentedA. Fields mapped from metadata
5. Search Fields: Identified & DocumentedA. Fields mapped from search engine
11
12DPMS and Aspire Work Together
• DPMS:• A methodology for creating awesome
assembly lines for documents• Is 100% software independent• Produces Design and Architecture
Documents• Aspire:
• A software framework and toolset that can be used to implement DPMS and enhance search
• Search engine independent• Architected to handle systems with very
many & very complex collections
12
DPMS Aspire
13GPO: Before DPMS
Everyone is working on their own function(no one is looking out for the data)
14GPO: After DPMS
Data flow is documented through the system(this is done for each and every collection)
15
Business Process OverviewSubmission
Ingest Process
Congressional SubmissionWorkflow (folder)
MigrationApplication
Bulk SubmissionProcess
Preservation
Archival ProcessingWorkflow
Archival UpdatingWorkflow
Access
Public UserAccess & Delivery
Application
Authorized UserAccess & Delivery
Application
Processing
Package UpdatingWorkflow
Access ProcessingWorkflow
Publishing Process
ILS IntegrationApplication
SubmissionProcess
Congressional SubmissionWorkflow (interactive)what renditions
are available?
how will metadata be
extracted and merged?
what manual edits may be
required?
how are PDF files processed?
how will the HTML rendition
be created
how will parser data and input files
be validated
what’s on the search form?
how will the content and metadata be
indexed
what are the navigators?
how will the MODS be created?
how are search results formatted?
what do content URLs look like?
The DMD Defines How Data Flows Through The System
16
FAST NotificationACP Cache
Documentum
OriginalContent
Parse
fdsysxml
Validate Values & Normalize
FASTindexes
Search Results:
1. [title] [ [type] [size] ][line 2][teaser...] [more...]
2. [title] [ [type] [size] ][line 2][teaser...] [more...]
Index
Search
mapfields
[per collection]
Content Detail
[field1]: [data1][field 2]: [data2][field 3]: [data3]...
Package TOC:[collection]
[congress num][document type]
[chapter][chapter]
[section][article **]
[chapter]...
Search Form:
field1: [_________]field2: [________v]field3: from [___] to [___]field4: [_________]
normalize dataand map to FQL
MODSXML PREMIS
.xslt
.xslt
Collection Browsing:
[collection][congress number]
[document type][document version]
mapnavigators
[per collection]
.xslt.xslt.xslt.xslt
fdsysxml
fdsysxml
FASTXML
granulefile(s)
index.xml
search.xml.xslt
Index Push
granulefile(s)
granulefile(s)
publish
The Challenge at GPO
17DPMS High-Level View
Assessment (Search Technologies
Architect and Business Analyst)
DPMSAnalysis
(Knowledge Engineer, Business Analyst, etc.)
Assessment Report
Expert assessment and recommendations
ValidationAspire
DMDsReview
(Architect, Domain Experts, Peers)
1Assessment
2Detailed Analysis
3Execution
Implementation(Developer)
Validate DMDsSearchEngine
18Aspects of the DMD
• Describes a “Horizontal Slice” through the application
• One per collection of data• Documents all metadata mappings throughout
• Parsing• Storage representation & fields• Data value representations & mappings
• Documents all file processing throughout• Documents search methods and presentations
18
19The DMD = Data Model Design
The DMD Drives the Whole Process1. Introduction2. Metadata Schema3. Input Files4. File Parsing Metadata Extraction5. File Processing Renditions, formats6. Metadata to Index Mapping7. Index to Search & Browse Mapping8. Metadata to Detail Display Mapping
19
20Aspire Pipeline
Files or DBs
Packaging
Quarantine
Parsing &Extraction Enrichment
Outside DBs
Document Post Processing
Document Pre Processing
Transform and Load
SearchEngine
Quarantine Quarantine Quarantine Quarantine Quarantine
Outside Sources
21Just a few things in the DMD
• Fields Names• Mapping formulas
• 111 = 111th Congress (2009-2010)• Navigators (names, where from, how displayed)• Format Translations (.doc .txt .html .pdf)• Data structure
• Single value, multi-value, optional, grouped• Document Structure
• Hierarchies, granules, sections, chapters
21
22DPMS: From Document Object• Data inside organizations are very messy
• Multiple databases / sources, data types, etc.• Fragmented or incomplete data
• An “object” can be: project, person, customer, transaction, product item, etc.
• Moving From Documents Objects• Combining data from multiple databases into larger, “virtual”
documents, OR• Tagging documents so they can be grouped by object ID• Decomposing large documents so sections can be retrieved
as manageable units but re-assembled if needed
22
23From Documents to Objects - Merging
Résumés
Certifications
M erge Merge Merge Merge
Time Cards Skills Web Site
CombinedDocument
24From Documents to Objects - Splitting
FederalRegisterGranules
25Lots of other examples of splitting
• Zip Files• Spread Sheets• RDBMS Tables• XML Data Records• Newspapers• Blog Entries
25
26Samples
• [SHOW] DMD for GPO• [SHOW] DMD for OLRC• [SHOW] DMD Template• [SHOW] Mini-DMD
26
27Other Advantages to the DMD & DPMS
• Scalable• Data Analyst ≠ Programmer
• Two different jobs with two different skill sets• Much easier to fill these roles if they are
separate• The programmer’s job is more enjoyable
• Doesn’t have to worry about data issues• Can just implement what’s in the DMD
27
28The Problems Are…
• DPMS is hard to sell• DPMS is hard to describe to customers• We are “inventing” a methodology from scratch
• This is hard• Giving it a name is step 1• Next steps: solidify methods, determine
what “certified DPMS” means• Need case studies• Needs work to define and communicate
28
29Hosting Possibilities?
• “DPMS Level 3 Compliant Hosting Center”• Take customer through the process as we load
their data into our hosting center• Provide all of the documentation back to them• Certified DPMS Level Three Search Systems
29
30But the upside is enormous
• Multi-million dollar customers• GPO, CPA
• Ideal for customers for whom “data is their product”
• We become mission-critical to these customers• We can more easily justify the expense• Customers will see bottom-line value• We become much more valuable to the customer• Customers will want low-risk “tried and true”
methodologies for these very complex and difficult tasks
30