Data Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

31
1 Data Process Methodology for Search (DPMS) Paul Nelson [email protected] 1

description

Data Process Methodology for Search (DPMS) Paul Nelson [email protected]. DPMS: Example 1. DPMS: Example 2. DPMS: Example 3 (FY 2011). DPMS: Example 4 (FY 201?). DPMS: Example 5. CPA. (sorry, I don’t have a screenshot). What is Aspire???. - PowerPoint PPT Presentation

Transcript of Data Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

Page 1: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

1

Data Process Methodology for Search (DPMS)

Paul [email protected]

1

Page 2: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

2DPMS: Example 1•Additional Government PublicationsSelect Publications •Budget of the United States GovernmentFiscal Years 2010 and 2011 •Code of Federal Regulations | XML Bulk Data 2000 to Present •Commerce Business Daily Bulk Data1996 to 2001 •Compilation of Presidential Documents1993 to Present •Congressional Bills103rd Congress to Present •Congressional Calendars104th Congress to Present •Congressional Committee Prints including Ways and Means Committee Prints105th Congress to Present •Congressional Directory104th Congress to Present •Congressional Documents104th Congress to Present •Congressional Hearings including House and Senate Appropriations Hearings105th Congress to Present •Congressional Pictorial Directory including New Member Pictorial Directory105th Congress to Present •Congressional Record (Bound)1999 to 2001 •Congressional Record (Daily)1994 to Present •Congressional Record Index (Daily)1983 to Present •Congressional Reports including Conference Reports 104th Congress to Present •Constitution of the United States of America: Analysis and Interpretation 1992 to 2008 •Economic Indicators1995 to Present •Economic Report of the President1995 to Present •Education Reports from Eric1995 to 2004

•Federal Register | XML Bulk Data | FR 2.0 1994 to Present •GAO Reports and Comptroller General Decisions1994 to 2008 •House Practice104th and 108th Congresses •House Rules and Manual 104th Congress to Present •History of Bills 1983 to Present •Independent Counsel Investigations1998 to 2002 •Journal of the House of Representatives1992 to 1999 •List of CFR Sections Affected1997 to Present •Precedents of the U.S. House of RepresentativesCannon, Deschler, and Hinds •Privacy Act Issuances1995 to 2005 •Public and Private Laws 104th Congress to Present •Public Papers of the Presidents of the United States1991 to 2005 •Riddick's Senate Procedure101st Congress •Senate Manual104th, 106th, 107th, and 110th Congresses •Supreme Court Decisions (FLITE) Bulk Data 1937 to 1975 •Unified Agenda 1994 to Present •United States Code1994 to Present •United States Government Manual1995 to Present •United States Government Policy and Supporting Positions (Plum Book)1996 to 2008 •United States Statutes at Large 2003 to 2006

Page 3: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

3DPMS: Example 2

Page 4: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

4DPMS: Example 3 (FY 2011)

Page 5: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

5DPMS: Example 4 (FY 201?)

Page 6: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

6DPMS: Example 5 6

CPA(sorry, I don’t have a screenshot)

Page 7: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

7What is Aspire???

• ASPIRE: The Automated Assembly Line for Documents

7

Page 8: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

8What is DPMS???

• A Methodology for Creating Good Assembly Lines For Search

• Document Process Methodology for Search

8

Page 9: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

9What is DPMS?

Technologies Management Processes

MS-Windows

Java

Aspire.NET

LinuxSearchEngine

TheCustomer

SearchTechnologies

ISO-9000

6-Sigma

Agile

Waterfall

DPMS+

Page 10: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

10What is DPMS?

DPMS is the 6-Sigma of

Document Preparation for Search

Page 11: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

11Are you DPMS Compliant? Certified?1. Inputs: Identified & Documented

A. ValidatedB. Virus Checked

2. Metadata: Identified & DocumentedA. Fields namedB. Structure and arity knownC. Schema V

3. File Processing: Identified & DocumentedA. File names & formats specified

4. Index Fields: Identified & DocumentedA. Fields mapped from metadata

5. Search Fields: Identified & DocumentedA. Fields mapped from search engine

11

Page 12: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

12DPMS and Aspire Work Together

• DPMS:• A methodology for creating awesome

assembly lines for documents• Is 100% software independent• Produces Design and Architecture

Documents• Aspire:

• A software framework and toolset that can be used to implement DPMS and enhance search

• Search engine independent• Architected to handle systems with very

many & very complex collections

12

DPMS Aspire

Page 13: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

13GPO: Before DPMS

Everyone is working on their own function(no one is looking out for the data)

Page 14: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

14GPO: After DPMS

Data flow is documented through the system(this is done for each and every collection)

Page 15: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

15

Business Process OverviewSubmission

Ingest Process

Congressional SubmissionWorkflow (folder)

MigrationApplication

Bulk SubmissionProcess

Preservation

Archival ProcessingWorkflow

Archival UpdatingWorkflow

Access

Public UserAccess & Delivery

Application

Authorized UserAccess & Delivery

Application

Processing

Package UpdatingWorkflow

Access ProcessingWorkflow

Publishing Process

ILS IntegrationApplication

SubmissionProcess

Congressional SubmissionWorkflow (interactive)what renditions

are available?

how will metadata be

extracted and merged?

what manual edits may be

required?

how are PDF files processed?

how will the HTML rendition

be created

how will parser data and input files

be validated

what’s on the search form?

how will the content and metadata be

indexed

what are the navigators?

how will the MODS be created?

how are search results formatted?

what do content URLs look like?

The DMD Defines How Data Flows Through The System

Page 16: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

16

FAST NotificationACP Cache

Documentum

OriginalContent

Parse

fdsysxml

Validate Values & Normalize

FASTindexes

Search Results:

1. [title] [ [type] [size] ][line 2][teaser...] [more...]

2. [title] [ [type] [size] ][line 2][teaser...] [more...]

Index

Search

mapfields

[per collection]

Content Detail

[field1]: [data1][field 2]: [data2][field 3]: [data3]...

Package TOC:[collection]

[congress num][document type]

[chapter][chapter]

[section][article **]

[chapter]...

Search Form:

field1: [_________]field2: [________v]field3: from [___] to [___]field4: [_________]

normalize dataand map to FQL

MODSXML PREMIS

.xslt

.xslt

Collection Browsing:

[collection][congress number]

[document type][document version]

mapnavigators

[per collection]

.xslt.xslt.xslt.xslt

fdsysxml

fdsysxml

FASTXML

granulefile(s)

index.xml

search.xml.xslt

Index Push

granulefile(s)

granulefile(s)

publish

The Challenge at GPO

Page 17: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

17DPMS High-Level View

Assessment (Search Technologies

Architect and Business Analyst)

DPMSAnalysis

(Knowledge Engineer, Business Analyst, etc.)

Assessment Report

Expert assessment and recommendations

ValidationAspire

DMDsReview

(Architect, Domain Experts, Peers)

1Assessment

2Detailed Analysis

3Execution

Implementation(Developer)

Validate DMDsSearchEngine

Page 18: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

18Aspects of the DMD

• Describes a “Horizontal Slice” through the application

• One per collection of data• Documents all metadata mappings throughout

• Parsing• Storage representation & fields• Data value representations & mappings

• Documents all file processing throughout• Documents search methods and presentations

18

Page 19: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

19The DMD = Data Model Design

The DMD Drives the Whole Process1. Introduction2. Metadata Schema3. Input Files4. File Parsing Metadata Extraction5. File Processing Renditions, formats6. Metadata to Index Mapping7. Index to Search & Browse Mapping8. Metadata to Detail Display Mapping

19

Page 20: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

20Aspire Pipeline

Files or DBs

Packaging

Quarantine

Parsing &Extraction Enrichment

Outside DBs

Document Post Processing

Document Pre Processing

Transform and Load

SearchEngine

Quarantine Quarantine Quarantine Quarantine Quarantine

Outside Sources

Page 21: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

21Just a few things in the DMD

• Fields Names• Mapping formulas

• 111 = 111th Congress (2009-2010)• Navigators (names, where from, how displayed)• Format Translations (.doc .txt .html .pdf)• Data structure

• Single value, multi-value, optional, grouped• Document Structure

• Hierarchies, granules, sections, chapters

21

Page 22: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

22DPMS: From Document Object• Data inside organizations are very messy

• Multiple databases / sources, data types, etc.• Fragmented or incomplete data

• An “object” can be: project, person, customer, transaction, product item, etc.

• Moving From Documents Objects• Combining data from multiple databases into larger, “virtual”

documents, OR• Tagging documents so they can be grouped by object ID• Decomposing large documents so sections can be retrieved

as manageable units but re-assembled if needed

22

Page 23: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

23From Documents to Objects - Merging

Résumés

Certifications

M erge Merge Merge Merge

Time Cards Skills Web Site

CombinedDocument

Page 24: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

24From Documents to Objects - Splitting

FederalRegisterGranules

Page 25: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

25Lots of other examples of splitting

• Zip Files• Spread Sheets• RDBMS Tables• XML Data Records• Newspapers• Blog Entries

25

Page 26: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

26Samples

• [SHOW] DMD for GPO• [SHOW] DMD for OLRC• [SHOW] DMD Template• [SHOW] Mini-DMD

26

Page 27: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

27Other Advantages to the DMD & DPMS

• Scalable• Data Analyst ≠ Programmer

• Two different jobs with two different skill sets• Much easier to fill these roles if they are

separate• The programmer’s job is more enjoyable

• Doesn’t have to worry about data issues• Can just implement what’s in the DMD

27

Page 28: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

28The Problems Are…

• DPMS is hard to sell• DPMS is hard to describe to customers• We are “inventing” a methodology from scratch

• This is hard• Giving it a name is step 1• Next steps: solidify methods, determine

what “certified DPMS” means• Need case studies• Needs work to define and communicate

28

Page 29: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

29Hosting Possibilities?

• “DPMS Level 3 Compliant Hosting Center”• Take customer through the process as we load

their data into our hosting center• Provide all of the documentation back to them• Certified DPMS Level Three Search Systems

29

Page 30: Data  Process Methodology for Search (DPMS) Paul Nelson pnelson@searchtechnologies

30But the upside is enormous

• Multi-million dollar customers• GPO, CPA

• Ideal for customers for whom “data is their product”

• We become mission-critical to these customers• We can more easily justify the expense• Customers will see bottom-line value• We become much more valuable to the customer• Customers will want low-risk “tried and true”

methodologies for these very complex and difficult tasks

30