Pre-Command Course, October 2008 LTG William B. Caldwell IV, Commanding General ATIA 6.2 United...

Pre-Command Course, October 2008Pre-Command Course, October 2008Pre-Command Course, October 2008Pre-Command Course, October 2008

LTG William B. Caldwell IV, Commanding GeneralLTG William B. Caldwell IV, Commanding General

ATIA 6.2

United States Army Combined Arms Center

Preliminary Design Review1 July 2010

Agenda• Requirements• Function Points• Natural Language Processor• Relevancy Algorithm• Search Personalization• Document Population• Search Web Service• Deployment / Maintenance Process • Milestones• Q&A

Requirements

Current Requirements documentation is maintained under CM in project forge

ATIA1451 The system shall maintain data and functional access control through a user profile.

ATIA2444 The system shall request the selection of all appropriate access levels. System Computes this based on account type

This is an extension of the existing ATIA requirements

ATIA998

The system shall be required to track the changes by the use of a day/date/time stamp that indicates when the change was made, no matter why a change is made.

Function PointsFP Ver / Date Comment

FP 2.3 My Training Home Page

12 Apr 2010, Ver 1.16

Complete

FP 16.1 Master Catalog Registration Manager,

xx Jun 2010 Ver 1.xx

Updating FP 16.1 for PDR to incorporate latest changes for “minimal user data input•Modify to Include Multiple Document Input capability •Modify to Include Spidering function for Semantic extraction

FP 16.8 ATIA Master Catalog Security,

17 Mar 2009 Ver 1.0

Reviewing 16.8 for minor changes

FP 38.1 Web Services

25 Jun 2010 Ver 1.0

Complete

FP 38.2 Master Catalog

25 Jun 2010 Ver 1.0

Complete

ATIA Service Oriented Architecture

ATIA Web Services/SOA “Cloud”ATIA Web Services/SOA “Cloud”

BlackBoardLMS

BlackBoardLMS

Atlas ProLMS

Atlas ProLMS ALMS

ALMS InterfaceEngine

InterfaceEngine ATRRS

ATRRS

ILMSILMS

RECBASSRECBASS

ARISSARISS

Sec

uri

ty S

ervi

ces

– S

SO

an

d B

2B

LMSWeb Service

RegisterWeb Service

SecurityWeb Service

RepositoryWeb Service

LoggingWeb Service

ProfileWeb Service

MintWeb Service

ContentWeb Service

RITM/IE

SISWeb Service

ATIAAdmin

GUI

ATIAAdmin

GUI

ATIACatalog

GUI

ATIACatalog

GUI

MCRM(Doc Mgr)

GUI

MCRM(Doc Mgr)

GUIRITMS

RITMSDTMS

DTMSPDM

PDM

GenerateWeb Service

Oracle DBATIA

Oracle DBATIA

SemanticTripleStore

SemanticTripleStore

MagnoliaContent

Mgmt Sys

MagnoliaContent

Mgmt Sys

JackrabbitContent

Mgmt Sys

JackrabbitContent

Mgmt Sys

PerformanceWeb Service

CATSWeb Service

PublicationWeb Service

ProductList Web Services

SearchWeb Service

Legend

Complete

TDC Related

ATIA 6.2

Future Dev

Data Source

Other Sys

ATIA 6.2 - Integrated Index and Search• ATIA 6.2 will provide a replacement for Generate-WS

and a search algorithm module• These changes provide several added features over

ATIA 6.1– Centralized datastore for registration and indexing– Consistent search and relevance– Control over semantic term space– Customization of search/relevance algorithms– Government will own Generate-WS source code

ATIA 6.1 – Eduworks ACE• Eduworks ACE performs

– Extract meta-data (Generate-WS)– Star-tree relevancy (SearchEdu-WS)– Relevancy between search terms and documents

(SearchEdu-WS)

• Limitations– Unable to add/subtract key word/phrases – I/O intensive requests for relevancy– Duplication of data across 2 data stores– Relevancy inconsistencies between triple store and

Eduworks– Proprietary and reaching end-of-life

ATIA 6.2 – Key Tasks• Building a new search algorithm module• Catalog Search

– Calculate relevancy between search terms and documents inside catalog using the new search algorithm module

• Generate-WS– Support text extraction of common file formats– Utilize Natural Language Processor– Algorithm to identify most relevant terms from extracted

data– Store metadata in triplestore– Pre-computation of values used in relevancy algorithm– Asynchronous call from Publication-WS and Generate-WS

Client

Generate-WS ArchitecturePhased implementation – Phase 1: Wrap eduWorks

Generate-WS withOur Implementation

Asynchronous Call ToGenerate-WS from

Publication-WS directly posting messages to JMS

Queue.Generate-WS

Generate-WS (eduWorks)

JMS Queue

Triple Store

Generate-WS Client

(Optional Asynchronous)

Publication-WS(Asynchronous

Call)

Register-WS

Content-WS

Generate-WS ArchitecturePhased implementation – Phase 2: Eliminate eduWorks

Provide our own relevancyand key words keeping interface the same.

Generate-WS(New Implementation)

JMS Queue

Triple Store

Generate-WS Client

(Optional Asynchronous)

Publication-WS(Asynchronous

Call)

Register-WS

Content-WS

Natural Language Processing Framework

• ATIA 6.2 will implement a text processing framework with the following key features

– Ability to integrate new natural language processors• e.g. processors for processing non-English documents

– Flexibility to process new file formats besides common file formats if desired

– Multithreaded processing pipelines

Comparison of Natural Language Processors

Name Language Algorithm API Trainable Open Source?

Tokenizer Speed(words/second)

Open NLP Java Maximum entropy

Y Y Y ~400

LingPipe C++ Maximum entropy

Y Y N N/A

NLTK Python Simple Splitter

Y N Y ~1500

Mallet Java Simple Splitter

N N Y ~2000

Stanford NLP

Java Maximum entropy

Y Y Y ~50

Selection of Natural Language Processors

• ATIA 6.2 will use OpenNLP as a open source library for text data extraction

• Java & Open Source– Flexibility to modify for our needs

• Easy-to-use Java API• Decent size user base • High accuracy on sentence segmentation• Ability to train with customized models

– Less effort to conduct training

Increasing Relevancy Increasing search relevancy will require a new

implementation of the search algorithm The relevancy algorithm will be used by the catalog search and

Generate-WS to give consistent results Relevance will be calculated inside the catalog by

communicating directly with the AllegroGraph triplestore. The relevance algorithm will apply cosine-similarity methods

to our RDF ontology New generate-WS will integrate directly with the relevance

algorithm to use the AllegroGraph triplestore as a backend data store.

Cosine Similarity User enters “ranger handbook” into the search box and the search returns documents A and B. The documents A and B and the query Q are plotted as vectors in the semantic space. Term “ranger” has more weight in the query because “handbook” is so common in the catalog shown by the query vector which is more than 45°. Document A and B both emphasize ranger but Document A has a higher relative emphasis on ranger than Document B Doc A

(More “Ranger” EmphasisThan Doc B)

Doc B

Query Q

“handbook”

“ran

ger”

Relevance to the query produces different angles for each document. In cosine similarity, a smaller angle between a document and the query indicates higher relevance.

(Computation is performed for every search result (docs C, D, E, F, ……) to sort all by relevance.)

Increasing Relevancy with User Profile Research Increasing User

Relevancy Utilize CAC, MOS, Job Series Log Search & utilize ‘hit’

counts Integrate Closely with new

Relevancy Algorithm Index PDM to create job series

‘document’ Based on user AKO supplied

MOS/AOC/job series Include this ‘document’ in the cosine similarity calculations

Doc A(More “Ranger” EmphasisThan Doc B)

Doc B

Query Q

“handbook”

“ran

ger”

User Feedback

• How can we add tagging by users? Rate me 1-5 stars, higher<->lower, ... to increase relevancy

• Need this slide expanded

Document Manager Improvements• Provide Form for uploading multiple related documents

• Album style upload• Required Entry Title

• User can set matching metadata across all documents• User can set individual metadata for each document

• Provide form for editing multiple related documents• User can set matching metadata across all documents

• Spidering Issues• Password Protected Repositories

• AKO• SharePoint

• Depth

Data Collection Improvements

Embed RDF in Results• Resource Description Framework (RDF) is a standard model for

data interchange on the Web

• RDF metadata provides a scalable way to present catalog item data

• Catalog HTML pages should contain RDF metadata

• Catalog XML data lists should be provided in RDF format

• Allows other triple stores to interpret our data

Search WS• Allows Developers to add the Catalog Search to their website• Provides access to catalog search results in various formats• Provides search customization to limit search results• Will enable the creation of multiple catalog gadgets

• News feed style gadget that displays New or Obsolete Documents

• Popular Documents• Documents that could be useful to the user

Rich Site Summary• Provides RSS feed of recent catalog activity

Deployment Process

• High Availability (PLC)• Backup (will CommVault work with AllegroGraph)

– Clean Stop/Stop– Re-aiming links– Fix links/data (rollback)

Maintenance Process

• ATIA 6.2 is an extension of the ATIA 6.1 clusters

Replace Glassfish

• Sun fees on Glassfish usage

Questions

Backup

Weighting Term Relevance Relevancy weight, w, between a document, d, and a term, t, will be

determined with tf, term frequency, is a function of the number of occurrences of the phrase

in the document idf, inverse document frequency, is a function of the number of documents

that the term appears in. idf is used to reduce the relevance weight of terms which occur across

many documents Note that the term, t, may not be the same as the search input query. Search

queries can be long and are treated as documents themselves with a weight calculated for terms appearing within them.

Cosine Similarity is the formula for determining relevance between two documents

idftfw dt ,

Cosine Similarity of Documents• Cosine Similarity is the formula for determining relevance

between two documents, A and B.

• By treating queries as documents this formula is used to determine relevance between– two documents– document and search query– search query and related terms

N

iBiterm

N

iAiterm

N

iBitermAiterm

ww

ww

BA

BABASim

1

2),(

1

2),(

1),(),( *

cos),(

Cosine Similarity Example

A

BQ

“handbook”

“ran

ger”

Cosine similarity can be demonstrated on a two-dimensional chart when there are two search termsIn this case, the query Q = “ranger handbook” and documents A and B are relevant.The weight of “handbook” is plotted on the horizontal axis and the weight of “ranger” is on the verticalAlthough “ranger” and “handbook” each appear in Q exactly once, the terms may not have equal weighting on the query because of the idfThe cosine similarity is a function of the angle between a document and the query QAlthough the endpoints of vectors of Q and B are closer than the endpoints of vectors of Q and A, the cosine similarity of Q and A is strongerThis is because Q has a stronger emphasis on “ranger” and so does document A

Applying Cosine Similarity to RDF Ontologies

• The cosine similarity measure is the common method for determining a text based relevance

• In our RDF triplestore we will use cosine similarity to determine relevance between RDF resources

• This is accomplished by using definitions of tf and idf based on the RDF predicates that match our search query

Performance considerations• The performance of this relevance algorithm applied to a

triplestore is hindered by the lack of scalar functions in SPARQL.

• The mitigation is – Pre-computation of partial values– Refactoring of the algorithm code to reduce roundtrips to

the triplestore• Pre-computation during registration will require additional

code in the register-WS

Current Document Population Methods

– Document Manager• One document at a time

– Batch• Spreadsheet• Ziptool

Pre-Command Course, October 2008 LTG William B. Caldwell IV, Commanding General ATIA 6.2 United...

Documents

Transcript of Pre-Command Course, October 2008 LTG William B. Caldwell IV, Commanding General ATIA 6.2 United...