Pre-Command Course, October 2008 LTG William B. Caldwell IV, Commanding General ATIA 6.2 United...
-
Upload
louise-scott -
Category
Documents
-
view
217 -
download
2
Transcript of Pre-Command Course, October 2008 LTG William B. Caldwell IV, Commanding General ATIA 6.2 United...
Pre-Command Course, October 2008Pre-Command Course, October 2008Pre-Command Course, October 2008Pre-Command Course, October 2008
LTG William B. Caldwell IV, Commanding GeneralLTG William B. Caldwell IV, Commanding General
ATIA 6.2
United States Army Combined Arms Center
Preliminary Design Review1 July 2010
Agenda• Requirements• Function Points• Natural Language Processor• Relevancy Algorithm• Search Personalization• Document Population• Search Web Service• Deployment / Maintenance Process • Milestones• Q&A
Requirements
Current Requirements documentation is maintained under CM in project forge
ATIA1451 The system shall maintain data and functional access control through a user profile.
ATIA2444 The system shall request the selection of all appropriate access levels. System Computes this based on account type
This is an extension of the existing ATIA requirements
ATIA998
The system shall be required to track the changes by the use of a day/date/time stamp that indicates when the change was made, no matter why a change is made.
Function PointsFP Ver / Date Comment
FP 2.3 My Training Home Page
12 Apr 2010, Ver 1.16
Complete
FP 16.1 Master Catalog Registration Manager,
xx Jun 2010 Ver 1.xx
Updating FP 16.1 for PDR to incorporate latest changes for “minimal user data input•Modify to Include Multiple Document Input capability •Modify to Include Spidering function for Semantic extraction
FP 16.8 ATIA Master Catalog Security,
17 Mar 2009 Ver 1.0
Reviewing 16.8 for minor changes
FP 38.1 Web Services
25 Jun 2010 Ver 1.0
Complete
FP 38.2 Master Catalog
25 Jun 2010 Ver 1.0
Complete
ATIA Service Oriented Architecture
ATIA Web Services/SOA “Cloud”ATIA Web Services/SOA “Cloud”
BlackBoardLMS
BlackBoardLMS
Atlas ProLMS
Atlas ProLMS ALMS
ALMS InterfaceEngine
InterfaceEngine ATRRS
ATRRS
ILMSILMS
RECBASSRECBASS
ARISSARISS
Sec
uri
ty S
ervi
ces
– S
SO
an
d B
2B
LMSWeb Service
RegisterWeb Service
SecurityWeb Service
RepositoryWeb Service
LoggingWeb Service
ProfileWeb Service
MintWeb Service
ContentWeb Service
RITM/IE
SISWeb Service
ATIAAdmin
GUI
ATIAAdmin
GUI
ATIACatalog
GUI
ATIACatalog
GUI
MCRM(Doc Mgr)
GUI
MCRM(Doc Mgr)
GUIRITMS
RITMSDTMS
DTMSPDM
PDM
GenerateWeb Service
Oracle DBATIA
Oracle DBATIA
SemanticTripleStore
SemanticTripleStore
MagnoliaContent
Mgmt Sys
MagnoliaContent
Mgmt Sys
JackrabbitContent
Mgmt Sys
JackrabbitContent
Mgmt Sys
PerformanceWeb Service
CATSWeb Service
PublicationWeb Service
ProductList Web Services
SearchWeb Service
Legend
Complete
TDC Related
ATIA 6.2
Future Dev
Data Source
Other Sys
ATIA 6.2 - Integrated Index and Search• ATIA 6.2 will provide a replacement for Generate-WS
and a search algorithm module• These changes provide several added features over
ATIA 6.1– Centralized datastore for registration and indexing– Consistent search and relevance– Control over semantic term space– Customization of search/relevance algorithms– Government will own Generate-WS source code
ATIA 6.1 – Eduworks ACE• Eduworks ACE performs
– Extract meta-data (Generate-WS)– Star-tree relevancy (SearchEdu-WS)– Relevancy between search terms and documents
(SearchEdu-WS)
• Limitations– Unable to add/subtract key word/phrases – I/O intensive requests for relevancy– Duplication of data across 2 data stores– Relevancy inconsistencies between triple store and
Eduworks– Proprietary and reaching end-of-life
ATIA 6.2 – Key Tasks• Building a new search algorithm module• Catalog Search
– Calculate relevancy between search terms and documents inside catalog using the new search algorithm module
• Generate-WS– Support text extraction of common file formats– Utilize Natural Language Processor– Algorithm to identify most relevant terms from extracted
data– Store metadata in triplestore– Pre-computation of values used in relevancy algorithm– Asynchronous call from Publication-WS and Generate-WS
Client
Generate-WS ArchitecturePhased implementation – Phase 1: Wrap eduWorks
Generate-WS withOur Implementation
Asynchronous Call ToGenerate-WS from
Publication-WS directly posting messages to JMS
Queue.Generate-WS
Generate-WS (eduWorks)
JMS Queue
Triple Store
Generate-WS Client
(Optional Asynchronous)
Publication-WS(Asynchronous
Call)
Register-WS
Content-WS
Generate-WS ArchitecturePhased implementation – Phase 2: Eliminate eduWorks
Provide our own relevancyand key words keeping interface the same.
Generate-WS(New Implementation)
JMS Queue
Triple Store
Generate-WS Client
(Optional Asynchronous)
Publication-WS(Asynchronous
Call)
Register-WS
Content-WS
Natural Language Processing Framework
• ATIA 6.2 will implement a text processing framework with the following key features
– Ability to integrate new natural language processors• e.g. processors for processing non-English documents
– Flexibility to process new file formats besides common file formats if desired
– Multithreaded processing pipelines
Comparison of Natural Language Processors
Name Language Algorithm API Trainable Open Source?
Tokenizer Speed(words/second)
Open NLP Java Maximum entropy
Y Y Y ~400
LingPipe C++ Maximum entropy
Y Y N N/A
NLTK Python Simple Splitter
Y N Y ~1500
Mallet Java Simple Splitter
N N Y ~2000
Stanford NLP
Java Maximum entropy
Y Y Y ~50
Selection of Natural Language Processors
• ATIA 6.2 will use OpenNLP as a open source library for text data extraction
• Java & Open Source– Flexibility to modify for our needs
• Easy-to-use Java API• Decent size user base • High accuracy on sentence segmentation• Ability to train with customized models
– Less effort to conduct training
Increasing Relevancy Increasing search relevancy will require a new
implementation of the search algorithm The relevancy algorithm will be used by the catalog search and
Generate-WS to give consistent results Relevance will be calculated inside the catalog by
communicating directly with the AllegroGraph triplestore. The relevance algorithm will apply cosine-similarity methods
to our RDF ontology New generate-WS will integrate directly with the relevance
algorithm to use the AllegroGraph triplestore as a backend data store.
Cosine Similarity User enters “ranger handbook” into the search box and the search returns documents A and B. The documents A and B and the query Q are plotted as vectors in the semantic space. Term “ranger” has more weight in the query because “handbook” is so common in the catalog shown by the query vector which is more than 45°. Document A and B both emphasize ranger but Document A has a higher relative emphasis on ranger than Document B Doc A
(More “Ranger” EmphasisThan Doc B)
Doc B
Query Q
“handbook”
“ran
ger”
Relevance to the query produces different angles for each document. In cosine similarity, a smaller angle between a document and the query indicates higher relevance.
(Computation is performed for every search result (docs C, D, E, F, ……) to sort all by relevance.)
Increasing Relevancy with User Profile Research Increasing User
Relevancy Utilize CAC, MOS, Job Series Log Search & utilize ‘hit’
counts Integrate Closely with new
Relevancy Algorithm Index PDM to create job series
‘document’ Based on user AKO supplied
MOS/AOC/job series Include this ‘document’ in the cosine similarity calculations
Doc A(More “Ranger” EmphasisThan Doc B)
Doc B
Query Q
“handbook”
“ran
ger”
User Feedback
• How can we add tagging by users? Rate me 1-5 stars, higher<->lower, ... to increase relevancy
• Need this slide expanded
Document Manager Improvements• Provide Form for uploading multiple related documents
• Album style upload• Required Entry Title
• User can set matching metadata across all documents• User can set individual metadata for each document
• Provide form for editing multiple related documents• User can set matching metadata across all documents
• Spidering Issues• Password Protected Repositories
• AKO• SharePoint
• Depth
Data Collection Improvements
Embed RDF in Results• Resource Description Framework (RDF) is a standard model for
data interchange on the Web
• RDF metadata provides a scalable way to present catalog item data
• Catalog HTML pages should contain RDF metadata
• Catalog XML data lists should be provided in RDF format
• Allows other triple stores to interpret our data
Search WS• Allows Developers to add the Catalog Search to their website• Provides access to catalog search results in various formats• Provides search customization to limit search results• Will enable the creation of multiple catalog gadgets
• News feed style gadget that displays New or Obsolete Documents
• Popular Documents• Documents that could be useful to the user
Rich Site Summary• Provides RSS feed of recent catalog activity
Deployment Process
• High Availability (PLC)• Backup (will CommVault work with AllegroGraph)
– Clean Stop/Stop– Re-aiming links– Fix links/data (rollback)
Maintenance Process
• ATIA 6.2 is an extension of the ATIA 6.1 clusters
Replace Glassfish
• Sun fees on Glassfish usage
Questions
Backup
Weighting Term Relevance Relevancy weight, w, between a document, d, and a term, t, will be
determined with tf, term frequency, is a function of the number of occurrences of the phrase
in the document idf, inverse document frequency, is a function of the number of documents
that the term appears in. idf is used to reduce the relevance weight of terms which occur across
many documents Note that the term, t, may not be the same as the search input query. Search
queries can be long and are treated as documents themselves with a weight calculated for terms appearing within them.
Cosine Similarity is the formula for determining relevance between two documents
idftfw dt ,
Cosine Similarity of Documents• Cosine Similarity is the formula for determining relevance
between two documents, A and B.
• By treating queries as documents this formula is used to determine relevance between– two documents– document and search query– search query and related terms
N
iBiterm
N
iAiterm
N
iBitermAiterm
ww
ww
BA
BABASim
1
2),(
1
2),(
1),(),( *
cos),(
Cosine Similarity Example
A
BQ
“handbook”
“ran
ger”
Cosine similarity can be demonstrated on a two-dimensional chart when there are two search termsIn this case, the query Q = “ranger handbook” and documents A and B are relevant.The weight of “handbook” is plotted on the horizontal axis and the weight of “ranger” is on the verticalAlthough “ranger” and “handbook” each appear in Q exactly once, the terms may not have equal weighting on the query because of the idfThe cosine similarity is a function of the angle between a document and the query QAlthough the endpoints of vectors of Q and B are closer than the endpoints of vectors of Q and A, the cosine similarity of Q and A is strongerThis is because Q has a stronger emphasis on “ranger” and so does document A
Applying Cosine Similarity to RDF Ontologies
• The cosine similarity measure is the common method for determining a text based relevance
• In our RDF triplestore we will use cosine similarity to determine relevance between RDF resources
• This is accomplished by using definitions of tf and idf based on the RDF predicates that match our search query
Performance considerations• The performance of this relevance algorithm applied to a
triplestore is hindered by the lack of scalar functions in SPARQL.
• The mitigation is – Pre-computation of partial values– Refactoring of the algorithm code to reduce roundtrips to
the triplestore• Pre-computation during registration will require additional
code in the register-WS
Current Document Population Methods
– Document Manager• One document at a time
– Batch• Spreadsheet• Ziptool