3.1. Technologies Used - Drexel Universitymeb388/worden/february... · Web viewFeb 22, 2016 ·...
Transcript of 3.1. Technologies Used - Drexel Universitymeb388/worden/february... · Web viewFeb 22, 2016 ·...
Software DesignSpecifications
For
Worden
Prepared By:Marc Barrowclift
Travis DutkoCorey Everitt
Jiakang JinEric Nordstrom
Ivan "Frankie" Orrego[--------------]
Kyle MusalAaron ChapinAndrew Orner
Matthew Tornetta
Advised By:Rachel Greenstadt
Jeff Salvage
Drexel University
1
Table of Contents
3.1. Technologies Used3.1.1 JStylo3.1.2. Java 1.83.1.3. WEKA3.1.4. Stanford Part-of-Speech Tagger3.1.5. Microsoft Azure3.1.6. Apache Spark3.1.7. Java Graphical Authorship Attribution Program (JGAAP)3.1.8 Maven
3.2. Application Overview3.3. Design Patterns
3.3.1 Machine Learning Integration3.3.2 Experiment Construction via the FullAPI
4. Detailed Design4.1. Database Design4.2. User Interaction / Dataflow Design4.3. JStylo Design
4.3.1 Analyzers Package4.3.1.1 WekaAnalyzer4.3.1.2 SparkAnalyzer
4.3.2 Canoncizers Package4.3.2.1. Canonicizer4.3.2.2 Apply Special Keys4.3.2.3 BrownExtractPOSTags4.3.2.4 BrownExtractText4.3.2.5 RemoveFirstNLines4.3.2.6 RemoveSpecialKeys4.3.2.7 StripEdgesPunctuation4.3.2.8 StripSpaces4.3.2.9 WordEndsPunctSeparator
4.3.3 Event Cullers Package4.3.3.1 EventCuller4.3.3.2 FrequencyEventsExtended4.3.3.3 LeastCommonEventsExtended4.3.3.4 MaxAppearances4.3.3.5 MinAppearances4.3.3.6 MostCommonEventsExtended
4.3.4 Event Drivers Package4.3.4.1 EventDriver4.3.4.2 CharCounterEventDriver4.3.4.3 EventsCounterEventDriver4.3.4.4 FastTagPOSNGramsEventDriver
2
4.3.4.5 FastTagPOSTagsEventDriver4.3.4.6 FleschReadingEaseScoreEventDriver4.3.4.7 GunningFogIndexEventDriver4.3.4.8 LetterCounterEventDriver4.3.4.9 LetterNGramEventDriver4.3.4.10 ListEventDriver4.3.4.11 ListRegexpEventDriver4.3.4.12 MaxentPOSNGramsEventDriver4.3.4.13 MaxentPOSNGramsEventDriverGeneric4.3.4.14 MaxentPOSTagsEventDriver4.3.4.15 MaxentPOSTagsEventDriverGeneric4.3.4.16 RegexpCounterEventDriver4.3.4.17 RegexpEventDriver4.3.4.18 SentenceCounterEventDriver4.3.4.19 SingleNumericEventDriver4.3.4.20 SyllableCounterEventDriver4.3.4.21 TreeTaggerEventDriver4.3.4.22 TreeTaggerNGramsEventDriver4.3.4.23 UniqueWordsCounterEventDriver4.3.4.24 WordCounterEventDriver4.3.4.25 WordLengthEventDriver
4.3.5 Generics Package4.3.5.1 Analyzer4.3.5.2 CumulativeEventCuller4.3.5.3 CumulativeFeatureDriver4.3.5.4 FeatureDriver4.3.5.5 FeatureExtractionAPI4.3.5.6 FullAPI4.3.5.7 LocalParallelFeatureExtractionAPI4.3.5.8 ProblemSet4.3.5.9 StringDocument
5. Appendix5.1 Traceability Matrix
3
1. Document History
Version Date Editor Description
1.0 1/28/2010MT, AC, KM, AO Initial Version
2.0 1/29/2016 TD New iteration of the software
2. Introduction
2.1. PurposeThe purpose of this document is to describe, in detail, the design and architecture of Worden (henceforth referred to as “the software”) in its entirety. These design decisions pertain to the functionality, performance, and external interfaces of the software. This document will be used to implement all features as described in the requirements document for the software.
2.2. ScopeThis document pertains to the design and architecture of the release of Worden and includes all functional requirements as listed in the requirements document for the software. This document is intended for the developers and testers of the software.
2.3. Design Goals The software is designed to improve upon the original Stylo software in the following ways: ● Ease of Use: The software interface is an improvement over previous other iterations of the interface in terms of usability by users not versed in the field of stylometry.● Modularity: The system will preserve the modularity of certain aspects of the software (Feature Sets) and provide additional modularity to further separate data extraction, classification, and display components.● Open Source: The software is open source and licensed under the GNU General Public License.● Portability: The software can run on any platform which is capable of running Java 1.8.
4
● Security: Due to the possible sensitivity of works being analysed, the software transmits information in an encrypted manner and holds onto it for a minimal amount of time as defined in requirements 5.3.1, 5.3.2, and 5.3.3● Accuracy: The software’s analysis accuracy is preserved from previous versions of the software.
2.4. Glossary Classifier: The machine learning algorithm which is used to process data extracted from a problem set.Corpus (pl. Corpora): A large, structured set of documents from various authors.Domain: The context in which a piece of writing has been written in. (Essays, blogs, tweets)Feature: An element of a person's writing style which can be extracted from text which they have written. Feature Set: A group of one or more similar features.Function/Stop Words: A set of words which is typically ignored when analyzing features of a document (for example, a, an, the). These are typically selected due to their commonality.Graphical Toolkit: A set of tools for creating a graphical interface for a software.Problem Set: A set of documents which defines which authors a classifier should learn from to establish the features which define each author as well as defines which documents should have their authorship analyzed.Machine Learning: Machine Learning is the process of a computer extracting categories in the form of patterns from data (in our case, numeric data). These patterns can be used to assign categories to new data that is brought into the system. N-Gram: A sub-sequence of n items from a given sequence. For example, the sentence “I painted the fence red.” has the 2-grams “I painted”, “painted the”, “the fence”, “fence red”. . (Unigrams=1-grams,Bigrams=2-grams,Trigrams=3-grams)Testing: The process in which a classifier identifies unlabeled data with one of the categories it has learned from the training process.Training: The process in which a classifier creates categories of data (in our case, defining features of specific authors) from labeled data.Stylometry: The study of author identification through analysis of linguistic style.
2.5. ReferencesJStylo:<https://psal.cs.drexel.edu/index.php/JStylo-Anonymouth>WEKA: <http://www.cs.waikato.ac.nz/ml/weka/>Stanford POS Tagger: <http://nlp.stanford.edu/software/tagger.shtml>Apache Spark: <http://spark.apache.org/>Microsoft Azure: <https://azure.microsoft.com/en-us/>
JGAAP: <http://evllabs.com/jgaap/w/index.php/Main_Page>
5
Apache Maven: <https://maven.apache.org/>
3-Clause BSD: <https://opensource.org/licenses/BSD-3-Clause>
2.6. Context Diagram
Figure 1.
6
3. Architecture3.1. Technologies Used
3.1.1 JStyloJStylo is a stylometry tool which provides authorship attribution and classifier testing
capabilities. It is an experimental tool for researchers used to create and determine the effectiveness of feature sets, identify authors of documents, and assess the effectiveness of classifiers. This project uses JStylo as an engine to provide a streamlined UI for several common, non-technical use-cases, as well as provides educational commentary to help non-technical users understand stylometry.
3.1.2. Java 1.8 Java 1.8 was selected as the platform for the backend of the system for two primary reasons. The first of these is portability. By writing it in Java, any system which has a Java 1.8 JRE will have the potential to host Worden. The second reason is that JStylo, the second and current iteration of the Stylo project, was written in Java. As the purpose of the project is to improve usability and performance, completely rewriting it was not desired.
3.1.3. WEKA The Waikato Environment for Knowledge Analysis (WEKA) is a machine learning tool. It was selected for JStylo as it is easy to use and provides a wide library of classifiers for machine learning. This tool has been embedded since JStylo, but improvements on how it interacts with JStylo have been made to increase performance and make the system more modular.
3.1.4. Stanford Part-of-Speech Tagger The Stanford Part-of-Speech Tagger is a natural language processing tool which parses text and labels it with the part of speech of each word in a sentence. It is utilized for the purpose of text analysis.
3.1.5. Microsoft AzureMicrosoft Azure is a cloud computing platform and is used to host the Worden
application. It provides scalable power/pricing which makes ideal for this project. While Worden is initially a small project and does not need this extra power, having the capability in-built allows for the inheritors of the project to take advantage of it in the future.
7
3.1.6. Apache SparkApache Spark is a tool for parallel/distributed processing. It offers local parallelization as
well as distributed computing capabilities. It also provides a machine learning library in the form of Spark MLib. Spark is utilized to increase the processing speed over WEKA. It also improves JStylo’s scalability by providing an easy avenue towards distributed processing.
3.1.7. Java Graphical Authorship Attribution Program (JGAAP)JGAAP provides a low level framework for extracting feature information from a
document. While Worden does not use JGAAP directly, JStylo builds upon it and extends many of their classes.
3.1.8 MavenApache Maven is a project and dependency management tool. It provides tools through
which a project can be versioned as well as streamlines the dependency update/buildpath constructing process.
3.1.9 3-Clause BSD LicenseThe 3-Clause BSD license is an Open Source license. It was selected for Worden by
virtue of being the license through which JStylo is currently released.
3.2. Application OverviewThe software analyzes features from selected texts. The extracted features are sent
through a machine learning tool and compared against a corpus (which was previously trained) in an attempt to attribute ownership of texts to an author contained in the selected corpus. In addition, the software does this in a way which clearly communicates this information to the user and provide both tools to understand how Stylometry functions and how they were identified. It also provides advanced options for users who already know the field and want finer control over their experiments. At its highest level, the application operates as identified in Figure 1. Figure 2 provides a more detailed breakdown and module size estimations.
Worden is a three part system.There is the backend which is comprised of JStylo, WEKA, Apache Spark, and various
feature extraction tools. This provides all of the heavy lifting of the processing. Second is the server layer, which makes API calls to JStylo and provides endpoints for sending and retrieving information. It also contains a database for temporary data storage. The third component is the UI which sends and retrieves information to the server, providing interpretive services that aid the user in understanding the Stylometry process
8
Figure 2.
3.3. Design PatternsUpdating JStylo’s inner workings involved utilizing and working with the following design patterns.
9
3.3.1 Machine Learning Integration
Figure 3.
The Machine Learning integration is done via an Adapter Pattern. As shown in Figure 3 (and expanded upon in section 4.3.5.1), the superclass Analyzer provides a generic implementation of a few required methods and abstract method signatures for the expected inputs and return values that JStylo/Worden expects to receive. With this implementation, any Machine Learning library should be usable so long as an Adapter is programmed to interface with JStylo and that library.
3.3.2 Experiment Construction via the FullAPIAs shown in section 4.3.5.TODO, The FullAPI class provides a realization of the Builder pattern. This pattern is used to allow for a multitude of conveniences, as follows:
● Allows for multiple experimental setups, meaning that both JStylo and Worden can use the same API by providing different information/making different calls.
● Exposes everything that could be needed for external use, while making most of it optional.
● Allows for either high level calls/construction, or low level, depending on application needs.
The FullAPI also provides a Facade to the FeatureExtraction API and MachineLearning capabilities of the JStylo system.
10
4. Detailed Design
4.1. Database Design
As the database is temporary storage, performance and search is not a design concern. The database merely holds the serialized data from a user’s experiments until they access it or a day has passed, at which point it is removed.
4.2. User Interaction / Dataflow DesignOne of the primary purposes of Worden is to simplify the user interaction and dataflow process for non-technical users. To that end, the Worden workflow contains very little variance in execution due to user interaction. To illustrate this, we have included Flow Diagram 1 and Flow Diagram 2 which outline the dataflow in two common use cases.
11
Flow Diagram 1.
12
Flow Diagram 2.
4.3. JStylo DesignThis section aims to document the JStylo internal classes and packages. Note that the Canonicizers (4.3.2), EventCullers (4.3.3), and EventDrivers (4.3.4) packages were not modified for Worden aside from to provide bugfixes. Also note that the JStylo GUI package is not described in this document, as Worden neither modifies it nor utilizes it.
4.3.1 Analyzers PackageThe Analyzers package contains all of the implementations of the Analyzer interface
(4.3.5.1). These implementations are the bridge between a Machine Learning library and JStylo, allowing the library’s machine learning algorithms to be used with JStylo’s feature extraction capabilities.
13
4.3.1.1 WekaAnalyzer
WEKA implementation of 4.3.5.1. See that entry for method details.
4.3.1.2 SparkAnalyzer
Apache Spark implementation of 4.3.5.1. See that entry for method details. Not yet implemented.
14
4.3.2 Canoncizers PackageThe Canonicizers package is used by JStylo’s feature extraction engine. Each
canonicizer pre-processes a document’s text in preparation for a feature to be extracted (for example, a document being tested for character n-grams may have capitalizations normalized so that ‘A’ and ‘a’ are both considered the same character).
All classes in this package extend the JGAAP class “canonicizer”, the methods of which are documented in 4.3.2.1. These methods will not be described in each canonicizer implementation (although they will be shown in the diagram) for sake of being concise.
4.3.2.1. Canonicizer
Method Input Output Description
displayName(): String - the string name of the canonicizer
the human-readable name of the canonicizer.
tooltipText(): String - a descriptor of the canonicizer’s purpose.
showInGUI(): boolean - whether or not the class should display in JGAAP’s GUI. Unused for JStylo/Worden.
process(char[]): char[] text to be processed
text that has been processed
Performs the actual text modification
15
4.3.2.2 Apply Special Keys
This canonicizer replaces special characters with a particular alternative character. For example, it normalizes the document so that all tabs thrown into the system are the same type of tab despite different document types noting that formatting character differently. No methods aside from the ones implemented from interface 4.3.2.1.
4.3.2.3 BrownExtractPOSTags
No methods aside from the ones implemented from interface 4.3.2.1.
16
4.3.2.4 BrownExtractText
No methods aside from the ones implemented from interface 4.3.2.1.
4.3.2.5 RemoveFirstNLines
Removes the first N lines from a document. Useful for scenarios with headers/non-author writing of a certain length. Rarely used however due to the rarity it is applicable to the whole dataset. No methods aside from the ones implemented from interface 4.3.2.1.
4.3.2.6 RemoveSpecialKeys
Removes special keys/characters from a document entirely. No methods aside from the ones implemented from interface 4.3.2.1.
17
4.3.2.7 StripEdgesPunctuation
Removes punctuation that is at the beginning or end of a word (such as quotes, commas, periods, etc.). No methods aside from the ones implemented from interface 4.3.2.1.
4.3.2.8 StripSpaces
Removes whitespace. Useful for letter-adjacency features. No methods aside from the ones implemented from interface 4.3.2.1.
4.3.2.9 WordEndsPunctSeparator
Separates the ends of words and any punctuation which follows them, preventing the scenario where the word “end” is distinct from the word “end.” No methods aside from the ones implemented from interface 4.3.2.1.
18
4.3.3 Event Cullers PackageThe Event Cullers package is used by JStylo’s feature extraction engine to determine
which features are relevant for analysis. For example, more than 1000 character bi-gram features may not add much to the accuracy but greatly impact runtime, so an Event Culler determines which 1000 to keep and which extras to discard. Most EventCullers in this package extend JGAAP’s EventCuller, the methods of which are documented in 4.3.3.1.
4.3.3.1 EventCuller
Method Input Output Description
displayName(): String - the string name of the EventCuller
the human-readable name of the EventCuller.
tooltipText(): String - a descriptor of the EventCuller’s purpose.
showInGUI(): boolean - whether or not the class should display in JGAAP’s GUI. Unused for JStylo/Worden.
cull(List<EventSet>): List<EventSet>
the list of extracted features to cull
a list containing only the desired features.
Performs the actual culling.
4.3.3.2 FrequencyEventsExtended
Method Input Output Description
getFrequency(EventSet[]) : Map
List of EventSets to process
A mapping of each eventset to how often it occurs
used for preprocessing for certain cullers
19
compare(String,String):int
two strings to compare.
result is <0 if string 1 comes before string 2, greater than 0 for the opposite, and 0 if they are equivalent.
Used to compare two EventSets so they can be ordered properly
Abstract class which provides common functionality for a couple of EventCullers. Extends 4.3.3.1.
4.3.3.3 LeastCommonEventsExtended
Finds the rarest events from a set and keeps them over the others. Extends 4.3.3.2 and implements 4.3.3.1.
4.3.3.4 MaxAppearances
An EventCuller which disposes of events which have occurred over a specified maximum threshold. Extends 4.3.3.1.
4.3.3.5 MinAppearances
20
An EventCuller which disposes of events which have occurred under a specified minimum threshold. Extends 4.3.3.1.
4.3.3.6 MostCommonEventsExtended
Finds the most common events and prioritizes keeping them over the rarer events. Extends 4.3.3.2 and implements 4.3.3.1.
4.3.4 Event Drivers PackageThe Event Drivers package is used by JStylo’s feature extraction engine to pull feature
data from text. Each EventDriver provides a method for extracting numeric data from a specific feature of a document’s text.
4.3.4.1 EventDriver
Method Input Output Description
displayName(): String - the string name of the EventDriver
the human-readable name of the EventDriver.
tooltipText(): String - a descriptor of the EventDriver’s purpose.
showInGUI(): boolean - whether or not the class should display in JGAAP’s GUI. Unused for JStylo/Worden.
getValue(Document doc): double
The document to extract features from
the non-normalized, non-weighted value of the feature
Extracts the feature from the document
21
4.3.4.2 CharCounterEventDriver
An EventDriver which counts the number of characters in a document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.
4.3.4.3 EventsCounterEventDriver
An EventDriver which counts something within a document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1. Methods getUnderlyingEvents() and setUnderlyingEvents() are unused and currently flagged for removal.
4.3.4.4 FastTagPOSNGramsEventDriver
An EventDriver which uses the FastTagger to tag Parts-Of-Speech within a document and to report on how many occurrences each POS-N-gram reports. Here, a POS-N-gram means something like “adjective-noun” or “noun-verb” Extends and implements 4.3.4.1.
22
4.3.4.5 FastTagPOSTagsEventDriver
An EventDriver which uses the FastTagger to tag and report on how many instances of each part of speech is within the document. How many nouns, how many verbs… etc. Extends and implements 4.3.4.1.
4.3.4.6 FleschReadingEaseScoreEventDriver
An EventDriver which calculates the Flesch-Kincaid Reading Ease Score of the document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.
4.3.4.7 GunningFogIndexEventDriver
An Event Driver which calculates the Gunning-Fog Index Score of the document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.
23
4.3.4.8 LetterCounterEventDriver
An EventDriver which counts the number of letters in the document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.
4.3.4.9 LetterNGramEventDriver
An EventDriver which reports on the number of times a given letter-n-gram (“ae”, “th”, etc) was used within the document. Extends and implements 4.3.4.1.
4.3.4.10 ListEventDriver
A generic EventDriver which counts matches of each phrase or word on a list. Used, for example, to find how many words in the document are incorrectly spelled and are on the list of most commonly misspelled words. Extends and implements 4.3.4.1.
24
4.3.4.11 ListRegexpEventDriver
A generic EventDriver which reports on the number of matches for each regexp on a list of regexps. Extends and implements 4.3.4.1.
4.3.4.12 MaxentPOSNGramsEventDriver
An EventDriver which uses the Maxent tagger to tag and report counts on POS N-grams. Extends and implements 4.3.4.1.
4.3.4.13 MaxentPOSNGramsEventDriverGeneric
25
An EventDriver which uses the Maxent tagger to tag and report counts on POS N-grams. Extends and implements 4.3.4.1.
4.3.4.14 MaxentPOSTagsEventDriver
An EventDriver which uses the Maxent tagger to tag and report counts on POS tag counts. Extends and implements 4.3.4.1.
4.3.4.15 MaxentPOSTagsEventDriverGeneric
An EventDriver which uses the Maxent tagger to tag and report counts on POS tag counts. Extends and implements 4.3.4.1.
26
4.3.4.16 RegexpCounterEventDriver
An EventDriver which counts how many times a specific regular expression occurs within a document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.
4.3.4.17 RegexpEventDriver
An EventDriver which counts how many regular expression matches occur within a document. Extends and implements 4.3.4.1.
4.3.4.18 SentenceCounterEventDriver
An EventDriver which counts how many sentences are within a document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.
27
4.3.4.19 SingleNumericEventDriver
Abstract class which extends 4.3.4.1. For features which resolve into a single double value, rather than a set of events. Provides a common method used by all features which would result in a single value, the createEventSet(Document):EventSet method.
Method Input Output Description
getValue(Document) :double
inherited from 4.3.4.1
createEventSet(Document):EventSet
Document file to be processed
an EventSet containing the event data
stores feature information in set form so that it can be used interchangeably with non-single number events that have also been extracted.
4.3.4.20 SyllableCounterEventDriver
An EventDriver which counts the amount of syllables in a document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.
28
4.3.4.21 TreeTaggerEventDriver
An EventDriver which uses TreeTagger, a more language independent POS tagger, to count POS tags. Extends and implements 4.3.4.1.
4.3.4.22 TreeTaggerNGramsEventDriver
An EventDriver which uses TreeTagger, a more language independent POS tagger, to count POS n-grams. Extends and implements 4.3.4.1.
4.3.4.23 UniqueWordsCounterEventDriver
An EventDriver which counts how many unique words were used within the document. Unique here means used only a single time throughout the document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.
29
4.3.4.24 WordCounterEventDriver
An EventDriver which counts the total number of words within a document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.
4.3.4.25 WordLengthEventDriver
An EventDriver which counts how many words there are of each length of word that occurs within the document. Extends and implements 4.3.4.1.
4.3.5 Generics PackageThe Generics package contains JStylo’s high level abstract classes/interfaces, as well
as the various APIs, each which have its own degree of granularity. There are also a few miscellaneous utility classes located here.
30
4.3.5.1 Analyzer
Method Input Output Description
isType enumeration of what type of analyzer it is
needed due to incomplete generalization.
classify(Instances,Instances,Document[]): Map
Training instances, testing instances, unknown documents
Map of documents to probabilities for each author’s ownership
runs an experiment with unknown author
runCrossValidation(Instances, int, long) : Evaluation (weka object)
feature data, number of calculation folds, random seed
Experiment results runs an experiment to validate the classifier/feature set
runCrossValidation(Instances, int, long, int) : Evaluation (weka
as above but with 4th param as relaxation factor
runs an experiment to validate the classifier/feature set
31
object)
getLastStringResults() : String
contains experiment result data
getTrainTestEval(Instances, Instances): Evaluation (weka object)
Training data, testing data
The Experiment object of the last run unknown author analysis
contains experiment result data
getLastTrainingSet() : Instances (weka object)
feature data of the training documents
experiment metadata
getLastTestSet() : Instances (weka object)
feature data of the testing documents
experiment metadata
getAllInstances() : Instances (weka object)
all feature data experiment metadata
getLastAuthors : String[]
list of the String names of the authors in the last experiment
experiment metadata
getLastResults: Map a map of the likelihood of each author writing a given document
contains experiment result data
getOptions retrieve the exact options string used
experiment metadata
setOptions(String[]): experiment metadata
optionsDescription() : String
description of the options used
experiment metadata
analyzerDescription : String
Description of analyzer
experiment metadata
getClassifier : Classifier (Weka object)
get the underlying classifier object for analysis
experiment metadatahere due to poor initial generalization.
getName() : String String Name of the analyzer
32
The Analyzer Interface ensures that any implemented Machine Learning library exposes their internals via a series of forced get/set methods. It also ensures that each form of analysis is implemented via the runCrossValidation() and classify() methods.
4.3.5.2 CumulativeEventCuller
Method Input Output Description
cull(List<EventSet>][, CumulativeFeatureDriver) : List<EventSet>
All Events from the documents, the feature driver used to extract the events
a shortened list of events
iterates over the list of list of event sets, culling according to the CFD’s configuration
cullWithRespectToKnown(CumulativeFeatureDriver, List<EventSet>[], List<EventSet>[]) : List<EventSet>
The cumulative feature driver used to extract the events, the known events, the unknown events
a shortened list of events
culls the events with respect to the CFD, but only keeps events if they are already known.
A class which collects and applies EventCullers to the document data.
33
4.3.5.3 CumulativeFeatureDriver
Method Input Output Description
CumulativeFeatureDriver() Empty constructor
CumulativeFeatureDriver(CumulativeFeatureDriver)
CFD to clone clone constructor
CumulativeFeatureDriver(FeatureDriver[])
collection of FeatureDrivers to build CFD from
constructor to build programmatically
CumulativeFeatureDriver(String)
path to xml file which defines a CFD
constructor to load from file
createEventSets(Document, boolean)
document to have its data extracted, boolean is true if running over a distributed platform, false if local
List<EventSet> of the features extracted from the document
extracts a single document’s events
clean cleans up any unneeded data. Note:
34
shouldn’t be needed in final version
featureDriverAt(int) :FeatureDriver
index of desired driver
FeatureDriver at that index
fetches a featureDriver object
displayName():String CFD’s name name of the FeatureSet
numOfFeatureDrivers(): int number of feature drivers in the CFD
setName(String) new name change the CFD’s name
setDescription(String) new description change the CFD’s description
addFeatureDriver(FeatureDriver) :int
new feature driver index of feature driver on success, -1 on failure
adds a new featureDriver to the CFD
replaceFeatureDriverAt(int, FeatureDriver) : FeatureDriver
index to replace, new feature driver
old FeatureDriver replaces one feature driver with another
removeFeatureDriverAt(int) : FeatureDriver
index to remove old FeatureDriver pops out FeatureDriver at that index
getName redundant; need to remove
getDescription() :String string description of CFD
writeToXML(String) : boolean
path to file to save CFD to
true on success, false on failure
toXMLString() : String XML String of CFD
Primary configuration for the JStylo Feature Extraction process. Contains all of the information on what features to extract, how they should be pre-processed and post-processed. Used throughout the feature extraction process.
35
4.3.5.4 FeatureDriver
Method Input Output Description
FeatureDriver() empty constructor
FeatureDriver(String, boolean, EventDriver)
String name, true if it is a histogram, the EventDriver for the feature
basic constructor
FeatureDriver(String, boolean, EventDriver,Canonicizer[],EventCuller[])
as above, but with a List of Canonicizers and a List of EventCullers
constructor with pre and post processing for the feature
FeatureDriver(String, boolean, EventDriver,Canonicizer[],EventCuller[],Pair)
as above, but with a pair containing the normalization type and value.
constructor with pre-processing, post-processing, and normalization
setDisplayName(String)
new feature name change its name
setDescription(String) new description change its description
setUnderlyingEventDriver(EventDriver)
new EventDriver change the feature which is being
36
extracted
setCalcHist(boolean) true if histogram, false if not
sets if this is a histogram
addCanonicizer(Canonicizer)
new canonicizer adds a canonicizer
setCanonicizers(List<Canonicizer>)
list of canonicizers to use
replaces canonicizers with new list
addEventCuller(EventCuller)
new culler adds an EventCuller
setEventCullers(List<EventCuller>)
list of cullers to use replaces EventCullers with the new list
SetNormalizationBaseline(Norm)
enum value for type of normalization
normalization configuration
setNormalizationFactor(Factor)
how much to normalize by
normalization configuration
setNormalization(Pair)
a pair enum to factor normalization configuration
displayName() :String FeatureDriver name string name of the FD
getName() : String FeatureDriver name duplicate-to be removed
getDescription():String
FeatureDriver description
description of the FD
getUnderlyingEventDriver() : EventDriver
JGAAP EventDriver for debugging
isCalcHist() : boolean true if histogram, false if not
getter
canonicizerAt(int) : Canonicizer
index to retrieve the canonicizer at that index
getter
getCanonicizers() : List<Canonicizer>
List of canonicizers for the featureDriver
getter
cullerAt(int) : EventCuller
index to retrieve the EventCuller at the index
getter
getCullers() : List<EventCuller>
List of EventCullers for the FD
getter
37
getNormalizationBaseline() : Norm
type of normalization to perform
getter
getNormalizationFactor() : Double
how much to normalize to
getter
setNormalization() : Pair
pair of normalization type to factor
getter
getClassParams(String) : String
name of the class valid parameters for that class
used to learn more about the feature driver
Class which contains all of the configuration and algorithmic information for extracting a single feature from a document. A collection of these is used to create the CFD.
4.3.5.5 FeatureExtractionAPI
Method Input Output Description
extractEventSets(Document, CumulativeFeatureDriver, boolean) : List<EventSet>
the document to have its data extracted, the CFD defining what to extract, true if using cloud processing / false if local
List<EventSet> of all extracted features
pulls out the features from a single document
cull(List<EventSet>, CFD,) : List<EventSet>
Events from a document, CFD
List<EventSet> events which survive document’s culling
reduces amount of features for a document
getRelevantEvents(List<EventSet>,CFD)
Events that are most valuable, CFD
List<EventSet> events which have survived training’s culling
pull out events which are most valuable
getAttributeList(List<EventSet>,List<EventSet>, CFD, List<EventSet>,boolean
all events, relevant events, CFD, document events, if it’s a sparse representation, whether
List<Attribute> for creation of WEKA Instance
WEKA conversion method - generates objects needed to build a
38
,boolean) the documents are labeled.
WEKA instance
createInstance(List<Attribute>,List<EventSet>,CFD,List<EventSet>,boolean,boolean)
the data from the previous method, the relevant events, CFD, document events, if it’s a sparse representation, whether the documents are labeled.
Instance (Weka Object) that represents the document
WEKA conversion method - builds an instance
normalizeInstance(CFD,Instance,List<EventSet>,boolean, List<Attribute>)
CFD, weka data to be normalized, document data, if the data is labeled, attribute data.
WEKA conversion method - normalizes data
calcInfoGain(Instances) : double[][]
Weka data for all documents
mapping of feature to its value
figures out which features are most valuable
applyInfoGain(double[][],Instances, int) : double[][]
feature data from calcInfoGain, weka data, number of features to keep
mapping of feature to its value
removes non-valuable features
applyInfoGain(double[][],Instances, int)
feature data from calcInfoGain, weka data, number of features to keep
removes non-valuable features
cullWithRespectToTraining(List<EventSet>,List<EventSet>,CFD): List<EventSet>
relevant events (from named method), document events, CFD
culled document events
culls a testing document and only keeps data that was in training
This class provides access to methods for extracting a single document’s features. This is so that the feature extraction can be parallelized at higher levels of computation.
39
4.3.5.6 FullAPI
Method Input Output Description
prepareInstances() extracts data from documents
prepareAnalyzer() classifier learning process
calcInfoGain() performs infoGain calculation
applyInfoGain(int) number of features to keep
culls features with respect to infoGain
run() performs evaluation
40
writeArff(String, Instances)
path to output file, document data
writes out the document data to a file.
Remaining methods are setters/getters. Not yet pictured is the Builder for the API which is still under construction and will be added to the document once it is finalized.
This is the high-level API for JStylo. Allows for the creation and configuration of a JStylo experiment, the running of it, and access to deeper JStylo data structures and methods should the experimenter need it.
41
4.3.5.7 LocalParallelFeatureExtractionAPI
Getters and Setters not described in table below.
Method Input Output Description
LocalParallelFeatureExtractionAPI()
empty constructor
42
LocalParallelFeatureExtractionAPI(Preferences)
local user preferences
not used in Worden, used to intialize from local JStylo
LocalParallelFeatureExtractionAPI(Builder)
builder constructor that takes a builder to construct
LocalParallelFeatureExtractionAPI(LocalParallelFeatureExtractionAPI)
API instance to clone
clone constructor
cleanAttributes() disposes of unneeded objects
clean() disposes of unneeded objects
extractEventSetsThreaded() pulls all of the features out of the documents via Java parallelization
initializeRelevantEvents() initializes events from training in order to map to testing
initializeAttributes() creates Attribute objects for WEKA conversion
createTrainingInstancesThreaded()
creates WEKA object for train data
createTestingInstancesThreaded()
creates WEKA object for test data
applyInfoGain(int) num features to keep
uses infogain to cull features
calcInfoGain() : double[][] mapping of feature to how valuable it is
figures out which features are most valuable
writeToArff(String, Instances) path to file, data
writes data to ARFF file
writeToCSV(String,Instances) path to file, data
writes data to CSV file
reset() clears all data from API
kill() ends all ongoing threads
43
This class provides a parallelized feature extraction process for local processing. It also parallelizes the creation of the WEKA data from these features. However, it is not generic and will be broken up in the process of Worden development.
44
4.3.5.8 ProblemSet
This class contains all information on which documents to use and which documents were written by which authors. Has a lot of extraneous / deprecated methods and requires revision.
45
4.3.5.9 StringDocument
Method Input Output Description
StringDocument() builds an empty string document
StringDocument(Document)
the document to copy builds a string document by loading the data from an existing Document
StringDocument(String, String, String)
the document body, the document author, the document title
builds a string document from strings passed into it
load() loads the document into memory. Only for use with the Document constructor
getFilePath(): String The path to the file
An adapter for allowing a document to be read without having the file. This allows us to more easily modify, create, and cut up documents in memory. Typical JGAAP document needs the file on disk, and this circumvents that.
5. Appendix
5.1 Traceability MatrixWe still need to actually go through this.
46
@Salvage; how do you recommend documenting that UI requirements were satisfied if they’re being handled via frontend / dynamic on the side of the GUI and not an API call?
Requirement Number Design Component
4.2.1.1 4.3.7,4.3.9
4.2.1.2
4.2.1.3
4.2.1.4
4.3.1
4.3.2
4.3.3
4.4.1.1 4.3.5.3
4.4.1.2 4.3.5.3, needs UI support
4.5.1.1
4.5.1.1.1
4.5.1.1.2
4.5.2.1
4.5.2.2
4.5.2.2.1
4.5.2.2.2
4.5.2.2.3
4.5.2.2.4
4.5.2.2.5
4.5.2.3
4.5.2.3.1
4.5.2.3.2
4.5.2.3.3
4.5.2.3.4
47
4.5.2.4
4.5.2.4.1
4.5.2.4.2
4.5.2.5
4.5.2.5.1
4.5.2.5.2
4.5.3.1
4.5.3.2
4.5.3.2.1
4.5.3.2.1.1
4.5.3.2.1.2
4.5.3.2.2
4.5.3.2.2.1
4.5.3.2.2.1.1
4.5.3.2.2.2
4.5.3.2.2.2.1
4.5.3.2.2.2.2
4.5.3.3
4.5.3.3.1
4.5.3.3.2
4.5.3.4
4.5.3.4.1
4.5.3.4.2
4.5.3.5
4.5.3.5.1
4.5.3.5.2
4.5.3.6
48
4.5.3.6.1
4.5.3.6.2
4.5.3.6.3
4.5.3.7
4.5.3.7.1
4.5.3.7.2
4.5.3.8
4.5.4.1
4.5.4.2
4.5.4.3
4.5.5.1
4.5.5.1.1
4.5.5.1.1.1
4.5.5.1.2
4.5.5.1.2.1
4.5.5.2
4.5.5.2.1
4.5.5.2.1.1
4.5.5.2.1.2
4.5.5.2.2
4.5.5.2.3
4.5.5.3
4.5.5.3.1
4.5.5.3.2
4.5.5.3.2.1
4.5.5.3.2.2
4.5.5.4
49
4.5.5.4.1
4.5.6.1
4.5.6.2
4.5.6.3
4.5.6.4
4.5.6.4.1
4.5.6.4.2
4.5.6.4.3
4.5.6.5
4.5.7.1
4.5.7.2
4.5.7.2.1
4.5.7.2.2
4.5.7.3
4.5.7.3.1
4.5.7.3.2
4.5.7.3.2.1
4.5.7.3.2.2
4.5.7.4
4.5.7.4.1
4.6.1
4.6.2 4.3.1.1
4.6.3
4.7.1
4.7.2
4.7.3 3.1.8
5.1
50
5.2
5.3.1
5.3.2
5.3.3
5.4.1
5.4.2
5.5 3.1.9
5.6.1
5.6.2
Color Key:
Not yet implemented / verified
In-Progress
Complete