3.1. Technologies Used - Drexel Universitymeb388/worden/february... · Web viewFeb 22, 2016 ·...

Software DesignSpecifications

For

Worden

Prepared By:Marc Barrowclift

Travis DutkoCorey Everitt

Jiakang JinEric Nordstrom

Ivan "Frankie" Orrego[--------------]

Kyle MusalAaron ChapinAndrew Orner

Matthew Tornetta

Advised By:Rachel Greenstadt

Jeff Salvage

Drexel University

1

Table of Contents

3.1. Technologies Used3.1.1 JStylo3.1.2. Java 1.83.1.3. WEKA3.1.4. Stanford Part-of-Speech Tagger3.1.5. Microsoft Azure3.1.6. Apache Spark3.1.7. Java Graphical Authorship Attribution Program (JGAAP)3.1.8 Maven

3.2. Application Overview3.3. Design Patterns

3.3.1 Machine Learning Integration3.3.2 Experiment Construction via the FullAPI

4. Detailed Design4.1. Database Design4.2. User Interaction / Dataflow Design4.3. JStylo Design

4.3.1 Analyzers Package4.3.1.1 WekaAnalyzer4.3.1.2 SparkAnalyzer

4.3.2 Canoncizers Package4.3.2.1. Canonicizer4.3.2.2 Apply Special Keys4.3.2.3 BrownExtractPOSTags4.3.2.4 BrownExtractText4.3.2.5 RemoveFirstNLines4.3.2.6 RemoveSpecialKeys4.3.2.7 StripEdgesPunctuation4.3.2.8 StripSpaces4.3.2.9 WordEndsPunctSeparator

4.3.3 Event Cullers Package4.3.3.1 EventCuller4.3.3.2 FrequencyEventsExtended4.3.3.3 LeastCommonEventsExtended4.3.3.4 MaxAppearances4.3.3.5 MinAppearances4.3.3.6 MostCommonEventsExtended

4.3.4 Event Drivers Package4.3.4.1 EventDriver4.3.4.2 CharCounterEventDriver4.3.4.3 EventsCounterEventDriver4.3.4.4 FastTagPOSNGramsEventDriver

2

4.3.4.5 FastTagPOSTagsEventDriver4.3.4.6 FleschReadingEaseScoreEventDriver4.3.4.7 GunningFogIndexEventDriver4.3.4.8 LetterCounterEventDriver4.3.4.9 LetterNGramEventDriver4.3.4.10 ListEventDriver4.3.4.11 ListRegexpEventDriver4.3.4.12 MaxentPOSNGramsEventDriver4.3.4.13 MaxentPOSNGramsEventDriverGeneric4.3.4.14 MaxentPOSTagsEventDriver4.3.4.15 MaxentPOSTagsEventDriverGeneric4.3.4.16 RegexpCounterEventDriver4.3.4.17 RegexpEventDriver4.3.4.18 SentenceCounterEventDriver4.3.4.19 SingleNumericEventDriver4.3.4.20 SyllableCounterEventDriver4.3.4.21 TreeTaggerEventDriver4.3.4.22 TreeTaggerNGramsEventDriver4.3.4.23 UniqueWordsCounterEventDriver4.3.4.24 WordCounterEventDriver4.3.4.25 WordLengthEventDriver

4.3.5 Generics Package4.3.5.1 Analyzer4.3.5.2 CumulativeEventCuller4.3.5.3 CumulativeFeatureDriver4.3.5.4 FeatureDriver4.3.5.5 FeatureExtractionAPI4.3.5.6 FullAPI4.3.5.7 LocalParallelFeatureExtractionAPI4.3.5.8 ProblemSet4.3.5.9 StringDocument

5. Appendix5.1 Traceability Matrix

3

1. Document History

Version Date Editor Description

1.0 1/28/2010MT, AC, KM, AO Initial Version

2.0 1/29/2016 TD New iteration of the software

2. Introduction

2.1. PurposeThe purpose of this document is to describe, in detail, the design and architecture of Worden (henceforth referred to as “the software”) in its entirety. These design decisions pertain to the functionality, performance, and external interfaces of the software. This document will be used to implement all features as described in the requirements document for the software.

2.2. ScopeThis document pertains to the design and architecture of the release of Worden and includes all functional requirements as listed in the requirements document for the software. This document is intended for the developers and testers of the software.

2.3. Design Goals The software is designed to improve upon the original Stylo software in the following ways: ● Ease of Use: The software interface is an improvement over previous other iterations of the interface in terms of usability by users not versed in the field of stylometry.● Modularity: The system will preserve the modularity of certain aspects of the software (Feature Sets) and provide additional modularity to further separate data extraction, classification, and display components.● Open Source: The software is open source and licensed under the GNU General Public License.● Portability: The software can run on any platform which is capable of running Java 1.8.

4

● Security: Due to the possible sensitivity of works being analysed, the software transmits information in an encrypted manner and holds onto it for a minimal amount of time as defined in requirements 5.3.1, 5.3.2, and 5.3.3● Accuracy: The software’s analysis accuracy is preserved from previous versions of the software.

2.4. Glossary Classifier: The machine learning algorithm which is used to process data extracted from a problem set.Corpus (pl. Corpora): A large, structured set of documents from various authors.Domain: The context in which a piece of writing has been written in. (Essays, blogs, tweets)Feature: An element of a person's writing style which can be extracted from text which they have written. Feature Set: A group of one or more similar features.Function/Stop Words: A set of words which is typically ignored when analyzing features of a document (for example, a, an, the). These are typically selected due to their commonality.Graphical Toolkit: A set of tools for creating a graphical interface for a software.Problem Set: A set of documents which defines which authors a classifier should learn from to establish the features which define each author as well as defines which documents should have their authorship analyzed.Machine Learning: Machine Learning is the process of a computer extracting categories in the form of patterns from data (in our case, numeric data). These patterns can be used to assign categories to new data that is brought into the system. N-Gram: A sub-sequence of n items from a given sequence. For example, the sentence “I painted the fence red.” has the 2-grams “I painted”, “painted the”, “the fence”, “fence red”. . (Unigrams=1-grams,Bigrams=2-grams,Trigrams=3-grams)Testing: The process in which a classifier identifies unlabeled data with one of the categories it has learned from the training process.Training: The process in which a classifier creates categories of data (in our case, defining features of specific authors) from labeled data.Stylometry: The study of author identification through analysis of linguistic style.

2.5. ReferencesJStylo:<https://psal.cs.drexel.edu/index.php/JStylo-Anonymouth>WEKA: <http://www.cs.waikato.ac.nz/ml/weka/>Stanford POS Tagger: <http://nlp.stanford.edu/software/tagger.shtml>Apache Spark: <http://spark.apache.org/>Microsoft Azure: <https://azure.microsoft.com/en-us/>

JGAAP: <http://evllabs.com/jgaap/w/index.php/Main_Page>

5

Apache Maven: <https://maven.apache.org/>

3-Clause BSD: <https://opensource.org/licenses/BSD-3-Clause>

2.6. Context Diagram

Figure 1.

6

3. Architecture3.1. Technologies Used

3.1.1 JStyloJStylo is a stylometry tool which provides authorship attribution and classifier testing

capabilities. It is an experimental tool for researchers used to create and determine the effectiveness of feature sets, identify authors of documents, and assess the effectiveness of classifiers. This project uses JStylo as an engine to provide a streamlined UI for several common, non-technical use-cases, as well as provides educational commentary to help non-technical users understand stylometry.

3.1.2. Java 1.8 Java 1.8 was selected as the platform for the backend of the system for two primary reasons. The first of these is portability. By writing it in Java, any system which has a Java 1.8 JRE will have the potential to host Worden. The second reason is that JStylo, the second and current iteration of the Stylo project, was written in Java. As the purpose of the project is to improve usability and performance, completely rewriting it was not desired.

3.1.3. WEKA The Waikato Environment for Knowledge Analysis (WEKA) is a machine learning tool. It was selected for JStylo as it is easy to use and provides a wide library of classifiers for machine learning. This tool has been embedded since JStylo, but improvements on how it interacts with JStylo have been made to increase performance and make the system more modular.

3.1.4. Stanford Part-of-Speech Tagger The Stanford Part-of-Speech Tagger is a natural language processing tool which parses text and labels it with the part of speech of each word in a sentence. It is utilized for the purpose of text analysis.

3.1.5. Microsoft AzureMicrosoft Azure is a cloud computing platform and is used to host the Worden

application. It provides scalable power/pricing which makes ideal for this project. While Worden is initially a small project and does not need this extra power, having the capability in-built allows for the inheritors of the project to take advantage of it in the future.

7

3.1.6. Apache SparkApache Spark is a tool for parallel/distributed processing. It offers local parallelization as

well as distributed computing capabilities. It also provides a machine learning library in the form of Spark MLib. Spark is utilized to increase the processing speed over WEKA. It also improves JStylo’s scalability by providing an easy avenue towards distributed processing.

3.1.7. Java Graphical Authorship Attribution Program (JGAAP)JGAAP provides a low level framework for extracting feature information from a

document. While Worden does not use JGAAP directly, JStylo builds upon it and extends many of their classes.

3.1.8 MavenApache Maven is a project and dependency management tool. It provides tools through

which a project can be versioned as well as streamlines the dependency update/buildpath constructing process.

3.1.9 3-Clause BSD LicenseThe 3-Clause BSD license is an Open Source license. It was selected for Worden by

virtue of being the license through which JStylo is currently released.

3.2. Application OverviewThe software analyzes features from selected texts. The extracted features are sent

through a machine learning tool and compared against a corpus (which was previously trained) in an attempt to attribute ownership of texts to an author contained in the selected corpus. In addition, the software does this in a way which clearly communicates this information to the user and provide both tools to understand how Stylometry functions and how they were identified. It also provides advanced options for users who already know the field and want finer control over their experiments. At its highest level, the application operates as identified in Figure 1. Figure 2 provides a more detailed breakdown and module size estimations.

Worden is a three part system.There is the backend which is comprised of JStylo, WEKA, Apache Spark, and various

feature extraction tools. This provides all of the heavy lifting of the processing. Second is the server layer, which makes API calls to JStylo and provides endpoints for sending and retrieving information. It also contains a database for temporary data storage. The third component is the UI which sends and retrieves information to the server, providing interpretive services that aid the user in understanding the Stylometry process

8

Figure 2.

3.3. Design PatternsUpdating JStylo’s inner workings involved utilizing and working with the following design patterns.

9

3.3.1 Machine Learning Integration

Figure 3.

The Machine Learning integration is done via an Adapter Pattern. As shown in Figure 3 (and expanded upon in section 4.3.5.1), the superclass Analyzer provides a generic implementation of a few required methods and abstract method signatures for the expected inputs and return values that JStylo/Worden expects to receive. With this implementation, any Machine Learning library should be usable so long as an Adapter is programmed to interface with JStylo and that library.

3.3.2 Experiment Construction via the FullAPIAs shown in section 4.3.5.TODO, The FullAPI class provides a realization of the Builder pattern. This pattern is used to allow for a multitude of conveniences, as follows:

● Allows for multiple experimental setups, meaning that both JStylo and Worden can use the same API by providing different information/making different calls.

● Exposes everything that could be needed for external use, while making most of it optional.

● Allows for either high level calls/construction, or low level, depending on application needs.

The FullAPI also provides a Facade to the FeatureExtraction API and MachineLearning capabilities of the JStylo system.

10

4. Detailed Design

4.1. Database Design

As the database is temporary storage, performance and search is not a design concern. The database merely holds the serialized data from a user’s experiments until they access it or a day has passed, at which point it is removed.

4.2. User Interaction / Dataflow DesignOne of the primary purposes of Worden is to simplify the user interaction and dataflow process for non-technical users. To that end, the Worden workflow contains very little variance in execution due to user interaction. To illustrate this, we have included Flow Diagram 1 and Flow Diagram 2 which outline the dataflow in two common use cases.

11

Flow Diagram 1.

12

Flow Diagram 2.

4.3. JStylo DesignThis section aims to document the JStylo internal classes and packages. Note that the Canonicizers (4.3.2), EventCullers (4.3.3), and EventDrivers (4.3.4) packages were not modified for Worden aside from to provide bugfixes. Also note that the JStylo GUI package is not described in this document, as Worden neither modifies it nor utilizes it.

4.3.1 Analyzers PackageThe Analyzers package contains all of the implementations of the Analyzer interface

(4.3.5.1). These implementations are the bridge between a Machine Learning library and JStylo, allowing the library’s machine learning algorithms to be used with JStylo’s feature extraction capabilities.

13

4.3.1.1 WekaAnalyzer

WEKA implementation of 4.3.5.1. See that entry for method details.

4.3.1.2 SparkAnalyzer

Apache Spark implementation of 4.3.5.1. See that entry for method details. Not yet implemented.

14

4.3.2 Canoncizers PackageThe Canonicizers package is used by JStylo’s feature extraction engine. Each

canonicizer pre-processes a document’s text in preparation for a feature to be extracted (for example, a document being tested for character n-grams may have capitalizations normalized so that ‘A’ and ‘a’ are both considered the same character).

All classes in this package extend the JGAAP class “canonicizer”, the methods of which are documented in 4.3.2.1. These methods will not be described in each canonicizer implementation (although they will be shown in the diagram) for sake of being concise.

4.3.2.1. Canonicizer

Method Input Output Description

displayName(): String - the string name of the canonicizer

the human-readable name of the canonicizer.

tooltipText(): String - a descriptor of the canonicizer’s purpose.

showInGUI(): boolean - whether or not the class should display in JGAAP’s GUI. Unused for JStylo/Worden.

process(char[]): char[] text to be processed

text that has been processed

Performs the actual text modification

15

4.3.2.2 Apply Special Keys

This canonicizer replaces special characters with a particular alternative character. For example, it normalizes the document so that all tabs thrown into the system are the same type of tab despite different document types noting that formatting character differently. No methods aside from the ones implemented from interface 4.3.2.1.

4.3.2.3 BrownExtractPOSTags

No methods aside from the ones implemented from interface 4.3.2.1.

16

4.3.2.4 BrownExtractText

No methods aside from the ones implemented from interface 4.3.2.1.

4.3.2.5 RemoveFirstNLines

Removes the first N lines from a document. Useful for scenarios with headers/non-author writing of a certain length. Rarely used however due to the rarity it is applicable to the whole dataset. No methods aside from the ones implemented from interface 4.3.2.1.

4.3.2.6 RemoveSpecialKeys

Removes special keys/characters from a document entirely. No methods aside from the ones implemented from interface 4.3.2.1.

17

4.3.2.7 StripEdgesPunctuation

Removes punctuation that is at the beginning or end of a word (such as quotes, commas, periods, etc.). No methods aside from the ones implemented from interface 4.3.2.1.

4.3.2.8 StripSpaces

Removes whitespace. Useful for letter-adjacency features. No methods aside from the ones implemented from interface 4.3.2.1.

4.3.2.9 WordEndsPunctSeparator

Separates the ends of words and any punctuation which follows them, preventing the scenario where the word “end” is distinct from the word “end.” No methods aside from the ones implemented from interface 4.3.2.1.

18

4.3.3 Event Cullers PackageThe Event Cullers package is used by JStylo’s feature extraction engine to determine

which features are relevant for analysis. For example, more than 1000 character bi-gram features may not add much to the accuracy but greatly impact runtime, so an Event Culler determines which 1000 to keep and which extras to discard. Most EventCullers in this package extend JGAAP’s EventCuller, the methods of which are documented in 4.3.3.1.

4.3.3.1 EventCuller


displayName(): String - the string name of the EventCuller

the human-readable name of the EventCuller.

tooltipText(): String - a descriptor of the EventCuller’s purpose.


cull(List<EventSet>): List<EventSet>

the list of extracted features to cull

a list containing only the desired features.

Performs the actual culling.

4.3.3.2 FrequencyEventsExtended


getFrequency(EventSet[]) : Map

List of EventSets to process

A mapping of each eventset to how often it occurs

used for preprocessing for certain cullers

19

compare(String,String):int

two strings to compare.

result is <0 if string 1 comes before string 2, greater than 0 for the opposite, and 0 if they are equivalent.

Used to compare two EventSets so they can be ordered properly

Abstract class which provides common functionality for a couple of EventCullers. Extends 4.3.3.1.

4.3.3.3 LeastCommonEventsExtended

Finds the rarest events from a set and keeps them over the others. Extends 4.3.3.2 and implements 4.3.3.1.

4.3.3.4 MaxAppearances

An EventCuller which disposes of events which have occurred over a specified maximum threshold. Extends 4.3.3.1.

4.3.3.5 MinAppearances

20

An EventCuller which disposes of events which have occurred under a specified minimum threshold. Extends 4.3.3.1.

4.3.3.6 MostCommonEventsExtended

Finds the most common events and prioritizes keeping them over the rarer events. Extends 4.3.3.2 and implements 4.3.3.1.

4.3.4 Event Drivers PackageThe Event Drivers package is used by JStylo’s feature extraction engine to pull feature

data from text. Each EventDriver provides a method for extracting numeric data from a specific feature of a document’s text.

4.3.4.1 EventDriver


displayName(): String - the string name of the EventDriver

the human-readable name of the EventDriver.

tooltipText(): String - a descriptor of the EventDriver’s purpose.


getValue(Document doc): double

The document to extract features from

the non-normalized, non-weighted value of the feature

Extracts the feature from the document

21

4.3.4.2 CharCounterEventDriver

An EventDriver which counts the number of characters in a document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.

4.3.4.3 EventsCounterEventDriver

An EventDriver which counts something within a document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1. Methods getUnderlyingEvents() and setUnderlyingEvents() are unused and currently flagged for removal.

4.3.4.4 FastTagPOSNGramsEventDriver

An EventDriver which uses the FastTagger to tag Parts-Of-Speech within a document and to report on how many occurrences each POS-N-gram reports. Here, a POS-N-gram means something like “adjective-noun” or “noun-verb” Extends and implements 4.3.4.1.

22

4.3.4.5 FastTagPOSTagsEventDriver

An EventDriver which uses the FastTagger to tag and report on how many instances of each part of speech is within the document. How many nouns, how many verbs… etc. Extends and implements 4.3.4.1.

4.3.4.6 FleschReadingEaseScoreEventDriver

An EventDriver which calculates the Flesch-Kincaid Reading Ease Score of the document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.

4.3.4.7 GunningFogIndexEventDriver

An Event Driver which calculates the Gunning-Fog Index Score of the document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.

23

4.3.4.8 LetterCounterEventDriver

An EventDriver which counts the number of letters in the document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.

4.3.4.9 LetterNGramEventDriver

An EventDriver which reports on the number of times a given letter-n-gram (“ae”, “th”, etc) was used within the document. Extends and implements 4.3.4.1.

4.3.4.10 ListEventDriver

A generic EventDriver which counts matches of each phrase or word on a list. Used, for example, to find how many words in the document are incorrectly spelled and are on the list of most commonly misspelled words. Extends and implements 4.3.4.1.

24

4.3.4.11 ListRegexpEventDriver

A generic EventDriver which reports on the number of matches for each regexp on a list of regexps. Extends and implements 4.3.4.1.

4.3.4.12 MaxentPOSNGramsEventDriver

An EventDriver which uses the Maxent tagger to tag and report counts on POS N-grams. Extends and implements 4.3.4.1.

4.3.4.13 MaxentPOSNGramsEventDriverGeneric

25

An EventDriver which uses the Maxent tagger to tag and report counts on POS N-grams. Extends and implements 4.3.4.1.

4.3.4.14 MaxentPOSTagsEventDriver

An EventDriver which uses the Maxent tagger to tag and report counts on POS tag counts. Extends and implements 4.3.4.1.

4.3.4.15 MaxentPOSTagsEventDriverGeneric

An EventDriver which uses the Maxent tagger to tag and report counts on POS tag counts. Extends and implements 4.3.4.1.

26

4.3.4.16 RegexpCounterEventDriver

An EventDriver which counts how many times a specific regular expression occurs within a document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.

4.3.4.17 RegexpEventDriver

An EventDriver which counts how many regular expression matches occur within a document. Extends and implements 4.3.4.1.

4.3.4.18 SentenceCounterEventDriver

An EventDriver which counts how many sentences are within a document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.

27

4.3.4.19 SingleNumericEventDriver

Abstract class which extends 4.3.4.1. For features which resolve into a single double value, rather than a set of events. Provides a common method used by all features which would result in a single value, the createEventSet(Document):EventSet method.


getValue(Document) :double

inherited from 4.3.4.1

createEventSet(Document):EventSet

Document file to be processed

an EventSet containing the event data

stores feature information in set form so that it can be used interchangeably with non-single number events that have also been extracted.

4.3.4.20 SyllableCounterEventDriver

An EventDriver which counts the amount of syllables in a document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.

28

4.3.4.21 TreeTaggerEventDriver

An EventDriver which uses TreeTagger, a more language independent POS tagger, to count POS tags. Extends and implements 4.3.4.1.

4.3.4.22 TreeTaggerNGramsEventDriver

An EventDriver which uses TreeTagger, a more language independent POS tagger, to count POS n-grams. Extends and implements 4.3.4.1.

4.3.4.23 UniqueWordsCounterEventDriver

An EventDriver which counts how many unique words were used within the document. Unique here means used only a single time throughout the document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.

29

4.3.4.24 WordCounterEventDriver

An EventDriver which counts the total number of words within a document. Extends and implements 4.3.4.19 and, by extension, 4.3.4.1.

4.3.4.25 WordLengthEventDriver

An EventDriver which counts how many words there are of each length of word that occurs within the document. Extends and implements 4.3.4.1.

4.3.5 Generics PackageThe Generics package contains JStylo’s high level abstract classes/interfaces, as well

as the various APIs, each which have its own degree of granularity. There are also a few miscellaneous utility classes located here.

30

4.3.5.1 Analyzer


isType enumeration of what type of analyzer it is

needed due to incomplete generalization.

classify(Instances,Instances,Document[]): Map

Training instances, testing instances, unknown documents

Map of documents to probabilities for each author’s ownership

runs an experiment with unknown author

runCrossValidation(Instances, int, long) : Evaluation (weka object)

feature data, number of calculation folds, random seed

Experiment results runs an experiment to validate the classifier/feature set

runCrossValidation(Instances, int, long, int) : Evaluation (weka

as above but with 4th param as relaxation factor

runs an experiment to validate the classifier/feature set

31

object)

getLastStringResults() : String

contains experiment result data

getTrainTestEval(Instances, Instances): Evaluation (weka object)

Training data, testing data

The Experiment object of the last run unknown author analysis


getLastTrainingSet() : Instances (weka object)

feature data of the training documents

experiment metadata

getLastTestSet() : Instances (weka object)

feature data of the testing documents

experiment metadata

getAllInstances() : Instances (weka object)

all feature data experiment metadata

getLastAuthors : String[]

list of the String names of the authors in the last experiment

experiment metadata

getLastResults: Map a map of the likelihood of each author writing a given document


getOptions retrieve the exact options string used

experiment metadata

setOptions(String[]): experiment metadata

optionsDescription() : String

description of the options used

experiment metadata

analyzerDescription : String

Description of analyzer

experiment metadata

getClassifier : Classifier (Weka object)

get the underlying classifier object for analysis

experiment metadatahere due to poor initial generalization.

getName() : String String Name of the analyzer

32

The Analyzer Interface ensures that any implemented Machine Learning library exposes their internals via a series of forced get/set methods. It also ensures that each form of analysis is implemented via the runCrossValidation() and classify() methods.

4.3.5.2 CumulativeEventCuller


cull(List<EventSet>][, CumulativeFeatureDriver) : List<EventSet>

All Events from the documents, the feature driver used to extract the events

a shortened list of events

iterates over the list of list of event sets, culling according to the CFD’s configuration

cullWithRespectToKnown(CumulativeFeatureDriver, List<EventSet>[], List<EventSet>[]) : List<EventSet>

The cumulative feature driver used to extract the events, the known events, the unknown events

a shortened list of events

culls the events with respect to the CFD, but only keeps events if they are already known.

A class which collects and applies EventCullers to the document data.

33

4.3.5.3 CumulativeFeatureDriver


CumulativeFeatureDriver() Empty constructor

CumulativeFeatureDriver(CumulativeFeatureDriver)

CFD to clone clone constructor

CumulativeFeatureDriver(FeatureDriver[])

collection of FeatureDrivers to build CFD from

constructor to build programmatically

CumulativeFeatureDriver(String)

path to xml file which defines a CFD

constructor to load from file

createEventSets(Document, boolean)

document to have its data extracted, boolean is true if running over a distributed platform, false if local

List<EventSet> of the features extracted from the document

extracts a single document’s events

clean cleans up any unneeded data. Note:

34

shouldn’t be needed in final version

featureDriverAt(int) :FeatureDriver

index of desired driver

FeatureDriver at that index

fetches a featureDriver object

displayName():String CFD’s name name of the FeatureSet

numOfFeatureDrivers(): int number of feature drivers in the CFD

setName(String) new name change the CFD’s name

setDescription(String) new description change the CFD’s description

addFeatureDriver(FeatureDriver) :int

new feature driver index of feature driver on success, -1 on failure

adds a new featureDriver to the CFD

replaceFeatureDriverAt(int, FeatureDriver) : FeatureDriver

index to replace, new feature driver

old FeatureDriver replaces one feature driver with another

removeFeatureDriverAt(int) : FeatureDriver

index to remove old FeatureDriver pops out FeatureDriver at that index

getName redundant; need to remove

getDescription() :String string description of CFD

writeToXML(String) : boolean

path to file to save CFD to

true on success, false on failure

toXMLString() : String XML String of CFD

Primary configuration for the JStylo Feature Extraction process. Contains all of the information on what features to extract, how they should be pre-processed and post-processed. Used throughout the feature extraction process.

35

4.3.5.4 FeatureDriver


FeatureDriver() empty constructor

FeatureDriver(String, boolean, EventDriver)

String name, true if it is a histogram, the EventDriver for the feature

basic constructor

FeatureDriver(String, boolean, EventDriver,Canonicizer[],EventCuller[])

as above, but with a List of Canonicizers and a List of EventCullers

constructor with pre and post processing for the feature

FeatureDriver(String, boolean, EventDriver,Canonicizer[],EventCuller[],Pair)

as above, but with a pair containing the normalization type and value.

constructor with pre-processing, post-processing, and normalization

setDisplayName(String)

new feature name change its name

setDescription(String) new description change its description

setUnderlyingEventDriver(EventDriver)

new EventDriver change the feature which is being

36

extracted

setCalcHist(boolean) true if histogram, false if not

sets if this is a histogram

addCanonicizer(Canonicizer)

new canonicizer adds a canonicizer

setCanonicizers(List<Canonicizer>)

list of canonicizers to use

replaces canonicizers with new list

addEventCuller(EventCuller)

new culler adds an EventCuller

setEventCullers(List<EventCuller>)

list of cullers to use replaces EventCullers with the new list

SetNormalizationBaseline(Norm)

enum value for type of normalization

normalization configuration

setNormalizationFactor(Factor)

how much to normalize by

normalization configuration

setNormalization(Pair)

a pair enum to factor normalization configuration

displayName() :String FeatureDriver name string name of the FD

getName() : String FeatureDriver name duplicate-to be removed

getDescription():String

FeatureDriver description

description of the FD

getUnderlyingEventDriver() : EventDriver

JGAAP EventDriver for debugging

isCalcHist() : boolean true if histogram, false if not

getter

canonicizerAt(int) : Canonicizer

index to retrieve the canonicizer at that index

getter

getCanonicizers() : List<Canonicizer>

List of canonicizers for the featureDriver

getter

cullerAt(int) : EventCuller

index to retrieve the EventCuller at the index

getter

getCullers() : List<EventCuller>

List of EventCullers for the FD

getter

37

getNormalizationBaseline() : Norm

type of normalization to perform

getter

getNormalizationFactor() : Double

how much to normalize to

getter

setNormalization() : Pair

pair of normalization type to factor

getter

getClassParams(String) : String

name of the class valid parameters for that class

used to learn more about the feature driver

Class which contains all of the configuration and algorithmic information for extracting a single feature from a document. A collection of these is used to create the CFD.

4.3.5.5 FeatureExtractionAPI


extractEventSets(Document, CumulativeFeatureDriver, boolean) : List<EventSet>

the document to have its data extracted, the CFD defining what to extract, true if using cloud processing / false if local

List<EventSet> of all extracted features

pulls out the features from a single document

cull(List<EventSet>, CFD,) : List<EventSet>

Events from a document, CFD

List<EventSet> events which survive document’s culling

reduces amount of features for a document

getRelevantEvents(List<EventSet>,CFD)

Events that are most valuable, CFD

List<EventSet> events which have survived training’s culling

pull out events which are most valuable

getAttributeList(List<EventSet>,List<EventSet>, CFD, List<EventSet>,boolean

all events, relevant events, CFD, document events, if it’s a sparse representation, whether

List<Attribute> for creation of WEKA Instance

WEKA conversion method - generates objects needed to build a

38

,boolean) the documents are labeled.

WEKA instance

createInstance(List<Attribute>,List<EventSet>,CFD,List<EventSet>,boolean,boolean)

the data from the previous method, the relevant events, CFD, document events, if it’s a sparse representation, whether the documents are labeled.

Instance (Weka Object) that represents the document

WEKA conversion method - builds an instance

normalizeInstance(CFD,Instance,List<EventSet>,boolean, List<Attribute>)

CFD, weka data to be normalized, document data, if the data is labeled, attribute data.

WEKA conversion method - normalizes data

calcInfoGain(Instances) : double[][]

Weka data for all documents

mapping of feature to its value

figures out which features are most valuable

applyInfoGain(double[][],Instances, int) : double[][]

feature data from calcInfoGain, weka data, number of features to keep

mapping of feature to its value

removes non-valuable features

applyInfoGain(double[][],Instances, int)

feature data from calcInfoGain, weka data, number of features to keep

removes non-valuable features

cullWithRespectToTraining(List<EventSet>,List<EventSet>,CFD): List<EventSet>

relevant events (from named method), document events, CFD

culled document events

culls a testing document and only keeps data that was in training

This class provides access to methods for extracting a single document’s features. This is so that the feature extraction can be parallelized at higher levels of computation.

39

4.3.5.6 FullAPI


prepareInstances() extracts data from documents

prepareAnalyzer() classifier learning process

calcInfoGain() performs infoGain calculation

applyInfoGain(int) number of features to keep

culls features with respect to infoGain

run() performs evaluation

40

writeArff(String, Instances)

path to output file, document data

writes out the document data to a file.

Remaining methods are setters/getters. Not yet pictured is the Builder for the API which is still under construction and will be added to the document once it is finalized.

This is the high-level API for JStylo. Allows for the creation and configuration of a JStylo experiment, the running of it, and access to deeper JStylo data structures and methods should the experimenter need it.

41

4.3.5.7 LocalParallelFeatureExtractionAPI

Getters and Setters not described in table below.


LocalParallelFeatureExtractionAPI()

empty constructor

42

LocalParallelFeatureExtractionAPI(Preferences)

local user preferences

not used in Worden, used to intialize from local JStylo

LocalParallelFeatureExtractionAPI(Builder)

builder constructor that takes a builder to construct

LocalParallelFeatureExtractionAPI(LocalParallelFeatureExtractionAPI)

API instance to clone

clone constructor

cleanAttributes() disposes of unneeded objects

clean() disposes of unneeded objects

extractEventSetsThreaded() pulls all of the features out of the documents via Java parallelization

initializeRelevantEvents() initializes events from training in order to map to testing

initializeAttributes() creates Attribute objects for WEKA conversion

createTrainingInstancesThreaded()

creates WEKA object for train data

createTestingInstancesThreaded()

creates WEKA object for test data

applyInfoGain(int) num features to keep

uses infogain to cull features

calcInfoGain() : double[][] mapping of feature to how valuable it is

figures out which features are most valuable

writeToArff(String, Instances) path to file, data

writes data to ARFF file

writeToCSV(String,Instances) path to file, data

writes data to CSV file

reset() clears all data from API

kill() ends all ongoing threads

43

This class provides a parallelized feature extraction process for local processing. It also parallelizes the creation of the WEKA data from these features. However, it is not generic and will be broken up in the process of Worden development.

44

4.3.5.8 ProblemSet

This class contains all information on which documents to use and which documents were written by which authors. Has a lot of extraneous / deprecated methods and requires revision.

45

4.3.5.9 StringDocument


StringDocument() builds an empty string document

StringDocument(Document)

the document to copy builds a string document by loading the data from an existing Document

StringDocument(String, String, String)

the document body, the document author, the document title

builds a string document from strings passed into it

load() loads the document into memory. Only for use with the Document constructor

getFilePath(): String The path to the file

An adapter for allowing a document to be read without having the file. This allows us to more easily modify, create, and cut up documents in memory. Typical JGAAP document needs the file on disk, and this circumvents that.

5. Appendix

5.1 Traceability MatrixWe still need to actually go through this.

46

@Salvage; how do you recommend documenting that UI requirements were satisfied if they’re being handled via frontend / dynamic on the side of the GUI and not an API call?

Requirement Number Design Component

4.2.1.1 4.3.7,4.3.9

4.2.1.2

4.2.1.3

4.2.1.4

4.3.1

4.3.2

4.3.3

4.4.1.1 4.3.5.3

4.4.1.2 4.3.5.3, needs UI support

4.5.1.1

4.5.1.1.1

4.5.1.1.2

4.5.2.1

4.5.2.2

4.5.2.2.1

4.5.2.2.2

4.5.2.2.3

4.5.2.2.4

4.5.2.2.5

4.5.2.3

4.5.2.3.1

4.5.2.3.2

4.5.2.3.3

4.5.2.3.4

47

4.5.2.4

4.5.2.4.1

4.5.2.4.2

4.5.2.5

4.5.2.5.1

4.5.2.5.2

4.5.3.1

4.5.3.2

4.5.3.2.1

4.5.3.2.1.1

4.5.3.2.1.2

4.5.3.2.2

4.5.3.2.2.1

4.5.3.2.2.1.1

4.5.3.2.2.2

4.5.3.2.2.2.1

4.5.3.2.2.2.2

4.5.3.3

4.5.3.3.1

4.5.3.3.2

4.5.3.4

4.5.3.4.1

4.5.3.4.2

4.5.3.5

4.5.3.5.1

4.5.3.5.2

4.5.3.6

48

4.5.3.6.1

4.5.3.6.2

4.5.3.6.3

4.5.3.7

4.5.3.7.1

4.5.3.7.2

4.5.3.8

4.5.4.1

4.5.4.2

4.5.4.3

4.5.5.1

4.5.5.1.1

4.5.5.1.1.1

4.5.5.1.2

4.5.5.1.2.1

4.5.5.2

4.5.5.2.1

4.5.5.2.1.1

4.5.5.2.1.2

4.5.5.2.2

4.5.5.2.3

4.5.5.3

4.5.5.3.1

4.5.5.3.2

4.5.5.3.2.1

4.5.5.3.2.2

4.5.5.4

49

4.5.5.4.1

4.5.6.1

4.5.6.2

4.5.6.3

4.5.6.4

4.5.6.4.1

4.5.6.4.2

4.5.6.4.3

4.5.6.5

4.5.7.1

4.5.7.2

4.5.7.2.1

4.5.7.2.2

4.5.7.3

4.5.7.3.1

4.5.7.3.2

4.5.7.3.2.1

4.5.7.3.2.2

4.5.7.4

4.5.7.4.1

4.6.1

4.6.2 4.3.1.1

4.6.3

4.7.1

4.7.2

4.7.3 3.1.8

5.1

50

5.2

5.3.1

5.3.2

5.3.3

5.4.1

5.4.2

5.5 3.1.9

5.6.1

5.6.2

Color Key:

Not yet implemented / verified

In-Progress

Complete

3.1. Technologies Used - Drexel Universitymeb388/worden/february... · Web viewFeb 22, 2016 ·...

Documents

Transcript of 3.1. Technologies Used - Drexel Universitymeb388/worden/february... · Web viewFeb 22, 2016 ·...