Dear Mr. Silverman: Maryland Procurement Office requests that your company submit a Firm,...

Dear Mr. Silverman: Maryland Procurement Office requests that your company submit a Firm, Fixed-Price proposal for the effort described below

in accordance with; Statement of Work entitled "Exploratory Analysis for Predictive Analytics and Anomaly Detection"

The Offeror shall submit a brief summary of response to each requirement of the SOW...The Offeror shall Submit a plan on the intended methods for organizing the performance of the contract...The Offeror shall provide a total price proposal for meeting the requirements...The Offeror shall submit the entire proposal no later than 3:00pm on 6 Sept. 2013....

Exploratory Analysis for are Predictive Analytics and Anomaly Detection Statement of Work July l2, 2013

1.0 Intro: As the amount of data available for collection grows, many of the tools and analytics used to analyze the data and detect anomalies becoming less effective or they just don’t work at large data volumes. This imposes performance and quality limitations on large scale data analysis that can adversely impact mission. Alternative approaches and new ways of doing analytics are needed that operate quickly/effectively over large volumes of data.

The primary objective of this work is to develop new approaches and tools to support large scale data analysis to predict likely outcomes and detect relevant anomalies in large amounts of data. These predictive analytics will need to operate on multiple types of disparate data and perform rapidly to provide actionable info and need to run in a Cloud environment and be able to operate on data stored in Cloud data stores. Additionally, these new analytics will need to work effectively in multiple workflows, each designed to answer a different question.

2.0 Scope: Under FYI2 Applied Res. Prototypes Broad Agency Announcement, Treeminer developed and demonstrated a functional research prototype illustrating novel predictive analytics and anomaly detection through the use of specialized exploratory analysis techniques in response to NSA’s requirements. Given the success of the initial prototype, SSG is beginning to develop a new capability to support advanced predictive analytics and anomaly detection. These new analytics will focus on performing at very large data volumes and multiple types of disparate data

The purpose of the Exploratory Analysis for Predictive Analytics and Anomaly Detection project and contract is to develop these new approaches for large scale data analytics and anomaly detection and demonstrate their effectiveness by showing they are able to more effectively identify and promote understanding of the events and anomalies embedded in the very large amounts of data faced by NSA.

4.0 Requirements 4.1 Project Tasks: The Contractor shall perform the following tasks for the contract:

4.1.1 Planning: The Contractor shall develop a comprehensive implementation and testing plan that addresses the provided goals for the implementation and evaluation of the exploratory analysis techniques detailed in 4.1.2 through 4.1.8. The Contractor shall work in conjunction with the Government Contracting Officer’s Representative (COR) and the Technical Lead (Tech Lead) to identify the milestones and criteria for implementation and testing. At a minimum, this plan will include: the installation 0f identification of the necessary hardware and software; integration with other systems; resolving data flow; risk identification and mitigation; and evaluation criteria. The contractor shall also include a comprehensive schedule based on the stated goals and developed plans.

4.1.2 Establish a development and testing environment: The Contractor shall establish a development and testing environment able to support unclassified development and testing in the Contractor’s facility of the work detailed in 4.1.1 through 4.1.8. The Contractor shall work in conjunction with the Government COR and the Tech Lead to identify the technical and configuration information needed to establish a realistic and relevant test environment as well as identify sources of suitable test data. The Government shall share with the Contractor sufficient information about its network configurations and data to ensure the Contractor is able to effectively simulate the Government’s network environment within the Contractor’s unclassified development and testing environment. The Contractor shall also install and configure in the Contractor’s unclassified development and test environment the current NSA standard Cloud data framework which at this time is expected to be the open source Accumulo software and the current NSA standard Streaming data framework which at this time is expected to be IBM lnfo Sphere Streams.

4.1.3 Develop specialized exploratory analysis techniques to run in a Cloud environment: The Contractor shall develop specialized exploratory analysis techniques using novel approaches for vertical data indexing able to run in a Cloud environment on RedHat Enterprise Linux. The specialized exploratory analysis techniques shall be able to run as MapReduce jobs distributed among the nodes in the Cloud as well as read from and write to Cloud data stores.

4.1.4 Install and configure specialized exploratory analysis techniques in a Government Cloud: The Contractor shall assist Government personnel in installing the specialized exploratory analysis techniques developed in 4.1.3 on a classified Government Cloud and configure the techniques to operate on at least two (2) different types of Government data. The Contractor shall coordinate with the TechLead to identify and prioritize the specific types of Government data that will be used. The goal of this task is to investigate and determine the effectiveness of the specialized exploratory analysis techniques operating on Government hardware in a Government Cloud and on Government data.

4.1.5 Develop additional types of analytics using specialized exploratory analysis techniques: The Contractor shall develop a min of 3 additional analytics utilizing specialized exploratory analysis and vertical data indexing techniques that can operate effectively in a Cloud environment. Each analytic shall produce results and outcomes that provide actionable information in as close to real-time as possible. The Contractor shall coordinate with the TechLead to identify and prioritize the specific types of analytics to be developed. The goal of this task is to develop new complex, predictive and anomaly detection analytics able to detect items of interest and generate actionable information in a timely manner that is measurably faster than current techniques.

4.1.6 Incorporate additional Government data types: The Contractor shall expand upon the specialized exploratory analysis techniques developed for the Cloud in 4.1.4 and 4.1.5 to allow the analysis techniques to process at least three (3) additional Government data types. The Contractor shall coordinate with the TechLead to identify and prioritize the specific types of data to be incorporated. The goal of this task is to show the flexibility and performance benefits of the exploratory analysis techniques as applied over a wider range of disparate data sets.

4.1.7 Perform scalability and performance testing: The Contractor shall coordinate with the TechLead to identify and conduct appropriate scalability and performance testing to measure the performance of the specialized exploratory analysis techniques under different operating conditions and to determine the characteristics of the analytics that improve or reduce overall performance of these techniques. This testing shall include testing overall performance, data throughput and speed under multiple conditions, and accuracy of the analytic results.

4.1.8 Research/Develop Streaming Capability Prototype: The Contractor shall coordinate with thc TechLead to identify and conduct appropriate research and develop a prototype Streams application that matches the high performance infrastructure offered by the IBM lnfoSphere Streams platform with the vertical data indexing techniques to demonstrate complex data classification operating on a stream of data. The Contractor shall adapt the vertical data indexing techniques and algorithms to be fully integrated within a Streams infrastructure, and demonstrate the operation of the vertical data indexing techniques in a streaming data environment.

Classification/Prediction:1 Entity TrainingSet (e.g., IRIS(PL,PW,SL,SW) FAUST (max STD of dot prods), pCkNN2 Entity TrainingSet (e.g., Rating(user, movie),

Buys(cust, item), tfidf(doc, term) 3 Entity TrainingSet (e.g., ????(doc, term, user)Recommender Taxonomy2 Entity Recommenders (e.g., Buys(cust, item)

pSVD (min sse using gradient descent and line search)

3 Entity Recommenders (e.g., DTU(Doc,Term,User) (But cell value measure what?)

DTU cube

1 0 0 10 1 1 11 00 01 1 0 0

T 2 3 4 5

1234

D

1 0 0 10 1 1 11 00 01 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

1 0 0 10 1 1 11 00 01 1 0 0

2222U

Use a 3-hop rolodex model instead of 3D data cube model?TD: Term-Document matrix (cells contain tf-idf?).

FU

0 0 0 10 1 0 1

DF

0 01 00 00 1

FD

0 0 0 10 1 0 1

TF

0 01 00 00 1

FT

0 0 0 10 1 0 1

UF

0 01 00 00 1

Train UF, FU (Are we training for best user feature twice? Do we get the same SV feature vector? If not, does the combination do better than either pair?

DU

0 0 0 11 0 1 00 0 0 10 1 0 1

U

2

3

45

TD

1 0 0 10 1 1 11 0 0 01 1 0 0

D

2

3

45

54

3

T

2

UT

0 0 0 10 0 1 00 0 0 10 1 0 0

What would happen if we use one feature vector space (concatenate D,T,U)? Training for best feature amounts to starting with a DTU feature vector, back-propagating

(and line searching) to minimize sse. If sse is driven very low, the feature F=(DF,TF,UF) should capture most of the information

in DU, TD and UT combined?

FU 0 1 0 1

DF

0100

FD 0 1 0 1

TF

0100

FT 0 1 0 1

UF

0100

U

2

3

45

D

2

3

45

54

3

T

2

DU

TD

UT

(BackProp+LineSearch)-train F = <----FD---->

0 1 0 1 0 1 0 1

<----FT---->0 1 0 1

<----FU---->

Use DU (Recommender), UT (Anomaly Detection), TD (Anomaly Detection?)

Do the conversion to pTrees and train in the CLOUD. Then download F to an agen'ts (or soldier's) personal device for Classification / Anomaly Detections

DU: Doc-User matrix (Docs are bought/accessed/rated by Users (Docs = web pages?).

Use [cyclic] 3 hop rolodex model, TDDUUT (Instead of 3D DataCube, DTU (What's should DTU cell measure??)Do ARM. or pSVD on TD (train TF and FD using TD1. use DF to cluster D2. use FT to cluster T into TCL and then use TCL to cluster D in D-TCL

UT: User-Term matrix (cell contains level of user's preference characterized by term.)

Train one single Feature Vector?

3-hop

S(F,G)

R(E,F)

0 0 0 10 0 1 00 0 0 10 1 0 0

1 0 0 10 1 1 11 0 0 01 1 0 0

1234

E

F 2 3 4 5

1234

G

A

C

T(G,H)

0 0 0 11 0 1 00 0 0 10 1 0 1

H2 3 4 5

Collapse T: TC≡ {gG|T(g,h) hC} That's just 2-hop case w TCG replacing C. ( can be replaced by or any other quantifier. The choice of quantifier should match that intended for C.). Collapse T and S: STC≡{fF |S(f,g) gTC} Then it's 1-hop w STC replacing C.

Focus on G

mncnfct(&eARe &g&hCThSg) / ct(&eARe

mncnf&hCTh) ct(&f&eAReSf / ct(&f&eARe

Sf)

ct( 1001 &g=1,3,4 Sg ) /ct(1001)ct( 1001 &1001&1000&1100) / 2ct( 1000 ) / 2 = 1/2

Focus on F Are they different? Yes, because the confidences can be different numbers. Focus on G.

ct(&eARe &glist&hCThSg ) /ct(&eARe

&hCTh)ct(&flist&eAReSf / ct(&flist&eARe

Sf)ct(&f=2,5Sf &1101 ) / ct(&f=2,5Sf

ct(1101 & 0011 &&1101 ) / ct(1101 & 0011 )ct(0001 ) / ct(0001) = 1/1 =1

mnsup ct(&eARe

mnspct(&f&eAReSf)

Focus on F

antecedent downward closure: A infreq. implies supersets infreq. A 1-hop from F (down

consequent upward closure: AC noncnf implies AD noncnf. DC. C 2-hops (up

antecedent upward closure: A infreq. implies all subsets infreq. A 2-hop from G (up) consequent downward closure: AC noncnf impl AD noncnf. DC. C 1-hops (down)

ct(PA & Rf) f&g&hCThSg

/ ct(PA) mncnf mnsup ct(PA)Focus on E

antecedent upward closure: A infreq. implies subsets infreq. A 0-hops from E (up)

consequent downward closure: AC noncnf implies AD noncnf. DC. C 3-hops (down)

Focus on H

antecedent downward closure: A infreq. implies all subsets infreq. A 3-hops from G (down) consequent upward closure: AC noncnf impl AD noncnf. DC. C 0-hops (up)

ct(& Tg & PC) g&f&eAReSf

mncnf /ct(& Tg) g&f&eAReSf

ct(& Tg) g&f&eAReSf

mnsp

Simon Funk: Netflix provided a database of 100M ratings (1 to 5) of 17K movies by 500K users. as a triplet of numbers: (User,Movie,Rating). The challenge: For (User,Movie,?) not in the database, predict how the given User would rate the given Movie.

Think of the data as a big sparsely filled matrix, with userIDs across the top and movieIDs down the side (or vice versa then transpose everything), and each cell contains an observed rating (1-5) for that movie (row) by that user (column), or is blank meaning you don't know.

This matrix would have 8.5B entries, but you are only given values for 1/85 th of those 8.5B cells (or 100M of them). The rest are all blank. Netflix posed a "quiz" of a bunch of question marks plopped into previously blank slots, and your job is to fill in best-guess ratings in their place.Squared error (se) measures accuracy (You guess=1.5, actual=2, you get docked (2-1.5)2=.25. They use root mean squared error (rmse) but if we

minimize mse, we minimize rmse. There is a date for ratings and question marks (so a cell can potentially have >=1 rating in it.Any movie can be described in terms of some features (or aspects) such as quality, action, comedy, stars (e.g., Pitt), producer, etc.A user's preferences can be described in terms how they rate the same features (quality/action/comedy/star/producer/etc.).Then ratings ought to be explainable by a lot less than 8.5 billion numbers (e.g., a single number specifying how much action a particular movie has

may help explain why a few million action-buffs like that movie.). SVD: Assume 40 features. A movie, m, is described by mF[40] = how much that movie exemplifies each aspect. A user, u, is described by uF[40] = how much he likes each aspect. Pu,m=uFomF erru,m=Pu,m- ru,m

ua+= lrate (u,i * iaT - * ua ) where u,i = pu,i - ru,i and ru,i = actual rating

SVD is a trick which finds UT, M which minimize mse(k) (one k at a time). So, the rank=40 SVD of the 8.5B Training matrix, is the best (least error) approx we can get within limits of our user-movie-rating model. I.e., the SVD has found the "best" feature generalizations.

To get the SVD matrixes we take the gradient of mse(k) and follow it.This has a bonus - we can ignore the unknown error on the 8.4B empty slots.

Take gradient of mse(k) (just the given values, not empties), one k at a time.userValue[user] += lrate*err*movieValue[movie];

movieValue[movie] += lrate*err*userValue[user];

More correctly: uv = userValue[user] += err * movieValue[movie]; movieValue[movie] += err * uv; finds the most prominent feature remaining (most reduces error). When it's good, shift it onto done features, start a new one (cache residuals of the 100M. "What does that mean for us???).

This Gradient descent has no local minima, which means it doesn't really matter how it's initialized.

With Horizontal data, the code is evaluated for each rating. So, to train for one sample: real *userValue= userFeature[featureBeingTrained]; real *movieValue= movieFeature[featureBeingTrained]; real lrate = 0.001;

UT a1 a40 u1

u500K

u uF

M m1 m m17K a1

mF

a40

o

P m1 m m17K u1

.

.

u500K

u Pu,m

=

= k=1..40uFk*mFk - ru,m

m=1..17K; u=1..500K( )2/8.5B k=1..40uFk*mFk - ru,mmse = mse/uFh = (2/8.5B) m=1..17K; u=1..500K (erru,m)[ ( )/uFh] k=1..40uFk*mFk - ru,m

= (2/8.5B) m=1..17K; u=1..500K (erru,m)[

mFh ]

mse/mFh = (2/8.5B) m=1..17K; u=1..500K (erru,m)[ uFk ]So, we increment each uFh+ = 2mse * mFhand we increment each mFh+ = 2mse * uFh+This is a big move and may overshoot the minimum, so the 2 is replaced by a smaller learning rate, lrate (e.g., Funk takes lrate=0.001)

Moving on: 20M free params is a lot for a 100M TrainSet. Seems neat to just ignore all blanks, but we have expectations about them. As-is, this modified SVD algorithm tends to make a mess of sparsely observed movies or users. If you have a user who has only rated 1 movie, say

American Beauty=2 while the avg is 4.5, and further that their offset is only -1, we'd, prior to SVD, expect them to rate it 3.5. So the error given to the SVD is -1.5 (the true rating is 1.5 less than we expect).

m(Action) is training up to measure the amount of Action, say, .01 for American Beauty (ust slightly more than avg). SVD optimize predictions, which it can do by eventually setting our user's preference for Action to a huge -150. I.e., the alg naively looks at the only example it has of this user's preferences and in the context of only the one feature it knows about so far (Action), determines that our user so hates action movies that even the tiniest bit of action in American Beauty makes it suck a lot more than it otherwise might. This is not a problem for users we have lots of observations for because those random apparent correlations average out and the true trends dominate.

We need to account for priors. As with the average movie ratings, blend our sparse observations in with some sort of prior, but it's a little less clear how to do that with this incremental algorithm. But if you look at where the incremental algorithm theoretically converges, you get:

userValue[user] = [sum residual[user,movie]*movieValue[movie]] / [sum (movieValue[movie]^2)]

The numerator there will fall in a roughly zero-mean Gaussian distribution when charted over all users, which through various gyrations:

userValue[user] = [sum residual[user,movie]*movieValue[movie]] / [sum (movieValue[movie]^2 + K)] And finally back to: userValue[user] += lrate * (err * movieValue[movie] - K * userValue[user]);movieValue[movie] += lrate * (err * userValue[user] - K * movieValue[movie]);

This is equivalent to penalizing the magnitude of the features. To cut over fitting, allowing use of more features.

If m only appears once with r(m,u)=1 say, AvgRating(m)=1? Probably not! View r(m,u)=1 as a draw from a true prob dist who's avg you want...View that true average itself as a draw from a prob dist of averages--the histogram of average movie ratings. Assume both distributions Gaussian,

then the best-guess mean should be lin combo of observed mean and apriori mean, with a blending ratio equal to the ratio of variances.

If Ra and Va are the mean and variance (squared standard deviation) of all of the movies' average ratings (which defines your prior expectation for a new movie's average rating before you've observed any actual ratings) and Vb is the average variance of individual movie ratings (which tells you how indicative each new observation is of the true mean--e.g,. if the average variance is low, then ratings tend to be near the movie's true mean, whereas if the avg variance is high, ratings tend to be more random and less indicative) then:

BogusMean = sum(ObservedRatings)/count(ObservedRatings) K = Vb/VaBetterMean = [GlobalAverage*K + sum(ObservedRatings)] / [K + count(ObservedRatings)]

The point here is simply that any time you're averaging a small number of examples, the true average is most likely nearer the apriori average than the sparsely observed average. Note if the number of observed ratings for a particular movie is zero, the BetterMean (best guess) above defaults to the global average movie rating as one would expect.

Refinements: Prior to starting SVD, Note: AvgRating(movie), AvgOffset(UserRating, MovieAvgRating), for every user. I.e.:static inline real predictRating_Baseline(int movie, int user) {return averageRating[movie] + averageOffset[user];}

So, that's the return value of predictRating before the first SVD feature even starts training. You'd think avg rating for a movie would just be... its average rating! Alas, Occam's razor was a little rusty that day.

Two choices for G proved useful. 1. clip the prediction to 1-5 after each component is added. E.g., each feature is limited to only swaying rating within the valid range, and any excess beyond that is lost rather than carried over. So, if the first feature suggests +10 on a scale of 1-5, and the second feature suggests -1, then instead of getting a 5 for the final clipped score, it gets a 4 because the score was clipped after each stage. The intuitive rationale here is that we tend to reserve the top of our scale for the perfect movie, and the bottom for one with no redeeming qualities whatsoever, and so there's a sort of measuring back from the edges that we do with each aspect independently. More pragmatically, since the target range has a known limit, clipping is guaranteed to improve our perf, and having trained a stage with clipping on, use it with clipping on. I did not really play with this extensively enough to determine there wasn't a better strategy.

A second choice for G is to introduce some functional non-linearity such as a sigmoid. I.e., G(x) = sigmoid(x). Even if G is fixed, this requires modifying the learning rule slightly to include the slope of G, but that's straightforward. The next question is how to adapt G to the data. I tried a couple of options, including an adaptive sigmoid, but the most general and the one that worked the best was to simply fit a piecewise linear approximation to the true output/output curve. That is, if you plot the true output of a given stage vs the average target output, the linear model assumes this is a nice 45 degree line. But in truth, for the first feature for instance, you end up with a kink around the origin such that the impact of negative values is greater than the impact of positive ones. That is, for two groups of users with opposite preferences, each side tends to penalize more strongly than the other side rewards for the same quality. Or put another way, below-average quality (subjective) hurts more than above-average quality helps. There is also a bit of a sigmoid to the natural data beyond just what is accounted for by the clipping. The linear model can't account for these, so it just finds a middle compromise; but even at this compromise, the inherent non-linearity shows through in an actual-output vs. average-target-output plot, and if G is then simply set to fit this, the model can further adapt with this new performance edge, which leads to potentially more beneficial non-linearity and so on... This introduces new free parameters and encourages over fitting especially for the later features which tend to represent small groups. We found it beneficial to use this non-linearity only for the first twenty or so features and to disable it after that.

Moving on: Despite the regularization term in the final incremental law above, over fitting remains a problem. Plotting the progress over time, the probe rmse eventually turns upward and starts getting worse (even though the training error is still inching down). We found that simply choosing a fixed number of training epochs appropriate to the learning rate and regularization constant resulted in the best overall performance. I think for the numbers mentioned above it was about 120 epochs per feature, at which point the feature was considered done and we moved on to the next before it started over fitting. Note that now it does matter how you initialize the vectors: Since we're stopping the path before it gets to the (common) end, where we started will affect where we are at that point. I wonder if a better regularization couldn't eliminate overfitting altogether, something like Dirichlet priors in an EM approach--but I tried that and a few others and none worked as well as the above.

Here is the probe and training rmse for the first few features with and w/o regularization term "decay" enabled. Same thing, just the probe set rmse, further along where you can see the regularized version pulling ahead: This time showing probe rmse (vertical) against train rmse (horizontal). Note how the regularized version has better probe performance relative to

the training performance:

Anyway, that's about it. I've tried a few other ideas over the last couple of weeks, including a couple of ways of using the date information, and while many of them have worked well up front, none held their advantage long enough to actually improve the final result.

If you notice any obvious errors or have reasonably quick suggestions for better notation or whatnot to make this explanation more clear, let me know. And of course, I'd love to hear what y'all are doing and how well it's working, whether it's improvements to the above or something completely different. Whatever you're willing to share,

Moving on: Linear models are limiting. We've bastardized the whole matrix analogy so much that we aren't really restricted to linear models: We can add non-linear outputs such that instead of predicting with: sum (userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40. We can use: sum G(userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40.

Dear Mr. Silverman: Maryland Procurement Office requests that your company submit a Firm,...

Documents

Transcript of Dear Mr. Silverman: Maryland Procurement Office requests that your company submit a Firm,...