Dm sei-tutorial-v7

71
Data Science for Software Engineering (short version) Tim Menzies, West Virginia University Fayola Peters, West Virginia University SEI, August, 2013 SEI http://goo.gl/w4Acsi ICSE’13 http://goo.gl/29YTMu 1

Transcript of Dm sei-tutorial-v7

Page 1: Dm sei-tutorial-v7

1

Data Science for Software Engineering (short version)

Tim Menzies, West Virginia UniversityFayola Peters, West Virginia University

SEI, August, 2013SEI http://goo.gl/w4AcsiICSE’13 http://goo.gl/29YTMu

Page 2: Dm sei-tutorial-v7

2

This talk: reflections on data science and software analytics

Two recent special issues of IEEE Software: July’13; Sept’13.Editors: Menzies & Zimmermann

Page 3: Dm sei-tutorial-v7

• Statistics • Operations research

• Machine Learning • Data mining

• Predictive Analytics • Business Intelligence

• Data Science • Smart data

• Big Data 3

Insert buzzword here

Page 4: Dm sei-tutorial-v7

4

Big data: not-so-successful stories

• Community medicine– Additional manual

collection required for their queries

• Software engineering– Much product data

• examples of source code

– Little process data• costs, quality measures

We go mining with the data we have, not the data we want. Get used to it.

Page 5: Dm sei-tutorial-v7

5

But what isn’t being said in the all the above about data mining + SE?

1. Its not just all about algorithms (people matter)2. Data mining is a technical and a sociological problem

– No point in talking about how to learn lessons from many organizations…

– …. Unless those organizations let you access their data– The problem of privacy

3. When we learn from each other– There is more to sharing that just “you give me

your model”• Local learning, ensembles, filtering..

Page 6: Dm sei-tutorial-v7

6

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Know your domain– Data science is cyclic

• PART 2: Data Issues– How to prune data, simpler & smarter– How to keep your data private

• PART 3: Models– Envy-based learning– Ensembles

Page 7: Dm sei-tutorial-v7

7

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Know your domain– Data science is cyclic

• PART 2: Data Issues– How to prune data, simpler & smarter– How to keep your data private

• PART 3: Models– Envy-based learning– Ensembles

Page 8: Dm sei-tutorial-v7

What can we share?

• Two software project managers meet– What can they learn

from each other?

• They can share1. Data2. Models3. Methods

• techniques for turning data into models

4. Insight into the domain

• The standard mistake– Generally assumed that

models can be shared, without modification.

– Yeah, right…

8

Page 9: Dm sei-tutorial-v7

SE research = sparse sample of a very diverse set of activities

9

Microsoft research,Redmond, Building 99

Other studios,many other projects

And they are all different.

Page 10: Dm sei-tutorial-v7

Models may not move(effort estimation)

• 20 * 66% samples of data from NASA

• Linear regression oneach sample to learneffort = a*LOCb *Σiβixi

• Back select to removeuseless xi

• Result? – Wide βi variance

10* T. Menzies, A.Butcher, D.Cok, A.Marcus, L.Layman, F.Shull, B.Turhan, T.Zimmermann, "Local vs. Global Lessons for Defect

Prediction and Effort Estimation," IEEE TSE pre-print 2012. http://menzies.us/pdf/12gense.pdf

Page 11: Dm sei-tutorial-v7

11

Models may not move(defect prediction)

* T. Menzies, A.Butcher, D.Cok, A.Marcus, L.Layman, F.Shull, B.Turhan, T.Zimmermann, "Local vs. Global Lessons for Defect Prediction and Effort Estimation," IEEE TSE pre-print 2012. http://menzies.us/pdf/12gense.pdf

Page 12: Dm sei-tutorial-v7

Oh woe is me

• No generality in SE?• Nothing we can learn

from each other?• Forever doomed to

never make a conclusion?– Always, laboriously,

tediously, slowly, learning specific lessons that hold only for specific projects?

• No: 3 things we might want to share– Models, methods, data

• If no general models, then – Share methods

• general methods for quickly turning local data into local models.

– Share data• Find and transfer relevant

data from other projects to us

12

Page 13: Dm sei-tutorial-v7

The rest of this tutorial

• Data science– How to share data– How to share methods

• Maybe one day, in the future, – after we’ve shared enough data and methods– We’ll be able to report general models

• But first,– Some general notes on data mining

13

Page 14: Dm sei-tutorial-v7

14

OUTLINE• PART 0: Introduction

• PART 1: Organization Issues– Know your domain– Data science is cyclic

• PART 2: Data Issues– How to prune data, simpler & smarter– How to keep your data private

• PART 3: Models– Envy-based learning– Ensembles

Page 15: Dm sei-tutorial-v7

15

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

–Know your domain– Data science is cyclic

• PART 2: Data Issues– How to prune data, simpler & smarter– How to keep your data private

• PART 3: Models– Envy-based learning– Ensembles

Page 16: Dm sei-tutorial-v7

Case Study : NASA

• NASA’s Software Engineering Lab, 1990s– Gave free access to all comers to their data– But you had to come to get it (to Learn the domain)– Otherwise: mistakes

• E.g. one class of software module with far more errors that anything else.– Dumb data mining algorithms: might learn that this kind of module in

inherently more data prone

• Smart data scientists might question “what kind of programmer work that module”– A: we always give that stuff to our beginners as a learning exercise

16* F. Shull, M. Mendonsa, V. Basili, J. Carver, J. Maldonado, S. Fabbri, G. Travassos, and M. Ferreira, "Knowledge-Sharing Issues in Experimental Software Engineering", EMSE 9(1): 111-137, March 2004.

Page 17: Dm sei-tutorial-v7

So algorithms are only part of the story

17• Drew Conway, The Data Science Venn Diagram, 2009,

• http://www.dataists.com/2010/09/the-data-science-venn-diagram/

• Dumb data miners miss important domains semantics

• An ounce of domain knowledge is worth a ton to algorithms.

• Math and statistics only gets you machine learning,

• Science is about discovery and building knowledge, which requires some motivating questions about the world

• The culture of academia, does not reward researchers for understanding domains.

Page 18: Dm sei-tutorial-v7

18

`

• ds

Source: Manuel Sevilla, http://goo.gl/cBKIh

Page 19: Dm sei-tutorial-v7

19

Management misconceptions of Big Data

• All our data analysis problems will be solved– Once we boot a CPU farm– Once we bring up Hadoop and Map-reduce

• If your first question is “what tools to buy?”– Then you are asking the wrong question

Page 20: Dm sei-tutorial-v7

20

• Deploy data scientists before deploying tools

Tools can augment, but not replace, human insight

Source: http://goo.gl/CCMZo

Page 21: Dm sei-tutorial-v7

The great myth

• Wouldn’t it be wonderful if we did not have to listen to them– The dream of olde

worlde machine learning• Circa 1980s

– Dispense with live experts and resurrect dead ones.

• But any successful learner needs biases– Ways to know what’s

important• What’s dull• What can be ignored

– No bias? Can’t ignore anything

• No summarization• No generalization• No way to predict the future

21

Lesson:TALK TO

THE USERS!

Page 22: Dm sei-tutorial-v7

The Inductive Engineering Manifesto

• Users before algorithms: – Mining algorithms are only useful in industry if

users fund their use in real-world applications.

• Data science – Understanding user goals to inductively generate

the models that most matter to the user.

22• T. Menzies, C. Bird, T. Zimmermann, W. Schulte, and E. Kocaganeli.

The inductive software engineering manifesto. (MALETS '11).

Page 23: Dm sei-tutorial-v7

23

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Know your domain

–Data science is cyclic• PART 2: Data Issues

– How to prune data, simpler & smarter– How to keep your data private

• PART 3: Models– Envy-based learning– Ensembles

Page 24: Dm sei-tutorial-v7

Do it again, and again, and again, and …

24

In any industrial application, data science is repeated multiples time to either answer an extra user question, make some enhancement and/or bug fix to the method, or to deploy it to a different set of users.

Page 25: Dm sei-tutorial-v7

Thou shall not click

• For serious data science studies, – to ensure repeatability, – the entire analysis should be automated – using some high level scripting language;

• e.g. R-script, Matlab, Bash, ….

25

Page 26: Dm sei-tutorial-v7

The feedback process

26

Page 27: Dm sei-tutorial-v7

The feedback process

27

Page 28: Dm sei-tutorial-v7

28

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Know your domain– Data science is cyclic

• PART 2: Data Issues– How to prune data, simpler & smarter– How to keep your data private

• PART 3: Models– Envy-based learning– Ensembles

Page 29: Dm sei-tutorial-v7

29

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Know your domain– Data science is cyclic

• PART 2: Data Issues

–How to prune data, simpler & smarter

– How to keep your data private• PART 3: Models

– Envy-based learning– Ensembles

Page 30: Dm sei-tutorial-v7

30

How to Prune Data, Simpler and Smarter

Data is the new oil

Page 31: Dm sei-tutorial-v7

31

And it has a cost too

Page 32: Dm sei-tutorial-v7

32

Picking random training instance is

not a good idea

More popular instances in the active pool decrease error

One of the stopping point conditions fires

Data for Industry / Active Learning

X-axis: Instances sorted in decreasing popularity numbers

Y-ax

is: M

edia

n M

RE

Page 33: Dm sei-tutorial-v7

33

Data for Industry / Active Learning

At most 31% of all the cells

On median 10%

Intrinsic dimensionality: There is a consensus in the high-dimensional data analysis community that the only reason any methods work in very high dimensions is that, in fact, the data is not truly high-dimensional*

* E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in Neural Information Processing Systems, volume 17, Cambridge, MA, USA, 2004. The MIT Press.

Page 34: Dm sei-tutorial-v7

34

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Know your domain– Data science is cyclic

• PART 2: Data Issues– How to prune data, simpler & smarter

–How to keep your data private• PART 3: Models

– Envy-based learning– Ensembles

Page 35: Dm sei-tutorial-v7

35

Is Data Sharing Worth the Risk to Individual Privacy

• Former Governor Massachusetts.• Victim of re-identification privacy breach.• Led to sensitive attribute disclosure of his medical records.

What would William Weld say?

Page 36: Dm sei-tutorial-v7

36

Is Data Sharing Worth the Risk to Individual Privacy

What about NASA contractors?

Subject to competitive bidding every 2 years.

Unwilling to share data that would lead to sensitive attribute disclosure.

e.g. actual software development times

Page 37: Dm sei-tutorial-v7

37

When To Share – How To Share

So far we cannot guarantee 100% privacy.What we have is a directive as to whether data is private and useful enough to share...

We have a lot of privacy algorithms geared toward minimizing risk.

Old SchoolK-anonymity L-diversityT-closeness

But What About Maximizing Benefits (Utility)?

The degree of risk to thedata sharing entity must not exceed the benefits of sharing.

Page 38: Dm sei-tutorial-v7

38

Page 39: Dm sei-tutorial-v7

39

Balancing Privacy and Utilityor...

Minimize risk of privacy disclosure while maximizing utility.

Instance Selection with CLIFFSmall random moves with MORPH

= CLIFF + MORPH

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 40: Dm sei-tutorial-v7

40

CLIFFDon't share all the data.

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 41: Dm sei-tutorial-v7

41

CLIFFDon't share all the data.

"a=r1" powerful for selection for class=yes

more common in "yes" than "no"

CLIFFstep1: for each class find ranks of all values

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 42: Dm sei-tutorial-v7

42

CLIFFDon't share all the data.

"a=r1" powerful for selection for class=yes

more common in "yes" than "no"

CLIFFstep2: multiply ranks of each row

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 43: Dm sei-tutorial-v7

43

CLIFFDon't share all the data.

CLIFFstep3: select the most powerful rows of each class

Note linear time

Can reduce N rows to 0.1N

So an O(N2) NUN algorithm now takes time O(0.01)

Scalability

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 44: Dm sei-tutorial-v7

44

MORPHPush the CLIFF data from their original position.

y = x ± (x − z) r∗

x D, the original ∈instance

z D the NUN of x∈

y the resulting MORPHed instance

F. Peters and T. Menzies, “Privacy and utility for defect prediction: Experiments with morph,” in Software Engineering (ICSE), 2012 34th International Conference on, june 2012, pp. 189 –199.F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 45: Dm sei-tutorial-v7

45

Case Study: Cross-Company Defect Prediction (CCDP)

Sharing Required.

Zimmermann et al.

Local data not always available• companies too small• product in first release, so

no past data.

Kitchenham et al.

• no time for collection• new technology can make all

data irrelevant

T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross-project defect prediction: a large scale experiment on data vs. domain vs. process.” in ESEC/SIGSOFT FSE’09,2009B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost estimation studies: A systematic review,” IEEE Transactions on Software Engineering, vol. 33, pp. 316–329, 2007

- Company B has little or no data to build a defect model;- Company B uses data from Company A to build defect models;

Page 46: Dm sei-tutorial-v7

46

Measuring the RiskIPR = Increased Privacy Ratio

Queries Original Privatized Privacy Breach

Q1 0 0 yes

Q2 0 1 no

Q3 1 1 yes

yes = 2/3IPR = 1- 2/3 = 0.33

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 47: Dm sei-tutorial-v7

47

Measuring the UtilityThe g-measure

Probability of detection (pd) Probability of False alarm (pf)

Actualyes no

Predicted yes TP FP

no FN TN

pd TP/(TP+FN)

pf FP/(FP+TN)

g-measure 2*pd*(1-pf)/(pd+(1-pf))

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 48: Dm sei-tutorial-v7

48

Making Data Private for CCDPComparing CLIFF+MORPH to Data Swapping and K-anonymity

Data Swapping (s10, s20, s40)

A standard perturbation technique used for privacyTo implement...• For each NSA a certain percent

of the values are swapped with any other value in that NSA.

• For our experiments, these percentages are 10, 20 and 40.

k-anonymity (k2, k4)

The Datafly Algorithm.To implement...• Make a generalization

hierarchy.• Replace values in the NSA

according to the hierarchy.• Continue until there are k or

fewer distinct instances and suppress them.

K. Taneja, M. Grechanik, R. Ghani, and T. Xie, “Testing software in age of data privacy: a balancing act,” in Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, ser. ESEC/FSE ’11. New York, NY, USA: ACM, 2011, pp. 201–211.

L. Sweeney, “Achieving k-anonymity privacy protection using generalization and suppression,” Int. J. Uncertain. Fuzziness Knowl.-Based Syst., vol. 10, no. 5, pp. 571–588, Oct. 2002.

Page 49: Dm sei-tutorial-v7

49

Making Data Private for CCDPComparing CLIFF+MORPH to Data Swapping and K-anonymity

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 50: Dm sei-tutorial-v7

50

Making Data Private for CCDPComparing CLIFF+MORPH to Data Swapping and K-anonymity

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 51: Dm sei-tutorial-v7

Making Data Private for CCDP

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 52: Dm sei-tutorial-v7

52

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Know your domain– Data science is cyclic

• PART 2: Data Issues– How to prune data, simpler & smarter– How to keep your data private

• PART 3: Models– Envy-based learning– Ensembles

Page 53: Dm sei-tutorial-v7

53

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Know your domain– Data science is cyclic

• PART 2: Data Issues– How to prune data, simpler & smarter– How to keep your data private

• PART 3: Models

–Envy-based learning– Ensembles

Page 54: Dm sei-tutorial-v7

• Seek the fence where the grass is greener on the other side.

• Learn from there

• Test on here

• Cluster to find “here” and “there”

54

Envy =The WisDOM Of the COWs

Page 55: Dm sei-tutorial-v7

55

@attribute recordnumber real@attribute projectname {de,erb,gal,X,hst,slp,spl,Y} @attribute cat2 {Avionics, application_ground, avionicsmonitoring, … }@attribute center {1,2,3,4,5,6}@attribute year real@attribute mode {embedded,organic,semidetached}@attribute rely {vl,l,n,h,vh,xh}@attribute data {vl,l,n,h,vh,xh} …@attribute equivphyskloc real

@attribute act_effort real

@data

1,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,25.9,117.62,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,24.6,117.63,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,7.7,31.24,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,8.2,365,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,9.7,25.26,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,2.2,8.4….

DATA = MULTI-DIMENSIONAL VECTORS

Page 56: Dm sei-tutorial-v7

CAUTION: data may not divide neatly on raw dimensions

• The best description for SE projects may be synthesize dimensions extracted from the raw dimensions

56

Page 57: Dm sei-tutorial-v7

Fastmap

57

Fastmap: Faloutsos [1995]

O(2N) generation of axis of large variability• Pick any point W; • Find X furthest from W, • Find Y furthest from Y.

c = dist(X,Y)All points have distance a,b to (X,Y)

• x = (a2 + c2 − b2)/2c • y= sqrt(a2 – x2)

Find median(x), median(y)

Recurse on four quadrants

Page 58: Dm sei-tutorial-v7

Hierarchical partitioningPrune

• Find two orthogonal dimensions• Find median(x), median(y)• Recurse on four quadrants

• Combine quadtree leaves with similar densities

• Score each cluster by median score of class variable

58

Grow

Page 59: Dm sei-tutorial-v7

Q: why cluster Via FASTMAP?

• A1: Circular methods (e.g. k-means) assume round clusters.• But density-based clustering allows clusters to be

any shape

• A2: No need to pre-set the number of clusters

• A3: cause other methods (e.g. PCA) are much slower• Fastmap is the O(2N)• Unoptimized Python:

59

Page 60: Dm sei-tutorial-v7

60

Learning via “envy”

Page 61: Dm sei-tutorial-v7

• Seek the fence where the grass is greener on the other side.

• Learn from there

• Test on here

• Cluster to find “here” and “there”

61

Envy =The WisDOM Of the COWs

Page 62: Dm sei-tutorial-v7

Hierarchical partitioningPrune

• Find two orthogonal dimensions• Find median(x), median(y)• Recurse on four quadrants

• Combine quadtree leaves with similar densities

• Score each cluster by median score of class variable

• This cluster envies its neighbor with better score and max abs(score(this) - score(neighbor)) 62

Grow

Where is grass greenest?

Page 63: Dm sei-tutorial-v7

Q: How to learn rules from neighboring clusters

• A: it doesn’t really matter– Many competent rule learners

• But to evaluate global vs local rules:– Use the same rule learner for local vs global rule learning

• This study uses WHICH (Menzies [2010])– Customizable scoring operator– Faster termination– Generates very small rules (good for explanation)

63

Page 64: Dm sei-tutorial-v7

Data from http://promisedata.org/data

• Effort reduction = { NasaCoc, China } : COCOMO or function points

• Defect reduction = {lucene,xalan jedit,synapse,etc } : CK metrics(OO)

• Clusters have untreated class distribution.

• Rules select a subset of the examples: – generate a treated class

distribution

64

25th

50th

75th

100th

0 10 20 30 40 50 60 70 80 90

untreated global local

Distributions have percentiles:

Treated with ruleslearned from all data

Treated with rules learned from neighboring cluster

Page 65: Dm sei-tutorial-v7

• Lower median efforts/defects (50th percentile)• Greater stability (75th – 25th percentile)• Decreased worst case (100th percentile)

By any measure, Local BETTER THAN GLOBAL

65

Page 66: Dm sei-tutorial-v7

Rules learned in each cluster

• What works best “here” does not work “there”– Misguided to try and tame conclusion instability – Inherent in the data

•Can’t tame conclusion instability. • Instead, you can exploit it • Learn local lessons that do better than overly generalized global theories

66

Page 67: Dm sei-tutorial-v7

67

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Know your domain– Data science is cyclic

• PART 2: Data Issues– How to prune data, simpler & smarter– How to keep your data private

• PART 3: Models– Envy-based learning

–Ensembles

Page 68: Dm sei-tutorial-v7

68B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-2, pp.62-74, 2012.

Outlier ‘Detection’

RelevancyFiltering

Instance Weighting

Stratification

Cost Curves

Mixture Models

Managing Dataset Shift

Covariate Shift

Prior Probability

Shift

Sampling Imbalanced Data

Domain Shift

Source Component

Shift

Page 69: Dm sei-tutorial-v7

69

Solutions to SE Model Problems/Ensembles of Learning Machines*

Sets of learning machines grouped together. Aim: to improve predictive performance.

...

estimation1 estimation2 estimationN

Base learners

E.g.: ensemble estimation = Σ wi estimationi

B1 B2 BN

* T. Dietterich. Ensemble Methods in Machine Learning. Proceedings of the First International Workshop in Multiple Classifier Systems. 2000.

Page 70: Dm sei-tutorial-v7

70

Solutions to SE Model Problems/Ensembles of Learning Machines

One of the keys: Diverse* ensemble: “base learners” make different

errors on the same instances.

* G. Brown, J. Wyatt, R. Harris, X. Yao. Diversity Creation Methods: A Survey and Categorisation. Journal of Information Fusion 6(1): 5-20, 2005.

Page 71: Dm sei-tutorial-v7

71

Solutions to SE Model Problems/Dynamic Adaptive Ensembles

Dynamic Cross-company Learning (DCL) DCL uses new completed projects that arrive with time. DCL determines when CC data is useful. DCL adapts to changes by using CC data.

Predicting effort for a single company from ISBSG based on its projects and other companies' projects.

* L. Minku, X. Yao. Can Cross-company Data Improve Performance in Software Effort Estimation? Proceedings of the 8th International Conference on Predictive Models in Software Engineering, p. 69-78, 2012.

http://dx.doi.org/10.1145/2365324.2365334.