IPA Spring Days 2012

15
Linguistic diversity in open-source development Bogdan Vasilescu Alexander Serebrenik Mark van den Brand

description

I used these slides during my talk at the IPA Spring Days in Garderen, The Netherlands (2012).

Transcript of IPA Spring Days 2012

Page 1: IPA Spring Days 2012

Linguistic diversity in

open-source development

Bogdan Vasilescu

Alexander Serebrenik

Mark van den Brand

Page 2: IPA Spring Days 2012

Motivation

/ Mathematics and Computer Science PAGE 123-4-2012

Lisp

CC++

Java

PythonUnix shell

HTML

XML

I „speak‟ Java

I „speak‟ PythonI „speak‟ Java

and Python

……

Page 3: IPA Spring Days 2012

If leaves the project, what is the risk of not finding

replacement developers that speak Python?

Motivation

/ Mathematics and Computer Science PAGE 223-4-2012

No risk, plenty of other Python

developers to choose from

What about now?

Page 4: IPA Spring Days 2012

Linguistic diversity

• Greenberg (1956)

• compare geographic regions

• probability that two random individuals do not speak the

same language

/ Mathematics and Computer Science PAGE 323-4-2012

Page 5: IPA Spring Days 2012

Linguistic diversity

/ Mathematics and Computer Science PAGE 423-4-2012

• Simple model

• everyone speaks exactly one language

• languages are independent

L

pA

21

P

Sp

Probability that two random individuals do not speak the same language

Page 6: IPA Spring Days 2012

Linguistic diversity

/ Mathematics and Computer Science PAGE 523-4-2012

• Related-languages model

• everyone speaks exactly one language

• languages are similar

Probability that two random individuals do not speak the same language

Lm

m msimppB,

),(1

1),(

1),(0

sim

msim

P

Sp

Page 7: IPA Spring Days 2012

Linguistic diversity

/ Mathematics and Computer Science PAGE 623-4-2012

• Polyglot related-languages model

• everyone speaks at least one language

• languages are similar

Probability that two random individuals do not speak the same language

)(,

,

),(

1LPts

tms

tsts

msim

ppF

P

Xp

s

s

ABCBCACABCBALPCBAL ,,,,,,)(,,

Page 8: IPA Spring Days 2012

Our risk measure

• Probability that two random individuals do not speak the

same language

• Risk of not finding developers that „speak‟

/ Mathematics and Computer Science PAGE 723-4-2012

)(

)(max1)(LPs

sks ksimprisk

)(,

,

),(

1LPts

tms

tsts

msim

ppF

Page 9: IPA Spring Days 2012

StackOverflow.com

/ Mathematics and Computer Science PAGE 823-4-2012

Page 10: IPA Spring Days 2012

User tags

/ Mathematics and Computer Science PAGE 923-4-2012

Page 11: IPA Spring Days 2012

Similarity measure

• Reverend Gonzo: Java, C, C++, C#, Python,…

• Alexander Serebrenik: Prolog, SQL, C++,…

• Bogdan Vasilescu: Python,…

• Jon Skeet: C#, Java, ASP.net, XML,…

• … > 400,000

/ Mathematics and Computer Science PAGE 1023-4-2012

nLeft

nBothconfksim k

C

Java

• Association rule mining:

• “C => Java”

Page 12: IPA Spring Days 2012

Similarity measure - results

/ Mathematics and Computer Science PAGE 1123-4-2012

• Assembly posts: 44

• Assembly + Java developers: > 1000

When in need for Java developers, ask Assembly guys

Page 13: IPA Spring Days 2012

Case study - Emacs

• 1985-2012: C, Emacs Lisp, C++, Java, Lisp, Python, M4, … (26)

/ Mathematics and Computer Science PAGE 1223-4-2012

Exotic languages

High/low risk

Page 14: IPA Spring Days 2012

Case study - Emacs

/ Mathematics and Computer Science PAGE 1323-4-2012

C: spoken by half of the community

+ similar to other languages

low risk

Python: spoken very sporadically

+ similar to other languages

low risk

Page 15: IPA Spring Days 2012

What is the risk of not finding developers

that speak Python?

Conclusions

/ Mathematics and Computer Science PAGE 1423-4-2012

• Risk measure)(

)(max1)(LPs

sks ksimprisk

• Similarity measure (StackOverflow)

• “C => Java”nLeft

nBothconfksim k

Low risk Depends on similarity