SeCold - A Linked Data Platform for Mining Software Repositories

19

Click here to load reader

Transcript of SeCold - A Linked Data Platform for Mining Software Repositories

Page 1: SeCold - A Linked Data Platform for  Mining Software Repositories

Iman KeivanlooChristopher ForbesAseel HmoodMostafa ErfaniChristopher NealGeorge PeristerakisJuergen Rilling

MSR 2012 June 2

A Linked Data Platform for Mining Software Repositories

Page 2: SeCold - A Linked Data Platform for  Mining Software Repositories

MSR 2012 2

SeCold is a “Wikipedia of source code related facts” produced from over 1,000,000 open source projects.

SeCold main objectives:

(1) establish the fundamental framework (2) perform data analysis

SeCold 2.0 is an ongoing research project (currently in its second year)

Page 3: SeCold - A Linked Data Platform for  Mining Software Repositories

Software Analysis Story

3

Issue TrackerSource CodeMailing ListVersioning Control…

Some output

Some analysis

MSR 2012

Page 4: SeCold - A Linked Data Platform for  Mining Software Repositories

Software Analysis Story

4

Issue TrackerSource CodeMailing ListVersioning Control…

Some output

Extraction Process

Raw Data

Structured Internal Data

Representation Analysis ProcessStructured Output

[Source Code Analysis: A Roadmap, FOSE’07]

MSR 2012

Page 5: SeCold - A Linked Data Platform for  Mining Software Repositories

Sharing

5

Issue TrackerSource CodeMailing ListVersioning Control…

[Source code analysis: a roadmap, FOSE’07][Fostering synergies: how … ICSE-SUITE’10]

MSR 2012

Page 6: SeCold - A Linked Data Platform for  Mining Software Repositories

Integration

6

Internal Data

Analysis Process

Output

Issue TrackerSource CodeMailing ListVersioning Control…

Internal Data

Analysis Process

Output

Internal Data

Analysis Process

Output

Internal Data

Analysis Process

Output

Alignm

ent

Inter-dataset Analysis

MSR 2012

Page 7: SeCold - A Linked Data Platform for  Mining Software Repositories

How to align?

7

The Challenge

MSR 2012

Same as!

Dataset A Dataset B

Page 8: SeCold - A Linked Data Platform for  Mining Software Repositories

History of Data Sharing

TXT

CSV

DATABASES

XML

LINKED DATA8

Page 9: SeCold - A Linked Data Platform for  Mining Software Repositories

Linked Data is about being …

Online a URL for each fact!

Standard uses HTTP, XML, HTML and …

Open usable for both human and machines

NOT Static data and schema are editable

Graph-based graph of triples vs. XML (tree)

Integrating integrated/linked on the fly

9MSR 2012

Page 10: SeCold - A Linked Data Platform for  Mining Software Repositories

SeCold Project A Linked Data Platform for Mining Software Repositories

10MSR 2012

1- Vocabulary Set (aka Schema, Data Model, Ontology)

Source Code Ecosystem Ontology Family (SECON)SOCON, VERON, METON, ISSUEON, LICENSON, CLON

Page 11: SeCold - A Linked Data Platform for  Mining Software Repositories

SeCold Project

11MSR 2012

2- URL/ID Generation SchemaA URL for each piece of fact (e.g. var. def. stmt)http://aseg.cs.concordia.ca/secold/page/type/java/DatasetChangeInfo

Integration ChallengeSeveral ways to generate URLs (e.g. random )REPRODUCIBLE IDENTIFIERS

A Linked Data Platform for Mining Software Repositories

Page 12: SeCold - A Linked Data Platform for  Mining Software Repositories

SeCold Project

12MSR 2012

3- Baseline Data PublicationGeneral Information ( ~2,000,000 triples)Source Code (~2,000,000,000 triples)Issue Tracker ( ~30,000,000 triples)Version Control ( ~700,000,000 triples)

A Linked Data Platform for Mining Software Repositories

~1 MILLION PROJECTS

Page 13: SeCold - A Linked Data Platform for  Mining Software Repositories

LinkedData Cloud (LOD)

[Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/, as of Sept 2011]

13MSR 2012

Publication

Life Science

Government

Media

Circle size Triple count

Very large >1B

Large 1B-10M

Medium 10M-500k

Small 500k-10k

Very small <10k

SeCold:Among the 9 largest datasets in the cloud

SeCold

Page 14: SeCold - A Linked Data Platform for  Mining Software Repositories

14

secold.org

Page 15: SeCold - A Linked Data Platform for  Mining Software Repositories

Showcase #1 (Similar Code Search)

15MSR 2012

Page 16: SeCold - A Linked Data Platform for  Mining Software Repositories

Showcase #2 –Part1 (Copyright violation detection)

16MSR 2012

Internal Data

Analysis Process

Output

Source Code of 25K projects

Internal Data

Analysis Process

Output

Ninka [A sentence-matching …, ASE’10]

Se Clone [SeClone … ICPC’11& WCRE’11]

Line level fingerprintsClone (Type 1,2 and 3)

License per file

Upload

Page 17: SeCold - A Linked Data Platform for  Mining Software Repositories

Showcase #2 –Part2 (Copyright violation detection)

17MSR 2012

Analysis Process

Output

Analysis Process

Output

Ninka [A sentence-matching …, ASE’10]

Se Clone [SeClone … ICPC’11& WCRE’11]

Line level fingerprintsClone (Type 1,2 and 3)

License per file

Upload

Copyright violation detection:

select ?fileA ?fileB  where {   ?fileA  testxi ?fingerprint .   ?fileB  testxi ? fingerprint  .   ?fileA hasLicense ?la .   ?fileB  hasLicense ?lb .   Filter (?la != ?lb) }

Page 18: SeCold - A Linked Data Platform for  Mining Software Repositories

Showcase #3 (Statistical Analysis)

18MSR 2012

No License 42%

GPL 217%

All Rights Reserved14%

Apache 29%

LGPL 2.112%

BSD3%

Mozilla PL 1.01%

MIT0%

Apache 10%

Nokos0%

Mozilla PL 1.10%

PHP0%

Sleepycat0%

Artistic0%

Shareware0%

Patented0%

No License ; 46%

All Rights Reserved; 13%

GPL 2; 12%

Apache 2; 10%

LGPL 2.1; 9%

BSD; 3% Mozilla PL 1.0; 3%MIT; 1%Apache 1; 1%BSD; 0%

Mozilla PL 1.1; 0%PHP; 0% Sleepycat; 0% Artistic; 0%Nokos; 0%Shareware; 0%

2009

2012

Page 19: SeCold - A Linked Data Platform for  Mining Software Repositories

MSR 2012 19