Presented by Asheq Hamid
description
Transcript of Presented by Asheq Hamid
![Page 1: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/1.jpg)
Sourcerer: An Internet-Scale Software Repository
Sushil Bajracharya Joel Ossher Cristina LopesDonald Bren School of Information and Computer Sciences
University of California, Irvine
Presented byAsheq Hamid
![Page 2: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/2.jpg)
An ExampleFind a code snippet where:class “B” inherits class “A”And class “A” has a method named “C”And class “B” overrides that method “C”
class B extends A{
void C (){// Body of the//Overridden method
}
class A{
void C (){// Body of method C
}
![Page 3: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/3.jpg)
How to leverage this type of searching?
Google code search, Koders , Krugle code search ??
Textual search : Ignores rich structural information in the code.
Sourcerer provides an infrastructure upon which this type of searching can be implemented.
![Page 4: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/4.jpg)
Outline of the presentation
Sourcerer Infrastructure : How the repository has been created.
Sourcerer Web Services : What service Sourcerer developers are providing on top of this infrastructure.
Application to Existing Tools : How different existing tools can be benefited from Sourcerer repository.
Future Work/Extension : What can be some useful additions to the project.
![Page 5: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/5.jpg)
Key References
E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, P. Baldi. Sourcerer: Mining and Searching Internet-Scale Software Repositories. Data Mining and Knowledge Discovery 2008
S. Bajracharya, J. Ossher, and C. Lopes. Sourcerer – An Infrastructure for Large-scale Collection and Analysis of Open-source Code. In Proceedings of the Third International Workshop on Academic Software Development Tools and Techniques (WASDeTT-3), 2010.
Sourcerer Project Website: http://sourcerer.ics.uci.edu/
![Page 6: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/6.jpg)
Infrastructure
The Sourcerer infrastructure comprises five major subsystems:
1. A system to crawl and manage software repositories. 2. A system to parse and extract features from the code.3. A relational database to store the information.4. Various tools to mine, search the database.5. A Web-based graphical interface.
![Page 7: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/7.jpg)
1. Crawl and manage software repositories
•Current prototype supports java projects.•Sourceforge•Apache•Colt and Weka
External code repository (on the web)
•Crawlers for downloading from Sourceforge.
•Web spiders to get code from academic repository and researcher’s web site
Code crawler
•Keeps local copy of projects in local hard disk.
Local storage
![Page 8: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/8.jpg)
2. Parse and extract features from the code
A parser has been written on top of Eclipse’s AST parser.
Mainly the following information is extracted after parsing:
Entity Relation Keyword Fingerprint
Several passes required on the source code to extract all these information.
![Page 9: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/9.jpg)
Entity PACKAGE CLASS INTERFACE ENUM ANNOTATION INITIALIZER FIELD CONSTRUCTOR METHOD PARAMETER LOCAL VARIABLE ARRAY
![Page 10: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/10.jpg)
RelationRelation Description Example
Extend Class inheritence classA extends classB
Implement Interface implementation
classB implements interfaceB
Returns Method return value Java.lang.String.toCharArray() returns char[]
Calls Method invocation Void foo(){ bar();}
Inside Physical containment Java.lang.String inside java.lang
……… ………… ………………….
![Page 11: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/11.jpg)
Relation….
![Page 12: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/12.jpg)
Keywords
Why keywords?? Because they are useful for faster retrieval of search result.
How keywords are extracted ?Fully qualified names are broken according to java convention. For example: “quickSort” is broken into two keywords : “quick” and “sort” .
![Page 13: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/13.jpg)
Fingerprints
What is fingerprint?code with particular syntactical signatures .
Example: Find a code snippet with three nested loops. Find a switch statement with seven cases.
Fingerprints are useful for structural searches of source code.
![Page 14: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/14.jpg)
Cross project dependency
What is cross project dependency?Every project has some external dependencies. These dependencies are typically packaged in jar files and included along with the source code.
Sourcerer keeps track of these dependency files. In case of a missing dependency file, Sourcerer tries to locate that
jar file base upon missing dependency information.
![Page 15: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/15.jpg)
3. Store information in the database.
Database
SourcererDBStores relational
information extracted from source code
ArtifactDBStores information about
jar files for automated resolution of missing
dependency
![Page 16: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/16.jpg)
4. Tools createdTools
Code Crawler: Takes a set of root URLs as an input and produces a list of download URLs and version control links along with other project specific metadata .
Repository Creator: Deletes duplicate links and downloads the source code.
Feature Extractor: Extracts detailed structural information from source code.
Database Importer: Imports extracts information in SourcererDB and ArtifactDB.
Code Indexer: Code indexer produces a semi-structured full text index.
![Page 17: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/17.jpg)
5. Web-based graphical interface
Can be found at:
http://sourcerer.ics.uci.edu/sourcerer/search/index.jsp
![Page 18: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/18.jpg)
Sourcerer Web services: Code Search Repository Access Dependency Slicing Similarity Calculation
Code Search:Input: A combination of terms and fields. The query language is based on Lucene’s implementation.
Output: A result set with detailed information on the entities that matchedthe queries.
![Page 19: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/19.jpg)
An example query
** Find a method with terms ”week” and ”date” in its short name, that returns a ”String” type andtakes in argument with the term ”Date” in its name.
Corresponding query:
short name: (week date) AND entity type: METHOD AND
m_ret_type_sname_contents: String AND m_sig_args_fqn_contents: Date
![Page 20: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/20.jpg)
Repository Access
Dependency Slicing:
What is a dependency slice?
A dependency slice of an entity is a program (collection of Java source files) which includes that entity as well as all the entities upon which it depends. A dependency slice should be immediately compilable.
Input: Id of (file | entity | relation | comment)
Output: The file that contains the id.
![Page 21: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/21.jpg)
Dependency Slicing…
Input: One or more entity ids.
Output: A zip file containing the collection of sliced/synthesized Javafiles that the given set of entities depend on.
Similarity Calculation
Input: An entity id.
Output: A list of other entities that are similar to the input entity.
*How the similarity has been calculated is out of the scope of the paper.
![Page 22: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/22.jpg)
Application to Existing Tools
Strathcona is a tool that also uses structural information to find code examples.
Its code repository structure is very similar to that of Sourcerer.
The large repository of Sourcerer can help Strathcona searching code in a bigger repository.
1. Finding better code snippet
Strathcona:
![Page 23: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/23.jpg)
Parseweb is a tool that provides example for object instantiation.
It downloads code from google code search for examples of object instantiation.
There is no way to automatically resolve missing dependencies, it uses some heuristics.
Sourcerer can benefit Parseweb a great deal as the external dependencies are automatically resolved before downloading the source code.
1. Finding better code snippet…..
Parseweb:
![Page 24: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/24.jpg)
Look for hotspots: From different API ,finds the entities which have most associations to other entities.
Using Sourcerer, hotspots could be detected directly simply by ordering the entities in a jar by the number of incoming relations.
2. Information miningSpotWeb and CodeWeb
3. Test driven code search CodeGenie and Code Conjurer, both use the context provided by a test case to formulate queries. Code Conjurer’s dependency resolution can be empowered by Sourcerer’s automatic dependency resolution ability.
![Page 25: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/25.jpg)
Future Extensions
Currently support only Java. Add support for other languages.
1. Multiple Language Support
Needs to create separate project for each new version of a project.
Add automated support for adopting new versions without creating a new project each time .
2. Addressing Evolution.
![Page 26: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/26.jpg)
Future Extensions….
Many code project contains non-code artifacts like data on issues, bugs, documentation, authorship, developer’s activities/ history etc.
Find an approach to connect Sourcerer’s models and services with these non-code artifacts.
3. Considering non-code artifacts.
There are some open-source quality monitoring platforms. FLOSSmole project is a collaborative effort to collect and
analyze large amount of open source project data. Its database contains more project specific metadata. Sourcerer contains more structure specific data. Therefore, integrating Sourcerer with FLOSSmole could widen
the scope and impact of both projects
4. Intergrating with other open source platforms.
![Page 27: Presented by Asheq Hamid](https://reader036.fdocuments.in/reader036/viewer/2022062410/56816206550346895dd22c34/html5/thumbnails/27.jpg)
Questions?