Software Analytics:Towards Software Mining that
MattersTao Xie
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign, USA
In Collaboration with Microsoft Research
Machine Learning that Matters
“The basic argument in her paper is that machine learning might be in danger of losing its impact because the community as a whole has become quite self-referential. People are probably solving real-world problems using ML methods, but there is little sharing of these results within the community. Instead, people focus on existing benchmarks which might have originally had some connection to real-world problems which has been long forgotten, however.”
“She proposes a number of tasks like $100M solved through ML based decision making or a human life saved through a diagnosis or an intervention recommended by an ML system to get ML back on track.”
ICML’12
http://icml.cc/2012/papers/298.pdf
http://blog.mikiobraun.de/2012/06/is-machine-learning-losing-impact.html
2012 NSF Workshop on Formal Methods• Goal: to identify the future directions in
research in formal methods and its transition to industrial practice.
• Success examples mentioned by the attendees– SLAM/SDV– ASTREE– SMT-based tools– …
http://goto.ucsd.edu/~rjhala/NSFWorkshop/
“What Happened to the Promise of Software Tools?” – Jim Larus
http://www.srl.inf.ethz.ch/workshop2014/eth-larus.pdf
https://www.youtube.com/watch?v=kO9OYnkeRTM
Software Analytics
Software analytics is to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services.
Dongmei Zhang, Yingnong Dang, Jian-Guang Lou, Shi Han, Haidong Zhang, and Tao Xie. Software Analytics as a Learning Case in Practice: Approaches and Experiences. In MALETS 2011http://research.microsoft.com/en-us/groups/sa/malets11-analytics.pdf
Software Analytics
Software analytics is to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services.
http://research.microsoft.com/en-us/groups/sa/ http://research.microsoft.com/en-us/news/features/softwareanalytics-052013.aspx
“What Happened to the Promise of Software Tools?” – Jim Larus
http://www.srl.inf.ethz.ch/workshop2014/eth-larus.pdf
https://www.youtube.com/watch?v=kO9OYnkeRTM
StackMinePerformance debugging in the large via mining millions of stack traces
http://research.microsoft.com/en-us/groups/sa/stackmine_icse2012.pdf http://research.microsoft.com/en-us/groups/sa/ieeesoft13-softanalytics.pdf
Performance debugging in the large
Pattern Matching
Trace StorageTrace
collection
Bug updateProblematic
Pattern Repository
Bug Database
Network
Trace analysis
Bug filingKey to issue
discovery
Performance debugging in the large
Pattern Matching
Trace StorageTrace
collection
Bug updateProblematic
Pattern Repository
Bug Database
Network
Trace analysis
Bug filingKey to issue
discoveryBottleneck
of scalability
Performance debugging in the large
Pattern Matching
Trace StorageTrace
collection
Bug updateProblematic
Pattern Repository
Bug Database
Network
Trace analysis
How many issues are still unknown?
Bug filingKey to issue
discoveryBottleneck
of scalability
Performance debugging in the large
Pattern Matching
Trace StorageTrace
collection
Bug updateProblematic
Pattern Repository
Bug Database
Network
Trace analysis
How many issues are still unknown?
Which trace file should I investigate
first?
Bug filingKey to issue
discoveryBottleneck
of scalability
Technical highlights• Data mining for software domain
– Discovery of problematic execution patterns formulated as callstack mining & clustering
– Domain knowledge incorporated systematically
• Interactive performance analysis system– Parallel mining infrastructure based on HPC + MPI– Visualization aided interactive exploration
Impact: Debugging Productivity Boost“We believe that the MSRA tool is highly valuable and much more efficient for mass trace (100+ traces) analysis. For 1000 traces, we believe the tool saves us 4-6 weeks of time to create new signatures, which is quite a significant productivity boost.”
Highly effective new issue discovery on Windows mini-hang
Continuous impact on future Windows versions
XIAOScalable code clone analysis
2012
http://research.microsoft.com/en-us/groups/sa/xiao_acsac12_camerareadyfinal.pdf
XIAO: Code Clone Analysis• Motivation
– Copy-and-paste is a common developer behavior– A real tool widely adopted internally and externally
• XIAO enables code clone analysis in the following way– High tunability– High scalability– High compatibility– High explorability
High tunability – what you tune is what you get• Intuitive similarity metric
– Effective control of the degree of syntactical differences between two code snippets
• Tunable at fine granularity– Statement similarity– % of inserted/deleted/modified statements– Balance between code structure and disordered statements
for (i = 0; i < n; i ++) { a ++; b ++; c = foo(a, b); d = bar(a, b, c); e = a + c; }
for (i = 0; i < n; i ++) { c = foo(a, b); a ++; b ++; d = bar(a, b, c); e = a + d; e ++; }
High explorability
1. Clone navigation based on source tree hierarchy2. Pivoting of folder level statistics3. Folder level statistics4. Clone function list in selected folder5. Clone function filters6. Sorting by bug or refactoring potential7. Tagging
1 2 3 4 5 6
7
1. Block correspondence2. Block types3. Block navigation4. Copying5. Bug filing6. Tagging
1
2
3
4
1
6
5
Scenarios & SolutionsQuality gates at milestones• Architecture refactoring• Code clone clean up• Bug fixing
Post-release maintenance• Security bug investigation• Bug investigation for sustained
engineering
Development and testing• Checking for similar issues before check-
in• Reference info for code review• Supporting tool for bug triage
Online code clone search
Offline code clone analysis
Impact: Benefiting developer community
Available in Visual Studio 2012 RC
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Impact: More secure Microsoft products
Code Clone Search service integrated into workflow of Microsoft Security Response Center
Over 590 million lines of code indexed across multiple products
Real security issues proactively identified and addressed
Example – MS Security Bulletin MS12-034Combined Security Update for Microsoft Office, Windows, .NET Framework, and Silverlight, published: Tuesday, May 08, 2012
3 publicly disclosed vulnerabilities and 7 privately reported involved. Specifically, 1 is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32k.sysCloned copy in gdiplus.dll, ogl.dll (office), Silver Light, Windows Journal viewer
Microsoft Technet Blog about this bulletinHowever, we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base. To that end, we have been working with Microsoft Research to develop a “Cloned Code Detection” system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product. This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034.
SASIncident management of online services
http://research.microsoft.com/apps/pubs/?id=202451
Motivation
Incident Management (IcM) is a critical task to assure service quality
• Online services are increasingly popular & important
• High service quality is the key
Incident Management: Workflow
Detect a
service issue
Alert On-Call
Engineers (OCEs)
Investigate the problem
Restore the
service
Fix root cause via
postmortem analysis
SAS: Incident management of online services SAS, developed and deployed to effectively reduce
MTTR (Mean Time To Restore) via automatically analyzing monitoring data
26
Design Principle of SAS Automating Analysis Handling Heterogeneity Accumulating Knowledge Supporting human-in-the-loop
(HITL)
Techniques Overview• System metrics
– Identifying Incident Beacons• Transaction logs
– Mining Suspicious Execution Patterns• Historical incidents
– Mining Historical Workaround Solutions
Industry Impact of SAS
Deployment
•SAS deployed to worldwide datacenters for Service X (serving hundreds of millions of users) since June 2011•OCEs now heavily depend on SAS
Usage•SAS helped successfully diagnose ~76% of the service incidents assisted with SAS
Coding Duels (Code Hunt/Pex4Fun)
Teaching/Learning Programming/Software Engineering via Interactive Gaming
http://web.engr.illinois.edu/~taoxie/publications/icse13see-pex4fun.pdf
Code Hunt Competition for Students https://www.codehunt.com/
Precursor: http://www.pex4fun.com/
A Fun and Engaging Game – Win by Writing Code Supports Java and C#Adapts to competitions as well as individual play
Users: 1,181,152User Programs: 7,079,497
WWW.CODEHUNT.COM
Behind the Scene of Coding Duel
Secret Implementation class Secret {
public static int Puzzle(int x) { if (x <= 0) return 1; return x * Puzzle(x-1); }}
Player Implementation
class Player { public static int Puzzle(int x) { return x; }}
class Test {public static void Driver(int x) { if (Secret.Puzzle(x) != Player.Puzzle(x)) throw new Exception(“Mismatch”); }}
behaviorSecret Impl == Player
Impl
33
Experience Reports on Successful Tool Transfer• Nikolai Tillmann, Jonathan de Halleux, and Tao Xie. Transferring an Automated
Test Generation Tool to Practice: From Pex to Fakes and Code Digger. In Proceedings of ASE 2014, Experience Papers. http://web.engr.illinois.edu/~taoxie/publications/ase14-pexexperiences.pdf
• Jian-Guang Lou, Qingwei Lin, Rui Ding, Qiang Fu, Dongmei Zhang, and Tao Xie. Software Analytics for Incident Management of Online Services: An Experience Report. In Proceedings ASE 2013, Experience Paper. http://web.engr.illinois.edu/~taoxie/publications/ase13-sas.pdf
• Dongmei Zhang, Shi Han, Yingnong Dang, Jian-Guang Lou, Haidong Zhang, and Tao Xie. Software Analytics in Practice. IEEE Software, Special Issue on the Many Faces of Software Analytics, 2013. http://web.engr.illinois.edu/~taoxie/publications/ieeesoft13-softanalytics.pdf
• Yingnong Dang, Dongmei Zhang, Song Ge, Chengyun Chu, Yingjun Qiu, and Tao Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. In Proceedings of ACSAC 2012. http://web.engr.illinois.edu/~taoxie/publications/acsac12-xiao.pdf
Ex: Human Consumption of Tool Outputs
• Developer: Your tool generated “\0”
• Pex team: What did you expect?
• Developer: Marc
Invariant candidates:this.getPrice() > 0this.getPrice() >= 0
http://www.agitar.com/
http://research.microsoft.com/projects/pex/
Q & Ahttp://research.microsoft.com/en-us/groups/sa/
http://www.cs.illinois.edu/homes/taoxie/
Contact: [email protected]
Supported in part by a Microsoft Research Award, NSF grants CCF-1349666, CNS-1434582, CCF-1434596, CCF-1434590, CNS-1439481, and the USA National Security Agency (NSA) Science of Security Lablet.
Top Related