Reproducible Evaluation Repeatable and

33
Repeatable and Reproducible Evaluation Fraida Fund NYU Polytechnic School of Engineering [email protected]

Transcript of Reproducible Evaluation Repeatable and

Repeatable and Reproducible Evaluation

Fraida FundNYU Polytechnic School of Engineering

[email protected]

“In industry, we ignore the evaluation in academic papers. It is often wrong and always

irrelevant.”

- Head of a major industrial lab, 2011

Source of quote: Vitek, Jan, and Tomas Kalibera. "R3: Repeatability, reproducibility and rigor." ACM SIGPLAN Notices 47, no. 4a (2012): 30-36. http://janvitek.github.io/pubs/r3.pdf)

Common problems in evaluation

● Unclear goals● Meaningless measurements● No baseline (or wrong baseline)● Not representative● Implicit assumptions● Weak statistics● Ineffective or misleading graphics● Proprietary code and data● Results are not reproducible

Repetition

The ability to re-run the exact same experiment with the same method on the same or similar system and obtain the same or very similar result.

Reproducibility

Independent confirmation of qualitative results by a third party, using the description of experiment design in the report/paper.

Six degrees of reproducibility

5: The results can be easily reproduced by an independent researcher with at most 15 min of user effort, requiring only standard, freely available tools (C compiler, etc.).

Source: P. Vandewalle, J. Kovacevic, and M. Vetterli. "Reproducible research in signal processing - what, why, and how." IEEE Signal Processing Magazine, 26(3):37–47, May 2009. http://infoscience.epfl.ch/record/136640/files/VandewalleKV09.pdf

Six degrees of reproducibility

4: The results can be easily reproduced by an independent researcher with at most 15 minutes of user effort, requiring some proprietary source packages (MATLAB, etc.).

Source: P. Vandewalle, J. Kovacevic, and M. Vetterli. "Reproducible research in signal processing - what, why, and how." IEEE Signal Processing Magazine, 26(3):37–47, May 2009. http://infoscience.epfl.ch/record/136640/files/VandewalleKV09.pdf

Six degrees of reproducibility

3: The results can be reproduced by an independent researcher, requiring considerable effort.

Source: P. Vandewalle, J. Kovacevic, and M. Vetterli. "Reproducible research in signal processing - what, why, and how." IEEE Signal Processing Magazine, 26(3):37–47, May 2009. http://infoscience.epfl.ch/record/136640/files/VandewalleKV09.pdf

Six degrees of reproducibility

2: The results could be reproduced by an independent researcher, requiring extreme effort.

Source: P. Vandewalle, J. Kovacevic, and M. Vetterli. "Reproducible research in signal processing - what, why, and how." IEEE Signal Processing Magazine, 26(3):37–47, May 2009. http://infoscience.epfl.ch/record/136640/files/VandewalleKV09.pdf

Six degrees of reproducibility

1: The results cannot seem to be reproduced by an independent researcher.

Source: P. Vandewalle, J. Kovacevic, and M. Vetterli. "Reproducible research in signal processing - what, why, and how." IEEE Signal Processing Magazine, 26(3):37–47, May 2009. http://infoscience.epfl.ch/record/136640/files/VandewalleKV09.pdf

Six degrees of reproducibility

0: The results cannot be reproduced by an independent researcher.

Source: P. Vandewalle, J. Kovacevic, and M. Vetterli. "Reproducible research in signal processing - what, why, and how." IEEE Signal Processing Magazine, 26(3):37–47, May 2009. http://infoscience.epfl.ch/record/136640/files/VandewalleKV09.pdf

How reproducible is CS systems research?

Versioning problems

We’ll give you code… soon

No plans to release the code

Only student knew how to use, left

Proprietary code

Depends on proprietary/obsolete systems

Poor design

Build errors

How to create a reproducible experiment

Experiment design

❏ Is there a clear mapping between your experiment goal and experiment design?

❏ Does your experiment achieve your goal with the minimum amount of work possible?

❏ Is it clear what the “result” of your evaluation is?

❏ Are there as few manual steps in your experiment as possible?

❏ Are the tools used in your experiment open and widely available?

Data analysis and visualization

❏ Did you separate raw and processed data?❏ Do you have a data analysis and

visualization script? (No manual calculations or interactive image generation!)

❏ Did you share the raw and processed data and script used to generate any images in your report?

❏ Are you using version control?❏ Do you follow good statistics and data

integrity practices?

Documentation

❏ Is it clear where to begin? (e.g., can someone picking a project up see where to start running it)

❏ Are there instructions for setting up the experiment and executing it?

❏ Do you explain non-obvious steps in the instructions?

❏ Have you noted the exact version of every external application used in the process?

❏ Are you using version control?

Lab exercises

Final lab exercises

Routing (repeatable and reproducible): ● Dijkstra’s algorithm● OSPF

Software defined networks● Just to give you another tool to use in

potential projects

Projects

● Form groups of 3 or 4● Project will run on GENI

○ Lab exercises give you some software tools to use: iperf, netem, tinyhttpd, OSPF setup, SDN, others

○ May use these or other software● Must use good experiment design practices● Must use good practices for communicating

quantitative results● Must use good practices for creating

reproducible experiments

Projects

The labs are meant to help you, so you can use them as a jumping-off point for projects

Topics can include:● Data center networks● Congestion and flow control● Routing and resiliency● SDN● Other topics related to HSN

Projects

Start thinking about your project● Work in groups of 3-4● Must have reasonable division of labor (every student

takes responsibility for a part of the project)● Must apply lessons from the lab lectures

● Will give you specific instructions for proposal before spring break.

● Project proposals due @ midterm.

Lab coverage on midterm

Lab topics are included on midterm:● Using networking testbeds● Experiment design● Communicating results● Reproducible experiments

Will give some example problems for you to work on.

Getting help

● Office hours on lab website ● Asking for help on the Internet

○ For e.g. Git Bash, R usage, there’s lot of information online

○ GENI Users Group: https://groups.google.com/forum/#!forum/geni-users

○ If you ask a question, cite it in your report

References1. Raj Jain, The Art of Computer Systems Performance Analysis: Techniques for Experimental

Design, Measurement, Simulation, and Modeling," Wiley- Interscience, New York, NY, April 1991, ISBN:0471503361.

2. Moraila, G., Shankaran, A., Shi, Z., & Warren, A. M. “Measuring Reproducibility in Computer Systems Research.” Tech Report (2014). http://reproducibility.cs.arizona.edu/tr.pdf

3. Vitek, Jan, and Tomas Kalibera. "R3: Repeatability, reproducibility and rigor." ACM SIGPLAN Notices 47, no. 4a (2012): 30-36. http://janvitek.github.io/pubs/r3.pdf

4. P. Vandewalle, J. Kovacevic, and M. Vetterli. "Reproducible research in signal processing - what, why, and how." IEEE Signal Processing Magazine, 26(3):37–47, May 2009. http://infoscience.epfl.ch/record/136640/files/VandewalleKV09.pdf

5. Edwards, Sarah, Xuan Liu, and Niky Riga. "Creating Repeatable Computer Science and Networking Experiments on Shared, Public Testbeds." ACM SIGOPS Operating Systems Review 49, no. 1 (2015): 90-99. http://mescal.imag.fr/membres/arnaud.legrand/research/readings/acm_sigops_si_rsea/p90-edwards.pdf and http://groups.geni.net/geni/wiki/PaperOSRMethodology

6. Leek, Jeff. The elements of data analytic style. 20157. Handigol, Nikhil, Brandon Heller, Vimalkumar Jeyakumar, Bob Lantz, and Nick McKeown.

"Reproducible network experiments using container-based emulation." In Proceedings of the 8th international conference on Emerging networking experiments and technologies, pp. 253-264. ACM, 2012. http://tiny-tera.stanford.edu/~nickm/papers/p253.pdf and https://reproducingnetworkresearch.wordpress.com/