1 Evaluation. 2 Personal evaluation Software validation Software evaluation.
-
date post
21-Dec-2015 -
Category
Documents
-
view
240 -
download
8
Transcript of 1 Evaluation. 2 Personal evaluation Software validation Software evaluation.
3
Personal evaluation
• What have I achieved?
• Have I achieved what I set out to achieve?
• Where have I fallen short?
• Why?
• What could I have done better?
• Assumes an a priori statement of what you hope/expect/intend to achieve
4
Self evaluation in your dissertation
Dissertation plan• Introduction• Background• Success criteria• Design• Realisation• Evaluation/Testing• Conclusions & Further
Work
• Ch 3 lays out success criteria by which success of project is to be judged
• Ch 6 will review work done in Ch 5 with respect to these criteria, including reflection on overall validity of the approach
• But this is not “software evaluation”
5
Program validation
• Systematically check all functions in your program/application
• Systematically check all sequences of inputs etc.
• Does your program/application do what you think it is supposed to do?
• This is important, but ...
• This is not “software evaluation”
6
Software evaluation
Note: We are using the term “software” in a very vague sense: it could include a program, a web application, any sort of implementation that does something
• Evaluate the appropriateness of the software with respect to its intended use
• Large range of aspects of software that can be evaluated
7
Evaluation evaluation
• In your dissertation you are asked to evaluate what you have achieved
• Your research could (should?) include an evaluation element
• So you will need to evaluate your evaluation• Your evaluation might have negative results, but
still be an informative experiment which you can evaluate positively
• Your research could even be to compare evaluation schemes!
8
A case study
• Last year a student of mine did a project which was a comparative evaluation of a number of speech synthesis devices
• His dissertation discussed – Factors in setting up a comparative evaluation– A description of the actual evaluation– A discussion of the results
• His personal evaluation then considered how well the experiment (i.e. the evaluation) had been conducted
9
Software evaluation
• Functionality – does it do what is supposed to do?
• Reliability – does it do the same thing under the same conditions?
• Usability – is it user-friendly?• Efficiency – cost, speed, etc.• Maintainability – can you modify it? Is it robust?• Portability – can it be transferred from one
environment/platform to another?
10
Software evaluation
• Evaluating commercial software is different from evaluating something you have constructed– Even if you have constructed it from commercially
available components
• Again, note the difference between validation and evaluation– Especially concerning “functionality”
• Also, evaluation not the same as a software review, as found eg in a magazine
11
Stakeholders
• Developers– Researchers– Commercial developers
• End-users– Actual end-users (is this a single type?)– Their managers (buyers)
• Vendors
• Investors
12
Evaluation types
• Feasibility / Suitability – For any of the above stakeholders
• Internal evaluation– For development– Iterative testing, to evaluate progress– Adequacy evaluation– Diagnostic evaluation (debugging)– Black box vs. glass box evaluation
13
Evaluation types
• Declarative evaluation– How well does it perform?– Comparison with a “gold standard” ideal performance– Comparison with a baseline “wooden block”
• Usability evaluation– How long does each step take?– Is it “natural”, intuitive?– Is it easy to learn to use?– Is it well documented?
14
Evaluation types
• Operational evaluation– ROI– Compatibility with other software– Consistency of interfaces
• Internal• With respect to “standards” (eg Microsoft)
– Failsofts– Role of humans– Preparation, throughput, correction, output– Backup
• Documentation• Support• Corporate situation of provider
15
Framework for evaluation
• Definition of the relevant quality characteristics – what is it you want to evaluate? Be specific
• Definition of attributes pertinent to this quality
• Definition of a measure able to provide values for these attributes
• Definition of a method whereby the measure can be made
16
Framework for evaluation
Important to be sure that • The quality to be evaluated is genuinely a
quality that is claimed of the software• The attribute to be measured does reflect
the quality in question• The measure does genuinely measure
that attribute (and not some other one)• The method is sufficient to deliver a
meaningful measure
17
Example: spell checker
• Function: – (a) identify wrongly-spelled word – (b) suggest an appropriate correction – (among other features)
• Quality: ability to do (a) • Attribute: success rate in performance of that
task• Measure: “Precision”: percentage of wrongly-
spelled words correctly identified in a document• Method: give it a text with some wrongly-spelled
words and count how many it spots
18
Example: spell checker
• Good evaluation, but not A*• Success means
– Identifying misspelled words (true positives)– Ignoring correctly spelled words (true negatives)
• So is the measure really appropriate? We are only counting true positives and false negatives: we are not giving credit for the true negatives, nor penalising false positives
• The method is underspecified: – How much text? – What sort of text? – Should we take into account what we know about spell
checking (a certain class of error is very hard to detect)?– Should we classify misspellings and measure different classes
separately?
19
Attributes
• Different types imply different measures/methods
• Example: dish-washers
Name Racks Options* Water consumption Noise level Cleanliness
ABC 2 a,b 10 noisy ***
EFG 3 b 6 quiet *
PQR 2 a 5 very noisy **
* a = pre-wash rinse cycle; b = independent rinse cycle
20
Methods and measures
• Objective measures– Measuring, counting, timing– Doing a specific task– In case of usability issues, need to evaluate
with a number of subjects (not just do it yourself)
– Comparison against a gold standard• Precision • Recall• Other measures also considering false positives and
negatives
possible
correctR
total
correctP
21
Methods and measures
• Subjective measures– Interview after use– Feedback questionnaire
• Rating scales (usually 5 or 7 points, + DK, N/A)• Open-ended questions?• Questions should relate to some specific point• Repeat (some) questions in a disguised way
– Performance analysis• Video the session, analyse afterwards
22
Methods and measures
• Don’t try to measure too many different things with the same instrument
• Though this can be possible to some extent• But extraneous factors need to be controlled
carefully• Problem of statistical significance:
– Do you have enough subjects to know that the differences (and similarities) are not just random fluctuations?
23
Example
• Simulated doctor-patient interviews with patients with limited English, using computer-based communication device with symbols and digitised speech– two devices (laptop+mousepad, tablet+stylus)– doctors and nurses– literate and illiterate patients
25
Example
• General question: could they get to the end of the consultation? (How did we “measure” this?)
• Objective measures– How long did it take?– How many questions did they ask?– How many answers were (apparently) correctly
understood?
• Subjective measures– Feedback questionnaire with satisfaction ratings– Open-ended questions about specific issues
26
Subjects
• Many types of evaluation require volunteers– How many do you need?– Where will you get them from?– Are they suitable?
• Exclusion factors: eg prior familiarity with your topic• Need to control for irrelevant differences in their profile
– How will you guarantee their cooperation?– Ethical issues
• Officially, you need ethics clearance for any experiments involving living beings!
• In any case, important that volunteers know what they are letting themselves in for
• Also important that you don’t waste people’s time, eg evaluating a useless task (for example as a baseline)
27
Summary
• What are you trying to evaluate?– Be specific, not general eg “What do you think
of this interface?”
• What is the best way to measure what you are interested in?
• How feasible is it to do what you want?
• [After Easter]: How to write it all up!