Measuring and Comparing the Reliability of the Structured Walkthrough Evaluation Method with Novices...

Chris Bailey, Elaine Pearson, Voula Gkatzidou.Teesside University; AbilityNet.

Measuring and Comparing the Reliability of the Structured Walkthrough Evaluation Method with Novices and Experts

Email: [email protected] Twitter: @chrisbailey000

mailto:[email protected]

The Overall Problem

“We haven’t solved the problem of web accessibility”

- Jeff Bigham, W4A 2013.

The Evaluation Problem

The evaluator effect: multiple evaluators detect different sets of problems when examining the same interface (Herzum & Jacobsen (2001).

Manual evaluation and human judgement is a significant requirement when using automated tools (Vigo et al, 2013).

Evaluator expertise is particularly significant in WCAG 2.0 (Brajnik, 2010) and BW (Yesilada et al, 2009).

Neither WCAG set have reliability definitely over the W3C threshold (Brajnik, 2009); only 8 of 25 SC can be reliably tested (Alonso et al, 2010).

With experts, half WCAG 2.0 SC fail to meet threshold (Brajnik, 2010).

Towards A Solution

With novices, comprehension, knowledge and effort are key factors (Alonso et al, 2010).

Evaluation reports (audit) has motivational and educational value (Sloan, 2006).

Heuristic evaluation constrains evaluator (Brajnik, 2005) and potentially reduces the evaluator effect.

BW finds more severe issues and fewer false positives (Brajnik, 2006) and more issues compared to CR (2008).

• Developed as an educational, evaluation support tool for novices

• 3 Evaluation Functions; 48 Checks; 5 Categories.

• SWM guides novice through process.

• Checks based on potential barriers they are testing, supported by guidance and tutorials.

• W4A 2010 has full information.

Accessibility Evaluation Assistant (AEA)

Each check broken into a number of components.

1. The title of the accessibility principle (heuristic).2. A short summary.3. General description of the check’s importance in terms of the

user group(s) affected and the nature of the barrier or problem caused.

4. Description of the method to perform the check, with step-by-step instructions if using the Web Accessibility Toolbar.

5. Steps to verify and record a result for the check.6. A video demonstration of the check being performed in a live

context.

Structured Walkthrough Method

Aims of the Experiment Define and measure quality attributes of SWM.

Reliability Validity (Correctness, Sensitivity) Usefulness Efficiency

Measure reliability and validity of novice evaluations:

The extent to which the participants agree on the result of a check.

How ‘correct’ the novice evaluators were. Did they reach the same judgment as majority of experts.

Measure reliability of expert evaluations.

Gain qualitative feedback on the potential usefulness and viability of SWM.

Experiment Methodology (Part 1)

26 final year Computing Students, 12 week elective Accessibility and Adaptive Technology Module.

Conducted as assessment within curriculum constraints.

4 tasks over 3 weeks:

2 Evaluations: Fitness First and Pure Gym Homepages.

2 Reflective Pieces: Personas/User Group, Experience of Evaluation.

Evaluate pages for conformance to 15 AEA Heuristics; relevant to both pages, result may be different.

Check criteria is Met, Not Met or Partly Met; explain and justify their decision.

6 Experienced Accessibility Practictioners

Example of Measuring Reliability and Validity – Fitness First

Check

DecisionReliability

(R)Validity

(V)MetPart Met

Not Met

Colour Contrast 15 13* 0 15/28 (54%) 13/28 (46%)

Text Size 23* 5 0 23/28 (82%) 23/28 (82%)

Text Alternatives 2 16 10* 16/28 (57%) 10/28 (38%)

Link Titles 2 4 21* 21/28 (75%) 21/28 (75%)

Language of Text 23* 3 2 23/28 (83%) 23/28 (83%)

Calculating Overall Reliability and Validity

In example 28 evaluators performing 5 checks; total of 140 decisions.

R is extent to which evaluators reached same decision expressed as a proportion of maximum value.

R= (15 + 23 + 16 + 21 + 23) /140 (98/140) 70%

V is extent to which decision matches majority of experts (Yesliada et al, 2010).

V= (13 + 23 + 10 + 21 + 23) /140 (90/140) 64%

Results: Reliability and Validity

Summary of Reliability

Summary of Validity

Website

Reliability (R)

Novice Evaluations

Expert Evaluations

Fitness First 62% 76%

Pure Gym 67% -

Overall 65% 76%

WebsiteValidity (V)

Novice Evaluations

Fitness First 48%

Pure Gym -

Overall 48%

2011: 66% - 73%Overall 69%

2012: 63% - 78%

Overall 71%

2011: 56% - 65%Overall 60%

2012: 62% - 73%

Overall 68%

Results: Comparison of Reliability

Check Reliability (R)

Images of Text 60%

Colour Contrast 54%

Moving Elements 57%

Text Size 82%

Keyboard Navigation 75%

Link Names 57%

Skip Navigation Link 68%

Text Alternatives 57%

Link Titles 75%

Headings and Sub-Headings

39%

Form Labels 50%

Identify Language of Text

82%

Validate (X)HTML Code

68%

Site Map 57%

Accessibility Information

50%

Check Reliability (R)

Images of Text 66%

Colour Contrast 83%

Moving Elements 83%

Text Size 100%


Link Names 66%

Skip Navigation Link 66%


Link Titles 66%

Headings and Sub-Headings

66%

Form Labels 50%

Identify Language of Text

100%

Validate (X)HTML Code

100%

Site Map 66%

Accessibility Information

83%

Novices Experts

Checks performed by experts generally had higher level of reliability.

Results: Validity of Novice Evaluations

Validity of some novice checks was particularly low; reasons include:

Lack of thoroughness (Alonso et al, 2010)

Incomplete instructional information.

Check Validity (V)

1. Images of Text 14%

Colour Contrast 46%

Moving Elements 57%

Text Size 82%


2. Link Names 0%

3. Skip Navigation Link 1%


Link Titles 75%

Headings and Sub-Headings 39%

Form Labels 50%

Identify Language of Text 92%

Validate (X)HTML Code 68%

Site Map 39%

Accessibility Information 43%

Overall 48%

Expert Feedback: Viability and Usefulness

“Simple to understand and well structured. Could easily follow the steps based on the instructions provided.”

“It was easy and succinct. Found it pretty useful.”

“Much simpler (than WCAG 2.0) and more directed.”

“The information about why it (a check) is important and how to check it.”


“The evaluation tool would be very useful for someone with little accessibility experience. They would be able to evaluate a web page using the instructions and video provided.”

“I don’t think it could replace a WCAG 2.0 audit but it does have the benefit of being a quick way to evaluate a number of pages to provide indicators as to where problem areas are before conducting a more in-depth WCAG 2.0 audit once the top level issues have been fixed.”

“Works well if only Internet Explorer is used however in my testing I will use Firefox inspectors and assistive technology to verify issues.”


“Ability to grade issues as partially met\not met felt useful. Checkpoints seemed quite broad, allowing for some degree of flexibility when interpreting.”

“….judgement was still required as to how to classify a check. If one of the points was not met does that mean ‘part met’ or ‘not met’? How much common sense and judgement should be applied? However this is still much better than WCAG 2.0 where guidance at this level is a really big issue.”

Expert Feedback: Appropriateness and Specificity

“Some aspects not covered (colour\sensory reliance), heading interpretation is too strict. Not sure on coverage of link title attributes and how much of an impact adhering to this checkpoint would have in practical terms.”

“Requirement for a sitemap explicitly was good rather than the vaguer, WCAG 2.0 equivalent.”

Conclusion

Levels of reliability of novice evaluations have been consistent.

Reliability of expert evaluations was high; overall figure approaching 80%.

AEA is not an appropriate means to deliver method to experts.

Improved coverage (notification of dynamic content).

Current approach useful for top level evaluations.

Tool needs redevelopment.

Trialled with different cohorts of novices.

Further trials of method with experts.

WCAG 2.0 integration.

Chris Bailey, Elaine Pearson, Voula Gkatzidou.Teesside University; AbilityNet.

Measuring and Comparing the Reliability of the Structured Walkthrough Evaluation Method with Novices and Experts

Email: [email protected] Twitter: @chrisbailey000

mailto:[email protected]

Measuring and Comparing the Reliability of the Structured Walkthrough Evaluation Method with Novices...

Technology

Transcript of Measuring and Comparing the Reliability of the Structured Walkthrough Evaluation Method with Novices...