measures efficiency - Department of Information and Computing

Measuring Usability
Datagathering & usability metrics
Universiteit Utrecht
dr. dr. Egon L. van den Broek
(met dank aan dr. H. Prüst)
1
“Op zondagmiddag ligt Matthieu aan de
elektrodes”
TROUW 8/3 2010
2
Vandaag
Bijbehorende literatuur:
 Datagathering & analysis
 Usability metrics
•
•
•
-Tullis & Albert (2008)
hfst 4 en 6
- Sharp et al. (2011)
hfst 7 en 8
Performance metrics
Self-reported metrics (user perception)
• Scales for UX
Behavioral and physiological metrics
3
key issues that require attention for any data gathering
session to be successful:
 goal setting
 identifying participants (sampling)
 relationship with participants (clean & professional)
 pilot studies
 triangulation
4
triangulation
 triangulation: the investigation of a phenomenon
from (at least) two different perspectives (Jupp,
2009). Four types:
- Triangulation of data: data is drawn from different sources
at different times, in different places or from different
people (possibly by using a different sampling technique).
- Investigator triangulation: different researchers (observers,
interviewers etc) have been used to collect and interpret
the data.
- Triangulation of theories : the use of different theoretical
frameworks through which to view the data or findings
- Methodological triangulation means to employ different
data gathering techniques.
5
User centered design
6
Main data gathering techniques
General:
Interviews
•
•
•
Structured, unstructured, semi-structured
focus groups
•
Questionnaires
•
Observation
•
Data recording
Usability specific:
•
Inspections (expert evaluator)
•
Predictive models (theoretical)
Observation
Direct observation in controlled environments
 Think aloud
Indirect observation: (tracking users’ activities)
 Diaries
 Interaction logging
 eye tracking, (web) analytics
Direct observation in the field
 Structuring frameworks
 Degree of participation (insider or outsider)
 Ethnography
Structuring frameworks to guide
observation
 - The person. Who?
- The place. Where?
- The thing. What?
 The Goetz and LeCompte (1984) framework:
- Who is present?
- What is their role?
- What is happening?
- When does the activity occur?
- Where is it happening?
- Why is it happening?
- How is the activity organized?
Ethnography
• Ethnography is a philosophy with a set of techniques that
•
•
•
•
•
•
include participant observation and interviews
Ethnographers immerse themselves in the culture that they
study
Co-operation of people being observed is required
A researcher’s degree of participation can vary along a scale
from ‘outside’ to ‘inside’
Analyzing video and data logs can
be time-consuming; data analysis is
continuous
Interpretivist technique
Collections of comments, incidents,
and artifacts are made
Data recording
 notes, audio, video, photographs
 data logging
 time on task, efficiency
 measuring physiological
data
 heart rate, temperature,
eye tracking, facial expressions
Kiezen tussen technieken
12
13
Usability criteria
Specifieke criteria waarmee de usability van een product kan
worden bepaald, door meting van de performance van de
gebruiker.
Bijvoorbeeld:
 tijd die nodig is een taak uit te voeren (efficiency)
 tijd die nodig is een taak te leren (learnability)
 aantal fouten dat wordt gemaakt na verstrijken van
periode (memorability)
Toetsbaar, meetbaar, kwantificeerbaar: Usability metrics
(Tullis & Albert hfst 4 en verder)
14
Criteria for User experience
 “User experience refers to all aspects of someone’s
interaction with a product, application, or system”
(Tullis & Albert 2008)
 How many errors do users make in trying to log onto a
library system?
 How many users get frustrated trying to read the tiny
serial number on the back of their new MP3 player
trying to registrate it?
 Behaviours and attitudes that can be measured to
give insight into user experience
15
Usability Metrics
 Ways of measuring or evaluating the user’s experience
 Reveal something about the user’s experience
 What to measure?
 How to measure? When to measure? Data gathering
 Analysis and interpretation of data
 Tullis & Albert (2008) review three main types of
usability metrics:
 Performance metrics
 Self-reported metrics (user perception)
 Behavioural and physiological metrics
16
Measurements for Usability / User Experience
Usability metrics reveal something about the user’s
experience with the interaction between the user and the
product
 Performance
e.g. criteria that explicitly measure effectiveness,
efficiency, learnability
 Perceived experience
e.g. satisfaction, expectation, perceived ease of use,
perceived usefullness, awareness, pleasure
 Physiology and behaviour
e.g. eye-movement, neural activity, facial expressions,
stress
17
Validity
 whether an instrument actually measures what it sets
out to measure
 Construct validity: the degree to which a measure relates to
other variables as expected within a system of theoretical
relationships
 Content validity: the degree to which a measure corresponds to
the content of the construct it was designed to cover
 Criterion validity: evidence that scores from an instrument
correspond with concurrent external measures conceptually
related to the measured construct
 Ecological validity: evidence that the results of a study can be
applied to real-world conditions
18
Reliability
 whether an instrument can be interpreted consistently
across different situations
(zie Wetenschappelijke
Onderzoeksmethoden)
19
Performance metrics:
 user behaviour in relation with the use of scenarios
or tasks
 useful to estimate the magnitude of a specific
usability issue
20
Performance metrics

task-success
(measures effectiveness, efficiency, ...)

time-on-task
(measures efficiency, learnability...)

steps-to-completion
(measures efficiency, ...)

efficiency
(measures efficiency, ...)

lostness
(measures efficiency, ...)

errors
21
Performance metrics: task-success
 “ how effectively are users able to complete a given
set of tasks?”
 clear end-state?
 Vind de prijs van het boek dat gebruikt wordt bij de
module Usability Engineering
versus
 Onderzoek hoe je de profileringsruimte van de
bacheloropleiding Informatiekunde kunt invullen
22
Task succes
Levels of succes:
• Complete succes
• Without assistance
• With assistance
• Partial succes
• Without assistance
• With assistance
• Failure
• Participant thought it was complete, but it wasn’t
• Participant gave up
Binary succes (0-1)
23
Binary success rates in Excel or SPSS
24
Confidence intervals
 A confidence interval is a range that estimates the true
population value for a statistic
 extremely valuable for any usability professional (Tullis
2008)
 For a given statistic calculated from a sample (e.g. the
mean), the confidence interval is a range of values
around that statistic that are believed to contain, with a
certain probability (e.g. 95%), the population value.
 http://www.amstat.org
NB: the binary success statistic adheres to a Bernouilli
distribution; hence the success proportion follows a
binomial distribution. This should be taken into
account when computing the confidence interval (use
e.g. Wald, or Adjusted Wald for n < 20).
25
Binary success rates and confidence
intervals
26
Binary success rates and confidence
intervals
12 participants
36 participants
 given equal distribution, more participants yield
smaller confidence intervals
27
Binary success: difference between tasks
 To determine statistically significant differences
between the various tasks: perform t-test or
Analysis of Variance (ANOVA)
 Zie Wetenschappelijke Onderzoeksmethoden
28
Analysis
29
Measuring efficiency
 “ the amount of effort that a user expends to
complete a task”
 two types of effort
 Cognitive effort – involves finding the right place to
perform an action (e.g. finding a link on a web page),
deciding what action is necessary (should I click on this
link?), and interpreting the results of the action
 Physical effort – involves the physical effort required to
take an action
 simple and compound measures
30
Measuring efficiency
 Simple metrics
 time-on-task
 steps-to-completion
(number of steps or actions to complete a task)
 Compound metrics
 efficiency
 lostness
31
Performance metrics: time-on-task
 How much time is required to complete a task?
 How to measure?
 All tasks? Only successful tasks?
• the faster a participant can
complete a task, the better the user
experience?
• hotel /airplane ticket
reservation: Yes (?)
• games?
32
Performance metrics: efficiency
 Compound efficiency metric:
- combination of task success and time-on-task
- typically measured per task; alternative = per participant
efficiency =
task completion rate
mean time per task
or, alternatively,
efficiency =
number of successfully completed tasks
total time spent
33
Performance metrics: lostness
 Compound efficiency metric (Smith 1996)
34
Performance metrics: lostness
 example
35
Measuring learnability
 measure how performance changes over time
 (how any efficiency metric changes over time)
 how much time and effort is required to become
proficient using the product or application
 collecting data multiple times (trials)
 within-subjects design (zie Wetenschappelijke
Onderzoeksmethoden)
36
Performance metrics: time-on-task
• multiple trials for single subject (same task);
gives ‘learning curve’
37
Measuring errors & usability issues
 Usability issue: underlying cause of a problem
 Error: the possible outcome, i.e. the mistakes
made during a task
 Errors may be useful in pointing out particularly
confusing or misleading parts of an interface
 It is not evident what constitutes an error
 Therefore measuring errors is not always easy
38
Measuring errors & usability issues
39
Severity ratings of usability issues
A combination of:
• frequency
• impact
• persistence
0 = I don't agree that this is a usability problem at all
1 = Cosmetic problem only: need not be fixed unless extra
time is available on project
2 = Minor usability problem: fixing this should be given
low priority
3 = Major usability problem: important to fix, so should be
given high priority
4 = Usability catastrophe: imperative to fix this before
product can be released
Jakob Nielsen
40
Self-reported metrics
41
Self-reported metrics
What to measure?
 How to measure?



Gathering self-reported data




Single-item formats
Multiple-item formats: indexes and scales: general and
usability-specific
Pre/Post-task
Pre/Post-test
Analysing self-reported data
42
What to measure?
characteristics
age
date of birth
attitudes
gender
male female
occupation
Kernenergie in Nederland moet
worden afgeschaft
beliefs
Het aantal ongelukken in kerncentrales is de laatste jaren
toegenomen
behaviours
Bent u op dit moment lid van
Greenpeace?
43
De informatie die je zoekt
 Attitude
Wat mensen
zeggen te willen
Moet kernenergie in Nederland worden afgeschaft?
 Ja
 Nee
Wat is uw houding ten aanzien van de afschaffing van kernenergie in
Nederland?
 Sterk tegen
 Tegen
 Niet voor of tegen
 Voor
 Sterk voor
Bent u het eens of oneens met de volgende stelling:
‘Kernenergie zou in Nederland verboden moeten worden’
 Eens
 Oneens
De informatie die je zoekt
 Attitude
 Beliefs
Wat mensen
denken dat
waar is
Het aantal ongelukken met kerncentrales is de afgelopen tien
jaar toegenomen.
 Waar
 Onwaar
In hoeverre draagt kernenergie bij aan de energievoorziening
in Nederland?
 In zeer beperkte mate
 In beperkte mate
 Voor een belangrijk deel
 Voor een zeer belangrijk deel
Denkt u dat de afschaffing van kernenergie leidt tot problemen
in de energievoorziening
 Ja
 Nee
De informatie die je zoekt
 Attitude
 Beliefs
 Gedrag
Wat mensen
zeggen te
doen
Heeft u wel eens meegedaan aan een demonstratie tegen
kernenergie?
 Ja
 Nee
Bent u op dit moment lid van Greenpeace?
 Ja
 Nee
Denkt u in de toekomst ooit mee te doen aan demontraties
tegen kernenergie?
 Nee
 Waarschijnlijk niet
 Waarschijnlijk wel
 Ja
De informatie die je zoekt
 Attitude
 Beliefs
 Gedrag
 Kenmerken
Wie mensen
zeggen te zijn
Bent u een man of een vrouw?
 Man
 Vrouw
Wat is uw huidige leeftijd?
_________ jaar
Wat is uw geboortejaar?
19________
Wat is uw hoogst genoten opleiding?
 Lager onderwijs
 Middelbaar onderwijs (HAVO/VWO)
 Middelbaar beroepsonderwijs (VMBO/MBO)
 Hoger Onderwijs (HBO/WO)
meten een enkele eigenschap, karakteristiek, etc
1 dimensie
48
Single-item formats
 e.g. well-known ‘Likert scale’ *:
I think that I would like to use this system
frequently:
___ Strongly Disagree
___ Disagree
___ Neither agree not disagree
___ Agree
Rensis
Likert
___ Strongly Agree
* in fact a misnomer: it’s not a scale but a well-known question format
49
Guidelines single-items formats
 Avoid "acquiescence bias": people are more likely to agree with a
statement than to disagree with it (Cronbach, 1946)
 You need to balance positively-phrased statements (such as "I found this
interface easy to use") with negative ones (such as "I found this interface
difficult to navigate").
 Use 5-9 levels in a rating
 You gain no additional information by having more than 10 levels
 Include a neutral point in the middle of the scale
 Otherwise you lose information by forcing some participants to take sides
 Don’t use numbers, but if so: use positive integers
 1-7 instead of -3 to +3
(Participants are less likely to go below 0 than they are to use 1-3)
 Use word labels for at least the end points.
 Hard to create labels for every point beyond 5 levels
 Having labels on the end points only also makes the data more “interval-
like”
50
Measuring concepts / Complex constructs
Multidimensional indexes / scales
51
Suppose you want to measure …
 innovativity of an organisation
 complexity of a task
 willingness of people to participate in




social relations
presence in a virtual environment
engagement in a game
credibility of a company
maturity of different aspects of a
company
 Strategic
 E-business maturity
 Informationtechnology
 perceived ease of use of a system
Multidimensional measurement
indexes & scales
 Both are composite or
multidimensional measures
 Index simply accumulates
scores assigned to individual
indicators
 Scale is composed of several
items that have a logical or
empirical structure among
them
 Both indexes and scales are
ordinal measures
engagement
Construct
Multidimensional measurement
indexes & scales
 a set of items to measure a
construct
 some items don’t ‘fit’ in the
construct (check reliability;
Cronbach’s alfa)
Item 2
Item 1
Item 3
Item 5
Item 4
Item 6
Construct
Index construction
 an index of political activism
(yes/no answers; summarised in single score)
•
•
•
•
•
•
Wrote a letter to a public official
Gave money to a political candidate
Signed a political petition
Gave money to a political cause
Wrote a political letter to the editor
Persuaded someone to change her or his plans
55
Bogardus Social Distance Scale
 determines willingness of people to participate in
social relations – of various degrees of closeness –
with other kinds of people
•
•
•
•
•
Are you willing to let sex offenders live in your country?
Are you willing to let sex offenders live in your
community?
Are you willing to let sex offenders live in your
neighborhood?
Are you willing to let a sex offender live next door to you?
Would you let your child marry a sex offender?
Is a cumulative (Guttman) scale
56
Guttman scale
 Do you feel a woman shoud have the right to an abortion if
she is not married?
 Do you feel that a woman should have the right to an
abortion when her pregnancy was the result of a rape?
 Do you feel a woman shoud have the right to an abortion if
continuing her pregnancy would seriously threaten her
life?
Woman’s health is seriously endangerd
Pregnant as a result of rape
Woman is not married
89%
81%
39%
57
Guttman scale
bron: The basics of Social Research – Earl Babbie
58
Guttman scale
bron: The basics of Social Research – Earl Babbie
59
60
Models with validated usability scales: TAM
 Technology Acceptance Model (Davis,1989)
 Perceived Usefulness: degree to which a system is
believed to enhance a person’s job
 Perceived Ease of Use: the degree to which the use
of a system is believed to be free from effort.
61
Validated usability scales: perceived usefulness
(TAM)
62
Validated usability scales: perceived ease of use
(TAM)
63
Models with validated usability scales: UTAUT
 Unified Theory of Acceptance and Use of Technology
(Venkatesh et al., 2003)
64
Validated usability scales: performance
expectancy, effort expectancy (UTAUT)
65
Other validated usability scales: presence
 user’s subjective sensation of “being there”
 “a perceptual illusion of non-mediation” (Lombard
& Ditton, 1997)
A Cross-Media Presence Questionnaire: The ITC-Sense of Presence
Inventory (Lessiter et al. 2001)
66
Game-based leren
Rampentraining voor ambulancemedewerkers
Code Red Triage (van der Spek (2010))
presence /
engagement
Code Red: Triage Or COgnitionbased DEsign Rules Enhancing
Decisionmaking TRaining In A
Game Environment, Erik D. van
der Spek, Pieter Wouters and Herre
van Oostendorp, in British Journal
of Educational Technology (2010)
68
Validated usability scales: System Usability Scale
(SUS)
 John Brooke –
Digital Equipment
Corporation, 1986
 “ A quick and dirty
usability scale”
69
SUS
70
Self-reported data:
•Pre/Post-task
•Pre/Post-testsession
71
Measuring expectations: Pre- and Post-Task Ratings
 Before the task:
How easy or difficult do you expect this task to be?
Very easy
Very difficult
0
0
0
0
0
0
0
 After the task:
How easy or difficult was this task to do?
Very easy
Very difficult
0
0
0
0
0
0
0
72
Ratings Can Help Prioritize Work
Average Expectation and Experience Ratings
by Task
“Promote It”
“Big
Opportunity”
Avg. Experience Rating
7
6
“Don’t Touch It”
5
4
3
“Fix it Fast”
2
1
1=Difficult
…
1
2
3
4
5
6
7
Average Expectation Rating
7=Easy
73
Pre/Post- task ratings versus Pre/Postsession ratings
 task-level data: help identify areas that need
improvement
(quick ratings immediately after each task help pinpoint
tasks and interface parts that are particularly problematic)
 session-level data: help to get a sense of overall
usability
(effective overall evaluation after each participant has had a
chance to interact with the product more fully)
74
Post-session ratings: examples
• Software Usability Scale (SUS) – 10 ratings
• Usefulness, Satisfaction, and Ease of use (USE)
• Questionnaire for User-Interface Satisfaction – QUIS * - 71
•
•
•
•
•
(long form), 26 (short form) ratings
Software Usability Measurement Inventory (SUMI) * – 50
ratings
After Scenario Questionnaire (ASQ) – three ratings
Post Study System Usability Questionnaire (PSSOQ) - 19
ratings. Electronic version called the Computer System
Usability Questionnaire (CSUQ)
Website Analysis and MeasureMent Inventory (WAMMI) *
– 20 ratings of website usability
Computer System Usability Questionnaire (CSUQ)
* requires a license
75
Physiological metrics
76
Physiological and behavioural metrics
 Verbal behaviours
 Comments
 Questions
 Utterance of confusion / frustration
 Nonverbal behaviours
 Facial expressions
 Eye behaviour
 Skin conductance
 Heart rate
 Blood flow
 Temperature
 Sleep / wake
77
Measuring physiological signals: observation
Usability Test Observation Coding Form
Date:
Start Time:
Participant ID:
End Time:
Verbal Behaviors
Notes
Strongly positive comment
Other positive comment
Strongly negative comment
Other negative comment
Suggestion for improvement
Question
Variation from expectation
Stated confusion
Stated frustration
Other:
Non-verbal Behaviors
Notes
Frowning/Grimacing/Unhappy
Smiling/Laughing/Happy
Surprised/Unexpected
Furrowed brow/Concentration
Evidence of Impatience
Leaning in close to screen
Variation from expectation
Fidgeting in chair
Random mouse movement
Groaning/Deep sigh
Rubbing head/eyes/neck
Other:
Task Completion Status:
Incomplete:
Participant gave up
Task “called” by moderator
Thought complete, but not
Notes:
Complete:
Fully complete
Complete with assistance
Partial completion
Task #:
Measuring physiological signals: equipment
79
Facial expressions
 Video-based systems
 Electromyogram sensors
80
pupils
Eye tracking (measuring attention)
Are People Drawn to Faces on Webpages? – T.Tullis, M.Siegel &
M.Sun in: CHI 2009, Boston, Massachusetts, USA.
Faces draw
attention to
them on
webpages
Study 1:
users are
clearly drawn
to faces when
asked to look at
pages and
report what
they remember
82
Eye tracking and task-performance
Are People Drawn to Faces on Webpages? – T.Tullis, M.Siegel &
M.Sun in: CHI 2009, Boston, Massachusetts, USA.
Study 2:
•a Portfolio Summary
page was modified to
contain either a photo
of a woman’s face or no
image
•tasks that had answers
that could be found by
reading information on
the page
83
Eye tracking and task-performance
Study 2:
Contrary to
expectation, a picture
of a face in this
context actually
caused users to do
worse on a task
involving information
adjacent to the face.
84
Thermal Imaging (measuring stress)
Thermal imaging of the face
 Stresscam: a small thermal
imaging camera

StressCam: Non-contact Measurement of Users’ Emotional States through
Thermal Imaging, C. Puri et al. CHI 2005
85
Thermal Imaging (measuring stress)
• user stress is correlated with increased blood flow in the
frontal vessel of the forehead. This increased blood flow
dissipates convective heat
86
Polysomnography & Actigraphy (measuring
sleep)
Polysomnography
 Actigraphy


Sleeping diary
87
???
In accordance
Objective measure (polysomnography
Comparing ‘subjective’ and ‘objective’ data
In accordance
???
Subjective measure(sleeping diary)
1=very bad, …, 7=very well
88
Combining metrics: an example
 Emily B. Falk, Elliot T. Berkman, and Matthew D. Lieberman
From Neural Responses to Population Behavior: Neural
Focus Group Predicts Population-Level Media Effects in:
Psychological Science 2012; 0(2012), zie UBU
 heavy smokers with the intention to quit
 brain activations were recorded while smokers viewed
three different television campaigns (A, B, C) promoting
the National Cancer Institute’s telephone hotline to help
smokers quit (1-800-QUIT-NOW)
 self-report predictions of the campaigns’ relative
effectiveness
 population measures of the success of each campaign 89
Self-report scale of Ad effectiveness
Falk E B et al. Psychological Science
2012;0956797611434964
Copyright © by Association for Psychological Science
Fig. 1. Illustration of the medial prefrontal cortex (MPFC) region of interest (ROI) and three
measures of the effectiveness of the antismoking ad campaigns promoting the National
Cancer Institute’s Smoking Quitline.
Falk E B et al. Psychological Science
2012;0956797611434964
Copyright © by Association for Psychological Science
Samenvattend
 Usability metrics: reveal something about the user
experience with the interaction between the user and the
product
• Performance metrics
e.g. task success, time on task, errors, efficiency
• -> statistics
• Self-reported metrics / User perception metrics
e.g. satisfaction, expectation, perceived ease of use,
perceived usefulness, awareness, pleasure
• -> scales
• Behavioural and physiological metrics
e.g. eye-movement, facial expressions, stress
92