Measuring Usability Datagathering & usability metrics Universiteit Utrecht dr. dr. Egon L. van den Broek (met dank aan dr. H. Prüst) 1 “Op zondagmiddag ligt Matthieu aan de elektrodes” TROUW 8/3 2010 2 Vandaag Bijbehorende literatuur: Datagathering & analysis Usability metrics • • • -Tullis & Albert (2008) hfst 4 en 6 - Sharp et al. (2011) hfst 7 en 8 Performance metrics Self-reported metrics (user perception) • Scales for UX Behavioral and physiological metrics 3 key issues that require attention for any data gathering session to be successful: goal setting identifying participants (sampling) relationship with participants (clean & professional) pilot studies triangulation 4 triangulation triangulation: the investigation of a phenomenon from (at least) two different perspectives (Jupp, 2009). Four types: - Triangulation of data: data is drawn from different sources at different times, in different places or from different people (possibly by using a different sampling technique). - Investigator triangulation: different researchers (observers, interviewers etc) have been used to collect and interpret the data. - Triangulation of theories : the use of different theoretical frameworks through which to view the data or findings - Methodological triangulation means to employ different data gathering techniques. 5 User centered design 6 Main data gathering techniques General: Interviews • • • Structured, unstructured, semi-structured focus groups • Questionnaires • Observation • Data recording Usability specific: • Inspections (expert evaluator) • Predictive models (theoretical) Observation Direct observation in controlled environments Think aloud Indirect observation: (tracking users’ activities) Diaries Interaction logging eye tracking, (web) analytics Direct observation in the field Structuring frameworks Degree of participation (insider or outsider) Ethnography Structuring frameworks to guide observation - The person. Who? - The place. Where? - The thing. What? The Goetz and LeCompte (1984) framework: - Who is present? - What is their role? - What is happening? - When does the activity occur? - Where is it happening? - Why is it happening? - How is the activity organized? Ethnography • Ethnography is a philosophy with a set of techniques that • • • • • • include participant observation and interviews Ethnographers immerse themselves in the culture that they study Co-operation of people being observed is required A researcher’s degree of participation can vary along a scale from ‘outside’ to ‘inside’ Analyzing video and data logs can be time-consuming; data analysis is continuous Interpretivist technique Collections of comments, incidents, and artifacts are made Data recording notes, audio, video, photographs data logging time on task, efficiency measuring physiological data heart rate, temperature, eye tracking, facial expressions Kiezen tussen technieken 12 13 Usability criteria Specifieke criteria waarmee de usability van een product kan worden bepaald, door meting van de performance van de gebruiker. Bijvoorbeeld: tijd die nodig is een taak uit te voeren (efficiency) tijd die nodig is een taak te leren (learnability) aantal fouten dat wordt gemaakt na verstrijken van periode (memorability) Toetsbaar, meetbaar, kwantificeerbaar: Usability metrics (Tullis & Albert hfst 4 en verder) 14 Criteria for User experience “User experience refers to all aspects of someone’s interaction with a product, application, or system” (Tullis & Albert 2008) How many errors do users make in trying to log onto a library system? How many users get frustrated trying to read the tiny serial number on the back of their new MP3 player trying to registrate it? Behaviours and attitudes that can be measured to give insight into user experience 15 Usability Metrics Ways of measuring or evaluating the user’s experience Reveal something about the user’s experience What to measure? How to measure? When to measure? Data gathering Analysis and interpretation of data Tullis & Albert (2008) review three main types of usability metrics: Performance metrics Self-reported metrics (user perception) Behavioural and physiological metrics 16 Measurements for Usability / User Experience Usability metrics reveal something about the user’s experience with the interaction between the user and the product Performance e.g. criteria that explicitly measure effectiveness, efficiency, learnability Perceived experience e.g. satisfaction, expectation, perceived ease of use, perceived usefullness, awareness, pleasure Physiology and behaviour e.g. eye-movement, neural activity, facial expressions, stress 17 Validity whether an instrument actually measures what it sets out to measure Construct validity: the degree to which a measure relates to other variables as expected within a system of theoretical relationships Content validity: the degree to which a measure corresponds to the content of the construct it was designed to cover Criterion validity: evidence that scores from an instrument correspond with concurrent external measures conceptually related to the measured construct Ecological validity: evidence that the results of a study can be applied to real-world conditions 18 Reliability whether an instrument can be interpreted consistently across different situations (zie Wetenschappelijke Onderzoeksmethoden) 19 Performance metrics: user behaviour in relation with the use of scenarios or tasks useful to estimate the magnitude of a specific usability issue 20 Performance metrics task-success (measures effectiveness, efficiency, ...) time-on-task (measures efficiency, learnability...) steps-to-completion (measures efficiency, ...) efficiency (measures efficiency, ...) lostness (measures efficiency, ...) errors 21 Performance metrics: task-success “ how effectively are users able to complete a given set of tasks?” clear end-state? Vind de prijs van het boek dat gebruikt wordt bij de module Usability Engineering versus Onderzoek hoe je de profileringsruimte van de bacheloropleiding Informatiekunde kunt invullen 22 Task succes Levels of succes: • Complete succes • Without assistance • With assistance • Partial succes • Without assistance • With assistance • Failure • Participant thought it was complete, but it wasn’t • Participant gave up Binary succes (0-1) 23 Binary success rates in Excel or SPSS 24 Confidence intervals A confidence interval is a range that estimates the true population value for a statistic extremely valuable for any usability professional (Tullis 2008) For a given statistic calculated from a sample (e.g. the mean), the confidence interval is a range of values around that statistic that are believed to contain, with a certain probability (e.g. 95%), the population value. http://www.amstat.org NB: the binary success statistic adheres to a Bernouilli distribution; hence the success proportion follows a binomial distribution. This should be taken into account when computing the confidence interval (use e.g. Wald, or Adjusted Wald for n < 20). 25 Binary success rates and confidence intervals 26 Binary success rates and confidence intervals 12 participants 36 participants given equal distribution, more participants yield smaller confidence intervals 27 Binary success: difference between tasks To determine statistically significant differences between the various tasks: perform t-test or Analysis of Variance (ANOVA) Zie Wetenschappelijke Onderzoeksmethoden 28 Analysis 29 Measuring efficiency “ the amount of effort that a user expends to complete a task” two types of effort Cognitive effort – involves finding the right place to perform an action (e.g. finding a link on a web page), deciding what action is necessary (should I click on this link?), and interpreting the results of the action Physical effort – involves the physical effort required to take an action simple and compound measures 30 Measuring efficiency Simple metrics time-on-task steps-to-completion (number of steps or actions to complete a task) Compound metrics efficiency lostness 31 Performance metrics: time-on-task How much time is required to complete a task? How to measure? All tasks? Only successful tasks? • the faster a participant can complete a task, the better the user experience? • hotel /airplane ticket reservation: Yes (?) • games? 32 Performance metrics: efficiency Compound efficiency metric: - combination of task success and time-on-task - typically measured per task; alternative = per participant efficiency = task completion rate mean time per task or, alternatively, efficiency = number of successfully completed tasks total time spent 33 Performance metrics: lostness Compound efficiency metric (Smith 1996) 34 Performance metrics: lostness example 35 Measuring learnability measure how performance changes over time (how any efficiency metric changes over time) how much time and effort is required to become proficient using the product or application collecting data multiple times (trials) within-subjects design (zie Wetenschappelijke Onderzoeksmethoden) 36 Performance metrics: time-on-task • multiple trials for single subject (same task); gives ‘learning curve’ 37 Measuring errors & usability issues Usability issue: underlying cause of a problem Error: the possible outcome, i.e. the mistakes made during a task Errors may be useful in pointing out particularly confusing or misleading parts of an interface It is not evident what constitutes an error Therefore measuring errors is not always easy 38 Measuring errors & usability issues 39 Severity ratings of usability issues A combination of: • frequency • impact • persistence 0 = I don't agree that this is a usability problem at all 1 = Cosmetic problem only: need not be fixed unless extra time is available on project 2 = Minor usability problem: fixing this should be given low priority 3 = Major usability problem: important to fix, so should be given high priority 4 = Usability catastrophe: imperative to fix this before product can be released Jakob Nielsen 40 Self-reported metrics 41 Self-reported metrics What to measure? How to measure? Gathering self-reported data Single-item formats Multiple-item formats: indexes and scales: general and usability-specific Pre/Post-task Pre/Post-test Analysing self-reported data 42 What to measure? characteristics age date of birth attitudes gender male female occupation Kernenergie in Nederland moet worden afgeschaft beliefs Het aantal ongelukken in kerncentrales is de laatste jaren toegenomen behaviours Bent u op dit moment lid van Greenpeace? 43 De informatie die je zoekt Attitude Wat mensen zeggen te willen Moet kernenergie in Nederland worden afgeschaft? Ja Nee Wat is uw houding ten aanzien van de afschaffing van kernenergie in Nederland? Sterk tegen Tegen Niet voor of tegen Voor Sterk voor Bent u het eens of oneens met de volgende stelling: ‘Kernenergie zou in Nederland verboden moeten worden’ Eens Oneens De informatie die je zoekt Attitude Beliefs Wat mensen denken dat waar is Het aantal ongelukken met kerncentrales is de afgelopen tien jaar toegenomen. Waar Onwaar In hoeverre draagt kernenergie bij aan de energievoorziening in Nederland? In zeer beperkte mate In beperkte mate Voor een belangrijk deel Voor een zeer belangrijk deel Denkt u dat de afschaffing van kernenergie leidt tot problemen in de energievoorziening Ja Nee De informatie die je zoekt Attitude Beliefs Gedrag Wat mensen zeggen te doen Heeft u wel eens meegedaan aan een demonstratie tegen kernenergie? Ja Nee Bent u op dit moment lid van Greenpeace? Ja Nee Denkt u in de toekomst ooit mee te doen aan demontraties tegen kernenergie? Nee Waarschijnlijk niet Waarschijnlijk wel Ja De informatie die je zoekt Attitude Beliefs Gedrag Kenmerken Wie mensen zeggen te zijn Bent u een man of een vrouw? Man Vrouw Wat is uw huidige leeftijd? _________ jaar Wat is uw geboortejaar? 19________ Wat is uw hoogst genoten opleiding? Lager onderwijs Middelbaar onderwijs (HAVO/VWO) Middelbaar beroepsonderwijs (VMBO/MBO) Hoger Onderwijs (HBO/WO) meten een enkele eigenschap, karakteristiek, etc 1 dimensie 48 Single-item formats e.g. well-known ‘Likert scale’ *: I think that I would like to use this system frequently: ___ Strongly Disagree ___ Disagree ___ Neither agree not disagree ___ Agree Rensis Likert ___ Strongly Agree * in fact a misnomer: it’s not a scale but a well-known question format 49 Guidelines single-items formats Avoid "acquiescence bias": people are more likely to agree with a statement than to disagree with it (Cronbach, 1946) You need to balance positively-phrased statements (such as "I found this interface easy to use") with negative ones (such as "I found this interface difficult to navigate"). Use 5-9 levels in a rating You gain no additional information by having more than 10 levels Include a neutral point in the middle of the scale Otherwise you lose information by forcing some participants to take sides Don’t use numbers, but if so: use positive integers 1-7 instead of -3 to +3 (Participants are less likely to go below 0 than they are to use 1-3) Use word labels for at least the end points. Hard to create labels for every point beyond 5 levels Having labels on the end points only also makes the data more “interval- like” 50 Measuring concepts / Complex constructs Multidimensional indexes / scales 51 Suppose you want to measure … innovativity of an organisation complexity of a task willingness of people to participate in social relations presence in a virtual environment engagement in a game credibility of a company maturity of different aspects of a company Strategic E-business maturity Informationtechnology perceived ease of use of a system Multidimensional measurement indexes & scales Both are composite or multidimensional measures Index simply accumulates scores assigned to individual indicators Scale is composed of several items that have a logical or empirical structure among them Both indexes and scales are ordinal measures engagement Construct Multidimensional measurement indexes & scales a set of items to measure a construct some items don’t ‘fit’ in the construct (check reliability; Cronbach’s alfa) Item 2 Item 1 Item 3 Item 5 Item 4 Item 6 Construct Index construction an index of political activism (yes/no answers; summarised in single score) • • • • • • Wrote a letter to a public official Gave money to a political candidate Signed a political petition Gave money to a political cause Wrote a political letter to the editor Persuaded someone to change her or his plans 55 Bogardus Social Distance Scale determines willingness of people to participate in social relations – of various degrees of closeness – with other kinds of people • • • • • Are you willing to let sex offenders live in your country? Are you willing to let sex offenders live in your community? Are you willing to let sex offenders live in your neighborhood? Are you willing to let a sex offender live next door to you? Would you let your child marry a sex offender? Is a cumulative (Guttman) scale 56 Guttman scale Do you feel a woman shoud have the right to an abortion if she is not married? Do you feel that a woman should have the right to an abortion when her pregnancy was the result of a rape? Do you feel a woman shoud have the right to an abortion if continuing her pregnancy would seriously threaten her life? Woman’s health is seriously endangerd Pregnant as a result of rape Woman is not married 89% 81% 39% 57 Guttman scale bron: The basics of Social Research – Earl Babbie 58 Guttman scale bron: The basics of Social Research – Earl Babbie 59 60 Models with validated usability scales: TAM Technology Acceptance Model (Davis,1989) Perceived Usefulness: degree to which a system is believed to enhance a person’s job Perceived Ease of Use: the degree to which the use of a system is believed to be free from effort. 61 Validated usability scales: perceived usefulness (TAM) 62 Validated usability scales: perceived ease of use (TAM) 63 Models with validated usability scales: UTAUT Unified Theory of Acceptance and Use of Technology (Venkatesh et al., 2003) 64 Validated usability scales: performance expectancy, effort expectancy (UTAUT) 65 Other validated usability scales: presence user’s subjective sensation of “being there” “a perceptual illusion of non-mediation” (Lombard & Ditton, 1997) A Cross-Media Presence Questionnaire: The ITC-Sense of Presence Inventory (Lessiter et al. 2001) 66 Game-based leren Rampentraining voor ambulancemedewerkers Code Red Triage (van der Spek (2010)) presence / engagement Code Red: Triage Or COgnitionbased DEsign Rules Enhancing Decisionmaking TRaining In A Game Environment, Erik D. van der Spek, Pieter Wouters and Herre van Oostendorp, in British Journal of Educational Technology (2010) 68 Validated usability scales: System Usability Scale (SUS) John Brooke – Digital Equipment Corporation, 1986 “ A quick and dirty usability scale” 69 SUS 70 Self-reported data: •Pre/Post-task •Pre/Post-testsession 71 Measuring expectations: Pre- and Post-Task Ratings Before the task: How easy or difficult do you expect this task to be? Very easy Very difficult 0 0 0 0 0 0 0 After the task: How easy or difficult was this task to do? Very easy Very difficult 0 0 0 0 0 0 0 72 Ratings Can Help Prioritize Work Average Expectation and Experience Ratings by Task “Promote It” “Big Opportunity” Avg. Experience Rating 7 6 “Don’t Touch It” 5 4 3 “Fix it Fast” 2 1 1=Difficult … 1 2 3 4 5 6 7 Average Expectation Rating 7=Easy 73 Pre/Post- task ratings versus Pre/Postsession ratings task-level data: help identify areas that need improvement (quick ratings immediately after each task help pinpoint tasks and interface parts that are particularly problematic) session-level data: help to get a sense of overall usability (effective overall evaluation after each participant has had a chance to interact with the product more fully) 74 Post-session ratings: examples • Software Usability Scale (SUS) – 10 ratings • Usefulness, Satisfaction, and Ease of use (USE) • Questionnaire for User-Interface Satisfaction – QUIS * - 71 • • • • • (long form), 26 (short form) ratings Software Usability Measurement Inventory (SUMI) * – 50 ratings After Scenario Questionnaire (ASQ) – three ratings Post Study System Usability Questionnaire (PSSOQ) - 19 ratings. Electronic version called the Computer System Usability Questionnaire (CSUQ) Website Analysis and MeasureMent Inventory (WAMMI) * – 20 ratings of website usability Computer System Usability Questionnaire (CSUQ) * requires a license 75 Physiological metrics 76 Physiological and behavioural metrics Verbal behaviours Comments Questions Utterance of confusion / frustration Nonverbal behaviours Facial expressions Eye behaviour Skin conductance Heart rate Blood flow Temperature Sleep / wake 77 Measuring physiological signals: observation Usability Test Observation Coding Form Date: Start Time: Participant ID: End Time: Verbal Behaviors Notes Strongly positive comment Other positive comment Strongly negative comment Other negative comment Suggestion for improvement Question Variation from expectation Stated confusion Stated frustration Other: Non-verbal Behaviors Notes Frowning/Grimacing/Unhappy Smiling/Laughing/Happy Surprised/Unexpected Furrowed brow/Concentration Evidence of Impatience Leaning in close to screen Variation from expectation Fidgeting in chair Random mouse movement Groaning/Deep sigh Rubbing head/eyes/neck Other: Task Completion Status: Incomplete: Participant gave up Task “called” by moderator Thought complete, but not Notes: Complete: Fully complete Complete with assistance Partial completion Task #: Measuring physiological signals: equipment 79 Facial expressions Video-based systems Electromyogram sensors 80 pupils Eye tracking (measuring attention) Are People Drawn to Faces on Webpages? – T.Tullis, M.Siegel & M.Sun in: CHI 2009, Boston, Massachusetts, USA. Faces draw attention to them on webpages Study 1: users are clearly drawn to faces when asked to look at pages and report what they remember 82 Eye tracking and task-performance Are People Drawn to Faces on Webpages? – T.Tullis, M.Siegel & M.Sun in: CHI 2009, Boston, Massachusetts, USA. Study 2: •a Portfolio Summary page was modified to contain either a photo of a woman’s face or no image •tasks that had answers that could be found by reading information on the page 83 Eye tracking and task-performance Study 2: Contrary to expectation, a picture of a face in this context actually caused users to do worse on a task involving information adjacent to the face. 84 Thermal Imaging (measuring stress) Thermal imaging of the face Stresscam: a small thermal imaging camera StressCam: Non-contact Measurement of Users’ Emotional States through Thermal Imaging, C. Puri et al. CHI 2005 85 Thermal Imaging (measuring stress) • user stress is correlated with increased blood flow in the frontal vessel of the forehead. This increased blood flow dissipates convective heat 86 Polysomnography & Actigraphy (measuring sleep) Polysomnography Actigraphy Sleeping diary 87 ??? In accordance Objective measure (polysomnography Comparing ‘subjective’ and ‘objective’ data In accordance ??? Subjective measure(sleeping diary) 1=very bad, …, 7=very well 88 Combining metrics: an example Emily B. Falk, Elliot T. Berkman, and Matthew D. Lieberman From Neural Responses to Population Behavior: Neural Focus Group Predicts Population-Level Media Effects in: Psychological Science 2012; 0(2012), zie UBU heavy smokers with the intention to quit brain activations were recorded while smokers viewed three different television campaigns (A, B, C) promoting the National Cancer Institute’s telephone hotline to help smokers quit (1-800-QUIT-NOW) self-report predictions of the campaigns’ relative effectiveness population measures of the success of each campaign 89 Self-report scale of Ad effectiveness Falk E B et al. Psychological Science 2012;0956797611434964 Copyright © by Association for Psychological Science Fig. 1. Illustration of the medial prefrontal cortex (MPFC) region of interest (ROI) and three measures of the effectiveness of the antismoking ad campaigns promoting the National Cancer Institute’s Smoking Quitline. Falk E B et al. Psychological Science 2012;0956797611434964 Copyright © by Association for Psychological Science Samenvattend Usability metrics: reveal something about the user experience with the interaction between the user and the product • Performance metrics e.g. task success, time on task, errors, efficiency • -> statistics • Self-reported metrics / User perception metrics e.g. satisfaction, expectation, perceived ease of use, perceived usefulness, awareness, pleasure • -> scales • Behavioural and physiological metrics e.g. eye-movement, facial expressions, stress 92
© Copyright 2025 ExpyDoc