
Hi, Habr! This is Natalia Sprogis from the Mail.Ru Group UX-lab. Today I will talk about the planning and preparation of this type of research, as usability testing. The article is intended primarily for inexperienced researchers and those who are going to conduct usability testing for the first time.
A test plan is, on the one hand, a set of tasks, questions and questionnaires that you give to each respondent, and on the other, the methodological basis of the research: metrics and hypotheses that you test and record, the chosen toolkit. The first part of the article is devoted just to methodological issues underlying any plan.
Do you really need testing?

')
To begin with, you must be sure that at this stage the project needs usability testing. Therefore, clarify the real purpose of contacting you project team. Usability testing is not all-powerful, and at the start you need to understand that such a study can really bring a product. Immediately prepare the project team for what questions you can give answers to, and which you do not. We have had cases where we either offered a different method to customers (for example, in-depth interviews or diary studies would be better) or even even recommended to abandon the study, and instead do a split test.
For example, we never undertake in qualitative research to check the “attractiveness” of some function or design variant. We may collect feedback from users, but the risk is too great that their social desirability will affect their responses. People are always inclined to say that they would even use what they will not use. And the small sample size does not allow to trust such answers. For example, we had a bad experience testing gaming landing pages. When the landing, which was chosen as the most attractive on the test, with A / B testing worked much worse.
There are a number of limitations for testing prototypes and concepts. When planning, you should understand that you can really "squeeze" out of this test. It's great when the project has the opportunity to test prototypes or design before implementation. However, the less detailed and working the prototype, the higher the level of abstraction for the respondent, the less data can potentially be obtained from this test. Best of all in testing prototypes are problems of naming and icon metaphors, i.e. all questions of
clarity . The possibility of checking something beyond this strongly depends on the essence of the project and the detailed elaboration of the prototype.
Basis for a usability test script
Testing planning does not begin with drawing up the text of the tasks, but with a detailed study of the objectives and research questions in conjunction with the project team. The main plan for the preparation are:
- Important scripts. These are the user scenarios (or tasks, or user cases) that affect the business or are related to the purpose of testing. Even if the team suspects problems in specific locations, it is often worth checking out the main cases. The following scenarios may be considered important for the test:
- the most frequent (for example, sending a message in the messenger);
- Influencing business goals (for example, working with a payment form);
- related to the update (those that affected the redesign or the introduction of new functionality).
- Known issues. Often the research should provide an answer to the causes of the business problems of the service. For example, the producer is worried about the large outflow of players after the first hour of play. And sometimes the problem areas of the interface are already known to the team, and you need to collect the details and specifics. For example, the support service is often asked questions on the form of payment.
- Questions The team may also have research questions. For example, do users notice a banner advertising additional services, or is it clear if a specific section is named.
- Hypotheses. This is what known problems and team issues translate into. Well, if the customer comes to you with ready-made hypotheses. For example, “Our clients pay only from the phone with a commission. Perhaps users do not see the choice of a more profitable payment method. ” If there are no hypotheses, but there is only a desire to check the project abstractly “for usability”, then your task is to formulate these hypotheses.
Think with the project team about the places where users behave differently than expected (if such information is available). Find out if there are many design elements that were arguing about and they may be problematic. And also make your own product audit in search of potential difficulties for users, which is important to check on the test. All this will help you to compile a list of those elements (tasks, questions, checks) that should be included in the final scenario.
Data collection method

It is important for you to think about how you will collect data on what is happening during the test for later analysis. The following options are traditionally used:
- Observation During the assignments, the respondent is left alone with the product and behaves as he sees fit. Respondent comments are collected through questionnaires and communication with the moderator after the test. This is the most "pure" method, providing a more natural behavior of the respondent and the ability to correctly measure a number of metrics (for example, the task execution time). However, a lot of useful qualitative data remains behind the scenes. Seeing this or that behavior of the respondent, you cannot understand why he acts like this. Of course, you can ask about this at the end of the test, but most likely the respondent will only remember well the last task. In addition, during the execution of assignments, his opinion about the system may change, and you will only get the final picture, not first impressions.
- Think Aloud. For a long time, this method has been used in usability testing most often. Jacob Nielsen at one time called him the main tool for assessing usability . The essence of it is that you ask the respondent to voice all the thoughts he has when working with the interface and comment on all his actions. It looks like this: “Now I'm going to add this product to the cart. And where is the button? Oh, here she is. Oh, I forgot to see what color it was. ” The method helps you to understand why the user behaves in one way or another, and what emotions his current interaction causes. It is cheap and simple, even an inexperienced researcher can handle it. However, it has its drawbacks. Firstly, it’s not too natural for people to “think out loud” all the time. They will often fall silent, and you will have to constantly remind them to continue talking. Secondly, tasks with this method are performed somewhat longer than in real life. In addition, some respondents begin to use the product more thoughtfully. Speaking the reasons for their actions, they try to act more rationally, and just do not want to look like idiots. And you can not catch some intuitive moments of behavior.
- Active moderator intervention. This method is ideal for testing concepts and prototypes. In this case, the moderator actively interacts with the user during the execution of tasks, finding out at the right moments the reasons for his behavior and asking clarifying questions. In some cases, the moderator may even issue unplanned tasks arising from the dialogue. This method allows you to collect the maximum amount of quality data. However, it can be used only if you trust the professionalism of your moderator. Incorrectly formulated or not in time asked questions can greatly affect the behavior and impressions of the respondent, and even make the test results invalid. Also, when using this method, practically no metrics can be measured.
- Retrospective think aloud (Retrospective). This is a combination of the first two methods. The user first performs all tasks without intervention, and then in front of him the video of his work is played, and he comments on his behavior and answers the moderator's questions. The main disadvantage of the method is a strong increase in testing time. However, there are times when it is optimal. For example, once we faced the task of testing several types of mobs (game monsters) in a single RPG game. Naturally, we could neither distract respondents with questions, nor force them to comment on their actions during the battle. This would make it impossible to play a game where concentration is needed to win. On the other hand, to recall after a series of fights, whether he noticed a red-headed ax caught fire in the first rat, the user could hardly. Therefore, in this test, we used the RTA. With each user, we reviewed their battles and discussed what monster effects they noticed and how they were understood.
It’s up to you to decide which method is best for you. However, my advice: try to think about how you can get enough data, while maintaining the maximum naturalness of the respondent's behavior. Despite the simplicity and versatility of the “thinking out loud” method, which has long been the most popular in usability testing, we are increasingly trying to replace it with observation. If the moderator sees the respondent’s interesting behavior, he will wait for him to complete the task and ask the question after. Immediately after the assignment, it is more likely that the respondent remembers why he did so. It helps a lot in this issue tracker. Seeing the focus of the current attention of the respondent, you can, without asking unnecessary questions, much better understand his behavior. Ai-tracker generally significantly improves the quality of moderation, and this role, in my opinion, is no less important than the possibility of building hitmaps.
Metrics

Metrics are quantitative usability indicators. As a result of testing, you always get a set of problems found in the interface. Metrics allow you to understand how good or bad everything is, and also to compare with another project or previous versions of design.
What are the metrics
We all, of course, remember that according to ISO 9241-11, the main characteristics of usability are efficiency, productivity and satisfaction. Different metrics may be relevant for different projects, but all of them, one way or another, are tied to these three characteristics. I will write about the most commonly used indicators:
- The success of assignments. You can use the binary code: coped, did not cope with the task. We more often follow the Nielsen approach and highlight three types of success ratings:
- coped with the task with virtually no problems - 100%;
- faced problems, but completed the task on his own - 50%;
- did not cope with the task - 0%.
That is, if out of 12 respondents 4 coped with the task easily, 6 with problems, and 2 failed, then the average success rate for this task would be 58%. Sometimes you will be confronted with a situation where respondents who are very different in the degree of “problematic nature” fall into the middle group. For example, one respondent could suffer over each field of the form, and the second only made a little mistake at the very end. You can give a rating at your own discretion, depending on what happened on the test. For example, 25%, if the respondent has just started to perform the task, or 80%, if he has made a minor error. However, to avoid too much subjectivity, consider rating scales in advance, and do not decide for each respondent after the test. You should also consider what to do with errors. For example, you gave a task to buy movie tickets on the Kino Mail.Ru project. One of the respondents accidentally bought a ticket not for tomorrow, but for today, and did not notice it. He is sure that he coped with the task and really has a ticket on hand. However, his mistake is so critical that it will lead to the fact that he does not get into the movie, so I would put "0%", despite the fact that the ticket was purchased. The success rate is a very simple and clear metric. And I recommend using it if your assignments have clear goals. A glance at the schedule of success by assignments allows you to quickly understand where the most problematic interface points are.
- Time to complete assignments . This metric is indicative only in comparison. How do you know if it’s bad or good that a user performs a task in 30 seconds? But the fact that the time has decreased compared to the previous version of the design is already good. Or the fact that registration on our project takes less time than the competition. There are interfaces where reducing the time to perform tasks is critical. For example, the working interface of a call center employee. However, this metric is in principle not applicable to all tasks. Take, for example, the task of selecting goods in an online store. Users must quickly find filters and other interface elements related to the search for products, but the selection process itself will take different time from them. And this is completely normal. Women when choosing shoes are ready to see 20 pages of issue. And this does not necessarily mean that there were no suitable products on the front pages or that they do not see the filters. Often they just want to see all the options.
- Frequency problems. Any report on usability testing contains a list of problems encountered by respondents. How many respondents have encountered a problem is an indicator of the frequency of this problem in the test. This metric can only be used if your users performed exactly the same tasks. If in the test there were variations or tasks that were not clearly formulated, but compiled on the basis of an interview, then it would be difficult to calculate the frequency. It will be necessary to consider not only those who are confronted, but also to assess how many respondents might have encountered a problem (they performed a similar task, went into the same section). Nevertheless, this is quite a useful feature for the team, allowing you to understand which problems should be fixed first.
- Subjective satisfaction. This is a subjective assessment by the user of the convenience / comfort of working with the system. It is revealed with the help of various questionnaires that respondents fill out during or after testing. There are standard questionnaires. For example, System Usability Scale (SUS), Post-Study Usability Questionnaire (PSSUQ) or Game Experience Questionnaire (GEQ) for games. Or you can create your own questionnaire. Learn more about how to assess satisfaction in the second part of the article.
These are far from the only possible metrics. Here, for example, is a list of
10 UX-metrics that Jeff Sauro highlights. But for your product metrics may be different. For example, from what level do respondents understand the rules of the game, how many mistakes are made when filling out long forms, and so on.
Remember also that the decision to use many metrics imposes a number of limitations on testing. Respondents must act as naturally as possible and be put in the same conditions. Therefore it would be good to provide:
- Single starting points. The same tasks for different respondents must begin from the same point of the interface. For example, you can ask the respondents to return to the main page after each assignment.
- Lack of interventions. Any communication with the moderator can affect the performance metrics, if the moderator involuntarily prompts something to the respondent, and increases the task execution time.
- The order of tasks. To compensate for the effect of learning in comparative testing, be sure to change the order of familiarity with the compared products for different respondents. Let half start with your project, and half start with a competitive one.
- Success criteria. Consider in advance exactly what behavior you consider successful for this assignment. Is it permissible, for example, that the respondent did not use filters when selecting goods in the online store?
Treatment of metrics
Using metrics, remember that classic usability testing is a qualitative study. And the metrics you received are primarily illustrative. They give a general look at the different scenarios in the product, allowing you to see the pain points. For example, that account settings cause more difficulties than registering with the system. They can show you the dynamics of change if you measure them regularly. Those. metrics make it possible to understand that in the new design the task has become faster. It is these relationships that are much more indicative and reliable than the absolute values ​​of metrics found.
Jeff Sauro, an expert on statistics in UX-studies, advises that metrics are not mean values, but always consider confidence intervals. This is much more correct, especially if there is a variation in the results of the respondents. To do this, you can use its free online calculators:
for success and
for the time of task execution . Also not to do without statistical processing and when comparing the results.
When metrics are needed
Not every usability testing report contains metrics. Their collection and analysis takes time and imposes a number of restrictions on the methods of the test. In which cases they are really needed:
- Prove it. Often, especially in large companies, the need for product changes must be proved. The numbers are clear, understandable and familiar to decision makers. Therefore, when you show that 10 out of 12 respondents could not pay for the goods, or that the registration in the system takes on average two times more than that of competitors, this gives the research results more weight.
- Compare. If you compare your product with others in the market, you also can not do without metrics. Otherwise, you will be able to see the advantages and disadvantages of different projects, but you will not be able to assess what place your product takes among them.
- See the changes. Metrics are good for regular tests of the same product after making changes. They allow you to see the progress after redesign, to pay attention to those places that remained without improvements. You can use these figures again, as an evidence base for management, showing the weight of investments in redesign. Or just to understand that you have achieved results and are moving in the right direction.
- Illustrate, focus. The numbers help to illustrate important issues. Sometimes even if we do not use metrics in all assignments, we count them for the most vivid and important moments of the test.
However, we do not use metrics in every test. You can do without them, if the researcher works closely with the project team, there is internal confidence and the team is mature enough to correctly prioritize problem solving.
Method of fixing data

It would seem, what is wrong with a notebook and a pen or just an open Word document? In the modern Agile world of development, UX researchers should try to deliver the results of their observations to the team as quickly as possible. To optimize the time for analysis, it is good to prepare a template in advance for entering your notes during the test. We tried to do it in specialized software (for example, Noldus Observer or Morae Manager), but in practice the tables turned out to be the most flexible and versatile. Mark in advance in the table questions that you are planning to ask, places for entering problems found in different tasks, as well as hypotheses (you will mark whether it was confirmed or not on each respondent). Our tablets look like this:
| Respondent 1 | Respondent 2 | Respondent 3 | Respondent 4 |
Exercise 1 | | | | |
Did you notice function A? | | | | |
Where did you look for opportunity B? | | | | |
Problems and observations on assignment | | | | |
... | | | | |
You can also use:
- Usability Test Data Logger by Userfocus. Customizable Excel template for entering observations for each respondent. There is a built-in timer for measuring the time of task execution and automatically generated time and success graphs.
- Rainbow Spreadsheet from Tomer Sharon of Google. Visual table for collaboration of the researcher and the team. Under the link article with the description of a method, in the same place there is a link to the Google-table with a template.
With experience, most recordings can be made right during the test. If you do not have time, it is better to write down everything that you remember right after the test. Because if you return to the analysis in a few days, you will most likely have to review the video and spend much more time.
Preparation for testing
In addition to the method, metrics and the test protocol itself, you need to decide on the following:
- Format of communication with the moderator. The moderator can be in the same room with the testing participant, in which case it will be easy for him to ask questions on time. However, the presence of a moderator may affect the respondent. He can begin to ask questions to the moderator, provoking him to tell him explicitly or implicitly. If possible, we try to leave the respondent alone with the product for at least part of the test. So his behavior becomes more relaxed and natural. And in order not to run back and forth, if something goes wrong, you can leave on any instant messenger with audio connection so that the moderator can contact the respondent from the monitoring room.
- The way of setting tasks. Tasks can be voiced by a moderator. But in this case, despite the single test protocol, each time the task text can be pronounced a little differently. This is especially true if the test is conducted by several moderators. Sometimes, even small differences in formulations can put respondents in different initial conditions. To avoid this problem, you can either “train” the moderators to always read the texts of the assignment, or issue the tasks to respondents on a piece of paper or on the screen. The difference in formulations is not a problem for you if you use a flexible scenario in which tasks are formulated during the test, based on an interview with the moderator.
In addition, it may be interesting to use the means of the product for setting tasks. For example, when testing ICQ, the respondents received tasks through a chat window with a moderator, and when testing Mail.Ru, they came in letters. This way of setting assignments was as natural as possible for these projects, and we also repeatedly tested basic correspondence scenarios.
- Creating a natural context. Even if we are talking about laboratory research, think about how you can bring the use of the product on the test to real conditions. For example, if you are testing mobile devices, how will respondents keep them? For a good image on the video better when the phone or tablet is fixed on the stand or lie on the table. However, it does not make it clear whether all zones are accessible and convenient to click. After all, phones are often held with one hand, and with tablets lie on the sofa. You should also think about the environment in which the product will be used: is there any distraction, is it noisy, is the internet good. All this can be tried to imitate in the laboratory.
- Test plan for the customer. This is also an important stage of preparation, as it involves the project team. You can not devote the customer to all the methodological features of the test (how will you communicate with the respondent, fix the data, etc.). But be sure to show him what the tasks will be and what you are going to check for them. Perhaps you did not take into account some features of the project, or perhaps the project team will have some additional ideas and hypotheses. We usually get a similar sign:
Task text | What we check |
“Remember what kind of home appliances you recently bought. What were the selection criteria? Let's try using this site to find you something by the same criteria. ” | - Find the right category
- Visibility of filters
- Is there any difficulty with the filter for the price
- Filter sufficiency
- Etc.
|
- Report plan. Naturally, the report is written on the results of the study. But it is a very good practice to make a report plan even before the tests, based on the goals and objectives of the study. Having such a plan before your eyes, you will be able to check your script for completeness, as well as prepare the most convenient forms for recording data for further analysis. Or perhaps you decide that you do not need a report, but a rather general file with observations for you and the team. And if you motivate the team to fill it with you, it will be very cool.
Conclusion
Of course, you can simply “give the product a use” to your friend and watch what difficulties he has. But a well-written script will allow you not to miss important problems and not accidentally push the respondent to the answers you need. After all, usability testing is a simplified experiment, and for any experiment, preliminary preparation is important. In the next part of the article, I will talk about drawing up a test protocol: where to start the test, what questions to ask the respondent, how to formulate tasks and how to collect the final impressions.