Re-call and Re-cognition in Episode Re-retrieval: A User Study on News Re-finding a Fortnight Later Shuya Ochiai, Makoto P. Kato, and Katsumi Tanaka Kyoto University, Yoshida Honmachi, Sakyo, Kyoto, Japan 6068501 {ochiai,kato,tanaka}@dl.kuis.kyoto-u.ac.jp ABSTRACT This study investigates recall and recognition in a news refinding task where participants were asked to read news articles and then to search for the same articles a fortnight later. Recall, which is a task to express what a person remembers, corresponds to query formulations, while recognition, which is a task to judge whether a presented item has been shown before, corresponds to a user’s relevance judgment on search results in a refinding task. Our four main contributions can be summarized as follows: (i) we developed a method to investigate the effects of memory loss on episode refinding tasks on a large scale; (ii) our user study revealed a big drop on search performances in the refinding task after a fortnight and several differences between search queries input immediately after news browsing and ones at a later time; (iii) we found that asking questions and expanding input queries on the basis of the answers significantly improved the search performance in the news refinding task; and (iv) the users’ recognition abilities were different than their recall abilities, e.g. object names in a news story could be correctly recognized even though they were rarely recalled. Our findings support several findings in cognitive psychology from the viewpoint of information refinding and also have several implications for search algorithms for assisting user refinding. Categories and Subject Descriptors H.3.3. [Information Search and Retrieval]: Information Search and Retrieval Keywords refinding; news search; recall and recognition 1. INTRODUCTION People browse various types of episodes on the Web, including news articles, stories, and the experiences of others as written in blog posts. At a later time, they try to refind these episodes for a wide range of purposes such as information sharing and citations. However, refinding is made difficult by the oblivescence: if users forget the details of the episode for which they are looking, they Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. CIKM’14, November 3–7, 2014, Shanghai, China. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2598-1/14/11 ...$15.00. http://dx.doi.org/10.1145/2661829.2661920. may fail to formulate appropriate Web search queries and be unable to identify documents that contain the episode in search engine results pages. For example, imagine Connie incidentally read a news story about Takeru Kobayashi, a Japanese competitive eater. The news said that he is active in Nathan’s hot dog eating contest, an annual American competitive eating event, and that he has gotten involved in trouble with the organizers and been arrested once. A fortnight later, she wanted to share this news with her friend, but she could not remember enough details to create effective Web search queries. Connie’s search queries contained ambiguous and/or incorrect keywords such as “asian”, “false accusation”, and “purged”. Thus, her search results were noisy and it was difficult to locate the correct news story as she browsed. Indeed, even though she actually scanned a snippet of the news she was looking for, which was ranked fifth in her third search session, she ended up being unaware of it because she lacked confidence in her memory of the details. As seen in Connie’s example, the success of refinding is highly reliant on the individual’s powers of memory. Her failures can be characterized by two processes to retrieve a memory: recall and recognition. Recall is a task to express what a person remembers. In Connie’s search, recall is remembering the episode about Takeru Kobayashi and inputting the keywords that she thought were included in the episode. Recognition is a task to judge whether a presented item has been shown before. Connie’s recognition failed in the example above, since she could not correctly judge which search result was the one she had browsed before. The memory of episodes such as Connie’s is usually distinguished from other types of memories and is called episodic memory [39]. Episodic memory, which is our main focus in this paper, has been recently studied in detail, and several characteristics have been clarified in cognitive psychology research. Tulving distinguished episodic memory from semantic memory in terms of information, operations, and applications and stated that episodic memory is a spatiotemporal-oriented memory of experiences [39]. We investigated recall and recognition in the context of a news refinding task after a fortnight. The three main research questions we addressed are: 1) How do users formulate a search query with an ambiguous, unconfident memory? 2) Is asking a certain kind of question helpful for user recall? and 3) What kind of information can users recognize the most accurately? A task-based user study was administered to answer these questions. We recruited 381 participants online, asked them to read news articles, and then to search for the same articles a fortnight later. They were also asked to recall the episodes with the assistance of some questions and to recognize facts included in the episode they had seen. Our four main contributions can be summarized as follows: (i) we developed a method to investigate the effects of memory loss on episode refinding tasks on a large scale; (ii) our user study re- vealed a big drop on search performances in the refinding task after a fortnight and several differences between search queries input immediately after news browsing and ones at a later time; (iii) we found that asking questions and expanding input queries on the basis of the answers significantly improved the search performance in the news refinding task; and (iv) the users’ recognition abilities were different than their recall abilities, e.g. object names in a news story could be correctly recognized even though they were rarely recalled. Our findings support several findings in cognitive psychology from the viewpoint of information refinding and also have several implications for search algorithms for assisting user refinding. The rest of this paper is organized as follows. Section 2 summarizes the previous work on memory and refinding. Section 3 discusses problems caused by oblivescence in refinding and proposes methods to help users with episode refinding. Section 4 describes the task-based user study we used to investigate the trend of recall and recognition in episode refinding, and Section 5 reports the findings from this study. Section 6 discusses implications of this study, and Section 7 concludes the paper. 2. RELATED WORK We survey the previous work on memory in cognitive psychology, and information access technologies that aim to support users’ refinding. 2.1 Principles of Memory We would like to introduce several principles of memory referring to Surprenant and Neath’s work [34], which are strong bases to motivate our study and refer to in our discussions. The encoding specificity principle proposed by Tulving and Thomson emphasizes the dependency between encoded content and cues that are used to retrieve memories, with their finding that strong associates of an item (e.g. bloom to flower) are not always effective to retrieve the memory of the item [40]. The principle implies that there is no type of cues that is effective to retrieve any kind of memories in any situation. On the basis of their principle, some information access research has been conducted to explore effective cues for a various situations (e.g. [17]). The reconstruction principle claims that people use whatever information is available to reconstruct or build up a coherent memory of an episode. While this process can help users accurately remember the past in some circumstances, the reconstruction can cause distortion or loss of memories in others [13]. False memory has been studied by using the Deese-Roediger-McDermott (DRM) paradigm [10, 32], where participants are shown a list of words composed of the strongest associates of a nonpresented word, and are later asked to recall or recognize words in the list. On the subsequent recall and recognition tests, participants tend to recall the nonpresented word. Some studies support an assumption that false memory follows some rules and shows a bias toward the most likely event. Loftus and Palmer found that questions asked subsequent to an episode changed the memory of the episode, thorough recall tests on a film in response to different questions [25]. Cann, McRae, and Katz conducted a series of experiments to explore the most important word features that cause false memory by using a knowledge-type taxonomy [5]. We observed many kinds of false memory in participants’ queries and answers to given questions in our study, some of which are explained with the previous finding mentioned above. The relative distinctiveness principle enables us to estimate the performance on memory tasks based on the distinctiveness of an item relative to its alternatives. One of the well-known effects is Von Restorff effect [20], an effect of visual distinctiveness on recall and recognition performances. This effect has been further investigated and observed in conceptually distinctive items [33], and recall tests for young [18] and older people [3]. We utilize this principle to predict what users remember several days later, and try to elicit it from users especially in the question-assisted recall tests. The characteristics of recall and recognition abilities as well as their relationship have been studied from several aspects. The word frequency effect refers to the trend in word recognition tests, in which high-frequecy words are likely to be recalled, while lowfrequecy words are likely to be recognized [15]. Although people are generally able to recognize information better than recall, it was demonstrated that recognizable information is not always recallable and vice versa [41]. Tulving and Wiseman reported that even an item recalled by participants with a cue was not recognized without any cue in many cases. Thus, we simultaneously assess the recall and recognition abilities to understand the two different process of information refinding, i.e. query formulations and relevance judgments, which might show different characteristics according to Tulving and Wiseman’s work. The effects of aging on the memory ability have been widely studied in the literature. A series of experiments conducted by Park et al. revealed that short-term and long-term memory abilities gradually decrease as a function of the age [28]. People in their late middle age demonstrated a better ability in recognition tests than recall tests, and their performance of recognition was not low when compared to young people [8]. Names or proper nouns are thought to be one of the most difficult information types to recall especially for people in their late middle age [6]. These effects of aging were also observed in our user study, where older people failed to refind information when compared to the other ages. 2.2 Memory and Information Access One of the most relevant work to this paper was conducted by Teevan [35], in which recall and recognition of search results were investigated through questionnaire-based user studies. Participants were first exposed to a search engine result page (SERP) in response to a self-generated query, and were later asked to recall the title, snippet, and URL of each search result in the list. Teevan reported that the top-ranked search result and last search result clicked were recalled with relatively high accuracy, which were considered as a result of primacy effect and recency effect [26]. She further tested the recognition ability of participants by asking to recognize whether a new SERP was the same or different from the SERP that they have browsed. The result showed that no less than 31% of the participants answered different to a presented SERP that was identical to the one that they have seen. Although our study address the problem of recall and recognition in a refinding task, our focus includes query formulations for refinding, and the recognition ability for different information types rather than search results towards generating recognizable search results. Recall tests are also different from Teevan’s in that we asked participants to recall a news story with the assistance of questions. Teevan et al. also investigated search engine queries and found that 33% of the queries were repeated by the same user [36]. Although the repeated queries were probably used for refinding, information refinding is not always initiated by the same query used before. One of the our research questions is addressed to clarify how queries for refinding change after several days passed. Elsweiler, Baillie, and Ruthven investigated what kind of e-mail attributes people remember in e-mail refinding [12]. In their study, participants performed several tasks and were asked if they believed the information needed to solve the task was stored within their collection, and what they were able to remember about the information. They reported that several factors affected what attributes were remembered: the elapsed time since the e-mail was accessed, the experience, the number of e-mails in participants’ collection, and strategy of filing e-mails. While their research focused mainly on whether participants recalled a certain type of e-mail attributes, we focus more on the content of news articles that participants remember and is effective for refinding. Obendorf et al. [27] conducted a log-based study on revisitation behaviors, and found that strategies used by participants were highly reliant on how long ago they had browsed the information for which they looked. The most frequent strategies were the use of the back button for short-term (less than an hour), direct access via URL-entry or bookmark selection for middle-term (between one hour and a day), and use of hyperlinks for long-term. Horvitz, Dumais, and Koch proposed statistical models that estimate the probability that users consider events to be memory landmarks [17]. They constructed models based on Bayesian network utilizing event properties such as the event duration, subject, and location. Many tools for refinding have been proposed especially for personal information management. Ringel et al. proposed timeline visualization with temporal landmarks for guiding search over personal collections [31]. Dumais et al. [11] developed a system called Stuff I’ve Seen, which enable users to search for their personal information with the support of contextual information such as time, author, thumbnails and previews. Recent work in this line of work proposed an associative browsing tool, which in response to a given query presents a ranked list of items associated based on several features including textual and temporal similarity [23]. 3. CHALLENGES IN EPISODE REFINDING WITH MEMORY LOSS In this section, we discuss several problems in episode refinding tasks with an ambiguous, unconfident memory. We also propose methods to help users find the episode they read a long time ago, and explain the concrete research questions we address in this paper. The three subsections that follow correspond to research questions 1)-3) in Section 1. 3.1 Query Processing for Fighting Amnesia The first problem users may encounter when lacking a detailed memory is formulating an appropriate query for refinding. As the previous work has suggested, the queries of such users are likely to include more general words rather than specific ones [15], which might generate far more search results than expected and prevents users from identifying what they are looking for. Moreover, queries input with an ambiguous memory may include keywords that are not included in the episode the users are trying to retrieve. For example, a coordinate term, which shares a hypernym with another, can be used for a noun in an episode. Examples we observed in our study include cases where “Syria” was replaced with “Ukraine” and where “uncle” was replaced with “father”. Another possible lapse of memory is that a proper nouns can be replaced with its hypernym: e.g. “Michael Jackson” can be replaced with “musician” in a query. These phenomena has been suggested by the DRM paradigm [10, 32], where people tend to recall a word that is not presented when they are shown a list of words composed of the strongest associates of the recalled word. In this paper, we focus on investigating the following characteristics of queries: the part-of-speech distribution (i.e. the number of common nouns, proper nouns, verbs, and adjectives in a query), Table 1: Recall questions. Character Action Impression Baseline I II III I II III I II III Who/What is the main character of this news? Please explain the characteristics of the person/object. Please describe something else you can remember. What happened in this news? Why did the event happen? Please describe something else you can remember. How did you feel by reading this news? Why did you feel so? Please describe something else you can remember. Please describe something else you can remember. the frequency of word substitution types (i.e. the relation between a term and its substitute), and the relation between these query characteristics and search performance. We then propose some query processing techniques based on findings regarding these characteristics. 3.2 Reviving Memory of Users It would be a challenging task for search engines to utilize only ambiguous, incorrect keywords and to rank the desired search result at the top without any assistance from the user. Thus, it is necessary to interact with users to elicit more information about the episode they are looking for, or to have them input additional keywords about the episode by reviving their memory. While general solutions to this problem include relevance feedback and query suggestion, here we employ a simple alternative to elicit and revive user memory that is specialized for episode refinding tasks: we ask the users free-answer questions and expand the original query by adding keywords in their answers. This idea was inspired by Kelly et al.’s work on a document-independent term source for query expansion [22]. The feedback form included a few simple questions such as “Can you describe what you already know about the topic?” and “Why do you want to know about this topic?”. Their experiments demonstrated significant improvement to search performance when the input query and answers to the questions were used together as a query. Unlike their experiments where the feedback form was filled out right after a search topic had been shown, we address the question of whether it is possible to improve the search performance by asking a few questions and adding answers to the input query even at a later time. We also try to clarify what types of questions are the most effective to revive user memory in episode refinding tasks. To this end, we devised three free-answer question types: character, action, and impression. These types are called recall question types in this paper. Each type of recall question employed in our user study, shown in Table 1, was designed to be applicable to any types of episodes. We also used a baseline recall question “Please describe something else you can remember” to compare with the recall question type we designed. Character recall question type includes questions about the name and characteristics of main characters in the episode, while action recall question type includes questions about the events happening in the episode. The former was used because it is said that conceptually distinctive objects (i.e. the main person/object in an episode) are easy to remember in recall tasks (known as the Von Restorff effect) [20, 33], while the latter was devised because it has been shown that the memory for the textbase (semantic) lasts longer than that for the surface (lexical and syntactic) [24]. Impression recall question type includes questions that try to elicit users’ impression when they read the episode. This type of recall question is probably effective for reviving user mem- Table 2: Recognition types. Type Attribute Object name Major action Minor action Definition Attributes of the episode or of objects in the episode, e.g. time, address, and characteristics. Names of objects in the episode, e.g. person, country, and company names. Actions mainly described in the episode, e.g. perpetration and incident. Actions complementally described in the episode, e.g. background and experts’ views. ories because mood-state-dependent retention implies that people recall an episode better if they reinstate the original emotion they experienced during episode acquisition [4]. 3.3 Generating Recognizable Search Results The last problem in refinding tasks is to identify the search result a user is looking for. In Teevan’s experiment, 31% of participants reported that the presented list of search results was different from what they actually saw on the same day, even though the two lists were identical [35]. This indicates that it should be hard to identify which search result is the one users intended to retrieve especially more than a week later. Although there have been several studies on generating query-biased snippets to help users quickly assess each search result and judge whether the presented one is relevant [16, 19, 38], it remains unclear what kind of snippets are effective in the situation of refinding. Users may not be able to judge or even be unaware of the desired search result presented simply because they do not remember sentences shown as a snippet. Therefore, we clarify what kind of information would be the most recognizable even at a later time. As we explained earlier, recognition is a task to judge whether a presented item has been shown before. It corresponds to relevance judgments on search results. Users should be able to easily judge each search result if their snippets include information they could accurately recognize. In this paper, we classify the information in episodes into four types (summarized in Table 2) and measure the recognition accuracy of each type to estimate the recognizability of search result snippets. They are called recognition types in this paper. Attribute and object name recognition types are similar to the character recall question type in Table 1 except that the character recall question type focuses on the main character while the major action and minor action recognition types correspond to the action recall question type. We opt to use these four types in order to compare the recall and recognition results, as it is known that recognizable information is not always recallable and vice versa [41]. 4. METHODOLOGY In this section, we first describe participants and experimental procedure in our user study. We then explain post-processing, including participant filtering and labeling users’ queries and answers. Finally, we discuss the limitations of our user study. 4.1 Participants We recruited participants through a Japanese Internet research company. All experiments were carried out on the website, and all directions and questions were written in Japanese. Before we started the main experiments, we excluded unreliable users and those who are not familiar with search on the Web by asking two simple questions that can be easily solved by a Web search. Partic- Table 3: Participant demographics. Male Female 20s 37 34 30s 39 40 40s 40 39 50s 39 35 60s 39 39 ipants were instructed to perform their search using a Web search engine and to choose from among five choices. We also asked the question “How often do you read news on the newspaper or Internet for more than 15 minutes a day?” to select only individuals who read the news on a daily basis. We excluded those who rarely read the news since we want to study users with an ambiguous memory on the episodes they read, yet those who do not usually read news articles can easily retain their memory even several days later. Thus, we excluded individuals who answered a couple of days or less a week. We used only those individuals who answered these three questions as we expected. As we could not ensure that all of the individuals participated in all tasks due to a limitation of the recruiting method, we first hired 799 individuals, 381 of which passed through our filtering (explained later) and finished all the tasks. Table 3 shows the demographics of 381 participants, where their sex and age distributions are almost even. 4.2 Procedure Our user study is composed of two parts, as shown in Figure 1. The former part consists of episode acquisition (reading news articles), query formulation, and summary generation (writing a summary for each article), while the latter part consists of query formulation, a recall test (answering recall questions), and a recognition test (choosing which information was included in the news articles). We conducted the former part from February 18 to February 20, 2014. A fortnight later, we sent e-mails to the participants and asked them to participate in the latter part from March 5 to March 8, 2014. The participants were first asked to carefully read the user study instructions, which stated that a) the user study consisted of the former and latter parts; b) only participants who sincerely answered questions in the former part could move on to the latter part; c) the latter part contained different tasks from the former part; and d) participants were not allowed to use a Web search engine or to ask others during both of the former and latter parts. Instruction b) was presented to the participants for giving an incentive. We informed participants that this user study was designed to survey the depth of understanding of information on the Web, and did not inform them that the latter part functions as a refinding task. This was to prevent participants from intentionally remembering the news content between the former and latter parts. Instruction c) was shown for the same reason. Below, we explain tasks in the former and latter parts. 4.2.1 Former Part We randomly assigned two different news categories to the participants with a consideration of counter balance and asked them to read a news article from each category from top to bottom. Five news categories were used: crime, international, social problems, entertainment, and local, each of which contained two Japanese news articles. The category name was shown at the top-left of the Web page in red so that the participants were certainly aware of the category while reading an article. We recorded their scrolling behaviors and reading time by using JavaScript embedded in the Web page to filter those who finished news browsing without scrolling Former part Latter part Episode acquisition Query forumulation Query formulation Recall test Summary generation Recognition test Either recall or recognition was tested for each subject-episode pair Figure 1: Flowchart of user study. or within a very short time. By this means, we let the participants to acquire episodes by reading news articles. We then asked the participants to move on to another Web page, and input several queries to retrieve the news article they read. The query box we provided resembled in form ones used in commercial search engines such as Yahoo!, Google, and Bing, and can show 35 two-byte characters without horizontal scrolling. The participants could submit queries by clicking on a search button or by pressing the Enter key. After query submission, the query box was cleared to accept another query. The search results for each query were not displayed to the participants. Participants who entered at least one query could proceed to the summary generation task page. We asked participants to summarize each news article they had read using a minimum of 50 Japanese characters. This summary generation was designed to encourage participants to retain a memory of the news by rehearsal, which refers to repetition for retaining a memory for a long time, and to filter out those who did not read the news articles. Note that participants first read an article and created queries as well as a summary for the first article. Finishing the tasks in the former part for the first article, the participants started reading a different article from another news category, and began the same process again for this second article. Participants were not able to use the “back” button of their browsers and to read any article more than once. 4.2.2 Latter Part Two week interval was set between the former and latter parts. We decided the length of the interval for the following reasons. First, the interval should be long enough to create a situation where participants felt it difficult to refind news articles shown in the former part. Second, the interval should not be too long for participants to forget most of the cues about new articles. To estimate an appropriate interval, we conducted a preliminary user study with 14 students in our university, where the participants were asked to read novels with their title hidden and to refind them four weeks later. We found that more than a half of them could not refind the novels with Web search engines, and decided to use a shorter interval in this study, i.e. fortnight. We sent e-mails to 714 participants fortnight after the former part, and asked them to participate in the latter part of our user study. We excluded 85 individuals on the basis of how long it took them to read the news articles as well as generated summaries. Specifically, we excluded those who did not spend more than one minute to read two news articles or who created a meaningless summary. The user study site was closed on March 8, 2014 in the morning. As a result, 381 out of 714 participants finished all the tasks in the latter part. Note that all the participants finished the latter part after 14 days passed since the former part, and all the participants but two finished the latter part before 16 days passed. First, we asked the participants to create several search queries to retrieve the news article they had read in the former part by explicitly showing the category name in red. We expected the participants would be able to identify one of the two news articles they had read without confusion, as two different categories were assigned and clearly presented to the participants in the former part. The query box used in the latter part was exactly the same as that used in the former part. In this way, we obtained two types of queries: ones input right after news browsing and ones input a fortnight later. The participants were then asked to take a recall and a recognition test. Only one or the other type of tests was assigned to each episode to avoid the effect of learning. In the recall test, we presented in order all the questions of one of the types shown in Table 1. Participants were instructed to answer these questions to the best of their ability by filling out a text form. In the recognition test, we used forced-choice recognition and asked the participants to choose from among four choices as well as a “no idea” choice. Only one of the four choices was the fact mentioned in news articles that had been read, while the others were not mentioned but were similar to each other. As an example, five of the choices we presented for one news story were: “A country mentioned in the news article was Iraq”, “A country mentioned in the news article was Syria”, “A country mentioned in the news article was Pakistan”, “A country mentioned in the news article was Israel”, and “no idea”. We prepared two recognition tests for each pair of a recognition type (Table 2) and a news article; thus, the number of recognition tests was 80 in total. One of the two recognition test for each recognition type was taken by the participants in random order, i.e. they took four recognition tests for each news article. In summary, we obtained two types of query sets for two news articles (i.e. one formulated right after the news browsing and one formulated a fortnight later), answers to the recall test for a news article, and user choices in four recognition tests for another news article. 4.3 Post-processing In this subsection, we describe our participant filtering, annotation on queries and answers in recall tests, quantification of the difficulty of each recognition test, and development of a news search system to measure the search performance of submitted queries. After all participants had finished both the former and latter parts, we excluded outliers as they might affect the resultant data. Specifically, we filtered out those who spent less than 30 seconds or more than 600 seconds to read two news articles. In the latter part, we obtained queries and answers to each recall question from the participants. Since some of the queries and answers included personal messages from the participants, such as “I cannot remember the news at all” and “Please give me a hint”, we decided to label each query/answer and remove any such messages by using a crowd-sourcing service. Five crowd workers were assigned to each query/answer, shown the news story the participant had read, as well as his/her query or answer, and asked to choose from three choices: forget (the participant explicitly states that s/he cannot remember the news), confused (the participant obviously talks about a different news story from what s/he read), and neither. In total, 1.31% of the queries and 35.4% of the answers were judged as forget by three or more workers. The inter-rater agreements as measured by free-marginal multirater Fleiss’ Kappa [30] were 0.71 for the queries, and 0.81 for the answers to the recall questions. We regarded as empty queries/answers that were judged as forget by three or more workers. Note that we did not exclude queries/answers judged as forget in our analysis and did not conduct any operation for confused or neither queries/answers. 4.4 Limitations In this section, we discuss the three limitations of the methodology described above. Although these limitations may reduce the value of our reported findings, readers can still interpret and use the results by duly taking into consideration possible biases discussed here. First, our user study was conducted online, which is different from ordinary psychological experiments that are typically held in a laboratory setting. Although this enabled us to utilize a larger sample size than laboratory experiments, the drawback is that we have less control over participants’ behaviors: for example, they might pretend to read news articles and reluctantly type queries and answers to our recall questions. However, it is difficult to perfectly control behaviors even in a laboratory setting. Moreover, as we compared results obtained from a large number of participants, we can expect that any unusual behaviors are evenly distributed over different groups and do not seriously affect our conclusions. Second, participants were instructed to read news articles and were externally motivated to refind the news article they had seen. As it is not common for ordinary users to be asked to refind information, our experiments were somewhat unrealistic. It might also increase the unreality to enforce the participants to use only searching to find the desired information, since it has been reported that users prefer navigating to what they look for by taking a few known steps [37]. Third, our samples were biased in terms of search expertise. Since we recruited individuals through an Internet research company, the participants might include more search experts than those 1 In the case of no crowd worker who answered correctly, the score was set to 10 (the maximum value). 1.0 Former Latter 0.8 0.6 RR In the recognition tests, we tested the recognition ability by asking participants to choose from among five choices. One of the problems in measuring the recognizability of each recognition type is that the accuracy of a recognition test, which is defined as the number of participants who selected the correct choice divided by the total number of participants, would be highly biased by the difficultly of the recognition test. The difficultly of a recognition test is different from the accuracy of identifying an item shown before at a later time and should be measured by excluding the effects of memory decay as much as possible. Thus, we recruited ten crowd workers for each recognition test, presented them with a news article, and asked them to take the associated recognition tests right after the news browsing. We expected the accuracy of the recognition test without memory decay to approximate the difficulty of the recognition test. In our analysis, we assume that a participant can get a score if s/he selects the correct choice, where the score is the inverted accuracy of the recognition test right after the news browsing, i.e. the total number of crowd workers (10 in our case) divided by the number of crowd workers who selected the correct choice1 . Intuitively, the more difficult a recognition test, the higher its score. In the following analyses, we refer to the score as recognition score. The search performances of the participants’ queries were measured with a proprietary news search system we developed based on Apache Solr. We crawled news articles from the Web sites of four national Japanese newspaper companies from October 26, 2013 to February 15, 2014 (three days before the first day of the user study), and indexed them using a default Japanese tokenizer in Solr. The total number of news articles was 77,353. Each query submitted by the participants was evaluated with a ranking by the Okapi BM25 algorithm, where the default Solr parameters were used, i.e. k1 = 1.2 and b = 0.75. 0.4 0.2 0.0 20s 30s 40s Age 50s 60s Figure 2: Mean RRs of former and latter part queries as a function of the age of participants (±SEM). in a random sampling. It is of course well known that search expertise affects search behaviors [1], so it might be inappropriate to generalize our findings to all search engine users including novices. 5. FINDINGS In this section, we address three research questions by reporting the findings from our user study: 1) How do users formulate a search query with an ambiguous, unconfident memory? 2) Is asking a certain kind of question helpful for user recall? and 3) What kind of information can users recognize the most accurately? For question 1), we analyze the terms in queries and their search performances measured by reciprocal rank (RR), comparing the terms of queries formulated in the former and latter parts. For question 2), we analyze the answers in the recall tests in terms of both their content and improved search performances when the answers were used for query expansion. For question 3), we report the recognition scores for different recognition types and compare them with our analysis of recalled information. In our analysis, we mainly used non-parametric significance tests (Kruskal-Wallis and Wilcoxon signed-rank tests) because the data we obtained do not satisfy the normality in many cases. When we conduct a post-hoc test or a multiple comparison, the HolmBonferroni method is employed to adjust p values. Significant effects are reported on the significance level α = 0.05. 5.1 Queries in Episode Refinding Figure 2 shows the mean RRs of former and latter part queries as a function of the age of participants, where the RR of a query is defined as a multiplicative inverse rank of the news shown in the former part in the search result. Overall, there is a big performance drop from the former part queries to the latter part ones. This drop demonstrates the difficulty of refinding episodes after just fourteen days passed. The Kruskal-Wallis test revealed significant effects of age on the mean RR for both the former (χ2 (4) = 11.0, p < 0.05) and latter part queries (χ2 (4) = 58.3, p < 0.01). A post-hoc test using Wilcoxon Mann-Whitney rank sum tests with Holm correction showed significant differences between 20s/60s for the former part queries (p < 0.05) and between 20s/50s, 30s/50s, 20s/60s, 30s/60s, 40s/60s, and 50s/60s for the latter part queries (p < 0.01). This demonstrates that it is difficult, especially for users in their late middle age, to refind episodes they had seen before. Although this trend coincides with the trend of long-term memory decay (e.g. [9, 28]), it is possible that users in their late middle age could not formulate appropriate queries due to their comparatively lower 0 1 # of terms 2 3 0.3 4 0.25 0.2 RR Former 0.15 Common noun Proper noun Latter Verb 0.1 0.05 Adjective 0 Others searching skills. However, this is not conclusive, since the significant difference for the former part queries can be explained by the effect of both search skill and short-term memory decay [28]. Next, we analyze differences of the terms used in the former and latter part queries. The average number of terms per part-of-speech in the former and latter part queries is shown in Figure 3. Note that participants were asked to input multiple queries in both the former and latter parts and that we took the average number of terms in multiple queries input by each participant. As shown, the fraction of common nouns increases while that of proper nouns decreases in the latter part. This follows the same trend observed in psychological experiments, where the vulnerability of proper names to memory errors has been demonstrated in learning new names and in retrieving familiar names [7]. We further investigate the effect of the number of common and proper nouns on the RR. Figures 4 and 5 show the mean RRs of the latter part queries as a function of the number of common nouns and proper nouns, respectively. There is an RR drop when more than two common nouns were used by participants, while the RR increases as the number of proper nouns increases. A KruskalWallis test revealed significant effects of the number of common (χ2 (3) = 16.3, p < 0.01) and proper nouns (χ2 (2) = 47.7, p < 0.01) on the mean RR. A post-hoc test using Wilcoxon MannWhitney rank sum tests with Holm correction showed significant differences between (2, inf) and (0, 1] as well as (1, 2] common nouns (p < 0.05) and those between all pairs of the number of proper nouns (p < 0.01). The significant effect of the number of proper nouns may imply the effectiveness of proper nouns in episode refinding tasks and/or a correlation between the number of proper nouns and to what extent people remember the news content they have read. A question is why many common nouns resulted in lower search performances. Although there are several possible explanations to this result, a possible explanation might be that many common nouns in search queries indicate a sign of participants who were struggling to remember the episodes but failed. Figure 6 shows the average number of unique-term overlaps between the former and latter part queries, where former only and latter only indicate the average number of unique terms that appear only in the former and latter part queries, respectively, and overlap indicates the number of unique terms that appear in both. This figure provides another insight on query formulations a few weeks after news browsing: there are few overlaps between the former and latter part queries. More interestingly, there are some proper nouns that were not used in the former part queries but were used in the latter part ones. To drill down these findings, we manually labeled term pairs in the former and latter queries on the basis of their word relation, i.e. hypernym, hyponym, synonym, coordinate (0, 1] (1, 2] # of common nouns (2, inf) Figure 4: Mean RRs of latter part queries as a function of the number of common nouns (+SEM). 0.3 0.25 0.2 RR Figure 3: Average number of unique terms per part-of-speech in the former and latter part queries (+SEM). 0 0.15 0.1 0.05 0 0 (0, 1] # of proper nouns (1, inf) Figure 5: Mean RRs of latter part queries as a function of the number of proper nouns (+SEM). (two terms share a hypernym), and misspell. Since this labeling is relatively subjective and requires knowledge of lexical ontologies, two of the authors independently labeled term pairs, and a computer science student not involved in the writing of this paper intermediated when our labels were different. Since the coordinate relation can be applied to any term pair according to its definition, we labeled a term pair as coordinate only if the most commonly used hypernyms of the two terms are identical. For example, an “uncle” and “father” pair was labeled as coordinate while an “uncle” and “Barack Obama” pair was not labeled as coordinate since the most commonly used hypernyms of the former pair are “relative” while those of the latter are “relative” and “politician”, respectively. Table 4 shows the fraction of relations between terms in the former and latter part queries. There are two points of interest in this table: the term substitutions with coordinate terms were most frequent at a later time, and the number of hyponym substitutions was not negligible. These two findings are relevant to false recall and reminiscence. A recent psychological study also investigated word relations that are likely to cause false recall in the DRM paradigm [10, 32], and reported that the most important features to predict false recall were situation features (functions of a word), synonyms, and taxonomic relations including hypernym and coordidate [5]. Reminiscence is a phenomenon of recalling items that cannot be recalled before [29]. Inputting a hyponym of a term in an episode is not false recall, rather reminiscence, as it indicates that the participants recalled further details of the episode. In summary, we analyzed terms in queries formulated in the former and latter parts and reported the following findings: a) age had a significant effect on the mean RR, especially for queries formulated in the latter part; b) the fraction of common nouns increased, 0 1 # of unique terms 2 3 4 5 0.1 6 0.08 Proper noun 0.06 Verb Former only Adjective RR gain Common noun Overlap Others All words Common nouns Proper nouns 0.04 0.02 Latter only 0 Character Figure 6: Average number of term overlaps between former and latter part queries (+SEM). Table 4: Fraction of relations between terms in former and latter part queries. Hypernym 22.7% Hyponym 7.55% Synonym 17.3% Coordinate 48.9% Action -0.02 Impression Baseline Recall question type Figure 7: Mean RR gains by query expansion with answers to each type of recall questions (+SEM). 0 Misspell 3.60% 2 # of unique terms 4 6 8 10 while that of proper nouns decreased in the latter part; c) there was a correlation between the number of proper nouns and the mean RR; d) there were few overlaps between the former and latter part queries; and e) coordinate term substitutions were the most frequent in the latter part queries, and hyponym substitutions were observed in some queries. Recall question type Character Action Impression Common noun Proper noun Baseline 5.2 Recall in Episode Refinding Figure 7 shows the mean RR gain by query expansion with answers to each type of recall questions, where we used for query expansion all the words, only common nouns, and only proper nouns in the answers. We added words in participants’ answers to their queries by using OR operators, and measured the gain of RRs by submitting the expanded queries to our news search system. A Wilcoxon signed-rank test revealed a significant effect of query expansion on the RR, or a significant difference in the RR between unexpanded and expanded queries (all words) (Z = 4.27, p < 0.01, ES = 0.155). However, there was no significant difference among the recall question types when we performed a Kruskal-Wallis test. Thus, we can conclude that asking questions could significantly improved the search performances even two weeks later, as Kelly et al. showed in their experiments [22]. We then evaluate the contribution of terms in the recall tests to RR gains. A Friedman rank sum test revealed a significant effect of the query expansion method (i.e. all words, common nouns, and proper nouns) on the RR gain (χ2 (2) = 30.7, p < 0.01). A posthoc test using Wilcoxon Mann-Whitney rank sum tests with Holm correction showed the significant differences between proper nouns and all words as well as common nouns (p < 0.01). Through this significance test, we intended to emphasize the difference of the effectiveness of proper nouns in the queries and recall tests, i.e. proper nouns in the recall tests did not contribute the RR gain, even though ones in queries were strong indicators of high RRs as was seen in Figure 5. Although this comparison is not appropriate as Figure 5 shows RRs with all the words in the latter queries, we could at least argue that proper nouns in the recall tests alone did not improve the search performances but common nouns did. Figure 8 shows the average number of unique terms used in answers to each type of recall question. The part-of-speech distribution in the recall tests is different from that in the latter queries in that more common nouns and verbs were input, and less proper Verb Adjective Figure 8: Average number of unique terms used in answers to each type of recall questions (+SEM). nouns were used in the recall tests. On average, there is about one unique proper noun, while there are more than three unique common nouns used in answers in the recall tests. It also seems difficult to explain the small contribution of proper nouns to RR gains, as there was not much difference between the number of proper and common nouns that is compatible to the RR gain difference. In summary, we demonstrated the significant effect of query expansion by recalled results, and showed the low performance improvement of proper nouns in the recall tests. Although we analyzed several other effects of the age, news category, and individual recall question, significant effects were not observed in our user study. 5.3 Recognition in Episode Refinding Figure 9 shows the mean recognition score for each recognition type. Recall that a recognition score was defined as the inverted accuracy of a recognition test right after the news browsing if a participant could select the correct answer; otherwise 0. A Kruskal-Wallis test revealed significant effects of the recognition type on the recognition score (χ2 (3) = 43.8, p < 0.01). A posthoc test using Wilcoxon Mann-Whitney rank sum tests with Holm correction showed the significant differences between all the pair of recognition types but a pair of object name and minor action types (p < 0.05). Thus, the major action type was the easiest, while the attribute type was the hardest to recognize in the episode refinding task. As we expected, this could be explained by a finding that the memory for the semantic information lasts longer than that for the lexical and syntactic information [24]. In other words, 1 0.8 0.8 Recognition score Recognition score 1 0.6 0.4 20s 30s 50s 60s 40s 0.6 0.4 0.2 0.2 0 0 Attribute Object Major Recognition type Minor Figure 9: Mean recognition score for each recognition type (+SEM). people can more accurately recognize what is described than what is literally written in news articles. Although the mean recognition score of object names is lower than that of major actions, the score is not as low as the contribution of proper nouns in the recall tests. While the difficulty of remembering proper nouns such as person names has been demonstrated in the literature, it was also reported that recognition of names is less difficult than freerecall [2]. The word frequency effect can also explain the advance of proper nouns in the recognition tests, which indicates a high recall ability of high-frequecy words and a high recognition ability of low-frequecy words [15]. Therefore, people can recognize object names to some extent, despite low frequency and performance improvement of proper nouns in the recall tests. Note that we also compared the recognition accuracy of each recognition type, i.e. the number of participants who selected the correct choice divided by the number of participants assigned. The result is similar to the recognition scores: 0.239 (±0.0225), 0.426 (±0.0241), 0.566 (±0.0268), and 0.354 (±0.0240) for attribute, object name, major action, and minor action, respectively (SEM is parenthesized). A Chi-square test revealed that the recognition accuracy significantly differed by the recognition type (χ2 (3) = 82.9, p < 0.01). A post-hoc test using Chi-square tests with Holm correction showed the significant differences between all the pair of recognition types (p < 0.01). A conclusion is almost the same as the discussion above: people can recognize major actions the most accurately, and can also recognize object names with reasonable accuracy. Figure 10 shows the mean recognition score for each recognition type and age pair. A Kruskal-Wallis test revealed significant effects of the age of participants on the recognition score (χ2 (4) = 25.3, p < 0.01). A post-hoc test using Wilcoxon Mann-Whitney rank sum tests with Holm correction showed the significant differences between 60s and the other ages (p < 0.01). Thus, participants in their late middle age have a significantly lower recognition ability than the other ages. Even though it is not conclusive due to lack of significance, there exist different trends of recognition ability decay for recognition types and ages. There is little recognizability difference in ages for major and minor action recognition, while recognizability for attributes and object names sharply decreases with advancing years. The difference of vulnerabilities of memory for different ages have been reported in several studies. For example, people in their late middle age tend to have a better ability in recognition tests than recall tests, and their performance of recognition is not low when compared to young people [8]. In summary, we revealed significant effects of the recognition type and age in the recognition ability, and found that major actions in episodes are the most recognizable. In spite of low-frequency in Attribute Object Major Recognition type Minor Figure 10: Mean recognition score for each recognition type and age pair (+SEM). the recall tests, object names are also recognizable to some extent. A significant effect of the age of participants was also found in the recognition tests. 6. DISCUSSIONS AND IMPLICATIONS Our user study uncovered the fact that user queries fail to retrieve episodes they have read a fortnight ago, to what extent asking questions and eliciting information from users are effective for improving the search performances, and what kind of information users can recognize accurately. This section discusses approaches to the problems raised in our analysis for better episode refinding. We found that, after several days passed, terms in queries contain less proper nouns, and are replaced with their coordinate terms or hypernyms. As suggested by the result in Figure 5, moreover, we could hypothesize that the search performance would be improved if users could correctly remember proper nouns. Therefore, search engine functionalities could be augmented in the following ways. Query expansion can expand a proper noun in a query by adding its coordinate terms with OR operators since we found that terms in a query are likely to be replaced with their coordinate terms. When a user inputs a hypernym instead of a certain proper noun, query suggestion can help users narrow down the search results by suggesting queries that include hyponyms of the input proper noun. These functionalities should be triggered when we find users facing difficulties in refinding. This could be detected on the basis of the fraction of general terms, or trial and error in query formulations with different coordinate terms, as they might be caused by loss of memories. Studies on searcher frustration may also help search engines detect user difficulties in refinding [14]. A straightforward approach to the low search performances several days later is to ask questions to users and to expand their queries with their answers, as suggested by the result in Figure 7. Although our study could not identify what kind of questions is the most effective to elicit information for refinding, our finding suggests that proper nouns in user recall are not useful to improve the search performance. Thus, effective questions are ones that elicit not proper nouns but common nouns. For example, we should not ask questions “What is the name of main characters?” or "In which city did the accident happen?". More general questions are expected to work better, e.g. “What sort of people played a main role?” or “What happened in the news?”. On another front, general questions tend to be rarely answered well, as previous work on question clarification reported [21]. Thus, it is also necessary to generate questions customized for the user by utilizing the content of the input query. The results of recognition tests imply that the snippet of search results should include object names and actions rather than attributes such as time, address, and characteristics of characters/objects in the episode. The difference between recall and recognition ability supports the query suggestion encouraging users to input correct proper nouns, as object names are less likely to be recalled but likely to be recognized accurately. 7. CONCLUSIONS This study investigated recall and recognition in a news refinding task a fortnight later. Our user study revealed that there is a big drop on search performances, that asking questions and expanding input queries on the basis of the answers significantly improved the search performance, and that the users’ recognition abilities were different than their recall abilities. Our findings supported several findings in cognitive psychology from the viewpoint of information refinding and also had several implications for search algorithms for assisting user refinding. Future work includes further studies on resolving ambiguous expressions, and development of a system that generates questions that effectively and efficiently elicit information form users. 8. ACKNOWLEDGMENTS We would like to thank Professor Takashi Kusumi (Kyoto University) for his advice on our study. This work was supported in part by the following projects: Grants-in-Aid for Scientific Research (Nos. 24240013 and 26700009) from MEXT of Japan, and Microsoft Research CORE Project. 9. REFERENCES [1] A. Aula. Query formulation in web information search. In ICWI, pages 403–410, 2003. [2] H. P. Bahrick, P. O. Bahrick, and R. P. Wittlinger. Fifty years of memory for names and faces: A cross-sectional approach. Journal of experimental psychology: General, 104(1):54–75, 1975. [3] T. J. Bireta, A. M. Surprenant, and I. Neath. Age-related differences in the von restorff isolation effect. The Quarterly Journal of Experimental Psychology, 61(3):345–352, 2008. [4] G. H. Bower. Mood and memory. American psychologist, 36(2):129, 1981. [5] D. R. Cann, K. McRae, and A. N. Katz. False recall in the deese–roediger–mcdermott paradigm: The roles of gist and associative strength. The Quarterly Journal of Experimental Psychology, 64(8):1515–1542, 2011. [6] G. Cohen. Why is it difficult to put names to faces? British Journal of Psychology, 81(3):287–297, 1990. [7] G. Cohen and D. M. Burke. Memory for proper names: A review. Memory, 1(4):249–263, 1993. [8] F. I. Craik. Age-related changes in human memory. Cognitive aging: A primer, 5:75–92, 2000. [9] F. I. Craik. Human memory and aging. In Psychology at the turn of the millennium, volume 1, pages 261–280, 2002. [10] J. Deese. On the prediction of occurrence of particular verbal intrusions in immediate recall. Journal of experimental psychology, 58(1):17–22, 1959. [11] S. Dumais, E. Cutrell, J. J. Cadiz, G. Jancke, R. Sarin, and D. C. Robbins. Stuff i’ve seen: a system for personal information retrieval and re-use. In SIGIR, pages 72–79, 2003. [12] D. Elsweiler, M. Baillie, and I. Ruthven. Exploring memory in email refinding. ACM Transactions on Information Systems (TOIS), 26(4):21, 2008. [13] W. K. Estes. Processes of memory loss, recovery, and distortion. Psychological Review, 104(1):148–169, 1997. [14] H. A. Feild, J. Allan, and R. Jones. Predicting searcher frustration. In SIGIR, pages 34–41, 2010. [15] V. Gregg. Word frequency, recognition, and recall. John Wiley & Sons, 1976. [16] K. Haas, P. Mika, P. Tarjan, and R. Blanco. Enhanced results for web search. In Proc. of SIGIR, pages 725–734. ACM, 2011. [17] E. Horvitz, S. Dumais, and P. Koch. Learning predictive models of memory landmarks. In CogSci, pages 583–588, 2004. [18] I.-N. Huang and C. Wille. The von restorff isolation effect in free recall. The Journal of General Psychology, 101(1):27–34, 1979. [19] Y. Huang, Z. Liu, and Y. Chen. Query biased snippet generation in xml search. In Proc. of SIGMOD, pages 315–326. ACM, 2008. [20] R. R. Hunt. The subtlety of distinctiveness: What von restorff really did. Psychonomic Bulletin & Review, 2(1):105–112, 1995. [21] M. P. Kato, R. W. White, J. Teevan, and S. T. Dumais. Clarifications and question specificity in synchronous social q&a. In CHI Extended Abstracts, pages 913–918, 2013. [22] D. Kelly, V. D. Dollu, and X. Fu. The loquacious user: a document-independent source of terms for query expansion. In Proc. of SIGIR, pages 457–464. ACM, 2005. [23] J. Kim, W. B. Croft, D. Smith, and A. Bakalov. Evaluating an associative browsing model for personal information. In CIKM, pages 647–652. ACM, 2011. [24] W. Kintsch, D. Welsch, F. Schmalhofer, and S. Zimny. Sentence memory: A theoretical analysis. Journal of Memory and language, 29(2):133–159, 1990. [25] E. F. Loftus and J. C. Palmer. Reconstruction of automobile destruction: An example of the interaction between language and memory. Journal of verbal learning and verbal behavior, 13(5):585–589, 1974. [26] B. B. Murdock Jr. The serial position effect of free recall. Journal of experimental psychology, 64(5):482–488, 1962. [27] H. Obendorf, H. Weinreich, E. Herder, and M. Mayer. Web page revisitation revisited: implications of a long-term click-stream study of browser usage. In CHI, pages 597–606, 2007. [28] D. C. Park, G. Lautenschlager, T. Hedden, N. S. Davidson, A. D. Smith, and P. K. Smith. Models of visuospatial and verbal memory across the adult life span. Psychology and aging, 17(2):299–320, 2002. [29] D. G. Payne. Hypermnesia and reminiscence in recall: A historical and empirical review. Psychological Bulletin, 101(1):5–27, 1987. [30] J. J. Randolph, A. Thanks, R. Bednarik, and N. Myller. Free-marginal multirater kappa (multirater κfree): an alternative to fleiss’ fixed-marginal multirater kappa. In Joensuu learning and instruction symposium, 2005. [31] M. Ringel, E. Cutrell, S. Dumais, and E. Horvitz. Milestones in time: The value of landmarks in retrieving information from personal stores. In ÂAINTERACT, ˛ pages 184–191, 2003. [32] H. L. Roediger and K. B. McDermott. Creating false memories: Remembering words not presented in lists. Journal of experimental psychology: Learning, Memory, and Cognition, 21(4):803–814, 1995. [33] S. R. Schmidt. Encoding and retrieval processes in the memory for conceptually distinctive events. Journal of Experimental Psychology: Learning, Memory, and Cognition, 11(3):565–578, 1985. [34] A. M. Surprenant and I. Neath. Principles of memory. Taylor & Francis, 2009. [35] J. Teevan. How people recall, recognize, and reuse search results. ACM TOIS, 26(4):19, 2008. [36] J. Teevan, E. Adar, R. Jones, and M. A. Potts. Information re-retrieval: repeat queries in yahoo’s logs. In Proc. of SIGIR, pages 151–158, 2007. [37] J. Teevan, C. Alvarado, M. S. Ackerman, and D. R. Karger. The perfect search engine is not enough: a study of orienteering behavior in directed search. In CHI, pages 415–422, 2004. [38] A. Tombros and M. Sanderson. Advantages of query biased summaries in information retrieval. In Proc. of SIGIR, pages 2–10. ACM, 1998. [39] E. Tulving. Elements of Episodic Memory. OUP Oxford, 1985. [40] E. Tulving and D. M. Thomson. Encoding specificity and retrieval processes in episodic memory. Psychological review, 80(5):352, 1973. [41] E. Tulving and S. Wiseman. Relation between recognition and recognition failure of recallable words. Bulletin of the Psychonomic Society, 6(1):79–82, 1975.
© Copyright 2024 ExpyDoc