A User Study on News Re-finding a Fortnight Later

Re-call and Re-cognition in Episode Re-retrieval:
A User Study on News Re-finding a Fortnight Later
Shuya Ochiai, Makoto P. Kato, and Katsumi Tanaka
Kyoto University, Yoshida Honmachi, Sakyo, Kyoto, Japan 6068501
{ochiai,kato,tanaka}@dl.kuis.kyoto-u.ac.jp
ABSTRACT
This study investigates recall and recognition in a news refinding
task where participants were asked to read news articles and then
to search for the same articles a fortnight later. Recall, which is
a task to express what a person remembers, corresponds to query
formulations, while recognition, which is a task to judge whether
a presented item has been shown before, corresponds to a user’s
relevance judgment on search results in a refinding task. Our four
main contributions can be summarized as follows: (i) we developed
a method to investigate the effects of memory loss on episode refinding tasks on a large scale; (ii) our user study revealed a big drop
on search performances in the refinding task after a fortnight and
several differences between search queries input immediately after
news browsing and ones at a later time; (iii) we found that asking
questions and expanding input queries on the basis of the answers
significantly improved the search performance in the news refinding task; and (iv) the users’ recognition abilities were different than
their recall abilities, e.g. object names in a news story could be
correctly recognized even though they were rarely recalled. Our
findings support several findings in cognitive psychology from the
viewpoint of information refinding and also have several implications for search algorithms for assisting user refinding.
Categories and Subject Descriptors
H.3.3. [Information Search and Retrieval]: Information Search
and Retrieval
Keywords
refinding; news search; recall and recognition
1.
INTRODUCTION
People browse various types of episodes on the Web, including
news articles, stories, and the experiences of others as written in
blog posts. At a later time, they try to refind these episodes for a
wide range of purposes such as information sharing and citations.
However, refinding is made difficult by the oblivescence: if users
forget the details of the episode for which they are looking, they
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
CIKM’14, November 3–7, 2014, Shanghai, China.
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-2598-1/14/11 ...$15.00.
http://dx.doi.org/10.1145/2661829.2661920.
may fail to formulate appropriate Web search queries and be unable
to identify documents that contain the episode in search engine results pages. For example, imagine Connie incidentally read a news
story about Takeru Kobayashi, a Japanese competitive eater. The
news said that he is active in Nathan’s hot dog eating contest, an
annual American competitive eating event, and that he has gotten
involved in trouble with the organizers and been arrested once. A
fortnight later, she wanted to share this news with her friend, but she
could not remember enough details to create effective Web search
queries. Connie’s search queries contained ambiguous and/or incorrect keywords such as “asian”, “false accusation”, and “purged”.
Thus, her search results were noisy and it was difficult to locate the
correct news story as she browsed. Indeed, even though she actually scanned a snippet of the news she was looking for, which was
ranked fifth in her third search session, she ended up being unaware
of it because she lacked confidence in her memory of the details.
As seen in Connie’s example, the success of refinding is highly
reliant on the individual’s powers of memory. Her failures can be
characterized by two processes to retrieve a memory: recall and
recognition. Recall is a task to express what a person remembers.
In Connie’s search, recall is remembering the episode about Takeru
Kobayashi and inputting the keywords that she thought were included in the episode. Recognition is a task to judge whether a
presented item has been shown before. Connie’s recognition failed
in the example above, since she could not correctly judge which
search result was the one she had browsed before. The memory of
episodes such as Connie’s is usually distinguished from other types
of memories and is called episodic memory [39]. Episodic memory, which is our main focus in this paper, has been recently studied
in detail, and several characteristics have been clarified in cognitive
psychology research. Tulving distinguished episodic memory from
semantic memory in terms of information, operations, and applications and stated that episodic memory is a spatiotemporal-oriented
memory of experiences [39].
We investigated recall and recognition in the context of a news
refinding task after a fortnight. The three main research questions
we addressed are: 1) How do users formulate a search query with
an ambiguous, unconfident memory? 2) Is asking a certain kind
of question helpful for user recall? and 3) What kind of information can users recognize the most accurately? A task-based user
study was administered to answer these questions. We recruited
381 participants online, asked them to read news articles, and then
to search for the same articles a fortnight later. They were also
asked to recall the episodes with the assistance of some questions
and to recognize facts included in the episode they had seen.
Our four main contributions can be summarized as follows: (i)
we developed a method to investigate the effects of memory loss
on episode refinding tasks on a large scale; (ii) our user study re-
vealed a big drop on search performances in the refinding task after a fortnight and several differences between search queries input
immediately after news browsing and ones at a later time; (iii) we
found that asking questions and expanding input queries on the basis of the answers significantly improved the search performance
in the news refinding task; and (iv) the users’ recognition abilities were different than their recall abilities, e.g. object names in
a news story could be correctly recognized even though they were
rarely recalled. Our findings support several findings in cognitive
psychology from the viewpoint of information refinding and also
have several implications for search algorithms for assisting user
refinding.
The rest of this paper is organized as follows. Section 2 summarizes the previous work on memory and refinding. Section 3 discusses problems caused by oblivescence in refinding and proposes
methods to help users with episode refinding. Section 4 describes
the task-based user study we used to investigate the trend of recall
and recognition in episode refinding, and Section 5 reports the findings from this study. Section 6 discusses implications of this study,
and Section 7 concludes the paper.
2.
RELATED WORK
We survey the previous work on memory in cognitive psychology, and information access technologies that aim to support users’
refinding.
2.1 Principles of Memory
We would like to introduce several principles of memory referring to Surprenant and Neath’s work [34], which are strong bases
to motivate our study and refer to in our discussions.
The encoding specificity principle proposed by Tulving and Thomson emphasizes the dependency between encoded content and cues
that are used to retrieve memories, with their finding that strong
associates of an item (e.g. bloom to flower) are not always effective to retrieve the memory of the item [40]. The principle implies
that there is no type of cues that is effective to retrieve any kind of
memories in any situation. On the basis of their principle, some information access research has been conducted to explore effective
cues for a various situations (e.g. [17]).
The reconstruction principle claims that people use whatever information is available to reconstruct or build up a coherent memory of an episode. While this process can help users accurately
remember the past in some circumstances, the reconstruction can
cause distortion or loss of memories in others [13]. False memory
has been studied by using the Deese-Roediger-McDermott (DRM)
paradigm [10, 32], where participants are shown a list of words
composed of the strongest associates of a nonpresented word, and
are later asked to recall or recognize words in the list. On the subsequent recall and recognition tests, participants tend to recall the
nonpresented word. Some studies support an assumption that false
memory follows some rules and shows a bias toward the most likely
event. Loftus and Palmer found that questions asked subsequent
to an episode changed the memory of the episode, thorough recall tests on a film in response to different questions [25]. Cann,
McRae, and Katz conducted a series of experiments to explore the
most important word features that cause false memory by using a
knowledge-type taxonomy [5]. We observed many kinds of false
memory in participants’ queries and answers to given questions in
our study, some of which are explained with the previous finding
mentioned above.
The relative distinctiveness principle enables us to estimate the
performance on memory tasks based on the distinctiveness of an
item relative to its alternatives. One of the well-known effects is
Von Restorff effect [20], an effect of visual distinctiveness on recall and recognition performances. This effect has been further investigated and observed in conceptually distinctive items [33], and
recall tests for young [18] and older people [3]. We utilize this principle to predict what users remember several days later, and try to
elicit it from users especially in the question-assisted recall tests.
The characteristics of recall and recognition abilities as well as
their relationship have been studied from several aspects. The word
frequency effect refers to the trend in word recognition tests, in
which high-frequecy words are likely to be recalled, while lowfrequecy words are likely to be recognized [15]. Although people are generally able to recognize information better than recall, it
was demonstrated that recognizable information is not always recallable and vice versa [41]. Tulving and Wiseman reported that
even an item recalled by participants with a cue was not recognized
without any cue in many cases. Thus, we simultaneously assess the
recall and recognition abilities to understand the two different process of information refinding, i.e. query formulations and relevance
judgments, which might show different characteristics according to
Tulving and Wiseman’s work.
The effects of aging on the memory ability have been widely
studied in the literature. A series of experiments conducted by Park
et al. revealed that short-term and long-term memory abilities gradually decrease as a function of the age [28]. People in their late
middle age demonstrated a better ability in recognition tests than
recall tests, and their performance of recognition was not low when
compared to young people [8]. Names or proper nouns are thought
to be one of the most difficult information types to recall especially
for people in their late middle age [6]. These effects of aging were
also observed in our user study, where older people failed to refind
information when compared to the other ages.
2.2 Memory and Information Access
One of the most relevant work to this paper was conducted by
Teevan [35], in which recall and recognition of search results were
investigated through questionnaire-based user studies. Participants
were first exposed to a search engine result page (SERP) in response to a self-generated query, and were later asked to recall
the title, snippet, and URL of each search result in the list. Teevan reported that the top-ranked search result and last search result clicked were recalled with relatively high accuracy, which were
considered as a result of primacy effect and recency effect [26]. She
further tested the recognition ability of participants by asking to
recognize whether a new SERP was the same or different from the
SERP that they have browsed. The result showed that no less than
31% of the participants answered different to a presented SERP
that was identical to the one that they have seen. Although our
study address the problem of recall and recognition in a refinding
task, our focus includes query formulations for refinding, and the
recognition ability for different information types rather than search
results towards generating recognizable search results. Recall tests
are also different from Teevan’s in that we asked participants to
recall a news story with the assistance of questions.
Teevan et al. also investigated search engine queries and found
that 33% of the queries were repeated by the same user [36]. Although the repeated queries were probably used for refinding, information refinding is not always initiated by the same query used
before. One of the our research questions is addressed to clarify
how queries for refinding change after several days passed.
Elsweiler, Baillie, and Ruthven investigated what kind of e-mail
attributes people remember in e-mail refinding [12]. In their study,
participants performed several tasks and were asked if they believed
the information needed to solve the task was stored within their
collection, and what they were able to remember about the information. They reported that several factors affected what attributes
were remembered: the elapsed time since the e-mail was accessed,
the experience, the number of e-mails in participants’ collection,
and strategy of filing e-mails. While their research focused mainly
on whether participants recalled a certain type of e-mail attributes,
we focus more on the content of news articles that participants remember and is effective for refinding.
Obendorf et al. [27] conducted a log-based study on revisitation behaviors, and found that strategies used by participants were
highly reliant on how long ago they had browsed the information
for which they looked. The most frequent strategies were the use of
the back button for short-term (less than an hour), direct access via
URL-entry or bookmark selection for middle-term (between one
hour and a day), and use of hyperlinks for long-term.
Horvitz, Dumais, and Koch proposed statistical models that estimate the probability that users consider events to be memory landmarks [17]. They constructed models based on Bayesian network
utilizing event properties such as the event duration, subject, and
location.
Many tools for refinding have been proposed especially for personal information management. Ringel et al. proposed timeline
visualization with temporal landmarks for guiding search over personal collections [31]. Dumais et al. [11] developed a system
called Stuff I’ve Seen, which enable users to search for their personal information with the support of contextual information such
as time, author, thumbnails and previews. Recent work in this line
of work proposed an associative browsing tool, which in response
to a given query presents a ranked list of items associated based on
several features including textual and temporal similarity [23].
3.
CHALLENGES IN EPISODE REFINDING
WITH MEMORY LOSS
In this section, we discuss several problems in episode refinding
tasks with an ambiguous, unconfident memory. We also propose
methods to help users find the episode they read a long time ago,
and explain the concrete research questions we address in this paper. The three subsections that follow correspond to research questions 1)-3) in Section 1.
3.1 Query Processing for Fighting Amnesia
The first problem users may encounter when lacking a detailed
memory is formulating an appropriate query for refinding. As the
previous work has suggested, the queries of such users are likely to
include more general words rather than specific ones [15], which
might generate far more search results than expected and prevents
users from identifying what they are looking for. Moreover, queries
input with an ambiguous memory may include keywords that are
not included in the episode the users are trying to retrieve. For example, a coordinate term, which shares a hypernym with another,
can be used for a noun in an episode. Examples we observed in our
study include cases where “Syria” was replaced with “Ukraine”
and where “uncle” was replaced with “father”. Another possible
lapse of memory is that a proper nouns can be replaced with its hypernym: e.g. “Michael Jackson” can be replaced with “musician”
in a query. These phenomena has been suggested by the DRM
paradigm [10, 32], where people tend to recall a word that is not
presented when they are shown a list of words composed of the
strongest associates of the recalled word.
In this paper, we focus on investigating the following characteristics of queries: the part-of-speech distribution (i.e. the number
of common nouns, proper nouns, verbs, and adjectives in a query),
Table 1: Recall questions.
Character
Action
Impression
Baseline
I
II
III
I
II
III
I
II
III
Who/What is the main character of this news?
Please explain the characteristics of the person/object.
Please describe something else you can remember.
What happened in this news?
Why did the event happen?
Please describe something else you can remember.
How did you feel by reading this news?
Why did you feel so?
Please describe something else you can remember.
Please describe something else you can remember.
the frequency of word substitution types (i.e. the relation between a
term and its substitute), and the relation between these query characteristics and search performance. We then propose some query
processing techniques based on findings regarding these characteristics.
3.2 Reviving Memory of Users
It would be a challenging task for search engines to utilize only
ambiguous, incorrect keywords and to rank the desired search result
at the top without any assistance from the user. Thus, it is necessary
to interact with users to elicit more information about the episode
they are looking for, or to have them input additional keywords
about the episode by reviving their memory.
While general solutions to this problem include relevance feedback and query suggestion, here we employ a simple alternative to
elicit and revive user memory that is specialized for episode refinding tasks: we ask the users free-answer questions and expand the
original query by adding keywords in their answers. This idea was
inspired by Kelly et al.’s work on a document-independent term
source for query expansion [22]. The feedback form included a
few simple questions such as “Can you describe what you already
know about the topic?” and “Why do you want to know about this
topic?”. Their experiments demonstrated significant improvement
to search performance when the input query and answers to the
questions were used together as a query. Unlike their experiments
where the feedback form was filled out right after a search topic
had been shown, we address the question of whether it is possible
to improve the search performance by asking a few questions and
adding answers to the input query even at a later time.
We also try to clarify what types of questions are the most effective to revive user memory in episode refinding tasks. To this
end, we devised three free-answer question types: character, action, and impression. These types are called recall question types
in this paper. Each type of recall question employed in our user
study, shown in Table 1, was designed to be applicable to any types
of episodes. We also used a baseline recall question “Please describe something else you can remember” to compare with the recall question type we designed. Character recall question type includes questions about the name and characteristics of main characters in the episode, while action recall question type includes questions about the events happening in the episode. The former was
used because it is said that conceptually distinctive objects (i.e. the
main person/object in an episode) are easy to remember in recall
tasks (known as the Von Restorff effect) [20, 33], while the latter
was devised because it has been shown that the memory for the
textbase (semantic) lasts longer than that for the surface (lexical and
syntactic) [24]. Impression recall question type includes questions
that try to elicit users’ impression when they read the episode. This
type of recall question is probably effective for reviving user mem-
Table 2: Recognition types.
Type
Attribute
Object name
Major action
Minor action
Definition
Attributes of the episode or of objects in the episode,
e.g. time, address, and characteristics.
Names of objects in the episode, e.g. person, country,
and company names.
Actions mainly described in the episode, e.g. perpetration and incident.
Actions complementally described in the episode, e.g.
background and experts’ views.
ories because mood-state-dependent retention implies that people
recall an episode better if they reinstate the original emotion they
experienced during episode acquisition [4].
3.3 Generating Recognizable Search Results
The last problem in refinding tasks is to identify the search result
a user is looking for. In Teevan’s experiment, 31% of participants
reported that the presented list of search results was different from
what they actually saw on the same day, even though the two lists
were identical [35]. This indicates that it should be hard to identify
which search result is the one users intended to retrieve especially
more than a week later. Although there have been several studies on
generating query-biased snippets to help users quickly assess each
search result and judge whether the presented one is relevant [16,
19, 38], it remains unclear what kind of snippets are effective in the
situation of refinding. Users may not be able to judge or even be
unaware of the desired search result presented simply because they
do not remember sentences shown as a snippet.
Therefore, we clarify what kind of information would be the
most recognizable even at a later time. As we explained earlier,
recognition is a task to judge whether a presented item has been
shown before. It corresponds to relevance judgments on search results. Users should be able to easily judge each search result if
their snippets include information they could accurately recognize.
In this paper, we classify the information in episodes into four types
(summarized in Table 2) and measure the recognition accuracy of
each type to estimate the recognizability of search result snippets.
They are called recognition types in this paper. Attribute and object
name recognition types are similar to the character recall question
type in Table 1 except that the character recall question type focuses on the main character while the major action and minor action recognition types correspond to the action recall question type.
We opt to use these four types in order to compare the recall and
recognition results, as it is known that recognizable information is
not always recallable and vice versa [41].
4.
METHODOLOGY
In this section, we first describe participants and experimental
procedure in our user study. We then explain post-processing, including participant filtering and labeling users’ queries and answers.
Finally, we discuss the limitations of our user study.
4.1 Participants
We recruited participants through a Japanese Internet research
company. All experiments were carried out on the website, and
all directions and questions were written in Japanese. Before we
started the main experiments, we excluded unreliable users and
those who are not familiar with search on the Web by asking two
simple questions that can be easily solved by a Web search. Partic-
Table 3: Participant demographics.
Male
Female
20s
37
34
30s
39
40
40s
40
39
50s
39
35
60s
39
39
ipants were instructed to perform their search using a Web search
engine and to choose from among five choices.
We also asked the question “How often do you read news on the
newspaper or Internet for more than 15 minutes a day?” to select
only individuals who read the news on a daily basis. We excluded
those who rarely read the news since we want to study users with
an ambiguous memory on the episodes they read, yet those who do
not usually read news articles can easily retain their memory even
several days later. Thus, we excluded individuals who answered a
couple of days or less a week.
We used only those individuals who answered these three questions as we expected. As we could not ensure that all of the individuals participated in all tasks due to a limitation of the recruiting method, we first hired 799 individuals, 381 of which passed
through our filtering (explained later) and finished all the tasks. Table 3 shows the demographics of 381 participants, where their sex
and age distributions are almost even.
4.2 Procedure
Our user study is composed of two parts, as shown in Figure
1. The former part consists of episode acquisition (reading news
articles), query formulation, and summary generation (writing a
summary for each article), while the latter part consists of query
formulation, a recall test (answering recall questions), and a recognition test (choosing which information was included in the news
articles). We conducted the former part from February 18 to February 20, 2014. A fortnight later, we sent e-mails to the participants
and asked them to participate in the latter part from March 5 to
March 8, 2014.
The participants were first asked to carefully read the user study
instructions, which stated that a) the user study consisted of the
former and latter parts; b) only participants who sincerely answered
questions in the former part could move on to the latter part; c) the
latter part contained different tasks from the former part; and d)
participants were not allowed to use a Web search engine or to ask
others during both of the former and latter parts.
Instruction b) was presented to the participants for giving an incentive. We informed participants that this user study was designed
to survey the depth of understanding of information on the Web,
and did not inform them that the latter part functions as a refinding task. This was to prevent participants from intentionally remembering the news content between the former and latter parts.
Instruction c) was shown for the same reason.
Below, we explain tasks in the former and latter parts.
4.2.1 Former Part
We randomly assigned two different news categories to the participants with a consideration of counter balance and asked them to
read a news article from each category from top to bottom. Five
news categories were used: crime, international, social problems,
entertainment, and local, each of which contained two Japanese
news articles. The category name was shown at the top-left of the
Web page in red so that the participants were certainly aware of the
category while reading an article. We recorded their scrolling behaviors and reading time by using JavaScript embedded in the Web
page to filter those who finished news browsing without scrolling
Former part
Latter part
Episode acquisition
Query forumulation
Query formulation
Recall test
Summary generation
Recognition test
Either recall or recognition was tested for
each subject-episode pair
Figure 1: Flowchart of user study.
or within a very short time. By this means, we let the participants
to acquire episodes by reading news articles.
We then asked the participants to move on to another Web page,
and input several queries to retrieve the news article they read. The
query box we provided resembled in form ones used in commercial
search engines such as Yahoo!, Google, and Bing, and can show 35
two-byte characters without horizontal scrolling. The participants
could submit queries by clicking on a search button or by pressing
the Enter key. After query submission, the query box was cleared
to accept another query. The search results for each query were not
displayed to the participants.
Participants who entered at least one query could proceed to
the summary generation task page. We asked participants to summarize each news article they had read using a minimum of 50
Japanese characters. This summary generation was designed to encourage participants to retain a memory of the news by rehearsal,
which refers to repetition for retaining a memory for a long time,
and to filter out those who did not read the news articles.
Note that participants first read an article and created queries as
well as a summary for the first article. Finishing the tasks in the
former part for the first article, the participants started reading a
different article from another news category, and began the same
process again for this second article. Participants were not able to
use the “back” button of their browsers and to read any article more
than once.
4.2.2 Latter Part
Two week interval was set between the former and latter parts.
We decided the length of the interval for the following reasons.
First, the interval should be long enough to create a situation where
participants felt it difficult to refind news articles shown in the former part. Second, the interval should not be too long for participants to forget most of the cues about new articles. To estimate an
appropriate interval, we conducted a preliminary user study with
14 students in our university, where the participants were asked to
read novels with their title hidden and to refind them four weeks
later. We found that more than a half of them could not refind the
novels with Web search engines, and decided to use a shorter interval in this study, i.e. fortnight.
We sent e-mails to 714 participants fortnight after the former
part, and asked them to participate in the latter part of our user
study. We excluded 85 individuals on the basis of how long it
took them to read the news articles as well as generated summaries.
Specifically, we excluded those who did not spend more than one
minute to read two news articles or who created a meaningless summary. The user study site was closed on March 8, 2014 in the morning. As a result, 381 out of 714 participants finished all the tasks in
the latter part. Note that all the participants finished the latter part
after 14 days passed since the former part, and all the participants
but two finished the latter part before 16 days passed.
First, we asked the participants to create several search queries to
retrieve the news article they had read in the former part by explicitly showing the category name in red. We expected the participants
would be able to identify one of the two news articles they had read
without confusion, as two different categories were assigned and
clearly presented to the participants in the former part. The query
box used in the latter part was exactly the same as that used in the
former part. In this way, we obtained two types of queries: ones
input right after news browsing and ones input a fortnight later.
The participants were then asked to take a recall and a recognition test. Only one or the other type of tests was assigned to each
episode to avoid the effect of learning. In the recall test, we presented in order all the questions of one of the types shown in Table
1. Participants were instructed to answer these questions to the best
of their ability by filling out a text form. In the recognition test, we
used forced-choice recognition and asked the participants to choose
from among four choices as well as a “no idea” choice. Only one
of the four choices was the fact mentioned in news articles that had
been read, while the others were not mentioned but were similar
to each other. As an example, five of the choices we presented
for one news story were: “A country mentioned in the news article
was Iraq”, “A country mentioned in the news article was Syria”,
“A country mentioned in the news article was Pakistan”, “A country mentioned in the news article was Israel”, and “no idea”. We
prepared two recognition tests for each pair of a recognition type
(Table 2) and a news article; thus, the number of recognition tests
was 80 in total. One of the two recognition test for each recognition
type was taken by the participants in random order, i.e. they took
four recognition tests for each news article.
In summary, we obtained two types of query sets for two news
articles (i.e. one formulated right after the news browsing and one
formulated a fortnight later), answers to the recall test for a news
article, and user choices in four recognition tests for another news
article.
4.3 Post-processing
In this subsection, we describe our participant filtering, annotation on queries and answers in recall tests, quantification of the difficulty of each recognition test, and development of a news search
system to measure the search performance of submitted queries.
After all participants had finished both the former and latter parts,
we excluded outliers as they might affect the resultant data. Specifically, we filtered out those who spent less than 30 seconds or more
than 600 seconds to read two news articles.
In the latter part, we obtained queries and answers to each recall
question from the participants. Since some of the queries and answers included personal messages from the participants, such as “I
cannot remember the news at all” and “Please give me a hint”, we
decided to label each query/answer and remove any such messages
by using a crowd-sourcing service. Five crowd workers were assigned to each query/answer, shown the news story the participant
had read, as well as his/her query or answer, and asked to choose
from three choices: forget (the participant explicitly states that s/he
cannot remember the news), confused (the participant obviously
talks about a different news story from what s/he read), and neither. In total, 1.31% of the queries and 35.4% of the answers were
judged as forget by three or more workers. The inter-rater agreements as measured by free-marginal multirater Fleiss’ Kappa [30]
were 0.71 for the queries, and 0.81 for the answers to the recall
questions. We regarded as empty queries/answers that were judged
as forget by three or more workers. Note that we did not exclude
queries/answers judged as forget in our analysis and did not conduct any operation for confused or neither queries/answers.
4.4 Limitations
In this section, we discuss the three limitations of the methodology described above. Although these limitations may reduce the
value of our reported findings, readers can still interpret and use the
results by duly taking into consideration possible biases discussed
here.
First, our user study was conducted online, which is different
from ordinary psychological experiments that are typically held in
a laboratory setting. Although this enabled us to utilize a larger
sample size than laboratory experiments, the drawback is that we
have less control over participants’ behaviors: for example, they
might pretend to read news articles and reluctantly type queries and
answers to our recall questions. However, it is difficult to perfectly
control behaviors even in a laboratory setting. Moreover, as we
compared results obtained from a large number of participants, we
can expect that any unusual behaviors are evenly distributed over
different groups and do not seriously affect our conclusions.
Second, participants were instructed to read news articles and
were externally motivated to refind the news article they had seen.
As it is not common for ordinary users to be asked to refind information, our experiments were somewhat unrealistic. It might also
increase the unreality to enforce the participants to use only searching to find the desired information, since it has been reported that
users prefer navigating to what they look for by taking a few known
steps [37].
Third, our samples were biased in terms of search expertise.
Since we recruited individuals through an Internet research company, the participants might include more search experts than those
1
In the case of no crowd worker who answered correctly, the score
was set to 10 (the maximum value).
1.0
Former
Latter
0.8
0.6
RR
In the recognition tests, we tested the recognition ability by asking participants to choose from among five choices. One of the
problems in measuring the recognizability of each recognition type
is that the accuracy of a recognition test, which is defined as the
number of participants who selected the correct choice divided by
the total number of participants, would be highly biased by the difficultly of the recognition test. The difficultly of a recognition test
is different from the accuracy of identifying an item shown before
at a later time and should be measured by excluding the effects of
memory decay as much as possible. Thus, we recruited ten crowd
workers for each recognition test, presented them with a news article, and asked them to take the associated recognition tests right
after the news browsing. We expected the accuracy of the recognition test without memory decay to approximate the difficulty of
the recognition test. In our analysis, we assume that a participant
can get a score if s/he selects the correct choice, where the score
is the inverted accuracy of the recognition test right after the news
browsing, i.e. the total number of crowd workers (10 in our case)
divided by the number of crowd workers who selected the correct
choice1 . Intuitively, the more difficult a recognition test, the higher
its score. In the following analyses, we refer to the score as recognition score.
The search performances of the participants’ queries were measured with a proprietary news search system we developed based
on Apache Solr. We crawled news articles from the Web sites
of four national Japanese newspaper companies from October 26,
2013 to February 15, 2014 (three days before the first day of the
user study), and indexed them using a default Japanese tokenizer
in Solr. The total number of news articles was 77,353. Each query
submitted by the participants was evaluated with a ranking by the
Okapi BM25 algorithm, where the default Solr parameters were
used, i.e. k1 = 1.2 and b = 0.75.
0.4
0.2
0.0
20s
30s
40s
Age
50s
60s
Figure 2: Mean RRs of former and latter part queries as a function of the age of participants (±SEM).
in a random sampling. It is of course well known that search expertise affects search behaviors [1], so it might be inappropriate to
generalize our findings to all search engine users including novices.
5. FINDINGS
In this section, we address three research questions by reporting the findings from our user study: 1) How do users formulate a
search query with an ambiguous, unconfident memory? 2) Is asking a certain kind of question helpful for user recall? and 3) What
kind of information can users recognize the most accurately? For
question 1), we analyze the terms in queries and their search performances measured by reciprocal rank (RR), comparing the terms
of queries formulated in the former and latter parts. For question
2), we analyze the answers in the recall tests in terms of both their
content and improved search performances when the answers were
used for query expansion. For question 3), we report the recognition scores for different recognition types and compare them with
our analysis of recalled information.
In our analysis, we mainly used non-parametric significance tests
(Kruskal-Wallis and Wilcoxon signed-rank tests) because the data
we obtained do not satisfy the normality in many cases. When
we conduct a post-hoc test or a multiple comparison, the HolmBonferroni method is employed to adjust p values. Significant effects are reported on the significance level α = 0.05.
5.1 Queries in Episode Refinding
Figure 2 shows the mean RRs of former and latter part queries
as a function of the age of participants, where the RR of a query
is defined as a multiplicative inverse rank of the news shown in the
former part in the search result. Overall, there is a big performance
drop from the former part queries to the latter part ones. This drop
demonstrates the difficulty of refinding episodes after just fourteen
days passed. The Kruskal-Wallis test revealed significant effects of
age on the mean RR for both the former (χ2 (4) = 11.0, p < 0.05)
and latter part queries (χ2 (4) = 58.3, p < 0.01). A post-hoc test
using Wilcoxon Mann-Whitney rank sum tests with Holm correction showed significant differences between 20s/60s for the former
part queries (p < 0.05) and between 20s/50s, 30s/50s, 20s/60s,
30s/60s, 40s/60s, and 50s/60s for the latter part queries (p < 0.01).
This demonstrates that it is difficult, especially for users in their
late middle age, to refind episodes they had seen before. Although
this trend coincides with the trend of long-term memory decay
(e.g. [9, 28]), it is possible that users in their late middle age could
not formulate appropriate queries due to their comparatively lower
0
1
# of terms
2
3
0.3
4
0.25
0.2
RR
Former
0.15
Common noun
Proper noun
Latter
Verb
0.1
0.05
Adjective
0
Others
searching skills. However, this is not conclusive, since the significant difference for the former part queries can be explained by the
effect of both search skill and short-term memory decay [28].
Next, we analyze differences of the terms used in the former and
latter part queries. The average number of terms per part-of-speech
in the former and latter part queries is shown in Figure 3. Note that
participants were asked to input multiple queries in both the former
and latter parts and that we took the average number of terms in
multiple queries input by each participant. As shown, the fraction
of common nouns increases while that of proper nouns decreases
in the latter part. This follows the same trend observed in psychological experiments, where the vulnerability of proper names to
memory errors has been demonstrated in learning new names and
in retrieving familiar names [7].
We further investigate the effect of the number of common and
proper nouns on the RR. Figures 4 and 5 show the mean RRs of the
latter part queries as a function of the number of common nouns
and proper nouns, respectively. There is an RR drop when more
than two common nouns were used by participants, while the RR
increases as the number of proper nouns increases. A KruskalWallis test revealed significant effects of the number of common
(χ2 (3) = 16.3, p < 0.01) and proper nouns (χ2 (2) = 47.7,
p < 0.01) on the mean RR. A post-hoc test using Wilcoxon MannWhitney rank sum tests with Holm correction showed significant
differences between (2, inf) and (0, 1] as well as (1, 2] common
nouns (p < 0.05) and those between all pairs of the number of
proper nouns (p < 0.01). The significant effect of the number
of proper nouns may imply the effectiveness of proper nouns in
episode refinding tasks and/or a correlation between the number of
proper nouns and to what extent people remember the news content
they have read. A question is why many common nouns resulted in
lower search performances. Although there are several possible explanations to this result, a possible explanation might be that many
common nouns in search queries indicate a sign of participants who
were struggling to remember the episodes but failed.
Figure 6 shows the average number of unique-term overlaps between the former and latter part queries, where former only and
latter only indicate the average number of unique terms that appear
only in the former and latter part queries, respectively, and overlap indicates the number of unique terms that appear in both. This
figure provides another insight on query formulations a few weeks
after news browsing: there are few overlaps between the former
and latter part queries. More interestingly, there are some proper
nouns that were not used in the former part queries but were used
in the latter part ones. To drill down these findings, we manually
labeled term pairs in the former and latter queries on the basis of
their word relation, i.e. hypernym, hyponym, synonym, coordinate
(0, 1]
(1, 2]
# of common nouns
(2, inf)
Figure 4: Mean RRs of latter part queries as a function of the
number of common nouns (+SEM).
0.3
0.25
0.2
RR
Figure 3: Average number of unique terms per part-of-speech
in the former and latter part queries (+SEM).
0
0.15
0.1
0.05
0
0
(0, 1]
# of proper nouns
(1, inf)
Figure 5: Mean RRs of latter part queries as a function of the
number of proper nouns (+SEM).
(two terms share a hypernym), and misspell. Since this labeling
is relatively subjective and requires knowledge of lexical ontologies, two of the authors independently labeled term pairs, and a
computer science student not involved in the writing of this paper
intermediated when our labels were different. Since the coordinate
relation can be applied to any term pair according to its definition,
we labeled a term pair as coordinate only if the most commonly
used hypernyms of the two terms are identical. For example, an
“uncle” and “father” pair was labeled as coordinate while an “uncle” and “Barack Obama” pair was not labeled as coordinate since
the most commonly used hypernyms of the former pair are “relative” while those of the latter are “relative” and “politician”, respectively. Table 4 shows the fraction of relations between terms
in the former and latter part queries. There are two points of interest in this table: the term substitutions with coordinate terms were
most frequent at a later time, and the number of hyponym substitutions was not negligible. These two findings are relevant to false
recall and reminiscence. A recent psychological study also investigated word relations that are likely to cause false recall in the DRM
paradigm [10, 32], and reported that the most important features to
predict false recall were situation features (functions of a word),
synonyms, and taxonomic relations including hypernym and coordidate [5]. Reminiscence is a phenomenon of recalling items that
cannot be recalled before [29]. Inputting a hyponym of a term in an
episode is not false recall, rather reminiscence, as it indicates that
the participants recalled further details of the episode.
In summary, we analyzed terms in queries formulated in the former and latter parts and reported the following findings: a) age had
a significant effect on the mean RR, especially for queries formulated in the latter part; b) the fraction of common nouns increased,
0
1
# of unique terms
2
3
4
5
0.1
6
0.08
Proper noun
0.06
Verb
Former only
Adjective
RR gain
Common noun
Overlap
Others
All words
Common nouns
Proper nouns
0.04
0.02
Latter only
0
Character
Figure 6: Average number of term overlaps between former
and latter part queries (+SEM).
Table 4: Fraction of relations between terms in former and latter part queries.
Hypernym
22.7%
Hyponym
7.55%
Synonym
17.3%
Coordinate
48.9%
Action
-0.02
Impression
Baseline
Recall question type
Figure 7: Mean RR gains by query expansion with answers to
each type of recall questions (+SEM).
0
Misspell
3.60%
2
# of unique terms
4
6
8
10
while that of proper nouns decreased in the latter part; c) there was
a correlation between the number of proper nouns and the mean
RR; d) there were few overlaps between the former and latter part
queries; and e) coordinate term substitutions were the most frequent in the latter part queries, and hyponym substitutions were
observed in some queries.
Recall question type
Character
Action
Impression
Common noun
Proper noun
Baseline
5.2 Recall in Episode Refinding
Figure 7 shows the mean RR gain by query expansion with answers to each type of recall questions, where we used for query
expansion all the words, only common nouns, and only proper
nouns in the answers. We added words in participants’ answers
to their queries by using OR operators, and measured the gain of
RRs by submitting the expanded queries to our news search system. A Wilcoxon signed-rank test revealed a significant effect of
query expansion on the RR, or a significant difference in the RR
between unexpanded and expanded queries (all words) (Z = 4.27,
p < 0.01, ES = 0.155). However, there was no significant
difference among the recall question types when we performed
a Kruskal-Wallis test. Thus, we can conclude that asking questions could significantly improved the search performances even
two weeks later, as Kelly et al. showed in their experiments [22].
We then evaluate the contribution of terms in the recall tests to
RR gains. A Friedman rank sum test revealed a significant effect
of the query expansion method (i.e. all words, common nouns, and
proper nouns) on the RR gain (χ2 (2) = 30.7, p < 0.01). A posthoc test using Wilcoxon Mann-Whitney rank sum tests with Holm
correction showed the significant differences between proper nouns
and all words as well as common nouns (p < 0.01). Through this
significance test, we intended to emphasize the difference of the
effectiveness of proper nouns in the queries and recall tests, i.e.
proper nouns in the recall tests did not contribute the RR gain, even
though ones in queries were strong indicators of high RRs as was
seen in Figure 5. Although this comparison is not appropriate as
Figure 5 shows RRs with all the words in the latter queries, we
could at least argue that proper nouns in the recall tests alone did
not improve the search performances but common nouns did.
Figure 8 shows the average number of unique terms used in answers to each type of recall question. The part-of-speech distribution in the recall tests is different from that in the latter queries
in that more common nouns and verbs were input, and less proper
Verb
Adjective
Figure 8: Average number of unique terms used in answers to
each type of recall questions (+SEM).
nouns were used in the recall tests. On average, there is about one
unique proper noun, while there are more than three unique common nouns used in answers in the recall tests. It also seems difficult
to explain the small contribution of proper nouns to RR gains, as
there was not much difference between the number of proper and
common nouns that is compatible to the RR gain difference.
In summary, we demonstrated the significant effect of query expansion by recalled results, and showed the low performance improvement of proper nouns in the recall tests. Although we analyzed several other effects of the age, news category, and individual recall question, significant effects were not observed in our user
study.
5.3 Recognition in Episode Refinding
Figure 9 shows the mean recognition score for each recognition type. Recall that a recognition score was defined as the inverted accuracy of a recognition test right after the news browsing
if a participant could select the correct answer; otherwise 0. A
Kruskal-Wallis test revealed significant effects of the recognition
type on the recognition score (χ2 (3) = 43.8, p < 0.01). A posthoc test using Wilcoxon Mann-Whitney rank sum tests with Holm
correction showed the significant differences between all the pair
of recognition types but a pair of object name and minor action
types (p < 0.05). Thus, the major action type was the easiest,
while the attribute type was the hardest to recognize in the episode
refinding task. As we expected, this could be explained by a finding that the memory for the semantic information lasts longer than
that for the lexical and syntactic information [24]. In other words,
1
0.8
0.8
Recognition score
Recognition score
1
0.6
0.4
20s
30s
50s
60s
40s
0.6
0.4
0.2
0.2
0
0
Attribute
Object
Major
Recognition type
Minor
Figure 9: Mean recognition score for each recognition type
(+SEM).
people can more accurately recognize what is described than what
is literally written in news articles. Although the mean recognition score of object names is lower than that of major actions, the
score is not as low as the contribution of proper nouns in the recall tests. While the difficulty of remembering proper nouns such
as person names has been demonstrated in the literature, it was
also reported that recognition of names is less difficult than freerecall [2]. The word frequency effect can also explain the advance
of proper nouns in the recognition tests, which indicates a high recall ability of high-frequecy words and a high recognition ability
of low-frequecy words [15]. Therefore, people can recognize object names to some extent, despite low frequency and performance
improvement of proper nouns in the recall tests.
Note that we also compared the recognition accuracy of each
recognition type, i.e. the number of participants who selected the
correct choice divided by the number of participants assigned. The
result is similar to the recognition scores: 0.239 (±0.0225), 0.426
(±0.0241), 0.566 (±0.0268), and 0.354 (±0.0240) for attribute,
object name, major action, and minor action, respectively (SEM
is parenthesized). A Chi-square test revealed that the recognition
accuracy significantly differed by the recognition type (χ2 (3) =
82.9, p < 0.01). A post-hoc test using Chi-square tests with Holm
correction showed the significant differences between all the pair of
recognition types (p < 0.01). A conclusion is almost the same as
the discussion above: people can recognize major actions the most
accurately, and can also recognize object names with reasonable
accuracy.
Figure 10 shows the mean recognition score for each recognition
type and age pair. A Kruskal-Wallis test revealed significant effects
of the age of participants on the recognition score (χ2 (4) = 25.3,
p < 0.01). A post-hoc test using Wilcoxon Mann-Whitney rank
sum tests with Holm correction showed the significant differences
between 60s and the other ages (p < 0.01). Thus, participants in
their late middle age have a significantly lower recognition ability
than the other ages. Even though it is not conclusive due to lack of
significance, there exist different trends of recognition ability decay for recognition types and ages. There is little recognizability
difference in ages for major and minor action recognition, while
recognizability for attributes and object names sharply decreases
with advancing years. The difference of vulnerabilities of memory
for different ages have been reported in several studies. For example, people in their late middle age tend to have a better ability in
recognition tests than recall tests, and their performance of recognition is not low when compared to young people [8].
In summary, we revealed significant effects of the recognition
type and age in the recognition ability, and found that major actions
in episodes are the most recognizable. In spite of low-frequency in
Attribute
Object
Major
Recognition type
Minor
Figure 10: Mean recognition score for each recognition type
and age pair (+SEM).
the recall tests, object names are also recognizable to some extent.
A significant effect of the age of participants was also found in the
recognition tests.
6. DISCUSSIONS AND IMPLICATIONS
Our user study uncovered the fact that user queries fail to retrieve
episodes they have read a fortnight ago, to what extent asking questions and eliciting information from users are effective for improving the search performances, and what kind of information users
can recognize accurately. This section discusses approaches to the
problems raised in our analysis for better episode refinding.
We found that, after several days passed, terms in queries contain
less proper nouns, and are replaced with their coordinate terms or
hypernyms. As suggested by the result in Figure 5, moreover, we
could hypothesize that the search performance would be improved
if users could correctly remember proper nouns. Therefore, search
engine functionalities could be augmented in the following ways.
Query expansion can expand a proper noun in a query by adding
its coordinate terms with OR operators since we found that terms
in a query are likely to be replaced with their coordinate terms.
When a user inputs a hypernym instead of a certain proper noun,
query suggestion can help users narrow down the search results by
suggesting queries that include hyponyms of the input proper noun.
These functionalities should be triggered when we find users facing
difficulties in refinding. This could be detected on the basis of the
fraction of general terms, or trial and error in query formulations
with different coordinate terms, as they might be caused by loss
of memories. Studies on searcher frustration may also help search
engines detect user difficulties in refinding [14].
A straightforward approach to the low search performances several days later is to ask questions to users and to expand their
queries with their answers, as suggested by the result in Figure
7. Although our study could not identify what kind of questions
is the most effective to elicit information for refinding, our finding suggests that proper nouns in user recall are not useful to improve the search performance. Thus, effective questions are ones
that elicit not proper nouns but common nouns. For example, we
should not ask questions “What is the name of main characters?”
or "In which city did the accident happen?". More general questions are expected to work better, e.g. “What sort of people played
a main role?” or “What happened in the news?”. On another front,
general questions tend to be rarely answered well, as previous work
on question clarification reported [21]. Thus, it is also necessary to
generate questions customized for the user by utilizing the content
of the input query.
The results of recognition tests imply that the snippet of search
results should include object names and actions rather than attributes
such as time, address, and characteristics of characters/objects in
the episode. The difference between recall and recognition ability supports the query suggestion encouraging users to input correct proper nouns, as object names are less likely to be recalled but
likely to be recognized accurately.
7.
CONCLUSIONS
This study investigated recall and recognition in a news refinding
task a fortnight later. Our user study revealed that there is a big
drop on search performances, that asking questions and expanding
input queries on the basis of the answers significantly improved the
search performance, and that the users’ recognition abilities were
different than their recall abilities. Our findings supported several
findings in cognitive psychology from the viewpoint of information
refinding and also had several implications for search algorithms
for assisting user refinding.
Future work includes further studies on resolving ambiguous expressions, and development of a system that generates questions
that effectively and efficiently elicit information form users.
8.
ACKNOWLEDGMENTS
We would like to thank Professor Takashi Kusumi (Kyoto University) for his advice on our study. This work was supported in
part by the following projects: Grants-in-Aid for Scientific Research (Nos. 24240013 and 26700009) from MEXT of Japan, and
Microsoft Research CORE Project.
9.
REFERENCES
[1] A. Aula. Query formulation in web information search. In ICWI,
pages 403–410, 2003.
[2] H. P. Bahrick, P. O. Bahrick, and R. P. Wittlinger. Fifty years of
memory for names and faces: A cross-sectional approach. Journal of
experimental psychology: General, 104(1):54–75, 1975.
[3] T. J. Bireta, A. M. Surprenant, and I. Neath. Age-related differences
in the von restorff isolation effect. The Quarterly Journal of
Experimental Psychology, 61(3):345–352, 2008.
[4] G. H. Bower. Mood and memory. American psychologist, 36(2):129,
1981.
[5] D. R. Cann, K. McRae, and A. N. Katz. False recall in the
deese–roediger–mcdermott paradigm: The roles of gist and
associative strength. The Quarterly Journal of Experimental
Psychology, 64(8):1515–1542, 2011.
[6] G. Cohen. Why is it difficult to put names to faces? British Journal
of Psychology, 81(3):287–297, 1990.
[7] G. Cohen and D. M. Burke. Memory for proper names: A review.
Memory, 1(4):249–263, 1993.
[8] F. I. Craik. Age-related changes in human memory. Cognitive aging:
A primer, 5:75–92, 2000.
[9] F. I. Craik. Human memory and aging. In Psychology at the turn of
the millennium, volume 1, pages 261–280, 2002.
[10] J. Deese. On the prediction of occurrence of particular verbal
intrusions in immediate recall. Journal of experimental psychology,
58(1):17–22, 1959.
[11] S. Dumais, E. Cutrell, J. J. Cadiz, G. Jancke, R. Sarin, and D. C.
Robbins. Stuff i’ve seen: a system for personal information retrieval
and re-use. In SIGIR, pages 72–79, 2003.
[12] D. Elsweiler, M. Baillie, and I. Ruthven. Exploring memory in email
refinding. ACM Transactions on Information Systems (TOIS),
26(4):21, 2008.
[13] W. K. Estes. Processes of memory loss, recovery, and distortion.
Psychological Review, 104(1):148–169, 1997.
[14] H. A. Feild, J. Allan, and R. Jones. Predicting searcher frustration. In
SIGIR, pages 34–41, 2010.
[15] V. Gregg. Word frequency, recognition, and recall. John Wiley &
Sons, 1976.
[16] K. Haas, P. Mika, P. Tarjan, and R. Blanco. Enhanced results for web
search. In Proc. of SIGIR, pages 725–734. ACM, 2011.
[17] E. Horvitz, S. Dumais, and P. Koch. Learning predictive models of
memory landmarks. In CogSci, pages 583–588, 2004.
[18] I.-N. Huang and C. Wille. The von restorff isolation effect in free
recall. The Journal of General Psychology, 101(1):27–34, 1979.
[19] Y. Huang, Z. Liu, and Y. Chen. Query biased snippet generation in
xml search. In Proc. of SIGMOD, pages 315–326. ACM, 2008.
[20] R. R. Hunt. The subtlety of distinctiveness: What von restorff really
did. Psychonomic Bulletin & Review, 2(1):105–112, 1995.
[21] M. P. Kato, R. W. White, J. Teevan, and S. T. Dumais. Clarifications
and question specificity in synchronous social q&a. In CHI Extended
Abstracts, pages 913–918, 2013.
[22] D. Kelly, V. D. Dollu, and X. Fu. The loquacious user: a
document-independent source of terms for query expansion. In Proc.
of SIGIR, pages 457–464. ACM, 2005.
[23] J. Kim, W. B. Croft, D. Smith, and A. Bakalov. Evaluating an
associative browsing model for personal information. In CIKM,
pages 647–652. ACM, 2011.
[24] W. Kintsch, D. Welsch, F. Schmalhofer, and S. Zimny. Sentence
memory: A theoretical analysis. Journal of Memory and language,
29(2):133–159, 1990.
[25] E. F. Loftus and J. C. Palmer. Reconstruction of automobile
destruction: An example of the interaction between language and
memory. Journal of verbal learning and verbal behavior,
13(5):585–589, 1974.
[26] B. B. Murdock Jr. The serial position effect of free recall. Journal of
experimental psychology, 64(5):482–488, 1962.
[27] H. Obendorf, H. Weinreich, E. Herder, and M. Mayer. Web page
revisitation revisited: implications of a long-term click-stream study
of browser usage. In CHI, pages 597–606, 2007.
[28] D. C. Park, G. Lautenschlager, T. Hedden, N. S. Davidson, A. D.
Smith, and P. K. Smith. Models of visuospatial and verbal memory
across the adult life span. Psychology and aging, 17(2):299–320,
2002.
[29] D. G. Payne. Hypermnesia and reminiscence in recall: A historical
and empirical review. Psychological Bulletin, 101(1):5–27, 1987.
[30] J. J. Randolph, A. Thanks, R. Bednarik, and N. Myller.
Free-marginal multirater kappa (multirater κfree): an alternative to
fleiss’ fixed-marginal multirater kappa. In Joensuu learning and
instruction symposium, 2005.
[31] M. Ringel, E. Cutrell, S. Dumais, and E. Horvitz. Milestones in time:
The value of landmarks in retrieving information from personal
stores. In ÂAINTERACT,
˛
pages 184–191, 2003.
[32] H. L. Roediger and K. B. McDermott. Creating false memories:
Remembering words not presented in lists. Journal of experimental
psychology: Learning, Memory, and Cognition, 21(4):803–814,
1995.
[33] S. R. Schmidt. Encoding and retrieval processes in the memory for
conceptually distinctive events. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 11(3):565–578, 1985.
[34] A. M. Surprenant and I. Neath. Principles of memory. Taylor &
Francis, 2009.
[35] J. Teevan. How people recall, recognize, and reuse search results.
ACM TOIS, 26(4):19, 2008.
[36] J. Teevan, E. Adar, R. Jones, and M. A. Potts. Information
re-retrieval: repeat queries in yahoo’s logs. In Proc. of SIGIR, pages
151–158, 2007.
[37] J. Teevan, C. Alvarado, M. S. Ackerman, and D. R. Karger. The
perfect search engine is not enough: a study of orienteering behavior
in directed search. In CHI, pages 415–422, 2004.
[38] A. Tombros and M. Sanderson. Advantages of query biased
summaries in information retrieval. In Proc. of SIGIR, pages 2–10.
ACM, 1998.
[39] E. Tulving. Elements of Episodic Memory. OUP Oxford, 1985.
[40] E. Tulving and D. M. Thomson. Encoding specificity and retrieval
processes in episodic memory. Psychological review, 80(5):352,
1973.
[41] E. Tulving and S. Wiseman. Relation between recognition and
recognition failure of recallable words. Bulletin of the Psychonomic
Society, 6(1):79–82, 1975.