Time-Aware Rank Aggregation for Microblog Search

Time-Aware Rank Aggregation for Microblog Search
Shangsong Liang†
Zhaochun Ren†
[email protected]
[email protected]
§
Edgar Meij
[email protected]
Wouter Weerkamp‡
[email protected]
Maarten de Rijke†
[email protected]
†University of Amsterdam, Amsterdam, The Netherlands
‡904Labs, Amsterdam, The Netherlands
§Yahoo Labs, Barcelona, Spain
ABSTRACT
We tackle the problem of searching microblog posts and frame it as
a rank aggregation problem where we merge result lists generated
by separate rankers so as to produce a final ranking to be returned to
the user. We propose a rank aggregation method, TimeRA, that is
able to infer the rank scores of documents via latent factor modeling. It is time-aware and rewards posts that are published in or near
a burst of posts that are ranked highly in many of the lists being aggregated. Our experimental results show that it significantly outperforms state-of-the-art rank aggregation and time-sensitive microblog search algorithms.
Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content Analysis
and Indexing
Keywords
Rank aggregation; data fusion; ad hoc retrieval; microblog search
1.
INTRODUCTION
The task of searching microblog posts has been studied widely.
Many effective algorithms have been proposed, including, for example, those based on language modeling [9], query expansion [21],
and time-sensitive features [3]. Different microblog search algorithms have different merits; hence, combining these alternative algorithms into a single model may boost retrieval performance. In
this paper we follow this idea and tackle the challenge of microblog
search by following a rank aggregation (also called data fusion) approach. Our input consists of a number of ranked lists generated by
different microblog search algorithms and the output is an aggregated and, hopefully, better list.
Rank aggregation approaches can inherit the merits of the individual search algorithms whose outputs are being aggregated. Also,
they may boost recall [31] and because they aggregate the results
of multiple strategies, they mitigate risks associated with opting for
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
CIKM’14, November 3–7, 2014, Shanghai, China.
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-2598-1/14/11 ...$15.00
http://dx.doi.org/10.1145/2661829.2661905.
a single strategy [31]. In rank aggregation, documents in the fused
list are ranked in deceasing order of their fusion scores. The fusion score of a document is usually a function of the rank scores
from the individual input lists, such as the weighted sum. Previous work on rank aggregation often assumes—either implicitly or
explicitly—that the rank score of a document is set to zero for an
input list if that document does not appear in the list. We challenge
this assumption; our intuition is that documents that are similar to
a document d that does occur in an input list, should get similar
rank scores. Let us call a document that does not appear in list L,
but does appear in other lists we aim to fuse, a missing document
for L (see Figure 1). We propose a rank aggregation algorithm
where we apply a latent factor model to predict the rank scores of
missing documents. In our rank aggregation algorithm, we define
a list-document rank score matrix, factorize this matrix, and utilize
the factorized list-specific and document-specific matrices to assign
scores to missing documents.
State-of-the-art rank aggregation methods often work on the assumption that only documents that are ranked highly in many of the
result lists being aggregated are likely to be relevant [1, 4, 5, 11, 13,
22]. As a result, a relevant document might be ranked low in an aggregated list if it appears in only few of the input result lists and
is ranked low within these. The characteristics of microblog environments suggest a different perspective. In such environments
people tend to talk about a topic within specific short time intervals [10, 20, 23]. E.g., people talked about the “2014 Eastern Synchronized Skating Sectional Championship” mainly between January 30 and February 1 2014, which is when the championship
took place. Posts generated before or after the event are less likely
to talk about the championship and, hence, less likely to be relevant
to a query about the championship.
Inspired by these characteristics of microblog search scenarios,
we propose a rank aggregation algorithm, TimeRA, that is timeaware. TimeRA detects windows (intervals) of timestamps of posts
ranked high in many of the lists to be fused. These windows give
rise to bursts of posts, and subsequently TimeRA rewards posts that
are published in the vicinity of a burst that contains highly scored
posts. Specifically, if a post di and posts d1 , . . . , dk were created
within the same narrow time window, and the posts d1 , . . . , dk are
ranked highly in many of the lists being merged, then post di will be
“rewarded,” even if, in the extreme case, it appears in only one list
where it is ranked low. Fig. 1 illustrates this: post d2 is ranked low
in list L1 but will be rewarded as it was published in a time window
in which a large number of posts are published that are ranked high
L2
Lm
d7
d6
d8
d5
d5
d6
d4
d3
d5
d2
d1
d3
d6
d4
d2
combined score
L1
d4
d1
missing documents
d2
d3
d5
window
d6
d7
d8
time
Figure 1: Rewarding posts that are published in the same narrow time frame as a large number of (supposedly) relevant
posts. The vertical axis indicates the summed scores of posts
with the same timestamps. There are m lists in response to a
query. Post d2 only occurs in list L1 and is ranked low; d8 also
occurs in a single list, Lm , but is ranked very high. Surrounding d2 there are many documents that are ranked high in many
lists; therefore, d2 should be “rewarded,” while d8 should not.
Documents marked in blue are “missing” documents; e.g., d6
is a missing document for L1 , as it is not observed in L1 during
the aggregation process.
in many lists. But d8 , while ranked high in Lm , receives no such
bonus as it was published outside the time window.
Our results show that TimeRA is able to effectively aggregate
result lists, subject to real-time search requirements, running almost as fast as standard fusion methods such as CombSUM [5],
and outperform both time-sensitive microblog search methods and
competing fusion algorithms.
Our contributions can be summarized as follows:
• We propose a rank aggregation-based algorithm for microblog search, TimeRA, that outperforms traditional and stateof-the-art unsupervised and supervised rank aggregation methods as well as the best single result list that it aggregates.
• We examine the relative contributions of the main ingredients of TimeRA (burst detection and score inference) and
find that burst detection is effective in helping rank aggregation and that inferring scores for missing documents during
aggregation boosts performance as well.
• We provide a detailed analysis of the performance of TimeRA
and offer a number of examples where we observe the effect
hypothesized in Fig. 1, i.e., of posts in bursts having their
rank score boosted.
• We show that the effect of TimeRA on ranking microblog
posts is different from the effect of existing time-sensitive
microblog search models.
• And, finally, we study the efficiency of TimeRA and show
that it meets real-time search requirements.
2.
RELATED WORK
Three major types of research are closely related to our work:
microblog search, rank aggregation and latent factor modeling.
2.1
documents in conjunction with the topic similarity to derive a timesensitive ranking. More recently, Miyanishi et al. [21] integrate
temporal and lexical evidence into their query expansion method
for microblog search.
We propose to develop a rank aggregation-based approach to
microblog search. To the best of our knowledge, this is the first
attempt to tackle the problem of microblog search as a rank aggregation problem.
2.2
Rank aggregation
Rank aggregation approaches attempt to inherit the merits of different search algorithms, and to combine their result lists in order to
produce a new and hopefully better ranking [14, 16, 26]. In recent
years, rank aggregation has featured in many applications, including federated search [28], vertical search [2], and more recently,
search result diversification [14, 15]. Many rank aggregation methods have been proposed, including CombSUM and CombMNZ [26],
an outranking approach [14, 31], semantic fusion [16], clusterbased fusion [7] and λ-Merge [27]. CombSUM and CombMNZ are
the oldest but fast and effective rank aggregation methods. The outranking approach only utilizes the rank of documents in the result
lists during aggregation. Both semantic fusion and cluster-based
fusion use the assumption that content-similar documents should
be clustered to boost the performance of aggregation. Finally, λMerge tries to learn a model from labeled data for aggregation.
Previous rank aggregation algorithms ignore the temporal information of the documents to be merged. We propose a new rank
aggregation where we fully utilize the temporal information and reward posts that are published in the vicinity of a burst to be ranked
higher in the fused list. To the best of our knowledge, this is the first
attempt to integrate temporal information into rank aggregation.
2.3
Latent factor modeling
Latent factor models are often used in collaborative filtering (CF)
and recommender systems [8, 25]. Matrix factorization methods
form a group of well-known latent factor models, that include,
for instance, singular value decomposition [8], probabilistic matrix factorization [25] and social regularization [18].These methods
first model users with latent interests and the products with latent
features by matrix factorization, and then try to predict the rating
of products for the given users with the observations of the existing
users’ rating data [8, 18, 25]. Motivated by latent factor models,
rather than simply setting no rank scores for missing documents as
in previous rank aggregation methods [5, 7, 27, 31], our proposed
rank aggregation algorithm applies matrix factorization to model
both result lists (called “users” in CF) and documents (“products”
in CF) as a mixture of latent factors (“interests” in CF), such that
the rank scores of missing documents (“the rating of unobserved
products” in CF) in a result list for aggregation can be inferred.
Reasons of why a latent topic model can infer scores can be found
in [8, 18, 25].
To the best of our knowledge, this is the first attempt to predict
rank scores for missing documents in rank aggregation.
Microblog search
Searching posts that are short and creatively spelled in microblogging platforms, like Twitter, is challenging. Previous work on microblog search has focused on employing traditional IR techniques,
i.e., learning to rank [6], document expansion [19], pseudo-relevance
feedback retrieval [21] to improve the performance. Some previous work explores time information to boost performance. For instance, Li and Croft [12] propose a time-sensitive language model
to explore time and relevance of queries; Massoudi et al. [19] integrate temporal information into a dynamic expansion microblog retrieval model. And Dakka et al. [3] consider the publication time of
3.
PRELIMINARIES
We detail the task that we address and recall two standard rank
aggregation algorithms, CombSUM and CombMNZ.
3.1
Problem formulation
The task we address is this: given an index of microblog posts, a
query, and a set of ranked lists of posts with timestamps produced
in response to the query, aggregate the lists into a final ranked list
of posts to be returned in response to the query. Hence, the rank
aggregation problem consists of finding a ranking function fX such
Table 1: Main notation used in TimeRA.
Notation
Gloss
m × |CL | list-post rank score matrix
number of latent factors
matrix used for inferring latent topics for L
matrix used for inferring latent topics for CL
a column vector in S used for aspects of Li
a column vector in V used for aspects of dj
rank score of dj in list Li
inferred rank score of post dj ∈ CL \ Li
dk ranks higher than dj in Li
timestamp
post with timestamp ti
number of different timestamps of posts in CL
a burst of posts
set of all bursts in CL (given query q)
indicate whether dj ∈ Li
timestamp standard deviation of a burst
“reward” score for dj if dj is within or near the burst b to
which dk belongs
w(dj |Li ) rank punishment weighting function for dj ∈ Li
β
relative weight of burst information in TimeRA
R
A
S
V
Si
Vj
Rij
Rij
dk Li dj
ti
dti
T
b
B
Iij
σb
r(dj |dk )
f
X
L = {L1 , L2 , . . . , Lm }, q, C −→
Lf ,
where L is a set of lists to be fused, Li is the i-th result list, m =
|L| is the number of input result lists, C is the corpus of microblog
posts, q is a query, Lf is the final fused list.
Standard rank aggregation
Before we move on to the details of our rank aggregation method,
we set the stage regarding rank aggregation by briefly recalling two
standard fusion methods: CombSUM and CombMNZ. Let Rij denote the rank score of document dj in list Li based on the rank of
dj in the list Li ; in the literature on rank aggregation [5, 7, 11, 31],
one often finds
/ Li (dj still in the combined set of
S Rij = 0 if dj ∈
posts CL := m
i=1 Li ). In both CombSUM and CombMNZ, Rij is
often defined as:
( (1+|L |)−rank(d |L )
i
j
i
dj ∈ Li
|Li |
(1)
Rij =
0
dj ∈
/ Li ,
where |Li | is the length of Li and rank(dj |Li ) ∈ {1, . . . , |Li |} is
the rank of dj in Li . The well-known CombSUM fusion method [5,
31], for instance, scores dj by the sum of its rank scores in the lists:
P
fCombSUM (dj |q) := m
i=1 Rij ,
while CombMNZ [5, 31] rewards dj that ranks high in many lists:
fCombMNZ (dj |q) := |{Li : dj ∈ Li }| · fCombSUM (dj |q),
where |{Li : dj ∈ Li }| is the number of lists in which dj appears.
4.
4.1
Bursts detection
Our proposed TimeRA method utilizes burst information. To
detect bursts, we proceed as follows. Let t be a timestamp. Let dt
∈ CL denote a post with timestamp t. Here, CL is the set of posts
appearing in the result lists. We regard posts published during the
same hour as having the same timestamp. Although it is possible
to define “the same timestamp” in many different ways, we found
that this is a suitable level of granularity for the effectiveness of
rank aggregation. Before we detect bursts in posts CL , we need
to define Ht , the burst-time score at time t. Let Rkt (obtained by
Eq. 1) be the rank score of dt given Lk . Then:
P
Pm
1
k=1 Rkt
dt ∈Ct
− ,
(2)
H t = PT P
Pm
T
0
k=1 Rkt
d 0 ∈C 0
t0 =1
t
that:
3.2
aggregation [7] and λ-Merge [27] have been shown to be more effective than CombSUM and CombMNZ in several cases. However,
a drawback of cluster-based fusion is that it has to access the content of documents to compute inter-document similarities so as to
generate a set of clusters and determine probabilities based on the
clusters. A drawback of λ-Merge is that it also has to access the
content to extract features and it often overfits [27].
TIME-AWARE RANK AGGREGATION
We detail our time-aware rank aggregation algorithm in this section; §4.1 describes the way we detect bursts, §4.2 details our aggregation method, TimeRA, and §4.3 contains a brief complexity
analysis of TimeRA.
To be able to present rank aggregation methods in a uniform way,
we first introduce and summarize our notation in Table 1. Although
introduced some time ago, CombSUM and CombMNZ are still effective and very efficient rank aggregation methods. More recent
fusion methods have been proposed but most of them may be inappropriate for searching posts in microblogging scenarios that are
dynamic and need rapid updates. For instance, cluster-based rank
t
where T is the total number of different timestamps belonging to
posts
CL , Ct is a set of posts having the same timestamp t,
P in P
m
dt ∈Ct
k=1 Rkt is the sum of the rank scores of posts having
P
P
P
0
the same timestamp t, and Tt0 =1 d 0 ∈C 0 m
k=1 Rkt is the sum
t
t
of the rank scores of all posts in CL . According to Eq. 2, the bursttime score Ht > 0 if it is above the average score (i.e., 1/T ),
otherwise Ht ≤ 0.
Using Eq. 2, we compute a burst-time score Ht for each time
point t ∈ {t1 , t2 , . . . , tT }. In this manner we generate a burst-time
#»
score sequence H = {Ht1 , Ht2 , . . . , HtT }. Following [24], a
#»
segment H[ti : tj ] = {Hti , Hti+1 , . . . , Htj }, where 1 ≤ i ≤ j ≤
#»
T , is a maximal segment in H if:
#»
i. All proper subsequences of H[ti : tj ] have a lower score.1
#»
#»
ii. No proper super-segments of H[ti : tj ] in H satisfy item i.
#»
As an example, consider the input sequence H = {2, −2, 4, 3,
−3, −4, −1, −3, 5, −1, 3, −2}. The maximal segments in this
sequence are {2}, {4, 3} and {5, −1, 3}. The segment {2, −2, 4,
3} is not maximal, since it has a nonempty zero-scoring prefix {2,
−2} appending to the left of {4, 3}; {5} is not a maximal segment,
since {5, −1, 3} has a total higher score of 7.
#»
Each maximal segment H[ti : tj ] gives rise to a burst of posts
b[ti : tj ] with start timestamp ti and end timestamp tj : it contains
any post d ∈ CL whose timestamp is between
S ti and tj . We write
B to denote the set of all bursts, i.e., B = b[ti : tj ]. We let b be
short for b[ti : tj ] in the following.
4.2
The aggregation method
Next we detail our time-aware rank aggregation method, TimeRA. TimeRA utilizes matrix factorization techniques. The input
of the matrix factorization in TimeRA is an m × n matrix R ∈
Rm×n which we call a list-document matrix; here, again, m is the
number of result lists and n is the number of posts to be fused,
i.e., n = |CL |. R is initialized by Eq. 1, viz., its elements Rij are
defined by Eq. 1. The output of the matrix factorization consists
of two new matrices S ∈ RA×m and V ∈ RA×n , obtained by
1
The burst-time score of a sequence is the sum of the burst-time
scores of the elements in the sequence.
Algorithm 1: TimeRA: Time-Aware rank aggregation.
1
2
3
4
5
6
7
8
9
Input : A query q, a number of ranked lists of posts to be fused, L={
L1 , L2 S
, . . . , Lm }, the combined set of posts
CL := m
i=1 Li , learning rate δ.
Output: A final rank aggregation list of posts Lf .
Lf = ∅, initialize R by Eq. 1, Initialize S and V with random
values, let m = |L| and t = 0;
for j = 1, 2, · · · , n do
fTimeRA (dj |q) = 0;
Generate a set of bursts B;
repeat
for i = 1, 2, · · · , m do
(t+1)
(t)
Si
= Si − δ
∂C2
(t)
∂Si
for j = 1, 2, · · · , n do
(t+1)
(t)
Vj
= Vj − δ
(t)
14
15
Construct final fused list Lf via decreased scores of fTimeRA (d|q).
12
13
S, V = arg min
S,V
factorizing R, which we call the factor-list matrix and the factorpost matrix, respectively. Here, A is the number of latent factors.
After obtaining S and V, we can infer the normalized scores of
missing posts, based on which we arrive at the aggregation scores
of all the posts.
We present TimeRA in Algorithm 1. We first provide a highlevel walkthrough. To begin, it sets matrices S ∈ RA×m , V ∈
RA×n with random values and fTimeRA (d|q) with the value 0 (lines
1–3 in Algorithm 1). Let Si be the i-th column of S, meant to get
the latent factors of the list Li after matrix factorization, and let
Vj be the j-th column of V, meant to get latent factors of post dj
after matrix factorization. After detecting bursts (line 4), TimeRA
utilizes burst information and tries to get the final S and V by performing a gradient descent process on a cost function C2 (lines 5–
12). To this end, aggregation scores fTimeRA (d|q) of d ∈ CL can
be obtained by S and V (lines 13–14). The defined cost function
plays an important role: (1) it tries to keep the original rank scores
of posts that rank high in many of the lists to be fused, (2) it rewards
posts that rank low in a few lists but in the vicinity of bursts, and
(3) it gets the latent factors of both the lists and the posts such that
missing posts’ rank scores can be inferred.
Our matrix factorization approach in TimeRA seeks to approximate the list-document matrix R by a multiplication of A latent
factors,
>
R≈S V
(3)
To obtain S and V, traditionally, the Singular Value Decomposition
(SVD) method [8] is utilized to approximate the list-document rank
matrix R by minimizing
1
S, V = arg min ||R − S> V||2F ,
2
S,V
(4)
where || · ||2F denotes the Frobenius norm. Due to the definition that
Rij = 0 if dj is a missing document, i.e., dj ∈ CL \ Li , we only
need to factorize the observed scores in R so that (4) changes to
S, V = arg min
S,V
S,V
based on Eq. 15;
t = t + 1;
until C2 given by Eq. 12 converges;
S = S(t) , V = V(t) ;
for j = 1, 2, · · · , n do
Obtain fTimeRA (dj |q) score for dj via S and V by Eq. 14;
10
11
S, V = arg min
m
n
1 XX
2
Iij (Rij − S>
i Vj ) ,
2 i=1 j=1
m
n
1 XX
2
Iij (Rij − g(S>
i Vj )) ,
2 i=1 j=1
(6)
where g(x) is the logistic function defined by g(x) = 1/(1 +
exp(−x)), which makes it possible to bound the range of S>
i Vj
within the same range [0, 1] by Rij .
In order to avoid overfitting, two regularization terms are added
to Eq. 6:
based on Eq. 15;
∂C2
∂Vj
where Iij is an indicator function that is equal to 1 if dj appears in
Li and equal to 0 otherwise, Si ∈ RA×1 is a column vector in S
that serves to get latent factors of Li , and, similarly, Vj ∈ RA×1
is a column vector in V that serves to get latent factors of dj . As
Rij ∈ [0, 1] according to Eq. 1, (5) can be rewritten as
(5)
m
n
1 XX
2
Iij (Rij − g(S>
i Vj ))
2 i=1 j=1
(7)
λ2
λ1
2
2
||V||F ,
+ ||S||F +
2
2
where λ1 , λ2 > 0. For a probabilistic interpretation with Gaussian
observation noise for Eq. 7 we refer to Salakhutdinov and Mnih
[25]. To reduce the model’s complexity we set λ1 = λ2 in all
experiments below [8, 18, 25].
The cost function defined by Eq. 7 punishes posts equally when
they shift from the original rank scores Rij . However, a post ranked
higher should be punished more than posts ranked lower if they
shift from the original rank scores: higher ranked posts are more
likely to be relevant so that keeping their original rank scores will
be of more value. Thus, we modify Eq. 7 to obtain the following
cost function C1 :
S, V = arg min
S,V
m
n
1 XX
2
Iij · w(dj |Li )(Rij − g(S>
i Vj ))
2 i=1 j=1
(8)
λ1
λ2
+ ||S||2F +
||V||2F ,
2
2
where w(dj |Li ) is a rank punishment weighting function defined
for dj given Li . Specifically, we define w(dj |Li ) as
(
1
dj ∈ Li
rank(dj |Li )−1
(9)
w(dj |Li ) = 2
0
dj ∈
/ Li .
Our next step is to bring in burst information. After detecting a set
of bursts, we introduce the following item into the cost function C1
to boost the relevance of posts by burst information, such that
S, V = arg min
S,V
·
X
m
n
1 XXX
1
2 i=1 j=1
|{dk ∈ b : dk Li dj }|
b∈B
2
Iij Iik r(dj |dk )w(dk |Li )(Rik − g(S>
i Vj )) ,
(10)
dk ∈b
dk Li dj
where B is a set of detected bursts, b is a burst in B, dk Li dj means
dk is ranked higher than dj in list Li , |{dk ∈ b : dk Li dj }| is the
total number of posts in the burst b that are ranked higher than the
post dj in Li , and r(dj |dk ) is the “reward” score for dj from dk
if dj is within or near the burst b to which dk belongs. In Eq. 10,
dk Li dj indicates that only the posts ranked higher than the post dj
to be rewarded are capable of boosting the relevance of dj .
If dj , the post to be rewarded, is generated at (almost) the same
time as the highly relevant post dk , it will be rewarded more than
posts that are further away in time from dk , which is measured by
r(dj |dk ) in Eq. 10. Accordingly, we define r(dj |dk ) based on a
The final rank aggregation score for dj ∈ CL is obtained by
normal distribution as
(
r(dj |dk ) = exp −
(tdj − tdk )
2σb2
2
)
,
(11)
where tdj and tdk are the timestamps of post dj and dk , respectively, and σb is the standard deviation of timestamps in burst b:
Pnb
nb +1 2
}
n2 − 1
2
i=1 {i −
2
σb =
= b
,
nb
12
where nb is the number of different timestamps of posts in b. According to Eq. 11, the more likely dj is generated at the same time
with dk , the larger the score r(dj |dk ) is, resulting in a bigger reward for dj from dk .
fTimeRA (dj |q) =
S, V =
arg min
S,V
+
n
m
1 − β XX
2
Iij w(dj |Li )(Rij − g(S>
i Vj ))
2 i=1 j=1
m
n
X
β XXX
1
Iij Iik
2 i=1 j=1
|{dk ∈ b : dk Li dj }|
b∈B
(12)
dk ∈b
dk Li dj
2
r(dj |dk )w(dk |Li )(Rik − g(S>
i Vj ))
λ1
λ2
+ ||S||2F +
||V||2F ,
2
2
where β ∈ [0, 1] is a free parameter that governs the linear mixture.
In words, in response to a query q, TimeRA uses a two-component mixture model to score d ∈ CL . The first component (the
first term of Eq. 12) tries to maintain the original rank scores Rij .
The second component (the second term of Eq. 12), which uses
bursts for posts, “rewards” d if it is strongly associated with the
posts in the bursts. Clearly, if β = 0 in Eq. 12, TimeRA will only
try to maintain the original rank scores, i.e., Eq. 8.
A local minimum of the objective function given by Eq. 12 can
be found by performing gradient descent in both Si and Vj . The
same can apply to Eqs. 4, 5, 6, 7, and 8. The algorithm for obtaining S and V is straightforward: we first randomly initialize S and
V, then iteratively update these two matrices based on their gradients until the value of the cost (objective) function converges. The
derivation of these equations are included in Appendix A.
After optimizing Eq. 12, posts dj ∈ Li they will end up with a
score g(S>
i Vj ) = Rij + “some reward.” Unlike previous work
that assigns 0 to missing documents dj ∈ CL \ Li [5, 7, 27, 31], we
infer a rank score RLi dj for missing posts dj as:
(
g(S>
if g(S>
i Vj )
i Vj ) ≤ min(Rid )
Rij =
(13)
min(Rid ) if g(S>
i Vj ) > min(Rid ) ,
where min(Rid ) is the minimum rank score of the lowest ranked
post that appears in Li as computed by Eq. 1. As shown in Eq. 13,
if the inferred rank score g(S>
i Vj ) is smaller than the minimum
rank score, we maintain the inferred score for that post. However,
if the inferred rank score is greater than the minimum rank score,
we give up the inferred score and let Rij = min(Rid ), as dj ∈
/ Li
means that it should at least be ranked lower than any post d ∈ Li .
m
X
g(S>
i Vj ) +
i=1
dj ∈Li
Rij .
(14)
i=1
dj ∈CL \Li
The final rank aggregation score for a post consists of two parts,
i.e., scores for lists Li it appears in (the first term in Eq. 14) and
scores for lists it does not appear in (the second term in Eq. 14),
as inferred by Eq. 13. Finally, the final fused list Lf produced in
response to q is obtained by ranking posts in decreasing order of
fTimeRA (dj |q). We call the model defined by Eq. 14, TimeRA, and
the variant that ignores inferred rank information Rij in Eq. 14 is
called TimeRA-Infer (“TimeRA minus Infer”).
4.3
We are almost ready now to define the cost function C2 that we use
in TimeRA. We integrate the original rank score of the posts in the
result lists (Eq. 8) with rewards for posts generated in the vicinity
of bursts (Eq. 10). To this end, we substitute Eq. 10 into Eq. 8. Our
cost function C2 for TimeRA is defined as
m
X
Analysis of time-aware rank aggregation
The main computation of TimeRA is detecting bursts and its gradients against matrices S and V. Our burst detection has computational complexity O(n) because the detection algorithm runs in
linear time [24], and the gradients of C2 have computational complexity of O(e × m × n × A). In practice, the number of result
lists to be fused, m, the number of posts to be aggregated, n, the
number of latent factors, A, and the number of epochs, e, are all
quite small. Therefore, the rank aggregation procedure can run in
real-time, as we will see in the experiments presented in §6.6.
In the limiting case, TimeRA reverts back to CombSUM. That
is, if we set β = 0 (ignoring burst information), set the weight
function w(d|Li ) = 1 for all documents d and lists Li , and, moreover, set λ1 = λ2 = 0, and do not try to infer the rank score of
posts, then g(S>
i Vj ) = Rij after performing gradient descent in
Eq. 12. Our experimental results below show that the performance
of TimeRA is almost the same as CombSUM when β = 0. Note
that λ1 , λ2 6= 0 in the experiments.
5.
EXPERIMENTAL SETUP
In this section, we list our research questions, describe the data
set, specify our baselines, detail the metrics as well as our training
and optimization setup, and describe the experiments.
5.1
Research questions
The research questions guiding the remainder of the paper are:
RQ1 Does TimeRA outperform traditional and state-of-the-art unsupervised or supervised rank aggregation methods and the
best single result list? (§6.1)
RQ2 What are the relative contributions of the main ingredients of
TimeRA (burst detection and score inference)? (§6.2)
RQ3 What is the effect of using burst information in TimeRA (i.e.,
what is the impact of the parameter β in Eq. 12)? (§6.3)
RQ4 What is the effect of the number of lists to be aggregated in
TimeRA? (§6.4)
RQ5 Can we observe the hypothesized effect sketched in Fig. 1,
i.e., posts in bursts being rewarded? (§6.5)
RQ6 Does TimeRA meet real-time search requirements? (§6.6)
RQ7 Does TimeRA beat existing time-sensitive microblog search
models? (§6.7)
5.2
Data set
In order to answer our research questions we use the Tweets 2011
corpus [17, 29] provided by the TREC 2011 and 2012 Microblog
tracks. The collection is comprised of around 16 million tweets
collected over a period of 2 weeks (January 23, 2011 until February 8, 2011). Different types of tweets are present in this data set,
including replies and retweets, each with their own timestamp.
The task at the TREC 2011 and 2012 Microblog tracks was:
given a query with a timestamp, return relevant and interesting
tweets in reverse chronological order. In response to a query, a
run was required to retrieve 30 posts. In total, for the 2011 track
49 test topics were created and 2965 tweets were deemed relevant;
some topics have just two relevant tweets, while others have more
than 100 relevant tweets.
A total of 59 groups participated in the TREC 2011 Microblog
track, with each team submitting at most four runs, which resulted
in 184 runs2 [17]. The official evaluation metric was precision at
30 (p@30) [17, 29]. The p@30 scores of these 184 runs varied
dramatically, with the best run achieving a p@30 score of 0.4551
and the worst runs achieving scores below 0.10. Details about the
implementation of each run can be found in [17, 29]. We also run
experiments on the TREC 2012 Microblog track runs, and find that
they yield the same overall results and trends. Due to space limitations, we only report results for the TREC 2011 edition runs below.
5.3
Baselines and evaluation
We compare TimeRA to 5 aggregation baselines: 2 traditional
unsupervised methods, i.e., CombSUM, CombMNZ, 2 start-of-theart cluster-based fusion methods, ClustFuseCombSUM and ClustFuseCombMNZ [7], and a start-of-the-art supervised method, λMerge [27], in which we integrate temporal features. As TimeRA
utilizes temporal information, we also compare TimeRA to 4 stateof-the-art time-sensitive microblog search algorithms: time-based
language model (TBLM) [12], textual quality factor model with
temporal query expansion (LM-T(qe)) [19], direct time-sensitive
BM25 retrieval model (DIRECT-BM25 (mean)) [3] and temporal
tweet selection feedback method (TSF+QDRM) [21]. To build the
index of the dataset that some of our baselines require, we apply
Porter stemming, tokenization, and stopword removal (using INQUERY lists) to posts using the Lemur toolkit.3 Features used in
λ-Merge, including time-sensitive features, and the settings of λMerge are briefly described in Appendix B.
For performance evaluation we use the official TREC Microblog
2011 metric, p@30. We also report on p@5, p@10, p@15 and
MAP scores. MAP scores are of special interest to us: we hypothesize that TimeRA has both a precision and recall-enhancing effect
and we use MAP to measure this. Statistical significance of observed differences between two runs’ performances is tested using
a two-tailed paired t-test and is denoted using N (or H ) for significant differences for α = .01, or M (and O ) for α = .05.
A single free parameter in Eq. 12, β (∈ {0, 0.1, . . . , 1}), is incorporated in TimeRA, which is set using leave-one-out cross validation performed over the entire set of 49 queries. The performance
of MAP is optimized in the learning phase. In other words, the performance for a query is attained using a value of β that maximizes
MAP performance over all other queries. The same optimization
strategy is used for one of our baselines, cluster-based fusion. Other
baselines do not incorporate free parameters. Following [18], we
set the parameters λ1 = λ2 = 0.001.
5.4
Experiments
We report on 7 main experiments in this paper aimed at understanding (1) the performance of TimeRA in general via sampling
lists and fusing them; (2) the contribution of the main ingredients in
TimeRA; (3) the performance of TimeRA with increasing numbers
of runs to be fused; (4) query level performance; (5) TimeRA’s efficiency; (6) the effect of inferring rank scores of posts by TimeRA;
(7) the performance of TimeRA against temporal retrieval models.
2
3
Available from http://trec.nist.gov.
http://www.lemurproject.org
Table 2: Sampled runs used in some of the experiments.
Class
Sampled runs
Class 1 clarity1, waterlooa3, FASILKOM02, isiFDL,
DFReeKLIM30, PRISrun1
Class 2 KAUSTRerank, ciirRun1, gut, dutirMixFb, normal,
UDMicroIDF
Class 3 WESTfilext, LThresh, qRefLThresh, run3a, Nestor,
uogTrLqea
Class 4 FASILKOM02, waterlooa3, gut, UDMicroIDF,
run3a, qRefLThresh
Performance
0.40 ≤ p@30
0.30 ≤ p@30 < 0.40
0.20 ≤ p@30 < 0.30
0.20 ≤ p@30
To understand the overall performance of TimeRA, we sample
∼10% from the ranked lists produced by participants in the TREC
2011 Microblog track based on the lists’ p@30 distribution: 18 out
of the runs submitted to the TREC 2011 Microblog track, 6 with
p@30 scores between 0.20 and 0.30 (Class 3), 6 between 0.30 and
0.40 (Class 2), and 6 over 0.40 (Class 1). We also randomly choose
two runs from each class to construct Class 4; see Table 2. The runs
in Class 1 are the 6 best runs in the TREC 2011 Microblog track.
In every class, we use run1, run2, run3, run4, run5 and run6 to refer
to the runs in descending order of p@30.
Next, to understand the contributions of the two main ingredients of TimeRA, viz., burst detection and inferring scores, we make
comparisons among TimeRA, TimeRA-Infer and CombSUM. We
also gradually increase the parameter β in Eq. 12 from 0.0 to 1.0 to
see if burst information is helpful to boost fusion performance.
To understand the effect of the number of lists being merged, we
randomly choose k = 2, 4, 6, . . . , 36 lists from the 184 lists and
aggregate them. We repeat the experiments 20 times and report the
average results and standard deviation. In order to understand the
query-level performance of TimeRA, we provide a detailed comparison of its performance against the baseline methods. To determine whether TimeRA can respond to a given query in (near) real
time, we again randomly fuse k = 2, 6, 12, 18, 30 lists for all 49
test queries and report the average time required. Finally, we compare TimeRA against state-of-the-art time-sensitive retrieval models that utilize time/burst information.
6.
RESULTS AND ANALYSIS
§6.1 and §6.2 show the results of fusing the sample lists, the
contributions of burst detection and score inference in TimeRA, respectively; §6.3 analyzes the effect of using burst information; §6.4
shows the effect of the number of lists on the overall performance;
§6.5 provides a topic-level analysis; §6.6 examines the run times.
Finally, §6.7 compares TimeRA against time-sensitive models.
6.1
Fusing the sample lists
The performance of TimeRA and the 5 baselines is presented
in Table 3, with numbers based on the ∼10% sample mentioned
in §5.4. The performance of all the fusion methods is better than
that of the best performing result list that is used in the merging
process (run1) for all classes and on almost all metrics. Many of
these improvements are statistically significant. More importantly,
when fusing the top 6 result lists (Class 1), all of the p@30 scores
generated by any rank aggregation method are higher than that of
the best run in TREC 2011 Microblog track (0.4551), especially
for TimeRA, which achieves 0.5531. These findings attest to the
merits of using rank aggregation methods for microblog search.
It is also worth noting in Table 3 that in almost all cases, the
cluster-based method does not beat the standard fusion method that
it integrates, and the performance differences between the two are
usually not significant. The reason behind this may be that it is
challenging to do clustering in a microblog environment, with limited amounts of text and very creative language usage.
The performance of TimeRA is better than that of the baseline
methods, and all of the differences are substantial and statistically
significant. The performance of λ-Merge is almost the same as that
of CombSUM, CombMNZ and the cluster-based methods when
fusing the lists in Class 2, but in the other classes the performance
tends to be a bit below that of the other methods, on all metrics.
This may be due to overfitting.
Interestingly, the higher the quality of the result lists that are being aggregated, the bigger the improvements that can be observed
in Table 3. For instance, the p@30 scores after aggregation are
highest in Class 1 followed by those in Class 2 and Class3, and the
quality of Class 1 is best followed by Class 2 and Class 3, respectively. The p@30 aggregation scores in Class 4 are almost the same
as those in Class 2, as some of the lists’ scores in Class 4 are better
than those in Class 2.
6.2
Contributions of the main ingredients
Next, we compare the relative contributions of the main ingredients of TimeRA against the second best baseline, viz., CombSUM:
burst detection and score inference. The effect of burst detection in
TimeRA can be seen through comparisons between TimeRA-Infer
and CombSUM; the effect of score inference can be seen through
comparisons between TimeRA and TimeRA-Infer in Fig. 2.
Interestingly, in Fig. 2 there are large gaps in performance between TimeRA-Infer and CombSUM in terms of all of metrics;
all improvements are statistically significant. This illustrates that
burst detection makes an important contribution to the performance
of rank aggregation. When we compare TimeRA and TimeRAInfer in Fig. 2, we see that the performance of TimeRA-Infer in
terms of p@5 is almost the same as that of TimeRA, while in
terms of p@10 and p@30, TimeRA has some small advantages
over TimeRA-Infer—some of these improvements are statistically
significant. This observation confirms that inferring scores for posts
during aggregation can boost performance as well. It also shows,
however, that enhancing the performance of p@k becomes easier
for larger values of k. This is because the cost of boosting the
performance (i.e., changing the original rank score to be higher)
is smaller when the posts are ranked lower. TimeRA is unable to
beat TimeRA-Infer in terms of p@5 (.6939 for both), but TimeRA
does boost the p@30 performance (.5531M for TimeRA vs .5405
for TimeRA-Infer).
As burst information is such an important contributor to the performance of TimeRA, we analyze it further in §6.3.
6.3
The use of burst information
Next we examine the effect of using different amounts of burst
information or cluster information in our time-aware aggregation
or cluster-based methods, respectively. What is the impact of the
free parameter β in Eq. 12 and in the cluster-based methods? Fig. 3
depicts the MAP performance curves for all rank aggregation methods when fusing the lists in Class 1, Class 2, Class 3 and Class 4,
respectively. For β = 0, TimeRA almost amounts to CombSUM,
while the cluster-based methods are the same as the standard fusion methods they incorporate, e.g., ClustFuseCombSUM has no
difference with CombSUM in this case; more weight is put on
burst information and cluster information with higher values of
β in TimeRA and the cluster-based methods, respectively. For
0 < β < 1, both the CombSUM scores of posts and burst information are utilized for aggregating lists in TimeRA.
In each of our four classes of runs, when aggregating lists, the
MAP scores of TimeRA where burst information is used (β > 0)
are always higher than that of any other fusion method. In Class 1
and Class 4 the gain increases gradually as the weight of burst information increases. These findings attest to the merits of using
burst information to boost the performance in fusing ranked lists
for microblog search. Putting more weight on cluster information
in the cluster-based methods hurts performance in many cases.
6.4
Effect of the number of lists being merged
We explore the effect of varying the number of lists to be merged
on the performance of TimeRA. Fig. 4 shows the fusion results of
randomly sampling k ∈ {2, 4, 6, . . . , 36} lists from the 184 lists.
For each k, we repeat the experiment 20 times and report on the
average scores. We use CombSUM as a representative example for
comparisons with TimeRA; the results of other baseline methods
are worse or qualitatively similar to those of CombSUM.
From Fig. 4 we can see that TimeRA performs better than CombSUM over all performance evaluation metrics no matter how many
lists are fused. For both precision metrics (p@5 and p@30) we find
that as long as the number of lists ≤ 10, the performance of both
TimeRA and CombSUM gradually increases as the number of lists
to be merged increases. The increases level off when the number of
lists exceeds 12. For MAP we find that performance keeps increasing until we fuse 26 lists; then, the performance increase levels off.
Interestingly, in Fig. 4 the improvement of TimeRA over CombSUM on p@5 becomes smaller when more lists are merged. For
example, the increase in p@5 of TimeRA over CombSUM is .1063
(.5861 for TimeRA vs .4798 for CombSUM) when two lists are
fused. The performance increase, however, drops to only .0281
(.6712 for TimeRA vs .6431 for CombSUM) for 36 lists. Looking
at the other metrics, which take a larger part of the merged result
into account (p@30 and especially MAP), the gap remains.
6.5
Query-level analysis
We take a closer look at per test query improvements of TimeRA
over other runs. For brevity, we only consider CombSUM as a representative and we only consider runs in Class 1. The results of
TimeRA against CombSUM for other classes of runs and for other
baseline methods are qualitatively similar. Fig. 5 shows per query
performance differences in terms of AP, p@5, p@10 and p@30,
respectively, between TimeRA and CombSUM. TimeRA displays
both a precision and recall enhancing effect (with increases in precision oriented metrics as well as in MAP). As the metric at hand
considers a larger chunk of the result list, there are more instances
where TimeRA outperforms CombSUM. This is due mainly to topics that are discussed only in very specific time intervals. Examples include queries MB010 (Egyptian protesters attack museum),
MB011 (Kubica crash) and MB015 (William and Kate fax savethe-date) etc. For such queries we found evidence of the intuition
depicted in Fig. 1: posts that are ranked low in a small number
lists but that TimeRA pushes up because they are central to a burst.
E.g., in response to query MB010, post #30354903104749568 is
ranked near the bottom in just two lists (at ranks 26 and 27 in the
runs clarity1 and DFReekLIM30, respectively). Many posts for the
query were generated around the same time interval (Jan. 26–29)
and are ranked high in many lists; post #30354903104749568 was
also published around this time and ranked 6th in the merged list
because of this.
Queries for which TimeRA cannot beat CombSUM tend to be
quite general and unrelated to any specific time windows. Examples include queries MB023 (Amtrak train service), MB027 (reduce energy consumption) and MB029 (global warming and weather) etc. For a very small number of queries, TimeRA’s performance
is slightly worse than that of CombSUM. One reason that we observed for this phenomenon is that not all posts that are ranked low
in a small number of lists but central to a burst need to be rewarded.
An example here is query MB024 (Super Bowl, seats).
Table 3: Retrieval performance on the ∼10% sample lists. Boldface marks the best result per metric; a statistically significant difference between TimeRA and the best baseline method is marked in the upper right hand corner of the TimeRA score. A significant
difference with run1 for each method is marked in the upper left hand corner using the same symbols. None of the differences
between the cluster-based method and the standard method it incorporates are statistically significant.
Class 1
run1
run2
run3
run4
run5
run6
CombSUM
ClustFuseCombSUM
N
CombMNZ
ClustFuseCombMNZ
N
λ-Merge
TimeRA
N
N
N
N
Class 2
MAP
p@5
p@10
p@15
p@30
MAP
p@5
p@10
p@15
p@30
.2210
.2690
.2318
.2058
.2575
.2098
.5918
.5959
.5755
.5714
.5673
.5469
.5673
.5796
.5367
.5367
.4980
.5102
.5347
.5442
.5034
.4939
.4721
.4694
.4551
.4537
.4401
.4211
.4211
.4095
.1457
.1886
.1525
.1376
.1688
.1820
.4612
.4776
.4041
.3959
.3878
.4122
.4143
.4347
.4143
.3939
.3633
.3796
.3714
.3878
.3878
.3796
.3605
.3619
.3571
.3463
.3408
.3218
.3136
.3027
.6245
.6240
.5816
.5802M
.5524
.5503
.6245
.6231
.5755
.5748
.5524
.5502M
N
.5456
.6259N
N
.3404
.3398
N
.3385
.3355
N
.3245
.3834N
N
N
N
N
.6213
.6939N
N
.5734
.6510N
N
N
N
N
M
.4966
.4899
N
.5020
.4987
N
.4833
.5531N
N
N
N
N
.2625
.2612
N
.2581
.2560
N
.2573
.3037N
N
N
N
N
.5306
.5287
N
.5347
.5330
M
.5634
.6327N
N
M
N
N
Class 3
run1
run2
run3
run4
run5
run6
CombSUM
ClustFuseCombSUM
N
CombMNZ
ClustFuseCombMNZ
N
λ-Merge
TimeRA
N
N
N
N
0.4
.4592
.4523
N
.4952
.5551N
N
N
N
N
.4286
.4213
N
.4354
.4311
N
.4463
.5211N
N
N
N
N
.3735
.3686
.3789
.3731
.3901
.4320N
Class 4
p@5
p@10
p@15
p@30
MAP
p@5
p@10
p@15
p@30
.1661
.0997
.1636
.0753
.0571
.0994
.4041
.3429
.3959
.3265
.2980
.3510
.3408
.3000
.3122
.2735
.2551
.2735
.2898
.2653
.2571
.2585
.2408
.2408
.2122
.2095
.2041
.2034
.2020
.2016
.2058
.2098
.1376
.1820
.1636
.0753
.5714
.5469
.3959
.4122
.3959
.3265
.5367
.5102
.3939
.3796
.3122
.2735
.4939
.4694
.3796
.3619
.2571
.2585
.4211
.4095
.3218
.3027
.2041
.2034
.2150
.2142
N
.2187
.2151
N
.2125
.2491N
N
N
N
N
.4857
.4804
N
.4898
.4872
N
.5057
.5347N
N
N
N
N
.4327
.4267
N
.4327
.4265
N
.4373
.4694N
N
N
N
N
.3837
.3806
N
.3932
.3886
N
.3827
.4122N
N
N
N
N
.2952
.2882
N
.2973
.2894
N
.2933
.3293N
N
M
N
N
.2795
.2741
N
.2794
.2744
N
.2753
.3210N
N
N
N
N
.6122
.6088
N
.6000
.5975
N
.6046
.6735N
N
M
N
N
.5327
.5297
N
.5449
.5407
N
.5387
.5918N
M
N
N
.4721
.4674
O
H
.4830
.4782
.4918
.5429N
.3918
.3865
.4048
.3958
N
.4047
.4347N
0.7
CombSUM
TimeRA−Infer
TimeRA
0.36
CombSUM
TimeRA−Infer
TimeRA
0.65
CombSUM
TimeRA−Infer
TimeRA
0.55
0.65
0.5
0.6
0.34
0.6
0.3
0.28
p@30
p@10
0.45
p@5
0.32
MAP
N
MAP
CombSUM
TimeRA−Infer
TimeRA
0.38
0.55
0.4
0.55
0.26
0.5
0.24
0.35
0.5
0.22
0.2
.4531
.4500
0.45
Class 1
Class 2
Class 3
Class 4
0.45
Class 1
Class 2
Class 3
Class 4
0.3
Class 1
Class 2
Class 3
Class 4
Class 1
Class 2
Class 3
Class 4
Figure 2: Retrieval performance of CombSUM, TimeRA-Infer (without using score inference for posts) and TimeRA in terms of
MAP, p@5, p@10, and p@30 when fusing the runs in the 4 classes. Note that figures should be viewed in color.
6.6
Run time comparisons
We now explore how fast TimeRA can merge result lists in response to a query. TimeRA is developed in C++ and the experiments are run on a 10.6.8 MacBook Pro computer with 4GB memory and a 2.3 GHz Intel core i5 processor. In Table 4, we randomly
choose k ∈ {2, 6, 12, 18, 30} lists from the 184 lists. For each k,
we repeat the experiment 20 times and report the average run time
per query (in seconds) that the fusion methods require. ClustSUM
and ClustMNZ are short for ClustFuseCombSUM and ClustFuseCombMNZ, respectively in Table 4.
TimeRA does not run as fast as CombSUM or CombMNZ, but
it manages to merge lists near real-time. TimeRA merges the lists
within 0.1s when given 30 result lists and within 0.01s when fusing two lists. As the number of lists to be fused increases, the time
spent on fusing is linear for CombSUM, CombMNZ, TimeRA, and
λ-Merge; for ClustSUM and ClustMNZ, the time increases dramatically with larger numbers of lists. When fusing 30 lists, ClustMNZ
needs to spend 235.91s, although it only spends 1.03s on fusing two
lists. The times clocked for CombSUM and CombMNZ are similar,
and likewise for those of ClustSUM and ClustMNZ.
6.7
Effect of fusing time-sensitive result lists
TimeRA uses temporal information in an essential way. How
does it compare against retrieval models that explore temporal information? To answer this question, we conduct 4 additional experiments for generating 4 time-sensitive result lists, and explore
the performance of the baseline fusion methods and TimeRA with
each of which respectively fuses these 4 lists. These lists are generated by TBLM [12], LM-T(qe) [19], DIRECT-BM25 (mean) [3]
and TSF+QDRM [21].
Table 5 shows the fusion result. Obviously, except ClustFuseCombMNZ and λ-Merge, fusion baselines and TimeRA outperform the best time-sensitive component run (TSF+QDRM) for all
metrics and most of the improvements are statistically significant.
This illustrates that exploring time information in data fusion has
a different effect than utilizing time information in an individual
ranking function, an effect that can lead to performance increases.
0.3
TimeRA
CombSUM
ClustCombSUM
CombMNZ
ClustCombMNZ
λ−Merge
0.34
0.32
TimeRA
CombSUM
ClustCombSUM
CombMNZ
ClustCombMNZ
λ−Merge
0.29
MAP
MAP
0.36
0.25
0.28
0.27
0.24
0.23
0.4
0.6
0.8
0.3
0.29
0.28
0.21
0.24
0
1
TimeRA
CombSUM
ClustCombSUM
CombMNZ
ClustCombMNZ
λ−Merge
0.31
0.22
0.25
0.2
0.32
0.26
0.3
0
0.33
TimeRA
CombSUM
ClustCombSUM
CombMNZ
ClustCombMNZ
λ−Merge
MAP
0.31
MAP
0.4
0.38
0.2
0.4
β
0.6
0.8
1
0.2
0
0.27
0.2
0.4
β
(a) Class 1
0.6
0.8
0.26
0
1
0.2
0.4
β
(b) Class 2
0.6
0.8
1
β
(c) Class 3
(d) Class 4
Figure 3: Effect of varying the value of β on the MAP performance of six aggregation methods, for lists in (a) Class 1, (b) Class 2,
(c) Class 3 and (d) Class 4. More weight is put on burst information and on cluster information with higher values of β in TimeRA
and the cluster-based methods, respectively. Note: the figures are not to the same scale.
0.75
0.45
0.4
0.7
0.5
0.65
0.45
0.3
0.6
0.55
0.25
0.5
0.2
0.45
0.15
0
TimeRA
CombSUM
5
10
15
20
25
Number of runs to be fused
30
35
p@30
p@5
MAP
0.35
0.4
0.35
TimeRA
CombSUM
0.4
0
5
10
15
20
25
Number of runs to be fused
30
35
0.3
0.25
0
TimeRA
CombSUM
5
10
15
20
25
Number of runs to be fused
30
35
Figure 4: Effect on performance (in terms of MAP, p@5, and p@30) of the number of lists being merged. We plot averages and
standard deviations. Note: the figures are not to the same scale.
Table 4: Time spent on fusing lists by different aggregation
methods. Recorded in seconds with standard deviations (std).
Number of lists
12
18
CombSUM 3.06e–4 5.76e–4
std
1.13e–5 2.57e–5
2
6
1.03e–3
6.93e–5
1.98e–3
6.49e–5
3.37e–3
7.50e–5
30
CombMNZ 3.06e–4 5.76e–4
std
1.13e–5 2.57e–5
1.03e–3
6.93e–5
1.99e–3
6.52e–5
3.38e–3
7.01e–5
TimeRA
std
4.77e–3 1.69e–2
7.60e–5 7.41e–5
2.05e–2
2.20e–4
4.56e–2
3.53e–4
9.60e–2
3.60e–4
λ-Merge
std
1.15
3.82
1.22e–1 5.82e–1
7.78
12.03
8.03e–1 1.09
ClustSUM
std
1.03
4.12
37.56
9.87e–2 4.24e–1 1.26
88.21
3.54
235.91
13.08
ClustMNZ
std
1.03
4.12
37.56
9.87e–2 4.24e–1 1.26
88.21
3.54
235.91
13.08
20.74
1.18
One reason behind this is that posts within intervals in which many
relevant posts appear can only be confirmed to be relevant by gathering data from multiple lists, time-sensitive or not.
7.
CONCLUSION
The special nature of microblog posts, e.g., their limited length
and their creative language usage, raises challenges for searching
them. However, this special nature also provides unique algorithmic opportunities. In this paper, we focus on utilizing time information to boost the performance of searching microblog posts.
Specifically, we proposed a novel rank aggregation approach, TimeRA, that utilizes bursts and only rank information to aggregate result lists. TimeRA first detects bursts of posts across the lists utilizing original rank information of the posts, and then rewards posts
that are ranked low in few lists but in the vicinity of a burst that contains higher ranked posts. It also infers the rank scores of missing
posts by modeling lists and posts as a mixture of latent factors.
Our experimental results show that both utilizing burst information and score inference for rank aggregation can significantly en-
Table 5: Performance on 4 time-sensitive result lists. Boldface
marks the better result per metric; a statistically significant difference between TimeRA and the best baseline is marked in the
upper right hand corner of TimeRA score; a statistically significant difference with TSF+QDRM for each fusion method is
marked in the upper left hand corner of the score.
TSF+QDRM
DIRECT-BM25 (mean)
LM-T (qe)
TBLM
CombSUM
ClustFuseCombSUM
CombMNZ
ClustFuseCombMNZ
λ-Merge
TimeRA
M
M
M
N
MAP
p@5
p@10
p@15
p@30
.2834
.2798
.2346
.2231
.6220
.6187
.5836
5742
.6856
.6725
.5648
5433
.6279
.6320
.5178
.5017
.5368
.5133
.4471
.4395
.2962
.2951
M
.2948
.2886
M
M
.6395
.6385
N
.6350
.6312
M
M
.6973
.6927
N
.6918
.6873
M
N
.6482
.6407
N
.6347
.6283
M
N
.5513
.5476
.5420
.5416
.2847 .6275 .6876 .6279 .5383
.3117N N .6518M N .7023M N .6545N N .5733N
hance retrieval performance when compared against traditional and
state-of-the-art, supervised and unsupervised rank aggregation approaches for microblog post search. Additional analyses show that
TimeRA is a robust and efficient rank aggregation method that outperforms state-of-the-art temporal retrieval algorithms.
As to future work, we plan to look into combining social information, such as user relationships into rank aggregation and further analyze our model in scenarios where the documents being
searched are published in bursts.
Acknowledgements. This research was supported by the China Scholarship Council, the European Community’s Seventh Framework Programme
(FP7/2007-2013) under grant agreements nr 288024 (LiMoSINe) and nr
312827 (VOX-Pol), the Netherlands Organisation for Scientific Research
(NWO) under project nrs 727.011.005, 612.001.116, HOR-11-10, 640.006.013, the Center for Creation, Content and Technology (CCCT), the
Dutch national program COMMIT, the ESF Research Network Program
ELIAS, the Elite Network Shifts project funded by the Royal Dutch Academy of Sciences (KNAW), the Netherlands eScience Center under project
-0.5
topics
0
-0.5
topics
0.5
∆p@30
0
0.5
∆p@10
0.5
∆p@5
∆AP
0.5
0
-0.5
0
-0.5
topics
topics
Figure 5: Per query performance differences, TimeRA vs. CombSUM. The figures shown are for the runs in Class 1, for MAP, p@5,
p@10 and p@30 (left to right). A bar extending above the center indicates that TimeRA outperforms CombSUM, and vice versa.
number 027.012.105, the Yahoo! Faculty Research and Engagement Program, the Microsoft Research PhD program, and the HPC Fund.
8.
REFERENCES
[1] J. A. Aslam and M. Montague. Models for metasearch. In SIGIR,
pages 276–284, 2001.
[2] H. Bota, K. Zhou, J. M. Jose, and M. Lalmas. Composite retrieval of
heterogeneous web search. In WWW, 2014.
[3] W. Dakka, L. Gravano, and P. Ipeirotis. Answering general
time-sensitive queries. IEEE Trans. Knowledge and Data Engin., 24
(2):220–235, 2012.
[4] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation
methods for the web. In WWW, pages 613–622, 2001.
[5] E. A. Fox and J. A. Shaw. Combination of multiple searches. In
TREC-2, 1994.
[6] Z. Han, X. Li, M. Yang, H. Qi, S. Li, and T. Zhao. Hit at trec 2012
microblog track. In TREC ’12 Working Notes, 2012.
[7] A. K. Kozorovitsky and O. Kurland. Cluster-based fusion of
retrieved lists. In SIGIR, pages 893–902, 2011.
[8] M. Kurucz, A. A. Benczúr, and K. Csalogány. Methods for large
scale SVD with missing values. In KDD Cup and Workshop, 2007.
[9] J. Lafferty and C. Zhai. Document language models, query models,
and risk minimization for information retrieval. In SIGIR, 2001.
[10] T. Lappas, B. Arai, M. Platakis, D. Kotsakos, and D. Gunopulos. On
burstiness-aware search for document sequences. In KDD, pages
477–486, 2009.
[11] J. H. Lee. Combining multiple evidence from different properties of
weighting schemes. In SIGIR, pages 180–188, 1995.
[12] X. Li and W. B. Croft. Time-based language models. In CIKM,
pages 469–475, 2003.
[13] S. Liang, M. de Rijke, and M. Tsagkias. Late data fusion for
microblog search. In ECIR’13, pages 743–746, 2013.
[14] S. Liang, Z. Ren, and M. de Rijke. Fusion helps diversification. In
SIGIR, pages 303–312, 2014.
[15] S. Liang, Z. Ren, and M. de Rijke. Personalized search result
diversification via structured learning. In KDD ’14, 2014.
[16] S. Liang, Z. Ren, and M. Rijke. The impact of semantic document
expansion on cluster-based fusion for microblog search. In ECIR,
pages 493–499, 2014.
[17] J. Lin, C. Macdonald, I. Ounis, and I. Soboroff. Overview of the
TREC 2011 Microblog track. In TREC 2011. NIST, 2011.
[18] H. Ma, D. Zhou, and C. Liu. Recommender systems with social
regularization. In WSDM, 2011.
[19] K. Massoudi, M. Tsagkias, M. de Rijke, and W. Weerkamp.
Incorporating query expansion and quality indicators in searching
microblog posts. In ECIR, pages 362–367, 2011.
[20] M. Mathioudakis, N. Bansal, and N. Koudas. Identifying, attributing
and describing spatial bursts. In VLDB, pages 1091–1102, 2010.
[21] T. Miyanishi, K. Seki, and K. Uehara. Improving pseudo-relevance
feedback via tweet selection. In CIKM, pages 439–448, 2013.
[22] M. Montague and J. A. Aslam. Condorcet fusion for improved
retrieval. In CIKM, pages 538–548, 2002.
[23] M.-H. Peetz, E. Meij, M. de Rijke, and W. Weerkamp. Adaptive
temporal query modeling. In ECIR ’12, 2012.
[24] W. L. Ruzzo and M. Tompa. A linear time algorithm for finding all
maximal scoring subsequences. In Int. Conf. Int. Syst. Molec. Biol.,
pages 234–241, 1999.
[25] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In
Advances in Neural Information Processing Systems, 2008.
[26] J. A. Shaw and E. A. Fox. Combination of multiple searches. In
TREC 1992, pages 243–252. NIST, 1993.
[27] D. Sheldon, M. Shokouhi, M. Szummer, and N. Craswell.
LambdaMerge: merging the results of query reformulations. In
WSDM, pages 795–804, 2011.
[28] M. Shokouhi and L. Si. Federated search. Foundations and Trends in
Information Retrieval, 5(1):1–102, 2011.
[29] I. Soboroff, I. Ounis, and J. Lin. Overview of the TREC 2012
Microblog track (notebook version). In TREC 2012. NIST, 2012.
[30] W. Weerkamp and M. de Rijke. Credibility-inspired ranking for blog
post retrieval. Information Retrieval Journal, 15(3–4):243–277,
2012.
[31] S. Wu. Data fusion in information retrieval, volume 13 of
Adaptation, Learning and Optimization. Springer, 2012.
APPENDIX
A. DERIVATION OF THE MODELS
The following is the derivation of Eq. 12. We leave out the derivations
of Eq. 4, 5, 6, 7 and 8 as they may be obtained in a similar way.
n
X
∂C2
0
>
=(1 − β)
Iij w(dj |Li )(g(S>
i Vj ) − Rij )g (Si Vj )Vj
∂Si
j=1
+
n X
X
β
|{dk : dk Li dj |b}|
j=1 b∈B
X
Iij Iik r(dj |dk )
dk ∈b
dk Li dj
0
>
>
· w(dk |Li )(g(S>
i Vj ) − Rik )g (Si Vj )Vj + λ1 Si
=(1 − β)
m
X
0
>
>
Iij w(dj |Li )(g(S>
i Vj ) − Rij )g (Si Vj )Si
i=1
+
m X
X
i=1 b∈B
β
|{dk : dk Li dj |b}|
X
Iij Iik r(dj |dk )
dk ∈b
dk Li dj
0
>
>
· w(dk |Li )(g(S>
i Vj ) − Rik )g (Si Vj )Si + λ2 Vj , (15)
where g 0 (x) = exp(x)/(1 + exp(x))2 is the derivative of the logistic
function g(x) = 1/(1 + exp(−x)).
B. λ-MERGE AND ITS FEATURES
Originally λ-Merge is an algorithm that fuses the result lists of query
reformulations. The method we call λ-Merge is the same as the original one except that we consider the weights of query reformulations in the
original λ-Merge as those of the result lists in response to the same query
in our experiments. In total, we use 13 features, extracted at three levels: time-sensitive level, query-document level, document level; all features
are represented by either binary or real numbers. At time-sensitive level,
we obtain the retrieval scores from our 4 time-sensitive retrieval baselines,
i.e., TBLM [12], LM-T(qe) [19], DIRECT-BM25 (mean) [3] and TSF+QDRM [21], as features. At the query-document level, we use 3 features:
Rank, Rankers and IsTop-N . Rank is defined in Eq. 1. Rankers is the number of ranked lists in which the post appears over the total number of lists
to be merged. IsTop-N is a binary feature to indicate if this document is
within the top-N results in the list Li , where N ∈ {5, 10, 15, 20}. At the
document level, we extract features capable of indicating the quality of a
post [17, 29]. The post features include Link, Hashtag, Retweet; the values
of these three features are set to 1 if the document contains links, hashtags,
and retweets. The document-level features also consist of content quality
indicators of a post (Density, Capitalization and Length) [30].