Time-Aware Rank Aggregation for Microblog Search Shangsong Liang† Zhaochun Ren† [email protected] [email protected] § Edgar Meij [email protected] Wouter Weerkamp‡ [email protected] Maarten de Rijke† [email protected] †University of Amsterdam, Amsterdam, The Netherlands ‡904Labs, Amsterdam, The Netherlands §Yahoo Labs, Barcelona, Spain ABSTRACT We tackle the problem of searching microblog posts and frame it as a rank aggregation problem where we merge result lists generated by separate rankers so as to produce a final ranking to be returned to the user. We propose a rank aggregation method, TimeRA, that is able to infer the rank scores of documents via latent factor modeling. It is time-aware and rewards posts that are published in or near a burst of posts that are ranked highly in many of the lists being aggregated. Our experimental results show that it significantly outperforms state-of-the-art rank aggregation and time-sensitive microblog search algorithms. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing Keywords Rank aggregation; data fusion; ad hoc retrieval; microblog search 1. INTRODUCTION The task of searching microblog posts has been studied widely. Many effective algorithms have been proposed, including, for example, those based on language modeling [9], query expansion [21], and time-sensitive features [3]. Different microblog search algorithms have different merits; hence, combining these alternative algorithms into a single model may boost retrieval performance. In this paper we follow this idea and tackle the challenge of microblog search by following a rank aggregation (also called data fusion) approach. Our input consists of a number of ranked lists generated by different microblog search algorithms and the output is an aggregated and, hopefully, better list. Rank aggregation approaches can inherit the merits of the individual search algorithms whose outputs are being aggregated. Also, they may boost recall [31] and because they aggregate the results of multiple strategies, they mitigate risks associated with opting for Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. CIKM’14, November 3–7, 2014, Shanghai, China. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2598-1/14/11 ...$15.00 http://dx.doi.org/10.1145/2661829.2661905. a single strategy [31]. In rank aggregation, documents in the fused list are ranked in deceasing order of their fusion scores. The fusion score of a document is usually a function of the rank scores from the individual input lists, such as the weighted sum. Previous work on rank aggregation often assumes—either implicitly or explicitly—that the rank score of a document is set to zero for an input list if that document does not appear in the list. We challenge this assumption; our intuition is that documents that are similar to a document d that does occur in an input list, should get similar rank scores. Let us call a document that does not appear in list L, but does appear in other lists we aim to fuse, a missing document for L (see Figure 1). We propose a rank aggregation algorithm where we apply a latent factor model to predict the rank scores of missing documents. In our rank aggregation algorithm, we define a list-document rank score matrix, factorize this matrix, and utilize the factorized list-specific and document-specific matrices to assign scores to missing documents. State-of-the-art rank aggregation methods often work on the assumption that only documents that are ranked highly in many of the result lists being aggregated are likely to be relevant [1, 4, 5, 11, 13, 22]. As a result, a relevant document might be ranked low in an aggregated list if it appears in only few of the input result lists and is ranked low within these. The characteristics of microblog environments suggest a different perspective. In such environments people tend to talk about a topic within specific short time intervals [10, 20, 23]. E.g., people talked about the “2014 Eastern Synchronized Skating Sectional Championship” mainly between January 30 and February 1 2014, which is when the championship took place. Posts generated before or after the event are less likely to talk about the championship and, hence, less likely to be relevant to a query about the championship. Inspired by these characteristics of microblog search scenarios, we propose a rank aggregation algorithm, TimeRA, that is timeaware. TimeRA detects windows (intervals) of timestamps of posts ranked high in many of the lists to be fused. These windows give rise to bursts of posts, and subsequently TimeRA rewards posts that are published in the vicinity of a burst that contains highly scored posts. Specifically, if a post di and posts d1 , . . . , dk were created within the same narrow time window, and the posts d1 , . . . , dk are ranked highly in many of the lists being merged, then post di will be “rewarded,” even if, in the extreme case, it appears in only one list where it is ranked low. Fig. 1 illustrates this: post d2 is ranked low in list L1 but will be rewarded as it was published in a time window in which a large number of posts are published that are ranked high L2 Lm d7 d6 d8 d5 d5 d6 d4 d3 d5 d2 d1 d3 d6 d4 d2 combined score L1 d4 d1 missing documents d2 d3 d5 window d6 d7 d8 time Figure 1: Rewarding posts that are published in the same narrow time frame as a large number of (supposedly) relevant posts. The vertical axis indicates the summed scores of posts with the same timestamps. There are m lists in response to a query. Post d2 only occurs in list L1 and is ranked low; d8 also occurs in a single list, Lm , but is ranked very high. Surrounding d2 there are many documents that are ranked high in many lists; therefore, d2 should be “rewarded,” while d8 should not. Documents marked in blue are “missing” documents; e.g., d6 is a missing document for L1 , as it is not observed in L1 during the aggregation process. in many lists. But d8 , while ranked high in Lm , receives no such bonus as it was published outside the time window. Our results show that TimeRA is able to effectively aggregate result lists, subject to real-time search requirements, running almost as fast as standard fusion methods such as CombSUM [5], and outperform both time-sensitive microblog search methods and competing fusion algorithms. Our contributions can be summarized as follows: • We propose a rank aggregation-based algorithm for microblog search, TimeRA, that outperforms traditional and stateof-the-art unsupervised and supervised rank aggregation methods as well as the best single result list that it aggregates. • We examine the relative contributions of the main ingredients of TimeRA (burst detection and score inference) and find that burst detection is effective in helping rank aggregation and that inferring scores for missing documents during aggregation boosts performance as well. • We provide a detailed analysis of the performance of TimeRA and offer a number of examples where we observe the effect hypothesized in Fig. 1, i.e., of posts in bursts having their rank score boosted. • We show that the effect of TimeRA on ranking microblog posts is different from the effect of existing time-sensitive microblog search models. • And, finally, we study the efficiency of TimeRA and show that it meets real-time search requirements. 2. RELATED WORK Three major types of research are closely related to our work: microblog search, rank aggregation and latent factor modeling. 2.1 documents in conjunction with the topic similarity to derive a timesensitive ranking. More recently, Miyanishi et al. [21] integrate temporal and lexical evidence into their query expansion method for microblog search. We propose to develop a rank aggregation-based approach to microblog search. To the best of our knowledge, this is the first attempt to tackle the problem of microblog search as a rank aggregation problem. 2.2 Rank aggregation Rank aggregation approaches attempt to inherit the merits of different search algorithms, and to combine their result lists in order to produce a new and hopefully better ranking [14, 16, 26]. In recent years, rank aggregation has featured in many applications, including federated search [28], vertical search [2], and more recently, search result diversification [14, 15]. Many rank aggregation methods have been proposed, including CombSUM and CombMNZ [26], an outranking approach [14, 31], semantic fusion [16], clusterbased fusion [7] and λ-Merge [27]. CombSUM and CombMNZ are the oldest but fast and effective rank aggregation methods. The outranking approach only utilizes the rank of documents in the result lists during aggregation. Both semantic fusion and cluster-based fusion use the assumption that content-similar documents should be clustered to boost the performance of aggregation. Finally, λMerge tries to learn a model from labeled data for aggregation. Previous rank aggregation algorithms ignore the temporal information of the documents to be merged. We propose a new rank aggregation where we fully utilize the temporal information and reward posts that are published in the vicinity of a burst to be ranked higher in the fused list. To the best of our knowledge, this is the first attempt to integrate temporal information into rank aggregation. 2.3 Latent factor modeling Latent factor models are often used in collaborative filtering (CF) and recommender systems [8, 25]. Matrix factorization methods form a group of well-known latent factor models, that include, for instance, singular value decomposition [8], probabilistic matrix factorization [25] and social regularization [18].These methods first model users with latent interests and the products with latent features by matrix factorization, and then try to predict the rating of products for the given users with the observations of the existing users’ rating data [8, 18, 25]. Motivated by latent factor models, rather than simply setting no rank scores for missing documents as in previous rank aggregation methods [5, 7, 27, 31], our proposed rank aggregation algorithm applies matrix factorization to model both result lists (called “users” in CF) and documents (“products” in CF) as a mixture of latent factors (“interests” in CF), such that the rank scores of missing documents (“the rating of unobserved products” in CF) in a result list for aggregation can be inferred. Reasons of why a latent topic model can infer scores can be found in [8, 18, 25]. To the best of our knowledge, this is the first attempt to predict rank scores for missing documents in rank aggregation. Microblog search Searching posts that are short and creatively spelled in microblogging platforms, like Twitter, is challenging. Previous work on microblog search has focused on employing traditional IR techniques, i.e., learning to rank [6], document expansion [19], pseudo-relevance feedback retrieval [21] to improve the performance. Some previous work explores time information to boost performance. For instance, Li and Croft [12] propose a time-sensitive language model to explore time and relevance of queries; Massoudi et al. [19] integrate temporal information into a dynamic expansion microblog retrieval model. And Dakka et al. [3] consider the publication time of 3. PRELIMINARIES We detail the task that we address and recall two standard rank aggregation algorithms, CombSUM and CombMNZ. 3.1 Problem formulation The task we address is this: given an index of microblog posts, a query, and a set of ranked lists of posts with timestamps produced in response to the query, aggregate the lists into a final ranked list of posts to be returned in response to the query. Hence, the rank aggregation problem consists of finding a ranking function fX such Table 1: Main notation used in TimeRA. Notation Gloss m × |CL | list-post rank score matrix number of latent factors matrix used for inferring latent topics for L matrix used for inferring latent topics for CL a column vector in S used for aspects of Li a column vector in V used for aspects of dj rank score of dj in list Li inferred rank score of post dj ∈ CL \ Li dk ranks higher than dj in Li timestamp post with timestamp ti number of different timestamps of posts in CL a burst of posts set of all bursts in CL (given query q) indicate whether dj ∈ Li timestamp standard deviation of a burst “reward” score for dj if dj is within or near the burst b to which dk belongs w(dj |Li ) rank punishment weighting function for dj ∈ Li β relative weight of burst information in TimeRA R A S V Si Vj Rij Rij dk Li dj ti dti T b B Iij σb r(dj |dk ) f X L = {L1 , L2 , . . . , Lm }, q, C −→ Lf , where L is a set of lists to be fused, Li is the i-th result list, m = |L| is the number of input result lists, C is the corpus of microblog posts, q is a query, Lf is the final fused list. Standard rank aggregation Before we move on to the details of our rank aggregation method, we set the stage regarding rank aggregation by briefly recalling two standard fusion methods: CombSUM and CombMNZ. Let Rij denote the rank score of document dj in list Li based on the rank of dj in the list Li ; in the literature on rank aggregation [5, 7, 11, 31], one often finds / Li (dj still in the combined set of S Rij = 0 if dj ∈ posts CL := m i=1 Li ). In both CombSUM and CombMNZ, Rij is often defined as: ( (1+|L |)−rank(d |L ) i j i dj ∈ Li |Li | (1) Rij = 0 dj ∈ / Li , where |Li | is the length of Li and rank(dj |Li ) ∈ {1, . . . , |Li |} is the rank of dj in Li . The well-known CombSUM fusion method [5, 31], for instance, scores dj by the sum of its rank scores in the lists: P fCombSUM (dj |q) := m i=1 Rij , while CombMNZ [5, 31] rewards dj that ranks high in many lists: fCombMNZ (dj |q) := |{Li : dj ∈ Li }| · fCombSUM (dj |q), where |{Li : dj ∈ Li }| is the number of lists in which dj appears. 4. 4.1 Bursts detection Our proposed TimeRA method utilizes burst information. To detect bursts, we proceed as follows. Let t be a timestamp. Let dt ∈ CL denote a post with timestamp t. Here, CL is the set of posts appearing in the result lists. We regard posts published during the same hour as having the same timestamp. Although it is possible to define “the same timestamp” in many different ways, we found that this is a suitable level of granularity for the effectiveness of rank aggregation. Before we detect bursts in posts CL , we need to define Ht , the burst-time score at time t. Let Rkt (obtained by Eq. 1) be the rank score of dt given Lk . Then: P Pm 1 k=1 Rkt dt ∈Ct − , (2) H t = PT P Pm T 0 k=1 Rkt d 0 ∈C 0 t0 =1 t that: 3.2 aggregation [7] and λ-Merge [27] have been shown to be more effective than CombSUM and CombMNZ in several cases. However, a drawback of cluster-based fusion is that it has to access the content of documents to compute inter-document similarities so as to generate a set of clusters and determine probabilities based on the clusters. A drawback of λ-Merge is that it also has to access the content to extract features and it often overfits [27]. TIME-AWARE RANK AGGREGATION We detail our time-aware rank aggregation algorithm in this section; §4.1 describes the way we detect bursts, §4.2 details our aggregation method, TimeRA, and §4.3 contains a brief complexity analysis of TimeRA. To be able to present rank aggregation methods in a uniform way, we first introduce and summarize our notation in Table 1. Although introduced some time ago, CombSUM and CombMNZ are still effective and very efficient rank aggregation methods. More recent fusion methods have been proposed but most of them may be inappropriate for searching posts in microblogging scenarios that are dynamic and need rapid updates. For instance, cluster-based rank t where T is the total number of different timestamps belonging to posts CL , Ct is a set of posts having the same timestamp t, P in P m dt ∈Ct k=1 Rkt is the sum of the rank scores of posts having P P P 0 the same timestamp t, and Tt0 =1 d 0 ∈C 0 m k=1 Rkt is the sum t t of the rank scores of all posts in CL . According to Eq. 2, the bursttime score Ht > 0 if it is above the average score (i.e., 1/T ), otherwise Ht ≤ 0. Using Eq. 2, we compute a burst-time score Ht for each time point t ∈ {t1 , t2 , . . . , tT }. In this manner we generate a burst-time #» score sequence H = {Ht1 , Ht2 , . . . , HtT }. Following [24], a #» segment H[ti : tj ] = {Hti , Hti+1 , . . . , Htj }, where 1 ≤ i ≤ j ≤ #» T , is a maximal segment in H if: #» i. All proper subsequences of H[ti : tj ] have a lower score.1 #» #» ii. No proper super-segments of H[ti : tj ] in H satisfy item i. #» As an example, consider the input sequence H = {2, −2, 4, 3, −3, −4, −1, −3, 5, −1, 3, −2}. The maximal segments in this sequence are {2}, {4, 3} and {5, −1, 3}. The segment {2, −2, 4, 3} is not maximal, since it has a nonempty zero-scoring prefix {2, −2} appending to the left of {4, 3}; {5} is not a maximal segment, since {5, −1, 3} has a total higher score of 7. #» Each maximal segment H[ti : tj ] gives rise to a burst of posts b[ti : tj ] with start timestamp ti and end timestamp tj : it contains any post d ∈ CL whose timestamp is between S ti and tj . We write B to denote the set of all bursts, i.e., B = b[ti : tj ]. We let b be short for b[ti : tj ] in the following. 4.2 The aggregation method Next we detail our time-aware rank aggregation method, TimeRA. TimeRA utilizes matrix factorization techniques. The input of the matrix factorization in TimeRA is an m × n matrix R ∈ Rm×n which we call a list-document matrix; here, again, m is the number of result lists and n is the number of posts to be fused, i.e., n = |CL |. R is initialized by Eq. 1, viz., its elements Rij are defined by Eq. 1. The output of the matrix factorization consists of two new matrices S ∈ RA×m and V ∈ RA×n , obtained by 1 The burst-time score of a sequence is the sum of the burst-time scores of the elements in the sequence. Algorithm 1: TimeRA: Time-Aware rank aggregation. 1 2 3 4 5 6 7 8 9 Input : A query q, a number of ranked lists of posts to be fused, L={ L1 , L2 S , . . . , Lm }, the combined set of posts CL := m i=1 Li , learning rate δ. Output: A final rank aggregation list of posts Lf . Lf = ∅, initialize R by Eq. 1, Initialize S and V with random values, let m = |L| and t = 0; for j = 1, 2, · · · , n do fTimeRA (dj |q) = 0; Generate a set of bursts B; repeat for i = 1, 2, · · · , m do (t+1) (t) Si = Si − δ ∂C2 (t) ∂Si for j = 1, 2, · · · , n do (t+1) (t) Vj = Vj − δ (t) 14 15 Construct final fused list Lf via decreased scores of fTimeRA (d|q). 12 13 S, V = arg min S,V factorizing R, which we call the factor-list matrix and the factorpost matrix, respectively. Here, A is the number of latent factors. After obtaining S and V, we can infer the normalized scores of missing posts, based on which we arrive at the aggregation scores of all the posts. We present TimeRA in Algorithm 1. We first provide a highlevel walkthrough. To begin, it sets matrices S ∈ RA×m , V ∈ RA×n with random values and fTimeRA (d|q) with the value 0 (lines 1–3 in Algorithm 1). Let Si be the i-th column of S, meant to get the latent factors of the list Li after matrix factorization, and let Vj be the j-th column of V, meant to get latent factors of post dj after matrix factorization. After detecting bursts (line 4), TimeRA utilizes burst information and tries to get the final S and V by performing a gradient descent process on a cost function C2 (lines 5– 12). To this end, aggregation scores fTimeRA (d|q) of d ∈ CL can be obtained by S and V (lines 13–14). The defined cost function plays an important role: (1) it tries to keep the original rank scores of posts that rank high in many of the lists to be fused, (2) it rewards posts that rank low in a few lists but in the vicinity of bursts, and (3) it gets the latent factors of both the lists and the posts such that missing posts’ rank scores can be inferred. Our matrix factorization approach in TimeRA seeks to approximate the list-document matrix R by a multiplication of A latent factors, > R≈S V (3) To obtain S and V, traditionally, the Singular Value Decomposition (SVD) method [8] is utilized to approximate the list-document rank matrix R by minimizing 1 S, V = arg min ||R − S> V||2F , 2 S,V (4) where || · ||2F denotes the Frobenius norm. Due to the definition that Rij = 0 if dj is a missing document, i.e., dj ∈ CL \ Li , we only need to factorize the observed scores in R so that (4) changes to S, V = arg min S,V S,V based on Eq. 15; t = t + 1; until C2 given by Eq. 12 converges; S = S(t) , V = V(t) ; for j = 1, 2, · · · , n do Obtain fTimeRA (dj |q) score for dj via S and V by Eq. 14; 10 11 S, V = arg min m n 1 XX 2 Iij (Rij − S> i Vj ) , 2 i=1 j=1 m n 1 XX 2 Iij (Rij − g(S> i Vj )) , 2 i=1 j=1 (6) where g(x) is the logistic function defined by g(x) = 1/(1 + exp(−x)), which makes it possible to bound the range of S> i Vj within the same range [0, 1] by Rij . In order to avoid overfitting, two regularization terms are added to Eq. 6: based on Eq. 15; ∂C2 ∂Vj where Iij is an indicator function that is equal to 1 if dj appears in Li and equal to 0 otherwise, Si ∈ RA×1 is a column vector in S that serves to get latent factors of Li , and, similarly, Vj ∈ RA×1 is a column vector in V that serves to get latent factors of dj . As Rij ∈ [0, 1] according to Eq. 1, (5) can be rewritten as (5) m n 1 XX 2 Iij (Rij − g(S> i Vj )) 2 i=1 j=1 (7) λ2 λ1 2 2 ||V||F , + ||S||F + 2 2 where λ1 , λ2 > 0. For a probabilistic interpretation with Gaussian observation noise for Eq. 7 we refer to Salakhutdinov and Mnih [25]. To reduce the model’s complexity we set λ1 = λ2 in all experiments below [8, 18, 25]. The cost function defined by Eq. 7 punishes posts equally when they shift from the original rank scores Rij . However, a post ranked higher should be punished more than posts ranked lower if they shift from the original rank scores: higher ranked posts are more likely to be relevant so that keeping their original rank scores will be of more value. Thus, we modify Eq. 7 to obtain the following cost function C1 : S, V = arg min S,V m n 1 XX 2 Iij · w(dj |Li )(Rij − g(S> i Vj )) 2 i=1 j=1 (8) λ1 λ2 + ||S||2F + ||V||2F , 2 2 where w(dj |Li ) is a rank punishment weighting function defined for dj given Li . Specifically, we define w(dj |Li ) as ( 1 dj ∈ Li rank(dj |Li )−1 (9) w(dj |Li ) = 2 0 dj ∈ / Li . Our next step is to bring in burst information. After detecting a set of bursts, we introduce the following item into the cost function C1 to boost the relevance of posts by burst information, such that S, V = arg min S,V · X m n 1 XXX 1 2 i=1 j=1 |{dk ∈ b : dk Li dj }| b∈B 2 Iij Iik r(dj |dk )w(dk |Li )(Rik − g(S> i Vj )) , (10) dk ∈b dk Li dj where B is a set of detected bursts, b is a burst in B, dk Li dj means dk is ranked higher than dj in list Li , |{dk ∈ b : dk Li dj }| is the total number of posts in the burst b that are ranked higher than the post dj in Li , and r(dj |dk ) is the “reward” score for dj from dk if dj is within or near the burst b to which dk belongs. In Eq. 10, dk Li dj indicates that only the posts ranked higher than the post dj to be rewarded are capable of boosting the relevance of dj . If dj , the post to be rewarded, is generated at (almost) the same time as the highly relevant post dk , it will be rewarded more than posts that are further away in time from dk , which is measured by r(dj |dk ) in Eq. 10. Accordingly, we define r(dj |dk ) based on a The final rank aggregation score for dj ∈ CL is obtained by normal distribution as ( r(dj |dk ) = exp − (tdj − tdk ) 2σb2 2 ) , (11) where tdj and tdk are the timestamps of post dj and dk , respectively, and σb is the standard deviation of timestamps in burst b: Pnb nb +1 2 } n2 − 1 2 i=1 {i − 2 σb = = b , nb 12 where nb is the number of different timestamps of posts in b. According to Eq. 11, the more likely dj is generated at the same time with dk , the larger the score r(dj |dk ) is, resulting in a bigger reward for dj from dk . fTimeRA (dj |q) = S, V = arg min S,V + n m 1 − β XX 2 Iij w(dj |Li )(Rij − g(S> i Vj )) 2 i=1 j=1 m n X β XXX 1 Iij Iik 2 i=1 j=1 |{dk ∈ b : dk Li dj }| b∈B (12) dk ∈b dk Li dj 2 r(dj |dk )w(dk |Li )(Rik − g(S> i Vj )) λ1 λ2 + ||S||2F + ||V||2F , 2 2 where β ∈ [0, 1] is a free parameter that governs the linear mixture. In words, in response to a query q, TimeRA uses a two-component mixture model to score d ∈ CL . The first component (the first term of Eq. 12) tries to maintain the original rank scores Rij . The second component (the second term of Eq. 12), which uses bursts for posts, “rewards” d if it is strongly associated with the posts in the bursts. Clearly, if β = 0 in Eq. 12, TimeRA will only try to maintain the original rank scores, i.e., Eq. 8. A local minimum of the objective function given by Eq. 12 can be found by performing gradient descent in both Si and Vj . The same can apply to Eqs. 4, 5, 6, 7, and 8. The algorithm for obtaining S and V is straightforward: we first randomly initialize S and V, then iteratively update these two matrices based on their gradients until the value of the cost (objective) function converges. The derivation of these equations are included in Appendix A. After optimizing Eq. 12, posts dj ∈ Li they will end up with a score g(S> i Vj ) = Rij + “some reward.” Unlike previous work that assigns 0 to missing documents dj ∈ CL \ Li [5, 7, 27, 31], we infer a rank score RLi dj for missing posts dj as: ( g(S> if g(S> i Vj ) i Vj ) ≤ min(Rid ) Rij = (13) min(Rid ) if g(S> i Vj ) > min(Rid ) , where min(Rid ) is the minimum rank score of the lowest ranked post that appears in Li as computed by Eq. 1. As shown in Eq. 13, if the inferred rank score g(S> i Vj ) is smaller than the minimum rank score, we maintain the inferred score for that post. However, if the inferred rank score is greater than the minimum rank score, we give up the inferred score and let Rij = min(Rid ), as dj ∈ / Li means that it should at least be ranked lower than any post d ∈ Li . m X g(S> i Vj ) + i=1 dj ∈Li Rij . (14) i=1 dj ∈CL \Li The final rank aggregation score for a post consists of two parts, i.e., scores for lists Li it appears in (the first term in Eq. 14) and scores for lists it does not appear in (the second term in Eq. 14), as inferred by Eq. 13. Finally, the final fused list Lf produced in response to q is obtained by ranking posts in decreasing order of fTimeRA (dj |q). We call the model defined by Eq. 14, TimeRA, and the variant that ignores inferred rank information Rij in Eq. 14 is called TimeRA-Infer (“TimeRA minus Infer”). 4.3 We are almost ready now to define the cost function C2 that we use in TimeRA. We integrate the original rank score of the posts in the result lists (Eq. 8) with rewards for posts generated in the vicinity of bursts (Eq. 10). To this end, we substitute Eq. 10 into Eq. 8. Our cost function C2 for TimeRA is defined as m X Analysis of time-aware rank aggregation The main computation of TimeRA is detecting bursts and its gradients against matrices S and V. Our burst detection has computational complexity O(n) because the detection algorithm runs in linear time [24], and the gradients of C2 have computational complexity of O(e × m × n × A). In practice, the number of result lists to be fused, m, the number of posts to be aggregated, n, the number of latent factors, A, and the number of epochs, e, are all quite small. Therefore, the rank aggregation procedure can run in real-time, as we will see in the experiments presented in §6.6. In the limiting case, TimeRA reverts back to CombSUM. That is, if we set β = 0 (ignoring burst information), set the weight function w(d|Li ) = 1 for all documents d and lists Li , and, moreover, set λ1 = λ2 = 0, and do not try to infer the rank score of posts, then g(S> i Vj ) = Rij after performing gradient descent in Eq. 12. Our experimental results below show that the performance of TimeRA is almost the same as CombSUM when β = 0. Note that λ1 , λ2 6= 0 in the experiments. 5. EXPERIMENTAL SETUP In this section, we list our research questions, describe the data set, specify our baselines, detail the metrics as well as our training and optimization setup, and describe the experiments. 5.1 Research questions The research questions guiding the remainder of the paper are: RQ1 Does TimeRA outperform traditional and state-of-the-art unsupervised or supervised rank aggregation methods and the best single result list? (§6.1) RQ2 What are the relative contributions of the main ingredients of TimeRA (burst detection and score inference)? (§6.2) RQ3 What is the effect of using burst information in TimeRA (i.e., what is the impact of the parameter β in Eq. 12)? (§6.3) RQ4 What is the effect of the number of lists to be aggregated in TimeRA? (§6.4) RQ5 Can we observe the hypothesized effect sketched in Fig. 1, i.e., posts in bursts being rewarded? (§6.5) RQ6 Does TimeRA meet real-time search requirements? (§6.6) RQ7 Does TimeRA beat existing time-sensitive microblog search models? (§6.7) 5.2 Data set In order to answer our research questions we use the Tweets 2011 corpus [17, 29] provided by the TREC 2011 and 2012 Microblog tracks. The collection is comprised of around 16 million tweets collected over a period of 2 weeks (January 23, 2011 until February 8, 2011). Different types of tweets are present in this data set, including replies and retweets, each with their own timestamp. The task at the TREC 2011 and 2012 Microblog tracks was: given a query with a timestamp, return relevant and interesting tweets in reverse chronological order. In response to a query, a run was required to retrieve 30 posts. In total, for the 2011 track 49 test topics were created and 2965 tweets were deemed relevant; some topics have just two relevant tweets, while others have more than 100 relevant tweets. A total of 59 groups participated in the TREC 2011 Microblog track, with each team submitting at most four runs, which resulted in 184 runs2 [17]. The official evaluation metric was precision at 30 (p@30) [17, 29]. The p@30 scores of these 184 runs varied dramatically, with the best run achieving a p@30 score of 0.4551 and the worst runs achieving scores below 0.10. Details about the implementation of each run can be found in [17, 29]. We also run experiments on the TREC 2012 Microblog track runs, and find that they yield the same overall results and trends. Due to space limitations, we only report results for the TREC 2011 edition runs below. 5.3 Baselines and evaluation We compare TimeRA to 5 aggregation baselines: 2 traditional unsupervised methods, i.e., CombSUM, CombMNZ, 2 start-of-theart cluster-based fusion methods, ClustFuseCombSUM and ClustFuseCombMNZ [7], and a start-of-the-art supervised method, λMerge [27], in which we integrate temporal features. As TimeRA utilizes temporal information, we also compare TimeRA to 4 stateof-the-art time-sensitive microblog search algorithms: time-based language model (TBLM) [12], textual quality factor model with temporal query expansion (LM-T(qe)) [19], direct time-sensitive BM25 retrieval model (DIRECT-BM25 (mean)) [3] and temporal tweet selection feedback method (TSF+QDRM) [21]. To build the index of the dataset that some of our baselines require, we apply Porter stemming, tokenization, and stopword removal (using INQUERY lists) to posts using the Lemur toolkit.3 Features used in λ-Merge, including time-sensitive features, and the settings of λMerge are briefly described in Appendix B. For performance evaluation we use the official TREC Microblog 2011 metric, p@30. We also report on p@5, p@10, p@15 and MAP scores. MAP scores are of special interest to us: we hypothesize that TimeRA has both a precision and recall-enhancing effect and we use MAP to measure this. Statistical significance of observed differences between two runs’ performances is tested using a two-tailed paired t-test and is denoted using N (or H ) for significant differences for α = .01, or M (and O ) for α = .05. A single free parameter in Eq. 12, β (∈ {0, 0.1, . . . , 1}), is incorporated in TimeRA, which is set using leave-one-out cross validation performed over the entire set of 49 queries. The performance of MAP is optimized in the learning phase. In other words, the performance for a query is attained using a value of β that maximizes MAP performance over all other queries. The same optimization strategy is used for one of our baselines, cluster-based fusion. Other baselines do not incorporate free parameters. Following [18], we set the parameters λ1 = λ2 = 0.001. 5.4 Experiments We report on 7 main experiments in this paper aimed at understanding (1) the performance of TimeRA in general via sampling lists and fusing them; (2) the contribution of the main ingredients in TimeRA; (3) the performance of TimeRA with increasing numbers of runs to be fused; (4) query level performance; (5) TimeRA’s efficiency; (6) the effect of inferring rank scores of posts by TimeRA; (7) the performance of TimeRA against temporal retrieval models. 2 3 Available from http://trec.nist.gov. http://www.lemurproject.org Table 2: Sampled runs used in some of the experiments. Class Sampled runs Class 1 clarity1, waterlooa3, FASILKOM02, isiFDL, DFReeKLIM30, PRISrun1 Class 2 KAUSTRerank, ciirRun1, gut, dutirMixFb, normal, UDMicroIDF Class 3 WESTfilext, LThresh, qRefLThresh, run3a, Nestor, uogTrLqea Class 4 FASILKOM02, waterlooa3, gut, UDMicroIDF, run3a, qRefLThresh Performance 0.40 ≤ p@30 0.30 ≤ p@30 < 0.40 0.20 ≤ p@30 < 0.30 0.20 ≤ p@30 To understand the overall performance of TimeRA, we sample ∼10% from the ranked lists produced by participants in the TREC 2011 Microblog track based on the lists’ p@30 distribution: 18 out of the runs submitted to the TREC 2011 Microblog track, 6 with p@30 scores between 0.20 and 0.30 (Class 3), 6 between 0.30 and 0.40 (Class 2), and 6 over 0.40 (Class 1). We also randomly choose two runs from each class to construct Class 4; see Table 2. The runs in Class 1 are the 6 best runs in the TREC 2011 Microblog track. In every class, we use run1, run2, run3, run4, run5 and run6 to refer to the runs in descending order of p@30. Next, to understand the contributions of the two main ingredients of TimeRA, viz., burst detection and inferring scores, we make comparisons among TimeRA, TimeRA-Infer and CombSUM. We also gradually increase the parameter β in Eq. 12 from 0.0 to 1.0 to see if burst information is helpful to boost fusion performance. To understand the effect of the number of lists being merged, we randomly choose k = 2, 4, 6, . . . , 36 lists from the 184 lists and aggregate them. We repeat the experiments 20 times and report the average results and standard deviation. In order to understand the query-level performance of TimeRA, we provide a detailed comparison of its performance against the baseline methods. To determine whether TimeRA can respond to a given query in (near) real time, we again randomly fuse k = 2, 6, 12, 18, 30 lists for all 49 test queries and report the average time required. Finally, we compare TimeRA against state-of-the-art time-sensitive retrieval models that utilize time/burst information. 6. RESULTS AND ANALYSIS §6.1 and §6.2 show the results of fusing the sample lists, the contributions of burst detection and score inference in TimeRA, respectively; §6.3 analyzes the effect of using burst information; §6.4 shows the effect of the number of lists on the overall performance; §6.5 provides a topic-level analysis; §6.6 examines the run times. Finally, §6.7 compares TimeRA against time-sensitive models. 6.1 Fusing the sample lists The performance of TimeRA and the 5 baselines is presented in Table 3, with numbers based on the ∼10% sample mentioned in §5.4. The performance of all the fusion methods is better than that of the best performing result list that is used in the merging process (run1) for all classes and on almost all metrics. Many of these improvements are statistically significant. More importantly, when fusing the top 6 result lists (Class 1), all of the p@30 scores generated by any rank aggregation method are higher than that of the best run in TREC 2011 Microblog track (0.4551), especially for TimeRA, which achieves 0.5531. These findings attest to the merits of using rank aggregation methods for microblog search. It is also worth noting in Table 3 that in almost all cases, the cluster-based method does not beat the standard fusion method that it integrates, and the performance differences between the two are usually not significant. The reason behind this may be that it is challenging to do clustering in a microblog environment, with limited amounts of text and very creative language usage. The performance of TimeRA is better than that of the baseline methods, and all of the differences are substantial and statistically significant. The performance of λ-Merge is almost the same as that of CombSUM, CombMNZ and the cluster-based methods when fusing the lists in Class 2, but in the other classes the performance tends to be a bit below that of the other methods, on all metrics. This may be due to overfitting. Interestingly, the higher the quality of the result lists that are being aggregated, the bigger the improvements that can be observed in Table 3. For instance, the p@30 scores after aggregation are highest in Class 1 followed by those in Class 2 and Class3, and the quality of Class 1 is best followed by Class 2 and Class 3, respectively. The p@30 aggregation scores in Class 4 are almost the same as those in Class 2, as some of the lists’ scores in Class 4 are better than those in Class 2. 6.2 Contributions of the main ingredients Next, we compare the relative contributions of the main ingredients of TimeRA against the second best baseline, viz., CombSUM: burst detection and score inference. The effect of burst detection in TimeRA can be seen through comparisons between TimeRA-Infer and CombSUM; the effect of score inference can be seen through comparisons between TimeRA and TimeRA-Infer in Fig. 2. Interestingly, in Fig. 2 there are large gaps in performance between TimeRA-Infer and CombSUM in terms of all of metrics; all improvements are statistically significant. This illustrates that burst detection makes an important contribution to the performance of rank aggregation. When we compare TimeRA and TimeRAInfer in Fig. 2, we see that the performance of TimeRA-Infer in terms of p@5 is almost the same as that of TimeRA, while in terms of p@10 and p@30, TimeRA has some small advantages over TimeRA-Infer—some of these improvements are statistically significant. This observation confirms that inferring scores for posts during aggregation can boost performance as well. It also shows, however, that enhancing the performance of p@k becomes easier for larger values of k. This is because the cost of boosting the performance (i.e., changing the original rank score to be higher) is smaller when the posts are ranked lower. TimeRA is unable to beat TimeRA-Infer in terms of p@5 (.6939 for both), but TimeRA does boost the p@30 performance (.5531M for TimeRA vs .5405 for TimeRA-Infer). As burst information is such an important contributor to the performance of TimeRA, we analyze it further in §6.3. 6.3 The use of burst information Next we examine the effect of using different amounts of burst information or cluster information in our time-aware aggregation or cluster-based methods, respectively. What is the impact of the free parameter β in Eq. 12 and in the cluster-based methods? Fig. 3 depicts the MAP performance curves for all rank aggregation methods when fusing the lists in Class 1, Class 2, Class 3 and Class 4, respectively. For β = 0, TimeRA almost amounts to CombSUM, while the cluster-based methods are the same as the standard fusion methods they incorporate, e.g., ClustFuseCombSUM has no difference with CombSUM in this case; more weight is put on burst information and cluster information with higher values of β in TimeRA and the cluster-based methods, respectively. For 0 < β < 1, both the CombSUM scores of posts and burst information are utilized for aggregating lists in TimeRA. In each of our four classes of runs, when aggregating lists, the MAP scores of TimeRA where burst information is used (β > 0) are always higher than that of any other fusion method. In Class 1 and Class 4 the gain increases gradually as the weight of burst information increases. These findings attest to the merits of using burst information to boost the performance in fusing ranked lists for microblog search. Putting more weight on cluster information in the cluster-based methods hurts performance in many cases. 6.4 Effect of the number of lists being merged We explore the effect of varying the number of lists to be merged on the performance of TimeRA. Fig. 4 shows the fusion results of randomly sampling k ∈ {2, 4, 6, . . . , 36} lists from the 184 lists. For each k, we repeat the experiment 20 times and report on the average scores. We use CombSUM as a representative example for comparisons with TimeRA; the results of other baseline methods are worse or qualitatively similar to those of CombSUM. From Fig. 4 we can see that TimeRA performs better than CombSUM over all performance evaluation metrics no matter how many lists are fused. For both precision metrics (p@5 and p@30) we find that as long as the number of lists ≤ 10, the performance of both TimeRA and CombSUM gradually increases as the number of lists to be merged increases. The increases level off when the number of lists exceeds 12. For MAP we find that performance keeps increasing until we fuse 26 lists; then, the performance increase levels off. Interestingly, in Fig. 4 the improvement of TimeRA over CombSUM on p@5 becomes smaller when more lists are merged. For example, the increase in p@5 of TimeRA over CombSUM is .1063 (.5861 for TimeRA vs .4798 for CombSUM) when two lists are fused. The performance increase, however, drops to only .0281 (.6712 for TimeRA vs .6431 for CombSUM) for 36 lists. Looking at the other metrics, which take a larger part of the merged result into account (p@30 and especially MAP), the gap remains. 6.5 Query-level analysis We take a closer look at per test query improvements of TimeRA over other runs. For brevity, we only consider CombSUM as a representative and we only consider runs in Class 1. The results of TimeRA against CombSUM for other classes of runs and for other baseline methods are qualitatively similar. Fig. 5 shows per query performance differences in terms of AP, p@5, p@10 and p@30, respectively, between TimeRA and CombSUM. TimeRA displays both a precision and recall enhancing effect (with increases in precision oriented metrics as well as in MAP). As the metric at hand considers a larger chunk of the result list, there are more instances where TimeRA outperforms CombSUM. This is due mainly to topics that are discussed only in very specific time intervals. Examples include queries MB010 (Egyptian protesters attack museum), MB011 (Kubica crash) and MB015 (William and Kate fax savethe-date) etc. For such queries we found evidence of the intuition depicted in Fig. 1: posts that are ranked low in a small number lists but that TimeRA pushes up because they are central to a burst. E.g., in response to query MB010, post #30354903104749568 is ranked near the bottom in just two lists (at ranks 26 and 27 in the runs clarity1 and DFReekLIM30, respectively). Many posts for the query were generated around the same time interval (Jan. 26–29) and are ranked high in many lists; post #30354903104749568 was also published around this time and ranked 6th in the merged list because of this. Queries for which TimeRA cannot beat CombSUM tend to be quite general and unrelated to any specific time windows. Examples include queries MB023 (Amtrak train service), MB027 (reduce energy consumption) and MB029 (global warming and weather) etc. For a very small number of queries, TimeRA’s performance is slightly worse than that of CombSUM. One reason that we observed for this phenomenon is that not all posts that are ranked low in a small number of lists but central to a burst need to be rewarded. An example here is query MB024 (Super Bowl, seats). Table 3: Retrieval performance on the ∼10% sample lists. Boldface marks the best result per metric; a statistically significant difference between TimeRA and the best baseline method is marked in the upper right hand corner of the TimeRA score. A significant difference with run1 for each method is marked in the upper left hand corner using the same symbols. None of the differences between the cluster-based method and the standard method it incorporates are statistically significant. Class 1 run1 run2 run3 run4 run5 run6 CombSUM ClustFuseCombSUM N CombMNZ ClustFuseCombMNZ N λ-Merge TimeRA N N N N Class 2 MAP p@5 p@10 p@15 p@30 MAP p@5 p@10 p@15 p@30 .2210 .2690 .2318 .2058 .2575 .2098 .5918 .5959 .5755 .5714 .5673 .5469 .5673 .5796 .5367 .5367 .4980 .5102 .5347 .5442 .5034 .4939 .4721 .4694 .4551 .4537 .4401 .4211 .4211 .4095 .1457 .1886 .1525 .1376 .1688 .1820 .4612 .4776 .4041 .3959 .3878 .4122 .4143 .4347 .4143 .3939 .3633 .3796 .3714 .3878 .3878 .3796 .3605 .3619 .3571 .3463 .3408 .3218 .3136 .3027 .6245 .6240 .5816 .5802M .5524 .5503 .6245 .6231 .5755 .5748 .5524 .5502M N .5456 .6259N N .3404 .3398 N .3385 .3355 N .3245 .3834N N N N N .6213 .6939N N .5734 .6510N N N N N M .4966 .4899 N .5020 .4987 N .4833 .5531N N N N N .2625 .2612 N .2581 .2560 N .2573 .3037N N N N N .5306 .5287 N .5347 .5330 M .5634 .6327N N M N N Class 3 run1 run2 run3 run4 run5 run6 CombSUM ClustFuseCombSUM N CombMNZ ClustFuseCombMNZ N λ-Merge TimeRA N N N N 0.4 .4592 .4523 N .4952 .5551N N N N N .4286 .4213 N .4354 .4311 N .4463 .5211N N N N N .3735 .3686 .3789 .3731 .3901 .4320N Class 4 p@5 p@10 p@15 p@30 MAP p@5 p@10 p@15 p@30 .1661 .0997 .1636 .0753 .0571 .0994 .4041 .3429 .3959 .3265 .2980 .3510 .3408 .3000 .3122 .2735 .2551 .2735 .2898 .2653 .2571 .2585 .2408 .2408 .2122 .2095 .2041 .2034 .2020 .2016 .2058 .2098 .1376 .1820 .1636 .0753 .5714 .5469 .3959 .4122 .3959 .3265 .5367 .5102 .3939 .3796 .3122 .2735 .4939 .4694 .3796 .3619 .2571 .2585 .4211 .4095 .3218 .3027 .2041 .2034 .2150 .2142 N .2187 .2151 N .2125 .2491N N N N N .4857 .4804 N .4898 .4872 N .5057 .5347N N N N N .4327 .4267 N .4327 .4265 N .4373 .4694N N N N N .3837 .3806 N .3932 .3886 N .3827 .4122N N N N N .2952 .2882 N .2973 .2894 N .2933 .3293N N M N N .2795 .2741 N .2794 .2744 N .2753 .3210N N N N N .6122 .6088 N .6000 .5975 N .6046 .6735N N M N N .5327 .5297 N .5449 .5407 N .5387 .5918N M N N .4721 .4674 O H .4830 .4782 .4918 .5429N .3918 .3865 .4048 .3958 N .4047 .4347N 0.7 CombSUM TimeRA−Infer TimeRA 0.36 CombSUM TimeRA−Infer TimeRA 0.65 CombSUM TimeRA−Infer TimeRA 0.55 0.65 0.5 0.6 0.34 0.6 0.3 0.28 p@30 p@10 0.45 p@5 0.32 MAP N MAP CombSUM TimeRA−Infer TimeRA 0.38 0.55 0.4 0.55 0.26 0.5 0.24 0.35 0.5 0.22 0.2 .4531 .4500 0.45 Class 1 Class 2 Class 3 Class 4 0.45 Class 1 Class 2 Class 3 Class 4 0.3 Class 1 Class 2 Class 3 Class 4 Class 1 Class 2 Class 3 Class 4 Figure 2: Retrieval performance of CombSUM, TimeRA-Infer (without using score inference for posts) and TimeRA in terms of MAP, p@5, p@10, and p@30 when fusing the runs in the 4 classes. Note that figures should be viewed in color. 6.6 Run time comparisons We now explore how fast TimeRA can merge result lists in response to a query. TimeRA is developed in C++ and the experiments are run on a 10.6.8 MacBook Pro computer with 4GB memory and a 2.3 GHz Intel core i5 processor. In Table 4, we randomly choose k ∈ {2, 6, 12, 18, 30} lists from the 184 lists. For each k, we repeat the experiment 20 times and report the average run time per query (in seconds) that the fusion methods require. ClustSUM and ClustMNZ are short for ClustFuseCombSUM and ClustFuseCombMNZ, respectively in Table 4. TimeRA does not run as fast as CombSUM or CombMNZ, but it manages to merge lists near real-time. TimeRA merges the lists within 0.1s when given 30 result lists and within 0.01s when fusing two lists. As the number of lists to be fused increases, the time spent on fusing is linear for CombSUM, CombMNZ, TimeRA, and λ-Merge; for ClustSUM and ClustMNZ, the time increases dramatically with larger numbers of lists. When fusing 30 lists, ClustMNZ needs to spend 235.91s, although it only spends 1.03s on fusing two lists. The times clocked for CombSUM and CombMNZ are similar, and likewise for those of ClustSUM and ClustMNZ. 6.7 Effect of fusing time-sensitive result lists TimeRA uses temporal information in an essential way. How does it compare against retrieval models that explore temporal information? To answer this question, we conduct 4 additional experiments for generating 4 time-sensitive result lists, and explore the performance of the baseline fusion methods and TimeRA with each of which respectively fuses these 4 lists. These lists are generated by TBLM [12], LM-T(qe) [19], DIRECT-BM25 (mean) [3] and TSF+QDRM [21]. Table 5 shows the fusion result. Obviously, except ClustFuseCombMNZ and λ-Merge, fusion baselines and TimeRA outperform the best time-sensitive component run (TSF+QDRM) for all metrics and most of the improvements are statistically significant. This illustrates that exploring time information in data fusion has a different effect than utilizing time information in an individual ranking function, an effect that can lead to performance increases. 0.3 TimeRA CombSUM ClustCombSUM CombMNZ ClustCombMNZ λ−Merge 0.34 0.32 TimeRA CombSUM ClustCombSUM CombMNZ ClustCombMNZ λ−Merge 0.29 MAP MAP 0.36 0.25 0.28 0.27 0.24 0.23 0.4 0.6 0.8 0.3 0.29 0.28 0.21 0.24 0 1 TimeRA CombSUM ClustCombSUM CombMNZ ClustCombMNZ λ−Merge 0.31 0.22 0.25 0.2 0.32 0.26 0.3 0 0.33 TimeRA CombSUM ClustCombSUM CombMNZ ClustCombMNZ λ−Merge MAP 0.31 MAP 0.4 0.38 0.2 0.4 β 0.6 0.8 1 0.2 0 0.27 0.2 0.4 β (a) Class 1 0.6 0.8 0.26 0 1 0.2 0.4 β (b) Class 2 0.6 0.8 1 β (c) Class 3 (d) Class 4 Figure 3: Effect of varying the value of β on the MAP performance of six aggregation methods, for lists in (a) Class 1, (b) Class 2, (c) Class 3 and (d) Class 4. More weight is put on burst information and on cluster information with higher values of β in TimeRA and the cluster-based methods, respectively. Note: the figures are not to the same scale. 0.75 0.45 0.4 0.7 0.5 0.65 0.45 0.3 0.6 0.55 0.25 0.5 0.2 0.45 0.15 0 TimeRA CombSUM 5 10 15 20 25 Number of runs to be fused 30 35 p@30 p@5 MAP 0.35 0.4 0.35 TimeRA CombSUM 0.4 0 5 10 15 20 25 Number of runs to be fused 30 35 0.3 0.25 0 TimeRA CombSUM 5 10 15 20 25 Number of runs to be fused 30 35 Figure 4: Effect on performance (in terms of MAP, p@5, and p@30) of the number of lists being merged. We plot averages and standard deviations. Note: the figures are not to the same scale. Table 4: Time spent on fusing lists by different aggregation methods. Recorded in seconds with standard deviations (std). Number of lists 12 18 CombSUM 3.06e–4 5.76e–4 std 1.13e–5 2.57e–5 2 6 1.03e–3 6.93e–5 1.98e–3 6.49e–5 3.37e–3 7.50e–5 30 CombMNZ 3.06e–4 5.76e–4 std 1.13e–5 2.57e–5 1.03e–3 6.93e–5 1.99e–3 6.52e–5 3.38e–3 7.01e–5 TimeRA std 4.77e–3 1.69e–2 7.60e–5 7.41e–5 2.05e–2 2.20e–4 4.56e–2 3.53e–4 9.60e–2 3.60e–4 λ-Merge std 1.15 3.82 1.22e–1 5.82e–1 7.78 12.03 8.03e–1 1.09 ClustSUM std 1.03 4.12 37.56 9.87e–2 4.24e–1 1.26 88.21 3.54 235.91 13.08 ClustMNZ std 1.03 4.12 37.56 9.87e–2 4.24e–1 1.26 88.21 3.54 235.91 13.08 20.74 1.18 One reason behind this is that posts within intervals in which many relevant posts appear can only be confirmed to be relevant by gathering data from multiple lists, time-sensitive or not. 7. CONCLUSION The special nature of microblog posts, e.g., their limited length and their creative language usage, raises challenges for searching them. However, this special nature also provides unique algorithmic opportunities. In this paper, we focus on utilizing time information to boost the performance of searching microblog posts. Specifically, we proposed a novel rank aggregation approach, TimeRA, that utilizes bursts and only rank information to aggregate result lists. TimeRA first detects bursts of posts across the lists utilizing original rank information of the posts, and then rewards posts that are ranked low in few lists but in the vicinity of a burst that contains higher ranked posts. It also infers the rank scores of missing posts by modeling lists and posts as a mixture of latent factors. Our experimental results show that both utilizing burst information and score inference for rank aggregation can significantly en- Table 5: Performance on 4 time-sensitive result lists. Boldface marks the better result per metric; a statistically significant difference between TimeRA and the best baseline is marked in the upper right hand corner of TimeRA score; a statistically significant difference with TSF+QDRM for each fusion method is marked in the upper left hand corner of the score. TSF+QDRM DIRECT-BM25 (mean) LM-T (qe) TBLM CombSUM ClustFuseCombSUM CombMNZ ClustFuseCombMNZ λ-Merge TimeRA M M M N MAP p@5 p@10 p@15 p@30 .2834 .2798 .2346 .2231 .6220 .6187 .5836 5742 .6856 .6725 .5648 5433 .6279 .6320 .5178 .5017 .5368 .5133 .4471 .4395 .2962 .2951 M .2948 .2886 M M .6395 .6385 N .6350 .6312 M M .6973 .6927 N .6918 .6873 M N .6482 .6407 N .6347 .6283 M N .5513 .5476 .5420 .5416 .2847 .6275 .6876 .6279 .5383 .3117N N .6518M N .7023M N .6545N N .5733N hance retrieval performance when compared against traditional and state-of-the-art, supervised and unsupervised rank aggregation approaches for microblog post search. Additional analyses show that TimeRA is a robust and efficient rank aggregation method that outperforms state-of-the-art temporal retrieval algorithms. As to future work, we plan to look into combining social information, such as user relationships into rank aggregation and further analyze our model in scenarios where the documents being searched are published in bursts. Acknowledgements. This research was supported by the China Scholarship Council, the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreements nr 288024 (LiMoSINe) and nr 312827 (VOX-Pol), the Netherlands Organisation for Scientific Research (NWO) under project nrs 727.011.005, 612.001.116, HOR-11-10, 640.006.013, the Center for Creation, Content and Technology (CCCT), the Dutch national program COMMIT, the ESF Research Network Program ELIAS, the Elite Network Shifts project funded by the Royal Dutch Academy of Sciences (KNAW), the Netherlands eScience Center under project -0.5 topics 0 -0.5 topics 0.5 ∆p@30 0 0.5 ∆p@10 0.5 ∆p@5 ∆AP 0.5 0 -0.5 0 -0.5 topics topics Figure 5: Per query performance differences, TimeRA vs. CombSUM. The figures shown are for the runs in Class 1, for MAP, p@5, p@10 and p@30 (left to right). A bar extending above the center indicates that TimeRA outperforms CombSUM, and vice versa. number 027.012.105, the Yahoo! Faculty Research and Engagement Program, the Microsoft Research PhD program, and the HPC Fund. 8. REFERENCES [1] J. A. Aslam and M. Montague. Models for metasearch. In SIGIR, pages 276–284, 2001. [2] H. Bota, K. Zhou, J. M. Jose, and M. Lalmas. Composite retrieval of heterogeneous web search. In WWW, 2014. [3] W. Dakka, L. Gravano, and P. Ipeirotis. Answering general time-sensitive queries. IEEE Trans. Knowledge and Data Engin., 24 (2):220–235, 2012. [4] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In WWW, pages 613–622, 2001. [5] E. A. Fox and J. A. Shaw. Combination of multiple searches. In TREC-2, 1994. [6] Z. Han, X. Li, M. Yang, H. Qi, S. Li, and T. Zhao. Hit at trec 2012 microblog track. In TREC ’12 Working Notes, 2012. [7] A. K. Kozorovitsky and O. Kurland. Cluster-based fusion of retrieved lists. In SIGIR, pages 893–902, 2011. [8] M. Kurucz, A. A. Benczúr, and K. Csalogány. Methods for large scale SVD with missing values. In KDD Cup and Workshop, 2007. [9] J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In SIGIR, 2001. [10] T. Lappas, B. Arai, M. Platakis, D. Kotsakos, and D. Gunopulos. On burstiness-aware search for document sequences. In KDD, pages 477–486, 2009. [11] J. H. Lee. Combining multiple evidence from different properties of weighting schemes. In SIGIR, pages 180–188, 1995. [12] X. Li and W. B. Croft. Time-based language models. In CIKM, pages 469–475, 2003. [13] S. Liang, M. de Rijke, and M. Tsagkias. Late data fusion for microblog search. In ECIR’13, pages 743–746, 2013. [14] S. Liang, Z. Ren, and M. de Rijke. Fusion helps diversification. In SIGIR, pages 303–312, 2014. [15] S. Liang, Z. Ren, and M. de Rijke. Personalized search result diversification via structured learning. In KDD ’14, 2014. [16] S. Liang, Z. Ren, and M. Rijke. The impact of semantic document expansion on cluster-based fusion for microblog search. In ECIR, pages 493–499, 2014. [17] J. Lin, C. Macdonald, I. Ounis, and I. Soboroff. Overview of the TREC 2011 Microblog track. In TREC 2011. NIST, 2011. [18] H. Ma, D. Zhou, and C. Liu. Recommender systems with social regularization. In WSDM, 2011. [19] K. Massoudi, M. Tsagkias, M. de Rijke, and W. Weerkamp. Incorporating query expansion and quality indicators in searching microblog posts. In ECIR, pages 362–367, 2011. [20] M. Mathioudakis, N. Bansal, and N. Koudas. Identifying, attributing and describing spatial bursts. In VLDB, pages 1091–1102, 2010. [21] T. Miyanishi, K. Seki, and K. Uehara. Improving pseudo-relevance feedback via tweet selection. In CIKM, pages 439–448, 2013. [22] M. Montague and J. A. Aslam. Condorcet fusion for improved retrieval. In CIKM, pages 538–548, 2002. [23] M.-H. Peetz, E. Meij, M. de Rijke, and W. Weerkamp. Adaptive temporal query modeling. In ECIR ’12, 2012. [24] W. L. Ruzzo and M. Tompa. A linear time algorithm for finding all maximal scoring subsequences. In Int. Conf. Int. Syst. Molec. Biol., pages 234–241, 1999. [25] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In Advances in Neural Information Processing Systems, 2008. [26] J. A. Shaw and E. A. Fox. Combination of multiple searches. In TREC 1992, pages 243–252. NIST, 1993. [27] D. Sheldon, M. Shokouhi, M. Szummer, and N. Craswell. LambdaMerge: merging the results of query reformulations. In WSDM, pages 795–804, 2011. [28] M. Shokouhi and L. Si. Federated search. Foundations and Trends in Information Retrieval, 5(1):1–102, 2011. [29] I. Soboroff, I. Ounis, and J. Lin. Overview of the TREC 2012 Microblog track (notebook version). In TREC 2012. NIST, 2012. [30] W. Weerkamp and M. de Rijke. Credibility-inspired ranking for blog post retrieval. Information Retrieval Journal, 15(3–4):243–277, 2012. [31] S. Wu. Data fusion in information retrieval, volume 13 of Adaptation, Learning and Optimization. Springer, 2012. APPENDIX A. DERIVATION OF THE MODELS The following is the derivation of Eq. 12. We leave out the derivations of Eq. 4, 5, 6, 7 and 8 as they may be obtained in a similar way. n X ∂C2 0 > =(1 − β) Iij w(dj |Li )(g(S> i Vj ) − Rij )g (Si Vj )Vj ∂Si j=1 + n X X β |{dk : dk Li dj |b}| j=1 b∈B X Iij Iik r(dj |dk ) dk ∈b dk Li dj 0 > > · w(dk |Li )(g(S> i Vj ) − Rik )g (Si Vj )Vj + λ1 Si =(1 − β) m X 0 > > Iij w(dj |Li )(g(S> i Vj ) − Rij )g (Si Vj )Si i=1 + m X X i=1 b∈B β |{dk : dk Li dj |b}| X Iij Iik r(dj |dk ) dk ∈b dk Li dj 0 > > · w(dk |Li )(g(S> i Vj ) − Rik )g (Si Vj )Si + λ2 Vj , (15) where g 0 (x) = exp(x)/(1 + exp(x))2 is the derivative of the logistic function g(x) = 1/(1 + exp(−x)). B. λ-MERGE AND ITS FEATURES Originally λ-Merge is an algorithm that fuses the result lists of query reformulations. The method we call λ-Merge is the same as the original one except that we consider the weights of query reformulations in the original λ-Merge as those of the result lists in response to the same query in our experiments. In total, we use 13 features, extracted at three levels: time-sensitive level, query-document level, document level; all features are represented by either binary or real numbers. At time-sensitive level, we obtain the retrieval scores from our 4 time-sensitive retrieval baselines, i.e., TBLM [12], LM-T(qe) [19], DIRECT-BM25 (mean) [3] and TSF+QDRM [21], as features. At the query-document level, we use 3 features: Rank, Rankers and IsTop-N . Rank is defined in Eq. 1. Rankers is the number of ranked lists in which the post appears over the total number of lists to be merged. IsTop-N is a binary feature to indicate if this document is within the top-N results in the list Li , where N ∈ {5, 10, 15, 20}. At the document level, we extract features capable of indicating the quality of a post [17, 29]. The post features include Link, Hashtag, Retweet; the values of these three features are set to 1 if the document contains links, hashtags, and retweets. The document-level features also consist of content quality indicators of a post (Density, Capitalization and Length) [30].
© Copyright 2024 ExpyDoc