Hashtag Clustering to Summarize the Topics Discussed by Dutch Members of Parliament Jan Kalmeijer February 27, 2014 Abstract Politicians are adopting social media, creating an avenue for social science research. However, social media data is largely unstructured. This work analyzes tweets sent by 144 members of Dutch parliament over a period of three months. We cluster hashtags based on co-occurrence to identify discussed topics. This approach produces clusters of reasonable quality, but the identified topics are perhaps too specific to be used in political science research. 1 Introduction In the past, socio-political researchers have analyzed the political agenda by content coding1 official documents, such as the Queen’s speech and the European Council Conclusions. As social media grow, data about the political agendas becomes available in new forms. Politicians can now express their views and agendas in real-time through communication channels such as Twitter. This new form of data has properties that makes it interesting to use alongside the more formal documents that are currently used in political science research: the data is real-time and allows politicians to broadcast a personal view. These digital communication channels are also less troubled by size constraints, which means there is room for more personal issues. This could allow for a more finegrained analysis than that is possible with formal documents. The downsides of social media data are the volume and the quality. In this study we analyze tweets sent by Dutch members of parliament. As all members combined send a total of only roughly 300 tweets2 per day on average, manual examination of these tweets is doable but expensive and tedious. Instead we 1 Content coding refers to determining which category (e.g., macroeconomics and taxes, environment, education and culture, etc.) best characterizes a piece of content (e.g., a sentence or paragraph) in an official document. 2 See Section 2 for a brief description of the Twitter platform and terminology. 1 try to summarize the tweets by clustering them based on their hashtags. Using hashtag based clustering to summarize the tweets allows us to get an overview of the topics discussed in the tweets, without the need for manual annotation. A challenge when using clustering is the evaluation of the produced clusters. We propose a number of properties, specific to the problem of clustering hashtags, that we hypothesize can aid in assessing the quality of the clusters. We try to verify this claim by examining the relationship between the cluster properties and our subjective notion of cluster quality. We briefly discuss the Twitter platform in Section 2, and related work in Section 3. In Section 4 we contribute recent statistics on how Dutch members of parliament use Twitter. We explain our approach to clustering hashtags in Section 5, discuss the experimental results in Section 6, and summarize and conclude in Section 7. 2 Twitter platform Twitter is a microblogging platform that allows users (also called twitterers or tweeters) to post messages (tweets) of up to 140 characters. In addition to text and links, a tweet can contain hashtags and mentions. The hashtag symbol ‘#’ is used as a prefix of a sequence of characters to mark keywords or topics in a tweet. The mention symbol ‘@’ is used as a prefix of user names (also known as twitter handles) to automatically notify users that they are mentioned in a tweet. Users can personalize their Twitter experience by following other users. If user A follows user B, all tweets and updates of user B are shown on user A’s Twitter homepage. Following is not the only method of aggregating tweets. Another method is to create a list. This is a user created group of twitterers, and when the list is viewed only tweets from users on that list are shown. Some users actively try to get other users to follow them, because more people will read their tweets, presumably making them more influential. Another method of reaching a large audience is through retweeting. When a tweet is posted it is broadcast to all the followers of the tweet’s author. These followers can then retweet (repost) this tweet, which shares it with their followers. A retweet quotes the original tweet’s text, and can contain additional information. Twitter tries to algorithmically determine topics that are popular on Twitter. These topics are called trends and are personalized for each user based on their location and on who they follow. Tweeting about trending topics is another method of getting noticed, as the trending topics are shown in a separate column on the Twitter page. The Twitter API enables automated collection of tweets under two restrictions: 2 the amount of tweets that can be collected is rate limited, and tweets that are older than a few days are not available. In addition to the tweet’s text the Twitter API provides more information such as the tweet’s author, the tweet’s creation time, and the number of times a tweet has been retweeted. If a tweet is a retweet the same information is provided for the original tweet. 3 Related Work Sch¨ afer, Overheul, and Boeschoten examined the use of Twitter by Dutch political figures in 2011, and found four different categories of Twitter users within this demographic [10] . They also found that participation was between 60 and 70 percent. Members of Dutch parliament have since embraced Twitter, and participation is now up to 85 percent. Tweets have also been used by Tumasjan, Sprenger, Sandner, and Welpe to predict the German federal election results [11]. They found that the number of mentions of a political party (or a prominent member of said parties), was a reasonable predictor of election results. The approach of counting Twitter mentions produced a mean average error of 1.65% when compared to the actual election outcome. In comparison, the worst and best poll errors were 1.48% and 0.80% respectively. Approaches to the problem of cluster tracking are suggested in [8, 1]. Allan, Papka, and Lavrenko state the problem of first story detection in news stories [1]. They represent old stories as clusters, and use a threshold on the distance between the old story clusters and the new story to decide whether the new story is actually novel or has been seen before. Petrovi´c, Osborne, and Lavrenko take a similar approach but with application to Twitter [8]. 4 Data The dataset is a collection of tweets tweeted by members of the Dutch parliament spanning from the end of September to the end of December, 2013. All tweets were collected from a Twitter list3 of all Twitter accounts of members of parliament. This list contains 144 members, of which 133 sent at least one tweet during the mentioned period. This means more than 85 percent of the members of parliament participate on Twitter. In April 2011 this was only between 60 and 70 percent [10]. Eight of these 133 members are not actually members of parliament. They either resigned (e.g., Henk Krol), or are on leave (e.g., Sadet Karabulut). See Table 1 for a summary on how well every party is represented on Twitter. 3 https://twitter.com/RTLNieuwsnl/tweede-kamer 3 party VVD PVDA PVV SP CDA D66 CU GL SGP PVDD 50PLUS seats twitterers extras 41 38 15 15 13 12 5 4 3 2 2 38 34 6 11 14 12 5 5 3 2 2 (+1) (+3) (+1) (+1) (+1) (+1) number of tweets total average 5423 5516 471 3170 4006 3611 1318 1743 372 704 255 142 162 78 288 286 300 263 348 124 352 127 Table 1: Participation on Twitter for each party. The number of seats is the number of seats in parliament. The number of twitterers is the number of party members that sent at least one tweet. A twitterer is considered an extra if they are no longer an official member of parliament, but were at some point during the collection of the data. Using the timestamp of each tweet, we can look at the weekly (Figure 1) and daily (Figure 2) tweet frequency patterns. The most notable effect is that parties generally tweet less during the weekends, and parties with a religious background tweet less or not at all on Sundays. The members of parliament send a total average of 298 tweets per day, but not everyone participates equally; GL, PVDD, CDA, SP, CDA and SP send around 3 tweets per member per day on average. Other parties are less active, sending roughly 1 tweet per member per day on average. There are a number of different methods to quantify a twitterer’s influence. Kwak, Lee, Park, and Moon propose measures such as the number of followers, PageRank, and the number of retweets [6]. Cha, Haddadi, Benevenuto, and Gummadi use the number of mentions instead of PageRank [4], and Bakshy, Hofman, Mason, and Watts measure influence in terms of the size of the diffusion tree [3]. Unfortunately our data contains only tweets by politicians, limiting the possible measures to the number of followers and the number of retweets. The PageRank measure has been found very similar to the number of followers [6], and the number of retweets has been shown to correlate strongly with the number of mentions [4]. We examined the influence of both individual politicians (Table 2) and their parties as a whole. To make party influence comparable across parties we normalized for the number of active twitterers in each party. We found that the number of followers is not a good indicator for the number of retweets. The only politician that occurs in both the top 5 followed and top 5 retweeted is Geert 4 Wilders (both in first position). When aggregating over parties (Table 3), the difference between the number of retweets and number of followers are smaller. D66 has many followers, but is not retweeted that often. Possibly because they interact on a more individual level with other twitterers during their D66 question hours (#vraaghetD66). Other notable shifts are CDA (ranked 10th by followers, 5th by retweets), and GL (ranked 8th by followers, 4th by retweets). PVDA PVV SP PVDA PVV SP CDA D66 CU CDA D66 CU GL SGP PVDD GL SGP PVDD VVD 50PLUS 50PLUS Figure 1: Normalized tweet frequency per party for each day of the week, starting with Monday. VVD Figure 2: Normalized tweet frequency per party for each hour of the day, starting at midnight. rank retweets name and party count followers name and party 1 2 3 4 5 Geert Wilders (PVV) Pieter Omtzigt (CDA) Paul Ulenbelt (SP) Marianne Thieme (PVDD) Liesbeth van Tongeren (GL) 15864 6425 5356 4639 3546 Geert Wilders (PVV) Alexander Pechtold (D66) Diederik Samsom (PVDA) Emile Roemer (SP) Marianne Thieme (PVDD) count 282923 211559 117439 60256 35129 Table 2: The top 5 most influential politicians according to the absolute number of retweets and number of followers. 4.1 Hashtag Usage During hashtag analysis we divided the dataset into three periods of four weeks, in order to study temporal behavior. The tweets used in this analysis were sent during the 30th of September to the 22nd of December, 2013. We found that approximately 2200 tags are used within each period, and roughly 1800 of these were not used in the period before, while roughly 320 tags are used consistently across all periods. This common vocabulary consists of party tags, tags of media sources (radio stations, newspapers, etc.), and names of cities, but also some generic topics such as #zorg, #huur, and #begroting. 5 rank retweets party count followers party count 1 2 3 4 5 6 7 8 9 10 11 PVDD PVV SP GL CDA D66 CU SGP PVDA 50PLUS VVD PVV D66 PVDD SP SGP CU PVDA GL 50PLUS CDA VVD 3483 2602 1814 1377 1281 1228 1046 692 325 296 249 44194 23374 22605 12136 11617 9693 9051 8513 6503 6424 3823 Table 3: Parties ranked according to the number of times one of their tweets has been retweeted, and their number of followers. Values are normalized by dividing by the number of twitterers per party. Tags that are part of the common vocabulary are also used the most. When we look at the ten most frequent tags during the first period, we find that this is mainly dominated by party tags (Figure 4). The majority of tags occurs only once (72%), while only a fraction of the tags is used more than ten times (Figure 3). Note that we ignored casing, i.e., the tags #VVD, #Vvd, and #vvd are considered identical. 72% number of tags 1542 tag d66 pvda jeugdwet cda vvd tkjeugd tweedekamer penw jeugdzorg 50plus 14% 308 1 2 5% 5% 117 114 3 4 10 number of tweets 2% 48 >10 Figure 3: Distribution showing how often a tag is (re-)used. Most tags are used only once. Only a small fraction of tags (2%) is used more than ten times. frequency 157 111 96 78 73 66 45 39 37 34 Table 4: The ten most frequent tags during the first period. Roughly 13% of the tweets use more than one tag (Figure 4). Tweets with more than five tags typically include an enumeration of the parties involved: 6 “Wat ’n ratatouille-#akkoord #VVD #PvdA #D66 #CU #SGP incl. #GL -en #CDA -snippertjes? 5x niks en geeft ouderen geen perspectief. #50PLUS wel” — Norbert Klein, 50PLUS Tweets with four to five tags include a variety of tags: “minister #StefBlok : hoge #huur is een keus! http://t.co/MVybDzuThK #huurbeleid #woonlasten #armoede” — Paulus Jansen, SP “Gesprek met Actal over vermindering #regeldruk. Club verdient massief steun in strijd tegen #controlisme. In #onderwijs, #bouw, van #EU bijv” – Roelof Bisschop, SGP 58% frequency 4493 26% 2089 9% 768 0 1 3% 0% 0% 0% 0% 0% 275 1% 81 23 4 2 1 2 2 3 4 5 6 7 8 9 number of tags per tweet Figure 4: Distribution of the number of tags per tweet. 5 Clustering Hashtags To get a broad overview of the topics discussed by the Dutch members of parliament we decided to look at hashtags. Since roughly 2200 different tags are used in a 4 week period, we need some method to relate tags to another. Tags that co-occur (that is, are used within the same tweet), have been found more semantically similar than tags that do not co-occur [2]. For this reason we tried to group hashtags based on their co-occurrence relation. 5.1 Preprocessing We first converted all tags to lowercase. In theory this could lead to ambiguities, but we did not find it an issue in practice. A significant portion of tags occurs only once (72%, Figure 3) and we decided to ignore these tags, as they are not likely to be significant for our purposes, and this greatly reduces computational effort. 7 The remaining tags were used to produce a co-occurrence graph: two tags are connected if they co-occur at least once. Two tags are said to co-occur when they are used within the same tweet. We define the number of co-occurrences between the tags a and b as: co-occ(a, b) = |{t ∈ Tweets such that a, b ∈ tags(t)}| Here tags(t) is the set of tags associated with the tweet t, and Tweets is the analyzed set of tweets. The resulting co-occurrence graph is disconnected. Across all three periods we found that the graph consists of some small (<5 tags) connected components, and a larger (roughly 380 tags) connected component. Components consisting of a single tag are discarded. We judged that the smaller components themselves already form reasonable clusters, and that they require no further processing (Table 6 on page 18 shows the components of the first period). 5.2 Clustering To make the larger component interpretable, we tried to cluster the tags in this component. We decided to use spectral clustering [7], because this method focuses on connectivity rather than compactness (in contrast to, for example, k-means). This property seems desirable, since we might want to group together tags that are not directly connected, but share a common connection (that is, both co-occur with some other tag). The output of the clustering algorithm is a set of tags, which we shall also refer to as a tagset. The input of the algorithm is a similarity matrix, and the number of clusters. We use the following notion of similarity between two hashtags a and b (as per [2], but rephrased for notational consistency): ! co-occ(a, b) co-occ(b, a) 1 P +P (1) S(a, b) = 2 j∈Tags co-occ(a, j) j∈Tags co-occ(b, j) We tried to determine a good trade-off between the number of clusters and the cluster quality through the silhouette width [9]. An advantage of the silhouette width is that it does not require a labeling of the data indicating the desired outcome of the clustering algorithm. The silhouette width s of a sample i is defined as follows: b(i) − a(i) s(i) = max{a(i), b(i)} Here a(i) is the average dissimilarity of i with all other data assigned to the same cluster as i, and b(i) is the strongest average dissimilarity of i to some cluster different from the cluster of i. It follows that the silhouette width is a value between 1 and −1, where a value of 1 is the most desirable. A sample i 8 in this case is a hashtag, and the used dissimilarity measure is the complement of the co-occurrence similarity shown in Equation 5.2. The silhouette width can also be averaged over a tagset, in which case we call it the average silhouette width. We call the average over the tagsets that form a clustering the overall average silhouette width. 5.3 Tagset Properties In this section we define a number of tagset properties. We examine how these properties relate to a subjective notion of tagset quality in Section 6. The support of a tagset T is the number of tweets tagged with a non-strict superset of the tagset: support(T ) = |{t ∈ Tweets such that T ⊆ tags(t)}| The weak-support of a tagset T is the number of tweets tagged with at least n tags in T : weak-support(T, n) = |{t ∈ Tweets such that |T ∩ tags(t)| ≥ n}| The co-occurrence support of a tagset T is the number of tweets tagged with at least 2 tags in T : co-occurrence support(T ) = weak-support(T, 2) The popularity of a tagset T is the number of tweets in which a member of the tagset is used as a tag: popularity(T ) = weak-support(T, 1) The normalized co-occurrence support of a tagset T is indicative of the relative strength of the co-occurrence relation between the tags in T : normalized co-occurrence support(T ) = co-occurrence support(T ) popularity(T ) An alternative measure for the relative strength of the co-occurrence relation between the tags in a tagset T is the average similarity between the tags in the tagset: P (a,b)∈T ×T S(a, b) average inter tagset similarity(T ) = |T × T | Antenucci, Handy, Modi, and Tinkerhess identified that there is a certain degree of noise when it comes to co-occurrence, and suggest two metrics that help reduce the noise [2]. We adapted and rephrased these below. 9 If a tag t accounts for less than 5% of the co-occurrences of a tag t0 , we might consider a link between t and t0 insignificant. Within a tagset T we can look at the strongest, weakest, and average link strength: co-occ(t, t0 ) ¯ t¯∈Tags co-occ(t, t) accounts-for(t, t0 ) = P max-accounts-for(T ) = max{accounts-for(t, t0 ) such that (t, t0 ) ∈ T × T } min-accounts-for(T ) = min{accounts-for(t, t0 ) such that (t, t0 ) ∈ T × T } P 0 (t,t0 )∈T ×T accounts-for(t, t ) mean-accounts-for(T ) = |T × T | Another form of noise can occur when a tag co-occurs with several different and unrelated tags. For example, #zzp co-occurs with #vvd, and #vvd also co-occurs with #jeugdzorg, while #zzp and #jeugdzorg are unrelated. We could thus say that #zzp and #jeugdzorg do not overlap very much (only via #vvd). To formalize the notion of overlap, we first define CO(t, t0 ), which is the set of tags that co-occur with both t and t0 . We then define the overlap between t and t0 as follows: P ¯ t¯∈CO(t,t0 ) co-occ(t, t) 0 overlap(t, t ) = P ˆ tˆ∈Tags co-occ(t, t) This relationship is not symmetric, and is a measure between tags. We can extend the measure to tagsets in the same way as we did for accounts-for. 6 Experiments We clustered the larger connected components of each period using spectral clustering. To decide on a good number of clusters we ran the algorithm multiple times, but with a different number of clusters. For each result we calculated the overall average silhouette width, and chose the number of clusters that led to the maximum overall average silhouette width. We found the optimal number of clusters to be within the 150 to 180 range (Figure 5). 6.1 Assessing and Predicting Quality To evaluate the quality of the resulting tagsets, we manually annotated each tagset produced by the clusterings that maximized the silhouette width for the first two periods with a subjective quality score between 0 to 5 (inclusive). The distributions of the ratings are shown in Figure 6. The large difference in the shapes of the distributions might be due to a difference in the produced tagsets, or the subjectivity of the rating method. 10 0.5 0.4 0.3 0.2 0.1 0.0 average inter cluster similarity overall average silhouette width 10 47 85 123 161 198 236 274 312 350 number of clusters Figure 5: Trade-off between the number of clusters and cluster quality. The overall average silhouette width is preferred to the average inter cluster similarity, as the latter fails to take into account the number of clusters. We first examined the relationship between the subjective rating and the tagset properties listed in Section 5.3, the average silhouette width, and the cluster size, but found no direct relationship: the maximum Pearson correlation coefficient is 0.3, which is the correlation between the rating and the normalized co-occurrence support. The scatter plots between the rating and the properties also show no obvious pattern (Figure 7, page 16). We attempted to automate the rating of tagsets by using the previously mentioned tagset properties as input features for a machine learning model. We had most success with a regularized linear regression model. This model produced a root mean squared error (RMSE) of 1.95 over 5 stratified folds, which is only slightly better than predicting the mean, which gives an RMSE of 2.06. The most contributing features were the average silhouette width (coefficient of 3.4), the maximum overlap (1.7), the normalized support (1.7), and the minimum account-for (−1.2). All other features contributed significantly less. When we applied the model trained on the first period to the tagsets produced during the second period, the RMSE was 3.83. Since we were unable to confidently predict the subjective rating of a tagset, we tried to filter the results. All tagsets rated between 3 and 5 (inclusive) should be kept (positive examples), and all tagsets rated between 0 and 2 should be filtered (negative examples). We then trained both a support vector machine and a random forest, and tuned the parameters using stratified 5-fold cross validation. We found that the random forest was unable to outperform the naive strategy of always predicting positive. The support vector machine either matched this performance on the first period (by always predicting positive), or slightly outperformed the naive strategy on the second period (Table 5). 11 35% 28% 56 49 39 frequency frequency 24% 15% 24 6% 11 0 1 2 10% 16 7% 3 4 rating 12 18% 32 15% 27 16% 28 13% 24 8% 14 5 0 (a) First period 1 2 3 rating 4 5 (b) Second period Figure 6: Subjective rating distributions of tagsets formed by spectral clustering the first period and the second period. It is unclear if the large difference in the shapes of the distributions is due to a difference in the produced tagsets, or the subjectivity of the rating method. predicted negative positive actual negative positive 35 28 38 73 Table 5: Predictions of tagset quality by a support vector machine, aggregated over 5 stratified folds. The predictions are slightly better than naively predicting the dominant category. 6.2 Resulting Tagsets In this section we show some tagsets of varying quality: subjective quality 5 4 3 2 1 0 tagset #lampedusa, #vluchtelingen #assen, #jwfkazerne, #veldzicht #akkoord, #pensioen, #zzp #borssele, #deventer, #kernenergie #kunst, #nieuwsuur, #sp #doneerd66, #luchtvaart, #mkb, #progressivealliance, #route66, #vkopinie, #vnva Visit http://www.liacs.nl/~jkalmeij/twitter-project/ for a more elaborate overview, including the tweets corresponding to a tagset. 12 7 Conclusion This work analyzes tweets sent by 144 members of Dutch parliament during a timespan of roughly three months. We showed that not all parties are equally active, and that the number of followers and the number of retweets are not a good indicator for one another. The only politician that consistently appeared in the top 5 for both metrics was Geert Wilders. We broke our collection of tweets into three periods of four weeks, and found that most (72%) tags are used only once, and the majority of tweets (84%) contained less than two tags. The tags used between these periods varied a lot: only 18% of the tags were constantly used. This constant collection is primarily made up of party tags, tags of media sources, names of cities, and some generic topics. We assumed a tweet’s hashtag approximates a tweet’s topic, and used hashtag clustering based on hashtag co-occurrence to identify topics. The used clustering method is spectral clustering, which takes as input a similarity matrix and the number of clusters, and produces sets of tags, which we also refer to as tagsets. To determine the number of tagsets, we tried to find a good trade-off between the number of tagsets and tagset quality by using the overall average silhouette width. We manually inspected and rated quality of the tagsets produced by the spectral clustering method on a scale between 0 and 5 (inclusive). We found that at least 50% of the tagsets were satisfactory (rated 4 or higher). However, the subjective rating showed great variation in distribution across different periods, and we were unable to train statistical models to predict the subjective rating significantly better than naive methods, from which we conclude that we have either too little data, or the subjective rating is too random. During manual inspection we noticed some tagsets included tags a and b that were not directly connected (that is, a and b were never used within the same tweet). Instead a and b both co-occurred with some other tag c (which sometimes was not present in this tagset). Since tags such as party tags, or news media tags connect many unrelated topics, this sometimes led to undesirable tagsets. This effect could maybe be reduced by using a different clustering method, or filtering hub tags. We contributed some insight in how Dutch politicians used Twitter. The produced clusters are of reasonable quality. Ideally we would want to like the clusters to broader topics, to really summarize the political discussions on Twitter. We suggest this can only be done by using other data sources, since co-occurring tags are too rare and the Dutch members of parliament is a small population. 13 7.1 Future Work This work could be expanded to perform automated content coding of the Parliamentary Twitter stream. Hillard, Purpura, and Wilkerson have shown supervised learning approaches can achieve automated classification that is just as reliable as classification by humans [5]. A high quality automated coding of the Twitter stream opens up many avenues for political science research. One could, for example, look at the relationship between the topics discussed in the parliamentary minutes and the parliamentary Twitter stream. The development of topics through time could be examined. This would give insight in the relationship between certain classes of topic: does one class suppress another, or do they often trend together? This would require the topics to be examined on a more abstract level, since we have shown that there is little overlap in the hashtags used per period (and thus the resulting topics). 8 Acknowledgments I am grateful to Walter Kosters and Christoph Stettina for their supervision and discussions. I thank Gerard Breeman and Arco Timmermans for their feedback and suggestions. References [1] [2] [3] [4] [5] J. Allan, R. Papka, and V. Lavrenko. “On-line new event detection and tracking”. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’98). 1998, pp. 37–45. doi: 10.1145/290941.290954. D. Antenucci, G. Handy, A. Modi, and M. Tinkerhess. “Classification of tweets via clustering of hashtags”. 2011. E. Bakshy, J.M. Hofman, W.A. Mason, and D.J. Watts. “Everyone’s an influencer: Quantifying influence on Twitter”. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining (WSDM ’11). 2011, pp. 65–74. doi: 10.1145/1935826.1935845. M. Cha, H. Haddadi, F. Benevenuto, and K.P. Gummadi. “Measuring user influence in Twitter: the million follower fallacy”. In: 4th International AAAI Conference on Weblogs and Social Media (ICWSM). 2010. D. Hillard, S. Purpura, and J. Wilkerson. “Computer assisted topic classification for mixed methods social science research”. In: Journal of Information Technology & Politics 4 (4 2008), pp. 31–46. doi: 10.1080/ 19331680801975367. 14 [6] [7] [8] [9] [10] [11] H. Kwak, C. Lee, H. Park, and S. Moon. “What is Twitter, a social network or a news media?” In: Proceedings of the 19th International Conference on World Wide Web (WWW ’10). 2010, pp. 591–600. doi: 10.1145/ 1772690.1772751. F. Pedregosa et al. “Scikit-learn: Machine learning in Python”. In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830. S. Petrovi´c, M. Osborne, and V. Lavrenko. “Streaming first story detection with application to Twitter”. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT ’10). 2010, pp. 181–189. P.J. Rousseeuw. “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis”. In: Journal of Computational and Applied Mathematics 20.0 (1987), pp. 53 –65. doi: http://dx.doi.org/10.1016/ 0377-0427(87)90125-7. M.T. Sch¨ afer, N. Overheul, and T. Boeschoten. Politieke communicatie in 140 tekens. [in Dutch]. June 2011. A. Tumasjan, T.O. Sprenger, P.G. Sandner, and I.M. Welpe. “Predicting elections with twitter: What 140 characters reveal about political sentiment”. In: Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media. 2010, pp. 178–185. 15 A Results 5 5 4 4 3 3 2 1 1 0 0.36 0.52 0.68 max-overlap -0.14 0.83 0.13 5 5 4 4 3 3 rating rating 2 0 0.20 2 1 0 0 43.75 85.50 popularity -0.05 127.25 169.00 0.10 5 4 4 3 3 rating 5 2 1 0 0 0.20 0.40 0.60 min-accounts-for 0.00 0.80 0.46 0.63 0.80 0.28 0.45 0.63 0.80 0.25 0.50 0.75 mean-overlap 0.01 mean within-tagset-similarity 0.30 2 1 0.00 0.30 2 1 2.00 rating 0.10 rating rating 0.13 normalized weak-support 1.00 Figure 7: Relationship between the subjective rating and different tagset properties introduced in Section 5.3. Shown on top is the Pearson correlation coefficient. 16 5 5 4 4 3 3 2 1 0 0 0.08 0.20 0.33 mean silhouette width 0.10 0.45 5 5 4 4 3 3 rating rating 2 1 -0.04 2 1 0 0 2.75 5.50 weak-support -0.05 8.25 11.00 5 4 4 3 3 rating 5 2 1 0 0 2.50 4.00 size 5.50 7.00 0.28 0.00 0.20 0.20 0.38 0.45 0.63 0.80 0.40 0.60 0.80 0.55 0.72 0.90 mean-accounts-for 0.02 min-overlap -0.05 2 1 1.00 0.10 2 1 0.00 rating 0.01 rating rating 0.32 max-accounts-for Figure 7: (continued) Relationship between the subjective rating and different tagset properties introduced in Section 5.3. Shown on top is the Pearson correlation coefficient. 17 tagset a2tunnel, rimec boek, lek award, toegankelijkheid bollen5, teylingen arjoklamer, demoed ddw, brabant kindermishandeling, cinekid cobouw, tilburg dutchcareercup, adotwe zembla, vechtscheidingen duurzametop100, nr21 vaola, stemmingen feuten, film mtcs, aoduurzaamhout geheim, ohohdenhaag nunl, fnvmetaal sneek, veemarkthallen pietitie, premiejagen politieacademie, oliezwendel nobelprijs, opcw stella, wsc13 lienden, kulturhus jeugddebat, pizzasessie idigf, igf2013 maatwerk, hetkanwel ldon, prodemos gameon, ehl, hockey russian, dutch, lavrov ddw13, ddw2013, eindhoven ovchipcard, happylines, metro hoppenbrouwers, techniekpact, pnb sergio, kijktip, vn nijmegen, zeeland, tbs telegraaf, padua, apollonetwerk eye, jaofnee, donorweek vreemdelingenbeleid, borgen, nyborg werkgelegenheid, ao, integratie, economie wereldvoedseldag2013, discosoep, geenboergeenv... sup wsup 1 2 2 1 2 1 1 1 2 1 2 2 2 1 2 1 2 1 2 1 2 1 2 2 1 1 0 0 0 0 1 1 1 0 1 0 0 0 1 2 2 1 2 1 1 1 2 1 2 2 2 1 2 1 2 1 2 1 2 1 2 2 1 1 2 3 3 2 2 1 1 2 2 3 2 4 Table 6: When building a graph based on the co-occurrence between tags used during the first period, we found the graph consisted of a number of connected components. Each line depicts the tags of a component. In addition to these smaller components the graph consists of a single, much larger component. 18
© Copyright 2024 ExpyDoc