Hashtag Clustering to Summarize the Topics Discussed by

Hashtag Clustering to Summarize the Topics
Discussed by Dutch Members of Parliament
Jan Kalmeijer
February 27, 2014
Abstract
Politicians are adopting social media, creating an avenue for social
science research. However, social media data is largely unstructured. This
work analyzes tweets sent by 144 members of Dutch parliament over a
period of three months. We cluster hashtags based on co-occurrence to
identify discussed topics. This approach produces clusters of reasonable
quality, but the identified topics are perhaps too specific to be used in
political science research.
1
Introduction
In the past, socio-political researchers have analyzed the political agenda by
content coding1 official documents, such as the Queen’s speech and the European
Council Conclusions. As social media grow, data about the political agendas
becomes available in new forms. Politicians can now express their views and
agendas in real-time through communication channels such as Twitter.
This new form of data has properties that makes it interesting to use alongside
the more formal documents that are currently used in political science research:
the data is real-time and allows politicians to broadcast a personal view. These
digital communication channels are also less troubled by size constraints, which
means there is room for more personal issues. This could allow for a more finegrained analysis than that is possible with formal documents.
The downsides of social media data are the volume and the quality. In this
study we analyze tweets sent by Dutch members of parliament. As all members
combined send a total of only roughly 300 tweets2 per day on average, manual
examination of these tweets is doable but expensive and tedious. Instead we
1 Content coding refers to determining which category (e.g., macroeconomics and taxes,
environment, education and culture, etc.) best characterizes a piece of content (e.g., a sentence
or paragraph) in an official document.
2 See Section 2 for a brief description of the Twitter platform and terminology.
1
try to summarize the tweets by clustering them based on their hashtags. Using
hashtag based clustering to summarize the tweets allows us to get an overview
of the topics discussed in the tweets, without the need for manual annotation.
A challenge when using clustering is the evaluation of the produced clusters. We
propose a number of properties, specific to the problem of clustering hashtags,
that we hypothesize can aid in assessing the quality of the clusters. We try to
verify this claim by examining the relationship between the cluster properties
and our subjective notion of cluster quality.
We briefly discuss the Twitter platform in Section 2, and related work in Section 3. In Section 4 we contribute recent statistics on how Dutch members
of parliament use Twitter. We explain our approach to clustering hashtags in
Section 5, discuss the experimental results in Section 6, and summarize and
conclude in Section 7.
2
Twitter platform
Twitter is a microblogging platform that allows users (also called twitterers or
tweeters) to post messages (tweets) of up to 140 characters. In addition to text
and links, a tweet can contain hashtags and mentions. The hashtag symbol ‘#’
is used as a prefix of a sequence of characters to mark keywords or topics in a
tweet. The mention symbol ‘@’ is used as a prefix of user names (also known
as twitter handles) to automatically notify users that they are mentioned in a
tweet.
Users can personalize their Twitter experience by following other users. If user A
follows user B, all tweets and updates of user B are shown on user A’s Twitter
homepage. Following is not the only method of aggregating tweets. Another
method is to create a list. This is a user created group of twitterers, and when
the list is viewed only tweets from users on that list are shown. Some users
actively try to get other users to follow them, because more people will read
their tweets, presumably making them more influential.
Another method of reaching a large audience is through retweeting. When a
tweet is posted it is broadcast to all the followers of the tweet’s author. These
followers can then retweet (repost) this tweet, which shares it with their followers. A retweet quotes the original tweet’s text, and can contain additional
information.
Twitter tries to algorithmically determine topics that are popular on Twitter.
These topics are called trends and are personalized for each user based on their
location and on who they follow. Tweeting about trending topics is another
method of getting noticed, as the trending topics are shown in a separate column
on the Twitter page.
The Twitter API enables automated collection of tweets under two restrictions:
2
the amount of tweets that can be collected is rate limited, and tweets that are
older than a few days are not available. In addition to the tweet’s text the
Twitter API provides more information such as the tweet’s author, the tweet’s
creation time, and the number of times a tweet has been retweeted. If a tweet
is a retweet the same information is provided for the original tweet.
3
Related Work
Sch¨
afer, Overheul, and Boeschoten examined the use of Twitter by Dutch political figures in 2011, and found four different categories of Twitter users within
this demographic [10] . They also found that participation was between 60 and
70 percent. Members of Dutch parliament have since embraced Twitter, and
participation is now up to 85 percent.
Tweets have also been used by Tumasjan, Sprenger, Sandner, and Welpe to
predict the German federal election results [11]. They found that the number
of mentions of a political party (or a prominent member of said parties), was
a reasonable predictor of election results. The approach of counting Twitter
mentions produced a mean average error of 1.65% when compared to the actual
election outcome. In comparison, the worst and best poll errors were 1.48% and
0.80% respectively.
Approaches to the problem of cluster tracking are suggested in [8, 1]. Allan,
Papka, and Lavrenko state the problem of first story detection in news stories [1].
They represent old stories as clusters, and use a threshold on the distance between the old story clusters and the new story to decide whether the new story
is actually novel or has been seen before. Petrovi´c, Osborne, and Lavrenko take
a similar approach but with application to Twitter [8].
4
Data
The dataset is a collection of tweets tweeted by members of the Dutch parliament spanning from the end of September to the end of December, 2013. All
tweets were collected from a Twitter list3 of all Twitter accounts of members of
parliament. This list contains 144 members, of which 133 sent at least one tweet
during the mentioned period. This means more than 85 percent of the members
of parliament participate on Twitter. In April 2011 this was only between 60
and 70 percent [10]. Eight of these 133 members are not actually members of
parliament. They either resigned (e.g., Henk Krol), or are on leave (e.g., Sadet
Karabulut). See Table 1 for a summary on how well every party is represented
on Twitter.
3 https://twitter.com/RTLNieuwsnl/tweede-kamer
3
party
VVD
PVDA
PVV
SP
CDA
D66
CU
GL
SGP
PVDD
50PLUS
seats
twitterers
extras
41
38
15
15
13
12
5
4
3
2
2
38
34
6
11
14
12
5
5
3
2
2
(+1)
(+3)
(+1)
(+1)
(+1)
(+1)
number of tweets
total
average
5423
5516
471
3170
4006
3611
1318
1743
372
704
255
142
162
78
288
286
300
263
348
124
352
127
Table 1: Participation on Twitter for each party. The number of seats is the
number of seats in parliament. The number of twitterers is the number of party
members that sent at least one tweet. A twitterer is considered an extra if they
are no longer an official member of parliament, but were at some point during
the collection of the data.
Using the timestamp of each tweet, we can look at the weekly (Figure 1) and
daily (Figure 2) tweet frequency patterns. The most notable effect is that parties
generally tweet less during the weekends, and parties with a religious background
tweet less or not at all on Sundays.
The members of parliament send a total average of 298 tweets per day, but not
everyone participates equally; GL, PVDD, CDA, SP, CDA and SP send around
3 tweets per member per day on average. Other parties are less active, sending
roughly 1 tweet per member per day on average.
There are a number of different methods to quantify a twitterer’s influence.
Kwak, Lee, Park, and Moon propose measures such as the number of followers, PageRank, and the number of retweets [6]. Cha, Haddadi, Benevenuto, and
Gummadi use the number of mentions instead of PageRank [4], and Bakshy,
Hofman, Mason, and Watts measure influence in terms of the size of the diffusion tree [3]. Unfortunately our data contains only tweets by politicians, limiting
the possible measures to the number of followers and the number of retweets.
The PageRank measure has been found very similar to the number of followers [6], and the number of retweets has been shown to correlate strongly with
the number of mentions [4].
We examined the influence of both individual politicians (Table 2) and their
parties as a whole. To make party influence comparable across parties we normalized for the number of active twitterers in each party. We found that the
number of followers is not a good indicator for the number of retweets. The only
politician that occurs in both the top 5 followed and top 5 retweeted is Geert
4
Wilders (both in first position).
When aggregating over parties (Table 3), the difference between the number of
retweets and number of followers are smaller. D66 has many followers, but is not
retweeted that often. Possibly because they interact on a more individual level
with other twitterers during their D66 question hours (#vraaghetD66). Other
notable shifts are CDA (ranked 10th by followers, 5th by retweets), and GL
(ranked 8th by followers, 4th by retweets).
PVDA
PVV
SP
PVDA
PVV
SP
CDA
D66
CU
CDA
D66
CU
GL
SGP
PVDD
GL
SGP
PVDD
VVD
50PLUS
50PLUS
Figure 1: Normalized tweet frequency per party for each day of the
week, starting with Monday.
VVD
Figure 2: Normalized tweet frequency per party for each hour of the
day, starting at midnight.
rank
retweets
name and party
count
followers
name and party
1
2
3
4
5
Geert Wilders (PVV)
Pieter Omtzigt (CDA)
Paul Ulenbelt (SP)
Marianne Thieme (PVDD)
Liesbeth van Tongeren (GL)
15864
6425
5356
4639
3546
Geert Wilders (PVV)
Alexander Pechtold (D66)
Diederik Samsom (PVDA)
Emile Roemer (SP)
Marianne Thieme (PVDD)
count
282923
211559
117439
60256
35129
Table 2: The top 5 most influential politicians according to the absolute number
of retweets and number of followers.
4.1
Hashtag Usage
During hashtag analysis we divided the dataset into three periods of four weeks,
in order to study temporal behavior. The tweets used in this analysis were sent
during the 30th of September to the 22nd of December, 2013.
We found that approximately 2200 tags are used within each period, and roughly
1800 of these were not used in the period before, while roughly 320 tags are used
consistently across all periods. This common vocabulary consists of party tags,
tags of media sources (radio stations, newspapers, etc.), and names of cities, but
also some generic topics such as #zorg, #huur, and #begroting.
5
rank
retweets
party
count
followers
party
count
1
2
3
4
5
6
7
8
9
10
11
PVDD
PVV
SP
GL
CDA
D66
CU
SGP
PVDA
50PLUS
VVD
PVV
D66
PVDD
SP
SGP
CU
PVDA
GL
50PLUS
CDA
VVD
3483
2602
1814
1377
1281
1228
1046
692
325
296
249
44194
23374
22605
12136
11617
9693
9051
8513
6503
6424
3823
Table 3: Parties ranked according to the number of times one of their tweets
has been retweeted, and their number of followers. Values are normalized by
dividing by the number of twitterers per party.
Tags that are part of the common vocabulary are also used the most. When
we look at the ten most frequent tags during the first period, we find that
this is mainly dominated by party tags (Figure 4). The majority of tags occurs
only once (72%), while only a fraction of the tags is used more than ten times
(Figure 3). Note that we ignored casing, i.e., the tags #VVD, #Vvd, and #vvd are
considered identical.
72%
number of tags
1542
tag
d66
pvda
jeugdwet
cda
vvd
tkjeugd
tweedekamer
penw
jeugdzorg
50plus
14%
308
1
2
5%
5%
117
114
3
4 10
number of tweets
2%
48
>10
Figure 3: Distribution showing
how often a tag is (re-)used. Most
tags are used only once. Only a
small fraction of tags (2%) is used
more than ten times.
frequency
157
111
96
78
73
66
45
39
37
34
Table 4: The ten most frequent
tags during the first period.
Roughly 13% of the tweets use more than one tag (Figure 4). Tweets with more
than five tags typically include an enumeration of the parties involved:
6
“Wat ’n ratatouille-#akkoord #VVD #PvdA #D66 #CU #SGP
incl. #GL -en #CDA -snippertjes? 5x niks en geeft ouderen geen
perspectief. #50PLUS wel” — Norbert Klein, 50PLUS
Tweets with four to five tags include a variety of tags:
“minister #StefBlok : hoge #huur is een keus! http://t.co/MVybDzuThK
#huurbeleid #woonlasten #armoede” — Paulus Jansen, SP
“Gesprek met Actal over vermindering #regeldruk. Club verdient
massief steun in strijd tegen #controlisme. In #onderwijs, #bouw,
van #EU bijv” – Roelof Bisschop, SGP
58%
frequency
4493
26%
2089
9%
768
0
1
3%
0% 0% 0% 0% 0%
275 1%
81 23 4 2 1 2
2 3 4 5 6 7 8 9
number of tags per tweet
Figure 4: Distribution of the number of tags per tweet.
5
Clustering Hashtags
To get a broad overview of the topics discussed by the Dutch members of parliament we decided to look at hashtags. Since roughly 2200 different tags are
used in a 4 week period, we need some method to relate tags to another. Tags
that co-occur (that is, are used within the same tweet), have been found more
semantically similar than tags that do not co-occur [2]. For this reason we tried
to group hashtags based on their co-occurrence relation.
5.1
Preprocessing
We first converted all tags to lowercase. In theory this could lead to ambiguities,
but we did not find it an issue in practice. A significant portion of tags occurs
only once (72%, Figure 3) and we decided to ignore these tags, as they are not
likely to be significant for our purposes, and this greatly reduces computational
effort.
7
The remaining tags were used to produce a co-occurrence graph: two tags are
connected if they co-occur at least once. Two tags are said to co-occur when
they are used within the same tweet. We define the number of co-occurrences
between the tags a and b as:
co-occ(a, b) = |{t ∈ Tweets such that a, b ∈ tags(t)}|
Here tags(t) is the set of tags associated with the tweet t, and Tweets is the
analyzed set of tweets.
The resulting co-occurrence graph is disconnected. Across all three periods we
found that the graph consists of some small (<5 tags) connected components,
and a larger (roughly 380 tags) connected component. Components consisting
of a single tag are discarded. We judged that the smaller components themselves
already form reasonable clusters, and that they require no further processing
(Table 6 on page 18 shows the components of the first period).
5.2
Clustering
To make the larger component interpretable, we tried to cluster the tags in
this component. We decided to use spectral clustering [7], because this method
focuses on connectivity rather than compactness (in contrast to, for example,
k-means). This property seems desirable, since we might want to group together
tags that are not directly connected, but share a common connection (that is,
both co-occur with some other tag).
The output of the clustering algorithm is a set of tags, which we shall also refer
to as a tagset. The input of the algorithm is a similarity matrix, and the number
of clusters. We use the following notion of similarity between two hashtags a
and b (as per [2], but rephrased for notational consistency):
!
co-occ(a, b)
co-occ(b, a)
1
P
+P
(1)
S(a, b) =
2
j∈Tags co-occ(a, j)
j∈Tags co-occ(b, j)
We tried to determine a good trade-off between the number of clusters and the
cluster quality through the silhouette width [9]. An advantage of the silhouette
width is that it does not require a labeling of the data indicating the desired
outcome of the clustering algorithm. The silhouette width s of a sample i is
defined as follows:
b(i) − a(i)
s(i) =
max{a(i), b(i)}
Here a(i) is the average dissimilarity of i with all other data assigned to the
same cluster as i, and b(i) is the strongest average dissimilarity of i to some
cluster different from the cluster of i. It follows that the silhouette width is a
value between 1 and −1, where a value of 1 is the most desirable. A sample i
8
in this case is a hashtag, and the used dissimilarity measure is the complement
of the co-occurrence similarity shown in Equation 5.2. The silhouette width can
also be averaged over a tagset, in which case we call it the average silhouette
width. We call the average over the tagsets that form a clustering the overall
average silhouette width.
5.3
Tagset Properties
In this section we define a number of tagset properties. We examine how these
properties relate to a subjective notion of tagset quality in Section 6.
The support of a tagset T is the number of tweets tagged with a non-strict
superset of the tagset:
support(T ) = |{t ∈ Tweets such that T ⊆ tags(t)}|
The weak-support of a tagset T is the number of tweets tagged with at least n
tags in T :
weak-support(T, n) = |{t ∈ Tweets such that |T ∩ tags(t)| ≥ n}|
The co-occurrence support of a tagset T is the number of tweets tagged with at
least 2 tags in T :
co-occurrence support(T ) = weak-support(T, 2)
The popularity of a tagset T is the number of tweets in which a member of the
tagset is used as a tag:
popularity(T ) = weak-support(T, 1)
The normalized co-occurrence support of a tagset T is indicative of the relative
strength of the co-occurrence relation between the tags in T :
normalized co-occurrence support(T ) =
co-occurrence support(T )
popularity(T )
An alternative measure for the relative strength of the co-occurrence relation
between the tags in a tagset T is the average similarity between the tags in the
tagset:
P
(a,b)∈T ×T S(a, b)
average inter tagset similarity(T ) =
|T × T |
Antenucci, Handy, Modi, and Tinkerhess identified that there is a certain degree
of noise when it comes to co-occurrence, and suggest two metrics that help
reduce the noise [2]. We adapted and rephrased these below.
9
If a tag t accounts for less than 5% of the co-occurrences of a tag t0 , we might
consider a link between t and t0 insignificant. Within a tagset T we can look at
the strongest, weakest, and average link strength:
co-occ(t, t0 )
¯
t¯∈Tags co-occ(t, t)
accounts-for(t, t0 ) = P
max-accounts-for(T ) = max{accounts-for(t, t0 ) such that (t, t0 ) ∈ T × T }
min-accounts-for(T ) = min{accounts-for(t, t0 ) such that (t, t0 ) ∈ T × T }
P
0
(t,t0 )∈T ×T accounts-for(t, t )
mean-accounts-for(T ) =
|T × T |
Another form of noise can occur when a tag co-occurs with several different and
unrelated tags. For example, #zzp co-occurs with #vvd, and #vvd also co-occurs
with #jeugdzorg, while #zzp and #jeugdzorg are unrelated. We could thus say
that #zzp and #jeugdzorg do not overlap very much (only via #vvd).
To formalize the notion of overlap, we first define CO(t, t0 ), which is the set of
tags that co-occur with both t and t0 . We then define the overlap between t and
t0 as follows:
P
¯
t¯∈CO(t,t0 ) co-occ(t, t)
0
overlap(t, t ) = P
ˆ
tˆ∈Tags co-occ(t, t)
This relationship is not symmetric, and is a measure between tags. We can
extend the measure to tagsets in the same way as we did for accounts-for.
6
Experiments
We clustered the larger connected components of each period using spectral
clustering. To decide on a good number of clusters we ran the algorithm multiple
times, but with a different number of clusters. For each result we calculated the
overall average silhouette width, and chose the number of clusters that led to
the maximum overall average silhouette width. We found the optimal number
of clusters to be within the 150 to 180 range (Figure 5).
6.1
Assessing and Predicting Quality
To evaluate the quality of the resulting tagsets, we manually annotated each
tagset produced by the clusterings that maximized the silhouette width for the
first two periods with a subjective quality score between 0 to 5 (inclusive). The
distributions of the ratings are shown in Figure 6. The large difference in the
shapes of the distributions might be due to a difference in the produced tagsets,
or the subjectivity of the rating method.
10
0.5
0.4
0.3
0.2
0.1
0.0
average inter cluster similarity
overall average silhouette width
10 47 85 123 161 198 236 274 312 350
number of clusters
Figure 5: Trade-off between the number of clusters and cluster quality. The
overall average silhouette width is preferred to the average inter cluster similarity, as the latter fails to take into account the number of clusters.
We first examined the relationship between the subjective rating and the tagset
properties listed in Section 5.3, the average silhouette width, and the cluster size,
but found no direct relationship: the maximum Pearson correlation coefficient is
0.3, which is the correlation between the rating and the normalized co-occurrence
support. The scatter plots between the rating and the properties also show no
obvious pattern (Figure 7, page 16).
We attempted to automate the rating of tagsets by using the previously mentioned tagset properties as input features for a machine learning model. We had
most success with a regularized linear regression model. This model produced
a root mean squared error (RMSE) of 1.95 over 5 stratified folds, which is only
slightly better than predicting the mean, which gives an RMSE of 2.06. The
most contributing features were the average silhouette width (coefficient of 3.4),
the maximum overlap (1.7), the normalized support (1.7), and the minimum
account-for (−1.2). All other features contributed significantly less. When we
applied the model trained on the first period to the tagsets produced during the
second period, the RMSE was 3.83.
Since we were unable to confidently predict the subjective rating of a tagset, we
tried to filter the results. All tagsets rated between 3 and 5 (inclusive) should
be kept (positive examples), and all tagsets rated between 0 and 2 should be
filtered (negative examples). We then trained both a support vector machine
and a random forest, and tuned the parameters using stratified 5-fold cross
validation.
We found that the random forest was unable to outperform the naive strategy of always predicting positive. The support vector machine either matched
this performance on the first period (by always predicting positive), or slightly
outperformed the naive strategy on the second period (Table 5).
11
35%
28%
56
49
39
frequency
frequency
24%
15%
24
6%
11
0
1
2
10%
16
7%
3
4
rating
12
18%
32
15%
27
16%
28
13%
24
8%
14
5
0
(a) First period
1
2
3
rating
4
5
(b) Second period
Figure 6: Subjective rating distributions of tagsets formed by spectral clustering the first period and the second period. It is unclear if the large difference in
the shapes of the distributions is due to a difference in the produced tagsets, or
the subjectivity of the rating method.
predicted
negative positive
actual
negative
positive
35
28
38
73
Table 5: Predictions of tagset quality by a support vector machine, aggregated
over 5 stratified folds. The predictions are slightly better than naively predicting
the dominant category.
6.2
Resulting Tagsets
In this section we show some tagsets of varying quality:
subjective quality
5
4
3
2
1
0
tagset
#lampedusa, #vluchtelingen
#assen, #jwfkazerne, #veldzicht
#akkoord, #pensioen, #zzp
#borssele, #deventer, #kernenergie
#kunst, #nieuwsuur, #sp
#doneerd66, #luchtvaart, #mkb, #progressivealliance, #route66,
#vkopinie, #vnva
Visit http://www.liacs.nl/~jkalmeij/twitter-project/ for a more elaborate overview, including the tweets corresponding to a tagset.
12
7
Conclusion
This work analyzes tweets sent by 144 members of Dutch parliament during a
timespan of roughly three months. We showed that not all parties are equally
active, and that the number of followers and the number of retweets are not a
good indicator for one another. The only politician that consistently appeared
in the top 5 for both metrics was Geert Wilders.
We broke our collection of tweets into three periods of four weeks, and found
that most (72%) tags are used only once, and the majority of tweets (84%)
contained less than two tags. The tags used between these periods varied a lot:
only 18% of the tags were constantly used. This constant collection is primarily
made up of party tags, tags of media sources, names of cities, and some generic
topics.
We assumed a tweet’s hashtag approximates a tweet’s topic, and used hashtag
clustering based on hashtag co-occurrence to identify topics. The used clustering
method is spectral clustering, which takes as input a similarity matrix and the
number of clusters, and produces sets of tags, which we also refer to as tagsets.
To determine the number of tagsets, we tried to find a good trade-off between
the number of tagsets and tagset quality by using the overall average silhouette
width.
We manually inspected and rated quality of the tagsets produced by the spectral
clustering method on a scale between 0 and 5 (inclusive). We found that at
least 50% of the tagsets were satisfactory (rated 4 or higher). However, the
subjective rating showed great variation in distribution across different periods,
and we were unable to train statistical models to predict the subjective rating
significantly better than naive methods, from which we conclude that we have
either too little data, or the subjective rating is too random.
During manual inspection we noticed some tagsets included tags a and b that
were not directly connected (that is, a and b were never used within the same
tweet). Instead a and b both co-occurred with some other tag c (which sometimes
was not present in this tagset). Since tags such as party tags, or news media
tags connect many unrelated topics, this sometimes led to undesirable tagsets.
This effect could maybe be reduced by using a different clustering method, or
filtering hub tags.
We contributed some insight in how Dutch politicians used Twitter. The produced clusters are of reasonable quality. Ideally we would want to like the clusters to broader topics, to really summarize the political discussions on Twitter.
We suggest this can only be done by using other data sources, since co-occurring
tags are too rare and the Dutch members of parliament is a small population.
13
7.1
Future Work
This work could be expanded to perform automated content coding of the Parliamentary Twitter stream. Hillard, Purpura, and Wilkerson have shown supervised learning approaches can achieve automated classification that is just
as reliable as classification by humans [5]. A high quality automated coding of
the Twitter stream opens up many avenues for political science research. One
could, for example, look at the relationship between the topics discussed in the
parliamentary minutes and the parliamentary Twitter stream.
The development of topics through time could be examined. This would give
insight in the relationship between certain classes of topic: does one class suppress another, or do they often trend together? This would require the topics to
be examined on a more abstract level, since we have shown that there is little
overlap in the hashtags used per period (and thus the resulting topics).
8
Acknowledgments
I am grateful to Walter Kosters and Christoph Stettina for their supervision and
discussions. I thank Gerard Breeman and Arco Timmermans for their feedback
and suggestions.
References
[1]
[2]
[3]
[4]
[5]
J. Allan, R. Papka, and V. Lavrenko. “On-line new event detection and
tracking”. In: Proceedings of the 21st Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR ’98). 1998, pp. 37–45. doi: 10.1145/290941.290954.
D. Antenucci, G. Handy, A. Modi, and M. Tinkerhess. “Classification of
tweets via clustering of hashtags”. 2011.
E. Bakshy, J.M. Hofman, W.A. Mason, and D.J. Watts. “Everyone’s an influencer: Quantifying influence on Twitter”. In: Proceedings of the Fourth
ACM International Conference on Web Search and Data Mining (WSDM
’11). 2011, pp. 65–74. doi: 10.1145/1935826.1935845.
M. Cha, H. Haddadi, F. Benevenuto, and K.P. Gummadi. “Measuring user
influence in Twitter: the million follower fallacy”. In: 4th International
AAAI Conference on Weblogs and Social Media (ICWSM). 2010.
D. Hillard, S. Purpura, and J. Wilkerson. “Computer assisted topic classification for mixed methods social science research”. In: Journal of Information Technology & Politics 4 (4 2008), pp. 31–46. doi: 10.1080/
19331680801975367.
14
[6]
[7]
[8]
[9]
[10]
[11]
H. Kwak, C. Lee, H. Park, and S. Moon. “What is Twitter, a social network or a news media?” In: Proceedings of the 19th International Conference on World Wide Web (WWW ’10). 2010, pp. 591–600. doi: 10.1145/
1772690.1772751.
F. Pedregosa et al. “Scikit-learn: Machine learning in Python”. In: Journal
of Machine Learning Research 12 (2011), pp. 2825–2830.
S. Petrovi´c, M. Osborne, and V. Lavrenko. “Streaming first story detection
with application to Twitter”. In: Human Language Technologies: The 2010
Annual Conference of the North American Chapter of the Association for
Computational Linguistics (HLT ’10). 2010, pp. 181–189.
P.J. Rousseeuw. “Silhouettes: A graphical aid to the interpretation and
validation of cluster analysis”. In: Journal of Computational and Applied
Mathematics 20.0 (1987), pp. 53 –65. doi: http://dx.doi.org/10.1016/
0377-0427(87)90125-7.
M.T. Sch¨
afer, N. Overheul, and T. Boeschoten. Politieke communicatie in
140 tekens. [in Dutch]. June 2011.
A. Tumasjan, T.O. Sprenger, P.G. Sandner, and I.M. Welpe. “Predicting
elections with twitter: What 140 characters reveal about political sentiment”. In: Proceedings of the Fourth International AAAI Conference on
Weblogs and Social Media. 2010, pp. 178–185.
15
A
Results
5
5
4
4
3
3
2
1
1
0
0.36
0.52
0.68
max-overlap
-0.14
0.83
0.13
5
5
4
4
3
3
rating
rating
2
0
0.20
2
1
0
0
43.75
85.50
popularity
-0.05
127.25 169.00
0.10
5
4
4
3
3
rating
5
2
1
0
0
0.20
0.40
0.60
min-accounts-for
0.00
0.80
0.46
0.63
0.80
0.28
0.45
0.63
0.80
0.25
0.50
0.75
mean-overlap
0.01
mean within-tagset-similarity
0.30
2
1
0.00
0.30
2
1
2.00
rating
0.10
rating
rating
0.13
normalized weak-support
1.00
Figure 7: Relationship between the subjective rating and different tagset properties introduced in Section 5.3. Shown on top is the Pearson correlation coefficient.
16
5
5
4
4
3
3
2
1
0
0
0.08
0.20
0.33
mean silhouette width
0.10
0.45
5
5
4
4
3
3
rating
rating
2
1
-0.04
2
1
0
0
2.75
5.50
weak-support
-0.05
8.25
11.00
5
4
4
3
3
rating
5
2
1
0
0
2.50
4.00
size
5.50
7.00
0.28
0.00
0.20
0.20
0.38
0.45
0.63
0.80
0.40
0.60
0.80
0.55
0.72
0.90
mean-accounts-for
0.02
min-overlap
-0.05
2
1
1.00
0.10
2
1
0.00
rating
0.01
rating
rating
0.32
max-accounts-for
Figure 7: (continued) Relationship between the subjective rating and different tagset properties introduced in Section 5.3. Shown on top is the Pearson
correlation coefficient.
17
tagset
a2tunnel, rimec
boek, lek
award, toegankelijkheid
bollen5, teylingen
arjoklamer, demoed
ddw, brabant
kindermishandeling, cinekid
cobouw, tilburg
dutchcareercup, adotwe
zembla, vechtscheidingen
duurzametop100, nr21
vaola, stemmingen
feuten, film
mtcs, aoduurzaamhout
geheim, ohohdenhaag
nunl, fnvmetaal
sneek, veemarkthallen
pietitie, premiejagen
politieacademie, oliezwendel
nobelprijs, opcw
stella, wsc13
lienden, kulturhus
jeugddebat, pizzasessie
idigf, igf2013
maatwerk, hetkanwel
ldon, prodemos
gameon, ehl, hockey
russian, dutch, lavrov
ddw13, ddw2013, eindhoven
ovchipcard, happylines, metro
hoppenbrouwers, techniekpact, pnb
sergio, kijktip, vn
nijmegen, zeeland, tbs
telegraaf, padua, apollonetwerk
eye, jaofnee, donorweek
vreemdelingenbeleid, borgen, nyborg
werkgelegenheid, ao, integratie, economie
wereldvoedseldag2013, discosoep, geenboergeenv...
sup
wsup
1
2
2
1
2
1
1
1
2
1
2
2
2
1
2
1
2
1
2
1
2
1
2
2
1
1
0
0
0
0
1
1
1
0
1
0
0
0
1
2
2
1
2
1
1
1
2
1
2
2
2
1
2
1
2
1
2
1
2
1
2
2
1
1
2
3
3
2
2
1
1
2
2
3
2
4
Table 6: When building a graph based on the co-occurrence between tags used
during the first period, we found the graph consisted of a number of connected
components. Each line depicts the tags of a component. In addition to these
smaller components the graph consists of a single, much larger component.
18