Behaviour Analysis using Mutual Influences 1 Introduction

Journal of Computational Information Systems 10: 21 (2014) 9143–9151
Available at http://www.Jofcis.com
Behaviour Analysis using Mutual Influences
Mei YU 1,2 , Yu DU 1,2 , Dejun HOU 3,∗,
1 School
2 Tianjin
3 Office
Jing ZHOU 1,2 , Ting LEI 1,2
of Computer Science and Technology, Tianjin University, Tianjin 300072, China
Key Laboratory of Cognitive Computing and Application, Tianjin University, Tianjin 300072,
China
of State-owned Assets and Equipment Management, Tianjin University, Tianjin 300072, China
Abstract
Social network analysis has received enormous attention in recent years, owing to the success of online
social networking sites. This trend leads to the generation of a wealth of social network data. Therefore,
the potential research impact of these techniques is still largely unexplored. In this article we address the
problem of behavior analysis of huge amounts of data produced in social networks. Such a problem arises
naturally in data analysis industry where one aims to understand users’ tastes with multiple traces from
his history of surfing the net as correctly as possible. In each phase we present a brief overview of the
problem, describe state-of-the art approaches, transform the model to deal with massive data examples,
and map each of the topics to a behavior analysis framework. Furthermore, two probability analysis
methods are compared to handle the situations what are really the users’ interest and to what extent
that users’ privacy via online social network will be disclosed. We then investigate into applications
of our algorithm to community user tastes analysis. In addition, experimental results on challenging
real-world datasets show that the risk assessment capability of our proposed algorithm is effective. The
main contribution of the article is to propose a state-of-the-art conversion of current techniques while
providing a critical perspective on behavior analysis applications of social network analysis and data
mining.
Keywords: Social Network Analysis; Data Mining; PageRank; D-S Evidence Theory
1
Introduction
With the ability to provide reliable behavior records and improve the precision and value of
commercials, social network analysis using big data [1] has attracted considerable attention.
Sufficient data may influence people’s decisions in life and business activities. Qmee [2] has
reported exactly what happened online in an average minute of August 1, 2013. The analysis of
online social networks based on big data has recently received wide attention from the aspects
∗
Corresponding author.
Email address: [email protected] (Dejun HOU).
1553–9105 / Copyright © 2014 Binary Information Press
DOI: 10.12733/jcis12234
November 1, 2014
9144
M. Yu et al. /Journal of Computational Information Systems 10: 21 (2014) 9143–9151
[3–6]. The authors in [7] propose a new metric to automatically evaluate the confidence that a
student knows a certain concept included in his or her conceptual model, which develops a new
perspective. Based on matrix and graph theory, Social Network Analysis methods (SNA) [8]
have been gradually developed. SNA can provide quantitative analysis, which is widely used in
international trade, the area of crime, economic and cultural fields and so on.
In this paper, a massive data processing method is developed for social network analysis to
effectively evaluate user behaviors. Firstly, based on the popularity of films analyzed by PageRank, we get ranked community movies. Then, we apply the SNA method. In this work, we
evaluate the impact of community that is provided by SNA by computing the closeness through
intimacy metrics. By quantifying the popularity of movies and user’s influence and utilizing D-S
evidence theory, we realize the measurement of user’s behaviors and the assessments of user’s privacy disclosure. Finally, using the simulation experiment we test the effectiveness of the proposed
algorithm.
2
Related Work
In [9], Newman et al. studied a hybrid coauthorship/citation network that combines the two,
which they analyze to gain insight into the correlations and interactions between authorship and
citation. However, they did not take scientists’ social information into account. Our method
adopts to start with the importance of users and movies.
An overview of community detection in graphs, was proposed in [10]. They put forward an
integrated illustration of the topic, which contains the definition of the fundamental parts of the
problem and the general methods developed. Google’s founder Page applied random walk model
to the network of web pages and proposed PageRank algorithm [11] to calculate the influence of
web pages. They showed how to efficiently compute PageRank for large numbers of pages and
utilized PageRank to search and to user navigation. Authors in [12] focused on the problem of
identifying influential users of micro-blogging services. TwitterRank, an extension of PageRank
algorithm, was raised to measure the influence of users in Twitter. They had not classified
different categories of twitterers by studying their “following” behaviors closely. What is more,
PageRank was used in [13] to deal with trust in social network. A novel trust model for P2P
system based on advanced D-S theory of evidence was proposed in [14]. Compared to some
current trust models, the introduced model was more robust on trust security problems and more
advanced in successful transaction rate.
The major distinction between our work and aforementioned work lies in the formulation of
the calculation method of importance. Specifically, we propose a PageRank-like method to model
the popularity of each movie that appears among the mass of the films. We solve the problem
faced with big data. Moreover, we design an algorithm to judge the influence by making use of
measuring intimacy of users in the community.
The remainder of this paper is organized as follows. In Section 3 we present the importance
model with PageRank-like method and intimacy measurement. The details of assessment along
with the final evaluation rule are stated in Section 4. The algorithm of our method is provided
in Section 5. The applications of our method to behavior analysis and privacy disclosure risk
assessment are studied in Section 6. Finally, we conclude our work in Section 7.
M. Yu et al. /Journal of Computational Information Systems 10: 21 (2014) 9143–9151
3
3.1
9145
Importance Model
Movie-specific pageRank
PageRank describes a method for rating Web pages objectively and mechanically, effectively
measuring the human interest and attention devoted to them. The advantages of PageRank can
be revealed when dealing with the greatest for underspecified queries. PageRank is essentially an
idea of citation analysis.
We apply PageRank to settle the problem in movie rating social network. The unidirectional
relationship rating for a user to a movie would be counted as a citation. The similarity between
two users is compared through the coincidence rate of citation. That is to say, the more common
rated movies two users have the higher similarity between them. Besides, the frequency that
a movie is rated can be used to measure the movie’s popularity. When a specific movie gets
numerous ratings from users, it can be inferred that the movie is very popular.
The definition of PageRank aforementioned has another basis in random walks. That is, it
can be thought of as modeling the behavior of a “random surfer”. The “random surfer” simply
ensures clicking on successive links randomly. Finally, the user surf to a page that has no links.
Then the user enters a URL aimlessly to jump to a new page to continue previous procedure.
The key of PageRank is the calculation method of transition probability among nodes in random
walk model. The growth of PageRank has been fueled by the desire to bring social network
analysis to big data while retaining access to traditional rank calculation. The algorithm we
proposed exploits the social behavior of rating in the movie community to redesign the calculation
strategy in transition probability between nodes of PageRank.
Firstly, we define the similarity between users in the movie community by the social behavior of
rating. The main idea is as follows: the user rates a specific movie. We do not consider how many
stars the user evaluates the movie or whether the user likes the movie. What we can conclude is
that the user is interested in the movie. When several users rate the same movie, these reviewers
have the same tastes in this specific movie.
Based on the idea presented above, we define a movie interest eigenvector for each community
user to express the interestingness that the user to a specific movie. In definition 1, id stands for
the movie ID, cn (user, id) represents the rating to a movie from the user. Movie (user) means
the set of the users who have rated the movies.
Definition 1 The movie interests’ eigenvector of users can be calculated as follows:
V ector(user) = [id = cn(user, id)] , t = |M ovies(user)| , id ∈ M ovies(user)
(1)
Definition 2 The definition of interest similarity.
Each dimension of user interest eigenvector is actually a tuple (movie id, the rate of user to
the movie). From the tuple, we can gain the level of interest of the user on a specific movie. So,
cosine similarity is employed to compute interest similarity between two users in the community.
9146
M. Yu et al. /Journal of Computational Information Systems 10: 21 (2014) 9143–9151
The formula is as Eq. (2):
Sim(i, j) =
V ector(i) · V ector(j)
,
|V ector(i)| |V ector(j)|
i, j ∈ V
(2)
From the calculated results from the aforementioned definitions, this paper expands PageRank
to redesign transition probability of random walk as Eq. (3):
Pij =
|M ovies(j)| × Sim(i, j)
∑
, i, j ∈ V
|M ovies(n)| × Sim(i, n)
(3)
n∈F riends(i)
Eq. (3) is actually based on the idea of the amount of information: fractions explains the sum
of information gained from user i to j that they rate. Numerator represents the information user
i have from all users in the movie community. The ratio shows the transition probability between
user i to j.
From the redesigned transition probability, we can calculate the PR value of the movie community user from below:
P R(i) =
∑
1−q
+q
P R(j) × Pij ,
|V |
j
i, j ∈ V
(4)
In academic co-networks, the number of papers and citation frequency are viewed as a measure
to evaluate the importance of the author [15]. In sociology, if two people have more common
friends, the relationship between them are closer. Therefore, we obtain these definitions from 3
to 5:
Definition 3 Define CM (node1, node2) the common movies that they saw in the movie community:
CM (node1, node2) = {node|node ∈ M ovies(node1)∧ node ∈ M ovies(node2)}, node1, node2 ∈ V
(5)
Definition 4 Define Closeness (node1, node2) the closeness of the two users in the movie community:
Closeness(node1, node2) = |CF (node1, node2)|
(6)
Definition 5 Define D (circle, node) the distance between a specific user to movie interest circle:
M. Yu et al. /Journal of Computational Information Systems 10: 21 (2014) 9143–9151
∑
D(circle, node) =
|CM (node, i) ∧ CM (node, center)|
, i ∈ circle,s = |circle|
s × |CM (node, center)|
9147
(7)
From definition 5, D (circle, node) shows the intimacy distance from a user in movie community
of specific center user to an existing movie interest circle. |CM (node, i) ∧ CM (node, center)|
represents the common movie viewing number of node i and center movie node. The fraction
computes the sum of common movie viewing number from the specific node to every node in the
circle and the center node. That is to say, D (circle, node) explains the average intimacy between
node and circle.
4
Behavior Analysis Model
Bayesian methods are usually applied to various aspects [16-18]. Its basic principle is: if given
a priori assumptions likelihood estimation, with the arrival of the new evidence (observed data),
the Bayesian method can update the assumed likelihood function.
The Dempster-Shafer (D-S) theory of evidence (Shafer, 1976) is a powerful tool for analyzing
uncertain knowledge. Let Θ= {θ1 , · · · , θk } be a finite set of possible hypotheses. This set Θ is
referred to as the frame of discernment, and its powerset denoted by 2Θ . It stands for all subsets
of Θ.
A basic belief assignment (BPA) m is a function that m (A) is the part of belief that supports
A, but does not support anything more specific. The combination rule can be easily extended
to several belief functions by repeating the rule for new belief functions. Thus the pairwise
orthogonal sum of n belief functions Beli (i = 1, 2, ..., n) can be formed as
Bel = Bel1 ⊕ Bel2 ⊕ · · · ⊕ Beln
(8)
Consider two BPAs m1 (.) and m2 (.) for belief functions bel1 (.) and bel2 (.) respectively.
Let Aj and Bk be focal elements of bel1 and bel2 respectively. Then m1 (.) and m2 (.) can be
combined to obtain the belief mass committed to C ∈ Θ according to the following combination
or orthogonal sum formula (Shafer, 1976),


0,
C=ϕ
∑
∑
m1 (Ai )m2 (Bj )
m(C) =
,k = 1 −
m1 (Ai )m2 (Bj )
(9)
 Ai ∩Bj =A
Ai ∩Bj =A
(C ̸= ϕ)
k
Lemma 1 Bayesian methods and evidence theory are commutative and associative
It is obvious that Bayes’ theorem and evidence theory’s fusion rules are commutative. Below
we prove Bayes’ theorem and evidence theory in information fusion are associative.
For Bayes’ theorem, when there are two variables, its fusion probability can be described as
follow:
P (A |B , C) =
P (A, B, C)
P (A, B, C)
P (C |A, B )P (A, B)
P (A)P (B |A )P (C |A, B )
=
=
=
(10)
P (B, C)
P (B)P (C |B )
P (B)P (C |B )
P (B)P (C |B )
9148
M. Yu et al. /Journal of Computational Information Systems 10: 21 (2014) 9143–9151
Repeatedly applying two variables for Bayesian information fusion and use induction, it is easy
to prove that the Bayes’ theorem in data fusion is associative.
Based on the aforementioned associative law, it can be induced easily that:
m1···n−1 (A)=m1 (A) ⊕ m2 (A) ⊕ . . . mn (A)
(11)
These two features ensure that the order does not affect the synthesis of synthetic results. The
strong point may be described as overcoming the reality of ‘preconceived policy decision’.
Theorem 1 Bayesian approach is a special case of evidence theory.
¯ = 1.
In Bayesian methods, proposition A and its complement form a complete set P (A) + P (A)
However, in practical problems, belief degree on proposition A does not reflect the belief degree
¯ ≤ 1. For proposition A, the upper and lower range of evidence
to its complement m(A) + m(A)
interval are support function Spt(A) and plausibility function Pls(A), which can be expressed as
[Spt(A), Pls(A)].
We can conclude that Bayesian method is a special case of evidence theory. For the convenience
of depicting information fusion in evidence theory and its inclusion to Bayes’ theorem, evidence
theory is chose as the measure of analyzing user behaviors and privacy disclosure.
5
5.1
PageRank-like Algorithm
Main idea of the algorithm
This algorithm process includes several steps: 1) Order of importance: mainly divides different
importance degree for the objects in data. Through a PageRank-like algorithm and intimacy
calculation, the sorting for importance of the data will be realized. 2) User behavior analysis
and assessment of privacy disclosure based on evidence theory: Firstly, converse the sorted data
of importance to integrated reliability calculation. Then evaluate user behaviors and privacy
disclosure via evidence theory.
5.2
Description of the algorithm
Based on the above view, algorithm description is in Fig. 1:
6
Simulation Results
To evaluate the effectiveness of our proposed method for social behaviors’ analysis, we systematically apply it to MovieLens data sets (http://www. MovieLens.org). The data set we choose
contains 943 users’ rating for 1682 movies from one to five stars. They delete those users who
rate less than 20 movies. The data also consists of some information for users like age, gender,
occupation, zip and so on. In order to obtain reliable results, we test several stages of analysis.
M. Yu et al. /Journal of Computational Information Systems 10: 21 (2014) 9143–9151
9149
Fig. 1: Pseudo code of analysis algorithm
6.1
Movies sorts
The third part of the paper describes PageRank-like algorithm that is utilized to sort movie
popularity. However, damping was to be 0.85. We vary q from 0.05 to 0.95 for step of 0.05 to
evaluate the effect of damping.
We start with data reprocessing. The outcome is a 1682*1682 symmetric matrix. Diagonal
elements are set to be 0. Each row represents the popularity of a film relative to others.
Then we use q from 0-1 to calculate the influence of movie popularity with the method of
PageRank-like method. We calculate the top 10 movies that are computed with varied q. The
damping is set to be 0.05. From top to bottom, damping values increase 0.05 for each time.
The experiment tells that as the damping increases, the popular movie has experienced steady–
some fluctuation - stable process. We use the results to count the frequency of the most popular
movies,the outcome go as Table 1. The first row in Table 1 represents the id of the most popular
movie. The second row shows the occurrences of the popular movies when changing the damping
coefficient.
Fig. 2: Trends of the variance of popular movies Fig. 3: Users’ influence on the popularity of movies
9150
M. Yu et al. /Journal of Computational Information Systems 10: 21 (2014) 9143–9151
Table 1: Frequency of popular movies
Movie ID
1
50
56
98
100
121
172
174
181
204
258
288
Frequency
19
19
19
11
19
19
9
19
19
19
10
8
6.2
Sorting based on user influence
The third part of the paper describes the sort of user influence using intimacy-related algorithm.
We determine the users by sorted movies using PageRank-like algorithm.We measure user authority by intimacy distance. Then find out the importance of the small group who are curious about
the studied movies in the rating community. Finally, we mine the ranked influence of community
users via the bridge of interest.
6.3
User behavior analysis and privacy disclosure assessment
We select evidence theory as the assessment method. We conduct the information fusion by
evidence theory to the users who have rated for the most popular movie. Fig. 3 explains the
assessments of the first 6 users and other excess users. 1-6 describes the most influential users
who have seen the movie in the community. 7 stand for the other 577 users. The experiment
shows key users have vital influence on community viewing habits. And with the same habits,
vital users have a higher probability that they know each other in real world. As to whose
personal information is disclosed to lead the trend, there is considerable variance. Therefore,
the protection of the key users’ privacy information will contribute to the overall small group’s
information greatly.
7
Conclusion
In this paper we have analyzed a large data set from the MovieLens web set, taking a network
perspective. Rather than focus solely on user behavior analysis, as most previous studies have
done, we have also extended to assess privacy disclosure risks. The ratings of the data set is
unusually large, covering more than 100,000, which allows us to study users’ tastes and privacy
risks that are not accessible with smaller data sets.
We investigated a social network user behavior analysis issue faced by the privacy disclosure
risks. A three-level analysis scheme was proposed to jointly analyze user behaviors and privacy
disclosure risks with the social networks in the background of big data. We explore the key parts
for each sequence along two methods: (1) use the PageRank-like algorithm to select the best
influential movies for the movie rating community; (2) sort the community users for each movie
and its importance to the other user’s choice making. To obtain the results of the user behavior
analysis, evidence theory was employed to assess the user behaviors and privacy disclosure risk
problems, which can be optimally solved. The fraction of influential movies selection is more or
less constant over varying the damping q. And community users tend to rating movies that have
been rated by important guys. We observe a strong tendency towards key users, who have greater
risks to be attacked. Simulation results demonstrated after 3 sections, shows the performance of
the proposed strategies can obtain near-optimal performance.
M. Yu et al. /Journal of Computational Information Systems 10: 21 (2014) 9143–9151
9151
There are many other questions that could be addressed with this data set, the time stamps
and a further discuss about rating scores opening up a variety of possibilities. We could also
study personalized commercial recommendation by making use of the data on users’ interests and
personal situations. And finally, any of our analyses could be extended to data sets that cover
other social communities, if and when such data become available.
References
[1]
Big Data. Nature (http://www.nature.com/news/specials/bigdata/index.html), Sep 2008.
[2]
What happens online in 60 seconds? Qmee (http://blog.qmee.com/qmee-online-in-60-seconds/).
[3]
Arxiden, Ablimit. Analysis of users’ influence based on activity and quality of tweets for specific
topics in Micro-blog. [J]. Journal of Computational Information Systems, 2013, 9 (20): 8127-8137.
[4]
Yu Mei; Xu Tianyi; Yu Jian, et al. A social-aware incentive mechanism for Ad Hoc networks [J].
Journal of Computational Information Systems, 2013, 9 (20): 8251-8259.
[5]
Manovich L (2012) Trending: the promises and the challenges of big social data. In: Debates in
the digital humanities. Gold MK, editor Minneapolis: University of Minnesota Press. 532.
[6]
Di Loreto Ines G A. Facebook Games: Between Social and Personal Aspects [J]. International
Journal of Computer Information Systems and.
[7]
Industrial Management Applications. ISSN 2150-7988 Vol. 3 (2011) pp. 713-723.
[8]
Perez-Marin, D., Alfonseca, E., Rodriguez, P., Pascual-Neito, I. (2007) A Study on the Possibility of Automatically Estimating the Confidence Value of Students’ Knowledge in Generated
Conceptual Models, Journal of Computers 2 (5), 17-26.
[9]
Lin jurenSocial Network Analysis: Theory, Methods and Applications [M]BeiJingBeijing Normal
University Press. 2009.
[10] Martin T, Ball B, Karrer B, et al. Coauthorship and citation in scientific publishing [J]. arXiv
preprint arXiv: 1304.0473, 2013.
[11] Fortunato, S. Community detection in graphs. Phys. Rep. 486, 75-174 (2010).
[12] Page L, Brin S, Motwani R, et al. The PageRank citation ranking: Bringing Order to the web [R].
1998.
[13] Jianshu Weng, Ee-Peng Lim, Jing Jiang, Qi He. TwitterRank: finding topic-sensitive influential
twitters. Proceedings of the third ACM international conference on web search and data mining,
2010: 261-270.
[14] Matsuo Y, Tomobe H, Hasida K, et al. Finding social network for trust calculation [C]. ECAI.
2004, 16: 510.
[15] Tian C, Zou S, Wang W, et al. A new trust model based on advanced DS evidence theory for P2P
networks [M]. Formal Aspects in Security and Trust. Springer Berlin Heidelberg, 2007: 270-284.
[16] B. Turhan, T. Menzies, A. B. Bener, and J. D. Stefano, On the relative value of cross-company and
within-company data for defect prediction. Empirical Software Engineering, DOI: 10.1007/s10664008-9103-7, 2009.
[17] Varguez-Moo M, Moo-Mena F, Uc-Cetina V. Use of Classification Algorithms for Semantic Web
Services Discovery [J]. Journal of Computers, 2013, 8 (7): 1810-1814.
[18] Carlin BP, Louis TA. Bayesian Methods for Data Analysis (3rd edn). CRC Press: Boca Raton,
2008.