Journal of Computational Information Systems 10: 21 (2014) 9143–9151 Available at http://www.Jofcis.com Behaviour Analysis using Mutual Influences Mei YU 1,2 , Yu DU 1,2 , Dejun HOU 3,∗, 1 School 2 Tianjin 3 Office Jing ZHOU 1,2 , Ting LEI 1,2 of Computer Science and Technology, Tianjin University, Tianjin 300072, China Key Laboratory of Cognitive Computing and Application, Tianjin University, Tianjin 300072, China of State-owned Assets and Equipment Management, Tianjin University, Tianjin 300072, China Abstract Social network analysis has received enormous attention in recent years, owing to the success of online social networking sites. This trend leads to the generation of a wealth of social network data. Therefore, the potential research impact of these techniques is still largely unexplored. In this article we address the problem of behavior analysis of huge amounts of data produced in social networks. Such a problem arises naturally in data analysis industry where one aims to understand users’ tastes with multiple traces from his history of surfing the net as correctly as possible. In each phase we present a brief overview of the problem, describe state-of-the art approaches, transform the model to deal with massive data examples, and map each of the topics to a behavior analysis framework. Furthermore, two probability analysis methods are compared to handle the situations what are really the users’ interest and to what extent that users’ privacy via online social network will be disclosed. We then investigate into applications of our algorithm to community user tastes analysis. In addition, experimental results on challenging real-world datasets show that the risk assessment capability of our proposed algorithm is effective. The main contribution of the article is to propose a state-of-the-art conversion of current techniques while providing a critical perspective on behavior analysis applications of social network analysis and data mining. Keywords: Social Network Analysis; Data Mining; PageRank; D-S Evidence Theory 1 Introduction With the ability to provide reliable behavior records and improve the precision and value of commercials, social network analysis using big data [1] has attracted considerable attention. Sufficient data may influence people’s decisions in life and business activities. Qmee [2] has reported exactly what happened online in an average minute of August 1, 2013. The analysis of online social networks based on big data has recently received wide attention from the aspects ∗ Corresponding author. Email address: [email protected] (Dejun HOU). 1553–9105 / Copyright © 2014 Binary Information Press DOI: 10.12733/jcis12234 November 1, 2014 9144 M. Yu et al. /Journal of Computational Information Systems 10: 21 (2014) 9143–9151 [3–6]. The authors in [7] propose a new metric to automatically evaluate the confidence that a student knows a certain concept included in his or her conceptual model, which develops a new perspective. Based on matrix and graph theory, Social Network Analysis methods (SNA) [8] have been gradually developed. SNA can provide quantitative analysis, which is widely used in international trade, the area of crime, economic and cultural fields and so on. In this paper, a massive data processing method is developed for social network analysis to effectively evaluate user behaviors. Firstly, based on the popularity of films analyzed by PageRank, we get ranked community movies. Then, we apply the SNA method. In this work, we evaluate the impact of community that is provided by SNA by computing the closeness through intimacy metrics. By quantifying the popularity of movies and user’s influence and utilizing D-S evidence theory, we realize the measurement of user’s behaviors and the assessments of user’s privacy disclosure. Finally, using the simulation experiment we test the effectiveness of the proposed algorithm. 2 Related Work In [9], Newman et al. studied a hybrid coauthorship/citation network that combines the two, which they analyze to gain insight into the correlations and interactions between authorship and citation. However, they did not take scientists’ social information into account. Our method adopts to start with the importance of users and movies. An overview of community detection in graphs, was proposed in [10]. They put forward an integrated illustration of the topic, which contains the definition of the fundamental parts of the problem and the general methods developed. Google’s founder Page applied random walk model to the network of web pages and proposed PageRank algorithm [11] to calculate the influence of web pages. They showed how to efficiently compute PageRank for large numbers of pages and utilized PageRank to search and to user navigation. Authors in [12] focused on the problem of identifying influential users of micro-blogging services. TwitterRank, an extension of PageRank algorithm, was raised to measure the influence of users in Twitter. They had not classified different categories of twitterers by studying their “following” behaviors closely. What is more, PageRank was used in [13] to deal with trust in social network. A novel trust model for P2P system based on advanced D-S theory of evidence was proposed in [14]. Compared to some current trust models, the introduced model was more robust on trust security problems and more advanced in successful transaction rate. The major distinction between our work and aforementioned work lies in the formulation of the calculation method of importance. Specifically, we propose a PageRank-like method to model the popularity of each movie that appears among the mass of the films. We solve the problem faced with big data. Moreover, we design an algorithm to judge the influence by making use of measuring intimacy of users in the community. The remainder of this paper is organized as follows. In Section 3 we present the importance model with PageRank-like method and intimacy measurement. The details of assessment along with the final evaluation rule are stated in Section 4. The algorithm of our method is provided in Section 5. The applications of our method to behavior analysis and privacy disclosure risk assessment are studied in Section 6. Finally, we conclude our work in Section 7. M. Yu et al. /Journal of Computational Information Systems 10: 21 (2014) 9143–9151 3 3.1 9145 Importance Model Movie-specific pageRank PageRank describes a method for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. The advantages of PageRank can be revealed when dealing with the greatest for underspecified queries. PageRank is essentially an idea of citation analysis. We apply PageRank to settle the problem in movie rating social network. The unidirectional relationship rating for a user to a movie would be counted as a citation. The similarity between two users is compared through the coincidence rate of citation. That is to say, the more common rated movies two users have the higher similarity between them. Besides, the frequency that a movie is rated can be used to measure the movie’s popularity. When a specific movie gets numerous ratings from users, it can be inferred that the movie is very popular. The definition of PageRank aforementioned has another basis in random walks. That is, it can be thought of as modeling the behavior of a “random surfer”. The “random surfer” simply ensures clicking on successive links randomly. Finally, the user surf to a page that has no links. Then the user enters a URL aimlessly to jump to a new page to continue previous procedure. The key of PageRank is the calculation method of transition probability among nodes in random walk model. The growth of PageRank has been fueled by the desire to bring social network analysis to big data while retaining access to traditional rank calculation. The algorithm we proposed exploits the social behavior of rating in the movie community to redesign the calculation strategy in transition probability between nodes of PageRank. Firstly, we define the similarity between users in the movie community by the social behavior of rating. The main idea is as follows: the user rates a specific movie. We do not consider how many stars the user evaluates the movie or whether the user likes the movie. What we can conclude is that the user is interested in the movie. When several users rate the same movie, these reviewers have the same tastes in this specific movie. Based on the idea presented above, we define a movie interest eigenvector for each community user to express the interestingness that the user to a specific movie. In definition 1, id stands for the movie ID, cn (user, id) represents the rating to a movie from the user. Movie (user) means the set of the users who have rated the movies. Definition 1 The movie interests’ eigenvector of users can be calculated as follows: V ector(user) = [id = cn(user, id)] , t = |M ovies(user)| , id ∈ M ovies(user) (1) Definition 2 The definition of interest similarity. Each dimension of user interest eigenvector is actually a tuple (movie id, the rate of user to the movie). From the tuple, we can gain the level of interest of the user on a specific movie. So, cosine similarity is employed to compute interest similarity between two users in the community. 9146 M. Yu et al. /Journal of Computational Information Systems 10: 21 (2014) 9143–9151 The formula is as Eq. (2): Sim(i, j) = V ector(i) · V ector(j) , |V ector(i)| |V ector(j)| i, j ∈ V (2) From the calculated results from the aforementioned definitions, this paper expands PageRank to redesign transition probability of random walk as Eq. (3): Pij = |M ovies(j)| × Sim(i, j) ∑ , i, j ∈ V |M ovies(n)| × Sim(i, n) (3) n∈F riends(i) Eq. (3) is actually based on the idea of the amount of information: fractions explains the sum of information gained from user i to j that they rate. Numerator represents the information user i have from all users in the movie community. The ratio shows the transition probability between user i to j. From the redesigned transition probability, we can calculate the PR value of the movie community user from below: P R(i) = ∑ 1−q +q P R(j) × Pij , |V | j i, j ∈ V (4) In academic co-networks, the number of papers and citation frequency are viewed as a measure to evaluate the importance of the author [15]. In sociology, if two people have more common friends, the relationship between them are closer. Therefore, we obtain these definitions from 3 to 5: Definition 3 Define CM (node1, node2) the common movies that they saw in the movie community: CM (node1, node2) = {node|node ∈ M ovies(node1)∧ node ∈ M ovies(node2)}, node1, node2 ∈ V (5) Definition 4 Define Closeness (node1, node2) the closeness of the two users in the movie community: Closeness(node1, node2) = |CF (node1, node2)| (6) Definition 5 Define D (circle, node) the distance between a specific user to movie interest circle: M. Yu et al. /Journal of Computational Information Systems 10: 21 (2014) 9143–9151 ∑ D(circle, node) = |CM (node, i) ∧ CM (node, center)| , i ∈ circle,s = |circle| s × |CM (node, center)| 9147 (7) From definition 5, D (circle, node) shows the intimacy distance from a user in movie community of specific center user to an existing movie interest circle. |CM (node, i) ∧ CM (node, center)| represents the common movie viewing number of node i and center movie node. The fraction computes the sum of common movie viewing number from the specific node to every node in the circle and the center node. That is to say, D (circle, node) explains the average intimacy between node and circle. 4 Behavior Analysis Model Bayesian methods are usually applied to various aspects [16-18]. Its basic principle is: if given a priori assumptions likelihood estimation, with the arrival of the new evidence (observed data), the Bayesian method can update the assumed likelihood function. The Dempster-Shafer (D-S) theory of evidence (Shafer, 1976) is a powerful tool for analyzing uncertain knowledge. Let Θ= {θ1 , · · · , θk } be a finite set of possible hypotheses. This set Θ is referred to as the frame of discernment, and its powerset denoted by 2Θ . It stands for all subsets of Θ. A basic belief assignment (BPA) m is a function that m (A) is the part of belief that supports A, but does not support anything more specific. The combination rule can be easily extended to several belief functions by repeating the rule for new belief functions. Thus the pairwise orthogonal sum of n belief functions Beli (i = 1, 2, ..., n) can be formed as Bel = Bel1 ⊕ Bel2 ⊕ · · · ⊕ Beln (8) Consider two BPAs m1 (.) and m2 (.) for belief functions bel1 (.) and bel2 (.) respectively. Let Aj and Bk be focal elements of bel1 and bel2 respectively. Then m1 (.) and m2 (.) can be combined to obtain the belief mass committed to C ∈ Θ according to the following combination or orthogonal sum formula (Shafer, 1976), 0, C=ϕ ∑ ∑ m1 (Ai )m2 (Bj ) m(C) = ,k = 1 − m1 (Ai )m2 (Bj ) (9) Ai ∩Bj =A Ai ∩Bj =A (C ̸= ϕ) k Lemma 1 Bayesian methods and evidence theory are commutative and associative It is obvious that Bayes’ theorem and evidence theory’s fusion rules are commutative. Below we prove Bayes’ theorem and evidence theory in information fusion are associative. For Bayes’ theorem, when there are two variables, its fusion probability can be described as follow: P (A |B , C) = P (A, B, C) P (A, B, C) P (C |A, B )P (A, B) P (A)P (B |A )P (C |A, B ) = = = (10) P (B, C) P (B)P (C |B ) P (B)P (C |B ) P (B)P (C |B ) 9148 M. Yu et al. /Journal of Computational Information Systems 10: 21 (2014) 9143–9151 Repeatedly applying two variables for Bayesian information fusion and use induction, it is easy to prove that the Bayes’ theorem in data fusion is associative. Based on the aforementioned associative law, it can be induced easily that: m1···n−1 (A)=m1 (A) ⊕ m2 (A) ⊕ . . . mn (A) (11) These two features ensure that the order does not affect the synthesis of synthetic results. The strong point may be described as overcoming the reality of ‘preconceived policy decision’. Theorem 1 Bayesian approach is a special case of evidence theory. ¯ = 1. In Bayesian methods, proposition A and its complement form a complete set P (A) + P (A) However, in practical problems, belief degree on proposition A does not reflect the belief degree ¯ ≤ 1. For proposition A, the upper and lower range of evidence to its complement m(A) + m(A) interval are support function Spt(A) and plausibility function Pls(A), which can be expressed as [Spt(A), Pls(A)]. We can conclude that Bayesian method is a special case of evidence theory. For the convenience of depicting information fusion in evidence theory and its inclusion to Bayes’ theorem, evidence theory is chose as the measure of analyzing user behaviors and privacy disclosure. 5 5.1 PageRank-like Algorithm Main idea of the algorithm This algorithm process includes several steps: 1) Order of importance: mainly divides different importance degree for the objects in data. Through a PageRank-like algorithm and intimacy calculation, the sorting for importance of the data will be realized. 2) User behavior analysis and assessment of privacy disclosure based on evidence theory: Firstly, converse the sorted data of importance to integrated reliability calculation. Then evaluate user behaviors and privacy disclosure via evidence theory. 5.2 Description of the algorithm Based on the above view, algorithm description is in Fig. 1: 6 Simulation Results To evaluate the effectiveness of our proposed method for social behaviors’ analysis, we systematically apply it to MovieLens data sets (http://www. MovieLens.org). The data set we choose contains 943 users’ rating for 1682 movies from one to five stars. They delete those users who rate less than 20 movies. The data also consists of some information for users like age, gender, occupation, zip and so on. In order to obtain reliable results, we test several stages of analysis. M. Yu et al. /Journal of Computational Information Systems 10: 21 (2014) 9143–9151 9149 Fig. 1: Pseudo code of analysis algorithm 6.1 Movies sorts The third part of the paper describes PageRank-like algorithm that is utilized to sort movie popularity. However, damping was to be 0.85. We vary q from 0.05 to 0.95 for step of 0.05 to evaluate the effect of damping. We start with data reprocessing. The outcome is a 1682*1682 symmetric matrix. Diagonal elements are set to be 0. Each row represents the popularity of a film relative to others. Then we use q from 0-1 to calculate the influence of movie popularity with the method of PageRank-like method. We calculate the top 10 movies that are computed with varied q. The damping is set to be 0.05. From top to bottom, damping values increase 0.05 for each time. The experiment tells that as the damping increases, the popular movie has experienced steady– some fluctuation - stable process. We use the results to count the frequency of the most popular movies,the outcome go as Table 1. The first row in Table 1 represents the id of the most popular movie. The second row shows the occurrences of the popular movies when changing the damping coefficient. Fig. 2: Trends of the variance of popular movies Fig. 3: Users’ influence on the popularity of movies 9150 M. Yu et al. /Journal of Computational Information Systems 10: 21 (2014) 9143–9151 Table 1: Frequency of popular movies Movie ID 1 50 56 98 100 121 172 174 181 204 258 288 Frequency 19 19 19 11 19 19 9 19 19 19 10 8 6.2 Sorting based on user influence The third part of the paper describes the sort of user influence using intimacy-related algorithm. We determine the users by sorted movies using PageRank-like algorithm.We measure user authority by intimacy distance. Then find out the importance of the small group who are curious about the studied movies in the rating community. Finally, we mine the ranked influence of community users via the bridge of interest. 6.3 User behavior analysis and privacy disclosure assessment We select evidence theory as the assessment method. We conduct the information fusion by evidence theory to the users who have rated for the most popular movie. Fig. 3 explains the assessments of the first 6 users and other excess users. 1-6 describes the most influential users who have seen the movie in the community. 7 stand for the other 577 users. The experiment shows key users have vital influence on community viewing habits. And with the same habits, vital users have a higher probability that they know each other in real world. As to whose personal information is disclosed to lead the trend, there is considerable variance. Therefore, the protection of the key users’ privacy information will contribute to the overall small group’s information greatly. 7 Conclusion In this paper we have analyzed a large data set from the MovieLens web set, taking a network perspective. Rather than focus solely on user behavior analysis, as most previous studies have done, we have also extended to assess privacy disclosure risks. The ratings of the data set is unusually large, covering more than 100,000, which allows us to study users’ tastes and privacy risks that are not accessible with smaller data sets. We investigated a social network user behavior analysis issue faced by the privacy disclosure risks. A three-level analysis scheme was proposed to jointly analyze user behaviors and privacy disclosure risks with the social networks in the background of big data. We explore the key parts for each sequence along two methods: (1) use the PageRank-like algorithm to select the best influential movies for the movie rating community; (2) sort the community users for each movie and its importance to the other user’s choice making. To obtain the results of the user behavior analysis, evidence theory was employed to assess the user behaviors and privacy disclosure risk problems, which can be optimally solved. The fraction of influential movies selection is more or less constant over varying the damping q. And community users tend to rating movies that have been rated by important guys. We observe a strong tendency towards key users, who have greater risks to be attacked. Simulation results demonstrated after 3 sections, shows the performance of the proposed strategies can obtain near-optimal performance. M. Yu et al. /Journal of Computational Information Systems 10: 21 (2014) 9143–9151 9151 There are many other questions that could be addressed with this data set, the time stamps and a further discuss about rating scores opening up a variety of possibilities. We could also study personalized commercial recommendation by making use of the data on users’ interests and personal situations. And finally, any of our analyses could be extended to data sets that cover other social communities, if and when such data become available. References [1] Big Data. Nature (http://www.nature.com/news/specials/bigdata/index.html), Sep 2008. [2] What happens online in 60 seconds? Qmee (http://blog.qmee.com/qmee-online-in-60-seconds/). [3] Arxiden, Ablimit. Analysis of users’ influence based on activity and quality of tweets for specific topics in Micro-blog. [J]. Journal of Computational Information Systems, 2013, 9 (20): 8127-8137. [4] Yu Mei; Xu Tianyi; Yu Jian, et al. A social-aware incentive mechanism for Ad Hoc networks [J]. Journal of Computational Information Systems, 2013, 9 (20): 8251-8259. [5] Manovich L (2012) Trending: the promises and the challenges of big social data. In: Debates in the digital humanities. Gold MK, editor Minneapolis: University of Minnesota Press. 532. [6] Di Loreto Ines G A. Facebook Games: Between Social and Personal Aspects [J]. International Journal of Computer Information Systems and. [7] Industrial Management Applications. ISSN 2150-7988 Vol. 3 (2011) pp. 713-723. [8] Perez-Marin, D., Alfonseca, E., Rodriguez, P., Pascual-Neito, I. (2007) A Study on the Possibility of Automatically Estimating the Confidence Value of Students’ Knowledge in Generated Conceptual Models, Journal of Computers 2 (5), 17-26. [9] Lin jurenSocial Network Analysis: Theory, Methods and Applications [M]BeiJingBeijing Normal University Press. 2009. [10] Martin T, Ball B, Karrer B, et al. Coauthorship and citation in scientific publishing [J]. arXiv preprint arXiv: 1304.0473, 2013. [11] Fortunato, S. Community detection in graphs. Phys. Rep. 486, 75-174 (2010). [12] Page L, Brin S, Motwani R, et al. The PageRank citation ranking: Bringing Order to the web [R]. 1998. [13] Jianshu Weng, Ee-Peng Lim, Jing Jiang, Qi He. TwitterRank: finding topic-sensitive influential twitters. Proceedings of the third ACM international conference on web search and data mining, 2010: 261-270. [14] Matsuo Y, Tomobe H, Hasida K, et al. Finding social network for trust calculation [C]. ECAI. 2004, 16: 510. [15] Tian C, Zou S, Wang W, et al. A new trust model based on advanced DS evidence theory for P2P networks [M]. Formal Aspects in Security and Trust. Springer Berlin Heidelberg, 2007: 270-284. [16] B. Turhan, T. Menzies, A. B. Bener, and J. D. Stefano, On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, DOI: 10.1007/s10664008-9103-7, 2009. [17] Varguez-Moo M, Moo-Mena F, Uc-Cetina V. Use of Classification Algorithms for Semantic Web Services Discovery [J]. Journal of Computers, 2013, 8 (7): 1810-1814. [18] Carlin BP, Louis TA. Bayesian Methods for Data Analysis (3rd edn). CRC Press: Boca Raton, 2008.
© Copyright 2024 ExpyDoc