School of Computer Science and Information Technology, Northeast Normal University, Changchun 130117, China
Copyright © 2013 Tieying Zhu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
With the rapid development of social networks and its applications, the demand of publishing and sharing social network data for the purpose of commercial or research is increasing. However, the disclosure risks of sensitive information of social network users are also arising. The paper proposes an effective structural attack to deanonymize social graph data. The attack uses the cumulative degree of -hop neighbors of a node as the regional feature and combines it with the simulated annealing-based graph matching method to explore the nodes reidentification in anonymous social graphs. The simulation results on two social network datasets show that the attack is feasible in the nodes reidentification in anonymous graphs including the simply anonymous graph, randomized graph and -isomorphism graph.
A social network is a social relation structure which is made up of a set of social entities and their social ties or interactions. The research of social network analysis can be traced back to the contributions of Moreno who discusses the dynamics of social interactions within groups of people . Nowadays, with the emerging of online social networks and services, the social relationship in reality has been extended to the virtual network world. Billions of users use online social website making friends, sharing pictures, micrologging and so on. The social structure hidden in the social network data are valuable for social analysis for the purpose of commerce or academy. For example, the user behavior and interests derived from social data are important for all the commercial recommendation systems [2, 3]. At the same time, more and more attentions are being paid to the privacy preservation problems in the process of using social networks and sharing social data, since data publishing and exchange increase the risk of disclosure and leakage of personal information of social network users [4, 5].
Social networks are usually modeled as graphs, in which the vertices represent social entities and the edges represent the social links or social ties. The properties of entities, such as age, gender, and SIN. can be represented as the attributes of vertices, and the properties of links between entities, such as the tightness of social ties, can be shown as the edge label or weight. Therefore, the natural and simple way to prevent the disclosure of personal information of social users is to remove the user portfolios, such as names and ISN, or replace them with random identifications. But the simple method cannot prevent the disclosure of personal sensitive information. The earliest privacy event causing public attention is the publishing of email data set of Enron Corpus . Although the original purpose is for legal investigation, the regularity of email communications among employees within the company, even the organization structure of Enron Corpus, can be inferred from the email data. Other personal information disclosure events include the AOL Company publishing anonymized user search data for the research in search engine , and Netflix Company publishing user movie scoring data for improving the movie recommendation systems . All of the intended purposes of these data publishing issues are not to leak users’ information, but it results in the privacy risks.
On the other hand, many privacy-preserving methods have been put forward and examined including -anonymity based privacy preservation via edge modification, probabilistic privacy preservation via edge randomization and privacy preservation via clustering and generalization (see the recent review papers [9–11]). Besides these methods, the differential privacy method, which depends on specific privacy guarantees and aims to make users in released data computationally indistinguishable from most of the other users in that data set, are paid more and more attention recently [12, 13].
In the paper, we present a structural attack method to deanonymize social graph data, called -hop neighbor Feature for Node Reidentification (-hop neighFNR). The method relies only on the network structure. It uses the cumulative degree of -hop neighbors as the regional feature and combines with the simulated annealing-based graph matching method. With the aid of auxiliary graph, it can reidentify the nodes in anonymized social graphs. The simulations on two data sets including Karate clubs  and email networks of URV  show it is feasible on de-anonymizing social graphs including the simple anonymous graph, randomized anonymous graph and -isomorphism graph.
The rest of the paper is organized as follows. Section 2 presents the related work, and Section 3 describes the definition of -hop neighbor feature and the node reidentification algorithm, followed by the experiments results on data sets in Section 4. Finally, we conclude the paper in Section 5.
2. Related Work
In the graph data of social networks, nodes usually correspond to the users in social networks and edges correspond to the relationship between users. The privacy attack to graph data of social networks aims to obtain the sensitive information including identity, friendship and other personal information that is hidden in social networks.
Backstrom et al.  firstly proposes the active and passive attack to simple anonymous social graphs. These two methods try to identify the target in the released social graph. The difference between them is whether the attackers change the graph data before data publishing. In active attacks, the adversary can create a certain number of Sybil nodes and edges linked to the target and embed these node and edges into the graph before data publish, then find these “Sybil” nodes together with the targeted users in the anonymized network. In passive attacks, attackers try to discover a target using their knowledge of local structure of the network around the target.
Different privacy attacks depend on different background knowledge. Zhou et al.  generalize some possible background knowledge that can be used in the privacy attacks. The background information includes degree, attribute of nodes, special links with the target node, neighbors, embedded sub-graphs, and other properties of graphs such as betweenness, closeness centrality and etc. Some literature [17–22] discusses different background knowledge and the corresponding privacy protection methods. Literature  proposes the known degree attack and corresponding -degree anonymous graph for solving the problem. Literature  presents a degree trace attack which traces the change of certain node degree in the evolution graphs to reidentify the target node. Other privacy attacks are based on the structure of 1-hop neighbor  or neighbor subgraph and the corresponding privacy preserving methods are usually based on -anonymity methods in structure, such as -automorphism , -isomorphism  and -symmetry model . Compared with these -anonymity methods, edge Randomization is a generalized privacy preservation methods which is not specific to the privacy attacks.
The de-anonymizing social network based on the auxiliary graph is a feasible attack which can recognize nodes from the large scale social networks. The auxiliary graph which can be obtained by crawling, is used to match the anonymized graph and reidentify the targets on the viewpoint of graph structure. Literature  proposes this method and use it to reidentify a third of the users who have accounts on both Twitter and Flickr with small error rate. In such attack, there are usually two phases: the recognition of certain amount of seed nodes in the anonymous graph, and then the propagation process to match the rest of nodes in the auxiliary graph with the targets in the anonymous graph on the basis of the known seeds. Rattigan  employs the crawling data of Yahoo! Music data as the auxiliary graph and attends to recognize the artist in the data set of KDD Cup 2011. Although there is no ground-truth graph to show the accuracy of this work, it still shows the feasibility of such attack. More recently, Narayanan et al.  use the crawling Flickr graph to deanonymize much of the competition test set of a machine learning contest in Kaggle challenge. They use the neighbor similarity between node pairs as a structural feature and combine it with simulated annealing-based graph matching method to reidentify a small number of nodes with largest degree. These recognized nodes can be used as the seed nodes in the first phase of de-anonymizing attack.
In privacy attacks mentioned above, the degree, neighbor structure, neighbor similarity of certain node pair are all metrics used to match or recognize the target nodes in anonymized graphs. In graph mining, Henderson et al.  present the concept of recursive feature, which combines node-based local features with egonet-based neighborhood features to capture the regional information of an individual node in large graphs. It can be applied in de-anonymization tasks on evolution graphs or partially anonymized graphs. Influenced by these ideas, the paper proposes a structural attack, which combines the cumulative -hop neighbors’ degree feature with the simulated annealing algorithm, to reidentify the nodes in anonymized graphs.
Many computational intelligent algorithms for optimization problems are proposed like in literature [27–31]. In this paper, we use simulated annealing method  to match the auxiliary graph with anonymized graph, although other intelligent algorithms can be used to replace simulated annealing method.
3. -Hop Neighbor Feature for Node Reidentification
3.1. -Hop Neighbor Feature
A social network can be modeled as a graph , where is the set of nodes and is the set of connections. In this paper, undirected graph is used, although social networks can be directed graphs if the direction of connections is considered. In a graph structure, the node degree is the basic local feature of a node. When considering the relation of a node and its 1-hop neighbors, the related metrics are computed in the range of egonet. For example, in literature , the cosine similarity between a pair of nodes is defined as: , while and are the neighbor sets of a node pair, respectively.
In this paper, we considered the cumulative degree feature of a node in the range of -hop neighbors. For a node , its cumulative degree is defined as the sum of -hop neighbors’ degree of a node and denoted as follows: The -hop neighbor feature is a regional feature and captures the node’s properties better than the node degree since for the nodes with same degree would have different value of cumulative value -hop neighbor degree. It qualifies the connections of a node with other nodes and shows the importance of a node in the -hop range and even the whole network.
In order to show the discrimination of the -hop cumulative feature, we analyze two data sets: karate club data and email network of URV, in terms of the value of degree and -hop cumulative feature; here, . In karate data set, there are 34 nodes and 156 edges, while there are 1133 nodes and 5451 edges in email data set. Figures 1(a) and 1(b) shows that -hop neighbor features discriminates nodes much better than the feature of degree on both two data sets. In these figures, presents the degree of nodes in decreasing order, and presents the corresponding 4-hop neighbor feature value. In a graph, there may be some nodes with the same degree value, and the nodes with the same degree value may also have the same or different 4-hop neighbor features; therefore we use the number besides the circle to denote how many nodes have the same -hop neighbors feature value. For example, in Figure 1(a), in the 5 nodes with same degree 4, 3 nodes have different -hop neighbor feature value and only 2 nodes have same -hop neighbor feature value. Furthermore, there are 11 nodes with the same degree 2, while there are 5 nodes with same -hop neighbor feature value of 6376 and another 2 nodes with the other -hop neighbor feature of 5977. These figures show that most of the nodes with the same degree have different 4-hop neighbor feature value on the two data sets. Specially, Figure 1(b) shows that most of the nodes with large degree value in email network data can be discriminated by 4-hop neighbor feature value. Although the discrimination becomes worse for some lower degree value, like 3 or 2, it is still better than the degree feature.
Figure 1: Discrimination of -hop neighbor feature; .
3.2. -Hop Neighbor Feature for Node Reidentification
The -hop neighbor feature can capture the regional information of a node in the graph. This paper combines -hop neighbor feature with simulated annealing algorithm and proposes -hop neighbor feature for node reidentification algorithm (-hop neighFNR) to deanonymize social networks by the aid of auxiliary graph.
For two graphs and , is the auxiliary graph, in which the identities of nodes are already known, and is the anonymous target graph. and can be thought as the induced graph from the same graph . Usually the auxiliary graph can be obtained by the crawling. is the anonymous social data for publishing. The process of privacy attack can be thought as a matching process between the nodes of and . We use the original data sets as , and three kinds of anonymous graphs as , including the simple anonymous graph, which removes the identification of nodes, the randomize anonymous graph, which is obtained by randomly adding one edge followed by deleting another edge and repeating the process for times, as shown in , and -isomorphism graph, as shown in .
We combine the -hop neighbor feature with the simulated annealing methods to match the two graphs, and . The Euclidian distance between two sets of -hop neighbor feature value of node pairs is defined to measure the quality of a candidate mapping, so that we can optimize the matching over all the possible mapping. The Euclidian distance is defined as: , in which and is the cumulative degree value of node ’s -hop neighbors in and , respectively.
Algorithm 1 shows the method in -hop neighbor feature for node reidentification. In the algorithm, is the change of Euclidian distance in different matching processes. is the temperature, which will be cooled with a rate of . The initiating value of temperature depends on the nodes in the graph; . The ending of simulated annealing algorithm is determined by the threshold . is constant, which is also dependent on the nodes, and .
Algorithm 1: -hop neighbor feature for node reidentification.
4. Experiment Results
In order to show the feasibility and effectiveness of -hop neighbor feature, we compare it with the neighbor similar feature used in literature , which is called neighSNR, to deanonymize the simply anonymous graphs, randomized graphs, and -isomorphism graphs. In e-mail data set, we select 30 nodes with the largest degree as the targets. The randomized graphs are obtained by adding or deleting edges randomly and repeatedly. The perturbation degree is defined as the percentage of the number of adding/deleting edges and we use 10%, 20%, and 50% as the perturbation degree, respectively. The -isomorphism graph is obtained by adding or deleting edges to satisfy the definition of -isomorphic: A graph is -isomorphic if consists of disjoint subgraphs , where and are isomorphic for . In the experiment, we use 2-isomorphic graph as the anonymous graph .
Figures 2(a) and 2(b) show the recognition results in simple anonymous graph. Our method -hop neighFNR is much better than neighSNR. It can recognize 32 out of 34 nodes for karate data set in the best case and all the 30 largest degree nodes for email network data.
Figure 2: The results on simple anonymous graph for karate data set and e-mail network data.
Figures 3(a) and 3(b) show the results of the number of reidentification node in randomized anonymous graph with different perturbation degree for karate data and email network data. Although with the increasing of perturbation degree, the number of reidentification nodes decrease, our algorithm neighFNR outperforms neighSNR in general. For karate data set, 12 node pairs are matched in the best case when our method is employed and perturbation degree is 10%. When perturbation degree increases to 50%, 8 node pairs are re-identified using our algorithm. For neighSNR method, 4 node pairs are recognized when perturbation degree is 50% in the best case. For email network data, when perturbation degree is 10%, 10 nodes pairs are matched. When perturbation degree increases to 50%, both the -hop neighFNR and neighSNR method recognize 4 nodes pairs in the best case. In the average and worst cases, -hop neighFNR also outperforms neighSNR on both of two data sets.
Figure 3: The results on randomized anonymous graph for karate data set and e-mail network data.
Figures 4(a) and 4(b) show the de-anonymizing results on -isomorphism graphs, in which . In the graph of Karate data, 16 edges are added and 36 edges are deleted in order to generate 2-isomorphism anonymous graph. In the graph of Email data, 2428 edges are added and 2454 edges are deleted. Both neighSNR and -hop neighFNR does not work well on 2-isomorphism graphs, since these two 2-isomorphism anonymous graphs are obtained by perturbation of about 50% edges of the corresponding original graphs, and the -isomorphism method enforces -security for protecting the nodes and links in anonymous graph .
Figure 4: The results on 2-isomorphism anonymous graph for karate data set and e-mail network data.
The paper presents -hop neighbor feature to capture node characteristics in a graph. It uses the sum of degree of -hop neighbors of a node as a regional feature. When combining with simulated annealing algorithm, it can be used as a structural attack to de-annoymize social networks. The experiments on two data sets show it is very effective for de-anonymizing the simple anonymous graph and feasible for the randomized graph. The research provides insights for the privacy-preserving problem of social networks and the design of privacy-preserving algorithms. The future work we should do is to evaluate the effectiveness of our algorithm on large-scale real networks.
This work was supported in part by the Special Fund for Fast Sharing of Science Paper in Net Era by CSTD (FSSP 2012) and the Technical Development Plan of Jilin Province of China (No. 201101003).
Steven Englehardt, Jeffrey Han, Arvind Narayanan:
I never signed up for this! Privacy implications of email tracking.PoPETs2018(1): 109-126 (2018)
Peter Bailis, Arvind Narayanan, Andrew Miller, Song Han:
Research for practice: cryptocurrencies, blockchains, and smart contracts; hardware for deep learning.Commun. ACM60(5): 48-51 (2017)
Arvind Narayanan, Jeremy Clark:
Bitcoin's academic pedigree.Commun. ACM60(12): 36-45 (2017)
Matthew Zook, Solon Barocas, danah boyd, Kate Crawford, Emily Keller, Seeta Peña Gangadharan, Alyssa Goodman, Rachelle Hollander, Barbara König, Jacob Metcalf, Arvind Narayanan, Alondra Nelson, Frank Pasquale:
Ten simple rules for responsible big data research.PLoS Computational Biology13(3) (2017)
Arvind Narayanan, Jeremy Clark:
Bitcoin's Academic Pedigree.ACM Queue15(4): 20 (2017)
Steven Goldfeder, Joseph Bonneau, Rosario Gennaro, Arvind Narayanan:
Escrow Protocols for Cryptocurrencies: How to Buy Physical Goods Using Bitcoin.Financial Cryptography2017: 321-339
Eman Ramadan, Arvind Narayanan, Zhi-Li Zhang, Runhui Li, Gong Zhang:
BIG Cache Abstraction for Cache Networks.ICDCS2017: 742-752
Saurabh Verma, Arvind Narayanan, Zhi-Li Zhang:
Multi-low-rank Approximation for Traffic Matrices.ITC2017: 72-80
Jessica Su, Ansh Shukla, Sharad Goel, Arvind Narayanan:
De-anonymizing Web Browsing Data with Social Networks.WWW2017: 1261-1269
Andrew Miller, Malte Möser, Kevin Lee, Arvind Narayanan:
An Empirical Analysis of Linkability in the Monero Blockchain.CoRRabs/1704.04299 (2017)
Grant Storey, Dillon Reisman, Jonathan Mayer,