A Multilevel Inference Mechanism for User Attributes over Social Networks

Zhang, Hang; Yang, Yajun; Wang, Xin; Gao, Hong; Hu, Qinghua; Yin, Dan

doi:10.1007/978-3-030-73197-7_49

Hang Zhang^16,17,
Yajun Yang^16,17,
Xin Wang¹⁶,
Hong Gao¹⁸,
Qinghua Hu^16,17 &
…
Dan Yin¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12682))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

2858 Accesses

Abstract

In a real social network, each user has attributes for self-description called user attributes which are semantically hierarchical. With these attributes, we can implement personalized services such as user classification and targeted recommendations. Most traditional approaches mainly focus on the flat inference problem without considering the semantic hierarchy of user attributes which will cause serious inconsistency in multilevel tasks. To address these issues, in this paper, we propose a cross-level model called IWM. It is based on the theory of maximum entropy which collects attribute information by mining the global graph structure. Meanwhile, we propose a correction method based on the predefined hierarchy to realize the mutual correction between different layers of attributes. Finally, we conduct extensive verification experiments on the DBLP data set and it has been proved that compared with other algorithms, our method has a superior effect.

Access provided by Autonomous University of Puebla. Download conference paper PDF

The art of characterization in large networks: Finding the critical attributes

Article 11 June 2021

Privacy Leakage via Attribute Inference in Directed Social Networks

Evidential Link Prediction in Uncertain Social Networks Based on Node Attributes

Keywords

1 Introduction

In a social network, each user has a series of labels used to describe their characteristics called user attributes. However, for a certain type of attributes, they are not flat but hierarchical. The most existing methods [4, 5] mainly focus on the single-level attribute inference and it will bring some problems for hierarchical structures as shown in Fig. 1. Even though utilizing the same method for every single-level, the attributes of different level may be conflicted for the same user, attributes at the same level may be indeterminate, and the results of a certain layer may be missing.

In this paper, we propose a multi-level inference model named IWM to solve the problems mentioned above. This model can infer hierarchical attributes for unknown users by collecting attributes from nearby users under maximum entropy random walk. Meanwhile, we propose a correction method based on the predefined hierarchy of attributes to revise the results. Finally, we conduct the experiments on real datasets to validate the effectiveness of our method.

The rest of the paper is organized as follows. Section 2 defines the problem. Section 3 proposes the multilevel inference model. Algorithm is given in Sect. 4. The experimental results and analysis are presented in Sect. 5. The related works are introduced in Sect. 6. Finally, we conclude this paper in Sect. 7.

2 Problem Definition

2.1 Semantic Tree

The semantic tree T is a predefined structure which is semantically exists used to describe the hierarchical relationship between different user attributes. We use $T_g$ to represent the user attributes at T’s gth layer.

2.2 Labeled Graph

Labeled graph is a simple undirected graph, denoted as $G=(V,E,T,L)$, where V is the set of vertices and E is the set of edges. T is the semantic tree of attributes in G. L is a function mapping V to a cartesian product of the attributes in T defined as $L:V \rightarrow T_1 \times T_2 \times \cdots \times T_m$, where m is the depth of T.

Problem Statement: Given a labeled graph G(V, E, T, L) and labeled vertices set $V_s \subset V$, where $V_s$ is the set of vertices with complete attributes. So for every vertex $v_s \in V_s, L(v_s) = \{l_1,l_2,\cdots ,l_m\}$, where $l_1 \in T_1, l_2 \in T_2, \cdots , l_m \in T_m $. The input of the problem is $L(v_s)$ for every vertex $v_s \in V_s$ and the output is $L(v_u)$ for every vertex $v_u \in V_u$, where $V_u=V-V_s$.

3 Attribute Inference Model

Our attribute inference model can be divided into two parts. The first part is called the information propagation model. Based on the maximum entropy theory and one step random walk, vertices in $V_s$ spread their own attributes to other vertices layer by layer. The second part is a correction model based on the semantic tree. This model realizes the mutual correction between different layers of attributes. These two models are described in detail below.

3.1 Information Propagation Model

The information propagation model is an extension of the model proposed in [7]. The main idea is that the higher the entropy value of the vertex, the stronger the uncertainty of its own user attributes, so more information should be collected. The attributes of $v_j$’s each layer can be represented by $L_g(v_j)=\{l_x,w_x(v_j),l_x\in T_g\}$. Then the entropy value of $v_j$’s gth layer $H_g(v_j)$ can be calculated as blow.

$$\begin{aligned} H_g(v_j)=-\sum _{l_x\in T_g}w_x(v_j) \times \ln w_x(v_j) \end{aligned}$$

(1)

If $v_i$ is a neighbor of $v_j$, then the transition probability $P_g(v_i,v_j)$ from $v_i$ to $v_j$ at gth layer is computed as follows.

$$\begin{aligned} P_g(v_i,v_j)=\frac{H_g(v_j)}{\sum _{v_j \in N(v_i)}H_g (v_j)} \end{aligned}$$

(2)

Where $N(v_i)$ is the set of neighbors of $v_j$.

Next, we use the following equation to normalize the attribute probability obtained by different vertices.

$$\begin{aligned} w_x(v_j)=\frac{\sum _{v_i\in N(v_j)}P_g(v_i,v_j)\times w_x(v_i)}{\sum _{l_y\in T_g}\sum _{v_i\in N(v_j)}P_g(v_i,v_j)\times w_y (v_i)} \end{aligned}$$

(3)

$L_g(v_j)$ will be updated through $w_x(v_j)$. In this way, the attribute information is spread hierarchically in the graph.

3.2 Attribute Correction Model

The formal definitions of the concepts involved in this section are given below.

Definition 1

Define the following relationships in the semantic tree:

(1)
If $x_2$ is a child node of $x_1$, then $x_1,x_2$ have a relationship called $Child(x_1,x_2)$.
(2)
Say that $x_1$, $x_2$ have a descendant relationship called $Descendant(x_1, x_2)$,if $Child(x_1,x_2)\cup \exists x_3 (Child(x_1,x_3)\cap Descendant(x_3,x_2))$.
(3)
If $x_2$ is a brother node of $x_1$, then $x_1,x_2$ have a relationship called $Brother(x_1,x_2)$.

Definition 2

(Descendant vertex set). For a node $x_1$, its descendant node set is defined as $DesSet(x_1)=\{x|Descendant(x_1,x)\}$.

Definition 3

(Brother vertex set). For a node $x_1$, its brother node set is defined as $BroSet(x_1)=\{x|Brother(x_1,x)\}$.

For the attribute $l_x$ in the middle layer of the semantic tree, its existence depends on both Parent(x) and DesSet(x), so $w_x(v_j)$ can be corrected by Eq. (4).

$$\begin{aligned} w_x(v_j)=w_{Parent(x)}(v_j)\times \frac{(1-\alpha )\times w_x(v_j)+\alpha \times \sum _{y\in DesSet(x)}w_y(v_j)}{\sum _{z}(1-\alpha )\times w_z(v_j)+\alpha \times \sum _{y\in DesSet(z)}w_y(v_j)} \end{aligned}$$

(4)

where $z\in BroSet(x)$ and $\alpha $ represents a correction strength. When the value of $\alpha $ is large, the result is inclined to the hierarchy of the semantic tree, otherwise, it is more inclined to the information collected by propagation.

There is another case that the highest layer attributes don’t have any child node, so they can be corrected as follows.

$$\begin{aligned} w_x(v_j)=w_{Parent(x)}(v_j)\times \frac{w_x(v_j)}{\sum _{z\in BroSet(x)}w_z(v_j)} \end{aligned}$$

(5)

4 Attribute Inference Algorithm

4.1 Algorithm Description

The detailed steps of the algorithm are shown in Algorithm 1. Firstly, we use Eq. (1) to calculate entropy $H_g(v_u)$ for all $v_u\in V_u $ layer by layer (line 1 to 3). Line 4 to 9 start inferring hierarchically. After all layers’ information are collected, correction can be performed by Eq. (4) or Eq. (5) (line 10 to 11).

The algorithm terminates when the convergence is satisfied. The condition of convergence is given by the following equation.

$$\begin{aligned} \sum _{v_u \in V_u} \sum _{l_x \in T} |diffw_x(v_u)| \le |V_u| \times |T| \times \sigma \end{aligned}$$

(6)

where $diff(w_x (v_u))$ is the difference on $w_x(v_u)$ after the inference algorithm is executed, and $\sigma $ is a threshold to control the number of iterations.

4.2 Time Complexity

We assume that the labeled graph G has n vertices and p attributes, the semantic tree has m layers. So the time complexity of information propagation is $O(m|V_u|+mnd+pnd)=O(mnd+pnd)$, where d is the average degree of all the vertices in G. After that, we need to modify every attribute for each user by the complexity of O(pn). To sum up, the total time complexity of our algorithm for one iteration is $O(mnd+pn)$.

5 Experiment

The experiments are performed on a Windows 10 PC with Intel Core i5 CPU and 8 GB memory. Our algorithms are implemented in Python 3.7. The default parameter values in the experiment are $\alpha =0.5$, $\sigma =0.0001$.

5.1 Experimental Settings

Dataset. We will study the performance on DBLP dataset. DBLP is a computer literature database system. Each author is a vertex and their research field is used as the attributes to be inferred. We extract 63 representative attributes and define a 4-layer semantic tree in advance.

Baselines and Evaluation Metrics. We compare our method IWM with three classic attribute inference baselines which are SVM, Community Detection (CD) [6] and Traditional Random Walk (TRW) [7].

We use five commonly metrics to make a comprehensive evaluation of the inference results. The calculation method of these metrics are shown below.

$$\begin{aligned} Precison = \frac{\sum _{l \in T}|\{v_u|v_u\in V_u\wedge l\in Predict(v_u)\cap Real(v_u)\}|}{\sum _{l \in T}|\{v_u|v_u\in V_u\wedge l\in Predict(v_u)\}|}\end{aligned}$$

(7)

$$\begin{aligned} Recall = \frac{\sum _{l \in T}|\{v_u|v_u\in V_u\wedge l\in Predict(v_u)\cap Real(v_u)\}|}{\sum _{l \in T}|\{v_u|v_u\in V_u\wedge l\in Real(v_u)\}|} \end{aligned}$$

(8)

$$\begin{aligned} F_1 = \frac{2\times Precision \times Recall}{Precision + Recall} \end{aligned}$$

(9)

$$\begin{aligned} Accuracy = \frac{1}{|V_u|} \times |\{v_u|v_u\in V_u \wedge Predict(v_u)=Real(v_u)\}|\end{aligned}$$

(10)

$$\begin{aligned} Jaccard = \frac{1}{|V_u|} \times \sum _{v_u \in V_u} \frac{|Predict(v_u) \cap Real(v_u)|}{|Predict(v_u) \cup Real(v_u)|} \end{aligned}$$

(11)

where $Predict(v_u)$ and $Real(v_u)$ respectively represent the inference result set and real original attribute set of $v_u$. For all metrics, the larger value means the better performance.

5.2 Results and Analysis

Exp1-Impact of Vertex Size. We conduct the first experiment in coauthor relationship networks with 5, 000, 10, 000, 20, 000, and 40, 000 vertices. The proportion of unknown vertices is $30\%$ (Table 1).

Table 1. Inference performance on different vertex size.

Full size table

It is obvious that our method shows the best performance on different evaluation indicators. For examplewhen it comes to a 20,000 vertices network, our model improves over the strongest baseline $22.2\%$, $35.1\%$, $16.3\%$ and $6.3\%$ on Precision, F1, Accuracy, and Jaccard index, separately. In terms of recall, our method does not have obvious advantages over TRW.

Exp2-Impact of the Proportion of Unknown Vertices. In Exp2 the vertex scale of the network is 20, 000 and we set the unlabeled scale $10\%$, $20\%$, $30\%$, and $50\%$ respectively.

We can analyze the results to get that as the proportion of unknown vertices increases, the decline tendency of our method is much slower than other methods. It is interesting to see that the five evaluate indicators of our method are $71.77\%$, $72.17\%$, $71.96\%$, $64.21\%$ and $65.43\%$ at the condition of $50\%$ vertices lack of attributes which can show that it has a great value in practical applications.

Exp3-Real Case Study. In Table 2 we present partial results of the experiment which gives a clear comparison between our method and TRW. We use these examples to demonstrate the effectiveness of our method.

Table 2. Comparison of inference results by TRW and IWM.

Full size table

For Chris Stolte, IWM can complement the missing information which can’t be inferred by TRW. For Marcel Kyas, our method modify the error information on Layer 3 and obtain the correct result. TRW causes indeterminacy problem on Layer2 of William Deitrick, while IWM can select more relevant attributes. However, for V. Dhanalakshmi, due to its special structure, when most of the collected information is interference, IWM can’t make correct inference either.

6 Related Work

There has been an increasing interest in the inference of single-layer user attributes over the last several years.

Firstly, based on resource content there are [1, 11] which utilize the user’s text content for inference. [3] constructs a social-behavior-attribute network and design a vote distribution algorithm to perform inference. There are also methods based on the analysis of graph structure such as Local Community Detection [6] and Label Propagation [12]. [10] discovers the correlation between item recommendation and attribute reasoning, so they use an Adaptive Graph Convolutional Network to joint these two tasks. However, these methods don’t explore the relationship existing in the attribute hierarchy, which will greatly reduce the effectiveness in our multilevel problem.

Another method is to build a classifier to treat the inference problem as a multilevel classification problem. [2] trains a binary classifier for each attribute. [8] trains a multi-classifier for each parent node in the hierarchy. [9] trains a classifier for each layer in the hierarchical structure, and use it in combination with [8] to solve the inconsistency. However, classifier-based approaches have a high requirement for data quality. It will make the construction of the classifier complicated and the amount of calculation for training is huge.

7 Conclusion

In this paper, we study the multilevel user attribute inference problem. We first define the problem and propose the concept of semantic tree and labeled graph. We present a new method to solve this problem. The information propagation model is proposed to collect attributes for preliminary inference. The attribute correction model is proposed to conduct a cross-level correction. Experimental results on real-world data sets have demonstrated the superior performance of our new method. In future work, we will improve our method for multi-category attributes and do more works on optimizing the algorithm to save more time.

References

Choi, D., Lee, Y., Kim, S., Kang, P.: Private attribute inference from facebook’s public text metadata: a case study of korean users. Ind. Manage. Data Syst. 117(8), 1687–1706 (2017)
Article Google Scholar
Fagni, T., Sebastiani, F.: Selecting negative examples for hierarchical text classification: an experimental comparison. J. Am. Soc. Inf. Sci. Technol. 61(11), 2256–2265 (2010)
Article Google Scholar
Gong, N.Z., Liu, B.: Attribute inference attacks in online social networks. ACM Trans. Priv. Secur. 21(1), 3:1–3:30 (2018)
Google Scholar
Krestel, R., Fankhauser, P., Nejdl, W.: Latent dirichlet allocation for tag recommendation. In: Proceedings of the 2009 ACM Conference on Recommender Systems, pp. 61–68. ACM (2009)
Google Scholar
Lu, Y., Yu, S., Chang, T., Hsu, J.Y.: A content-based method to enhance tag recommendation. In: Proceedings of the 21st International Joint Conference on Artificial Intelligence, pp. 2064–2069 (2009)
Google Scholar
Mislove, A., Viswanath, B., Gummadi, P.K., Druschel, P.: You are who you know: inferring user profiles in online social networks. In: Proceedings of the Third International Conference on Web Search and Web Data Mining, pp. 251–260. ACM (2010)
Google Scholar
Pan, J., Yang, Y., Hu, Q., Shi, H.: A label inference method based on maximal entropy random walk over graphs. In: Li, F., Shim, K., Zheng, K., Liu, G. (eds.) APWeb 2016. LNCS, vol. 9931, pp. 506–518. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45814-4_41
Chapter Google Scholar
Secker, A.D., Davies, M.N., Freitas, A.A., Timmis, J., Mendao, M., Flower, D.R.: An experimental comparison of classification algorithms for hierarchical prediction of protein function. Expert Update (Mag. Br. Comput. Soc. Spec. Group AI) 9(3), 17–22 (2007)
Google Scholar
Taksa, I.: David Taniar: research and trends in data mining technologies and applications. Inf. Retr. 11(2), 165–167 (2008)
Article Google Scholar
Wu, L., Yang, Y., Zhang, K., Hong, R., Fu, Y., Wang, M.: Joint item recommendation and attribute inference: an adaptive graph convolutional network approach. In: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp. 679–688. ACM (2020)
Google Scholar
Yo, T., Sasahara, K.: Inference of personal attributes from tweets using machine learning. In: 2017 IEEE International Conference on Big Data, pp. 3168–3174. IEEE Computer Society (2017)
Google Scholar
Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation (2003)
Google Scholar

Download references

Acknowledgement

This work is supported by the National Key Research and Development Program of China No. 2019YFB2101903, the State Key Laboratory of Communication Content Cognition Funded Project No. A32003, the National Natural Science Foundation of China No. 61702132 and U1736103.

Author information

Authors and Affiliations

College of Intelligence and Computing, Tianjin University, Tianjin, China
Hang Zhang, Yajun Yang, Xin Wang & Qinghua Hu
State Key Laboratory of Communication Content Cognition, Beijing, China
Hang Zhang, Yajun Yang & Qinghua Hu
Harbin Institute of Technology, Harbin, China
Hong Gao
Harbin Engineering University, Harbin, China
Dan Yin

Authors

Hang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yajun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar
Qinghua Hu
View author publications
You can also search for this author in PubMed Google Scholar
Dan Yin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yajun Yang .

Editor information

Editors and Affiliations

Aalborg University, Aalborg, Denmark
Christian S. Jensen
Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Academia Sinica, Taipei, Taiwan
De-Nian Yang
The Pennsylvania State University, University Park, PA, USA
Wang-Chien Lee
National Chiao Tung University, Hsinchu, Taiwan
Vincent S. Tseng
Athens University of Economics and Business, Athens, Greece
Vana Kalogeraki
National Cheng Kung University, Tainan City, Taiwan
Jen-Wei Huang
National Tsing Hua University, Hsinchu, Taiwan
Chih-Ya Shen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, H., Yang, Y., Wang, X., Gao, H., Hu, Q., Yin, D. (2021). A Multilevel Inference Mechanism for User Attributes over Social Networks. In: Jensen, C.S., et al. Database Systems for Advanced Applications. DASFAA 2021. Lecture Notes in Computer Science(), vol 12682. Springer, Cham. https://doi.org/10.1007/978-3-030-73197-7_49

Download citation

DOI: https://doi.org/10.1007/978-3-030-73197-7_49
Published: 06 April 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-73196-0
Online ISBN: 978-3-030-73197-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics