1 Introduction

Mobile applications (i.e., apps) are extensively implanted with paid advertisements (i.e., ads) as a means of product promotion. It has been reported that over 57% of all apps in Google Play contain ad libraries, and this percentage reaches up to two-thirds within popular apps [24]. Among these ads, the app-promotion ads are widely used by Android app developers to promote other mobile apps. These app-promotion ads play a crucial role in helping users discover new apps, demonstrated by the research showing that 33% of users discover new apps through ads in other apps [19]. However, concerns arise regarding the trustfulness of the apps promoted through these ads, given the competitive nature of the industry and the potential for web technology with risks of invading personal privacy [13, 18].

Fig. 1.
figure 1

App promotion chain examples in the app-promoting ecosystem.

To prevent users from being promoted to malicious apps, researchers start to mine promoting patterns in the app-promoting ecosystem in order to provide early interventions for suspicious app installments. Some of the previous studies focus on analyzing the behaviors of ad libraries within the app promotion ecosystem [8, 12], and some others analyze app promotion behaviors based solely on ways the apps are presented (web view or image view) [6, 11]. These studies either primarily examine the behaviors of ad libraries, or analyze app promotion behaviors based solely on the app representations (web views or image views). Therefore, they pay too little attention to app propagation in terms of how massive individuals exploit the app promotion ecosystem and possess limited abilities to capture the interaction patterns with other apps in the promotion network. For instance, Fig. 1(a) demonstrates an app promotion chain where a popular benign app “Passport Photo Maker” promotes a greyware app “Photo Editor”, which in turn promotes malware “Flood-It!”, a strategy game capable of scanning the local network and stealing sensitive phone information. Promotion chains among apps, as exampled above, hence cannot be detected and exploited by previous studies for app promoting behavior analysis.

Furthermore, prior studies lack a comprehensive understanding of the app-promoting ecosystem, which involves multiple heterogeneous actors beyond apps, such as app markets, security vendors, and developers. These actors collectively contribute to the promotion of specific apps. For example, Fig. 1(b) provides a promotion chain path that could explain why an online messaging app, “Polish English Translation”, promotes “CallApp”. The underlying behaviors indicate that “Polish English Translation” shares the same developer as another translation app “Thai Chinese Translation”, which has been observed to promote “CallApp”. Hence, a more holistic approach that learns the intrinsic connections among these various entities is necessary to deeply understand the complexities of the app-promoting ecosystem, as well as its implications for society and online commerce. Such an in-depth understanding of the app-promoting ecosystem has the potential for various positive extensions, such as improving trust in app recommender systems and detecting malicious apps.

A natural way to holistically exploit the complex connections among the heterogeneous actors is to model the collected app-promoting data as a heterogeneous graph, and directly analyze the graph for app promotion pattern mining. Current graph representation learning [7, 9, 14, 26, 33, 36, 37] have shown their effectiveness in various domains such as recommender systems [16], node/graph classification [5, 17, 22, 32], knowledge engineering [10, 20, 27, 28, 34], and graph completion [15, 29]. They are designed to capture high-order connectivity relationships between multiple entities. However, a huge obstacle blocking the exploitation of the app-promoting data with heterogeneous actors is the data scarcity problem. Around  15% of our collected app-promoting data in Google Play contains missing entries such as app developers and app categories. Als0, data scarcity is one of the major challenges in training deep and complex graph learning models with exceptional performance, including graph learning methods [35]. Therefore, targeting the data scarcity problem is the key to advancing studies in app promotion pattern mining.

To address the above limitations, we first model the app-promoting data as a relational heterogeneous graph and center around the graph completion task. Specifically, each heterogeneous actor (e.g., an app developer, a visited URL) is abstracted to an entity in the graph, and each interaction between the actors (e.g., a developer develops an app, an app belongs to a category) is abstracted to a relation in the graph. Then, we center around the heterogeneous graph completion task, where given the observed graph and a query containing an entity and a relation type (e.g., a query contains the entity developer\(_1\), and the relation type developer-develop-app), we aim to predict un-observed true positive entities that answer the query (e.g., the apps that are developed by the developer in the query).

Nevertheless, learning to complete a graph collected from the wild in our focused app promotion network is non-trivial. Existing methods for graph completion are either too simplistic for modeling the highly complex networks with scarce relations and entities [1, 21, 30], or heavily rely on rich semantic information to train a heavy model with massive parameters [15, 25, 31], which contradicts the data scarcity issue in app-promoting data collected from the wild. To target the limitations of existing techniques, we propose a framework that leverages a pre-trained language model (e.g., BERT) to model the complex relation, and incorporate two different symbolic prompts to compensate for the lack of semantic information closely related to the query. We first derived symbolic prompts from existing embedding-based methods like DistMult [30] to provide contextual information from the geometrical embeddings, and then generate the other set of symbolic prompts based on the correlation between the queried relation and metapaths. The two sets of symbolic prompts are concatenated with the query tokens to construct the input tokens to our framework. Our contributions are:

  • We propose a novel framework that addresses the challenge of modeling complex connectivity patterns in the app promotion graph by leveraging the pre-trained language model. Our approach additionally incorporates two sets of symbolic prompts for further learning guidance.

  • We collect a real-world dataset in the app promotion ecosystem, and demonstrate the effectiveness of our approach through extensive experiments. The results show that our approach outperforms existing techniques in terms of both accuracy and generalities.

  • We provide a deeper understanding of the app promotion ecosystem, its complexities, and its implications for societal trust. Our work sheds light on the potential applications of heterogeneous graph completion methods and pre-trained language models in detecting malicious app promotions.

2 Background and Related Work

2.1 Definitions

Definition 1

(Heterogeneous Graph). A heterogeneous graph (HG) \(\mathcal {G}=(\mathcal {V}, \mathcal {E}, \mathcal {X})\) consists of a entity set \(\mathcal {V}\), an relation set \(\mathcal {E}\), and the optional entity and relation features: \(\mathcal {X} = (\mathcal {X}_{\mathcal {V}}, \mathcal {X}_{\mathcal {E}})\). The types of entities and relations are mapped through the type mapping functions \(\phi : \mathcal {V} \rightarrow \mathcal {A}\) and \(\psi : \mathcal {E}\rightarrow \mathcal {R}\), where \(\mathcal {A}\) and \(\mathcal {R}\) denotes the entity and relation type set respectively. Each relation is directional, and is represented as a triple \((h, r, t) \in \mathcal {E}\) where \(h, t \in \mathcal {V}, r \in \mathcal {R}\). For a heterogeneous graph, there exists the constrain \(|\mathcal {A}|+|\mathcal {R}|>2\).

Definition 2

(Metapath). In a heterogeneous graph, a metapath is a predefined sequence of entity types and relation types that capture the semantic relationships between entities. Formally, a metapath \(\mathcal {P}\) is denoted as \(e_1 \xrightarrow {r_1} e_2 \xrightarrow {r_2} ... e_L \xrightarrow {r_{L}} e_{L+1}\), where \(r_i \in \mathcal {R}, e_i \in \mathcal {A}, r = r_1 \cdot r_2 \cdot ... \cdot r_{L}\) is the composite relation between entity type \(e_1\) and \(e_{L+1}\), and L is the length of the metapath.

Definition 3

(Heterogeneous Graph Completion). For a query (hr) where \(h \in \mathcal {V}\) and \(r \in \mathcal {R}\), the heterogeneous graph completion (HGC) task refers to discovering answers \(\mathcal {T} \subset \mathcal {V}\), such that for all \(t \in \mathcal {T}\), \((h, r, t) \in \mathcal {E}\) in reality.

Table 1. Numbers and types for entities.

2.2 Collected App Promotion Dataset

To exploit the complex relational patterns in the app promotion ecosystem, we collect data from Google Play and construct the app promotion heterogeneous graph (APHG) for the completion task.

Entities. The APHG encapsulates various entities derived from the following attributes: application, developer, application category, manifest, VirusTotal Engine, digital signature, and URL. Given the unique promotional behaviors demonstrated by benign, grey, and malicious apps, we further classify the application entity into three discrete classes - benign, grey, and malicious, and extend the aggregated count of entity types to nine. We provide the statistics of the entities in the APHG in Table 1. In total, there are nine types of entities.

Fig. 2.
figure 2

The schema of the APHG.

Relations. The relations of interests are demonstrated in Fig. 2. In total, we consider twenty-nine classes of relations by further categorizing the applications into benign, greyware, and malware. Note that all the relations are directional. Despite the potential to gather additional information, neither the entities nor the relations are associated with any features. Therefore, our APHG is denoted as \(\mathcal {G}=(\mathcal {V}, \mathcal {E})\). We provide detailed descriptions of the relations in the supplementary materials.

2.3 APHG Completion Task

The APHG completion task refers to answering a query that contains an entity type and a relation type based on the observed APHG. For example, for a query \(q =\) (developer\(_1\), developer-develop-app), the model is expected to output the possible apps that developer\(_1\) develops. We additionally note that the queried relations only associate with one of the constructed directional relations, excluding the reverse relations. For example, we query (developer\(_1\), developer-develop-app), rather than (app\(_1\), app-developed-developer).

2.4 Related Work

Embedding-Based Methods. Knowledge graph embedding (KGE) methods employ geometric operations in the vector space to capture the underlying semantics of the graph, such as translation [1], bilinear transformation [30], rotation [21]. Other methods design embeddings from different perspectives. For instance, CompLEX [23] leverages compositionality to model the complex relationships between entities. ConvE [3] utilizes multi-layer convolutional networks on the 2D grid abstracted from the knowledge graph to encode local dependencies. Although conceptually straightforward, these methods encode each entity and relation’s embedded information through a simple vector. The inherent simplicity of embedding-based methods can present challenges in scenarios involving complex reasoning and scarcity of information.

Fig. 3.
figure 3

Overall framework: the metapath-based and embedding-based symbolic prompts are pre-computed, and are concatenated with the queried entity and relation to construct the symbolic inputs.

Transformer-Based Methods. Taking account of the relatively weak expression power of the embedding-based methods, several recent works utilize transformers for additional enhanced contextual information encoding. Some works take the triple as the input and perform tasks such as triple classification and link prediction. For example, KG-BERT [31] treats triples as textual sequences to inject semantic information and exploits pre-trained BERT to learn context-aware embeddings. PKGC [15] leverages the entity’s semantic information and converts them into natural prompt sentences to address the closed-world assumption (CWA) and incoherent issue. However, the above methods require the scoring of all possible triples in inference, therefore introducing some unnecessary calculation overheads. On the other hand, some other works are designed to directly output the candidate entities. For example, StAR [25] designs a structure-aware and structure-augmented framework for efficient KGC inference. HittER [2] extracts context neighbors for the source entity and introduces the additional masked entity prediction task for balanced contextualization. GenKGC [29] introduces relation-aware demonstration and entity-ware hierarchical decoding for better representation learning. Despite the progress made so far, we notice some implementation gaps in applying the above methods to a knowledge graph and a heterogenous graph: First, entities in a knowledge graph naturally entitle semantic information, while this is not always true for a heterogeneous graph; Second, the above methods left out the entity/node type information provided in a heterogeneous graph, therefore leaving considerable space for performance improvement. In contrast, our model is designed to not only straightly output the candidate entities, which eliminates the calculation overhead but also fully utilize the entity and relation type information for better prompting.

3 SymPrompt

Our SymPrompt approach leverages the pre-trained BERT [4] as a language model to process the tokenized inputs. The reasons are tri-folded: (i) BERT reduces the computational overhead by directly outputing the probability distribution over the entities, where the corresponding probabilities represent the ranking scores. In comparison, geometrical-based KGE methods require further similarity calculation to obtain the ranking scores; (ii) The bidirectional attention learning mechanism in BERT allows for complex input sequence processing, which is essential in aligning the query tokens with the symbolic prompt tokens to the completion task; (iii) BERT accepts flexible length of the input tokens, which include the query tokens and the symbolic prompt tokens. It supports various numbers of symbolic prompts. This feature further enhances the practical applicability of our framework. Subsequently, the encoded tokens are aggregated and decoded by a two-layer MLP to output the final result. The overall framework is depicted in Fig. 3. The input tokens are composed of three parts, as demonstrated in Fig. 3: (i) the query tokens, including the queried entity and relation token; (ii) the symbolic prompts generated based on embedding-based methods, such as DistMult; (iii) the symbolic prompts generated based on the correlation between the metapaths and the queried relation. In the following content, we provide details regarding these symbolic prompts.

3.1 Embedding-Based Symbolic Prompts

Prior embedding-based models have demonstrated remarkable performance on various benchmark datasets. These models possess inherent simplicity that renders them proficient tokenizers, effectively mapping entity and relation tokens to a shared semantic space. In this paper, we select DistMult [30] as the pre-trained embedding-based method to tokenize the entities and relations. Note that this is a designer’s choice and can be substituted with any other methods that fit our framework. Let n be the size of the embedding-based symbolic prompts, which are defined as the top-n predicted entities by the pre-trained embedding-based methods according to the predicted scores. For example, when considering the query (app\(_1\), app-access-URL), the embedding-based symbolic prompts are represented as \(S_e\) = [url\(_1\), url\(_2\), ..., url\(_n\)], where \(S_e\) includes the top-n URLs predicted by DistMult that may be accessed by app\(_1\).

3.2 Metapath-Based Symbolic Prompts

Previous embedding-based methods rely on geometric operations to model relations among entities, resulting in symbolic prompts that share similarities from a geometric perspective. Meanwhile, semantic relational information among entities can also be extracted from the metapaths of the HG. Therefore, we additionally provide the model with metapath-based symbolic prompts from the semantic perspective, based on the assumption that relations are semantically correlated with metapaths to different extents. We first introduce the measure of the semantic correlation between a metapath and a relation and then illustrate how to utilize the correlation to create the metapath-based symbolic prompts.

Relation-Metapath Correlation. We first define the functions \(src(\cdot )\) and \(dst(\cdot )\) as the source and destination entity type mapping functions for a relation r respectively. We make the following definitions:

Definition 4

(r-valid Metapath). A metapath \(p = e_1 \xrightarrow {r_1} e_2 \xrightarrow {r_2} ... e_L \xrightarrow {r_{L}} e_{L+1}\) is r-valid if and only if \(src(r) = e_1\) and \(dst(r) = e_{L+1}\).

Definition 5

(p-Hit). For a triple (hrt), where r is the relation, h and t are the source and destination entity respectively, we say the triple is p-Hit if and only if there exists at least one path from h to t such that this path is an instance of the metapath p.

Definition 6

(p-Hit Ratio). For a p-Hit triple (hrt), the corresponding p-hit ratio is defined as the ratio of t among all the entities reached by following the metapath p starting from entity h; if t cannot be reached, the p-hit ratio is zero.

Definition 7

(r-p Ratio). For a relation r, a r-valid metapth p, and triples \(\mathcal {E}_r = \{(h, r, t) \in \mathcal {E}\}\), the corresponding r-p ratio is defined as the averaged p-Hit Ratio in \(\mathcal {E}_r\).

Example. For simplicity, we abbreviate relation benign-access-URL as access, developer-develop-app as develop, and developer-use-URL as use. For relation \(r=\) access, one of the metapaths \(p=\textit{benign} \xleftarrow {\texttt {develop}} \textit{developer} \xrightarrow {\textit{use}} \textit{URL}\) is an r-valid metapath. For the triple (benign\(_1\), access, url\(_1\)), if there exists a path “\(\textit{benign}_1 \xleftarrow {\texttt {develop}} \textit{developer}_1 \xrightarrow {\texttt {use}} \textit{url}_1\)”, then we say the triple is p-Hit. Furthermore, if starting from \(\textit{benign}_1\) and following p reaches to a set of nodes \(\mathcal {T}_h^p\) where \(\textit{url}_1 \in \mathcal {T}_h^p\), and \(\mathcal {T}_h^r \in \mathcal {T}_h^p\) represent the entities that are connected with \(\textit{benign}_1\) with relation r, then the p-Hit Ratio is \(1 / |(\mathcal {T}_h^p \setminus \mathcal {T}_h^r| + 1)\). Since the r-p Ratio is defined as the probability of finding the answer t by following metapath p starting from the entity h, we utilize it as the correlation indicator and select top-k metapaths that are most correlated with relation r, denoted as \(\mathcal {P}_r\).

Entity Candidate Refining. Even if metapaths in \(\mathcal {P}_r\) are selected based on the correlation with the relation, the reachable entities may still contain noise regarding the queried relation, especially when the path passes a high-degree entity, which results in a significant size of the entity candidates with little and even noisy information. To further refine the entity candidate set, we first categorize the metapaths \(p \in \mathcal {P}_r\) as one leading to either a large or small size of candidates. For those who lead to small-sized candidates, we union the candidates, and for large-sized candidates, we intersect them. The rationale is as follows: if a relation is weakly related to a metapath, then the candidate size is large and we rely on the intersect operation to filter out the noise; if some relations are broadly related to more than one metapath, then the candidate size is small and the union operation gathers all the possible candidates. To further reduce the size of the entity candidates, we lastly utilize an embedding-based method to select the top-n entities among the candidates as the final refined metapath-based symbolic prompts.

3.3 Combined Input Tokens

The final input of a query (hr) is defined as the concatenation of the embedding-based symbolic prompt tokens, the metapath-based symbolic prompts tokens, as well as the query tokens. Prior to the language model, we randomly permute the input tokens. This step is essential in forcing the language model to learn the intrinsic connection between the query and the answer, rather than consulting the positional information as the shortcut. We validate the necessity of this step in the following experiments. Subsequently, we utilize an embedding-based method as the tokenizer (rather than the BERT’s original tokenizer) to project the tokens into the embedding space for encoding. Finally, we adopt the binary cross entropy loss to train the language model for the HGC task.

4 Experiment

4.1 Setup

For baseline models, we carefully select DistMult [30], ComplEX [23], ConvE [3], HittER [2], and LTE [38] as the baselines, for they can be easily adapted to our HGC task. We evaluate the models with two key metrics, mean reciprocal rank (MRR) and Hits@K, where MRR provides an absolute measure of ranking performance via the average reciprocal rank of the correct candidates for each test example, and Hits@K measures the proportion of test examples for which the correct candidate is ranked within the top-K predicted candidates. Higher values of MRR and Hits@K indicate better performance in accurately ranking the correct candidates for the graph completion task. We select a pre-trained DistMult [30] as the tokenizer to tokenize the entity and relation tokens and utilize a pre-trained ComplEX [23] to refine the symbolic prompts as described in Sect. 3.2. Note that the above choices are a matter of preference, and can be substituted with other embedding-based methods such as TransE [1].

Table 2. Performance comparison with the baseline models on the app promotion dataset. Best results are bolded, and runner-ups are underlined.

4.2 Performance on App Promotion HGC

The performance comparison with the baselines is shown in Table 2 (based on the three types of input token components, i.e., the query tokens, the embedding-based and the metapath-based symbolic prompt tokens as well as the adoption of token random permutation/shuffle, we consider five settings under our framework, as shown in Table 3). The results in Table 2 demonstrate that our model outperforms the other baselines by a significant margin. This is because while our model utilizes DistMult as the tokenizer to project the tokens in the embedding space, it does not merely rely on the simple multiplication operation as in DistMult for query answering; instead, the attention mechanism in the deep layers of the language model captures the complex interaction between the query and the symbolic prompt tokens. We also observe that as the value of K in Hit@K increases, the performance gap between our model and the baselines gradually diminishes. This phenomenon could be explained by the two-step inference process followed by our model: (i) the model processes the symbolic prompts and attempts to identify the answers out of the input tokens. If the correct answer exists within the symbolic prompts, the model confidently outputs it with a high probability, leading to higher hit ratios with a small K. This answer identification step is relatively straightforward; (ii) if the answer is not included in the symbolic prompts, the model instead attempts to generate the answer token. We term the second step the answer generation step, as it requires the model to deduce the interactive patterns among the input tokens wrt the query, and is, therefore, more challenging compared with the first one. The two-step inference process reveals that our model excels at both answer identification and generation. Additionally, the Hit@1 in all settings but the Mtp. Only is significantly improved upon DistMult, suggesting our framework’s effective combination of the answer identification and generation steps for query answering. The downgrade Hit@1 in Mtp. Only results from the tilted focus to the more challenging answer generation step, for the massive noise in the symbolic prompts. We detailedly analyze the two-step inference process in Sect. 4.3.

Fig. 4.
figure 4

Results of all component-differed variant models. The model variants in the first row randomly permute the input tokens, and the ones in the second row do not. The input tokens are constructed as indicated by the legend.

4.3 Component Analysis

Table 3. Detailed settings of model variants in Table 2. Emb., Mtp., and Query represent embedding-based prompt tokens, metapath-based prompt tokens only, and query tokens respectively. Shuffle represents the random permutation layer.

We analyze the effectiveness of each component in our framework to confirm the necessity of constructing our model as designed and provide supportive evidence for the two-step inference process conducted by our model. Additionally, we add another model variant named Random-Prompt, where the symbolic prompts are replaced with randomly sampled entity tokens. The input component settings of the model variants are shown in Table 4.

Performance Comparison. We combine the adoption of the random permutation layer and the components of the input tokens to expand the model variants, and compare the performance with the best baseline in Fig. 4:

For the Base model which shares the same input tokens and embedding space with DistMult, it relies on the language model (BERT) and the two-layer MLP decoder for query answering, substituting the simple matrix multiplication employed in DistMult. This substitution, while allowing the model to yield better Hit@1 performance than the baselines, fails to achieve consistent improvement across all evaluation metrics. This is because it is challenging to enforce a complex language model encoder and a decoder to fill the role of the multiplication operation. Therefore, we detour the functionality of our model from the replication of matrix multiplication to mining the complex interactive patterns between the query tokens and the symbolic prompt tokens.

Table 4. The token components discussed in Sect. 4.3. Emb. represent embedding-based prompt tokens, Mtp. represents metapath-based prompt tokens, Query represents query tokens, and Rand. represents random prompt tokens.

For the Rand. Prompt model, the addition of randomly generated symbolic prompts completely collapses the model, regardless of the employment of random permutation. This is because the model is overwhelmed with the massive noise brought by the random tokens, and cannot identify the query tokens for the downstream task. It suggests that the symbolic prompt tokens should be crafted with very limited noise to avoid model collapse.

The Embedding-based Only variant yields superior performance when combined with the random permutation layer, especially for hit ratios with small K’s. This suggests that the embedding-based symbolic prompts are effective in providing both possible candidates and additional information regarding the query. The inferior performance when not adopting the random permutation layer results from the positional shortcut taken by the model, which weakens the model’s ability in answer generation.

The Metapath-based Only model underperforms the Base model due to the massive noise introduced in the prompts. Although the Embedding-based Only model also accepts extra symbolic prompt tokens as the input, these prompt tokens are geometrically similar in the embedding space and therefore introduce less noise when providing information related to the query. However, the noises in the metapath-based symbolic prompts introduced by following the correlated metapaths are intractable, which inevitably results in inferior performance.

Our model outperforms all other variants under the two settings, suggesting the necessity of combining the two sets of symbolic prompts. Furthermore, we discover that as K increases, the gap between our variants and the baselines decreases, and decreases faster under the w/o Rand. Perm. setting. The decrement is observed because as K increases, our model relies more on the more challenging answer generation step to improve the metrics, while the baselines consistently generate the answers (i.e., conducting the answer generation step) through their designed geometrical operations. In addition, the random permutation layer mingles the input tokens together, forcing our model to focus on the intrinsic connection between the input tokens rather than the one hooked by the token positions. This deprivation of position shortcut enhances our model’s ability to generate more possible answers, therefore resulting in slower performance gap decrement as K increases, compared with the baselines.

Fig. 5.
figure 5

The training dynamics of models with full symbolic prompt tokens (ours), metapath-based symbolic prompts only, and embedding-based symbolic prompts only.

Training Dynamics Comparison. We analyze the training dynamics of the Embedding-based Only model, Metapath-based Only model, and our model under two settings (i.e., adopt random permutation and not), and show the training dynamics in Fig. 5. We observe that models converge slower under w/ Rand.Perm. than those under w/o Rand.Perm. This is because models with consistently positioned input tokens tend to opt for the shortcut solution like memorizing positions, rather than learning the intrinsic relations between the tokens. Identifying the shortcut token’s positions, compared with generating the answer based on the learned interactive patterns, is less challenging, leading to faster model convergence. The phenomenon aligns with the two-step inference process, where the model first tackles an easy task by attempting to identify existing answers in the symbolic prompt tokens and then engages in the more complex task of generating answers. Additionally, we notice that the performance gaps in Hit@K for different K’s are larger under w/ Rand. Perm., indicating better generality for the HGC task. Our model neither saturates as quickly as Metapath-based Only due to less noise introduced in the symbolic prompt, nor takes excessively long to achieve performance improvement like Embedding-based Only, which relies heavily on accessible shortcuts that hinder generality.

4.4 Random Permutation on Model Learning

To analyze how random permutation effects model learning, we showcase the normalized attention scores heatmaps in Fig. 6 under the following three conditions: (a) train and test the model under w/o. Rand. Perm..; (b) train the model w/o. Rand. Perm., but test it under w/ Rand. Perm.; (c) train and test the model under w/ Rand. Perm. (our model). The correct answer entity ranks 1, 133, and 1 under the three conditions respectively. Under condition (a), we see the model consistently pays heavy attention to the 1\(^{\text {st}}\) and 21\(^{\text {st}}\) tokens. This is because we set the number of embedding-based and metapath-based symbolic prompts as 20, and the two positions correspond to the most probable answers. Without random permutation, the model easily identifies the position shortcut and pays less attention to the query tokens (indexed by the red and blue dotted lines). Under condition (b), the model fails to assign a high rank to the correct answer (133): the model rigidly assigns heavy attention to the 1\(^{\text {st}}\) and 21\(^{\text {st}}\) tokens when they are no longer the most probable answers due to random permutation. The model trained and tested under w/ Rand. Perm. as we designed, on the other hand, assigns much more attention to the input sequence. Under condition (c), our model learns to dynamically assign attention to the potential answers (red and purple line intersections in Layer 1, Head 1), the source entity token (red vertical dotted line in Layer 2, Head 2), and any other important information it deems important (tokens indexed by 7, 31, etc.). The comparison showcases that random permutation stimulates the model to learn the interactive patterns between the input tokens, increasing its power in generalized query answering.

Fig. 6.
figure 6

The attention heatmaps in the layers of the model under three settings.

5 Conclusion

In this work, we focus on completing the missing data in the context of app promotion to combat the information scarcity problem and propose a language model-based approach named SymPrompt that leverages symbolic prompts to provide valuable hints to answer the query. This research advances the understanding of app promotion networks, and we direct future works toward explainable information completion in the context of the app promotion ecosystem.