1 Introduction

It has become difficult for users to find information on the web that satisfies their needs since information resources continue to grow and have far exceeded human processing capabilities. The sheer abundance of information often prevents users from finding desired information, services and products, or aggravates making informed choices.

For those reasons, users need intelligent personalized web applications that simplify information access and content discovery, based on user preferences, and delivers services in a most valuable and convenient way. An example of personalized web applications that have recently become tremendously popular is the personalized web search system. These systems offer users personalized answers about services, products, and information.

Two main challenges face PIR systems in general. The first one is building a user profile that accurately represents the user’s real interests, which was addressed in our previous work El-Ansari et al. (2020a). The second challenge is the privacy protection problem, since giving the user a personalized browsing experience comes at the cost of his privacy. Thus, most people are afraid of using such applications. According to the General Data Protection Regulation (GDPR) Voigt and Von dem Bussche (2017), data controllers must design information systems with privacy in mind. Also, the data subjects should have full access and control over the collected data. Yet most former research works in the field focused on improving the personalization quality disregarding user privacy.

In this paper, we propose a new model that aims at improving privacy protection on various levels. The purposes of our research, in general, are first to enhance the personalization quality by using accurate profiles capable of reflecting the user’s changing interests. Second, to ensure user’s privacy by protecting his sensitive data on the client-side, through the Internet channel, and more importantly on the server-side where most privacy risks come (data misuse, leakage, etc.).

PAPIR is a model that aims at improving both personalization quality and privacy protection. The first is ensured by building accurate user profiles used in query expansion and re-ranking algorithms. The second is modeled by design, taking into consideration each user’s privacy requirements, and enforced with a FHE scheme.

This model can be implemented in numerous information retrieval (IR) applications, including search engines, natural language question answering systems, chatbots, library management systems, etc.

The paper is organized as follows; The next section discusses related works. Section 3 presents a comparative study of the existing personalization systems’ structures focusing on user privacy, along with the privacy threat modeling. Our model is detailed in Sect. 4. The proof-of-concept implementation measures are discussed in Sect. 5, with experimental evaluation results in Sect. 6. The paper ends with the conclusion and perspectives.

2 Related work

This research relates to the field of Personalized Information Retrieval (PIR) systems, the process of delivering information or items to the user considering his specific needs and interests in the most adequate way and at the right time. These systems collect different types of user data continuously which raises many privacy protection concerns. We discuss former works in this section as follows: PIR applications, personalization methods, and privacy protection solutions.

  • PIR applications

Some personalized systems were developed to help users find desired products (Amazon Smith and Linden (2017), eBay Greenstein-Messica and Rokach (2018)), improve search results (Yu et al. (2018)), browse news articles (Wu et al. (2019), Bountouridis et al. (2019)), find scientific and research papers (Mohseni et al. (2019)), or perform a combination of the above tasks.

  • Personalization methods

Most personalized systems build user profiles by collecting and analyzing user’s browsing history (visited Web pages as in Dennis et al. (2016) and ElShaweesh et al. (2017)). Other data sources have also been used such as images in Lully et al. (2018), or a combination of the visited Web pages, bookmarks, queries, and search results as in El-Ansari et al. (2020a). This combination boosts user profiling performance, especially when combined with Big Data technologies.

To construct the user profiles, a variety of learning techniques are used as including the probabilistic model Chaney et al. (2015), genetic algorithms Lv et al. (2016), clustering Liao and Lee (2016), deep learning techniques Singhal et al. (2017), and the well-known vector space model (Wu et al. 2019; El-Ansari et al. 2020b). Since the user profile might comprise irrelevant topics, Some of the above systems use concept filtering or rating algorithms to improve the profile’s accuracy. A survey on user profiling techniques by Eke et al. (2019) provides more details on this subject.

  • Privacy protection solutions

Most studies focus on improving the personalization quality disregarding the user’s privacy issues. Zhu et al. (2010) tried to address the privacy protection problems in personalized systems by protecting user identification using techniques such as the pseudo-identity, the group identity, no identity, and no personal information. Most efforts focus on the second level. For example, Zhu et al. (2010) provided online anonymity for users by generating a group profile of k-users. De-identification techniques are mostly used in personalized systems collecting user’s personally identifiable information (PII) and working on aggregate data. However, such techniques may be vulnerable to attacks like unsorted matching attacks, temporal attacks, and complementary release attacks. Tomashchuk et al. (2019) provided a detailed survey of de-identification techniques for privacy protection.

Since in our work we do not use PII, we focus on techniques protecting the user’s sensitive data (user profile). In this type of privacy-preserving systems, three main technique lines are used. Differential privacy, Randomized perturbation, and Cryptographic methods.

Differential privacy (DP) has become widely accepted as a model for privacy protection during the past years, this solution preserves privacy by making it difficult for the adversary to infer the presence or absence of any individual in the data set. This technique is designed to work on aggregate data and is most suitable for big data.

For instance, Zhu et al. (2013) proposed another DP scheme for neighborhood-based CF that can select neighbor privately; however, fails to maintain a good trade-off between privacy and accuracy. Authors in Shen and Jin (2016) designed a privacy built-in client that perturb data on the user device. However, the utility of perturbed data may decrease due to the inhered volatility of the whole process. Another work in Zhang et al. (2020) proposes a probabilistic mechanism for mobility datasets releasing based on differential privacy, to give users control over privacy level in location-based services.

The main limitations of differential privacy are first, the amount of noise added to the data; more noise for higher privacy risks which reduces the data utility. It is also vulnerable to insider threats that come from within the personalization server. Desfontaines and Pejó (2020) presented a survey of DP techniques for more details on the subject.

Randomized perturbation (RP) is another noise-based technique that was proposed in Zhu et al. (2015). Authors claim that they can obtain accurate recommendations while adding randomness from a specific distribution to the original user data to counter information exposure. Though, the range of randomness is chosen by experience and does not have a provable privacy protection guarantee. Another work in Polatidis et al. (2017) proposed a multi-level privacy-preserving method for collaborative filtering systems by perturbing each rating before it is submitted to the server. Yet the results showed a decrease in utility. Authors in Liu et al. (2017b) presented a hybrid approach for privacy-preserving recommender systems by combining DP with RP to offer more privacy protection. However, the recommendation accuracy loss with this approach is significant. Most noise-based techniques share the problem of utility loss Siraj et al. (2019).

Both DP and RP techniques are designed for aggregate data privacy ignoring each user’s privacy requirements and they both decrease the personalization accuracy. Lately, privacy concerns among users have increased due to unethical data aggregation practices of many recommendation systems. For this reason, in our work, we focus on individual data privacy.

As an individual privacy (IP) solution, Shou et al. (2014) claims that better results can be achieved with privacy guarantee if the personalization is only performed based on less sensitive user data. The idea is to expose only the insensitive part of the profile to the search engine. However, with this approach, a significant part of the user profile can be collected by an attacker or the server. Wu et al. (2020) proposed a framework for the privacy protection in book search based on the idea of constructing a group of plausible fake queries for each user query to cover up the sensitive subjects behind users’ queries. This approach focused on limited queries with no support for general ones and no practical implementation.

Cryptographic techniques such as Homomorphic Encryption (HE), Searchable Symmetric Encryption (SSE), or Garbled circuits are used as a mechanism to protect private data confidentiality. Authors in Erkin et al. (2012) present a solution to generate private recommendations in a privacy-preserving manner using HE and data packing. While Anjali and Reeshma (2015) proposed a search framework based on Shou et al. (2014)’s model, enhanced with HE to protect user queries but no proper explanation on how HE is implemented. Also, authors in Liu et al. (2017a) used partially HE to design two protocols for privacy-preserving trust-oriented POI recommendation based on the off-line encryption and parallel computing. A recent work Wang et al. (2020) proposed a fully HE scheme for private IR in the cloud; the paper contains theoretical analysis, but no proof-of-concept implementation is discussed. A survey by Zhou et al. (2020) on keyword search with public key encryption can help understand the use of encryption in private IR.

Table 1 presents a comparative analysis of our model with the closely related works mentioned in this section.

Table 1 Comparative analysis with closely related works

Our model focuses on countering privacy risks while preserving the personalization utility in PIR systems. The main contribution of our paper can be summarized as follows :

  • Presents an overall study on privacy protection in PIR, detailing the structures of existing systems and the possible privacy threats in a threat modeling approach.

  • Proposes a new model (based on the vector space model) that enhances the personalization quality by using accurate profiles capable of reflecting the user’s changing interests. And ensures user privacy by protecting his sensitive data on the client-side, through the Internet channel, and on the server-side.

  • Creates and stores user profiles on the client-side, which are used in query expansion and re-ranking algorithms without data perturbation or accuracy loss.

  • Guarantees privacy protection by design, taking into consideration each user’s privacy requirements, and uses a fast FHE scheme to protect user data in different states: at-rest, in-transit, and in-use with no assumption of trust in the server.

  • Implements a FHE scheme with extended operations over ciphertexts such as comparison, Sum, etc.

  • Includes a proof-of-concept implementation with scalability support.

  • Provides theoretical and experimental analysis.

PAPIR can be easily adapted and implemented in numerous personalized IR applications; such as search engines, question answering systems, etc.

3 Privacy in personalized IR systems

User privacy is the user’s ability to insulate himself, or some of his information and thereby express himself selectively. Privacy is becoming a big concern in many fields such as Social Networks, Cloud Computing, Personalized systems, etc. To protect user privacy in profile-based personalized IR systems, we must consider two contradicting effects. First, such systems improve search quality by collecting more user data. Second, must hide the sensitive data in the user profile to place the privacy risk under control. The storage of the user profile depends on the system structure.

3.1 Personalized IR systems’ structures

In this section, we classify personalized search systems into three distinct structures. Based on the user profile’s storage (on the client-side or the server-side) and use for personalization.

3.1.1 Server-side personalization

The first type of structure is Server-side personalization (Fig. 1), where the user profile is stored on the server. This structure requires the user to create an account to identify himself. The server creates and updates the user profile either from the user’s explicit input (e.g., asking the user to specify his interests) or by implicitly collecting the user’s search history (e.g., query and click-through history). The latter approach requires no additional effort from the user and contains a richer description of his interests.

Fig. 1
figure 1

Server-side personalization structure

Some search engines, like Google Personalized, adopted this architecture. Most systems with such a structure ask users to provide consent before collecting and using their data. If the user gives his permission, the search system will hold all the personally identifiable data possibly available on the server. Thus, from the user’s perspective, this architecture does not have a minimum level of privacy protection.

The lack of protection for such data raises privacy issues. For instance, AOL query logs scandal when the AOL Research released a file on its website containing twenty million search keywords for over 650,000 users, intended for research purposes Wikipedia (2019). In the report, AOL did not identify the users. However, many users’ queries contained personally identifiable information attributed by AOL to particular identifiable user accounts. The New York Times located an individual from the released search records by cross-referencing them with phone-book listings. AOL admitted it was a mistake and removed the file, yet others redistributed the file on mirror sites. This example not only raises panic among users but also dampens the data publishers’ enthusiasm in offering improved personalized services.

  1. +

    Advantages:

  2. The search engine can use all of its resources ( search patterns, document index) in the personalization algorithm.

  3. Also, the client software generally requires no changes.

  4. The server’s high performance allows us to gain considerable time.

  5. Drawbacks:

  6. From the user perspective, it does not have a minimum level of privacy protection.

  7. Most users are afraid of using such systems which can compromise their private data.

To address the drawbacks of this structure, especially the user privacy problems, the client-side structure can be a solution.

3.1.2 Client-side personalization

The second type of structure is Client-side personalization (Fig. 2), stores the user profile on the client-side. The client agent sends queries to the search engine and receives results, the same way as in the ordinary web search scenario. The client agent also performs a query expansion to generate a new personalized query before sending it to the search engine. Furthermore, as in Hawalah and Fasli (2015), the client agent ranks the search results to match user preferences.

  1. +

    Advantages:

  2. Offer a richer user profile: combining the user’s search history with his contextual activities (visited web pages) and personal data (emails, bookmarks) and producing a richer user profile.

  3. Reduce privacy concerns since the user profile is on the client-side.

  4. Distribute the overhead in computation and storage for personalization among the clients.

  5. Drawbacks:

  6. The client usually receives many results from the search engine, which increases the re-ranking process time and reduce its efficiency.

  7. Besides, the personalization algorithm cannot use the server knowledge (Page-rank score of a result document).

Fig. 2
figure 2

Client-side personalization structure

Recent studies, to address the above drawbacks and improve the personalization quality without compromising user privacy, use a client–server collaborative structure.

3.1.3 Client-server collaborative personalization

The client–server collaborative structure (Fig. 3) is a balance between the past structures. The user profile is still on the client-side, but the server also participates in search personalization.

At query time, the client agent extracts a sub-profile from the user profile to send it to the search engine along with the query. The search engine then uses the received context to personalize the results.

Fig. 3
figure 3

Client–server collaborative structure

Personalization research in this category is minimum, probably due to the relatively complex architecture. In Shou et al. (2014), the contextual information sent to the server is a generalized profile that specifies the user search preferences without exposing the sensitive data in the user profile. The client agent extracts a sub-profile relevant only to a particular query. This sub-profile is a condensed version of the original user profile (generally a few terms or a weight vector from a user search history). Thus, such a structure can reduce the personal data obtained by the search engine.

  1. +

    Advantages:

  2. Offers better privacy protection than a server-side structure because the amount of user data collectible on the server-side is lower than in the case of a server-side personalization.

  3. It allows the use of a server’s internal resources (common search patterns, document index) in the personalization algorithm.

  4. It presents a more personalized set of results.

  5. Drawbacks:

  6. The condensed contextual information is not as efficient as the original user profile.

  7. Presents a privacy risk; though the original user profile is not exposed, the generalized ones can still be collected on the server-side or with an eavesdropping attack (Fig. 4).

3.2 Privacy threats in personalized IR systems

Personalized information retrieval (PIR) systems pose several risks to user privacy. In this section, we discuss potential privacy threats based on the location of the breach on the user’s sensitive data.

3.2.1 Threats on the client-side

Exploiting a user’s device to store sensitive data (user profile) by personalized IR applications, introduces risks. Some of these threats may lead to serious problems for both users and service providers.

Cross-Site Scripting (XSS) is one of the prevalent vulnerabilities in recent web applications. An attacker can execute scripts within the context of the website under attack. Different types of XSS exist with the same result; allowing for the execution of malicious codes in the browser of the user. This may allow the attacker to access sensitive user data.

Client-side SQL injection (csSQLi) is a new form of well-known SQL injection attacks that have emerged recently due to the introduction of database-support on clients (Google Gears and HTML 5 Web SQL Databases). A popular mechanism that’s often used in conjunction with SQL injection is a mechanism called stacked queries. A stacked query allows an attacker to execute his query, fully irrespective of the original query the application executes. Stacked queries are added to an original query through the use of a semicolon. SQL injection with stacked queries can be very powerful, as it allows the attacker to execute arbitrary SQL commands on the database especially on old browsers’ versions.

Client-side data corruption or leakage is when the user, or an attacker controlling his device, changes or corrupts the stored data, or simply retrieves sensitive information. Data leakage/corruption can be caused by numerous attacks including malware, XSS, csSQLi, and threats exploiting vulnerabilities in the user’s browser or device.

To ensure data security and lower the risk of exploiting client-side data storage vulnerabilities, both users and service providers are advised to implement preventive measures (e.g., encryption, digital signatures, and access control mechanisms). Furthermore, output encoding mechanisms and parameterized queries are used to prevent XSS and csSQLi respectively. Users also need to install anti-malware solutions and keep the device’s system and browser updated to avoid new vulnerabilities.

3.2.2 Threats on the server-side

Reports of privacy breaches on the server-side affecting personalized systems dominate the news with increasing frequency. Since most personalized systems including personalized search engines and social media store the collected massive user data on their servers making it the first target for attackers. We can classify server-side privacy threats into two categories: insider threats and outsider threats.

An insider privacy threat comes from within the service-providing organization (malicious or negligent insiders, infiltrators, or even the organization’s intentions). The threat may involve data leakage (AOL scandal Wikipedia (2019)), data misuse (Facebook–Cambridge Analytica scandal Tuttle (2018)) when the user data are used for other purposes, data brokerage (sourcing and aggregating data, and reselling the most valuable categories of users to third parties), the theft of confidential or commercially valuable information, etc.

Outsider threats are ever-present, constant, and do pose a danger. Any system connected to the internet is at risk. Historically, the data breaches that make the news are typically carried out by outsiders. In 2016, search engine and email giant organization Yahoo had their system compromised by a data breach, resulting in stealing the information of about 500 million users Trautman and Ormerod (2016). eBay too reported that an attack exposed its entire account list of 145 million users in May 2014 Minkus and Ross (2014). Many more companies on the web have suffered from data breaches (Adobe, Canva, LinkedIn, Zynga, etc.). While these breaches can cost millions of dollars, outsider threats are generally the ones addressed with traditional security measures (Firewall, Passwords, Encryption, etc.) used to prevent potential attacks (Malware Attacks, Phishing, XSS, SQL injections, Password attacks, etc.).

However, securing a server is a difficult and challenging task that cannot be fully accomplished. Failure to implement proper security controls such as patches and updates or secure configurations, changing default accounts, or disabling unnecessary back-end services can compromise data confidentiality and integrity. Moreover, introducing an additional solution to enhance a server’s security can increase vulnerability and exposure to further threats. One answer to the problem is to understand server vulnerabilities and start implementing a risk-mitigation approach taking into consideration insider and outsider threats.

3.2.3 Communication channel threats

The internet serves as the electronic chain linking a client to a server. Messages on the internet travel a random path from a source node to a destination node. The message passes through several intermediate nodes on the network before reaching the final destination. It is impossible to guarantee that every computer on the internet, through which messages pass, is secure and non-hostile.

Web applications often use the HTTP protocol for client–server communication, which communicates all information in plain text. Even when they provide transport-layer security through the use of the HTTPS protocol, if they ignore certificate validation errors or revert to plain-text communication after a failure, they can jeopardize security by revealing data or facilitating data tampering.

Compression side-channel attacks such as CRIME and BREACH presented concrete and real-world examples of HTTPS vulnerabilities in 2012 and 2013. Since then, security experts confront new attacks on TLS/SSL every year especially with servers using version 1.2 of the TLS communication protocol, which supports encryption and compression. The combination of encryption and compression algorithms presented security flaws that allowed the attackers to open the content of the encrypted HTTP header and use the authentication token within the cookie to impersonate a user Yang and Gong (2019).

These attacks, among others, can lead to a privacy threat called eavesdropping (also known as a sniffing or man-in-the-middle attack) that is a theft of information as it is transmitted over a network by a computer, smartphone, or another connected device. The attacker takes advantage of unsecured network communications to access data as it is being sent or received by its user.

Moreover, the advancement towards future 5G mobile networks is rapid and expected to offer high data speed and low latency, which will eventually result in huge data flows in the communication channels between users and servers. This fact increases the user’s concerns about privacy protection in these networks.

3.3 Privacy threat model

Since implementing effective privacy protection models requires understanding the range of potential privacy threats in this field. It is important to study and define a privacy threat model. The threat model described in this section is based on the aforementioned privacy threats in PIR systems and structures. The idea is to model how an attacker could threaten the user’s privacy and conduct an attack. As described in Fig 4, we investigate three different threat scenarios:

  1. 1.

    The attacker controls the user’s device and have access to the whole user profile.

  2. 2.

    The communication channel is insecure and the attacker can eavesdrop and collect the user’s queries with portions of his profile, then using auxiliary information (Online ontology) and social engineering skills he can guess the user’s profile.

  3. 3.

    The server is compromised, and the attacker can collect data sent by the user, then guess the original user profile as in the second scenario.

Fig. 4
figure 4

A privacy threat model in PIR

Section 3.2 summarized potential methods that an attacker can use to gain access to the client’s device, the server, or the communication channel. Since the user profile is stored on the client-side, encryption is mandatory to secure the user’s data and reduce the risk in the first scenario.

During a browsing session in the personalized IR system, a user sends many queries to the server, with short portions of his profile. As we mentioned in the second and third scenarios, an attacker can obtain a significant part of the original profile by collecting the sub-profiles and using the online ontology to figure the rest.

Considering the user profile P, each time a user enters a query q, the system sends a part of P. If the attacker captures each generalized profile \(G_i\), it is possible after a number of queries n to guess a significant portion of the user profile P using the online taxonomy. And, even if the generalized profile \(G_i\) contains no private data, the attacker can still obtain the user profile by comparing the \(G_n\) to the ontology.

$$\begin{aligned} \sum \limits _{i=1}^n G_i = G_n \rightarrow P \end{aligned}$$
(1)

Where n is a number of queries (depends on the user activity and time).

Fig. 5
figure 5

Ontology-based user profile

To illustrate how an attacker can breach user privacy, Fig. 5 shows an example of a user profile (a) with two generalized profiles (\(G_a\) and \(G_b\)). The gray concepts in this figure reflect the user’s private data. And the generalized profiles contain no sensitive data because the system stops at the parent nodes. However, in \(G_a\) for example, the attacker can retrieve the sub-tree of Security relying on the taxonomy (b) in the same Fig. 5, where Security is the parent of two nodes including a private one (Privacy). Therefore, if the probability of touching both branches is equal, the attacker has 50 percent confidence on Privacy, leading to high privacy risk.

To reduce the aforementioned risks in our PAPIR model, we combine two techniques discussed in Sect. 2; individual privacy solution with encryption techniques to protect the user profile and limit the possibility of the three scenarios in the threat model.

4 The proposed PAPIR model

This section presents the structure of our privacy-aware model for personalized IR systems. We use the vector space model (VSM) in the user profile construction process and also in the information retrieval. In the VSM, each document, query, or user interest is an n-dimensional vector where n is the number of distinct terms over all the documents and queries. Vectorized data processing helps with developing faster IR systems by making efficient utilization of CPU cache. Figure 6 describes the PAPIR model structure that combines four main components:

  1. C1.

    Profiling agent

  2. C2.

    Query processor

  3. C3.

    Encryption agent

  4. C4.

    Re-ranking agent

Fig. 6
figure 6

PAPIR model structure

PAPIR enhances the Client-Server collaborative structure (Fig. 3) to offer more security and better privacy protection while keeping a good personalization quality. It also addresses the drawbacks of the previous models:

  • It uses a rich profile of tree layers for enhanced personalization.

  • It eliminates the privacy risk of the generalized user profiles that can still be collected on the server-side or with an eavesdropping attack.

  • The client receives a set of personalized and re-ranked results.

  • The personalization algorithm on the server-side can use all the knowledge that is only available on the server-side (Page-rank score of a result document).

  • Our model also preserves the user’s privacy on different levels using encryption.

4.1 Profiling agent

Building user profiles consist of learning from user browsing behaviors. Hence this process is activated after each browsing session as described in our previous work El-Ansari et al. (2020a). In our model, the profiling agent creates and stores the user profile on the client-side (User’s device) in the form of a tree of concepts or topics. Each topic of interest is represented with a word vector. For reasons we discuss later in this section, instead of a single user profile, we segregate the user’s interests into three parts:

  • Short-term profile (STP).

  • Long-term profile (LTP).

  • The archive.

After each browsing session, the profiling agent prepares and classifies a list of visited concepts to the short-term and long-term layers. A user might show interest in a topic and abandon it after a while. For example, when the football world cup starts, sports fans focus on this event. Once the contest ends, they would likely turn their interest to other sports.

The short-term profile includes the user’s recent interests. One way to discover these interests is to define a threshold, and the system classifies all new concepts with weights above it to the short-term profile.

The long-term interests are more stable than the short-term ones. For example, programming languages are stable interests for a user who works as a programmer. Therefore, the short-term profile contains the changing user interests, while the long-term contains the stable ones.

The archive contains interests that are no longer important to the user. Concepts that are gradually losing their weight values (importance) eventually move to the archive. We used separate profiles in our model for the following assets:

  • The three profiles will help the system adapt to the user’s interest change.

  • With the short and long-term profiles, we separate stable interests from the occasional ones.

  • Helps preserve user privacy by minimizing the risk of guessing the user’s whole profile from the query vector sent to the server.

In most previous works, the same level of privacy protection is afforded for all individuals. However, it is common that the data subjects have quite different expectations regarding the acceptable level of privacy for their data. In our model, once the profiling agent creates the initial profile, the user is allowed to specify his privacy requirements by specifying a sensitivity value for each topic in his profile through a graphical user interface.

4.2 Query processor

This component is a key element in our model, as it is responsible for two main tasks: query preparation and query expansion.

4.2.1 Query preparation

Once a user enters a query q, the first step is to prepare a query vector Q. As we mentioned before, each document, user query, or user interest is an n-dimensional vector where n is the number of distinct terms over all the documents. Normally, users express queries in natural language with a set of words. Each query is tokenized and processed to remove stop-words and stemmed with a Porter stemming algorithm to clear common suffixes Swain and Nayak (2018).

Let \(Q_0\) = \((0_1,0_2,\ldots ,0_n)\) be a query vector with 0 as a weight value for all n terms. For each query term \(t_i\) we change the corresponding weight value from 0 to 1 to mark it’s presence in the user query. The same operation is performed on the term \(t_i\)’s synonyms to improve the retrieved results’ accuracy. For example, a query q contains 2 key-terms \(t_5\) and \(t_7\), and \(t_5\) is a synonym of the term \(t_9\). The resulting initial query vector would be as follows:

$$\begin{aligned} Q_i = (0_1,0_2,0_3,0_4,1_5,0_6,1_7,0_8,1_9,0_{10},\ldots ,0_n) \end{aligned}$$

4.2.2 Query expansion

Information Retrieval is concerned with the identification of documents in the collection that are relevant to the user’s information needs. Queries formed by users are generally short and vague, making it difficult to estimate the exact user need. Information retrieval may improve their effectiveness by using the process of query expansion, which automatically adds new terms to the original query posed by the user.

In our query expansion algorithm, the terms we add to the user query are extracted from the user profile reflecting his interests and needs. Selecting terms (to add) is done by computing the similarity between the query vector and each topic in the user profile (STP) taxonomy to extract related topics of interest. The following algorithm describes the query expansion process:

figure a

We compute the cosine similarity metric between the query vector Q and each concept’s vector C in the STP as follows:

$$\begin{aligned} CosSim (C, Q) = {C \cdot Q \over \Vert C\Vert \times \Vert Q\Vert } = \frac{ \sum _{i=1}^{n}{C_i \times Q_i} }{ \sqrt{\sum _{i=1}^{n}{C_i^2}} \times \sqrt{\sum _{i=1}^{n}{Q_i^2}}} \end{aligned}$$
(2)

A high cosine value indicates that the query is closely related to the topic of interest in the user profile. To achieve a balance between personalization quality and individual user privacy, we need to take into consideration two important metrics thresholds: Utility \(\mu\) and Privacy risk \(\rho\).

Utility metric (\(\mu\)) is used to enhance the personalization quality and increase the system’s performance by avoiding unnecessary computations. For some queries, called distinct queries, this whole process of personalization contributes little or even reduces the search quality.

A distinct query is a clear one that needs no personalization and is used by most users to look for the same result. For example, by the question (who is the director of the Titanic movie), users are looking for the same answer (James Cameron). The query expansion and randomized perturbation tasks, in this case, are unnecessary. We use a predefined utility threshold (\(\mu\)) in the query expansion algorithm to avoid such unnecessary computations. To select concepts from the STP that are relevant to the query q; \(CosSim (C, Q) > \mu\).

Privacy risk (\(\rho\)) metric is based on the user’s privacy requirements. It helps decide which topics in the user profile can be added to the query vector. The user specifies a sensitivity value S for each concept in his profile. The privacy risk metric \(\rho\) is the average sensitivity value in the user profile. We calculate the similarity between the query vector and a concept \(C_i\)’s vector, only if \(C_i\)’s sensitivity is under the threshold: \(S_i < \rho\).

To generate the expanded query vector \(Q_e\), we calculate the new weight (\(nw_i\)) for each term i as follows:

$$\begin{aligned} nw_i = {w_i + \sum _{j=1}^{x}{W_{ij}} \over x+1} \end{aligned}$$
(3)

Where :

\(w_i\) : is the weight of term i in the initial query vector.

x : is the number of selected concepts.

\(W_{ij}\) : is the weight of term i in the concept j’s vector.

4.3 Encryption Agent

The encryption agent is responsible for the protection of the user data in its different states; at-rest, in-transit, and in-use.

In our model, the user data that require protection at rest are mainly the user profiles(STP, LTP, and the archive). For data-at-rest encryption, we use the AES algorithm with a 256-bit key. The computational requirements of this approach are low, and it offers strong protection making it the algorithm of choice for governments, financial institutions, and security-conscious enterprises around the world.

The query vector and search results on the other hand require in-transit and in-use protection. Our approach to protecting user data in the last two states is based on the Homomorphic encryption, which enables cryptographically secure computations in untrusted environments.

We use a Fully Homomorphic Encryption (FHE) scheme called CKKS (proposed by Cheon, Kim, Kim, and Song in Cheon et al. (2017)) also know as HEAAN (Homomorphic Encryption for Arithmetic of Approximate Numbers). CKKS allows additions and multiplications on encrypted real or complex numbers, and it is implemented in a widely used library called SEAL (2020) actively developed by Microsoft. The CKKS’s plaintext space is \({\mathbb {C}}^{n/2}\) for some power-of-two integer n. Allowing us to compute on encrypted rational weights in the query and documents’ vectors.

Cheon et al. (2017) proposed plaintext encoding/decoding methods to deal with the complex plaintext vector efficiently, which exploits a ring isomorphism:

$$\begin{aligned} \phi :{\mathbb {R}} [X]/(X^{n}+1)\rightarrow {\mathbb {C}}^{n/2} \end{aligned}$$

Encoding Let \(V=(v_{1},v_{2},\ldots ,v_{n/2})\in {\mathbb {C}}^{n/2}\) be a plaintext vector and \(\varDelta >1\) a scaling factor, the plaintext vector is encoded as a polynomial \(m(X)\in R:={\mathbb {Z}} [X]/(X^{n}+1)\) by computing \(m(X)=\lfloor \varDelta \cdot \phi ^{-1}(V)\rceil \in R\) where \(\lfloor \cdot \rceil\) denotes the coefficient-wise rounding function.

Decoding Given the message polynomial \(m(X)\in R\) and a scaling factor \(\varDelta >1\), the message polynomial is decoded to a complex vector \(V=(v_{1},v_{2},\ldots ,v_{n/2})\in {\mathbb {C}}^{n/2}\) by computing \(V= \varDelta ^{-1} \cdot \phi (m(X)) \in {\mathbb {C}}^{n/2}\).

Where the scaling factor \(\varDelta\) enables controlling the error of encoding/decoding, caused by the rounding process.

SEAL (2020) implements the CKKS scheme algorithms; KeyGen, Encrypt, Decrypt, Add, and Multiply among others. For a positive integer q, let \(R_{q}:=R/qR\) be the quotient ring of R modulo q. Let \(\chi _{s}\), \(\chi _{r}\) and \(\chi _{e}\) be distributions over R which output polynomials with small coefficients. These distributions, the initial modulus Q, and the ring dimension n are predetermined before the key generation phase as encryption parameters.

  • KeyGen:

  • Input:

  • - Sample a secret polynomial \(s \leftarrow \chi _{s}\)

  • - Sample a (resp. \(a'\)) uniform randomly from \(R_{Q}\) (resp. \(R_{PQ}\)), and \(e,e'\leftarrow \chi _{e}\)

  • Output: secret key, public key, and evaluation key

  • $$\begin{aligned}&sk\leftarrow (1,s)\in R_{Q}^{2} \\&pk\leftarrow (b=-a\cdot s+e,a)\in R_{Q}^{2} \\&evk\leftarrow (b'=-a'\cdot s+e'+P\cdot s^{2},a')\in R_{PQ}^{2} \end{aligned}$$
  • Encrypt:

  • Input:

  • - an ephemeral secret polynomial \(r\leftarrow \chi _{r}\)

    - a message polynomial \(m\in R\),

  • Output: a ciphertext (ct)

  • \(ct\leftarrow (c_{0}=r\cdot b+e_{0}+m,c_{1}=r\cdot a+e_{1})\in R_{Q}^{2}\)

  • Decrypt:

  • Input: a ciphertext \(ct\in R_{q}^{2}\),

  • Output: a message (\(m'\))

  • \(m'\leftarrow \langle ct,sk\rangle ({\text {mod }}q)\).

  • Add:

  • Input: 2 ciphertexts \(ct_a\) and \(ct_b \in R_{q}^{2}\),

  • Output: a ciphertext (\(ct_{add}\))

  • \(ct_{add} \leftarrow ct_a + ct_b \in R_{q}^{2}\).

  • where \(Dec(sk,ct_{add})\approx Dec(sk,ct_a)+Dec(sk,ct_b)\).

  • Multiply:

  • Input: 2 ciphertexts \(ct_a=(ca_0,ca_1)\), \(ct_b=(cb_0,cb_1) \in R_{q}^{2}\),

  • Output: a ciphertext (\(ct_{mult}\))

  • \((d_0,d_1,d_2) = (ca_0 cb_0, ca_0 cb_1 + ca_1 cb_0, ca_1 cb_1)(\)mod q)

  • \(ct_{mult} \leftarrow (d_0,d_1)+\lfloor P^{-1}\cdot d_{2}\cdot evk\rceil \in R_{q}^{2}\).

  • where \(Dec(sk,ct_{mult})\approx Dec(sk,ct_a) \cdot Dec(sk,ct_b)\).

Other function are also available in SEAL to reduce the ciphertext noise after operations (Add, Multiply) such as Rescaling or Bootstrapping, MultiplyMany and AddMany, which either multiply together or add together several ciphertexts in an optimal way. SEAL also provides functions such as \(AddPlain(ct, m_{add})\) and \(MultiplyPlain(ct, m_{mult})\) that, given a ciphertext ct encrypting a plaintext polynomial m, and unencrypted plaintext polynomials \(m_{add},m_{mult}\) output encryptions of \(m+m_{add}\) and \(m \cdot m_{mult}\), respectively. For more details we refer the reader to the documentation available in the SEAL (2020) library link.

We also use a comparison function \(Comp(ct_1, ct_2)\) to compare two encrypted numbers. This function is a new approximate representation of the comparison function with a rational function proposed by Cheon et al. (2019b) and improved in Cheon et al. (2019a) with optimal asymptotic complexity and an acceleration method. This improved comparison algorithm only require \(O(log(1/\epsilon ))+O(log(\alpha ))\) computational complexity to obtain an approximate comparison result of \(a,b \in [0,1]\) satisfying \(|a-b| \ge \epsilon\) within \(2^{-\alpha }\) error.

We implement this comparison function in a sorting algorithm Sort(EIFN) executed in the server-side to sort documents by their interest score \(I'_{score}\) and return the Top-N relevant ones. The sorting algorithm takes as input the encrypted index file EIF containing in each row the following encrypted document data: \(D_{Title},D_{url},D_{vec},I_{score}\) (Sect. 5.1). The \(I_{score}\) reflects the dot product of the document \(d_i\) with a given query vector Q. The following algorithm describes the encrypted search process using SEAL and the above-mentioned algorithms and functions:

figure b

Setting the encryption parameters \(Enc_{param}\) along with the KeyGen are operations executed in the preparation phase on the client-side. The encryption agent then communicates the \(Enc_{param}, pk, evk\) to establish secure communication. On the other side, the server prepares an index file IF in the dataset indexing process. Next, the IF is encoded and encrypted using the public key pk. These operations are executed in the preparation phase. In the encrypted search process, once the user query is prepared and expanded as a vector, the encryption agent encodes and encrypts this query vector using the public key pk. The encrypted query vector is then sent to the server. The search program on the server-side loads a copy of the index file IF encrypted also with the same public key. In this file, each document in the data set is represented with an encrypted document vector \(D_{vec}\). The search program then computes the dot product of the query vector and each document vector. This encrypted value is then stored in the \(I_{Score}\) column for each document and used next by the Sort(IFN) function to sort documents based on the interest score in the index file and return top N rows. The returned encrypted results are then sent back to the encryption agent for decryption with secret key sk and decoding. These results are then processed by the re-ranking agent as described next (Sect. 4.4).

Security proof The CKKS scheme’s indistinguishability under chosen-plaintext attack (IND-CPA) is based on the ring learning with errors (RLWE) problem hardness assumption, the ring variant of the very promising lattice-based hard problem Learning with errors (LWE). For more details, the bit security of the CKKS scheme based on known attacks was estimated by Albrecht (2017)’s LWE estimator.

4.4 Re-ranking agent

The PIR server returns a set of N results that are relevant to the query vector as shown in Fig. 6. After the decryption step, the returned set of results \(R_s\) enters a re-ranking process. We use both STP and LTP to compute an interest score for each document.

Before computing the new \(I_{Score}\) for each result, we use a function Merge(STPLTP) to merge concepts in STP and LTP. In the end, we sort the results based on the new \(I_{Score}\) with the Sort() function.

figure c

After presenting the personalized results to the user, the profiling agent collects the chosen documents along with the user query and updates the user profile.

5 Proof of concept implementation

We implement our model as a web application based on the client–server collaborative structure operating with a browser extension installed on the client-side. In the implementation phase, different measures are taken both on the server-side and the client-side.

5.1 Server-side measures

As a local server, we used a computer with an i5 processor (4CPU) having a clock speed of 2.5 GHz and 12GB RAM running a Linux server environment.

Dataset preparation We choose DBpedia (a structured and semantic version of Wikipedia) as a data source for our system; due to its great richness of information. Given that the version we used (DBpedia (2015)) contains descriptions of over 1.9 M concepts within the English version of Wikipedia; including titles, abstracts, and links to the corresponding documents in Wikipedia.

Indexing process The goal of the indexing process is to reduce the time needed to perform a search operation. It consists of five basic steps: tokenization, dropping stop words, lemmatization, term weighting, and index creation. Hence each document in the dataset is processed to generate a document vector \(D_{vec}=(w_1,w_2,\ldots ,w_n)\) where n is the number of distinct terms in the dataset, and \(w_i\) is the weight of a term i calculated using the well-known \(TF\times IDF\) formula. The result of this process is an index file IF containing in each row the following document data:

  • \(D_{Title}\): contains the document title.

  • \(D_{url}\): is the link to the Wikipedia page.

  • \(D_{vec}\): is the vector representation of the document.

  • \(I_{score}\): Interest score of the document for a given query.

The encryption program implemented on the server-side is responsible for encrypting each element of the index file using the public key pk.

Homomorphically encrypted search is also assured by the encryption program implemented on the server-side based on the SEAL (2020) library. As described in algorithm 2, this program returns the top N documents based on the dot product of the query vector and the documents vectors. We have tested different values of N to test its effect on the results’ relevance and computation time.

Scalability A recent project called SparkFHE (2020) integrates Apache Spark with fully homomorphic encryption to enable performing efficient computation on large encrypted datasets. This project is being developed by the SpiRITlab, in collaboration with the Microsoft research team developing SEAL (2020) and working on a similar project called SparkSEAL. Although both projects (SparkFHE (2020) and SparkSEAL) are not officially released, the available version of SparkFHE (2020) can be tested and offer some examples. To test the distributed computation effect on the performance of our PAPIR model, we implemented a 2nd version of the previously mentioned encryption program using the SparkFHE project on our local server. The results are discussed in Sect. 6.

5.2 Client-side measures

The main components of our model are implemented on the client-side (as shown in Fig. 6). We used some features that we have developed in previous works;

Query processor In El-Ansari et al. (2017) we presented a natural language question-answering system over multiple knowledge bases (including DBpedia (2015)). We use a modified version of the question processing feature from this system in the current Query processor to process and transform natural language questions into a query vector. This component includes a query expansion feature (Algorithm 1) to personalize and expand the user query based on his profile.

Profiling agent also uses a profiling method presented in El-Ansari et al. (2020a) to create dynamic user profiles reflecting each user’s interests and preferences. This component is developed as a browser extension to track user browsing activities. Along with a graphical user interface (GUI), giving the user access and full control on his profile to provide feedback, change, or update his privacy requirements regarding sensitive topics.

Encryption agent is responsible for protecting user data confidentiality in different states: At-rest, In-transit, and In-use. We implement the user profile protection At-rest using the AES algorithm with a 256-bit key, offering strong protection with lower computational requirements. To protect the user personalized query vector both In-transit and In-use along with the search results, we used homomorphic encryption that showed promising results in one of our previous works Makkaoui et al. (2017). We use the CKKS scheme implemented in the open-source SEAL (2020) library. A recent widely used library supporting efficient FHE and actively developed by the Microsoft research team. The key for a successful implementation with this library is choosing the right encryption parameters (\(Enc_{param}\) in Algorithm 2), which is an iterative task. These parameters are:

  • \(poly\_modulus\) : a polynomial \(x^n + 1\); n a power of 2;

  • \(coeff\_modulus\) : an integer modulus q which is constructed as a product of multiple distinct primes;

  • \(plain\_modulus\) : an integer modulus t;

  • \(noise\_standard\_deviation\) : a standard deviation \(\sigma\);

  • \(random\_generator\) : a source of randomness.

In most cases developers only need to set the \(poly\_modulus\) n (1024, 2048, 4096, ..., 32768), \(coeff\_modulus\) (bit-length: 128-bit, 192-bit, and 256-bit), and \(plain\_modulus\) (a positive integer \(t\ge 2\) and at most 60 bit-length) parameters.

Both \(random\_generator\) and \(noise\_standard\_deviation\) have good default values and are, in most cases, not necessary to specify explicitly. The choice of encryption parameters affects the performance, capabilities, and security of the encryption scheme.

Re-ranking agent As described in algorithm 3, this agent performs two operations; computing the new \(I_{score}\) based on the user profile (STP + LTP), and sorting the results based on the new \(I_{score}\). For the first operation, we use the cosine similarity measure in formula 2. The second operation is based on the Quicksort algorithm Hossain et al. (2020) also known as partition-exchange sort with time complexity of O(n log n).

6 Experiments and results discussion

This section describes the experiments and the corresponding results of evaluating our PAPIR system. In particular, we conduct a Cost/Efficiency analysis to compare the costs of implementing our model to the outputs it can achieve.

6.1 Cost analysis

We analyze the communication and computation cost for a single search operation both on the client-side and server-side. Algorithms 1 and 3 both run separately on the client-side. Algorithm 1 computes the similarity between the query vector with user profile concepts’ vectors of the same dimension. The vectors’ size is the number of distinct terms in the dataset which can be considered constant, while the number of concepts in the STP is not. Therefore, the time complexity for the first algorithm is O(n) meaning that it is a linear time algorithm.

Algorithm 3, as we mentioned before, is based on the Quicksort algorithm with time complexity of O(n log n) making it a linearithmic time algorithm.

Estimating the run time for Algorithm 2 is different since it is executed on the client-side and server-side. The computation cost for this algorithm depends on several variables, including the encryption parameters. From our experiments, the computation time is mostly affected by the \(poly\_modulus\) degree.

Fig. 7
figure 7

Effect of the PolyModulus degree on the computation time of the four main algorithms in the CKKS scheme

We investigated the computation time for the main CKKS scheme algorithms that we use in Algorithm 2 (Encrypt, Decrypt, Add, and Multiply) under different \(poly\_modulus\) degree values. The results are shown in Fig. 7. While using a larger \(poly\_modulus\) degree n decreases performance, it also increases the security level. Therefore, finding a balance value is important for an efficient and secure implementation of the CKKS scheme.

We noticed a significant improvement in the computation time of these algorithms when using SparkFHE (2020), as shown in Fig. 8. This result allowed us to increase the \(poly\_modulus\) degree value in our implementation from 4096 to 8192, reinforcing the security level of our algorithm.

Fig. 8
figure 8

Effect of the PolyModulus degree on the computation time of the four main algorithms in the CKKS scheme using SparkFHE

Considering the implementation settings and the hardware limitations, the overall computation cost is reasonable. The average response time for a single user query is 1.36 s, which is acceptable, giving the fact that the implementation was on a local machine with limited hardware capabilities.

6.2 Efficiency analysis

Our model aims to enhance the search accuracy and results’ relevance for each user based on his profile while preserving the user’s privacy. To evaluate the system’s efficiency, we conducted a series of A/B testing experiments with heuristic evaluation. With the help of 35 participants (mainly colleagues and students) who were asked to browse freely for several sessions to construct user-profiles and specify the privacy requirements. Then use different system versions to perform search operations and provide feedback on the search result and the user profile. The collected feedback data are used to analyze the system’s accuracy and privacy protection.

6.2.1 Utility evaluation

Another important parameter in algorithm 2 is N representing the Top-N documents retrieved from the server. This number N affects both performance and accuracy since a higher value increases the computation cost and a lower one decreases the accuracy. To evaluate the search accuracy and relevance, we use three well-known metrics in the IR field: Precision, Recall, and F-measure.

Precision reflects the fraction of retrieved documents that are relevant to the user query:

$$\begin{aligned} Precision = \frac{{Relevant\_docs} \ \cap \ {Retrieved\_docs}}{Retrieved\_docs} \end{aligned}$$

Recall is the fraction of the relevant documents that are successfully retrieved:

$$\begin{aligned} Recall = \frac{{Relevant\_docs} \ \cap \ {Retrieved\_docs}}{Relevant\_docs} \end{aligned}$$

F-measure is the harmonic mean of precision and recall:

$$\begin{aligned} F = 2 \times \frac{{Precision} \times {Recall}}{Precision + Recall} \end{aligned}$$

In a first experiment, we evaluated the effect of N on the F-measure value as shown in Fig. 9

Fig. 9
figure 9

Effect of N (in Top-N) on the F-measure

Based on the first experiment results, we conducted a second A/B testing experiment where we proposed two system versions to the subjects.

A: The PAPIR based system with (Top-N=50)

B: a personalization free system where we ignore the algorithm 1.

Based on the subjects’ feedback, we computed the average precision, recall, and F-measure value for both versions, as shown in Table 2.

Table 2 Accuracy A/B testing results

Initial experiments prove that our model significantly improves the accuracy and relevance of the search results, as shown in the above table.

6.2.2 Privacy loss evaluation

To test the privacy protection of our model, we conduct an experiment based on an implausible assumption. Assuming that the homomorphic encryption security is, somehow, broken and an attacker can retrieve the secret key and decrypt the user queries. How much sensitive data will the attacker (or the server) be able to collect?

To answer this question, we disabled the encryption agent and asked the subjects to perform search operations. By mapping the accumulated search queries for every user with the dataset and user profile. The aim is to identify the average number of sensitive, relevant, and irrelevant concepts an attacker can collect after numerous browsing sessions (Fig. 10).

Fig. 10
figure 10

Privacy loss evaluation results

Even without the homomorphic encryption layer, our model reduces the amount of sensitive data that an outsider can collect. Since the personalization process, in our model, is performed in two separate steps Query expansion and Re-ranking. The first one only uses the STP and transforms the user query into an expanded vector. This form also complicates the task for an attacker trying to eavesdrop and collect user data.

Overall, the initial experiments proved the feasibility and efficiency of our model for privacy protection in personalized information retrieval systems.

7 Conclusion and future work

The work presented in this paper focuses on protecting the user’s privacy by protecting the sensitive data in his profile. The model we propose counters various privacy threats on different levels. The use of homomorphic encryption as an extra protection layer reinforces the user’s privacy protection, which is still one of the major issues in the field of personalized IR systems. Experimental results prove the feasibility and efficiency of our model.

A personalized IR system must ensure privacy protection to earn the user’s trust. Otherwise, only a minority of users will use it, to whom the personalized experience is more important than their privacy.

In the future, we plan to improve the implementation of our model by taking into consideration large scale datasets. We are also interested in building a fully HE library adapted for BigData and Cloud technologies.