1 Introduction

Ubiquitous computing has given rise to vast quantities of data, with some sources suggesting that 2.5 quintillion bytes of data are being created on a daily basisFootnote 1. These enormous volumes of data have been used to fuel a growing data economy, with many companies taking advantage of the opportunities that can flow from leveraging fine-grained consumer personalisation. In concert with these opportunities there are challenges—most obviously in the guise of threats to privacy in the context of a market where personal data is traded as a commodity. The acknowledgment of such threats has led to an evolution of privacy guidelines and requirements in the online space, driven by manifestations of data management legislation such as the EU’s General Data Protection Regulation (GDPR)Footnote 2.

Despite increasing concerns over data management with reference to privacy and human rights, the topic of metadata has remained relatively unexplored. (In the context of this paper we consider metadata to be any data serving to provide contextual or otherwise additional information about other data.) While the definition of online metadata has remained relatively constant, the information to which it pertains has become far more complex, its volume has become greater, and its specificity has, for many, become more worrisome. Online metadata is routinely overlooked since it is often generated without users’ explicit consent and awareness due, in part, to the unfavourable online conditions of dark patterns [1, 2], privacy-averse designs [3], and general privacy fatigue [4]. As a result, there is often unregulated collection of such metadata, which can have troubling, and potentially disastrous, implications for an individual’s privacy.

A single instance of metadata is often trivialised as being harmless, since, in isolation, it may not compromise an individual’s privacy. However, it is well understood that, when aggregated with other instances of metadata, sensitive details may be inferred or reidentification may occur. This is akin to the problem of jigsaw re-identification [5] in the context of publicly available data, though, in this case, the reidentification may be undertaken by data custodians, data brokers, or third-party organisations. Notably, the issues of concern extend far beyond reidentification due to the magnitude of metadata collection and consequent analysis in the online realm. To this end, we find that the breadth, pervasiveness and implicitness of metadata has led to online browsing habits shaping an extremely detailed trail of online activities via the passive digital footprints [6] that are unknowingly left behind. Such digital footprints that are accumulated by means of dataveillance have led to inferences on identity elements that expose user identity while rendering these users powerless in the management of their personal privacy.

At present, there is no single unifying method to identity management as it relates to this problem. As such, we leverage solutions to related problems to give rise to formal models of access control with a view to applying them to metadata and identity management. In doing so, we build upon the work of the Solid data decentralisation projectFootnote 3, which aims to preserve personal privacy on a decentralised Web via Access Control Lists. We hope that, by building upon the Solid project’s foundation, we may be able to formalise our specific problem.

It is worth noting that, while the access control ‘problem’ is well defined and well understood [7], things become more complicated when we start to consider broader privacy concerns: considerations of beneficiary and ownership can cause complex issues [8], not least because it is often the case that, despite (for example) legal obligations, the data-holder will not typically have a vested interest in the protection of privacy.

Deviation from ‘traditional’ policy requirements has seen the introduction of a variety of novel approaches to access control. For example, in [9] Fernandez et al. report upon a framework for secure data collection whereby users are able to specify data management rules through the extension of the Category-Based Access Control metamodel [10]. Category-Based Access Control (CBAC) adopts a foundational notion of access control, by considering notions of categories in order to provide flexible access control concepts, which can be specialised for particular needs. In particular, the logical foundations of this approach enable the analysis and verification of policy properties. We argue that utilising such an approach should enable one to reason about the problem at hand. Thus, we present a model of CBAC in terms of the Z notation [11, 12], the mathematical language of which is based upon typed set theory and first-order predicate logic. In addition, the schema language of Z allows us to combine aspects of a model in a way that makes reuse and composition of aspects relatively straightforward and also allows us to reason about evolution of states.

The choice of Z was influenced, in part, by the fact that its notation is relatively straightforward to understand and that its underlying logical structures have much in common with those of relational databases. In addition, it has been used previously in describing models of access control (see, for example, [13]).

Given the foundations provided by the CBAC framework and the Solid decentralisation project, we frame our research problem thus:

How can we leverage a Solid-style approach to data ownership together with formal models of data access to aid in the protection of data subjects’ privacy, so as to mitigate inference-driven identity exposure from metadata collection?

2 Background and Motivation

2.1 Metadata and Dataveillance

Metadata, in this era of pervasive computing, has become infrastructural [14] in the manner in which it invisibly supports almost all interactions in the online realm, leaving a distinct digital trail for any given user. This digital trail is made up of digital footprints [6] from user interaction. Digital footprints can be thought of as identifying features that are used to infer elements or attributes of an identity, or are themselves identity elements or attributes, where identity elements are defined to be single pieces of information that are indicative of an identity [15].

These digital footprints consist of two types: those of a passive nature, such as metadata, and those which might be described as active, such as a Facebook post or a tweet. Passive digital footprints are comprised primarily of metadata created unintentionally by user activities without the user’s explicit consent or knowledge. Collection of this passive data is particularly disconcerting due to the nature of online surveillance, or dataveillance [16].

The phenomenon of dataveillance has, in part, motivated the race for storage, analysis and creation of large data volumes whilst simultaneously degrading user privacy and exerting control in intimate settings. Moreover, dataveillance has played a pivotal role in the inference of identity or digital personas, where, as early as the 1990s, there were indications of organisations’ ability to create digital personas [17].

Today, dataveillance’s role in big data has become drastically more pervasive, and, hence, so too have these inferred digital personas. In fact, in this surveilled online environment, with continuous monitoring in the form of metadata, many truthful identity elements can be inferred without the user needing to explicitly express them online [18], which exposes users’ identity information.

With regards to identity management in the offline space, the multiplicity of identity is naturally explored through context, where a single individual’s identity is made up of numerous aspects that they may choose to reveal. This allows individuals to refrain from sharing certain details or identifiers outside of settings that they deem appropriate. Similarly, the contexts in the online space reflects one’s self-presentation whereby individuals are free to choose how they wish to express themselves to others across a multitude of platforms. However, as it pertains to metadata in the online space, this multiplicity struggles to adapt to a persistent, non-forgetting terrain that often lacks context despite the aggressive aggregation and sharing. This is epitomised by revelations that a unique persona can always be inferred for individuals, regardless of the differing personas maintained by these individuals across the online space.

Given that this problem is only set to worsen as technology becomes further intertwined with individuals’ lives with, for example, the increased ubiquity of the IoT and ‘smart’ devices, the generated metadata agglomeration will only further expand. Therefore, regulation and access control mechanisms are needed in this area to divert privacy violations that threaten human rights and to ensure the metadata collected is not easily exploitable.

2.2 Data Ownership

The definition of privacy is an all-encompassing, ever-evolving, fluid notion that has enjoyed (and continues to enjoy) many definitions [19]. The varying definitions for privacy can, in part, be seen as a result of discrepancies between individuals’ own conceptions, the varying cultures of belonging, and economic and lawful practices. Despite these variances, privacy is viewed by many as an important factor in human life—to some, a human right—and “[over] 130 countries have constitutional statements regarding the protection of privacy” [20].

As far as we are concerned in this paper, privacy can be viewed as

A limitation on access to self

(This is reminiscent of earlier philosophical definitions, such as those of [21] and [22].)

Defining privacy in this fashion allows us to move towards formulating the problem in terms of access control: if we consider self as a collection of data objects, we can begin to see how access control may be applied to this problem. However, while it may be a nice intellectual exercise to reason about access to data objects as a method of protecting individuals’ privacy, there is a clear roadblock in the form and nature of today’s Internet, whereby users typically have no control over their own data and, consequently, no control over managing access to it.

Of course, data ownership has long been a source of debate [23, 24]. Initial hopes for the Web narrated a vision of radical decentralisation and freedom [25], yet its incredible growth has been plagued with centralisation, so that, decades later, we see a significant proportion of traffic on the Web flowing through a comparatively minuscule set of corporations.

The disadvantages of such centralisation follow a similar path to the pitfalls of corrupt central entities, such as corrupt governments and companies, whereby we find the Web today ripe with surveillance, data breaches, privacy loss and manipulation. Indeed, a “few large companies now own important junctures of the Web, and consequently a lot of the data created on [it]” [26]. Thus, while individuals largely remain the source of their data, this does not translate into automatic data ownership. Of course, the complexities of data ownership become even more complex when one considers, for example, transfer of data to third parties, analysis information of data aggregates from various sources, etc.

2.3 An Alternative Approach

A number of projects (including those described by [27] and [26]) have begun to investigate methods for users to retain (or regain) ownership, as well as consequent control, of their data, through the introduction of decentralised architectures that can be implemented on top of the current Internet.

The Web, as originally formulated, was intended to facilitate easy data sharing between researchers across the Internet in a largely decentralised manner. However, the Web today, as we have argued, tends towards centralisation, with data being controlled and processed by a small number of large companies. To counter the resultant privacy dilemmas, the inventor of the Web, Tim Berners-Lee, and colleagues began work on Solid [28]—a distributed data decentralisation project that seeks to enable information sharing in a privacy-preserving manner.

In the Solid approach, data generated by a user is written to their own individual ‘pods’ (with a pod being ‘personal online data store’). These pods may be hosted wherever the user chooses, and the user can then authorise granular access of the pod to other parties as they please. This means that authenticated applications are allowed to request data, assuming that the user has given the particular application permission.

At this point, it is worth noting that other solutions (such as those described in [29] and [30]) offer approaches to decentralised access control via blockchain technology. However, our motivations—including the consideration of aggregation of metadata and the automatic evolution of access control policies—have led us to conclude that developing a solution based on the Solid approach is appropriate for our needs.

For managing control over data, Solid utilises Web Access Control (WAC) Lists. These Web Access Control Lists have much in common with the Access Control Lists generally used in Discretionary Access Control [31] policies.

Users and groups are attached to URIs, or WebIDs; resources are identified by URLs that may refer to web documents or resources. To handle permissions, these latter resources are accompanied by a set of Authorisation statements that describe:

  • which agents have access to the resource, and

  • what type, or mode, of access the given agent has.

These Authorisations are placed into separate WAC documents called Access Control List Resources (ACLs) with the permissions of the ACL resource stored in a Linked Data [32] format. Linked Data can be described as typed links that enable explicit connections to be made when necessary. Thus, we may have links between different users’ data pods, as depicted in Fig. 1.

Fig. 1.
figure 1

Links between users’ data

By adapting the Solid philosophy and principles, there is scope for reasoning about the ‘metadata problem’ in terms of access control—due to the fact that a fundamental tenet of Solid is that users control access to their own data.

Crucially, though, Solid by itself will be insufficient. With regards to metadata, it is almost inevitable that the pods will not be a sufficient method of protection (as it is usual for metadata to appear ‘harmless’). Metadata typically becomes useful as a result of inferences from aggregations. As such, Solid provides the intellectual foundations upon which we build, but we need to go further in terms of reasoning about access. It is the notion of Category-Based Access Control (CBAC) [10] that allows us to do that.

3 CBAC

There is undeniably a trend with regards to the development of novel access control models to handle use cases that are perceived to be new and/or unique. Often these novel approaches have much in common with each other or with previous approaches. An alternative to re-inventing the wheel on a regular basis is to consider a more general, ‘primitive’ notion upon which new models can be built. The meta-model for Category-Based Access Control (CBAC) [33] is such a notion.

Category-Based Access Control (CBAC) was developed to provide flexible access control concepts that can be specialised for particular needs [33]. With regards to our problem of interest, CBAC has the potential to provide the foundations for a model that allows one to reason about autonomous changes to permissions.

In CBAC, permissions are assigned to categories of users (as opposed to users); this, in turn, allows permissions to be associated with categories. Categories can be defined on the basis of, for example, roles and resources, as well as attributes and geographical constraints so that permissions can change when a user attribute or geographical location changes, without intervention from an administrator. This would be decidedly beneficial in our case regarding thresholds for data aggregation and inferences, whereupon a permission can be withdrawn.

The CBAC meta-model works on interactions between the following sets [10]:

  • A countable set C of categories, where \(c_0\), \(c_1\), etc. denote arbitrary category identifiers.

  • A countable set P of principals, where \(p_0\), \(p_1\), etc. denote principals.

  • A countable set A of named atomic actions, where \(a_0\), \(a_1\), etc. denote arbitrary action identifiers.

  • A countable set R of resource identifiers, where \(r_0\), \(r_1\), etc. denote arbitrary resources.

  • A countable set S of situational identifiers, where \(s_0\), \(s_1\), etc. denote possible situations that may occur in the system.

  • A countable set E of event identifiers, where \(e_0\), \(e_1\), etc. denote possible events that may happen in the system.

With respect to permissions and authorisations:

  • A permission is an ordered pair \((a, r) \in A \times R\), consisting of an action, \(a \in A\), and a resource, \(r \in R\).

  • An authorisation is a triple, \((p, a, r) \in P \times A \times R\), which associates a permission with a principal, \(p \in P\).

Continuing, the meta-model of [10] details the following relations:

  • Principal–category assignment: \(PCA \subseteq P \times C\), where \((p,c) \in PCA\) if, and only if, the principal \(p \in P\) is assigned to the category \(c \in C\).

  • Permission–category assignment: \(ARCA \subseteq A \times R \times C\), where \((a,r,c) \in ARCA\) if, and only if, action \(a \in A\) on resource \(r \in R\) can be performed by principals associated with the category \(c \in C\).

  • Authorisations: \(PAR \subseteq P \times A \times R\), where \((p,a,r) \in PAR\) if, and only if, the principal \(p \in P\) can perform the action \(a \in A\) on the resource \(r \in R\).

In the following, we shall use ‘syntactic sugar’ of the form pca  (pc), arca  (arc) and par  (par) to capture these concepts.

Subsequently, Barker [10] determines that the set of par  (par) facts that hold with respect to the specification of a particular access control policy may be expressed in first-order terms thus:

figure a

A further relationship, \(\rho \), captures the notion of the existence of a relationship, such as inclusion, holding between categories:

figure b

Although the model does not consider aspects such as sessions, delegations, denial of permissions or conflict resolution strategies, it is noted that, from this initial basis, these notions can naturally be accommodated [10]. This has been illustrated in practice by the aforementioned contribution of Fernandez et al. [9].

4 A Formal Model

As a first step, we have developed a formal model of the Category-Based Access Control meta-model, \(\mathcal {M}\) [10], in terms of the Z schema language [11, 12]. The intention is that we should be in a position to leverage and build upon this model as we move forward. As well as providing the necessary formal foundations, Z schemas provide a degree of flexibility—allowing us to add or remove optional elements (such as situational and event identifiers) in a relatively straightforward fashion. In this section we present key aspects of the formal model.

4.1 Types and Relations

Our interpretation of the meta-model \(\mathcal {M}\) captures the key components of the original presentation as faithfully as possible. For example, the original characterisation gives rise to six basic types: C, the set of categories; P, the set of principals/users; A, the set of actions; R, the set of resources; S, the situational identifier set; and E, the event identifier set. (These last two are optional for any particular instantiation.)

Thus, we have the following declaration:

figure c

As already discussed, we may consider permissions and authorisations thus:

  • A permission is an ordered pair (ar) consisting of an action \(a \in A\) and a resource \(r \in R\).

  • An authorisation is a triple (par) that associates the constituent parts of a permission with a principal \(p \in P\).

We define the sets Perm (for permissions) and Auth (for authorisations) thus:

figure d

In the previous section, we discussed three relations: principal–category assignment, permission–category assignment, and authorisations. We capture these in our Z model thus.

figure e

Here, captures the set of all possible relations between the set P and the set C (i.e. the set of all sets of pairs appearing in the Cartesian product \(P \times C\))—with PCA being an abbreviation for this collection. The sets of relations, ARCA and PAR, are defined similarly: the former captures relations of type ; the latter captures relations of type .

4.2 The MModel Schema

The schema Instance captures the ‘current’ categories, principles and permissions of the system under consideration and is defined thus.

figure f

Here, permset is a finite set of elements of type Perm; principalset is a finite set of elements of type P; and catset is a finite set of elements of type C.

The MModel schema is then defined as follows.

figure g

The inclusion of permset, principalset and catset via the Instance schema allows different instances of the metamodel to operate on different combinations of permissions, principals and categories. The sets pca, arca and par capture principal–category assignment, permission–category assignment and authorisations respectively.

The constraints are direct derivatives of the rules placed upon the sets as defined in the original model. The first six constraints simply restrict the domains and ranges of the par, arca and par relations (via constraints that leverage the \(\mathop {\mathrm {dom}}\) and \(\mathop {\mathrm {ran}}\) operations). The final constraint captures the fact that the par relation can be viewed as the relational composition of pca and arca: a pair \((p_1,p_2)\), for some \(p_1 \in P\) and some \(p_2 \in Perm\), appears in par if, and only if, there is some \(c \in C\), such that \((p_1,c) \in pca\) and \((c,p_2) \in arca\).

4.3 Support for Modularity

To support modularity, we utilise two schemas. The first, which is concerned with the attributes of the Instance schema, is defined thus:

figure h

This schema is used in operations that are concerned only with the relations arca, par and pca.

The second such schema, ModelRetained, is used in operations that are concerned only with the sets of categories, principles and permissions.

figure i

4.4 Example Operations: Permissions

In category-based access control (CBAC), permissions are not assigned to individuals users, but, instead, to categories of users. These categories can then be changed as necessary.

As a means of illustrating operations on our model, we present operations that are specifically concerned with explicit changes to the set of permissions (as opposed to operations concerned with changes made by category or principal assignments).

We start by defining a schema, UpdatePerms:

figure j

To any given instance of the model we want to be able to define a starting set of permissions on the set of actions and resources. As such, we utilise a schema, AllocatePerms to allocate a chosen set of permissions to the particular instance:

figure k

The operation AddPerms allows the set of permissions to be augmented:

figure l

The operation RemovePerms captures the notion of permissions being removed via this route:

figure m

Here, is the range co-restriction operator: in this case, all elements of the set allocation? are removed from the ranges of both par and arca. These updates are necessary as the removal of permissions from permset can impact upon both par and arca.

4.5 Example Operations: Principals

In CBAC, for users to be granted permissions, they must first be assigned to categories. This results in operations on principals typically involving the set pca and, at times, par. As such, the schema UpdatePrincCat ensures the set arca remains unchanged.

figure n

In order to assign a principal to a particular category, we consider two cases. The first such case is in which a principal is being assigned to category that does not have any assigned permissions (i.e. for a category c, it is the case that \(c \notin \mathrm {dom}\, arca\)). The second such case is in which a principal is being assigned to a category that already has assigned permissions (i.e. for a category c, it is the case that \(c \in \mathrm {dom}\, arca\)). As such, the schema AllocatePrincCat involves an \(\mathbf{if}\) clause with respect to the set of authorisations added to par.

figure o

To handle the complementary case—removal of a principal from a given category—we define RemovePrincCat. In order to remove a principal from a category, we must consider whether the category has assigned permissions: if the category has no assigned permissions, the only change made is to the set pca; otherwise, the set par may also change (as reflected by the IF clause).

figure p

4.6 Example Operations: Categories

Categories are, by definition, at the heart of CBAC. Categories can be characterised as a class of entities that share some property.

In our model, the set of categories, catset captures the categories that exist in the current system. The schema AddCat allows us to add categories:

figure q

The schema RemCat is concerned with removing categories. Category removal needs to handle the cases where: the category has principals previously assigned; the category has no assigned principals; and the category has permissions assigned to it and, as such, can be found in arca.

figure r

Within a given category, members are associated with permissions that have been assigned to the category. As such, operations on categories often involve the sets arca and par. Thus, we define a schema, UpdateCatPerms, which ensures that pca remains unchanged:

figure s

We may consider the assignment of permissions to categories (as found in the set arca), where we consider how assignment of permissions applies to a category that does not have any assigned principals, i.e. for a category c, it is the case that \(c \notin \mathop {\mathrm {ran}}pca\). Here, we reason that the only set to be updated is arca and so \(v = \emptyset \). We may also consider the handling of category–permission assignment when the category already has assigned principals, i.e. for a category c, it is the case that \(c \in \mathop {\mathrm {ran}}pca\). Here, arca is updated, as is the set of authorisations: par must change to reflect an assignment of permissions to principals found in the given category.

figure t

The next schema provides a method to remove permissions from a category and update the authorisation set, par. In the first case, we deal with the removal of permissions from a category that does not have any principals assigned to it, and as a result, the only set to change is arca. This is in contrast to the case where the category has principals assigned to it. This difference is reflected in the need to update the authorisation set, par, if these permissions are the sole result of membership of the category in question, i.e. for a category \(c_1\), it is the case that :.

figure u

The following schema encapsulates an update or ‘swap’ on the permissions found within a category. Here, for a category, \(c_1\), with \(c_1 \mapsto perm_1 \in arca\), if we wish to ‘swap’ the permission \(perm_1\) with \(perm_2\) so that we have \(c_1 \mapsto perm_2 \in arca\), we can define this for a category with or without principals. In the case of a category without principals, we see changes to arca only; for the other case, we see changes to both arca and par.

figure v

5 Conclusion

We have presented a preliminary model of the Category-Based Access Control (CBAC) meta-model in terms of the schema language of Z. The model enables one to reason about potential extensions, to account for, for example, the dynamic, contextual nature of privacy as it relates to metadata.

Our problem of interest relates specifically to the issue of metadata collection as it applies to personal data, consent and dataveillance. As we have discussed, metadata collection informs the identity profiles companies create for their users [18].

This threat to users’ privacy is entwined with the issue of data ownership on the Internet, where we see that users rarely have explicit ownership of their own data. To address this, we have identified a possible stepping stone in the form of the Solid proposal. The Solid proposal, as has been outlined, focuses on access control for users’ data pods, whereby individuals may authorise granular access of their data as they please.

Our next task is to build upon our initial model to capture the Solid proposal’s Web Access Control Lists. Subsequently, we will be in a position to build upon it, focusing on the specific problems faced due to collation of metadata. By reasoning about user data pods, we will be able to implement our own privacy notions that handle the complexities of inferences from metadata—allowing us to make further progress with respect to our main research question.