This article contends that the effective regulation of Big Data requires a combination of legal tools and other instuments of a semantic and algorithmic nature. It commences with a brief discussion of the concept of Big Data and views expressed by Australian and UK participants in a study of Big Data use in a law enforcement and national security perspective. The second part of the article highlights the UN’s Special Rapporteur on the Right to Privacy interest in the themes and the focus of their new program on Big Data. UK law reforms regarding authorisation of warrants for the exercise of bulk data powers is discussed in the third part. Reflecting on these developments, the paper closes with an exploration of the complex relationship between law and Big Data and the implications for regulation and governance of Big Data.

1 Perspectives on Big Data

This article focuses on the Regulation of Big Data.Footnote 1 To frame the topic we will commence with a few brief remarks about the term ‘Big Data’ before discussing views expressed by participants in a recently-concluded study undertaken by the Law and Policy Program of the Data to Decisions Cooperative Research Centre.

1.1 “Big Data”

‘Big Data’ is a concept used to label key advances, opportunities and risks in data sciences. Despite its take-up in major policy documents [1,2,3,4], the term lacks precise content. It has been in use since the 1990s [5], and was initially coined to refer to the rapidly increasing volume of data that presented new opportunities and also needed to be managed [6]. In 2001, Douglas Laney, currently at Gartner,Footnote 2 introduced two further elements: the velocity at which it is being developed, as well as the increasing variety of structured and unstructured data [7].

Big Data definitions abound.Footnote 3 In their systematic mapping study of Big Data definitions Ossi Ylijoki and Jari PorrasFootnote 4 found that the three definitional Vs (volume, velocity and variety) were common to many [8]. Some definitions go further by including technical aspects relating to usage of the data, for example analysis or decision-making. Ylijoki and Porras argue that the inclusion of such elements often creates logical tensions, and adds to definitional vagueness.Footnote 5 Enhanced analytical capabilities are, however, such a major driver of the Big Data phenomenon that their inclusion in some definitions is understandable.

1.2 “Big Data” in the context of national security and law enforcement

A recent project of the Law and Policy Program of the Data to Decisions Cooperative Research Centre, Big Data Technology and National Security: Comparative International Perspectives on Strategy, Policy and Law in Australia, the United Kingdom and Canada, examined the policies, regulatory approaches, processes and strategies used by Australia, the UK and Canada, to balance the management and exploitation of Big Data for national security and law enforcement purposes. The study combined empirical research and doctrinal analysis. The remarks that follow focus on a few high level perspectives drawn from the empirical research led by Professor Janet Chan and Associate Professor Lyria Bennett Moses of UNSW Law.Footnote 6

The design of the empirical inquiry acknowledged that technologists, stakeholders and users might see Big Data technology through different ‘technological frames’ [9]. These frames are assumptions, expectations and knowledge about a technology [10,11,12]. Such frames are informed by a range of factors, including skills, knowledge, demographics and personal experiences. Technological frames are also relevant to perceptions of legal and policy responses to Big Data.

The empirical research comprised of interviews conducted with key stakeholders, technologists, and users in each country. The interviews explored the interviewees’ understanding of the capabilities and uses of Big Data and their perception of issues and challenges in relation to Big Data as well as perception of existing, proposed and recommended strategies, policies, laws and practices. In total 63 research participants took part in the research project.Footnote 7

A comprehensive discussion of the findings lies beyond the scope of this article. To frame the topic and illustrate some of the perceptions, this discussion is limited to a selection of high level findings relating to the concept of Big Data, the barriers and challenges to using it and some of the risks relating to Big Data.

1.3 “Big Data” – The conceptFootnote 8

The definitions of “Big Data” of research participants focused largely on technical and user requirements [14]. Volume was the most frequently mentioned attribute by Australian participants (22/38). Other attributes that were mentioned included its analytical or predictive capacity (13/38) and that it comprises of aggregated or integrated data from different sources (9/38). Interestingly velocity (5/38) and variety (4/38) were mentioned but not as prominently as suggested by Laney’s concept of Big Data, discussed in 1.1 [15]. Importantly a number of participants from technical organisations (5/38) viewed “Big Data” as largely a marketing term that captures the current trend of generating and using large volumes of data.

Views of “Big Data” were largely consistent among participants from the UK, Canada and Australia. However, there was much skepticism about the term among operational and policy participants in the UK. Some would prefer the more precise terminologies used in legislation, such as “bulk personal dataset” used in the UK’s Investigatory Powers Act 2016 [14, 15], as we will discuss later (section 3).

1.4 Barriers and challenges to using Big Data

The most important barriers and challenges to the use of Big Data listed by Australian participants were legal and privacy issues (13/37), public acceptance/trust in agencies wishing to use Big Data (11/37) and access to/sharing of data/data silos (9/37). A number of participants also raised the challenge of obtaining and maintaining technical, human and other resources, for example the lack of technical capacity to manage large volumes of data and competition for the small pool of good analysts [15].

Research participants in the UK reported similar types of barriers to the use of Big Data; for example, they operated under similar resource constraints. These included working with outdated systems, limited processing power and insufficient human resources. UK participants were however on average more sensitive to technical and resource concerns than Australian participants (9/14 vs 8/37). UK participants also appeared more aware than their Australian counterparts of the potential that data may be incomplete or biased.

1.5 Risks of using Big Data

Australian research participants raised privacy (12/38), misuse of data (10/38), misplaced trust in technology and assumptions behind analytics (10/38) and data security (9/38) as the most significant risks of using Big Data. Research participants from operational organisations seemed particularly sensitive to harm to their own organisations (through political and reputational risks, negative public perceptions and information overload, for example by having data that could be used to identify a criminal or prevent a terrorist attach but failing to do so). Those in operational and technical organisations appeared more conscious of misplaced trust in technology than those in policy positions. The variability of identified risks between individuals and organisations suggests that broader awareness of the diversity of risks across sectors would be beneficial.

1.6 Appropriate regulations and policies

A more comprehensive discussion of the findings falls outside the scope of this article. However, the views expressed by the research participants and reflected above provide some insight into the range of views held in relation to Big Data. An improved understanding of the views held by different sectors of society in relation to Big Data will improve the ability to formulate appropriate policies, especially in relation to a matter as sensitive as the use of Big Data to support national security and law enforcement.

We will address this issue in the sections below. We contend that the appropriate regulation of Big Data in the private and public spheres lies beyond the capacity of such traditional legal instruments as constitutional principles, statutes, regulations, and case law.

To be effective in the Web of Data there is an increasing need to complement them with other tools of semantic and algorithmic nature.

2 Regulating Big Data

2.1 Some doubts

In 2015, the United Nations (UN) appointed a Special Rapporteur on the Right to Privacy (SRP). The appointment was in response to the Snowden allegations, and work that was undertaken in their aftermath by the Human Rights Council.

The SRP produced his first report in March 2016 [16].Footnote 9 The report identified a number of key themes that require investigatory work under the SRP’s mandate. One of these is Big Data and Open Data.

In July 2016, at the SRP’s Conference on Privacy, Personality and Information Flows at the New York University Law School, one of the authors of the present article, David Watts, was appointed to lead this part of the SRP’s mandate. His key task is to oversee and coordinate the production of a paper on the privacy implications of Big Data, and Open Data, for presentation to the UN General Assembly, and the UN Human Rights Council in late 2017.

As discussed in 1 above, there is no accepted single definition of Big Data: there are many descriptions and these focus mainly on the large and complex datasets that require new architectures to efficiently manage them [19]. The lack of a definition poses a number of conceptual problems for the Big Data theme — how do you go about determining risk when you don’t really know what you are measuring or assessing?

To illustrate the problem, it is worthwhile taking an historical perspective. For example, the Domesday Book, compiled in 1086 as a survey of land and chattels over the whole of England, must fall within an eleventh century experience as a Big Data action of its day. So too must the inventory of English abbeys begun by Thomas Cromwell in 1536, the 1933 Prussian census that used ‘Hollerith punch cards’ and computing machines supplied and maintained by IBM that produced the evidence of religion, which underpinned the Holocaust.Footnote 10 The Big Data of today can easily become the little data of tomorrow.

Open Data can be seen as one of the dimensions of Big Data as an input or data source. According to the Open Knowledge International Handbook,Footnote 11 it is defined as “data that can be freely used, re-used and redistributed by anyone — subject only, at most, to the requirement to attribute and sharealike” [20].

Open Data has become a public sector article of faith over the last few years. The asserted policy basis for this is that “governments have a significant amount of data that can be published publicly. Where this data is made available in a machine-readable way, digital services can leverage it to support improved information and service delivery for users”.

The SRP has expressed reservations about Open Data:

At first sight Open Data sounds fine as a concept, a noble and altruistic approach to dealing with data as a common good, if not quite “common heritage of mankind”. Who could object to data sets being used and re-used in order to benefit various parts of society and eventually hopefully all of humanity? It is what you can do with Open Data that is of concern, especially when you deploy the power of Big Data analytical methods on the data sets which may have been made publicly available thanks to Open Data policies. [21]

There are now a significant number of data sets that have been released by government in Australia under Open Data policies. One of the most recent and significant was the release of more than 1 billion lines of what is claimed to be de-identified historical health data by the Department of Health [22]. The Department stated that:

To ensure that personal details cannot be derived from this data, a suite of confidentiality measures including encryption, perturbation and exclusion of rare events has been applied. This will safeguard personal health information and ensure that patients and providers cannot be re-identified.

There is no doubt that better research for the common good, for example population health research, carries with it community benefits. But what are open data’s privacy risks and can they be mitigated appropriately? It’s puzzling that the Department’s announcement did not include the specific details of the nature of the de-identification process it used.

2.2 The SRP’s Big Data and Open Data theme

The Big Data/Open Data theme has been divided into a number of areas of inquiry. These include:

  • The benefits of Big Data and Open Data

  • The associated data protection risks

  • The ways that the risks can be managed/mitigated

The focus in this section is on one aspect of risk mitigation, privacy enhancing technology [PET].

Usually PET is used - to refer to that use of technology, which helps achieve compliance with data protection legislation. The rationale for using PETs does not end with privacy. PETs can protect corporate confidential information and intellectual property, as well as other categories of valuable information.

2.3 De-identification

One of the main PETs is de-identification. Privacy law only applies to personal information. If the information no longer falls within the definition of personal information, privacy law no longer applies.

De-identification is one of the most contentious international privacy issues. Its supporters acknowledge that even though no de-identification approach can be guaranteed to be successful all of the time and for all time, robust, and risk-based de-identification processes can provide sufficient protection to comply with privacy laws. They argue that there are no guarantees of anything. Ann CavoukianFootnote 12 and Daniel CastroFootnote 13 are two of the most prominent supporters of de-identification [23]. Opponents of de-identification argue that (i) there is no evidence that de-identification works either in theory or in practice and (ii) attempts to quantify its efficacy are unscientific, and promote a false sense of security by assuming unrealistic, artificially constrained models of what an adversary might do [24].

At this stage it is difficult to know who is right and who is not, or whether a binary answer to the de-identification debate is either helpful or useful. Perhaps we need to look at the debate in a more nuanced way, accepting that in some, but not all cases, de-identification might provide acceptable answers. But even so, it is difficult to see where the boundaries lie. It is becoming easier to combine de-identified data with other data sources in ways that increase the risk of re-identification.

2.4 Distributed ledgers

High on the ‘hype cycle’ is distributed ledger technology of which blockchain technology is a component. A distributed ledger is a consensus of replicated, shared, and synchronized digital data geographically spread across multiple sites, countries, and/or institutions.Footnote 14 Data is stored in a continuous ledger but can only be added when the participants reach a validation consensus. More about the validation consensus process is discussed below.

A blockchain takes data or records, and stores them in a block. A simple analogy is the recording of a transaction on a piece of paper. Each block is then ‘chained’ to the subsequent block using cryptography. The chain of blocks becomes the digital version of a ledger. The ledger can be shared and examined by anyone permitted to do so. The key difference between this process, and a conventional database, is that rules can be set at the transactional level in the block chain, whereas, this does not occur with conventional databases.

In this brief discussion it is not possible to canvass all of the viewpoints about distributed ledgers and blockchains. We are in the midst of an incredible explosion of information about the uses of these technologies: experience suggests that we need to carefully examine and understand their strengths and weaknesses before reaching firm conclusions about their effectiveness.

But we need to note a few issues about privacy impacts and risks at the outset. Blockchains offer new opportunities for individuals to collaborate and to create datasets in a peer network, without a central intermediary. But what if the ledger is controlled by a single entity, or a group of affiliated interests, who control the validation and permissions process? What happens if they exercise their majority powers? Further, what if the data in each block contains personal information – such as your health records, or your recent bankruptcy, or your change of gender? In an open blockchain, chances are this information is available to anyone, and forever. In a closed blockchain, the information will be available to those with the relevant permissions. Although encryption can be used to protect personal information embodied in each block, at some point the permissions and validation processes require that the information be decrypted.

The distributed ledger technology model operates in a way that challenges one of the main information privacy assumptions – that an organisation, whether public or private, collects, uses and discloses personal information, and thus is both accountable and responsible for those activities. Implicit in this is the assumption that a hierarchy exists: that stewardship of personal information can be traced to a source. In a distributed system that relies on a community of membership permissions and validation processes, such an assumption breaks down.

The outlined technologies might have a potential to deliver privacy benefits; however, they need to be better understood and tested before being declared the answer to our privacy dreams.

2.5 Consent technologies

Finally, another body of work is exploring ways in which technology can be used to underpin the core privacy concept of consent, and the collection, use and disclosure of personal information for the purpose for which it was collected. This body of work relies on applying forms of digital rights management – a technology that fell into disrepute after the way it was used by the entertainment industry to “prop-up” its decaying business model – to personal information. It also can, within some implementations, use semantic web principles, to the same end. Essentially, these approaches attach permissions to personal information, and enable automated negotiations between information subjects and information recipients about the collection and subsequent use and disclosure of the subjects’ personal information. This type of approach has been advocated by Professor Alex Pentland of MIT, who has supported placing “the individual much more in charge of data that’s about them. This is a major step in making Big Data safer and more transparent, as well as more liquid and available, because people can now choose to share data [26].”

Pentland also believes that personal information can be owned by data subjects, that there should be a proprietary right in personal information. This is a view that is expressed fairly often in US privacy discourse, but not elsewhere. While an interesting ideal, there is perhaps less chance of such a proposition becoming a reality. That said, the idea of giving individuals technological tools to negotiate personal information transactions needs to be explored and considered carefully.

The challenge for readers is to understand and scrutinize the technologies and their implications realistically, and from multiple viewpoints. The conduct of the recent Australian census serves as a useful case study.

The Australian Bureau of Statistics (ABS) is responsible for undertaking a periodic census in Australia. In the past, the ABS has not retained, for any significant period of time, names and addresses. For the 2016 census, it decided that it would do so. There are various accounts of the duration of the retention period. They appear to have shifted as public concern grew.

Another decision was that the census should be undertaken primarily online. The ABS gave assurances that the security measures would be the best in the world.

On the day that the census began, the online platform was hacked, and was taken down. Various, and sometimes conflicting explanations were provided. Assurances were offered that no personal information was compromised. But the damage had been done. The trust in, and the reputation of, a hitherto highly regarded Australian institution had been shattered, perhaps irreparably. A variety of inquiries are currently being undertaken to understand “what” happened, and “why” it happened.

The Australian census debacle has shown that Australians care very much about the privacy of their personal information and are unlikely to trust solutions that do not strike the right balance between functionality and protecting their rights.

They are also sceptical of government. Despite the Commonwealth Minister responsible for the census, Michael McCormack, commenting that the census was just like Facebook, and dismissing concerns about the census enabling government to track the population as being “much ado about nothing,” [27] the intensity of the public debate indicates that this viewpoint is a contentious one.

The same argument was used in the recent Australian controversy about the retention of bulk telecommunications metadata for law enforcement and national security purposes, when the (then) Head of the Australian Security Intelligence Organisation said:

Are you arguing that it is OK for Microsoft or Google to profile you in order to sell you a new BMW, or some beauty product, that is alright for them, but it’s not alright for the government on a very selective basis to access telecommunications metadata in order to save lives? That to me is a very distorted and worrying argument. [28]

One of the key issues in the data retention debate was whether individuals’ web browsing history would be retained for two years. Public trust and confidence was undermined when Australia’s first law officer, the Attorney General, George Brandis, was unable to answer this question clearly:

Brandis: "The web address, um, is part of the metadata."

Journalist: "The website?"

Brandis: "The well, the web address, the electronic address of the website. What the security agencies want to know, to be retained is the, is the electronic address of the website that the web user is ...”

Journalist: "So it does tell you the website?"

Brandis: "Well, it, it tells you the address of the website." [29]

There is a body of opinion in Australian government -- to the effect that ‘if you let the private sector do it, then government should be able to do it as well.’ Frequently, this is the opinion expressed by those responsible for ‘innovation’ or ‘disruption’ agendas, and who see no boundaries to government information sharing.

Our scepticism grows in proportion to the claims about personal information that are made by government that prove to be inaccurate, such as the ‘opt-in’ promise for the Personally Controlled Electronic Health Records system [30] or claims that the Census had the best security features.

2.6 The path ahead

David Watts was tasked by the UN Special Rapporteur, to be critical, however, also open-minded: the focus will be on evidence-based critical analysis. The approach is to be inclusive and collaborative.

False trade-offs pervade privacy discourse, such as the supposed trade-off between privacy and security. It is important, when considering privacy issues as they pertain to Big Data and Open Data, to avoid simplistic analysis and to take account of all of the risks and benefits of each in formulating a response that respects privacy rights. Let’s examine now with more detail application of legal instruments, and how they may change in the future. The two sections that follow are devoted to this objective. We will introduce them through the examination of the Investigatory Powers Act 2016 (UK).

3 An improved legal framework

Technical developments, including Big Data analytics have profoundly changed the way and the extent to which law enforcement and security agencies access and examine information. However, legislation governing their functions and practices is still grounded in, and focused upon the twentieth century’s responses - to the ‘then existing’ technologies of access to, automated aggregation, processing and distribution of data, including personal data. This is why, governments and legislatures are re-conceptualising what role the law can play in controlling and regulating the use and misuse of data. The task is not easy.

What follows describes an improved legal framework for Law Enforcement and Security Agencies in the UK: the Investigatory Powers Act 2016 (UK) and proportionate controls incorporated into the system of warrants.

3.1 The Investigatory Powers Act 2016 (UK)

In the United Kingdom, the much debated Investigatory Powers Act 2016 (UK) ch 25 consolidates and streamlines statutory controls and safeguards that were previously contained in statutes governing nine law enforcement and national security and intelligence agencies into a single statute.Footnote 15 It systemically embeds a system of warrants based on the tests of proportionality and necessity into provisions about the interception, interference with, acquisition, examination, and management of communications and other digitized data. It also establishes and oversight scheme involving the Investigatory Powers Commissioner and Judicial Commissioners.

The Act, as compared with, for example, Australian and Canadian legislationFootnote 16 in this field, provides the most advanced and comprehensive legislative framework of controls on access/interception, examination/analysis of the material, and distribution of information from Big Data. The Investigatory Powers Act 2016 (UK) attempts to deal with the unprecedented volume of accessible data, referred to as “bulk”.Footnote 17 This “bulk” data, as available to national security and law enforcement agencies, has been conceptualised in terms of a hierarchical order of categories that can be regulated by way of warrants, authorisations and notices. In turn, these legal instruments enabling access, collection and management instruments include criteria for certain general privacy protections based on consideration of:

data access/collection minimisation “(a) whether what is sought to be achieved by the warrant, authorisation or notice could reasonably be achieved by other less intrusive means”;

differentiation of protective privacy levels “(b) whether the level of [privacy] protection to be applied in relation to any obtaining of information by virtue of the warrant, authorisation or notice is higher because of the particular sensitivity of that information”;

and the tension between two public interests:

“(c) the public interest in the integrity and security of telecommunication systems and postal services, and (d) any other aspects of the public interest in the protection of privacy.”Footnote 18

Statutory duties to consider limits on access, collection and distribution of bulk data are context-driven, and not exhaustive. In particular, the statute (s 2(4)) identifies the following additional factors that, depending on context, must be considered:

“(a) the interests of national security or of the economic well-being of the United Kingdom,

(b) the public interest in preventing or detecting serious crime,

(c) other considerations which are relevant to:

(i) whether the conduct authorised or required by the warrant, authorisation or notice is proportionate, or

(ii) whether it [conduct] is necessary to act for a purpose provided for by this Act,

(d) the requirements of the Human Rights Act 1998 (UK), and

(e) other requirements of public law.”

3.2 The “double-lock” mechanism

It is in this specific statutory context that the tests of proportionality and necessity serve as controls for all operational provisions of the Act. They are also central to the special “double-lock” mechanism for the issuance and approval of warrants relating to “bulk” data. The rule is that before the issued warrant becomes operative, it must be approved.

The double-lock mechanism applies to approval of all warrants involving targeted interception, targeted equipment interference, and all “bulk” surveillance measures. They are set out below and apply to:

  1. 1.

    Equipment interference warrants: Part 5 and Part 6, Chapter 3,Footnote 19 (their issuance must be necessary in the interests of national security)Footnote 20;

  2. 2.

    Communication interception warrants: Part 2 and Part 6, Chapter 1;

  3. 3.

    Obtaining communications warrants: Part 3 and Part 6, Chapter 2;

  4. 4.

    Bulk Personal Dataset Warrants (large datasets containing personal information about a wide range of people): Part 7.

Decisions to issue warrants for bulk interception,Footnote 21 bulk acquisition,Footnote 22 bulk equipment interference,Footnote 23 and bulk personal datasetsFootnote 24 are to be taken personally by the Secretary of State.

The double-lock mechanism works as follows:

  1. 1.

    In order to obtain a relevant warrant, the investigative agency has to provide the Secretary of State with materials, (including evidence of necessity and proportionality) in support of the proposed warrant [31]Footnote 25;

  2. 2.

    The Secretary of State then must apply specific statutory tests of “necessity” and “proportionality” before deciding whether to issue the warrant.Footnote 26

  3. 3.

    The Secretary of State (or in the case of a warrant to be issued by the Scottish Ministers, a member of the Scottish Government), is to personally make decisions to issueFootnote 27 three specific kinds of warrants on behalf of an “intercepting authority”,Footnote 28 namely targeted interception warrant, targeted examination warrant, and mutual assistance warrant.

The statutory test of necessity lists grounds that are applicable to each specific category of warrant. For example: to fulfil the necessity precondition, a targeted interception warrant or targeted examination warrant, must be “necessary —

  1. (a)

    in the interests of national security,

  2. (b)

    for the purpose of preventing or detecting serious crime, or

  3. (c)

    in the interests of the economic well-being of the United Kingdom so far as those interests are also relevant to the interests of national security.”

A mutual assistance warrantFootnote 29 under sections 15(4) and 15(5) will be deemed necessary if —

  1. (a)

    it is necessary for the purpose of giving effect - to the provisions of an EU mutual assistance instrument, or an international mutual assistance agreement, and

  2. (b)

    the circumstances appear to the Secretary of State to be equivalent to those in which the Secretary of State would issue a warrant for the purpose of preventing or detecting serious crime.

However, subsection 20(4) specifies that “information which it is considered necessary to obtain is information relating to the acts or intentions of persons outside the British Islands” [i.e., it cannot pertain to people in Britain].Footnote 30

The second part of the “double-lock system” brings in the judiciary. Judicial Commissioners must be judges who hold, or have held, high judicial office.Footnote 31 They have statutory power to approve warrants issued by the Secretary of State.Footnote 32 While Judicial Commissioners must apply the (s 19) statutory necessity test already applied by the Secretary of State, their determination regarding the proportionality criterion is different. Section 23(2)(a) of the Act mandates that in considering whether the conduct to be authorised by the warrant is proportionate: “the Judicial Commissioner must apply the same principles as would be applied by a court on an application for judicial review”.Footnote 33 In other words, with the exception of instances implicating EU treaties, Judicial Commissioners must apply the common law test of proportionality.Footnote 34

3.3 Discussion: Controls and protections

The double-lock mechanism is based on a system of checks and balances involving executive decision-making directed to the issuance of the warrant, and judicial oversight directed to approval of the warrant.

In Chapter 1 of Part 8, the legislation creates the Office of the Investigatory Powers CommissionerFootnote 35 whose duties include review (by way of audit, inspection and investigation) the agencies’ functions relating to the interception of communications; the acquisition or retention of communications data; the acquisition of secondary data [i.e. metadata] or related systems data; equipment interference (whether under warrants and authorisations or otherwise). The effectiveness of the Investigatory Powers Commissioner’s oversight will depend on allocation of resources by successive governments to this Office.

Finally, ensuring the security of the accessed and managed data-sets is placed under personal responsibility of the designated Cabinet Minister. Does the statutory notion of Ministerial personal responsibility have practical significance? The answer to this question will depend on judicial construction of the relevant provisions. Section 1(5) of the Act refers to further statutory protections for privacy,Footnote 36 including “the common law offence of misconduct in public office”, and “elsewhere in the law”. It is unclear whether the imposition of personal responsibility on Cabinet Ministers means that they have no immunity from charge of misconduct in public office or a suit for damages at common law by individuals whose reputation or other legal interests have been harmed by a privacy breach.

The control scheme of the Investigatory Powers Act increases legal controls over national security and law enforcement access and usage of bulk data. It is doubtful, however, whether they will prove sufficient to adequately control and monitor data processing flows. The legal mechanisms embedded in the Act are based on public law principles, e.g. necessity and proportionality. The “double-lock” mechanism relies on a system of checks and balances between executive decisions and judicial control, reinforced by the “personal responsibility” of the Cabinet Minister.

However, the Act does not directly regulate the nature and use of automated algorithms that aggregate, link, examine, and process (filter, classify, draw inferences from) the “bulk” data.

As detailed as the Act may be, this means that there is a fundamental dimension of the Web of Data that has not been directly addressed. As argued in the next section, the nature of knowledge and information processes does not fit well into the mould of existing legal solutions. The latter are built on the basis of the rule of law and are nationally, culturally, and linguistically bounded. The former, on the contrary, are transnational in nature, and are usually represented within formal languages and methods.

4 Defining digital regulations

4.1 Privacy and Big DataFootnote 37

Quite recently in contemporary societies, Big Data encountered the Semantic Web, and the Internet of Things. In 2015, eight Zettabytes (Zetta = 1021) were generated, which consisted mostly of unstructured (e-mails, blogs, Twitter, Facebook posts, images, and videos) data [34]. Twitter users generate more than half a billion of daily tweets. e-Bay Online Dispute Resolution system alone solves about 80 million disputes annually. Collecting data and producing metadata — i.e., machine understandable information — constitutes the new stage. But metadata can be collected and structured as well. Big Data entails a belief that combining data from multiple sources may lead to better decisions (or not!). The quality of Big Data-made decisions depends on the structure of context, the organisation, principles and criteria applied.

Furthermore, these decisions are the product of their context: Big Data actually builds a new individual and collective identity, and a new kind of hybrid political culture. With the many sensors within the Internet of Things, and the organization of Smart Cities, our world may be managed and possibly ruled almost automatically - in the short run. Information processing interacts with programs that not only simulate human intelligence, but virtually act like humans.

The term “hype cycles” refers to “graphic representation of the maturity and adoption of technologies and applications” (“industry noise”), indicating where things are moving to.Footnote 38 The Gartner Hype Cycle for Emerging Technologies (GHENT) of August 2014 located Big Data on the edge of already known, but, yet non-mature technologies. The Internet of Things occupied the place of Big Data, at the peak of emerging expectations curve. Huge amounts of data are produced daily through smartphones’ sensors, automatically sending information - regardless of the will of their owners. Mobile technology outnumbered the use of personal computers in 2008 [35].

All of this has fostered new applied research. The GHENT for 2015 dropped Big Data from its peak curve, and signalled that autonomous vehicles and the Internet of Things are at the peak of inflated expectations. New related technologies with social and economic applications are emerging: among them, digital dexterity (employee cognitive ability and social practice seeking digital business success) and citizen data science. Footnote 39 These are social areas. Society and security are the highlighted emergent fields. The recently delivered GHENT for 2016 confirms these trends, stressing their cognitive side (e.g. perceptual smart machines).Footnote 40 Cognitive business [36] and management strategic studies [37] give support to the same idea, as industry aims at reducing latency and data transfer costs.

Mimicking the Hyper Gardner Cycle, Daniel CastroFootnote 41 and Alan McQuinn,Footnote 42 from the influential US Information Technology and Innovation Foundation, wrote an essay published in September 2015 with the title, “The Privacy Panic Cycle: A Guide to Public Fears about New Technologies” [38]. It shows the growing tensions between Human Rights lawyers (and the so-called whistle-blowers) on the one side, and government and administrators on the other. While we are not in complete agreement with the aforementioned authors, we agree that they plot the battlefield; at least from the US point of view. Still, conversations are better served, which focus not so much on the tension between privacy vs. innovation, rather on the necessary balance among personal needs, decisions, fears and wisdoms (liberty), and collective and public risks (security).

Citizens should be informed and, more than that, act upon incorporating self-regulation into practice for collective purposes, because rules and regulations will become increasingly embedded into programs and devices through interactive workflows between humans and machines [39].

4.2 Citizens’ experiences

How do we acknowledge and express that these issues matter? How do we, the citizens, experience privacy and security in our everyday lives? Currently, it is by surprise, by confusion, and by exhaustion.

Let’s consider the following situation. Imagine that you discover that your family tree (which by extension includes your children and yourself) is available on an online family tree service. It is provided by a company that created an online platform, where users can build family trees based on data collected from their relatives. The platform has aggregated data from 1.5 billion family tree profiles. What can you do?

This is one side of the so-called Open Source Intelligence (OSINT) [40]. Collected data could be kept private by design, and by default. But they are not: data of deceased people are located in the public domain. Otherwise the information could not be legally structured, properly linked, and sold. Only when you click on the names of living people, registration is required. This presents a kind of paradox: the public domain - at the service of market’ agents. Apparently, they offer a solution, if you are unhappy about ‘your inclusion’ in their database. However, reversing this situation is time-consuming, painstaking, and costly, and few people choose going along with the consumer opt-out option.

With the presence of Big Data in everyday life, other examples come to mind, the three Vs (volume, velocity, and variety) foster health applications that may collect a continuous flow of physical personal information, opening new possibilities. For example, Barrett et al. predict:

“In addition to simple monitoring, a more sophisticated programs would include algorithms that provide personalized feedback to assist with behaviour modification at key moments of decision making (e.g., suggesting healthy recipes while the patient is shopping; encouraging exercise at the end of the workday, or giving a personalized warning about location based environmental triggers for asthma). The real-time velocity sets this application of big data apart from traditional public health uses of behavioural or health data.” [41].

Who will control the access and reuse of this personal data flow?

4.3 Linked DataFootnote 43

From a semantic point of view, Big Data can in fact be Linked Data, and Open Linked Data, when they are made accessible. Let’s go to the “global giant graph” envisaged by Tim Berners-Lee. Since 2007, there has been a DBpedia project linking databases according to the best practices and guidelines of the World Wide Web Consortium [W3C], and building a large-scale, multilingual knowledge base by extracting structured data from Wikipedia editions, now existing in 125 languages.Footnote 44 As stated by the organisation: “The English version of the DBpedia knowledge base describes 4.58 million things, out of which 4.22 million are classified in a consistent ontology, including 1,445,000 persons, 735,000 places (including 478,000 populated places), 411,000 creative works (including 123,000 music albums, 87,000 films and 19,000 video games), 241,000 organizations (including 58,000 companies and 49,000 educational institutions), 251,000 species and 6,000 diseases.”.Footnote 45

References are tied using Semantic Web languages, especially Resource Descriptive Framework [RDF]. The search language is the Protocol and RDF Query Language [SPARQL], currently being drawn 3000 million triples — subject / object / relation in all natural languages — describing some four and a half million objects [42].

In 2011, a sister project Wikidata was launched to be "a free linked database that can be read and edited by both humans and machines". It contains more than 24,263,995 data items that anyone can edit (October 2016) in all Wikimedia languages.Footnote 46 Wikidata aims to provide statements given in a particular context. But few databases contain explicit licenses and rule-based provisions to allow data and metadata be regulated through a rights-driven workflow. Nearly 40% do not yet have licenses [43].

How could personal information workflows be controlled? How can data and metadata be personalised in a safe manner? Such possibility can exist only by balancing liberty and security, and by embedding regulations in semi-automated systems. Some authors envisage a “ubiquitous pragmatic web” (encompassing humans and multi-agent systems, i.e. emotions, and processing languages) [33, 44].

For this purpose, we should distinguish semantic metadata (human or automated annotations added to the content) from structural metadata. The latter adds information — about creation, purpose, origin, time, author, location, network, and language and data standards. Metadata is data that refers and describes data. As it is defined by the W3C, it has the feature of being automatable: for the Web, metadata is machine-understandable information, expressible into a programming language.

Besides, we should also distinguish at least three types of languages expressing knowledge: (i) natural language, (ii) technical (expert) language, (iii) formal language. Expert language is essential, as rules and norms are usually formulated in natural languages (English, Spanish, French, etc.). Formal language is the only one that machines can understand. All three kinds of languages can be integrated to create a controlled regulatory ecosystem. For example, Creative Commons licenses incorporate a “three layer design” to make them more comprehensible and to facilitate their greater usage —legal code, human readable, machine-readable.Footnote 47

Contextually, it is also important to distinguish law, the rule of law, and the meta-rule of law, as analytical dimensions that are always pivotal to human-machine-human or machine-human-machine communication. Law refers to rules laid down by official bodies with regulatory and binding powers (such as parliaments and courts). The rule of law protects citizens’ rights, restricting the arbitrary exercise of power. When such protections are embedded into formal systems, and represented through the languages of the Web of Data we face the layer of the meta-rule of law. That is, with the development of the web, the rule of law needs to evolve to a meta-rule of law, based on legal knowledge, and incorporating tools to regulate and monitor the semantic and algorithmic layer of the web [32].

Semantic patterns (and ontology design patterns)Footnote 48 can be used and reused to create new types of (interactive, hybrid) regulations by design.Footnote 49 W3C Open Digital Rights Language (ODRL) supports open publishing, distribution, and consumption of content, applications and services on the Web. This model respects and puts into the users’ hands the management of their rights: it requires their explicit permission. Otherwise, moves are forbidden by default.Footnote 50

4.4 Regulatory algorithms

Focusing on categories and content is not the only strategy. Cryptography and Privacy-enhancing Technologies (PETs) have been developed nearly in parallel of Semantic Web approaches. Statistical disclosure control, inference control, privacy-preserving data mining, private data analysis, differential data privacy (privacy-preserving data analysis)Footnote 51 are algorithmic techniques applied to large databases using statistical methods [47].Footnote 52 However, they are usually designed to protect privacy in a slightly different situation than those described above: “it addresses the paradox of learning nothing about an individual while learning useful information about a population” [48, 49]. Thus, they try to neutralise risks of Linked Data, protecting private information against linkage attacks, because aggregate statistical information about the data may reveal some information about the individuals. But even if accurate, they are not infallible.Footnote 53

Industry has been developing some algorithmic solutions based on machine learning and Artificial Intelligence as well. The general idea is that it would be better that data never reach corporate servers in the first place. Privacy-sensitive intelligence is being introduced in mobile operating systems for iPhones and iPads [50, 51]. For instance, Apple is implementing “local” differential privacy. Thus, the company never gets hold of the raw data.

4.5 Legal approaches

As described above (section 3), there are some legal steps that regulate possible privacy risks in UK. Yet, it is worth noticing that the USA and Europe are handling Linked and Big Data in a quite different way. It has been suggested that a perceived loss of control of societal developments by governments, inhibit the effective protection of essential values in democratic societies [52].

Europe embraced the General Data Protection Reform (GDPR) five years ago.Footnote 54 Legal scholars pointed out the Copernican legal turn of GDPR compared to the previous situation [53]. This means that privacy and data protection is being aligned with the construction of the so-called European digital single market.Footnote 55 The set of principles contained in the recent EU Regulation 2016/679 and in the EU Directive 2016/680 extend citizens’ protections - covered by the EU Charter of Fundamental Rights (2000). Personal data protection has emerged in Europe as a specific fundamental right [54]. Transparency, data minimisation, proportionality, purpose limitation, consent, accountability, data security, rights of access, rights of correction, third country transfers, rights of erasure, can be now enforced through economic sanctions and instruments of monitoring and control by EU agencies [55]. This holds for citizens, but it is clear that the new Regulation is mainly aimed at companies handling large amounts of data. Breaches of privacy and data protection provisions will be severely fined.

This is not the case in the USA, where privacy is under the general protection of Courts and three Constitutional Amendments (the First, Fourteenth and especially the Fourth Amendment). Constitutional law is a way of enhancing citizens’ rights, in which, technology and privacy at the County, National, and Federal levels, deal with Court rulings, and the fulfilment of Federal conditions (such as the accomplishment of Fair Information Practices). Personal data is aligned with the “right to be left alone” in the tradition of Justice Louis Brandeis, and Thomas Cooley’s A Treatise on Law of Torts (1888),Footnote 56 and its dimension of data protection based on Alan Westin’s work on databases [57, 58].

Actually, it has been noticed that the reaction of Warren and Brandeis[59] was a kind of legal transplant, an early attempt to incorporate the European tradition into American culture [60].Footnote 57 Thus,

“Why do these sensibilities differ? Why is it that French people won't talk about their salaries, but will take off their bikini tops? Why is it that Americans comply with court discovery orders that open essentially all of their documents for inspection, but refuse to carry identity cards? Why is it that Europeans tolerate state meddling in their choice of baby names? Why is it that Americans submit to extensive credit reporting without rebelling?”

In the twenty-first century Big Data, the starting point is the value of datatasets and how much added value metadata brings to data, using data analytics. From this point of view, there is more room for mass surveillance and for data commodification, because markets are allowed to set the rules of the game. Online Services use clickstream data to optimise operations, and the state is collecting and using citizens’ data under security laws and Open Source Intelligence (OSINT). These are two sides of the same coin.

Languages and semantic tools are the subject of technical protocols and standards (e.g. W3C Recommendations, Oasis standards, best practices). They are not mandatory. Legal scholars, computer and social scientists have identified the main elements of this new regulatory framework — including the results of fifteen years of research on legal XML, Legal RuleML, and legal ontologies; linked data publication and consumption, copyright related terms, licensing, patents, privacy risks, and the emergence of data and metadata markets [61, 62]. But their legal function, and how they will contribute to create sustainable social ecosystems are still uncertain at a general level.

The relationship between the rule of law and its semantic and algorithmic counter-part, the meta-rule of law, should be explored further. The intermediate level between macro- and micro-regulations opens a space for intermediate anchoring institutions, compliance by design and regulatory modelling, to minimise risks and to implement individual and collective rights. Mobile health applications, websites (such PatientsLikeMe and the Health Tracking Network), crowdsourcing platforms, programs (such Flu Near You), Electronic Health Records [EHR] databases… could benefit from this kind of intermediate regulatory institutions.

It is worth noticing that underlying ethics have a new positive regulatory value in this field, both for Artificial Intelligence broad applications [63] and in the digital space for biomedical Big Data [64]. Traditional legal tools are not enough, and there is an urgent need for us to understand it.

5 Conclusions

Law is facing significant new challenges that should be discussed. Personalisation (the use of services, information, and knowledge) in the Web of Data, create new unregulated contexts and scenarios. Some boundaries arise within emerging data markets. Others unfold under non-harmonised jurisdictions and rules, while other boundaries relate to safety and collective security. There are at least these pending topics on the list:

  1. 1.

    Digital legal pluralism (how to deal with different legal cultures, jurisdictions, and the binding power of national states).

  2. 2.

    Legal forum-shopping (how to deal with markets, companies and corporate power).

  3. 3.

    Balancing citizens’ rights (privacy) and security (how to deal with individual rights and collective needs).

  4. 4.

    Creating a global, digital, legal culture (how to deal with the general human-machine framework in which agency, risk scenarios and social ecosystems emerge).

  5. 5.

    Improving National and International legal frameworks to include new digital terminologies and concepts, harmonising them and allowing legal interoperability across jurisdictions and legal cultures.

  6. 6.

    Consciousness. We should acknowledge the change, and accept that privacy is a public and collective issue. Building privacy means accepting that there is no absolute privacy. It deals rather with building communities and generating trust in citizens and consumers —within national and transnational markets.

  7. 7.

    Aligning civil, legal and technological knowledge. Identity means building, controlling and monitoring knowledge about safety, security, and personage (data and metadata).

  8. 8.

    Solving the algorithmic-semantic puzzle. From the technical point of view, differential privacy, and (semantic) privacy and data protection by design should be developed alike. Encryption, de-identification, and self-enforcing protocols should be encompassed with ethical and legal protections.

  9. 9.

    Ethics matter. Principles and values should be fleshed out, and dynamically anchored into normative and institutional systems (not only considering them as preliminary requirements for enginery-building).

  10. 10.

    We should acknowledge that there is a political dimension too. Whistle-blowers, activists (including ethical hackers), perform significant socio-political functions. Having an open mind and accepting innovation means adding linked democracy as a new dimension of liberal, deliberative, and epistemic democracy.