Keywords

1 Introduction

“Privacy is a contested notion” used to be a stock phrase in presentations and papers throughout the nineties and noughties, and a vast number of classifications of different notions and aspects of privacy have been proposed, see [14, 16, 30] for just three examples. Data protection is a similarly multi-faceted concept – not only a counterpart right to the right to privacy [8], but also a means for protecting “fundamental rights and freedoms” in general (Article 1 GDPR).Footnote 1 Finally, computer technology in general and Artificial Intelligence applications in particular play many roles, ranging from the quintessential threat of profiling (an activity that even entered the title of an article in the GDPR, Article 22) to the mandate, in the GDPR, to use state-of-the-art (privacy-enhancing) technologies: Articles 25(1) and 32(1) require controllers and processors to give due regard to the state of the art when choosing the technologies.

Yet, in spite of this richness of meaning in the three concepts of “privacy”, “data protection”, and “AI”, there is a prototype: (1) Privacy is operationalised via data privacy and therefore obtains when information is hidden from all or at least specific others; it is therefore centered around confidentiality. (2) Data protection obtains when this information is removed in appropriate ways (e.g. through anonymisation or pseudonymisation) or at least restricted to its intended recipients (e.g. through access control), cf. for example [13] or, specifically for trajectory data, [29]. (3) AI and – particularly relevant when it comes to the processing of personal data – data scienceFootnote 2 are conceptualised as dangers to individuals and their autonomy, by combining unwanted data collection with intransparent inferences and manipulation (as exemplified in the Cambridge Analytica media narrative).

The present paper starts from reconceptualising privacy as more than confidentiality. It goes back to the alternative of “privacy as the right to be let alone” as formulated by Agre and Rotenberg: “the freedom from unreasonable constraints on the construction of one’s own identity” [1], also expanded on by Hildebrand [17]. Crucially, as feminist scholars and others have pointed out, keeping something confidential or invisible can serve to perpetuate oppression and therefore counteract the very liberatory effects that a private sphere is supposed to have, cf. [27]. On the contrary, it may be necessary to make certain information visible in order to fight and overcome oppression and oppressive structures. The standard example used to be the treatment of domestic violence as a “private matter” vs. its publication and the legal and regulatory successes this has enabled; a current example is the #metoo movement.Footnote 3

This reconceptualisation will be done by contrasting two case studies that on the surface share many commonalities: human trajectories that can be derived from observed vehicle movement data and data science studies that reconstruct trajectories from these data. The case studies are (1) the New York City taxi rides dataset, and (2) the use of data from the maritime Automatic Information System (AIS) for mapping refugee movements on the Mediterranean Sea. In both cases, the data amount to “holistic trajectories”, spatiotemporal data enriched with semantic information about the vehicles, the space, the voyage, and enrichable (through data-science inferences) with information about the people on that trajectory.

The first contribution of the paper is to investigate claims that have been made with regard to privacy protection in the second case study, and to argue that, unlike in the first case study, invisibility is often not what the affected individuals want. In their case, rather, visibility becomes a precondition for having rights and often life at all. Data science projects that support this goal and a counter-narrative to politically prevalent narratives, can then become tools that may further fundamental rights (rather than threaten them, as in the default narrative). Data protection law, in turn, may or may not be applicable, and in any case probably not conceptualised as in standard GDPR-related discussions. The second contribution is to highlight some possible questions that can be asked of the data and their presentation.

The paper is a position paper, a question and a proposal. It asks the question whether and how the APF community wants to engage with the highly politically charged topics around migration, data, and fundamental rights. It proposes a number of (technical and social) questions as a starting point to such an engagement. Lastly, it is (obviously) the opinion of the current author that this is a discussion worth having at APF.

2 Case Study 1: New York City Tax Rides Dataset

In 2014, the City of New York released, in response to a Freedom of Information request, data about all 173 million taxi rides in New York in 2013, with the taxi identifiers pseudonymised, and exact spatiotemporal data about start- and endpoints, as well as fares, given. This dataset provided a rich real-life dataset for a wide range of data mining studies, such as “optimization of the revenue of NYC Taxi Service using Markov Decision Processes” [21]. At the same time, the publication of the dataset was soon criticized on privacy grounds. For example, the taxi pseudonyms could easily be re-identified to their actual medallion numbers [26]. It was also argued that the data allowed inferences towards sensitive attributes of the taxi drivers, such as the patterns of breaks during the day indicating that someone is a devout Muslim [35]. Finally, with some background knowledge, inferences can be made towards the identity of taxi customers, and based on that, details about their whereabouts learned [3]. The futility even of better pseudonymisation/anonymisation approaches was demonstrated by Douriez et al. [12]. Medallion and driver license IDs were removed from NYC’s taxi datasets released in subsequent years.Footnote 4

The taxi rides represent a typical case of personal data in the sense of the GDPR. Personal data are “any information relating to an identified or identifiable natural person (‘data subject’)” (Article 4). Since at least some, and likely many, taxi drivers and taxi customers are easily identifiable, the dataset contains personal data. Taxi customers (and conceivably also taxi drivers) had not been asked to give their consent to these data being published online for unspecified purposes, nor are other grounds for such processing (Article 6 GDPR) present. This is textbook privacy violation by dataFootnote 5 (more accurately in the EU context: a violation of data protection law). While the GDPR defines a number of exemptions for research, it does so under conditions [19, 22], such that currently ethics boards in EU universities are cautious and therefore discourage the use of this dataset for any kind of data mining.Footnote 6

This perception of a dataset assumes that the population of data subjects consists of informed individuals, who exercise their autonomy among other things by travelling in vehicle passages they pay for, and who have a reasonable expectation of privacy in doing so that requires that the data about their movements remain confidential. The main question for the responsible data scientist appears to follow from the observation that the removal of taxi identifiers “would adversely impact certain types of analysis on the data” [12, p. 148] and the need to find different analysis types.

3 Case Study 2: AIS Data for Describing Migrant Rescue Operations

The second case study is based on the Hoffmann et al. study published in a 2017 report by the IOM [18], the UN International Organization for Migration, and illustrated in an interactive and multimedia online presentationFootnote 7. As in case study 1, the base data are in principle publicly accessible. They are data from the Automatic Information System (AIS), a maritime communications system through which vessels regularly broadcast information, including their identifier, vessel type, latitude and longitude, speed, course and destination. The information is used by maritime authorities and ships to locate nearby vessels and avoid collisions. Based on these spatiotemporal data and enriched with textual and pictorial data from other sourcesFootnote 8, the authors generate a type of holistic trajectories, manually label them as representing (or not) a rescue operation, and use clustering and machine learning with a view to classification and prediction. Other researchers have investigated how to model and detect such trajectories. Based on AIS data, complex events, including but not limited to SAR (search and rescue) missions, and involving one or several vessels, can be modelled and detected efficiently and in real time using combinations of exploratory, machine learning, and logics-based (event calculus) techniques [28, 36].

Hoffmann et al. mention several limitations of their method, mainly with regard to data quality, including the fact that as circumstances change, so do the data and patterns (thus, the analysis of timely data is crucial).

In a section on “privacy”, the authors raise several points. The first is a reference to concerns over port security as a consequence of AIS data public availability. The second is the possibility that rescue organisations may not want the full details of their operations to be publicly known, because they are facing opposition and threats (a European far-right group threatening to attack rescue vessels is mentioned). Both concerns are not privacy concerns in the sense of European law (in particular because the agent requesting the confidentiality is not a natural person). As a third reason, the authors mention that “adversarial users could take advantage of the data to track the location of individual refugees [identified by record linkage with data such as photos or statements, or other background knowledge], attack rescue boats or guide piracy operations” (p. 40). Presumably, the attacks and piracy operations are security/safety concerns for the rescue vessels, their crews, and the rescued persons, and these concerns could arise from the public availability of the data as well as from possible predictors learned from them, i.e. the data scientists’ work.

It also appears, from the sentence, that the possible tracking of individuals is considered a security/safety risk (because it could lead to attacks) rather than a typical privacy risk (by which an individual migrant would want to keep their identity or properties hidden). It is difficult to say what role such expectations of, or wishes for, privacy in our usual sense, play in this extreme situation. Also, it has been observed increasingly over the past years that rather than trying to hide their voyage, “migrants from Libya facilitated their traceability by national authorities and monitoring systems, anticipating in space and time border patrols by sending an SOS as soon as they entered the international waters” [33, p. 576]. In other words, along their journey, migrants deal strategically with visibility and invisibility, with information disclosure and hiding/confidentiality. This is quite probably a very rational strategy given the fact that a successful and invisible journey to Europe is by now nearly impossible for many reasons, including that traffickers severely overload and under-equip their vessels, and that due to the high-resolution sensors employed in the European Border Surveillance System EUROSUR [9], even very small vessels are likely to be spotted and monitored. Strategic information disclosure (in addition to strategic information hiding) by individuals can also be observed in many other contexts that are less dramatic than the life-or-death situations faced by migrants on the Mediterranean, and it has been pointed out that strategic information disclosures too can be privacy-related behaviour [15].

A second question related to privacy is related to the referent of the data. Technically a ship’s trajectory could be considered personal data in the same sense as a taxi’s trajectory. (This concerns the ships provided by the traffickers as well as rescuing ships once they have been boarded by migrants.) As for taxis, the trajectory is a trajectory both of the “driver” and of the “passenger(s)”.

In the NGO vessel case, the “driver” individuals are the captain and crew members. To the extent that they can be re-identified using public (or otherwise procured) records, the AIS-based trajectory data form personal data. However, their personal and professional mission is to carry out rescue tasks, and to do so in a transparent manner, and they in fact often seek visibility and publicity (for their funders as well as a political statement). It thus appears less likely that these individuals would regard the publication of the information that they were at some location at some time as a violation of their privacy (even if for security/safety reasons, they may prefer some degree of invisibility, see above). “Drivers” of non-NGO vessels such as cargo ships are likely to have other motivations, since their original task is not related to sea rescue, which may make them regard their location data differently.

As regards the “passengers”, with appropriate background knowledge, similar re-identification attacks could in principle be mounted to identify individuals. These could for example be based on photos taken of individuals while on-board or disembarking, matched with named photos as background knowledge [18]. It is also conceivable that data regarding the captain or crew members and data regarding migrants are combined, and that this may result in undesired consequences. It is an open question whether such attacks are likely.Footnote 9 If such a re-identification link is not made, or is very unlikely to be made, AIS-based trajectories of rescuing ships may not count as personal data.

However, even if individuals may not be exposed in a traditional privacy-violation sense, there is a much more likely sense in which migrants are exposed by AIS data: as a group. In fact, as has been argued in this context [33] as well as in connection with other applications of big data analyses to humanitarian causes [31], there is a temptation to focus on migrants as a group defined only by one feature (here: to be in need of rescue).

The fact that big data constitute new risks in the profiling of groups has been lamented often in connection with data protection laws such as the GDPR (which focus on the protection of individuals’ rights and freedoms); in the humanitarian realm, it creates additional and different challenges [32].

For the data scientist, this means that also the response to these risks and threats may need to be very different, because traditional approaches to (for example) anonymisation are focussed on the protection of individuals from threats against these persons as individuals. It is an open research question what could constitute effective measures of group protection.

Data privacy, viewed technically, does not need to make a clear distinction between protecting information and control over it related to individuals (a concept rooted in human rights) and protecting information and control over it related to other entities (such as organisations, the NGOs in the current example and in the argument made by Hoffmann et al.) [11]. In data privacy, a different and independent dimension becomes relevant when one asks “whose privacy” should be protected. A useful distinction is that between data owner, data respondent (the data subject, although not always in its legal sense of an individual person), and data user; and this distinction has implications for the choice of data-privacy protection methods [11]. In the present example, one assignment of these roles that follows the argument about risks above could be: the NGO as the data owner, the migrant (or migrant group) as the data respondent, and various (potential) data users: the public, politicians, pirates, ...

Moving beyond privacy and data privacy, many other questions, technical as well as ethical, arise about information disclosure and hiding. The study and visualisation of “rescue patterns” can have different objectives. Hoffmann et al. mention operational objectives (e.g., supporting coordination of rescue operations), analytic objectives (e.g., determining conditions under which rescues are most effective), and reporting objectives. The latter are described as follows: “supplement the large amount of qualitative, descriptive coverage already produced by NGOs and the news media”, “help external observers ... obtain a high-level picture of what is happening in the region over time. An overview of these patterns is critical for coordination and advocacy purposes; it enables stakeholders to see the true magnitude of rescue operations, and to quantify costs, shortcomings and future needs.” [18, p. 30].

Concentrating on the reporting objective, it can be argued that rescue patterns constitute a counter-mapping practice: in the EUROSUR monitoring system, selected migratory events are produced from the sensed data and mapped in time and space [33]. The website watchthemed.net, initiated and run by a network of NGOs, activists and researchers, maps events to monitor deaths and violations of migrants’ rights. In the SAR-centric applications described here, rescue events are produced and mapped. EUROSUR is run by Frontex, and its data and analytics are not available to NGOs and other external partners, whereas the rescue patterns are mined from data available publicly (AIS data) or available to partners of the research (the broadcast warnings), and enriched with further aspects from public data (such as tweets).

Mapping practices generate a narrative around their real-world phenomenon. The current data models and visualizations of rescue patterns, maybe for technical reasons (because the EUROSUR data are not available), maybe to avoid visual clutter, display these patterns in an otherwise “empty” space. Is it possible, and is it advisable, to at least represent that far more data exist (even if one does not have access to them)? In other words, should the “known unknown” data be modelled and represented too, and if so, how? These data are important for technical reasons as much as for narrative reasons – how can and should these two motivations be addressed, and how can the choices made be made in a transparent and accountable way? In the following paragraphs, I will illustrate three examples of these considerations.

Fig. 1.
figure 1

The Alexander Maersk’s June 2018 trajectory (in red, via Valletta). (Color figure online)

First, sometimes trajectory data illustrate very directly the influences of context and the uncertainty and the “unknowns” of vessel operators. As an example, consider the recent case of a commercial cargo ship that took on 113 people saved by an NGO rescue ship and then spent four days in a political stand-off on a zig-zag trajectory between ports before being allowed to dock in Sicily [2, 5], see Fig. 1.Footnote 10 Can and should holistic trajectories measure and visualize the enormous costs caused by such decisions, as well as the incentives and influences this may have on further behaviour by vessel operators? What about similar odysseys that have since taken place in a politically more and more charged climate, such as those involving a coastguard and a sea-rescue NGO ship respectively [20, 38]? What about, reversely, the trajectories of ships that could not and were not ‘doing anything anymore’ under these circumstances, with trajectories (enforced by the political context) so dis-incentivizing that it contributed to Germany’s withdrawal from Sophia, the EU naval mission targeting human trafficking in the Mediterranean [10]? Could and (how) should a visualisation illustrate a progressive emptying of the knowable in the space, caused by the reduction of official and NGO sea-rescue vessels active in the area?

Second, many of the existing, but not accessible data have strong effects on the rescue events modelled. For example, the Libyan coastguard now has indirect access to EUROSUR data [25]; thus, their rescue actions, including those in cooperation or competition with European actors, may be planned based on data that are not modelled in the rescue patterns system, and which therefore can co-determine the “coordination” and “effectiveness” of a rescue. Can and should these data (or at least the fact of their existence and possible influence) be modelled?

Third, further questions concern which aspects are important to judge the legal and ethical dimensions of a rescue operation. An example is provided by [23] resp. [6]: in a case in which a commercial towboat under Italian flag rescued migrants and then handed them over to the Libyan coastguard, a key legal question revolved not around the spatiotemporal data of the rescue operation, but around whether it was instructed by the Italian or the Libyan authorities [37]. Can these aspects be modelled as part of holistic trajectories, and how could this be done if the datum itself is still being contested?

4 Towards a Comparative Analysis

The preceding sections have shown that vehicle trajectory data are often rich sources of personal data, of individuals as well as of groups. However, even if very similar in technical aspects, such data can present very different challenges in different contexts. In both examples analysed in the present paper, concerns of different stakeholders need to be weighed. Even if we only regard stakeholder groups’ interests with regard to invisibility (confidentiality of the data) or visibility, further differentiation becomes apparent. For reasons of space, I cannot present a worked-out comparative analysis of the two case studies, or provide a weighing of the different interests in a GDPR sense. Instead, I will sketch some further subdivisions that arise within stakeholder groups, and argue that a weighing of these interests is a more far-reaching political decision.

As regards the stakeholder group “drivers”, it appears that those in the taxi case study probably have an interest in invisibility, whereas those in the vessel case study may seek invisibility, be indifferent, have an interest in visibility, or regard being located as a security/safety risk more than as a privacy violation. Their views may also depend on whether they are in the subgroup of “NGO vessel driver” or “other vessel driver”.

For the “passengers”, invisibility appears a strong interest in the first case study, whereas visibility may be strongly preferred as a prerequisite for surviving by the vulnerable people in the second case study, and different protection needs of individuals and groups become apparent.

“The public” also consists of subgroups with different interests. In the taxi case study, these include citizens interested in the visibility of public city data (as the motivation of the FOI request), scientists interested in publicly available datasets, celebrity spotters interested in disclosures, and privacy activists and data scientists interested in highlighting and preventing data-privacy attacks. In the AIS data case study, different subgroups of the public are even interested in creating different overall narratives, including (a) “there is an invasion of migrants”, (b) “the migration crisis is over”, and (c) “people keep dying”. It may be argued that these narratives [34] induce preferences for visibility (a, c) and invisibility (b), for different reasons, with different finalities, and therefore with foci on different data.

5 Conclusion

In sum, a consideration of the modelling and reporting of vehicle data and patterns, even if restricted to what data are to be included and how, what information is to be kept confidential or disclosed, can reach far beyond the traditional questions discussed under data protection and data privacy, and data science can assume the importance and responsibility normally associated with PETs. This use of modelling and AI requires a critical examination of the sociopolitical background of the mobility that these vehicles afford, support, or impede, and of the goals of the data-science project undertaken. And although “data can help citizens demand accountability”, “ultimately, the inferences that can be drawn from the data are only as valuable as the actions they induce. There is a need for political momentum to address the situation in the Mediterranean, and this problem will not be solved with data alone.” [18, p. 42].