Introduction

This article introduces digital provenance. In the Background section I compare data provenance and digital provenance, in the context of several related literatures, focusing on archival studies, digital preservation and media archeology. The remainder of the article is divided into two parts, first examining three dimensions of digital provenance (technical, social-technical and social) and then considering how digital provenance might be of use in the four main archival functions (appraisal, preservation, representation and access). An understanding of digital provenance is necessary for archivists to process born digital records; but more than this, it is necessary for archivists and archival users to understand the context and content of born digital records.

Background

The timing of this special issue could not be better. While archivists have long debated the nature of provenance (Douglas 2017), concern over machine learning (ML), artificial intelligence (AI) and large language models (LLM) has brought more attention than ever before to provenance outside of archival studies. These discussions explore provenance in ways that align with archival ideas, sometimes drawing on archival literature.

US President Joseph Biden’s “Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence” recognizes the importance of provenance, promising:

My Administration will help develop effective labeling and content provenance mechanisms, so that Americans are able to determine when content is generated using AI and when it is not (United States. Executive Order No. 14110, 2023).

In “Lessons from Archives” Jo and Gebru (2020) observe:

As disciplines primarily concerned with documentation collection and information categorization, archival studies have come across many of the issues related to consent, privacy, power imbalance, and representation among other concerns that the ML community is now starting to discuss … Our results show that archives have institutional and procedural structures in place that regulate data collection, annotation, and preservation that ML can draw from (pp. 307-308).

In non-academic publications archival concepts of provenance are becoming better known as well. The lead editorial of a 2024 issue of The Economist magazine opines:

Online content will no longer verify itself, so who posted something will become as important as what was posted. Assuming trustworthy sources can continue to identify themselves securely—via urls, email addresses and social media platforms—reputation and provenance will become more important than ever (Economist 20 Jan 2024, p. 12).

Despite rising concerns for provenance, and even though archival studies scholars have for many years brought needed complexity to notions of provenance (Douglas 2017), archival practice has been slow to build on these insights. In 1993 Cook complained that a simplistic, “mono-hierarchical” view of provenance, in which the complexities of record creation and keeping are minimized in favor of a clear-cut designation of a single creator, predominates in archival representation. Cook argues that this misrepresents the true nature of provenance, and notes that a “straightforward relationship between a single creator and a few closed, complete series of records simply does not exist” (Cook 1993, p. 33). Drawing on Scott (1966) and Barr (1987), Cook suggests that provenance is often multiple and extended in time. In Australia Hurley (1994), also drawing on Scott, contrasts serial and simultaneous multiple provenance, and describes how descriptive systems can document a range of multiple provenances, as in the series system (or Australian system). Nonetheless nearly a decade later, Nesmith (2002), similarly to Cook (1993), observed “Archivists have typically viewed provenance narrowly, as the single individual or family (for personal archives) or the particular office (for institutional archives) that inscribed, accumulated, and used a body of records” (p. 35). In place of this narrow view, Nesmith offers a definition of provenance that includes “societal and intellectual contexts ..., the functions the records perform, the capacities of information technologies to capture and preserve information at a given time, and the custodial history of the records” (p. 35). Nesmith concludes that the history of the records is, effectively, their provenance; and that all who have a hand in that history should be considered creators or co-creators of the records, even if their contributions are relatively distant from the specific content of the records at hand.

Despite such rich conceptualizations of provenance archivists in Canada, from which Cook and Nesmith wrote then and I write today, continue to follow, with only minor updates, the descriptive standard that Cook found so inadequate in 1993: the Rules for Archival Description (RAD) (Dancy 2012). Digital preservationists have developed metadata standards to document provenance in the context of their work, such as PREMIS and PROV. Depending on their implementation, these can allow for multiple provenances (Bettivia et al 2022), but they are designed to support preservation activities and not records access and the representation of records. ICA’s Records in Contexts (RiC), a descriptive standard, might substantially change how archivists represent provenance to their users, but work remains to be done on how it can be implemented, and there are not yet archival content management systems that can accommodate its broader vision of multiple provenances (EGAD 2023). In short, despite three decades since Cook’s critique, which built on Scott's (1966) almost three decades before that, archivists’ representations of provenance in finding aids and catalogs often remain simple and narrow.Footnote 1

In this article I agree with Nesmith’s understanding of provenance as the entire history of the record, a definition in which provenance is complex, multiple and broad, encompassing many actors including technology designers and creators, record creators and record keepers and archivists. How provenance is defined, however, is separate from how archivists represent provenance. Going forward, as standards for archival representation such as RiC continue to develop, and in time may replace older standards such as RAD, and as archivists embrace the possibilities of linked data, representation of provenance could and should become more complex. Even so, the breadth of provenance, when understood as the entire history of the records, means that any representation of it can only ever be partial. Archivists must make choices as to which aspects of provenance they describe; such choices can only be situational and contingent, based on the mandate and resources of an archives, the particular records at issue, and the communities for whom the records matter and by whom they will be used. In this article, I suggest that digital provenance be incorporated into archival representations of the provenance of born digital records, and into other archival functions.

Digital provenance aligns with but is distinct from much of the literature on provenance, which has tended to focus, with some exceptions, on the content of the records, rather than their material manifestation. Outside of archival studies, this kind of provenance is known as data provenance (Buneman et al. 2001; Mordell 2019) or data lineage (Bose 2002). Data lineage is of general interest due to the need for those making use of archival and other data sets to identify the origins, including the subjectivities and biases, of a particular data set, as well as other “issues related to consent, privacy, power imbalance, and representation” in a data set (Jo and Gebru 2020), through “all the processes and transformations of data from original measurements to its current form” (quoted in Bose 2002). A good deal of the archival literature on provenance tends to focus on this kind of content-oriented provenance. Digital provenance, on the other hand, considers the specific systems involved in creating and managing the records or data, and engages with debates around digital materiality; this complements a larger (though still small) literature on the materiality of non-digital archival records. Digital provenance is a necessary addition to our understanding of provenance because, to date, archivists have not incorporated into their representations of provenance an understanding of how the limitations and characteristics of digital technologies affect the records that are created and kept on them.

Digital provenance deepens our understanding of data provenance, or data lineage, by adding technical, social and cultural dimensions to how we understand the materiality and meanings of technologies of inscription, access and preservation. It addresses a persistent misunderstanding in the digital preservation literature that archival preservation is primarily informational in its focus. For example, Owens (2018) describes “three different frames for thinking about preservation—artefactual, informational and folkloric” (p. 26). Of these three frames, artefactual preservation aligns most closely with what I am calling digital provenance, since it values the specific, material qualities of objects (including documents). Owens finds that archives practice artefactual preservation (for example, by keeping original, non-digital records in climate-controlled vaults) and informational preservation (for example, by reformatting non-digital records through programs of publishing, microfilming or digitization), but concludes that archivists’ interests in artefactual preservation are “rather pragmatic” in that “considerations for the material object are less about its artefactual qualities and more about trying to inexpensively ensure the longevity of the information stored on the object” (Owens 2018, p. 29). This is wrong.

In asserting that archivists are concerned primarily with informational preservation, Owens sits with the larger part of the digital preservation literature, in which format migration is presented as a reasonable and cost-effective means of preserving information despite sometimes massive losses in digital functionalities (for example, when spreadsheets are preserved as PDFs) or in appearance (for example, when images and video are preserved separately from text (e.g., Brown 2014)). Yeo (2010), Webster (2020) and Skødt (2024) offer contrasting perspectives, finding information in digital records that is not limited to the legibility of text or image. Yeo, for example, discusses how page layouts, font choices, text color and many other factors may also be meaningful. Webster (2020) and Skødt (2024) explore multiple losses of information through format migration; both discuss the practice of “fixing” spreadsheets by transforming them into static formats such as PDFs (e.g., Levi 2011), resulting not only in the loss of formulae for calculating values, but incurring additional information losses by eliminating sortability, searchability and “playability” (see also Open Preservation Foundation. Archives Interest Group [n.d.]).

Elsewhere in his book Owens (2018) proves sensitive to these concerns, as when he discusses “screen essentialism,” but Owens associates this with “new media scholars” rather than archivists (Owens 2018, p. 46). In fact, Owens could have found the same argument within archival studies, albeit with different terminology and examples. Yeo, Webster and Skødt, straddling archival studies and digital preservation, frame their arguments in relation to significant properties, the idea that some aspects or characteristics of digital records can be identified for retention; while, the loss of others can be considered acceptable, if unfortunate (Hedstrom and Lee 2002; Wilson 2007). Others base their arguments on the materiality of non-digital records. Rekrut (2005) finds that the materiality of non-digital records reflects the preferences and available choices of a record creator, and can shape the form and content of the record that results. To illustrate she considers the final letter that Métis resistance leader Louis Riel wrote to his wife in 1885 on low-quality, inexpensive paper, while awaiting execution in a Regina, Saskatchewan prison. Nesmith (2006) similarly finds meaning—information—in the materials that a record creator inscribes upon, illustrating his argument with reference to an official journal kept by a North West Company fur trader in 1802–1803, who “apparently running out of paper,” wrote upon birch bark (Nesmith 2006 p. 353). Nesmith sees in this choice the contours of a larger intercultural encounter between the trader, his Indigenous wife and relations, and the cultural and physical environment, far north on Turtle Island. Lester (2018) sees in the materiality of international treaties intellectual, sensory and emotional effects that influence encounters with, and the meaning of, the content of the record. While this focus on materiality is found in only a minority of the archival literature, a broader acceptance of the importance of materiality and “originality” in record valuation can be seen in a range of archival practices—and nowhere more so than in the reformatting processes which, as we have seen, Owens cites as evidence of a lack of concern for artefactual preservation. In fact, archives rarely discard the original record once it has been reformatted, instead preserving both original and copy, and tend to see reformatting more as a means of promoting access than as a means for preservation—at least for non-digital records.

For digital records, archival concerns for the materiality of “the original” have been muddied by a lack of clarity as to what, exactly, is the digital original (e.g., Cook 1994; Sundqvist 2021), and over the conceptual challenges of digital materiality. Digital materiality encompasses the materiality of the bits that make up a digital file; the infrastructure necessary to keep, access and render the bits; and the internal characteristics of the digital representation itself (Blanchette 2011; Sundqvist 2021; Manoff 2013). Perhaps these conceptual challenges have allowed archivists to be influenced by much of the digital preservation literature, which arguably better reflects Owens (2018) observation of greater concern for informational than artefactual preservation. Nonetheless, despite the evident differences between digital and non-digital media, I believe that archivists should be as concerned for digital materiality as for non-digital materiality. As with the low-quality paper and birchbark examined by Rekrut and Nesmith, the choices made by digital records creators are constrained by availability, demonstrating analogous constraints based on circumstances. These constrained choices nonetheless offer an interested reader key information about the act, result and context of record creation. Archival studies scholars such as Rekrut, Nesmith and Lester have considered these constraints and choices in relation to non-digital records, and others including Yeo, Webster and Skødt have seen similar value in keeping digital records in their original formats; but the study of the materiality of digital originals has proceeded furthest in the field of media archeology. Media archeology can be understood as a subfield of media studies, with contributions from scholars such as Kittler (1999) and Ernst (2013), as well as literary studies, with contributions from scholars such as Galey and Kirschenbaum. Galey’s 2012 study of ebooks is a tour-de-force examination of digital and non-digital materialities, an excellent starting point for those seeking to understand how and why they matter. Kirschenbaum’s Mechanisms (2008) and Track Changes (2016) offer detailed and nuanced readings of how digital systems impose procedures and conditions on their users, ultimately impacting the ways that the systems are used and the records created using them.

What I am calling digital provenance incorporates concerns for materiality and digital materiality articulated in archival studies and media archeology, in addition to drawing more broadly on archival studies, media archeology, media studies, digital preservation, digital forensics and the history of computing. While my concept of digital provenance is original to this paper, it is primarily a work of synthesis, in which I provide, first, a concise articulation of digital provenance through a discussion of its technical, social-technical and social dimensions; and then explore how it might be integrated into the daily work of archivists through the four principle archival functions of appraisal, preservation, representation and access.

Part I: Dimensions of digital provenance

Technical dimension

The technical dimension of digital provenance represents the histories and ideas that inform the design and construction of the hardware and software that make up digital systems.

A certain level of technical knowledge is required to work with digital systems and records. Some technical requirements are built into the systems themselves, which can only function with compatible hardware, software, electrical supply, cooling and so on. The most basic of these requirements are spelled out through the technical standards that enable electricity and data to flow, or not, over heterogeneous infrastructures that in some cases have been built over the last hundred years.

Inscribing requirements into protocols and standards, such as TCP/IP, the suite of protocols that allow data to move across the Internet, does not make them any less of a cultural expression than a poem or a painting. Turner (2010), Wu (2011) and Markoff (2005) are among those who see in TCP/IP an expression of the culture and values of the 1960s. TCP/IP underwrites a distributed, non-hierarchical architecture that reflects not only the fear that a nuclear bomb might wipe out part of the network, but other cultural impulses toward decentralization, antiauthoritarianism and radical openness as well. The idealism behind the creation of open protocols and architectures, developed in common through the RFC (Request for Comment) system, carried forward to the similarly open standards and architectures of the World Wide Web in the late 1980s, but was notably absent in the rush to monetize the Internet during the 1990s and 2000s, out of which emerged our current digital environment of paywalled data silos that use the Internet and World Wide Web only for carriage or connection—and which is itself an expression of the heady .com capitalism of that era.

This understanding of technologies as expressions of culture carries through to all aspects of our computing devices, including their most basic operations. Computer historian Mahoney (2005) and digital culture visionary Nelson (1979/2016) are among those who have ascribed the “command and control” architecture of the earliest computers to the military culture in which they were conceived. Kidder’s 1981 account of the development of a new Data General minicomputer in the 1970s, also quoted by Mahoney, includes his description of Data General’s lead designer, Tom West, poking through the guts of a minicomputer created by industry-leading rival DEC:

Looking into the VAX, West had imagined he saw a diagram of DEC’s corporate organization. He felt that VAX was too complicated. He did not like, for instance, the system by which various parts of the machine communicated with each other; for his taste, there was too much protocol involved. He decided that VAX embodied flaws in DEC’s corporate organization. The machine expressed that phenomenally successful company’s cautious, bureaucratic style (Kidder 1981, p. 32).

Isaacson (2011) has similarly noted that the culture of the Santa Clara Valley was essential to the differing and ultimately diverging philosophies of the two founders of Apple Computers, Steve Jobs and Steve Wozniak. Isaacson characterizes the creation of the Apple II as the moment at which their philosophies combined, and then parted:

The Apple II would be marketed, in various models, for the next sixteen years, with close to six million sold. More than any other machine, it launched the personal computer industry. Wozniak deserves the historic credit for the design of its awe-inspiring circuit board and related operating software, which was one of the era’s great feats of solo invention. But Jobs was the one who integrated Wozniak’s boards into a friendly package, from the power supply to the sleek case. He also created the company that sprang up around Wozniak’s machines. As Regis McKenna later said, “Woz designed a great machine, but it would be sitting in hobby shops today were it not for Steve Jobs” (Isaacson 2011, pp. 84-5).

Wozniak designed the Apple II to be opened up and customized by users who shared his enthusiasm not only to use computers but to build computers:

Wozniak was just trying to make a great computer for himself and impress his friends at the Homebrew Computer Club. His design somehow projected an audacious sense of infinite horizons, as if the Apple II could do anything, if you were just clever enough (Hertzfeld 2005, p. xvii).

If the Apple II was Wozniak’s machine, the Macintosh was all Jobs. Isaacson observes that it was designed so.

You wouldn’t even be able to open the case and get at the motherboard. For a hobbyist or hacker, that was uncool. But for Jobs, the Macintosh was for the masses. He wanted to give them a controlled experience. “It reflects his personality, which is to want control,” said Barry Cash “Steve would talk about the Apple II and complain, ‘We don’t have control, and look at all these crazy things people are trying to do to it. That is a mistake I’ll never make again.’ He went so far as to design special tools so that the Macintosh case could not be opened with a regular screwdriver” (Isaacson 2011, p. 138).

Hardware, in short, is as much an expression of culture as a novel. Similarly, many have explicitly compared writing code to writing literature. Wagner’s 2017 blogpost “Writing Good Code is like Writing a Novel” offers a series of comparisons between creative writing and code writing; Winata (2020), maintains “Writing code is like writing a love letter.” Another tech blogger, Khmelyuk (2017), draws advice for coders from On Writing by Stephen King, organized under headings such as “Write about something you like,” “Simplify and remove clutter,” “Read continuously” and “Avoid passive verbs.”

All of this is to say that, even at their most technical, digital technologies are cultural. Nonetheless, technical compatibility remains a non-negotiable, foundational requirement for all else. Thus, without denying the cultural status of technical development, it is necessary to understand specific configurations of hardware and software, coding languages and structures, manufacturing and maintenance techniques, which impose technical dependencies on the implementation and use of digital technologies. These technical dependencies and cultural contingencies are part of the technical dimension of digital provenance.

Social-technical dimension

The social-technical dimension of digital provenance represents histories of use and users, paying particular attention to necessary tacit knowledges. The social-technical dimension speaks to the influence that users have on shaping how technologies spread, are taken up, and are used within what might be called a community of users, or a community of computing. Writing about the earliest era of commercial computing, the mainframe era of the 1950s, Mahoney notes:

the history of computing is the history of what people wanted computers to do and how people designed computers to do it. Different groups of people saw different possibilities in computing, and they had different experiences as they sought to realize those possibilities. One may speak of them as ‘communities of computing’, or perhaps as communities of practitioners that took up the computer, adapting to it while they adapted it to their purposes. (Mahoney 2005, pp. 123-4)

The ways that we adopt and adapt our digital systems is inherently limited by the systems themselves. The notion of an all-powerful computer with limitless capacities exists only in fictions like Start Trek: The Next Generation, in which the holodeck can exactly simulate any place in the galaxy and any period in human history; or 2001: A Space Odyssey, in which the mainframe computer, HAL, attains self-awareness and uses its complete surveillance and control over the spacecraft to eliminate problem-causing humans. The reality is much more mundane. To return to Mahoney:

To ‘use’ a ‘personal’ computer today is, despite its much hyped origins in the counterculture, to work in a variety of environments created by a host of anonymous people who have made decisions about the tasks to be done and the ways they should be done. As most of us use a computer, it is no more personal than a restaurant: you can have anything you want on the menu, cooked the way the kitchen has prepared it. (Mahoney 2005, p. 132)

This does not mean that the technology completely or inevitably determines its use: as we habituate ourselves to digital systems, we use them in ways that could not have been foreseen by their developers, despite being constrained by the choices those developers made. To extend Mahoney’s restaurant metaphor: we might order from the breakfast menu for dinner, order an appetizer as an entrée, or order the Caesar salad and eat only the bacon bits and croutons. Mahoney illustrates this point with reference to PowerPoint, developed to present hard data in the corporate boardroom and now used throughout society (Mahoney 2005, p. 132). There is a longstanding critique of PowerPoint as imposing authoritarian linearity, in contrast to the supposed free-ranging, looping orality of non-PowerPoint speakers (Tufte 2003). In fact, PowerPoint is used in a variety of ways by a variety of users, including as counterpoint to oral presentations; as a series of nonlinear, evocative images; or unaccompanied by oral presentation, as a freestanding form of communication. This, too, is part of the culture of PowerPoint use. Whether creating a PowerPoint slide show, delivering a talk accompanied by a PowerPoint slide show; or being in the audience for a PowerPoint slide show, our own range of experience with PowerPoint informs how we create and receive the slides.

To take another example, the pre-eminence of the MS DOS operating system in the 1980s hinged upon existing familiarity of users with command-line interfaces for other and earlier operating systems, and in running specific applications. For these users, the vocabularies and processes of command-line could be experienced as natural and intuitive. As desktop computing power increased in the 1990s, operating systems based on point-and-click icons and application windows, known as graphical user interface (GUI), enabled broader use of digital technologies and edged out command-line interface, rendering obsolete not only MS DOS and its extensive library of applications software, but the tacit knowledge necessary to use these systems. Using command-line today requires the tacit knowledge that all command-line users possessed in the 1980s. I recall my own reluctance to let go of my last MS DOS computer, already mourning the key combination shortcuts that made command-line so efficient for the experienced user.

This kind of tacit knowledge goes well beyond key combination shortcuts or an intuitive understanding of the range of possibilities for using an application such as PowerPoint. Galloway (2011), describing her efforts to restart and operate her long-unused Kaypro II computer, purchased in 1983, shows how tacit knowledge informs every aspect of using the machine, from turning it on to loading applications software and saving files to disk. This kind of tacit knowledge is known and shared within Mahoney’s communities of computing. When Galloway’s Kaypro II was a current system this community was larger and allowed easy movement among and within other communities such as those using different systems that ran the same operating system (CP/M), and therefore had access to the same applications software, or that used the same external memory media, or that had accustomed themselves to similar routines, such as loading applications after boot-up and before loading data. Galloway describes how, with the passage of time, her own tacit knowledge had receded, still accessed through a sort of muscle memory that reawakened as she began to use the computer once more, but no longer sure or complete. At the same time, the community of computing around the Kaypro II had shrunken and moved online: Galloway found a new community of computing devoted to retrocomputing that shared knowledge through online videos, websites and social media.

Digital forensics has given us the ability to study how people have used their digital systems in ways that are impossible with other media (Kirshenbaum 2008). Digital forensics tools and processes include external audits performed for security or other reasons, network traffic tracking and analysis, as well as native functionalities within digital systems. For example, Reside (2011) describes his work with the 180 floppies that Jonathan Larson, who created the hit musical Rent, left behind after his untimely passing. By studying not only what Larson saved on his floppies but the metadata saved as “last modified,” Reside was able to identify the last edits that Larson made on Rent, only days before his death, noting that “the fact that Larson made the change digitally two hours after he had saved the previous draft and saved the draft as a new file suggests how he used his computer in his creative process” (Reside 2011, pp. 337–338).

One of the perceptive peer reviewers of this essay encouraged me to note that tacit knowledges continue to be important today, citing the example of contemporary smartphones, “whose user interface seems to be based on the premise of having no explicit instructions” but rather learned “through trial and error and word of mouth” (Reviewer 3, 2024). This point is well taken: the tacit knowledge needed to operate a smartphone today would be easier to pick up through trial-and-error than that required to operate Galloway’s Kaypro II from the early 1980s. This reflects the forty years in the evolution of computer system and software design between these systems, as well as forty years in the development of computing speed and power. Similarly, it is often said that any smartphone today has more computing power than all of NASA had for the Apollo program's moonshot in 1969—but it is worth noting that much of the smartphone’s computing power is devoted to making its interface intuitive and seemingly transparent. The comparison is apples to oranges. The community of computing for smartphones is incomparably larger than that for the Apollo computing infrastructure, and the tasks that most members of the smartphone “community of computing” accomplish are inversely smaller. To glance back at Isaacson’s summary of Jobs’ design philosophy for the Apple Macintosh, smartphones are “for the masses” while the Apollo computing infrastructure was for a small, elite and deeply thoughtful group of computational brainiacs. I would not want to put my smartphone computational skills up against Creola Katherine Johnson, with or without her mainframe.

The social-technical dimension of digital provenance, then, is shared among members of a community of computing, and between related communities of computing. While it may seem natural to use our computing systems in a certain way, there is nothing natural about these actions—they are learned. Even the habits of tapping or clicking on icons and sorting through applications windows must be learned, as any public services librarian from the 2000s could attest. Together these actions and knowledges make up the social-technical provenance of the system, and records created with the system.

Social dimension

The social dimension of digital provenance encompasses the meanings that digital systems, and records created on them, have within a culture. These associations may be to digital systems in general, to entire classes of devices or software, or to specific instances of systems.

All archival records have meanings beyond the direct meaning conveyed through their inscription, and the functions that the records document or enable. O’Toole (1993) discussed the “symbolic value” of archives, which he aligned with “nonpractical” reasons for creating and keeping records. Lester (2018) discussed the sensory and emotional value of records which, he noted, affect the meaning and value of the records themselves. While the symbolic and emotional value of records may go beyond their specific material manifestations, these values are often bound up with materiality. Rekrut (2005) and Dever (2019) agree on the need for archivists to have material literacy to discern informational, emotional and other meanings that are conveyed through the materiality of records. All of these concepts contribute to what might be called the social value of records. When the social meaning and value of a record connects with larger systems of social meaning and value within a culture, it is possible to talk about a social dimension of provenance.

As noted above, digital materiality is conceptually more challenging than other forms of materiality; it involves the materiality of the bits that make up a digital file, the infrastructure necessary to access and render the bits, and the configuration of the digital representation itself (Blanchette 2011; Sundqvist 2021; Manoff 2013). The social value of digital records can be bound up with digital materiality, as when records have particular social value due to the digital systems used to produce them or, as with other forms of record, it can be bound up with the functional, nonpractical or symbolic meanings of the records themselves.

Some meanings of computer systems are symbolic and social. Haigh (2001, p. 87) describes the “symbolic modernity” of 1950s mainframes, which was strategically demonstrated and enhanced by making them visible to clients, visitors and staff through plate glass windows, “a Potemkin village of clattering printers, spinning tape drives, and flashing lights.” Mainframes were promoted by IBM and Univac with promises to “revolutionize” corporate operations and management decision making. Since sales of mainframes outstripped production, corporations committed themselves to multi-million-dollar purchases (a typical installation cost around USD $2 million in the mid 1950s) months or years before they had practical experience with the capabilities and limitations of the new technology (p. 78). In the end, many early mainframes performed the kind of data processing work that was already handled efficiently by electromechanical tabulators, a mature technology that was seventy years old by the 1950s. Haigh characterizes these early mainframes as “chromium-plated tabulators:” “Visitors and reporters could not judge the usefulness of what was being produced, still less its cost savings,” notes Haigh, “so it was more important that the computer be seen to operate than that it improve managerial effectiveness” (p. 87).

Revolution has been a selling feature of digital technologies from their inception. Edmund Berkeley’s Giant Brains or Machines that Think, published in 1949, set the tone of revolutionary hype. Mahoney succinctly notes that “hype hides history” (Mahoney 2005, p. 120): when every new generation of computers is billed as a new revolution, history is declared irrelevant since this new technology rewrites all of the rules. Inevitably, digital technologies cannot fulfill this hype; nonetheless, forward-looking symbolic modernism endures. By the early 2000s, when the BlackBerry handheld had become visual shorthand for managerial hierarchies, the “hype cycle” had been identified, labeled and dissected (Linden and Fenn 2003; Gartner, n.d.). Nobody would be surprised to learn that, despite (and because of) the hype, BlackBerrys were a status symbol and very convenient way of messaging on the go rather than, according to the hype, the key to lightning-fast decision making by top executives. Hedman and Gimpel (2010) identify five distinct values attached to the adoption of “hyped technologies:” not only functional value, but social, epistemic, emotional and conditional value as well.

The symbolic value of digital technologies was put to use in geopolitical policy, public relations and espionage during the Cold War, as allies and clients of the United States were provided access to American digital technologies. Tinn (2010; 2018) has discussed how Taiwan’s stand against communist China in the 1950s and 1960s attracted American financial and technological aid. In fact, the one was a prerequisite for the other, and to get the financial aid it was necessary to accept and use American computers, which came with American personnel to train and assist Taiwanese operators. As a result, digitally processed economic information flowed from Taiwan to the United States (Tinn 2010; 2018). “That economic knowledge was to be used for ‘development,’” writes Tinn, “which was a prevailing catchword, working in tandem with the idea of ‘containment,’ during the Cold War” (Tinn 2010, 91).

The social value of digital technologies need not be tied to newness or promises of revolution. The “dumb phone” trend today may well be at (or past) its peak, with Vice magazine’s guide to “The Best Dumb Phones (For Getting Back in Touch With Reality).” Among the half-screens, hard keys and flip phones is the Punkt, recommended “If your goal is having people say, ‘Wow, cool—what is that?’ every time you take your phone out” (Rothbarth 2024). Retro technologies can be functional and keyed into social systems of valuation at the same time. Game of Thrones author George R. R. Martin famously uses the WordStar word processor to type his novels, citing both his familiarity (he has used WordStar 4.0 for decades) as well as its lack of autocorrect and spell check, saving much aggravation when writing about his fantasy world and languages (Team Coco 2014). Additionally, much like a 1950s corporation promoting the symbolic modernity of its “Potemkin village” mainframe, Martin clearly revels in the symbolic archaism of WordStar, saying “You know, I'm partly still the kind of guy that would like to tie messages to legs of ravens to get the word out. But I do use a computer. I have since 1982, so I'm adapting as best as I can to this new world” (Bishop 2013).

Drake (2016) offers another view onto the social dimensions of digital provenance. “It bears mentioning,” he writes, “that provenance emerged as a concept in the West at a time when most people were structurally if not legally excluded from ownership; ownership of their own bodies, minds, labor, property, and records.” Today, descendants of those formerly excluded from records creation are now able to create and share records through digital technologies:

From a social view, this expansion in digital technologies disrupts theoretical and practical applications of provenance for two separate but related reasons. First, we occupy a moment in history in which the largest percentage of the world’s population ever possesses the power and potential to author and create documentation about their lived experiences. Emanating from the first, the second disruption is that this increasing agency gives people and communities a chance to name themselves, a process essential to establishing provenance that was previously reserved for the archive and the bureaucratic and corporate entities that rely on it (Drake 2016).

For Drake, this makes consumer digital technologies radical—the “RadTech” of his title – in their transformation and expansion of record creation and keeping.

In the end, any aspect of a record has the potential to be imbued with social significance. This can be part of a digital record, such as the digital stamp “sent from my iPhone,” or it can be part of a non-digital output (or record) of a digital system. For example, in non-digital collections from the 1980s and 1990s, when many but not all records were created digitally, it is possible to determine which documents were created on typewriters and which were printed from computers, by looking for traces of perforation from the continuous stationary that would be fed into computer printers; or by looking at the distinctive letter shapes created by dot matrix printers. As with other evidence of materiality, such as those discussed by Rekrut (2005) or Dever (2019), the meaning of such traces can vary based on the context and content of the record itself. Digitally created records can have an “aura,” just as other records. For instance, in my own research on early computing in Canada, I can attest to the emotional affect of encountering records that were created by the first digital computer in Canada, the Ferranti Mark 1 installed at the University of Toronto in 1952 and named Ferut (e.g., University of Manitoba Archives, Robert Bury Ferguson Fonds). Similarly, archivists at Library and Archives Canada, Winnipeg Office have on several occasions responded to film crews who want to shoot footage of original records of Mincome, Canada’s ground-breaking experiment with providing a minimum income to Manitobans living in Winnipeg and Dauphin, a project that relied on minicomputers in Winnipeg and Ottawa. The trailer for one documentary pans down an aisle crowded with archival boxes that presumably hold the raw data and printouts of the Mincome minicomputers (Big Sky 2019).

The social dimension of digital provenance, then, like the technical and social-technical dimensions, is bound up with the histories, social and cultural value, and the uses of particular computer systems as much as with the records themselves.

Part II: Digital provenance and archival work

As I noted at the outset, digital provenance is not a new kind of provenance; it is not a reinvention of the concept. It is rather a subset: it is the application of provenance to a particular technology of records creation and keeping. Any medium of inscription and record keeping could be analyzed for its technical, social-technical and social dimensions. Nonetheless, archival work with digital records requires this kind of detailed and precise understanding of digital provenance at this point in time.

Cloonan (2010) offers a brief history of paper making and keeping, which is useful here as a kind of parable. Cloonan describes paper as a robust but expensive medium in the eighteenth century and earlier, becoming inexpensive but chemically volatile in the nineteenth century, to our current experience of paper that is inexpensive, robust and chemically stable by the end of the twentieth century. She points out that “durable/permanent papers began to be manufactured in the 1960s” but the key to universally inexpensive and chemically stable papers was when “conservators, librarians, archivists, publishers, and paper manufacturers” got together in the 1980s to create international standards (p. 79). Nineteenth century paper was available in a range of formulations that varied by cost, strength and chemical composition. Keeping these papers over the long-term requires a range of interventions, such as careful environmental and light controls, segregation from other papers, backing, encapsulation and chemical deacidification. Keeping twenty-first century papers, on the other hand, is relatively simple: environmental controls that keep humans happy are likely to be fine for even cheap papers.

Digital formats and media of the past fifty years—the era of personal computing–are comparable to nineteenth century papers. Keeping these born digital records requires a range of interventions that vary based on the peculiarities of each format. This work is complex and technical, and requires that archivists think through the specifics of digital provenance to capture those aspects of context, content and interactivity that express what makes the various formats and records understandable and valuable. And of course, just as with the archivist or conservator who chemically deacidifies or encapsulates nineteenth century paper documents to control their ongoing degradation, every intervention by a digital archivist today, as when a record is copied from portable media or failing hard drives, or converted into a new format, is itself part of the history of the record, and therefore part of the ongoing, unfolding story of its provenance.Footnote 2 Next, I explore how digital provenance affects the four principle archival functions.

Appraisal

Cook noted many times that appraisal and acquisition are separate and distinct processes. In his first major essay on appraisal in 1992, in which he laid out the principles of macroappraisal, he wrote “Practical and preservation concerns may also change an initially positive appraisal decision. Such issues include data readability, fragility of originals, the possibility and expense of media conversion, space availability, and storage, conservation, and processing costs” (Cook 1992, p. 132). In this he closely followed Naugler’s 1984 UNESCO report The Archival Appraisal of Machine-Readable Records. While it may be tempting to assess the value of records “regardless of format” (Bailey 2007), Naugler describes a range of factors affecting the long-term costs and stability of digital formats and media. He recommends two rounds of appraisal: first a traditional appraisal process (which for Naugler, at the Public Archives of Canada in the 1970s and ‘80s, was Schellenberg’s structural–functional, taxonomy-driven approach), followed by a technical appraisal that considered the specifics of data structure, software and system dependencies, metadata requirements and so on. Naugler, and Cook after him, noted that the costs and complexities of digital preservation do and should affect appraisal decisions, since both acknowledged that archival value is not absolute, but must be judged relative to the mandate and resources of a particular archives. Naugler’s process of double appraisal remains essential in appraising digital records today (Mumma et al. 2011).

The technical and social-technical dimensions of digital provenance are particularly important in making these sorts of judgements. Understanding system dependencies, requisite tacit knowledges and exploring the communities of computing that used the original systems to create the records can make visible some of the costs and complexities of acquiring specific digital records, and either keeping them in their original formats or migrating them into new formats.

Nor is the social dimension irrelevant. Among the earliest digital records at Library and Archives Canada are the first space data sets derived from the Alouette I satellite. Launched in 1962, the Canadian Space Agency (2018) states that the Alouette I “marked Canada's entry into the space age” and made it only the third nation, after the US and USSR, to build and launch a satellite. The transfer of Alouette I data to the Public Archives was similarly noteworthy, being the “first formal accession of magnetic tape, comprising the trackings of Alouette I from its launching to 1965” (Public Archives of Canada 1971). This data, as content, may well have been superseded since, by more sensitive and advanced instruments than those on the Alouette I—but the symbolic value of the original data transferred from the Space Agency to the Public Archives remains. The social dimension of digital provenance is not limited to the nation-building aspirations of national archives. Drake’s musings on the expansion of records creation through widespread access to computing (Drake 2016), discussed above, point toward the symbolic, historical and archival value of records, for example, from #IdleNoMore (Kino-nda-niimi Collective 2014) and Black Twitter (Parham 2021a,b).

Preservation

Digital preservation requires a deep understanding of digital provenance, in all of its dimensions. Yeo (2010) notes that there is not, and never can be, a singular detailing of significant properties for any digital format. One person may find only the content of a record significant; while others may see significance in layouts, fonts and colors of the record as rendered in its original software; while still others may see significance in understanding the exact processes of booting up the original system and opening up the original file. Not every archives has the mandate or resources to meet these different understandings of significance, but Yeo’s point is simply that while perhaps we must identify specific significant properties to preserve and make available, we should not deceive ourselves that these are “significant” in any absolute or final way, or that the attributes and functionalities that are not deemed “significant” have no value simply because they are not preserved. In this way, Yeo aligns his discussion of significant properties with earlier discussions of archival value. Like Cook (1992), who maintained that archival value is not a binary value but an infinite gradation in which every record ever created holds archival value for some user, Yeo maintains that all properties are significant to someone. Their designation is, in effect, an act of appraisal. Therefore, as in the function of appraisal, all dimensions of digital provenance must be taken into account to determine not just whether digital records are preserved, but why and what aspects of them are key to the value that has been found in them. For example, if the Alouette I space data was preserved simply as space data, without reference to its place in the history of Canadian modernity and nation-building, then the true value of the data would be lost.

Meanwhile, in the practical work of accessing digital records from obsolete media and formats while moving them through archival processing and into storage, the technical and social-technical dimensions of digital provenance come particularly to the fore. As discussed above, the technical dimension identifies factors like software, hardware or system dependencies; while, the social-technical dimension includes various tacit knowledges and community-derived information about how systems were in fact used, which is not always exactly as they were intended by their designers and manufacturers. All of this information is key to accessing digital records in the first place, quite apart from how they are then preserved. At present, the digital preservation orthodoxy is to preserve records through format management, by migrating content from one format to another. To do so, however, the records first must be accessed in their original formats, which may require a mixture of original systems, hard-won tacit knowledge, and retrocomputing.

As a result, preservation is the archival function in which digital provenance has been taken the most seriously up until now. In fact, the inclusion of “environments” in PREMIS 3.0 already points toward the creation of a shared, comprehensive repository of information about specific hardware, software and systems, simply to allow for the work of digital preservation to proceed (Dappert et al. 2013). Such a repository, crowdsourced from and shared among digital archivists, would represent an essential resource in documenting the most basic aspects of digital provenance.

Representation

Niu (2015) notes the “original order” of digital information can be understood in a few ways: as conceptual order, order in the user interface, or order in storage. Zhang (2012) sought to eliminate confusion by focusing on the user-created order of directories and filing systems. Regardless, the original order that users create on their digital systems cannot be understood without first understanding how systems constrain or enable orderings by users. Niu explains:

When files are copied across different operating systems and file systems, certain metadata attributes may be modified and thus disturb the original order. For example, Unix/Linux and Windows have different file-naming conventions. When copying files between these two systems, un-allowed characters may be deleted during the transfer (Niu 2015, p. 67).

These kinds of system-specific restrictions or attributes are the result of the design and construction of the original system, and therefore part of the technical dimension of digital provenance. How communities of users respond to them is part of the social-technical dimension.

Representation offers archivists the opportunity to document provenance and relevant aspects of context. It offers a space for archivists to flag for users aspects of format or media of creation and keeping that may be peculiar or specific to particular records. Current archival standards offer very little guidance on describing born digital records and do not address questions of digital provenance. The emerging ICA standard Records in Contexts (EGAD 2023) offers more flexibility than older standards like Canada’s RAD, but there has not yet emerged a body of interpretation and practice to guide archivists in describing born digital records specifically. It would appear to be necessary, at minimum, to document aspects of digital provenance that restrict the record creator’s choices in terms of record creation, ordering and keeping, including native system functionalities, policy-based system-level restrictions and digital rights management. How much of this would end up in archival descriptions, specifically, could vary based on how relevant it is deemed to the (always subjective and relative to the mandate and resources of the archival institution) value of the records being described and the way that access is provided. If records are retained primarily for their symbolic value, or other aspects of their social provenance, then this would likely be identified in their description. Similarly, if special systems, emulations or equipment are required to render the records, this should be identified in their description along with the tacit knowledges (such as navigating via arrow keys and command-line) required to operate archaic hardware and software.

Access

Access has been a driving force in digital archiving, a central principle, with digital preservation often defined as long-term access. As archivists continue to explore the significant properties of various formats and even specific record sets, we are moving toward a consensus that values access to records in both migrated and original formats—migrated to allow for ease of access on contemporary systems, and original formats to allow insight into the original presentation of the records and their context of creation. This is, in fact, the solution proposed by Carroll et al. (2011) in preserving and making available the very high-value records of Salman Rushdie. Whether following format management or keeping records in their original formats, digital provenance provides necessary context. For format-migrated records, digital provenance information is essential for users to understand the original context in which the records were created, circulated, managed and used. For records in original formats, digital provenance offers information that may be necessary to render the records and navigate them.

Providing both options is becoming more necessary with each passing year. Skødt (2024) fears that the Danish National Archives decision to preserve born digital records only in migrated formats threatens the archives’ claims to hold original and authentic records, picking up on concerns also raised by Rothenberg (1995) and Yeo (2011). Moreover, as emulation-based solutions continue to develop—including the Internet Archive’s practical demonstration of emulation-on-the-fly and an increasing variety of cloud-based providers of emulation-as-a-service—user access to original records rendered through emulation is an increasingly practical and economical possibility (Roy 2019).

Nonetheless, access to original records through emulation does not address the challenge that contemporary users would have working with original systems. Contemporary users are unlikely to have the tacit knowledge necessary to use the command-line interface and keystroke commands of MS DOS applications, and may find themselves frustrated and disoriented when applications do not behave in ways that we have become accustomed to today. Access to original versions must be possible, but for most users it is sufficient and even preferable to access format-migrated versions, preferably with metadata, screencasts, narrative descriptions and other resources documenting the original computing infrastructure. This documentation, ideally, would be shared among archivists and users through an open access repository modeled after Wikipedia or GitHub.

Conclusions

Technology is part of culture. This was true of the calendar wheels of the Mayans and the be-finned Cadillacs of the “Greatest Generation”—and IBM mainframes of the 1960s, MS DOS desktop clones of the 1990s, and iPhones of the 2010s. Technologies of record creation are part of the provenance of records. This holds for both the technical and the cultural aspects of records creation. Nesmith (2006) observed this in relation to birch bark journals and residential school photographs; it is also true of any email, report or text message. Just as technology is part of culture so, then, is the act of record creation culturally contingent.

Digital provenance supplements other forms provenance. Digital records, as much as non-digital, form many bonds of provenance at creation due to a range of functional and non-functional factors (Bak 2012). Digital provenance is not a distinct type of provenance. Rather, digital provenance is a way of talking about how technologies of records creation and keeping inform or “infuse” provenance (Nesmith 2006 p. 352). Moreover, digital records and recordkeeping are not any special case: digital provenance is an example of technological provenance, which can be traced to any technology of record creation, digital or non-digital. All record-creating and keeping technologies infuse all other aspects of provenance since technology is an aspect of culture, and since it is through the records that we come to know the culture. Through a focus on digital technologies, then, this article offers a working out of how we might think about and respond to technologies of records creation and keeping within provenance, in archival theory and practice.

In the past archivists have relied on users to educate themselves on, say, cultures (and technologies) of letter writing in the eighteenth century; or scrapbooking in the nineteenth, so that they could appropriately interpret archival records. This worked when our users were predominantly academics. As the population of archival users has grown, and its center of gravity has shifted from historians and history students to society in general, archives must provide more information about technological (and other) contexts to make records understandable to and useable by a greater range of users. Sweeney (2014), an archivist, experienced this firsthand, interacting with online commenters responding to her presentation of digitized séance photographs from the early twentieth century:

they did not realize what an archives is and what one might expect from an archives, that is not Photoshopped photos. I was astonished at the visual illiteracy of some of the commenters. There were people who were not even aware of the relative age of the photos. They could not draw any conclusions from either what the people were wearing or the look of the photographs (Sweeney 2014, p. 29).

This is part and parcel of what Yakel (2011) calls the “Second Great Opening” of archives, a digitization- and genealogy-driven bonanza of new users, most of whom lack academic training in history and archives, and who require a new generation of online resources and tools to help them understand not only “what an archives is” but past eras and technologies of records creation and keeping as well.Footnote 3

Arguably, this is even more urgent for digital technologies of records creation, which have had a very fast cycle of currency and obsolescence, than it has been for more stable forms of records creation, such as pen and paper. Faced with the command-line interface of an operating system like MS DOS or CP/M, it makes little difference if the would-be user is a “digital native” or a senior citizen. Without appropriate guidance, a system of emulation that requires users to interact with obsolete operating systems and applications will be useless; while, a system of format migration that makes no note of how these records were created obscures an important part of their context of creation. Moreover, as digital technologies continue to evolve in ways that incorporate ML and AI, it will become increasingly important to establish benchmarks for the difference (for example) between the meaning of authorship relative to a word-processed file from the 1990s versus the 2020s versus the 2050s. The suggestion by Dappert et al. (2013) for an online, crowdsourced registry of past systems, operating systems and applications, to support the Environment entity in PREMIS, would be a good starting point.

Digital archivists must operate obsolete computer hardware, software and systems to access content. More than this, they need to understand the social and cultural meanings of technology, as well as the technical capacities and possibilities of different technologies, in the archiving-present, in the record-creating, -managing and -using past, and into the record-accessing future, in order to adequately assess, represent and make known the archival value of records, relative to the mandate and resources of their institutions. The meaning of the records, literally and symbolically, is wrapped up in their systems of creation and their digital provenance. These meanings must be expressed in descriptions or other metadata to ensure that future users of the records are informed of the ways that specific digital technologies have affected the content, meanings and uses of specific records.

Digital provenance provides a framework for archivists to ponder, explore and document various aspects of how the capacities, limits, operations and larger symbolic meanings of digital systems shape our perceptions of the world, and our activities in the world. The model presented here recognizes and values several influences on digital records, including those of system designers and constructors, communities of computing and society at large. It reaffirms the suspicion of Nietzsche, who began to see his modes and habits of writing change after he started using the stunningly beautiful Malling-Hansen Writing Ball typewriter in 1882, in response to his encroaching blindness (Berry and Ribicki 2012). “Our writing tools,” he wrote to a friend, “are also working on our thoughts.”