1 Introduction

In this paper we consider a decentralized AI pipeline with multiple independent organizations wherein one set of organizations specialize in curating high quality datasets based on independent data sources, another set of organizations specialise in training models from the curated datasets, and another set of organizations deploy the trained models and provide them as a service to the model consumers. A typical decentralized AI pipeline is shown in Fig. 1. The core assets like datasets and models represent significant intellectual property for their respective owners. Therefore, it is essential for the asset owners to ensure the confidentiality of their assets beyond the intended usage. On the other hand, since the model consumers are likely to use them for driving major decisions, they would like to ensure auditability and integrity of the models by (i) verifying the provenance and performance of the models on benchmark datasetsFootnote 1 and (ii) ensuring that the predictions from the deployed service match with that of the verified model. In summary, decentralized AI pipelines need to provide end to end provenance while ensuring the confidentiality of different assets.

Fig. 1.
figure 1

Typical decentralized AI pipeline.

Consider an example of deciding on mortgage applications using an AI service. A data service provider, SP, provides high quality training and benchmark datasets by curating historical mortgage data from reputationally trusted financial institutes. A specialized fintech company, FC, trains and deploys an AI model as a service for the given task. Further, it makes a public claim on the model performance on benchmark dataset. Note that establishing provenance of model training carried out by FC is not addressed in this work. A financial institute, CONS, wanting to use AI in mortgage approval process would want to independently verify the claim made by FC before deciding to subscribe to the service. If CONS is satisfied after the verification process, it might use the deployed service to make decision on mortgage applications. At this time, CONS and individual mortgage applicants should be able to independently verify that the predictions from the deployed service match with that of the verified model. The reputationally trusted data owners and FC would like to protect the confidentiality of their assets except from those actors who are entitled to access them. We would like to highlight a special and important requirement of FC: to prevent model reengineering attacks, the FC would like to ensure that the model verifier does not get to learn the predictions of the models on individual instances during the process of verification.

We present significant progress towards describing efficient and scalable approach to provide public verifiability for common operations in an AI pipeline, while preserving confidentiality of involved data and model assets. In the paper we have highlighted few primitive operations, but more operations on both data and models can be added as state of the art improves. While it is difficult to match the expressiveness of what is possible via plain-text computations, our methods can nevertheless provide provenance over simpler pipelines.

1.1 Related Work

While there is no prior work that addresses all the aspects of verifiable distributed AI pipeline as introduced in this paper, there are past works that address different aspects of the overall requirements. The provenance requirement is addressed in [19, 21], the model verification or certification requirement is addressed in [15, 22], and the verifiable inference from private model requirement is addressed in [11, 14, 18, 23, 28]. Our work is of independent interest to the field of Verifiable Computation (VC) as it provides more efficient methods for useful computational primitives like Read Only Memory (ROM) access and operations on datasets. We briefly review and contrast the relevant literature with our work.

Provenance Models for AI: There has recently been considerable interest in the provenance of AI assets. For instance, [19, 21] provide good motivation and DLT based architecture for establishing provenance of AI assets. The provenance is enabled by recording the cryptographic hash of each asset on the tamper-proof ledger, and recording any operations on them as transactions. While this provides auditability and lineage of an asset, its verification necessarily involves revealing the assets, thereby violating the confidentiality requirements in our setting. We build on the tools from verifiable computation to enable verifiability of assets and operations on them while supporting all the stated confidentiality requirements.

Model Certification for AI: Training and testing AI models for fairness and bias is an area of active research. Recently, efforts have been made to leverage methods from secure multiparty computation (MPC) to enable fair training and certification of AI models while ensuring privacy of sensitive data of the participants [15, 22]. These methods require a trusted party (e.g. a regulator) to certify the claims on the models and therefore, do not support the public verifiability requirement in our setting.

Verifiable Model Inference: The problem of verifying the predictions from private AI models, with different privacy requirements, has been considered in the literature. For instance, verifiable execution of neural networks has been considered in [14, 18, 23, 27] and verification of predictions from decision trees has been considered in [28]. These works cannot be extended for end to end pipeline verification as they cannot handle verification of operations on datasets. In our work, apart from providing verification for the entire AI pipeline, we improve upon the work of [28] by making the verification of the decision tree inference more scalable as described in Sect. 1.2.

Reusable Gadgets for VC: On the technical front, our work complements persistent efforts such as [16, 25] to enable more computations efficiently in the VC setting. The problem of efficiently supporting addressable memory inside VC circuits has received considerable attention [3, 5, 16, 25, 31] as many computations are best expressed using the abstraction of memory. Methods in aforementioned efforts support arbitrary zero knowledge Succinct Arguments of Knowledge (zkSNARKs). We provide a more efficient variant of prior methods, leveraging a zkSNARK with commit and prove capability (see Sect. 3). However, this is not a major hinderance as many efficient zkSNARKs can be modified to be commit and prove with negligible overhead (see [8]). Our efficient abstractions for read only memory (ROM) and datasets can be incorporated into zkSNARK circuit compilers such as ZokRates [10], when suitably targeted for a commit and prove backend. In particular, supporting datasets as first class primitives in zkSNARK compilers will make them more attractive for privacy preserving data science applications. Finally we mention that the work on Verifiable Outsourced Databases (e.g. [29, 30]) is not directly applicable here as (i) current implementations do not address data confidentiality and (ii) they do not support reusable representation of datasets across computations.

1.2 Our Contributions

We present the first efficient and scalable system for decentralized AI pipelines with support for confidentiality concerns of the asset owners (as described in Table 2) and public verifiability. Our work represents major system level innovations in the areas of model certification ([15] - lacks public verifiability, provenance), provenance architectures for AI artifacts ([19, 21] - lack privacy), and confidentiality preserving model inference ([14, 23, 28] - lack provenance). A number of technical contributions enable this system level novelty and they are summarized as follows.

  • Improved method for read-only memory access in arithmetic circuits with an order of magnitude gain in efficiency over the existing methods (see Table 3). The improved memory access protocol is crucially used in realizing efficient circuits for data operations (inner-join) and decision tree inference.

  • A method for consistent modeling of datasets in arithmetic circuits with complete privacy. In addition, we design efficient circuits to prove common operations on datasets. We make several optimizations over the basic approach of using zkSNARKs resulting in at least an order of magnitude gain in efficiency (see Table 4). On commodity hardware, our implementation scales well to prove operations on datasets with up to 1 million rows in a few minutes. The verification takes few hundred milliseconds.

  • We present an improved protocol for privacy preserving verifiable inference from decision tree. Our method yields up to ten times smaller verification circuits by avoiding expensive one-time hashing of the tree used in [28]. Further leveraging our method for read-only memory access, we also incur fewer multiplication gates per prediction (see Sect. 5 for more details). Comparative performance under different settings is summarized in Table 5.

  • We implement our scheme using Adaptive-Pinocchio [24] to experimentally evaluate the efficacy of our scheme. We report the results in Sect. 6. Our scheme can also be instantiated with other CP-SNARKs.

Our implementation uses pre-processing zkSNARKs [5, 9, 13, 20] which pre-process a circuit description to make subsequent proving and verification more efficient. Our circuits can also be used with generic zkSNARKs such as those in [2, 4, 7], suitably augmented with commit and prove capability.

Table 1. Performance of our dataset operations. For concrete numbers we took number of rows \(N=100K\) and bit-width of elements \(b=32\).

2 Verifiable Provenance in Decentralized AI Pipelines

A typical AI pipeline consists of different steps, such as accessing raw datasets from multiple sources, performing aggregation and transformations in order to curate training and testing datasets for the AI task on hand, developing the AI model, and deploying it in production. We are interested in settings in which the AI pipeline is decentralized, i.e., different steps of the pipeline are carried out by different independent actors. We assume five different type of actors: data owners(DO), data curators(DC), model owners(MO), model certifiers(MCERT), and model consumers(MCONS). For brevity of exposition, we assume that the number of data curators, model owners, model certifiers, and model consumers is just one. However, all the concepts and results extend in a straight forward manner to the general setting involving multiple entities of each type.

We assume that there is a task \(\texttt{T}\) for which the process of building an AI pipeline is undertaken in a decentralized setting. The salient features of our provenance and certification framework is summarized as follows.

There are m data owners \(\texttt{DO}_1, \texttt{DO}_2, \ldots , \texttt{DO}_m\) who share their respective raw datasets \(D_1, D_2, \ldots , D_m\) privately with the data curator \(\texttt{DC}\) and also make a public commitment of the datasets. The data curator curates a dataset \(D_{b} = f(D_1, D_2, \ldots , D_m)\) for the purpose of benchmarking the performance of an AI model for the task \(\texttt{T}\) and makes a public commitment of \(D_{b}\). We assume the model owner, \(\texttt{MO}\), has a pre-trained AI model M and wants to offer it as a service. \(\texttt{MO}\) makes a public commitment of the model. \(\texttt{MO}\) buys the benchmark dataset \(D_{b}\) from \(\texttt{DC}\). MO wishes to convince potential consumers of the utility of the model M by making performance claim \(accuracy = score(M, D_b)\) when M is used for getting predictions on the dataset \(D_b\). The model certifier, \(\texttt{MCERT}\), should be able to independently verify the provenance of all the steps and the claimed performance of the model M. MCERT also ensures that the timestamp of the public commitment of model M is earlier than the timestamp of public commitment of \(D_b\) to ensure that the model M cannot be overfitted to the dataset \(D_b\). \(\texttt{MCERT}\) certifies the model M only after verifying the correctness of the claim. The model consumer, \(\texttt{MCONS}\), subscribes to the model M only upon its successful certification. Suppose \(\texttt{MCONS}\) supplies a valid input data \(D'\) to the service provided by \(\texttt{MO}\) and gets a prediction \(Y'\). We require that \(\texttt{MCONS}\) should be able to independently verify that the prediction \(Y'\) matches with the prediction of the committed model M on the instance \(D'\).

We observe that the outlined requirements ensure that the decentralized AI pipeline is transparent. The key question we address in this paper is that of providing such a transparency while satisfying the confidentiality requirements of all the actors. We assume that none of the actors in the set up have any incentive to collude with the others, but, can act maliciously. The privacy requirements and security model of different actors is summarized in Table 2.

Table 2. Summary of privacy requirements and trust assumptions in our setting.

We present a provenance framework which ensures trust in the AI pipeline by proving each computation step using zero-knowledge proofs, thus meeting all the confidentiality requirements captured in Table 2. Below, we present a concrete example of an AI pipeline for establishing fairness of an AI model, where we clearly highlight involvement of various actors.

2.1 Decentralized Model Fairness

Increasingly, AI models are required to be fair (i.e. non-discriminating) with respect to protected attributes (e.g. Gender). There are several metrics which are used to evaluate a model for fairness. For the sake of illustration, we choose the popular metric called predictive parity, which requires a model to have similar accuracy for different values of the protected attribute. In our specific example, our goal is to show that for binary classification model M we have:

$$\begin{aligned} \big \vert \textrm{Pr}[M(\boldsymbol{x})=y\,|\, \textsf{Gender}(\boldsymbol{x})=\texttt{M}] - \textrm{Pr}[M(\boldsymbol{x})=y\,|\, \textsf{Gender}(\boldsymbol{x})=\texttt{F}] \big \vert \le \varepsilon \end{aligned}$$

where \((\boldsymbol{x},y)\sim \mathcal {D}\) for representative distribution \(\mathcal {D}\). We may estimate the above metric emperically on a test data T consisting of samples \(\{(\boldsymbol{x}_i,y_i)\}_{i=1}^n\). For concreteness, let M be a decision tree model developed by model owner MO to be used by financial institutions for approving home mortgage loan applications. Let \(D_1\) and \(D_2\) be two private datasets consisting of loan applications, which are owned by financial instituions \(\texttt{DO}_1\) and \(\texttt{DO}_2\) respectively. A data curator DC curates the dataset T by concatenating (row-wise) datasets \(D_1\), \(D_2\) and further generates datasets \(T_M\), \(T_F\) consisting of applications with male and female applicants respectively. Finally the model owner MO obtains datsets \(T_M\) and \(T_F\) and computes the accuracy of its model on the respective datasets. In Fig. 2, the top left code block shows the operations executed by different actors in the pipeline without verifiability. The remaining code blocks show operations performed by actors in a verifiable pipeline. The asset owners publicly commit their private assets (bottom left) and generate proofs to attest correctness of their operations on assets (top right). Finally, a verifier (e.g. auditor) uses published commitments and proofs to establish the correctness of steps performed by respective actors in the pipeline (bottom right).

Fig. 2.
figure 2

Example pipeline for certifying financial model for fairness.

3 Overview

This section provides overview of the technical challenges in instantiating our solution. More detailed technical contributions appear in Sects. 4 and 5.

3.1 Building Blocks

Cryptographic Primitives: We use zkSNARKs as the main cryptographic tool to verify correctness of data operations and model inference while maintaining confidentiality of the respective assets. A zkSNARK consists of a triple of algorithms \((\textsf{G},\textsf{P},\textsf{V})\) where (i) \(\textsf{G}\) takes description of a computation as an arithmetic circuit C and outputs public parameters \(\textsf{pp}\leftarrow \textsf{G}(1^\lambda ,C)\), (ii) \(\textsf{P}\) takes \(\textsf{pp}\) and a satisfying instance \((\boldsymbol{x},\boldsymbol{w})\) for C and outputs a proof \(\pi \leftarrow \textsf{P}(\textsf{pp},\boldsymbol{x},\boldsymbol{w})\) while (iii) \(\textsf{V}\) takes \(\textsf{pp}\), statement \(\boldsymbol{x}\) and a proof \(\pi \) and outputs \(b\leftarrow \textsf{V}(\textsf{pp},\boldsymbol{x},\pi )\). The proof \(\pi \) reveals no knowledge of the witness \(\boldsymbol{w}\), while an accepting proof \(\pi \) implies that prover knows a satisfying assignment \((\boldsymbol{x},\boldsymbol{w})\) with overwhelming probability. A commit and prove zkSNARK (CP-SNARK) allows proving knowledge of witness \(\boldsymbol{w}\) as before, where part of \(\boldsymbol{w}\) additionally opens a public commitment c, i.e. \(\boldsymbol{w}=(\boldsymbol{u},\boldsymbol{z})\) and \(\textsf{Open}(c)=u\). A CP-SNARK specifies a commitment scheme \(\textsf{Com}\) and like a zkSNARK, it provides algorithms \(\textsf{G},\textsf{P}\) and \(\textsf{V}\) for generating public parameters, generating proofs and verifying proofs respectively. Additionally, a CP-SNARK allows one to generate proofs over data committed using \(\textsf{Com}\) with negligible overhead in proof generation and verification.

Notation: We use the notation [n] to denote the set of natural numbers \(\{1,\ldots ,n\}\). We often use the array notation \(\boldsymbol{x}[\,i\,]\) to denote the \(i^{th}\) component of the vector \(\boldsymbol{x}\), with 1 as the starting index. We will denote the concatenation of vectors \(\boldsymbol{x}\) and \(\boldsymbol{y}\) as \(\llbracket \boldsymbol{x}, \boldsymbol{y}\rrbracket \). All our arithmetic circuits, vectors and matrices are over a finite field \(\mathbb {F}\) of prime order.

Circuits for Dataset Operations: To use zkSNARKs, we express operations on datasets as arithmetic circuits. At a high level, arithmetic circuits representing data operations accept datasets as their inputs and outputs. Since establishing provenance of an asset in an AI pipeline requires verifying operations over several related assets, we require uniform representation of datasets across arithmetic circuits, which would allow a dataset to be used as inputs/outputs in different circuits. The second design constraint we enforce is that arithmetic circuits to be universal, i.e., the same circuit can be used to verify operations on all datasets within a known size bound. We need universal circuits for two primary reasons: (i) the sizes of datasets are considered confidential and must not be inferable from the circuits being used, and (ii) the circuits can be pre-processed to yield efficient verification as it is a frequent operation in our applications.

Dataset Representation in Circuits: As we use the same circuit to represent operations over datasets of varying sizes, we first describe a uniform representation of datasets which can be used within the arithmetic circuits. Let N denote a known upper bound on the size of input/output datasets. We view a dataset as a collection of its column vectors (of size at most N). We encode a vector of size at most N as \(N+1\) size vector \(\llbracket s, \boldsymbol{X}\rrbracket \) where \(\boldsymbol{X}=(\boldsymbol{X}[1],\ldots ,\boldsymbol{X}[N])\) In this encoding s denotes the size of the vector, \(\boldsymbol{X}[1],\ldots ,\boldsymbol{X}[s]\) contain the s entries of the vector, while \(\boldsymbol{X}[i]\) for \(i>s\) are set to 0Footnote 2. Similarly, a dataset is encoded by encoding each of its columns separately.

Dataset Commitment: Let \(\textsf{Com}\) be a vector commitment scheme associated with a CP-SNARK \(\textsf{CP}\). We additionally assume that \(\textsf{Com}\) is homomorphic. To commit a vector \(\boldsymbol{x}\), we first compute its encoding \(\overline{\boldsymbol{x}}\) as a vector of size \(N+1\), and then compute \(c=\textsf{Com}(\overline{\boldsymbol{x}},r)\) as its commitment. Here r denotes the commitment randomness. To commit a dataset \(\boldsymbol{D}\) with columns \(\boldsymbol{x}_1,\ldots ,\boldsymbol{x}_M\), we commit each of its columns to obtain \(\boldsymbol{c}=(c_1,\ldots ,c_M)\), where \(c_i=\textsf{Com}(\boldsymbol{x}_i)\) as the commitment. Using our circuits with the CP-SNARK \(\textsf{CP}\) allows us to efficiently prove operations over committed datasets.

3.2 Optimizations

We now highlight optimizations that are pivotal to the scalability of our system:

Mitigating Commitment Overhead: To prove statements over committed values using general zkSNARKs, one generally needs to compute the commitment as part of the arithmetic circuit expressing the computation. This introduces substantial overhead, when the amount of data to be committed is large. To avoid this, we use a CP-SNARK and its associated commitment scheme. We instantiate our system using Adaptive-Pinnochio [24], as the CP-SNARK. Adaptive-Pinnochio augments the popular Pinnochio [20] zkSNARK with commit and prove capability. The resulting scheme incurs \(\le 5\%\) overhead in proof generation time over Pinnochio, while verification continues to be efficient (\(\le \mathrm {400\,ms}\)) in practice. We expect similar savings with other CP-SNARK schemes, and thus our constructs are agnostic to the choice of CP-SNARK.

Circuit Decomposition: For some operations, verification is more efficient when decomposed as two or more circuits, than when encoded as a monolithic circuit. Let \(C(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w})\) be an arithmetic circuit which checks some property on \((\boldsymbol{x},\boldsymbol{u})\) where \(\boldsymbol{u}\) additionally opens the commitment c. Our decomposition takes the form \(C(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w})\equiv C_1(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w}_0,\boldsymbol{w_1})\wedge C_2(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w}_0,\boldsymbol{w}_2)\) where \(\boldsymbol{w} = (\boldsymbol{w}_0,\boldsymbol{w}_1,\boldsymbol{w}_2)\) denotes a suitable partition of witness wires. Using a CP-SNARK we let the prover provide an additional commitment \(c_0\) for the witness wires \(\boldsymbol{w}_0\) which are common to both the sub-circuits. In our decompositions, we let \(C_1\) encode relation that is easily verified by an arithmetic circuit and let \(C_2\) encode the relation which has substantially cheaper probabilistic verification circuit, i.e., there exists a circuit \(\widetilde{C}_2(\alpha ,\boldsymbol{x},\boldsymbol{u},\boldsymbol{w}_0,\boldsymbol{w}_2)\) which takes additional random challenge \(\alpha \) and has identical output to \(C_2\) with overwhelming probability (over random choices of \(\alpha \)). In our constructions, the latter circuit verifies either the simultaneous permutation property or consistent memory access property which we introduce below. These are inefficient to check deterministically using arithmetic circuits but admit efficient probabilistic circuits.

3.3 Simultaneous Permutation

We say that tuples \((\boldsymbol{u}_1,\ldots , \boldsymbol{u}_k)\) and \((\boldsymbol{v}_1,\ldots ,\boldsymbol{v}_k)\) of vectors in \(\mathbb {F}^N\) satisfy the simultaneous permutation relation if there exists a permutation \(\sigma \) of [N] such that \(\boldsymbol{v}_i=\sigma (\boldsymbol{u}_i)\) for all \(i\in [k]\). We now describe protocol to check the relation over committed vectors: i.e., given commitments \(\textsf{cu}_1,\ldots ,\textsf{cu}_k,\textsf{cv}_1,\ldots ,\textsf{cv}_k\) the prover shows knowledge of vectors \(\boldsymbol{u}_1,\ldots ,\boldsymbol{u}_k\) and \(\boldsymbol{v}_1,\ldots ,\boldsymbol{v}_k\) corresponding to the commitments which satisfy the relation. To achieve this, the verifier first sends a challenge \(\beta _1,\ldots ,\beta _k\) and challenges the prover to show that \(\beta \)-linear combinations of the vectors \(\boldsymbol{u}=\sum _{i=1}^k\beta _i\boldsymbol{u}_i\), \(\boldsymbol{v}=\sum _{i=1}^k\beta _i\boldsymbol{v}_i\), corresponding to commitments \(\textsf{cu}=\sum _{i=1}^k\beta _i\textsf{cu}_i\), \(\textsf{cv}=\sum _{i=1}^k\beta _i\textsf{cv}_i\) are permutations of each other. This is accomplished via a further challenge \(\alpha \leftarrow \mathbb {F}\) and subsequently chekcing \(\prod _{i=1}^N (\alpha -\boldsymbol{u}[\,i\,])=\prod _{i=1}^N (\alpha - \boldsymbol{v}[\,i\,])\). We describe the formal protocol and its analysis in Appendix C.1. The last computation can be expressed in an arithmetic circuit using O(N) multiplication gates which is concretely more efficient compared to deterministic circuits for checking permutation relation using routing networks [6, 26].

Table 3. Comparison of Circuit Complexity for different ROM approaches. ZK and CP denote zkSNARK and CP-SNARK protocols. m and n denote number of reads and memory size respectively.

3.4 Consistent Memory Access

We define consistent memory access relation for a triple of vectors \(\boldsymbol{L},\boldsymbol{U}\) and \(\boldsymbol{V}\) where \(\boldsymbol{L}\in \mathbb {F}^n\) and \(\boldsymbol{U},\boldsymbol{V}\in \mathbb {F}^m\) for some integers mn. We say that \((\boldsymbol{L},\boldsymbol{U},\boldsymbol{V})\) satisy the relation if \(V[\,i\,] = L[\,U[i]\,]\) for all \(i\in [m]\). We think of \(\boldsymbol{L}\) as read only memory (ROM) which is accessed at locations given by \(\boldsymbol{U}\) with \(\boldsymbol{V}\) being the corresponding values. We adapt the techniques in [3, 5, 25, 31] to take advantage of CP-SNARKs in our construction. Next, we present a protocol to check the relation given commitments to \(\boldsymbol{L},\boldsymbol{U}\) and \(\boldsymbol{V}\). The verification proceeds as:

  1. 1.

    First \(m+n\) sized vectors \(\boldsymbol{u}\) and \(\boldsymbol{v}\) are computed as follows: For the vector \(\boldsymbol{u}\) we require \(\boldsymbol{u}[\,i\,]=i\) for \(i\in [n]\) and \(\boldsymbol{u}[\,i+n\,]=\boldsymbol{U}[\,i\,]\) for \(i\in [m]\). For the vector \(\boldsymbol{v}\) we require \(\boldsymbol{v}[\,i\,]=\boldsymbol{L}[\,i\,]\) for \(i\in [n]\) and \(\boldsymbol{v}[\,i+n\,]=\boldsymbol{V}[\,i\,]\) for \(i\in [m]\) (see Fig. 3).

  2. 2.

    The prover also supplies auxiliary vectors \(\tilde{\boldsymbol{u}}\) and \(\tilde{\boldsymbol{v}}\) of size \(m+n\), where \(\tilde{\boldsymbol{u}}\) and \(\tilde{\boldsymbol{v}}\) are purportedly obtained from \(\boldsymbol{u}\) and \(\boldsymbol{v}\) via the same permutation.

  3. 3.

    Finally, we ensure that the vector \(\tilde{\boldsymbol{u}}\) is sorted and that the vector \(\tilde{\boldsymbol{v}}\) differs in adjacent positions only if the same is true for those positions in vector \(\tilde{\boldsymbol{u}}\).

The constraints on the first n entries of vectors \(\boldsymbol{u}\) and \(\boldsymbol{v}\) in step (1) can be thought of as “loading” constraints that load the entries of \(\boldsymbol{L}\) against corresponding address in memory, while constraints on the last m entries can be thought of as “fetching” constraints that fetch the appropriate value against the specified memory location. The steps (2) and (3) ensure that the value fetched for a given location is same as the value loaded against it during the initial loading steps. We decompose above checks across two circuits. The first arithmetic circuit \(\textsf{C}_{\textsf{ROM}, {m},{n}}\) ensures steps (1) and (3) while the second circuit checks that vectors \(\tilde{\boldsymbol{u}},\tilde{\boldsymbol{v}}\) are obtained by applying the same permutation to vectors \(\boldsymbol{u},\boldsymbol{v}\) respectively. The circuit \(\textsf{C}_{\textsf{ROM}, {m},{n}}\) can be realized using \(O(m+n)\) multiplication gates. Generally, verifying that a vector such as \(\tilde{\boldsymbol{u}}\) is sorted in step (3) incurs logarithmic overhead due to the need for bit decomposition of each element. However, we can leverage the fact that \(\tilde{\boldsymbol{u}}\) is a (sorted) rearrangement of \(\boldsymbol{u}\), which includes all elements of [n] by construction. Thus, monotonicity of \(\tilde{\boldsymbol{u}}\) is established provided (i) \(\tilde{\boldsymbol{u}}[\,n\,]=1\), (ii) \(\tilde{\boldsymbol{u}}[\,m+n\,] = n\) and \(\tilde{\boldsymbol{u}}[\,i+1\,]\)\(\tilde{\boldsymbol{u}}[\,i\,]\in \{0,1\}\) for all \(1\le i\le m+n-1\), which together require \(O(m+n)\) gates to verify. Finally, we invoke the protocol for “Simultaneous Permutation” property in Sect. 3.3 to check compliance of step (2). We illustrate the verification circuit and the decomposition in Fig. 3. The formal protocol and analysis appears in Appendix C.2. Overall we incur \(O(m+n)\) gates, which is more efficient than encoding entire relation in one circuit. In that case one uses routing networks which incur \(O((m+n)\log (m+n))\) gates and are concretely much more expensive. We can optimize further when the same access pattern is used for accessing different ROMs as described below.

Multiplexed Memory Access. For access pattern \(\boldsymbol{U}\in \mathbb {F}^m\) and ROMs \(\boldsymbol{L}_j\in \mathbb {F}^n\) for \(j\in [k]\), we can show the correctness of lookup values \(\boldsymbol{V}_j[\,i\,] = \boldsymbol{L}_j[\,\boldsymbol{U}[i]\,]\), \(i\in [M], j\in [k]\) using just one instance of protocol discussed in this section. To achieve this, the verifier sends a random challenge \(\alpha _1,\ldots ,\alpha _k\) to the prover. The prover then shows that \((\boldsymbol{L},\boldsymbol{U},\boldsymbol{V})\) satisfy correct memory access where \(\boldsymbol{L}=\alpha _1\boldsymbol{L}_1+\cdots +\alpha _k\boldsymbol{L}_k\) and \(\boldsymbol{V}=\alpha _1\boldsymbol{V}_1+\cdots +\alpha _k\boldsymbol{V}_k\) for uniformly sampled \(\alpha _1,\ldots ,\alpha _k\). Note that due to the homomorphism of the commitment scheme, both the prover and the verifier can compute the commitments for \(\boldsymbol{L},\boldsymbol{U}\) and \(\boldsymbol{V}\).

3.5 Our Techniques in Perspective

Commit and prove functionality in conjunction with zero knowledge proofs has been used in recent works addressing privacy in machine learning, most notably in [18, 27, 28]. In [18] and [28], CP-SNARKs are used to “link” proofs of correctness for different parts of the circuit (similar to Circuit Decomposition in our setting) to prove inference from a private neural network and a decision tree respectively. In [27], public commitments are linked to set of authenticated inputs between a prover and a verifier in a two party protocol. Subsequently the prover produces a ZK proof showing correctness of neural network inference over authenticated inputs. In contrast, our usage of CP-SNARKs is more pervasive. We first optimize key relations (simultaneous permutation, consistent memory access) for CP-SNARKs and then design our dataset representation in a way that allows us to represent operations on them in terms of aforementioned relations.

Fig. 3.
figure 3

Consistent memory access

4 Privacy Preserving Dataset Operations

We now describe protocols for common dataset operations such as aggregation, filter, order-by, inner-join etc. These operations serve to illustrate our key techniques, which can be further applied to yeild protocols for much more comprehensive list of dataset operations. We use the fact that most of the operations distribute nicely as identical computations over different pairs of columns. Throughout this section, N denotes the upper bound on the sizes of input/output datasets.

Aggregation: Aggregation operation takes two datasets as inputs and outputs their row-wise concatenation. We first describe arithmetic circuit to verify the concatenation of vectors. The circuit accepts three vectors in their uniform representation as discussed in Sect. 3.1. Let \(\boldsymbol{x},\boldsymbol{y},\boldsymbol{z}\) be three vectors of size at most N represented as \(\llbracket s, \boldsymbol{X}\rrbracket \), \(\llbracket t, \boldsymbol{Y}\rrbracket \) and \(\llbracket w, \boldsymbol{Z}\rrbracket \) respectively where \(\boldsymbol{X},\boldsymbol{Y},\boldsymbol{Z}\) are vectors of size N. The verification involves ensuring that the first w entries of \(\boldsymbol{Z}\) contain the first s entries of \(\boldsymbol{X}\) and the first t entries of \(\boldsymbol{Y}\). Figure 4 illustrates the setting for \(s=3\), \(t=4\), \(w=7\) and \(N=9\). To aid the verification, the prover provides N-length binary vectors \(\boldsymbol{\rho }_s,\boldsymbol{\rho }_t\) and \(\boldsymbol{\rho }_w\) as auxiliary inputs. The vector \(\boldsymbol{\rho }_s\) is 1 in its first s entires, and 0 elsewhere. Similar relation is satisfied by \(\boldsymbol{\rho }_t\) and \(\boldsymbol{\rho }_w\). The correctness of aggregation now reduces to showing that there is a permutation that simultaneously maps \(\llbracket \boldsymbol{\rho }_s, \boldsymbol{\rho }_t\rrbracket \) to \(\llbracket \boldsymbol{\rho }_w, \boldsymbol{0}\rrbracket \) and \(\llbracket \boldsymbol{X}, \boldsymbol{Y}\rrbracket \) to \(\llbracket \boldsymbol{Z}, \boldsymbol{0}\rrbracket \). Figure 4 also shows how the verification is decomposed: The first circuit checks that (i) \(w=s+t\), (ii) vectors \(\boldsymbol{\rho }_s,\boldsymbol{\rho }_t,\boldsymbol{\rho }_t\) are correctly provided and (iii) ensures \(\boldsymbol{u}_1 = \llbracket \boldsymbol{\rho }_s, \boldsymbol{\rho }_t\rrbracket \), \(\boldsymbol{v}_1= \llbracket \boldsymbol{X}, \boldsymbol{Y}\rrbracket \), \(\boldsymbol{u}_2=\llbracket \boldsymbol{\rho }_w, \boldsymbol{0}\rrbracket \) and \(\boldsymbol{v}_2= \llbracket \boldsymbol{Z}, \boldsymbol{0}\rrbracket \). The second circuit checks the “simultaneous permutation” property on the pairs \((\boldsymbol{u}_1,\boldsymbol{v}_1)\) and \((\boldsymbol{u}_2,\boldsymbol{v}_2)\). Both the circuits can be realized using O(N) multiplication gates. Using a CP-SNARK we can verify the correctness of aggregation of vectors over commitments.

We now leverage the above construction to verify aggregation operation over datasets. Let \(D_x, D_y\) and \(D_z\) be datasets each with k columns given by \((\boldsymbol{x}_i)_{i=1}^k, (\boldsymbol{y}_i)_{i=1}^k\) and \((\boldsymbol{z}_i)_{i=1}^k\) respectively. The reduction technique involves the verifier sampling random \(\alpha _1,\ldots ,\alpha _k\) satisfying \(\alpha _1+\cdots +\alpha _k=1\). Next, we use the above circuit construction with a CP-SNARK to prove that vectors \(\boldsymbol{x}=\sum _{i=1}^k\alpha _i\boldsymbol{x}_i\) ,\(\boldsymbol{y}=\sum _{i=1}^k\alpha _i\boldsymbol{y}_i\) and \(\boldsymbol{z}=\sum _{i=1}^k\alpha _i\boldsymbol{z}_i\) satisfy the concatenation property. We give complete protocol and proof of the reduction in the Appendix C.3.

Filter: Filter operation involves a dataset and a selection predicate as inputs and subsequently outputs a dataset consisting of subset of rows satisfying the predicate. We divide the computation in two parts (i) Applying selection predicate to rows of the dataset to obtain a binary vector \(\boldsymbol{f}\) which we call as selection vector and (ii) Applying selection vector to the source dataset to obtain the target dataset. The latter computation can be verified with techniques similar to those used in aggregation operation. For the first computation, we describe an efficient circuit for predicates of the form \(\wedge _{i=1}^k (\boldsymbol{x}_i==v_i)\) where \(\boldsymbol{x}_1,\ldots ,\boldsymbol{x}_k\) are the columns of the dataset. Once again the verifier chooses random \(\alpha _1,\ldots ,\alpha _k\) with \(\sum _{i=1}^k \alpha _i=1\) and challenges the prover to show that the selection vector \(\boldsymbol{f}\) satisfies \(\boldsymbol{f}=(\boldsymbol{x}==v)\) where \(\boldsymbol{x}=\sum _{i=1}^k \alpha _i\boldsymbol{x}_i\) and \(v=\sum _{i=1}^k \alpha _iv_i\). The relation \(\boldsymbol{f}=(\boldsymbol{x}==v)\) can be verified using a circuit with O(N) gates. Due to the homomorphism of the commitment scheme, the verifier can compute the commitment for vector \(\boldsymbol{x}\) given the commitments to columns of the dataset. For more general range queries of the form \(\wedge _{i=1}^k (\ell _i < \boldsymbol{x}_i \le r_i)\), we can compute selection vector \(\boldsymbol{f}_i\) for each column, and then compute the final selection vector \(\boldsymbol{f}=\wedge _{i=1}^k \boldsymbol{f}_i\).

Fig. 4.
figure 4

Circuit for verifying vector concatenation

Order By: Order-By relation involves permuting the rows of the dataset so that a specified column is in sorted order. The verification can be naturally expressed as columns of source and target dataset satisfying simultaneous permutation relation, where additionally the specified column is sorted. We can check the monotonicity of a column using a circuit with O(bN) gates where b is the bit-width of the range of values in the column. We skip the details.

Inner-Join: Inner join operation concatenates pairs of rows of input datasets which have identical value for the designated columns (joining columns). We consider the inner-join operation under the restriction that the joining columns have distinct values. As a first step, we order both the input datasets so that the joining columns are sorted. We can use the verification protocol for order-by operation to ensure correctness of this step. We therefore assume that joining columns are sorted, and take distinct values. Let \(D_1\) and \(D_2\) be two datasets which are joined on columns \(\boldsymbol{x}\) and \(\boldsymbol{y}\) to yield the dataset D. We write D as juxtaposition of columns \([D^{'}_1,\boldsymbol{z},D^{'}_2]\) where \(D^{'}_i\) denotes the columns coming from \(D_i\) while \(\boldsymbol{z}\) denotes the column obtained as intersection of \(\boldsymbol{x}\) and \(\boldsymbol{y}\). We first design sub-circuit for private set intersection (PSI) to compute the size w of the resulting dataset. We then let the prover provide auxiliary selection vectors \(\boldsymbol{f}_1\) and \(\boldsymbol{f}_2\) of size w. Finally, using the circuit for filter relation, we verify that \(\boldsymbol{f}_1\) applied to \(D_1\) yields dataset \(D_L = [D^{'}_1,\boldsymbol{z}]\) and \(\boldsymbol{f}_2\) applied to \(D_2\) yields the dataset \(D_R=[D^{'}_2,\boldsymbol{z}]\). The overall circuit complexity is O(bN) where b is the bit-width of the range of values in \(\boldsymbol{x}\) and \(\boldsymbol{y}\) with set-intersection computation dominating the overall cost.

5 Privacy Preserving Model Inference: Decision Trees

In this section we present a zero knowledge protocol for verifiable inference from decision trees (and random forests). Decision trees are popular models in machine learning due to their interpretability. A decision tree recursively partitions the feature space (arranged as a tree), and finally assigns a label to each leaf segment. The problem of proving correct inference from a decision tree was considered recently in [28], where authors present a privacy preserving method for an adversary to commit to a decision tree and later prove inference from the tree on public test data. We present a new construction based on consistent memory access, which improves upon the prior construction by reducing the number of multiplication gates in the inference circuit. We also provide zero knowledge protocol for establishing the accuracy of a decision tree on test data. We consider variants with test data being public or private. The latter scenario is helpful while verifying performance of a private model on reputationally trusted private dataset.

Decision Tree Representation: We parameterize a binary decision tree with following parameters: the maximum number of nodes (N), the maximum length of a decision path (h) and maximum number of features used as predictors (d). We assume that the nodes in the decision tree have unique identifiers from the set [N], while features are identified using indices in set [d]. We naturally represent a decision tree \(\mathcal {T}\) as a lookup table with five columns, i.e., \(\mathcal {T}=(\boldsymbol{V},\boldsymbol{T},\boldsymbol{L},\boldsymbol{R},\boldsymbol{C})\), where each column vector is of size N. For a decision tree with \(t\le N\) nodes, we encode as follows: For \(i\in [t]\):

  • \(\boldsymbol{V}[\,i\,]\) denotes the identifier for the splitting feature for \(i^{th}\) node.

  • \(\boldsymbol{T}[\,i\,]\) denotes the threshold value for the splitting feature for \(i^{th}\) node.

  • \(\boldsymbol{L}[\,i\,]\) and \(\boldsymbol{R}[\,i\,]\) denote the identifiers for the left and right child of \(i^{th}\) node. In case of a leaf node, this value is set to i itself.

  • \(\boldsymbol{C}[\,i\,]\) denotes the label associated with the \(i^{th}\) node, when it is a leaf node. For non-leaf nodes this may be set arbitrarily.

We commit to a decision tree, by committing to each of the vectors. We define \(\textsf{cm}_{\mathcal {T}} = (\textsf{cm}_V,\textsf{cm}_T,\textsf{cm}_L,\textsf{cm}_R, \textsf{cm}_C)\) as the commitment to \(\mathcal {T}\).

Decision Tree Inference: We model the test data D as \(n\times d\) matrix, consisting of n d-dimensional samples. Let \(\boldsymbol{D}\) be the vector of size dn obtained by flattening D in row major order. The algorithm below computes decision paths \(\boldsymbol{p}_i = (\boldsymbol{p}_i[\,1\,],\ldots ,\boldsymbol{p}_i[\,h\,])\) for each sample \(i\in [n]\). The prediction vector \(\boldsymbol{q}\) contains class labels corresponding to leaf nodes \(\boldsymbol{p}_i[\,h\,]\) for \(i\in [n]\).

  1. 1.

    For \(i=1,\ldots ,n\) do:

    • Set \(\boldsymbol{p}_i[\,1\,]=1\) : root is the first node on every decision path.

    • For \(j=1,\ldots ,h\) determine next node as follows:

      1. (a)

        Compute splitting feature: \(\boldsymbol{f}_i[\,j\,]=\boldsymbol{V}[\,\boldsymbol{p}_i[j]\,]\).

      2. (b)

        Compute threshold value: \(\boldsymbol{t}_i[\,j\,]=\boldsymbol{T}[\,\boldsymbol{p}_i[j]\,]\).

      3. (c)

        Compute left and right child id: \(\boldsymbol{l}_i[\,j\,]=\boldsymbol{L}[\,\boldsymbol{p}_i[j]\,]\), \(\boldsymbol{r}_i[\,j\,]=\boldsymbol{R}[\,\boldsymbol{p}_i[j]\,]\).

      4. (d)

        Compute label: \(\boldsymbol{c}_i[\,j\,]=\boldsymbol{C}[\,\boldsymbol{p}_i[j]\,]\).

      5. (e)

        Compute \(\hat{\boldsymbol{f}}_i[\,j\,] = d*i + \boldsymbol{f}_i[\,j\,]\).

      6. (f)

        Compute value of splitting feature: \(\boldsymbol{v}_i[\,j\,]=D[\,i,\boldsymbol{f}_i[j]\,]=\boldsymbol{D}[\,\hat{\boldsymbol{f}_i}[j]\,]\).

      7. (g)

        Compute next node: \(\boldsymbol{p}_i[j+1]\) = \(\boldsymbol{l}_i[\,j\,]\) if \(\boldsymbol{v}_i[\,j\,]\le \boldsymbol{t}_i[\,j\,]\) and \(\boldsymbol{r}_i[\,j\,]\) otherwise.

    • Compute label for the sample: \(\boldsymbol{q}[\,i\,]=\boldsymbol{c}_i[\,h\,]\).

Verification of the above algorithm involves verifying (i) hn memory accesses on the tables of \(\mathcal {T}\) in steps (a)-(d), which share the access pattern \(\boldsymbol{p}_i[\,j\,]\), (ii) verifying hn memory accesses on \(\boldsymbol{D}\) (of size dn) in step (f) and (iii) hn comparisons as part of step (g). Using the optimization in Sect. 3.4, the first verification incurs \(O(N+hn)\) multiplication gates, while the second verification incurs \(O(dn + hn)\) multiplication gates. Using standard techniques, verification of (iii) can be made using O(whn) multiplication gates, where w is the bit-width of feature values. Thus, overall circuit complexity of our solution is \(O(N + n(d + h + wh))\). We compare our solution with the method for zero-knowledge decision tree (zkDT) inference presented in [28]. Broadly, the method in [28] establishes the correctness of inference as three checks:

  • Consistency of input decision tree with public commitment: This involves O(N) evaluations of the hash function \(\mathcal {H}\) used for commitment and thus incurs \(c(\mathcal {H})\cdot N\) multiplication gates. Here \(c(\mathcal {H})\) denotes the size of circuit required to evaluate \(\mathcal {H}\).

  • Consistency of feature vector with decision path: The verification of this step leverages a “Multiset Check” ([28, Section 4.1]) which costs \(O(d\log h)\) multiplication gates per sample.

  • Correct evaluation of decision tree function: It involves h comparisons for each sample, which incurs hw mutliplication gates, where w is the bit-width of feature values.

Above steps result in an overall circuit complexity of \(c(\mathcal {H})N + n(3d\log h + hw)\) for zkDT. Our solution improves upon the approach in [28] by reducing the cost of the first two checks. Using a CP-SNARK, we avoid the cost of computing the commitment within the verification circuit, while using our optimized protocols for memory access allows us to accomplish the second check with an average cost of \(O(h + d)\) gates per sample (\(O(dn+hn)\) overall), which compares favorably with per sample cost of \(O(d\log h)\) incurred by zkDT for \(h=\varTheta (d)\). The concrete improvement obtained using our approach depends on which of the three checks dominate the cost for specific parameter settings. We compare the cost of the two approaches for some representative parameter settings in Table 5.

Decision Tree Accuracy: The above circuit for decision tree inference can be easily modified to yield the circuit for proving accuracy of a decision tree on test data. In this case, the prediction vector is kept private, and tallied against ground truth to compute accuracy. Since our system also includes verifiability of model performance (accuracy) on private benchmark datasets, we briefly describe the modifications required to achieve the same. Let D be a private dataset with columns \((\boldsymbol{x}_1,\ldots ,\boldsymbol{x}_d)\) with commitments to columns being public. Since, we can no longer compute the flattened vector \(\boldsymbol{D}\) as before, we cannot verify the lookup \(\boldsymbol{v}_i[\,j\,]=\boldsymbol{D}[\,\hat{\boldsymbol{f}}_i[j]\,]\). Instead we use polynomial interpolation to pre-process D. For \(i^{th}\) row \(D[i,\cdot ]\) of the original data (a vector of size d), we interpolate a polynomial \(p_i\) of degree \(d-1\) such that \(p(j)=D[i,j]\). We obtain the pre-processed dataset \(D'\) whose \(i^{th}\) row consists of coefficients of \(p_i\). The data owner makes a commitment to \(D'\) instead of D. The lookup \(\boldsymbol{v}_i[\,j\,]=D'[i,j]=p_i(j)\) now involves evaluating a \(d-1\) degree polynomial which incurs d multiplication gates. The overall circuit complexity for accuracy over private datasets is therefore \(O(N + hn + hnw + hnd)\).

Table 4. Measuring the efficacy of our optimizations on 100K\(\times \) 10 datasets.
Table 5. Comparison of Circuit Complexity for decision tree inference.

6 Experimental Evaluation

In this section we report the concrete performance of our system primitives. For our implementation, we used Adaptive Pinocchio [24] as the underlying CPSNARK, which we implemented using the libsnark [17] library. We also used the libsnark library for our circuit descriptions. Our experiments were performed on Ubuntu Linux 18.04 cloud instances with 8 Intel Xeon 2.10 GHz virtual cpus with 32 GB of RAM. The experiments were run with finite field arithmetic libraries and FFT libraries compiled to exploit multiple cores. We often use circuit complexity (multiplication gates in the circuit) as the “environment neutral” metric for comparing different approaches (the proving times scale quasi-linearly with circuit complexity).

Performance of Dataset Operations: Table 1 contains summary of asymptotic as well as concrete efficiency of our dataset operations. All the operations scale linearly with the number of rows (with marginal additive dependence on the number of columns). The numbers for proof generation and verification were generated for representative dataset size of \(100K\times 10\). While proof generation is an expensive operation by general standards, it is practical enough for infrequent usage. We also tabulate the efficacy of our optimizations in Table 4. For the unoptimized case, we do not use CP-SNARKs and instead compute commitments using circuit-friendly MiMC hash [1]. For partially optimized case, we use native commitment scheme of CP-SNARK for commitments, but use monolithic circuits to encode the operations. To express permutations in monolithic circuits, we use gadgets for routing networks [6, 26] available in [17]. The fully optimized version delegates permutation checking and memory access check to probabilistic circuits as discussed in Sect. 3.2. In the first case, hashing dominates the circuit complexity resulting in 50–100 times larger circuits. Decomposing the circuits instead of monolithic circuits also results in an order of magnitude savings.

Table 6. Concrete proving and verification time for decision tree inference.
Table 7. Circuit Complexity for decision tree accuracy for public and private benchmark datasets.

Performance of Decision Tree Inference: We use two decision trees T1 and T2 to benchmark performance of our decision tree inference implementation. We also use the same trees to compare our method with the one presented in [28]. We synthetically generate the tree T1 with 1000 nodes, 50 features and depth as 20, which roughly corresponds to the largest tree used in [28]. The tree T2 is trained on a curated version of dataset [12] for Home Mortgage Approval. We identify 35 features from the dataset to train binary decision tree. We train T2 with 10000 nodes and depth 25. We verify the inference from the two trees for batch sizes of 100 (small), 1000 (medium) and 10000 (large). Using our method to generate proof of predictions takes from few seconds (on small data) to few minutes (on large data), as seen in Table 6. The circuit complexity and the proving time scale almost linearly for our method. We also compare the multiplication gates incurred by arithmetic circuits in our method with that in [28] in Table 5. Our efficiency is an order of magnitude better for smaller data sizes, as we do not incur one time cost for hashing the tree. For larger batch sizes, our method is still about 1.5-\(4\times \) more efficient. As the batch sizes get large, comparisons dominate the circuit complexity in both the approaches. We report the circuit complexity for proving the accuracy for decision trees on private datasets and public datasets. Table 7 shows that the overhead for proving accuracy on private datasets ranges from 50–80%.

Performance of Memory Access: We also independently benchmark the performance of our memory abstraction technique and compare it to existing methods in Table 3. Leveraging CP-SNARKs and probabilistic reductions we essentially incur constant number of gates per access. We compare different approaches both in terms of asymptotic complexity and concrete complexity for parameter settings representative of their usage in our work. Our concrete efficiency is an order of magnitude better than the alternatives considered.