Zero Knowledge Proofs Towards Verifiable Decentralized AI Pipelines

Singh, Nitin; Dayama, Pankaj; Pandit, Vinayaka

doi:10.1007/978-3-031-18283-9_12

Nitin Singh⁹,
Pankaj Dayama⁹ &
Vinayaka Pandit⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13411))

Included in the following conference series:

International Conference on Financial Cryptography and Data Security

1344 Accesses
1 Citations

Abstract

We are witnessing the emergence of decentralized AI pipelines wherein different organisations are involved in the different steps of the pipeline. In this paper, we introduce a comprehensive framework for verifiable provenance for decentralized AI pipelines with support for confidentiality concerns of the owners of data and model assets. Although some of the past works address different aspects of provenance, verifiability, and confidentiality, none of them address all the aspects under one uniform framework. We present an efficient and scalable approach for verifiable provenance for decentralized AI pipelines with support for confidentiality based on zero-knowledge proofs (ZKPs). Our work is of independent interest to the fields of verifiable computation (VC) and verifiable model inference. We present methods for basic computation primitives like read only memory access and operations on datasets that are an order of magnitude better than the state of the art. In the case of verifiable model inference, we again improve the state of the art for decision tree inference by an order of magnitude. We present an extensive experimental evaluation of our system.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Distributed Ledger for Provenance Tracking of Artificial Intelligence Assets

Obscuring Provenance Confidential Information via Graph Transformation

Trusted Provenance of Collaborative, Adaptive, Process-Based Data Processing Pipelines

1 Introduction

In this paper we consider a decentralized AI pipeline with multiple independent organizations wherein one set of organizations specialize in curating high quality datasets based on independent data sources, another set of organizations specialise in training models from the curated datasets, and another set of organizations deploy the trained models and provide them as a service to the model consumers. A typical decentralized AI pipeline is shown in Fig. 1. The core assets like datasets and models represent significant intellectual property for their respective owners. Therefore, it is essential for the asset owners to ensure the confidentiality of their assets beyond the intended usage. On the other hand, since the model consumers are likely to use them for driving major decisions, they would like to ensure auditability and integrity of the models by (i) verifying the provenance and performance of the models on benchmark datasets^{Footnote 1} and (ii) ensuring that the predictions from the deployed service match with that of the verified model. In summary, decentralized AI pipelines need to provide end to end provenance while ensuring the confidentiality of different assets.

Consider an example of deciding on mortgage applications using an AI service. A data service provider, SP, provides high quality training and benchmark datasets by curating historical mortgage data from reputationally trusted financial institutes. A specialized fintech company, FC, trains and deploys an AI model as a service for the given task. Further, it makes a public claim on the model performance on benchmark dataset. Note that establishing provenance of model training carried out by FC is not addressed in this work. A financial institute, CONS, wanting to use AI in mortgage approval process would want to independently verify the claim made by FC before deciding to subscribe to the service. If CONS is satisfied after the verification process, it might use the deployed service to make decision on mortgage applications. At this time, CONS and individual mortgage applicants should be able to independently verify that the predictions from the deployed service match with that of the verified model. The reputationally trusted data owners and FC would like to protect the confidentiality of their assets except from those actors who are entitled to access them. We would like to highlight a special and important requirement of FC: to prevent model reengineering attacks, the FC would like to ensure that the model verifier does not get to learn the predictions of the models on individual instances during the process of verification.

We present significant progress towards describing efficient and scalable approach to provide public verifiability for common operations in an AI pipeline, while preserving confidentiality of involved data and model assets. In the paper we have highlighted few primitive operations, but more operations on both data and models can be added as state of the art improves. While it is difficult to match the expressiveness of what is possible via plain-text computations, our methods can nevertheless provide provenance over simpler pipelines.

1.1 Related Work

While there is no prior work that addresses all the aspects of verifiable distributed AI pipeline as introduced in this paper, there are past works that address different aspects of the overall requirements. The provenance requirement is addressed in [19, 21], the model verification or certification requirement is addressed in [15, 22], and the verifiable inference from private model requirement is addressed in [11, 14, 18, 23, 28]. Our work is of independent interest to the field of Verifiable Computation (VC) as it provides more efficient methods for useful computational primitives like Read Only Memory (ROM) access and operations on datasets. We briefly review and contrast the relevant literature with our work.

Provenance Models for AI: There has recently been considerable interest in the provenance of AI assets. For instance, [19, 21] provide good motivation and DLT based architecture for establishing provenance of AI assets. The provenance is enabled by recording the cryptographic hash of each asset on the tamper-proof ledger, and recording any operations on them as transactions. While this provides auditability and lineage of an asset, its verification necessarily involves revealing the assets, thereby violating the confidentiality requirements in our setting. We build on the tools from verifiable computation to enable verifiability of assets and operations on them while supporting all the stated confidentiality requirements.

Model Certification for AI: Training and testing AI models for fairness and bias is an area of active research. Recently, efforts have been made to leverage methods from secure multiparty computation (MPC) to enable fair training and certification of AI models while ensuring privacy of sensitive data of the participants [15, 22]. These methods require a trusted party (e.g. a regulator) to certify the claims on the models and therefore, do not support the public verifiability requirement in our setting.

Verifiable Model Inference: The problem of verifying the predictions from private AI models, with different privacy requirements, has been considered in the literature. For instance, verifiable execution of neural networks has been considered in [14, 18, 23, 27] and verification of predictions from decision trees has been considered in [28]. These works cannot be extended for end to end pipeline verification as they cannot handle verification of operations on datasets. In our work, apart from providing verification for the entire AI pipeline, we improve upon the work of [28] by making the verification of the decision tree inference more scalable as described in Sect. 1.2.

Reusable Gadgets for VC: On the technical front, our work complements persistent efforts such as [16, 25] to enable more computations efficiently in the VC setting. The problem of efficiently supporting addressable memory inside VC circuits has received considerable attention [3, 5, 16, 25, 31] as many computations are best expressed using the abstraction of memory. Methods in aforementioned efforts support arbitrary zero knowledge Succinct Arguments of Knowledge (zkSNARKs). We provide a more efficient variant of prior methods, leveraging a zkSNARK with commit and prove capability (see Sect. 3). However, this is not a major hinderance as many efficient zkSNARKs can be modified to be commit and prove with negligible overhead (see [8]). Our efficient abstractions for read only memory (ROM) and datasets can be incorporated into zkSNARK circuit compilers such as ZokRates [10], when suitably targeted for a commit and prove backend. In particular, supporting datasets as first class primitives in zkSNARK compilers will make them more attractive for privacy preserving data science applications. Finally we mention that the work on Verifiable Outsourced Databases (e.g. [29, 30]) is not directly applicable here as (i) current implementations do not address data confidentiality and (ii) they do not support reusable representation of datasets across computations.

1.2 Our Contributions

We present the first efficient and scalable system for decentralized AI pipelines with support for confidentiality concerns of the asset owners (as described in Table 2) and public verifiability. Our work represents major system level innovations in the areas of model certification ([15] - lacks public verifiability, provenance), provenance architectures for AI artifacts ([19, 21] - lack privacy), and confidentiality preserving model inference ([14, 23, 28] - lack provenance). A number of technical contributions enable this system level novelty and they are summarized as follows.

Improved method for read-only memory access in arithmetic circuits with an order of magnitude gain in efficiency over the existing methods (see Table 3). The improved memory access protocol is crucially used in realizing efficient circuits for data operations (inner-join) and decision tree inference.
A method for consistent modeling of datasets in arithmetic circuits with complete privacy. In addition, we design efficient circuits to prove common operations on datasets. We make several optimizations over the basic approach of using zkSNARKs resulting in at least an order of magnitude gain in efficiency (see Table 4). On commodity hardware, our implementation scales well to prove operations on datasets with up to 1 million rows in a few minutes. The verification takes few hundred milliseconds.
We present an improved protocol for privacy preserving verifiable inference from decision tree. Our method yields up to ten times smaller verification circuits by avoiding expensive one-time hashing of the tree used in [28]. Further leveraging our method for read-only memory access, we also incur fewer multiplication gates per prediction (see Sect. 5 for more details). Comparative performance under different settings is summarized in Table 5.
We implement our scheme using Adaptive-Pinocchio [24] to experimentally evaluate the efficacy of our scheme. We report the results in Sect. 6. Our scheme can also be instantiated with other CP-SNARKs.

Our implementation uses pre-processing zkSNARKs [5, 9, 13, 20] which pre-process a circuit description to make subsequent proving and verification more efficient. Our circuits can also be used with generic zkSNARKs such as those in [2, 4, 7], suitably augmented with commit and prove capability.

Table 1. Performance of our dataset operations. For concrete numbers we took number of rows $N=100K$ and bit-width of elements $b=32$.

Full size table

2 Verifiable Provenance in Decentralized AI Pipelines

A typical AI pipeline consists of different steps, such as accessing raw datasets from multiple sources, performing aggregation and transformations in order to curate training and testing datasets for the AI task on hand, developing the AI model, and deploying it in production. We are interested in settings in which the AI pipeline is decentralized, i.e., different steps of the pipeline are carried out by different independent actors. We assume five different type of actors: data owners(DO), data curators(DC), model owners(MO), model certifiers(MCERT), and model consumers(MCONS). For brevity of exposition, we assume that the number of data curators, model owners, model certifiers, and model consumers is just one. However, all the concepts and results extend in a straight forward manner to the general setting involving multiple entities of each type.

We assume that there is a task $\texttt{T}$ for which the process of building an AI pipeline is undertaken in a decentralized setting. The salient features of our provenance and certification framework is summarized as follows.

There are m data owners $\texttt{DO}_1, \texttt{DO}_2, \ldots , \texttt{DO}_m$ who share their respective raw datasets $D_1, D_2, \ldots , D_m$ privately with the data curator $\texttt{DC}$ and also make a public commitment of the datasets. The data curator curates a dataset $D_{b} = f(D_1, D_2, \ldots , D_m)$ for the purpose of benchmarking the performance of an AI model for the task $\texttt{T}$ and makes a public commitment of $D_{b}$. We assume the model owner, $\texttt{MO}$, has a pre-trained AI model M and wants to offer it as a service. $\texttt{MO}$ makes a public commitment of the model. $\texttt{MO}$ buys the benchmark dataset $D_{b}$ from $\texttt{DC}$. MO wishes to convince potential consumers of the utility of the model M by making performance claim $accuracy = score(M, D_b)$ when M is used for getting predictions on the dataset $D_b$. The model certifier, $\texttt{MCERT}$, should be able to independently verify the provenance of all the steps and the claimed performance of the model M. MCERT also ensures that the timestamp of the public commitment of model M is earlier than the timestamp of public commitment of $D_b$ to ensure that the model M cannot be overfitted to the dataset $D_b$. $\texttt{MCERT}$ certifies the model M only after verifying the correctness of the claim. The model consumer, $\texttt{MCONS}$, subscribes to the model M only upon its successful certification. Suppose $\texttt{MCONS}$ supplies a valid input data $D'$ to the service provided by $\texttt{MO}$ and gets a prediction $Y'$. We require that $\texttt{MCONS}$ should be able to independently verify that the prediction $Y'$ matches with the prediction of the committed model M on the instance $D'$.

We observe that the outlined requirements ensure that the decentralized AI pipeline is transparent. The key question we address in this paper is that of providing such a transparency while satisfying the confidentiality requirements of all the actors. We assume that none of the actors in the set up have any incentive to collude with the others, but, can act maliciously. The privacy requirements and security model of different actors is summarized in Table 2.

Table 2. Summary of privacy requirements and trust assumptions in our setting.

Full size table

We present a provenance framework which ensures trust in the AI pipeline by proving each computation step using zero-knowledge proofs, thus meeting all the confidentiality requirements captured in Table 2. Below, we present a concrete example of an AI pipeline for establishing fairness of an AI model, where we clearly highlight involvement of various actors.

2.1 Decentralized Model Fairness

Increasingly, AI models are required to be fair (i.e. non-discriminating) with respect to protected attributes (e.g. Gender). There are several metrics which are used to evaluate a model for fairness. For the sake of illustration, we choose the popular metric called predictive parity, which requires a model to have similar accuracy for different values of the protected attribute. In our specific example, our goal is to show that for binary classification model M we have:

$$\begin{aligned} \big \vert \textrm{Pr}[M(\boldsymbol{x})=y\,|\, \textsf{Gender}(\boldsymbol{x})=\texttt{M}] - \textrm{Pr}[M(\boldsymbol{x})=y\,|\, \textsf{Gender}(\boldsymbol{x})=\texttt{F}] \big \vert \le \varepsilon \end{aligned}$$

where $(\boldsymbol{x},y)\sim \mathcal {D}$ for representative distribution $\mathcal {D}$. We may estimate the above metric emperically on a test data T consisting of samples $\{(\boldsymbol{x}_i,y_i)\}_{i=1}^n$. For concreteness, let M be a decision tree model developed by model owner MO to be used by financial institutions for approving home mortgage loan applications. Let $D_1$ and $D_2$ be two private datasets consisting of loan applications, which are owned by financial instituions $\texttt{DO}_1$ and $\texttt{DO}_2$ respectively. A data curator DC curates the dataset T by concatenating (row-wise) datasets $D_1$, $D_2$ and further generates datasets $T_M$, $T_F$ consisting of applications with male and female applicants respectively. Finally the model owner MO obtains datsets $T_M$ and $T_F$ and computes the accuracy of its model on the respective datasets. In Fig. 2, the top left code block shows the operations executed by different actors in the pipeline without verifiability. The remaining code blocks show operations performed by actors in a verifiable pipeline. The asset owners publicly commit their private assets (bottom left) and generate proofs to attest correctness of their operations on assets (top right). Finally, a verifier (e.g. auditor) uses published commitments and proofs to establish the correctness of steps performed by respective actors in the pipeline (bottom right).

3 Overview

This section provides overview of the technical challenges in instantiating our solution. More detailed technical contributions appear in Sects. 4 and 5.

3.1 Building Blocks

Cryptographic Primitives: We use zkSNARKs as the main cryptographic tool to verify correctness of data operations and model inference while maintaining confidentiality of the respective assets. A zkSNARK consists of a triple of algorithms $(\textsf{G},\textsf{P},\textsf{V})$ where (i) $\textsf{G}$ takes description of a computation as an arithmetic circuit C and outputs public parameters $\textsf{pp}\leftarrow \textsf{G}(1^\lambda ,C)$, (ii) $\textsf{P}$ takes $\textsf{pp}$ and a satisfying instance $(\boldsymbol{x},\boldsymbol{w})$ for C and outputs a proof $\pi \leftarrow \textsf{P}(\textsf{pp},\boldsymbol{x},\boldsymbol{w})$ while (iii) $\textsf{V}$ takes $\textsf{pp}$, statement $\boldsymbol{x}$ and a proof $\pi $ and outputs $b\leftarrow \textsf{V}(\textsf{pp},\boldsymbol{x},\pi )$. The proof $\pi $ reveals no knowledge of the witness $\boldsymbol{w}$, while an accepting proof $\pi $ implies that prover knows a satisfying assignment $(\boldsymbol{x},\boldsymbol{w})$ with overwhelming probability. A commit and prove zkSNARK (CP-SNARK) allows proving knowledge of witness $\boldsymbol{w}$ as before, where part of $\boldsymbol{w}$ additionally opens a public commitment c, i.e. $\boldsymbol{w}=(\boldsymbol{u},\boldsymbol{z})$ and $\textsf{Open}(c)=u$. A CP-SNARK specifies a commitment scheme $\textsf{Com}$ and like a zkSNARK, it provides algorithms $\textsf{G},\textsf{P}$ and $\textsf{V}$ for generating public parameters, generating proofs and verifying proofs respectively. Additionally, a CP-SNARK allows one to generate proofs over data committed using $\textsf{Com}$ with negligible overhead in proof generation and verification.

Notation: We use the notation [n] to denote the set of natural numbers $\{1,\ldots ,n\}$. We often use the array notation $\boldsymbol{x}[\,i\,]$ to denote the $i^{th}$ component of the vector $\boldsymbol{x}$, with 1 as the starting index. We will denote the concatenation of vectors $\boldsymbol{x}$ and $\boldsymbol{y}$ as $\llbracket \boldsymbol{x}, \boldsymbol{y}\rrbracket $. All our arithmetic circuits, vectors and matrices are over a finite field $\mathbb {F}$ of prime order.

Circuits for Dataset Operations: To use zkSNARKs, we express operations on datasets as arithmetic circuits. At a high level, arithmetic circuits representing data operations accept datasets as their inputs and outputs. Since establishing provenance of an asset in an AI pipeline requires verifying operations over several related assets, we require uniform representation of datasets across arithmetic circuits, which would allow a dataset to be used as inputs/outputs in different circuits. The second design constraint we enforce is that arithmetic circuits to be universal, i.e., the same circuit can be used to verify operations on all datasets within a known size bound. We need universal circuits for two primary reasons: (i) the sizes of datasets are considered confidential and must not be inferable from the circuits being used, and (ii) the circuits can be pre-processed to yield efficient verification as it is a frequent operation in our applications.

Dataset Representation in Circuits: As we use the same circuit to represent operations over datasets of varying sizes, we first describe a uniform representation of datasets which can be used within the arithmetic circuits. Let N denote a known upper bound on the size of input/output datasets. We view a dataset as a collection of its column vectors (of size at most N). We encode a vector of size at most N as $N+1$ size vector $\llbracket s, \boldsymbol{X}\rrbracket $ where $\boldsymbol{X}=(\boldsymbol{X}[1],\ldots ,\boldsymbol{X}[N])$ In this encoding s denotes the size of the vector, $\boldsymbol{X}[1],\ldots ,\boldsymbol{X}[s]$ contain the s entries of the vector, while $\boldsymbol{X}[i]$ for $i>s$ are set to 0^{Footnote 2}. Similarly, a dataset is encoded by encoding each of its columns separately.

Dataset Commitment: Let $\textsf{Com}$ be a vector commitment scheme associated with a CP-SNARK $\textsf{CP}$. We additionally assume that $\textsf{Com}$ is homomorphic. To commit a vector $\boldsymbol{x}$, we first compute its encoding $\overline{\boldsymbol{x}}$ as a vector of size $N+1$, and then compute $c=\textsf{Com}(\overline{\boldsymbol{x}},r)$ as its commitment. Here r denotes the commitment randomness. To commit a dataset $\boldsymbol{D}$ with columns $\boldsymbol{x}_1,\ldots ,\boldsymbol{x}_M$, we commit each of its columns to obtain $\boldsymbol{c}=(c_1,\ldots ,c_M)$, where $c_i=\textsf{Com}(\boldsymbol{x}_i)$ as the commitment. Using our circuits with the CP-SNARK $\textsf{CP}$ allows us to efficiently prove operations over committed datasets.

3.2 Optimizations

We now highlight optimizations that are pivotal to the scalability of our system:

Mitigating Commitment Overhead: To prove statements over committed values using general zkSNARKs, one generally needs to compute the commitment as part of the arithmetic circuit expressing the computation. This introduces substantial overhead, when the amount of data to be committed is large. To avoid this, we use a CP-SNARK and its associated commitment scheme. We instantiate our system using Adaptive-Pinnochio [24], as the CP-SNARK. Adaptive-Pinnochio augments the popular Pinnochio [20] zkSNARK with commit and prove capability. The resulting scheme incurs $\le 5\%$ overhead in proof generation time over Pinnochio, while verification continues to be efficient ($\le \mathrm {400\,ms}$) in practice. We expect similar savings with other CP-SNARK schemes, and thus our constructs are agnostic to the choice of CP-SNARK.

Circuit Decomposition: For some operations, verification is more efficient when decomposed as two or more circuits, than when encoded as a monolithic circuit. Let $C(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w})$ be an arithmetic circuit which checks some property on $(\boldsymbol{x},\boldsymbol{u})$ where $\boldsymbol{u}$ additionally opens the commitment c. Our decomposition takes the form $C(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w})\equiv C_1(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w}_0,\boldsymbol{w_1})\wedge C_2(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w}_0,\boldsymbol{w}_2)$ where $\boldsymbol{w} = (\boldsymbol{w}_0,\boldsymbol{w}_1,\boldsymbol{w}_2)$ denotes a suitable partition of witness wires. Using a CP-SNARK we let the prover provide an additional commitment $c_0$ for the witness wires $\boldsymbol{w}_0$ which are common to both the sub-circuits. In our decompositions, we let $C_1$ encode relation that is easily verified by an arithmetic circuit and let $C_2$ encode the relation which has substantially cheaper probabilistic verification circuit, i.e., there exists a circuit $\widetilde{C}_2(\alpha ,\boldsymbol{x},\boldsymbol{u},\boldsymbol{w}_0,\boldsymbol{w}_2)$ which takes additional random challenge $\alpha $ and has identical output to $C_2$ with overwhelming probability (over random choices of $\alpha $). In our constructions, the latter circuit verifies either the simultaneous permutation property or consistent memory access property which we introduce below. These are inefficient to check deterministically using arithmetic circuits but admit efficient probabilistic circuits.

3.3 Simultaneous Permutation

We say that tuples $(\boldsymbol{u}_1,\ldots , \boldsymbol{u}_k)$ and $(\boldsymbol{v}_1,\ldots ,\boldsymbol{v}_k)$ of vectors in $\mathbb {F}^N$ satisfy the simultaneous permutation relation if there exists a permutation $\sigma $ of [N] such that $\boldsymbol{v}_i=\sigma (\boldsymbol{u}_i)$ for all $i\in [k]$. We now describe protocol to check the relation over committed vectors: i.e., given commitments $\textsf{cu}_1,\ldots ,\textsf{cu}_k,\textsf{cv}_1,\ldots ,\textsf{cv}_k$ the prover shows knowledge of vectors $\boldsymbol{u}_1,\ldots ,\boldsymbol{u}_k$ and $\boldsymbol{v}_1,\ldots ,\boldsymbol{v}_k$ corresponding to the commitments which satisfy the relation. To achieve this, the verifier first sends a challenge $\beta _1,\ldots ,\beta _k$ and challenges the prover to show that $\beta $-linear combinations of the vectors $\boldsymbol{u}=\sum _{i=1}^k\beta _i\boldsymbol{u}_i$, $\boldsymbol{v}=\sum _{i=1}^k\beta _i\boldsymbol{v}_i$, corresponding to commitments $\textsf{cu}=\sum _{i=1}^k\beta _i\textsf{cu}_i$, $\textsf{cv}=\sum _{i=1}^k\beta _i\textsf{cv}_i$ are permutations of each other. This is accomplished via a further challenge $\alpha \leftarrow \mathbb {F}$ and subsequently chekcing $\prod _{i=1}^N (\alpha -\boldsymbol{u}[\,i\,])=\prod _{i=1}^N (\alpha - \boldsymbol{v}[\,i\,])$. We describe the formal protocol and its analysis in Appendix C.1. The last computation can be expressed in an arithmetic circuit using O(N) multiplication gates which is concretely more efficient compared to deterministic circuits for checking permutation relation using routing networks [6, 26].

Table 3. Comparison of Circuit Complexity for different ROM approaches. ZK and CP denote zkSNARK and CP-SNARK protocols. m and n denote number of reads and memory size respectively.

Full size table

3.4 Consistent Memory Access

We define consistent memory access relation for a triple of vectors $\boldsymbol{L},\boldsymbol{U}$ and $\boldsymbol{V}$ where $\boldsymbol{L}\in \mathbb {F}^n$ and $\boldsymbol{U},\boldsymbol{V}\in \mathbb {F}^m$ for some integers m, n. We say that $(\boldsymbol{L},\boldsymbol{U},\boldsymbol{V})$ satisy the relation if $V[\,i\,] = L[\,U[i]\,]$ for all $i\in [m]$. We think of $\boldsymbol{L}$ as read only memory (ROM) which is accessed at locations given by $\boldsymbol{U}$ with $\boldsymbol{V}$ being the corresponding values. We adapt the techniques in [3, 5, 25, 31] to take advantage of CP-SNARKs in our construction. Next, we present a protocol to check the relation given commitments to $\boldsymbol{L},\boldsymbol{U}$ and $\boldsymbol{V}$. The verification proceeds as:

1.
First $m+n$ sized vectors $\boldsymbol{u}$ and $\boldsymbol{v}$ are computed as follows: For the vector $\boldsymbol{u}$ we require $\boldsymbol{u}[\,i\,]=i$ for $i\in [n]$ and $\boldsymbol{u}[\,i+n\,]=\boldsymbol{U}[\,i\,]$ for $i\in [m]$. For the vector $\boldsymbol{v}$ we require $\boldsymbol{v}[\,i\,]=\boldsymbol{L}[\,i\,]$ for $i\in [n]$ and $\boldsymbol{v}[\,i+n\,]=\boldsymbol{V}[\,i\,]$ for $i\in [m]$ (see Fig. 3).
2.
The prover also supplies auxiliary vectors $\tilde{\boldsymbol{u}}$ and $\tilde{\boldsymbol{v}}$ of size $m+n$, where $\tilde{\boldsymbol{u}}$ and $\tilde{\boldsymbol{v}}$ are purportedly obtained from $\boldsymbol{u}$ and $\boldsymbol{v}$ via the same permutation.
3.
Finally, we ensure that the vector $\tilde{\boldsymbol{u}}$ is sorted and that the vector $\tilde{\boldsymbol{v}}$ differs in adjacent positions only if the same is true for those positions in vector $\tilde{\boldsymbol{u}}$.

The constraints on the first n entries of vectors $\boldsymbol{u}$ and $\boldsymbol{v}$ in step (1) can be thought of as “loading” constraints that load the entries of $\boldsymbol{L}$ against corresponding address in memory, while constraints on the last m entries can be thought of as “fetching” constraints that fetch the appropriate value against the specified memory location. The steps (2) and (3) ensure that the value fetched for a given location is same as the value loaded against it during the initial loading steps. We decompose above checks across two circuits. The first arithmetic circuit $\textsf{C}_{\textsf{ROM}, {m},{n}}$ ensures steps (1) and (3) while the second circuit checks that vectors $\tilde{\boldsymbol{u}},\tilde{\boldsymbol{v}}$ are obtained by applying the same permutation to vectors $\boldsymbol{u},\boldsymbol{v}$ respectively. The circuit $\textsf{C}_{\textsf{ROM}, {m},{n}}$ can be realized using $O(m+n)$ multiplication gates. Generally, verifying that a vector such as $\tilde{\boldsymbol{u}}$ is sorted in step (3) incurs logarithmic overhead due to the need for bit decomposition of each element. However, we can leverage the fact that $\tilde{\boldsymbol{u}}$ is a (sorted) rearrangement of $\boldsymbol{u}$, which includes all elements of [n] by construction. Thus, monotonicity of $\tilde{\boldsymbol{u}}$ is established provided (i) $\tilde{\boldsymbol{u}}[\,n\,]=1$, (ii) $\tilde{\boldsymbol{u}}[\,m+n\,] = n$ and $\tilde{\boldsymbol{u}}[\,i+1\,]$ − $\tilde{\boldsymbol{u}}[\,i\,]\in \{0,1\}$ for all $1\le i\le m+n-1$, which together require $O(m+n)$ gates to verify. Finally, we invoke the protocol for “Simultaneous Permutation” property in Sect. 3.3 to check compliance of step (2). We illustrate the verification circuit and the decomposition in Fig. 3. The formal protocol and analysis appears in Appendix C.2. Overall we incur $O(m+n)$ gates, which is more efficient than encoding entire relation in one circuit. In that case one uses routing networks which incur $O((m+n)\log (m+n))$ gates and are concretely much more expensive. We can optimize further when the same access pattern is used for accessing different ROMs as described below.

Multiplexed Memory Access. For access pattern $\boldsymbol{U}\in \mathbb {F}^m$ and ROMs $\boldsymbol{L}_j\in \mathbb {F}^n$ for $j\in [k]$, we can show the correctness of lookup values $\boldsymbol{V}_j[\,i\,] = \boldsymbol{L}_j[\,\boldsymbol{U}[i]\,]$, $i\in [M], j\in [k]$ using just one instance of protocol discussed in this section. To achieve this, the verifier sends a random challenge $\alpha _1,\ldots ,\alpha _k$ to the prover. The prover then shows that $(\boldsymbol{L},\boldsymbol{U},\boldsymbol{V})$ satisfy correct memory access where $\boldsymbol{L}=\alpha _1\boldsymbol{L}_1+\cdots +\alpha _k\boldsymbol{L}_k$ and $\boldsymbol{V}=\alpha _1\boldsymbol{V}_1+\cdots +\alpha _k\boldsymbol{V}_k$ for uniformly sampled $\alpha _1,\ldots ,\alpha _k$. Note that due to the homomorphism of the commitment scheme, both the prover and the verifier can compute the commitments for $\boldsymbol{L},\boldsymbol{U}$ and $\boldsymbol{V}$.

3.5 Our Techniques in Perspective

Commit and prove functionality in conjunction with zero knowledge proofs has been used in recent works addressing privacy in machine learning, most notably in [18, 27, 28]. In [18] and [28], CP-SNARKs are used to “link” proofs of correctness for different parts of the circuit (similar to Circuit Decomposition in our setting) to prove inference from a private neural network and a decision tree respectively. In [27], public commitments are linked to set of authenticated inputs between a prover and a verifier in a two party protocol. Subsequently the prover produces a ZK proof showing correctness of neural network inference over authenticated inputs. In contrast, our usage of CP-SNARKs is more pervasive. We first optimize key relations (simultaneous permutation, consistent memory access) for CP-SNARKs and then design our dataset representation in a way that allows us to represent operations on them in terms of aforementioned relations.

4 Privacy Preserving Dataset Operations

We now describe protocols for common dataset operations such as aggregation, filter, order-by, inner-join etc. These operations serve to illustrate our key techniques, which can be further applied to yeild protocols for much more comprehensive list of dataset operations. We use the fact that most of the operations distribute nicely as identical computations over different pairs of columns. Throughout this section, N denotes the upper bound on the sizes of input/output datasets.

Aggregation: Aggregation operation takes two datasets as inputs and outputs their row-wise concatenation. We first describe arithmetic circuit to verify the concatenation of vectors. The circuit accepts three vectors in their uniform representation as discussed in Sect. 3.1. Let $\boldsymbol{x},\boldsymbol{y},\boldsymbol{z}$ be three vectors of size at most N represented as $\llbracket s, \boldsymbol{X}\rrbracket $, $\llbracket t, \boldsymbol{Y}\rrbracket $ and $\llbracket w, \boldsymbol{Z}\rrbracket $ respectively where $\boldsymbol{X},\boldsymbol{Y},\boldsymbol{Z}$ are vectors of size N. The verification involves ensuring that the first w entries of $\boldsymbol{Z}$ contain the first s entries of $\boldsymbol{X}$ and the first t entries of $\boldsymbol{Y}$. Figure 4 illustrates the setting for $s=3$, $t=4$, $w=7$ and $N=9$. To aid the verification, the prover provides N-length binary vectors $\boldsymbol{\rho }_s,\boldsymbol{\rho }_t$ and $\boldsymbol{\rho }_w$ as auxiliary inputs. The vector $\boldsymbol{\rho }_s$ is 1 in its first s entires, and 0 elsewhere. Similar relation is satisfied by $\boldsymbol{\rho }_t$ and $\boldsymbol{\rho }_w$. The correctness of aggregation now reduces to showing that there is a permutation that simultaneously maps $\llbracket \boldsymbol{\rho }_s, \boldsymbol{\rho }_t\rrbracket $ to $\llbracket \boldsymbol{\rho }_w, \boldsymbol{0}\rrbracket $ and $\llbracket \boldsymbol{X}, \boldsymbol{Y}\rrbracket $ to $\llbracket \boldsymbol{Z}, \boldsymbol{0}\rrbracket $. Figure 4 also shows how the verification is decomposed: The first circuit checks that (i) $w=s+t$, (ii) vectors $\boldsymbol{\rho }_s,\boldsymbol{\rho }_t,\boldsymbol{\rho }_t$ are correctly provided and (iii) ensures $\boldsymbol{u}_1 = \llbracket \boldsymbol{\rho }_s, \boldsymbol{\rho }_t\rrbracket $, $\boldsymbol{v}_1= \llbracket \boldsymbol{X}, \boldsymbol{Y}\rrbracket $, $\boldsymbol{u}_2=\llbracket \boldsymbol{\rho }_w, \boldsymbol{0}\rrbracket $ and $\boldsymbol{v}_2= \llbracket \boldsymbol{Z}, \boldsymbol{0}\rrbracket $. The second circuit checks the “simultaneous permutation” property on the pairs $(\boldsymbol{u}_1,\boldsymbol{v}_1)$ and $(\boldsymbol{u}_2,\boldsymbol{v}_2)$. Both the circuits can be realized using O(N) multiplication gates. Using a CP-SNARK we can verify the correctness of aggregation of vectors over commitments.

We now leverage the above construction to verify aggregation operation over datasets. Let $D_x, D_y$ and $D_z$ be datasets each with k columns given by $(\boldsymbol{x}_i)_{i=1}^k, (\boldsymbol{y}_i)_{i=1}^k$ and $(\boldsymbol{z}_i)_{i=1}^k$ respectively. The reduction technique involves the verifier sampling random $\alpha _1,\ldots ,\alpha _k$ satisfying $\alpha _1+\cdots +\alpha _k=1$. Next, we use the above circuit construction with a CP-SNARK to prove that vectors $\boldsymbol{x}=\sum _{i=1}^k\alpha _i\boldsymbol{x}_i$ ,$\boldsymbol{y}=\sum _{i=1}^k\alpha _i\boldsymbol{y}_i$ and $\boldsymbol{z}=\sum _{i=1}^k\alpha _i\boldsymbol{z}_i$ satisfy the concatenation property. We give complete protocol and proof of the reduction in the Appendix C.3.

Filter: Filter operation involves a dataset and a selection predicate as inputs and subsequently outputs a dataset consisting of subset of rows satisfying the predicate. We divide the computation in two parts (i) Applying selection predicate to rows of the dataset to obtain a binary vector $\boldsymbol{f}$ which we call as selection vector and (ii) Applying selection vector to the source dataset to obtain the target dataset. The latter computation can be verified with techniques similar to those used in aggregation operation. For the first computation, we describe an efficient circuit for predicates of the form $\wedge _{i=1}^k (\boldsymbol{x}_i==v_i)$ where $\boldsymbol{x}_1,\ldots ,\boldsymbol{x}_k$ are the columns of the dataset. Once again the verifier chooses random $\alpha _1,\ldots ,\alpha _k$ with $\sum _{i=1}^k \alpha _i=1$ and challenges the prover to show that the selection vector $\boldsymbol{f}$ satisfies $\boldsymbol{f}=(\boldsymbol{x}==v)$ where $\boldsymbol{x}=\sum _{i=1}^k \alpha _i\boldsymbol{x}_i$ and $v=\sum _{i=1}^k \alpha _iv_i$. The relation $\boldsymbol{f}=(\boldsymbol{x}==v)$ can be verified using a circuit with O(N) gates. Due to the homomorphism of the commitment scheme, the verifier can compute the commitment for vector $\boldsymbol{x}$ given the commitments to columns of the dataset. For more general range queries of the form $\wedge _{i=1}^k (\ell _i < \boldsymbol{x}_i \le r_i)$, we can compute selection vector $\boldsymbol{f}_i$ for each column, and then compute the final selection vector $\boldsymbol{f}=\wedge _{i=1}^k \boldsymbol{f}_i$.

Order By: Order-By relation involves permuting the rows of the dataset so that a specified column is in sorted order. The verification can be naturally expressed as columns of source and target dataset satisfying simultaneous permutation relation, where additionally the specified column is sorted. We can check the monotonicity of a column using a circuit with O(bN) gates where b is the bit-width of the range of values in the column. We skip the details.

Inner-Join: Inner join operation concatenates pairs of rows of input datasets which have identical value for the designated columns (joining columns). We consider the inner-join operation under the restriction that the joining columns have distinct values. As a first step, we order both the input datasets so that the joining columns are sorted. We can use the verification protocol for order-by operation to ensure correctness of this step. We therefore assume that joining columns are sorted, and take distinct values. Let $D_1$ and $D_2$ be two datasets which are joined on columns $\boldsymbol{x}$ and $\boldsymbol{y}$ to yield the dataset D. We write D as juxtaposition of columns $[D^{'}_1,\boldsymbol{z},D^{'}_2]$ where $D^{'}_i$ denotes the columns coming from $D_i$ while $\boldsymbol{z}$ denotes the column obtained as intersection of $\boldsymbol{x}$ and $\boldsymbol{y}$. We first design sub-circuit for private set intersection (PSI) to compute the size w of the resulting dataset. We then let the prover provide auxiliary selection vectors $\boldsymbol{f}_1$ and $\boldsymbol{f}_2$ of size w. Finally, using the circuit for filter relation, we verify that $\boldsymbol{f}_1$ applied to $D_1$ yields dataset $D_L = [D^{'}_1,\boldsymbol{z}]$ and $\boldsymbol{f}_2$ applied to $D_2$ yields the dataset $D_R=[D^{'}_2,\boldsymbol{z}]$. The overall circuit complexity is O(bN) where b is the bit-width of the range of values in $\boldsymbol{x}$ and $\boldsymbol{y}$ with set-intersection computation dominating the overall cost.

5 Privacy Preserving Model Inference: Decision Trees

In this section we present a zero knowledge protocol for verifiable inference from decision trees (and random forests). Decision trees are popular models in machine learning due to their interpretability. A decision tree recursively partitions the feature space (arranged as a tree), and finally assigns a label to each leaf segment. The problem of proving correct inference from a decision tree was considered recently in [28], where authors present a privacy preserving method for an adversary to commit to a decision tree and later prove inference from the tree on public test data. We present a new construction based on consistent memory access, which improves upon the prior construction by reducing the number of multiplication gates in the inference circuit. We also provide zero knowledge protocol for establishing the accuracy of a decision tree on test data. We consider variants with test data being public or private. The latter scenario is helpful while verifying performance of a private model on reputationally trusted private dataset.

Decision Tree Representation: We parameterize a binary decision tree with following parameters: the maximum number of nodes (N), the maximum length of a decision path (h) and maximum number of features used as predictors (d). We assume that the nodes in the decision tree have unique identifiers from the set [N], while features are identified using indices in set [d]. We naturally represent a decision tree $\mathcal {T}$ as a lookup table with five columns, i.e., $\mathcal {T}=(\boldsymbol{V},\boldsymbol{T},\boldsymbol{L},\boldsymbol{R},\boldsymbol{C})$, where each column vector is of size N. For a decision tree with $t\le N$ nodes, we encode as follows: For $i\in [t]$:

$\boldsymbol{V}[\,i\,]$ denotes the identifier for the splitting feature for $i^{th}$ node.
$\boldsymbol{T}[\,i\,]$ denotes the threshold value for the splitting feature for $i^{th}$ node.
$\boldsymbol{L}[\,i\,]$ and $\boldsymbol{R}[\,i\,]$ denote the identifiers for the left and right child of $i^{th}$ node. In case of a leaf node, this value is set to i itself.
$\boldsymbol{C}[\,i\,]$ denotes the label associated with the $i^{th}$ node, when it is a leaf node. For non-leaf nodes this may be set arbitrarily.

We commit to a decision tree, by committing to each of the vectors. We define $\textsf{cm}_{\mathcal {T}} = (\textsf{cm}_V,\textsf{cm}_T,\textsf{cm}_L,\textsf{cm}_R, \textsf{cm}_C)$ as the commitment to $\mathcal {T}$.

Decision Tree Inference: We model the test data D as $n\times d$ matrix, consisting of n d-dimensional samples. Let $\boldsymbol{D}$ be the vector of size dn obtained by flattening D in row major order. The algorithm below computes decision paths $\boldsymbol{p}_i = (\boldsymbol{p}_i[\,1\,],\ldots ,\boldsymbol{p}_i[\,h\,])$ for each sample $i\in [n]$. The prediction vector $\boldsymbol{q}$ contains class labels corresponding to leaf nodes $\boldsymbol{p}_i[\,h\,]$ for $i\in [n]$.

1.
For $i=1,\ldots ,n$ do:
- Set $\boldsymbol{p}_i[\,1\,]=1$ : root is the first node on every decision path.
- For $j=1,\ldots ,h$ determine next node as follows:
  1. (a)
    Compute splitting feature: $\boldsymbol{f}_i[\,j\,]=\boldsymbol{V}[\,\boldsymbol{p}_i[j]\,]$.
  2. (b)
    Compute threshold value: $\boldsymbol{t}_i[\,j\,]=\boldsymbol{T}[\,\boldsymbol{p}_i[j]\,]$.
  3. (c)
    Compute left and right child id: $\boldsymbol{l}_i[\,j\,]=\boldsymbol{L}[\,\boldsymbol{p}_i[j]\,]$, $\boldsymbol{r}_i[\,j\,]=\boldsymbol{R}[\,\boldsymbol{p}_i[j]\,]$.
  4. (d)
    Compute label: $\boldsymbol{c}_i[\,j\,]=\boldsymbol{C}[\,\boldsymbol{p}_i[j]\,]$.
  5. (e)
    Compute $\hat{\boldsymbol{f}}_i[\,j\,] = d*i + \boldsymbol{f}_i[\,j\,]$.
  6. (f)
    Compute value of splitting feature: $\boldsymbol{v}_i[\,j\,]=D[\,i,\boldsymbol{f}_i[j]\,]=\boldsymbol{D}[\,\hat{\boldsymbol{f}_i}[j]\,]$.
  7. (g)
    Compute next node: $\boldsymbol{p}_i[j+1]$ = $\boldsymbol{l}_i[\,j\,]$ if $\boldsymbol{v}_i[\,j\,]\le \boldsymbol{t}_i[\,j\,]$ and $\boldsymbol{r}_i[\,j\,]$ otherwise.
- Compute label for the sample: $\boldsymbol{q}[\,i\,]=\boldsymbol{c}_i[\,h\,]$.

Verification of the above algorithm involves verifying (i) hn memory accesses on the tables of $\mathcal {T}$ in steps (a)-(d), which share the access pattern $\boldsymbol{p}_i[\,j\,]$, (ii) verifying hn memory accesses on $\boldsymbol{D}$ (of size dn) in step (f) and (iii) hn comparisons as part of step (g). Using the optimization in Sect. 3.4, the first verification incurs $O(N+hn)$ multiplication gates, while the second verification incurs $O(dn + hn)$ multiplication gates. Using standard techniques, verification of (iii) can be made using O(whn) multiplication gates, where w is the bit-width of feature values. Thus, overall circuit complexity of our solution is $O(N + n(d + h + wh))$. We compare our solution with the method for zero-knowledge decision tree (zkDT) inference presented in [28]. Broadly, the method in [28] establishes the correctness of inference as three checks:

Consistency of input decision tree with public commitment: This involves O(N) evaluations of the hash function $\mathcal {H}$ used for commitment and thus incurs $c(\mathcal {H})\cdot N$ multiplication gates. Here $c(\mathcal {H})$ denotes the size of circuit required to evaluate $\mathcal {H}$.
Consistency of feature vector with decision path: The verification of this step leverages a “Multiset Check” ([28, Section 4.1]) which costs $O(d\log h)$ multiplication gates per sample.
Correct evaluation of decision tree function: It involves h comparisons for each sample, which incurs hw mutliplication gates, where w is the bit-width of feature values.

Above steps result in an overall circuit complexity of $c(\mathcal {H})N + n(3d\log h + hw)$ for zkDT. Our solution improves upon the approach in [28] by reducing the cost of the first two checks. Using a CP-SNARK, we avoid the cost of computing the commitment within the verification circuit, while using our optimized protocols for memory access allows us to accomplish the second check with an average cost of $O(h + d)$ gates per sample ($O(dn+hn)$ overall), which compares favorably with per sample cost of $O(d\log h)$ incurred by zkDT for $h=\varTheta (d)$. The concrete improvement obtained using our approach depends on which of the three checks dominate the cost for specific parameter settings. We compare the cost of the two approaches for some representative parameter settings in Table 5.

Decision Tree Accuracy: The above circuit for decision tree inference can be easily modified to yield the circuit for proving accuracy of a decision tree on test data. In this case, the prediction vector is kept private, and tallied against ground truth to compute accuracy. Since our system also includes verifiability of model performance (accuracy) on private benchmark datasets, we briefly describe the modifications required to achieve the same. Let D be a private dataset with columns $(\boldsymbol{x}_1,\ldots ,\boldsymbol{x}_d)$ with commitments to columns being public. Since, we can no longer compute the flattened vector $\boldsymbol{D}$ as before, we cannot verify the lookup $\boldsymbol{v}_i[\,j\,]=\boldsymbol{D}[\,\hat{\boldsymbol{f}}_i[j]\,]$. Instead we use polynomial interpolation to pre-process D. For $i^{th}$ row $D[i,\cdot ]$ of the original data (a vector of size d), we interpolate a polynomial $p_i$ of degree $d-1$ such that $p(j)=D[i,j]$. We obtain the pre-processed dataset $D'$ whose $i^{th}$ row consists of coefficients of $p_i$. The data owner makes a commitment to $D'$ instead of D. The lookup $\boldsymbol{v}_i[\,j\,]=D'[i,j]=p_i(j)$ now involves evaluating a $d-1$ degree polynomial which incurs d multiplication gates. The overall circuit complexity for accuracy over private datasets is therefore $O(N + hn + hnw + hnd)$.

Table 4. Measuring the efficacy of our optimizations on 100K$\times $ 10 datasets.

Full size table

Table 5. Comparison of Circuit Complexity for decision tree inference.

Full size table

6 Experimental Evaluation

In this section we report the concrete performance of our system primitives. For our implementation, we used Adaptive Pinocchio [24] as the underlying CPSNARK, which we implemented using the libsnark [17] library. We also used the libsnark library for our circuit descriptions. Our experiments were performed on Ubuntu Linux 18.04 cloud instances with 8 Intel Xeon 2.10 GHz virtual cpus with 32 GB of RAM. The experiments were run with finite field arithmetic libraries and FFT libraries compiled to exploit multiple cores. We often use circuit complexity (multiplication gates in the circuit) as the “environment neutral” metric for comparing different approaches (the proving times scale quasi-linearly with circuit complexity).

Performance of Dataset Operations: Table 1 contains summary of asymptotic as well as concrete efficiency of our dataset operations. All the operations scale linearly with the number of rows (with marginal additive dependence on the number of columns). The numbers for proof generation and verification were generated for representative dataset size of $100K\times 10$. While proof generation is an expensive operation by general standards, it is practical enough for infrequent usage. We also tabulate the efficacy of our optimizations in Table 4. For the unoptimized case, we do not use CP-SNARKs and instead compute commitments using circuit-friendly MiMC hash [1]. For partially optimized case, we use native commitment scheme of CP-SNARK for commitments, but use monolithic circuits to encode the operations. To express permutations in monolithic circuits, we use gadgets for routing networks [6, 26] available in [17]. The fully optimized version delegates permutation checking and memory access check to probabilistic circuits as discussed in Sect. 3.2. In the first case, hashing dominates the circuit complexity resulting in 50–100 times larger circuits. Decomposing the circuits instead of monolithic circuits also results in an order of magnitude savings.

Table 6. Concrete proving and verification time for decision tree inference.

Full size table

Table 7. Circuit Complexity for decision tree accuracy for public and private benchmark datasets.

Full size table

Performance of Decision Tree Inference: We use two decision trees T1 and T2 to benchmark performance of our decision tree inference implementation. We also use the same trees to compare our method with the one presented in [28]. We synthetically generate the tree T1 with 1000 nodes, 50 features and depth as 20, which roughly corresponds to the largest tree used in [28]. The tree T2 is trained on a curated version of dataset [12] for Home Mortgage Approval. We identify 35 features from the dataset to train binary decision tree. We train T2 with 10000 nodes and depth 25. We verify the inference from the two trees for batch sizes of 100 (small), 1000 (medium) and 10000 (large). Using our method to generate proof of predictions takes from few seconds (on small data) to few minutes (on large data), as seen in Table 6. The circuit complexity and the proving time scale almost linearly for our method. We also compare the multiplication gates incurred by arithmetic circuits in our method with that in [28] in Table 5. Our efficiency is an order of magnitude better for smaller data sizes, as we do not incur one time cost for hashing the tree. For larger batch sizes, our method is still about 1.5-$4\times $ more efficient. As the batch sizes get large, comparisons dominate the circuit complexity in both the approaches. We report the circuit complexity for proving the accuracy for decision trees on private datasets and public datasets. Table 7 shows that the overhead for proving accuracy on private datasets ranges from 50–80%.

Performance of Memory Access: We also independently benchmark the performance of our memory abstraction technique and compare it to existing methods in Table 3. Leveraging CP-SNARKs and probabilistic reductions we essentially incur constant number of gates per access. We compare different approaches both in terms of asymptotic complexity and concrete complexity for parameter settings representative of their usage in our work. Our concrete efficiency is an order of magnitude better than the alternatives considered.

Notes

1.
Provenance of the model training step is not considered in this paper.
2.
This introduces no ambiguity if 0 is legitimately part of the vector, as s specifies the content of the vector.

References

Albrecht, M., Grassi, L., Rechberger, C., Roy, A., Tiessen, T.: MiMC: efficient encryption and cryptographic hashing with minimal multiplicative complexity. In: Cheon, J.H., Takagi, T. (eds.) ASIACRYPT 2016. LNCS, vol. 10031, pp. 191–219. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-53887-6_7
Chapter Google Scholar
Ames, S., Hazay, C., Ishai, Y., Venkitasubramaniam, M.: Ligero: lightweight sublinear arguments without a trusted setup. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 2087–2104 (2017)
Google Scholar
Ben-Sasson, E., Chiesa, A., Genkin, D., Tromer, E., Virza, M.: SNARKs for C: verifying program executions succinctly and in zero knowledge. In: Canetti, R., Garay, J.A. (eds.) CRYPTO 2013. LNCS, vol. 8043, pp. 90–108. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40084-1_6
Chapter MATH Google Scholar
Ben-Sasson, E., Chiesa, A., Riabzev, M., Spooner, N., Virza, M., Ward, N.P.: Aurora: transparent succinct arguments for R1CS. In: Ishai, Y., Rijmen, V. (eds.) EUROCRYPT 2019. LNCS, vol. 11476, pp. 103–128. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17653-2_4
Chapter Google Scholar
Ben-Sasson, E., Chiesa, A., Tromer, E., Virza, M.: Succinct non-interactive zero knowledge for a von neumann architecture. In: Proceedings of the 23rd USENIX Security Symposium, pp. 781–796 (2014)
Google Scholar
Beneš, V.: Mathematical Theory of Connecting Networks and Telephone Traffic. Elsevier Science, ISSN (1965)
Google Scholar
Bünz, B., Bootle, J., Boneh, D., Poelstra, A., Wuille, P., Maxwell, G.: Bulletproofs: short proofs for confidential transactions and more. In: Proceedings of the IEEE Symposium on Security and Privacy (SP), pp. 315–334 (2018)
Google Scholar
Campanelli, M., Fiore, D., Querol, A.: Legosnark: modular design and composition of succinct zero-knowledge proofs. In: Proceedings of the ACM SI)GSAC Conference on Computer and Communications Security (CCS), pp. 2075–2092 (2019)
Google Scholar
Chiesa, A., Ojha, D., Spooner, N.: Fractal: post-quantum and transparent recursive proofs from holography. In: Canteaut, A., Ishai, Y. (eds.) EUROCRYPT 2020. LNCS, vol. 12105, pp. 769–793. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45721-1_27
Chapter Google Scholar
Eberhardt, J., Tai, S.: Zokrates - scalable privacy-preserving off-chain computations. In: Proceedings of the IEEE International Conference on Internet of Things (iThings), pp. 1084–1091 (2018)
Google Scholar
Feng, B., Qin, L., Zhang, Z., Ding, Y., Chu, S.: ZEN: efficient zero-knowledge proofs for neural networks. IACR Cryptol. ePrint Arch. 2021, 87 (2021)
Google Scholar
ffiec. Home mortgage disclosure act. https://ffiec.cfpb.gov/data-publication/snapshot-national-loan-level-dataset/2018. Accessed 14 Sept 2021
Gennaro, R., Gentry, C., Parno, B., Raykova, M.: Quadratic span programs and succinct NIZKs without PCPs. In: Johansson, T., Nguyen, P.Q. (eds.) EUROCRYPT 2013. LNCS, vol. 7881, pp. 626–645. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38348-9_37
Chapter Google Scholar
Ghodsi, Z., Gu, T., Garg, S.: Safetynets: verifiable execution of deep neural networks on an untrusted cloud. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 4672–4681 (2017)
Google Scholar
Kilbertus, N., Gascón, A., Kusner, M.J., Veale, M., Gummadi, K.P., Weller, A.: Blind justice: Fairness with encrypted sensitive attributes. In: Proceedings of the 35th International Conference on Machine Learning (ICML), pp. 2635–2644 (2018)
Google Scholar
Kosba, A.E., Papamanthou, C., Shi, E.: xjsnark: a framework for efficient verifiable computation. In: Proceedings of the IEEE Symposium on Security and Privacy (SP), pp. 944–961 (2018)
Google Scholar
Lab, S.: libsnark: A C++ library for zkSNARK proofs, howpublished. https://github.com/scipr-lab/libsnark. Accessed 14 Sept 2021
Lee, S., Ko, H., Kim, J., Oh, H.: vcnn: verifiable convolutional neural network. IACR Cryptol. ePrint Arch. 2020, 584 (2020)
Google Scholar
Lüthi, P., Gagnaux, T., Gygli, M.: Distributed ledger for provenance tracking of artificial intelligence assets. CoRR, abs/2002.11000 (2020)
Google Scholar
Parno, B., Howell, J., Gentry, C., Raykova, M.: Pinocchio: nearly practical verifiable computation. In: Proceedings of the IEEE Symposium on Security and Privacy (SP), pp. 238–252 (2013)
Google Scholar
Sarpatwar, K.K., et al.: Towards enabling trusted artificial intelligence via blockchain. In: Extended papers from the Second International Workshop on Policy-based Autonomic Data Governance, vol. 11550, pp. 137–153 (2018)
Google Scholar
Segal, S., Adi, Y., Pinkas, B., Baum, C., Ganesh, C., Keshet, J.: Fairness in the eyes of the data: certifying machine-learning models. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES), pp. 926–935 (2021)
Google Scholar
Tramèr, F., Boneh, D.: Slalom: fast, verifiable and private execution of neural networks in trusted hardware. In: Proceedings of the 7th International Conference on Learning Representations (ICLR) (2019)
Google Scholar
Veeningen, M.: Pinocchio-based adaptive zk-SNARKs and secure/correct adaptive function evaluation. In: Joye, M., Nitaj, A. (eds.) AFRICACRYPT 2017. LNCS, vol. 10239, pp. 21–39. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57339-7_2
Chapter Google Scholar
Wahby, R.S., Setty, S.T.V., Ren, Z., Blumberg, A.J., Walfish, M.: Efficient RAM and control flow in verifiable outsourced computation. In: Proceedings of the 22nd Annual Network and Distributed System Security Symposium (NDSS) (2015)
Google Scholar
Waksman, A.: A permutation network. J. ACM 15(1), 159–163 (1968)
Article MathSciNet Google Scholar
Weng, C., Yang, K., Xie, X., Katz, J., Wang, X.: Mystique: efficient conversions for zero-knowledge proofs with applications to machine learning. In: 30th USENIX Security Symposium (USENIX Security 2021), pp. 501–518 (2021)
Google Scholar
Zhang, J., Fang, Z., Zhang, Y., Song, D.: Zero knowledge proofs for decision tree predictions and accuracy. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 2039–2053 (2020)
Google Scholar
Zhang, Y., Genkin, D., Katz, J., Papadopoulos, D., Papamanthou, C.: VSQL: verifying arbitrary SQL queries over dynamic outsourced databases. In: Proceedings of the IEEE Symposium on Security and Privacy (SP), pp. 863–880 (2017)
Google Scholar
Zhang, Y., Genkin, D., Katz, J., Papadopoulos, D., Papamanthou, C.: A zero-knowledge version of vsql. IACR Cryptol. ePrint Arch. 2017, 1146 (2017)
Google Scholar
Zhang, Y., Genkin, D., Katz, J., Papadopoulos, D., Papamanthou, C.: vram: Faster verifiable RAM with program-independent preprocessing. In: 2018 IEEE Symposium on Security and Privacy, SP 2018, Proceedings, San Francisco, California, USA, 21–23 May 2018, pp. 908–925 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Research Lab, Bangalore, India
Nitin Singh, Pankaj Dayama & Vinayaka Pandit

Authors

Nitin Singh
View author publications
You can also search for this author in PubMed Google Scholar
Pankaj Dayama
View author publications
You can also search for this author in PubMed Google Scholar
Vinayaka Pandit
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nitin Singh .

Editor information

Editors and Affiliations

Technion - Israel Institute of Technology, Haifa, Israel
Ittay Eyal
Texas A&M University, College Station, TX, USA
Juan Garay

Appendices

A Preliminaries

We briefly summarise some key cryptographic notions that we use throughout the paper. For more details on the notions discussed below, we refer the reader to [8, Section 2].

1.1 A.1 Commitment Scheme

Definition 1

A commitment scheme $\textsf{Com}=$ $(\textsf{Setup}, \textsf{Commit},\textsf{VerCommit})$ is a tuple of algorithms with message space $\mathcal {D}$, commitment space $\mathcal {C}$ and opening space $\mathcal {O}$ which satisfies correctness, hiding and binding as described below:

$\textsf{Setup}(1^\lambda )\rightarrow \textsf{ck}$ takes security parameter $\lambda $ and outputs commitment key $\textsf{ck}$.
$\textsf{Commit}(\textsf{ck},u)\rightarrow (c,o)$ takes commitment key $\textsf{ck}$ and $u\in \mathcal {D}$ and outputs commitment $c\in \mathcal {C}$ and opening $o\in \mathcal {O}$.
$\textsf{VerCommit}(\textsf{ck},c,u,o)\rightarrow b$ takes commitment key $\textsf{ck}$, commitment c, message u and opening o and outputs $b\in \{0,1\}$.

Correctness: A valid commitment always verifies correctly, i.e. for $\textsf{ck}\leftarrow \textsf{Setup}(1^\lambda )$, $(c,o)\leftarrow \textsf{Commit}(\textsf{ck},u)$, with probability 1, we have $\textsf{VerCommit}(\textsf{ck},c,u,o)=1$.

Binding: It is infeasible for a polynomial time adversary to provide two openings to the same commitment.

Hiding: Commitments to any two messages are indistinguishable.

1.2 A.2 Zero Knowledge Arguments

We define the notion of pre-processing zero-knowledge Succinct Arguments of Knowledge (zkSNARKs).

Definition 2

A zkSNARK for a family of $\textsf{NP}$ relations $\{\mathcal {R}_\lambda \}_{\lambda \in \mathbb {N}}$ is a tuple of algorithms $(\textsf{G},\textsf{P},\textsf{V})$ where:

$\textsf{G}(1^\lambda ,R)\rightarrow (\textsf{pp},\textsf{td})$ takes security parameter and the relation $R\in \mathcal {R}_\lambda $ and outputs public parameters $\textsf{pp}=(\textsf{pk},\textsf{vk})$ and a trapdoor $\textsf{td}$. In the above $\textsf{pk}$ is called the evaluation key and $\textsf{vk}$ is called the verification key.
$\textsf{P}(\textsf{pk},\boldsymbol{x},\boldsymbol{w})\rightarrow \pi $ takes the evaluation key, public input vector $\boldsymbol{x}$, witness vector $\boldsymbol{w}$ and outputs a proof $\pi $.
$\textsf{V}(\textsf{vk}, \boldsymbol{x}, \pi )\rightarrow b$ takes the verification key, public input vector $\boldsymbol{x}$, a proof $\pi $ and outputs $b=1$ (accept) or $b=0$ (reject).

A zkSNARK $\mathcal {S}=(\textsf{G},\textsf{P},\textsf{V})$ satisfies the following properties:

Completeness: For all $(R,\boldsymbol{x},\boldsymbol{w})$ such that $R\in \mathcal {R}_\lambda $ and $R(\boldsymbol{x},\boldsymbol{w})=1$, the following probability is 1.

$$\begin{aligned} \textrm{Pr}[\pi \leftarrow \textsf{P}(\textsf{pk},\boldsymbol{x},\boldsymbol{w}); \textsf{V}(\textsf{vk},\boldsymbol{x},\pi )=1] \end{aligned}$$

Knowledge Soundness: Let $\mathcal{R}\mathcal{G}$ denote a relation generator and $\mathcal {Z}$ denote a (benign) auxiliary input generator. Then the zkSNARK $\mathcal {S}$ is called knowledge sound for $(\mathcal{R}\mathcal{G},\mathcal {Z})$ if for all efficient provers $P^*$, there exists an extractor $E^{P^*}$ such that the following probability is negligible:

$$\begin{aligned} \textrm{Pr}\left[ \begin{array}{c|c} (R,aux_R)\leftarrow \mathcal{R}\mathcal{G}, \textsf{pp}\leftarrow \textsf{G}(1^\lambda ,R) &{} \\ Z\leftarrow \mathcal {Z}(\textsf{pp},R,aux_R) &{} \textsf{V}(\textsf{pp},\boldsymbol{x},\pi ) \wedge \\ (\boldsymbol{x},\pi )\leftarrow P^*(R,aux_R,\textsf{pp},Z) &{} \lnot R(\boldsymbol{x},\boldsymbol{w}) \\ \boldsymbol{w}\leftarrow E^{P^*}(R,aux_R,\textsf{pp},Z) &{} \end{array} \right] \end{aligned}$$

Zero Knowledge: We say that $\mathcal {S}$ satisfies zero-knowledge for relation generator $\mathcal{R}\mathcal{G}$ if there exists simulator $S=(S_1, S_2)$ such that the following hold:

Key Indistinguishability: For all efficient adversaries $\mathcal {A}$ we have:
$$\begin{aligned}&\textrm{Pr}\left[ \begin{array}{l|l} (R,aux_R)\leftarrow \mathcal{R}\mathcal{G}(1^\lambda ), \textsf{pp}\leftarrow G(1^\lambda ,R)&\mathcal {A}(R,aux_R,\textsf{pp})=1 \end{array} \right] \\&\approx \textrm{Pr}\left[ \begin{array}{l|l} (R,aux_R)\leftarrow \mathcal{R}\mathcal{G}(1^\lambda ), &{} \mathcal {A}(R, aux_R, \textsf{pp})=1 \\ (\textsf{pp},\textsf{td})\leftarrow S_1(R,aux_R) \end{array} \right] \end{aligned}$$
Proof Indistinguishability: For all efficient adversaries $\mathcal {A}$ and all $R\in \mathcal {R}_\lambda $, $(\boldsymbol{x},\boldsymbol{w})$ such that $R(\boldsymbol{x},\boldsymbol{w})=1$ we have:
$$\begin{aligned}&\textrm{Pr}\left[ \begin{array}{l|l} (R,aux_R)\leftarrow \mathcal{R}\mathcal{G}(1^\lambda ), &{} \\ \textsf{pp}\leftarrow G(R,aux_R), &{} \mathcal {A}(\textsf{pp},aux_R,\pi )=1 \\ \pi \leftarrow \textsf{P}(\textsf{pp},\boldsymbol{x},\boldsymbol{w}) &{} \end{array}\right] \\&\approx \textrm{Pr}\left[ \begin{array}{l|l} (R,aux_R)\leftarrow \mathcal{R}\mathcal{G}(1^\lambda ), &{} \\ (\textsf{pp},\textsf{td})\leftarrow S_1(R,aux_R), &{} \mathcal {A}(\textsf{pp},aux_R,\pi )=1 \\ \pi \leftarrow S_2(\textsf{pp},\boldsymbol{x},\textsf{td}) &{} \end{array} \right] \end{aligned}$$

1.3 A.3 Commit and Prove SNARKs

Informally, a commit and prove SNARK (CP-SNARK) is a SNARK that can prove knowledge of witness where part of the witness opens a commitment c. In other words, a CP-SNARK for relation R allows one to prove knowledge of $\boldsymbol{w}=(\boldsymbol{u},\boldsymbol{z})$ such that $R(\boldsymbol{x},\boldsymbol{w})=1$ and c is a commitment for $\boldsymbol{u}$. The commitments can be used in several proofs to prove composite statements. We summarise the formal notion of CP-SNARKs as defined in [8].

Definition 3 (CP-SNARK)

Let $\textsf{Com}$ be a commitment scheme with input space $\mathcal {D}$, opening space $\mathcal {O}$ and commitment space $\mathcal {C}$. Let $\{R_\lambda \}_{\lambda \in \mathbb {N}}$ be a family of relations R over $\mathcal {D}_x\times \mathcal {D}_u\times \mathcal {D}_w$ where $\mathcal {D}_u$ splits as $\mathcal {D}_1\times \cdots \times \mathcal {D}_\ell $ for some $\ell \ge 1$ such that $\mathcal {D}_i\subseteq \mathcal {D}$ for $i=1,\ldots ,\ell $. A commit and prove zkSNARK ($\textsf{CP}$) for $\textsf{Com}$ and $\{R_\lambda \}_{\lambda \in \mathbb {N}}$ is a zkSNARK for family of relations $\{R^{\textsf{Com}}_\lambda \}_{\lambda \in \mathbb {N}}$ where:

every $\boldsymbol{R}\in R^{\textsf{Com}}$ is represented by $(\textsf{ck}, R)$ where $\textsf{ck}\in \textsf{Setup}(1^\lambda )$ and $R\in R_\lambda $.
$\boldsymbol{R}$ is over the pairs $(\boldsymbol{x},\boldsymbol{w})$ where $\boldsymbol{x}=(x, (c_j)_{j\in [\ell ]})$ $\in \mathcal {D}_x\times \mathcal {C}^\ell $ is the statement and $\boldsymbol{w}=((u_j)_{j\in [\ell ]}, (o_j)_{j\in [\ell ]}, \omega )$ $\in \mathcal {D}_1\times \cdots \times \mathcal {D}_\ell \times \mathcal {O}^\ell \times \mathcal {D}_\omega $ is the witness. The relation $\boldsymbol{R}$ holds iff:
$$\begin{aligned} \bigwedge _{j\in [\ell ]} \textsf{VerCommit}(\textsf{ck}, c_j, u_j, o_j) = 1 \wedge R(x, (u_j)_{j\in [\ell ]}, \omega ) = 1 \end{aligned}$$

Further, we say that $\textsf{CP}$ is knowledge sound for relation generator $\mathcal{R}\mathcal{G}$ and auxiliary input generator $\mathcal {Z}$ if it satisfies knowledge soundness $(\mathcal{R}\mathcal{G}^{\textsf{Com}}, \mathcal {Z})$ where $\mathcal{R}\mathcal{G}^{\textsf{Com}}$ denotes the relation generator which samples $(\textsf{ck}, R, aux)$ as $\mathcal{R}\mathcal{G}(1^\lambda )\rightarrow (R,aux)$ and $\textsf{Setup}(1^\lambda )\rightarrow \textsf{ck}$.

We elaborate slightly on the intuition behind the above definition. Typically a zkSNARK for relation $R\subseteq \mathcal {D}_x\times \mathcal {D}_\omega $ proves knowledge of $\boldsymbol{w}\in \mathcal {D}_\omega $ for a given statement $\boldsymbol{x}\in \mathcal {D}_x$ such that $R(\boldsymbol{x},\boldsymbol{w})=1$. With a CP-SNARK, we additionally wish to prove that part of the witness $\boldsymbol{w}$ opens a commitment c, i.e. $\boldsymbol{w}=(\boldsymbol{u},z)$ where c is a commitment for $\boldsymbol{u}$. Generalizing this further, we can decompose the committed part of the witness $\boldsymbol{u}$ into $\ell $ slots, where witness corresponding to each slot opens a specified commitment.

B Security Analysis

We describe our protocols as interactive protocols with (semi) honest verifiers. One can obtain non-interactive arguments of knowledge (SNARKs) in the Random Oracle model from them via Fiat-Shamir heuristic. We first define a secure protocol for proving a relation R under commitments using the commitment scheme $\textsf{Com}$. We will write a relation R as $R(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w})$ where $\boldsymbol{x}$ denotes the public input (plain-text), $\boldsymbol{u}$ denotes the committed witness while $\boldsymbol{w}$ denotes the “free” (uncommitted witness). The vector $\boldsymbol{u}$ purportedly opens a public commitment c.

Definition 4 (Secure Protocol)

A secure protocol for a relation R and commitment scheme $\textsf{Com}$ consists of tripe $\varPi =(\mathcal {G},\mathcal {P},\mathcal {V})$ consisting of generator algorithm $\mathcal {G}$, a $\textsf{PPT}$ prover $\mathcal {P}$ and a $\textsf{PPT}$ verifier $\mathcal {V}$ which work as follows:

1.
$\mathcal {G}(\textsf{ck},R,1^\lambda )\longrightarrow \textsf{pp}$: Given a commitment key $ck\leftarrow \textsf{Com}.\textsf{Setup}(1^\lambda )$ and R, $\mathcal {G}$ outputs public parameters $\textsf{pp}$.
2.
Given public parameters $\textsf{pp}$ for relation R and a pair $(\boldsymbol{x},c)$ consisting of statement $\boldsymbol{x}$ and a public commitment c, $\mathcal {P}$ and $\mathcal {V}$ interact via an alternating sequence of messages, at the end of which $\mathcal {V}$ outputs $0\, (\texttt{Reject})$ or $1\, (\texttt{Accept})$.

Further, a secure protocol $\varPi $ satisfies completeness, soundness and zero-knowledge which we define shortly.

Let $\varPi (\textsf{pp},\boldsymbol{x},c;\boldsymbol{u},\boldsymbol{w},0)$ denote the output (0/1) of interaction between $\mathcal {P}$ and $\mathcal {V}$ on common input $(\boldsymbol{x},c)$ and $\mathcal {P}$’s private inputs as $\boldsymbol{u},\boldsymbol{w},o$. Similarly, let $\varPi .\textsf{Vw}(\boldsymbol{x},c; \boldsymbol{u},\boldsymbol{w},o)$ denote $\mathcal {V}$’s view in the interaction. We use $\varPi _\mathcal {A}(\textsf{pp},\boldsymbol{x},c)$ to denote the output of interaction between an adversarial prover $\mathcal {A}$ and $\mathcal {V}$ on common input $(\boldsymbol{x},c)$. Next, we define the security properties satisfied by a secure protocol $\varPi $.

Completeness: We call $\varPi $ to be complete if for all $\textsf{ck}\in \textsf{Com}.\textsf{Setup}(1^\lambda )$ and $(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w})\in R$ we have:

$$\begin{aligned} \textrm{Pr}\left[ \textsf{pp}\leftarrow \mathcal {G}(\textsf{ck},R,1^\lambda ), c = \textsf{Com}.\textsf{Commit}(\textsf{ck}, \boldsymbol{u}, o), \varPi (\boldsymbol{x},c;\boldsymbol{u},\boldsymbol{w},o) = 1\right] = 1 \end{aligned}$$

.

Soundness: We call $\varPi $ to have soundness if for all $\textsf{PPT}$ adversaries $\mathcal {A}$, there exists and efficient extractor $\mathcal {E}$ such that the following probability is negligible:

$$\begin{aligned} \textrm{Pr}\left[ \begin{array}{l} \textsf{ck}\leftarrow \textsf{Com}.\textsf{Setup}(1^\lambda ), \textsf{pp}\leftarrow \mathcal {G}(\textsf{ck},R,1^\lambda ), \\ (\boldsymbol{x},c)\leftarrow \mathcal {A}(\textsf{pp},z), (\boldsymbol{u},\boldsymbol{w},o)\leftarrow \mathcal {E}^\mathcal {A}(\textsf{pp},z) \end{array} \,\left| \, \begin{array}{c} \varPi _\mathcal {A}(\textsf{pp},\boldsymbol{x},c)=1 \\ \wedge \lnot \widetilde{R}(\boldsymbol{x},c,\boldsymbol{u},\boldsymbol{w},o) \end{array}\right. \right] \end{aligned}$$

Here $\tilde{R}(\boldsymbol{x},c,\boldsymbol{u},\boldsymbol{w},o)\equiv R(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w})\wedge \textsf{Com}.\textsf{VerCommit}(\textsf{ck},c,\boldsymbol{u},o)$.

Zero Knowledge: We say that $\varPi $ is zero-knowledge if there exists efficient simulator $\mathcal {S}=(\mathcal {S}_1,\mathcal {S}_2)$ such that for all $\textsf{ck}\in \textsf{Com}.\textsf{Setup}(1^\lambda )$, $(\boldsymbol{x},c,\boldsymbol{u},\boldsymbol{w},o)$ such that $(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w})\in R$ and $c=\textsf{Com}.\textsf{Commit}(\textsf{ck},\boldsymbol{u},o)$, the following are statistically indistinguishable:

$$\begin{aligned}&\left[ \textsf{pp}\leftarrow \mathcal {G}(\textsf{ck},R)\,\vert \,\big (\textsf{pp},\varPi .\textsf{Vw}(\textsf{pp},\boldsymbol{x},c; \boldsymbol{u},\boldsymbol{w},o)\big )\right] \\&\quad \approx \left[ (\textsf{pp},\textsf{td})\leftarrow \mathcal {S}_1(1^\lambda ,R)\,|\, \big (\textsf{pp},\mathcal {S}_2(\textsf{td},\textsf{pp},\textsf{ck},\boldsymbol{x},c)\big )\right] \end{aligned}$$

First, we exhibit a trivial secure protocol that can be obtained from a CP-SNARK for a relation.

Lemma 1

Let $\textsf{CP}=(\textsf{G},\textsf{P},\textsf{V})$ be a CP-SNARK for relation R and commitment scheme $\textsf{Com}$. Then $\varPi =(\mathcal {G},\mathcal {P},\mathcal {V})$ as described below is a secure protocol for relation R and commitment scheme $\textsf{Com}$.

$\mathcal {G}(\textsf{ck},R,1^\lambda )\longrightarrow \textsf{pp}$ where $\textsf{pp}\leftarrow \textsf{G}(\textsf{ck},R,1^\lambda )$.
On common input $(\boldsymbol{x},c)$ and $\mathcal {P}$’s input $(\boldsymbol{u},\boldsymbol{w},o)$, $\mathcal {P}$ and $\mathcal {V}$ interact as follows:
1. 1.
  $\mathcal {P}$ computes: $\pi \leftarrow \textsf{P}(\textsf{pp},\boldsymbol{x},\boldsymbol{u},\boldsymbol{w},o)$.
2. 2.
  $\mathcal {P}\rightarrow \mathcal {V}$: $\mathcal {P}$ sends $\pi $ to $\mathcal {V}$.
3. 3.
  $\mathcal {V}$ outputs $\textsf{V}(\textsf{pp},\boldsymbol{x},c,\pi )$.

The proof of the above is trivial and follows directly from the properties of CP-SNARK $\textsf{CP}$. We now formally define the probabilistic relation decomposition and provide a secure protocol for decomposed relation in by gluing the secure protocols for the constituent relations.

Definition 5 (Probabilistic Relation Decomposition)

Let $R(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w})$ be a relation. We say that relations $(R_1,R_2)$ are a probabilistic decomposition of R if there exists a canoical partitioning of $\boldsymbol{w}$ as $\boldsymbol{w}_0||\boldsymbol{w}_1||\boldsymbol{w}_2$ and a challenge space $\mathcal {C}$ such that for $\alpha \leftarrow \mathcal {C}$:

$$\begin{aligned} \textrm{Pr}\left[ R_1(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w}_0,\boldsymbol{w}_1)\wedge R_2(\alpha ,\boldsymbol{x},\boldsymbol{u},\boldsymbol{w}_0,\boldsymbol{w}_2)=1 \,\vert \, R(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w})=1\right]&= 1 \\ \textrm{Pr}\left[ R_1(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w}_0,\boldsymbol{w}_1)\wedge R_2(\alpha ,\boldsymbol{x},\boldsymbol{u},\boldsymbol{w}_0,\boldsymbol{w}_2)=1 \,\vert \, R(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w})=0\right]&= \textsf{negl} \end{aligned}$$

Lemma 2 (Glueing Lemma)

Let $(R_1,R_2)$ be a probabilistic relation decomposition of the relation R and let $\varPi _1$ and $\varPi _2$ be secure protocols for $(R_1,\textsf{Com})$ and $(R_2,\textsf{Com})$ respectively, where $\textsf{Com}$ is a commitment scheme. Then the protocol $\varPi =(\mathcal {G},\mathcal {P},\mathcal {V})$ as described below is a secure protocol for $(R,\textsf{Com})$.

$\mathcal {G}(\textsf{ck},R,1^\lambda )\longrightarrow \textsf{pp}$: The algorithm $\mathcal {P}$ invokes generator algorithms for the consituent relations as $\textsf{pp}_1\leftarrow \varPi _1.\mathcal {G}(\textsf{ck}, R_1, 1^\lambda )$, $\textsf{pp}_2\leftarrow \varPi _2.\mathcal {G}(\textsf{ck}, R_2, 1^\lambda )$ and returns $\textsf{pp}=(\textsf{pp}_1,\textsf{pp}_2)$.
On common input $(\boldsymbol{x},c)$ and private prover inputs $(\boldsymbol{u},\boldsymbol{w},o)$, $\mathcal {P}$ and $\mathcal {V}$ interact as follows:
1. 1.
  $\mathcal {P}$ computes: $\mathcal {P}$ partitions $\boldsymbol{w}$ as $\boldsymbol{w}_0||\boldsymbol{w}_1||\boldsymbol{w}_2$. Next $\mathcal {P}$ samples $o_w\leftarrow \mathcal {O}$ and computes $c_w = \textsf{Com}.\textsf{Commit}(\textsf{ck},\boldsymbol{w}_0,o_w)$.
2. 2.
  $\mathcal {P}\rightarrow \mathcal {V}$: $\mathcal {P}$ sends $c_w$ to $\mathcal {V}$.
3. 3.
  $\mathcal {P}$ and $\mathcal {V}$ execute the secure protocol $\varPi _1$ with common input $(\boldsymbol{x},(c,c_w))$ and prover’s ($\varPi _1.\mathcal {P}$) inputs as $((\boldsymbol{u},\boldsymbol{w}_0),\boldsymbol{w}_1,(o,o_w))$. Let $b_1$ denote the output of the protocol $\varPi _1$.
4. 4.
  $\mathcal {V}\rightarrow \mathcal {P}$: $\mathcal {V}$ samples $\alpha \leftarrow \mathcal {C}$ and sends $\alpha $ to $\mathcal {P}$.
5. 5.
  $\mathcal {P}$ and $\mathcal {V}$ execute the secure protocol $\varPi _2$ with common input $((\alpha ,\boldsymbol{x}),(c,c_w))$ and prover’s ($\varPi _2.\mathcal {P}$) inputs as $((\boldsymbol{u},\boldsymbol{w}_0),\boldsymbol{w}_2,(o,o_w))$. Let $b_2$ denote the output of the protocol $\varPi _1$.
6. 6.
  $\mathcal {V}$ outputs $b_1\wedge b_2$.

Proof

We skip the proof of completeness of protocol $\varPi $, as it is straightforward to verify. To show soundness, let $\mathcal {A}$ be a $\textsf{PPT}$ adversary such that $\varPi _\mathcal {A}(\textsf{pp},\boldsymbol{x},c)=1$. Let $c_w$ be the first message (commitment) sent by $\mathcal {A}$ to $\mathcal {V}$. From the protocol description of $\varPi $, we have:

$$\begin{aligned} \varPi _\mathcal {A}(\textsf{pp},\boldsymbol{x},c)=\varPi _{1,\mathcal {A}}(\textsf{pp}_1,\boldsymbol{x}, (c, c_w))\wedge \varPi _{2,\mathcal {A}}(\textsf{pp}_2,(\alpha ,\boldsymbol{x}),(c,c_w)). \end{aligned}$$

Thus $\mathcal {A}$ is also an adversary for secure protocols $\varPi _1$ and $\varPi _2$. Soundness of $\varPi _1$ and $\varPi _2$ implies existence of extractors $\mathcal {E}_1$ and $\mathcal {E}_2$ such that $((\boldsymbol{u},\boldsymbol{w}_0),\boldsymbol{w}_1,o) \leftarrow \mathcal {E}_1^\mathcal {A}(\textsf{pp}_1,z)$ and $((\boldsymbol{u}',\boldsymbol{w}'_0,\boldsymbol{w}_2,(o',o'_w)) \leftarrow \mathcal {E}_2^\mathcal {A}(\textsf{pp}_2,z)$. We define extractor $\mathcal {E}$ which invokes the above extractors and outputs $(\boldsymbol{u},\boldsymbol{w},o)$ for $\boldsymbol{w}=\boldsymbol{w}_0||\boldsymbol{w}_1||\boldsymbol{w}_2$. With overwhelming probability we have

$$\begin{aligned} R_1(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w}_0,\boldsymbol{w}_1)&\wedge \textsf{Com}.\textsf{VerCommit}(\textsf{ck},(c,c_w),(\boldsymbol{u},\boldsymbol{w}_0),(o,o_w)) \\ R_2(\alpha ,\boldsymbol{x},\boldsymbol{w}'_0,\boldsymbol{w}_2)&\wedge \textsf{Com}.\textsf{VerCommit}(\textsf{ck},(c,c_w),(\boldsymbol{u}',\boldsymbol{w}'_0,(o',o'_w)) \end{aligned}$$

By the binding property of $\textsf{Com}$, we also have $\boldsymbol{u}'=\boldsymbol{u}$, $\boldsymbol{w}'_0=\boldsymbol{w}_0$, $o'=o$ and $o'_w=o_w$ and $\textsf{Com}.\textsf{VerCommit}(\textsf{ck},(c,c_w),(\boldsymbol{u},\boldsymbol{w}_0),(o,o_w))=1$ with overwhelming probability. Finally, since $R_1(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w}_0,\boldsymbol{w}_1)\wedge R_2(\alpha ,\boldsymbol{x},\boldsymbol{u},\boldsymbol{w}_0,\boldsymbol{w}_2)=1$, we must have $R(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w})=1$ for $\boldsymbol{w}=\boldsymbol{w}_0||\boldsymbol{w}_1||\boldsymbol{w}_2$ with probability negligibly close to 1. This proves that $\mathcal {E}$ extracts a valid witness with overwhelming proability.

We now show that $\varPi $ is zero-knowledge. Let $ck\leftarrow \textsf{Com}.\textsf{Setup}(1^\lambda )$ and let $(\boldsymbol{x},c,\boldsymbol{u},\boldsymbol{w},o)$ be such that $(\boldsymbol{x},\boldsymbol{u},\boldsymbol{w})\in R$ and $c=\textsf{Com}.\textsf{Commit}(\textsf{ck},\boldsymbol{u},o)$. We show the existence of simulator $\mathcal {S}=(\mathcal {S}_1,\mathcal {S}_2)$ such that:

$$\begin{aligned}&\left[ \textsf{pp}\leftarrow \mathcal {G}(\textsf{ck},R)\,\vert \,\big (\textsf{pp}, \varPi .\textsf{Vw}(\textsf{pp},\boldsymbol{x},c; \boldsymbol{u},\boldsymbol{w},o)\big )\right] \\&\quad \approx \left[ (\textsf{pp},\textsf{td})\leftarrow \mathcal {S}_1(1^\lambda ,R)\,|\, \big (\textsf{pp}, \mathcal {S}_2(\textsf{td},\textsf{pp},\textsf{ck},\boldsymbol{x},c)\big )\right] \end{aligned}$$

Let $\widetilde{\mathcal {S}}=(\widetilde{\mathcal {S}}_1,\widetilde{\mathcal {S}}_2)$ and $\widehat{\mathcal {S}}=(\widehat{\mathcal {S}}_1,\widehat{\mathcal {S}}_2)$ be the simulators for secure protocols $\varPi _1$ and $\varPi _2$ respectively. The simulator $\mathcal {S}$ works as follows:

$\mathcal {S}_1(1^\lambda ,R)\longrightarrow (\textsf{pp}',\textsf{td}')$: On input R and security parameter, $\mathcal {S}_1$ invokes simulators for $R_1$, $R_2$ to obtain $(\textsf{pp}'_1,\textsf{td}'_1)\leftarrow \widetilde{\mathcal {S}}_1(1^\lambda ,R_1)$, $(\textsf{pp}'_2,\textsf{td}'_2)\leftarrow \widehat{\mathcal {S}}_1(1^\lambda ,R_2)$ respectively. It sets $\textsf{pp}'=(\textsf{pp}'_1,\textsf{pp}'_2)$ and $\textsf{td}'=(\textsf{td}'_1,\textsf{td}'_2)$.
$\mathcal {S}_2$ works as follows: It samples $\alpha \leftarrow \mathcal {C}$, $\tilde{o}\leftarrow \mathcal {O}_\lambda $ and computes $\tilde{c}_w=\textsf{Com}.\textsf{Commit}(\textsf{ck},\boldsymbol{0},\tilde{o})$. Then it invokes simulators $\widetilde{\mathcal {S}}_2$ and $\widehat{\mathcal {S}}_2$ as:
- $V'_1\leftarrow \widetilde{\mathcal {S}}_2(\textsf{td}'_1,\textsf{pp}'_1,\boldsymbol{x},(c,\tilde{c}_w))$,
- $V'_2\leftarrow \widehat{\mathcal {S}}_2(\textsf{td}'_2,\textsf{pp}'_2,(\alpha ,\boldsymbol{x}),(c,\tilde{c}_w))$.
Finally it outputs $(\alpha ,\tilde{c}_w,V'_1,V'_2)$.

The required indistinguishability follows via hybrids shown below. For ease of notation let $V_1$ denote $\varPi _1(\textsf{pp}_1,\boldsymbol{x},(c,c_w);(\boldsymbol{u},\boldsymbol{w}_0),\boldsymbol{w}_1,(o,o_w))$ and $V_2$ denote $\varPi _2(\textsf{pp}_2,(\alpha ,\boldsymbol{x}),(c,c_w);(\boldsymbol{u},\boldsymbol{w}_0),\boldsymbol{w}_2,(o,o_w))$. Then we have:

$$\begin{aligned}&\langle \textsf{pp},\varPi .\textsf{Vw}(\textsf{pp},\boldsymbol{x},c;\boldsymbol{u},\boldsymbol{w},o)\rangle \end{aligned}$$

(1)

$$\begin{aligned}&= \langle \textsf{pp}_1,\textsf{pp}_2,\alpha ,c_w,V_1,V_2 \rangle \end{aligned}$$

(2)

$$\begin{aligned}&\approx \langle \textsf{pp}'_1,\textsf{pp}_2,\alpha ,c_w,\widetilde{\mathcal {S}}_2(\textsf{td}'_1,\textsf{pp}'_1,\boldsymbol{x},(c,c_w)),V_2\rangle \end{aligned}$$

(3)

$$\begin{aligned}&\approx \langle \textsf{pp}'_1,\textsf{pp}'_2,\alpha ,c_w, \widetilde{\mathcal {S}}_2(\textsf{td}'_1,\textsf{pp}'_1,\boldsymbol{x},(c,c_w)), \widehat{\mathcal {S}}_2(\textsf{td}'_2,\textsf{pp}'_2,(\alpha ,\boldsymbol{x}),(c,c_w))\rangle \end{aligned}$$

(4)

$$\begin{aligned}&\approx \langle \textsf{pp}'_1,\textsf{pp}'_2,\alpha ,\tilde{c}_w,V'_1,V'_2\rangle \end{aligned}$$

(5)

In the above the indistinguishability of (2) and (3) follows from the zero knowledge property of $\varPi _1$. Similarly zero knowledge of $\varPi _2$ implies indistinguishability of (3) and (4). Finally, the indistinguishability of (4) and (5) follows from the hiding property of $\textsf{Com}$. This completes the proof.

C Secure Protocols

In this section, we give secure protocols for the different relations discussed in this paper such as simultaneous permutation, consistent memory access, various dataset operations and decision tree inference.

1.1 C.1 Simultaneous Permutation

For a fixed N, recall that k-tuples $(\boldsymbol{u}_1,\ldots ,\boldsymbol{u}_k)$ and $(\boldsymbol{v}_1,\ldots ,\boldsymbol{v}_k)$ of vectors in $\mathbb {F}^N$ satisfy simultaneous permutation relation if there exists a permutation $\sigma $ of [N] such that $\sigma (\boldsymbol{u}_i)=v_i$ for all $i\in [N]$. Let $R_\sigma $ denote the relation over $(\alpha ,\boldsymbol{u},\boldsymbol{v})$ with $\alpha \in \mathbb {F}$ and $\boldsymbol{u},\boldsymbol{v}\in \mathbb {F}^N$ such that $\prod _{i=1}^N (\alpha - \boldsymbol{u}[\,i\,])$ $=$ $\prod _{i=1}^N (\alpha - \boldsymbol{v}[\,i\,])$. Let $\varPi _\sigma $ denote the trivial secure protocol obtained from CP-SNARK for $(R_\sigma ,\textsf{Com})$ (using Lemma 1), where we also assume $\textsf{Com}$ is homomorphic.

Lemma 3

The protocol $\varPi _{\textrm{perm}}=(\mathcal {G},\mathcal {P},\mathcal {V})$ in Fig. 5 is a secure protocol for simultaneous permutation relation and commitment scheme $\textsf{Com}$.

Proof

By standard rewinding technique, with overwhelming probability the extractor $\mathcal {E}$, for an accepting adversarial prover $\mathcal {A}$ can extract vectors $\{\boldsymbol{u}_i,\boldsymbol{v}_i\}_{i=1}^k$ such that $\boldsymbol{u}_i$ opens commitment $\textsf{cu}_i$ and $\boldsymbol{v}_i$ opens commitment $\textsf{cv}_i$ for all $i\in [k]$. This is accomplished by running the subprotocol $\varPi _\sigma $ for k different linear combinations of commitments given by the challenge $(\beta _1,\ldots ,\beta _k)$, and using the extractor for $\varPi _\sigma $ to obtain openings for respective linear combinations of vectors. Since the challenges are linearly independent with overwhelming probability, we can solve the system of equations to obtain openings for individual commitments $\textsf{cu}_i$ and $\textsf{cv}_i$ for all $i\in [k]$. By homomorphism of $\textsf{Com}$, the vectors $\boldsymbol{u}=\sum _{i=1}^k \beta _i\boldsymbol{u}_i$ and $\boldsymbol{v}=\sum _{i=1}^k \beta _i\boldsymbol{v}_i$ open commitments $\textsf{cu}$ and $\textsf{cv}$ respectively. Again soundness of $\varPi _\sigma $ implies with overwhelming probability $(\alpha ,\boldsymbol{u},\boldsymbol{v})\in R_\sigma $. Since $\alpha $ was drawn uniformly at random, we conclude that there is a permutation $\pi $ such that $\pi (\boldsymbol{u})=\boldsymbol{v}$ with probability almost 1. Finally, since $\beta _1,\ldots ,\beta _k$ were drawn uniformly at random $\pi (\sum _{i=1}^k \beta _i\boldsymbol{u}_i)$ $=$ $\sum _{i=1}^k \beta _i\boldsymbol{v}_i$, with overwhelming probability we must have $\pi (\boldsymbol{u}_i)=\boldsymbol{v}_i$ for all $i\in [k]$. This shows the soundness of $\varPi _{\textrm{perm}}$. We skip the proof of zero-knowledge for $\varPi _{\textrm{perm}}$ as it follows from the same property for $\varPi _\sigma $.

1.2 C.2 Consistent Memory Access

in this section, we formalize the secure protocol for consistent memory access relation discussed in Sect. 3.4.

Lemma 4

There exists a secure protocol $\varPi _\textrm{cma}$ for consistent memory access relation defined in Sect. 3.4.

Proof

We consider the relation $R_\textrm{cma}$ explained in Sect. 3.4 for consistent memory access as:

$$R_\textrm{cma}(\cdot ,\llbracket \boldsymbol{L},\boldsymbol{U},\boldsymbol{V}\rrbracket , \llbracket \boldsymbol{u},\boldsymbol{v},\tilde{\boldsymbol{u}},\tilde{\boldsymbol{v},\boldsymbol{w}_1,\boldsymbol{w}_2}\rrbracket ) $$

In the above, there are no public inputs, the committed witness consists of $\boldsymbol{L},\boldsymbol{U}$ and $\boldsymbol{V}$ which denote the read only memory, access pattern and values respectively. The uncommitted witness consists of auxiliary inputs ($\boldsymbol{u},\boldsymbol{v},\tilde{\boldsymbol{u}},\tilde{\boldsymbol{v}}$) and other witness $\boldsymbol{w}_1$ and $\boldsymbol{w}_2$ required to prove the relation. The description in Sect. 3.4 partitions the above as:

$$\begin{aligned} \textsf{C}_{\textsf{ROM}, {m},{n}}(\cdot ,\llbracket \boldsymbol{L},\boldsymbol{U},\boldsymbol{V},\boldsymbol{w}_0\rrbracket , \boldsymbol{w}_1)\wedge R_\sigma (\cdot ,\boldsymbol{w}_0,\boldsymbol{w}_2) \end{aligned}$$

(6)

where $\boldsymbol{w}_0=\llbracket \boldsymbol{u},\boldsymbol{v},\tilde{\boldsymbol{u}},\tilde{\boldsymbol{v}} \rrbracket $. The secure protocol $\varPi _\textrm{ROM}$ can be obtained using a CP-SNARK for circuit $\textsf{C}_{\textsf{ROM}, {m},{n}}$ via Lemma 1. Invoking Glueing Lemma (Lemma 2) with $\varPi _\textrm{ROM}$ and protocol $\varPi _{\textrm{perm}}$ for simultaneous permutation relation, we obtain the secure protocol $\varPi _\textrm{cma}$.

1.3 C.3 Aggregation Operation

We now provide a secure protocol for showing correctness of aggregation operation on datasets as described in Sect. 4. In Sect. 4 we described a protocol for checking correct concatenation of vectors under commitments, and then reduced the verification of dataset aggregation to that of verifying concatenation of vectors (obtained via linear combination of columns of dataset). We also justify the aforementioned reduction. We assume $\varPi _\textrm{concat}$ is a secure protocol for checking concatenation of vectors, which we assume is desceribed by the relation $R_\textrm{concat}$. The secure protocol $\varPi _\textrm{agg}=(\mathcal {G},\mathcal {P},\mathcal {V})$ for verifying aggregation of datasets appears in Fig. 6. Let $D_x,D_y$ and $D_z$ be datasets with columns given by $(\boldsymbol{x}_i)_{i=1}^k$, $(\boldsymbol{y}_i)_{i=1}^k$ and $(\boldsymbol{z}_i)_{i=1}^k$ respectively. Similarly let $(\textsf{cx}_i)_{i=1}^k$, $(\textsf{cy}_i)_{i=1}^k$ and $(\textsf{cz}_i)_{i=1}^k$ denote public commitments to the columns of $D_x$, $D_y$ and $D_z$ respectively. As in Sect. 4, let N denote the upper bound on the sizes of datasets and vectors.

Lemma 5

The protocol $\varPi _\textrm{agg}$ in Fig. 6 is a secure protocol for aggregation relation on datasets and commitment scheme $\textsf{Com}$.

Proof

The completeness and zero-knowledge properties of the protocol are proved in a manner similar to earlier protocols. Here we prove the soundness of the probabilistic reduction from aggregation relation on datasets to concatenation relation on vectors, which implies soundness of the overall protocol. With overwhelming probability, a successful adversary $\mathcal {A}$ knows vectors $(\boldsymbol{x}_i)_{i=1}^k$, $(\boldsymbol{y}_i)_{i=1}^k$ and $(\boldsymbol{z}_i)_{i=1}^k$ such that their respective $\beta $-linear combinations $\boldsymbol{x},\boldsymbol{y}$ and $\boldsymbol{z}$ satisfy the concatenation relation. As in Sect. 4, we write $\boldsymbol{x}_i=\llbracket s_i, \boldsymbol{X}_i\rrbracket $, $\boldsymbol{y}_i=\llbracket t_i, \boldsymbol{Y}_i\rrbracket $ and $\boldsymbol{z}_i=\llbracket w_i, \boldsymbol{Z}_i\rrbracket $ for $i\in [k]$. Similarly, let $\boldsymbol{x}=\llbracket s, \boldsymbol{X}\rrbracket $, $\boldsymbol{y}=\llbracket t, \boldsymbol{Y}\rrbracket $ and $\boldsymbol{z}=\llbracket w, \boldsymbol{Z}\rrbracket $. Note that we must have:

$$\begin{aligned} s = \sum _{i=1}^k\beta _i s_i, \quad t = \sum _{i=1}^k\beta _i t_i, \quad w = \sum _{i=1}^k\beta _i w_i \\ \boldsymbol{X} = \sum _{i=1}^k\beta _i \boldsymbol{X}_i, \quad \boldsymbol{Y} = \sum _{i=1}^k\beta _i \boldsymbol{Y}_i, \quad \boldsymbol{Z} = \sum _{i=1}^k\beta _i \boldsymbol{Z}_i \end{aligned}$$

Now, from description in Sect. 4, the vectors $\boldsymbol{x},\boldsymbol{y}$ and $\boldsymbol{z}$ satisfy the concatenation relation if there exists a permutation of [2N], which we denote by permutation matrix $\Lambda $ such that $\Lambda \cdot \llbracket \boldsymbol{\rho }_s, \boldsymbol{\rho }_t\rrbracket =\llbracket \boldsymbol{\rho }_w, \boldsymbol{0}\rrbracket $, $\Lambda \cdot \llbracket \boldsymbol{X}, \boldsymbol{Y}\rrbracket = \llbracket \boldsymbol{Z}, \boldsymbol{0}\rrbracket $ where vectors $\boldsymbol{\rho }_s,\boldsymbol{\rho }_t$ and $\boldsymbol{\rho }_w$ are in $\{0,1\}^N$ such that $\boldsymbol{\rho }_s$ is 1 in precisely the first s positions, $\boldsymbol{\rho }_t$ is 1 in precisely the first t positions and $\boldsymbol{\rho }_w$ is 1 in precisely the first w positions where further $w=s+t$. The relation thus also implicity requires that $s,t,w\in [N]$. We now claim that $s_i=s$, $t_i=t$ and $w_i=w$ for all $i\in [k]$. Otherwise it is easily seen that s is distributed uniformly in $\mathbb {F}$ (and likewise for t and w) for uniformly sampled $\beta _1,\ldots ,\beta _k$ (subject to sum being 1), and thus $s\in [N]$ with negligible probability $N/|\mathbb {F}|$. Similar reasoning also implies that with overwhelming probability we have $\Lambda \cdot \llbracket \boldsymbol{X}_i, \boldsymbol{Y}_i\rrbracket =\llbracket \boldsymbol{Z}_i, \boldsymbol{0}\rrbracket $ for all $i\in [k]$. Combined with the fact that $\Lambda \cdot \llbracket \boldsymbol{\rho }_s, \boldsymbol{\rho }_t\rrbracket =\llbracket \boldsymbol{\rho }_w, \boldsymbol{0}\rrbracket $, it implies that the same permutation $\Lambda $ maps the first s entries of column $\boldsymbol{x}_i$ and first t entries of column $\boldsymbol{y}_i$ to the first $w=s+t$ entries of the column $\boldsymbol{z}_i$ for all $i\in [k]$. Thus $D_z$ corresponds to aggregation of datasets $D_x$ and $D_y$.

Protcols and Proofs for Other Operations: We have provided circuit descriptions for other operations such as filter, order-by, inner-join and also ML operations such as inference and accuracy from decision trees. These circuits can be used with CP-SNARKs to yeild secure protocols for those operations using techniques similar to presented protocols (essentially using Lemmas 1 and 2), alongwith reduction technique when applicable.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Singh, N., Dayama, P., Pandit, V. (2022). Zero Knowledge Proofs Towards Verifiable Decentralized AI Pipelines. In: Eyal, I., Garay, J. (eds) Financial Cryptography and Data Security. FC 2022. Lecture Notes in Computer Science, vol 13411. Springer, Cham. https://doi.org/10.1007/978-3-031-18283-9_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-18283-9_12
Published: 22 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18282-2
Online ISBN: 978-3-031-18283-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Zero Knowledge Proofs Towards Verifiable Decentralized AI Pipelines

Abstract

Similar content being viewed by others

Distributed Ledger for Provenance Tracking of Artificial Intelligence Assets

Obscuring Provenance Confidential Information via Graph Transformation

Trusted Provenance of Collaborative, Adaptive, Process-Based Data Processing Pipelines

1 Introduction

1.1 Related Work

1.2 Our Contributions

2 Verifiable Provenance in Decentralized AI Pipelines

2.1 Decentralized Model Fairness

3 Overview

3.1 Building Blocks

3.2 Optimizations

3.3 Simultaneous Permutation

3.4 Consistent Memory Access

3.5 Our Techniques in Perspective

4 Privacy Preserving Dataset Operations

5 Privacy Preserving Model Inference: Decision Trees

6 Experimental Evaluation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Preliminaries

1.1 A.1 Commitment Scheme

Definition 1

1.2 A.2 Zero Knowledge Arguments

Definition 2

1.3 A.3 Commit and Prove SNARKs

Definition 3 (CP-SNARK)

B Security Analysis

Definition 4 (Secure Protocol)

Lemma 1

Definition 5 (Probabilistic Relation Decomposition)

Lemma 2 (Glueing Lemma)

Proof

C Secure Protocols

1.1 C.1 Simultaneous Permutation

Lemma 3

Proof

1.2 C.2 Consistent Memory Access

Lemma 4

Proof

1.3 C.3 Aggregation Operation

Lemma 5

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation