1 Introduction

In our digital age, credit cards have become a popular payment instrument. With increasing popularity of business through Internet, every business requires to maintain credit card information of its clients in some form. Credit card data theft is considered to be one of the most serious threats to any business. Such a breach not only amounts to a serious financial loss to the business but also a critical damage to the “brand image” of the company in question.

The payment card industry security standard council (PCI SSC), which was founded by the major payment card brands, is an organization responsible for the development and deployment of various best practices in ensuring security of credit card data. In particular, PCI SSC has developed a standard called the PCI data security standard (PCI DSS) [12], which specifies security mechanisms required to secure payment card data. PCI DSS dictates that organizations, which process card payments, must protect cardholder data when they store, transmit and process them. The actual requirements specified by PCI DSS are elaborate and complex. To obtain PCI compliance, a merchant needs to provide documentation on the usage and security policies regarding all sensitive information stored in its environment. PCI compliance is considered to be necessary for any business to acquire the confidence of its customers. Moreover, a business that has suffered theft of sensitive information while not being compliant can be subject to hefty amounts of fines from the government in some countries.

Traditionally credit card numbers have been used as a primary identifier in many business processes in the merchant sites. We quote from a document by Securosis [19]:

As the standard reference key, credit card numbers are stored in billing, order management, shipping, customer care, business intelligence, and even fraud detection systems. Large retail organizations typically store credit card data in every critical business processing system.

Thus, in merchant sites, credit card numbers are scattered across their environment. This makes it very difficult for a merchant to formulate security policies and provide necessary documentation to obtain PCI compliance.

But, in most systems where credit card numbers are stored, the data itself are not required, and the system would function as well as before if the credit card numbers are replaced by some other information which would “look like” credit card numbers. This observation has lead to a paradigm shift in the way security of credit card numbers are viewed: instead of securing sensitive data wherever it is present it is easier to remove sensitive data from where it is not required. This basic paradigm has been implemented using tokens. Tokens are numbers that represent credit cards but are unrelated to the real credit card numbers.

There have been numerous industry white papers and similar documents which try to popularize tokenization and discuss about the possible solutions to the tokenization problem [7, 1719, 21]. PCI SSC has also formulated its guidelines regarding tokenization [13]. But to our knowledge, a formal cryptographic analysis of the problem has not been done. Even it is not clear which basic cryptographic objects should be used and in what way, to achieve the goals of tokenization.

Small domain encryption One obvious solution for securing credit card numbers in a merchant site is to encrypt them. But as we stated, a typical merchant site heavily depends on the credit card numbers for its functioning. In some cases, even it uses them as the primary customer identifier in their data bases. Hence, a strict requirement for applying encryption is that the cipher should look like a credit card number, so that for using encryption one does not require to change the database fields where these numbers are stored. This necessity opened up an interesting problem. A typical credit card number consists of sixteen (or less) decimal digits, if this is treated as a binary string, it is about 53 bits long. This is much less than the block size of a typical block cipher (say AES). Thus, direct application of a block cipher to encrypt would result in a considerable length expansion, and it would not be possible to encode the cipher into sixteen decimal digits.

The general problem was named by Voltage Security as format-preserving encryption (FPE), which refers to an encryption algorithm which preserves the format of the message. Formally, if we consider \({\mathcal {X}}\) to be a message space which contains strings from an arbitrary alphabet satisfying certain format, \({\mathcal {D}}\) and \({\mathcal {K}}\) be finite sets called the tweak space and key space, respectively, then a format-preserving encryption scheme is formally defined to be a function \(\mathsf{FP}:{\mathcal {K}} \times {\mathcal {D}} \times {\mathcal {X}} \rightarrow {\mathcal {X}}\), such that for every \(d\in {\mathcal {D}}\) and \(K\in {\mathcal {K}}\), \(\mathsf{FP}_K(d,\cdot ):{\mathcal {X}}\rightarrow {\mathcal {X}}\) is a permutation. And \(\mathsf{FP}\) should provide security as that of a tweakable strong pseudorandom permutation (SPRP). Designing such schemes for arbitrary \({\mathcal {X}}\) is a challenging and interesting problem. In particular given a SPRP on \(\{0,1 \}^n\), designing a SPRP for a message space \(\{0,1\}^t\), where \(t< n\) is difficult. There have been some interesting solutions to this problem, but none of them can be considered to be efficient [1, 2, 5, 8, 11, 20].

A credit card number encrypted by an FPE scheme can act as a token. Such a solution is also provided by Voltage Security [21]. To the best of our knowledge, this is the only solution to the tokenization problem with known cryptographic guarantees. But again, there does not exist a formal security model for tokenization, and it has been contested that a token which is an encryption of the credit card data may not be considered as a safe token as there exists a possibility that the token can be inverted to get the original data [19].

Our contribution We study the problem of tokenization from a cryptographic viewpoint; the main contributions of this work can be summarized as follows. We point out the basic needs for a tokenization system and develop a syntax for the problem. The syntax follows closely the recommendation of the PCI SSC, and it is general enough to accommodate various implementation options. Further, we develop a security model for the problem in line with concrete provable security. We propose three different security notions IND-TKR, IND-TKR-CV, and IND-TKR-KEY, which depend on three different threat models. We amply discuss the adequacy of these new notions of security in practical scenarios.

Finally, we propose some constructions of tokenization systems and prove their security in the proposed security models. We propose three generic constructions, namely TKR1, TKR2 and TKR2a, and discuss how these constructions can be instantiated with existing cryptographic primitives. TKR1 is a construction which just uses a format-preserving encryption to generate tokens. TKR2 and TKR2a are similar, but both are very different from TKR1. Both schemes use a lookup table for tokenization/detokenization operations. In the constructions TKR2 and TKR2a, we demonstrate how the problem of tokenization can be solved both securely and efficiently without using FPE. Both TKR2 and TKR2a use off the shelf cryptographic primitives; in particular, we show how to instantiate them using ordinary block ciphers, stream ciphers supporting initialization vectors (IV) and physical random number generators. We also prove security of our constructions in the proposed security models.

2 Tokenization systems: requirements and PCI DDS guidelines

The basic architecture of a tokenization system is described in Fig. 1. In the diagram, we show three separate environments: the merchant site, the tokenization system and the card issuer. The basic data objects of interest are the primary account number (PAN), which is basically the credit card number and the token which represents the PAN. A customer communicates with the merchant environment through the “point of sale,” where the customer provides its PAN. The merchant sends the PAN to the tokenizer and gets back the corresponding token. The merchant may store the token in its environment. At the request of the merchant, the tokenizer can detokenize a token and send the corresponding PAN to the card issuer for payments.

We show the tokenization system to be separated from the merchant environment; this is true in most situations today, as the merchants receive the service of tokenization from a third party. But it is also possible that the merchant itself implements its tokenization solution, and in that case, the tokenization system is a part of the merchant environment.

Fig. 1
figure 1

Architecture of the tokenization system

As described in [13], a tokenization system has the following components:

  1. 1.

    A method for token generation A process to create a token corresponding to a primary account number (PAN). In [13], there is no specific recommendation regarding how this process should be implemented. Some of the mentioned options are encryption functions, cryptographic hash functions and random number generators.

  2. 2.

    A token mapping procedure It refers to the method used to associate a token with a PAN. Such a method would allow the system to recover a PAN, given a token.

  3. 3.

    Card-vault It is a repository which usually will store pairs of PANs and tokens and maybe some other information required for the token mapping. Since it may contain PANs, it must be specially protected according the PCI DSS requirements.

  4. 4.

    Cryptographic key management This module is a set of mechanisms to create, use, manage, store and protect keys used for the protection of PAN data and also data involved in token generation.

The PCI guidelines for tokenization are quite vague (this has been pointed out before in many places including [18]), and it is difficult to make out what properties tokens and tokenization systems should posses for functionality and security. We state two basic requirements for tokens and tokenization systems. We assume that tokenization is provided as a service; thus, multiple merchants utilize the same system for their tokenization needs.

  1. 1.

    Format preserving The token should have the same format as that of the PANs, so that the stored PANs can be easily replaced by the tokens in the merchant environment. It has been said that in some scenarios it may be important that the tokens can be easily distinguished from the PANs. For example, most credit card numbers have a Luhn checksum [9] of zero. One can make tokens containing same number of digits as that of the PAN, but the Luhn checksum should be 1. Such a distinguishing criteria may make audits easier.

  2. 2.

    Uniqueness The token generation method should be deterministic. As stated before, the application software in the merchant side uses the PAN for indexing; thus, the tokens for a specific PAN should be unique, i.e., if the same PAN is tokenized twice by the same merchant, then the same token should be obtained. Moreover, in a specific merchant environment two different PANs should be represented by different tokens.

In addition to the above basic syntactic requirements, a token and the tokenization system should satisfy certain security properties. In order to analyze the security of a tokenization system, we consider three possible threat scenarios. First, we consider a scenario where an adversary only has access to the tokens. Perhaps this is a self-evident situation, given that tokens are designed to be public. In this threat model, we want to guarantee that an adversary is unable to retrieve any information regarding the PAN if he only sees the token. Another possible scenario considers a more powerful adversary who has access not only to the tokens but also to the card-vault. As the card-vault can (possibly) reveal the relation between PANs and tokens, it can be a target for attackers. Clearly if a scheme is secure even if an adversary has access to the card-vault, then it is stronger than the previous one. Finally, we consider an adversary who has access to the tokens and the keys, but not the card-vault. Again, keys can also be an attractive target for attackers, because keys may be involved in token generation. We formally describe in detail these three scenarios in Sect. 5.

3 Cryptographic preliminaries and notations

General notations For a finite alphabet \(\mathsf{AL}\), we denote the set of all strings over \(\mathsf{AL}\) as \(\mathsf{AL}^*\), and the set of all strings of length n over \(\mathsf{AL}\) (i.e., containing n elements of the alphabet \(\mathsf{AL}\)) by \(\mathsf{AL}^n\). Specifically, the set of all n bit strings would be denoted by \(\{0,1\}^n\). For \(Y\in \mathsf{AL}^*\), by \(|Y|_\mathsf{AL}\) we will denote the number of characters in the string Y. If \(\mathsf{AL}=\{0,1\}\), and X is a string over \(\mathsf{AL}\), then we will use |X| to denote the length of X in bits. If A is a finite set, then \(\#A\) will denote the cardinality of A. If X, Y are strings, X||Y will denote the concatenation of X and Y. For \(X\in \{0,1\}^*\), \(\mathsf{format}_n(X) = X_1||X_2||\ldots ||X_m\), where \(|X_i|= n\), for \(1\le i \le m-1\) and \(0 < |X_m| \le n\). If \(X \in \{0,1\}^*\) is such that \(|X|\ge \ell \), then \(\mathsf{take}_{\ell }(X)\) will denote the \({\ell }\) most significant bits of X. For a nonnegative integer \(i\le 2^{n}-1\), \(\mathsf{bin}_n(i)\) will denote the n bit binary representation of i, and for any n-bit string X, \(\mathsf{toInt}(X)\) will denote the integer represented by the string X.

For a finite set \({\mathcal {S}}\), \(x\mathop {\leftarrow }\limits ^{\$}{\mathcal {S}}\) will denote x to be an element chosen uniformly at random from \({\mathcal {S}}\). We consider an adversary as a probabilistic algorithm that outputs a bit b. \({\mathcal {A}}^{\mathcal {O}} \Rightarrow b\), will denote the fact that an adversary \({\mathcal {A}}\) has access to an oracle \(\mathcal {O}\) and outputs b. In general, an adversary would have other sorts of interactions, maybe with other adversaries and/or algorithms before it outputs, these would be clear from the context. Unless mentioned otherwise, whenever we refer to resources of an adversary we would mean: the number of oracle queries made by it and its running time.

Pseudorandom functions and permutations For finite sets A and B, by \(\mathsf{Func}(A,B)\) we would mean the set of all functions mapping A to B, and \(\mathsf{Perm}(A)\) would denote the set of all permutations on A (i.e., all bijective functions mapping A to A). If \(A = \{0,1\}^n\) and \(B =\{0,1\}^\ell \), then we would denote \(\mathsf{Func}(A,B)\) by \(\mathsf{Func}(n,\ell )\) and \(\mathsf{Perm}(A)\) by \(\mathsf{Perm}(n)\).

Consider the map \(F:{\mathbb K} \times {\mathbb D} \rightarrow {\mathbb R}\) where \({\mathbb K}, {\mathbb D},{\mathbb R}\) (commonly called keys, domain and range, respectively) are all non-empty and \({\mathbb K}\) and \({\mathbb R}\) are finite. We view this map as representing a family of functions \(F=\{F_K\}_{K\in {\mathbb K}}\), i.e., for each \(K\in {\mathbb K}\), \(F_K\) is a function from \({\mathbb D}\) to \({\mathbb R}\) defined as \(F_K(X) = F(K,X)\). For every \(K\in {\mathbb K}\), we call \(F_K\) to be an instance of the family F.

Let \(F:{\mathbb K} \times {\mathbb D} \rightarrow {\mathbb R}\) be a family of functions. We define the PRF advantage of an adversary \({\mathcal {A}}\) in breaking F as

Hence, the PRF advantage of the adversary \({\mathcal {A}}\) is computed as a difference between two probabilities, the adversary \({\mathcal {A}}\) is required to distinguish between two situations, the first situation is where \({\mathcal {A}}\) is given a uniformly chosen member of the family F (i.e., \({\mathcal {A}}\) has oracle access to the procedure \(F_K\), where \(K\mathop {\leftarrow }\limits ^{\$}{\mathbb K}\)) and in the other \({\mathcal {A}}\) is given oracle access to a uniformly chosen element of \(\mathsf {Func}({\mathcal {D}},{\mathcal {R}})\). The adversary specifies its choice by outputting a bit. If the adversary cannot tell apart these two situations then we consider F to be a pseudorandom family. In other words, F is considered to be pseudorandom if for all efficient adversaries is small.

Similarly, if \(E:{\mathbb K} \times {\mathbb D} \rightarrow {\mathbb D}\) is a family of permutations, we define the PRP advantage of an adversary \({\mathcal {A}}\) in breaking E as

A tweakable enciphering scheme (TES) is a function \({\mathcal {E}}: {\mathbb K} \times {\mathbb T} \times {\mathcal {M}} \rightarrow {\mathcal {M}}\) where \({\mathbb K}\) is the key space, \({\mathbb T}\) is the tweak set, and \({\mathcal {M}}\) is the message space and for every \(K \in {\mathbb K}\) and \(T \in {\mathbb T}\) we have that \({\mathcal {E}}(K,T, \cdot )={\mathcal {E}}^T_K(\cdot )\) is a length-preserving permutation. We define the \(\widetilde{\text{ prp }}\) advantage of an adversary \({\mathcal {A}}\) as

where \(\mathsf{Perm}^{\mathbb T}({\mathcal {M}})\) is the set of length-preserving tweak indexed permutations on \({\mathcal {M}}\). If the message space \({\mathcal {M}} = \{0,1\}^n\), then \({\mathcal {E}}\) is called a tweakable block cipher.

Deterministic CPA secure encryption Let \(\mathbf{E}:{\mathbb K} \times {\mathbb T} \times {\mathcal {M}} \rightarrow {\mathbb C}\) be a deterministic encryption scheme with key space \({\mathbb K}\), tweak space \({\mathbb T}\), message space \({\mathcal {M}}\) and cipher space \({\mathbb C}\). We define the DET-CPA advantage of any adversary \({\mathcal {A}}\), which does not repeat any query as

where \(\$(.,.)\) is an oracle, which on input \((d,x) \in {\mathbb T} \times {\mathbb M}\) returns a random string of the size of the cipher text of x.

4 A generic syntax

A tokenization system has the following components:

  1. 1.

    \({\mathcal {X}}\), a finite set of primary account numbers or PAN’s. \({\mathcal {X}}\) contains strings from a suitable alphabet with a specific format.

  2. 2.

    \({\mathcal {T}}\), a finite set of tokens. \({\mathcal {T}}\) also contains strings from a suitable alphabet with a specific format. It may be the case that \({\mathcal {T}} = {\mathcal {X}}\).

  3. 3.

    \({\mathcal {D}}\), a finite set of associated data. The associated data can be any data related to the business processFootnote 1.

  4. 4.

    CV, the card-vault. The card-vault is a repository where PAN’s and tokens are stored, which may have a special structure for the ease of implementation of the token mapping procedure. In our syntax, we shall use the \(\mathsf{CV}\) to represent a state of the tokenization system. Whenever a new PAN is tokenized, possibly both the PAN and the generated token are stored in the CV, along with some additional data. Disregarding the structure of the CV, we consider that “basic” elements of \(\mathsf{CV}\) comes from a set \({\mathbb Y}\).

  5. 5.

    \({\mathcal {K}}\), a key generation algorithm. A tokenization system may require multiple keys, and all these keys are generated through the key generation algorithm.

  6. 6.

    \(\mathsf{TKR}\), the tokenizer. It is the procedure responsible for generating tokens from the PANs. We consider the tokenizer receives as input: the \(\mathsf{CV}\) as a state, a key K generated by \({\mathcal {K}}\), some associated data d which comes from a set \({\mathcal {D}}\), and a PAN \(x \in {\mathcal {X}}\). An invocation of \(\mathsf{TKR}\) outputs a token t and also changes the \(\mathsf{CV}\). Thus, other than t, \(\mathsf{TKR}\) also produces an element from \({\mathbb Y}\) which is used to update the \(\mathsf{CV}\). We use the square brackets to denote this interaction. We formally see \(\mathsf{TKR}\) as a function \(\mathsf{TKR}[\mathsf{CV}]:{\mathcal {K}} \times {\mathcal {X}} \times {\mathcal {D}} \rightarrow {\mathcal {T}} \times {\mathbb Y}\). For convenience, we shall implicitly assume the interaction of \(\mathsf{TKR}\) with \(\mathsf{CV}\), and we will use \(\mathsf{TKR}_K^{(1)}(x,d)\) and \(\mathsf{TKR}_K^{(2)}(x,d)\) to denote the two outputs (in \({\mathcal {T}}\) and \({\mathbb Y}\), respectively) of \(\mathsf{TKR}\).

  7. 7.

    \(\mathsf{DTKR}\), the detokenizer which inverts a token to a PAN. As in case of tokenizer, we denote a detokenizer as a function \(\mathsf{DTKR}[\mathsf{CV}]:{\mathcal {K}} \times {\mathcal {T}} \times {\mathcal {D}} \rightarrow {\mathcal {X}} \cup \{\bot \}\). For detokenization also, we shall implicitly assume its interaction with \(\mathsf{CV}\) and for \(K\in {\mathcal {K}}\), \(d \in {\mathcal {D}}\) and \(t \in {\mathcal {T}}\), we shall write \(\mathsf{DTKR}_K(t,d)\) instead of \(\mathsf{DTKR}[\mathsf{CV}](K,t,d)\).

A tokenization procedure \(\mathsf{TKR}_K\) should satisfy the following:

  • For every \(x\in {\mathcal {X}}\), \(d\in {\mathcal {D}}\) and \(K\in {\mathcal {K}}\), \(\mathsf{DTKR}_K(\mathsf{TKR}_K^{(1)}(x,d),d) =x \).

  • For every \(d \in {\mathcal {D}}\), and \(x,x'\in {\mathcal {X}}\), such that \(x\ne x'\), \(\mathsf{TKR}_K^{(1)}(x,d) \ne \mathsf{TKR}_K^{(1)}(x',d)\).

The second criteria focuses on a weak form of uniqueness. We want that two different PANs with the same associated data should produce different tokens; we do not disallow the case where two different PANs with different associated data have the same tokens. This requirement is clear if we consider the associated data d to be an identifier for a merchant. We do not want that a single merchant obtains the same token for two different PANs, but we do not care if two different merchants obtain the same token for two different PANs.

Fig. 2
figure 2

Experiments used in the security definitions: IND-TKR, IND-TKR-CV and IND-TKR-KEY

5 Security notions

We define three different security notions, which consider three different attack scenarios:

  1. 1.

    IND-TKR: Tokens are only public. This represents the most realistic scenario where an adversary has access to the tokens only, and the card-vault data remain inaccessible.

  2. 2.

    IND-TKR-CV: The tokens and the contents of the card-vault are public. This represents an extreme scenario where the adversary gets access to the card-vault data also.

  3. 3.

    IND-TKR-KEY: This represents another extreme scenario where the tokens and the keys are public.

We formally define the above three security notions based on the notion of indistinguishability, as it is usually done for encryption schemes. Three experiments corresponding to the three attack scenarios discussed above are described in Fig. 2. Each experiment represents an interaction between a challenger and an adversary \({\mathcal {A}}\). The challenger can be seen as the tokenization system, which in the beginning selects a random key from the key space and instantiates the tokenizer with the selected key. Then (in lines 3–6 of the experiments), the challenger responds to the queries of the adversary \({\mathcal {A}}\). The adversary \({\mathcal {A}}\) in each case queries with \((x,d) \in {\mathcal {X}} \times {\mathcal {D}}\), i.e., it asks for the outputs of the tokenizer for pairs of PAN and associated data of its choice. Finally, \({\mathcal {A}}\) submits two pairs of PANs and associated data to the challenger. The challenger selects one of the pairs uniformly at random and provides \({\mathcal {A}}\) with the tokenizer output for the selected pair. The task of \({\mathcal {A}}\) is to tell which pair was selected by the challenger. If \({\mathcal {A}}\) can correctly guess the selection of the challenger, then the experiment outputs a 1 otherwise it outputs a 0. This setting is very similar to the way in which security of encryption schemes are defined for a chosen plaintext adversary.

The three experiments differ in what the adversary gets to see. In experiment Exp-IND-TKR\(^{\mathcal {A}}\), \({\mathcal {A}}\), in response to its queries gets only the tokens and in Exp-IND-TKR-CV\(^{\mathcal {A}}\) it gets both the tokens and the data that is stored in the card-vault. In Exp-IND-TKR-KEY\(^{\mathcal {A}}\), \({\mathcal {A}}\) gets the tokens corresponding to its queries, and the challenger reveals the key to \({\mathcal {A}}\) after the query phase.

Definition 1

Let \(\mathsf{TKR} [\mathsf{CV}]:{\mathcal {K}} \times {\mathcal {X}} \times {\mathcal {D}}\rightarrow {\mathcal {T}} \times {\mathbb Y}\) be a tokenizer. Then, the advantage of an adversary \({\mathcal {A}}\) in the sense of IND-TKR, IND-TKR-CV and IND-TKR-KEY is defined as

respectively.

From the definitions, it is obvious that and , but and . Thus, and are strictly stronger than .

Adequacy of the notions We discuss some of the characteristics and limitations of the proposed definitions next.

  1. 1.

    IND-TKR refers to the basic security requirement for tokens. It adheres to the informal security notion for tokens as stated in the PCI DSS guideline for tokenization. It models the fact that tokens and PANs are unlinkable in a computational sense, if the key and card-vault are kept secret. Thus, if a merchant adopts a tokenization scheme provided by a third party, which is secure in the IND-TKR sense, then this will probably relieve it from PCI compliance. As in this case the merchant does not own the card-vault or the keys, and the burden of security involved with the keys and the card-vault lies with the provider who offers the tokenization service.

  2. 2.

    The IND-TKR-CV is a stronger notion. If a tokenization system achieves this security, then it implies that tokens and PANs are unlinkable even with the knowledge of the card-vault. This in turn implies that the contents of the card-vault are not useful (in a computational sense) to derive a relation between PANs and tokens. Thus, it provides security both to the tokenization service provider and the merchant who uses this service.

  3. 3.

    IND-TKR-KEY is a stronger form of the IND-TKR notion. Some public documents like [19] have been stressed that encryption is not a good option for tokenization, as in theory there exists the possibility that a token can be inverted to obtain the PAN. If tokens are generated using a “secure” encryption scheme, then it is infeasible for any “reasonably efficient” adversary to invert the token without the knowledge of the key. But, this computational guarantee does not seem to be enough for users. The IND-TKR-KEY definition aims to model this paranoid situation, where linking the PANs with tokens becomes infeasible even with the knowledge of the key. Note in IND-TKR-KEY we still assume that the card-vault is inaccessible to an adversary.

  4. 4.

    All the definitions follow the style of a chosen plaintext attack. The definitions may be made stronger by giving the adversary additional power of obtaining PANs corresponding to tokens of its choice. Though a stronger definition is always better, but in the current context, we think that such strong definitions may not be required. According to the specifications given by the PCI DSS [13], detokenization, i.e., to retrieve a PAN given a token, is an operation which must be performed only in special situations. It also specifies that this operation should be restricted to authorized individuals or applications. Thus, detokenization can be restricted with the suitable use of authentication mechanisms, which falls outside the scope of our abstraction of tokenization systems. We discuss a bit more about this in Sect. 10.

In the following two sections, we discuss two class of constructions for tokenizers. The first construction \(\mathsf{TKR1}\) is the trivial way to do tokenization using FPE. The other constructions (TKR2 and a variant TKR2a) presented in Sect. 7 are very different. For the later constructions, our main aim is to bypass the use of FPE schemes and use standard cryptographic schemes along with some encoding mechanism to achieve both security and the format requirements for arbitrarily formatted PANs/tokens.

6 Construction TKR1: tokenization using FPE

The construction TKR1 is described in Fig. 3. TKR1 uses an FPE scheme \(\mathsf{FP}:{\mathcal {K}} \times {\mathcal {D}} \times {\mathcal {X}} \rightarrow {\mathcal {T}}\) in an obvious way to generate tokens, assuming that \({\mathcal {T}} ={\mathcal {X}}\).

Fig. 3
figure 3

The TKR1 tokenization scheme using a format-preserving encryption scheme \(\mathsf{FP}\)

For security, we assume that \(\mathsf{FP}_k()\) is a tweakable pseudorandom permutation with a tweak space \({\mathcal {D}}\) and message space \({\mathcal {T}}\). Note that this scheme does not utilize a card-vault and thus is stateless. The scheme is secure both in terms of IND-TKR and in terms of IND-TKR-CV. We formally state the security in the following theorem.

Theorem 1

  1. 1.

    Let \(\Psi = \mathsf{TKR1}\) be defined as in Fig. 3, and let \({\mathcal {A}}\) be an adversary attacking \(\Psi \) in the IND-TKR sense. Then, there exists a \(\widetilde{prp}\) adversary \({\mathcal {B}}\) such that

    where \({\mathcal {B}}\) uses almost the same resources as of \({\mathcal {A}}\).

  2. 2.

    Let \(\Psi = \mathsf{TKR1}\) be defined as in Fig. 3, and let \({\mathcal {A}}\) be an adversary attacking \(\Psi \) in the IND-TKR-CV sense. Then, there exists a \(\widetilde{prp}\) adversary \({\mathcal {B}}\) (which uses almost the same resources as of \({\mathcal {A}}\)) such that

The first claim of the Theorem is an easy reduction where we design a PRP adversary \({\mathcal {B}}\) which runs \({\mathcal {A}}\) and finally relate the advantages of the adversaries \({\mathcal {A}}\) and \({\mathcal {B}}\). The second claim directly follows from the first, as in the construction TKR1, there is no card-vault, we can also see this as if the card-vault stores no information at all, and thus, an IND-TKR-CV adversary for \(\mathsf{TKR1}\) does not have any additional information compared to an IND-TKR adversary. The proof is provided in the “Appendix”

This scheme can be instantiated using any format-preserving encryption scheme as described in [1, 2, 5, 8, 11, 20]. According to [14], this method to generate tokens produces a reversible cryptographic token, i.e., we can recover the PAN from the token. Clearly the security of this method relies on the security of the FPE scheme.

We discuss more on the impact of security and efficiency for specific instantiations in Sect. 8.

Fig. 4
figure 4

The TKR2 tokenization scheme using a random number generator \(\mathsf{RN}^{\mathcal {T}}()\)

7 Construction TKR2: tokenization without using FPE

Here, we propose a class of constructions which avoids the use of format-preserving encryption. Instead of a permutation on \({\mathcal {T}}\) which we use for the previous construction, we assume a primitive \(\mathsf{RN}^{\mathcal {T}}()\), which when invoked (ideally) outputs a uniform random element in \({\mathcal {T}}\). This primitive can be keyed, which we will denote by \(\mathsf{RN}^{\mathcal {T}}[k]()\), where k is a uniform random element of a pre-defined finite key space \({\mathcal {K}}\). \(\mathsf{RN}^{\mathcal {T}}()\) can also be realized by using a keyed cryptographic primitive \(f_k\), such instantiations would be more specifically denoted by \(\mathsf{RN}^{\mathcal {T}}[f_k]()\). We define the RND advantage of an adversary \({\mathcal {A}}\) attacking \(\mathsf{RN}^{\mathcal {T}}()\) as

(1)

where \(\$^{\mathcal {T}}()\) is an oracle which returns uniform random strings from \({\mathcal {T}}\). The task of a RND adversary \({\mathcal {A}}\) is to distinguish between \(\mathsf{RN}^{\mathcal {T}}[k]()\) and its ideal counterpart when oracle access to these schemes is given to \({\mathcal {A}}\).

We describe a generic scheme for tokenization in Fig. 4, which we call as \(\mathsf{TKR2}\) that uses \(\mathsf{RN}^{\mathcal {T}}()\). For the description, we consider that the card-vault \(\mathsf{CV}\) is a collection of tuples, where each tuple has 3 components \((x_1,x_2,x_3)\), where \(x_1,x_2,x_3\) are the token, the PAN and associated data, respectively. For a tuple \(\mathsf {tup}=(x_1,x_2,x_3)\), we would use \(\mathsf {tup}^{(i)}\) to denote \(x_i\). Given a card-vault \(\mathsf{CV}\), we also assume procedures to search for tuples in the \(\mathsf{CV}\). \(\mathsf{SrchCV}(i,x)\) returns those tuples \(\mathsf {tup}\) in \(\mathsf{CV}\) such that \(\mathsf {tup}^{(i)}=x\). If S be a set of tuples, then by \(S^{(i)}\) we will denote the set of the i-th components of the tuples in S.

Fig. 5
figure 5

Modified TKR2 to ensure uniqueness

As it is evident from the description in Fig. 4, the detokenization operation is made possible through the data stored in the card-vault, and the detokenization is just a search procedure. Also, the determinism is assured by search.

Correctness A limitation of the TKR2 scheme is that it may violate the property of uniqueness. It is not guaranteed that \(\mathsf{TKR2}_k(x,d) \ne \mathsf{TKR2}_k(x',d')\) when \((x,d) \ne (x',d')\). As discussed before, for practical purposes a weak form of uniqueness is required, i.e., for \(x\ne x'\), for any \(d\in {\mathcal {D}}\), \(\mathsf{TKR2}(x,d) \ne \mathsf{TKR2}(x',d)\). This requirement stems from the fact that a specific merchant with associated data d may use the tokens as a primary key in its databases. Thus if \(d \ne d'\), it can be tolerated that \(\mathsf{TKR2}(x,d) = \mathsf{TKR2}(x',d') \), for any \(x,x' \in {\mathcal {X}}\).

Let us assume that \(\mathsf{RN}^{\mathcal {T}}()\) behaves ideally. If q unique tokens have been already generated with a specific associated data d, the probability that the \((q+1)^{th}\) token (generated with associated data d) is equal to any of the q previously generated tokens is given by \(q/\#{\mathcal {T}}\). Thus, this probability of collision increases with the number of tokens already generated. If the total number of tokens generated by the tokenizer for a specific associated data is much smaller than the size of the token space (which will be the case in a practical scenario), this probability of collision would be insignificantFootnote 2. But, still the uniqueness can be guaranteed by an additional search as shown in Fig. 5. Where \(\mathsf{RN}^{\mathcal {T}}()\) is repeatedly invoked unless a token different from one already produced is obtained. Following the previous discussion, if q is small compared to \(\#{\mathcal {T}}\), the expected number of repetitions required until a unique token is obtained would be small.

The detokenization corresponding to the modified tokenization scheme described in Fig. 5 remains the same as described in Fig. 4.

Fig. 6
figure 6

The TKR2a tokenization scheme

We formally specify the security of TKR2 later in this section, but it is easy to see that TKR2 is not secure in the IND-TKR-CV sense, as in the card-vault the PANs are stored in clear; hence, if the card-vault is revealed, then no security remains. This can be fixed by encrypting the tokens in the card-vault. To achieve security in terms of IND-TKR-CV, any CPA secure encryption can be used to encrypt the PANs stored in the card-vault. Note that for the encrypted PAN to be stored in the card-vault the format-preserving requirement is not required. We modify \(\mathsf{TKR2}\) to \(\mathsf{TKR2a}\) to achieve this. We discuss the details of \(\mathsf{TKR2}\) next.

Modifying TKR2 to TKR2a For this modification, the structure of the card-vault is a bit different than for TKR2. In this case, each tuple contains two components. The first being the encryption of the token and the second the encryption of the PAN. We additionally use a deterministic CPA secure encryption (supporting associated data) scheme \(\mathbf{E}:{\mathbb K}\times {\mathcal {D}} \times {\mathcal {M}} \rightarrow {\mathbb C}\), with key space \({\mathbb K}\), tweak (associated data) space \({\mathcal {D}}\) and message space \({\mathcal {M}}\). We assume that \({\mathcal {T}} = {\mathcal {X}} \subseteq \mathsf{AL}^*\), where \(\mathsf{AL}\) is an arbitrary alphabet, such that \(\#\mathsf{AL}\ge 2\). We fix \(\mathsf{a},\mathsf{b}\in \mathsf{AL}\) and define the message space \({\mathcal {M}}\) of \(\mathbf{E}\) to be

$$\begin{aligned} {\mathcal {M}} = \left\{ \mathsf{a}||x: x \in {\mathcal {X}} \right\} \bigcup \left\{ \mathsf{b}||t: t \in {\mathcal {T}}\right\} . \end{aligned}$$

Note that \(\mathsf{a}\) and \(\mathsf{b}\) are public quantities. The cipher space \({\mathbb C}\) can be arbitrary, i.e., it is not required that \({\mathbb C} = {\mathcal {X}}\), as the ciphers here would not be tokens but would be stored in the card-vault. We assume that \({\mathcal {D}},{\mathbb C} \subseteq \{0,1\}^*\).

The tokenization scheme \(\mathsf{TKR2a}\) described in Fig. 6 uses the objects described above. The main difference with TKR2 is that pairs of token and PAN are stored in the card-vault in the encrypted form. An important characteristic of the way the encryption is applied is that the inputs are differently encoded in case of a token and a PAN. This ensures the even if a PAN and a token are the same, they produce different ciphertexts.

7.1 Realizing \(\mathsf{RN}^{\mathcal {T}}[k]\)

The heart of the procedures TKR2 and TKR2a is the keyed primitive \(\mathsf{RN}^{\mathcal {T}}[k]\), which can be realized by standard cryptographic objects. We discuss here a specific realization which uses a pseudorandom function \(f:{\mathcal {K}}\times {\mathbb Z}_N \rightarrow \{0,1\}^L\), where L and N are sufficiently “large,” the exact requirements for N and L will become clear later. We call the construction \(\mathsf{RN}[f_k]()\), and it is shown in Fig. 7.

For the construction shown in Fig. 7, we assume that \({\mathcal {T}}\) contains strings of fixed length \(\mu \) from an arbitrary alphabet \(\mathsf{AL}\). Let \(\#\mathsf{AL} = \ell \), and \(\lambda = \lceil \lg \ell \rceil \). Let \(\sigma :\mathsf{AL} \rightarrow \{0,1,2,\ldots ,\ell -1\}\) be a fixed bijection. The variable cnt can be considered as a state of the algorithm, and it maintains its values across invocations. The basic idea behind the algorithm is to generate a “long” binary string using \(f_k(cnt)\) and divide the string into blocks of \(\lambda \) bits. If the integer corresponding to a block is less than \(\ell \), then it is accepted otherwise it is discarded. The accepted blocks are encoded as elements in \(\mathsf{AL}\).

Fig. 7
figure 7

Construction of \(\mathsf{RN}()\) using a pseudorandom function \(f_k()\)

Choosing \({\mathbf {L}}\) and \({\mathbf {N}}\): Let us define, \(p=\Pr [y\mathop {\leftarrow }\limits ^{\$}\{0,1\}^\lambda :\mathsf{toInt}(y)<\ell ]=\frac{\ell }{2^\lambda }>\frac{1}{2}\). Thus, if we assume that the output of \(f_k()\) is uniformly distributed then an \(X_i\) passes the test in line 6 (of Fig. 7) with probability p. Thus, the expected number of times the while loop will run is at most \(2\mu \). Thus, \(L = 3\mu \lambda \) will be sufficient for all practical purposes, recall that \(\mu \) is the length of each token if the tokens are treated as strings in \(\mathsf{AL}\) and \(\lambda = \lceil \lg \#\mathsf{AL} \rceil \).

Note that each invocation of \(\mathsf{RN}[f_k]()\) increases the value of cnt by 1. Thus, the value of N should be a conservative upper bound on the number of times \(\mathsf{RN}[f_k]()\) needs to be invoked. \(N = 2^{80}-1\) should be sufficient for all practical purposes.

If \(f_k\) is a PRF, then \(\mathsf{RN}^{\mathcal {T}}[f_k]\) is secure in the RND sense. We formally state this (easy to verify) security property in the following theorem.

Theorem 2

Let \({\mathcal {A}}\) be an arbitrary adversary attacking \(\mathsf{RN}[f_k]\) (as described in Fig. 7) in the RND sense. Then, there exists a PRF adversary \({\mathcal {B}}\) (which uses almost the same resources as of \({\mathcal {A}}\)) such that

(2)

This theorem asserts that as long as \(f_k()\) is a PRF, the construction achieves the desired security in the RND sense.

7.2 Candidates for \(f_k()\)

\(f_k()\) can be instantiated through standard symmetric key primitives. We discuss three options below:

  1. 1.

    Stream cipher Modern stream ciphers, such as those in the eStream [16] portfolio, take as input a short secret key K and a short initialization vector (IV) and produce a “long” and random looking string of bits. Let \(\mathsf{SC}_{K} : \{0, 1\}^\ell \rightarrow \{0, 1\}^L\) be a stream cipher with IV, i.e., for every choice of K from a certain pre-defined key space \({\mathcal {K}}\), \(\mathsf{SC}_K\) maps an \(\ell \)-bit IV to an output string of length L bits. The basic idea of security is that for a uniform random K and for distinct inputs \(IV_1,\ldots ,IV_q\), the strings \(\mathsf{SC}_K(IV_1),\ldots ,\mathsf{SC}_K(IV_q)\) should appear to be independent and uniform random to an adversary. This is formalized by requiring a stream cipher to be a PRF. See [3] for further discussion on this issue. Thus, a stream cipher with the above security guarantees can be used to instantiate \(f_k\).

  2. 2.

    Block cipher A block cipher \(E:{\mathcal {K}} \times \{0,1\}^n\rightarrow \{0,1\}^n\) can also be used to construct \(f_k\) as follows.

    figure a

    The above construction is same as the counter mode of operation, and if \(E_k\) is assumed to be a PRF, then \(f_k\) as constructed above is also a PRF; in particular, it is easy to verify the following holds

Proposition 1

Let \({\mathcal {B}}\) be an arbitrary PRF adversary attacking \(f_k()\) who asks at most q queries, then one can construct a PRF adversary \({\mathcal {B}}'\) for \(E_K()\) such that, \({\mathcal {B}}'\) asks at most mq queries and

  1. 3.

    True random number generator We end this discussion with another possible interesting instantiation of \(\mathsf{RN}()\). The specific construction that we depicted in Fig. 7 basically uses a stream of random bits generated through a pseudorandom function. Currently there has been a lot of interest in designing physical true random number generators. Such generators harvest entropy from its “environment,” and they generate random streams with some post-processing. It has been claimed that such generators are “true random number generators” (TRNG). Such a generator can be used to design RN() as in Fig. 7 by replacing \(f_k()\) with a TRNG, and by selecting suitable blocks from the generated stream according to the format requirements of \({\mathcal {T}}\). As a TRNG is keyless, thus this would lead to a keyless construction of \(\mathsf{RN}\), we call such an instantiation as \(\mathsf{RN}[\mathsf{TR}]\). As such a generator gives us “true randomness”; hence, for any adversary .

From now onwards, where it is necessary, we will denote \(\mathsf{TKR2}\) instantiated with \(\mathsf{RN}[f_k]\) and \(\mathsf{RN}[\mathsf{TR}]\) by \(\mathsf{TKR2}[f_k]\) and \(\mathsf{TKR2}[\mathsf{TR}]\), respectively. Similar convention would be followed for \(\mathsf{TKR2a}\).

7.3 Realizing \(\mathbf{E}_k(d,x)\)

As discussed previously, \(\mathbf {E}_k(\cdot ,\cdot )\) is used to encrypt the PAN, and the encryption is stored in the card-vault within the tokenization system. We do not require this encryption to be format preserving. Here, we discuss two instantiations of \(\mathbf{E}\) using a secure block cipher E. If the block length of E is n, then both the proposed constructions have \(\{0,1\}^n\) as their cipher space, and \({\mathcal {X}}\) and \({\mathcal {D}}\) as their message space and tweak space, respectively. For the constructions, we assume some restrictions on \({\mathcal {X}}\) and \({\mathcal {D}}\), but these restrictions would be satisfied in most practical scenarios.

Let \(E:{\mathcal {K}} \times \{0,1\}^n \rightarrow \{0,1\}^n\) be a block cipher. As we defined before, let \({\mathcal {X}}\) contain strings of fixed length \(\mu \) from an arbitrary alphabet \(\mathsf{AL}\) where \(\#\mathsf{AL} =\ell \) and \(\lambda = \lceil \lg \ell \rceil \). Let \(\#{\mathcal {D}}=\ell _1\) and \(\lambda _1= \lceil \lg \ell _1 \rceil \). Let \(n_1\) and \(n_2\) be positive integers such that \(n_1 \ge \mu \lambda \), \(n_2 \ge \lambda _1\) and \(n_1 + n_2 =n\). Note that for practical choice of \(\mathsf{AL}\), \({\mathcal {D}}\), \(\mu \) and n, such \(n_1,n_2\) can be selected. Let \(\mathsf{pad}_{\mathcal {X}}: {\mathcal {X}} \rightarrow \{0,1\}^n\), \(\mathsf{pad}_{\mathcal {D}}:{\mathcal {D}} \rightarrow \{0,1\}^n\), \(\mathsf{pad}_1:\mathsf{AL}^\mu \rightarrow \{0,1\}^{n_1}\) and \(\mathsf{pad}_2:{\mathcal {D}} \rightarrow \{0,1\}^{n_2}\) be injective functions.

Fig. 8
figure 8

The two instantiations of \(\mathbf{E}_K\)

The two different proposed instantiations of \(\mathbf{E}\) are shown in Fig. 8. Both the constructions use a block cipher with a block length of n, and the padding functions defined above. In \(\mathbf{E1}\), the message x and the associated data d are suitably formatted to a n bit string, and this formatted string is encrypted using the block cipher. \(\mathbf{E2}\) is same as the construction of a tweakable block cipher proposed in [10]. If \(E_K\) is a secure block cipher in the PRF sense, then both \(\mathbf{E1}_K\) and \(\mathbf{E2}_K\) are DET-CPA secure; we state this formally next.

Proposition 2

Let \({\mathcal {A}}\) be an arbitrary DET-CPA adversary attacking \(\mathbf{E1}\), who asks at most q queries, never repeats a query, and runs for time at most T; then, there exists a PRF adversary \({\mathcal {B}}\) such that

and \({\mathcal {B}}\) also asks exactly q queries and runs for time O(T).

Proposition 3

Let \({\mathcal {A}}\) be an arbitrary DET-CPA adversary attacking \(\mathbf{E2}\), who asks at most q queries, never repeats a query, and runs for time at most T; then, there exists a PRF adversary \({\mathcal {B}}\) such that

and \({\mathcal {B}}\) also asks exactly q queries and runs for time O(T).

The proofs of the above propositions are presented in “Appendix,” respectively. The above propositions suggest that \(\mathbf{E1}\) has a better security bound compared to \(\mathbf{E2}\), and for \(\mathbf{E2}\) two block cipher calls are required for each encryption, whereas only a single block cipher call is required for \(\mathbf{E1}\). The formatting requirements are more stringent for \(\mathbf{E1}\), where as \(\mathbf{E2}\) can be applied to any message space \({\mathcal {X}}\) and tweak space \({\mathcal {D}}\), where \(\#{\mathcal {X}} \le 2^n\) and \(\#{\mathcal {D}} \le 2^n\).

7.4 Security of TKR2 and TKR2a

The following three theorems specify the security of \(\mathsf{TKR2}\) and \(\mathsf{TKR2a}\).

Theorem 3

Let \(\Psi \in \{\mathsf{TKR2},\mathsf{TKR2a} \}\) and let \({\mathcal {A}}\) be an adversary attacking \(\Psi \) in the IND-TKR sense. Then there exists a RND adversary \({\mathcal {B}}\) (which uses almost the same resources as of \({\mathcal {A}}\)) such that

Theorem 4

Let \(\Psi =\mathsf{TKR2a}\) and \({\mathcal {A}}\) be an adversary attacking \(\Psi \) in the IND-TKR-CV sense. Then, there exist adversaries \({\mathcal {B}}\) and \({\mathcal {B}}'\) (which use almost the same resources as of \({\mathcal {A}}\)) such that

where s is the size of the shortest element in the cipher space of \(\mathbf {E}\).

Theorem 5

Let \(\Psi \in \{\mathsf{TKR2}[\mathsf{TR}],\mathsf{TKR2a}[\mathsf{TR}] \}\) and \({\mathcal {A}}\) be an arbitrary adversary attacking \(\Psi \) in the IND-TKR-KEY sense. Then,

The proofs of Theorems 3 and 4 use standard reductionist arguments; we present them in “Appendix,” respectively. Note, when TKR2 and TKR2a are instantiated with a true random number generator then they are keyless schemes; thus, Theorem 5 is immediate.

8 Discussions

Security The security properties of the various schemes as stated in the previous security theorems are summarized in Table 1. The security theorems in all cases are to be interpreted carefully. We note down some relevant issues below.

Table 1 Summary of security

In TKR1, the security is gained from the security of the format-preserving encryption. The scheme \(\mathsf{FP}\) used in TKR1 is required to be a tweakable pseudorandom permutation with the message/cipher space \({\mathcal {T}}\) and the tweak space \({\mathcal {D}}\). It is important to note that various instantiations of \(\mathsf{FP}\) can give different security guarantees. Most of the known FPE schemes can only ensure security (in provable terms) when the number of queries made by an adversary is highly restricted. For example, the security claim of the scheme based on Feistel networks discussed in [4] becomes vacuous when the number of queries exceeds \({\#{\mathcal {T}}^{1/4}}\), whereas the scheme in [11] can tolerate up to \({\#{\mathcal {T}}^{1-\epsilon }}\) queries where \(\epsilon \) is inversely related to the number of rounds in the construction. Some recent constructions in [8, 15] achieve much better bounds, specially in [15] almost \(\#{\mathcal {T}}\) queries can be tolerated for the bound to be meaningful. As \(\#{\mathcal {T}}\) can be much smaller than the typical domain of a block cipher (\(2^n\), for \(n=128\)), thus the exact security guarantees are important in this context. Note that for a typical scenario we consider credit card numbers of sixteen decimal digits then \(\#{\mathcal {T}} \approx 2^{53}\).

In TKR1, there is no card-vault; hence, trivially it satisfies the IND-TKR-CV notion, but in practical terms, the IND-TKR-CV notion is not applicable in case of TKR1. Also, among the schemes proposed in this paper, TKR1 is the only construction where the tokens bear a relationship with the PAN, i.e., the tokens are encrypted PANs. Thus, TKR1 does not satisfy the ideal notion of tokens being independent of the PANs. But, the security Theorem guarantees that to a computationally bounded adversary (who does not have any knowledge of the key), the tokens would look like random strings. Such computational guarantees for cryptographic schemes are generally enough in most practical applications.

In \(\mathsf{TKR2}\) and \(\mathsf{TKR2a}\), the security bounds are better than \(\mathsf{TKR1}\).

If \(\mathsf{RN}[f_k]\) is instantiated as in Fig. 7, and in turn \(f_k\) is constructed using a block cipher, then using Proposition 1 and Theorem 3, for any IND-TKR adversary \({\mathcal {A}}\) who asks at most q queries, we have

where \(\Psi \in \{\mathsf{TKR2},\mathsf{TKR2a}\}\) and \(\epsilon _q\) is the maximum PRF advantage of any adversary (who asks at most q queries) in attacking the block cipher E. Note that, n is the block length of the block cipher used to construct \(f_k\). And m depends on \(\#{\mathcal {T}}\), as per the description of the block cipher-based construction in Sect. 7.2, \(m = L/n\), and we discussed that it would be enough if we take \(L=3\mu \lambda \), where \(\mu \) is the length of each token where the tokens are treated as strings in \(\mathsf{AL}\) and \(\lambda = \lceil \lg \#\mathsf{AL} \rceil \). Thus, the security bound is less sensitive on \(\#{\mathcal {T}}\). The bound only becomes vacuous when mq is of the order of \(2^{n/2}\). A similar bound holds for , when a block cipher-based construction for \(f_k\) is used.

The IND-TKR-KEY definition is meant to model the property of independence of the tokens with the keys, and this represents a quite strong notion of security. The constructions \(\mathsf{TKR2}[f_k]\) and \(\mathsf{TKR2a}[f_k]\) do not achieve this security. But \(\mathsf{TKR2}[\mathsf{TR}]\) and \(\mathsf{TKR2a}[\mathsf{TR}]\) achieve security in the IND-TKR-KEY sense as here we are assuming an instantiation by a “true” random number generator.

Efficiency The efficiency of TKR1 depends on the efficiency of the \(\mathsf{FP}\) scheme. As discussed there are various ways to instantiate \(\mathsf{FP}\) with varying amount of security and efficiency. Also, most schemes with provable guarantees are far inefficient than standard block ciphers.

The efficiency of \(\mathsf{TKR2}\) and \(\mathsf{TKR2a}\) would be dominated by the search procedure. Asymptotically, if \(\#{\mathcal {T}}= N\), then tokenization and detokenization would take \(O(\lg N)\) time. But the hidden constant would depend on how efficiently the search has been implemented and how powerful the machine is. We discuss more about this in Sect. 9.

9 Experimental results

We performed some preliminary experiments to determine the efficiency and functionalities of the proposed constructions in a practical environment. All experiments reported used the following computing resources:

  • CPU Four-core i5-2400 Intel processor (3.1 GHz).

  • OS Ubuntu 12.04.4 LTS.

  • DataBase PostgreSQL 9.2.6

  • Compiler gcc 4.7.3

We implemented both \(\mathsf{TKR2}[f_k]\) and \(\mathsf{TKR2a}[f_k]\), instantiated with \(\mathsf{RN}[f_k]\) (described in Fig. 7), where \(f_k\) was instantiated with block cipher-based construction described in Sect. 7.2.

We implemented the card-vault in a PostgresSQL database. For \(\mathsf{TKR2}\), we considered the card-vault to be a relation with three attributes: the token (TKN), the associated data (ASD) and the PAN. For this construction, the primary key is composed by the token and the associated data. For \(\mathsf{TKR2a}\), we considered the card-vault to be a relation with two attributes EPAN and ETKN, representing the encrypted PAN and token, respectively. We encrypt these data using the construction \(\mathbf E1\) described in Sect. 7.3. In this case, ETKN was considered as the primary key.

Table 2 Summary of the experimental results: the descriptions of Run1, Run2a,\(\cdots \) Run2e are provided in the text

For implementation of \(f_k()\) we used AES with 128 bit key, and the implementation was done by using the new Intel AES-NI instruction set, which provides a very efficient and secure implementation of the AES. We assumed that \({\mathcal {X}}\) contains strings of 16 characters where each character is a decimal digit, and \({\mathcal {T}}= {\mathcal {X}}\). Thus, in accordance with our notations introduced before, we had \(\mu = 16\), \(\mathsf{AL} =\{0,1\ldots , 9\}\), thus \(\lambda = \lceil \lg (\#\mathsf{AL}) \rceil = 4\), and \({\mathcal {X}} ={\mathcal {T}} = \mathsf{AL}^\mu \).

The reported times are based on an \(\mathtt{-\!O3}\) optimized code. The time was measured by first measuring the number of cycles necessary for a specific operation using the \(\mathtt{rdtsc}\) instruction. This cycle counts we converted to real time using the processor frequency.

We summarize our experiments and the results below:

  1. 1.

    The first experiment was to verify how many block cipher calls are necessary for each call of \(f_k()\) and the efficiency of \(\mathsf{RN}^{\mathcal {T}}[f_k]\). In Sect. 7, we discussed that if the range of \(f_k\) is \(\{0,1\}^L\), then \(L \le 3\lambda \mu \) would be sufficient. Note that the number of block cipher calls required for each invocation of \(f_k\) is \(m = \lceil L/\lambda \rceil \). We made 1,000,000 independent calls to \(f_k\), and in all cases, in each call we required at most two block cipher calls. In fact in only \(5\%\) of the cases two calls were necessary. In all others, only one call was sufficient. The average time required for each invocation of \(\mathsf{RN}^{\mathcal {T}}[f_k]\) was 0.1 microseconds.

  2. 2.

    The second experiment was to see whether \(\mathsf{TKR2}\) implemented without the uniqueness test (as described in Fig. 4) would be sufficient. Again, we generated 1,000,000 tokens using \(\mathsf{TKR2}\) and they were all unique. Thus, in a practical scenario, where the card-vault would be stored in a database, the uniqueness test (as included in the description in Fig. 5) is not required to be explicitly included. Once a token is generated and when the system tries to insert it in the database, if the uniqueness condition is violated then the database would generate an error message, and then the process may be repeated until a unique token is generated.

  3. 3.

    Finally we measured efficiency of the tokenization procedures \(\mathsf{TKR2}\) and \(\mathsf{TKR2a}\). In Table 2, we summarize the results, which are described as below:

    • Run1 denotes the average time required to generate one token, including the insertion in the card-vault. But here primary keys in the card-vault relations are not specified, i.e., this run does not do any uniqueness test. The average is computed over 1,000,000 tokens.

    • Run2 denotes the scenario where the primary keys are specified, i.e., the database checks for the uniqueness. As it is obvious, in this case the time required to tokenize (including the insertions in the card-vault) would increase with the current size of the card-vault. To measure this difference, we divided this run into five different runs which we call Run2a, Run2b,\(\cdots \), Run2e. For \(\mathbf{Run2a}\), we started with an empty card-vault, and we generated 1,000,000 tokens. In \(\mathbf{Run2b}\), we started with a card-vault already containing 1,000,000 tokens, and we generated 1,000,000 more tokens. Similarly, in runs Run2c, Run2d and Run2e, we started with a card-vault containing 2,000,000, 3,000,000 and 4,000,000 tokens, respectively. In each run, we generated 1,000,000 more tokens. Table 2 shows the average time required for generating one token for each scenario.

The basic component for both \(\mathsf{TKR2}\) and \(\mathsf{TKR2a}\) is the procedure \(\mathsf{RN}^{\mathcal {T}}\), as mentioned, a call to \(\mathsf{RN}^{\mathcal {T}}\), costs only 0.1 microseconds. But the times reported in Table 2 (which are in milliseconds) are more realistic, and it shows that the database insertions dominate the cost of tokenization. Thus, further optimization in this regard may be possible. But, still our experimental results confirm that the schemes proposed in this work can be implemented and used in a real tokenization environment.

10 Conclusion

We studied the problem of tokenization from a cryptographic viewpoint. We proposed a syntax for the problem and also formulated three different security definitions. These new definitions may help in analyzing existing tokenization systems. We also proposed three constructions for tokenization: TKR1, TKR2 and TKR2a. The constructions TKR2 and TKR2a are particularly interesting, as they demonstrate that tokenization can be achieved without the use of format-preserving encryption. We analyzed all the constructions in light of our security definitions and also provided some preliminary experimental results.

The security definitions formulated in this study consider chosen plaintext adversaries (IND-CPA). Definitions secure against stronger adversaries may be given. A recent document from PCI DSS [14], which was made public after our submission, describes a new categorization of tokens as reversible and irreversible. In case of reversible tokens (as is the case of all the schemes proposed in this work), detokenization is a sensitive operation and should only be permitted for authorized entities. This brings in the important concern of authentication. Our model does not explicitly consider authentication. The basic structure of tokenization systems depicted in Fig. 1 considers that the connections between the tokenizer/detokenizer with the point of sale and the card issuer are secure, i.e., they are connected with encrypted and authenticated channels. In such a structure, a IND-CPA style definition should provide adequate security. But it may be possible to incorporate authentication as an inherent component of the security definitions. We plan to explore this possibility.