Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

17.1 Introduction (by JW)

In 2002, I attended a colloquium in Lisbon to celebrate the UN Software Technology Institute’s 10th Anniversary [1]. Tony Hoare gave a talk on the use of assertions in current Microsoft practice, where they instrument programs as software testing probes [40]. He went on to describe the greater benefits that would follow from using a tool to check a program’s adherence to its assertions: the Verifying Compiler. He concluded by saying that building a verifying compiler would be a “splendid opportunity for academic research” and “one of the major challenges of Computing Science in the twenty-first century,” likening it to the Human Genome project.

Tony had invited me to lecture at that year’s Marktoberdorf Summer School. During one of the excursions, he repeated to me his idea for the verifying compiler as a way of galvanising the computer science community into a productive long-term research programme: a Grand Challenge. This, Tony said, was the dream of Jim King’s doctoral thesis [59], a dream abandoned in the 1970s as being well beyond current theorem-proving capabilities. But much progress had been made in the following 30 years, both in hardware capacity and in the software technologies for verification. Tony suggested that the renewed challenge of an automatic verifying compiler could provide a focus for interaction, cross-fertilisation, advancement, and experimental evaluation of all relevant verification technologies. Perhaps by a concerted international effort, we might be able to meet this challenge within 15–25 years. I was now recruited to the cause.

In November that year, there was a UK Grand Challenge Workshop in Edinburgh, where more than 100 proposals were submitted [43]. These proposals were distilled into just seven grand challenges, one of which included the verifying compiler: GC6—Dependable Systems Evolution. As its name suggests, this challenge was based on a very broad understanding of software correctness, and tried to include as wide a community of researchers as possible, spanning the range of interests from full functional correctness through to issues of dependability, where formalisation is difficult, if at all possible. Three threads of activity were launched to progress GC6: software verification, dependability, and evolution. Tony Hoare, Cliff Jones, and Brian Randell tried to maintain the breadth of the grand challenge by emphasising the importance of the work on dependability and evolution [44], but proposals like this are shaped by the availability and enthusiasm of the individuals involved, and the only thread that has so far really taken off was the one inspired by Tony’s original idea of the verifying compiler. The main activity within this thread has been experimental work on pilot verification projects, as reported in this paper.

The term “verifying compiler” is often misunderstood by researchers, who sometimes hear “verified compiler.” It is also often thought of as just a single tool for a single programming language, probably an idealised academic one at that. But this was never Tony’s intention. The verifying compiler was a cipher for an integrated set of tools checking correctness, in a very broad sense, of a wide range of programming artefacts. In promoting the grand challenge, Tony talked about things that might have surprised his colleagues only a few years before. He talked not just about full functional correctness, but about checking isolated properties and about the subtler notions of robustness and dependability. He talked about tools that were neither sound nor complete, about inter-operability of tools, and about the practical programming languages used in industry. He even talked about testing.

GC6 has now transformed into an international grand challenge: the Verified Software Initiative, led by Tony Hoare and Jay Misra, and its manifesto [42] represents a consensus position that has emerged from a series of national and international meetings, workshops, and conferences. Overviews of the background and objectives may be found in [52, 77, 78]. Surveys of the state of the art are available [41], covering practice and experience in formal methods [8, 80], automated deduction for verification [68], and software model checking [51].

Interest in the VSI’s research agenda has grown from just a few dedicated individuals in 2002 to a distinct community today. There are 55 different international research groups working in the experimental strand alone. Many members of this community are young researchers, making important contributions at early stages of their career. They have their own conference series, Verified Software: Theories, Tools, and Experiments (Zurich 2005 [64], Toronto 2008 [69], and Edinburgh 2010). They have published a series of special journal issues (some are still in press): ACM Computing Surveys, Formal Aspects of Computing [18, 54], Science of Computer Programming [28], Journal of Object Technology, Journal of Universal Computer Science [4], and Software Tools for Technology Transfer. They organise working meetings at leading and specialist conferences: FM Symposium, FLoC, SBMF, ICTAC, ICFEM, ICECCS, and SEFM. They represent six continents: North and South America, Europe, Asia, Australia and Africa.

17.2 The Verified Software Repository

The main focus of the UK’s contribution to the VSI is on building a Verified Software Repository [9, 75], which will eventually contain hundreds of programs and components, amounting to several million lines of code. This will be accompanied by full or partial specifications, designs, test cases, assertions, evolution histories, and other formal and informal documentation. Each program will be mechanically checked by at least one tool, although most will be analysed by a series of tools in a comparative study. The Repository’s programs are selected by the research community as realistically representing the wide diversity of computer applications, including smartcards, embedded software, device drivers, a standard class library, an embedded operating system, a compiler for a useful language (possibly Java Card), and parts of the verifier itself, a program generator, a communications protocol (possibly TCP/IP), a desk-top application, parts of a web service (perhaps Apache). The main purpose of the Repository is to advance science, but reusable verified components may well be taken up in real-life application domains.

Verification of repository components already includes the wide spectrum of program properties, from avoidance of specific exceptions like buffer overflow, general structural integrity (crash-proofing), continuity of service, security against intrusion, safety, partial functional correctness, and (at the highest level) total functional correctness [45]. The techniques used are similarly wide ranging: from unit testing to partial verification, through bounded model checking to fully formal proof. To understand exactly what has been achieved, each claim for a specific level of correctness is accompanied by a clear informal statement of the assumptions and limitations of the proof, and the contribution that it makes to system dependability. The progress of the project can be measured by the automation involved in reaching each level of verification for each module in the Repository. Since the ultimate goal of the project is scientific, the ultimate aim is for complete automation of every property, higher than the expectations of a normal engineer or customer.

In the remainder of this introductory section, we describe the status of some early pilot projects that are being used to populate the Repository. Mondex is a smartcard for electronic finance. The Verified Filestore is inspired by a real space-flight application. FreeRTOS is a real-time scheduler that is very widely used in embedded systems. The Cardiac Pacemaker is a real system, and is representative of an important class of medical devices. Microsoft’s Hypervisor is based on one of their future products. Finally, Tokeneer is a security application involving biometrics. These six pilot projects encompass a wide variety of application areas and each poses some important challenges for verification.

17.2.1 Mondex

The following description is based on [81]. In the early 1990s, the National Westminster Bank and Platform Seven (a UK software house) developed a smartcard-based electronic cash system, Mondex, suitable for low-value cash-like transactions, with no third-party involvement, and no cost per transaction. A discussion of the security requirements can be found in [73, 81]; a description of some wider requirements can be found in [2]. It was crucial that the card was secure, otherwise money could be electronically counterfeited, so Platform Seven decided to certify Mondex to one of the very highest standards available at the time: ITSEC Level E6 [46], which approximates to Common Criteria Level EAL7 [14] (see the discussion in Section 17.3). This mandates stringent requirements on software design, development, testing, and documentation procedures. It also mandates the use of formal methods to specify the high-level abstract security policy model and the lower-level concrete architectural design. It requires a formal proof of correspondence between the two, in order to show that the concrete design obeys the abstract security properties. The evaluation was carried out by the Logica Commercial Licensed Evaluation Facility, with key parts subcontracted to the University of York to ensure independence.

The target platform smartcard had an 8-bit microprocessor, a low clock speed, limited memory (256 bytes of dynamic RAM, and a few kilobytes of slower EEPROM), and no built-in operating system support for tasks such as memory management. Power could be withdrawn at any point during the processing of a transaction. Logica was contracted to deliver the specification and proof using Z [71, 79]. They had little difficulty in formalising the concrete architectural design from the existing semi-formal design documents, but the task of producing an abstract security policy model that both captured the desired security properties (in particular, that “no value is created” and that “all value is accounted for”) and provably corresponded to the lower-level specification, was much harder. A very small change in the design would have made the abstraction much easier, but was thought to be too expensive to implement, as the parallel implementation work was already well beyond that point. The 200-page proof was carried out by hand, and revealed a small flaw in one of the minor protocols; this was presented to Platform Seven in the form of a security-compromising scenario. Since this constituted a real security problem, the design was changed to rectify it. The extensive proofs carried out were done manually using some novel techniques [72]. The decision not to use mechanical theorem proving was intended to keep costs under control. Recent work (reported below) has shown that this was overly cautious, and that Moore’s Law has swung the balance further in favour of cost-effective mechanical verification.

In 1999, Mondex achieved its ITSEC Level E6 certificate: the very first product ever to do so. As a part of the ITSEC E6 process, the entire Mondex development was additionally subjected to rigorous testing, which was itself evaluated. No errors were found in any part of the system subjected to the use of formal methods.

Mondex was revived in 2006 as a pilot project for the Grand Challenge in Verified Software. The main objective was to test how the state of the art in mechanical verification had moved on in 10 years. Eight groups took up the challenge using the following formal methods (with references to a full discussion of the kinds of analysis that were performed in each case): Alloy [67], ASM [39], Event-B [10], OCL [60], PerfectDeveloper,Footnote 1 π-calculus [53], Raise [36], and Z [32]. The cost of mechanising the Z proofs of the original project was 10% of the original development cost, and so did not dominate costs as initially believed. Interestingly, almost all techniques used in the Mondex pilot achieved the same level of automation, producing similar numbers of verification conditions and requiring similar effort (see [54] for a discussion of these similarities).

17.2.2 Verified Filestore

At an early workshop on the Verifying Compiler, Amir Pnueli suggested that we should choose the verification of the Linux kernel as a pilot project. It would be a significant challenge, and would have a lasting impact. Joshi and Holzmann suggested a more modest aim: the verification of the implementation of a subset of the POSIX filestore interface suitable for flash-memory hardware with strict fault-tolerance requirements to be used by forthcoming NASA missions [56]. They required the system would prevent corruption in the presence of unexpected power-loss, and that it would be able to recover from faults specific to flash hardware ( e.g., bad blocks, read errors, bit corruption, wear-levelling, etc.) [35]. The POSIX file-system interface [55] was chosen for four reasons: (i) it is a clean, well-defined, and standard interface that has been stable for many years; (ii) the data structures and algorithms required are well understood; (iii) although a small part of an operating system, it is complex enough in terms of reliability guarantees, such as unexpected power-loss, concurrent access, or data corruption; and (iv) modern information technology is massively dependent on reliable and secure information availability. An initial subset of the POSIX standard has been chosen for the pilot project. There is no support for: (i) file permissions; (ii) hard or symbolic-links; or (iii) entities other than files and directories ( e.g., pipes and sockets). Adding support for (i) is not difficult and may be done later, whereas support for (ii) and (iii) is more difficult and might be beyond the scope of the challenge. Existing flash-memory file-systems, such as YAFFS2 [50], do not support these features, since they are not usually needed for the functionality of an embedded system.

Freitas and Woodcock have mechanically verified existing Z models of the POSIX API [33] and a higher-level transaction processing API [31, 34]. Freitas has shown how to verify datatypes for the design of operating system kernels [30]. Butterfield, Freitas, and Woodcock have modelled the behaviour of flash memory devices [11, 12, 13]. Butler has specified a tree-structured file system in Event-B [26], and has specified some of the details of the flash file system itself [25]. Mühlberg and Lüttgen have used model checking to verify compiled file-system code [65], and Jackson has used Alloy to produce a relational model of aspects of a flash file system [57]. Ferreira and Oliveira have integrated Alloy, VDM + +, and HOL into a tool chain to verify parts of the Intel flash file system [29]. Finally, Kim has explored the flash multi-sector read operation using concolic testing [58].

17.2.3 FreeRTOS

Richard Barry (Wittenstein High Integrity Systems) has proposed the correctness of their open-source real-time mini-kernel as a pilot project. It runs on a wide range of different architectures and is used in many commercial embedded systems. There are over 5,000 downloads per month from SourceForge, putting it in the top 250 of SourceForge’s 170,000 codes. It is less than 2,500 lines of pointer-rich code, which makes it small, but very interesting. The first challenge is to analyse the program for structural integrity properties, for example, to prove that its elaborate use of pointers is safe. The second challenge is to make a rational reconstruction of the development of the program, starting from an abstract specification, and refining down to working code, with all verification conditions discharged with a high level of automation. These challenges push the current state of the art in both program analysis and refinement of pointer programs.

Déharbe has produced an abstract specification of FreeRTOS [27] and Machado has shown how to generate tests automatically from the code [63]. Craig is working on the formal specification and refinement in Z of more general operating system kernels [22, 23]. Some of his models have been verified by Freitas and Woodcock.

17.2.4 Cardiac Pacemaker

Boston Scientific has released into the public domain the system specification for a previous generation pacemaker, and is offering it as a challenge problem. They have released a specification that defines functions and operating characteristics, identifies system environmental performance parameters, and characterises anticipated uses. This challenge has multiple dimensions and levels. Participants may choose to submit a complete version of the pacemaker software, designed to run on specified hardware, they may choose to submit just a formal requirements documents, or anything in between. McMaster University’s Software Quality Research Laboratory is putting in place a certification framework to simulate the concept of licensing. This will enable the Challenge community to explore the concept of licensing evidence and the role of standards in the production of such software. Furthermore, it will provide a more objective basis for comparison between putative solutions to the Challenge.

Lawson and his colleagues at McMaster University maintain a web page describing the state of the Pacemaker pilot project [61]. It gives details of the pacemaker hardware reference platform, developed by students at the University of Minnesota, based on an 8-bit PIC18F4520 microcontroller. Macedo, Fitzgerald and Larsen have an incremental development of a distributed real-time model of a cardiac pacing system using VDM [62]. Gomes and Oliveira have specified the Pacemaker in Z, and carried out proofs of consistency of their specification using ProofPowerZ [37].

17.2.5 Microsoft Hypervisor

Schulte and Paul initiated work within Microsoft on a hypervisor (a kind of separation kernel), and it has been proposed by Thomas Santen as a challenge project. The European Microsoft Innovation Center is collaborating with German academic partners and the Microsoft Research group for Programming Languages and Methods on the formal verification of the new Microsoft Hypervisor, to be released as part of as new Windows Server. The Hypervisor will allow multiple guest operating systems to run concurrently on a single hardware platform. By proving the mathematical correctness of the Hypervisor, they will control the risks of malicious attack. Cohen has briefly described the Microsoft Hypervisor project [17].

17.3 Pilot project: Tokeneer ID Station

In this section, we describe one of the pilot projects in a lot more detail: the Tokeneer ID Station (TIS), a project conducted by Praxis High Integrity Systems and SPRE for the US National Security Agency. See [15] for an overview of the system, and [6] for an account of how it was engineered. Tony Hoare has already recorded his opinion of the work carried out: “The Tokeneer project is a milestone in the transfer of program verification technology into industrial application” [74]. A report from the US National Academies [48] refers to several Praxis projects as examples of best practice in software engineering, particularly in the areas of formal methods and programming language design and verification.

The Tokeneer project was originally conceived to supply evidence about whether it is economically feasible to develop systems that can be assured to the higher levels of the Common Criteria Security Evaluation, the ISO/IEC 15408 standard for computer security certification [14]. The standard defines seven levels for evaluating information technology security:

  • EAL7: formally verified design and tested

  • EAL6: semi-formally verified design and tested

  • EAL5: semi-formally designed and tested

  • EAL4: methodically designed, tested, and reviewed

  • EAL3: methodically tested and checked

  • EAL2: structurally tested

  • EAL1: functionally tested

Barnes et al. report that an evaluation in 1998 to what is now understood as EAL4 cost about US$2.5 million [6].

Numerous smartcard devices have been evaluated at EAL5, as have multilevel secure devices such as the Tenix Interactive Link. XTS-400 (STOP 6) is a general-purpose operating system, which has been evaluated at an augmented EAL5 level. An example of an EAL6 certified system is the Green Hills Software INTEGRITY-178B operating system, the only operating system to achieve EAL6 so far. The Tenix Interactive Link Data Diode Device has been evaluated at EAL7 augmented, the only product to achieve this.

The problem with these higher levels of the Common Criteria is that industry believes that it is simply too expensive to develop systems to this standard. The argument is a familiar one. In 1997, the UK Government Communications Headquarters held a workshop to discuss the view held in industry that it was too expensive to use formal methods to achieve ITSEC Level E6 [76], approximately EAL7. The Mondex project (see Section 17.2) provided evidence to the contrary.

So the objective of the Tokeneer project was to explore the feasibility of developing cost-effective, high-quality, low-defect EAL5 systems, and to provide evidence for both EAL6 and EAL7. It was a rare and valuable opportunity to undertake the controlled measurement of productivity and defect rates. Remarkably, the entire project archive is openly available and may be downloaded from [74].

Praxis have a well-developed software engineering method that addresses not only assurance, but also cost requirements. Their method starts from requirements analysis using their REVEAL technique, continues with specification and development, using formal methods where appropriate, until an implementation is reached in SPARK, a high-level programming language and toolset designed for writing software for high-integrity applications [7]. They have a successful record of using their method to develop commercial applications of formal methods, with costs reportedly lower than traditional manual object-oriented methods.

Tokeneer was the subject of an earlier NSA research project investigating the use of biometrics for physical access control to a secure room containing user workstations (the enclave). The Tokeneer ID Station contributed to a further development of the original system. The key idea is that users have smartcard security tokens that must be used both to gain access to the enclave and to use the workstations once the user is inside. There are smartcard and biometric readers outside the enclave; if a user passes their identity tests, then the door opens for entry. Authorisation information is written onto the card for subsequent workstation access. This information describes privileges the user can enjoy for this visit, including times of working, security clearance and user roles.

In what follows, it is important to understand the Tokeneer ID Station security target, in order to answer the question, “Is Tokeneer really secure?” The requirements assume that the enclave is situated in a high-security area, and so all the users will have passed a stringent security clearance procedure, either as NSA employees or as accredited visitors. As a consequence, it may be safely assumed that no user will ever attempt a malicious attack on the enclave. Instead, the security measures are intended to prevent accidents: unintentional, unauthorised access to the enclave and the data provided by its workstations.

The overall functionality of the Tokeneer ID Station was formalised in a 100-page Z specification. The code was developed in two parts. The core security-related functionality was implemented in SPARK, and amounts to 9,939 lines of code, with 6,036 lines of flow annotations, 1,999 lines of proof annotations, and 8,529 lines of comments. The remainder of the system was not security critical, and so was developed using Ada95, comprising 3,697 lines of code, no flow or poof annotations, and 2,240 lines of comments. The entire development required 260 man-days, provided by three people working part-time over 9 months.

The task set by NSA was to conform to EAL5. The development actually exceeded EAL5 requirements in several areas, including configuration control, fault management, and testing. The main body of the core development work was carried out to EAL5. But the specification, design, implementation and proof of correspondence were conducted to EAL6 and 7. So why would Praxis do more than they were asked to do? Because they were told that, if they could produce evidence at these higher levels within budget, then they should.

The Tokeneer project archive has been downloaded many hundreds of times. Knight [38] has verified Tokeneer properties using the PVS theorem prover [66]. Jackson is working to broaden the range of properties of SPARK programs that are automatically verifiable, thus speeding up verification and supporting use of richer assertions [49]. Work is underway to re-implement Tokeneer using the PerfectDeveloper system (see [24] for details of PerfectDeveloper). Aydal and Woodcock have been analysing the system to search for attacks [3], work reported below.

But how good was the original development of Tokeneer? In fact, only five defects have been found in Tokeneer since it was deployed within NSA in 2004. We describe each of the defects, reflecting on their causes and significance.

17.3.1 Defect 1

This account is based on that in [15, Section 17.3]. When the Tokeneer code was re-analysed in August 2008, in preparation for the public release of the entire archive, the tool that summarised the proof obligations (the POGS tool) revealed a single undischarged verification condition. Further investigation showed this to be in the subprogram ConfigData.ValidateFile. ReadDuration. The code in question concerns validation of an integer value that is read from a file, but is expected to be in the range 0–200 s before it is converted into a number of tenths of seconds in the range 0–2000. The offending undischarged VC is essentially:

The code is from line 222 of configdata.adb:

This VC clearly has a counterexample. For instance, when RawDuration = 109, H1 and H2 are true, but C1 is false. This reflects the possibility of an integer overflow when multiplying by 10 before the range of RawDuration is checked. The correction to the code is trivial. If replaced by:

then all VCs discharge successfully.

Why was this defect not discovered and reported during the original development? The original project used the SPARK Examiners “rtc” switch to generate VCs, which it does for partial correctness and run-time errors, but omits those side-conditions relating to Ada’s Overflow_Check. Previously, the SPARK toolset was limited in its capability to discharge these VCs, so these were omitted from the original project. Subsequently, the SPARK toolset has become far more capable with regard to overflow conditions, through the use of the compiler-dependent configuration file, and the base-type assertion for integer types. Users can now generate VCs using the “vcg” switch, which does include VCs for overflow checks. It is interesting to note that this defect was not discovered by any testing during the original project, or any use or attempt to analyse the system since the initial delivery.

What about the security impact? First, there is a potential denial-of-service attack resulting from this defect: a malicious user holding the “security officer” role can deliberately terminate the TIS core software by supplying a malformed configuration data file, rendering the system unusable. More seriously, the software can be terminated in this fashion with the enclave door open.

17.3.2 Defect 2

The first defect was discovered by Spinellis in October 2008; see his blog [70], where he reports the following. The Tokeneer function SystemFaultOccurred is required to return true exactly when a critical system fault has occurred while attempting to maintain the audit log. The code that implements this uses a global variable, AuditSystemFault, which is set to true whenever a fault is detected. Spinellis lists all the assignments to AuditSystemFault that he found in the code (OK is a variable set by various system functions).

But there is an anomaly in the last assignment, which is used when a log file is deleted. The conjunction used instead of a disjunction has the effect of clearing the AuditSystemFault flag if the deletion is successful, and failing to set it if the deletion fails, but no fault was detected before. Spinellis found this bug by inspection in less than an hour of browsing. In fact, it was in the second file he looked at, the first being very short.

17.3.3 Defect 3

The second defect was found by the CodePeer tool on or about 24 August 2009. CodePeer (developed jointly by AdaCore and SofCheck) statically analyses Ada programs for a wide range of flaws, including: pointer misuse, buffer overflows, numeric overflow or wraparound, division by zero, dead code, unused variables, and race conditions. The tool detected the following error.

The procedure KeyStore.DoFind contains the following code sequence:

The test in the second conditional statement is wrong: it should be

if RetValDo = Interface.Ok then

of course. This is almost certainly a cut-and-paste error from the enclosing conditional statements, but why was it not detected during the original verification process? The procedure call above the offending test assigns to RetValDo, but unfortunately, there is a meaningful reference to RetValDo later in the subprogram, so the SPARK flow-analyser fails to spot this as an ineffective assignment, as might have been expected. The offending code also gives rise to several dead paths through the code, since there are paths with traversal condition

(RetValIni = Interface.Ok) and

-- then branch of outer if statement

(RetValIni /= Interface.Ok)

-- erroneous else branch of inner if statement

But the verification conditions arising from this code are all trivially true, since this procedure has an implicitly true postcondition. Therefore, the SPARK Simplifier does not bother with proof-by-contradiction and so fails to spot the dead paths.

The discovery of this error has led to the development of a new tool, ZombieScope, to detect dead code. It is similar to the SPARK Simplifier, except that but it looks only for contradictory hypotheses, ignoring the conclusions of all VCs. Any VCs that are found to have contradictory hypotheses are flagged up as indicating dead paths. The security impact of this bug is not known, as it has yet to be analysed closely enough.

17.3.4 Defect 4

The third defect was found using the most recent GNAT compiler, a free, high-quality, complete compiler for Ada95, integrated into the GCC compiler system. The defect was found in AuditLog.AddElementToFile. NameOfType, using the -gnatwa flag (all warnings mode). The offending code is:

GNAT reports that the formal parameter E is not referenced, which is of course quite right: there is no reference to E in the body of that function. The reference to ElementID, which is a global variable that is visible from that scope, should be E here, so it should read:

Why was this defect not found before? It would have been difficult to detect the error during development using the tools available at the time. When the code was first written, the much earlier version of GNAT used did not implement this warning. The code in question is not even SPARK, but actually Ada, since it uses a feature that is not part of SPARK (an array-slice in the assignment), and the code has to be hidden from the analyser, which would otherwise reject it. Consequently no flow analysis was conducted at all. The original decision to hide this code was almost certainly a mistake. Bugs like this can creep in without the rigour of the SPARK tools. This bug was also missed in code-review, suggesting that the reviewing of hidden units should have been given more attention. If the code were SPARK, the tools would certainly have spotted it: an unreferenced formal parameter is always reported by the SPARK Examiner.

What is the impact on correctness and security? Curiously, none. There is exactly one call to this function, which reads:

So, in this single call E (the formal parameter) is synonymous with ElementID (the global variable), so there is no foul. But it is a bug waiting to happen if this code were ever called again with a different parameter, and the developers would fix it given the chance, so it still qualifies as a defect [16].

17.3.5 Defect 5

The fourth defect was also found by SofCheck’s CodePeer tool, in September 2009, in TokenReader.Poll.CheckCardState. CodePeer reports that the final branch of the case statement

is dead—telling us that CardState can never have the value

InvalidCardState

CodePeer (through some surprisingly clever inter-procedural value propagation) is able to determine that the RawCardState value returned from

TokenReader.Interface.Status

is always in the range 1–6, not 0–6. This can be verified by inspection by following the sequence of calls down the call tree into the support software (in support/tokenapi.adb). The SPARK tools did not detect this, since the dead-path analysis is intra-procedural and based on the contracts of the called units alone.

Is this a bug? Well, not really, but it is an interesting observation that the analysis of the use of subtypes in the code could be improved.

17.4 A Token Experiment

In this section, we report on an experiment that we performed on Tokeneer. Our motivation was to take a system that has been developed using best practice and to see if there is anything more that we can say about it. In particular, since it is a security-critical system, can we break it? One way of proceeding in our experiment would have been to search for undischarged verification conditions and proof obligations, hoping to find that at least one of them would turn out not to be true.

But a second motivation for the experiment was to revisit the original goals of GC6, and to make a small contribution towards understanding not just the functional correctness of Tokeneer (Does it correctly implement its specification?), but to say something about its dependability (Is it really secure?). For this reason, we decided to try to validate the system against its requirements, and in particular, the security requirement of no accidental access to the enclave or to its workstations.

To do this, we used a novel model-based testing technique that exploits formal methods and tools: assertion-guided model-based robustness testing. The main hypothesis of this study was:

Applying robustness testing in a model-based manner with the use of a separate test model may reveal requirements-related faults that may not necessarily be detected by formal verification techniques.

In addition to this hypothesis, we assumed two test hypotheses that helped us define the test oracle and the test selection algorithm. These hypotheses are valid for very specific kinds of system, and their usefulness must be judged by the quality of the results: have they uncovered any genuine failures? We revisit this point at the end of the paper. We consider first the Redundant Models Hypothesis:

If there are two different specifications of the same System under Test (SuT) that conform to the same set of requirements, then their fault domains must match with respect to some test set T.

By using two models of the same system that are independently produced from the requirements of the SuT, and a test suite generated by using one of these models, we tried to reveal faults within these models by finding inconsistent behaviours in their fault domains. Having one model for code generation and a separate model for test case generation not only introduces the redundancy required for the testing process, but also separates concerns whilst producing these models. Additionally, there may be situations where the design model of the system may not be available due to confidentiality reasons ( e.g., in security-critical systems) or the testing of the system may be completely outsourced. For such cases, being able to generate test cases from a separate test model brings flexibility to the testing process.

We used a test selection algorithm based on the satisfaction of assertions characterising the operations of the SuT. In order to make the satisfaction of these assertions more concrete, we introduced the Alternative Scenarios Hypothesis:

For any operation of the SuT, if s satisfies the precondition but t does not, then their corresponding post-states must be different.

This hypothesis is relevant to robust systems, where we want to find situations where the Operation under Test (OuT) has incorrect fault-handling. In general, robustness testing checks that a system can handle unexpected user input or software failures by testing the software outside its expected input range. Thus, by feeding the software with classified unexpected input that fails one precondition at a time, the faults for different situations are uncovered. For the Tokeneer Experiment, a separate test model of the high integrity variant of the Tokeneer was produced in the Alloy modelling language [47]. The test case specifications were produced by falsifying the assertions of the operations modelled in some order. The test case specifications were then fed into Alloy Analyzer, and the test cases generated automatically as counterexamples using SAT-solving technology.

Bearing in mind the high quality of the system under test, including the fact that only five errors have previously been found in Tokeneer since its release, we wish to report a small success in our experiment. We detected nine anomalous behaviours, and we describe four representative scenarios in the rest of this section (see [3] for a more detailed account of how we found the anomalies and for a description of the other scenarios). In Section 17.6, we consider whether these anomalous scenarios really compromise the security of the enclave and its workstations.

17.4.1 Scenario 1

Our first scenario concerns tailgating. The Tokeneer system should be seen as more than the sum of its software and hardware peripherals. It is a socio-technical system, and there are interactions between people and their procedures and the hardware and software, and these have to be considered to get a picture of the entire system. We understand that users must undertake not to “tailgate,” that is, to follow an authorised user into the enclave without being separately authorised themselves. So we can rule out the possibility of deliberate tailgating, since users agree not to do it, and no one has any malicious intent. But we found a scenario that can be explained by accidental tailgating. Consider the following sequence of events.

  1. 1.

    Miss Moneypenny, an NSA employee, inserts her smartcard into the reader, which checks its validity.

  2. 2.

    She places her thumb on the fingerprint reader, which checks her identity.

  3. 3.

    The system authorises Miss Moneypenny, unlocking the door.

  4. 4.

    Miss Moneypenny enters the enclave, and the door closes behind her.

  5. 5.

    I am a new NSA employee. Since I have not started to work on a project that needs access to the enclave area, my card does not have access to the enclave, but I am not told which areas my card has access to. I assume that my card has access to all areas.

  6. 6.

    I have eyes only for Miss Moneypenny, and as a result I look anxiously at the door to get a glimpse of her.

  7. 7.

    I insert my card into the reader.

  8. 8.

    The screen says

    ENTRY DENIED

    but I am not paying attention. There is no provision for audible alarms.

  9. 9.

    I place my thumb on the fingerprint reader, but again I miss the error message.

  10. 10.

    Anyone watching my actions would be satisfied that I appear to be following authorised procedures, and so would have no reason to be suspicious.

  11. 11.

    The door is still unlocked following Miss Moneypenny’s entry, and it stays unlocked for a period known as the latch-unlock duration. I enter the enclave during this period, and there is no way of detecting my unauthorised entry.

The scenario may be thought rather far-fetched and scarcely credible. It relies on two mistakes: me not noticing either of the error messages. But it does show how I could accidentally gain access to the enclave, even though this appears to be an unlikely occurrence. Perhaps more interestingly, it reveals something about the implementation of the system.

  1. 1.

    Audible alarms could profitably be used to draw attention to the authorisation failures of the card and fingerprint readers.

  2. 2.

    The no-tailgating rule could be enforced with hardware (a physical turnstile or a pair of “airlock” doors), or checked with a video entry-detection system.

  3. 3.

    There is a vulnerability offered by the door being unlocked during the latch-unlock duration. There are two issues here.

    1. (a)

      The duration must be long enough to allow Miss Moneypenny to enter, but not so long as to give others the possibility to tailgate; this seems impossible to get right.

    2. (b)

      Errors triggered during attempted authorisations do not prematurely end the latch-unlock duration. This is a missed opportunity.

17.4.2 Scenario 2

Suppose now that the security officer decides to shorten the latch-unlock duration, perhaps as a result of discovering my antics in Scenario 1. The security officer needs to update a configuration file on one of the workstations within the enclave, and suppose that he wants to decrease the duration from 30s to 15s.

  1. 1.

    The security officer modifies the configuration file, prepares the configuration data, and writes it to a floppy disk.

  2. 2.

    The security officer successfully authorises his entry and then enters the enclave.

  3. 3.

    He successfully authorises his use of a workstation.

  4. 4.

    He logs in, inserts the floppy disk, and enters the update command, but then he sees the following:

    read/write error

  5. 5.

    The security officer is uncertain what this means, so he then checks the screen showing the new configuration file. It says very clearly:

    Latch-Unlock Duration = 15s

    and he is satisfied that the update really has taken place.

  6. 6.

    Working hours currently stop at 17:00, and entry to the enclave should be forbidden after this time.

  7. 7.

    But I accidentally misread the time, thinking it to be 16:45, when it is really 17:45. This is probably due to my having returned from a foreign trip, and having made a mistake adjusting between time zones.

  8. 8.

    I successfully authorise my entry at 17:45 and enter the enclave.

  9. 9.

    I work until 18:00, 1 h later than permitted.

This scenario is rather puzzling at first sight: how could I have been authorised to enter the enclave in Step 8, clearly outside working hours? To answer this, we need to know a little bit more about how Tokeneer works. There is a default configuration file on the system that is used in case of an update failure. When the system detects a read/write error while trying to read the floppy disk or when the file on the floppy disk is incorrectly formatted, it is forced to use this default configuration file. The default file is inaccessible, presumably to prevent accidental interference. It just so happens in our scenario that the default value for the latch-unlock duration is the same as that required by the security officer (15s), and this coincidence persuades him in Step 5 that the update has taken place correctly. But of course, all the other settings now assume their default values. In our scenario, this gives the later working time of 18:00, and so explains my entry in Step 8.

So I have gained accidental entry to the enclave and accidental access to a workstation. Is this an unlikely scenario? It depends on how carefully the configuration data file is prepared and whether or not the floppy disk drive is working properly, and it may well be an ageing device with reliability problems. But if a failure does occur, then it is made worse by a human–computer interface problem: the consequences of the error are not made clear to the security officer.

17.4.3 Scenario 3

The third scenario also concerns configuration data. The system logs information about workstation usage in an audit file, and an alarm is signalled when the file reaches the minimum log-size for sounding the alarm.

  1. 1.

    The number of users and tasks increases, with the audit log filling up rapidly.

  2. 2.

    The audit manager visits the enclave several times a day to archive the log.

  3. 3.

    The security officer agrees to raise the threshold to reduce the number of alarms.

  4. 4.

    He copies new configuration data to a floppy disk, authorises his entry and enters the enclave.

  5. 5.

    He logs in to a workstation and updates the system.

  6. 6.

    To check that the update has occurred properly, he logs out, logs back in again and checks the minimum audit-log size. It has the value he requires, so he logs out and leaves the enclave.

  7. 7.

    But the alarm is triggered once more by the old threshold.

  8. 8.

    He discovers later that the update seems to take effect only after the system is rebooted, following a public holiday.

The triggering of the alarm in Step 7 is puzzling. The security officer assured himself in Step 6 that the update had taken place satisfactorily, but later discovers that this doesn’t seem to be the case. Again, we need to know more about how Tokeneer works. Although a new file is introduced to the system, and when requested, it can be viewed on the screen, the actual configuration data used for alerting stays the same till the next system start-up. The same thing also happens when a disk is inserted into the system drive for an admin operation. If the disk is not inserted at the start-up time, then the system does not recognise the disk.

This scenario affects the usability of the system, as too many alarms deny proper users the service they require from the system. It is a rather surprising behaviour, and may constitute an accidental denial-of-service attack.

17.4.4 Scenario 4

Our final scenario also concerns configuration files.

  1. 1.

    The security officer wants to change the closing time from 17:00 to 16:00.

  2. 2.

    He prepares a configuration file and writes it to a floppy disk.

  3. 3.

    But he accidentally forgets to erase an old configuration file.

  4. 4.

    He enters the enclave and updates system.

  5. 5.

    I enter the enclave at 16:30, half an hour after the new closing time.

Again the scenario is puzzling, and to explain it we need to know how Tokeneer deals with configuration files. The update function checks the validity of configuration files in the order it finds them on the floppy disk. As there was an older file, this is the one that is used. In our scenario, it was the older file that had a later closing time, creating the problem. Once more, I have gained accidental access to both the enclave and its workstations.

17.5 Analysis

All these stories are revealed by testing 12 operations. The operations realise their intended functionality when all the preconditions are satisfied, at least for one pre-state. However, it is only when the test cases exercise system behaviour beyond the anticipated operational envelope that the stories such as the ones given in the previous sections are uncovered. The next section discusses these scenarios in the context of the documentation produced by Praxis-HIS and SPRE during the development of the high-integrity variant of Tokeneer IDS.

To understand the relevance and validity of these stories, we must analyse them with respect to the documentation provided, which explains expected system behaviour and expected security properties. Section 17.5.1 provides information about the security model of the TIS, and evaluates the findings accordingly.

17.5.1 Comparison with Security Model and Requirements

This section compares the scenarios with the system requirements document [19, 21] and the security model of the re-developed version of TIS [20]. In [20], the security model of the TIS is identified with the following six security properties.

  1. 1.

    If the latch is unlocked by the TIS, then the TIS must be in possession of either a User Token or an Admin Token. The User Token must have valid ID, Privilege, and I&A Certificates, and either have a valid Authorisation Certificate or have a template that allowed the TIS to successfully validate the user’s fingerprint. Or, if the User Token does not meet this, the Admin Token must have a valid Authorisation Certificate, with role of guard.

  2. 2.

    If the latch is unlocked automatically by the TIS, then the current time must be close to being within the allowed entry period defined for the User requesting access. The term close needs to be defined, but is intended to allow a period of grace between checking that access is allowed and actually unlocking the latch. Automatically refers to the latch being unlocked by the system in response to a user token insertion, rather than being manually unlocked by the guard.

  3. 3.

    An alarm will be raised whenever the door/latch is insecure. Insecure is defined to mean the latch is locked, the door is open, and too much time has passed since the last explicit request to lock the latch.

  4. 4.

    No audit data is lost without an audit alarm being raised.

  5. 5.

    The presence of an audit record of one type ( e.g. recording the unlocking of the latch) will always be preceded by certain other audit records (e.g., recording the successful checking of certificates, fingerprints, etc.). Such a property would need to be defined in detail, explaining the data relationship rules exactly for each case.

  6. 6.

    The configuration data will be changed, or information written to the floppy, only if there is an administrator logged on to the TIS.

The first and second properties do not prevent the TIS from situations such as the one given in Scenario 1. Here are some of the root causes of these stories:

  • None of these properties imposes the condition that the owner of the card shall be the person entering the enclave.

  • There is no way of checking whether the person who is authorised outside the enclave actually enters the enclave after removing his/her card from the card reader.

  • It is difficult to keep track of users using the enclave without the exit point, which was included in the actual Tokeneer system specification, but excluded in the re-developed version of TIS.

  • No action is taken after an access denial, even if the door is open.

One might argue that some of these statements are not set as the requirements of the high-integrity variant of the TIS, and some of them are even explicitly excluded in the documentation. However, this does not affect the validity of the stories mentioned above. Therefore, if any of them is taken under investigation in the future, one of the root causes listed above must be considered whilst looking for a solution.

Another discussion item relevant to one of the stories is Security Property 6. The property states that the configuration data will be changed by an administrator (more specifically by a Security Officer), and the information (log files) will be written to the floppy by an administrator (more specifically by the Audit Manager). As shown in Scenario 2, in the case of an invalid configuration file, the system replaces the current configuration file with a default one. In other words, if the security officer attempts to update the configuration file with an invalid file by mistake, it is the system that updates the file with some default file without the control of the security officer. By taking such an action without the approval of the security officer, the system actually breaks one of the security properties stated earlier. In the Security Properties Document of the TIS [19], it is declared that the proof of this security property is missed, therefore we will not discuss this item any further. However, it is open to discussion whether more attention should be given to the correctness of a file that determines how long the door to a secure enclave can be kept open, the latch of this door can be kept unlocked, the enclave is available to users, etc.

Regarding system faults and the alarms raised, in [5] it is stated that the system faults are warnings, with the exception of critical faults listed as failure to control the latch, failure to monitor the door and failure to write to the Audit Log. It is also mentioned as a requirement that the system shall continue to function following a system fault categorised as a warning, and raise an alarm following a critical fault. During the test case execution process, it was not possible to concretise the test cases that require a critical fault, therefore we are not in a position to state the behaviour of the system under these circumstances. However, there were test cases that required the alarm to be raised due to other causes such as the door being kept open more than allowed or the audit file size exceeding the limit specified in the configuration file. The system continues to perform normally for such cases even though an alarm is raised. Our concern is that the system may behave similarly for the critical faults mentioned above since the only requirement specified following a critical fault is to raise an alarm. The security of the system may not be compromised if the log file is too large, but a failure to control the latch would certainly create a risk, and therefore we believe that the measures taken for the latter case should be more than just raising an alarm.

The next section explains the system-level testing carried out by SPRE Inc previously, and compares the test results of this testing activity with that explained in this report.

17.6 Conclusions

Is Tokeneer really secure? Of course it is! It seems very unlikely that any of the accidents described by our scenarios could really happen and compromise the security of the enclave and its workstations, particularly as an administrator is needed for three of the four scenarios. But these are interesting scenarios nonetheless, and they have been overlooked both by the formal development and by system testing. In fact, the system testing performed by SPRE detected Scenario 2, and the case is closed as solved; however, in our tests, it persisted, using tests generated by a completely different technique.

Additionally, these scenarios may be useful in designing similar systems in the future, as they raise questions about configuration files, audit logs, and alarms that may have been overlooked this time. Perhaps they might be useful if the system were to evolve to include stronger security guarantees, including malicious intent.

The lesson of our experiment is that there is value in diversity: an alternative approach gives us an opportunity to think laterally. De Bono’s lateral thinking is about judging the correctness of a statement (in this case the correctness of Tokeneer), and seeking errors that contradict that claim of correctness. Even when a considerable amount of effort has been invested in one approach, in this case the application of formal methods, lateral thinking, provided by using a completely different, but principled set of techniques and tools, can still challenge the claim of correctness.