Keywords

3.1 Introduction

CRISPR/Cas is an adaptive immune system of archaea and bacteria, providing a defense against invading plasmids and viruses (Garneau et al. 2010). Natural CRISPR/Cas systems consist of three core components:

  • An array of repeats encompassing unique sequences called spacers

  • A promoter sequence upstream of CRISPR arrays

  • An operon encoding a set of effector Cas proteins, essential for processing information coded within arrays

Native CRISPR/Cas defense systems consist of three stages: adaptation or acquisition, expression or biogenesis, and interference. During acquisition, a foreign genetic element (a “protospacer”) is cleaved and incorporated into the CRISPR locus as a new spacer. In biogenesis, these arrays are expressed as precursor CRISPR RNA (pre-crRNA) and subsequently processed into mature crRNA. Finally, in the interference stage, Cas endonucleases cleave the invading double-stranded DNA using crRNA as a guide sequence (Brazelton et al. 2015). Multiple studies have confirmed that the adoption and interference stages also require a protospacer adjacent motif (PAM) in the immediate vicinity of the protospacer (Fig. 3.1).

Fig. 3.1
figure 1

Mechanism of a natural CRISPR/Cas9 system: (a) acquisition; (b) expression; and (c) interference

Based on effector Cas protein organization and non-coding RNA species architecture, CRISPR/Cas systems have been classified into two main classes and six types (Lino et al. 2018). Class 1 systems are defined as multi-Cas proteins acting in a cascade manner or Cas module-RAMP (repeat-associated mysterious proteins), i.e., Cmr complexes. In contrast, class 2 systems are compact and utilize a single effector Cas protein. For detailed classification of CRISPR/Cas systems, see Chap. 2. Due to their compact architecture and single effector Cas protein, class 2 systems have been adopted for genome editing applications in eukaryotes (Jinek et al. 2013; Makarova and Koonin 2015; Mali et al. 2013). Cas9 from Streptococcus pyogenes (SpCas9) requires a non-coding RNA known as transactivating crRNA (tracrRNA) in addition to crRNA. In today’s genome editing applications, these two non-coding RNAs are synthetically fused into one sgRNA (Alkhnbashi et al. 2020). So, an sgRNA in an engineered CRISPR/Cas9 system consists of a permanent part and a programmable part. The programmable part can be tailored to target Cas9 anywhere in the genome. The target site in DNA consists of a 20-nucleotide (nt)-long region complementary to sgRNA plus a PAM sequence (NGG for SpCas9 and TTTV for Cpf1) (Table 3.1). If there is no PAM adjacent to the target site, Cas endonuclease will not cleave the target site. If the sgRNA pairs with the DNA target sequence followed by PAM, it could create a double-stranded break (DSB) in the target site. The DSB will be repaired by either non-homologous end joining (NHEJ) or homology-directed repair (HDR) (Tian et al. 2017) (Fig. 3.2). The sgRNAs are not selected randomly; they must be associated with a PAM that is present in the target DNA but not included in the sgRNA. Bacteria use PAM to differentiate between self and non-self, thereby protecting their own DNA from cleavage because PAMs are only present in phage DNA (Fig. 3.3). With this simple and straightforward design, CRISPR/Cas can be programmed to any sequence in the genome. However, this simple, two-component (sgRNA and PAM) process also has disadvantages, as exactly similar or closely similar sgRNA sequences may occur at multiple locations and some of them could be tolerated by Cas endonuclease, leading to so-called off-targets (Cui et al. 2018). Cas endonuclease may also tolerate specific sequence changes in PAM. For example, while spCas9 specifically recognizes NGG (where N is any nucleotide base; G is guanine), it may also recognize NAG (where N is any nucleotide base; A is adenine; G is guanine), albeit less efficiently (Thomas et al. 2019). It is critical to reduce the number of potential off-target sites for improved CRISPR/Cas specificity, especially in human therapeutic applications, germline modifications, and genome editing for important agricultural purposes.

Table 3.1 PAM sequence, cutting site, and sgRNA length requirement for different Cas proteins
Fig. 3.2
figure 2

Repair mechanisms for CRISPR/Cas-induced DSB. NHEJ and HDR are the two main DSB repair mechanisms

Fig. 3.3
figure 3

CRISPR/Cas9 components. Seed region (consist of 12 nt region proximal to PAM, which is sensitive to mismatches) of sgRNA and PAM sequence for Cas9

The rapid rise in CRISPR/Cas applications has prompted researchers to devise bioinformatic tools using different algorithms and design rules for effective sgRNA design, specific targeted modification, and low off-targets. Such tools facilitate gRNA design with maximum on-target efficiency in available genomes with user-defined PAM sequence and Cas endonuclease (Cui et al. 2018). Many design tools exist, but all have their own individual strengths and limitations. Most vary in terms of design parameters, specifications, available genomes, on-target efficiency score, off-target predictions, and so on. For example, design tools such as CRISPR-P (Li and Durbin 2009), E-CRISPR (Heigwer et al. 2014), CasOT (Xiao et al. 2014), and Cas-OFFinder (Bae et al. 2014) were mainly developed to predict off-targets in CRISPR/Cas experiments. However, in CRISPR/Cas applications such as CRISPR screening, cleavage efficiency is also important (Ma et al. 2016). Therefore, design tools such as sgRNA Designer, CRISPR-ERA (Liu et al. 2015), and Benchling predict on-targets as well as off-targets. Other genomic features such as sgRNA guanine-cytosine (GC) content, PAM flanking sequences, chromatin structure, methylation status, regulatory potential, and evolutionary conservation are also important in sgRNA design (Shi et al. 2015). Another critical factor in designing an efficient sgRNA is the application-specific (knockout (KO), knock-in (KI), CRISPR interference (CRISPRi), CRISPR activation (CRISPRa), and base editing) location of sgRNA in the genome. “WeReview: CRISPR Tools” is an online, live repository which helps researchers choose the best and latest tools for CRISPR/Cas applications (Torres-Perez et al. 2019). The current chapter aims to help researchers select the most useful tools for sgRNA design with maximum specificity and limited off-targets. This chapter also seeks to help users who are designing sgRNA with application-specific parameters in CRISPR/Cas.

3.2 Fundamentals of CRISPR/Cas Experiment and sgRNA Design

Engineered CRISPR/Cas system relies on sgRNA and PAM for genome modification in the target site of the genome. The prerequisites for designing an efficient sgRNA are:

  1. 1.

    Target gene and target region

  2. 2.

    Specific Cas endonuclease (e.g., Cas9, Cas9 nickase (nCas9), nuclease-dead Cas9 (dCas9), Cpf1) and an appropriate PAM for the Cas endonuclease

  3. 3.

    Promoter selection for in vivo or in vitro expression of sgRNA

  4. 4.

    Cloning strategy for sgRNA, e.g., sgRNA cloned in expression vector or used as template for RNA production

  5. 5.

    For multiple gRNA, whether expressed from a single promoter or individual promoters

Also important for sgRNA design are application-specific parameters (e.g., for KO, KI, CRISPRi, CRISPRa, and base editing) coupled with the intended DSB repair system. For example, in KO applications, off-targets on other chromosomes may be cleared by backcrossing. Moreover, the sgRNA position for CRISPRi and CRISPRa applications would be different to that for KO and KI applications. In addition, two or more sgRNAs are required in some applications, such as two sgRNAs with nCas9, a pair of sgRNAs in CRISPRa, and a pair of distal sgRNAs in KI applications (Mohr et al. 2016). Here we summarize the essentials of an effective sgRNA for different CRISPR/Cas systems.

3.2.1 Good Gene Annotation: An Essential Requirement

From a genome editing perspective, good gene annotation is a prerequisite for designing an appropriate sgRNA. Online databases and tools are available to help designers view sgRNA in a relevant genome browser, as successful editing in most CRISPR/Cas applications depends upon gRNA positioning relative to specific features of the gene. For example, in CRISPRa, the sgRNA must be located within 50–500 bp of the transcription start site (TSS), but in CRISPRi, the gRNA should be near TSS. For KO applications using NHEJ, appropriate target regions may include a common coding exon, while in KI, a specific coding exon, intron, or a region coding for a protein domain could be appropriate (Gilbert et al. 2014; Shalem et al. 2014; Wang et al. 2014; Shi et al. 2015). High-quality genome databases with regularly updated gene annotations based on experimental data are available for models such as drosophila, zebrafish, mouse, rat, and Arabidopsis. These databases assist in formed design of gRNA relative to the position of gene features. However, in non-model species, the lack of genome databases with appropriate gene annotations is a limiting factor on the design of specific gRNA (Mohr et al. 2016).

3.2.2 Different Guidelines for Different Applications

With rapid development in CRISPR/Cas systems has come the development of bioinformatic tools and algorithms to predict on-target efficiency, as well as off-targets. Off-target tools mostly focus on sequence similarity with on-target sites and use a defined cut-off for possible number of mismatches that can be tolerated. However, even for off-target sites with mismatches, creating a bulge or gap sometimes leads to a valid target site for a DSB. Although several tools can predict off-targets, it is not feasible to apply those rules for every gRNA and every application. Some rules for gRNA effectiveness are not relevant to all CRISPR/Cas applications or even the same application in different species (Mohr et al. 2016). For example, a CRISPRi application in Escherichia coli showed that gRNA must target the non-template strand (also called the coding strand or sense strand) (Qi et al. 2013), but similar studies in eukaryotes showed that gRNA binding to either strand is effective. Moreover, as compared with KO applications, off-target effects will be of less concern in CRISPRi and CRISPRa applications, because binding may not be within effective range of the promoter sequence (Mohr et al. 2016). A recent study showed that sgRNA effectiveness parameters for cleavage efficiency in CRISPRi were not valid for CRISPRa applications (Doench et al. 2016). This suggests that different applications require different design principles. However, it is not yet clear to what extent general design rules are relevant to various applications or to what extent optional parameters will be required for a particular species, tissue, or cell.

3.2.3 Best Design Linked with Availability of More Data

Improvements in CRISPR/Cas design require more data to be available. When designing sgRNA, researchers must be aware of the design tool’s criteria for maximizing specificity and limiting off-targets. Researchers must also know the background of the design criteria: the study, species, delivery method, and specific applications from which a particular parameter was derived (Mohr et al. 2016). Sharing results and data from good designs and poor ones, along with species information and specific applications, will help researchers to continue improving the design and efficiency of CRISPR/Cas systems. In addition, information and data sharing will help researchers better understand the universal and application-specific factors that influence the effective design of sgRNA.

3.3 sgRNA Design Process: An Overview

The key aspect of sgRNA design is to define the target site in the genome. This can easily be done by locating the PAM sequence (NGG for spCas9 and TTTV for Cpf1) in the target region or gene. All PAM sequences recognized by different Cas endonucleases are listed in Table 3.1. Theoretically, if 5′-20 nt of the sgRNA pairs with a complementary target site in the genome, the sgRNA/Cas9 complex will create a DSB. However, several practical studies have suggested that cleavage efficiency varies significantly among different gRNAs. So, predictive models and algorithms are essential for selecting the best high-efficiency gRNA with limited off-targets. An additional challenge in CRISPR applications is off-target activity caused by both sgRNA and Cas9. Several studies have confirmed that CRISPR/Cas9 can tolerate several mismatches and cleave the DNA at sites other than the intended site of modification (target site) leading to off-target mutations. Although spCas9 systems recognize 5′-NGG-3′ as PAM, spCas9 can also recognize 5′-NAG-3′ and 5′-NGA-3′ albeit with low efficiency. Many models and computational tools are available to help researchers design an effective gRNA with high efficiency and specificity (Cui et al. 2018). In the following section, we present an overview of the design process in CRISPR/Cas applications.

3.3.1 Selection of Desired Genetic Modification

The first step in the design process is to define the desired genetic modification, e.g., KO, point mutation, transcriptional control, or KI. Because different modifications require different CRISPR/Cas reagents, a clear understanding of the desired genetic manipulation will narrow down the selection of appropriate CRISPR/Cas components (Thomas et al. 2019). However, although a broad range of CRISPR reagents and components exist, it is better to customize these components if perfect reagents do not exist for the chosen application.

3.3.2 Choice of Appropriate Expression System

To achieve the desired objective in a CRISPR/Cas experiment, Cas9 and gRNA must be expressed in the target cells or organism. Factors that can affect the desired modification, off-target numbers, and efficiency include the selected expression system (transient or stable), promoter choice (constitutive or tissue specific), reagents (plasmid, mRNA or RNPs), and delivery systems (viral, non-viral, or physical) (Graham and Root 2015). Standard protocols and reagents may suffice for CRISPR/Cas applications in easy-to-transfect cell lines, e.g., HEK293 (Banan 2020).

3.3.3 Selection of Appropriate Cas Endonuclease

Of the two classes of CRISPR/Cas systems described above, Class 1 systems use multiple Cas proteins, while Class 2 use a single effector Cas protein to create DSB in the target DNA. Choosing the right Cas endonuclease is essential. Cas9 and Cpf1 (Cas12a), the two most widely used Cas endonucleases, are both Class 2 CRISPR/Cas systems. Cas9 is a type II endonuclease that recognizes NGG as PAM sequence and creates DSB with blunt ends, three bp upstream of PAM site. Multiple engineered Cas9 variants have been generated, for example, nCas9, which produces single-stranded breaks (SSB), while dCas9 is used for site-specific binding of DNA. In contrast, Cpf1 is a type V endonuclease that recognizes the TTTN PAM sequence. Cpf1 cleaves 18–23 bp away from PAM and produces staggered ends with 5′ overhangs. Because it is smaller than spCas9, it is easy to pack into viral vectors for delivery. So, selection of expression system depends upon the desired modifications (Luo 2019).

3.3.4 Selection of Gene or Genetic Element

To manipulate a gene with a particular CRISPR application, a researcher must first identify the target gene’s genomic sequence. Selection of target region (promoter, exons, or introns) in the gene will depend upon the desired genetic modification. For example, for KO applications, 5′ constitutive expressed exon is the best target. Alternatively, gRNA can be targeted to an exon that codes an essential protein domain. For HDR applications, the target sequence should be in close proximity (within 10 bp) to the desired edit site.

3.3.5 Searching of Target Site for Intended Gene Modification

Most CRISPR/Cas design tools search target regions using either a sequence-based or a genome-based approach. In sequence-based searching, the user must input the sequence to define the target site for gRNA design. The CRISPOR design tool searches on sequence and requires an input of <2000 nt for gRNA design and display. In a genome-based approach, the user must provide a gene name, ID, or similar input to display gRNA relative to the gene features. For example, the WGE (Wellcome Sanger Institute genome editing) tool requires a gene symbol in order to display sgRNA relative to the gene features (Thomas et al. 2019).

3.3.6 Sequencing of Target Site and Design of sgRNA

Once the desired manipulation, expression system, Cas endonuclease, and CRISPR reagents are decided, the next step is to confirm the site and design sgRNA. SgRNA design is a prime concern in CRISPR applications. Because features in the target DNA site affect the sgRNA efficiency, therefore, it is better to sequence the target region before designing gRNA, because variations in the target region and gRNA may occur and this can reduce cleavage efficiency. Most CRISPR/Cas applications require an efficient and specific sgRNA, but this task is quite challenging because there are many criteria to obey. So, to identify the most suitable gRNA with maximum efficiency, design criteria are very important. Various sequence features influence the efficiency of gRNA. For example, the presence of guanine (G) at 5′ end of sgRNA (GX19NGG) was crucial for expression from U6 promoter. G was also required on the first or second position adjacent to PAM, probably for loading of Cas9. The presence of cytosine (C) at this position was not favored. Thymine (T) at the fourth position closest to PAM is undesirable too, because the presence of multiple uracil (U) decreases sgRNA expression. Adenine (A) is suitable in the middle region of gRNA; G is preferred in the distal region of sgRNA. Overall, A and G make sgRNA more stable and more efficient. In addition to gRNA sequence features, novel features in PAM affect sgRNA reproducibility. For example, in the variable nucleotide N of NGG for spCas9, C is preferred, while T is not favored. Moreover, Cas9 preferences for particular sgRNA sequence features are quite different from those in a dCas9-mediated application. A 19-nucleotide sgRNA in dCas9-mediated CRISPRi and CRISPRa showed the highest efficiency compared with 20 nt or 17–18 nt truncated sgRNA for Cas9. Moreover, the seed region of sgRNA is of key importance in CRISPR/Cas9, while all sgRNA nucleotides contribute to gRNA efficiency in CRISPR/dCas9.

3.3.7 Selection of Suitable gRNA

A given target sequence or gene may have many potential gRNAs. It is important to select the most suitable gRNA with the highest efficiency for the intended modification. Suitability is assessed in terms of position relative to target site, high on-target activity, and low off-target activity. This can be achieved with tools such as WGE and CRISPOR using custom filters. Filtering for gRNAs with low off-targets will identify candidates with minimum off-targets. However, a gRNA with high on-target activity may have significantly low specificity leading to high off-targets. A gRNA with a high on-target score and high specificity would be an ideal sgRNA candidate for the desired CRISPR application (Thomas et al. 2019).

3.3.8 Design Criteria for Genome-Wide CRISPR Libraries

In contrast to individual gRNA design, CRISPR libraries are designed to screen mutations (or desired modifications) in many genes or across an entire genome. As a result, sgRNA design for genome-wide CRISPR libraries is entirely computer-based because it is impossible to evaluate each gRNA. Instead, multiple sgRNAs are designed for each gene in the genome at different locations. Users can design their own custom libraries or use libraries according to their chosen application (Thomas et al. 2019). Selected libraries and their applications are listed in Table 3.2.

Table 3.2 Selected CRISPR libraries and their purposes

3.4 Specificity in CRISPR/Cas

After selecting PAM and potential target sites, the next step is to identify the site most likely to result in efficient genome editing. In addition to choosing an sgRNA to match the target site, researchers try to select one with no additional binding sites in the genome. While the ideal sgRNA would have no homologous sites in the genome, in practice an sgRNA will have partial homology to many additional sites in the genome, i.e., off-targets (Duan et al. 2014). Off-target sites with mismatches near PAM will not be cleaved efficiently; such sgRNA would have lower off-targets effects and will be associated with the highest specificity as compared to those sgRNA in which mismatches are away from PAM in off-target sites. Off-target sites may be effectively minimized by predicting CRISPR/Cas specificity and designing a specific and optimal sgRNA. The two main approaches for predicting sgRNA specificity are based on either (1) alignment or (2) scoring. In the first method, sgRNA sequences are aligned to a given genome using conventional or specialized tools to discover all off-targets, and only frequency of the mismatches in the gRNA sequence is considered. In a scoring-based approach, sgRNA are scored and ranked after the initial alignment in order to select the most specific sgRNA for a given experiment. In this scoring-based approach, in addition to frequency of mismatches, positional weighing of each mismatch is calculated. Two scoring-based approaches are commonly used: (1) a learning-based method and (2) a hypothesis-driven method. Below we discuss alignment- and scoring-based methods in detail (Liu et al. 2020).

3.4.1 Alignment-Based Approach to Predict Specificity

Alignment-based methods for assessing sgRNA sequences involve aligning the sgRNA with a reference genome and identifying potential off-targets based on sequence homology. Bowtie (Langmead et al. 2009) and Burrows-Wheeler aligner (BWA) mapping tools are used to predict off-targets, but neither identify small PAM sequences. Because these tools allow a limited number of mismatches in the sgRNA seed region, they cannot identify all off-targets. CHOPCHOP and CCTOP design tools use Bowtie to find off-targets for a candidate sgRNA, while CRISPOR uses BWA. Alignment-based Cas-OFFinder and Cas-OT also predict off-targets (Liu et al. 2020). Cas-OFFinder is popular for finding off-targets with no mismatch limitations and can even predict off-targets with a 1-bp insertion or deletion (Thomas et al. 2019). Cas-OT can identify off-targets with 6-bp mismatches in the seed region and predict off-targets in coding exons of genes. Alignment-based CRISFlash and FlashFry use tree-based algorithms and user-defined data to optimize sgRNA. As well as off-target predictions, FlashFry provides additional information such as GC content and on-target score for sgRNA (Liu et al. 2020).

3.4.2 Specificity Prediction Through Scoring-Based Tools

3.4.2.1 Hypothesis-Driven Methods

Alignment-based methods can reliably predict off-targets. However, not all nucleotide positions with mismatches in sgRNA are equally effective in terms of off-target cleavage. In addition, alignment-based predictions for off-targets are sometimes false positives. One study found that only a few of the off-targets predicted by Cas-OFFinder and CC-Top were valid, and the tools also failed to predict some valid off-targets. So, there was a need to limit the features that contribute to the non-specific off-targets in CRISPR/Cas (Liu et al. 2020). These issues can be addressed in CRISPR/Cas systems by using the MIT specificity score (named after the institution) to evaluate off-targets (Hsu et al. 2013). Hsu et al. studied more than 700 sgRNAs and evaluated sgRNA/Cas9 sequence features such as contribution of position and numbers of mismatched nucleotide in the target site (Hsu et al. 2013). The MIT score is adopted to predict off-targets in design tools such as CHOPCHOP and CRISPOR (Haeussler et al. 2016; Labun et al. 2016). Cutting frequency determination (CFD) score is also popular for evaluating off-targets in CRISPR/Cas (Liu et al. 2020). In addition to recognizing NGG PAM, Cas9 recognizes non-canonical PAM sites such as NAG, NGA, and NCG, thus leading to off-targets. Doench et al. (2016) used PAM sequence features in their scoring matrix to predict off-targets. CFD score is considered a better performer better than MIT score and has been adopted by many design tools, such as GuideScan (Perez et al. 2017) and CRISPRscan (Moreno-Mateos et al. 2015). Other design tools use sgRNA/Cas9 structural features to predict off-targets. For example, CRISPR-OFF (Alkan et al. 2018) and uCRISPR (Zhang et al. 2019) use structural features because their off-target prediction accuracy is better than sequence features.

3.4.2.2 Learning-Based Methods

Compared to empirical algorithms, learning methods use multiple features (including PAM, GC contents, methylation state, and chromatic structure) to improve their off-target predictions. Most recent tools use machine learning with multiple features for predicting CRISPR/Cas system specificity and off-targets. For example, CRISPR target assessment (CRISTA), which uses machine learning to predict efficiency, was found to perform better than other tools (Liu et al. 2020). The computer platform DeepCRISPR, which incorporates sgRNA on-targets and off-targets into a single framework, has been found to perform better than other tools for predicting efficiency and off-targets (Chuai et al. 2018).

3.5 Factors Affecting Specificity

Numerous studies have revealed different factors that may affect CRISPR/Cas specificity. These factors can be classified into two categories: (a) an intrinsic specificity of Cas9 which recognize the importance of position of every sgRNA nucleotide to create DSB and (b) relative abundance of sgRNA/Cas9 for effective target cleavage. Factors that may contribute to CRISPR/Cas system specificity are discussed below.

3.5.1 Importance of PAM in CRISPR/Cas Specificity

To be recognized by an individual Cas9 domain, PAM must be next to the 3′ end of the genome target sequence (Wu et al. 2014b). Because PAM sequences vary across Cas endonucleases, users can select a different Cas endonuclease if a particular PAM (e.g., NGG for Cas9) does not exist in the target sequence. The most commonly used Cas endonuclease, Cas9, recognizes NGG for cleavage but can also recognize the canonical PAM sites NGA and NAG, thus increasing the number of off-targets. Some of these Cas proteins require a longer PAM sequence such as SaCas9 protein, derived from Staphylococcus aureus, which has “NNGRRT” PAM requirement. It is assumed that such Cas9 proteins which recognize a longer PAM will have less targetable sites in the genome and, therefore, will have fewer off-target sites in a given target DNA. PAM sequences with appropriate Cas endonucleases are listed in Table 3.1.

3.5.2 Seed Sequence of sgRNA

Recruiting Cas9 to the genome target site requires sgRNA. In vitro studies have shown that Cas9 can tolerate mismatches in the first seven nucleotides in the region distal to PAM. However, studies with bacteria and mammals have confirmed that mismatches in 10–12 bp PAM proximal region (also called seed region) of the gRNA will result in reduced cleavage or complete abolishment. Other studies suggest there is no clearly defined seed region, but have confirmed that mismatches in the PAM proximal region stop Cas9 cleavage of DNA (Cong et al. 2013). In contrast, genome-wide binding datasets have shown a clearly defined seed region, limited to five nucleotides proximal to PAM (Wu et al. 2014b). The differences in seed region might arise from factors such as concentration and time required for Cas9 binding and cleavage.

3.5.3 Effective Concentration of Cas9/sgRNA Complex

The effective concentration of Cas9/sgRNA influences the specificity of CRISPR/Cas systems. Studies have confirmed that cleavage becomes less specific at higher effective concentrations of Cas9/sgRNA. For example, an in vitro study found that higher concentrations of Cas9/sgRNA complex resulted in greater tolerance of mismatches, leading to cleavage of non-specific sites. Hsu and co-authors suggested that decreasing the amount of plasmid in transfected cells led to increasedCas9 specificity (Wu et al. 2014b; Hsu et al. 2013). Another study showed that a 2.6-fold increase in Cas9 concentration led to a similar increase in off-targets. When Cas9 level remained constant, the amount of sgRNA influenced off-target number (Wu et al. 2014a).

3.5.4 Importance of sgRNA Sequence

SgRNA sequence is the key to Cas9 specificity because it contributes to Cas9 loading and Cas9/sgRNA binding to the target site. Differences in sgRNA sequence influence Cas9 tolerance of mismatches at every position in 20 nucleotides. A possible underlying mechanism for this change in specificity is that different sgRNA sequences may influence effective concentration of sgRNA. For example, it has been reported that seed sequence mutations in sgRNA increase its transcription by U6 promoter. Changes in sgRNA sequence may also contribute to chromatin state, off-targets, and thermodynamic stability of sgRNA-DNA duplex (Wu et al. 2014b). We describe these effects in detail below.

3.5.4.1 Chromatin Accessibility and Epigenetic Features Affecting Binding of Cas

Chromatin state, i.e., whether packed or open, may influence Cas9’s ability to access the target site. DNase I hypersensitivity (DHS) is a strong predictor of chromatin accessibility. DHS peaks for a number of accessible seed sequences and PAM have been found to accurately predict the number of chromatin immunoprecipitation (ChIP) peaks in vivo. Wu and colleagues have suggested that chromatin accessibility does not impact significantly on-target activity of sgRNA as compared to off-target binding (Wu et al. 2014a, b).

Methylation of CpG sites (where cytosine and guanine are adjacent, with guanine closer to 3′) is an epigenetic mechanism that has been found to be linked with chromatin silencing. A study confirmed that CpG methylation of target sites may restrict Cas9 binding to the target site. Target site methylation showed strong correlation with ChIP signal, and less binding was observed in highly methylated sites (Wu et al. 2014a, b). Hsu et al. showed that Cas9 can mutate highly methylated promoters in vivo. However, an in vitro study found that CpG methylation had no significant effect on Cas9 cleavage (Hsu et al. 2013). Taking these studies together suggests that CpG methylation may affect only off-target sites.

3.5.4.2 Numbers of Seed Sequence in the Genome

Depending on sgRNA seed sequence length (5–12 nt), a mammalian genome may contain hundreds of thousands of seed match sites followed by PAM. However, nucleotide preference in the seed regions may mean that specific seed match sequences could be dramatically low. For example, for Cas9, a mouse genome contains about one million AAGGA + NGG seed sites but less than 10,000 CGTCG + NGG sites (Wu et al. 2014a, b). The relative abundance of seed sites is an important factor in designing specific sgRNA, especially in dCas9 applications.

3.5.4.3 Length of Target Sequence Influences Specificity

Length of sgRNA is important for Cas9 specificity. A 20-bp gRNA is optimal for guiding Cas9 to a target site. Although one might speculate that specificity may increase with sgRNA length, Ran et al. found that when sgRNA length was increased by extending the 5′-end, the extended sequence at the 5′-end was degraded in vivo (Ran et al. 2013). In contrast, truncated sgRNA with 17–18 nt of length increased Cas9 specificity. While the underlying mechanism is not clear, it may be that the first two nucleotides do not contribute to Cas9 stability, but instead contribute to off-targets (Fu et al. 2014).

3.5.5 sgRNA Scaffold

The impact of modifications in the sgRNA scaffold region has not been studied in detail. However, it is known that truncation or extension at the 3′ end may contribute to Cas9 stability and specificity by changing sgRNA expression, in similar fashion to 5′-end modifications in sgRNA. Increasing the length of the hairpin bound by Cas9 has been found to increase sgRNA efficiency for imaging and transcriptional regulation, probably due to efficient loading of sgRNA, but the exact mechanism remains unclear (Hsu et al. 2013; Wu et al. 2014b).

3.5.6 Repair Outcomes of DSBs

In addition to the above factors, DNA repair outcomes and sequence variations are likely to influence the selection of specific sgRNA. Several studies have identified a bias in repair outcomes for KO applications. These studies have shown that nucleotide comparison of target site adjacent to the cleavage site is important for single-nucleotide insertion or deletion in NHEJ repair pathway (Mao et al. 2013). The presence of thymine (T) adjacent to the cleavage site was associated with precise insertion of a single homologous nucleotide at the cleavage site (T to TT). However, having a dinucleotide repeat adjacent to the cleavage site led to single-nucleotide deletion with removal of homologous base (CC to C). Moreover, microhomologies in sequences flanking the cleavage site resulted in deletion of 30 nucleotides through microhomology-mediated end joining (MMEJ) repair. These findings highlight a bias in repair outcomes linked to the presence of specific sequences in target sites and the competing roles of NHEJ and MMEJ. Based on these studies, computational tools such as Favored Outcomes of Repair Events at Cas9 Targets (FORECasT) and inDelphi have been developed to predict the most likely mutational outcomes of CRISPR/Cas experiments.

3.6 Efficiency of sgRNA

Initially it was believed that CRISPR/Cas9 could target any genome sequence that was followed by PAM (NGG). As a result, most of the early bioinformatic tools were constructed based on simple methods to locate target site and PAM to design sgRNAs. Some of these tools predicted sgRNA position relative to gene features. However, several later studies demonstrated that Cas9 cleavage efficiency varies significantly between different sites, i.e., not all sites are cleaved with the same efficiency (Cong et al. 2013; Jinek et al. 2012; Mali et al. 2013; Wang et al. 2014). For example, two sgRNAs can have 100% homology with their target sites but different cleavage efficiency, indicating that cleavage efficiency may also be affected by specific nucleotides and nucleotide composition. Subsequent studies identified additional factors such as sequence features (GC contents, specific nucleotide positions, and sequence composition), genetic and epigenetic factors (methylation and chromatin arrangement), and thermodynamic properties (sgRNA secondary structure, melting temperature (Tm), and free energy) that influence on-target cleavage efficiency.

Nucleotide position and composition in the target sequence is critical for CRISPR/Cas on-target efficiency (Wilson et al. 2018; Wong et al. 2015). CRISPR/Cas-based screening in mammals has shown that G is highly preferred at positions 1 and 2 upstream to PAM, while T is not favored at position 4 in close proximity to PAM. The GC content of positions 4–13 proximal to PAM is also important for Cas9 cleavage efficiency. Using sequence features such as GC content, preferred nucleotide position, and sgRNA position relative to gene features, predictive models have been developed to design efficient sgRNA for CRISPR/Cas applications. Several laboratories have used these models to develop individual design platforms such as E-CRISP, CHOPCHOP, CRISPR-FOCUS, and CCTOP for predicting sgRNA efficiency (Table 3.3).

Table 3.3 Bioinformatic tools for sgRNA activity

Genetic and epigenetic features also contribute to target-site cleavage efficiency. Studies have shown that nucleosomes (sections of chromatin) may reduce Cas9 cleavage efficiency, and DNase I hypersensitivity (DHS) and epigenetic signatures may influence on-target efficiency. Predict-SGRNA is an R package (R is a free software environment) that uses epigenetic features to predict sgRNA cleavage efficiency (Liu et al. 2020). CRISPRpred and uCRISPR predict sgRNA efficiency using the energy properties of sgRNA, DNA, and Cas9 complex and sgRNA secondary structure. Because not all sgRNAs are effective, even when using the best design tools, multiple sgRNAs are used for each target gene. Multiple sgRNAs are also required to distinguish on-target perturbation from any off-target effect of an individual sgRNA.

3.7 Off-Targeting in CRISPR/Cas

Off-targets are a major challenge for the CRISPR/Cas community because Cas9 can bind and create DSBs even when there is only partial complementarity between sgRNA and target site. Numerous studies have reported that CRISPR/Cas may produce substantial numbers of off-targets. For example, a study in human beings found that Cas9 can tolerate up to five mismatches between sgRNA and target site, leading to DNA cleavage frequencies even higher than the intended target site (Carroll 2013; Hsu et al. 2013; Xie et al. 2014). Off-targets are not random changes but are induced by the PAM and target site. Natural off-targets in a bacterial defense system may degrade hypervariable nucleic acids (i.e., those vary much more than their counterparts in other similar regions) or plasmids beneficial for archaea and bacteria. However, from a genome editing perspective, off-targets may lead to undesirable changes at random sites in the genome, thus compromising the benefits of genome modifications. Predicting and minimizing off-targets in advance is essential for safe use of CRISPR/Cas, especially in therapeutic applications and translational research. It is also important to identify all off-targets and confirm that a desired phenotype has arisen from on-target modification instead of off-targets.

Several sgRNA design tools have a special focus on limiting off-targets in CRISPR/Cas (Table 3.4). Most of these produce sgRNA with minimal off-targets and show predicted off-targets for a given sgRNA. Different tools use different scoring methods to predict off-targets. Most of these tools score off-targets either by using data from systematic mutation studies or by having user-provided input penalties such as mismatch number and positions. Others use binary criteria, e.g., defined proximal or distal region, or sites with less than a defined number of mismatches. SgRNA candidates are then ranked by off-target number or the weighted sum of all off-target scores (Wu et al. 2014b). Some tools give option of using alternate PAM site to predict off-targets, e.g., NAG or NGA for Cas9.

Table 3.4 Tools for evaluating off-targets in CRISPR/Cas system

As with on-target prediction tools, most design tools for off-target prediction initially focused on Cas9 and predicted off-targets through alignment-based methods using seed sequence followed by NGG. However, the discovery that Cas9 also binds NAG or NGA PAM made it apparent that many off-targets were being missed. The early tools were superseded by tools that used sequence similarity or dCas9-mediated binding to confirm off-target sites, but these later approaches were biased and not comprehensive. Unbiased approaches were then developed based on high-throughput, next-generation sequencing (NGS). For example, DSBCapture used integrase-deficient lentiviral vectors (IDLV) and sequencing, while Digenome-seq, ChIP-NGS (whole genome binding), and direct in situ breaks labeling, enrichment on streptavidin, and next-generation sequencing (BLESS) were developed to detect off-targets in CRISPR/Cas applications. However, these approaches also had advantages and disadvantages. IDLV and BLESS could detect genome-wide off-targets, but they were less efficient because most off-target sites are transient. In addition, both approaches could generate false-positive off-targets because DSBs may arise from endogenous processes. Although whole genome sequencing is ideal and unbiased, it can miss perfectly repaired off-targets and binding sites without cleavage. Moreover, ChIP-NGS could be biased towards open chromatin and highly expressed genes. Guide-seq has good efficiency but does not work for DNA nicks (single-stranded cuts). Digested genome sequencing (Digenome-seq) does not consider other factors that affect cleavage. All things considered, the above approaches are all useful but need refinement because in vitro results can differ from in vivo (Peng et al. 2016).

Over the last few years, considerable effort has gone into limiting off-targets and improving specificity. Approaches have included lowering GC content, employing paired nickase enzymes, and using truncated sgRNA (17–18 bp). Lower GC content may reduce off-targets because higher GC content improves RNA/DNA duplex stability, thereby increasing the chance of tolerated mismatches. SgRNA and target site mismatches that produce bulges at the 5′ end, the 3′ end, or 7–12 nucleotides proximal to PAM must be avoided. The combined use of paired nickases and paired sgRNAs will generate two closely associated single-stranded breaks and eventually make a DSB.

3.8 Application-Specific Design of sgRNA

Although all CRISPR/Cas applications rely on sgRNA to guide Cas9 to the target sequence, DSBs are not always required. KO and KI applications always require DSB creation to delete or insert DNA at a precise location respectively. Large-scale deletions and insertions require more than one DSB. In KO applications, the NHEJ repair pathway will introduce a small indel into the coding framework, leading to a frameshift mutation and thus disruption of protein formation. However, for repair templates with suitable homology arms, DSBs will be repaired by HDR pathway, consequently leading to site-specific insertion of the repair template. Because NHEJ is the preferred pathway in cells, HDR efficiency must be improved for KI applications. In contrast to KO and KI applications, CRISPRi and CRISPRa use dCas9, which does not create a DSB, but instead recruits a transcriptional activator (VP64) or repressor (Krüppel-associated box (KRAB) domain proteins) to the promoter region of a gene (Graham and Root 2015). Similarly, sgRNA position in CRISPRi and CRISPRa varies significantly between KO and KI applications. However, despite differences in sgRNA position relative to gene features, the same basic principle underlies sgRNA design in all applications. Here we summarize application-specific sgRNA design in CRISPR/Cas.

3.8.1 sgRNA for KO Applications

Being able to KO an individual gene is a powerful tool for functional genomics. Knockout (KO) of single and multiple genes is often studied to evaluate phenotypic changes in cells, tissue, or organisms and by subsequently characterizing those genes for their potential roles in different functions. CRISPR/Cas has become the gold standard for producing KO models for functional characterization of genes (Graham and Root 2015). The KO of a gene or genetic element may be achieved by creating a DSB that is repaired through the NHEJ pathway. Exon size and relative position are important for generating KO alleles. For example, larger exons would have multiple choices of sgRNA, making it easier to select an efficient sgRNA. However, small exons are easy to delete with two DSBs. In addition, sgRNA position relative to the gene features may affect the outcomes of KO applications. Targeting sgRNA too close to a translation initiation codon ATG may reinitiate translation at a downstream ATG, leading to N-terminal truncated protein. Similarly, targeting sgRNA close to the 3′-end of a gene may result in insufficient disruption of protein functions. With sgRNA design for KO applications, selecting an optimal target such as a functional domain, active site, or the transmembrane helical domain of a protein (Fig. 3.4) can increase the likelihood of completely disrupting protein functions (Thomas et al. 2019). Using multiple sgRNAs can help ensure that the curated phenotype in a KO experiment has resulted from disrupting the respective gene instead of off-targets. For large-scale design, multiple sgRNAs per gene are also recommended for increased screen efficiency. In addition to generating KO for a single gene, multiplex genomes using CRISPR/Cas can be used to simultaneously disrupt multiple genes in order to study their interactions and discover pathways.

Fig. 3.4
figure 4

Application-specific positioning of sgRNA in CRISPR/Cas systems

3.8.2 Position of sgRNA for KI Applications

While the NHEJ pathway may lead to disruption of a gene, KI approaches using repair templates can use the HDR pathway to precisely insert a single nucleotide change or add a large template such as green fluorescent protein (GFP) (Wu et al. 2018), a tag (Chen et al. 2018), or a fluorophore. For the HDR-based repair pathway, the desired repair template must be introduced along with sgRNA and Cas9 or nCas9. The length and nature of the repair template depend on the size of the intended modification. For example, for a single-base replacement, ssDNA repair template with 50 bp homology arms on both sides of DSB could work efficiently. However, for larger insertions such as a GFP, tag, or fluorophore, a repair template with long homology arms (400–1000 bp) is desirable (Fig. 3.4). It is also advisable to exclude PAM site in the repair template. Moreover, mutating PAM site and sgRNA binding site with silent mutations would prevent subsequent binding and cleavage of target site after insertion of the repair template. These silent mutations may also assist genotyping following insertion of the desired repair template (Graham and Root 2015).

3.8.3 Designing sgRNA for CRISPRi and CRISPRa

In contrast to KO and KI applications that use Cas9 or nCas9, respectively, transcriptional regulation through CRISPR/Cas relies on dCas9, which does not create DSB but simply binds at a precise location in the genome. Binding dCas9 with an appropriate activator or repressor to a gene’s promoter region may subsequently activate or repress the gene by blocking binding of RNA polymerase or transcriptional factors. SgRNA position relative to the transcription start site (TSS) may affect the efficiency of activation or repression. Accurate TSS identification is highly desirable for transcriptional regulation through CRISPR/Cas. Generally, the target site for sgRNA design in CRISPRi should be downstream (within a 300 bp window) of TSS, while for CRISPRa it should be upstream (within a 400 bp window) (Fig. 3.4). Designing multiple sgRNAs for a target region should assist in achieving the best results (Davis et al. 2018; Noguchi et al. 2017; Thomas et al. 2019).

3.8.4 SgRNA in Epigenetic Regulation

dCas9 can be used to alter gene expression by recruiting epigenetic modifiers such as lysine-specific demethylase 1 (LSD1), ten-eleven translocation gene protein 1 (TET1), DNA methyltransferase MQ1, and histone acetyltransferase p300 to modify the methylation state of cytosine in the promoter region by inducing demethylation or histone acetylation (Brocken et al. 2017). Epigenetic modifiers sometimes work better than CRISPRi or CRISPRa.

3.8.5 Design Criteria for Base Editing

In CRISPR/Cas system, base editing was initially achieved by providing a repair template using the HDR pathway, which has low efficiency. To overcome the low efficiency, researchers developed two CRISPR-mediated base editing platforms for DNA (cytosine base editor (CBE) and adenine base editor (ABE)) and an RNA base-editing platform. CBE and ABE were developed by fusing either cytosine deaminase or adenine deaminase with an appropriate Cas protein (dCas9 or nCas9) (Liang and Huang 2019). The RNA base editor was developed by fusing type VI CRISPR/Cas effector (dCas13b) with hyper-activated adenosine deaminase 2 that acts on RNA (ADAR2) to create a programmable RNA base editor known as REPAIR (RNA editing for programmable A to I (G) replacement). In base editing, sgRNA position depends on the targeted nucleotide’s location in the protospacer region. The targeted nucleotide must be present within the active base editing window on the non-targeted strand, thus deciding the position and orientation of the sgRNA. The size of the active base window (usually four to eight nucleotides) depends on which base editor is used (Thomas et al. 2019). Base-editing efficiency can sometimes be very low at certain positions because these are inaccessible due to nucleosomes.

3.8.6 Designing sgRNA for RNA Editing

An alternative CRISPR/Cas system for regulation at transcriptional level uses CRISPR/Cas13, which specifically targets single-stranded RNA (ssRNA). CRISPR/Cas13 uses CRISPR RNA (crRNA) to recognize and cleave ssRNA (Freije et al. 2019). In bacteria, non-specific cleavage of RNA has been observed after initial cleavage with Cas13. Cas13 is used in a very sensitive diagnostic platform known as the specific high-sensitivity enzymatic reporter unlocking (SHERLOCK) assay for differentiating Zika virus strains (Kellner et al. 2019), genotyping human beings, and RNA imaging (Yang et al. 2019). SHERLOCK could also be useful for detecting SARS-CoV-2, the RNA virus that causes coronavirus disease 2019 (COVID-19) (Joung et al. 2020).

3.9 Design Tools for sgRNA

Design tools available to the CRISPR/Cas community have been developed by both academic and commercial institutes. Although the basic objective of these tools is to design and select an optimal sgRNA and provide information about the target site, each tool has its own particular features and benefits. Similarly, these tools all aim to provide sgRNA with minimal off-targets in the genome, but they employ various methods to score these off-targets. For example, off-target scoring in CHOPCHOP is based on empirical data from multiple studies, while Cas-Finder and E-CRISP evaluate off-targets using user-defined values for mismatch number and position.

Some design tools are application- or species-specific. For example, CRISPR-ERA and BE-Designer specifically design sgRNA for transcriptional regulation (CRISPRi/CRISPRa) and base editing, respectively. FlyCRISPR and CRISPR-PLANT are specialized for sgRNA design in Drosophila and plants, respectively (Liu et al. 2020). Some sgRNA design tools provide users with additional options for selecting alternative PAM sites, as well as Cas effector. Some useful sgRNA design tools are listed in Table 3.5, after which we discuss some of these potential tools.

Table 3.5 Selected sgRNA design tools

3.9.1 CHOPCHOP

More than 200 genomes are available on the CHOPCHOP website; users can input gene name or target sequence. This tool supports gRNA design for multiple applications (KO, KI, CRISPRi, and CRISPRa); users can choose application-specific Cas effector endonuclease. CHOPCHOP ranks potential sgRNAs on position, GC contents, mismatch number, and efficiency scores (Liu et al. 2020).

3.9.2 Base Editing (BE)-Analyzer and BE-Designer

These are publicly available design tools for base editing. Both tools help researchers select sgRNA for desired region and analyze outcomes of base editing from NGS data. BE-Designer also lists all potential sgRNAs for a given DNA sequence and provides off-targets for a given sgRNA against a large number of species (Hwang et al. 2018).

3.9.3 CRISPOR

One of the best tools for designing efficient sgRNA, CRISPOR contains 19 different PAMs and 417 different genomes. It can accept genome coordinates or user-provided sequences. Each sgRNA will be ranked for off-targets, specificity, and efficiency. Outcome predictions, GC contents, and poly T will also be given for each sgRNA (Liu et al. 2020).

3.9.4 CRISFlash

Like CHOPCHOP, CRISFlash can use sequenced genome or genome sequence to design sgRNA. In addition, it accepts user-defined values to optimize sgRNA and off-targets. CRISFlash is considered a faster tool for sgRNA design and scoring off-targets (Jacquin et al. 2019).

3.10 Prospects

CRISPR/Cas technology is a revolutionary tool for functional genomic human therapeutics and agricultural advances. Because sgRNA plays an indispensable role in CRISPR/Cas-mediated genome editing, numerous tools have been developed for designing efficient and specific sgRNA with minimal off-targets. However, off-targets continue to represent a major challenge for CRISPR/Cas-mediated genome manipulation. Systematic studies show that predictive models for efficient sgRNA design are not always effective for all applications and all species. This makes it imperative that scientists know the weaknesses and strengths of each model for sgRNA design. As new knowledge about CRISPR continues to emerge, it is clear that sgRNA and PAM are not the only influences on CRISPR/Cas cleavage, with additional such factors now including GC contents and chromatin accessibility. The ongoing discovery of new and novel features that contribute to CRISPR/Cas specificity and efficiency will also help minimize off-targets. Moreover, it has become clear that CRISPR/Cas outcomes are specific rather than random. Such findings will facilitate more precise editing with CRISPR/Cas.

In summary, recent advances in our understanding of CRISPR mechanisms and factors affecting specificity and efficiency, combined with the further development of bioinformatics tools, will enable more precision in achieving desired on-target modifications without potential off-targets. Directed evolution using EvolvR may also help scientists to engineer new Cas proteins with improved specificity.