Homing endonucleases basic mechanisms

Homing is a gene conversion process, whereby a mobile sequence is copied and inserted into a cognate allele that lacks this sequence. This relies on site-specific double-strand cleavage in the target allele, catalyzed by a nuclease encoded within the mobile sequence. Homing was discovered in the early 1970s upon the identification of the intron-encoded “homing endonuclease (HE)” responsible for the significant polarity of recombination observed for rRNA gene markers in genetic crosses between yeast mitochondria [1]. HEs are usually encoded within protein-encoding genes, by self-splicing introns within the RNA (group I, group II and Archaeal introns) or by inteins that self-splice at the protein level (Fig. 1). HEs may also be present as freestanding elements between individual genes. Homing is widespread and HEs or their homologs have been found in all phyla. HEs recognize very long DNA sequences (12–40 bp); hence they are also known as meganucleases. They tolerate certain substitutions within the target site, resulting in the effective recognition and cleavage of families of sequences. However, the long target sites occur only rarely in a whole genome. For instance, I-SceI recognizes only one such site in the ~13,000-kbp yeast genome. The relative rarity of HE target sites means that HEs pose a smaller threat to the integrity of the host DNA than restriction endonucleases (RE). Additionally, the corresponding mobile DNA elements do not require reversible methylation by methylases to protect the host DNA from multiple rounds of cleavage. Instead, intron- and intein-encoded HEs irreversibly disrupt the target site by inserting the DNA of its mobile element. The insertion sites of freestanding HEs encoded by T-even phages are, however, spatially separated (up to hundreds of base pairs) from the cleavage sites [2]. Thus, potential cleavage sites for the freestanding HEs exist in both the host (self) and target (non-self) DNA. In this context, the exact mechanism underlying protection of the self-DNA from cleavage remains unknown. Nucleotide polymorphism may provide a means of distinguishing between the self and non-self DNA. The cleavage-proficient self DNA variants would be counter-selected and eliminated, and the resistant variants would be preferentially inherited in the progeny phage [3].

Fig. 1
figure 1

Homing mechanisms for group I introns (a), inteins (b) and group II introns (c). The homing endonuclease coding sequence of gene A (green) is duplicated in its cognate allele, gene A′. Mobile ORFs and encoded products are green; host gene exons and products are yellow; the other nucleic acid sequence is black, DNA is coloured with a 3D effect and RNA is flat. The ORF position is indicated with a black arrow. HE Homing endonuclease, M maturase, RT reverse-transcriptase, RNP ribonucleoprotein. The different stages of each mechanism are numbered in the scheme. The diagram showing the mechanism of action for group II introns is highly simplified. More detailed figures of their homing mechanisms can be found in specific reviews [158]

HEs can be considered as selfish nucleases that promote their own proliferation by cleaving the foreign DNA, thereby inducing DNA repair by recombination, which in turn increases the chances of duplication of their genes. HEs often modify the target sites to minimize the destruction of their own DNA. This is achieved through either irreversible disruption of the target DNA during the process of their own proliferation or reversible modification catalyzed by another enzyme [4, 5].

The nomenclature of HEs resembles that of REs, with additional prefixes indicating different classes: I, for intron-encoded enzymes (e.g. I-CreI); PI, for intein-encoded enzymes (usually with an additional protein splicing activity, e.g. PI-SceI); and F, for freestanding enzymes [6]. The homing mechanisms used by the three types of HEs are summarized in Fig. 1. HEs encoded by group I and Archaeal introns or inteins exhibit similar mechanisms of homing (Fig. 1a, b). Their mechanism of action relies solely on the endonuclease, which cleaves the target site in the intron-less allele to generate recombinogenic ends. This leads to strictly DNA-dependent homologous recombination, resulting in the duplication of the intron (or the intein coding sequence) at the cleaved site. The only difference observed is the mechanism of action adopted by these HEs: in the case of group I introns, mRNA self-splices, and in the case of group II, the translated polypeptide self-processes to generate the enzyme (Fig. 1a, b). A similar mechanism is exhibited by freestanding HEs, which cause the duplication of the corresponding gene at a site located at some distance from the cleavage site [3]. The homing of group II introns may or may not involve DNA homologous recombination, but is always initiated by a complex formed by the protein encoded by the intron ORF and the excised intron (Fig. 1c). This ribonucleoprotein (RNP) displays endonuclease, reverse transcriptase and maturase activity (binding their host intron and facilitating RNA-catalyzed splicing). The RNP initiates homing by reverse splicing directly into a DNA target site and is then reverse transcribed by the intron-encoded protein. After DNA insertion, the introns remove themselves by protein-assisted, autocatalytic RNA splicing, thereby minimizing host damage [7].

Homing endonucleases are highly specific and can induce homologous recombination in different types of cells, including mammalian cells. These enzymes are therefore useful templates for the development of custom tools for gene targeting [8]. Such tools may be used for a number of potential biomedical and biotechnological applications, ranging from genome editing to gene therapy. However, the enzyme selected as a template must first be engineered to recognize the new DNA target sequence with high specificity. Prior understanding of the structure–function relationship of these enzymes with their DNA target is therefore essential for the design and development of custom enzymes that target a DNA sequence of interest, at the same time avoiding deleterious side effects when produced in cells.

Homing endonuclease families

The different mechanisms of recognition and cleavage exhibited by group I and group II intron HEs and the existence of distinct families of HEs suggest that different homing endonucleases evolved independently. HEs have been grouped into five families (LAGLIDADG, HNH, GIY-YIG, His-Cys box and PD-(D/E)XK; see Table 1), classified by their conserved active site core motifs [9].

Table 1 Summary of crystallographic structures of the different homing endonuclease families

A new HE family was recently predicted from an analysis of microbial environmental metagenomic sequence data [10]. The authors found a novel genomic arrangement of certain genes, whereby the coding region of a conserved enzyme was divided into two by a split-intein, with an endonuclease gene in between the two coding regions. These freestanding endonuclease genes were found to belong to different families, the GIY-YIG family or the novel very-short patch repair (Vsr)-like family of endonucleases. Sequence-to-sequence or sequence-to-multiple sequence alignment comparisons showed similarities between these endonuclease-encoding regions and Vsr DNA G:T-mismatch endonucleases [11]. Broader sequence similarity searches showed that this family of HEs is also present in group I introns and as freestanding inserts in the absence of surrounding intervening sequences [10].

As well as their conserved sequence motifs, the HE families also display differences in their catalytic mechanism, biological and genomic distribution, and in their similarities with non-homing endonucleases. The structure–function relationships of the different families, with some representative examples, are described below and summarized in Table 1. The families described are all encoded within group I introns or inteins, and some of them are also associated with maturase activity. No structural data are currently available for group II intron HEs, which function as part of a RNP complex with its own excised intron RNA.

The LAGLIDADG family

These endonucleases contain the LAGLIDADG sequence motif. They are the best characterized of the HE families. Several hundred intron-encoded nucleases have been identified and their endonuclease activity demonstrated experimentally in many cases. They have a widespread distribution, being present in the genomes of plant and algal chloroplasts, fungal and protozoan mitochondria, and Archaea [12]. On the basis of primary sequence homology and additional information from structural studies of some of its members, the LAGLIDADG family can be divided into two subfamilies. Homing endonucleases belonging to the first subfamily function as homodimers and bind consensus DNA targets displaying palindromic or near-palindromic symmetry. Each monomer contains a single copy of the LAGLIDADG motif that forms the core of the dimer interface, with the conserved acidic residues forming part of the active site. Examples from this family are I-CreI, I-CeuI and I-MsoI (Fig. 2) [1316]. The second subfamily consists of proteins with two LAGLIDADG motifs, one in each of the two similar domains of the monomeric enzyme. These monomeric pseudosymmetric enzymes recognize and cleave non-palindromic DNA sites. Examples from this family are I-AniI, I-DmoI or I-SceI (Fig. 2) [1719].

Fig. 2
figure 2

Crystallographic structures of representative members of the five HE families. a LAGLIDADG family: monomeric I-DmoI and homodimeric I-CreI. b PD-D/E-XK family: the tetrameric I-SspI. c His-Cys Box family: the homodimeric I-PpoI. d GIY-YIG family: I-TevI catalytic and DNA binding domains. e HNH family: the monomeric I-HmuI. f Engineered heterodimeric variants Amel and Ini based on the I-CreI template. a–e The enzymes are shown in cartoon representation with the bound target site in stick representation. Catalytic ions are shown as yellow spheres and structural Zn ions are shown as orange spheres

LAGLIDADG HEs recognize long DNA sites (22 bp and longer) and cleave along the minor groove to generate cohesive 4 bp 3′-OH overhangs. X-ray structures obtained for several LAGLIDADG proteins from the two subfamilies (I-CreI, I-MsoI and I-CeuI, and I-AniI, I-SceI, I-DmoI and PI-SceI) bound to their DNA targets have provided insight into their mechanisms of DNA recognition, binding and catalysis [13, 1824].

The first two artificially engineered meganuclease chimeras, H-DreI and DmoCre [25, 26], were generated using templates from the LAGLIDADG family, by fusing the N-terminal domain of I-DmoI to I-CreI. Different studies have shown that single-chain variants of I-CreI (whose native form is homodimeric) retain their catalytic activity and can be further modified to recognize non-palindromic DNA sequences [26, 27].

The LAGLIDADG enzymes are quite flexible in terms of target DNA recognition and can tolerate individual polymorphisms without considerable loss of binding or cleavage efficiency [13, 19, 23]. The extended substrate-binding cleft allows direct protein-DNA contact, facilitating docking before and during catalysis. This family of enzymes forces an acute twist of the DNA to bring the scissile phosphodiester bonds of both strands together at the narrow catalytic site. The catalytic mechanism resembles the canonical two-metal-ion mechanism, previously described for phosphoryl transferases and RE [2830]. In-depth studies of the mechanisms of catalysis for both subfamilies have been based on structural data obtained for I-CreI and I-DmoI bound with their DNA targets [20, 22, 31]. Catalysis requires the two conserved acidic residues at the carboxyl-termini of the LAGLIDADG helices in the active site and coordination of the divalent metal ions. The structures of uncleaved substrate complexes (determined in the presence of non-activating Ca2+ ions) for I-CreI and I-SceI, show the presence of three Ca2+ ions at the active site. Similarly, the corresponding cleaved substrate structures had three Mg2+/Mn2+ ions in the active sites [31]. However, the structures of I-CeuI and I-DmoI show the unambiguous presence of a single Ca2+ ion [22, 24], although in non-equivalent positions within the active site. The cleaved substrate structure of I-DmoI shows the presence of only two Mn2+ cations, with a water molecule modelled at the central position. Anomalous signal was not observed at this central position in the Mn2+ anomalous difference Fourier electron density map, that corresponds to the third metal in other enzymes [22]. This is also seen in the structures of I-AniI and PI-SceI [18, 19, 23], which do not show any ion or water molecules in the shared central position. It is possible that this third metal site is transiently occupied during catalysis. The presence of the single Ca2+ ion in the I-DmoI substrate bound structure is indicative of asymmetry during catalysis, suggesting a preference in the cleavage of one of the DNA strands [17, 22], as previously suggested for I-SceI [22, 32].

A subset of HEs act as maturases, these enzymes help in the splicing of their host intron by promoting a conformation that favors RNA-catalyzed splicing. The majority of known group I maturases are members of the LAGLIDADG protein family, acting as both homing endonucleases and RNA-splicing maturases. I-AniI is a good example of an enzyme that acts both as a highly specific endonuclease and as an RNA maturase [18, 33]. RNA maturase activity is required for normal expression of the mitochondrial apocytochrome b oxidase gene [33, 34]. It thus confers a selective advantage to the host and the LAGLIDADG protein-fold is thereby maintained.

HEs from the LAGLIDADG family have been engineered to specifically target new DNA sequences and are emerging as a powerful tool for gene targeting (see “Gene therapy applications”). Some have been designed using a combinatorial approach for binding and cleaving of specific DNA sequences [3538] (Fig. 2).

The HNH family

The distinguishing signature motif of this family of HEs is their HNH nuclease motif, in which the last histidine can sometimes be substituted by asparagine. HNH motifs appear to have diverged from a common ancestor, adopting a variety of biological roles that require the action of a nuclease. Indeed, this motif is found in a wide range of enzymes including HEs coded by group II introns, intein-associated HEs, non-specific bacterial and fungal nucleases, such as the colicin bacterial enzyme family [39, 40], DNA-acting enzymes such as transposases, restriction endonucleases (such as KpnI, HpHI, Hind4II, Eco31I), DNA packaging factors, and bacterial factors involved in developmentally controlled DNA rearrangements [12, 4145]. The HNH family has been divided into at least eight subfamilies based on the sequence of the nuclease core containing the HNH motif and the presence or absence of distinctive conserved flanking residues, such as cysteine pairs that may bind to additional metal ions [41]. Unlike members of the other families, some of the HNH HEs, such as I-HmuI, generate single-strand breaks in the target DNA [46].

Well characterized members of the HNH family encoded within group I introns include I-HmuI and I-HmuII from the B. subtilis bacteriophage SPO1 and SP83 introns [4750], I-BasI from Bastille bacteriophage [51], phage T4 endonuclease I-TevIII [52], the I-CmoeI endonuclease from the psbA gene of the C. moewusii chloroplast [53] and the related ORF from the same gene in C. reinhardtii [54]. So far, the only solved structure in this family is that of I-HmuI bound to its double-stranded DNA target (Fig. 2). The structural data obtained showed the cleavage of a single strand of its target, forming a nick in the DNA [49, 55]. The only bound metal ion binds the free 5′-phosphate, the 3′-OH, and a pair of enzyme side-chains. The elongated structure of I-HmuI shows a series of continuous different structural domains and motifs that perfectly intercalate and distort the DNA chain. The number of specific (protein-base) contacts within its binding site is small (14 out of 25 bp, 2 flanking the site of the strand break) [55]. For this reason, although I-HmuI, and probably its analogous HNH endonucleases, can accommodate a much longer target than known members of the LAGLIDADG and His-Cys box families, its overall specificity is low.

His-Cys box family

The host genes for all known His-Cys box family members have been identified as nuclear ribosomal DNA loci in a variety of eukaryotic organisms. This family has a conserved region of around 100 amino acids that contains conserved His and Cys residues [56, 57]. Amino-acid sequence beyond this motif is not significantly conserved across the family. Several family members, such as I-PpoI, I-DirI and I-DirII, have been studied. I-PpoI, the most fully characterized member, is a 163 amino acid protein encoded within the mobile intron PpLSU3, found in the extrachromosomal rDNA of the myxomycete P. polycephalum [5860]. I-DirI and I-DirII are around 260 amino acids long. They were identified within group I introns at position 956 in the SSU rDNA of two closely related isolates of the plasmodial slime mould D. iridis [6163]. Recent reports (reviewed in [64]) shed some light on the expression of these rDNA-encoded proteins, highlighting the important role of catalytic RNAs.

Members of the His-Cys box family of HEs recognize and cleave intronless homing sites of around 15–19 bp. The homing sites are palindromic or pseudopalindromic. Similarly to other HE enzymes, their DNA target sites may be modified at many positions—multiple substitutions—but still be recognized and cleaved. However, insertions and deletions are not tolerated and prevent cleavage [60, 65, 66]. The regions flanking the homing site come into contact with the enzyme—the footprint of I-PpoI extends to 23 bp, whereas its homing site is only 15 bp—and can affect catalysis, independently of the reaction conditions. Most members of this family generate 4 bp 3′-OH overhangs [67]. However, the Naeglia sp. nucleases I-NanI, I-NitI and I-NgrI were found to generate 5 bp 3′-OH staggered ends, differing from all other known HEs [6567]. The oligomeric state of the His-Cys box family members has only been directly observed for I-PpoI, which forms a dimer [68] (Fig. 2). The sequence outside the His-Cys motif is not sufficiently conserved across family members to allow their oligomeric state to be predicted. However, the palindromic nature of their homing sites is indicative of a homodimeric scaffold, although not necessarily resembling the I-PpoI dimer interface.

The structure of I-PpoI in its apo form and bound to its homing site has been studied extensively by X-ray crystallography [6872]. Many divalent cations can be used by I-PpoI, although there is a preference for those that can adopt a octahedral geometry [59]. The I-PpoI dimer displays a mixed α/β topology, resulting in an elongated molecule. In contrast to LAGLIDADG HEs, the I-PpoI dimer interface is small, highly solvated and more loosely packed. The C-terminal end residues are domain-swapped and contribute to the dimer interface. Cys or His residues in the His-Cys box motif form two novel Zn binding folds, which have a structural and stabilizing role. I-PpoI significantly distorts the homing-site DNA, rendering the scissile phosphodiester bonds accessible to the two separate active sites. Structure-guided protein engineering has been used to alter the DNA recognition specificity of I-PpoI [73]. These studies found little correlation between the binding, selectivity and cleavage properties of individual variant proteins.

GIY-YIG family

Most GIY-YIG family enzymes are complex multidomain proteins that perform various functions by combining the characteristic GIY-YIG catalytic domain with other non-catalytic domains responsible for DNA binding specifity [74]. Therefore, due to their non-catalytic function, the regions outside this domain are not conserved across family members, and vary from enzyme to enzyme. Multiple sequence alignments have been used to further examine the catalytic domain [75]. This domain ranges between 70 and 100 amino acids in length and contains five conserved sequence motifs. Not all the motifs are necessarily present in all the proteins in the family. One of these is the GIY-YIG motif which gives the family its name. The catalytic residues are not present in this characteristic sequence but are found in the other four conserved motifs within the catalytic domain. HEs belonging to this family have been characterized by the presence of the conserved sequence, “GIY-10/11 residues-YIG”. However, recent observations have shown that at least one family member, I-BathII, has an unconventional GIY-(X)8-YIG motif [76]. The GIY-YIG domain is found in proteins encoded by group I introns in the mitochondria and chloroplast genomes of fungi, algae and liverworts and in proteins encoded within the genomes of several bacteriophages (T4 and T4-related phages), in intergenic regions of bacteria, phage and viruses, and in the UvrC subunits of bacterial and archaeal (A)BC excinucleases [75].

The most extensively studied member of this family, I-TevI, consists of an N-terminal GIY-YIG catalytic domain and a C-terminal DNA-binding domain, which are joined by a 75 amino acid linker containing a Zinc-finger [7779] (Fig. 2). The catalytic domain encompasses the five motifs conserved by this family and has a novel α/β fold with a central three-stranded anti-parallel beta sheet flanked by three helices [80]. The C-terminal domain itself is composed of a minor groove-binding α-helix (‘NUMOD3’ motif) and a helix–turn–helix (HTH) motif also referred to as ‘IENR1’ domain [81]. It resembles the C-terminal DNA-binding domain of the HNH HE I-HmuI. As mentioned above, the DNA binding domain is an unusually extended structure that wraps around the DNA, lining the minor groove of the DNA-duplex. The interdomain segment of I-TevI has a complex function and has been studied in depth [75, 82, 83]. It has essential roles in determining distance along the sequence when selecting the cleavage site; indeed, the catalytic domain on its own makes non-specific cuts. Mutations in the Zn-finger and in an unstructured segment show that the linker is essential for normal I-TevI function.

These HEs generate double-strand breaks (DSB) in a divalent ion-dependent manner, producing 2 bp 3′-OH overhangs. Mg2+ is most commonly used for functional experiments but these enzymes can use other cations [84]. It remains unclear how a monomeric enzyme with one active site makes two cuts in the DNA strand. These enzymes may act as monomers, although there are no structural data available for the full-length protein bound to DNA. They cleave the two strands of the DNA by independent nicking reactions; thus, there is either a conformational change in the DNA to accommodate both strands in one active site or the two catalytic domains may oligomerize to form a transient complex with the DNA [85]. This issue has been addressed in biochemical studies of double-strand break formation by I-TevI and I-BmoI [8587]. Additionally, a role for monomer–dimer transition has been demonstrated in the cleavage of double-stranded DNA by the Eco29kl endonuclease, another GIY-YIG superfamily member [88].

The GIY-YIG HEs recognize long DNA targets (around 40 bp). The C-terminal DNA binding domain recognizes a region of approximately 20 bp around the intron insertion site and the N-terminal catalytic domain cleaves approximately 25 bp upstream [8992]. In general, GIY-YIG HEs cleave their intronless homing sites with relaxed specificity. I-TevI, for example, tolerates several mutations, as well as insertions of up to 5 bp and deletions of up to 16 bp between the intron insertion site and the cleavage site [93, 94].

The PD-(D/E)XK family

This family of HE belongs to the PD-(D/E)XK superfamily of nucleases. The first member of this family to be identified was I-Ssp6803I, or I-SspI for simplicity. It was the first example of a chromosomally encoded group I intron endonuclease in bacteria [95]. I-SspI is encoded by a self-splicing intron in the tRNAfMet gene of Synechocystis PCC6803 [95, 96].

I-SspI is a very small protein, only 150 amino acids long, making it the smallest HE characterized to date. Sequence comparison of I-SspI with the Hjc resolvase family revealed several conserved residues indicative of a canonical PD-(D/E)-XK active site within this HE [97, 98]. The PD-(D/E)-XK nuclease fold locates the catalytic residues at the concave surface of a curved β-sheet, facing an α-helix. This fold is typically found in most known RE and in a wide variety of DNA repair enzymes, all of which act on short DNA targets with strict fidelity. I-SspI was the first HE to be identified with this fold, which had previously only been found in REs. Similarly, the LAGLIDADG fold has only been found in HEs. I-SspI is unique in that it supports the PD-(D/E)-XK fold, but recognizes a long site: 23 bp, of which the central 17 bp show particularly high specificity [95, 97, 99]. This target site is pseudopalindromic and encompasses the entire anticodon stem and loop of the tRNAfMet gene.

I-SspI forms a tetramer that binds one DNA molecule (Fig. 2), consistent with solution studies [97]. This tetrameric arrangement allows the coding gene I-SspI to be small. This minimizes its interference with the folding and splicing of its host intron and creates an elongated DNA-binding surface that is able to recognize a long target site. At the same time, two of the subunits become separated active centers that cleave each strand of the double helix, while the two other subunits in the tetramer stabilize the whole assembly.

This enzyme cuts its target DNA on both strands and, in contrast to other HEs, cleaves one DNA strand precisely at the position of intron insertion. The DNA is cleaved across the minor groove producing 3′-OH single-strand overhangs. These overhangs are the result of a metal-dependent in-line displacement mechanism, characteristic of the PD-(D/E)XK fold. However, I-SspI is unique in that the extensions are three bases long rather than the two or four bp overhangs generated by most other HEs. The enzyme generates transient nicked intermediates during DNA cleavage under certain conditions [99]. DNA base-pair mutations at several positions show a strong correlation between cleavage and binding specificities; some mutations however affect cleavage but not binding [99].

Mechanisms of double-stranded break repair

Homologous recombination (HR) results in the exchange of genetic information between two endogenous DNA molecules or between an endogenous and exogenous DNA molecule. It may be useful in gene therapy for the exchange between an endogenous defective chromosomal sequence and an exogenous repair DNA construct. HR requires a few hundred base pairs of homology between both the targeting and the targeted DNA. The presence of free DNA ends in the exogenous construct is sufficient to stimulate this process [100]. These free DNA ends are resected to generate 3′ single-stranded overhangs and to form a nucleofilament with proteins from the RecA/Rad51 family [101, 102]. This nucleofilament then invades the homologous duplex, resulting in strand transfer [101, 103] (Fig. 3).

Fig. 3
figure 3

Double-strand breaks (DSBs) can be repaired by several homologous recombination (HR)-mediated pathways, including double-strand break repair (DSBR) and synthesis-dependent strand annealing (SDSA). Upper In both pathways, repair is initiated by resection of a DSB to provide 3′ single-stranded DNA (ssDNA) overhangs. Strand invasion at these 3′ ssDNA overhangs into a homologous sequence is followed by DNA synthesis at the invading end. Lower left After strand invasion and synthesis, the second DSB end can be captured to form an intermediate with two Holliday junctions (HJs). After gap-repair DNA synthesis and ligation, the resolved structure at the HJs may be a non-crossover (black arrowheads at both HJs) or crossover product (green and black arrowheads). Lower right Alternatively, the reaction may proceed to SDSA by strand displacement and the annealing of the extended single-strand end to the ssDNA at the other break end, followed by gap-filling DNA synthesis and ligation. The repair product from SDSA is always non-crossover

Non-homologous end joining (NHEJ) and HR are two competing DSB repair pathways. NHEJ is used for DSB repair in the absence of a homologous repair sequence [102104]. The most frequent outcome of NHEJ is the perfect joining of the broken ends; however, imperfect rejoining may result in the addition or deletion of base pairs, inactivating the targeted open reading frame.

Strategies for gene repair using meganucleases

Depending on the strategy chosen to repair or edit a given gene, different options are available (Table 2). In many gene repair applications, mainly for therapeutic purposes, endonuclease-induced breaks must be repaired in a conservative manner to maintain the information contained in the DNA sequence. DSB can be repaired by either HR or by NHEJ. HR occurs without loss of sequence information. However, NHEJ usually results in the loss of sequence at the repair junction and can promote chromosome translocations at DSBs, leading to genomic instability [105]. Recent studies have shown that the use of a nicking enzyme rather than a DSB nuclease might stimulate HR while reducing genomic instability associated with the alternative NHEJ repair of DSBs [106, 107]. Alternatively, these single-strand breaks can be solved by proteins involved in downstream steps of the base excission repair pathway [108, 109]. However, when single-strand breaks are not rapidly or properly repaired, they can collapse replication forks, block transcription or promote excessive activation of the single-strand break sensor protein poly(ADP-ribose) polymerase 1 (PARP1) with dangerous consequences for the cell [110]. Taking this into account, together with the structure–function relationships described above, members of the HNH and GIY-YIG families seem to be appropriate candidates for use as templates to engineer a nicking enzyme that promotes DSB repair using the HR pathway. Additionally, they bind non-palindromic DNA targets, widening the range of potential sequences that can be targeted. However, they lack specific protein-DNA contacts, consequently lowering their natural specificity and increasing the risk of deleterious off-site cleavage. Similarly, although the PD-D/E-XK family displays a high specificity for their targets, the tetrameric oligomerization state, observed in I-SspI, presents an extra challenge in redesigning the DNA recognition properties of members of this family. Members of the His-Cys and LAGLIDADG families may therefore be considered as more suitable templates. Both of these types of HE are highly specific. Attempts at the structure-based protein engineering of I-PpoI have yielded ambiguous results [73]. However, newly designed variants of I-CreI, a representative homodimeric member of the LAGLIDADG family, have been able to target DNA sequences that are completely different from its wild-type target sequences [38]. The LAGLIDADG family also has the advantage of providing two scaffolds, monomeric and homodimeric, widening the range of sequences, palindromic and non-palindromic, that can be targeted (Table 2).

Table 2 Schematic summary of the strategies employed to generate HE variants with novel target specificities

Gene therapy applications

Cell and gene therapy are emerging as fundamental tools, fighting genetic diseases by attacking at the origin of their cause, through the repair of or compensation for the defective gene. However, most strategies tested so far have relied on retroviral or lentiviral vectors to deliver the corrected gene. Due to the semi-random viral integration throughout the genome, these strategies have several potential limitations: transgene silencing, disruption of endogenous genes and the transcriptional activation of neighboring genes. An alternative approach involves use of the correct DNA copy to specifically target the mutated gene for correction of the defect. Gene repair does not simply restore a physiological pathway or function, but directly erases the deleterious mutation (Fig. 4). This brings significant advantages, elimintating the risk of oncogene activation, and alleviating the recurring problem of transgene silencing.

Fig. 4
figure 4

Diagram showing the ex vivo approach. The location and characterization of chromosomal damage is followed by the introduction of the engineered HE in the isolated cells population together with the correcting template DNA matrix. The mutated gene, upon DBS-induced HR, is repaired and, after selection, cells with a normal gene are recovered

Engineering meganucleases for therapeutic purposes

The use of HEs in therapeutic applications is dependent on the capacity of a DSB to actively induce HR [100, 111]. The potential repair of a particular mutated gene requires the generation of an engineered meganuclease that forms a DSB as close as possible to the mutated gene and triggers HR (Figs. 5, 6). Unfortunately, difficulties in modifying the specificity of DNA recognition has limited the engineering of DNA-binding proteins [112, 113]. However, recent progress has allowed the production of nucleases that initiate DSB-induced recombination at a chosen locus. A canonical zinc finger (ZF) DNA binding motif has been engineered and fused to the catalytic domain of FokI to produce zinc finger nucleases (ZFNs) composed of two subunits, which assemble to form dimers at the cleavage site [114, 115]. Engineered ZFNs, together with a chromosomal repair matrix, can induce gene correction in the Drosophila yellow gene, in the germ line as well as in somatic tissues [116]. Several studies in HEK-293 cells have demonstrated that ZFNs can induce 0.4, 0.5 and 1% efficiency of maximal gene correction [117119]. A similar experimental design was also successfully applied to tobacco plants [120]. However, high levels of cytotoxicity, potentially hampering their therapeutic use, were observed in Drosophila and mammalian cells, presumably due to off-target cleavage. In another study ZFNs were redesigned to correct mutations in the human ILR2G-encoding gene. This work observed gene correction with up to 18 and 5% efficiency in K562 cells and in human CD4 + T cells, respectively, with reduced toxicity [121]. In silico studies, using protein modelling and energy calculations, have allowed further development of ZFNs to reduce non-specific cleavage [122].

Fig. 5
figure 5

Strategy for the making of redesigned HEs. a General strategy. I-CreI variant libraries with locally altered specificity are generated. These mutants were assembled into homodimeric and heterodimeric proteins using a combinatorial approach, generating meganucleases with fully redesigned specificity. b The RAG1 series of targets. Two intermediary palindromic targets were derived from the non-palindromic human gene target. These were used to select for the homodimeric I-CreI variants that served as scaffold for the specific heterodimer. The two 3 bp segments used in the library screening are boxed

Fig. 6
figure 6

Targeted genes by modified endonucleases. a Human XPC (left panel) and RAG1 (right panel) genes showing the sequence recognized and cleaved by I-CreI amel [38] and I-CreI v2-v3 [27] variants which differ in 15 bp and 16 bp respectively from the original I-CreI 22 bp target (black box) [31]. The gene information was retrieved from http://www.ncbi.nlm.nih.gov/. b Tobacco acetolactate synthase gene showing the ZFN DSB site and the maximum endogenous repaired distance obtained by HR at a frequency of 2% [145]. c Human RAG1 gene showing the I-CreI v2-v3 DSB site and the donor DNA construct used to verify the 6% frequency of exogenous DNA integration, as a marker, in the genomic locus [27]. Red stars show clusters of mutations in the active core of the RAG1 protein that cause T-B-severe combined immune deficiency or Omenn syndrome [159]

HEs have emerged as very promising tools to promote DNA DSB for gene targeting due to their low cleavage frequency in eukaryotic genomes. Although several hundreds of HEs have been identified, the repertoire of cleavable sequences available to target any gene of interest is still very limited. Known structures can thus be used as a basis to modify the specificity of known HEs and create new custom variants (Table 2). I-CreI is a well-characterised homodimeric member of the LAGLIDADG family of HEs (LHE). Mutations in the DNA-interacting regions of this enzyme allow the cleavage of novel targets or lead to loss of cleavage activity at the I-CreI natural site [123, 124]. The DNA-binding residues of I-SceI have also been modified to obtain further variants with altered binding specificity [125].

The crystal structure of I-CreI with its natural DNA reveals two groups of residues that make specific contacts with two 3-bp segments of one half of the pseudo-palindromic target sequence [31]. Residues in each group were randomly mutated to generate different libraries of I-CreI variants (Fig. 5). These libraries were tested for cleavage against all 64 possible targets originated from changing each of the three bases in the corresponding segment to the four nucleotides. A high-throughput screening (HTS) method was used to test a large number of mutant-target combinations [36, 126]. As described below, it is based on inducing HR by DSB in yeast. The I-CreI library variants were cloned into a replicative yeast expression vector and transformed in a selective S. cerevisiae strain. The 64 combinatorial target DNAs were cloned in the LacZ-based yeast reporter vector, disrupting the LacZ gene, and trasformed in a complementary yeast strain. After mating of the two yeast strains, positive events were detected by selection of blue colonies due to the generation of the complete LacZ gene after I-CreI variant DSB cleavage and recombination. Millions of combinations were screened, leading to the identification of several hundreds of mutants with independent altered local specificity. These mutants were then used in a combinatorial screen, which was expanded to include homodimeric and heterodimeric variants of I-CreI (Fig. 5a). In vitro studies confirmed that many of the active variants maintained the essential properties of the wild-type I-CreI molecules, including structure, stability, cleavage efficiency and narrow specificity. Computational analysis, based on energy calculations, helped to improve the active variants by highlighting key residues [36, 127, 128].

The general redesign approach, based on the methods described above, can be summarised in two main steps: firstly, semi-rational mutagenesis followed by high-throughput screening generates I-CreI variants with changes in specific regions to alter their ability to recognize and cleave target DNA; secondly, a combinatorial strategy is used to join the altered domains to generate new HEs with changes in the DNA-binding domain, allowing recognition of different DNA targets. This approach also offers the considerable advantage over computational design of being able to select HEs to target a sequence that differs greatly from the natural target sequence (Figs. 5b, 6).

This general approach was used for the successful design of new I-CreI variants that cleave the human XPC (Xeroderma pigmentosum group C) gene [35]. Xeroderma pigmentosum (XP) is a monogenic disease associated with an extreme sensitivity to sunlight (Table 3). XP patients are at an increased risk from skin cancer and, in some cases, develop neurological defects [129, 130]. There is no efficient treatment available to XP patients, apart from repeated surgery to remove affected skin areas. The majority of patients die due to metastasis before reaching adulthood. Attempts have been made to replace affected areas of skin with skin from another site, which has not previously been exposed to the sun, from the same patient [131, 132]. However, the skin grafts are also sun-sensitive and the benefit to the patients is limited to a few years. XP diseases (Table 3) are good candidates for cell therapy with HEs, as they are monogenic and amenable to ex vivo treatment (Fig. 4). Cells from the skin lineage can be easily manipulated ex vivo, and then used to reconstruct functional skin [133, 134]. I-CreI has been succesfully engineered to recognize two sequences from the XPC gene [35]. Two derivatives of the modified I-CreI, recognizing a specific sequence in the XPC1 gene, have been found to induce high levels of specific gene targeting in mammalian cells with no obvious signs of genotoxicity [38] (Table 2).

Table 3 Summary of monogenic diseases that may be amenable to treatment based on homing endonuclease-induced double-strand break homologous recombination

A similar study was carried out with the human RAG1 gene [37], mutation of which is responsible for a severe combined immunodeficiency disorder (Table 3; Fig. 5). The sequence of the RAG1 gene was scanned searching for any 22-bp sequence that may be cleaved by a I-CreI variant obtained through the combinatorial approach. Eighteen potential sequences were identified, the most promising of which was chosen to screen combinatorial libraries of variant I-CreI. A set of active heterodimeric I-CreI variants that cleave this sequence and display high levels of activity in yeast and CHO cells was identified. This was the first LHE to be entirely redesigned to cleave a naturally occurring sequence [37] (Figs. 5 and 6). Furthermore, heterodimeric and single chain I-CreI variants that cleave the RAG1 gene are as active as I-SceI in mammalian cells, with no detectable cell toxicity [27]. These findings show that modified HEs from the LAGLIDADG family may be used as effective and safe tools in gene therapy. The methods described can thus be used as a proof of principle for future therapeutic applications.

The efficacy of gene correction is an important and complex consideration in gene therapy. It is affected by several factors, such as transfection rate for the HEs and repair sequences, location at the nucleus, expression levels, HR machinery, chromatin state and cell cycle. Several studies have found the frequency of successful induced gene targeting (GT) events to range from 10−5 to 10−1 when using a chromosomal reporter system and either I-SceI or ZFN target sites in embryonic stem or HEK-293 cells, respectively [118, 135]. Given the number of parameters involved, and the fact that they vary from one study to another (including cell type, locus, reporter system, correction or insertion, repair matrix, protocols and manipulation), it is difficult to explain these differences. Redesigned HEs have also been shown to induce GT with a similar efficiency as that obtained with I-SceI [38]. All these studies were carried out with artificial DNA constructs, and most of them with immortalised cell lines. Recombination efficiency may vary for endogenous genes in primary cells. Studies showing that HR machinery levels are increased in the late S and G2 phases of the cell cycle suggest that GT may be more efficient in rapidly dividing cells [136, 137]. Cell types differ in their ability to undergo HR, in part because most cells are no longer dividing in differentiated tissues. Another aspect to take into account is chromatin accessibility, which may affect the efficiency of DSB formation in particular DNA regions. One report shows a surprisingly high efficiency of DSB-induced recombination for the endogenous human ILR2G gene in primary CD4 + T cells [121].

It is pointless enhancing the DSB activity of HEs if the lesion is repaired by NHEJ, which is often frequently undertaken in mammalian cells. Several laboratories have tried to enhance HR or inactivate NHEJ, to improve gene targeting in mammalian cells [138140]. However, no clear solutions have emerged. One study has reported a 60-fold enhancement of GT in plants, by over-expression of Rad54 [141]. Another study has shown that Rad 18 enhances HR-dependent DSB repair in chicken and human cells [142]. A recent insightful study has shown that the use of an enzyme that generates nicks rather than DSBs may promote HR, at the same time minimising the genomic instability associated with DSB-induced NHEJ [107]. A deeper understanding of recombination mechanisms will be needed for further advances in this field, with a view to achieving the ultimate goal: a gene targeting event in every transfected cell.

It is important to consider that the frequency of correction decreases as a function of increasing distance from the cleavage site, with a four-fold drop in frequency after the first 100 bp, and no detectable recombination events occurring beyond 500 bp from the DSB [143]. However, in one study, 13% of conversion events were observed up to about 4 kbp away from the initial DSB site [144]. Similarly, a recent study using ZFNs to cleave endogenous plant genes reported over 2% of transformed cells displaying mutations up to 1.3 kbp from the ZFN cleavage site [145] (Fig. 6b). Again, the efficacy depends on various factors such as the locus, the sequence and the design of the repaired sequence. Overall, these findings show that, for optimal gene correction efficiency, the DSB must be generated as close as possible to the target sequence. The application of modified HEs in human gene therapy would thus require the sequencing of the damaged gene to localise the mutated nucleotides. This strategy would allow the selection of the closest target sequence that best matches the specificity of modified HEs from an existing library. These HEs could then serve as a template for further modification to improve binding to and cleavage of the target site of interest (Fig. 6c).

Current limitations are associated with low efficacy, possible cytotoxic effects and vectorisation. The ex vivo treatment of patient cells bearing the damaged gene with engineered HEs would allow corrected clones carrying the repaired gene to be selected for therapeutic use. Only a low gene correction rate has been obtained from in vivo viral-based treatments due to low infection rates and low expression levels. Ex vivo treatment using recombinant proteins, as previously shown for stem cell generation [146], would eliminate the need for plasmids and viruses to correct target genes (Fig. 4).

Monogenic diseases and ex vivo treatment

Monogenic diseases result from a change in a single gene, occurring in all cells of the body. Though relatively rare, they affect millions of people worldwide. It is currently estimated that over 10,000 of human diseases are monogenic. The nature of the disease depends on the function of the gene that has been modified. Single-gene or monogenic diseases can be divided into several categories: autosomal dominant disorders, in which only one allele of the gene is mutated in affected patients; autosomal recessive disorders, in which both alleles of the gene are mutated; X-linked, which are caused by mutations in genes on the X chromosome; Y-linked, caused by mutations in genes on the Y chromosome; and mitochondrial diseases, caused by modification of genes in mitochondrial DNA. Regarding these genetic conditions, gene therapy in some of these diseases, using modified HEs, will need the correct gene to be supplied as a template for the proper repair of the mutated gene, mediated by HR. Ex vivo treatment with modified HEs is limited to monogenic hematopoietic diseases, monogenic skin diseases, some monogenic metabolic disorders and potentially some kinds of cancer (Table 3; Fig. 4).

Only a few cell types are able to support ex vivo manipulation and re-implantation. The most promising of these, due to their plasticity, are hematopoietic stem cells, which can differentiate to form all the blood cell types. They are found in the adult bone marrow [147] and blood [148]. Hematopoietic stem-cell transplantation is used primarily for hematological and lymphoid cancers, but also for many other disorders [149]. Mesenchymal stem cells (MSC) can be expanded ex vivo, are easily extracted, and differentiate into multiple cell lineages. These cells thus provide a new clinical tool for cell and gene therapy. Several groups have isolated MSCs from bone marrow, umbilical cord blood and adipose tissue. MSCs seem to have different properties, depending on their origin [150].

Mutations in the TP53 gene

As well as their use in the treatment of monogenic disorders, the DSB activity of HEs could be used for the correction of somatic mutations. Mutation of the TP53 gene is the most frequent genetic change in human cancer, with one in two tumours bearing a p53 mutation. Most mutations are clustered in the DNA binding domain [151]. The importance of the role of p53 in tumour suppression is reflected by the fact that this protein is defective in virtually all human cancers. In about 50% of cases, it is inactivated by mutations; in other cases, p53 activity is suppressed, leading to disruption of its associated pathways. Studies on transgenic mice, in which p53 expression is reversibly switched on and off, show that restoration of p53 function can lead to tumour regression in vivo; thus restoration of p53 activity may be a promising therapeutic strategy [152, 153]. The activity of p53 could potentially be restored by the repair of the mutations that disrupt its normal function, through specifically induced DSB. However, the state of the art currently only allows for the treatment of certain types of cancer that are amenable to ex vivo treatment.

Concluding remarks

This review discusses the growing interest in the use of highly specific engineered HEs as a possible therapeutic tool to correct defects in different types of diseases. The redesign of these enzymes requires in-depth understanding of their physicochemical properties to elucidate mechanisms underlying their protein-DNA recognition and their mode of action. Depending on the molecular mechanisms mediating DNA repair, different outcomes of the action of these meganucleases may be foreseen. Whether these tools can provide the basis for a general approach in gene repair in monogenic human diseases remains unclear. Aspects such as repair efficiency and the selection of clones carrying the repaired gene must be addressed to determine whether this technology can be used in patients. However, it is beyond doubt that the use of this technology has considerable advantages over alternative methods. If successful, the use of these tools would circumvent the use of viral vectors for gene therapy, diminishing the risk of deleterious effects caused by DNA-based approaches [154157]. The use of this technology in a multigenic disease such as cancer remains a matter of debate. However, key genes, such as TP53, have been identified to have major roles in tumour development. The further study of potential therapeutic approaches involving the repair of such genes is thus needed and should be of particular interest.