Introduction

Robust, integrated, comprehensive maps were crucial to assembling the sequence of the human (Venter et al. 2001; Lander et al. 2001, Bentley et al. 2001), mouse (Gregory et al. 2002), and rat genomes (Kwitek et al. 2001). In addition, radiation hybrid (RH) maps form the basis of comparative genomics by identifying regions of orthology between genomes. The power of comparative mapping increases with marker density in the species of interest (Marra et al. 1998) and integration with its genetic map (Breen et al. 2001) because genes can be assigned to a genetic interval (Milan et al. 2002). Successful integration of information between well- and lesser-characterized genomes requires that major breaks in synteny and internal rearrangements be identified, because chromosomal breaks result in fragment inversions, inter- and intrachromosomal translocations, or even loss of genome fragments (Rink et al. 2002a). We reported a first-generation expressed sequence tag (EST) map (Rink et al. 2001a, 2002a) of the porcine genome that integrated 1058 ESTs into our earlier microsatellite (MS)-based, whole-genome radiation hybrid (WG-RH) map for swine (Hawken et al. 1999). This map refined corresponding regions between the human and porcine genomes, identified 60 potential breakpoints in synteny, and improved resolution over the entire genome (Rink et. al. 2002a). It also, for the first time, confirmed that synteny of “gene-rich” and “gene-desert” regions of the human genome were conserved in a closely related genome other than the mouse (Hudson et al. 2001). The IMpRH7000-rad panel and an abundant supply of porcine ESTs primarily from immune tissues (Rink et al. 2002b) provide a continuing opportunity to identify genomic intervals that contain gene sequences in a species of interest, improve the comparative map, and hasten the assembly of the porcine genome sequence. We have hastened that assembly by constructing a second-generation EST map of the porcine genome and used it to improve the resolution around putative synteny breaks.

Materials and methods

Marker development and amplification

All ESTs were cloned from a panel of normalized cDNA libraries (Rink et al. 2002b). Sequences were annotated using gapped Advanced BLAST (Altschul et al. 1997) against dbEST, the nonredundant part of NCBI and international protein databases. Primers for the original 1058 ESTs and 977 new ESTs analyzed were designed to amplify an optimal 140-bp product size (range = 100–200 bp), 20-bp optimal primer length, 45%–60% GC content with PRIMER3 (http://www. frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi). All primers were optimized by determining the highest annealing temperature at which successful amplification of porcine genomic DNA took place. Primers were then tested with porcine and Chinese hamster genomic DNA at that temperature for species specificity. During the course of this study, we avoided designing redundant primer pairs for a given gene or EST.

Annotation of ESTs

Sequences of the 2035 porcine ESTs were subjected to a BLASTN search against build 35.1 (May 2004) of the human genome sequence (HGS) and all other sequences deposited in GenBank by the University of Minnesota, Center for Computational Genomics and Bioinformatics (CCGB) (http://www.ccgb.umn.edu/). All sequences of the porcine ESTs were also BLAT-searched against the HGS (build 35.1) (http://www.genome.ucsc.edu/cgi-bin/hgBlat?command=start), which allowed us to quickly identify sequences of 95% and greater similarity of length 40 bases or more. The most significant match to a human sequence in the results was documented with the human chromosome number, gene (ortholog) symbol, score, e value, and start position (bp) in the human sequence. An e value ≤ 1 × 10−5 was set as a threshold based on the significant number of sequences in the data set and the need to return nearly exact matches of greater than 50 bp. An annotation threshold was set at a score of greater than 125 and e ≤ 10−25 to ensure the integrity of the final comparative maps. Data can be accessed online at http://www.cabnr.unr.edu/beattie/docs/complete_info.pdf

Genotyping

The IMpRH7000-rad panel (Yerle et al. 1998; Hawken et al. 1999; Rink et al. 2002a) was used in this study. All PCR typing reactions were performed in 96-well Techne® Touchgene thermocyclers (Techne Inc., Princeton, NJ). Each PCR reaction contained 25 ng of hybrid DNA, 0.4 μM of each primer, 1.5 mM MgCl2, 66 μM of each dNTP, 1 × cresol loading dye (Hodges et al. 1997), and 0.3 U ImmolaseTM DNA Polymerase and 1 × PCR buffer (Bioline USA Inc., Randolph, MA) in a total volume of 15 μl. The amplification cycle included an initial 95°C, 7-min denaturation step, followed by cycling between 94°C for 15 sec, 57–69°C for 30 sec, 72°C for 30 sec, and a final extension step at 72°C for 5 min. Controls consisted of porcine and hamster genomic DNA and a reaction containing no DNA. The first PCR was routinely run for 40 cycles; the second PCR was run for between 33 and 40 cycles depending on the level of background amplification and intensity of pig-specific bands. If more than six discrepancies were observed between the first and second PCR, a third PCR was performed. The PCR products were electrophoresed on 2% agarose gels, visualized, and photographed with an AlphaImagerTM 2200 (Alpha Innotech Corp, San Leandro, CA). At least two gel images per marker were independently scored by two individuals using GelScore (http://www.wesbarris.com/GelScore/). Consensus vectors were established by two individuals using GelScore data as a baseline. Markers were scored as present (1), absent (0), and ambiguous (2), and those with unusually high or low retention frequencies or with more than six discrepancies were eliminated.

Construction of the IMpRH7000-rad map

All EST vectors were initially subjected to a two-point analysis against framework microsatellites using the online IMpRH mapping tool (http://www. imprh.toulouse.inra.fr/Action=Menu?Do=Map+ one+marker+on+IMpRH&USER=&PASS=) that assig- ned each EST to 1 of 19 individual input files (one for each autosome, plus X) based on the closest-linked MS framework marker. Input files contained all the EST and MS vectors for a given chromosome. The CarthaGene software (Schiex and Gaspin 1997; deGivery et al. 2005) was then used to build and analyze the RH map in the following manner: Markers on each chromosome were grouped into distinct linkage groups using a two-point threshold set at LOD 4 and a distance threshold of less than 100 cR between any two markers. Any marker not assigned to a linkage group was labeled as a singleton. Linkage groups were analyzed individually using simulated annealing to improve CathaGene’s default marker order. Any improvements in order were put into resident memory as the new best map. After the simulated annealing was completed, the best map for that linkage group was subjected to a second improving method where marker orders were flipped internally. A window size of at least four markers was applied, with the largest linkage groups using a window size of either six or eight markers. The MS locations on the final best map were compared to the genetic map (http://www.marc.usda. gov/genome/swine/swine.html) to verify MS order. If a discrepancy occurred between MS order within the linkage groups and MS order on the genetic map, chromosomes were recalculated at LOD 6 and 8. Any EST or MS vectors potentially causing the rearrangements were removed from the associated linkage group and the order of the entire linkage group recalculated. If the removal of a marker corrected a discrepancy in the linkage group, it was reclassified as a singleton and not analyzed further. One hundred ninety-one ESTs were removed from a linkage group because they were reclassified as a singleton. All putative breaks in synteny fell within linkage groups in the RH maps. At these LOD levels we are confident that any marker identified as a break in synteny is in the correct relative position within the swine genome. Final maps were drawn using MapCreator (http://www.wesbarris.com/mapcreator/).

Comparative mapping

Chromosomal homologies between the swine and human genomes were determined essentially following an idea described by Ehrlich et al. (1997) and Bourque et al. (2004). First, chromosomal homologies between the two species were determined based on previous gene comparative mapping and bidirectional chromosome painting results (http://www.toulouse.inra.fr/lgc/pig/compare/). Bidirectional painting indicates that individual swine chromosomes exhibit synteny with between 1 and 5 human chromosomes with a total of 39 identifiable chromosomal breaks (Goureau et al. 1996). Second, synteny breaks refer to adjacent porcine EST markers that do not match their predicted human correspondence. In the present study, any markers that did not match the predicted human chromosome region as its top search result were initially labeled as putative synteny breaks and were subject to further analysis. Third, each marker identifying a putative synteny break was subjected to a manual BLAT search against HGS (build 35.1) on the UCSC website (http://www.genome.ucsc.edu/) with the same parameters used for CCGB searching, and the search results were checked for any annotation anywhere in the list that matched the predicted human segment. If any annotation to the predicted human segment was found, the putative synteny break marker was then checked with its adjacent markers in the pig RH linkage group to see if that human annotation was in the corresponding location (bp). If this marker and the adjacent markers were in synteny in the human genome and in the correct location, this marker was no longer considered a synteny break. Finally, the number of synteny blocks between the pig and the human genome were counted based on the method of Bourque et al. (2004) and the data was integrated with the above analysis and our RH mapping results. If all porcine markers in a given RH linkage group were in agreement with established human synteny, it was called one synteny block (conserved segment between the two species).

Conservation of synteny and genome coverage

The resolution a RH map is able to provide is dependent on the resolution of the RH panel and on the number and distribution of markers in the data set. The swine genome is estimated to span approximately 2.7 Gb (Schmitz et al. 1992), with varying marker densities across the individual swine chromosomes due to various known “gene-rich” regions and “gene-deserts” (Rink et al. 2002b). Published predicted averages for RH resolution of the IMpRH7000-rad panel range from 37 kb/cR7000 (Yerle et al. 2002) to 75 kb/cR7000 (Hawken et al. 1999). A comparative analysis between the IMpRH7000-rad and IMNpRH212000-rad panels found an average resolution of 37 kb/cR7000 (Yerle et al. 2002). We calculated the coverage of the swine genome as:

$$ C\,=\,R\,\times\,L/S\,{\rm or}\,(37\,\times\,57191.8/2.7\,\times\,106)\,=\,78\% $$

where C = coverage, R = RH map resolution (kb/cR), L = length of RH maps (cR), S = size of swine genome (kb).

Chromosomal locations and start positions of porcine orthologs of genes in the human genome were also established for all porcine genes analyzed using the NCBI human Map Viewer (build 35.1) (http://www.ncbi.nlm.nih.gov/mapview/). Map distances (cR) were the accumulated sum of all linkage groups. Coverage of the human genome, currently thought to span approximately 3.011 Gb (NCBI), was estimated using the start position of ESTs that returned a BLAST search result with e ≤ 10−5. All markers with a human genome position annotation were ordered by their human chromosome number and then by their position. The distance between each marker was calculated by subtracting the position of every marker from the next-closest marker. The resulting value is the size of the gap between the orthologs of those two markers. This gap value was held in memory. If the resulting gap value was greater than or equal to the gap threshold, the actual gap distance was recorded. All gap values greater than the threshold for individual chromosomes were then subtracted from the individual human chromosome length. The remaining value was the approximate comparative coverage of the chromosome that the annotated data represented, discounting gaps smaller than the threshold. As more annotated markers are added to the data set, the actual coverage and comparative coverage continuously improve. Two thresholds were used for our computations: 10 Mb (<0.332%) and 5 Mb (<0.166%). This threshold is referred to as the gap threshold. To avoid having to manually recompute the coverage whenever new data became available, a Perl script was developed and will be available for download at http://www.ag.unr.edu/beattie/research.htm. The script requires a tab-delimited input file containing three fields: marker name (or any unique identifier), human chromosome number (1–22 or X), and human genome position (bp). The gap value threshold is chosen by the user at runtime. The output file is a tab-delimited text file.

Results and Discussion

Genome coverage

Table 1 illustrates the number and characteristics of markers mapped on the IMpRH7000-rad panel. Genome-wide marker distribution was not stochastic in the first-generation RH map (Rink et al. 2001a, b, c) or in this second, more detailed version. The DNA/marker ratio ranged from 0.53 for SSC 11 to 2.02 for SSC 12, respectively. Average marker retention frequency (RF) was 30.3%. We mapped a total of 977 new ESTs onto the first-generation map of 1058 ESTs and 743 MS (Rink et al. 2002a). The genome-wide RF for the new markers was 30.2% with a minimum of 9.3% (EST-AR078F01 on SSC 13 and EST-UNR6191H10 on SSC 14) and a maximum of 87.3% (EST-AR095G02 on SSC X). High RF in the region directly surrounding the centromere, called “centromeric effect,” was observed in several species including human (Gyapay et al. 1996; Stewart et al. 1997), cattle (Kurar et al. 2003), and chicken (Pitel et al. 2004; Rabie et al. 2004). It was also observed in pig on the centromere of SSC 12 when the 12,000-rad IMNpRH2 panel was used (Liu et al. 2005). To analyze whether this “centromeric effect” is present in other porcine chromosomes, the RFs of all 2035 ESTs were plotted in histograms for each respective chromosome. Because we did not have a single complete linkage group for each chromosome, the ESTs were plotted in the order in which they were placed on the RH maps. By examining which MS (on both the genetic and the RH maps) was nearest the centromere of any given chromosome, the ESTs flanking that marker were easily identified and visualized on the histogram. After all chromosomes were analyzed, we found that all submetacentric (SSC 1-7), and metacentric (SSC 8-12, X) chromosomes show higher RF around the centromere (Table 1).

Table 1. Number and characteristics of markers on the IMpRH7000-rad panel by chromosome

Markers were assigned at LOD 4 on SSC 11, 16, and 17; LOD 6 on SSC 1–10, 13, 15, 18, and X; and LOD 8 on SSC 12 and 14. The 191 ESTs not assigned to a linkage group (singletons) were not included in the graphical representation of the final maps (http://www.ag.unr.edu/beattie/research/second_generation.htm) to conserve space. Two-point LOD scores to the closest framework MS for these singletons, as well as complete information on all ESTs, including annotation, can be found online at http://www.cabnr.unr.edu/beattie/docs/complete_info.pdf. Map resolution ranged from 32.4 kb/cR7000 on SSC 12 to 71.4 kb/cR7000 on SSC 10 (Table 2). The average resolution across the entire genome was calculated at 47.5 kb/cR7000, midway between earlier estimates of 37 kb/cR7000 (Yerle et al. 2002) to 75 kb/cR7000 (Hawken et al. 1999).

Table 2. Chromosomal coverage by the current ImpRH7000-rad map

A total of 134 linkage groups covered 57,192 cR or 78% of the predicted size of the porcine genome (Table 2) and 71% of the human genome (Table 3), respectively. At least 1649 (81%) ESTs had an annotation in GenBank with 1422 ESTs (70%) also matching the human genome data set housed at the University of California, Santa Cruz (http://www. genome.ucsc.edu/). Of those 1422 ESTs, 1158 had an annotation to a specific human gene, while the remaining 264 ESTs returned a definite location but did not match a specific gene in build 35.1 HGS. With a 10-Mb threshold, our coverage ranged from 41% on HSA 21 to 100% on HSA 17 and 20. The average coverage across the entire human genome is 71% (Table 3). At a threshold of 5 Mb, our coverage ranged from 12% on HSA 21 to 90% on HSA 17 (Table 3). A scatterplot representation of coverage of individual human chromosomes, including X, is presented in Fig. 1. Individual chromosomal scatterplots positioned next to their human ideograms are available at http://www.cabnr.unr.edu/beattie/research/hsa_coverage.htm

Table 3. Coverage of human genome by porcine linkage groups
Fig. 1.
figure 1

Linear scatterplot of human chromosomes 1-22 and X. Each data point represents the physical base pair location of each of the 1422 porcine ESTs with strong annotation in build 35.1 (NCBI). Centromeres are indicated by white diamonds. This set of data represents approximately 70% of the total number of ESTs in the study. The remaining 30% at this time do not have any annotation to a position within the human genome.

Conservation of synteny

Porcine Chromosomes 2, 5, 6, 12, and 14 were confirmed to be “gene-rich” relative to HSA 17, 19, and 22 (Venter et al. 2001, and extrapolated from NCBI build 35.1 HGS) suggesting significant conservation of synteny on these chromosomes. We refer to “rich” using an average of 10.0 genes/Mb in the human genome (Venter et al. 2001, and NCBI build 35.1 HGS). Autosomal chromosomes in the human genome have an average of 9.7 genes/Mb, 9.4 when HSA X and HSA Y are included. We also observed that SSC 3 (HSA 2, 7, 16), SSC 4 (HSA 1 and 8), SSC 7 (HSA 4, 6, 14–16), and SSC 14 (HSA 1, 8–10, 12, 22), are also “gene-rich” using this convention, while SSC 1, 8, 11, and X continue to correspond to the “gene-deserts” on HSA 18, 4, 13, and X (Fig. 2). The known “gene-deserts” in the swine genome, SSC 1, 8, 11, and X, that correspond to HSA 18, 4, 13, and X had 72%, 58%, 66%, 53%, and 66% of the markers they statistically should have received, respectively (Fig. 2). In contrast to the first-generation porcine EST RH map (Rink et al. 2002a), SSC 10 received less than the number of “genes” it should have based on the amount of DNA present in the chromosome (Schmitz et al. 1992). Interestingly, SSC 10 corresponds to regions on HSA 1, 9, and 10, which are not listed as chromosome deserts in the HGS. SSC 2, 5, 6, 7, and 12 clearly correspond to the gene-rich HSA 17, 19, and 22. SSC 6, a new addition to the current set of chromosomes that appears “gene-rich,” received 124% of the genes it statistically should have. SSC 6 exhibits synteny with a total of four human chromosomes: HSA 1, 16, 18, and 19 including the “gene-desert” associated with HSA 18. Swine autosomes 3, 4, and 14 show an increase in gene density compared with the first-generation map (Fig. 2) and present the porcine equivalent of “gene-rich” and “gene-desert” human chromosomes, avg. 9.3, 8.5, 9.3/Mb, respectively. However, the current porcine RH maps should be considered preliminary until a higher resolution is reached and full integration of linkage, cytogenetic, and RH maps has been achieved.

Fig. 2.
figure 2

Identification of “gene-deserts” and “gene-pools” in the porcine genome. SSC 1, 8, 11, and X correspond with HSA 18, 4, 13, and X, respectively. These represent “gene-deserts” both in the human and the porcine genomes. A total of six porcine chromosomes (SSC 2, 5, 6, 7, 12, and 14) are syntenic with the three most “gene-rich” human chromosomes (HSA 17, 19, and 22).

We identified 95 ESTs that mark putative breakpoints in synteny after the initial BLASTN search at CCGB (http://www.ccgb.umn.edu/). However, 21 of the 95 putative breaks in synteny were eliminated because these markers were subsequently found to be adjacent to ESTs in the correct location (bp) in the human genome when rechecked using a manual BLAT search against build 35 on the UCSC website (http://www.genome.ucsc.edu/). The remaining 74 markers were subsequently considered possible breaks in synteny, without setting any significant e-value threshold. The score of all remaining putative breaks in synteny ranged widely from 21 to 650 (average of 222.81). E values ranged from 4.16 × 10−6 to 0.00 (with 14 markers having e values <10−100). The length, read quality, score, and e value of all putative breaks were then examined. Shorter ESTs had a strong tendency toward low scores and e values approaching the cutoff value when annotated. An annotation threshold was set at a score of greater than 125 and e ≤ 10−25 to ensure the integrity of the final comparative maps. At this level of stringency, 34 of the 74 breaks did not meet the threshold, leaving 40 markers considered truly significant breaks in synteny, where each marker represents one syntenic block. We also identified at least 90 synteny blocks (123 if all 74 possible breaks were considered) between the pig and human genomes. This number is significantly smaller than the reported 394 between human and mouse, and the 417 between human and rat (Bourque et al. 2004). The map revealed 40 breaks in synteny (1.00e −25 and lower) with the human genome, extremely close to the 39 breaks identified by reverse chromosomal painting (Goureau et al. 1996). Marker numbers on this version of the physical map of the swine genome are still significantly lower than on the mouse or rat genome map, respectively. Therefore, the current number of syntenic breaks between the human and the porcine genome map should be considered preliminary.