Introduction

Cilia are built on a scaffold of nine peripheral microtubule doublets. In motile cilia, the nine peripheral microtubule doublets are accompanied by two central microtubules (the 9+2 structure). Radial spokes connect the peripheral and central microtubules, while peripheral doublets are connected with each other by nexin links. Outer and inner dynein arms are anchored to the peripheral microtubule doublet and produce the force needed for the movement of cilia. Synchronized beating of the cilia generates the flow of mucus and cerebrospinal fluid in the respiratory tract and in the brain, respectively, whereas in the fallopian tubes cilia help in moving the ovum toward the uterus. Flagella, built like motile cilia with a similar scheme of 9+2 microtubules, provide motility to unicellular organisms or cells, such as spermatozoa. Primary cilia—which lack the central pair of microtubules (the 9+0 structure), dynein arms and radial spokes—act as sensory organelles, displaying receptor molecules and sensing chemical and mechanical stimuli (Satir and Christensen 2008). However, it is important to note that motile cilia perform a variety of mechanosensory or chemosensory functions and primary cilia located, for example, on the embryonic node are motile (Bloodgood 2010).

Nevertheless, our understanding of the molecular composition of cilia is far from complete. Several approaches have been taken to characterize the cilia genome and proteome. Database search for tissue-specific expression pattern limited to ciliated tissues was used to predict 99 cilia related murine genes (McClintock et al. 2008). Liquid chromatography-mass spectrometry of human epithelial cilia identified 164 axonemal proteins (Ostrowski et al. 2002). A mass spectrometry study of flagella of the unicellular alga Chlamydomonas reinhardtii identified 360 proteins very likely to be involved in cilia formation, and 292 that are probably involved (Pazour et al. 2005). In addition, when comparing the genomes of non-ciliated organisms to ciliated organisms predicted 688 and 183 cilia-related genes in Chlamydomonas reinhardtii (Li et al. 2004) and Drosophila melanogaster (Avidor-Reiss et al. 2004), respectively. The presence of X-box promoter elements, which are targets of cilia-related transcription factors, has led to the identification of additional ciliary genes in Caenorhabditis elegans (Blacque et al. 2005; Efimenko et al. 2005). The results of these different studies have been assembled in online databases: the Ciliome Database (Inglis et al. 2006) and the Cilia Proteome database (Gherman et al. 2006), which also contains basal bodies proteins. As far as human motile cilia are concerned, the database is probably far from complete since only a single high-throughput proteomics study was performed on mammalian cilia (Ostrowski et al. 2002). Most of what we know about cilia comes from work on the unicellular biflagellate alga Chlamydomonas reinhardtiii. Although the evolutionary conservation of ciliary proteins is remarkably high, it can be assumed that mammalian cilia are more complex. Moreover, the genes related to the presence of multiple cilia in one epithelial cell, the genes associated with the coordinated beating of multiple cilia, as well as the genes coding for receptor molecules responsible for communication with the environment of the human body, are likely to be missing from the database.

Defects in both primary and motile cilia have been associated with a group of pleiotropic, clinically overlapping, human diseases called ciliopathies, such as primary ciliary dyskinesia (PCD), polycystic liver and kidney disease, retinitis pigmentosa, nephronophthisis, Bardet–Biedl syndrome, Alstrom syndrome, and Meckel–Gruber syndrome (Badano et al. 2006; Valente et al. 2006). PCD is a rare, genetically heterogeneous disorder caused by mutations in genes encoding proteins important for cilia motility (Geremek and Witt 2004; Pennarun et al. 1999; Olbrich et al. 2002; Bartoloni et al. 2002; Loges et al. 2002; Omran et al. 2008; Castleman et al. 2009; Duquesnoy et al. 2009; Loges et al. 2009). The disease is characterized by recurrent respiratory tract infections, bronchiectasis and infertility. Pulmonary symptoms occur due to the lack of an efficient mucociliary clearance caused by kinetic dysfunction of respiratory cilia. Male infertility is caused by immotility of flagella in spermatozoids. Situs inversus, a mirror reversal of thoracic organs positioning, is present in approximately half of the PCD patients because of immotility of primary cilia of the embryonic node. Nodal cilia are motile and generate leftward flow of extraembryonic fluid in the nodal pit. The flow has been identified as the initial left–right symmetry breaking event during embryogenesis. According to one of the hypotheses, the flow generates movement of nodal vesicular parcels to the left edge of the node. The parcels contain signaling molecules that activate signaling pathways. In case of immotility of nodal cilia, as in PCD, left–right determination is randomized causing situs inversus in approximately half of the patients (Nobutaka et al. 2009).

We observed a down-regulation of the expression of dynein genes and other ciliary genes in PCD patients as compared to the controls (unpublished observations). More importantly, we noticed that more genes followed the pattern of dyneins expression in PCD patients, suggesting a functional relationship between the co-regulated genes. We postulated that gene expression patterns in PCD patients could be used to identify mammalian cilia genes. For this study, we performed gene expression analysis on bronchial biopsies from PCD patients to identify groups of highly correlated cilia-related genes. In contrast to most of previous studies our predictions are based directly on the human material and not on model organisms. We report a cluster of 164 genes previously linked to cilia and 208 new genes that we predict will be involved in cilia-related processes.

Materials and methods

Bronchial biopsies

We collected bronchial biopsies of six clinically diagnosed PCD patients and nine control subjects that were referred to the hospital for unrelated reasons. All patients had a saccharine test and light microscopy imaging of cilia motility characteristic for PCD. An electron microscopy evaluation of ciliary defects was performed, however, in two specimens an insufficient number of epithelial cells was recovered for inspection (Supplementary table 1). The concentration of nitric oxide (NO) in the nasal cavity was measured with a chemiluminescence analyzer, with a threshold value of 200 ppb for diagnosing PCD (Karadag et al. 1999). Patient #1 had a concentration of nitric oxide in the nasal cavity little below the threshold and no electron microscopy imaging of cilia. Therefore, a second measurement of nitric oxide was performed confirming low concentration. He had a typical course of the disease with a resection of middle lobe due to necrosis and typical for PCD picture in bronchoscopic examination and nasal cavity examination. The specimens from non-PCD controls were obtained through the same protocol; these individuals were referred to the Institute of Tuberculosis and Lung Diseases in Rabka, Poland for regular check-ups, with no symptoms of acute disease, with no bronchoscopic signs of the disturbance of mucociliary transport and with normal ciliary beating in the light microscopy. The study was ethically approved by the institutional review board. Informed consent was obtained from all the subjects.

Gene expression

Anti-sense RNA was synthesized, amplified and purified using the Ambion Illumina TotalPrep Amplification Kit (Ambion, USA) following the manufacturer’s protocol. Complementary RNA was hybridized to Illumina HumanRef-12 Whole Genome BeadChips and scanned on the Illumina BeadArray Reader. The gene expression data has been submitted to GEO under accession number GSE11501.

Data preprocessing

The initial steps of data preprocessing and quantile normalization were performed in the Illumina BeadStudio Gene Expression module v3.2. Expression values below 5 were thresholded to 5 and scaled by base 2 logarithm. We limited our data to Ensembl database v52 coding genes. The probes with an expression value variance in the lower 25% of the data were removed. To further limit the computations and to filter out genes not stably expressed in bronchial tissue, we removed probes not detected as present in 4/9 control individuals who had a diagnostic bronchoscopy, but did not display PCD symptoms. The remaining 13,811 probes were subject to clustering.

Clustering

The quality threshold (QT) clustering algorithm was implemented in C according to the algorithm described previously. The QT clustering algorithm performs a computationally extensive search for groups of correlated genes. It looks for largest cluster by iteratively putting every gene in the center of a potential cluster and adding to it genes with the highest Jackknife correlation coefficients in a way that minimizes the cluster diameter d, until no further genes can be added without exceeding a predefined d value (Heyer et al. 1999; Coppe et al. 2009). The quality of clusters was ensured by keeping the cluster diameter <0.3.

Database search

For gene annotation enrichment analysis, we used the DAVID (Database for Annotation, Visualization, and Integrated Discovery: http://www.david.abcc.ncifcrf.gov/home.jsp) (Huang da et al. 2007; Kouwenhoven et al. 2010). The tool calculates over-representation of specific gene ontology terms with respect to the total number of genes assayed and annotated. DAVID applies a modified Fisher exact test, to establish if the proportion of genes falling into an annotation category significantly differs for a particular group of genes and the background group of genes. Ensembl Gene IDs of the cluster A genes were used as queries and the whole set of genes on the Illumina 12HT chip was used as the background group. The Ciliome Database (http://www.sfu.ca/~leroux/ciliome_home.htm) was queried with Ensembl Gene IDs of cluster A genes and Mouse Ensembl Gene IDs of theirs orthlologues to find known ciliary genes (Inglis et al. 2006). The Ciliary Proteome Database (Gherman et al. 2006) (http://www.ciliaproteom.com) was queried with UniProt\SwissProt accessions of clusterA genes. Biomart module of the Ensembl database was used for ID\accession conversions. In addition, Pubmed was searched for publications linking individual genes to cilia or flagella.

Tissue-specific expression

The number of EST transcripts for analyzed gene in 30 different tissue types or organ pools was fetched from the UniGene database. Tissue expression enrichment scores were calculated as described in (Yu et al. 2006; Colecchia et al. 2009). In brief, enrichment score ES i (g) = o i (g)/e i (g) is the ratio between observed to expected number of ESTs for gene g in tissue i. The total number of ESTs in UniGene for gene g is T(g) = ∑ i o i (g). Given the total size of EST libraries in tissue i, s i , the expected number of ESTs in tissue i for each gene is proportional to p i  = s i /∑ i s i . For gene g, if it is expressed equally across all tissues, the expected number of ESTs in tissue i is equal to e i  = T(g)p i . The mean enrichment scores of the analyzed groups of genes in each of the tissues were calculated and plotted for visualization of the tissue-specific expression patterns.

Transcription factor-binding sites

Over-representation of transcription factor binding sites was evaluated in PAINT v3.9 (Vadigepalli et al. 2003; Riehle et al. 2008) interfaced with the TRANSFAC Professional v 2009.2 database (Wingender et al. 1996). The 500 base pairs upstream from transcription starting sites were extracted from Ensembl database v52. A search for transcription factor binding sites was performed by the Match program with matrices deposited in the TRANSFAC database and filter option set to minimize false positives. p values were calculated using all human upstream sequences in the PAINT database as control group and false-discovery rate as the multiple testing correction.

Results

Clustering of genes based on the expression profiling of PCD tissue

We performed whole-genome gene expression profiling in bronchial tissue of six PCD patients using Illumina HT-12 bead arrays. From the 48,803 probes on the arrays, 13,811 probes passed the filtering steps and were used for further analysis. We first used the QT (quality threshold) clustering algorithm to identify groups of correlated genes in PCD patients. We identified 12 clusters with more than 100 genes; the clusters ranged in size from 100 (panel L, Fig. 1) to 372 genes (panel A, Fig. 1). We used DAVID (Database for Annotation, Visualization, and Integrated Discovery) (Huang da et al. 2007; Kouwenhoven et al. 2010) to perform functional annotation enrichment analysis on all cluster members. DAVID applies a modified Fisher exact test to establish if the proportion of genes falling into an annotation category significantly differs for a particular group of genes and the background group of genes. The whole set of genes on the Illumina 12HT chip was used as the background group. The terms significantly enriched in cluster A were related to cilia, flagella and microtubules (p < 0.05 after Bonferroni’s correction) (Table 1). The remaining 11 clusters were not significantly enriched for terms that would have indicated a relation to a specific functional process. We also used DAVID to investigate the tissue expression patterns of the 372 genes in cluster A and found that they are mainly expressed in tissues known to have ciliated epithelium, such as testis, lung and trachea (Table 2).

Fig. 1
figure 1

The 12 largest clusters of correlated genes obtained from bronchial tissue of 6 PCD patients. The expression was mean centered and divided by the standard deviation

Table 1 Distribution of gene ontology (GO) terms in cluster A analyzed by DAVID
Table 2 Tissue-specific expression enrichment analysis

Characterization of cluster A genes

We queried the publicly available Ciliome Database (Inglis et al. 2006) to check how many members of our cluster A had been previously linked to cilia and found 121 genes to be present in this database.

A similar search in the Ciliary Proteome Database (Gherman et al. 2006) resulted in classification of 115 genes as ciliary, 14 of which were not present in the Ciliome Database. A search in PubMed for publications linking cilia or flagella to any of the individual genes from cluster A led to classification of an additional 29 ciliary genes, resulting in a total of 164 known ciliary genes (i.e. 44% of cluster A; Supplementary table 2). These results strongly suggest that cluster A is a ciliary gene cluster (binominal p < 1.10−8 if we assume 10% of human genes to be related to cilia). This would also suggest that the remaining 208 genes are related to cilia and these may represent new cilia genes not previously reported (Supplementary table 3). We observed the highest percentage of shared genes with the part of the Ciliome Database that was built on the results of experimental studies of motile cilia (Table 3). Cluster F (Fig. 1) has an expression curve shape very similar to cluster A and 30% of the genes in cluster F also proved to be linked to cilia. However, we limited our analysis to cluster A genes only.

Table 3 The intersection of cluster A members and the Ciliome Database

To further investigate tissue expression patterns of cluster A genes, we calculated expression enrichment scores for the 30 tissues from the Unigenes dbEST database (Yu et al. 2006; Colecchia et al. 2009). We performed the calculations both on the full set of cluster A genes and on two subsets of cluster A members: genes previously linked to cilia and the new genes identified in this study as potential ciliary genes. Each of the genes was given an enrichment score based on the ratio between the number of ESTs representing it in a given tissue and the expected number of ESTs for that gene in the tissue. The expected number of ESTs for a given gene was calculated assuming equal expression across all tissues. The mean enrichment scores for all cluster A genes and the subsets were calculated for each tissue to present the tissue-specific expression pattern of the genes (Fig. 2).

Fig. 2
figure 2

Mean expression enrichment scores of known ciliary genes in cluster A, new cilia related genes in cluster A and all cluster A genes in 30 different tissues. The enrichment score for a gene represents its relative expression in the tissue comparing with other tissues. The highest relative expression in ciliated tissues can be observed for the three groups of genes

The mean enrichment score for cluster A genes and the two subsets was the highest for testis, trachea and lung, which are tissues known to have motile cilia. Brain tissue, where motile cilia are present in the ependyma, kidney and connective tissue having primary cilia, and the eye where rods and cones have a small connecting cilium, were ranked high; 8 out of the 10 highest ranking tissues were the same in the two subgroups of known ciliary genes and genes not previously linked to cilia.

We screened the 500 base pairs (bp) upstream of each gene from cluster A for over-representation of transcription factor binding sites using PAINT v 3.9 (Vadigepalli et al. 2003; Riehle et al. 2008) and the TRANSFAC Professional v 2009.2 database (Wingender et al. 1996) (Table 3). The binding site for regulatory factor X (RFX) family of transcription factors was significantly over-represented (false-discovery (FDR) rate adjusted p < 10−6) in the upstream sequences of cluster A genes, and in both the known and new ciliary genes in cluster A as well as in the “motile part” of the Ciliome Database (Table 4; Supplementary table 4).

Table 4 Over-representation of transcription factor-binding sites in the 500-bp upstream sequences of cluster A genes and motile cilia genes from the Ciliome Database

The expression of the cluster A genes in the control samples was stable, but not highly correlated (Fig. 3a). The QT clustering in 6 controls returned 13 clusters of more than 100 genes. None of the clusters was significantly enriched in annotation terms related to cilia. The biggest control cluster contained 383 genes (Fig. 3b) and was not significantly enriched in any annotation term. Thirteen percent of genes in this cluster have been previously related to cilia, which was not statistically significant. The tissue expression pattern was different from cluster.

Fig. 3
figure 3

The expression of 372 cluster A genes in 9 non-PCD control individuals (a). The biggest cluster of correlated gene expression in 6 control individuals (b). The biggest cluster in six control individuals and six patients showing a similar curve as cluster A (c). Tissue-specific expression pattern of five biggest control clusters (grey) and the cluster A (black). Cluster A is the only one with high relative expression in testis

A particularly the control cluster genes had much lower expression in testis. A similar pattern could be observed also for other control clusters (Fig. 3d). The clustering performed in 6 patients together with 6 controls returned a cluster of 205 genes with similar to cluster A pattern of expression in the patients and 171(83%) of genes present also in cluster A (Fig. 3c). Forty-four percent of genes in this cluster have been previously related to cilia.

Discussion

We have identified a group of 208 new genes that are likely to be involved in cilia-related processes. We used gene expression profiling of bronchial tissue from PCD patients to obtain a cluster of 372 genes with highly correlated expression. Forty-four percent of the cluster members have been previously related to cilia based on high-throughput studies or individual experimental studies. The remainder of the cluster members is very likely to be cilia-related genes as they showed tissue-specific expression patterns in accordance with the presence of cilia in the given tissues and they also showed over-presentation of RFX-binding sites in the upstream sequences. The tissue-specific expression pattern consistent with the presence of cilia was seen for all 372 cluster genes, as well as 164 known ciliary genes and the 208 cluster genes not previously linked to cilia. The expression libraries that we used for analysis of tissue-specific expression pattern were obtained from organs and tissues containing not only ciliated epithelium. Gene expression in different tissues containing ciliated cells has been successfully used to identify cilia-related genes (McClintock et al. 2008; Kubo et al. 2008). High expression in axoneme-containing tissues, especially in testis, was unique for cluster A and not observed in control clusters. Bronchial biopsies are a compound source of RNA containing not exclusively ciliated epithelial cells (Regamey et al. 2007). However, the gene expression pattern of the cluster A was extracted from patients and, therefore, is related to a disease that affects the function of ciliated cells.

We found a statistically significant presence of the RFX transcription factors binding sites in the 500-bp upstream sequences of the all cluster A genes, known ciliary genes in the cluster A, and those not previously linked to cilia. In addition, we found RFX-binding sites in the upstream sequences of genes related to motile cilia and deposited in the Ciliome Database. The X-box-binding RFX transcription factors contain a highly homologous DNA-binding domain and share the same DNA-binding sites (Iwama et al. 1999). RFX3, present among the 372 genes in the cluster A, is known to play a crucial role in the biogenesis of motile cilia (El Zein et al. 2009).

Based on the sequence homology with Chlamydomonas reinhardtiii proteins, several human orthologs have been assigned to outer and inner dynein arms (Pazour et al. 2005). The largest cluster in our analysis (cluster A) included genes encoding for proteins predicted to be part of outer dynein arms, like DNAH9, DNAI1 or DNAI2. Inner dynein arms were represented in this cluster by DNAH7, DNAH3, DNAH2, and WDR78. A number of known outer and inner dynein arm genes showed extremely low levels of expression and did not pass the filtering steps (DNAH14, DNAH17, TXNDC3 and DNAH8). DNAH6 and DNAH1 were not classified in our cluster A although they followed very similar expression patterns. DNAH5 and DNAH11 followed a different pattern of expression. Cluster A contained two genes known to be involved in building radial spoke complexes (RSHL3 and RSPH10B), four intraflagellar transport proteins (IFT140, IFT172, IFT57, IFT74) and three genes coding for cytoplasmic dyneins (DYNC2LI1, DYNLL1, DYNLRB2), which play a role in the retrograde transport in the flagella in Chlamydomonas. Thirty-eight genes from cluster A have been shown to be mutated in human diseases and have entries in the OMIM database, including genes mutated in ciliopathies, such as Bardet–Biedl syndrome (BBS4, BBS5, BBS11), Meckel–Gruber syndrome (MKS1), nephronophtisis (NPHP1, NPHP2, NPHP3) and PCD (DNAI1, DNAI2, RSPH4A, RSPH9, LRRC50). Cluster A also contains two limb–girdle muscular dystrophy genes (FUKUTIN, TRIM32), one of which is also mutated in Bardet–Biedl syndrome (TRIM32), and three genes associated with retinitis pigmentosa (PDE6B, Pronin and RPGR). RPGR is known to cause a complex phenotype with both retinitis pigmentosa and PCD in some patients (Moore et al. 2006).

The ciliary proteome has been studied extensively using different high-throughput methods, including mass spectrometry proteomic studies and comparative genomics. All studies, but one (Ostrowski et al. 2002) were performed in lower organisms and required a Blast search to identify possible human orthologs. Our ciliary set is based on the human transcriptome data, and this is one of our study’s strengths. Our dataset, however, might contain false-positive genes that matched the ciliary pattern by chance or are related uniquely to a pathology present in PCD. We based our prediction exclusively on PCD patients’ cells, what also limits our dataset to that part of a ciliary puzzle altered by this particular disease process.

PCD is a highly heterogeneous disorder, for which nine genes are known to be mutated. Yet, in the majority of patients, the causative genes remain undiscovered. Our gene set could be considered a list of PCD candidate genes. Because one of the PCD loci is on chromosome 19q (Meeks et al. 2000; Blouin et al. 2000) and we found three genes (CACNG6, C19ORF51 and AURKC) from our cluster that map to this region, they might well be considered disease candidate genes. A 1-bp deletion in the aurora kinase c gene (AURKC) has been associated with male infertility characterized by large-headed, multi-flagellar, polyploid spermatozoa, what makes this gene an interesting candidate for further investigation in PCD patients (Dieterich et al. 2007).

The expression of cluster A genes was not highly correlated in the control samples. We observed a disease-specific pattern of expression for 372 functionally related genes. This phenomenon occurs in response to the mutation of a single gene, but most probably not the same gene in each patient studied, since PCD is a genetically heterogeneous disease. It indicates that information from the cytoplasm of the PCD epithelium, where cilia are misassembled (Fliegauf et al. 2005), is transferred to the nucleus and transformed on the genomic level into a regulatory signal altering expression of ciliary genes. Our results show how a monogenic disease affecting multi-protein complexes can be used to study more complex genomic networks.

In conclusion, we identified a group of 208 new cilia-related genes. This list of genes provides candidate genes for PCD and other ciliopathies.