Decoding gene regulation in the fly brain

Janssens, Jasper; Aibar, Sara; Taskiran, Ibrahim Ihsan; Ismail, Joy N.; Gomez, Alicia Estacio; Aughey, Gabriel; Spanier, Katina I.; De Rop, Florian V.; González-Blas, Carmen Bravo; Dionne, Marc; Grimes, Krista; Quan, Xiao Jiang; Papasokrati, Dafni; Hulselmans, Gert; Makhzami, Samira; De Waegeneer, Maxime; Christiaens, Valerie; Southall, Tony; Aerts, Stein

doi:10.1038/s41586-021-04262-z

Decoding gene regulation in the fly brain

Article
Published: 05 January 2022

Volume 601, pages 630–636, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

From

View current issue Submit your manuscript

Decoding gene regulation in the fly brain

Download PDF

Abstract

The Drosophila brain is a frequently used model in neuroscience. Single-cell transcriptome analysis^1,2,3,4,5,6, three-dimensional morphological classification⁷ and electron microscopy mapping of the connectome^8,9 have revealed an immense diversity of neuronal and glial cell types that underlie an array of functional and behavioural traits in the fly. The identities of these cell types are controlled by gene regulatory networks (GRNs), involving combinations of transcription factors that bind to genomic enhancers to regulate their target genes. Here, to characterize GRNs at the cell-type level in the fly brain, we profiled the chromatin accessibility of 240,919 single cells spanning 9 developmental timepoints and integrated these data with single-cell transcriptomes. We identify more than 95,000 regulatory regions that are used in different neuronal cell types, of which 70,000 are linked to developmental trajectories involving neurogenesis, reprogramming and maturation. For 40 cell types, uniquely accessible regions were associated with their expressed transcription factors and downstream target genes through a combination of motif discovery, network inference and deep learning, creating enhancer GRNs. The enhancer architectures revealed by DeepFlyBrain lead to a better understanding of neuronal regulatory diversity and can be used to design genetic driver lines for cell types at specific timepoints, facilitating their characterization and manipulation.

Transcriptional Profiling of Identified Circuit Elements in Invertebrates

Single-cell analysis of chromatin accessibility in the adult mouse brain

Article Open access 13 December 2023

SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks

Article Open access 13 July 2023

Main

The Drosophila brain contains around 220,000 cells and is an excellent model for investigating the diversity of cell types. Advances in electron microscopy have yielded connectome maps across the fly brain^8,9, while driver lines¹⁰ provide genetic access to many cell types¹¹. This diversity of cell types has been bolstered by single-cell transcriptomics analyses of the brain^{1,2,3,4,5,6,12,13} and the ventral nerve cord¹⁴.

The role that transcription factors (TFs) have in the determination of neuronal fate was previously highlighted by imputing unique TF combinations for each cell type that activate or repress target genes². These combinations arise in neural progenitors through TF-guided temporal and spatial axes of differentiation¹⁵. Furthermore, TFs govern key neuronal features such as dendritic targeting and neurotransmitter determination^{3,16,17,18,19}, and altering a single TF can change neuronal fate²⁰. Inferring TFs and their putative target genes is crucial, but transcriptomics analysis leads to high false-positive rates²¹, as TF activity often cannot be predicted from TF expression levels because it depends on many variables, such as protein activity and localization as well as the presence of co-binding TFs and co-factors.

The recent development of the single-cell assay for transposase accessible chromatin by sequencing (scATAC-seq)²² provides additional understanding of the mechanisms that underlie neuronal identity, by enabling the analysis of which genomic regions encode regulatory information for creating and maintaining each cell type. Integrating genomic enhancers with gene expression would yield precise regulatory programs.

Here we built a single-cell multi-omics atlas across fly brain development, covering neurogenesis, maturation and maintenance. We identify key regulators of neuronal and glial cell identity, decipher the enhancer code for specific neuronal subtypes, generate informed enhancer driver lines and map enhancer GRNs (eGRNs, all of which are available for exploration online (http://flybrain.aertslab.org).

Unique chromatin landscapes of neurons

To study regulatory programs of neuronal diversity, we profiled chromatin accessibility of 240,919 cells from whole brains at 9 timepoints from larvae to adult, covering crucial stages of development^1,18 (Fig. 1a and Supplementary Table 1). This atlas is accompanied by a single-cell RNA sequencing (scRNA-seq) atlas of the adult brain, containing 118,687 high-quality cells divided into 204 clusters, of which 66 are annotated² (Methods).

**Fig. 1: Chromatin landscape of adult brain cell types.**

We first analysed the chromatin landscape of 60,624 cells from adult flies and late-stage pupa (72 h after puparium formation (APF)) and identified 79 stable cell states using cisTopic²³ (Methods and Extended Data Fig. 1). We next aggregated the accessibility of regions upstream and within the gene body, and linked scATAC clusters to cell types in the scRNA-seq atlas using co-clustering, marker gene enrichment and non-negative least squares (NNLS) regression (Fig. 1b, c, Extended Data Fig. 2 and Supplementary Discussion). After manual curation, 43 of the 79 ATAC clusters were one-to-one linked to RNA clusters.

The annotated clusters include six glial subtypes (approximately 10–15%), non-brain cells (1%, plasmatocytes and photoreceptors) and neurons (85–90%). Notably, optic lobe (OL) neurons form distinct clusters, whereas central brain (CB) neurons form a continuum. In the CB, three Kenyon cell (KC) subtypes and two smaller cell types of the central complex (ellipsoid body ring neurons, protocerebral bridge neurons) were identified. CB clusters can be split into Imp⁺ or pros⁺ cells on the basis of scATAC-seq, recapitulating the differences that were found in scRNA-seq analyses of the brain^2,12 and ventral nerve cord¹⁴. To validate cell type annotation, we used driver lines that label the three KC subtypes and two OL cell types (Tm1 and T4/T5), and performed bulk ATAC-seq after fluorescence-activated cell sorting (FACS), matching the scATAC aggregates with high concordance (Fig. 1d and Extended Data Fig. 3a–e).

Every cluster has a unique chromatin accessibility profile, with a range of 105 to 4,732 differentially accessible regions (DARs) out of a total of 24,543 median accessible regions per cluster, many of which are located close to validated marker genes (Fig. 1e). The transcription start site (TSS) is often ubiquitously accessible, meaning specificity is more distally controlled. Although T4/T5 neurons are inseparable in adult scRNA-seq, 110 DARs were identified between them, and subclustering also separated the a/b and c/d subtypes (Extended Data Fig. 3f, g). Given this high resolution in scATAC-seq, we investigated the missing CB cell types by examining olfactory projection neurons (OPNs). OPNs are identified in a 57,000-entry scRNA-seq dataset², but not in the 60,000-entry scATAC-seq dataset nor in an expanded dataset with additional timepoints nor using different clustering approaches (Extended Data Fig. 1j–l). However, when we FACS-sorted OPNs for scATAC-seq, 876 peaks were revealed near OPN marker genes (Extended Data Fig. 3h–l), showing that, despite unique chromatin profiles, OPNs, and probably other CB cell types, are more difficult to identify by scATAC-seq compared with scRNA-seq.

Dynamic changes during brain development

To investigate how neuronal diversity is generated, we studied chromatin accessibility changes during development by analysing 135,275 cells from third instar larvae to 12 h APF using cisTopic, obtaining 54 clusters (Fig. 2a). We trained a support vector machine (SVM) classifier⁵ on the adult cell types to transfer labels to earlier stages, enabling the detection of core sets of specific regions per cell type that remain continuously accessible, as well as DARs that vary over time (Extended Data Fig. 4a–e and Supplementary Table 2). Similar to RNA-seq analyses in which a maximum number of differentially expressed genes was detected at 48 h APF and a minimum in adults^1,5, we found a decrease in DARs over time, with a relative spike at 48 h APF during synaptogenesis (Extended Data Fig. 4f).

**Fig. 2: Chromatin changes through neuronal development.**

Progenitor cell types, which are characterized by accessible regions near the neuroblast markers dpn and ase (Supplementary Discussion and Extended Data Fig. 4g, h), form the roots of two main branches in the uniform manifold approximation and projection (UMAP) analysis (Extended Data Fig. 4m)—a continuous CB branch and a tree-like OL branch, indicating different neurogenesis modes. We detected a spike in TF motifs from the neuronal remodelling factors EcR and Sox14 (ref. ²⁴) in Imp⁺ neurons, but not in pros⁺ neurons, consistent with their respective pruning roles and with a potential embryonic origin of Imp⁺ neurons (Fig. 2b). In the OL, six branches emerge, each enriched for the motif of one major class of TFs (such as POU, bHLH and ETS), linked to synaptic partner recognition (Supplementary Table 3) and neurotransmitter determination¹⁶ (Fig. 2c).

Cell-type-specific TF-binding sites

To identify cell-type-specific key regulators, we defined a ‘cistrome’ as the combination of a TF with its target enhancers. We developed a dual approach of conventional motif discovery and deep learning (DL)^25,26, integrating information from the adult scRNA-seq and scATAC-seq data, to identify TFs per cell type that are both expressed and of which the motif is enriched in the accessible regions (Fig. 3a).

**Fig. 3: Identification of regulators through multi-omic data integration.**

First, using the conventional approach, we calculated the correlation between TF expression and motif enrichment in DARs per cell type (Fig. 3b and Extended Data Fig. 5). In total, 116 TFs show a strong positive correlation, suggesting that they open chromatin as activators. These cover pan-neuronal, pan-glial and cell-type-specific TFs (Fig. 3c), including known regulators for glia: Repo and Kay²⁷; for KCs: Ey and Mef2 (ref. ²⁸); for ellipsoid body neurons: Grn and D²⁹; for T1 neurons: Ets65A³⁰ and Oc³¹; and combinations of Acj6, Fkh, TfAP-2 and SoxN/Sox102F in the other T neurons^3,5,19,30. Moreover, 131 TFs display a negative correlation between expression and motif enrichment (such as Mamo and Lola-N), suggesting a repressive role.

Second, we trained a convolutional neural network, called DeepFlyBrain, using sequences of co-accessible regions (topics)²⁶ from KCs, T neurons and glia as input, and topic accessibility as output (Methods, Supplementary Discussion, Extended Data Fig. 6a–e and Supplementary Tables 4 and 5). We then used DeepExplainer³² to calculate the contribution of each nucleotide in the prediction of region accessibility, and TF-MoDISco³³ to identify motifs from recurring patterns in the contribution scores (Methods and Supplementary Table 6). This revealed that KC enhancers are characterized by Ey, Onecut, Mef2, Mamo and Dati motifs, matching their expression (Fig. 3d (top)). Mamo and Dati have negative nucleotide importance, suggesting that they correlate with closing chromatin. For T neurons, the most important motifs include Fkh, TfAP-2 and Acj6; and, for glia, we found Ct, Repo, Zfh2 and Klu (Fig. 3d (bottom) and Extended Data Fig. 6f). Scanning accessible regions with TF-MoDISco patterns, DeepFlyBrain provides high-confidence, cell-type-specific genome-wide binding-site predictions that show increased sequence conservation compared with the flanking sequences, supporting their functionality (Extended Data Fig. 6g–k).

To validate the predicted binding sites, we performed Eyeless and Repo CUT&Tag³⁴ analysis on whole-brain samples, and found a significant overlap (adjusted P (P_adj) < 10⁻³⁰, hypergeometric test; Extended Data Fig. 7a–e). We next applied targeted DamID (TaDa³⁵) analysis, finding 8,543 Mef2 peaks in γ-KCs and 10,900 Acj6 peaks in T4/T5 neurons that overlap significantly with predicted cell-type-specific DL and cistrome regions (Fig. 3e; P_adj < 10⁻³⁰ and P_adj < 10⁻³⁰, respectively, hypergeometric test). Interestingly, TaDa analysis of Mef2 in Tm1 neurons detects different sites from those in γ-KCs, whereas the Acj6 sites from T4/T5 neurons contain all those found in Acj6 TaDa analysis of all Acj6-expressing neurons¹⁶, suggesting a stronger pioneering role for Acj6 (Extended Data Fig. 7f, g). As a third validation experiment, we performed cell-type-specific TF knockdowns followed by FACS and ATAC-seq analysis (Fig. 3f). Mef2, Acj6, Onecut and TfAP-2 knockdowns all resulted in a decreased accessibility of regions with their respective motifs (Fig. 3f and Extended Data Fig. 7h–k), whereas knocking down the predicted repressor Mamo in γ-KCs increased the accessibility of Mamo sites, and led to a partial switch from the γ-KC-type chromatin landscape to the Mamo-negative α/β-KC-type chromatin landscape (Extended Data Fig. 7l–m; P_adj < 10⁻³⁰, hypergeometric test).

Finally, we performed bulk ATAC-seq analysis of the adult brain across 44 homozygous fly lines³⁶, identifying 4,063 single-nucleotide polymorphisms (SNPs) that correlate with chromatin accessibility³⁷ (P_adj < 0.05; Extended Data Fig. 6l). DeepFlyBrain identified affected TF motifs for 20% of the SNPs, consistent with previous studies²⁵, with SNPs destroying Mamo or Lola-N repressor binding sites, leading to increased accessibility (Extended Data Fig. 6m–p, s). When we overexpressed Lola-N in glia, the Lola-N GATC sites decreased in accessibility, confirming its repressive role in neurons³⁸ (Extended Data Fig. 6q, r).

Decoding enhancer architecture

The atlas of enhancers, linked to cell types and regulators, enables the design of reporter lines to target cell populations throughout development. Previous efforts to create driver lines for the fly brain (FlyLight¹⁰ and the Vienna Drosophila Resource Center) used random regions of 2–3 kb around neuronal genes, causing many lines to be non-specific: 2,551 out of 3,456 FlyLight lines contain more than one ATAC peak, of which 1,796 are DARs. These lines can be made specific by subcloning the individual ATAC-seq peaks (Extended Data Fig. 8a–c). Furthermore, split-GAL4 lines that combine two enhancers through AND logic are recapitulated as the intersection of their ATAC-seq signal (Extended Data Fig. 8d).

Using a more systematic approach, we selected an additional 60 regions for a total of 63 enhancers and tested their enhancer activity in vivo using transgenic fly lines (53 GFP, 10 GAL4; Methods, Extended Data Fig. 8e, f and Supplementary Table 7). The selected regions are accessible in either KCs (24), OL neurons (17), glia (5) or mixed (14) with a size range of 300–1,732 bp, and three negative controls were added that are either ubiquitously accessible or inaccessible. Overall, all of the enhancers show reporter activity in the brain, with 73% showing high activity at any developmental stage; whereas, for 65%, the adult reporter activity is specific to the predicted cell type (Extended Data Fig. 8g–i).

We next examined the relationship between motif architecture and reporter activity using DeepFlyBrain. An enhancer near the sNPF gene (Fig. 4a (column 1)) is predicted to be accessible in γ-KCs and α/β-KCs (Fig. 4a (column 2)); contains candidate Ey-, Mef2-, Onecut- and Sr-binding sites (Fig. 4a (column 3)); and has specific GFP reporter activity in KCs (Fig. 4a (column 4)). In silico mutagenesis identified nucleotides with a high impact on accessibility in the Mef2- or Ey-binding site (Fig. 4a (column 3)), and mutating these nucleotides abolished activity (Fig. 4a (column 4)). A second enhancer near Eip93F is active in T4 neurons, with binding sites for Fkh, Acj6 and Tfap-2 (Fig. 4b). Mutating either Fkh- or Tfap-2-binding sites abolish enhancer activity, confirming their predicted activator roles. Similar analyses were performed on enhancers near Bx, gish, Pkc53E, Appl and CG15117, highlighting TF activator binding sites; GFP reporter activity is lost after mutation of these sites (Extended Data Fig. 9a–e). In the enhancers near sNPF, Bx and Appl, the model predicted that changes in Mamo sites would increase enhancer activity (Extended Data Fig. 9c). Indeed, when these sites were mutated, the enhancer activity is not lost, but expanded to additional KCs (one-sided Mann-Whitney U-test, combined P = 0.031) (Fig. 4c and Extended Data Fig. 9f, g).

**Fig. 4: DL analysis unravels enhancer make-up.**

We scored every enhancer based on their KC, OL, glia or CB activity (Methods), showing examples of positive KC enhancers in Fig. 4d. Calculating receiver operating characteristic (ROC) curves, shows that accessibility alone is able to distinguish positive from negative enhancers with an accuracy (area under the ROC curve (AUC)) of 0.89 for KC, 0.87 for OL and 0.79 for glia (Fig. 4e and Extended Data Fig. 8h, i). Taking KC activator motif content into account, the accuracy increases to an AUC of 0.935 (Fig. 4e) and 90.5% (19/21) of enhancers that contain at least two activator motifs are active in KCs (Supplementary Table 8).

Building a resource of eGRNs

Current descriptions of GRNs have mostly focused on co-expression²¹, but the availability of transcriptome and chromatin accessibility profiles of cell types enables their regulatory code to be scrutinized. In particular, we aimed to map cell-type specific eGRNs, including key TFs, as well as their enhancers and target genes (Fig. 5a).

**Fig. 5: eGRNs identify cell-type-specific activators and repressors.**

To link cistrome regions to target genes, we calculated a co-variability score of gene expression and region accessibility for a window of 100 kb around each gene (50 kb upstream and downstream, plus introns³⁹), leading to an average of 6 positively linked regions per gene (Methods and Supplementary Discussion). Enhancer–gene links within BEAF-32 domains (average size, 57.7 kb) have higher correlation scores and a lower proportion of negative links, so links crossing these domains (45%) were pruned (Extended Data Fig. 10a–f). The strength of the region–gene links correlates with enhancer activity (Extended Data Fig. 10g), and intronic and distal intergenic regions correlated better with gene expression compared with promoter/TSS accessibility, confirming previous observations⁴⁰ (Extended Data Fig. 10h, i). The target genes were pruned using gene set enrichment analysis, retaining those of which the expression co-varies with the cistrome TF (Methods).

This procedure resulted in 171 cistromes forming eGRNs for 45 cell types (Extended Data Figs. 10j, k, 11a–d and Supplementary Table 9), including 87 activator TFs, with 4,995 enhancers linked to 2,025 genes, covering 17% of the adult DARs (13% promoters, 43% intronic, 44% distal), and 39% of the variable genes in the brain. In particular, cell-type eGRNs have on average 5 activator TFs (range, 1–15) that regulate 67 target genes through 81 enhancers. Indeed, 62% of the genes are regulated by multiple regions within the same cell type and 93% of enhancers have multiple TF inputs. The overlap of predicted binding sites for a TF is only high between similar cell types, suggesting a dependency on the presence of co-factors (Extended Data Fig. 11e). The network for γ-KCs (Fig. 5a) reveals that 2/3 of genes are co-regulated by at least two TFs, with the auto-regulatory factors Mef2 and Ey/Toy regulating 97 to 122 genes, alongside Onecut and Sr regulating an average of 38 genes.

Next, we looked into different modes of repression. The first type of repressors represses target genes by reducing chromatin accessibility (Fig. 5b). Mamo is involved in regulating the typical two-out-of-three pattern in KCs by repressing α/β-KC marker genes in α′/β′- and γ-KCs (Fig. 5c). Similarly, Lola-N represses glial genes in neurons (Extended Data Fig. 11f). In the second type, TFs cause nucleosome displacement and chromatin opening, similar to activator TFs, but would recruit co-repressors to repress target genes. This would be manifested by a negative correlation between accessibility and target gene expression. However, these relationships are less common compared with chromatin repression, and were detected for only TFs that also have positive correlation targets, such as Acj6 (Extended Data Fig. 10k), suggesting more complex mechanisms.

Finally, we investigated the cistrome regions throughout development and found that 45% (14,051) become more accessible at late timepoints, with 28.8% increasing after the ecdysone pulse. We also found 458 regions that undergo an enhancer switch, as their accessibility increases in one cell type while decreasing in another (Extended Data Fig. 12a, b). One of these regions is a T1 enhancer that drives CG15117 expression that is accessible in early glia and switches to T1 at 24 h APF (Fig. 5d). Using a scRNA-seq atlas of OL development⁵, we confirmed that an analogous switch in CG15117 expression occurs, from a developmental glial marker to an adult T1 marker, with a small delay between accessibility and expression changes, as previously observed⁴¹. When co-staining the enhancer GFP with the glial marker Repo, the overlapping signal in development disappears in the adult, coinciding with the closing of the enhancer (Fig. 5e, f). To study this phenomenon at the sequence level, a DL model was trained using developmental topics as input (Extended Data Fig. 12c, d). Inspecting the CG15117 enhancer, the model detects the same TAATTA motif in glia and T1 neurons (Extended Data Fig. 12g–j). Given that only a few TFs overlap, this suggests the binding of different factors to the same motif in different cell types. The reuse of the same enhancer in different cell types at different timepoints can also be noticed by differences in expression between larval and adult brains for 47 out of 54 tested enhancers (Extended Data Fig. 8).

Discussion

We generated the first single-cell chromatin accessibility atlas of the whole fly brain throughout development, tracing neuronal and glial cell types from birth to maturity. Using an integrated multi-omics approach, we introduced the concept of eGRNs in which TFs are linked to high-confidence binding sites that are linked to target genes. eGRNs can soon be derived for other datasets, given the pioneering work in scATAC-seq in mouse and human^{40,42,43,44,45} and developments in single-cell multi-omics^41,46,47.

Our atlas showed that all cell types in the brain have unique chromatin profiles, often combinatorial, with tens of thousands of accessible regions and hundreds to thousands specific, identifying over 95,000 candidate enhancers, covering 34.4% of the genome. To accurately predict enhancer activity on the basis of the DNA sequence, we integrated DL models with omics data. This ‘smarter’ motif discovery has been shown to reveal motifs that are missed by conventional algorithms, and leads to prediction of TF binding at the base-pair resolution⁴⁸. We validated these annotations and TF roles using CUT&Tag, TaDa and knockdown experiments, confirming the high quality of the cistromes and DL-based annotations. The library of annotated enhancers was then used as a starting point to clean up or design new driver lines for cell-type-specific genetic access. Developmental dynamics open up the possibility of creating spatiotemporal driver lines with enhancers corresponding to different maturation modules.

By linking TF cistromes, enhancer accessibility and target gene expression, we generated 45 eGRNs covering 87 activator TFs of which 90% have lethal mutations, 62% are linked to known brain phenotypes and 64% are linked to human diseases, providing a foundation for follow-up studies. Many enhancers were regulated by multiple TFs, as highlighted by Ey and Mef2 in KCs. Furthermore, mutating either binding site led to the abolishment of GFP activity in in vivo assays, suggesting cooperativity. The switching enhancers described here present a considerable fraction of enhancers (~500) encoding multilayered motif architectures, reminiscent of the phenotypic convergence phenomenon of different TFs in different cell types regulating the same enhancer³.

Our regulatory atlas of the brain covers cell types, TFs and enhancers, together with their joint representation as eGRNs and transitions through development. To be of further value to the community, we have made all data publicly available online (http://flybrain.aertslab.org; Extended Data Fig. 11g), enabling users to explore eGRNs with links to SCOPE (http://scope.aertslab.org/#/Fly_Brain/) and UCSC (http://ucsctracks.aertslab.org/papers/FlyBrain/hub.txt). Finally, our adult DL model DeepFlyBrain is available at Kipoi (http://kipoi.org/models/DeepFlyBrain).

Methods

Data reporting

No statistical methods were used to predetermine sample size. Animals that fit the age criteria were selected randomly and the investigators were not blinded to allocation during sequencing experiments and outcome assessment. Cloned enhancers and mutations were blinded using enhancer IDs.

Statistics and reproducibility

At least two technical replicates were performed per timepoint and condition for 10x Chromium (exact details are provided in Supplementary Table 1), aiming for 20,000 cells per timepoint. FACS experiments (cell types, knockdowns) were replicated once, except for OPNs for which two technical replicates were performed. For TaDa, two replicates were performed for controls and Mef2, and one for Acj6. For CUT&Tag analysis, one experiment was performed for Repo, and five for Ey (four of which led to Ey motif enrichment). Ten brains were visualized for cloned enhancers per condition, and representative images were chosen. Statistics were calculated using Scipy⁴⁹ unless mentioned otherwise. Seaborn was used for visualization.

Genetics

Flies were raised on a yeast-based medium at 25 °C under a 12 h–12 h day–night light cycle. All RNAi experiments were performed at 29 °C. All Drosophila lines that were used in the scATAC-seq experiments were derived from the DGRP collection. One hybrid was created by crossing different DGRP lines, generating genetic diversity. A list of all of the fly lines that were used is provided in Supplementary Table 10.

10x Genomics scATAC-seq

Sample preparation

The experiments were carried out with four different WT polymorphic strains³⁶, enabling a higher number of nuclei per run while detecting and removing doublets (Supplementary Table 1). Furthermore, to enrich for CB cell types that are often hard to detect, we performed two additional runs on adult brains without the OLs, as these contain more than two thirds of all brain cells in numbers, but with a lower diversity than the CB.

Drosophila melanogaster brains were dissected at nine timepoints (third instar wandering larvae, 0 h, 3 h, 6 h, 12 h, 24 h, 48 h and 72 h after puparium formation and adult) of both males and females and transferred to a tube containing 100 µl ice cold DPBS solution. After centrifugation at 800g for 5 min, the supernatant was replaced with 500 µl nuclei lysis buffer comprising 10 mM Tris-HCl pH 7.4, 10 mM NaCl, 3 mM MgCl₂, 0.1% Tween-20, 0.1% Nonidet P40, 0.01% Digitonin and 1% BSA, in nuclease-free water. The following procedure was followed to extract the nuclei from the brain tissue: incubation in nuclei lysis buffer on ice for 5 min, transfer to a dounce tissue grinder tube (Merck), 25 strokes with pestle A, incubation on ice for 10 min, 25 strokes with pestle B. The lysis was stopped by adding 1 ml of wash buffer composed of 10 mM Tris-HCl pH 7.4, 10 mM NaCl, 3 mM MgCl₂, 0.1% Tween-20 and 1% BSA in nuclease-free water. Nuclei were pelleted by centrifugation at 800g for 5 min at 4 °C and resuspended in a 1× nuclei buffer (10x Genomics). Nuclei suspensions were passed through a 40 µm Flowmi filter (VWR Bel-Art SP Scienceware). Nuclei concentration was assessed using the LUNA-FL Dual Fluorescence Cell Counter.

Library preparation

Single-cell libraries were generated using the GemCode Single-Cell Instrument and Single Cell ATAC Library & Gel Bead Kit v1 and ChIP Kit (10x Genomics). In brief, fly brain single nuclei were suspended in 1× nuclei buffer. The single nuclei were incubated for 60 min at 37 °C with a transposase that fragments the DNA in open regions of the chromatin and adds adapter sequences to the ends of the DNA fragments. After generation of nanolitre-scale gel bead-in-emulsions (GEMs), GEMs were incubated in a C1000 Touch Thermal Cycler (Bio-Rad) under the following programme: 72 °C for 5 min; 98 °C for 30 s; 12 cycles of 98 °C for 10 s, 59 °C for 30 s, 72 °C for 1 min; and held at 15 °C. After incubation, single-cell droplets were broken and the single-strand DNA was isolated and cleaned using Cleanup Mix containing Silane Dynabeads. Illumina P7 sequence and a sample index were added to the single-strand DNA during library construction via PCR: 98 °C for 45 s; 11–13 cycles of 98 °C for 20 s, 67 °C for 30 s, 72 °C for 20 s; 72 °C for 1 min; and hold at 4 °C. The sequencing-ready library was cleaned up with SPRIselect beads.

Sequencing

Before sequencing, the fragment size of every library was analysed using the Bioanalyzer high-sensitivity chip. All 10x scATAC libraries were sequenced on NextSeq500 and NovaSeq6000 instruments (Illumina) with the following sequencing parameters: 50 bp read 1 -8 bp index 1 (i7) -16 bp index 2 (i5) -49 bp read 2.

10x data processing

The 10x fly brain samples were each processed (alignment, barcode assignment and UMI counting) with CellRangerATAC v.1.2.0 count pipeline. The Cell Ranger reference index was built on the third 2017 FlyBase release (D. melanogaster r6.16)⁵⁰. Sequencing saturations were calculated on the basis of Michaelis–Menten kinetics and early pupal timepoints were also sequenced and CellRanger aggr was used to aggregate the sequencing results.

Demuxlet

We used Demuxlet⁵¹ to demultiplex the different genotypes that were used in the DGRP-mixed samples, enabling us to remove doublets of two different genetic backgrounds. The vcf file of the DGRP project (available at http://dgrp2.gnets.ncsu.edu/) was lifted over to dm6 genome and SNPs for DGRP-409 and DGRP-502 were extracted. For DGRP-639 and the DGRP-551-based hybrid, we performed bulk ATAC-seq to generate updated SNP profiles. After combining all SNPs, we retained only SNPs that were unique for one line. This vcf file was then used in Demuxlet with the default parameters leading to the identification and removal of 43,489 doublets (Supplementary Table 1).

scATAC topic modelling and clustering

After removing doublets, we performed some extra quality control filters to select the 240,919 cells that will be used in upcoming analyses (Signac’s nucleosome_signal ≤ 10, global blacklist_ratio < 0.05 and non-outlier blacklist ratio within its own run, number of fragments between 100 and 50,000).

During the initial steps of analyses of the dataset, we tested different clustering algorithms (including cisTopic, Seurat, ArchR and snapATAC). Although most of the methods identified the main cell types, we finally chose cisTopic as our primary clustering method as it provided a slightly better resolution for some of the clusters and subanalyses. Moreover, cisTopic provides a fuzzy clustering in the form of ‘topics’, which is useful for downstream analyses (for example, each region can belong to one or several topics, and each topic can be accessible in specific or multiple cell types).

To run cisTopic²³, we created the cell-counts matrix using 129,109 predefined regulatory regions (ctx) based on conservation⁵² (that is, counted fragments within these regions). Given the large size of our dataset, we implemented WarpLDA⁵³ within the cisTopic package as a faster and more efficient alternative to Collapsed Gibbs Sampling (CGS). WarpLDA uses the delayed update approach, meaning that topic–region and cell–topic distributions are updated after a number of assignments rather than after each assignment, reducing the number of calculations and memory access. This new faster algorithm is available in cisTopic version 3 (https://github.com/aertslab/cisTopic).

We performed topic modelling on the whole matrix (with between 2 and 500 topics, with 500 iterations and finally selecting the model with 500 topics). This analysis was used to obtain an overview of the whole dataset, and to perform the analyses across development. However, we noticed that we obtained slightly better region accessibility predictions, as well as higher clustering resolution, when analysing subsets of the dataset (for example, the T4/T5 split is not detected in this global analysis, the TfAP-2 enhancers are not predicted as differential). We therefore used independent cisTopic runs to perform the analysis of the adult cell types (including adult and 72 h APF, using 200 topics), and developmental stages (larva and 0–12 h APF, 200 topics). This split of stages was chosen on the basis of their similarity (Extended Data Fig. 1c, d).

Adult cell clusters were defined on the basis of a two-level analysis. (1) First, on the adult + 72 h APF cisTopic run, we clustered the adult cells with more than 900 fragments in peaks (FIP) using Louvain clustering on the cell–topic probability matrix (igraph::cluster_louvain, parameters: k = 10, eps = 0.1, treetype="bd"). This led to 55 clusters, including the main cell types identified in scRNA-seq. Note that we chose this strategy based on several alternative analyses, in which we observed that cisTopic benefits by higher numbers of cells, even if some of them have few reads, while the clustering of only high-count cells (FIP > 900) provided more stable clusters and more concordant with the scRNA-seq. (2) The same process was then applied to each of the major groups of cells: OL, CB and glia, using separate cisTopic runs, and consensus peaks instead of the ctx predefined regions (see the section below). These subclustering analyses provided 130 clusters—which might be over-clustered, as many of them were not matched to scRNA-seq clusters—but it enabled us to identify some extra cell types (for example, ab-cd split of T4/T5 cells; Extended Data Fig. 3f, g). From these analyses, after the scRNA-seq label transfer (see below), we finally selected 79 clusters as the main annotation.

The clusters for the developmental stages were determined followingthe equivalent approach on the larva to 12 h APF cisTopic analysis (in this case, only one level of clustering was required).

Gene accessibility matrix

Gene accessibilities were calculated using the cisTopic probabilities of region accessibility per cell. Next, ctx regions inside the gene body and up to 5 kb of its transcription start site (TSS) were selected. An exponentially decaying function was used to assign distance weights (w_d) to these regions, whereby regions further away from the gene have lower weights (d as distance in bp from the TSS), similar to ArchR⁵⁴. To give higher weights to variable regions, we calculated Gini scores per region, where highly variable regions have a high Gini score. Gini scores were then z-standardized and used as an exponent for the variability weight (w_v). Final weights were defined as the product of the distance and variability weights, and a weighted sum was calculated to acquire a gene accessibility matrix.

$${w}_{d}={\frac{-\{d}{\{e}}^{5000}+{\{e}^{-1}$$

$${w}_{v}=\{{e}^{{Z}_{{\rm{GINI}}}}$$

scRNA-seq clustering

We used the scRNA-data from the whole ageing fly brain described previously², this time using all data from all protocols with updated analysis methods (mostly batch effect correction). Mapping, filtering, normalization, batch effect correction, clustering, marker gene detection and gene regulatory network inference were performed using the VSN pipeline⁵⁵ (https://github.com/vib-singlecell-nf/vsn-pipelines), which is a Nextflow DSL2 pipeline using CellRanger (10x), Scanpy⁵⁶, Harmony⁵⁷ and pySCENIC⁵⁸. We used the command nextflow -C nextflow.config run vib-singlecell-nf/vsn-pipelines -entry harmony and a description of the nextflow.config file is provided in Supplementary Data 2. Finally, annotations were transferred from the published² dataset by calculating the adjusted rand index between annotations and the different calculated clusterings. The best-matching clustering was Leiden resolution 10 (224 clusters) and the clusters were annotated if at least 25% of cells in the cluster had the same annotation. If there was no match, the cluster was retained, ending up with 203 clusters. One modification was made to cluster 15 where a higher resolution (Leiden 12) was chosen in which it split in two (a and b), matching the split detected in the RNA-ATAC co-embedding (see below), leading to the final annotation of 66 clusters out of 204. Subsequently, marker genes were calculated in Seurat’s FindAllMarkers using the Wilcoxon method with min.pct = 0.1 and logfc.threshold = 0.2).

Label transfer using NNLS, AUCell and Seurat

To assign cell-type identities to adult scATAC-seq clusters (130) we followed three approaches:

First, we used the NNLS method to compare clusters across modalities⁵⁹, similar to what was used previously described⁴³. We calculated average RNA expression profiles per cell type from the annotated scRNA-seq data and averaged gene accessibility profiles for the scATAC-seq clusters using the top 10 marker genes per cell type as features (sorted by Bonferroni-corrected P value). These were then used as input for the algorithm in which an optimal weighted sum is calculated and the weights resemble cluster similarities.

Second, we used AUCell²¹ to score gene signatures per cell type on the basis of the top marker genes on the gene accessibility matrix. Gene signatures were then averaged per cluster and clusters were assigned to cell types on the basis of their score.

Third, Seurat (v.3)⁶⁰ was used to integrate the gene accessibility and gene expression data. First separate objects were created for scATAC-seq and scRNA-seq data, with the gene accessibility matrix used as ‘RNA’ assay and the region-accessibility as ‘peaks’ assay in the ATAC-seq object. First, the dimensions of the ATAC-seq object were reduced using RunTFIDF, FindTopFeatures and RunSVD using latent semantic analysis (LSI) on the peaks assay; the number of components used was 50, 70 and 100. Next, the RNA-seq object was log-normalized using NormalizeData with the median of expressed UMIs as the scale factor. FindVariableFeatures was used to find 2,500 variable features in the RNA-seq data to be used as features for integration. Anchors for integration were identified using FindTransferAnchors with the RNA-seq data as reference and the ATAC-seq object as query using canonical component analysis using 50, 70 and 100 components. TransferData were then used to transfer annotations from scRNA-seq to scATAC-seq using the LSI weights for weight reduction and dimensions ranging 2 to 100. To calculate a co-embedding, we used GetAssayData on the variable genes to get the RNA counts and used this as reference data in a second run of TransferData, whereby we impute RNA counts for the ATAC-seq data, again using the LSI weights for weight reduction. The two objects were subsequently merged, followed by scaling of the data, principal component analysis (PCA) and UMAP and t-SNE calculation (ScaleData, RunPCA, RunUMAP, RunTSNE). In particular, for the adult cells, we chose the t-SNE as primary representation for the figures since it looked less cluttered, which enabled better visualization of the intracluster heterogeneity.

We next collapsed annotations across all methods and merged non-annotated low-confidence clusters. Tm1/TmY8 and Mi1 matched to the same cluster, but we could separate the subclusters on the basis of gene accessibility scores of bsh and hth, two markers of Mi1 neurons.

DARs

For each of the adult cell clusters (including both clustering resolutions, plus a super-clustering of glia, OL/CB neurons and KCs), we calculated the DARs on the basis of the predictive distribution from cisTopic (using the Wilcoxon rank sum test, run through the FindMarkers function in Seurat with the default settings, except logfc.threshold, which was lowered to 0.20, and max.cells.per.ident, which was adjusted to balance the contrasts in some of the analyses). For each of the clusters, the DARs were calculated versus the closest cluster in the tree, and versus all of the other clusters in each of the two analyses (that is, each cluster was compared with the rest of the brain, and with the other cells in their same glia/OL/CB/KC category). The DARs were used as a starting point for the follow-up analyses for enhancer–gene links and eGRNs. However, note that 14% of DARs are promoter regions and these are included in all following analyses.

Cell-type-specific bams and bigwigs

We extracted cells per cell type per timepoint (details of the SVM are provided below). Next, we subset the bam files from the runs to contain only reads belonging to the selected cells and created a cell-type-specific bam file. Then we used SAMtools⁶¹ to remove duplicates (view -F 0x400) and remove regions mapping to blacklisted regions⁶². The remaining reads were then used as input for the bamCoverage function from deepTools⁶³ to create a depth-normalized bigwig file (reads per genome coverage, RPGC) with the following parameters: -bs 1 -p 8 -normalizeUsing RPGC -effectiveGenomeSize 142573017.

Consensus peaks

MACS2 (ref. ⁶⁴) was used to call peaks on cell-type-specific bam files using the call peak function with following parameters: macs2 callpeak -q 0.05 -g dm -keep-dup all -nolambda -call-summits -nomodel -shift -75 -extsize 150. This was repeated for all of the timepoints and for the grouped analyses (Adult + P72, L3–P12, P24, P48). Next, the summits were extended to 500 bp (or 150 bp) using slopBed from BEDTools (-l 149, -r 150 (or -l 74, -r 75)). The extended summits were then merged according to ENCODE standards, with first a normalization of the summit score (CPM) followed by iterative peak merging until non-overlapping peaks across all timepoints and cell types were retained. This led to a final number of 95,921 (500 bp) and 207,325 (150 bp) peaks.

For the adult cell types, we also ran a more stringent peak calling (-q 0.01), which provided 60,210 disjoint peaks of an average width of 455 bp, and covered 19% of the genome (provided in the UCSC session).

ArchR clustering

ArchR⁵⁴ was run on the fragments files from CellRanger using the createArrowFiles function with minTSS set to 4 and minFrags to 1000. A first shallow clustering was then performed on the tile matrix using the binned genome. These clusters would next be used to acquire consensus peaks to use in a final clustering. Thus, first an iterativeLSI was calculated on the tilematrix with 30 dimensions. Clusters were derived using the Seurat FindClusters implementation of the Louvain algorithm with resolution 4, leading to 25 clusters. The addGroupCoverages function was then used to calculate the coverage of every cluster, which was used as an input for the addReproduciblePeakSet to obtain consensus peaks. The fragment files were then quantified over the consensus peaks to acquire the final peak matrix. This matrix was then used as input for iterativeLSI, this time using 130 dimensions and 20,000 features, sampling 10,000 cells at a resolution of 2. Harmony was then used on the 130 LSI dimensions to correct for batch effects (variable = sample). The corrected Harmony features were then used to create the final UMAP embedding and clustering in Seurat (Louvain, resolution 4) to find 90 clusters.

Omni-ATAC-seq analysis of FACS-sorted samples

FACS

One-hundred GFP-expressing (MB371B, MB418B, MB419B crossed with UAS-nls.GFP⁶⁵) and 15 GFP-negative (WT) fly brains were dissected in PBS on ice. The brains were then centrifuged at 800g for 5 min, after which the supernatant was replaced with 50 μl ofdispase (3 mg ml⁻¹, Sigma-Aldrich, D4818, 2 mg), 75 μl collagenase I (100 mg ml⁻¹, Invitrogen, 17100-017) and 125 μl trypsin-EDTA (0.05%, Invitrogen 25300054). Brains were dissociated at 25 °C in a Thermoshaker (Grant Bio PCMT) for 15 min at 25 °C at 1,000 rpm and the solution was mixed by pipetting every 5 min. Next, cell suspensions were passed through a 10 μm pluriStrainer (ImTec Diagnostics 435001050) and viability was assessed by the LUNA-FL Dual Fluorescence Cell Counter. Next, four aliquots were made containing GFP-negative brains cells with/without PI (10%) and GFP-positive brains with/without PI (10%). FACS was performed on the FACS Aria III (BD Biosciences). The GFP-negative brains were used to set the gates on the machine for cell size and viability (PI), the GFP-positive brains for the GFP fluorescence, after which the GFP positive cells with PI were sorted (Supplementary Data 1). Between 2,600 and 11,000 GFP-positive cells were sorted and 50,000 cells per negative control. For RNAi experiments, a similar procedure was performed with GFP-labelled T4 cells and mCherry-labelled KCs. After sorting, regular omni-ATAC-seq was performed as described by Corces et al⁶⁶.

Analysis of WT experiments

Bulk ATAC-seq was performed on five samples (three GFP-positive cells from driver lines each targeting one subtype of KCs (MB371B, MB418B and MB419B) and two negative controls (GFP-negative cells from MB371B and MB419B)). ATAC-seq reads were trimmed using fastq-mcf⁶⁷ and a list of sequencing primers. The cleaned reads were then used as input for fastqc for quality control. Next, the reads were mapped to the third 2017 FlyBase release (D. melanogaster r6.16) genome using STAR, and SAMtools was used to sort the bam file. Macs2 was then used to call differential peaks between the positive samples and their negative controls and both negative controls were used for the positive sample without its own control using macs2 callpeak -t pos_sample -c neg_sample -g dm –nomodel.

Analysis of knockdown experiments

Bulk ATAC-seq was performed on one RNAi knockdown and one WT control sample. ATAC-seq reads were trimmed using fastq-mcf⁶⁷ and a list of sequencing primers. The cleaned reads were then used as input for fastqc for quality control. Next, the reads were mapped to the third 2017 FlyBase release (D. melanogaster r6.16) genome using Bowtie2, and SAMtools was used to sort the bam file. Bam files were deduplicated and blacklisted regions were removed. Macs2 was then used to call differential peaks between the knockdown and the control sample using macs2 callpeak -t pos_sample -c neg_sample -g dm –nomodel. The .narrowPeak output file was then used in i-cisTarget for motif enrichment⁶⁸ and for overlap with predicted DL binding sites and cistrome regions using pybedtools. Significant overlaps were determined using a hypergeometric test from Scipy⁴⁹. Bam files were converted to bigwig using DeepTools with RPGC normalization and coverage plots were drawn using pyBigwig in Python.

CUT&Tag analysis

Library preparation

Nuclei were isolated from dissected adult brains as indicated previously. After centrifugation, the supernatant was removed and isolated nuclei were resuspended in nuclear extraction (NE) buffer (EpiCypher CUT&Tag Protocol v.1.5). Nuclei concentration was assessed using the LUNA-FL Dual Fluorescence Cell Counter. 100,000 nuclei in 100 µl NE buffer were used for each CUT&Tag reaction. All of the subsequent steps were followed according to the EpiCypher CUT&Tag Protocol v.1.5. For each reaction, 11 µl of BioMag Plus Concanavalin A beads (ConA beads, Gentaur, 86057-3) were washed twice with 100 µl cold bead activation buffer and then resuspended in 11 µl of cold bead activation buffer. To bind nuclei to the activated ConA beads, 100 µl of nuclei and 10 µl of activated ConA beads were incubated for 10 min at room temperature. After supernatant removal, nuclei–bead slurry was resuspended in 50 µl Antibody150 buffer. Primary rabbit anti-GFP antibodies (0.5 µg) (Abcam, ab290) were added per sample and incubated overnight at 4 °C. The beads were cleared and resuspended in 50 µl cold Digitonin150 buffer. Anti-rabbit secondary antibodies (0.5 µg; EpiCypher, Cat 13-0047) were added per sample and incubated for 30 min at room temperature. While on a magnet, beads were washed twice with 200 µl cold Digitonin150 buffer. The beads were resuspended in cold Digitonin300 buffer and incubated for 1 h at room temperature with 2.5 µl CUTANA pAG-Tn5 (EpiCypher, Cat 15-1017). To initiate tagmentation reaction, the beads were resuspended in 50 µl cold Tagmentation buffer and incubated for 1 h at 37 °C. Beads were resuspended in 50 µl TAPS buffer and 5 µl SDS Release buffer was added to quench the tagmentation reaction. To release tagmented chromatin fragments into solution, samples were incubated for 1 h at 58 °C. 15 µl SDS quench buffer was added per sample. To amplify tagmented chromatin fragments, 2 µl each of individual barcoded P5 and P7 sequencing adapters (10 µM stock) and 25 µl non-hot start CUTANA High Fidelity 2× PCR Master Mix (EpiCypher, Cat 15-1018) were added, and 18 cycles of CUT&Tag-specific PCR parameters were used (58 °C for 5 min, 72 °C for 5 min, 98 °C for 45 s, 98 °C for 15 s, 60 °C for 10 s, 72 °C for 1 min). DNA clean-up was performed using 1.3 AMPure XP beads (Analis, A63880) and DNA was eluted in 15 µl 0.1 TE buffer. The CUT&Tag libraries were analysed using the Bioanalyzer (Agilent High Sensitivity DNA Chip, 5067-4626) and sequenced on the Illumina NextSeq2000 instrument.

Analysis

CUT&Tag reads were processed similarly to ATAC-seq reads from the knockdown experiments. Macs2 was then used to call differential peaks between the targeted TF and the input control sample (IgG) using macs2 callpeak -t pos_sample -c neg_sample -g dm –nomodel. We obtained cleaner results by comparing two different factors against each other, so we contrasted the results for Ey with those of Repo. The .narrowPeak output file from these contrasts was then used in i-cisTarget for motif enrichment⁶⁸ and for overlap with predicted DL binding sites and cistrome regions using pybedtools. Significant overlaps were determined using a hypergeometric test from Scipy⁴⁹. Bam files were converted to bigwig using DeepTools with RPGC normalization and coverage plots were drawn using pyBigwig in Python. For the Ey sample, we combined all of the bam files for working Ey experiments (Ey motif highly enriched) and created a total bam and bigwig file. A differential bigwig file was derived using DeepTools’ bigwigCompare on the Ey and Repo bigwig files.

TaDa analysis

Library preparation

UAS-LT3-NDam-acj6 was previously described¹⁶ and UAS-LT3-NDam-Mef2 was generated by cloning the Mef2 coding sequence into pUAST-LT3-NDam^35,69. A cherry-stop-MEF2-DAM construct (Life Technologies) was subcloned into pUAST-AttB as an EcoRI-XbaI fragment. The resulting plasmid, pUAST-AttB-cherryMEF2DAM, was injected into AttP2 embryos by the Cambridge Fly Facility.

The sequence of the cherry-stop-MEF2-DAM construct is as follows: GAATTCATGGCAACTAGCGGCATGGTTAGTAAAGGAGAAGAAAATAACATGGCAATCATTAAGGAGTTCATGAGATTCAAAGTTCACATGGAAGGTTCTGTAAATGGACATGAATTTGAAATAGAAGGTGAAGGAGAAGGAAGGCCTTATGAAGGAACCCAAACCGCGAAGCTAAAAGTTACTAAGGGTGGCCCATTACCATTTGCATGGGATATCCTTAGCCCTCAATTCATGTATGGGTCAAAGGCTTATGTCAAGCACCCCGCCGACATTCCAGACTATCTAAAGTTATCTTTTCCCGAAGGGTTTAAGTGGGAGCGTGTGATGAACTTCGAAGACGGTGGCGTGGTAACAGTGACTCAGGATTCGTCCCTGCAAGATGGTGAATTTATCTACAAAGTCAAATTAAGAGGAACTAACTTTCCATCTGACGGCCCGGTTATGCAAAAAAAGACAATGGGCTGGGAGGCCTCCTCAGAACGAATGTACCCTGAAGATGGTGCCTTGAAGGGTGAGATTAAACAAAGATTGAAATTGAAAGATGGTGGACATTATGACGCTGAGGTTAAAACGACATACAAAGCTAAGAAACCTGTCCAGCTCCCAGGTGCTTACAATGTAAATATAAAACTTGATATTACATCACATAATGAAGATTATACGATAGTTGAACAATACGAAAGGGCTGAGGGGAGACATAGTACTGGTGGCATGGATGAACTATACAAAGGTTCTGGTACCGCATAATAACATGGGCCGCAAAAAAATTCAAATATCACGCATCACCGATGAACGCAATCGGCAGGTGACCTTCAACAAGCGCAAGTTCGGCGTGATGAAGAAGGCCTACGAGCTGTCCGTGCTCTGCGACTGCGAGATCGCCCTGATCATCTTCTCGTCGAGCAACAAGCTGTACCAGTACGCCAGCACCGACATGGATCGCGTCCTGCTCAAGTACACCGAGTACAACGAGCCCCACGAGTCCCTCACCAACAAGAACATCATCGAGAAGGAGAACAAGAACGGCGTGATGTCGCCGGACTCGCCCGAAGCCGAAACGGACTACACACTCACTCCGCGAACGGAGGCCAAGTACAACAAGATCGACGAGGAGTTCCAGAACATGATGCAGCGCAACCAGATGGCCATCGGCGGTGCGGGTGCCCCTCGCCAGCTTCCAAACAGCAGCTACACGCTGCCCGTTTCTGTTCCGGTGCCGGGATCTTACGGCGACAACCTGCTGCAGGCCAGTCCACAGATGTCCCACACCAACATCAGCCCCCGTCCATCGAGTTCGGAGACGGATTCAGTTTATCCATCGGGTTCCATGCTGGAGATGTCGAACGGCTATCCGCATTCACACTCGCCGCTTGTGGGATCACCGAGTCCGGGTCCCAGTCCTGGCATAGCCCACCATTTGTCCATTAAGCAGCAGTCGCCGGGCAGCCAGAACGGACGAGCTTCCAATCTAAGGGTCGTCATACCGCCCACAATTGCCCCCATACCGCCCAATATGTCAGCGCCGGATGATGTGGGATATGCAGATCAACGACAGAGCCAGACATCGCTTAACACGCCAGTGGTCACGCTGCAGACGCCGATTCCCGCCCTCACGAGCTATTCCTTTGGGGCGCAGGACTTCTCCTCCTCCGGCGTAATGAACAGCGCGGATATCATGAGCCTCAACACCTGGCATCAGGGCCTGGTGCCGCACTCTAGTCTCTCGCACCTGGCTGTCTCGAATAGCACGCCGCCGCCCGCCACCTCCCCCGTCTCCATAAAGGTCAAGGCTGAGCCGCAGTCGCCGCCGAGAGATCTTTCCGCCAGCGGTCATCAGCAGAATAGCAATGGTTCCACGGGCAGCGGCGGATCCAGCAGCAGCACCAGTAGCAACGCCAGCGGAGGAGCAGGAGGCGGTGGAGCCGTCAGCGCAGCCAATGTCATCACGCACTTGAACAACGTCAGTGTCCTGGCGGGAGGTCCTTCGGGGCAGGGAGGAGGAGGCGGAGGCGGCGGCAGCAACGGAAATGTCGAACAGGCCACCAATCTTAGCGTACTGAGCCACGCGCAGCAACATCACCTGGGCATGCCCAACTCGCGTCCCTCGTCCACGGGCCACATCACACCCACTCCAGGTGCGCCGAGCAGCGACCAGGATGTGCGTCTGGCAGCCGTCGCCGTGCAGCAGCAACAGCAGCAGCCACATCAGCAACAGCAACTAGGCGACTACGATGCCCCCAACCACAAACGGCCGAGAATATCGGGCGGATGGGGCACAGAACAGAAACTCATCTCTGAAGAGGATCTGATGAAGAAAAATCGCGCTTTTTTGAAGTGGGCAGGGGGCAAGTATCCCCTGCTTGATGATATTAAACGGCATTTGCCCAAGGGCGAATGTCTGGTTGAGCCTTTTGTAGGTGCCGGGTCGGTGTTTCTCAACACCGACTTTTCTCGTTACATCCTTGCCGATATCAATAGCGACCTGATCAGTCTCTATAACATTGTGAAGATGCGTACTGATGAGTACGTACAGGCCGCACGCGAGCTGTTTGTTCCCGAAACAAATTGCGCCGAGGTTTACTATCAGTTCCGCGAAGAGTTCAACAAAAGCCAGGATCCGTTCCGTCGGGCGGTACTGTTTTTATATTTGAACCGCTACGGTTACAACGGCCTGTGTCGTTACAATCTGCGCGGTGAGTTTAACGTGCCGTTCGGCCGCTACAAAAAACCCTATTTCCCGGAAGCAGAGTTGTATCACTTCGCTGAAAAAGCGCAGAATGCCTTTTTCTATTGTGAGTCTTACGCCGATAGCATGGCGCGCGCAGATGATGCATCCGTCGTCTATTGCGATCCGCCTTATGCACCGCTGTCTGCGACCGCCAACTTTACGGCGTATCACACAAACAGTTTTACGCTTGAACAACAAGCGCATCTGGCGGAGATCGCCGAAGGTCTGGTTGAGCGCCATATTCCAGTGCTGATCTCCAATCACGATACGATGTTAACGCGTGAGTGGTATCAGCGCGCAAAATTGCATGTCGTCAAAGTTCGACGCAGTATAAGCAGCAACGGCGGCACACGTAAAAAGGTGGACGAACTGCTGGCTTTGTACAAACCAGGAGTCGTTTCACCCGCGAAAAAAGCCGGTTAGTCTAGA.

Parent lines were allowed to lay eggs over a minimum of two days at 25 °C before timed collections were performed to produce the following genotypes: tub-GAL80ts/+;UAS-LT3-NDam/GMR16A06-GAL4 (KCs); tub-GAL80ts/+;UAS-LT3-NDam-Mef2/GMR16A06-GAL4; tub-GAL80ts/+;UAS-LT3-NDam/GMR74G01-GAL4 (T1 and Tm1 neurons); tub-GAL80ts/+;UAS-LT3-NDam-Mef2/GMR74G01-GAL4; tub-GAL80ts/+;UAS-LT3-NDam/atonal-GAL4; and tub-GAL80ts/+;UAS-LT3-NDam-acj6/atonal-GAL4.

Flies were allowed to lay eggs for 2 days at 18 °C in fly food vials. Vials containing those eggs were kept at 18 °C (restrictive temperature) until adult flies eclosed. They were then kept at 18 °C for 3–7 days before being transferred to 29 °C (permissive temperature) for 24 h. For the Dam-repo experiments, adult flies were flash-frozen in dry ice, and stored at −80 °C. A minimum of 50 heads were removed for processing. For all of the other experiments, brains were dissected. For the Dam-Mef2 experiments in KCs, a minimum of 90 brains were dissected per replicate, for the Dam-Mef2 experiments in T1/Tm1 cells, 40 brains were dissected, and for the Dam-acj6 in atonal cells, a minimum of 30 brains were dissected. Two biological replicates were performed for each experiment.

The DamID protocol is as previously described⁷⁰ with the following modifications: after the overnight DpnI digestion, 0.5 µl of DpnI was added for an extra 1 h incubation and MyTaq HS DNA Polymerase was used for the PCR amplification (instead of Advantage 2 cDNA Polymerase).

Analysis

Sequencing data were mapped back to release 6.03 of the Drosophila genome using a previously described pipeline⁷¹. Peaks were called and mapped to genes using a custom Perl program (https://github.com/tonysouthall/Peak_calling_DamID) In brief, a false-discovery rate (FDR) was calculated for the peaks (formed of two or more consecutive GATC fragments) for the individual replicates. Then, each potential peak in the data was assigned an FDR. Any peaks with less than a 0.01% FDR were classified as significant. Significant peaks that were present in all of the replicates were used to form a final peak file. Motif enrichment was then performed using i-cistarget and direct hits were acquired from the leading-edge regions.

Enhancer assays

Selection of cloned enhancers

The selected regions are accessible in either KCs (24, average size = 621 bp), OL neurons (17, average size = 550 bp), glia (5, average size = 362 bp), or mixed (14, average size = 662 bp) with a size range of the cloned regions of between 300–1,732 bp. We also included negative controls that are either ubiquitously accessible or inaccessible (3, average size = 901). The 53 selected regions for the direct construct differ in their ATAC-seq peak height, specificity, presence of TF binding sites and nearby expressed genes (Supplementary Table 8). However, the 10 GAL4 enhancers were selected on the basis of multiple criteria (KC DL score > 0.35, KC accessibility > 5.6, KC accessibility fold-change > 2.4 and KC gene fold change > 0.3).

Cloning and visualization of enhancers

Selected enhancers were scored for the presence of homopolymers (>10) and GC content and small modifications to the sequence were made if needed. Sequences were synthesized by Twist Biosciences and inserted into the pTwist ENTR vector. Gateway cloning was then used to insert the sequence into the pH-Stinger vector containing nuclear GFP, Hsp70 promoter and gypsy insulators⁷². Next, the plasmids were sent to FlyORF (CH) and divided into six pools that were injected in Drosophila embryos (21F site on chromosome 2L). Positive transformants were selected and PCR was used to determine the identity of the enhancer in each line. This pipeline of pooled injections recovered a transgenic line for 54 of the 59 enhancers. Larval, pupal (15 h and 24 h) and adult flies were then dissected and stained using the immunofluorescence protocol for GFP, brp, repo and DAPI. Enhancers were scored using the following system: no expression (0), low on-target expression (1) and high on-target expression (2). The results are provided in Supplementary Table 8. Tests on the success rate were performed using two-sided Fisher’s exact tests. An additional set of ten enhancers was selected for producing KC Gal4 drivers. Gateway cloning was used to insert the sequences into the pBPGUw plasmid (gift from G. Rubin; Addgene plasmid, 17575, http://n2t.net/addgene:17575m, RRID: Addgene_17575). Plasmids were injected into Drosophila embryos individually. All of the other steps were repeated in the same manner. These Gal4 lines have been deposited in the Bloomington Drosophila Stock Center.

Immunohistochemistry analysis of split-Gal4 lines and larval brains

For immunofluorescence, brains were dissected and transferred to a tube containing 100 μl ice cold DPBS solution. After centrifugation at 800g for 5 min, the supernatant was replaced with 4% formaldehyde in PBT 0.3% (DBPS + 0.3% Triton X-100 (Sigma-Aldrich)) and incubated at room temperature with rotation for 15 min. Brains were washed three times with PBT 0.3%, rotating for 10 min at room temperature each time and then blocked in Pax-DG (10 g BSA (Sigma-Aldrich), 3 g Deoxycholate Acid (Sigma-Aldrich), 3 ml Triton X-100 (Sigma-Aldrich), 50 ml Normal Goat Serum (MP Biomedicals), 100 ml 10× PBS, 850 ml H₂O) for 2 h at room temperature with rotation. Primary antibody mixes were created in Pax-DG (dilutions detailed in Supplementary Table 11) and brains were incubated in these mixes overnight at 4 °C with rotation. The next day, brains were washed three times with PBT 0.3%, rotating for 10 min at room temperature each time and then stained with secondary antibody mixes in Pax-DG (dilutions detailed in Supplementary Table 11) for 2 h at room temperature with rotation. Brains were washed three times with PBT 0.3%, rotating for 10 min at room temperature each time and mounted in Mowiol mounting medium (Sigma-Aldrich). Imaging was performed using the Nikon C2 and Nikon A1 confocal microscopes. A list of the antibodies used is provided in Supplementary Table 11.

Immunohistochemistry analysis of adult brains

Brains were dissected in PBS and transferred to a tube for fixation in 4% formaldehyde in PBS for 20 min at room temperature. Brains were washed in PBS with 0.3% Triton-X (PBST) three times for 20 min each. Next, brains were placed in blocking solution (5% normal goat serum (Sigma-Aldrich) in PBST) overnight at 4 °C. The samples were then incubated in primary antibodies diluted in blocking solution overnight at 4 °C. The following antibodies were used: rabbit anti-GFP (Abcam, 1:1,000). Brains were then washed three times in PBST for 20 min each and incubated with a fluorochrome-conjugated secondary antibody (AlexaFluor-488 anti-rabbit (Abcam, 1:500)) for 2 h at room temperature. The brains were washed three times in PBST for 20 min each. Finally, the samples were mounted onto microscope slides with Prolong Glass Anti-fade solution (Thermo Fisher Scientific) for subsequent analysis using the Nikon TiE A1R confocal microscope. All of the images were acquired and analysed using the Nikon NIS Elements. A list of used antibodies can be found in Supplementary Table 11.

Enhancer ROC curves

We used the scikit-learn⁷³ framework to fit a roc-curve to separate adult high-quality enhancers (score = 2) from the other cloned enhancers. As features, we used peak height, peak specificity (Z score, log-transformed fold change, P value of DAR), motif content from DL, DAR and/or eGRN membership and correlation of peak accessibility with gene expression.

eGRN creation

Motif analysis

Conventional motif enrichment was performed using either i-cisTarget or RcisTarget⁶⁸. These methods are based on a combination of hidden Markov models, cross-species comparisons, and a ranking-and-recovery statistic that together provide an optimal balance between precision and recall. By default, we used as background a genome-wide collection of 134,000 regions (non-overlapping bins that are optimally cut on the basis of conservation) that cover the entire non-coding genome, removing the coding regions. These regions were scored with version 9 of our motif collection of curated PWMs (25,000 PWMs). CG content is controlled because of the genome-wide background, but overall has less of a role in the Drosophila genome in which CpG islands do not occur.

For DARs, different parameters were used in RcisTarget (aucMaxRank = 0.01 and 0.05, motif collection version 9 and the modERN TF ChIP–seq database⁷⁴) for each of the DAR sets with at least 10 regions. Here we also used all the regions in topics as additional background (re-ranking the database).

The results from these analyses are available online (http://flybrain.aertlab.org).

Cistromes

For building cistromes, we focused on cell types linked to a scRNA-seq cluster, regrouping the CB clusters into CB-Pros and CB-Imp to be able to establish the link to their transcriptome (T4 and T5 cells were analysed as independent clusters from ATAC, but both mapping to the same T4/T5 RNA cluster).

For each cell type, the cistromes were built on the basis of the motif enrichment analysis of upregulated DAR sets with at least 10 regions. Each significantly enriched motif (NES ≥ 3) was annotated to expressed TFs on the basis of cisTarget’s ‘direct’ and ‘inferred by orthology’ annotations (considering TFs expressed when expression > 0 in at least 10% of the cells of the given type/cluster). Note that as cisTarget’s annotation includes some non-TFs DNA binding proteins, we retained only the 459 TFs listed as such on Flybase and GO MF annotation.

For the significant motifs (NES ≥ 3.0), we split the TF–motif pairs by the correlation of the TF expression and motif enrichment score across the cell types, which resulted in ‘opening chromatin’ cistromes (positive correlation > 0.40 or 0.20) or ‘closing chromatin’ cistromes (negative correlation < –0.40 or –0.20) cistromes. We also retained all motifs merged under an ‘unclear direction’ set to be able to detect TFs of which the activity might be regulated at post-transcriptional levels (those can be explored on the website). For each of the motifs that were significantly enriched in a DAR set for a cell type in which the TF is expressed (note that, for the ‘closing’ cistromes, we did not require the TF to be expressed), we retrieved the DARs in which the motif had a significantly high score (that is, leading edge⁵² using RcisTarget::getSignificantRegions).

The dot heat map in Fig. 3 shows the average TF expression by cell type (that is, the average of all the cells in the cluster, after normalizing each cell on the basis of its total counts) to its maximum normalization (each gene divided by its maximum value), and the NES of the highest scoring motif (NES capped to 9).

Gene–enhancer links

Previous research showed that regulatory interactions can occur over large distances but are mostly confined within chromatin domains, in ‘genomic regulatory blocks’ (127 kb median size), a HiC-derived topological associated domain (TAD, 13 kb median size) or between two BEAF-32 boundary elements (57 kb median distance). On the basis of the comparison of these domains (Extended Data Fig. 10a–d), we decided to set a default search space of 50 kb around each gene for enhancer–gene links, which we then pruned using the BEAF-32 peaks.

We calculated the enhancer-to-gene links using the 43 matched clusters between RNA and ATAC plus CB-Pros and CB-Imp (ATAC T4 and T5 clusters were merged to match T4/T5 in RNA). For each cluster and data modality, 200 pseudocells were created as a bootstrap of five cells of the cell type. Each transcriptome pseudocell was then matched to a chromatin pseudocell of the same cell type to calculate the Pearson correlation and Random Forest regression (GENIE3) between each gene’s expression and the predicted accessibility (cisTopic cell-region probability) of the regions within 50 kb of its longest transcript (50 kb upstream the TSS and 50 kb downstream the end, plus the introns). The GENIE3 scores were filtered using the Binarize::binarize.BASC function in R. We then created a score, based on the aggregated ranking of these two measures plus the region accessibility, to enable us to select the top regulators. The maximum value of this score was scaled to 1,000 for compatibility with the UCSC Genome browser (in which we suggest a minimum threshold of 600 for link visualization). The links were then split into positive or negative links on the basis of the correlation of region accessibility and gene expression.

For identifying the links within BEAF-32 domains, we used ChIP–seq on whole Drosophila embryos (mixed sex embryo of 0–14 h; ENCODE dataset, https://www.encodeproject.org/files/ENCFF704WGH/)⁷⁵. The peaks were filtered on the basis of the enrichment of the BEAF-32 motif (i-cisTarget analysis with the default settings), and their accessibility in the adult fly brain (most of the peaks are ubiquitous across cell types; Extended Data Fig. 10a–d). We then defined the BEAF-32 based search space for each gene, taking the biggest transcript, and extending (upstream, and downstream) until the first BEAF32 peak within 200 kb (skipping the 500 bp around the TSS). In case there were no peaks within 200 kb, 50 kb was kept as search space. In 82% of the eGRNs, there is a slightly higher GSEA enrichment score with the TF co-expression module (see below) when using only links within BEAF-32 peaks.

Finally, we also checked whether using these links is better than just using all of the genes within a certain distance of the cistrome region. We converted the cistrome regions to genes using three approaches: (1) all proximal genes (5 kb upstream, plus introns); (2) all distal genes (such as regions within 50 kb of the gene); and (3) the newly calculated enhancer–gene link based on expression and accessibility. We then used GSEA to check the enrichment of the resulting gene sets on the cell type markers, which confirmed that using the links is clearly better (for example, more enrichment) than just using all genes near cistrome regions.

eGRN integration

The regions in each of the cell-type-specific cistromes were converted to genes on the basis of the enhancer–gene links with a score of ≥600, splitting the resulting gene sets according to positive or negative links. We then used GSEA to check whether each of these gene sets (with at least five target genes) is enriched in each of the TF co-expression modules (using 5,000 permutations of GSEA). For each TF co-expression module (that is, ranking) we kept the significant cistromes for the same TF (P < 0.01), and selected the genes in the leading edge to build the eGRNs. To finalize the eGRNs per cell type, for each of those genes, we next retrieved the linked regions in the cistrome within the BEAF-32-based search space. Thus, we obtained the connections TF–region–gene.

eGRN plots

To display the eGRNs as networks in Cytoscape (v.3.8.0;⁷⁶), we focused on the positive region–target gene links, and the genes expressed in at least 15% of the cells of the specific cell type (except T4/T5 neurons, for which we used 5% instead).

In addition, the Cytoscape networks also display differential expression and accessibility for each node (gene or region, respectively). The differential expression was calculated by contrasting the cell type versus all other cells (avg_logFC calculated using the Seurat function FindMarkers), the accessibility in the cell type was calculated by taking the mean over the interval with subsequent RPGC normalization.

DeepFlyBrain

cisTopic run

KCs (3), T-neurons (6) and glia (6) were selected from the adult and 72 h APF datasets leading to a total of 17,554 cells covering 15 cell types. The selected cells were rescored on a set of 207,000 150 bp peaks (see the consensus peaks), which were extended to 300 bp for optimal resolution in DL. We ran cisTopic on this subset of the data and with the new set of peaks. Given the smaller number of cells, we used the conventional Collapsed Gibbs Sampler method (runCGSModels) from 1 to 100 topics, with 500 iterations using 250 as burn-in. Using selectModel, we selected the model with the highest log-transformed likelihood leading to 81 topics. Using runtSNE without PCA on the probability matrix with the cells as target, we acquired the 2D embeddings. We then calculated scores for the topics per region using getRegionsScores with method=‘NormTop’ and scale=TRUE. Finally, we used binarizecisTopics with thrP = 0.975 to get 81 sets of peaks. These region sets were annotated to the different cell types on the basis of accessibility per cell type and region features (such as promoters, BEAF-32) on the basis of motif enrichment and the annotateRegions function using the Drosophila datasets.

Model training

These sets of regions were then used as input for a DL model, where 500 bp DNA sequences were used to predict the topic set to which the region belongs. The architecture of the model was used from an earlier study in which the authors again used the cisTopic clusters as an input for the DL model (DeepMEL²⁶, DeepMEL2²⁵). The model is a hybrid CNN–RNN multiclass classifier⁷⁷; details of the model architecture are provided in Supplementary Table 4. In addition to the architecture proposed earlier, we increased the number of filters from 128 to 1,024 where 747 of them are initialized as known PWMs representing 212 TFs. To be able to initialize the filters with the long PWMs, we also increased the filter size to 24.

Model performance

To assess the performance of the model, we performed ninefold cross validation whereby we split the regions into ten groups (10% of the input regions for each group). One of the groups is left out as a test set while the other nine groups are used for the ninefold cross validation. For each fold, one of the nine groups was used as the validation set (10% of the input regions) and the rest (80% of the input regions) was used as the training set. After splitting the regions into ten groups and before training the model, to increase the sample size for the DL model, we augmented the regions by extending them to 700 bp and used a sliding window of 500 bp with a 50 bp stride, increasing the sample size five times. During the training, the validation set was used for early stopping and the 83rd epoch (best in the main model) was chosen to evaluate the performance of cross-validation models. After training, we assessed the performance of the nine models on the non-augmented test set by scoring the test set regions with the models. Then, using the prediction scores and the topic labels, we calculated the area under the precision-recall (auPR) and receiver operating characteristic (auROC) curves using the average_precision_score and roc_auc_score functions from the scikit-learn package. Performance metrics for each topic are provided in Supplementary Table 5.

Here we notice a discordance between topics, with some topics achieving high validation scores, while others receive low ones. We noticed that cell-type specific topics in general have a higher score (Extended Data Fig. 6b–d). Indeed, in our analysis, the number of topics (81) is much higher than the number of cell types used as input for our case study (15). Thus, not all topics correspond to cell types and, therefore, not all topics need to have high validation scores. Some topics represent only noise, promoters or generally accessible elements. These background topics are useful for the training and are therefore retained, but present less interesting biological insights.

Nucleotide contributions

To find the nucleotides that contribute the most to the topic prediction, we used a network explaining tool called DeepExplainer from the SHAP package³². The tool was initialized with 500 random sequences and the default parameters were used. The importance score obtained from the DeepExplainer analysis was multiplied by the one-hot encoded DNA sequence and visualized as the height of the nucleotide letters as in earlier work⁷⁸. In addition to the nucleotide importance plots, we performed in silico saturation mutagenesis in which we calculated the effect of each variant of a region on its model prediction score. The sequences with all possible single mutations were generated and the delta prediction score for each topic was calculated. The code that was used to train the model, to measure the performance, and to calculate the prediction scores and the nucleotide importances is provided in Supplementary Data 3–5.

TF-binding site predictions

High nucleotide importances on DeepExplainer plots represent potential binding sites for TFs. We used TF-MoDISco (v.0.5.5.4)³³ to identify the most common patterns for KCs (topics 21, 35, 77), T neurons (topics 23, 20, 44, 10, 18, 32) and glia (topics 68, 25, 56, 34, 36). The default parameters were used to run for each group. After finding the patterns, to identify motif instances on the given sequences, we followed the TF-MoDISco manual. First, the patterns were trimmed using trim_by_ic (th = 0.25), then the sum score was calculated by using compute_sum_scores on the nucleotide importance scores. However, instead of using contribution weight matrix and calculating cosine-similarity using compute_masked_cosine_sim as shown in the TF-MoDISco manual, we converted the identified patterns to convolutional filters and calculated pattern activation scores using tf.nn.conv1d function from the TensorFlow package. It resulted in better motif instances for the noisy nucleotide importance scores and for the shorter patterns. Global motif instances were calculated on 500 bp consensus regions for each group (KCs, T-neurons, and glia) that have high prediction score (>0.25) for the corresponding class. The selected threshold for each pattern can be found in Supplementary Table 6.

The DL-based KC cistromes were built for the TFs in which contribution score matches one-to-one to gene expression on different KC types (ey, toy, Mef2, onecut, sr and mamo). For these TFs, the calculated motif instances were used to build cistromes, then the eGRNs were constructed following the same approach as with the motif-based eGRNs (described above).

Developmental model

To create a DL model on development we followed the same approach as outlined above, but changed the input topics. We performed cisTopic analysis of KCs, glia and T neurons from larval to 12 h APF brains with the addition of neuroblasts and neuroepithelium cell types (in total 27,853 cells). We ran cisTopic using the conventional Collapsed Gibbs Sampler method (runCGSModels) from 1 to 230 topics, with 500 iterations using 250 as burn-in. While 220 topics was the model with the highest log-transformed likelihood, we chose to work with 160 topics as the additional increase was not significant and we preferred to keep the number of topics as low as possible.

caQTLs analysis

Data preprocessing of the bulk ATAC-seq data across DGRP lines was performed as described in Bravo et al⁷⁹, which is based on the analysis previously performed by Jacobs et al³⁷. First, adapter sequences were trimmed from the raw reads using fastq-mcf (ea-utils v1.1.2, default parameters and using a list containing the common Illumina adapters), and the quality of the cleaned reads was checked with FastQC (v0.1). Next, Bowtie2 (v2.2.5) was used to map experiments to their personalized genome version on 3rd 2017 FlyBase release (D. melanogaster r6.16) genome. Called variants in this genome assembly were retrieved from ftp://ftp.hgsc.bcm.edu/DGRP/freeze2_Feb_2013/liftover_data_for_D.mel6.0_from_William_Gilks_Oct_2015/, and for each of the DGRP lines, the consensus genome (r6.16) was modified using seqtk mutfa (seqtk (v1.0)), each time including their S correlation of the regions within a chromatin NPs (previously called from whole genome sequencing). After the first mapping round, additional SNPs were called on the ATAC reads using SAMtools (v1.2, samtools mpileup -B –f r6.16.fasta DGRP_lineX.bam | varscan.sh mpileup2snp –output-vcf 1), retrieving several thousands per line that were added to the existing vcf files using VCFtools (v0.1.14). The vcf files were then used to update the genome, creating the final personalized genome for every DGRP line, strongly reducing mapping errors and increasing the sensitivity of subsequent analyses. Bowtie2 (v2.2.5) was then used to map the cleaned reads onto the final genomes, and SAMtools (v1.2) was used for sorting and indexing. Peaks were called on the mapped reads using MACS2 (v2.1.2.1), with the command macs2 callpeak -g dm –nomodel–keep-dup all –call-summits. The narrow peak files (bed format) for all the DGRP lines were merged leading to a total of 33,595 regions accessible in at least one DGRP line. After filtering out chrU, chrUextra, chrHet, and chrM regions and removing regions enriched in repeats (>25% of the sequence) using bedtools (v2.28.0) with the command intersectBed -v -f 0.25, we obtained 32,668 accessible regions across this DGRP panel. For every ATAC-seq sample, we quantified the coverage per accessible region using featureCounts (Subread v2.0.0). Next, we filtered 11,711 regions with low coverage (coverage of the region < 0.2 reads per base pair for every DGRP line)) for every DGRP line, ending up with 20,957 accessible regions. Finally, the DESeq2 package in R was used to normalize the final peak-counts matrix based on size-factors.

The chromatin accessibility quantitative trait loci (caQTLs) were also identified as described in Jacobs et al³⁷. In brief, we searched for correlation between the counts in each peak and the overlapping SNPs (treated as a vector across the 44 DGRP lines with values 0/1/NA) using the generalized linear model function in R. This provided 4,063 caQTLs, for example, highly correlated SNP–region pairs (Benjamini–Hochberg adjusted P < 0.05).

The motifs that were significantly affected by the caQTLs were identified by calculating the change in score produced by each of the SNPs in the caQTLs to each of the 24,454 motifs in our collection³⁷ (that is, we scored every sequence twice, once with the reference and another with the ALT allele, using Cluster-Buster⁸⁰, with the options -m 0 -c 0, and subtracted the score of the less accessible sequence from the most accessible one). We then used the Fisher’s exact test to compare the significance of the number of caQTL SNPs affecting the motif (with abs(Delta) > 3), versus what is expected by chance (for example, random SNPs). This returned the motifs shown in Extended Data Fig. 6l.

For the analysis of caQTL explainability by the DL model, the caQTLs and random SNPs were scored by the DL model using both reference and mutated alleles. For each caQTL/SNP, the maximum absolute delta (reference − mutated) prediction score was calculated using all 81 topics. A threshold was calculated on different false-positive rates based on random SNPs and the same thresholds were applied to caQTLs to identify the fraction of explained caQTLs at different false-positive rates.

Development ATAC analysis

Annotation of cell types

The annotation of cell types through development was performed following two complementary approaches: (1) annotate progenitor cell types based on marker genes near ase, dpn, grh, dac, cas and scro¹² (Extended Data Fig. 4g) and the ventral nerve cord based on para and abd-A¹², and (2) tracking back the annotated adult cell types.

To track back the adult cell types through development, we used an SVM classifier (Supplementary Table 2). We trained the SVM classifier on the annotated adult cell types, and we used it to iteratively transfer the labels to earlier stages.

(1)
In the first step, we used the SVM classifier to transfer the labels from the 79 adult clusters (adult cells with more than 900 FIP), to the remaining cells on the adult dataset (adult + 72 h APF, the classifiers are trained on the cell–topic matrix). Using cross validation within the adult cells, we estimated that the global accuracy of the classifier is 0.86, with a call rate of 0.97 (it is not forced to assign a class to every cell); having a specificity of over 0.99 for all cell types, and a sensitivity ranging from over 0.90, for many glial, OL and KC types, to 0.25–0.50 for the least confident CB-Pros clusters.
(2)
We then used the adult + 72 h APF cells to classify the 48 h pupa cells using the common cisTopic analysis with these three stages (Extended Data Fig. 1f–l), and Harmony (on the cell-topic matrix) to reduce the effect differences intrinsic to the developmental stage.
(3)
Finally, we classified the cells in the remaining developmental stages (from larva to 24 h AFP). For this we used the global cisTopic analysis (158,116/240,919 cells with more than 900 FIP), with Harmony to correct for developmental stage (Extended Data Fig. 1e). In this last training, we noticed that cells on the progenitor clusters remained largely unassigned, so we finally trained a classifier also including the progenitors as training labels (OL Developing neuron 2, CB Developing neuron 1, OL Neuroepithelium, NB Generation, OL Developing neuron 1, OL Type I NB, CB Type I NB, and LPC), and discarding from the training set the few cells from the "new 48h cluster" that had been assigned to a cell type (they seem to be younger cells, and could distort the classification). This method obtained a likely fate for the developmental cells.

Core-set identification

The peaks called per cell type per timepoint for the consensus peaks were used as the basis to identify core regions per cell type. ctx regions (see scATAC topic modelling and clustering) that overlapped with the called peaks for that timepoint were defined as open and the DARs of the timepoint for the cell type were used to get differential accessible regions. All of the ctx regions that passed the filtering were then taken together as one set of total accessible regions of the cell type. Regions that were accessible in every timepoint were defined as the core-set of regions; regions that were differentially accessible in every timepoint were defined as core-DARs.

Trajectory of OL branches

We used Monocle3 (refs. ^81,82,83) to fit a trajectory in the 3D UMAP of the OL branch from the larval to 12 h APF analysis and assign pseudotimes to the cells. First, we created a cell_data_set with the region probabilities per cell and the 3D UMAP from cisTopic as embeddings. We then used cluster_cells followed by a partition to separate the OL and CB branches and subset the object to only contain the OL. We performed another cluster_cell, and selected and merged clusters in the same branch. The branch IDs were then used in Seurat’s FindAllMarkers (Wilcoxon, min.pct = 0.1, logfc.threshold = 0.25) on the predictive distribution matrix to find DARs. Subsequently, motif enrichment on the branch DARs was performed using i-cisTarget⁶⁸. Regions were linked to genes up to 5 kb upstream or downstream and GO was performed using FlyMine (Supplementary Table 3).

Trajectory of ONE scATAC-seq

The Monocle3 object that was created for OL branches was also used to calculate pseudotime using learn_graph to fit a principal graph and order_cells to assign pseudotimes. Next, OL neuroepithelium cells were selected together with the tips of the lamina precursor cells and OL neuroblasts (NB generation), focusing on the trajectory between these cell types. The trajectory was split into 15 equal parts that were used in Seurat’s FindAllMarkers (Wilcoxon, min.pct = 0.1, logfc.threshold = 0.1) to find DARs using a two-sided Wilcoxon test. Next, the predictive distribution matrix was subset for DARs and CPM normalized, followed by region-based z-normalization. DARs were grouped into modules using hierarchical clustering with the Scipy cluster.hierachy module⁴⁹ using distance.pdist (Euclidean), linkage (complete) and fcluster (0.85 × max distance), leading to nine modules. RcisTarget was used to identify motifs per module.

Trajectory of ONE scRNA-seq

Lamina precursor cells, neuroepithelium cells and OL neuroblasts were selected from a scRNA-seq dataset of the larval brain¹². Monocle3 was used to create a trajectory through the cells and assign pseudotimes. First the data were processed using pre_process_cds with PCA as the method, selecting 20 components. Next, a batch effect correction was performed to align the two different runs with align_cds. The aligned data were then used for reduce_dimension, followed by learn_graph. Once the principal graph was learned, cells were ordered along it and pseudotimes were assigned. To plot gene expression trajectories over pseudotime, a rolling mean was calculated for the log-normalized CPM counts with a window of 10. Next, a 10th degree polynomial was fit through the rolling mean with polyfit using NumPy⁸⁴ and plotted.

CB pros versus Imp

CB clusters in the adult + 72 h APF dataset were selected on the basis of enrichment of CB-only runs (Extended Data Fig. 1f–l), with the exception of KCs. These clusters were assigned to either pros or Imp groups based on their maximal mean gene accessibility. We then used Seurat FindAllMarkers on the predictive distribution matrix (Wilcoxon, min.pct = 0.1, logfc.threshold = 0.2) to identify 166 regions for pros⁺ cells and 128 regions for Imp⁺ cells. Motif enrichment was performed using i-cisTarget⁶⁸.

scATAC-seq embryo

We used scATAC-seq from the whole Drosophila embryo⁸⁵ to map the different CB cell types. After data download from GEO, we used cisTopic to map the reads on ctx regions, leading to a matrix of 128,510 regions by 20,594 cells. Given the smaller number of cells, we used the conventional Collapsed Gibbs Sampler method in cisTopic using runCGSModels from 1 to 100 topics, with 500 iterations using 250 as burn-in. We selected the model with the highest log-transformed likelihood leading to 50 topics. Using runtSNE without PCA on the probability matrix with the cells as the target, we acquired the 2D embeddings. Annotations were transferred from the dataset, identifying the CNS. We then plotted the mean accessibility of the CB regions on the t-SNE.

Enhancer-switch identification

Region accessibilities per cell type were calculated per timepoint using RPGC-normalized bigwig files. Next, a linear curve was fit using statsmodels⁸⁶ in Python for every region using time as an independent variable and region accessibility as a dependent variable, with 95% confidence intervals calculated for the parameters. Regions with a positive coefficient were assigned to be upregulated and regions with negative coefficients were assigned to be downregulated. Finally, we selected the regions that were upregulated in one cell type while being downregulated in another one and that had a maximum accessibility exceeding 4, leading to 458 switching regions.

Hydrop scATAC-seq and analysis

HyDrop experiments were performed on sorted GH146 cells (OPNs) according to the standard HyDrop ATAC protocol as described by De Rop et al⁸⁷. As OPNs are a rare cell type in the CB, 330 brains were used, followed by a stringent gating strategy. The FACS gating strategy is provided in Supplementary Data 1. The sorted cells were split in two batches that served as technical replicates. Next, barcode reads were trimmed to exclude the intersub-barcode PCR adapters using a mawk script. Then, the VSN scATAC-seq preprocessing pipeline⁵⁰ was used to map the reads to the reference genome and generate a fragments file for downstream analysis. Here, barcode reads were compared to a whitelist (of 884,736 valid barcodes) and corrected, allowing for a maximum 1 bp mismatch. Uncorrected and corrected barcodes were appended to the fastq sequence identifier of the paired-end ATAC-seq reads. Reads were mapped to the reference genome using bwa mem with the default settings, and the barcode information was added as tags to each read in the bam file. Duplicate-marking was performed using samtools markdup. In the final step of the pipeline, fragments files were generated using Sinto (https://github.com/timoast/sinto).

The generated fragments files were then used as input for ArchR (and cisTopic which gave similar results, data not shown). The createArrowFiles function was used with minTSS set to 4 and minFrags to 1000, leaving 309 cells. Ten components were then used in IterativeLSI, and for the UMAP calculation, clusters were calculated using a Louvain resolution 1. The arrow files were also used for co-clustering with the 10x scATAC data of the adult brain. Clustering was performed with 70 components of IterativeLSI and for UMAP.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Data availability

The data generated for this study have been deposited in NCBI’s Gene Expression Omnibus and are accessible through GEO Series accession numbers GSE163697 and GSE181494 (DGRP lines). We also provide a dedicated website to browse the results of the analyses and processed data (https://flybrain.aertslab.org), which provides link-outs to the SCope session (http://scope.aertslab.org/#/Fly_Brain/), UCSC hub (http://genome.ucsc.edu/cgi-bin/hgTracks?db=dm6&hubUrl=http://ucsctracks.aertslab.org/papers/FlyBrain/hub.txt), the eGRNs in NDEx, the DeepExplainer plots of enhancers and other information. The following online databases were used: FlyBase (https://flybase.org/), FlyMine (https://www.flymine.org/flymine), icis-Target (https://gbiomed.kuleuven.be/apps/lcb/i-cisTarget/), FlyLight (https://flweb.janelia.org/cgi-bin/flew.cgi), CIS-BP (http://cisbp.ccbr.utoronto.ca/), ENCODE (https://www.encodeproject.org/, ENCFF704WGH). The following publicly accessible datasets were also used: GSE107451 (scRNA-seq adult brain), GSE157202 (scRNA-seq larval brain), GSE101581 (scATAC-seq embryo). The neural network is from Özel et al⁵.

Code availability

The updated version of cisTopic for scATAC-seq clustering and topic identification including warpLDA are available at GitHub (https://github.com/aertslab/cisTopic) with set-up instructions and a tutorial. The Nextflow pipeline for scRNA-seq analysis is available at GitHub (https://github.com/vib-singlecell-nf/vsn-pipelines) together with example config files and instructions. DeepFlyBrain is deposited in Kipoi (https://kipoi.org/models/DeepFlyBrain), and the Jupyter notebooks that can be used to train the model are provided in Supplementary Data 3–5. Enhancer gene links can be calculated using ScoMAP (https://github.com/aertslab/ScoMAP) and GENIE3 (https://github.com/aertslab/GENIE3). Trajectory analysis was performed using Monocle3 according to the package tutorials (http://cole-trapnell-lab.github.io/monocle-release/monocle3). Differential expression, accessibility and integration of RNA-seq and ATAC-seq was performed using Seurat v.3 (with vignettes and install instructions at https://satijalab.org/seurat/). TaDa analysis was performed using Perl scripts available at GitHub (https://github.com/tonysouthall/Peak_calling_DamID). Code for the website is available at GitHub (https://github.com/aertslab/FBD_App/) and notebooks are available at GitHub (https://github.com/aertslab/fly_brain).

References

Li, H. et al. Classifying Drosophila olfactory projection neuron subtypes by single-cell RNA sequencing. Cell 171, 1206–1220 (2017).
Article CAS PubMed PubMed Central Google Scholar
Davie, K. et al. A single-cell transcriptome atlas of the aging Drosophila brain. Cell 174, 982–998 (2018).
Article CAS PubMed PubMed Central Google Scholar
Konstantinides, N. et al. Phenotypic convergence: distinct transcription factors regulate common terminal features. Cell 174, 622–635 (2018).
Article CAS PubMed PubMed Central Google Scholar
Croset, V., Treiber, C. D. & Waddell, S. Cellular diversity in the Drosophila midbrain revealed by single-cell transcriptomics. eLife 7, e34550 (2018).
Article PubMed PubMed Central Google Scholar
Özel, M. N. et al. Neuronal diversity and convergence in a visual system developmental atlas. Nature 589, 88–95 (2020).
Article ADS PubMed PubMed Central Google Scholar
Kurmangaliyev, Y. Z., Yoo, J., Valdes-Aleman, J., Sanfilippo, P. & Zipursky, S. L. Transcriptional programs of circuit assembly in the Drosophila visual system. Neuron 108, 1045–1057 (2020).
Article CAS PubMed Google Scholar
Costa, M., Manton, J. D., Ostrovsky, A. D., Prohaska, S. & Jefferis, G. S. X. E. NBLAST: rapid, sensitive comparison of neuronal structure and construction of neuron family databases. Neuron 91, 293–311 (2016).
Article CAS PubMed PubMed Central Google Scholar
Zheng, Z. et al. A complete electron microscopy volume of the brain of adult Drosophila melanogaster. Cell 174, 730–743 (2018).
Article CAS PubMed PubMed Central Google Scholar
Scheffer, L. K. et al. A connectome and analysis of the adult Drosophila central brain. eLife 9, e57443 (2020).
Article CAS PubMed PubMed Central Google Scholar
Jenett, A. et al. A GAL4-driver line resource for Drosophila neurobiology. Cell Rep. 2, 991–1001 (2012).
Article CAS PubMed PubMed Central Google Scholar
Robie, A. A. et al. Mapping the neural substrates of behavior. Cell 170, 393–406 (2017).
Article CAS PubMed Google Scholar
Ravenscroft, T. A. et al. Drosophila voltage-gated sodium channels are only expressed in active neurons and are localized to distal axonal initial segment-like domains. J. Neurosci. 40, 7999–8024 (2020).
Article CAS PubMed PubMed Central Google Scholar
Konstantinides, N. et al. A comprehensive series of temporal transcription factors in the fly visual system. Preprint at https://doi.org/10.1101/2021.06.13.448242 (2021).
Allen, A. M. et al. A single-cell transcriptomic atlas of the adult Drosophila ventral nerve cord. eLife 9, e54074 (2020).
Article PubMed PubMed Central Google Scholar
Doe, C. Q. Temporal patterning in the Drosophila CNS. Annu. Rev. Cell Dev. Biol. 33, 219–240 (2017).
Article CAS PubMed Google Scholar
Estacio-Gómez, A., Hassan, A., Walmsley, E., Le, L. W. & Southall, T. D. Dynamic neurotransmitter specific transcription factor expression profiles during Drosophila development. Biol. Open 9, bio052928 (2020).
Article PubMed PubMed Central Google Scholar
Komiyama, T., Johnson, W. A., Luo, L. & Jefferis, G. S. X. E. From lineage to wiring specificity. POU domain transcription factors control precise connections of Drosophila olfactory projection neurons. Cell 112, 157–167 (2003).
Article CAS PubMed Google Scholar
Kurmangaliyev, Y. Z., Yoo, J., LoCascio, S. A. & Zipursky, S. L. Modular transcriptional programs separately define axon and dendrite connectivity. eLife 8, e50822 (2019).
Article CAS PubMed PubMed Central Google Scholar
Schilling, T., Ali, A. H., Leonhardt, A., Borst, A. & Pujol-Martí, J. Transcriptional control of morphological properties of direction-selective T4/T5 neurons in Drosophila. Development 146, dev169763 (2019).
Article PubMed PubMed Central Google Scholar
Masserdotti, G., Gascón, S. & Götz, M. Direct neuronal reprogramming: learning from and for development. Development 143, 2494–2510 (2016).
Article CAS PubMed Google Scholar
Aibar, S. et al. SCENIC: single-cell regulatory network inference and clustering. Nat. Methods 14, 1083–1086 (2017).
Article CAS PubMed PubMed Central Google Scholar
Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Bravo González-Blas, C. et al. cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data. Nat. Methods 16, 397–400 (2019).
Article PubMed Google Scholar
Kirilly, D. et al. A genetic pathway composed of Sox14 and Mical governs severing of dendrites during pruning. Nat. Neurosci. 12, 1497–1505 (2009).
Article CAS PubMed PubMed Central Google Scholar
Atak, Z. K. et al. Interpretation of allele-specific chromatin accessibility using cell state–aware deep learning. Genome Res. 31, 1082–1096 (2021).
Article PubMed PubMed Central Google Scholar
Minnoye, L. et al. Cross-species analysis of enhancer logic using deep learning. Genome Res. 30, 1815–1834 (2020).
Article CAS PubMed PubMed Central Google Scholar
Avet-Rochex, A., Maierbrugger, K. T. & Bateman, J. M. Glial enriched gene expression profiling identifies novel factors regulating the proliferation of specific glial subtypes in the Drosophila brain. Gene Expr. Patterns 16, 61–68 (2014).
Article CAS PubMed PubMed Central Google Scholar
Crittenden, J. R., Skoulakis, E. M. C., Goldstein, E. S. & Davis, R. L. Drosophila mef2 is essential for normal mushroom body and wing development. Biol. Open 7, bio035618 (2018).
Article PubMed PubMed Central Google Scholar
Minocha, S., Boll, W. & Noll, M. Crucial roles of Pox neuro in the developing ellipsoid body and antennal lobes of the Drosophila brain. PLoS ONE 12, e0176002 (2017).
Article PubMed PubMed Central Google Scholar
Davis, F. P. et al. A genetic, genomic, and computational resource for exploring neural circuit function. eLife 9, e50901 (2020).
Article CAS PubMed PubMed Central Google Scholar
Naidu, V. G. et al. Temporal progression of Drosophila medulla neuroblasts generates the transcription factor combination to control T1 neuron morphogenesis. Dev. Biol. 464, 35–44 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems 30, 4765–4774 (2017).
Google Scholar
Shrikumar, A. et al. Technical note on transcription factor motif discovery from importance scores (TF-MoDISco) version 0.5.6.5. Preprint at https://arxiv.org/abs/1811.00416 (2020).
Kaya-Okur, H. S. et al. CUT&Tag for efficient epigenomic profiling of small samples and single cells. Nat. Commun. 10, 1930 (2019).
Article ADS PubMed PubMed Central Google Scholar
Southall, T. D. et al. Cell-type-specific profiling of gene expression and chromatin binding without cell isolation: assaying RNA Pol II occupancy in neural stem cells. Dev. Cell 26, 101–112 (2013).
Article CAS PubMed PubMed Central Google Scholar
Mackay, T. F. C. et al. The Drosophila melanogaster Genetic Reference Panel. Nature 482, 173–178 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Jacobs, J. et al. The transcription factor Grainy head primes epithelial enhancers for spatiotemporal activation by displacing nucleosomes. Nat. Genet. 50, 1011–1020 (2018).
Article CAS PubMed PubMed Central Google Scholar
Southall, T. D., Davidson, C. M., Miller, C., Carr, A. & Brand, A. H. Dedifferentiation of neurons precedes tumor formation in lola mutants. Dev. Cell 28, 685–696 (2014).
Article CAS PubMed PubMed Central Google Scholar
Yang, J., Ramos, E. & Corces, V. G. The BEAF-32 insulator coordinates genome organization and function during the evolution of Drosophila species. Genome Res. 22, 2199–2207 (2012).
Article CAS PubMed PubMed Central Google Scholar
Trevino, A. E. et al. Chromatin accessibility dynamics in a model of human forebrain development. Science 367, eaay1645 (2020).
Article CAS PubMed PubMed Central Google Scholar
Ma, S. et al. Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell 183, 1103–1116 (2020).
Article CAS PubMed PubMed Central Google Scholar
Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324 (2018).
Article CAS PubMed PubMed Central Google Scholar
Domcke, S. et al. A human cell atlas of fetal chromatin accessibility. Science 370, eaba7612 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lake, B. B. et al. Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain. Nat. Biotechnol. 36, 70–80 (2018).
Article CAS PubMed Google Scholar
Preissl, S. et al. Single-nucleus analysis of accessible chromatin in developing mouse forebrain reveals cell-type-specific transcriptional regulation. Nat. Neurosci. 21, 432–439 (2018).
Article CAS PubMed PubMed Central Google Scholar
Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).
Article CAS PubMed PubMed Central Google Scholar
Zhu, C. et al. An ultra high-throughput method for single-cell joint analysis of open chromatin and transcriptome. Nat. Struct. Mol. Biol. 26, 1063–1070 (2019).
Article CAS PubMed PubMed Central Google Scholar
Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).
Article CAS PubMed PubMed Central Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar
Gramates, L. S. et al. FlyBase at 25: looking to the future. Nucleic Acids Res. 45, D663–D671 (2017).
Article CAS PubMed Google Scholar
Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).
Article CAS PubMed Google Scholar
Herrmann, C., Van de Sande, B., Potier, D. & Aerts, S. i-cisTarget: an integrative genomics method for the prediction of regulatory features and cis-regulatory modules. Nucleic Acids Res. 40, e114 (2012).
Article CAS PubMed PubMed Central Google Scholar
Chen, J., Li, K., Zhu, J. & Chen, W. WarpLDA: a cache efficient O(1) algorithm for latent dirichlet allocation. Proc. VLDB Endow. 9, 744–755 (2016).
Article Google Scholar
Granja, J. M. et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat. Genet. 53, 403–411 (2021).
Article CAS PubMed PubMed Central Google Scholar
De Waegeneer, M., Flerin, C. C., Davie, K. & Hulselmans, G. vib-singlecell-nf/vsn-pipelines: v0.26.1. Zenodo https://doi.org/10.5281/ZENODO.3703108 (2021).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Article PubMed PubMed Central Google Scholar
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Article CAS PubMed PubMed Central Google Scholar
Van de Sande, B. et al. A scalable SCENIC workflow for single-cell gene regulatory network analysis. Nat. Protoc. 15, 2247–2276 (2020).
Article PubMed Google Scholar
Stanescu, D. E., Yu, R., Won, K.-J. & Stoffers, D. A. Single cell transcriptomic profiling of mouse pancreatic progenitors. Physiol. Genom. 49, 105–114 (2017).
Article CAS Google Scholar
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Amemiya, H. M., Kundaje, A. & Boyle, A. P. The ENCODE blacklist: identification of problematic regions of the genome. Sci. Rep. 9, 9354 (2019).
Article ADS PubMed PubMed Central Google Scholar
Ramírez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 44, W160–W165 (2016).
Article PubMed PubMed Central Google Scholar
Zhang, Y. et al. Model-based analysis of ChIP-seq (MACS). Genome Biol. 9, R137 (2008).
Article PubMed PubMed Central Google Scholar
Shih, M.-F. M., Davis, F. P., Henry, G. L. & Dubnau, J. Nuclear transcriptomes of the seven neuronal cell types that constitute the Drosophila mushroom bodies. G3 9, 81–94 (2019).
Article CAS PubMed Google Scholar
Corces, M. R. et al. An improved ATAC-seq protocol reduces background and enables interrogation of frozen tissues. Nat. Methods 14, 959–962 (2017).
Article CAS PubMed PubMed Central Google Scholar
Aronesty et al. ea-utils: ‘Command-line tools for processing biological sequencing data’. https://github.com/ExpressionAnalysis/ea-utils (2011).
Imrichová, H., Hulselmans, G., Kalender Atak, Z., Potier, D. & Aerts, S. i-cisTarget 2015 update: generalized cis-regulatory enrichment analysis in human, mouse and fly. Nucleic Acids Res. 43, W57–W64 (2015).
Article PubMed PubMed Central Google Scholar
Aughey, G. N., Delandre, C., McMullen, J. P. D., Southall, T. D. & Marshall, O. J. FlyORF-TaDa allows rapid generation of new lines for in vivo cell-type-specific profiling of protein-DNA interactions in Drosophila melanogaster. G3 11, jkaa005 (2021).
Article PubMed Google Scholar
Marshall, O. J., Southall, T. D., Cheetham, S. W. & Brand, A. H. Cell-type-specific profiling of protein-DNA interactions without cell isolation using targeted DamID with next-generation sequencing. Nat. Protoc. 11, 1586–1598 (2016).
Article CAS PubMed PubMed Central Google Scholar
Marshall, O. J. & Brand, A. H. damidseq_pipeline: an automated pipeline for processing DamID sequencing datasets. Bioinformatics 31, 3371–3373 (2015).
Article PubMed PubMed Central Google Scholar
Aerts, S. et al. Robust target gene discovery through transcriptome perturbations and genome-wide enhancer predictions in Drosophila uncovers a regulatory basis for sensory specification. PLoS Biol. 8, e1000435 (2010).
Article PubMed PubMed Central Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Kudron, M. M. et al. The ModERN resource: genome-wide binding profiles for hundreds of Drosophila and Caenorhabditis elegans transcription factors. Genetics 208, 937–949 (2018).
Article CAS PubMed Google Scholar
Davis, C. A. et al. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 46, D794–D801 (2018).
Article CAS PubMed Google Scholar
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Article CAS PubMed PubMed Central Google Scholar
Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44, e107 (2016).
Article PubMed PubMed Central Google Scholar
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. Preprint at arXiv (2019).
Bravo González-Blas, C. et al. Identification of genomic enhancers through spatial integration of single-cell transcriptomics and epigenomics. Mol. Syst. Biol. 16, e9438 (2020).
Article PubMed PubMed Central Google Scholar
Frith, M. C., Li, M. C. & Weng, Z. Cluster-Buster: finding dense clusters of motifs in DNA sequences. Nucleic Acids Res. 31, 3666–3668 (2003).
Article CAS PubMed PubMed Central Google Scholar
Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 979–982 (2017).
Article CAS PubMed PubMed Central Google Scholar
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
Article CAS PubMed PubMed Central Google Scholar
Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Cusanovich, D. A. et al. The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature 555, 538–542 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Seabold, S. & Perktold, J. Statsmodels: econometric and statistical modeling with Python. In Proc. 9th Python Science Conf. 92–96 (2010).
De Rop, F. V. et al. HyDrop: droplet-based scATAC-seq and scRNA-seq using dissolvable hydrogel beads. Preprint at https://doi.org/10.1101/2021.06.04.447104 (2021).

Download references

Acknowledgements

We thank the staff at the Janelia FlyLight Project for publicly providing images and reporter lines to assess enhancer activity on the CNS in Drosophila; F. Pinto-Teixeira for providing the ato-Gal4 and acj6/TfAP-2 RNAi lines; and the members of the Aerts laboratory for discussions and for reading the manuscript.This work is funded by the following grants to S. Aerts: ERC Consolidator Grant (724226_cis‐CONTROL), by the Special Research Fund (BOF) KU Leuven (grant C14/18/092) and the FWO (grants G0C0417N, G094121N). J.J., C.B.G.‐B., F.V.D.R. and D.P. are supported by a PhD fellowship of The Research Foundation, Flanders (FWO, 1199518N; 11F1519N; 1S80920N; 1S75219N). 10x Chromium was partially made available through VIB Tech Watch Funding. Imaging, FACS and single-cell analyses were supported by the light microscopy, FACS and single-cell expertise units at the VIB-KU Leuven Center for Brain and Disease Research. Computing was performed at the Vlaams Supercomputer Center (VSC). Stocks obtained from the Bloomington Drosophila Stock Center were used in this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

These authors contributed equally: Jasper Janssens, Sara Aibar, Ibrahim Ihsan Taskiran

Authors and Affiliations

VIB Center for Brain & Disease Research, Leuven, Belgium
Jasper Janssens, Sara Aibar, Ibrahim Ihsan Taskiran, Joy N. Ismail, Katina I. Spanier, Florian V. De Rop, Carmen Bravo González-Blas, Xiao Jiang Quan, Dafni Papasokrati, Gert Hulselmans, Samira Makhzami, Maxime De Waegeneer, Valerie Christiaens & Stein Aerts
Department of Human Genetics, KU Leuven, Leuven, Belgium
Jasper Janssens, Sara Aibar, Ibrahim Ihsan Taskiran, Joy N. Ismail, Katina I. Spanier, Florian V. De Rop, Carmen Bravo González-Blas, Xiao Jiang Quan, Dafni Papasokrati, Gert Hulselmans, Samira Makhzami, Maxime De Waegeneer, Valerie Christiaens & Stein Aerts
Department of Life Sciences, Imperial College London, London, UK
Alicia Estacio Gomez, Gabriel Aughey, Marc Dionne, Krista Grimes & Tony Southall

Authors

Jasper Janssens
View author publications
You can also search for this author in PubMed Google Scholar
Sara Aibar
View author publications
You can also search for this author in PubMed Google Scholar
Ibrahim Ihsan Taskiran
View author publications
You can also search for this author in PubMed Google Scholar
Joy N. Ismail
View author publications
You can also search for this author in PubMed Google Scholar
Alicia Estacio Gomez
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Aughey
View author publications
You can also search for this author in PubMed Google Scholar
Katina I. Spanier
View author publications
You can also search for this author in PubMed Google Scholar
Florian V. De Rop
View author publications
You can also search for this author in PubMed Google Scholar
Carmen Bravo González-Blas
View author publications
You can also search for this author in PubMed Google Scholar
Marc Dionne
View author publications
You can also search for this author in PubMed Google Scholar
Krista Grimes
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Jiang Quan
View author publications
You can also search for this author in PubMed Google Scholar
Dafni Papasokrati
View author publications
You can also search for this author in PubMed Google Scholar
Gert Hulselmans
View author publications
You can also search for this author in PubMed Google Scholar
Samira Makhzami
View author publications
You can also search for this author in PubMed Google Scholar
Maxime De Waegeneer
View author publications
You can also search for this author in PubMed Google Scholar
Valerie Christiaens
View author publications
You can also search for this author in PubMed Google Scholar
Tony Southall
View author publications
You can also search for this author in PubMed Google Scholar
Stein Aerts
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S. Aerts, J.J., S. Aibar and I.I.T. conceived the study. J.J., S. Aibar, I.I.T. and D.P. performed computational analyses with assistance from K.I.S., C.B.G.-B., G.H. and M.D.W.; S.M. and V.C. performed scATAC-seq experiments. J.N.I., J.J. and S.M. performed FACS and omniATAC-seq. J.N.I., X.J.Q., J.J. and S.M. performed antibody staining and visualization. X.J.Q., V.C. and J.N.I. performed the cloning of selected enhancers. S.M. performed omniATAC on DGRP lines. J.N.I., J.J., and F.V.D.R. performed Hydrop-ATAC experiments. V.C. and J.N.I. performed CUT&Tag experiments. A.E.G., G.A. and T.S. performed TaDa experiments. M.D. and K.G. generated Mef2-Dam line. S. Aibar created the website with assistance of G.H., D.P. and K.S.; S. Aerts, J.J., S. Aibar and I.I.T. wrote the manuscript.

Corresponding author

Correspondence to Stein Aerts.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review information

Nature thanks Andrew Adey and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Global analysis and adult clustering approaches.

a. UMAP of global cisTopic analysis (150k cells shown), coloured by region accessibilities near elav (red, neurons), repo (green, glia) and dpn (blue, neuroblasts). b. Overview of regions shown in a for representative cell types (Kenyon cells for neurons, Astrocytes for glia, optic lobe neuroblasts for neuroblasts). c. Distribution of cells per timepoint in the global UMAP. Timepoints jointly analysed in the upcoming sections (green: early timepoints, blue: late timepoints) are grouped by borders. d. Spearman correlation of top 1000 variable regions across timepoints separating early timepoints (ET), middle timepoints (MT) and late timepoints (LT). e. UMAP after timepoint correction with Harmony (coloured by timepoint: ET in green, LT in blue). f. t-SNE of the 60k cells in the adult cell types analysis (LT). Central-brain only runs allow to annotate clusters according to their location (central brain (CB) and optic lobe (OL)). Subclustering was performed by splitting cells based in CB, OL and glia, note that Kenyon cells, plasmatocytes and photoreceptors were not included. g. Subclustering of OL neurons leads to 58 subclusters, including a further split of T4/T5 neurons. h. Subclustering of CB neurons reveals 51 subtypes. Notice how the S-shaped separation of pros⁺ cells and Imp⁺ cells is retained. i. Subclustering of glia reveals 16 subtypes. j. Clustering 88k cells from 48h APF to adult, provides three extra clusters enriched for younger cells (young only: circle, young enriched: arrows), but does not increase the resolution of the adult cell types. k. Clustering using ArchR pipeline, on 56k cells from 48hAPF to adult, leading to 90 clusters. UMAPs are shown after Harmony batch effect correction. l. Heatmap showing the correspondence between clusters from cisTopic and ArchR. Note that cisTopic clusters are merged across age, in contrast to ArchR clusters, leading to multiple ArchR clustes (different ages) mapping to one cisTopic cluster).

Extended Data Fig. 2 Integration of scRNA-seq and snATAC-seq.

a. Calculation of gene-accessibility scores using a weighted sum of regions in the gene body and up to 5kb upstream of the TSS. Weights decrease exponentially with distance from the TSS (constant in the gene body), and increase with higher gini (variability) coefficients. b. Gene expression and gene accessibility display a similar pattern for many genes (6 examples shown), which can be used to transfer cell type annotations across modalities (black lines). c. Overview of used annotation methods. Main cell types are consistently detected with each method, while low confidence matches are method specific. d. Annotated t-SNE of the transcriptomes of 118k adult cells. e. Integrated t-SNE of scRNA-seq and snATAC-seq using Seurat’s co-clustering. f. Gene set enrichment of marker genes using AUCell on gene-accessibility matrix, revealing matches per cell type and per major cell type group (glia, optic lobe neurons, central brain neurons, Kenyon cells, photoreceptors and plasmatocytes). g. Scatterplot showing the number of marker genes against the number of cells in the scRNA-seq dataset. Matched cell types between RNA and ATAC are coloured, unmatched are shown in grey. h. Scatterplot showing the number of marker genes in scRNA against the number of DARs for the matched cell types, with glia having the highest number for both. i. Heatmap of DARs per cluster. j. Bar plot showing the number of DARs per cluster.

Extended Data Fig. 3 FAC-sorted cell types match single-cell aggregates.

a. Overview of bulk ATAC-seq on three sorted Kenyon cell (KC) populations: (top) Confocal images of KC subtypes targeted with split-GAL4 lines (p: posterior, m: medial, d: dorsal); (middle) Average accessibility of the top 100 differential peaks from sorted cell types projected on the single-cell ATAC t-SNE; (bottom) Locus of three marker genes, showing similarity between bulk ATAC-seq (black) and the aggregated scATAC-seq profiles. b–e. Heatmap showing the Spearman’s correlation of the FAC-sorted samples and single-cell aggregates over different sets of regions. Matching samples and aggregates are shown in bold. f. Subclustering of T4 and T5 neurons identifies the a/b and c/d subtypes, with differential regions near marker genes TfAP-2 and bi. g. Locus of marker genes showing differential peaks between T4 and T5 neurons (top, TfAP-2, range: 0.8-252) and between a/b and c/d subtypes (bottom, bi, range: 0.2-120). h. Experimental overview: GH146 was used to drive GFP expression in OPNs, followed by FACS and scATAC-seq. i. UMAP showing 309 cells kept after filtering, forming 6 clusters. j. Visualization of gene accessibility near OPN markers, showing heterogeneity between clusters. k. 17k peaks were identified in the OPNs, of which 4.5k are unique, and of which 876 are near OPN marker genes (184 near highly expressed positive markers). These peaks were not found in the consensus peaks (CP) as shown in the tracks. l. Co-clustering UMAP of sorted OPNs and 10x scATAC data, with sorted OPNs shown in yellow (top), corresponding to cluster 37 which contains 10 cells of the 10x data, scattered from multiple clusters (bottom).

Extended Data Fig. 4 Chromatin landscapes of progenitors and developing neurons.

a. An SVM classifier was used to propagate the adult cell type labels to earlier stages in development. The classifier also included the progenitor cell types –from the developmental analysis– (purple and dark green colours). b. Proportion of cell types at each timepoint. c. Chromatin landscape for T1 neurons, shows a highly dynamic opening and closing of peaks during development. A core-set remains accessible at all times, of which a subset is specific to T1 neurons. d. Examples of regions with different developmental dynamics for T1 neurons. e. Bar plot showing the number of core-regions identified per cell type. Dark colours show specific core-regions (core-DARs). f. Number of DARs calculated per cell type (down sampled to 75 cells) for every timepoint, shows a decline over time. The arrow notes a small increase at 48h APF during synaptogenesis. The box plot marks the median (red line), upper and lower quartiles and 1.5× interquartile range (whiskers); outliers are shown individually, n=74,77,78,77,78,77,78,75,76 cell types. g. Progenitor cell types show specific marker accessibility, while neurons show accessibility in adult specific regions (Awh, ey). h. Number of DARs per cell type in the early development dataset, revealing a lower number for progenitors (purple shades). i–j. Trajectory from optic lobe neuroepithelium (ONE) to lamina progenitor cells (LPC) and optic lobe neuroblasts (OL NB) using (i) scATAC-seq and (j) scRNA-seq. Heatmap shows dynamic chromatin accessibility modules with enriched motifs (NES score shown) and line plot shows expression profiles for predicted master regulators. k. Specific comparison of different progenitor cell types detects thousands of differential regions, with motif enrichment of key TFs. l. In vivo reporter assay of a cloned ONE enhancer driving GFP. m. Optic lobe and central brain branches in 3D-UMAP. n. Central brain and VNC duality between Imp and pros traced through development. Standardized mean accessibility of Imp regions (n=128) and pros regions (n=166) is plotted for different developmental stages. Dati (AAAAAA) motifs and Pros ChIP-seq peaks (embryo, ModERN) are enriched in Imp regions where they are not expressed (grey) and vice versa, suggesting a chromatin closing role. o. AUCell enrichment scores of branch-specific regions for adult OL clusters (box plot marks the median (red line), upper and lower quartiles and 1.5× interquartile range (whiskers); outliers are shown individually; number of cells between brackets). Candidate TFs expressed per branch are shown (TFs with matching motifs in bold).

Extended Data Fig. 5 Cistromes overview.

a. TF expression vs motif enrichment for a selection of TFs. b. Heatmap of number of regions per cell type which present motif enrichment for a TF, coloured according to the TF expression-motif enrichment correlation (red: positive correlation, blue: negative correlation). Note that in contrast to c this heatmap does not require the TF to be expressed in the given cell type. c. Dot-heatmap including all available "chromatin-opening" cistromes (full version of Fig. 3c).

Extended Data Fig. 6 Deep learning predicts de novo key transcriptional activators and repressors.

a. t-SNE from the cisTopic analysis on the subset of 15 cell types used for the deep learning (DL) analysis. b. Accessibility of topic regions near marker genes. Calculated as the average region probability for topic-regions linked to each set of marker genes (markers from the transcriptome atlas). c. Comparison of topic coherence and DL classification performance (area under ROC-curve (auROC) for the classification of the left-out test regions). The topic coherence represents how likely the regions of the topic will co-occur (higher values are better). d. Box plot of TF motif enrichment in the topics (average enrichment score) split by the topic annotation (i.e., to one cell type, to multiple cell types, or marked as low contribution). The box plot marks the median, upper and lower quartiles and 1.5× interquartile range (whiskers); outliers are shown individually. Number of topics per category shown above plot. e. Topic heatmap showing cell-type specific topics. Bar plots show the number of regions per topic (cutoff p=0.995) and area under the precision recall curve (auPR) of the DL model. f. Contributions of the patterns identified by DeepFlyBrain to classify glial regions reveal activators and repressors (negative nucleotide importance). These motifs can be matched to known factors with concordant expression. g–h. Conservation of the regions centred by the motif (blue) or ATAC peak (orange) for (g) KC and (h) T neuron motif instances. The location of the motifs is shown with dashed lines. i. Heatmap showing Jaccard index between TF binding site predictions from DL and regions from conventional motif discovery. j. Box plots showing higher conservation for overlapping regions compared to deep-learning only regions. The box plot marks the median, upper and lower quartiles and 1.5× interquartile range (whiskers); outliers are shown individually. Number of enhancers per category shown above plot. k. Box plots showing higher enhancer-gene link scores for overlapping regions compared to deep-learning only regions. The box plot marks the median, upper and lower quartiles and 1.5× interquartile range (whiskers); outliers are shown individually. Number of enhancers per category shown above plot. l. Bulk ATAC-seq was performed on brains of 44 different genotypes leading to the identification of caQTLs. caQTLs affecting each of 28k motifs (adjusted p-value of Fisher test versus difference of number of motifs increasing/decreasing accessibility, see Methods). Dots in the same colour affect similar motifs (black: not-significant). m. The fraction of caQTLs predicted to affect chromatin accessibility at different false positive rates (random SNPs). The 5% false positive rate is shown as a dashed grey line. n–p. Effect of SNPs in Mamo (n), Lola-PF (o) and Lola-N (p) motifs on chromatin accessibility. Top-left: DeepExplainer plot for the reference (G) and alternative allele (T), showing a loss of a repressor site for Mamo and Lola-N, and a gain of a repressor site for Lola-PF. Top-right: Candle plots showing predicted accessibility change caused by the SNP for different cell types (increase shown in blue, decrease in red). Bottom-left: Box plot showing bulk accessibility of 44 DGRP lines, split by genotype at this SNP, highlighting an increase in accessibility for the alternative allele. The box plot marks the median, upper and lower quartiles and 1.5× interquartile range (whiskers); all data points are shown. Number of genotypes associated with either reference (Ref) or alternative (Alt) alleles shown. Bottom-right: Single-cell aggregates over the SNP. q. Overexpression of lola isoform N (lola-N) in glia (repo driver) versus neurons (elav driver, control) leads to the closing of 250 regions with the GATC motif. r. Example of a region in perineurial glia (PNG) and subperineurial glia (SUB), that closes upon overexpression of lola-N in glia. The region is also part of the PNG eGRN (see Fig. 5). s. caQTLs affecting DL motifs. Column nUP/nDw: Number of SNPs overlapping with the motif which produce an increase/decrease of accessibility. The FDR is checked on 1000 random caQTLs with the same number of SNPs (i.e., ey: take 6 random caQTLs, 1000 times, and see how many of the 1k repetitions have at least 5 SNPs increasing accessibility).

Extended Data Fig. 7 Overview of TF binding and perturbation experiments.

a. Table showing a summary of all TF binding/perturbation experiments indicating number of affected regions, top motif enrichment and overlap with deep learning binding sites. b. CUT&Tag signal of Repo is enriched over predicted Repo binding sites. c. CUT&Tag signal of Ey is enriched over predicted Ey binding sites. d. Optimization of CUT&Tag for Ey, finding a combination of Pitstop and higher Tn5 concentration to increase the number of regions detected and improve motif enrichment. e. The genomic region near trio contains peaks for both glia and Kenyon cells, with predicted binding sites for Ey and Repo. Ey and Repo CUT&Tag data are normalized against each other, showing biases to the Ey side over Ey binding sites and to the Repo side over Repo binding sites. f. TaDa coverage of Mef2 in optic lobe (Tm1) and Kenyon cells (left); and of Mef2 in γ-KC for predicted Mef2 regions in the optic lobe and the γ-KC. g. Venn diagrams showing overlap of TaDa experiments for Mef2 (left) and Acj6 (right). Motif enrichment is shown with Mef2 motifs enriched in all overlaps of Mef2, but no Acj6 enrichment in unique regions for acj6-TaDa. Strongest enrichments are found for common regions. h. Summary of RNAi results of Fig. 3f (knockdown in γ-KC and T4/T5 neurons). Bar plots show affected regions in both directions upon knock-down, with expected direction accentuated. Enriched motifs for the unexpected direction are shown. i–k. Results of hypergeometric tests of the overlap of TF knockdown affected regions for cistromes (i), cell-type specific cistromes (j) and deep learning binding sites (k). l. Mamo RNAi ATAC peaks from γ-KC are enriched for α/β Kenyon cell regions compared to WT or other knockdowns. m. Examples of three loci where Mamo RNAi has led to increased accessibility of α/β Kenyon cell regions in γ-KC.

Extended Data Fig. 8 Enhancers selected by accessible regions generate novel driver lines.

a. Peak in the overlap of two existing Kenyon Cell driver lines recapitulates KC expression. b. DeepExplainer view of the selected element from (a) showing Ey and Mef2 binding sites. c. Existing non-specific driver lines can be broken up in separate more specific drivers for KC and glia using cell-type specific ATAC-peak signals. d. In-silico overlap of ATAC-peak signals resembles that of in-vitro split-GAL4 lines for T4/T5 neurons. Images courtesy of the Janelia FlyLight Project. e. 63 adult enhancers were selected and cloned into a construct, flanked by gypsy insulators (GI) driving either direct GFP or GAL4 expression from the Hsp70 promoter. Selected peaks have a median size of 580bp (direct) or 626 (Gal4). f. Overview of GFP expression in different cloned enhancers (GAL4 enhancers were crossed with UAS-nlsGFP). Red numbers point to enhancer-IDs (Supplementary Table 8). Green numbers are scores of enhancer activity in the predicted cell type (0: no activity, 1: low, 2: high). g. Bar plots showing validation rate for GFP expression within Kenyon cells (KC), optic lobe (OL), glia (G). Mixed (M) and Negative (N) bar plots are shown as controls. Dark colours mean high expression (2), light colours mean faint expression (1). h. ROC curve showing the performance of different metrics to predict OL activity of the 64 cloned enhancers (including developmental enhancer). i. ROC curve showing the performance of different metrics to predict glial activity of the 64 cloned enhancers (including developmental enhancer).

Extended Data Fig. 9 DeepFlyBrain accurately predicts effects of mutations.

a–e. Analysis of cloned enhancers near (a) gish, (b) Appl, (c) Bx, (d) Pkc53e, and (e) CG15117. Accessibility profiles of the loci, DL prediction scores for the WT and mutated (mut) sequences, nucleotide importance scores and in-silico saturation mutagenesis assays, and in vivo enhancer activity of the cloned sequences are shown as in Fig. 4a, b. f, g. In vivo enhancer activity and nuclei count of the WT region and the region with mutated repressors near (f) sNPF and (g) Appl. The expected nuclei count after destroying the repressors is shown as a dashed grey line (20% increase). The box plot marks the median, upper and lower quartiles and 1.5× interquartile range (whiskers); all data points are shown (one-sided Mann-Whitney U test). Number of measured brains shown.

Extended Data Fig. 10 Gene expression and region accessibility correlation can be exploited to build eGRNs, and classify TF roles according to their network activity.

a. Proportion of BEAF-32 ChIP-seq peaks that have a high-scoring BEAF-32 motif, and are accessible in the fly brain. Despite being performed on whole embryo (0-14 h, mixed sex), most of the motif containing peaks are ubiquitously accessible across the brain. b. Distance to the closest BEAF-32 peak with motif upstream and downstream of each gene. Most of the genes (86%) are within 50kb of a BEAF-32 peak (46% are between two peaks within 50kb, and 88% within 200kb). For genes further than 50kb, expanding the search space from 50kb to 200kb adds a median of 2 extra links. c. View of genes, topological associated domains (TADs), genomic regulatory blocks (GRBs), BEAF-32 ChIP-seq peaks and BEAF-32 defined search spaces near the locus of Imp (green), ras (red), Ant2 (yellow) and feo (pink). The lowest track shows the search space for each gene (i.e., the region between the first two BEAF-32 peaks within 200kb of the transcript, skipping 500bp around the TSS. In case there are no peaks within 200kb, 50kb is kept as search space). d. Box plots showing higher correlation for enhancer-gene links contained within BEAF-32 domains compared to links outside. Links that cross boundaries have a higher anticorrelation. The box plot marks the median, upper and lower quartiles and 1.5× interquartile range (whiskers); outliers are shown individually. . Number of links per category is shown with one-sided Mann-Whitney U test. P-value below numerical precision equals 0. e. Overview of selected tracks in the Pkc53E locus; only the enhancer-gene links between two BEAF-32 peaks (green bar) are kept. The grey/blue regions on top are the pre-defined regulatory regions used for the analysis, dark blue indicates a region with link to the gene, grey regions are not differentially accessible. f. Regulatory region selection for Pkc53E; the inset shows accessibility of the top region versus gene expression (input to the random-forest), regions with a weight above the threshold are linked to the gene. On the right: t-SNEs showing the gene expression and accessibility, and the resulting network. g. Scatterplot showing the correlation (Pearson’s r=0.4, two-sided test, p=0.001, n=64) between in vivo enhancer reporter activity (GFP score) and the strength of the enhancer-gene link (correlation). Linear regression and 95% confidence interval are shown. h–i. Correlation of gene expression (h.) and TF gene expression (i.) with aggregated accessibility profiles at TSS, averaged gene-accessibility, and averaged accessibility of regions with positive links. Red line shows linear fit, with orange boundaries as the 95% confidence interval. Note the regions near the TSS that have high accessibility but do not lead to gene expression (highlighted in blue) and the increase in performance for the gene-accessibility score in the TF expression, while overall the highest correlation is reached with links. j. Overview of eGRN expression across different cell types. For each TF, the first row is TF expression, with below a heatmap value of normalized enrichment score of the eGRN (NES, gene-set enrichment analysis of eGRN target genes on genes ranked by FC in each cell type). Note that chromatin-repressing eGRNs are less validated, and represent lower confidence. k. Heatmap of number of genes in the TF-eGRNs (in all cell types) split by cistrome type and gene-link correlation (i.e., indicating potential activator/repressor roles). Canonical activators have most of their targets in opening-cistrome with positive-links. Most of the potential repressors are repressors of chromatin (e.g., closing enhancers), rather than opening repressive regions (i.e., regions with negative links). Only 4 TFs have a higher number of targets with the negative links: Fkh, Acj6, Oli and Ftz-f1 (cell-type dependent). TFs with an asterisk use 0.20 TF expression-motif correlation threshold. Bold highlights TFs with confirmed roles (in this study or previously known).

Extended Data Fig. 11 A resource of cell-type specific eGRNs.

a. γ-KC eGRN (motif-based) with key TFs in the middle. Genes marked as squares are also present in the DL-filtered eGRN (Fig. 5a). b. Heatmap of Jaccard index between TF target regions in the γ-KC eGRN. c. eGRN T1 neurons (regions are coloured in blue shades, genes in red; regulatory TFs are in the center). d. Heatmap of Jaccard index between TF target regions in T1 eGRN. e. Heatmap of eGRN overlap (region based, Jaccard index) of all cell types and all TFs. Examples of eGRNs (scro, TfAP-2, ey and Mef2) are highlighted, showing co-clustering based on TF, and on cell type (genomic context). f. Regulatory network for KCs and perineurial glia, with colour showing the status of the network (average expression and accessibility). Presence of Mamo leads to repression of α/β-KC marker regions and genes, while Lola-N leads to repression of glial marker regions and genes. g. eGRNs for different subtypes of Kenyon cells, T-neuron subclasses and glia are available for exploration on NDEx through https://flybrain.aertslab.org/. Link outs from the gene to FlyBase and UCSC allow to explore gene function and chromatin profiles with all nearby predicted enhancers coloured, while link outs from regions allow to inspect the region with DeepFlyBrain, to visualize nucleotide importances, while also linking to UCSC to view the genomic context with the selected region highlighted.

Extended Data Fig. 12 Enhancer switching is a prominent feature of development.

a. Heatmap showing 458 enhancers that undergo a switch from one cell type to another. Enhancers are grouped based on whether the switch is from (non)neuronal to (non)neuronal. Heatmap shows standardized average accessibility (RPGC): ET: early timepoints (larva-12h APF), MT: middle timepoints (24-48h APF), LT: late timepoints (72h APF-adult). b. Examples of enhancers that switch between cell types for different categories. Given that one region can contain multiple enhancers, it is hard to separate enhancer switches from shifters: the glia enhancer (right) shows a shift, where one peak goes down and an adjacent one becomes accessible. c. t-SNE from the cisTopic analysis on the subset of 18 cell types used for the deep learning (DL) analysis. d. Performance of the DL model for the different topics. e. Examples of topics linked to progenitor cell types (left) and to Kenyon cell subtypes (right). f. TF-MoDISco results for topics linked to progenitors (left) and to Kenyon cells (right), highlighting motifs of TFs expressed in those cell types. The motif for Ase shows negative nucleotide importances, suggesting a chromatin repressing role. g. CG15117 enhancer switches from ensheathing glia (ENS) to T1 neurons. h. Bar plots showing predicted scores of the region for the developmental and adult DL model. i-j. DeepExplainer plot and in-silico mutagenesis plots of the CG15117 enhancer calculated with (i) adult DL model and (j) development DL model. According to the models, the enhancer is repressed in adult ensheathing glia and developing T1 neurons by the same binding site (highlighted with orange box).

Supplementary information

Supplementary Information

Reporting Summary

Supplementary Data 1

FACS gating strategy: gating strategies used in the FACS runs performed on split-Gal4 lines (MB371B, MB418B and MB419B), and for the normal Gal4 lines (knockdown experiments: R16A06 and ato; TaDa: R74G01; sorted OPNs: GH146, sorted cell types: R16A06, R74G01 and ato) together with detailed results.

Supplementary Data 2–5

Supplementary Data 2: VSN config file to run the VSN Nextflow pipeline on the adult scRNA-seq data from Davie et al.². Supplementary Data 3: DeepFlyBrain training data containing a Jupyter notebook to train a DL model. Supplementary Data 4: DeepFlyBrain performance data containing a Jupyter notebook to determine the performance of the DL model. Supplementary Data 5: DeepFlyBrain scoring data and DeepExplainer plots containing a Jupyter notebook to score new regions and view nucleotide importance in the region.

Supplementary Tables

Supplementary Tables 1–11 and a guide of the Supplementary Tables.

Peer Review File

Rights and permissions

Reprints and permissions

About this article

Cite this article

Janssens, J., Aibar, S., Taskiran, I.I. et al. Decoding gene regulation in the fly brain. Nature 601, 630–636 (2022). https://doi.org/10.1038/s41586-021-04262-z

Download citation

Received: 30 December 2020
Accepted: 17 November 2021
Published: 05 January 2022
Issue Date: 27 January 2022
DOI: https://doi.org/10.1038/s41586-021-04262-z
Springer Nature Limited

Associated content

A regulatory atlas of the fly brain

In Brief Lab Animal 02 February 2022

Decoding gene regulation in the fly brain

Abstract

Similar content being viewed by others

Main

Unique chromatin landscapes of neurons

Dynamic changes during brain development

Cell-type-specific TF-binding sites

Decoding enhancer architecture

Building a resource of eGRNs

Discussion

Methods

Data reporting

Statistics and reproducibility

Genetics

10x Genomics scATAC-seq

Sample preparation

Library preparation

Sequencing

10x data processing

Demuxlet

scATAC topic modelling and clustering

Gene accessibility matrix

scRNA-seq clustering

Label transfer using NNLS, AUCell and Seurat

DARs

Cell-type-specific bams and bigwigs

Consensus peaks

ArchR clustering

Omni-ATAC-seq analysis of FACS-sorted samples

FACS

Analysis of WT experiments

Analysis of knockdown experiments

CUT&Tag analysis

Library preparation

Analysis

TaDa analysis

Library preparation

Analysis

Enhancer assays

Selection of cloned enhancers

Cloning and visualization of enhancers

Immunohistochemistry analysis of split-Gal4 lines and larval brains

Immunohistochemistry analysis of adult brains

Enhancer ROC curves

eGRN creation

Motif analysis

Cistromes

Gene–enhancer links

eGRN integration

eGRN plots

DeepFlyBrain

cisTopic run

Model training

Model performance

Nucleotide contributions

TF-binding site predictions

Developmental model

caQTLs analysis

Development ATAC analysis

Annotation of cell types

Core-set identification

Trajectory of OL branches

Trajectory of ONE scATAC-seq

Trajectory of ONE scRNA-seq

CB pros versus Imp

scATAC-seq embryo

Enhancer-switch identification

Hydrop scATAC-seq and analysis

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review information