Main

The Drosophila brain contains around 220,000 cells and is an excellent model for investigating the diversity of cell types. Advances in electron microscopy have yielded connectome maps across the fly brain8,9, while driver lines10 provide genetic access to many cell types11. This diversity of cell types has been bolstered by single-cell transcriptomics analyses of the brain1,2,3,4,5,6,12,13 and the ventral nerve cord14.

The role that transcription factors (TFs) have in the determination of neuronal fate was previously highlighted by imputing unique TF combinations for each cell type that activate or repress target genes2. These combinations arise in neural progenitors through TF-guided temporal and spatial axes of differentiation15. Furthermore, TFs govern key neuronal features such as dendritic targeting and neurotransmitter determination3,16,17,18,19, and altering a single TF can change neuronal fate20. Inferring TFs and their putative target genes is crucial, but transcriptomics analysis leads to high false-positive rates21, as TF activity often cannot be predicted from TF expression levels because it depends on many variables, such as protein activity and localization as well as the presence of co-binding TFs and co-factors.

The recent development of the single-cell assay for transposase accessible chromatin by sequencing (scATAC-seq)22 provides additional understanding of the mechanisms that underlie neuronal identity, by enabling the analysis of which genomic regions encode regulatory information for creating and maintaining each cell type. Integrating genomic enhancers with gene expression would yield precise regulatory programs.

Here we built a single-cell multi-omics atlas across fly brain development, covering neurogenesis, maturation and maintenance. We identify key regulators of neuronal and glial cell identity, decipher the enhancer code for specific neuronal subtypes, generate informed enhancer driver lines and map enhancer GRNs (eGRNs, all of which are available for exploration online (http://flybrain.aertslab.org).

Unique chromatin landscapes of neurons

To study regulatory programs of neuronal diversity, we profiled chromatin accessibility of 240,919 cells from whole brains at 9 timepoints from larvae to adult, covering crucial stages of development1,18 (Fig. 1a and Supplementary Table 1). This atlas is accompanied by a single-cell RNA sequencing (scRNA-seq) atlas of the adult brain, containing 118,687 high-quality cells divided into 204 clusters, of which 66 are annotated2 (Methods).

Fig. 1: Chromatin landscape of adult brain cell types.
figure 1

a, Experimental overview showing the number of cells per timepoint (×1,000). Colour switches demarcate runs; CB-only runs are accentuated. b, t-Distributed stochastic neighbour embedding (t-SNE) projections of mature cell types in scATAC-seq and scRNA-seq. Inset: z-normalized NNLS coefficients between RNA and ATAC clusters. c, Gene accessibility (left) and gene expression (right) for pros and Imp. The black lines connect matching clusters. d, Sorted ATAC-seq and aggregated scATAC-seq profiles of α/β-KCs. Scale bar, 100 μm. e, Aggregate profiles of DARs near marker genes across cell types (maximum value = 50 (30 for rgn and ab)). The arrows mark constitutively accessible regions. Inset: the number of accessible regions per cell type; cell-type-specific regions are shown in stronger colour. AST, astrocyte-like glia; C, centrifugal neurons; CSM, chiasm glia; CTX, cortex glia; ENG, ensheathing glia; GMC, ganglion mother cells; L, lamina monopolar neurons; LAM, lamina neurons; LPC, lamina precursor cells; NB, neuroblasts; ONE, OL neuroepithelium; PB, protocerebral bridge neurons; PLM, plasmatocytes; PNG, perineurial glia; PR, photoreceptors; PXN, poxn-neurons of ellipsoid body; SUB, sub-perineurial glia; SUR, surface glia; Tm, transmedullary neurons; VNC, ventral nerve cord. The numbers in b are unidentified clusters.

We first analysed the chromatin landscape of 60,624 cells from adult flies and late-stage pupa (72 h after puparium formation (APF)) and identified 79 stable cell states using cisTopic23 (Methods and Extended Data Fig. 1). We next aggregated the accessibility of regions upstream and within the gene body, and linked scATAC clusters to cell types in the scRNA-seq atlas using co-clustering, marker gene enrichment and non-negative least squares (NNLS) regression (Fig. 1b, c, Extended Data Fig. 2 and Supplementary Discussion). After manual curation, 43 of the 79 ATAC clusters were one-to-one linked to RNA clusters.

The annotated clusters include six glial subtypes (approximately 10–15%), non-brain cells (1%, plasmatocytes and photoreceptors) and neurons (85–90%). Notably, optic lobe (OL) neurons form distinct clusters, whereas central brain (CB) neurons form a continuum. In the CB, three Kenyon cell (KC) subtypes and two smaller cell types of the central complex (ellipsoid body ring neurons, protocerebral bridge neurons) were identified. CB clusters can be split into Imp+ or pros+ cells on the basis of scATAC-seq, recapitulating the differences that were found in scRNA-seq analyses of the brain2,12 and ventral nerve cord14. To validate cell type annotation, we used driver lines that label the three KC subtypes and two OL cell types (Tm1 and T4/T5), and performed bulk ATAC-seq after fluorescence-activated cell sorting (FACS), matching the scATAC aggregates with high concordance (Fig. 1d and Extended Data Fig. 3a–e).

Every cluster has a unique chromatin accessibility profile, with a range of 105 to 4,732 differentially accessible regions (DARs) out of a total of 24,543 median accessible regions per cluster, many of which are located close to validated marker genes (Fig. 1e). The transcription start site (TSS) is often ubiquitously accessible, meaning specificity is more distally controlled. Although T4/T5 neurons are inseparable in adult scRNA-seq, 110 DARs were identified between them, and subclustering also separated the a/b and c/d subtypes (Extended Data Fig. 3f, g). Given this high resolution in scATAC-seq, we investigated the missing CB cell types by examining olfactory projection neurons (OPNs). OPNs are identified in a 57,000-entry scRNA-seq dataset2, but not in the 60,000-entry scATAC-seq dataset nor in an expanded dataset with additional timepoints nor using different clustering approaches (Extended Data Fig. 1j–l). However, when we FACS-sorted OPNs for scATAC-seq, 876 peaks were revealed near OPN marker genes (Extended Data Fig. 3h–l), showing that, despite unique chromatin profiles, OPNs, and probably other CB cell types, are more difficult to identify by scATAC-seq compared with scRNA-seq.

Dynamic changes during brain development

To investigate how neuronal diversity is generated, we studied chromatin accessibility changes during development by analysing 135,275 cells from third instar larvae to 12 h APF using cisTopic, obtaining 54 clusters (Fig. 2a). We trained a support vector machine (SVM) classifier5 on the adult cell types to transfer labels to earlier stages, enabling the detection of  core sets of specific regions per cell type that remain continuously accessible, as well as DARs that vary over time (Extended Data Fig. 4a–e and Supplementary Table 2). Similar to RNA-seq analyses in which a maximum number of differentially expressed genes was detected at 48 h APF and a minimum in adults1,5, we found a decrease in DARs over time, with a relative spike at 48 h APF during synaptogenesis (Extended Data Fig. 4f).

Fig. 2: Chromatin changes through neuronal development.
figure 2

a, cisTopic UMAP of 92,954 cells (fragments-in-peaks > 900) from third instar larva to 12 h APF. The grey dashed line connects CB neuroblasts to the CB branch. b, Motif enrichment through development for Imp+ and pros+ cell types. c, 3D UMAP of six OL branches, with number of DARs and motif enrichment.

Progenitor cell types, which are characterized by accessible regions near the neuroblast markers dpn and ase (Supplementary Discussion and Extended Data Fig. 4g, h), form the roots of two main branches in the uniform manifold approximation and projection (UMAP) analysis (Extended Data Fig. 4m)—a continuous CB branch and a tree-like OL branch, indicating different neurogenesis modes. We detected a spike in TF motifs from the neuronal remodelling factors EcR and Sox14 (ref. 24) in Imp+ neurons, but not in pros+ neurons, consistent with their respective pruning roles and with a potential embryonic origin of Imp+ neurons (Fig. 2b). In the OL, six branches emerge, each enriched for the motif of one major class of TFs (such as POU, bHLH and ETS), linked to synaptic partner recognition (Supplementary Table 3) and neurotransmitter determination16 (Fig. 2c).

Cell-type-specific TF-binding sites

To identify cell-type-specific key regulators, we defined a ‘cistrome’ as the combination of a TF with its target enhancers. We developed a dual approach of conventional motif discovery and deep learning (DL)25,26, integrating information from the adult scRNA-seq and scATAC-seq data, to identify TFs per cell type that are both expressed and of which the motif is enriched in the accessible regions (Fig. 3a).

Fig. 3: Identification of regulators through multi-omic data integration.
figure 3

a, Multi-omic approach to identify key TFs. b, c, Conventional motif enrichment. b, TF motif enrichment (normalized enrichment score (NES)) versus mean expression (minimum to maximum) for each cell type. c, The expression (minimum to maximum, colour) and motif enrichment (NES, size) for 39 TFs (a full list is provided in Extended Data Fig. 5). The border indicates the availability of target genes (eGRN; Fig. 5). d, Using DL analysis, we identified nucleotide patterns for KCs (top) and T neurons (bottom) that are linked to TFs with concordant expression. e, Coverage plot of TaDa peaks over regions with and without TF motifs for Mef2 in γ-KCs and Acj6 in T4/T5 neurons. Pred., predicted. f, RNA interference (RNAi) knockdowns in γ-KCs and T4/T5 neurons. The bar plot shows the number of affected regions in highlighted direction; the tracks show example loci (the red box indicates predicted binding site). TF motifs are shown with NES values; the core Mamo motif is underlined.

First, using the conventional approach, we calculated the correlation between TF expression and motif enrichment in DARs per cell type (Fig. 3b and Extended Data Fig. 5). In total, 116 TFs show a strong positive correlation, suggesting that they open chromatin as activators. These cover pan-neuronal, pan-glial and cell-type-specific TFs (Fig. 3c), including known regulators for glia: Repo and Kay27; for KCs: Ey and Mef2 (ref. 28); for ellipsoid body neurons: Grn and D29; for T1 neurons: Ets65A30 and Oc31; and combinations of Acj6, Fkh, TfAP-2 and SoxN/Sox102F in the other T neurons3,5,19,30. Moreover, 131 TFs display a negative correlation between expression and motif enrichment (such as Mamo and Lola-N), suggesting a repressive role.

Second, we trained a convolutional neural network, called DeepFlyBrain, using sequences of co-accessible regions (topics)26 from KCs, T neurons and glia as input, and topic accessibility as output (Methods, Supplementary Discussion, Extended Data Fig. 6a–e and Supplementary Tables 4 and 5). We then used DeepExplainer32 to calculate the contribution of each  nucleotide in the prediction of region accessibility, and TF-MoDISco33 to identify motifs from recurring patterns in the contribution scores (Methods and Supplementary Table 6). This revealed that KC enhancers are characterized by Ey, Onecut, Mef2, Mamo and Dati motifs, matching their expression (Fig. 3d (top)). Mamo and Dati have negative nucleotide importance, suggesting that they correlate with closing chromatin. For T neurons, the most important motifs include Fkh, TfAP-2 and Acj6; and, for glia, we found Ct, Repo, Zfh2 and Klu (Fig. 3d (bottom) and Extended Data Fig. 6f). Scanning accessible regions with TF-MoDISco patterns, DeepFlyBrain provides high-confidence, cell-type-specific genome-wide binding-site predictions that show increased sequence conservation compared with the flanking sequences, supporting their functionality (Extended Data Fig. 6g–k).

To validate the predicted binding sites, we performed Eyeless and Repo CUT&Tag34 analysis on whole-brain samples, and found a significant overlap (adjusted P (Padj) < 10−30, hypergeometric test; Extended Data Fig. 7a–e). We next applied targeted DamID (TaDa35) analysis, finding 8,543 Mef2 peaks in γ-KCs and 10,900 Acj6 peaks in T4/T5 neurons that overlap significantly with predicted cell-type-specific DL and cistrome regions (Fig. 3e; Padj < 10−30 and Padj < 10−30, respectively, hypergeometric test). Interestingly, TaDa analysis of Mef2 in Tm1 neurons detects different sites from those in γ-KCs, whereas the Acj6 sites from T4/T5 neurons contain all those found in Acj6 TaDa analysis of all Acj6-expressing neurons16, suggesting a stronger pioneering role for Acj6 (Extended Data Fig. 7f, g). As a third validation experiment, we performed cell-type-specific TF knockdowns followed by FACS and ATAC-seq analysis (Fig. 3f). Mef2, Acj6, Onecut and TfAP-2 knockdowns all resulted in a decreased accessibility of regions with their respective motifs (Fig. 3f and Extended Data Fig. 7h–k), whereas knocking down the predicted repressor Mamo in γ-KCs increased the accessibility of Mamo sites, and led to a partial switch from the γ-KC-type chromatin landscape to the Mamo-negative α/β-KC-type chromatin landscape (Extended Data Fig. 7l–m; Padj < 10−30, hypergeometric test).

Finally, we performed bulk ATAC-seq analysis of the adult brain across 44 homozygous fly lines36, identifying 4,063 single-nucleotide polymorphisms (SNPs) that correlate with chromatin accessibility37 (Padj < 0.05; Extended Data Fig. 6l). DeepFlyBrain identified affected TF motifs for 20% of the SNPs, consistent with previous studies25, with SNPs destroying Mamo or Lola-N repressor binding sites, leading to increased accessibility (Extended Data Fig. 6m–p, s). When we overexpressed Lola-N in glia, the Lola-N GATC sites decreased in accessibility, confirming its repressive role in neurons38 (Extended Data Fig. 6q, r).

Decoding enhancer architecture

The atlas of enhancers, linked to cell types and regulators, enables the design of reporter lines to target cell populations throughout development. Previous efforts to create driver lines for the fly brain (FlyLight10 and the Vienna Drosophila Resource Center) used random regions of 2–3 kb around neuronal genes, causing many lines to be non-specific: 2,551 out of 3,456 FlyLight lines contain more than one ATAC peak, of which 1,796 are DARs. These lines can be made specific by subcloning the individual ATAC-seq peaks (Extended Data Fig. 8a–c). Furthermore, split-GAL4 lines that combine two enhancers through AND logic are recapitulated as the intersection of their ATAC-seq signal (Extended Data Fig. 8d).

Using a more systematic approach, we selected an additional 60 regions for a total of 63 enhancers and tested their enhancer activity in vivo using transgenic fly lines (53 GFP, 10 GAL4; Methods, Extended Data Fig. 8e, f and Supplementary Table 7). The selected regions are accessible in either KCs (24), OL neurons (17), glia (5) or mixed (14) with a size range of 300–1,732 bp, and three negative controls were added that are either ubiquitously accessible or inaccessible. Overall, all of the enhancers show reporter activity in the brain, with 73% showing high activity at any developmental stage; whereas, for 65%, the adult reporter activity is specific to the predicted cell type (Extended Data Fig. 8g–i).

We next examined the relationship between motif architecture and reporter activity using DeepFlyBrain. An enhancer near the sNPF gene (Fig. 4a (column 1)) is predicted to be accessible in γ-KCs and α/β-KCs (Fig. 4a (column 2)); contains candidate Ey-, Mef2-, Onecut- and Sr-binding sites (Fig. 4a (column 3)); and has specific GFP reporter activity in KCs (Fig. 4a (column 4)). In silico mutagenesis identified nucleotides with a high impact on accessibility in the Mef2- or Ey-binding site (Fig. 4a (column 3)), and mutating these nucleotides abolished activity (Fig. 4a (column 4)). A second enhancer near Eip93F is active in T4 neurons, with binding sites for Fkh, Acj6 and Tfap-2 (Fig. 4b). Mutating either Fkh- or Tfap-2-binding sites abolish enhancer activity, confirming their predicted activator roles. Similar analyses were performed on enhancers near Bx, gish, Pkc53E, Appl and CG15117, highlighting TF activator binding sites; GFP reporter activity is lost after mutation of these sites (Extended Data Fig. 9a–e). In the enhancers near sNPF, Bx and Appl, the model predicted that changes in Mamo sites would increase enhancer activity (Extended Data Fig. 9c). Indeed, when these sites were mutated, the enhancer activity is not lost, but expanded to additional KCs (one-sided Mann-Whitney U-test, combined P = 0.031) (Fig. 4c and Extended Data Fig. 9f, g).

Fig. 4: DL analysis unravels enhancer make-up.
figure 4

a, b, Analysis of cloned enhancers near sNPF (a) and Eip93F (b). Loci accessibility profiles (column 1). The grey box highlights the cloned region. DL prediction scores (top) and accessibility (bottom) for wild-type (WT) and mutated (Mut) sequences (column 2). Nucleotide importance scores (top) and in silico saturation mutagenesis assays (bottom) of cloned enhancers for different mutations (column 3). The predicted TF-binding sites are indicated by boxes, and the performed mutations are indicated by black circles. In vivo enhancer activity of cloned sequences (column 4). The balls represent the binding sites (Ey (red), Mef2 (green), Onecut (orange), Sr (red), Mamo (grey), Lola-PF (brown), Fkh (purple), TfAP-2 (yellow), Acj6 (magenta)) and the crosses indicate mutated sites. The expected patterns are shown. Dashed circles, KCs; white arrows, T4. c, In vivo enhancer activity of mutated enhancer near Bx (left). Right, Nuclei count of WT (n = 24 brains) and mutated repressors (Mamo and Lola-PF; n = 21 brains). The expected count after mutation is shown by the grey line (20% increase). The box plots show the median (centre line), upper and lower quartiles (box limits) and 1.5× interquartile range (whiskers). All data points are shown. Statistical analysis was performed using a one-sided Mann–Whitney U-test; P = 0.025. d, Examples of positive KC enhancers. e, Performance of different metrics to predict KC activity of 64 cloned enhancers. Scale bars, 100 μm.

We scored every enhancer based on their KC, OL, glia or CB activity (Methods), showing examples of positive KC enhancers in Fig. 4d. Calculating receiver operating characteristic (ROC) curves, shows that accessibility alone is able to distinguish positive from negative enhancers with an accuracy (area under the ROC curve (AUC)) of 0.89 for KC, 0.87 for OL and 0.79 for glia (Fig. 4e and Extended Data Fig. 8h, i). Taking KC activator motif content into account, the accuracy increases to an AUC of 0.935 (Fig. 4e) and 90.5% (19/21) of enhancers that contain at least two activator motifs are active in KCs (Supplementary Table 8).

Building a resource of eGRNs

Current descriptions of GRNs have mostly focused on co-expression21, but the availability of transcriptome and chromatin accessibility profiles of cell types enables their regulatory code to be scrutinized. In particular, we aimed to map cell-type specific eGRNs, including key TFs, as well as their enhancers and target genes (Fig. 5a).

Fig. 5: eGRNs identify cell-type-specific activators and repressors.
figure 5

a, Linking cistrome regions to genes based on co-accessibility with expression creates eGRNs. A DL-filtered γ-KC eGRN is shown (the full network is shown in Extended Data Fig. 11a). AC, accessibility; Covar., covariance; Enh., enhancer; EX, expression. b, Repression through inhibition of chromatin accessibility. c, Combinatorial gene expression in KCs through chromatin repression. Unknown repressors are shown by motif, activators are shown as balls; the colour code is the same as in Fig. 4. df, The enhancer switch of the CG15117 enhancer from glia to T1. d, Dynamics in enhancer accessibility (solid line) and CG15117 expression (dotted line). L3, third instar larvae. e, Enhancer accessibility through development. A, adult. f, Staining of DAPI (blue), Repo (red) and CG15117 enhancer activity (GFP, green) at 15 h APF (P15, top) and adult (A, bottom). Scale bars, 60 μm.

To link cistrome regions to target genes, we calculated a co-variability score of gene expression and region accessibility for a window of 100 kb around each gene (50 kb upstream and downstream, plus introns39), leading to an average of 6 positively linked regions per gene (Methods and Supplementary Discussion). Enhancer–gene links within BEAF-32 domains (average size, 57.7 kb) have higher correlation scores and a lower proportion of negative links, so links crossing these domains (45%) were pruned (Extended Data Fig. 10a–f). The strength of the region–gene links correlates with enhancer activity (Extended Data Fig. 10g), and intronic and distal intergenic regions correlated better with gene expression compared with promoter/TSS accessibility, confirming previous observations40 (Extended Data Fig. 10h, i). The target genes were pruned using gene set enrichment analysis, retaining those of which the expression co-varies with the cistrome TF (Methods).

This procedure resulted in 171 cistromes forming eGRNs for 45 cell types (Extended Data Figs. 10j, k, 11a–d and Supplementary Table 9), including 87 activator TFs, with 4,995 enhancers linked to 2,025 genes, covering 17% of the adult DARs (13% promoters, 43% intronic, 44% distal), and 39% of the variable genes in the brain. In particular, cell-type eGRNs have on average 5 activator TFs (range, 1–15) that regulate 67 target genes through 81 enhancers. Indeed, 62% of the genes are regulated by multiple regions within the same cell type and 93% of enhancers have multiple TF inputs. The overlap of predicted binding sites for a TF is only high between similar cell types, suggesting a dependency on the presence of co-factors (Extended Data Fig. 11e). The network for γ-KCs (Fig. 5a) reveals that 2/3 of genes are co-regulated by at least two TFs, with the auto-regulatory factors Mef2 and Ey/Toy regulating 97 to 122 genes, alongside Onecut and Sr regulating an average of 38 genes.

Next, we looked into different modes of repression. The first type of repressors represses target genes by reducing chromatin accessibility (Fig. 5b). Mamo is involved in regulating the typical two-out-of-three pattern in KCs by repressing α/β-KC marker genes in α′/β′- and γ-KCs (Fig. 5c). Similarly, Lola-N represses glial genes in neurons (Extended Data Fig. 11f). In the second type, TFs cause nucleosome displacement and chromatin opening, similar to activator TFs, but would recruit co-repressors to repress target genes. This would be manifested by a negative correlation between accessibility and target gene expression. However, these relationships are less common compared with chromatin repression, and were detected for only TFs that also have positive correlation targets, such as Acj6 (Extended Data Fig. 10k), suggesting more complex mechanisms.

Finally, we investigated the cistrome regions throughout development and found that 45% (14,051) become more accessible at late timepoints, with 28.8% increasing after the ecdysone pulse. We also found 458 regions that undergo an enhancer switch, as their accessibility increases in one cell type while decreasing in another (Extended Data Fig. 12a, b). One of these regions is a T1 enhancer that drives CG15117 expression that is accessible in early glia and switches to T1 at 24 h APF (Fig. 5d). Using a scRNA-seq atlas of OL development5, we confirmed that an analogous switch in CG15117 expression occurs, from a developmental glial marker to an adult T1 marker, with a small delay between accessibility and expression changes, as previously observed41. When co-staining the enhancer GFP with the glial marker Repo, the overlapping signal in development disappears in the adult, coinciding with the closing of the enhancer (Fig. 5e, f). To study this phenomenon at the sequence level, a DL model was trained using developmental topics as input (Extended Data Fig. 12c, d). Inspecting the CG15117 enhancer, the model detects the same TAATTA motif in glia and T1 neurons (Extended Data Fig. 12g–j). Given that only a few TFs overlap, this suggests the binding of different factors to the same motif in different cell types. The reuse of the same enhancer in different cell types at different timepoints can also be noticed by differences in expression between larval and adult brains for 47 out of 54 tested enhancers (Extended Data Fig. 8).

Discussion

We generated the first single-cell chromatin accessibility atlas of the whole fly brain throughout development, tracing neuronal and glial cell types from birth to maturity. Using an integrated multi-omics approach, we introduced the concept of eGRNs in which TFs are linked to high-confidence binding sites that are linked to target genes. eGRNs can soon be derived for other datasets, given the pioneering work in scATAC-seq in mouse and human40,42,43,44,45 and developments in single-cell multi-omics41,46,47.

Our atlas showed that all cell types in the brain have unique chromatin profiles, often combinatorial, with tens of thousands of accessible regions and hundreds to thousands specific, identifying over 95,000 candidate enhancers, covering 34.4% of the genome. To accurately predict enhancer activity on the basis of the DNA sequence, we integrated DL models with omics data. This ‘smarter’ motif discovery has been shown to reveal motifs that are missed by conventional algorithms, and leads to prediction of TF binding at the base-pair resolution48. We validated these annotations and TF roles using CUT&Tag, TaDa and knockdown experiments, confirming the high quality of the cistromes and DL-based annotations. The library of annotated enhancers was then used as a starting point to clean up or design new driver lines for cell-type-specific genetic access. Developmental dynamics open up the possibility of creating spatiotemporal driver lines with enhancers corresponding to different maturation modules.

By linking TF cistromes, enhancer accessibility and target gene expression, we generated 45 eGRNs covering 87 activator TFs of which 90% have lethal mutations, 62% are linked to known brain phenotypes and 64% are linked to human diseases, providing a foundation for follow-up studies. Many enhancers were regulated by multiple TFs, as highlighted by Ey and Mef2 in KCs. Furthermore, mutating either binding site led to the abolishment of GFP activity in in vivo assays, suggesting cooperativity. The switching enhancers described here present a considerable fraction of enhancers (~500) encoding multilayered motif architectures, reminiscent of the phenotypic convergence phenomenon of different TFs in different cell types regulating the same enhancer3.

Our regulatory atlas of the brain covers cell types, TFs and enhancers, together with their joint representation as eGRNs and transitions through development. To be of further value to the community, we have made all data publicly available online (http://flybrain.aertslab.org; Extended Data Fig. 11g), enabling users to explore eGRNs with links to SCOPE (http://scope.aertslab.org/#/Fly_Brain/) and UCSC (http://ucsctracks.aertslab.org/papers/FlyBrain/hub.txt). Finally, our adult DL model DeepFlyBrain is available at Kipoi (http://kipoi.org/models/DeepFlyBrain).

Methods

Data reporting

No statistical methods were used to predetermine sample size. Animals that fit the age criteria were selected randomly and the investigators were not blinded to allocation during sequencing experiments and outcome assessment. Cloned enhancers and mutations were blinded using enhancer IDs.

Statistics and reproducibility

At least two technical replicates were performed per timepoint and condition for 10x Chromium (exact details are provided in Supplementary Table 1), aiming for 20,000 cells per timepoint. FACS experiments (cell types, knockdowns) were replicated once, except for OPNs for which two technical replicates were performed. For TaDa, two replicates were performed for controls and Mef2, and one for Acj6. For CUT&Tag analysis, one experiment was performed for Repo, and five for Ey (four of which led to Ey motif enrichment). Ten brains were visualized for cloned enhancers per condition, and representative images were chosen. Statistics were calculated using Scipy49 unless mentioned otherwise. Seaborn was used for visualization.

Genetics

Flies were raised on a yeast-based medium at 25 °C under a 12 h–12 h day–night light cycle. All RNAi experiments were performed at 29 °C. All Drosophila lines that were used in the scATAC-seq experiments were derived from the DGRP collection. One hybrid was created by crossing different DGRP lines, generating genetic diversity. A list of all of the fly lines that were used is provided in Supplementary Table 10.

10x Genomics scATAC-seq

Sample preparation

The experiments were carried out with four different WT polymorphic strains36, enabling a higher number of nuclei per run while detecting and removing doublets (Supplementary Table 1). Furthermore, to enrich for CB cell types that are often hard to detect, we performed two additional runs on adult brains without the OLs, as these contain more than two thirds of all brain cells in numbers, but with a lower diversity than the CB.

Drosophila melanogaster brains were dissected at nine timepoints (third instar wandering larvae, 0 h, 3 h, 6 h, 12 h, 24 h, 48 h and 72 h after puparium formation and adult) of both males and females and transferred to a tube containing 100 µl ice cold DPBS solution. After centrifugation at 800g for 5 min, the supernatant was replaced with 500 µl nuclei lysis buffer comprising 10 mM Tris-HCl pH 7.4, 10 mM NaCl, 3 mM MgCl2, 0.1% Tween-20, 0.1% Nonidet P40, 0.01% Digitonin and 1% BSA, in nuclease-free water. The following procedure was followed to extract the nuclei from the brain tissue: incubation in nuclei lysis buffer on ice for 5 min, transfer to a dounce tissue grinder tube (Merck), 25 strokes with pestle A, incubation on ice for 10 min, 25 strokes with pestle B. The lysis was stopped by adding 1 ml of wash buffer composed of 10 mM Tris-HCl pH 7.4, 10 mM NaCl, 3 mM MgCl2, 0.1% Tween-20 and 1% BSA in nuclease-free water. Nuclei were pelleted by centrifugation at 800g for 5 min at 4 °C and resuspended in a 1× nuclei buffer (10x Genomics). Nuclei suspensions were passed through a 40 µm Flowmi filter (VWR Bel-Art SP Scienceware). Nuclei concentration was assessed using the LUNA-FL Dual Fluorescence Cell Counter.

Library preparation

Single-cell libraries were generated using the GemCode Single-Cell Instrument and Single Cell ATAC Library & Gel Bead Kit v1 and ChIP Kit (10x Genomics). In brief, fly brain single nuclei were suspended in 1× nuclei buffer. The single nuclei were incubated for 60 min at 37 °C with a transposase that fragments the DNA in open regions of the chromatin and adds adapter sequences to the ends of the DNA fragments. After generation of nanolitre-scale gel bead-in-emulsions (GEMs), GEMs were incubated in a C1000 Touch Thermal Cycler (Bio-Rad) under the following programme: 72 °C for 5 min; 98 °C for 30 s; 12 cycles of 98 °C for 10 s, 59 °C for 30 s, 72 °C for 1 min; and held at 15 °C. After incubation, single-cell droplets were broken and the single-strand DNA was isolated and cleaned using Cleanup Mix containing Silane Dynabeads. Illumina P7 sequence and a sample index were added to the single-strand DNA during library construction via PCR: 98 °C for 45 s; 11–13 cycles of 98 °C for 20 s, 67 °C for 30 s, 72 °C for 20 s; 72 °C for 1 min; and hold at 4 °C. The sequencing-ready library was cleaned up with SPRIselect beads.

Sequencing

Before sequencing, the fragment size of every library was analysed using the Bioanalyzer high-sensitivity chip. All 10x scATAC libraries were sequenced on NextSeq500 and NovaSeq6000 instruments (Illumina) with the following sequencing parameters: 50 bp read 1 -8 bp index 1 (i7) -16 bp index 2 (i5) -49 bp read 2.

10x data processing

The 10x fly brain samples were each processed (alignment, barcode assignment and UMI counting) with CellRangerATAC v.1.2.0 count pipeline. The Cell Ranger reference index was built on the third 2017 FlyBase release (D. melanogaster r6.16)50. Sequencing saturations were calculated on the basis of Michaelis–Menten kinetics and early pupal timepoints were also sequenced and CellRanger aggr was used to aggregate the sequencing results.

Demuxlet

We used Demuxlet51 to demultiplex the different genotypes that were used in the DGRP-mixed samples, enabling us to remove doublets of two different genetic backgrounds. The vcf file of the DGRP project (available at http://dgrp2.gnets.ncsu.edu/) was lifted over to dm6 genome and SNPs for DGRP-409 and DGRP-502 were extracted. For DGRP-639 and the DGRP-551-based hybrid, we performed bulk ATAC-seq to generate updated SNP profiles. After combining all SNPs, we retained only SNPs that were unique for one line. This vcf file was then used in Demuxlet with the default parameters leading to the identification and removal of 43,489 doublets (Supplementary Table 1).

scATAC topic modelling and clustering

After removing doublets, we performed some extra quality control filters to select the 240,919 cells that will be used in upcoming analyses (Signac’s nucleosome_signal ≤ 10, global blacklist_ratio < 0.05 and non-outlier blacklist ratio within its own run, number of fragments between 100 and 50,000).

During the initial steps of analyses of the dataset, we tested different clustering algorithms (including cisTopic, Seurat, ArchR and snapATAC). Although most of the methods identified the main cell types, we finally chose cisTopic as our primary clustering method as it provided a slightly better resolution for some of the clusters and subanalyses. Moreover, cisTopic provides a fuzzy clustering in the form of ‘topics’, which is useful for downstream analyses (for example, each region can belong to one or several topics, and each topic can be accessible in specific or multiple cell types).

To run cisTopic23, we created the cell-counts matrix using 129,109 predefined regulatory regions (ctx) based on conservation52 (that is, counted fragments within these regions). Given the large size of our dataset, we implemented WarpLDA53 within the cisTopic package as a faster and more efficient alternative to Collapsed Gibbs Sampling (CGS). WarpLDA uses the delayed update approach, meaning that topic–region and cell–topic distributions are updated after a number of assignments rather than after each assignment, reducing the number of calculations and memory access. This new faster algorithm is available in cisTopic version 3 (https://github.com/aertslab/cisTopic).

We performed topic modelling on the whole matrix (with between 2 and 500 topics, with 500 iterations and finally selecting the model with 500 topics). This analysis was used to obtain an overview of the whole dataset, and to perform the analyses across development. However, we noticed that we obtained slightly better region accessibility predictions, as well as higher clustering resolution, when analysing subsets of the dataset (for example, the T4/T5 split is not detected in this global analysis, the TfAP-2 enhancers are not predicted as differential). We therefore used independent cisTopic runs to perform the analysis of the adult cell types (including adult and 72 h APF, using 200 topics), and developmental stages (larva and 0–12 h APF, 200 topics). This split of stages was chosen on the basis of their similarity (Extended Data Fig. 1c, d).

Adult cell clusters were defined on the basis of a two-level analysis. (1) First, on the adult + 72 h APF cisTopic run, we clustered the adult cells with more than 900 fragments in peaks (FIP) using Louvain clustering on the cell–topic probability matrix (igraph::cluster_louvain, parameters: k = 10, eps = 0.1, treetype="bd"). This led to 55 clusters, including the main cell types identified in scRNA-seq. Note that we chose this strategy based on several alternative analyses, in which we observed that cisTopic benefits by higher numbers of cells, even if some of them have few reads, while the clustering of only high-count cells (FIP > 900) provided more stable clusters and more concordant with the scRNA-seq. (2) The same process was then applied to each of the major groups of cells: OL, CB and glia, using separate cisTopic runs, and consensus peaks instead of the ctx predefined regions (see the section below). These subclustering analyses provided 130 clusters—which might be over-clustered, as many of them were not matched to scRNA-seq clusters—but it enabled us to identify some extra cell types (for example, ab-cd split of T4/T5 cells; Extended Data Fig. 3f, g). From these analyses, after the scRNA-seq label transfer (see below), we finally selected 79 clusters as the main annotation.

The clusters for the developmental stages were determined followingthe equivalent approach on the larva to 12 h APF cisTopic analysis (in this case, only one level of clustering was required).

Gene accessibility matrix

Gene accessibilities were calculated using the cisTopic probabilities of region accessibility per cell. Next, ctx regions inside the gene body and up to 5 kb of its transcription start site (TSS) were selected. An exponentially decaying function was used to assign distance weights (wd) to these regions, whereby regions further away from the gene have lower weights (d as distance in bp from the TSS), similar to ArchR54. To give higher weights to variable regions, we calculated Gini scores per region, where highly variable regions have a high Gini score. Gini scores were then z-standardized and used as an exponent for the variability weight (wv). Final weights were defined as the product of the distance and variability weights, and a weighted sum was calculated to acquire a gene accessibility matrix.

$${w}_{d}={\frac{-\{d}{\{e}}^{5000}+{\{e}^{-1}$$
$${w}_{v}=\{{e}^{{Z}_{{\rm{GINI}}}}$$

scRNA-seq clustering

We used the scRNA-data from the whole ageing fly brain described previously2, this time using all data from all protocols with updated analysis methods (mostly batch effect correction). Mapping, filtering, normalization, batch effect correction, clustering, marker gene detection and gene regulatory network inference were performed using the VSN pipeline55 (https://github.com/vib-singlecell-nf/vsn-pipelines), which is a Nextflow DSL2 pipeline using CellRanger (10x), Scanpy56, Harmony57 and pySCENIC58. We used the command nextflow -C nextflow.config run vib-singlecell-nf/vsn-pipelines -entry harmony and a description of the nextflow.config file is provided in Supplementary Data 2. Finally, annotations were transferred from the published2 dataset by calculating the adjusted rand index between annotations and the different calculated clusterings. The best-matching clustering was Leiden resolution 10 (224 clusters) and the clusters were annotated if at least 25% of cells in the cluster had the same annotation. If there was no match, the cluster was retained, ending up with 203 clusters. One modification was made to cluster 15 where a higher resolution (Leiden 12) was chosen in which it split in two (a and b), matching the split detected in the RNA-ATAC co-embedding (see below), leading to the final annotation of 66 clusters out of 204. Subsequently, marker genes were calculated in Seurat’s FindAllMarkers using the Wilcoxon method with min.pct = 0.1 and logfc.threshold = 0.2).

Label transfer using NNLS, AUCell and Seurat

To assign cell-type identities to adult scATAC-seq clusters (130) we followed three approaches:

First, we used the NNLS method to compare clusters across modalities59, similar to what was used previously described43. We calculated average RNA expression profiles per cell type from the annotated scRNA-seq data and averaged gene accessibility profiles for the scATAC-seq clusters using the top 10 marker genes per cell type as features (sorted by Bonferroni-corrected P value). These were then used as input for the algorithm in which an optimal weighted sum is calculated and the weights resemble cluster similarities.

Second, we used AUCell21 to score gene signatures per cell type on the basis of the top marker genes on the gene accessibility matrix. Gene signatures were then averaged per cluster and clusters were assigned to cell types on the basis of their score.

Third, Seurat (v.3)60 was used to integrate the gene accessibility and gene expression data. First separate objects were created for scATAC-seq and scRNA-seq data, with the gene accessibility matrix used as ‘RNA’ assay and the region-accessibility as ‘peaks’ assay in the ATAC-seq object. First, the dimensions of the ATAC-seq object were reduced using RunTFIDF, FindTopFeatures and RunSVD using latent semantic analysis (LSI) on the peaks assay; the number of components used was 50, 70 and 100. Next, the RNA-seq object was log-normalized using NormalizeData with the median of expressed UMIs as the scale factor. FindVariableFeatures was used to find 2,500 variable features in the RNA-seq data to be used as features for integration. Anchors for integration were identified using FindTransferAnchors with the RNA-seq data as reference and the ATAC-seq object as query using canonical component analysis using 50, 70 and 100 components. TransferData were then used to transfer annotations from scRNA-seq to scATAC-seq using the LSI weights for weight reduction and dimensions ranging 2 to 100. To calculate a co-embedding, we used GetAssayData on the variable genes to get the RNA counts and used this as reference data in a second run of TransferData, whereby we impute RNA counts for the ATAC-seq data, again using the LSI weights for weight reduction. The two objects were subsequently merged, followed by scaling of the data, principal component analysis (PCA) and UMAP and t-SNE calculation (ScaleData, RunPCA, RunUMAP, RunTSNE). In particular, for the adult cells, we chose the t-SNE as primary representation for the figures since it looked less cluttered, which enabled better visualization of the intracluster heterogeneity.

We next collapsed annotations across all methods and merged non-annotated low-confidence clusters. Tm1/TmY8 and Mi1 matched to the same cluster, but we could separate the subclusters on the basis of gene accessibility scores of bsh and hth, two markers of Mi1 neurons.

DARs

For each of the adult cell clusters (including both clustering resolutions, plus a super-clustering of glia, OL/CB neurons and KCs), we calculated the DARs on the basis of the predictive distribution from cisTopic (using the Wilcoxon rank sum test, run through the FindMarkers function in Seurat with the default settings, except logfc.threshold, which was lowered to 0.20, and max.cells.per.ident, which was adjusted to balance the contrasts in some of the analyses). For each of the clusters, the DARs were calculated versus the closest cluster in the tree, and versus all of the other clusters in each of the two analyses (that is, each cluster was compared with the rest of the brain, and with the other cells in their same glia/OL/CB/KC category). The DARs were used as a starting point for the follow-up analyses for enhancer–gene links and eGRNs. However, note that 14% of DARs are promoter regions and these are included in all following analyses.

Cell-type-specific bams and bigwigs

We extracted cells per cell type per timepoint (details of the SVM are provided below). Next, we subset the bam files from the runs to contain only reads belonging to the selected cells and created a cell-type-specific bam file. Then we used SAMtools61 to remove duplicates (view -F 0x400) and remove regions mapping to blacklisted regions62. The remaining reads were then  used as input for the bamCoverage function from deepTools63 to create a depth-normalized bigwig file (reads per genome coverage, RPGC) with the following parameters: -bs 1 -p 8 -normalizeUsing RPGC -effectiveGenomeSize 142573017.

Consensus peaks

MACS2 (ref. 64) was used to call peaks on cell-type-specific bam files using the call peak function with following parameters: macs2 callpeak -q 0.05 -g dm -keep-dup all -nolambda -call-summits -nomodel -shift -75 -extsize 150. This was repeated for all of the timepoints and for the grouped analyses (Adult + P72, L3–P12, P24, P48). Next, the summits were extended to 500 bp (or 150 bp) using slopBed from BEDTools (-l 149, -r 150 (or -l 74, -r 75)). The extended summits were then merged according to ENCODE standards, with first a normalization of the summit score (CPM) followed by iterative peak merging until non-overlapping peaks across all timepoints and cell types were retained. This led to a final number of 95,921 (500 bp) and 207,325 (150 bp) peaks.

For the adult cell types, we also ran a more stringent peak calling (-q 0.01), which provided 60,210 disjoint peaks of an average width of 455 bp, and covered 19% of the genome (provided in the UCSC session).

ArchR clustering

ArchR54 was run on the fragments files from CellRanger using the createArrowFiles function with minTSS set to 4 and minFrags to 1000. A first shallow clustering was then performed on the tile matrix using the binned genome. These clusters would next be used to acquire consensus peaks to use in a final clustering. Thus, first an iterativeLSI was calculated on the tilematrix with 30 dimensions. Clusters were derived using the Seurat FindClusters implementation of the Louvain algorithm with resolution 4, leading to 25 clusters. The addGroupCoverages function was then used to calculate the coverage of every cluster, which was used as an input for the addReproduciblePeakSet to obtain consensus peaks. The fragment files were then quantified over the consensus peaks to acquire the final peak matrix. This matrix was then used as input for iterativeLSI, this time using 130 dimensions and 20,000 features, sampling 10,000 cells at a resolution of 2. Harmony was then used on the 130 LSI dimensions to correct for batch effects (variable = sample). The corrected Harmony features were then used to create the final UMAP embedding and clustering in Seurat (Louvain, resolution 4) to find 90 clusters.

Omni-ATAC-seq analysis of FACS-sorted samples

FACS

One-hundred GFP-expressing (MB371B, MB418B, MB419B crossed with UAS-nls.GFP65) and 15 GFP-negative (WT) fly brains were dissected in PBS on ice. The brains were then centrifuged at 800g for 5 min, after which the supernatant was replaced with 50 μl ofdispase (3 mg ml−1, Sigma-Aldrich, D4818, 2 mg), 75 μl collagenase I (100 mg ml−1, Invitrogen, 17100-017) and 125 μl trypsin-EDTA (0.05%, Invitrogen 25300054). Brains were dissociated at 25 °C in a Thermoshaker (Grant Bio PCMT) for 15 min at 25 °C at 1,000 rpm and the solution was mixed by pipetting every 5 min. Next, cell suspensions were passed through a 10 μm pluriStrainer (ImTec Diagnostics 435001050) and viability was assessed by the LUNA-FL Dual Fluorescence Cell Counter. Next, four aliquots were made containing GFP-negative brains cells with/without PI (10%) and GFP-positive brains with/without PI (10%). FACS was performed on the FACS Aria III (BD Biosciences). The GFP-negative brains were used to set the gates on the machine for cell size and viability (PI), the GFP-positive brains for the GFP fluorescence, after which the GFP positive cells with PI were sorted (Supplementary Data 1). Between 2,600 and 11,000 GFP-positive cells were sorted and 50,000 cells per negative control. For RNAi experiments, a similar procedure was performed with GFP-labelled T4 cells and mCherry-labelled KCs. After sorting, regular omni-ATAC-seq was performed as described by Corces et al66.

Analysis of WT experiments

Bulk ATAC-seq was performed on five samples (three GFP-positive cells from driver lines each targeting one subtype of KCs (MB371B, MB418B and MB419B) and two negative controls (GFP-negative cells from MB371B and MB419B)). ATAC-seq reads were trimmed using fastq-mcf67 and a list of sequencing primers. The cleaned reads were then used as input for fastqc for quality control. Next, the reads were mapped to the third 2017 FlyBase release (D. melanogaster r6.16) genome using STAR, and SAMtools was used to sort the bam file. Macs2 was then used to call differential peaks between the positive samples and their negative controls and both negative controls were used for the positive sample without its own control using macs2 callpeak -t pos_sample -c neg_sample -g dm –nomodel.

Analysis of knockdown experiments

Bulk ATAC-seq was performed on one RNAi knockdown and one WT control sample. ATAC-seq reads were trimmed using fastq-mcf67 and a list of sequencing primers. The cleaned reads were then used as input for fastqc for quality control. Next, the reads were mapped to the third 2017 FlyBase release (D. melanogaster r6.16) genome using Bowtie2, and SAMtools was used to sort the bam file. Bam files were deduplicated and blacklisted regions were removed. Macs2 was then used to call differential peaks between the knockdown and the control sample using macs2 callpeak -t pos_sample -c neg_sample -g dm –nomodel. The .narrowPeak output file was then used in i-cisTarget for motif enrichment68 and for overlap with predicted DL binding sites and cistrome regions using pybedtools. Significant overlaps were determined using a hypergeometric test from Scipy49. Bam files were converted to bigwig using DeepTools with RPGC normalization and coverage plots were drawn using pyBigwig in Python.

CUT&Tag analysis

Library preparation

Nuclei were isolated from dissected adult brains as indicated previously. After centrifugation, the supernatant was removed and isolated nuclei were resuspended in nuclear extraction (NE) buffer (EpiCypher CUT&Tag Protocol v.1.5). Nuclei concentration was assessed using the LUNA-FL Dual Fluorescence Cell Counter. 100,000 nuclei in 100 µl NE buffer were used for each CUT&Tag reaction. All of the subsequent steps were followed according to the EpiCypher CUT&Tag Protocol v.1.5. For each reaction, 11 µl of BioMag Plus Concanavalin A beads (ConA beads, Gentaur, 86057-3) were washed twice with 100 µl cold bead activation buffer and then resuspended in 11 µl of cold bead activation buffer. To bind nuclei to the activated ConA beads, 100 µl of nuclei and 10 µl of activated ConA beads were incubated for 10 min at room temperature. After supernatant removal, nuclei–bead slurry was resuspended in 50 µl Antibody150 buffer. Primary rabbit anti-GFP antibodies (0.5 µg) (Abcam, ab290) were added per sample and incubated overnight at 4 °C. The beads were cleared and resuspended in 50 µl cold Digitonin150 buffer. Anti-rabbit secondary antibodies (0.5 µg; EpiCypher, Cat 13-0047) were added per sample and incubated for 30 min at room temperature. While on a magnet, beads were washed twice with 200 µl cold Digitonin150 buffer. The beads were resuspended in cold Digitonin300 buffer and incubated for 1 h at room temperature with 2.5 µl CUTANA pAG-Tn5 (EpiCypher, Cat 15-1017). To initiate tagmentation reaction, the beads were resuspended in 50 µl cold Tagmentation buffer and incubated for 1 h at 37 °C. Beads were resuspended in 50 µl TAPS buffer and 5 µl SDS Release buffer was added to quench the tagmentation reaction. To release tagmented chromatin fragments into solution, samples were incubated for 1 h at 58 °C. 15 µl SDS quench buffer was added per sample. To amplify tagmented chromatin fragments, 2 µl each of individual barcoded P5 and P7 sequencing adapters (10 µM stock) and 25 µl non-hot start CUTANA High Fidelity 2× PCR Master Mix (EpiCypher, Cat 15-1018) were added, and 18 cycles of CUT&Tag-specific PCR parameters were used (58 °C for 5 min, 72 °C for 5 min, 98 °C for 45 s, 98 °C for 15 s, 60 °C for 10 s, 72 °C for 1 min). DNA clean-up was performed using 1.3 AMPure XP beads (Analis, A63880) and DNA was eluted in 15 µl 0.1 TE buffer. The CUT&Tag libraries were analysed using the Bioanalyzer (Agilent High Sensitivity DNA Chip, 5067-4626) and sequenced on the Illumina NextSeq2000 instrument.

Analysis

CUT&Tag reads were processed similarly to ATAC-seq reads from the knockdown experiments. Macs2 was then used to call differential peaks between the targeted TF and the input control sample (IgG) using macs2 callpeak -t pos_sample -c neg_sample -g dm –nomodel. We obtained cleaner results by comparing two different factors against each other, so we contrasted the results for Ey with those of Repo. The .narrowPeak output file from these contrasts was then used in i-cisTarget for motif enrichment68 and for overlap with predicted DL binding sites and cistrome regions using pybedtools. Significant overlaps were determined using a hypergeometric test from Scipy49. Bam files were converted to bigwig using DeepTools with RPGC normalization and coverage plots were drawn using pyBigwig in Python. For the Ey sample, we combined all of the bam files for working Ey experiments (Ey motif highly enriched) and created a total bam and bigwig file. A differential bigwig file was derived using DeepTools’ bigwigCompare on the Ey and Repo bigwig files.

TaDa analysis

Library preparation

UAS-LT3-NDam-acj6 was previously described16 and UAS-LT3-NDam-Mef2 was generated by cloning the Mef2 coding sequence into pUAST-LT3-NDam35,69. A cherry-stop-MEF2-DAM construct (Life Technologies) was subcloned into pUAST-AttB as an EcoRI-XbaI fragment. The resulting plasmid, pUAST-AttB-cherryMEF2DAM, was injected into AttP2 embryos by the Cambridge Fly Facility.

The sequence of the cherry-stop-MEF2-DAM construct is as follows: GAATTCATGGCAACTAGCGGCATGGTTAGTAAAGGAGAAGAAAATAACATGGCAATCATTAAGGAGTTCATGAGATTCAAAGTTCACATGGAAGGTTCTGTAAATGGACATGAATTTGAAATAGAAGGTGAAGGAGAAGGAAGGCCTTATGAAGGAACCCAAACCGCGAAGCTAAAAGTTACTAAGGGTGGCCCATTACCATTTGCATGGGATATCCTTAGCCCTCAATTCATGTATGGGTCAAAGGCTTATGTCAAGCACCCCGCCGACATTCCAGACTATCTAAAGTTATCTTTTCCCGAAGGGTTTAAGTGGGAGCGTGTGATGAACTTCGAAGACGGTGGCGTGGTAACAGTGACTCAGGATTCGTCCCTGCAAGATGGTGAATTTATCTACAAAGTCAAATTAAGAGGAACTAACTTTCCATCTGACGGCCCGGTTATGCAAAAAAAGACAATGGGCTGGGAGGCCTCCTCAGAACGAATGTACCCTGAAGATGGTGCCTTGAAGGGTGAGATTAAACAAAGATTGAAATTGAAAGATGGTGGACATTATGACGCTGAGGTTAAAACGACATACAAAGCTAAGAAACCTGTCCAGCTCCCAGGTGCTTACAATGTAAATATAAAACTTGATATTACATCACATAATGAAGATTATACGATAGTTGAACAATACGAAAGGGCTGAGGGGAGACATAGTACTGGTGGCATGGATGAACTATACAAAGGTTCTGGTACCGCATAATAACATGGGCCGCAAAAAAATTCAAATATCACGCATCACCGATGAACGCAATCGGCAGGTGACCTTCAACAAGCGCAAGTTCGGCGTGATGAAGAAGGCCTACGAGCTGTCCGTGCTCTGCGACTGCGAGATCGCCCTGATCATCTTCTCGTCGAGCAACAAGCTGTACCAGTACGCCAGCACCGACATGGATCGCGTCCTGCTCAAGTACACCGAGTACAACGAGCCCCACGAGTCCCTCACCAACAAGAACATCATCGAGAAGGAGAACAAGAACGGCGTGATGTCGCCGGACTCGCCCGAAGCCGAAACGGACTACACACTCACTCCGCGAACGGAGGCCAAGTACAACAAGATCGACGAGGAGTTCCAGAACATGATGCAGCGCAACCAGATGGCCATCGGCGGTGCGGGTGCCCCTCGCCAGCTTCCAAACAGCAGCTACACGCTGCCCGTTTCTGTTCCGGTGCCGGGATCTTACGGCGACAACCTGCTGCAGGCCAGTCCACAGATGTCCCACACCAACATCAGCCCCCGTCCATCGAGTTCGGAGACGGATTCAGTTTATCCATCGGGTTCCATGCTGGAGATGTCGAACGGCTATCCGCATTCACACTCGCCGCTTGTGGGATCACCGAGTCCGGGTCCCAGTCCTGGCATAGCCCACCATTTGTCCATTAAGCAGCAGTCGCCGGGCAGCCAGAACGGACGAGCTTCCAATCTAAGGGTCGTCATACCGCCCACAATTGCCCCCATACCGCCCAATATGTCAGCGCCGGATGATGTGGGATATGCAGATCAACGACAGAGCCAGACATCGCTTAACACGCCAGTGGTCACGCTGCAGACGCCGATTCCCGCCCTCACGAGCTATTCCTTTGGGGCGCAGGACTTCTCCTCCTCCGGCGTAATGAACAGCGCGGATATCATGAGCCTCAACACCTGGCATCAGGGCCTGGTGCCGCACTCTAGTCTCTCGCACCTGGCTGTCTCGAATAGCACGCCGCCGCCCGCCACCTCCCCCGTCTCCATAAAGGTCAAGGCTGAGCCGCAGTCGCCGCCGAGAGATCTTTCCGCCAGCGGTCATCAGCAGAATAGCAATGGTTCCACGGGCAGCGGCGGATCCAGCAGCAGCACCAGTAGCAACGCCAGCGGAGGAGCAGGAGGCGGTGGAGCCGTCAGCGCAGCCAATGTCATCACGCACTTGAACAACGTCAGTGTCCTGGCGGGAGGTCCTTCGGGGCAGGGAGGAGGAGGCGGAGGCGGCGGCAGCAACGGAAATGTCGAACAGGCCACCAATCTTAGCGTACTGAGCCACGCGCAGCAACATCACCTGGGCATGCCCAACTCGCGTCCCTCGTCCACGGGCCACATCACACCCACTCCAGGTGCGCCGAGCAGCGACCAGGATGTGCGTCTGGCAGCCGTCGCCGTGCAGCAGCAACAGCAGCAGCCACATCAGCAACAGCAACTAGGCGACTACGATGCCCCCAACCACAAACGGCCGAGAATATCGGGCGGATGGGGCACAGAACAGAAACTCATCTCTGAAGAGGATCTGATGAAGAAAAATCGCGCTTTTTTGAAGTGGGCAGGGGGCAAGTATCCCCTGCTTGATGATATTAAACGGCATTTGCCCAAGGGCGAATGTCTGGTTGAGCCTTTTGTAGGTGCCGGGTCGGTGTTTCTCAACACCGACTTTTCTCGTTACATCCTTGCCGATATCAATAGCGACCTGATCAGTCTCTATAACATTGTGAAGATGCGTACTGATGAGTACGTACAGGCCGCACGCGAGCTGTTTGTTCCCGAAACAAATTGCGCCGAGGTTTACTATCAGTTCCGCGAAGAGTTCAACAAAAGCCAGGATCCGTTCCGTCGGGCGGTACTGTTTTTATATTTGAACCGCTACGGTTACAACGGCCTGTGTCGTTACAATCTGCGCGGTGAGTTTAACGTGCCGTTCGGCCGCTACAAAAAACCCTATTTCCCGGAAGCAGAGTTGTATCACTTCGCTGAAAAAGCGCAGAATGCCTTTTTCTATTGTGAGTCTTACGCCGATAGCATGGCGCGCGCAGATGATGCATCCGTCGTCTATTGCGATCCGCCTTATGCACCGCTGTCTGCGACCGCCAACTTTACGGCGTATCACACAAACAGTTTTACGCTTGAACAACAAGCGCATCTGGCGGAGATCGCCGAAGGTCTGGTTGAGCGCCATATTCCAGTGCTGATCTCCAATCACGATACGATGTTAACGCGTGAGTGGTATCAGCGCGCAAAATTGCATGTCGTCAAAGTTCGACGCAGTATAAGCAGCAACGGCGGCACACGTAAAAAGGTGGACGAACTGCTGGCTTTGTACAAACCAGGAGTCGTTTCACCCGCGAAAAAAGCCGGTTAGTCTAGA.

Parent lines were allowed to lay eggs over a minimum of two days at 25  °C before timed collections were performed to produce the following genotypes: tub-GAL80ts/+;UAS-LT3-NDam/GMR16A06-GAL4 (KCs); tub-GAL80ts/+;UAS-LT3-NDam-Mef2/GMR16A06-GAL4; tub-GAL80ts/+;UAS-LT3-NDam/GMR74G01-GAL4 (T1 and Tm1 neurons); tub-GAL80ts/+;UAS-LT3-NDam-Mef2/GMR74G01-GAL4; tub-GAL80ts/+;UAS-LT3-NDam/atonal-GAL4; and tub-GAL80ts/+;UAS-LT3-NDam-acj6/atonal-GAL4.

Flies were allowed to lay eggs for 2 days at 18 °C in fly food vials. Vials containing those eggs were kept at 18 °C (restrictive temperature) until adult flies eclosed. They were then kept at 18 °C for 3–7 days before being transferred to 29 °C (permissive temperature) for 24 h. For the Dam-repo experiments, adult flies were flash-frozen in dry ice, and stored at −80 °C. A minimum of 50 heads were removed for processing. For all of the other experiments, brains were dissected. For the Dam-Mef2 experiments in KCs, a minimum of 90 brains were dissected per replicate, for the Dam-Mef2 experiments in T1/Tm1 cells, 40 brains were dissected, and for the Dam-acj6 in atonal cells, a minimum of 30 brains were dissected. Two biological replicates were performed for each experiment.

The DamID protocol is as previously described70 with the following modifications: after the overnight DpnI digestion, 0.5 µl of DpnI was added for an extra 1 h incubation and MyTaq HS DNA Polymerase was used for the PCR amplification (instead of Advantage 2 cDNA Polymerase).

Analysis

Sequencing data were mapped back to release 6.03 of the Drosophila genome using a previously described pipeline71. Peaks were called and mapped to genes using a custom Perl program (https://github.com/tonysouthall/Peak_calling_DamID) In brief, a false-discovery rate (FDR) was calculated for the peaks (formed of two or more consecutive GATC fragments) for the individual replicates. Then, each potential peak in the data was assigned an FDR. Any peaks with less than a 0.01% FDR were classified as significant. Significant peaks that were present in all of the replicates were used to form a final peak file. Motif enrichment was then performed using i-cistarget and direct hits were acquired from the leading-edge regions.

Enhancer assays

Selection of cloned enhancers

The selected regions are accessible in either KCs (24, average size = 621 bp), OL neurons (17, average size = 550 bp), glia (5, average size = 362 bp), or mixed (14, average size = 662 bp) with a size range of the cloned regions of between 300–1,732 bp. We also included negative controls that are either ubiquitously accessible or inaccessible (3, average size = 901). The 53 selected regions for the direct construct differ in their ATAC-seq peak height, specificity, presence of TF binding sites and nearby expressed genes (Supplementary Table 8). However, the 10 GAL4 enhancers were selected on the basis of multiple criteria (KC DL score > 0.35, KC accessibility > 5.6, KC accessibility fold-change > 2.4 and KC gene fold change > 0.3).

Cloning and visualization of enhancers

Selected enhancers were scored for the presence of homopolymers (>10) and GC content and small modifications to the sequence were made if needed. Sequences were synthesized by Twist Biosciences and inserted into the pTwist ENTR vector. Gateway cloning was then used to insert the sequence into the pH-Stinger vector containing nuclear GFP, Hsp70 promoter and gypsy insulators72. Next, the plasmids were sent to FlyORF (CH) and divided into six pools that were injected in Drosophila embryos (21F site on chromosome 2L). Positive transformants were selected and PCR was used to determine the identity of the enhancer in each line. This pipeline of pooled injections recovered a transgenic line for 54 of the 59 enhancers. Larval, pupal (15 h and 24 h) and adult flies were then dissected and stained using the immunofluorescence protocol for GFP, brp, repo and DAPI. Enhancers were scored using the following system: no expression (0), low on-target expression (1) and high on-target expression (2). The results are provided in Supplementary Table 8. Tests on the success rate were performed using two-sided Fisher’s exact tests. An additional set of ten enhancers was selected for producing KC Gal4 drivers. Gateway cloning was used to insert the sequences into the pBPGUw plasmid (gift from G. Rubin; Addgene plasmid, 17575, http://n2t.net/addgene:17575m, RRID: Addgene_17575). Plasmids were injected into Drosophila embryos individually. All of the other steps were repeated in the same manner. These Gal4 lines have been deposited in the Bloomington Drosophila Stock Center.

Immunohistochemistry analysis of split-Gal4 lines and larval brains

For immunofluorescence, brains were dissected and transferred to a tube containing 100 μl ice cold DPBS solution. After centrifugation at 800g for 5 min, the supernatant was replaced with 4% formaldehyde in PBT 0.3% (DBPS + 0.3% Triton X-100 (Sigma-Aldrich)) and incubated at room temperature with rotation for 15 min. Brains were washed three times with PBT 0.3%, rotating for 10 min at room temperature each time and then blocked in Pax-DG (10 g BSA (Sigma-Aldrich), 3 g Deoxycholate Acid (Sigma-Aldrich), 3 ml Triton X-100 (Sigma-Aldrich), 50 ml Normal Goat Serum (MP Biomedicals), 100 ml 10× PBS, 850 ml H2O) for 2 h at room temperature with rotation. Primary antibody mixes were created in Pax-DG (dilutions detailed in Supplementary Table 11) and brains were incubated in these mixes overnight at 4 °C with rotation. The next day, brains were washed three times with PBT 0.3%, rotating for 10 min at room temperature each time and then stained with secondary antibody mixes in Pax-DG (dilutions detailed in Supplementary Table 11) for 2 h at room temperature with rotation. Brains were washed three times with PBT 0.3%, rotating for 10 min at room temperature each time and mounted in Mowiol mounting medium (Sigma-Aldrich). Imaging was performed using the Nikon C2 and Nikon A1 confocal microscopes. A list of the antibodies used is provided in Supplementary Table 11.

Immunohistochemistry analysis of adult brains

Brains were dissected in PBS and transferred to a tube for fixation in 4% formaldehyde in PBS for 20 min at room temperature. Brains were washed in PBS with 0.3% Triton-X (PBST) three times for 20 min each. Next, brains were placed in blocking solution (5% normal goat serum (Sigma-Aldrich) in PBST) overnight at 4 °C. The samples were then incubated in primary antibodies diluted in blocking solution overnight at 4 °C. The following antibodies were used: rabbit anti-GFP (Abcam, 1:1,000). Brains were then washed three times in PBST for 20 min each and incubated with a fluorochrome-conjugated secondary antibody (AlexaFluor-488 anti-rabbit (Abcam, 1:500)) for 2 h at room temperature. The brains were washed three times in PBST for 20 min each. Finally, the samples were mounted onto microscope slides with Prolong Glass Anti-fade solution (Thermo Fisher Scientific) for subsequent analysis using the Nikon TiE A1R confocal microscope. All of the images were acquired and analysed using the Nikon NIS Elements. A list of used antibodies can be found in Supplementary Table 11.

Enhancer ROC curves

We used the scikit-learn73 framework to fit a roc-curve to separate adult high-quality enhancers (score = 2) from the other cloned enhancers. As features, we used peak height, peak specificity (Z score, log-transformed fold change, P value of DAR), motif content from DL, DAR and/or eGRN membership and correlation of peak accessibility with gene expression.

eGRN creation

Motif analysis

Conventional motif enrichment was performed using either i-cisTarget or RcisTarget68. These methods are based on a combination of hidden Markov models, cross-species comparisons, and a ranking-and-recovery statistic that together provide an optimal balance between precision and recall. By default, we used as background a genome-wide collection of 134,000 regions (non-overlapping bins that are optimally cut on the basis of conservation) that cover the entire non-coding genome, removing the coding regions. These regions were scored with version 9 of our motif collection of curated PWMs (25,000 PWMs). CG content is controlled because of the genome-wide background, but overall has less of a role in the Drosophila genome in which CpG islands do not occur.

For DARs, different parameters were used in RcisTarget (aucMaxRank = 0.01 and 0.05, motif collection version 9 and the modERN TF ChIP–seq database74) for each of the DAR sets with at least 10 regions. Here we also used all the regions in topics as additional background (re-ranking the database).

The results from these analyses are available online (http://flybrain.aertlab.org).

Cistromes

For building cistromes, we focused on cell types linked to a scRNA-seq cluster, regrouping the CB clusters into CB-Pros and CB-Imp to be able to establish the link to their transcriptome (T4 and T5 cells were analysed as independent clusters from ATAC, but both mapping to the same T4/T5 RNA cluster).

For each cell type, the cistromes were built on the basis of the motif enrichment analysis of upregulated DAR sets with at least 10 regions. Each significantly enriched motif (NES ≥ 3) was annotated to expressed TFs on the basis of cisTarget’s ‘direct’ and ‘inferred by orthology’ annotations (considering TFs expressed when expression > 0 in at least 10% of the cells of the given type/cluster). Note that as cisTarget’s annotation includes some non-TFs DNA binding proteins, we retained only the 459 TFs listed as such on Flybase and GO MF annotation.

For the significant motifs (NES ≥ 3.0), we split the TF–motif pairs by the correlation of the TF expression and motif enrichment score across the cell types, which resulted in ‘opening chromatin’ cistromes (positive correlation > 0.40 or 0.20) or ‘closing chromatin’ cistromes (negative correlation < –0.40 or –0.20) cistromes. We also retained all motifs merged under an ‘unclear direction’ set to be able to detect TFs of which the activity might be regulated at post-transcriptional levels (those can be explored on the website). For each of the motifs that were significantly enriched in a DAR set for a cell type in which the TF is expressed (note that, for the ‘closing’ cistromes, we did not require the TF to be expressed), we retrieved the DARs in which the motif had a significantly high score (that is, leading edge52 using RcisTarget::getSignificantRegions).

The dot heat map in Fig. 3 shows the average TF expression by cell type (that is, the average of all the cells in the cluster, after normalizing each cell on the basis of its total counts) to its maximum normalization (each gene divided by its maximum value), and the NES of the highest scoring motif (NES capped to 9).

Gene–enhancer links

Previous research showed that regulatory interactions can occur over large distances but are mostly confined within chromatin domains, in ‘genomic regulatory blocks’ (127 kb median size), a HiC-derived topological associated domain (TAD, 13 kb median size) or between two BEAF-32 boundary elements (57 kb median distance). On the basis of the comparison of these domains (Extended Data Fig. 10a–d), we decided to set a default search space of 50 kb around each gene for enhancer–gene links, which we then pruned using the BEAF-32 peaks.

We calculated the enhancer-to-gene links using the 43 matched clusters between RNA and ATAC plus CB-Pros and CB-Imp (ATAC T4 and T5 clusters were merged to match T4/T5 in RNA). For each cluster and data modality, 200 pseudocells were created as a bootstrap of five cells of the cell type. Each transcriptome pseudocell was then matched to a chromatin pseudocell of the same cell type to calculate the Pearson correlation and Random Forest regression (GENIE3) between each gene’s expression and the predicted accessibility (cisTopic cell-region probability) of the regions within 50 kb of its longest transcript (50 kb upstream the TSS and 50 kb downstream the end, plus the introns). The GENIE3 scores were filtered using the Binarize::binarize.BASC function in R. We then created a score, based on the aggregated ranking of these two measures plus the region accessibility, to enable us to select the top regulators. The maximum value of this score was scaled to 1,000 for compatibility with the UCSC Genome browser (in which we suggest a minimum threshold of 600 for link visualization). The links were then split into positive or negative links on the basis of the correlation of region accessibility and gene expression.

For identifying the links within BEAF-32 domains, we used ChIP–seq on whole Drosophila embryos (mixed sex embryo of 0–14 h; ENCODE dataset, https://www.encodeproject.org/files/ENCFF704WGH/)75. The peaks were filtered on the basis of the enrichment of the BEAF-32 motif (i-cisTarget analysis with the default settings), and their accessibility in the adult fly brain (most of the peaks are ubiquitous across cell types; Extended Data Fig. 10a–d). We then defined the BEAF-32 based search space for each gene, taking the biggest transcript, and extending (upstream, and downstream) until the first BEAF32 peak within 200 kb (skipping the 500 bp around the TSS). In case there were no peaks within 200 kb, 50 kb was kept as search space. In 82% of the eGRNs, there is a slightly higher GSEA enrichment score with the TF co-expression module (see below) when using only links within BEAF-32 peaks.

Finally, we also checked whether using these links is better than just using all of the genes within a certain distance of the cistrome region. We converted the cistrome regions to genes using three approaches: (1) all proximal genes (5 kb upstream, plus introns); (2) all distal genes (such as regions within 50 kb of the gene); and (3) the newly calculated enhancer–gene link based on expression and accessibility. We then used GSEA to check the enrichment of the resulting gene sets on the cell type markers, which confirmed that using the links is clearly better (for example, more enrichment) than just using all genes near cistrome regions.

eGRN integration

The regions in each of the cell-type-specific cistromes were converted to genes on the basis of the enhancer–gene links with a score of ≥600, splitting the resulting gene sets according to positive or negative links. We then used GSEA to check whether each of these gene sets (with at least five target genes) is enriched in each of the TF co-expression modules (using 5,000 permutations of GSEA). For each TF co-expression module (that is, ranking) we kept the significant cistromes for the same TF (P < 0.01), and selected the genes in the leading edge to build the eGRNs. To finalize the eGRNs per cell type, for each of those genes, we next retrieved the linked regions in the cistrome within the BEAF-32-based search space. Thus, we obtained the connections TF–region–gene.

eGRN plots

To display the eGRNs as networks in Cytoscape (v.3.8.0;76), we focused on the positive region–target gene links, and the genes expressed in at least 15% of the cells of the specific cell type (except T4/T5 neurons, for which we used 5% instead).

In addition, the Cytoscape networks also display differential expression and accessibility for each node (gene or region, respectively). The differential expression was calculated by contrasting the cell type versus all other cells (avg_logFC calculated using the Seurat function FindMarkers), the accessibility in the cell type was calculated by taking the mean over the interval with subsequent RPGC normalization.

DeepFlyBrain

cisTopic run

KCs (3), T-neurons (6) and glia (6) were selected from the adult and 72 h APF datasets leading to a total of 17,554 cells covering 15 cell types. The selected cells were rescored on a set of 207,000 150 bp peaks (see the consensus peaks), which were extended to 300 bp for optimal resolution in DL. We ran cisTopic on this subset of the data and with the new set of peaks. Given the smaller number of cells, we used the conventional Collapsed Gibbs Sampler method (runCGSModels) from 1 to 100 topics, with 500 iterations using 250 as burn-in. Using selectModel, we selected the model with the highest log-transformed likelihood leading to 81 topics. Using runtSNE without PCA on the probability matrix with the cells as target, we acquired the 2D embeddings. We then calculated scores for the topics per region using getRegionsScores with method=‘NormTop’ and scale=TRUE. Finally, we used binarizecisTopics with thrP = 0.975 to get 81 sets of peaks. These region sets were annotated to the different cell types on the basis of accessibility per cell type and region features (such as promoters, BEAF-32) on the basis of motif enrichment and the annotateRegions function using the Drosophila datasets.

Model training

These sets of regions were then used as input for a DL model, where 500 bp DNA sequences were used to predict the topic set to which the region belongs. The architecture of the model was used from an earlier study in which the authors again used the cisTopic clusters as an input for the DL model (DeepMEL26, DeepMEL225). The model is a hybrid CNN–RNN multiclass classifier77; details of the model architecture are provided in Supplementary Table 4. In addition to the architecture proposed earlier, we increased the number of filters from 128 to 1,024 where 747 of them are initialized as known PWMs representing 212 TFs. To be able to initialize the filters with the long PWMs, we also increased the filter size to 24.

Model performance

To assess the performance of the model, we performed ninefold cross validation whereby we split the regions into ten groups (10% of the input regions for each group). One of the groups is left out as a test set while the other nine groups are used for the ninefold cross validation. For each fold, one of the nine groups was used as the validation set (10% of the input regions) and the rest (80% of the input regions) was used as the training set. After splitting the regions into ten groups and before training the model, to increase the sample size for the DL model, we augmented the regions by extending them to 700 bp and used a sliding window of 500 bp with a 50 bp stride, increasing the sample size five times. During the training, the validation set was used for early stopping and the 83rd epoch (best in the main model) was chosen to evaluate the performance of cross-validation models. After training, we assessed the performance of the nine models on the non-augmented test set by scoring the test set regions with the models. Then, using the prediction scores and the topic labels, we calculated the area under the precision-recall (auPR) and receiver operating characteristic (auROC) curves using the average_precision_score and roc_auc_score functions from the scikit-learn package. Performance metrics for each topic are provided in Supplementary Table 5.

Here we notice a discordance between topics, with some topics achieving high validation scores, while others receive low ones. We noticed that cell-type specific topics in general have a higher score (Extended Data Fig. 6b–d). Indeed, in our analysis, the number of topics (81) is much higher than the number of cell types used as input for our case study (15). Thus, not all topics correspond to cell types and, therefore, not all topics need to have high validation scores. Some topics represent only noise, promoters or generally accessible elements. These background topics are useful for the training and are therefore retained, but present less interesting biological insights.

Nucleotide contributions

To find the nucleotides that contribute the most to the topic prediction, we used a network explaining tool called DeepExplainer from the SHAP package32. The tool was initialized with 500 random sequences and the default parameters were used. The importance score obtained from the DeepExplainer analysis was multiplied by the one-hot encoded DNA sequence and visualized as the height of the nucleotide letters as in earlier work78. In addition to the nucleotide importance plots, we performed in silico saturation mutagenesis in which we calculated the effect of each variant of a region on its model prediction score. The sequences with all possible single mutations were generated and the delta prediction score for each topic was calculated. The code that was used to train the model, to measure the performance, and to calculate the prediction scores and the nucleotide importances is provided in Supplementary Data 35.

TF-binding site predictions

High nucleotide importances on DeepExplainer plots represent potential binding sites for TFs. We used TF-MoDISco (v.0.5.5.4)33 to identify the most common patterns for KCs (topics 21, 35, 77), T neurons (topics 23, 20, 44, 10, 18, 32) and glia (topics 68, 25, 56, 34, 36). The default parameters were used to run for each group. After finding the patterns, to identify motif instances on the given sequences, we followed the TF-MoDISco manual. First, the patterns were trimmed using trim_by_ic (th = 0.25), then the sum score was calculated by using compute_sum_scores on the nucleotide importance scores. However, instead of using contribution weight matrix and calculating cosine-similarity using compute_masked_cosine_sim as shown in the TF-MoDISco manual, we converted the identified patterns to convolutional filters and calculated pattern activation scores using tf.nn.conv1d function from the TensorFlow package. It resulted in better motif instances for the noisy nucleotide importance scores and for the shorter patterns. Global motif instances were calculated on 500 bp consensus regions for each group (KCs, T-neurons, and glia) that have high prediction score (>0.25) for the corresponding class. The selected threshold for each pattern can be found in Supplementary Table 6.

The DL-based KC cistromes were built for the TFs in which contribution score matches one-to-one to gene expression on different KC types (ey, toy, Mef2, onecut, sr and mamo). For these TFs, the calculated motif instances were used to build cistromes, then the eGRNs were constructed following the same approach as with the motif-based eGRNs (described above).

Developmental model

To create a DL model on development we followed the same approach as outlined above, but changed the input topics. We performed cisTopic analysis of KCs, glia and T neurons from larval to 12 h APF brains with the addition of neuroblasts and neuroepithelium cell types (in total 27,853 cells). We ran cisTopic using the conventional Collapsed Gibbs Sampler method (runCGSModels) from 1 to 230 topics, with 500 iterations using 250 as burn-in. While 220 topics was the model with the highest log-transformed likelihood, we chose to work with 160 topics as the additional increase was not significant and we preferred to keep the number of topics as low as possible.

caQTLs analysis

Data preprocessing of the bulk ATAC-seq data across DGRP lines was performed as described in Bravo et al79, which is based on the analysis previously performed by Jacobs et al37. First, adapter sequences were trimmed from the raw reads using fastq-mcf (ea-utils v1.1.2, default parameters and using a list containing the common Illumina adapters), and the quality of the cleaned reads was checked with FastQC (v0.1). Next, Bowtie2 (v2.2.5) was used to map experiments to their personalized genome version on 3rd 2017 FlyBase release (D. melanogaster r6.16) genome. Called variants in this genome assembly were retrieved from ftp://ftp.hgsc.bcm.edu/DGRP/freeze2_Feb_2013/liftover_data_for_D.mel6.0_from_William_Gilks_Oct_2015/, and for each of the DGRP lines, the consensus genome (r6.16) was modified using seqtk mutfa (seqtk (v1.0)), each time including their S correlation of the regions within a chromatin NPs (previously called from whole genome sequencing). After the first mapping round, additional SNPs were called on the ATAC reads using SAMtools (v1.2, samtools mpileup -B –f r6.16.fasta DGRP_lineX.bam | varscan.sh mpileup2snp –output-vcf 1), retrieving several thousands per line that were added to the existing vcf files using VCFtools (v0.1.14). The vcf files were then used to update the genome, creating the final personalized genome for every DGRP line, strongly reducing mapping errors and increasing the sensitivity of subsequent analyses. Bowtie2 (v2.2.5) was then used to map the cleaned reads onto the final genomes, and SAMtools (v1.2) was used for sorting and indexing. Peaks were called on the mapped reads using MACS2 (v2.1.2.1), with the command macs2 callpeak -g dm –nomodel–keep-dup all –call-summits. The narrow peak files (bed format) for all the DGRP lines were merged leading to a total of 33,595 regions accessible in at least one DGRP line. After filtering out chrU, chrUextra, chrHet, and chrM regions and removing regions enriched in repeats (>25% of the sequence) using bedtools (v2.28.0) with the command intersectBed -v -f 0.25, we obtained 32,668 accessible regions across this DGRP panel. For every ATAC-seq sample, we quantified the coverage per accessible region using featureCounts (Subread v2.0.0). Next, we filtered 11,711 regions with low coverage (coverage of the region < 0.2 reads per base pair for every DGRP line)) for every DGRP line, ending up with 20,957 accessible regions. Finally, the DESeq2 package in R was used to normalize the final peak-counts matrix based on size-factors.

The chromatin accessibility quantitative trait loci (caQTLs) were also identified as described in Jacobs et al37. In brief, we searched for correlation between the counts in each peak and the overlapping SNPs (treated as a vector across the 44 DGRP lines with values 0/1/NA) using the generalized linear model function in R. This provided 4,063 caQTLs, for example, highly correlated SNP–region pairs (Benjamini–Hochberg adjusted P < 0.05).

The motifs that were significantly affected by the caQTLs were identified by calculating the change in score produced by each of the SNPs in the caQTLs to each of the 24,454 motifs in our collection37 (that is, we scored every sequence twice, once with the reference and another with the ALT allele, using Cluster-Buster80, with the options -m 0 -c 0, and subtracted the score of the less accessible sequence from the most accessible one). We then used the Fisher’s exact test to compare the significance of the number of caQTL SNPs affecting the motif (with abs(Delta) > 3), versus what is expected by chance (for example, random SNPs). This returned the motifs shown in Extended Data Fig. 6l.

For the analysis of caQTL explainability by the DL model, the caQTLs and random SNPs were scored by the DL model using both reference and mutated alleles. For each caQTL/SNP, the maximum absolute delta (reference − mutated) prediction score was calculated using all 81 topics. A threshold was calculated on different false-positive rates based on random SNPs and the same thresholds were applied to caQTLs to identify the fraction of explained caQTLs at different false-positive rates.

Development ATAC analysis

Annotation of cell types

The annotation of cell types through development was performed following two complementary approaches: (1) annotate progenitor cell types based on marker genes near ase, dpn, grh, dac, cas and scro12 (Extended Data Fig. 4g) and the ventral nerve cord based on para and abd-A12, and (2) tracking back the annotated adult cell types.

To track back the adult cell types through development, we used an SVM classifier (Supplementary Table 2). We trained the SVM classifier on the annotated adult cell types, and we used it to iteratively transfer the labels to earlier stages.

  1. (1)

    In the first step, we used the SVM classifier to transfer the labels from the 79 adult clusters (adult cells with more than 900 FIP), to the remaining cells on the adult dataset (adult + 72 h APF, the classifiers are trained on the cell–topic matrix). Using cross validation within the adult cells, we estimated that the global accuracy of the classifier is 0.86, with a call rate of 0.97 (it is not forced to assign a class to every cell); having a specificity of over 0.99 for all cell types, and a sensitivity ranging from over 0.90, for many glial, OL and KC types, to 0.25–0.50 for the least confident CB-Pros clusters.

  2. (2)

    We then used the adult + 72 h APF cells to classify the 48 h pupa cells using the common cisTopic analysis with these three stages (Extended Data Fig. 1f–l), and Harmony (on the cell-topic matrix) to reduce the effect differences intrinsic to the developmental stage.

  3. (3)

    Finally, we classified the cells in the remaining developmental stages (from larva to 24 h AFP). For this we used the global cisTopic analysis (158,116/240,919 cells with more than 900 FIP), with Harmony to correct for developmental stage (Extended Data Fig. 1e). In this last training, we noticed that cells on the progenitor clusters remained largely unassigned, so we finally trained a classifier also including the progenitors as training labels (OL Developing neuron 2, CB Developing neuron 1, OL Neuroepithelium, NB Generation, OL Developing neuron 1, OL Type I NB, CB Type I NB, and LPC), and discarding from the training set the few cells from the "new 48h cluster" that had been assigned to a cell type (they seem to be younger cells, and could distort the classification). This method obtained a likely fate for the developmental cells.

Core-set identification

The peaks called per cell type per timepoint for the consensus peaks were used as the basis to identify core regions per cell type. ctx regions (see scATAC topic modelling and clustering) that overlapped with the called peaks for that timepoint were defined as open and the DARs of the timepoint for the cell type were used to get differential accessible regions. All of the ctx regions that passed the filtering were then taken together as one set of total accessible regions of the cell type. Regions that were accessible in every timepoint were defined as the core-set of regions; regions that were differentially accessible in every timepoint were defined as core-DARs.

Trajectory of OL branches

We used Monocle3 (refs. 81,82,83) to fit a trajectory in the 3D UMAP of the OL branch from the larval to 12 h APF analysis and assign pseudotimes to the cells. First, we created a cell_data_set with the region probabilities per cell and the 3D UMAP from cisTopic as embeddings. We then used cluster_cells followed by a partition to separate the OL and CB branches and subset the object to only contain the OL. We performed another cluster_cell, and selected and merged clusters in the same branch. The branch IDs were then used in Seurat’s FindAllMarkers (Wilcoxon, min.pct = 0.1, logfc.threshold = 0.25) on the predictive distribution matrix to find DARs. Subsequently, motif enrichment on the branch DARs was performed using i-cisTarget68. Regions were linked to genes up to 5 kb upstream or downstream and GO was performed using FlyMine (Supplementary Table 3).

Trajectory of ONE scATAC-seq

The Monocle3 object that was created for OL branches was also used to calculate pseudotime using learn_graph to fit a principal graph and order_cells to assign pseudotimes. Next, OL neuroepithelium cells were selected together with the tips of the lamina precursor cells and OL neuroblasts (NB generation), focusing on the trajectory between these cell types. The trajectory was split into 15 equal parts that were used in Seurat’s FindAllMarkers (Wilcoxon, min.pct = 0.1, logfc.threshold = 0.1) to find DARs using a two-sided Wilcoxon test. Next, the predictive distribution matrix was subset for DARs and CPM normalized, followed by region-based z-normalization. DARs were grouped into modules using hierarchical clustering with the Scipy cluster.hierachy module49 using distance.pdist (Euclidean), linkage (complete) and fcluster (0.85 × max distance), leading to nine modules. RcisTarget was used to identify motifs per module.

Trajectory of ONE scRNA-seq

Lamina precursor cells, neuroepithelium cells and OL neuroblasts were selected from a scRNA-seq dataset of the larval brain12. Monocle3 was used to create a trajectory through the cells and assign pseudotimes. First the data were processed using pre_process_cds with PCA as the method, selecting 20 components. Next, a batch effect correction was performed to align the two different runs with align_cds. The aligned data were then used for reduce_dimension, followed by learn_graph. Once the principal graph was learned, cells were ordered along it and pseudotimes were assigned. To plot gene expression trajectories over pseudotime, a rolling mean was calculated for the log-normalized CPM counts with a window of 10. Next, a 10th degree polynomial was fit through the rolling mean with polyfit using NumPy84 and plotted.

CB pros versus Imp

CB clusters in the adult + 72 h APF dataset were selected on the basis of enrichment of CB-only runs (Extended Data Fig. 1f–l), with the exception of KCs. These clusters were assigned to either pros or Imp groups based on their maximal mean gene accessibility. We then used Seurat FindAllMarkers on the predictive distribution matrix (Wilcoxon, min.pct = 0.1, logfc.threshold = 0.2) to identify 166 regions for pros+ cells and 128 regions for Imp+ cells. Motif enrichment was performed using i-cisTarget68.

scATAC-seq embryo

We used scATAC-seq from the whole Drosophila embryo85 to map the different CB cell types. After data download from GEO, we used cisTopic to map the reads on ctx regions, leading to a matrix of 128,510 regions by 20,594 cells. Given the smaller number of cells, we used the conventional Collapsed Gibbs Sampler method in cisTopic using runCGSModels from 1 to 100 topics, with 500 iterations using 250 as burn-in. We selected the model with the highest log-transformed likelihood leading to 50 topics. Using runtSNE without PCA on the probability matrix with the cells as the target, we acquired the 2D embeddings. Annotations were transferred from the dataset, identifying the CNS. We then plotted the mean accessibility of the CB regions on the t-SNE.

Enhancer-switch identification

Region accessibilities per cell type were calculated per timepoint using RPGC-normalized bigwig files. Next, a linear curve was fit using statsmodels86 in Python for every region using time as an independent variable and region accessibility as a dependent variable, with 95% confidence intervals calculated for the parameters. Regions with a positive coefficient were assigned to be upregulated and regions with negative coefficients were assigned to be downregulated. Finally, we selected the regions that were upregulated in one cell type while being downregulated in another one and that had a maximum accessibility exceeding 4, leading to 458 switching regions.

Hydrop scATAC-seq and analysis

HyDrop experiments were performed on sorted GH146 cells (OPNs) according to the standard HyDrop ATAC protocol as described by De Rop et al87. As OPNs are a rare cell type in the CB, 330 brains were used, followed by a stringent gating strategy. The FACS gating strategy is provided in Supplementary Data 1. The sorted cells were split in two batches that served as technical replicates. Next, barcode reads were trimmed to exclude the intersub-barcode PCR adapters using a mawk script. Then, the VSN scATAC-seq preprocessing pipeline50 was used to map the reads to the reference genome and generate a fragments file for downstream analysis. Here, barcode reads were compared to a whitelist (of 884,736 valid barcodes) and corrected, allowing for a maximum 1 bp mismatch. Uncorrected and corrected barcodes were appended to the fastq sequence identifier of the paired-end ATAC-seq reads. Reads were mapped to the reference genome using bwa mem with the default settings, and the barcode information was added as tags to each read in the bam file. Duplicate-marking was performed using samtools markdup. In the final step of the pipeline, fragments files were generated using Sinto (https://github.com/timoast/sinto).

The generated fragments files were then used as input for ArchR (and cisTopic which gave similar results, data not shown). The createArrowFiles function was used with minTSS set to 4 and minFrags to 1000, leaving 309 cells. Ten components were then used in IterativeLSI, and for the UMAP calculation, clusters were calculated using a Louvain resolution 1. The arrow files were also used for co-clustering with the 10x scATAC data of the adult brain. Clustering was performed with 70 components of IterativeLSI and for UMAP.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.