High-dimensional gene expression and morphology profiles of cells across 28,000 genetic and chemical perturbations

Haghighi, Marzieh; Caicedo, Juan C.; Cimini, Beth A.; Carpenter, Anne E.; Singh, Shantanu

doi:10.1038/s41592-022-01667-0

High-dimensional gene expression and morphology profiles of cells across 28,000 genetic and chemical perturbations

Resource
Published: 07 November 2022

Volume 19, pages 1550–1557, (2022)
Cite this article

From

View current issue Submit your manuscript

11k Accesses
29 Citations
34 Altmetric
Explore all metrics

Abstract

Cells can be perturbed by various chemical and genetic treatments and the impact on gene expression and morphology can be measured via transcriptomic profiling and image-based assays, respectively. The patterns observed in these high-dimensional profile data can power a dozen applications in drug discovery and basic biology research, but both types of profiles are rarely available for large-scale experiments. Here, we provide a collection of four datasets with both gene expression and morphological profile data useful for developing and testing multimodal methodologies. Roughly a thousand features are measured for each of the two data types, across more than 28,000 chemical and genetic perturbations. We define biological problems that use the shared and complementary information in these two data modalities, provide baseline analysis and evaluation metrics for multi-omic applications, and make the data resource publicly available (https://broad.io/rosetta/).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

**Fig. 1: Multimodal datasets overview.**

**Fig. 2: An application using the shared subspace: cross-modality predictions from Cell Painting to gene expression.**

**Fig. 3: Cross-modality predictions from gene expression to Cell Painting.**

**Fig. 4: Using complementary information: data integration for MoA cluster retrieval and class prediction in compound datasets.**

Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations

Article Open access 09 April 2024

Uncovering axes of variation among single-cell cancer specimens

Article 13 January 2020

scPerturb: harmonized single-cell perturbation data

Article 26 January 2024

Data availability

Preprocessed profiles that are augmented with gene and compound annotation are freely available through the ‘Registry of Open Data on AWS’ on a public S3 bucket. Documentation on the folder structure, dataset details and instructions for accessing the data are available at https://broad.io/rosetta/. Datasets are described and referenced in Supplementary Data 1. Source data are provided with this paper.

Code availability

Source code to reproduce and build upon the presented results is available at https://broad.io/rosetta/. We licensed the source code as BSD 3-Clause, and licensed the data, results and figures as CC0 1.0.

References

Moshkov, N. et al. Predicting compound activity from phenotypic profiles and chemical structures. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2020.12.15.422887v4 (2022).
Breinig, M., Klein, F. A., Huber, W. & Boutros, M. A chemical–genetic interaction map of small molecules using high-throughput imaging in cancer cells. Mol. Syst. Biol. 11, 846 (2015).
Article PubMed PubMed Central Google Scholar
Rohban, M. H. et al. Systematic morphological profiling of human gene and allele function via Cell Painting. Elife 6, e24060 (2017).
Article PubMed PubMed Central Google Scholar
Meng, T., Jing, X., Yan, Z. & Pedrycz, W. A survey on machine learning for data fusion. Inf. Fusion 57, 115–129 (2020).
Article Google Scholar
Baldwin, E. et al. On fusion methods for knowledge discovery from multi-omics datasets. Comput. Struct. Biotechnol. J. 18, 509–517 (2020).
Article PubMed PubMed Central Google Scholar
Patel-Murray, N. L. et al. A multi-omics interpretable machine learning model reveals modes of action of small molecules. Sci. Rep. 10, 954 (2020).
Article CAS PubMed PubMed Central Google Scholar
Subramanian, A. et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell 171, 1437–1452 (2017).
Article CAS PubMed PubMed Central Google Scholar
Bray, M. -A. et al. Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nat. Protoc. 11, 1757–1774 (2016).
Article CAS PubMed PubMed Central Google Scholar
McQuin, C. et al. CellProfiler 3.0: next-generation image processing for biology. PLoS Biol. 16, e2005970 (2018).
Article PubMed PubMed Central Google Scholar
Wawer, M. J. et al. Toward performance-diverse small-molecule libraries for cell-based phenotypic screening using multiplexed high-dimensional profiling. Proc. Natl Acad. Sci. USA 111, 10911–10916 (2014).
Article CAS PubMed PubMed Central Google Scholar
Berger, A. H. et al. High-throughput phenotyping of lung cancer somatic mutations. Cancer Cell 30, 214–228 (2016).
Article PubMed PubMed Central Google Scholar
Haftbaradaran Esfahani, P. & Knöll, R. Cell shape: effects on gene expression and signaling. Biophys. Rev. 12, 895–901 (2020).
Article PubMed PubMed Central Google Scholar
Drareni, K., Gautier, J.-F., Venteclef, N. & Alzaid, F. Transcriptional control of macrophage polarisation in type 2 diabetes. Semin. Immunopathol. 41, 515–529 (2019).
Article PubMed Google Scholar
Mota de Sá, P., Richard, A. J., Hang, H. & Stephens, J. M. Transcriptional regulation of adipogenesis. Compr. Physiol. 7, 635–674 (2017).
Article PubMed Google Scholar
Way, G. P. et al. Morphology and gene expression profiling provide complementary information for mapping cell state. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2021.10.21.465335 (2022).
Nassiri, I. & McCall, M. N. Systematic exploration of cell morphological phenotypes associated with a transcriptomic query. Nucleic Acids Res. 46, e116 (2018).
Article PubMed PubMed Central Google Scholar
Spiering, D. & Hodgson, L. Dynamics of the Rho-family small GTPases in actin regulation and motility. Cell Adh. Migr. 5, 170–180 (2011).
Article PubMed PubMed Central Google Scholar
Dai Yang, K. et al. Multi-domain translation between single-cell imaging and sequencing data using autoencoders. Nat. Commun. 12, 31 (2021).
Google Scholar
Gundersen, G., Dumitrascu, B. & Ash, J. T. End-to-end training of deep probabilistic CCA on paired biomedical observations. In Proceedings of PMLR pp.945–955 (2019).
He, B. et al. Integrating spatial gene expression and breast tumour morphology via deep learning. Nat. Biomed. Eng. 4, 827–834 (2020).
Article CAS PubMed Google Scholar
Pasquer, Q. T. L., Tsakoumagkos, I. A. & Hoogendoorn, S. From phenotypic hit to chemical probe: Chemical biology approaches to elucidate small molecule action in complex biological systems. Molecules 25, 5702 (2020).
Article CAS PubMed PubMed Central Google Scholar
Rohban, M. H. et al. Virtual screening for small-molecule pathway regulators by image-profile matching. Cell Syst. 13, 724–736 (2022).
Article CAS PubMed Google Scholar
Ljosa, V. et al. Comparison of methods for image-based profiling of cellular morphological responses to small-molecule treatment. J. Biomol. Screen. 18, 1321–1329 (2013).
Article CAS PubMed Google Scholar
Warchal, S. J., Dawson, J. C. & Carragher, N. O. Evaluation of machine learning classifiers to predict compound mechanism of action when transferred across distinct cell lines. SLAS Discov. 24, 224–233 (2019).
Article PubMed PubMed Central Google Scholar
Aliper, A. et al. Deep learning applications for predicting pharmacological properties of drugs and drug repurposing using transcriptomic data. Mol. Pharm. 13, 2524–2530 (2016).
Article CAS PubMed PubMed Central Google Scholar
Lapins, M. & Spjuth, O. Evaluation of gene expression and phenotypic profiling data as quantitative descriptors for predicting drug targets and mechanisms of action. Preprint at bioRxiv https://doi.org/10.1101/580654 (2019).
Cantini, L. et al. Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer. Nat. Commun. 12, 124 (2021).
Article CAS PubMed PubMed Central Google Scholar
Tenenhaus, M., Tenenhaus, A. & Groenen, P. J. F. Regularized generalized canonical correlation analysis: a framework for sequential multiblock component methods. Psychometrika https://doi.org/10.1007/s11336-017-9573-x (2017).
Roselle, C., Verch, T. & Shank-Retzlaff, M. Mitigation of microtiter plate-positioning effects using a block randomization scheme. Anal. Bioanal. Chem. 408, 3969–3979 (2016).
Article CAS PubMed Google Scholar
Lill, J. R., Mathews, W. R., Rose, C. M. & Schirle, M. Proteomics in the pharmaceutical and biotechnology industry: a look to the next decade. Expert Rev. Proteom. 18, 503–526 (2021).
Article CAS Google Scholar
Chandrasekaran, S. N., Ceulemans, H., Boyd, J. D. & Carpenter, A. E. Image-based profiling for drug discovery: due for a machine learning upgrade? Nat. Rev. Drug Discov. 20, 145–159 (2021).
Article CAS PubMed Google Scholar
Tandon, G., Chan, P. & Mitra, D. MORPHEUS: motif oriented representations to purge hostile events from unlabeled sequences. in Proceedings of the 2004 ACM Workshop on Visualization and Data Mining for Computer Security https://doi.org/10.1145/1029208.1029212 (2004).
Broad Institute. Guide to LINCS data release into NCBI GEO—L1000. Connectopedia. https://clue.io/connectopedia/guide_to_geo_l1000_data
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank all the researchers who created and shared the data, who are mentioned in their respective publications cited in the paper. Funding was provided by grants (2018-183451 to A.E.C. and 2020-225720 to B.A.C.) from the Chan Zuckerberg Initiative DAF, an advised fund of the Silicon Valley Community Foundation and the National Institutes of Health NIGMS (R35 GM122547, to A.E.C.).

Author information

Authors and Affiliations

Broad Institute of MIT and Harvard, Cambridge, MA, USA
Marzieh Haghighi, Juan C. Caicedo, Beth A. Cimini, Anne E. Carpenter & Shantanu Singh

Authors

Marzieh Haghighi
View author publications
You can also search for this author in PubMed Google Scholar
Juan C. Caicedo
View author publications
You can also search for this author in PubMed Google Scholar
Beth A. Cimini
View author publications
You can also search for this author in PubMed Google Scholar
Anne E. Carpenter
View author publications
You can also search for this author in PubMed Google Scholar
Shantanu Singh
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.H., S.S., B.A.C. and A.E.C. contributed to drafting the manuscript and designing the research. J.C.C. initiated the project and performed early explorations of the LUAD dataset. M.H. analyzed and explored the data with inputs from the other co-authors.

Corresponding authors

Correspondence to Marzieh Haghighi or Shantanu Singh.

Ethics declarations

Competing interests

S.S. and A.E.C. serve as scientific advisors for companies that use image-based profiling and Cell Painting (A.E.C.: Recursion, S.S.: Waypoint Bio, Dewpoint Therapeutics), and receive honoraria for occasional talks at pharmaceutical and biotechnology companies.

Peer review

Peer review information

Nature Methods thanks Haiquan Li, Matthew McCall, and the other, anonymous, reviewer for their contribution to the peer review of this work. Primary Handling Editor: Rita Strack, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Generalizability of the prediction model across datasets.

Prediction of each L1000 mRNA level by Cell Painting features in dataset A, using a model trained on dataset B. We have trained Lasso and MLP models on each of LUAD and LINCS datasets and checked the prediction results on the other dataset which was not used in model training. Distribution of R2 prediction scores for all landmark genes are shown. Comparison of the results here with Fig. 2 indicates weakness of the prediction model in generalizability across datasets. This is an indication of dataset-specific technical variations (batch effects) that need exploration of experimental alignment techniques (batch-effect correction), which is an active area of research. We also observe that the model’s prediction power is stronger when the model is trained on the LINCS dataset and tested on the LUAD dataset. This is expected as the LUAD dataset is limited to a narrow set of genes associated with lung adenocarcinoma cancer; however, the LINCS dataset contains a wide variety of compounds with different mechanisms and known phenotypes. The y-axis is trimmed at −1 for clarity. Distributions are presented as boxplots, with center line being median, box limits being upper and lower quartiles and whiskers being 1.5× interquartile range; n = 978 landmark genes for each boxplot.

Source data

Extended Data Fig. 2 Gene group names for top 100 predictable landmark genes in LUAD dataset.

Top 100 predictable landmark genes by MLP model are shown along with their gene group names (based on HGNC Database41) for the LUAD dataset, finding a diverse array represented, though we note the perturbations in this experiment included only genes found mutated in lung cancers.

Source data

Extended Data Fig. 3 Over-Representation Analysis (ORA) of highly predictable (top 100) landmark genes in LUAD dataset.

Over-Representation Analysis of top 100 highly predictable landmark genes according to the MLP model applied on the LUAD dataset. ORA analysis was performed by WebGestalt analysis toolkit 42. Nineteen enriched categories (FDR < 0.2) are labeled in the volcano plot.

Source data

Extended Data Fig. 4 Visualization of cells in a cluster of landmark genes that are tightly correlated with RNA texture category of morphological features.

For the cluster of landmark genes shown in the top heatmap, which is a partial snapshot of Fig. 2d, we have shown example cell images for perturbations that have high and low predicted values for each gene in that cluster. We have filtered perturbations to those that have low prediction errors prior to that selection. We can observe that cells that are predicted to have (and actually do have) high levels of these five genes’ mRNA all are associated with visible changes in the staining for mitochondria, even though only half of these genes already have functional annotations related to the mitochondria.

Extended Data Fig. 5 Validation of the observed GE-CP relationship by GO-terms search analysis.

Landmark genes highly predictable according to morphological features in each specific Cell Painting channel are more likely to have GO annotation related to that channel compared to the rest of CP channels. For each channel in the rows of the table, the first column shows the Odds Ratio (OR) derived from the Fisher’s exact test for associations between the landmark genes being highly predictable (R2 > 0.6) by CP features in a channel and having GO annotations for that channel. The second column shows the association between the same set of highly predictable genes and having GO annotation for any channel but not the target row channel. Higher values in the first column compared to the second column show that highly predictable genes according to features in a CP channel are more likely to have GO annotations for that channel compared to the rest of the channels. This pattern holds for DNA and ER channels but not for the rest of CP channels. The third and fourth columns show the same associations but for low-predictability genes (R2 < 0). Lower values in the third column compared to the fourth column show that non-predictable genes according to features in a CP channel are less likely to have GO annotations for that channel compared to the rest of the channels. This pattern holds for all CP channels except for RNA. The CP channel specific predictability map used for this analysis was derived from the result of the experiment and results presented partially in Fig. 2d. As we can observe from the map, usually multiple categories of morphological features contribute to the predictability of a gene, which explains the lack of a simple relationship between a given channel’s predictability and GO term associations presented in this table.

Source data

Extended Data Fig. 6 Association between landmark gene predictability and having gene ontology annotations related to Cell-Painting stains.

Landmark genes that are predictable according to at least three of the four datasets (59 genes shown in Fig. 2c) are more likely to have GO annotations related to any of the stains in the Cell Painting assay compared to a random subset of landmark genes.

Source data

Supplementary information

Supplementary Information

Supplementary Data 1–3 and 5, Supplementary Table 1 and Supplementary Figs. 1–3

Reporting Summary

Peer Review File

Supplementary Data 4

Median prediction scores for each landmark gene across each dataset and model.

Source data

Source Data Fig. 2

Fig. 2a: Prediction score values per gene in each dataset and model. Single scores in rows of the table indicate the median of the predictability scores across all cross-validation folds. Fig. 2b: Extracted predictability threshold for each dataset and number of predictable genes. Fig. 2c: List of the genes for each area of the Venn diagram. Fig. 2d: Source numbers for predictability score, shown as a heat map.

Source Data Fig. 3

Fig. 3a: Prediction score values for each single morphological CP feature in each dataset and for MLP models. Each score in each row is the median of the predictability scores across all cross-validation folds in its corresponding experiment. Fig. 3b: Ratio of predictable features to total number of features in each cell-paining feature category. Fig. 3d: Values for relative importance of each feature, shown as bar plots.

Source Data Fig. 4

Fig. 4a: Jaccard Index values for each of MOA clusters for each model and each dataset and data modality. Fig. 4b: F1-score values for MOA classification experiments for each modality, integration method, model, dataset and cross-validation fold. Fig. 4c: MoA class-specific classification F1-scores, shown partially as a bar plot.

Source Data Extended Data Fig. 1

Prediction scores for each individual gene resulted from a model trained on one dataset and used to predict landmark GEs in another dataset. LUAD and LINCS datasets were used.

Source Data Extended Data Fig. 2

A table as shown in Extended Data Fig. 2.

Source Data Extended Data Fig. 3

ORA analysis results (set size, enrichment ratio, P value and false discovery rate), which are partially shown in Extended Data Fig. 3.

Source Data Extended Data Fig. 5

Numbers as shown in Extended Data Table. 5.

Source Data Extended Data Fig. 6

Numbers as shown in Extended Data Table. 6.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Haghighi, M., Caicedo, J.C., Cimini, B.A. et al. High-dimensional gene expression and morphology profiles of cells across 28,000 genetic and chemical perturbations. Nat Methods 19, 1550–1557 (2022). https://doi.org/10.1038/s41592-022-01667-0

Download citation

Received: 07 September 2021
Accepted: 28 September 2022
Published: 07 November 2022
Issue Date: December 2022
DOI: https://doi.org/10.1038/s41592-022-01667-0
Springer Nature America, Inc.

This article is cited by

High-dimensional phenotyping to define the genetic basis of cellular morphology
- Matthew Tegtmeyer
- Jatin Arora
- Soumya Raychaudhuri
Nature Communications (2024)
Orientation-invariant autoencoders learn robust representations for shape profiling of cells and organelles
- James Burgess
- Jeffrey J. Nirschl
- Serena Yeung-Levy
Nature Communications (2024)
Progress in toxicogenomics to protect human health
- Matthew J. Meier
- Joshua Harrill
- Carole L. Yauk
Nature Reviews Genetics (2024)
VIBRANT: spectral profiling for single-cell drug responses
- Xinwen Liu
- Lixue Shi
- Wei Min
Nature Methods (2024)
Autocorrelation analysis of a phenotypic screen reveals hidden drug activity
- Richard A. Dubach
- J. Matthew Dubach
Scientific Reports (2024)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High-dimensional gene expression and morphology profiles of cells across 28,000 genetic and chemical perturbations

Abstract

Access this article

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Navigation