Gene expression model inference from snapshot RNA data using Bayesian non-parametrics

A preprint version of the article is available at bioRxiv.


Gene expression models, which are key towards understanding cellular regulatory response, underlie observations of single-cell transcriptional dynamics. Although RNA expression data encode information on gene expression models, existing computational frameworks do not perform simultaneous Bayesian inference of gene expression models and parameters from such data. Rather, gene expression models—composed of gene states, their connectivities and associated parameters—are currently deduced by pre-specifying gene state numbers and connectivity before learning associated rate parameters. Here we propose a method to learn full distributions over gene states, state connectivities and associated rate parameters, simultaneously and self-consistently from single-molecule RNA counts. We propagate noise from fluctuating RNA counts over models by treating models themselves as random variables. We achieve this within a Bayesian non-parametric paradigm. We demonstrate our method on the Escherichia coli lacZ pathway and the Saccharomyces cerevisiae STL1 pathway, and verify its robustness on synthetic data.

Fig. 1: Schematic of gene expression models.
Fig. 2: Accurate inference for a variety of gene expression models derived from synthetic data.
Fig. 3: Robustness analysis with respect to the quantity of data.
Fig. 4: Inference on data from E. coli in slow-growth media.
Fig. 5: Non-parametric inference on fast-growth E. coli data.
Fig. 6: Inference on S. cerevisiae data.

Data availability

Source Data for Figs. 1–6 and Extended Data Figs. 1 and 2 is available with this paper, as well as online at ref. 114.

Code availability

Our custom MATLAB code is available at ref. 114.


We thank I. Golding for providing the experimental data analyzed herein. We thank I. Sgouralis, Z. Fox and B. Munsky for interesting discussions and insights. D.S. acknowledges support from the NIH NHLBI (R01HL068702) and NIH BRAIN (RF1MH128867), and S.P. acknowledges support from NIH NIGMS (R01GM130745) and NIH NIGMS (R01GM134426).

Z.K. and C.M. developed the Bayesian non-parametric inference algorithm and software with input from D.S. and S.P. M.S. further developed the existing analysis software and analyzed the data. M.S. created all the figures in the paper with input from all authors. All authors wrote the manuscript. Z.K., M.S., C.M. and S.P. conceived the research, and D.S. and S.P. oversaw all aspects of the project.

Extended data

Extended Data Fig. 1 Robustness analysis: transcription rate.

Shown are posterior distributions over: production rates βl, and transition rates \({k}_{{\sigma }_{l}\to {\sigma }_{{l}^{{\prime} }}}\). Across columns, the breadth of the distributions is comparable for a model containing two gene states, under various maximum ground-truth production rates. Again, the posterior maximum closely matches the ground truth, demonstrating the method’s robustness under quantitative changes in RNA count distribution. As before, each data point was generated using the Gillespie stochastic simulation algorithm, with weak limit set to L = 8 (as per Fig. 2). Rates in each column are inferred for 600 cells observed per time point with 20 collection times at [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 120, 180, 240, 360, 480, 600, 1200, 3600] (s).

Supplementary information

Supplementary Information

Supplementary information and Figs. 1–18.

Reporting Summary

