Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models.

in Nature biomedical engineering by Francisco Carrillo-Perez, Marija Pizurica, Yuanning Zheng, Tarak Nath Nandi, Ravi Madduri, Jeanne Shen, Olivier Gevaert

TLDR

  • The study creates realistic images of human tumors using computer models. These images can be used to train machine learning models to understand the tumors better. The study shows that the images created from the computer models are very similar to the real tumors and that the machine learning models trained on these images perform better than those trained on real tumors. This means that we can use these computer-generated images to learn more about tumors and potentially find better ways to treat them.

Abstract

Training machine-learning models with synthetically generated data can alleviate the problem of data scarcity when acquiring diverse and sufficiently large datasets is costly and challenging. Here we show that cascaded diffusion models can be used to synthesize realistic whole-slide image tiles from latent representations of RNA-sequencing data from human tumours. Alterations in gene expression affected the composition of cell types in the generated synthetic image tiles, which accurately preserved the distribution of cell types and maintained the cell fraction observed in bulk RNA-sequencing data, as we show for lung adenocarcinoma, kidney renal papillary cell carcinoma, cervical squamous cell carcinoma, colon adenocarcinoma and glioblastoma. Machine-learning models pretrained with the generated synthetic data performed better than models trained from scratch. Synthetic data may accelerate the development of machine-learning models in scarce-data settings and allow for the imputation of missing data modalities.

Overview

  • The study focuses on using cascaded diffusion models to synthesize realistic whole-slide image tiles from latent representations of RNA-sequencing data from human tumors. The hypothesis being tested is whether alterations in gene expression affect the composition of cell types in the generated synthetic image tiles and whether machine-learning models pretrained with the generated synthetic data perform better than models trained from scratch. The methodology used for the experiment includes the use of cascaded diffusion models, latent representations of RNA-sequencing data from human tumors, and whole-slide image tiles. The primary objective of the study is to demonstrate the effectiveness of using synthetically generated data to train machine-learning models in scarce-data settings and to allow for the imputation of missing data modalities.

Comparative Analysis & Findings

  • The study compares the outcomes observed under different experimental conditions, specifically the use of cascaded diffusion models to generate synthetic whole-slide image tiles versus using machine-learning models trained from scratch. The results show that alterations in gene expression affected the composition of cell types in the generated synthetic image tiles, which accurately preserved the distribution of cell types and maintained the cell fraction observed in bulk RNA-sequencing data. Additionally, machine-learning models pretrained with the generated synthetic data performed better than models trained from scratch. These findings suggest that synthetically generated data can be used to train machine-learning models in scarce-data settings and allow for the imputation of missing data modalities.

Implications and Future Directions

  • The study's findings have significant implications for the field of research and clinical practice, as they demonstrate the effectiveness of using synthetically generated data to train machine-learning models in scarce-data settings. The study also highlights the potential for synthetically generated data to allow for the imputation of missing data modalities. However, the study has limitations, such as the need for more data to validate the results and the potential for bias in the generated data. Future research directions could include exploring the use of synthetically generated data in other fields, such as drug discovery, and utilizing novel approaches to generate more realistic and diverse data.