Key Foundation Models

The number of papers published describing or using a foundation model has exploded in 2024. Searching the term “foundation model” in PubMed returns <10 papers a year before 2023, while 2024 alone returns 150 (including preprints), with some still trickling in before the end of the year. Below is a selection of these foundation models developed in the last decade for different biomedical research areas:
Geneformer
Training data: scRNA (human)
Tasks: predicting gene network dynamics, predicting chromatin dynamics, gene prioritization for perturbation studies
Resources: v1: Theodoris et al. 2023, v2: Chen et al. 2024, GitHub: Link
scGPT
Training data: scRNA (human)
Tasks: cell type annotation, clustering, inferring gene networks, integrating datasets
Resources: v1: Cui et al. 2024, GitHub: Link
scVI
Training data: scRNA
Tasks: creating latent space, integrating datasets, imputation, differential expression
Resources: v1: Lopez et al. 2018, Website: Link, GitHub: Link
DeepSEA
Training data: DNA
Tasks: predicting effects of noncoding genomic variants, predicting chromatin profiles
Resources: v1: Zhou & Troyanskaya 2015, Website: Link
Enformer
Training data: DNA
Tasks: predicting effects of noncoding genomic variants, modeling long-range (>100kb) interactions, predict gene expression
Resources: v1: Avsec et al. 2021
DNABERT
Training data: DNA (human)
Tasks: predicting splice sites, predicting promoter regions, predicting transcription factor binding sites
Resources: v1: Ji et al. 2021, GitHub: Link
AlphaFold
Training data: amino acid sequences
Tasks: predicting 3D protein structures, modeling protein-protein interactions
Resources: v2: Jumper et al. 2021, v3: Abramson et al. 2024, Website: Link, Database: Link
Nicheformer
Training data: scRNA, both dissociated and spatially resolved
Tasks: predicting spatial context of dissociated cells, modeling spatial niches
Resources: v1: Schaar et al. 2024
Novae
Training data: spatially-resolved scRNA
Tasks: spatial clustering, analyzing spatially variable genes and pathways, correcting for batch effect in multi-panel studies
Resources: v1: Blampey et al. 2024
BioBERT
Training data: biomedical text
Tasks: text mining, relationship extraction
Resources: v1: Lee et al. 2019
BioELECTRA
Training data: biomedical text
Tasks: text mining
Resources: v1: Kanakarajan et al. 2021
CONCH
Training data: histopathology images and captions
Tasks: image classification, captioning, image segmentation
Resources: v1: Lu et al. 2024
TxGNN
Training data: medical knowledge graph
Tasks: predicting drug indications, drug repurposing
Resources: v1: Huang et al. 2024
Want to see how these foundation models are being used in biomedical research? Read our full foundation models blog here: https://watershed.bio/resources/foundation-models-for-biomedical-research
Nowadays, there are resources available that translate analytical tools into GPU-accelerated versions, such as RAPIDS single-cell analysis, Parabricks variant calling, and RELION 3D modeling. The Watershed operating system gives you all of the tools you need to access these workflows and successfully implement foundation models with GPU acceleration, including:
- Install-and-go integration of GPU-accelerated pipelines
- Easy access to multiple types of GPUs for various applications
- Seamless switching between GPUs during testing and production
- GPU parallelization to further improve model training speed
Ready to get started?
Learn more about how Watershed can empower your entire team.