Key Foundation Models

Watershed Team

Jan 31, 2025

The number of papers published describing or using a foundation model has exploded in 2024. Searching the term “foundation model” in PubMed returns <10 papers a year before 2023, while 2024 alone returns 150 (including preprints), with some still trickling in before the end of the year. Below is a selection of these foundation models developed in the last decade for different biomedical research areas:

Geneformer

Training data: scRNA (human)

Tasks: predicting gene network dynamics, predicting chromatin dynamics, gene prioritization for perturbation studies

Resources: v1: Theodoris et al. 2023, v2: Chen et al. 2024, GitHub: Link

scGPT

Training data: scRNA (human)

Tasks: cell type annotation, clustering, inferring gene networks, integrating datasets

Resources: v1: Cui et al. 2024, GitHub: Link

scVI

Training data: scRNA

Tasks: creating latent space, integrating datasets, imputation, differential expression

Resources: v1: Lopez et al. 2018, Website: Link, GitHub: Link

DeepSEA

Training data: DNA

Tasks: predicting effects of noncoding genomic variants, predicting chromatin profiles

Resources: v1: Zhou & Troyanskaya 2015, Website: Link

Enformer

Training data: DNA

Tasks: predicting effects of noncoding genomic variants, modeling long-range (>100kb) interactions, predict gene expression

Resources: v1: Avsec et al. 2021

DNABERT

Training data: DNA (human)

Tasks: predicting splice sites, predicting promoter regions, predicting transcription factor binding sites

Resources: v1: Ji et al. 2021, GitHub: Link

AlphaFold

Training data: amino acid sequences

Tasks: predicting 3D protein structures, modeling protein-protein interactions

Resources: v2: Jumper et al. 2021, v3: Abramson et al. 2024, Website: Link, Database: Link

Nicheformer

Training data: scRNA, both dissociated and spatially resolved

Tasks: predicting spatial context of dissociated cells, modeling spatial niches

Resources: v1: Schaar et al. 2024

Novae

Training data: spatially-resolved scRNA

Tasks: spatial clustering, analyzing spatially variable genes and pathways, correcting for batch effect in multi-panel studies

Resources: v1: Blampey et al. 2024

BioBERT

Training data: biomedical text

Tasks: text mining, relationship extraction

Resources: v1: Lee et al. 2019

BioELECTRA

Training data: biomedical text

Tasks: text mining

Resources: v1: Kanakarajan et al. 2021

CONCH

Training data: histopathology images and captions

Tasks: image classification, captioning, image segmentation

Resources: v1: Lu et al. 2024

TxGNN

Training data: medical knowledge graph

Tasks: predicting drug indications, drug repurposing

Resources: v1: Huang et al. 2024

Want to see how these foundation models are being used in biomedical research? Read our full foundation models blog here: https://watershed.bio/resources/foundation-models-for-biomedical-research

Nowadays, there are resources available that translate analytical tools into GPU-accelerated versions, such as RAPIDS single-cell analysis, Parabricks variant calling, and RELION 3D modeling. The Watershed operating system gives you all of the tools you need to access these workflows and successfully implement foundation models with GPU acceleration, including:

Install-and-go integration of GPU-accelerated pipelines
Easy access to multiple types of GPUs for various applications
Seamless switching between GPUs during testing and production
GPU parallelization to further improve model training speed

Ready to get started?

Learn more about how Watershed can empower your entire team.