Daily Digest | March 31, 2024

A single-cell atlas enables mapping of homeostatic cellular shifts in the adult human breast | Nature Genetics

Here researchers use single-cell RNA sequencing to compile a human breast cell atlas assembled from 55 donors that had undergone reduction mammoplasties or risk reduction mastectomies. From more than 800,000 cells they identified 41 cell subclusters across the epithelial, immune and stromal compartments. The contribution of these different clusters varied according to the natural history of the tissue. Age, parity and germline mutations, known to modulate the risk of developing breast cancer, affected the homeostatic cellular state of the breast in different ways. They found that immune cells from BRCA1 or BRCA2 carriers had a distinct gene expression signature indicative of potential immune exhaustion, which was validated by immunohistochemistry.

Research paper

 

Gene-expression memory-based prediction of cell lineages from scRNA-seq datasets | Nature Communications

Assigning single cell transcriptomes to cellular lineage trees by lineage tracing has transformed our understanding of differentiation during development, regeneration, and disease. However, lineage tracing is technically demanding, often restricted in time-resolution, and most scRNA-seq datasets are devoid of lineage information. Here researchers introduce Gene Expression Memory-based Lineage Inference (GEMLI), a computational tool allowing to robustly identify small to medium-sized cell lineages solely from scRNA-seq datasets. GEMLI allows to study heritable gene expression, to discriminate symmetric and asymmetric cell fate decisions and to reconstruct individual multicellular structures from pooled scRNA-seq datasets.

Research paper

 

Genetic variation across and within individuals | Nature Reviews Genetics

Germline variation and somatic mutation are intricately connected and together shape human traits and disease risks. Germline variants are present from conception, but they vary between individuals and accumulate over generations. By contrast, somatic mutations accumulate throughout life in a mosaic manner within an individual due to intrinsic and extrinsic sources of mutations and selection pressures acting on cells. Recent advancements, such as improved detection methods and increased resources for association studies, have drastically expanded our ability to investigate germline and somatic genetic variation and compare underlying mutational processes. A better understanding of the similarities and differences in the types, rates and patterns of germline and somatic variants, as well as their interplay, will help elucidate the mechanisms underlying their distinct yet interlinked roles in human health and biology.

Research paper

 

Daily Digest | March 30, 2024

scGPT: toward building a foundation model for single-cell multi-omics using generative AI | Nature Methods

Generative pretrained models have achieved remarkable success in various domains such as language and computer vision. Specifically, the combination of large-scale diverse datasets and pretrained transformers has emerged as a promising approach for developing foundation models. Drawing parallels between language and cellular biology (in which texts comprise words; similarly, cells are defined by genes), this study probes the applicability of foundation models to advance cellular biology and genetic research. Using burgeoning single-cell sequencing data, researchers have constructed a foundation model for single-cell biology, scGPT, based on a generative pretrained transformer across a repository of over 33 million cells.

Research paper

 

SHARE-Topic: Bayesian interpretable modeling of single-cell multi-omic data | Genome Biology

Multi-omic single-cell technologies, which simultaneously measure the transcriptional and epigenomic state of the same cell, enable understanding epigenetic mechanisms of gene regulation. However, noisy and sparse data pose fundamental statistical challenges to extract biological knowledge from complex datasets. SHARE-Topic, a Bayesian generative model of multi-omic single cell data using topic models, aims to address these challenges. SHARE-Topic identifies common patterns of co-variation between different omic layers, providing interpretable explanations for the data complexity.

Research paper

 

Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters | Nature Communications

Two-dimensional (2D) embedding methods are crucial for single-cell data visualization. Popular methods such as t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP) are commonly used for visualizing cell clusters; however, it is well known that t-SNE and UMAP’s 2D embeddings might not reliably inform the similarities among cell clusters. Motivated by this challenge, researchers present a statistical method, scDEED, for detecting dubious cell embeddings output by a 2D-embedding method. By calculating a reliability score for every cell embedding based on the similarity between the cell’s 2D-embedding neighbors and pre-embedding neighbors, scDEED identifies the cell embeddings with low reliability scores as dubious and those with high reliability scores as trustworthy. Moreover, by minimizing the number of dubious cell embeddings, scDEED provides intuitive guidance for optimizing the hyperparameters of an embedding method.

Research paper

 

Daily Digest | March 29, 2024

The multimodality cell segmentation challenge: toward universal solutions | Nature Methods

Cell segmentation is a critical step for quantitative single-cell analysis in microscopy images. Existing cell segmentation methods are often tailored to specific modalities or require manual interventions to specify hyper-parameters in different experimental settings. Here, researchers present a multimodality cell segmentation benchmark, comprising more than 1,500 labeled images derived from more than 50 diverse biological experiments. The top participants developed a Transformer-based deep-learning algorithm that not only exceeds existing methods but can also be applied to diverse microscopy images across imaging platforms and tissue types without manual parameter adjustments.

Research paper

 

PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration | PLOS Computational Biology

As terabytes of multi-omics data are being generated, there is an ever-increasing need for methods facilitating the integration and interpretation of such data. Current multi-omics integration methods typically output lists, clusters, or subnetworks of molecules related to an outcome. Even with expert domain knowledge, discerning the biological processes involved is a time-consuming activity. Here researchers propose PathIntegrate, a method for integrating multi-omics datasets based on pathways, designed to exploit knowledge of biological systems and thus provide interpretable models for such studies.

Research paper

 

Digital twins in medicine | Nature Computational Science

Medical digital twins, which are potentially vital for personalized medicine, have become a recent focus in medical research. Here the authors present an overview of the state of the art in medical digital twin development, especially in oncology and cardiology, where it is most advanced. They discuss major challenges, such as data integration and privacy, and provide an outlook on future advancements. Emphasizing the importance of this technology in healthcare, they highlight the potential for substantial improvements in patient-specific treatments and diagnostics.

Research paper

 

Daily Digest | March 28, 2024

Tapioca: a platform for predicting de novo protein–protein interactions in dynamic contexts | Nature Methods

Protein–protein interactions (PPIs) drive cellular processes and responses to environmental cues, reflecting the cellular state. Here researchers develop Tapioca, an ensemble machine learning framework for studying global PPIs in dynamic contexts. Tapioca predicts de novo interactions by integrating mass spectrometry interactome data from thermal/ion denaturation or cofractionation workflows with protein properties and tissue-specific functional networks.

Research paper

 

Codon language embeddings provide strong signals for use in protein engineering | Nature Machine Intelligence

Protein representations from deep language models have yielded state-of-the-art performance across many tasks in computational protein engineering. In recent years, progress has primarily focused on parameter count, with recent models’ capacities surpassing the size of the very datasets they were trained on. Here researchers propose an alternative direction. They show that large language models trained on codons, instead of amino acid sequences, provide high-quality representations that outperform comparable state-of-the-art models across a variety of tasks.

Research paper

 

Metabolomic machine learning predictor for diagnosis and prognosis of gastric cancer | Nature Communications

Gastric cancer (GC) represents a significant burden of cancer-related mortality worldwide, underscoring an urgent need for the development of early detection strategies and precise postoperative interventions. However, the identification of non-invasive biomarkers for early diagnosis and patient risk stratification remains underexplored. Here, researchers conduct a targeted metabolomics analysis of 702 plasma samples from multi-center participants to elucidate the GC metabolic reprogramming. Their machine learning analysis reveals a 10-metabolite GC diagnostic model, which is validated in an external test set with a sensitivity of 0.905, outperforming conventional methods leveraging cancer protein markers (sensitivity < 0.40).

Research paper

 

Daily Digest | March 27, 2024

All-atom RNA structure determination from cryo-EM maps | Nature Biotechnology

Many methods exist for determining protein structures from cryogenic electron microscopy maps, but this remains challenging for RNA structures. Here we developed EMRNA, a method for accurate, automated determination of full-length all-atom RNA structures from cryogenic electron microscopy maps. EMRNA integrates deep learning-based detection of nucleotides, three-dimensional backbone tracing and scoring with consideration of sequence and secondary structure information, and full-atom construction of the RNA structure.

Research paper

 

Flexiplex: a versatile demultiplexer and search tool for omics data | Bioinformatics

The process of analyzing high throughput sequencing data often requires the identification and extraction of specific target sequences. This could include tasks such as identifying cellular barcodes and UMIs in single cell data, and specific genetic variants for genotyping. However, existing tools which perform these functions are often task-specific, such as only demultiplexing barcodes for a dedicated type of experiment, or are not tolerant to noise in the sequencing data. To overcome these limitations, researchers developed Flexiplex, a versatile and fast sequence searching and demultiplexing tool for omics data, which is based on the Levenshtein distance and thus allows imperfect matches.

Research paper

 

Chromosome evolution screens recapitulate tissue-specific tumor aneuploidy patterns | Nature Genetics

Whole chromosome and arm-level copy number alterations occur at high frequencies in tumors, but their selective advantages, if any, are poorly understood. Here, utilizing unbiased whole chromosome genetic screens combined with in vitro evolution to generate arm- and subarm-level events, researchers iteratively selected the fittest karyotypes from aneuploidized human renal and mammary epithelial cells.

Research paper

 

Daily Digest | March 26, 2024

De novo and somatic structural variant discovery with SVision-pro | Nature Biotechnology

Long-read-based de novo and somatic structural variant (SV) discovery remains challenging, necessitating genomic comparison between samples. Researchers developed SVision-pro, a neural-network-based instance segmentation framework that represents genome-to-genome-level sequencing differences visually and discovers SV comparatively between genomes without any prerequisite for inference models.

Research paper

 

Multi-omic integration of microbiome data for identifying disease-associated modules | Nature Communications

Multi-omic studies of the human gut microbiome are crucial for understanding its role in disease across multiple functional layers. Nevertheless, integrating and analyzing such complex datasets poses significant challenges. Most notably, current analysis methods often yield extensive lists of disease-associated features (e.g., species, pathways, or metabolites), without capturing the multi-layered structure of the data. Here, researchers address this challenge by introducing “MintTea”, an intermediate integration-based approach combining canonical correlation analysis extensions, consensus analysis, and an evaluation protocol. MintTea identifies “disease-associated multi-omic modules”, comprising features from multiple omics that shift in concord and that collectively associate with the disease. Applied to diverse cohorts, MintTea captures modules with high predictive power, significant cross-omic correlations, and alignment with known microbiome-disease associations.

Research paper

 

Single-cell multi-ome regression models identify functional and disease-associated enhancers and enable chromatin potential analysis | Nature Genetics

Researchers present a gene-level regulatory model, single-cell ATAC + RNA linking (SCARlink), which predicts single-cell gene expression and links enhancers to target genes using multi-ome (scRNA-seq and scATAC–seq co-assay) sequencing data. The approach uses regularized Poisson regression on tile-level accessibility data to jointly model all regulatory effects at a gene locus, avoiding the limitations of pairwise gene–peak correlations and dependence on peak calling.

Research paper

 

Daily Digest | March 25, 2024

Accurate and sensitive mutational signature analysis with MuSiCal | Nature Genetics

Mutational signature analysis is a recent computational approach for interpreting somatic mutations in the genome. Its application to cancer data has enhanced our understanding of mutational forces driving tumorigenesis and demonstrated its potential to inform prognosis and treatment decisions. However, methodological challenges remain for discovering new signatures and assigning proper weights to existing signatures, thereby hindering broader clinical applications. Here researchers present Mutational Signature Calculator (MuSiCal), a rigorous analytical framework with algorithms that solve major problems in the standard workflow.

Research paper

 

Dividing out quantification uncertainty allows efficient assessment of differential transcript expression with edgeR | Nucleic Acids Research

Differential expression analysis of RNA-seq is one of the most commonly performed bioinformatics analyses. Transcript-level quantifications are inherently more uncertain than gene-level read counts because of ambiguous assignment of sequence reads to transcripts. While sequence reads can usually be assigned unambiguously to a gene, reads are very often compatible with multiple transcripts for that gene, particularly for genes with many isoforms. Software tools designed for gene-level differential expression do not perform optimally on transcript counts because the read-to-transcript ambiguity (RTA) disrupts the mean-variance relationship normally observed for gene level RNA-seq data and interferes with the efficiency of the empirical Bayes dispersion estimation procedures. The pseudoaligners kallisto and Salmon provide bootstrap samples from which quantification uncertainty can be assessed. Researchers show that the overdispersion arising from RTA can be elegantly estimated by fitting a quasi-Poisson model to the bootstrap counts for each transcript. The technical overdispersion arising from RTA can then be divided out of the transcript counts, leading to scaled counts that can be input for analysis by established gene-level software tools with full statistical efficiency.

Research paper

 

A causal perspective on dataset bias in machine learning for medical imaging | Nature Machine Intelligence

As machine learning methods gain prominence within clinical decision-making, the need to address fairness concerns becomes increasingly urgent. Despite considerable work dedicated to detecting and ameliorating algorithmic bias, today’s methods are deficient, with potentially harmful consequences. This Perspective sheds new light on algorithmic bias, highlighting how different sources of dataset bias may seem indistinguishable yet require substantially different mitigation strategies.

Research paper

 

Daily Digest | March 24, 2024

Quality assessment of gene repertoire annotations with OMArk | Nature Biotechnology

In the era of biodiversity genomics, it is crucial to ensure that annotations of protein-coding gene repertoires are accurate. State-of-the-art tools to assess genome annotations measure the completeness of a gene repertoire but are blind to other errors, such as gene overprediction or contamination. Researchers introduce OMArk, a software package that relies on fast, alignment-free sequence comparisons between a query proteome and precomputed gene families across the tree of life. OMArk assesses not only the completeness but also the consistency of the gene repertoire as a whole relative to closely related species and reports likely contamination events.

Research paper

 

A novel batch-effect correction method for scRNA-seq data based on Adversarial Information Factorization | PLOS Computational Biology

Single-cell RNA sequencing captures the signal of individual cells, allowing a finer resolution than bulk sequencing, which is particularly important for studies comprising rare populations like tumor heterogeneity or lineage tracing studies. However, it is sensitive to the experimental conditions, which induce a bias in the data, called batch effects. Those technical variations hinder any aggregated analysis, limiting scRNA-seq to individual trials. To address this issue, researchers developed a novel Deep-Learning method called Adversarial Information Factorization, which aims at factorizing the batch effects from the biological signal to align the individual trials for downstream aggregated analysis.

Research paper

 

scCASE: accurate and interpretable enhancement for single-cell chromatin accessibility sequencing data | Nature Communications

Single-cell chromatin accessibility sequencing (scCAS) has emerged as a valuable tool for interrogating and elucidating epigenomic heterogeneity and gene regulation. However, scCAS data inherently suffers from limitations such as high sparsity and dimensionality, which pose significant challenges for downstream analyses. Although several methods are proposed to enhance scCAS data, there are still challenges and limitations that hinder the effectiveness of these methods. Here, researchers propose scCASE, a scCAS data enhancement method based on non-negative matrix factorization which incorporates an iteratively updating cell-to-cell similarity matrix.

Research paper

 

Daily Digest | March 23, 2024

SpatialData: an open and universal data framework for spatial omics | Nature Methods

Spatially resolved omics technologies are transforming our understanding of biological tissues. However, the handling of uni- and multimodal spatial omics datasets remains a challenge owing to large data volumes, heterogeneity of data types and the lack of flexible, spatially aware data structures. Here researchers introduce SpatialData, a framework that establishes a unified and extensible multiplatform file-format, lazy representation of larger-than-memory data, transformations and alignment to common coordinate systems.

Research paper

 

HeAR — Health Acoustic Representations | arXiv

Health acoustic sounds such as coughs and breaths are known to contain useful health signals with significant potential for monitoring health and disease, yet are underexplored in the medical machine learning community. The existing deep learning systems for health acoustics are often narrowly trained and evaluated on a single task, which is limited by data and may hinder generalization to other tasks. To mitigate these gaps, researchers develop HeAR, a scalable self-supervised learning-based deep learning system using masked autoencoders trained on a large dataset of 313 million two-second long audio clips. Through linear probes, they establish HeAR as a state-of-the-art health audio embedding model on a benchmark of 33 health acoustic tasks across 6 datasets.

Research paper

 

Genomic data in the All of Us Research Program | Nature

Comprehensively mapping the genetic basis of human disease across diverse individuals is a long-standing goal for the field of human genetics. The All of Us Research Program is a longitudinal cohort study aiming to enrol a diverse group of at least one million individuals across the USA to accelerate biomedical research and improve human health. Here the authors describe the programme’s genomics data release of 245,388 clinical-grade genome sequences. This resource is unique in its diversity as 77% of participants are from communities that are historically under-represented in biomedical research and 46% are individuals from under-represented racial and ethnic minorities. All of Us identified more than 1 billion genetic variants, including more than 275 million previously unreported genetic variants, more than 3.9 million of which had coding consequences.

Research paper

 

Daily Digest | March 22, 2024

SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms | Nature Methods

SQANTI3 is a tool designed for the quality control, curation and annotation of long-read transcript models obtained with third-generation sequencing technologies. Leveraging its annotation framework, SQANTI3 calculates quality descriptors of transcript models, junctions and transcript ends. With this information, potential artifacts can be identified and replaced with reliable sequences. Furthermore, the integrated functional annotation feature enables subsequent functional iso-transcriptomics analyses.

Research paper

 

DANCE: a deep learning library and benchmark platform for single-cell analysis | Genome Biology

DANCE is the first standard, generic, and extensible benchmark platform for accessing and evaluating computational methods across the spectrum of benchmark datasets for numerous single-cell analysis tasks. Currently, DANCE supports 3 modules and 8 popular tasks with 32 state-of-art methods on 21 benchmark datasets.

Research paper

 

Towards a general-purpose foundation model for computational pathology | Nature Medicine

Quantitative evaluation of tissue images is crucial for computational pathology (CPath) tasks, requiring the objective characterization of histopathological entities from whole-slide images (WSIs). The high resolution of WSIs and the variability of morphological features present significant challenges, complicating the large-scale annotation of data for high-performance applications. To address this challenge, current efforts have proposed the use of pretrained image encoders through transfer learning from natural image datasets or self-supervised learning on publicly available histopathology datasets, but have not been extensively developed and evaluated across diverse tissue types at scale. Researchers introduce UNI, a general-purpose self-supervised model for pathology, pretrained using more than 100 million images from over 100,000 diagnostic H&E-stained WSIs (>77 TB of data) across 20 major tissue types.

Research paper