daily digest – BioDecoded

Daily Digest | April 28, 2024

Analysis and benchmarking of small and large genomic variants across tandem repeats | Nature Biotechnology

Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits and are linked to over 60 disease phenotypes. However, they are often excluded from at-scale studies because of challenges with variant calling and representation, as well as a lack of a genome-wide standard. Here, to promote the development of TR methods, researchers created a catalog of TR regions and explored TR properties across 86 haplotype-resolved long-read human assemblies. They curated variants from the Genome in a Bottle (GIAB) HG002 individual to create a TR dataset to benchmark existing and future TR analysis methods.

Development and validation of AI/ML derived splice-switching oligonucleotides | Molecular Systems Biology

Splice-switching oligonucleotides (SSOs) are antisense compounds that act directly on pre-mRNA to modulate alternative splicing (AS). This study demonstrates the value that artificial intelligence/machine learning (AI/ML) provides for the identification of functional, verifiable, and therapeutic SSOs. Researchers trained XGboost tree models using splicing factor (SF) pre-mRNA binding profiles and spliceosome assembly information to identify modulatory SSO binding sites on pre-mRNA. Using Shapley and out-of-bag analyses they also predicted the identity of specific SFs whose binding to pre-mRNA is blocked by SSOs.

Transparent medical image AI via an image–text foundation model grounded in medical literature | Nature Medicine

Building trustworthy and transparent image-based medical artificial intelligence (AI) systems requires the ability to interrogate data and models at all stages of the development pipeline, from training models to post-deployment monitoring. Ideally, the data and associated AI systems could be described using terms already familiar to physicians, but this requires medical datasets densely annotated with semantically meaningful concepts. In the present study, researchers present a foundation model approach, named MONET (medical concept retriever), which learns how to connect medical images with text and densely scores images on concept presence to enable important tasks in medical AI development and deployment such as data auditing, model auditing and model interpretation. Dermatology provides a demanding use case for the versatility of MONET, due to the heterogeneity in diseases, skin tones and imaging modalities. They trained MONET based on 105,550 dermatological images paired with natural language descriptions from a large collection of medical literature. MONET can accurately annotate concepts across dermatology images as verified by board-certified dermatologists, competitively with supervised models built on previously concept-annotated dermatology datasets of clinical images.

Daily Digest | April 27, 2024

Large language models for preventing medication direction errors in online pharmacies | Nature Medicine

Errors in pharmacy medication directions, such as incorrect instructions for dosage or frequency, can increase patient safety risk substantially by raising the chances of adverse drug events. This study explores how integrating domain knowledge with large language models (LLMs)—capable of sophisticated text interpretation and generation—can reduce these errors. Researchers introduce MEDIC (medication direction copilot), a system that emulates the reasoning of pharmacists by prioritizing precise communication of core clinical components of a prescription, such as dosage and frequency.

CASCC: a co-expression assisted single-cell RNA-seq data clustering method | Bioinformatics

Existing clustering methods for characterizing cell populations from single-cell RNA sequencing are constrained by several limitations stemming from the fact that clusters often cannot be homogeneous, particularly for transitioning populations. On the other hand, dominant cell populations within samples can be identified independently by their strong gene co-expression signatures using methods unrelated to partitioning. Here, researchers introduce a clustering method, CASCC, designed to improve biological accuracy using gene co-expression features identified using an unsupervised adaptive attractor algorithm.

Virtual reality-empowered deep-learning analysis of brain cells | Nature Methods

Researchers created DELiVR, a deep-learning pipeline for 3D brain-cell mapping that is trained with virtual reality-generated reference annotations. It can be deployed via the user-friendly interface of the open-source software Fiji, which makes the analysis of large-scale 3D brain images widely accessible to scientists without computational expertise.

Daily Digest | April 26, 2024

Computational scoring and experimental evaluation of enzymes generated by neural networks | Nature Biotechnology

In recent years, generative protein sequence models have been developed to sample novel sequences. However, predicting whether generated proteins will fold and function remains challenging. Researchers evaluate a set of 20 diverse computational metrics to assess the quality of enzyme sequences produced by three contrasting generative models: ancestral sequence reconstruction, a generative adversarial network and a protein language model.

Measuring, visualizing, and diagnosing reference bias with biastools | Genome Biology

Many bioinformatics methods seek to reduce reference bias, but no methods exist to comprehensively measure it. Biastools analyzes and categorizes instances of reference bias.

Causal machine learning for predicting treatment outcomes | Nature Medicine

Causal machine learning (ML) offers flexible, data-driven methods for predicting treatment outcomes including efficacy and toxicity, thereby supporting the assessment and safety of drugs. A key benefit of causal ML is that it allows for estimating individualized treatment effects, so that clinical decision-making can be personalized to individual patient profiles. In this Perspective, the authors discuss the benefits of causal ML (relative to traditional statistical or ML approaches) and outline the key components and steps.

Daily Digest | April 25, 2024

De novo and somatic structural variant discovery with SVision-pro | Nature Biotechnology

Long-read-based de novo and somatic structural variant (SV) discovery remains challenging, necessitating genomic comparison between samples. Researchers developed SVision-pro, a neural-network-based instance segmentation framework that represents genome-to-genome-level sequencing differences visually and discovers SV comparatively between genomes without any prerequisite for inference models.

Multi-omic integration of microbiome data for identifying disease-associated modules | Nature Communications

Multi-omic studies of the human gut microbiome are crucial for understanding its role in disease across multiple functional layers. Nevertheless, integrating and analyzing such complex datasets poses significant challenges. Most notably, current analysis methods often yield extensive lists of disease-associated features (e.g., species, pathways, or metabolites), without capturing the multi-layered structure of the data. Here, researchers address this challenge by introducing “MintTea”, an intermediate integration-based approach combining canonical correlation analysis extensions, consensus analysis, and an evaluation protocol. MintTea identifies “disease-associated multi-omic modules”, comprising features from multiple omics that shift in concord and that collectively associate with the disease. Applied to diverse cohorts, MintTea captures modules with high predictive power, significant cross-omic correlations, and alignment with known microbiome-disease associations.

Single-cell multi-ome regression models identify functional and disease-associated enhancers and enable chromatin potential analysis | Nature Genetics

Researchers present a gene-level regulatory model, single-cell ATAC + RNA linking (SCARlink), which predicts single-cell gene expression and links enhancers to target genes using multi-ome (scRNA-seq and scATAC–seq co-assay) sequencing data. The approach uses regularized Poisson regression on tile-level accessibility data to jointly model all regulatory effects at a gene locus, avoiding the limitations of pairwise gene–peak correlations and dependence on peak calling.

Daily Digest | April 24, 2024

Generative models improve fairness of medical classifiers under distribution shifts | Nature Medicine

Domain generalization is a ubiquitous challenge for machine learning in healthcare. Model performance in real-world conditions might be lower than expected because of discrepancies between the data encountered during deployment and development. Underrepresentation of some groups or conditions during model development is a common cause of this phenomenon. This challenge is often not readily addressed by targeted data acquisition and ‘labeling’ by expert clinicians, which can be prohibitively expensive or practically impossible because of the rarity of conditions or the available clinical expertise. Researchers hypothesize that advances in generative artificial intelligence can help mitigate this unmet need in a steerable fashion, enriching the training dataset with synthetic examples that address shortfalls of underrepresented conditions or subgroups. They show that diffusion models can automatically learn realistic augmentations from data in a label-efficient manner.

Single Cell Atlas: a single-cell multi-omics human cell encyclopedia | Genome Biology

Single-cell sequencing datasets are key in biology and medicine for unraveling insights into heterogeneous cell populations with unprecedented resolution. Here, researchers construct a single-cell multi-omics map of human tissues through in-depth characterizations of datasets from five single-cell omics, spatial transcriptomics, and two bulk omics across 125 healthy adult and fetal tissues. They construct its complement web-based platform, the Single Cell Atlas (SCA, www.singlecellatlas.org), to enable vast interactive data exploration of deep multi-omics signatures across human fetal and adult tissues.

brainlife.io: a decentralized and open-source cloud platform to support neuroscience research | Nature Methods

Neuroscience is advancing standardization and tool development to support rigor and transparency. Consequently, data pipeline complexity has increased, hindering FAIR (findable, accessible, interoperable and reusable) access. brainlife.io was developed to democratize neuroimaging research. The platform provides data standardization, management, visualization and processing and automatically tracks the provenance history of thousands of data objects.

Daily Digest | April 23, 2024

Pretraining a foundation model for generalizable fluorescence microscopy-based image restoration | Nature Methods

Fluorescence microscopy-based image restoration has received widespread attention in the life sciences and has led to significant progress, benefiting from deep learning technology. However, most current task-specific methods have limited generalizability to different fluorescence microscopy-based image restoration problems. Here, researchers seek to improve generalizability and explore the potential of applying a pretrained foundation model to fluorescence microscopy-based image restoration. They provide a universal fluorescence microscopy-based image restoration (UniFMIR) model to address different restoration problems, and show that UniFMIR offers higher image restoration precision, better generalization and increased versatility.

Prediction of protein-RNA interactions from single-cell transcriptomic data | Nucleic Acids Research

Proteins are crucial in regulating every aspect of RNA life, yet understanding their interactions with coding and noncoding RNAs remains limited. Experimental studies are typically restricted to a small number of cell lines and a limited set of RNA-binding proteins (RBPs). Although computational methods based on physico-chemical principles can predict protein-RNA interactions accurately, they often lack the ability to consider cell-type-specific gene expression and the broader context of gene regulatory networks (GRNs). Here, researchers assess the performance of several GRN inference algorithms in predicting protein-RNA interactions from single-cell transcriptomic data, and propose a pipeline, called scRAPID (single-cell transcriptomic-based RnA Protein Interaction Detection), that integrates these methods with the catRAPID algorithm, which can identify direct physical interactions between RBPs and RNA molecules.

Demographic bias in misdiagnosis by computational pathology models | Nature Medicine

Despite increasing numbers of regulatory approvals, deep learning-based computational pathology systems often overlook the impact of demographic factors on performance, potentially leading to biases. This concern is all the more important as computational pathology has leveraged large public datasets that underrepresent certain demographic groups. Using publicly available data from The Cancer Genome Atlas and the EBRAINS brain tumor atlas, as well as internal patient data, the authors show that whole-slide image classification models display marked performance disparities across different demographic groups when used to subtype breast and lung carcinomas and to predict IDH1 mutations in gliomas.

Daily Digest | April 22, 2024

SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms | Nature Methods

SQANTI3 is a tool designed for the quality control, curation and annotation of long-read transcript models obtained with third-generation sequencing technologies. Leveraging its annotation framework, SQANTI3 calculates quality descriptors of transcript models, junctions and transcript ends. With this information, potential artifacts can be identified and replaced with reliable sequences. Furthermore, the integrated functional annotation feature enables subsequent functional iso-transcriptomics analyses.

DANCE: a deep learning library and benchmark platform for single-cell analysis | Genome Biology

DANCE is the first standard, generic, and extensible benchmark platform for accessing and evaluating computational methods across the spectrum of benchmark datasets for numerous single-cell analysis tasks. Currently, DANCE supports 3 modules and 8 popular tasks with 32 state-of-art methods on 21 benchmark datasets.

Towards a general-purpose foundation model for computational pathology | Nature Medicine

Quantitative evaluation of tissue images is crucial for computational pathology (CPath) tasks, requiring the objective characterization of histopathological entities from whole-slide images (WSIs). The high resolution of WSIs and the variability of morphological features present significant challenges, complicating the large-scale annotation of data for high-performance applications. To address this challenge, current efforts have proposed the use of pretrained image encoders through transfer learning from natural image datasets or self-supervised learning on publicly available histopathology datasets, but have not been extensively developed and evaluated across diverse tissue types at scale. Researchers introduce UNI, a general-purpose self-supervised model for pathology, pretrained using more than 100 million images from over 100,000 diagnostic H&E-stained WSIs (>77 TB of data) across 20 major tissue types.

Daily Digest | April 21, 2024

A visual-language foundation model for computational pathology | Nature Medicine

The accelerated adoption of digital pathology and advances in deep learning have enabled the development of robust models for various pathology tasks across a diverse array of diseases and patient cohorts. Researchers introduce CONtrastive learning from Captions for Histopathology (CONCH), a visual-language foundation model developed using diverse sources of histopathology images, biomedical text and, notably, over 1.17 million image–caption pairs through task-agnostic pretraining. Evaluated on a suite of 14 diverse benchmarks, CONCH can be transferred to a wide range of downstream tasks involving histopathology images and/or text, achieving state-of-the-art performance on histology image classification, segmentation, captioning, and text-to-image and image-to-text retrieval.

Domain-specific optimization and diverse evaluation of self-supervised models for histopathology | arXiv

Task-specific deep learning models in histopathology offer promising opportunities for improving diagnosis, clinical research, and precision medicine. However, development of such models is often limited by availability of high-quality data. Foundation models in histopathology that learn general representations across a wide range of tissue types, diagnoses, and magnifications offer the potential to reduce the data, compute, and technical expertise necessary to develop task-specific deep learning models with the required level of model performance. In this work, researchers describe the development and evaluation of foundation models for histopathology via self-supervised learning (SSL).

Tradeoffs in alignment and assembly-based methods for structural variant detection with long-read sequencing data | Nature Communications

Long-read sequencing offers long contiguous DNA fragments, facilitating diploid genome assembly and structural variant (SV) detection. Efficient and robust algorithms for SV identification are crucial with increasing data availability. Here researchers systematically compare 14 read alignment-based SV calling methods (including 4 deep learning-based methods and 1 hybrid method), and 4 assembly-based SV calling methods, alongside 4 upstream aligners and 7 assemblers.

Daily Digest | April 20, 2024

Development and validation of a new algorithm for improved cardiovascular risk prediction | Nature Medicine

QRISK algorithms use data from millions of people to help clinicians identify individuals at high risk of cardiovascular disease (CVD). Here, researchers derive and externally validate a new algorithm, QR4, that incorporates novel risk factors to estimate 10-year CVD risk separately for men and women. Health data from 9.98 million and 6.79 million adults from the United Kingdom were used for derivation and validation of the algorithm, respectively.

Benchmarking bioinformatic virus identification tools using real-world metagenomic data across biomes | Genome Biology

As most viruses remain uncultivated, metagenomics is currently the main method for virus discovery. Detecting viruses in metagenomic data is not trivial. In the past few years, many bioinformatic virus identification tools have been developed for this task, making it challenging to choose the right tools, parameters, and cutoffs. As all these tools measure different biological signals, and use different algorithms and training and reference databases, it is imperative to conduct an independent benchmarking to give users objective guidance. Researchers compare the performance of nine state-of-the-art virus identification tools in thirteen modes on eight paired viral and microbial datasets from three distinct biomes, including a new complex dataset from Antarctic coastal waters.

spVC for the detection and interpretation of spatial gene expression variation | Genome Biology

Spatially resolved transcriptomics technologies have opened new avenues for understanding gene expression heterogeneity in spatial contexts. However, existing methods for identifying spatially variable genes often focus solely on statistical significance, limiting their ability to capture continuous expression patterns and integrate spot-level covariates. To address these challenges, researchers introduce spVC, a statistical method based on a generalized Poisson model. spVC seamlessly integrates constant and spatially varying effects of covariates, facilitating comprehensive exploration of gene expression variability and enhancing interpretability.

Daily Digest | April 19, 2024

Benchmarking spatial clustering methods with spatially resolved transcriptomics data | Nature Methods

Spatial clustering, which shares an analogy with single-cell clustering, has expanded the scope of tissue physiology studies from cell-centroid to structure-centroid with spatially resolved transcriptomics (SRT) data. Computational methods have undergone remarkable development in recent years, but a comprehensive benchmark study is still lacking. Here researchers present a benchmark study of 13 computational methods on 34 SRT data (7 datasets). The performance was evaluated on the basis of accuracy, spatial continuity, marker genes detection, scalability, and robustness.

Prediction of metabolites associated with somatic mutations in cancers by using genome-scale metabolic models and mutation data | Genome Biology

Oncometabolites, often generated as a result of a gene mutation, show pro-oncogenic function when abnormally accumulated in cancer cells. Here researchers report the development of a computational workflow that predicts metabolite-gene-pathway sets. Metabolite-gene-pathway sets present metabolites and metabolic pathways significantly associated with specific somatic mutations in cancers. The computational workflow uses both cancer patient-specific genome-scale metabolic models (GEMs) and mutation data to generate metabolite-gene-pathway sets.

Interrogations of single-cell RNA splicing landscapes with SCASL define new cell identities with physiological relevance | Nature Communications

RNA splicing shapes the gene regulatory programs that underlie various physiological and disease processes. Here, researchers present the SCASL (single-cell clustering based on alternative splicing landscapes) method for interrogating the heterogeneity of RNA splicing with single-cell RNA-seq data. SCASL resolves the issue of biased and sparse data coverage on single-cell RNA splicing and provides a new scheme for classifications of cell identities.