BioDecoded – Page 2 – At the intersection of biomedicine and computing

Daily Digest | April 20, 2024

Development and validation of a new algorithm for improved cardiovascular risk prediction | Nature Medicine

QRISK algorithms use data from millions of people to help clinicians identify individuals at high risk of cardiovascular disease (CVD). Here, researchers derive and externally validate a new algorithm, QR4, that incorporates novel risk factors to estimate 10-year CVD risk separately for men and women. Health data from 9.98 million and 6.79 million adults from the United Kingdom were used for derivation and validation of the algorithm, respectively.

Benchmarking bioinformatic virus identification tools using real-world metagenomic data across biomes | Genome Biology

As most viruses remain uncultivated, metagenomics is currently the main method for virus discovery. Detecting viruses in metagenomic data is not trivial. In the past few years, many bioinformatic virus identification tools have been developed for this task, making it challenging to choose the right tools, parameters, and cutoffs. As all these tools measure different biological signals, and use different algorithms and training and reference databases, it is imperative to conduct an independent benchmarking to give users objective guidance. Researchers compare the performance of nine state-of-the-art virus identification tools in thirteen modes on eight paired viral and microbial datasets from three distinct biomes, including a new complex dataset from Antarctic coastal waters.

spVC for the detection and interpretation of spatial gene expression variation | Genome Biology

Spatially resolved transcriptomics technologies have opened new avenues for understanding gene expression heterogeneity in spatial contexts. However, existing methods for identifying spatially variable genes often focus solely on statistical significance, limiting their ability to capture continuous expression patterns and integrate spot-level covariates. To address these challenges, researchers introduce spVC, a statistical method based on a generalized Poisson model. spVC seamlessly integrates constant and spatially varying effects of covariates, facilitating comprehensive exploration of gene expression variability and enhancing interpretability.

Daily Digest | April 19, 2024

Benchmarking spatial clustering methods with spatially resolved transcriptomics data | Nature Methods

Spatial clustering, which shares an analogy with single-cell clustering, has expanded the scope of tissue physiology studies from cell-centroid to structure-centroid with spatially resolved transcriptomics (SRT) data. Computational methods have undergone remarkable development in recent years, but a comprehensive benchmark study is still lacking. Here researchers present a benchmark study of 13 computational methods on 34 SRT data (7 datasets). The performance was evaluated on the basis of accuracy, spatial continuity, marker genes detection, scalability, and robustness.

Prediction of metabolites associated with somatic mutations in cancers by using genome-scale metabolic models and mutation data | Genome Biology

Oncometabolites, often generated as a result of a gene mutation, show pro-oncogenic function when abnormally accumulated in cancer cells. Here researchers report the development of a computational workflow that predicts metabolite-gene-pathway sets. Metabolite-gene-pathway sets present metabolites and metabolic pathways significantly associated with specific somatic mutations in cancers. The computational workflow uses both cancer patient-specific genome-scale metabolic models (GEMs) and mutation data to generate metabolite-gene-pathway sets.

Interrogations of single-cell RNA splicing landscapes with SCASL define new cell identities with physiological relevance | Nature Communications

RNA splicing shapes the gene regulatory programs that underlie various physiological and disease processes. Here, researchers present the SCASL (single-cell clustering based on alternative splicing landscapes) method for interrogating the heterogeneity of RNA splicing with single-cell RNA-seq data. SCASL resolves the issue of biased and sparse data coverage on single-cell RNA splicing and provides a new scheme for classifications of cell identities.

Daily Digest | April 18, 2024

Improving microbial phylogeny with citizen science within a mass-market video game | Nature Biotechnology

Citizen science video games are designed primarily for users already inclined to contribute to science, which severely limits their accessibility for an estimated community of 3 billion gamers worldwide. Researchers created Borderlands Science (BLS), a citizen science activity that is seamlessly integrated within a popular commercial video game played by tens of millions of gamers. This integration is facilitated by a novel game-first design of citizen science games, in which the game design aspect has the highest priority, and a suitable task is then mapped to the game design. BLS crowdsources a multiple alignment task of 1 million 16S ribosomal RNA sequences obtained from human microbiome studies. Since its initial release on 7 April 2020, over 4 million players have solved more than 135 million science puzzles, a task unsolvable by a single individual. Leveraging these results, they show that their multiple sequence alignment simultaneously improves microbial phylogeny estimations and UniFrac effect sizes compared to state-of-the-art computational methods.

Topological benchmarking of algorithms to infer Gene Regulatory Networks from Single-Cell RNA-seq Data | Bioinformatics

In recent years, many algorithms for inferring gene regulatory networks from single-cell transcriptomic data have been published. Several studies have evaluated their accuracy in estimating the presence of an interaction between pairs of genes. However, these benchmarking analyses do not quantify the algorithms’ ability to capture structural properties of networks, which are fundamental, for example, for studying the robustness of a gene network to external perturbations. Here, researchers devise a three-step benchmarking pipeline called STREAMLINE that quantifies the ability of algorithms to capture topological properties of networks and identify hubs.

Prediction of tumor origin in cancers of unknown primary origin with cytology-based deep learning | Nature Medicine

Cancer of unknown primary (CUP) site poses diagnostic challenges due to its elusive nature. Many cases of CUP manifest as pleural and peritoneal serous effusions. Leveraging cytological images from 57,220 cases at four tertiary hospitals, researchers developed a deep-learning method for tumor origin differentiation using cytological histology (TORCH) that can identify malignancy and predict tumor origin in both hydrothorax and ascites. They examined its performance on three internal (n = 12,799) and two external (n = 14,538) testing sets. In both internal and external testing sets, TORCH achieved area under the receiver operating curve values ranging from 0.953 to 0.991 for cancer diagnosis and 0.953 to 0.979 for tumor origin localization. TORCH accurately predicted primary tumor origins, with a top-1 accuracy of 82.6% and top-3 accuracy of 98.9%.

Daily Digest | April 17, 2024

Foundation model for cancer imaging biomarkers | Nature Machine Intelligence

Foundation models in deep learning are characterized by a single large-scale model trained on vast amounts of data serving as the foundation for various downstream tasks. Foundation models are generally trained using self-supervised learning and excel in reducing the demand for training samples in downstream applications. This is especially important in medicine, where large labelled datasets are often scarce. Here, researchers developed a foundation model for cancer imaging biomarker discovery by training a convolutional encoder through self-supervised learning using a comprehensive dataset of 11,467 radiographic lesions. The foundation model was evaluated in distinct and clinically relevant applications of cancer imaging-based biomarkers.

PIFiA: self-supervised approach for protein functional annotation from single-cell imaging data | Molecular Systems Biology

Fluorescence microscopy data describe protein localization patterns at single-cell resolution and have the potential to reveal whole-proteome functional information with remarkable precision. Yet, extracting biologically meaningful representations from cell micrographs remains a major challenge. Existing approaches often fail to learn robust and noise-invariant features or rely on supervised labels for accurate annotations. Researchers developed PIFiA (Protein Image-based Functional Annotation), a self-supervised approach for protein functional annotation from single-cell imaging data.

Cell type signatures in cell-free DNA fragmentation profiles reveal disease biology | Nature Communications

Circulating cell-free DNA (cfDNA) fragments have characteristics that are specific to the cell types that release them. Current methods for cfDNA deconvolution typically use disease tailored marker selection in a limited number of bulk tissues or cell lines. Here, researchers utilize single cell transcriptome data as a comprehensive cellular reference set for disease-agnostic cfDNA cell-of-origin analysis. They correlate cfDNA-inferred nucleosome spacing with gene expression to rank the relative contribution of over 490 cell types to plasma cfDNA. In 744 healthy individuals and patients, they uncover cell type signatures in support of emerging disease paradigms in oncology and prenatal care. They train predictive models that can differentiate patients with colorectal cancer (84.7%), early-stage breast cancer (90.1%), multiple myeloma (AUC 95.0%), and preeclampsia (88.3%) from matched controls.

Daily Digest | April 16, 2024

scGHOST: identifying single-cell 3D genome subcompartments | Nature Methods

Single-cell Hi-C (scHi-C) technologies allow for probing of genome-wide cell-to-cell variability in three-dimensional (3D) genome organization from individual cells. Here researchers present scGHOST, a single-cell subcompartment annotation method using graph embedding with constrained random walk sampling. Applications of scGHOST to scHi-C data and contact maps derived from single-cell 3D genome imaging demonstrate reliable identification of single-cell subcompartments, offering insights into cell-to-cell variability of nuclear subcompartments.

BISCUIT: an efficient, standards-compliant tool suite for simultaneous genetic and epigenetic inference in bulk and single-cell studies | Nucleic Acids Research

Data from both bulk and single-cell whole-genome DNA methylation experiments are under-utilized in many ways. This is attributable to inefficient mapping of methylation sequencing reads, routinely discarded genetic information, and neglected read-level epigenetic and genetic linkage information. Researchers introduce the BISulfite-seq Command line User Interface Toolkit (BISCUIT) and its companion R/Bioconductor package, biscuiteer, for simultaneous extraction of genetic and epigenetic information from bulk and single-cell DNA methylation sequencing.

PLMSearch: Protein language model powers accurate and fast sequence search for remote homology | Nature Communications

Homologous protein search is one of the most commonly used methods for protein annotation and analysis. Compared to structure search, detecting distant evolutionary relationships from sequences alone remains challenging. Here researchers propose PLMSearch (Protein Language Model), a homologous protein search method with only sequences as input. PLMSearch uses deep representations from a pre-trained protein language model and trains the similarity prediction model with a large number of real structure similarity. This enables PLMSearch to capture the remote homology information concealed behind the sequences.

Daily Digest | April 15, 2024

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data | Nature Biotechnology

Existing methods for gene regulatory network (GRN) inference rely on gene expression data alone or on lower resolution bulk data. Despite the recent integration of chromatin accessibility and RNA sequencing data, learning complex mechanisms from limited independent data points still presents a daunting challenge. Here researchers present LINGER (Lifelong neural network for gene regulation), a machine-learning method to infer GRNs from single-cell paired gene expression and chromatin accessibility data. LINGER incorporates atlas-scale external bulk data across diverse cellular contexts and prior knowledge of transcription factor motifs as a manifold regularization.

Comprehensive transcriptome analysis reveals altered mRNA splicing and post-transcriptional changes in the aged mouse brain | Nucleic Acids Research

A comprehensive understanding of molecular changes during brain aging is essential to mitigate cognitive decline and delay neurodegenerative diseases. The interpretation of mRNA alterations during brain aging is influenced by the health and age of the animal cohorts studied. Here, researchers carefully consider these factors and provide an in-depth investigation of mRNA splicing and dynamics in the aging mouse brain, combining short- and long-read sequencing technologies with extensive bioinformatic analyses.

Accurately clustering biological sequences in linear time by relatedness sorting | Nature Communications

Clustering biological sequences into similar groups is an increasingly important task as the number of available sequences continues to grow exponentially. Search-based approaches to clustering scale super-linearly with the number of input sequences, making it impractical to cluster very large sets of sequences. Approaches to clustering sequences in linear time currently lack the accuracy of super-linear approaches. Here, the author set out to develop and characterize a strategy for clustering with linear time complexity that retains the accuracy of less scalable approaches. The resulting algorithm, named Clusterize, sorts sequences by relatedness to linearize the clustering problem.

Daily Digest | April 14, 2024

Identification of clinical disease trajectories in neurodegenerative disorders with natural language processing | Nature Medicine

Neurodegenerative disorders exhibit considerable clinical heterogeneity and are frequently misdiagnosed. This heterogeneity is often neglected and difficult to study. Therefore, innovative data-driven approaches utilizing substantial autopsy cohorts are needed to address this complexity and improve diagnosis, prognosis and fundamental research. Researchers present clinical disease trajectories from 3,042 Netherlands Brain Bank donors, encompassing 84 neuropsychiatric signs and symptoms identified through natural language processing. This unique resource provides valuable new insights into neurodegenerative disorder symptomatology.

AI-guided pipeline for protein–protein interaction drug discovery identifies a SARS-CoV-2 inhibitor | Molecular Systems Biology

Protein–protein interactions (PPIs) offer great opportunities to expand the druggable proteome and therapeutically tackle various diseases, but remain challenging targets for drug discovery. Here, researchers provide a comprehensive pipeline that combines experimental and computational tools to identify and validate PPI targets and perform early-stage drug discovery.

Leveraging large language models for predictive chemistry | Nature Machine Intelligence

Machine learning has transformed many fields and has recently found applications in chemistry and materials science. The small datasets commonly found in chemistry sparked the development of sophisticated machine learning approaches that incorporate chemical knowledge for each application and, therefore, require specialized expertise to develop. Here researchers show that GPT-3, a large language model trained on vast amounts of text extracted from the Internet, can easily be adapted to solve various tasks in chemistry and materials science by fine-tuning it to answer chemical questions in natural language with the correct answer.

Daily Digest | April 13, 2024

Unsupervised ensemble-based phenotyping enhances discoverability of genes related to left-ventricular morphology | Nature Machine Intelligence

Recent genome-wide association studies have successfully identified associations between genetic variants and simple cardiac morphological parameters derived from cardiac magnetic resonance images. However, the emergence of large databases, including genetic data linked to cardiac magnetic resonance facilitates the investigation of more nuanced patterns of cardiac shape variability than those studied so far. Here researchers propose a framework for gene discovery coined unsupervised phenotype ensembles. The unsupervised phenotype ensemble builds a redundant yet highly expressive representation by pooling a set of phenotypes learnt in an unsupervised manner, using deep learning models trained with different hyperparameters. These phenotypes are then analysed via genome-wide association studies, retaining only highly confident and stable associations across the ensemble. They applied this approach to the UK Biobank database to extract geometric features of the left ventricle from image-derived three-dimensional meshes.

A comparison of methods for detecting DNA methylation from long-read sequencing of human genomes | Genome Biology

Long-read sequencing can enable the detection of base modifications, such as CpG methylation, in single molecules of DNA. In this study, researchers systematically compare the performance of CpG methylation detection from long-read sequencing. They demonstrate that CpG methylation detection from 7179 nanopore-sequenced DNA samples is highly accurate and consistent with 132 oxidative bisulfite-sequenced (oxBS) samples, isolated from the same blood draws. They introduce quality filters for CpGs that further enhance the accuracy of CpG methylation detection from nanopore-sequenced DNA, while removing at most 30% of CpGs. This study provides the first systematic comparison of CpG methylation detection tools for long-read sequencing methods.

AlphaPept: a modern and open framework for MS-based proteomics | Nature Communications

In common with other omics technologies, mass spectrometry (MS)-based proteomics produces ever-increasing amounts of raw data, making efficient analysis a principal challenge. A plethora of different computational tools can process the MS data to derive peptide and protein identification and quantification. However, during the last years there has been dramatic progress in computer science, including collaboration tools that have transformed research and industry. To leverage these advances, researchers develop AlphaPept, a Python-based open-source framework for efficient processing of large high-resolution MS data sets. Numba for just-in-time compilation on CPU and GPU achieves hundred-fold speed improvements. AlphaPept uses the Python scientific stack of highly optimized packages, reducing the code base to domain-specific tasks while accessing the latest advances.

Daily Digest | April 12, 2024

Deciphering cell types by integrating scATAC-seq data with genome sequences | Nature Computational Science

The single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) technology provides insight into gene regulation and epigenetic heterogeneity at single-cell resolution, but cell annotation from scATAC-seq remains challenging due to high dimensionality and extreme sparsity within the data. Existing cell annotation methods mostly focus on the cell peak matrix without fully utilizing the underlying genomic sequence. Here researchers propose a method, SANGO, for accurate single-cell annotation by integrating genome sequences around the accessibility peaks within scATAC data.

Genomic language model predicts protein co-regulation and function | Nature Communications

Deciphering the relationship between a gene and its genomic context is fundamental to understanding and engineering biological systems. Machine learning has shown promise in learning latent relationships underlying the sequence-structure-function paradigm from massive protein sequence datasets. However, to date, limited attempts have been made in extending this continuum to include higher order genomic context information. Evolutionary processes dictate the specificity of genomic contexts in which a gene is found across phylogenetic distances, and these emergent genomic patterns can be leveraged to uncover functional relationships between gene products. Here, researchers train a genomic language model (gLM) on millions of metagenomic scaffolds to learn the latent functional and regulatory relationships between genes. gLM learns contextualized protein embeddings that capture the genomic context as well as the protein sequence itself, and encode biologically meaningful and functionally relevant information (e.g. enzymatic function, taxonomy). Their analysis of the attention patterns demonstrates that gLM is learning co-regulated functional modules (i.e. operons).

Spike sorting with Kilosort4 | Nature Methods

Spike sorting is the computational process of extracting the firing times of single neurons from recordings of local electrical fields. This is an important but hard problem in neuroscience, made complicated by the nonstationarity of the recordings and the dense overlap in electrical fields between nearby neurons. To address the spike-sorting problem, researchers have been openly developing the Kilosort framework. Here they describe the various algorithmic steps introduced in different versions of Kilosort. They also report the development of Kilosort4, a version with substantially improved performance due to clustering algorithms inspired by graph-based approaches.

Daily Digest | April 11, 2024

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles | Nature Genetics

Translating genome-wide association study (GWAS) loci into causal variants and genes requires accurate cell-type-specific enhancer–gene maps from disease-relevant tissues. Building enhancer–gene maps is essential but challenging with current experimental methods in primary human tissues. Here researchers developed a nonparametric statistical method, SCENT (single-cell enhancer target gene mapping), that models association between enhancer chromatin accessibility and gene expression in single-cell or nucleus multimodal RNA sequencing and ATAC sequencing data.

PMF-GRN: a variational inference approach to single-cell gene regulatory network inference using probabilistic matrix factorization | Genome Biology

Inferring gene regulatory networks (GRNs) from single-cell data is challenging due to heuristic limitations. Existing methods also lack estimates of uncertainty. Here researchers present Probabilistic Matrix Factorization for Gene Regulatory Network Inference (PMF-GRN). Using single-cell expression data, PMF-GRN infers latent factors capturing transcription factor activity and regulatory relationships. Using variational inference allows hyperparameter search for principled model selection and direct comparison to other generative models.

SimuCell3D: three-dimensional simulation of tissue mechanics with cell polarization | Nature Computational Science

The three-dimensional (3D) organization of cells determines tissue function and integrity, and changes markedly in development and disease. Cell-based simulations have long been used to define the underlying mechanical principles. However, high computational costs have so far limited simulations to either simplified cell geometries or small tissue patches. Here, researchers present SimuCell3D, an efficient open-source program to simulate large tissues in three dimensions with subcellular resolution, growth, proliferation, extracellular matrix, fluid cavities, nuclei and non-uniform mechanical properties, as found in polarized epithelia. Spheroids, vesicles, sheets, tubes and other tissue geometries can readily be imported from microscopy images and simulated to infer biomechanical parameters.