Daily Digest | April 9, 2024

A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions | Nature Machine Intelligence

The 5′ untranslated region (UTR), a regulatory region at the beginning of a messenger RNA (mRNA) molecule, plays a crucial role in regulating the translation process and affects the protein expression level. Language models have showcased their effectiveness in decoding the functions of protein and genome sequences. Here, researchers introduce a language model for 5′ UTR, which they refer to as the UTR-LM. The UTR-LM is pretrained on endogenous 5′ UTRs from multiple species and is further augmented with supervised information including secondary structure and minimum free energy. They fine-tuned the UTR-LM in a variety of downstream tasks.

Research paper

 

Pianno: a probabilistic framework automating semantic annotation for spatial transcriptomics | Nature Communications

Spatial transcriptomics has revolutionized the study of gene expression within tissues, while preserving spatial context. However, annotating spatial spots’ biological identity remains a challenge. To tackle this, researchers introduce Pianno, a Bayesian framework automating structural semantics annotation based on marker genes.

Research paper

 

Pleiotropy, epistasis and the genetic architecture of quantitative traits | Nature Reviews Genetics

Pleiotropy (whereby one genetic polymorphism affects multiple traits) and epistasis (whereby non-linear interactions between genetic polymorphisms affect the same trait) are fundamental aspects of the genetic architecture of quantitative traits. Recent advances in the ability to characterize the effects of polymorphic variants on molecular and organismal phenotypes in human and model organism populations have revealed the prevalence of pleiotropy and unexpected shared molecular genetic bases among quantitative traits, including diseases. By contrast, epistasis is common between polymorphic loci associated with quantitative traits in model organisms, such that alleles at one locus have different effects in different genetic backgrounds, but is rarely observed for human quantitative traits and common diseases. Here, the authors review the concepts and recent inferences about pleiotropy and epistasis, and discuss factors that contribute to similarities and differences between the genetic architecture of quantitative traits in model organisms and humans.

Research paper

 

Daily Digest | April 8, 2024

Gene trajectory inference for single-cell data by optimal transport metrics | Nature Biotechnology

Single-cell RNA sequencing has been widely used to investigate cell state transitions and gene dynamics of biological processes. Current strategies to infer the sequential dynamics of genes in a process typically rely on constructing cell pseudotime through cell trajectory inference. However, the presence of concurrent gene processes in the same group of cells and technical noise can obscure the true progression of the processes studied. To address this challenge, researchers present GeneTrajectory, an approach that identifies trajectories of genes rather than trajectories of cells. Specifically, optimal transport distances are calculated between gene distributions across the cell–cell graph to extract gene programs and define their gene pseudotemporal order.

Research paper

 

Species-aware DNA language models capture regulatory elements and their evolution | Genome Biology

The rise of large-scale multi-species genome sequencing projects promises to shed new light on how genomes encode gene regulatory instructions. To this end, new algorithms are needed that can leverage conservation to capture regulatory elements while accounting for their evolution. Here, researchers introduce species-aware DNA language models, which they trained on more than 800 species spanning over 500 million years of evolution. Investigating their ability to predict masked nucleotides from context, they show that DNA language models distinguish transcription factor and RNA-binding protein motifs from background non-coding sequence. Owing to their flexibility, DNA language models capture conserved regulatory elements over much further evolutionary distances than sequence alignment would allow.

Research paper

 

Tandem mass spectrum prediction for small molecules using graph transformers | Nature Machine Intelligence

Tandem mass spectra capture fragmentation patterns that provide key structural information about molecules. Although mass spectrometry is applied in many areas, the vast majority of small molecules lack experimental reference spectra. For over 70 years, spectrum prediction has remained a key challenge in the field. Existing deep learning methods do not leverage global structure in the molecule, potentially resulting in difficulties when generalizing to new data. In this work researchers propose the MassFormer model for accurately predicting tandem mass spectra.

Research paper

 

Daily Digest | June 24, 2023

The airway microbiome mediates the interaction between environmental exposure and respiratory health in humans | Nature Medicine

Exposure to environmental pollution influences respiratory health. The role of the airway microbial ecosystem underlying the interaction of exposure and respiratory health remains unclear. Here, through a province-wide chronic obstructive pulmonary disease surveillance program, researchers conducted a population-based survey of bacterial (n = 1,651) and fungal (n = 719) taxa and metagenomes (n = 1,128) from induced sputum of 1,651 household members in Guangdong, China. They found that cigarette smoking and higher PM2.5 concentration were associated with lung function impairment through the mediation of bacterial and fungal communities, respectively, and that exposure was associated with an enhanced inter-kingdom microbial interaction resembling the pattern seen in chronic obstructive pulmonary disease.

Research paper

 

MetaNovo: An open-source pipeline for probabilistic peptide discovery in complex metaproteomic datasets | PLOS Computational Biology

Microbiome research is providing important new insights into the metabolic interactions of complex microbial ecosystems involved in fields as diverse as the pathogenesis of human diseases, agriculture and climate change. Here researchers describe a novel approach, MetaNovo, that combines existing open-source software tools to perform scalable de novo sequence tag matching with a novel algorithm for probabilistic optimization of the entire UniProt knowledgebase to create tailored sequence databases for target-decoy searches directly at the proteome level, enabling metaproteomic analyses without prior expectation of sample composition or metagenomic data generation and compatible with standard downstream analysis pipelines.

Research paper

 

A super-resolution strategy for mass spectrometry imaging via transfer learning | Nature Machine Intelligence

High-spatial-resolution mass spectrometry imaging (HSR-MSI) provides precise spatial information on thousands of biomolecules without labelling across a tissue section. Here researchers develop a deep learning framework based on transfer learning called MSI from optical super-resolution (MOSR) that substantially reduces the requirement for sample size. Needing only ten HSR-MSI images, the method transfers knowledge learned from abundant optical images (~15,000) to MSI tasks.

Research paper

 

Daily Digest | August 11, 2021

Remote smartphone monitoring of Parkinson’s disease and individual response to therapy | Nature Biotechnology

Remote health assessments that gather real-world data (RWD) outside clinic settings require a clear understanding of appropriate methods for data collection, quality assessment, analysis and interpretation. Here researchers examine the performance and limitations of smartphones in collecting RWD in the remote mPower observational study of Parkinson’s disease (PD). Although remote assessment requires careful consideration for accurate interpretation of RWD, their results support the use of smartphones and wearables in objective and personalized disease assessments.

Research paper

 

Exploring tissue architecture using spatial transcriptomics | Nature

Deciphering the principles and mechanisms by which gene activity orchestrates complex cellular arrangements in multicellular organisms has far-reaching implications for research in the life sciences. Here the authors review spatial transcriptomic technologies and describe the repertoire of operations available for paths of analysis of the resulting data. Spatial transcriptomics can also be deployed for hypothesis testing using experimental designs that compare time points or conditions—including genetic or environmental perturbations. Finally, spatial transcriptomic data are naturally amenable to integration with other data modalities, providing an expandable framework for insight into tissue organization.

Research paper

 

Radiological tumour classification across imaging modality and histology | Nature Machine Intelligence

Radiomics refers to the high-throughput extraction of quantitative features from radiological scans and is widely used to search for imaging biomarkers for the prediction of clinical outcomes. Here, researchers propose novel radiological features that are specially designed to ensure compatibility across diverse tissues and imaging contrast. These features provide systematic characterization of tumour morphology and spatial heterogeneity.

Research paper

 

Daily Digest | July 26, 2021

Advancing the use of genome-wide association studies for drug repurposing | Nature Reviews Genetics

Genome-wide association studies (GWAS) have revealed important biological insights into complex diseases, which are broadly expected to lead to the identification of new drug targets and opportunities for treatment. In this Review, the authors explore approaches that leverage common variant genetics to identify opportunities for repurposing existing drugs, also known as drug repositioning.

Research paper

 

AnnotSV and knotAnnotSV: a web server for human structural variations annotations, ranking and analysis | Nucleic Acids Research

With the dramatic increase of pangenomic analysis, Human geneticists have generated large amount of genomic data including millions of small variants (SNV/indel) but also thousands of structural variations (SV) mainly from next-generation sequencing and array-based techniques. To help identifying human pathogenic SV, researchers have developed a web server dedicated to their annotation and ranking (AnnotSV) as well as their visualization and interpretation (knotAnnotSV).

Research paper

 

Modeling drug combination effects via latent tensor reconstruction | ISMB/ECCB 2021

Combination therapies have emerged as a powerful treatment modality to overcome drug resistance and improve treatment efficacy. Researchers introduce comboLTR, highly time-efficient method for learning complex, non-linear target functions for describing the responses of therapeutic agent combinations in various doses and cancer cell-contexts. The method is based on a polynomial regression via powerful latent tensor reconstruction.

Research paper

 

Daily Digest | December 9, 2020

Insights into the genetic architecture of the human face | Nature Genetics

The human face is complex and multipartite, and characterization of its genetic architecture remains challenging. Using a multivariate genome-wide association study meta-analysis of 8,246 European individuals, researchers identified 203 genome-wide-significant signals (120 also study-wide significant) associated with normal-range facial variation.

Research paper

 

Inference of mutability landscapes of tumors from single cell sequencing data | PLOS Computational Biology

One of the hallmarks of cancer is the extremely high mutability and genetic instability of tumor cells. Inherent heterogeneity of intra-tumor populations manifests itself in high variability of clone instability rates. MULAN (MUtability LANdscape inference) is a maximum-likelihood computational framework for inference of mutation rates of individual cancer subclones using single-cell sequencing data.

Research paper

 

nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation | Nature Methods

Biomedical imaging is a driver of scientific discovery and a core component of medical care and is being stimulated by the field of deep learning. Researchers developed nnU-Net, a deep learning-based segmentation method that automatically configures itself, including preprocessing, network architecture, training and post-processing for any new task. Without manual intervention, nnU-Net surpasses most existing approaches, including highly specialized solutions on 23 public datasets used in international biomedical segmentation competitions.

Research paper

 

Daily Digest | October 12, 2019

A compendium of promoter-centered long-range chromatin interactions in the human genome | Nature Genetics

A large number of putative cis-regulatory sequences have been annotated in the human genome, but the genes they control remain poorly defined. To bridge this gap, researchers generate maps of long-range chromatin interactions centered on 18,943 well-annotated promoters for protein-coding genes in 27 human cell/tissue types. They use this information to infer the target genes of 70,329 candidate regulatory elements and suggest potential regulatory function for 27,325 noncoding sequence variants associated with 2,117 physiological traits and diseases.

Research paper

 

Reproducible biomedical benchmarking in the cloud: lessons from crowd-sourced data challenges | Genome Biology

The authors review recent data challenges with innovative approaches to model reproducibility and data sharing, and outline key lessons for improving quantitative biomedical data analysis through crowd-sourced benchmarking challenges.

Research paper

 

Many cancer drugs aim at the wrong molecular targets | Nature

Many experimental cancer drugs might be succeeding in unintended ways, finds a study that used CRISPR–Cas9 gene editing to investigate how such drugs interact with malignant cells. An analysis of ten drugs — including seven now in clinical trials — found that the proteins they target are not crucial for the survival of cancer cells. The results could help to explain why many cancer drugs fail in clinical trials.

News article

 

Daily Digest | October 1, 2019

Protein interaction networks revealed by proteome coevolution | Science

Biological function is driven by interaction between proteins. Researchers predict protein interfaces by identifying coevolving residues in aligned protein sequences. The approach predicts 1618 protein interactions in Escherichia coli, 682 of which were unanticipated, and 911 interacting pairs in Mycobacterium tuberculosis, most of which had not been previously described.

Research paper

 

Phosphoproteome Analysis Reveals Estrogen-ER pathway as a modulator of mTOR activity via DEPTOR | Molecular & Cellular Proteomics

ER-positive breast tumors represent approximately 70% of all breast cancer cases. Researchers performed a phosphoproteome analysis of ER-positive MCF7 breast cancer cells treated with estrogen or estrogen and the mTORC1 inhibitor rapamycin. They demonstrated that DEPTOR accumulation is the result of estrogen-ERα-mediated transcriptional upregulation of DEPTOR expression. Consequently, the elevated levels of DEPTOR partially counterbalance the estrogen-induced activation of mTORC1 and mTORC2.

Research paper

 

CNEr: A toolkit for exploring extreme noncoding conservation | PLOS Computational Biology

Conserved Noncoding Elements (CNEs) are elements exhibiting extreme noncoding conservation in Metazoan genomes. They cluster around developmental genes and act as long-range enhancers, yet nothing that we know about their function explains the observed conservation levels. Researchers developed CNEr, a R/Bioconductor toolkit for large-scale identification of CNEs and for studying their genomic properties.

Research paper

 

Daily Digest | September 16, 2019

A comparison of automatic cell identification methods for single-cell RNA sequencing data | Genome Biology

Reseachers benchmarked 22 classification methods that automatically assign cell identities including single-cell-specific and general-purpose classifiers. The performance of the methods is evaluated using 27 publicly available single-cell RNA sequencing datasets of different sizes, technologies, species, and levels of complexity. The general-purpose support vector machine classifier has overall the best performance across the different experiments.

Research paper

 

Predicting the genetic ancestry of 2.6 million New York City patients using clinical data | bioRxiv

Researchers present a novel algorithm for predicting genetic ancestry using only variables that are routinely captured in electronic health records (EHRs), such as self-reported race and ethnicity, and condition billing codes. Using patients that have both genetic and clinical information at Columbia University / New York-Presbyterian Irving Medical Center, they developed a pipeline that uses only clinical data to predict the genetic ancestry of all patients of which more than 80% identify as other or unknown.

Research paper

 

The elements of algorithms | Chemistry World

Since Dmitri Mendeleev (and others) first sketched out the periodic relationships between the elements in the 1860s, it has been estimated that around a thousand different tables have appeared in print. But two recent papers have shown that it is now possible to use machine learning to rediscover the table empirically, from the way it is implicitly embedded within the milieu of chemistry.

Original article

 

Daily Digest | September 15, 2019

The Perfect Milk Machine: How Big Data Transformed the Dairy Industry | The Atlantic

Dairy scientists are the Gregor Mendels of the genomics age, developing new methods for understanding the link between genes and living things, all while quadrupling the average cow’s milk production since your parents were born.

Original article

 

Genome architecture and stability in the Saccharomyces cerevisiae knockout collection | Nature

The completion of the yeast Saccharomyces cerevisiae gene-knockout collection (YKOC) has enabled high-throughput reverse genetics, phenotypic screenings and analyses of synthetic-genetic interactions. Ensuing experimental work has also highlighted some inconsistencies and mistakes in the YKOC, or genome instability events that rebalance the effects of specific knockouts, but a complete overview of these is lacking. Researchers sequenced the whole genomes of nearly all of the 4,732 strains comprising the homozygous diploid YKOC. By extracting information on copy-number variation of tandem and interspersed repetitive DNA elements, they describe the genomic alterations that are induced by its loss.

Research paper

 

Recursive Sketches for Modular Deep Learning | Google AI Blog

In “Recursive Sketches for Modular Deep Learning”, recently presented at ICML 2019, researchers explore how to succinctly summarize how a machine learning model understands its input. They do this by augmenting an existing (already trained) machine learning model with “sketches” of its computation, using them to efficiently answer memory-based questions—for example, image-to-image-similarity and summary statistics—despite the fact that they take up much less memory than storing the entire original computation.

Blog post | Research paper