NHGRI/DIR Bioinformatics and Scientific Programming Core

$5,402,925ZICFY2022HGNIH

National Human Genome Research Institute

Investigators

Abstract

The Bioinformatics and Scientific Programming Core actively supports the research being performed by NHGRI/DIR investigators by providing expertise and assistance in scientific programming and computational analysis. The Core facilitates access to specialized software and hardware, develops generalized software solutions that can address a variety of questions in genomic research, develops database solutions for the efficient archiving and retrieval of experimental and clinical data, disseminates new software and database solutions to the genome community at-large, collaborates with DIR researchers on computationally intensive projects, and provides educational opportunities in bioinformatics to trainees. Support for projects includes not only data analysis but also related efforts focused on data collection through the public DIR research web site, located at https://research.nhgri.nih.gov. Additional information can be found on the Cores web site, at https://dir.nhgri.nih.gov/nhgri_cores/BSPC. Responding to the continued demand for variant calling on human genome and exome data, the Core has maintained and updated a GATK-based pipeline that builds upon best practices published by the Broad Institute. This standardized and validated pipeline is currently being used in the context of The Genome Ascertainment Consortium (TGAC) effort being led by Dr. Leslie Biesecker; the goals of this effort are to improve our overall understanding of the phenotypic consequences of genetic variation and to predict phenotypes from genotypes. To that end, this pipeline has facilitated the creation of a uniformly processed and formatted genotype callset across multiple cohorts, based on data from multiple sources. The dataset currently includes 2,000 exomes from NIAID and the ClinSeq cohort, as well as 4,600 genomes from the INOVA Translational Medicine Institute. We have been processing 350 genomes from INOVA, 4800 genomes from NIEHS, and 2300 exomes from NIAID; we plan to have these data incorporated in the dataset this fall. The GATK pipeline continues to be optimized to take advantage of the Biowulf high-performance computing environment, parallelizing the per-sample processing steps and making use of local SSD storage on nodes (as available) to increase speed and reduce network overhead. Going forward, this efficient pipeline will allow for the re-calling of data from this growing cohort of individuals who have agreed to be re-contacted for secondary phenotyping studies, with the increased sample sizes affording greater power to discover important phenotype/genotype associations. Alongside this effort, the Core has developed an interactive browser for visualizing aggregate exome and genome data from the aforementioned TGAC cohorts, using the gnomAD codebase as its foundation. The Core has also focused on developing an in-house somatic variant calling pipeline for use in mosaic and cancer somatic variant calling. This pipeline leverages the initial alignment stages of the existing germline pipeline, but in the variant calling step uses Samtools mpileup and Varscan to produce a highly sensitive variant caller capable of detecting alleles present in only 5% of reads in a sample. Key to the utility of this pipeline are detailed quality and read count statistics broken down by strand direction, allowing the scientific end-user to create a custom filtering strategy. For the FUSION Project, a long-term, international, and collaborative effort to identify genomic variants that predispose to type 2 diabetes, we have been investigating the genetic basis of disease risk through the use of single-cell and/or single-nuclei RNA-seq technology, with the goal of interrogating the transcriptomes of the individual cells that comprise the pancreatic islet. Differences in the single-cell transcriptomic signatures and cell type composition of islets obtained from diabetic and non-diabetic patients are being assessed. Additional projects include Implementation of a genome data portal for the preliminary annotation and analysis of the Hydra vulgaris AEP genome, annotation of immunoglobulin superfamily genes in the histocompatibility complex of Hydractinia, variant calling aimed at mapping sex determination loci and producing fine-scale mapping data for the genomic region controlling histocompatibility in Hydractinia, implementation of a gene prediction pipeline and genome data portal for the preliminary annotation and analysis of the Hydractinia genome, testing a pipeline to generate complete proteomes for 100 species with RNA-seq data in SRA and creating AniProtDB, a web resource providing access to the resulting proteomes and protein domains; analysis of single-cell RNA-seq and ATAC-seq data obtained from zebrafish sensory hair cells, implementation of software to analyze data generated from the scSPRITE protocol, which allows for investigation of 3D genome arrangement at the single cell level; developing a browser for the epaulette shark (Hemiscyllium ocellatum) genome, developing a web site for the Cohort Analytics Core and Reverse Phenotyping Core, updating the Skippy web server to comply with security regulations, as well as including additional complementary tools for splicing prediction; performing isoform expression profiling of pan-cancer datasets in TCGA, deducing methylation marks in parents caring for children with inherited metabolic disorders, performing single-cell RNAseq analyses in peripheral blood from patients with mitochondrial disease, analyzing ATAC-Seq and RNA-seq data from effector and memory T-cells from wild type and pyruvate dehydrogenase deficient T-cells to examine the chromatin and transcriptional landscape, analyzing ATAC-Seq and RNA-Seq of iPSC-derived neurons to identify changes in chromatin accessibility and gene expression due to mitochondrial dysfunction, continuing maintenance of a customized database and web interface for storing and computing on genomic data from dogs, design and implementation of surveys that assess the health of pet dogs whose DNA samples have been submitted to scientific studies, performing RNA-seq analyses of post-mortem brain tissue to compare neuronal gene expression in youths with a history of ADHD against matched controls in order to establish a neuronal transcriptome and determine the genes and neural gene networks that influence the development of ADHD, developing a website to facilitate the collection of sensitive medical data about genetic conditions, for classification and analysis utilizing AI methodologies; identifying integration sites of AAV in mouse and human genomes and developing methods to characterize the clustering and locations of the integration sites, andanalyzing RNAseq data comparing mouse models of MMA versus healthy controls. The Core also supports Labmatrix, NHGRIs clinical research database. During the current reporting period, there were 147 active Labmatrix user accounts, with 22 new accounts added and associated users trained. In addition to its general use by NHGRI clinical investigators, Labmatrix is utilized for large-scale data and/or sample management for the Inherited Diseases and Caregiving Study, the Insights Microbiome/Sickle Cell Study, the ClinSeq Study, and GENE-FORECAST. NHGRI Labmatrix Support services include user training and help desk support, legacy data mapping, data validation and import, and barcoding implementation. Support staff routinely handle large datasets, import data from CRIS, develop complex queries, and generate custom data reports on behalf of database users. Finally, recognizing the importance of having a degree of facility with computational approaches, the Core continues to offer a number of courses that cover various areas of the bioinformatic landscape, prov

View original record on NIH RePORTER →