NHGRI/DIR Bioinformatics and Scientific Programming Core
National Human Genome Research Institute
Investigators
Abstract
The NHGRI Bioinformatics and Scientific Programming Core actively supports the research being performed by NHGRI/DIR investigators by providing expertise and assistance in bioinformatics and computational analysis. The Core facilitates access to specialized software and hardware, develops generalized software solutions that can address a variety of questions in genomic research, develops database solutions for the efficient archiving and retrieval of experimental and clinical data, disseminates new software and database solutions to the genome community at-large, collaborates with NHGRI researchers on computationally-intensive projects, and provides educational opportunities in bioinformatics to NHGRI Investigators and trainees. The majority of engagements between the Bioinformatics and Scientific Programming Core and DIR investigators are focused on collaborative interactions intended to advance specific research projects. The support provided for these projects includes not only data analysis but also related efforts focused on data collection and dissemination through the public NHGRI/DIR Web site (http://research.nhgri.nih.gov). Responding to the continued demand for variant calling on human genome and exome data, the Core has maintained and updated a GATK-based pipeline that builds upon best practices published by the Broad Institute. This standardized and validated pipeline is currently being used in the context of The Genome Ascertainment Consortium (TGAC) effort, whose goals are to improve our overall understanding of the phenotypic consequences of genetic variation and to predict phenotypes from genotypes. To that end, this pipeline has facilitated the creation of a uniformly processed and formatted genotype callset across multiple cohorts, based on data from multiple sources. The dataset currently includes 2,000 exomes from NIAID and the ClinSeq cohort, as well as 4,600 genomes from the INOVA Translational Medicine Institute. The GATK pipeline continues to be optimized to take advantage of the Biowulf high-performance computing environment, parallelizing the per-sample processing steps and making use of local SSD storage on nodes (as available) to increase speed and reduce network overhead. Going forward, this efficient pipeline will allow for the re-calling of data from this growing cohort of individuals who have agreed to be re-contacted for secondary phenotyping studies, with the increased sample sizes affording greater power to discover important phenotype/genotype associations. Alongside this effort, the Core has developed an interactive browser for visualizing aggregate exome and genome data from the aforementioned TGAC cohorts, using the gnomAD codebase as its foundation. The Core has also focused on generating an in-house somatic variant calling pipeline for use in mosaic and cancer somatic variant calling. This pipeline leverages the initial alignment stages of the existing germline pipeline, but in the variant calling step uses Samtools mpileup and Varscan to produce a highly sensitive variant caller capable of detecting alleles present in only 5% of reads in a sample. Key to the utility of this pipeline are detailed quality and read count statistics broken down by strand direction, allowing the scientific end-user to create a custom filtering strategy. Additional projects include annotation of samples from the TGAC cohorts with HLA genotypes and integration of results into the gnomAD browser; using genotype data to assess familial relationships among TGAC samples to identify relevant individuals for phenotypic follow-up; updating the TGAC variant browser to make it compliant with NIH security regulations and introduce new features of use to the research community; development of a website to return negative secondary findings to participants from the A2 ClinSeq cohort; analysis of single-cell RNA-seq and ATAC-seq data obtained from zebrafish sensory hair cells; assessing the feasibility of using single-cell RNA-seq to interrogate the transcriptomes of pancreatic islet cells obtained from post autopsy tissue; investigating the genetic basis of type 2 diabetes disease risk through the use of single-cell and/or single-nuclei RNA-seq technology, with the goal of interrogating the transcriptomes of the individual cells that comprise the pancreatic islet; deducing differences in the single-cell transcriptomic signatures and cell type composition of islets obtained from diabetic and non-diabetic patients; updating the Skippy web server to comply with security regulations, as well as including additional complementary tools for splicing prediction; performing isoform expression profiling of pan-cancer datasets in TCGA; determining methylation marks in parents caring for children with inherited metabolic disorders; performing RNAseq analyses in peripheral blood from patients with mitochondrial disease; analyzing ATAC-seq and RNA-seq data from effector and memory T-cells (from both wild type and pyruvate dehydrogenase deficient T-cells); analyzing ATAC-seq and RNA-seq data fromf iPSC-derived neurons to identify changes in chromatin accessibility and gene expression due to mitochondrial dysfunction; implementation of a gene prediction pipeline and genome data portal for the preliminary annotation and analysis of the Hydractinia genome; significant expansion of the Mnemiopsis Genome Project Portal to include in situ images, temporal developmental expression profiles, and single-cell expression data; developing a pipeline to generate complete proteomes for 100 species from unannotated RNA-seq data found in the NCBI Sequence Read Archive and developing AniProtDB, a web resource that makes these newly derived proteomes accessible to the scientific community; continuing maintenance of a customized database and web interface for storing and computing on genomic data from various canine species that can inform questions in human health; designing and implementing of surveys that assess the health of pet dogs whose DNA samples have been submitted to scientific studies; performing RNA-seq analyses of post-mortem brain tissue to compare neuronal gene expression in youths with a history of ADHD against matched controls in order to establish a neuronal transcriptome and determine the genes and neural gene networks that influence the development of ADHD; and identifying integration sites of AAV in mouse and human genomes, developing new methods to characterize the clustering and locations of the integration sites. The Core also supports Labmatrix, NHGRIs clinical research database. During the current reporting period, there were 125 active Labmatrix user accounts, with 19 new accounts added and associated users trained. In addition to its general use by NHGRI clinical investigators, Labmatrix is utilized for large-scale data and/or sample management for the Inherited Diseases and Caregiving Study, the Insights Microbiome/Sickle Cell Study, the ClinSeq Study, and GENE-FORECAST. NHGRI Labmatrix Support services include user training and help desk support, legacy data mapping, data validation and import, and barcoding implementation. Support staff routinely handle large datasets, import data from CRIS, develop complex queries, and generate custom data reports on behalf of database users. Finally, recognizing the importance of having a degree of facility with computational approaches, the Core continues to offer a number of courses that cover various areas of the bioinformatic landscape, providing DIR scientists with hands-on experience in analyzing the genomic data being generated in our laboratories.
View original record on NIH RePORTER →