BioGenomics2017 - Global Biodiversity Genomics Conference
February 21-23, 2017
Smithsonian National Museum of Natural History | Washington, D.C.

Program - Single Session


[Back to Session Listing]

8
Bioinformatics I

Room: Salon 2, Marriott Hotel

10:50 - 12:40

Moderator: Keith Crandall, George Washington University



8.1  11:00  Supporting genomic analysis of diverse organisms using the Galaxy framework. Taylor J*, Johns Hopkins University

Genome-scale analysis requires combination of a wide variety of complex analysis tools with significant compute and data analytic resources. This complexity makes performing these analyses in a collaborative context and communicating them transparently and reproducibly even more challenging. The Galaxy Project (http://galaxyproject.org), established in 2005, is mature and well suited to address these challenges. Galaxy is used worldwide and has enabled thousands of biomedical researchers to analyze high throughput data in a transparent and reproducible manner. Designed from the outset as a flexible and extensible platform that supports the needs of all researchers, Galaxy has developed a vibrant user and developer community that have expanded its use into many areas of science (making available dozens of Galaxy instances and thousands of analysis tools across many domains). Galaxy serves as a collaboration platform enabling data analysis for numerous research groups and communities. This flexibility makes Galaxy particularly suited for building analysis and coordination platforms in new areas and for organisms which previously have not been subject to intensive genomic research. Here we will discuss the use of Galaxy for the analysis of diverse and non-model organisms, including genome assembly and annotation, comparative and phylo-genomics, and genome visualization.


8.2  11:20  Bioinformatic Pipelines for Mining Genomes and Transcriptomes to Understand Evolution and Ecology of Fishes. Hughes LC*, George Washington University

In just a few years, the genomic and transcriptomic resources for ray-finned fishes, the largest group of vertebrates, has exploded. We mine a database of 300 genomes and transcriptomes to gain insight into multiple questions in fish evolution and ecology. Here, I present a bioinformatics pipeline for optimizing loci for reconstruction of the Tree of Life for ray-finned fishes, a group affected by several rounds of genome duplications. We use a bioinformatics pipeline to filter for loci affected by two rounds of genome duplication at the base of the vertebrates, as well as a third round of genome duplication at the base of the Teleosts, which form the majority of ray-finned fishes. Ecological questions can also be addressed with this dataset, by bioinformatically mining reads that belong to microbes associated with fish gills, and we can explore how these microbial communities change depending on different salinity environments, which strongly structure both fish and microbial communities.


8.3  11:40  Using hybrid assembly approach with MaSuRCA to assemble challenging genomes. Zimin AV*, Johns Hopkins University/University of Maryland

Third generation (PacBio) genome sequencing data opened a large new realm of possibilities in de novo genome assembly to long (10kb+) read lengths and no sequencing bias. However, the inherent high error rates of about 15% present a challenge to using these data to assemble highly repetitive or heterozygous plant genomes. I will describe a hybrid technique that is capable of overcoming the assembly challenges by effectively combining the third generation long PacBio reads with the second generation short but accurate Illumina reads. We have successfully applied this technique to create complete and accurate assemblies of several challenging plant genomes such as ancestral wheat A. tauschii, and the hexaploid wheat T. aestivum. The technique is implemented in publicly available MaSuRCA assembler. You can learn about the assembler and download the code from http://masurca.blogspot.com.


8.4  12:00  Improving genome annotation strategies for biodiverse species using cloud technologies. Dikow RB*, Smithsonian Institution; Frandsen PB, Smithsonian Institution; Cruley D, Amazon Web Services; Davis D, Smithsonian Institution; Gupta S, Intel Corporation; Speirs S, Amazon Web Services; Stern BA, Smithsonian Institution; Taylor M, Intel Corporation; Burba D, Smithsonian Institution

With the dramatic increase in number and taxonomic diversity of organisms for which genomes are available, the ability to efficiently perform downstream analyses to understand genome biology and pursue potential applications is becoming the barrier to advancing biodiversity science. Genome annotation is a problem ripe for improvement, particularly given recent advances in cloud technologies and infrastructure, and its ability to be highly parallelized. Our work is based upon the premise that genome annotation should be a fast, repeatable process with a low bar of entry for biodiversity researchers. We are motivated by the fact that, given decreasing costs in sequencing, there are many researchers who now have the ability to sequence genomes for their taxon of interest. However, they may not have access to high performance computing systems or sophisticated systems administrators to install complicated packages with many dependencies. Here, we highlight our work with the Intel Corporation and Amazon Web Services to provide easily reusable and scalable cloud-enabled software solutions for genome annotation for organisms across the tree of life. We compare two strategies, (1) high-throughput implementation of existing annotation software in the cloud, and (2) the use of workflow engines, purpose-built to take advantage of cloud strengths in parallelization, to wrap around annotation component tools. The exploration of these two strategies will lead to faster and higher quality annotations, which in turn leads to essential biological insights about species.


8.5  12:20  Accelerating next-generation sequencing data analysis via large scale cloud-based system. Jin X*, BGI

Recent years have seen explosion of next-generation sequencing data, driven by both population level research projects (ex. 1000 Genomes Project) and clinical applications (ex. Non-invasive Prenatal Testing). Although price of sequencing a human genome drop to hundreds of dollars, analysis tens of thousands genomes still need better solutions with high efficiency. We have build a large scale cloud-based system named BGI Online to help researchers analyse, store and share genomics data securely. Taking advantage of cloud computing and advanced bioinformatic technology, we have shown the potential of analysis 1000 exomes within 22 hours and 100 whole genomes within 17 hours. The system will accelerating next-generation sequencing data analysis and benefit both research community and precision medicine practice.




[Back to Session Listing]