BioGenomics2017 - Global Biodiversity Genomics Conference
February 21-23, 2017
Smithsonian National Museum of Natural History | Washington, D.C.

Program - Single Session

[Back to Session Listing]

Genome Sequencing Technologies II & Bioinformatics II

Room: Salon 2, Marriott Hotel

16:00 - 17:30

Moderator: Richard Durbin, Wellcome Sanger Institute

14.1  16:10  A high quality genome of the Hawaiian monk seal assembled with synthetic long reads and optical mapping. Scott AF*, Johns Hopkins; Mohr DW, Johns Hopkins

The Hawaiian monk seal is an endangered species with a current population of ~1100 individuals and is the subject of a species recovery effort by NOAA. We have created a genome assembly using the Chromium synthetic long read technology of 10X Genomics (10XG) and optical mapping from BioNano Genomics (BNG). The 10XG Chromium method incorporates unique molecular identifiers (UMI) into long DNA molecules which are sheared into standard Illumina libraries, sequenced and reassembled based on the shared UMIs using 10XG SuperNova software. For optical mapping, lymphocytes were embedded in agarose and processed in situ for high molecular weight DNA. The DNA was labeled at BspQ1 sites, imaged and assembled into consensus molecular maps. BNG RefAligner software aligned the optical maps and SuperNova scaffolds. The N50 of the 10XG data alone was 22.2 Mb with the longest scaffold of 84.0 Mb. A total of 170 hybrid BNG/10XG hybrid scaffolds increased the N50 to 29.6 Mb with the longest scaffold 84.7 Mb. The quality of the resulting genome was assessed by various metrics. First, we used the BUSCO tool to measure the number of expected conserved protein-coding genes. Second, we translated all of the 10XG DNA sequences and matched those against the human protein database in order to identify blocks of synteny. Lastly, we compared the seal DNA scaffolds to the human genome using nucmer. We estimate that about ~98% of the seal genome is accounted for in the 170 hybrid scaffolds. BUSCO identified 123 missing protein genes out of 3023 searched but a manual inspection of the translated proteins found all but 12 of these. The materials cost for this genome was less than $15K and we believe that these methods are likely to significantly improve the quality and speed with which genomes can be assembled given the availability of high molecular weight DNA.

14.2  16:30  easyInstal and easyImport: tools for creation of multispecies genome databases for neglected organisms. Blaxter M*, University of Edinburgh; Challis R, University of Edinburgh; Kumar S, University of Edinburgh; Dasmahapatra K, University of York; Jiggins C, University of Edinburgh

The coming wave of non-model, "neglected" organism genomes poses pressing questions of data access and interoperability. We present a realised proposal for collating and presenting genome data from large sets of neglected organisms for browsing and exploration. We have developed a robust, versatile and simple way to install and populate an Ensembl database with genome data. Ensembl is a richly-featured, open access system developed for the model organism communities. Ensembl seamlessly integrates comparative genomics and other datatypes (such as structural variants and functional annotation) in a genome browser with a mature application planning interface that permits programmatic access. Ensembl has generally been considered to be difficult to install, and adding new genomes requires considerable effort. Our tools make this process simpler. EasyMirror allows you to set up a new site with remote and local data within a few minutes while EasyImport allows you to import genomic sequence, gene models and annotations for any taxon in around an hour. Once live, databases can be federated to provide futher linkage between diverse taxon groups. Our flagship product is LepBase, serving the Lepidoptera research community with ~40 integrated genomes of varying levels of contiguity and completeness. It will be relatively easy to replicate this kind of resource for the many other communities represented at this meeting.

14.3  16:50  gVolante: for more standardized completeness assessment of genome/transcriptome assemblies. Nishimura O, RIKEN Center for Life Science Technologies; Hara Y, RIKEN Center for Life Science Technologies; Kuraku S*, RIKEN Center for Life Science Technologies

Along with the increasing accessibility to comprehensive sequence information, such as whole genomes and transcriptomes, the demand for assessing their quality has been multiplied. Products of genome and transcriptome assemblies are often not thoroughly assessed, due to the time-consuming nature of assembly program executions, and it is also not straightforward to assess them on a uniform criterion. Metrics based on sequence lengths, such as N50, have become a standard, but they only evaluate one aspect of assembly quality. The program pipeline CEGMA was developed for completeness assessment of genome assemblies based on the coverage of pre-selected reference protein-coding genes (Parra et al., 2009. Nucleic Acids Res. 37: 289-97), and this function was taken over by its successor BUSCO (Simao et al., 2015. Bioinformatics 31: 3210-2). However, running CEGMA demands some prerequisites, including multiple programs that are not easy to install, and both CEGMA and BUSCO do not have a user-friendly interface. Here we introduce a brand-new web server, gVolante (, which provides an online tool for 1) on-demand completeness assessment of sequence sets and 2) browsing pre-computed completeness scores for publicly available data in its database section. Completeness assessment based on pre-selected reference protein-coding genes should be performed with a careful consideration for the compatibility of a reference gene set with the taxonomic position of the species from which the sequence set to be evaluated is derived. gVolante provides a choice between the reference gene set 'CEG' associated with CEGMA (Parra et al., 2009. Nucleic Acids Res. 37: 289-97), our original gene set for vertebrates 'CVG' (Hara et al., 2015. BMC Genomics 16:977), and some of the gene sets provided with BUSCO (Simao et al., 2015. Bioinformatics 31: 3210-2). Completeness assessment on gVolante does not only produce scores based on coverage of reference genes, but also scores based on sequence lengths (e.g., N50 scaffold length), allowing quality control in multiple aspects. Using gVolante, one can compare the quality of original assemblies between multiple versions (obtained through program choice and parameter tweaking, for example) and evaluate them in comparison to the scores of standard public resources found in the database section. To our knowledge, gVolante is the only online tool that allows one to evaluate the completeness of genome and transcriptome sequences and produces ready-to-view results via a graphical interface.

14.4  17:10  Comparison of single genome and allele frequency data reveals discordant demographic histories. Beichman AC*, University of California, Los Angeles; Lohmueller KE, University of California, Los Angeles

Inference of the history of population size changes using genetic data is a primary goal of population genetics in both model and non-model organisms. One to four whole genomes can be used to infer the demographic history of a population in Sequentially Markovian Coalescent (SMC)-based methods (including PSMC, PSMC' and MSMC), while site frequency spectrum (SFS)-based methods (such as dadi and fastsimcoal) use the distribution of allele frequencies in a sample of ten or more individuals to reconstruct the same historical events. Although both methods are extensively used in empirical studies and perform well on data simulated under simple models, they have not been compared to each other in more complex and realistic settings. Often there is no means of comparison, as the SMC whole-genome methods are frequently used on de novo genomes of non-model organisms for which there is no allele frequency data available to validate the resulting demographic trajectory. Here we use humans, a species for which there is extensive use of both methods, as an empirical test case to study the behavior of the two inference procedures. We find that the demographic histories inferred using SMC-based methods produce SFSs that do not match the observed SFS, indicating that SMC curves do not reflect literal population size changes. However, using simulated data, we also find that SMC-based methods can reconstruct the complex demographic models inferred by SFS-based methods, suggesting that the discordant SFSs between models is not attributable to a lack of statistical power. The differences between SMC-based demographic histories and SFS-based histories may indicate that neither method is capturing the full complexity of a population's demographic history, and therefore more complex models and inference tools are needed. More generally, our findings indicate that demographic inference from a small number of genomes using SMC methods, though routine in genomic studies of non-model organisms, should be regarded with caution, as these models cannot recapitulate other summaries of the data and should therefore not be taken as a record of literal population size changes through time.

[Back to Session Listing]