Significant advances in genome sequencing over the past decade have resulted in a flood of genomic data that pose enormous computational challenges and require new bioinformatics approaches. As the cost of sequencing has decreased and genomics has become an increasingly important tool for health and the environment, genomic data has grown exponentially, often requiring parallel computing on high-performance computing (HPC) systems. However, genomic applications are often characterized by irregular and unstructured computation and data layout, making them a problematic target for distributed memory parallelism.
In this work, we show that it is possible to productively write highly parallel code for irregular genomic computation using the appropriate abstraction. Genomic algorithms are often based on graph analysis and processing. For individual graph algorithms, it has been previously shown that graphs can be viewed as sparse matrices and the computations become a series of matrix operations. Here, we take this idea to a new level by demonstrating its applicability and the challenges for a data- and computationally-intensive end-to-end application in genomics: de novo long-read genome assembly, where an unknown genome is reconstructed from short, redundant, and erroneous DNA sequences. Our main contribution is the design and development of a set of scalable distributed and parallel algorithms for de novo long-read genome assembly that can run on hundreds of nodes of an HPC system, reducing the runtime for mammalian genomes from days on a single processor to less than 20 minutes on a supercomputer. Our algorithms are presented as the Extreme-Scale Long-Read Berkeley Assembler (ELBA) pipeline, which encompasses the major phases of the overlap-layout-consensus paradigm that is most popular for long-read sequencing data. In ELBA, we view assembly through the lens of sparse linear algebra, where the core data structure is a sparse matrix. This work paves the way for a highly productive paradigm for writing massively parallel codes for irregular and unstructured real-world computation.
Speaker Bio:
Guidi works in the field of high-performancecomputing for large-scale computational sciences (especially computational biology). Her research involves the development of algorithms, software infrastructures, and systems for parallel machines to accelerate data processing without sacrificing programming productivity, and to make high-performancecomputing more accessible. Guidi received her Ph.D. in computer science from the University of California Berkeley. Before joining Cornell Bowers CIS, she worked as a project scientist in the Performance and Algorithms Research Group in the Applied Math and Computational Sciences Division at Lawrence Berkeley National Laboratory in Berkeley, California, where she is currently an affiliate faculty.