SDAFT: A novel scalable data access framework for parallel BLAST

Authors

    Authors

    J. L. Yin; J. Y. Zhang; J. Wang;W. C. Feng

    Comments

    Authors: contact us about adding a copy of your work at STARS@ucf.edu

    Abbreviated Journal Title

    Parallel Comput.

    Keywords

    MPI/POSIX I/O; HDFS; Parallel sequence search; mpiBLAST; SEARCH; IMPLEMENTATION; PERFORMANCE; SEQUENCE; GENBANK; SYSTEM; Computer Science, Theory & Methods

    Abstract

    In order to run tasks in a parallel and load-balanced fashion, existing scientific parallel applications such as mpiBLAST introduce a data-initializing stage to move database fragments from shared storage to local cluster nodes. Unfortunately, with the exponentially increasing size of sequence databases in today's big data era, such an approach is inefficient. In this paper, we develop a scalable data access framework to solve the data movement problem for scientific applications that are dominated by "read" operation for data analysis. SDAFT employs a distributed file system (DFS) to provide scalable data access for parallel sequence searches. SDAFT consists of two interlocked components: (1) a data centric load-balanced scheduler (DC-scheduler) to enforce data-process locality and (2) a translation layer to translate conventional parallel I/O operations into HDFS I/O. By experimenting our SDAFT prototype system with real-world database and queries at a wide variety of computing platforms, we found that SDAFT can reduce I/O cost by a factor of 4-10 and double the overall execution performance as compared with existing schemes. (C) 2014 Elsevier B.V. All rights reserved.

    Journal Title

    Parallel Computing

    Volume

    40

    Issue/Number

    10

    Publication Date

    1-1-2015

    Document Type

    Article

    Language

    English

    First Page

    697

    Last Page

    709

    WOS Identifier

    WOS:000347018800010

    ISSN

    0167-8191

    Share

    COinS