Summary
Background: Advances in DNA Microarray devices and next-generation massively parallel DNA sequencing
platforms have led to an exponential growth in data availability but the arising opportunities
require adequate computing resources. High Performance Computing (HPC) in the Cloud
offers an affordable way of meeting this need.
Objectives: Bioconductor, a popular tool for high-throughput genomic data analysis, is distributed
as add-on modules for the R statistical programming language but R has no native capabilities
for exploiting multiprocessor architectures. SPRINT is an R package that enables easy
access to HPC for genomics researchers. This paper investigates: setting up and running
SPRINT-enabled genomic analyses on Amazon’s Elastic Compute Cloud (EC2), the advantages
of submitting applications to EC2 from different parts of the world and, if resource
underutilization can improve application performance.
Methods: The SPRINT parallel implementations of correlation, permutation testing, partitioning
around medoids and the multi-purpose papply have been benchmarked on data sets of
various size on Amazon EC2. Jobs have been submitted from both the UK and Thailand
to investigate monetary differences.
Results: It is possible to obtain good, scalable performance but the level of improvement
is dependent upon the nature of the algorithm. Resource underutilization can further
improve the time to result. End-user’s location impacts on costs due to factors such
as local taxation.
Conclusions: Although not designed to satisfy HPC requirements, Amazon EC2 and cloud computing
in general provides an interesting alternative and provides new possibilities for
smaller organisations with limited funds.
Keywords
Genomics - computing methodologies