Accelerating life sciences research/media/738AE43678BF...sensitive applications that involve...
Transcript of Accelerating life sciences research/media/738AE43678BF...sensitive applications that involve...
IBM Systems and Technology
Thought Leadership White Paper
June 2013
Accelerating life sciences researchIBM Platform Symphony helps deliver improved performance for life sciences workloads using Contrail software
2 Accelerating life sciences research
Contents
2 Addressing the challenges of genome assembly with Contrail
3 Accelerating results with IBM Platform Symphony
3 The benchmark environment
4 Selecting the E.coli model
4 Test methodology
6 Results
6 Interpreting the results
7 The additional benefits of Platform Symphony
7 Limitations and additional work
7 Conclusion
8 Appendix: Shell script for benchmark testing
10 Actual benchmark results captured over three successive comparative runs
11 Hadoop configuration files
New approaches to genomic analysis, such as next-generation sequencing, will play key roles in advancing scientific knowledge, facilitating the development of targeted drugs and delivering personalized healthcare. To capitalize on these new approaches, life sciences organizations need computing environments that can process tremendous amounts of data rapidly. Speed of analy-sis is critical in life sciences since it relates directly to the rate of discovery and the cost-efficiency of employing genomic sequenc-ing for personalized medicine on a large scale.
Contrail, a bioinformatics application, leverages the Hadoop MapReduce framework to deliver gains in performance and cost-efficiency in genome sequencing. By combining Contrail with IBM® Platform™ Symphony, a commercial workload scheduler and grid manager, researchers can see even greater
advantages. This paper presents the results of recent benchmark testing that demonstrate the advantages of using Platform Symphony in conjunction with Contrail.
Addressing the challenges of genome assembly with ContrailContrail is open-source software that was developed to solve key challenges associated with large-scale genome assembly. It enables de novo assembly of large genomes from short reads, bridging research in computational biology with advances in the Hadoop MapReduce framework.
The first step in analyzing a previously un-sequenced organism is to assemble reads by merging similar reads into progressively longer sequences. Assemblers such as Velvet and Euler attempt to solve the assembly problem by constructing, simplifying and traversing a de Bruijn graph of the read sequences.1 These assemblers primarily focus on correcting errors, reconstructing unambiguous regions and resolving short repeats.
While these assemblers can manage small genomes, scaling to larger, mammalian-sized genomes is challenging. The assem-blers require constructing and manipulating graphs that are too large to fit in the memory of most computer systems. Larger models can require computing environments with terabytes of memory—and building those environments would be too expensive for most institutions.
Contrail addresses the memory limitation by re-representing the algorithm to run on a distributed MapReduce framework that avoids the need for massive amounts of memory on any individ-ual system. Contrail relies on Hadoop to iteratively transform an on-disk representation of the assembly graph, allowing an in-depth analysis even for large genomes on clusters of commodity computer systems running a Linux operating system.
3IBM Systems and Technology
Accelerating results with IBM Platform SymphonyPlatform Symphony software offers enterprise-class manage-ment of distributed compute and big data applications on a scalable, shared grid. By providing a low-latency scheduling environment for heterogeneous workloads, Platform Symphony can help accelerate application workloads and enable IT groups to enhance the efficiency of how resources are used.
Platform Symphony—available on its own, and as a limited-use license as part of the IBM InfoSphere® BigInsights® software distribution—also makes it easy for organizations to run applica-tions specifically designed for big data and achieve higher levels of performance to facilitate rapid decision making. With Platform Symphony - Advanced Edition augmenting a sup-ported Hadoop distribution, organizations can run their existing Hadoop MapReduce applications without modification.
Platform Symphony does not replace Hadoop; it replaces only the standard batch scheduler included with the open-source Hadoop MapReduce distribution. Platform Symphony enhances Hadoop by providing a faster, low-latency MapReduce runtime layer and more reliable and f lexible workload management.
In other industries, Platform Symphony has been shown to sub-stantially accelerate Hadoop MapReduce workloads. The goal of this benchmark was to demonstrate how Platform Symphony could deliver similar advantages for a life sciences workload.
The benchmark environmentThis benchmark measured the relative performance of a Contrail model with and without Platform Symphony. Relatively little performance optimization was done for either the “Hadoop-only” case or the “Hadoop–plus–Platform Symphony” cases. Existing lab hardware was used to conduct the tests so the hardware environment may not have been optimal, but it was sufficient for this kind of simple comparative test.
HardwareA Hadoop MapReduce cluster comprising multiple IBM rack-mount servers (see Figure 1) was used to support the benchmark. The cluster had a single head node and seven data nodes. The head node was a 2.6 GHz IBM System x® 3650 M4 server with 32 GB of memory. Six of the compute nodes were IBM System x dx360 M4 servers configured with 64 GB of memory per server and 40 Gbps InfiniBand interconnects. The seventh server was an IBM iDataPlex® M3 server.
All nodes were connected through a 40 Gbps InfiniBand switch. The test ran IP over InfiniBand (IPoB). A separate 1 Gb Ethernet network was used for node configuration and management.
Figure 1. IBM System x test environment for Contrail performance comparisons.
Mellanox IB switch
IBM 3650 M4 server
IBM dx360 m4 server
IBM dx360 m4 server
IBM dx360 m4 server
IBM dx360 m4 server
IBM dx360 m4 server
IBM dx360 m4 server
IBM dx360 m3 server
4 Accelerating life sciences research
SoftwareThe cluster nodes all ran Red Hat Enterprise Linux 6.2. Hadoop 1.1.1 was downloaded from apache.org and configured in accordance with instructions provided in the Platform Symphony release notes (see Figure 2). The Contrail software tested was the latest version available from http://sourceforge.net/apps/mediawiki/contrail-bio/index.php?title=Contrail as of March 2013. The Contrail code was installed based on instruc-tions in the Contrail Quickstart Guide (available on the Contrail wiki). For the comparative test, Platform Symphony Version 6.1.0.1 was used in conjunction with the Hadoop software above.
Tests were initially conducted with both Hadoop 1.0.1 and 1.1.1, but it was judged to be more valid to focus on 1.1.1 since this was the more recent version. A significant difference between the two versions is the heartbeat interval. Hadoop 1.1.1 employs a more aggressive 0.3-second heartbeat interval, while Hadoop 1.0.1 has a 3-second interval. For this reason, Hadoop 1.1.1 gen-erally outperforms Hadoop 1.0.1 on small clusters such as the test environment, where a fast heartbeat interval is reasonable.
Figure 2. The Platform Symphony management console cluster view.
Selecting the E.coli modelFor this comparative benchmark testing, the Ecoli.10k file included in the data directory of the Contrail distribution was chosen as the basis for the test.2 The benchmark team treated the E.coli model provided with the Contrail distribution as a “black box” and ran Contrail in accordance with the provided directions.
Test methodologyTo simplify the benchmark testing, and to facilitate repeated runs with different data models and parameter settings, a shell script was developed (see the Appendix) to run the benchmark. Much of the logic of the script involves parsing the output of the Contrail simulation runs for both Hadoop-only and Hadoop–plus–Platform Symphony cases to easily capture run-time details from repeated benchmark runs. Without this kind of automation, manually gathering statistics from repeated job runs so that they could be easily compared would have been tedious.
5IBM Systems and Technology
The two test case configurations employed mostly the default settings. The benchmark team did, however, change three vari-ables in the Platform Symphony application profile for the Symphony MapReduce tenant, under which the Contrail jobs ran. The application profiles were configured with these settings:
preStartApplication=”true”
taskLowWaterMark=”0.0”
taskHighWaterMark=”1.0”
These settings are known to deliver better performance for Hadoop MapReduce workloads and would likely be the same settings used by organizations deploying such an application in production. These are standard settings explained in the Platform Symphony product documentation.
Figure 3. View of running Contrail jobs in the Platform Symphony management console.
The benchmark execution script:
●● Sets up the environment for both Hadoop and Platform Symphony
●● Cleans up the Hadoop Distributed File System (HDFS) environment to make sure there is no data from prior runs
●● Copies the E.coli model files into HDFS●● Runs the identical model twice—once using the Hadoop-only
environment and once using the Hadoop–plus–Platform Symphony environment
Following these runs, the output files contrail.out.hadoop and contrail.out.symphony generated by the script were parsed to show comparative runtime statistics.
For the Platform Symphony portion of the test, the running jobs could be monitored through the Platform Symphony manage-ment console (see Figure 3).
6 Accelerating life sciences research
ResultsUsing standard Hadoop 1.1.1, the average duration of each Hadoop MapReduce job was found to be 16.17 seconds with a total runtime of 873 seconds. Using Hadoop in conjunction with Platform Symphony accelerated the calculation of the Contrail model, reducing the average job runtime to just 4.68 seconds and compressing the total runtime to just 258 seconds—almost a 3.5 times performance boost.
The captured script output is shown below. Figure 4 shows the relative total runtimes in a bar chart form.
===========================================Hadoop + Platform Symphony
Total jobs: 53
Maximum job length: 124 seconds
Average job length: 4.6792 seconds
Total duration: 258 seconds
===========================================Hadoop only
Total jobs: 53
Maximum job length: 18 seconds
Average job length: 16.1698 seconds
Total duration: 873 seconds
===========================================
Interpreting the resultsWhile not all models will show similar performance gains, the observations in this test are consistent with a social media benchmark3 in which Platform Symphony was shown to acceler-ate workloads by an average 7.3 times. Generally, for latency-sensitive applications that involve multiple short-running jobs, Platform Symphony will help improve performance because of its low-latency scheduling architecture. As a result, organizations can either complete work faster or realize cost savings by deploying a smaller cluster environment to attain performance objectives.
0
100
200
300
400
500
600
700
800
900
1,000
Without Platform Symphony With Platform Symphony
Total runtime for53 jobs (seconds)
Contrail runtime to subset of E.coli bacteria (10K reads)
Figure 4. Using Platform Symphony with Hadoop helped significantly reduce the workload’s runtime.
7IBM Systems and Technology
The additional benefits of Platform SymphonyEven though this effort focuses on comparing performance, Platform Symphony includes capabilities that can provide several additional advantages to life sciences organizations. For example:
●● Proportional resource allocation: Organizations can run multiple MapReduce workloads concurrently, dynamically changing priorities and associated resource allocations in real time.
●● Fast job pre-emption: Organizations can make sure critical workloads start and finish quickly while longer-running workloads continue to run in the background.
●● Job recoverability: JobTracker execution is journaled so that jobs can resume where they left off in the event of failure.
●● Optional IBM General Parallel File System (IBM GPFS™): Organizations running both MapReduce and non-MapReduce workloads can benefit from GPFS since it is a POSIX4 file system that can support both Hadoop MapReduce and non-MapReduce workloads concurrently accessing file system data, without the need to copy data in and out of the file system.
●● Multi-mode clusters: Organizations running Hadoop MapReduce as well as traditional non-MapReduce workloads can configure individual clusters to support both Platform LSF® and Platform Symphony. Platform LSF is a powerful workload management solution for running large, batch-oriented workloads. Running both Platform LSF and Platform Symphony on the same cluster can deliver additional f lexibility and increase the number of life sciences applications that can efficiently share cluster resources.
Limitations and additional workThis test involved a single model—organizations could experi-ence different results with different models or different numbers of reads. Results may also vary with the size of the cluster. Furthermore, the disk subsystem as configured was suboptimal for both test cases. Organizations might see different results with a more optimized file system configuration.
It is debatable whether this specific Contrail test should be described as a “big data” workload since the actual files involved are relatively small by big data standards. The business advan-tage of using Hadoop MapReduce for this kind of workload, however, is undeniable. The MapReduce framework helps reduce the costs of performing de novo genome assembly, avoiding the need for costly systems with massive amounts of physical memory. Based on these tests, Platform Symphony builds on the inherent advantages associated with the use of Contrail by providing an additional incremental performance advantage.
ConclusionAs this testing demonstrates, life sciences organizations using Contrail can expect to see a significant performance advantage by using the Platform Symphony scheduler in place of the stan-dard scheduler included with the Hadoop MapReduce distribu-tion. In the sample model comprising 10,000 reads, Platform Symphony accelerated the calculation of the Contrail result by 3.4 times.
Because InfoSphere BigInsights 2.1 incorporates the IBM Platform Symphony scheduler, life sciences organizations considering deploying Hadoop MapReduce workloads along with other existing workloads should consider BigInsights as a platform for their big data applications.
8 Accelerating life sciences research
Appendix: Shell script for benchmark testingcontrail-test.shThis is the script used to control the execution of the benchmark.
#!/bin/sh
usage()
{
cat << EOF
usage: $0 -i <path> -o <path> [-k <int> -l
<prefix>]
This section runs Contrail on the input data.
OPTIONS -i <path> Path to the HDFS input directory
-o <path> Path to the HDFS output directory
-k <int> Value of K (default 25)
-l <prefix> Local outfile prefix (default
contrail.out)
EOF
}
#
# Extract total duration from contrail output
get_duration()
{
local dur=`grep Duration: $1 | awk ‘{ print $3; }’`
echo $dur
}
#
# Parse Hadoop contrail output and print statistics
parse_hadoop()
{
local outfile=$1
local jobpattern=”job_”
local tmpfile=”_lengths.tmp”
local max=0
local tot=0
local num=`grep $jobpattern $outfile | wc -l`
grep $jobpattern $outfile | sed -E “s/(.*)
($jobpattern.*)/\2/g” | awk ‘{ print $2; }’ >
$tmpfile
local jobs=( $( cat $tmpfile ) )
rm -f $tmpfile
for i in “${jobs[@]}”
do
if [ $i -gt $max ]
then
max=$i
fi
((tot=$tot+$i))
done
avg=`echo “scale=4; $tot/$num” | bc`
echo “Hadoop -- Total Jobs: $num”
echo “ Max Job Length: $max sec”
echo “ Avg Job Length: $avg sec”
}
9IBM Systems and Technology
#
# Parse Symphony contrail output and print
statistics
parse_symphony()
{
local outfile=$1
local jobpattern=”^job_”
local tmpfile=”_lengths.tmp”
local max=0
for i in “${jobs[@]}”
do
if [ $i -gt $max ]
then
max=$i
fi
((tot=$tot+$i))
done
avg=`echo “scale=4; $tot/$num” | bc`
echo “Symphony -- Total Jobs: $num”
echo “ Max Job Length: $max sec”
echo “ Avg Job Length: $avg sec”
}
HDFS_INPUT=
HDFS_OUTPUT=
CONTRAIL_K=25
PREFIX=contrail.out
while getopts “i:o:k:l:” ARG
do
case $ARG
in
i)
HDFS_INPUT=$OPTARG
;;
o)
HDFS_OUTPUT=$OPTARG
;;
k)
CONTRAIL_K=$OPTARG
;;
l)
PREFIX=$OPTARG
;;
esac
done
SYM_ASMDIR=${HDFS_OUTPUT}.symphony
SYM_OUTFILE=${PREFIX}.symphony
HADOOP_ASMDIR=${HDFS_OUTPUT}.hadoop
HADOOP_OUTFILE=${PREFIX}.hadoop
if [[ -z $HDFS_INPUT ]] || [[ -z $HDFS_OUTPUT ]]
then
usage
exit
fi
if [[ -z ${HADOOP_HOME} ]]
then
echo “HADOOP_HOME not defined.”
exit
fi
if [[ -z ${PMR_BINDIR} ]]
then
echo “PMR_BINDIR not defined.”
exit
fi
echo “Cleaning HDFS:${HDFS_INPUT}”
${HADOOP_HOME}/bin/hadoop fs -rmr ${HDFS_INPUT}
10 Accelerating life sciences research
echo “Cleaning HDFS:${HADOOP_ASMDIR}”
${HADOOP_HOME}/bin/hadoop fs -rmr
${HADOOP_ASMDIR}
echo “Cleaning HDFS:${SYM_ASMDIR}”
${HADOOP_HOME}/bin/hadoop fs -rmr ${SYM_ASMDIR}
echo “Copying input files to HDFS:${HDFS_INPUT}”
${HADOOP_HOME}/bin/hadoop fs -mkdir ${HDFS_INPUT}
${HADOOP_HOME}/bin/hadoop fs -copyFromLocal
Ec10k.sim[12].fq ${HDFS_INPUT}
# echo “Running contrail (K=${CONTRAIL_K}) on
Hadoop”
# echo “======= Redirecting all output to
${HADOOP_OUTFILE} in the current directory”
# export CONTRAIL_JAR=contrail.jar
# ${HADOOP_HOME}/bin/hadoop jar ${CONTRAIL_JAR}
contrail.Contrail -asm ${HADOOP_ASMDIR} -k
${CONTRAIL_K} -reads ${HDFS_INPUT} &> ${HADOOP_
OUTFILE}
# if [ $? -ne 0 ]
# then
# echo “ERROR: Hadoop execution failed.
Aborting...”
# exit 1;
# fi
echo “Running contrail (K=${CONTRAIL_K}) on
Symphony”
echo “======= Redirecting all output to
${SYM_OUTFILE} in the current directory”
$PMR_BINDIR/mrsh jar ${CONTRAIL_JAR} contrail.
Contrail -asm ${SYM_ASMDIR} -k ${CONTRAIL_K}
-reads ${HDFS_INPUT} &> ${SYM_OUTFILE}
if [ $? -ne 0 ]
then
echo “ERROR: Symphony execution failed.
Aborting...”
exit 1;
fi
echo “===========================================”
parse_symphony ${SYM_OUTFILE}
SYMPHONY_DUR=`get_duration ${SYM_OUTFILE}`
echo “ Total Duration: ${SYMPHONY_DUR} sec”
echo “===========================================”
parse_hadoop ${HADOOP_OUTFILE}
HADOOP_DUR=`get_duration ${HADOOP_OUTFILE}`
echo “ Total Duration: ${HADOOP_DUR} sec”
echo “===========================================”
SPEEDUP=`echo “scale=4; ${HADOOP_DUR}/${SYMPHONY_
DUR}” | bc`
echo “Symphony Speedup: ${SPEEDUP}x”
Actual benchmark results captured over three successive comparative runsNote that the second and third test results were discarded because Platform Symphony performance was substantially bet-ter than the Hadoop MapReduce results, likely because of cach-ing effects—Platform Symphony can persist services.
=========================================== Symphony—Total jobs: 53
Maximum job length: 124 seconds
Average job length: 4.6792 seconds
Total duration: 258 seconds =========================================== Hadoop—Total jobs: 53
Maximum job length: 18 seconds
Average job length: 16.1698 seconds
Total duration: 873 seconds =========================================== Symphony speedup: 3.3837 times =========================================== Symphony—Total jobs: 53
Maximum job length: 4 seconds
Average job length: 2.4905 seconds
Total duration: 142 seconds ===========================================
11IBM Systems and Technology
Hadoop—Total jobs: 53
Maximum job length: 20 seconds
Average job length: 16.1698 seconds
Total duration: 871 seconds =========================================== Symphony speedup: 6.1338 times =========================================== Symphony—Total jobs: 53
Maximum job length: 4 seconds
Average job length: 2.4905 seconds
Total duration: 142 seconds =========================================== Hadoop—Total jobs: 53
Maximum job length: 20 seconds
Average job length: 16.1698 seconds
Total duration: 871 seconds =========================================== Symphony speedup: 6.1338 times
Hadoop configuration filescore-site.xml<?xml version=”1.0”?>
<?xml-stylesheet type=”text/xsl”
href=”configuration.xsl”?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/hadoop/data</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://atsplat2.private:19000/</value>
</property>
</configuration>
hdfs-site.xml<?xml version=”1.0”?>
<?xml-stylesheet type=”text/xsl”
href=”configuration.xsl”?>
<configuration>
<property>
<namedfs.replication</name>
<value3</value>
</property>
</configuration>
mapred-site.xml
<?xml version=”1.0”?>
<?xml-stylesheet type=”text/xsl”
href=”configuration.xsl”?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>atsplat2.private:19001</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>15</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>15</value>
</property>
<property>
<name>mapred.map.child.java.opts</name>
<value>-Xmx2048M</value>
</property>
<property>
<name>mapred.reduce.child.java.opts</name>
<value>-Xmx2048M</value>
</property>
</configuration>
For more informationTo learn more about Contrail, visit: http://contrail-bio.git.sourceforge.net/git/gitweb.cgi?p=contrail-bio/contrail-bio;a=tree
For more information about IBM Platform Symphony, visit: ibm.com/platformcomputing/products/symphony
For more information about IBM InfoSphere BigInsights and other IBM big data solutions, contact your IBM representative or IBM Business Partner, or visit: ibm.com/software/data/infosphere/biginsights
© Copyright IBM Corporation 2013
IBM Corporation Systems and Technology Group Route 100 Somers, NY 10589
Produced in the United States of America June 2013
IBM, the IBM logo, ibm.com, BigInsights, GPFS, iDataPlex, InfoSphere, LSF, Platform, and System x are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
This document is current as of the initial date of publication and may be changed by IBM at any time.
The performance data discussed herein is presented as derived under specific operating conditions. Actual results may vary.
THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided.
Actual available storage capacity may be reported for both uncompressed and compressed data and will vary and may be less than stated.
1 While the science of genome assembly is outside of the scope of this paper, interested parties can learn more about Contrail by visiting: http://sourceforge.net/apps/mediawiki/contrail-bio/index.php?title=Contrail
2 For details about the 10K read E.coli model included with the Contrail software distribution, visit: http://contrail-bio.git.sourceforge.net/git/ gitweb.cgi?p=contrail-bio/contrail-bio;a=tree. Groundbreaking work on the E.coli K-12 strain MG1655 was done at the University of Wisconsin. For more information, visit www.genome.wisc.edu/sequencing/updating.htm
3 For an audited STAC Report commissioned by IBM, visit: ibm.com/systems/technicalcomputing/platformcomputing/products/symphony/highperfhadoop.html
4 Portable Operating System Interface for UNIX. See http://en.wikipedia.org/wiki/POSIX for details.
DCW03047USEN-01
Please Recycle