Accelerating life sciences research/media/738AE43678BF...sensitive applications that involve...

IBM Systems and Technology

Thought Leadership White Paper

June 2013

Accelerating life sciences researchIBM Platform Symphony helps deliver improved performance for life sciences workloads using Contrail software

2 Accelerating life sciences research

Contents

2 Addressing the challenges of genome assembly with Contrail

3 Accelerating results with IBM Platform Symphony

3 The benchmark environment

4 Selecting the E.coli model

4 Test methodology

6 Results

6 Interpreting the results

7 The additional benefits of Platform Symphony

7 Limitations and additional work

7 Conclusion

8 Appendix: Shell script for benchmark testing

10 Actual benchmark results captured over three successive comparative runs

11 Hadoop configuration files

New approaches to genomic analysis, such as next-generation sequencing, will play key roles in advancing scientific knowledge, facilitating the development of targeted drugs and delivering personalized healthcare. To capitalize on these new approaches, life sciences organizations need computing environments that can process tremendous amounts of data rapidly. Speed of analy-sis is critical in life sciences since it relates directly to the rate of discovery and the cost-efficiency of employing genomic sequenc-ing for personalized medicine on a large scale.

Contrail, a bioinformatics application, leverages the Hadoop MapReduce framework to deliver gains in performance and cost-efficiency in genome sequencing. By combining Contrail with IBM® Platform™ Symphony, a commercial workload scheduler and grid manager, researchers can see even greater

advantages. This paper presents the results of recent benchmark testing that demonstrate the advantages of using Platform Symphony in conjunction with Contrail.

Addressing the challenges of genome assembly with ContrailContrail is open-source software that was developed to solve key challenges associated with large-scale genome assembly. It enables de novo assembly of large genomes from short reads, bridging research in computational biology with advances in the Hadoop MapReduce framework.

The first step in analyzing a previously un-sequenced organism is to assemble reads by merging similar reads into progressively longer sequences. Assemblers such as Velvet and Euler attempt to solve the assembly problem by constructing, simplifying and traversing a de Bruijn graph of the read sequences.1 These assemblers primarily focus on correcting errors, reconstructing unambiguous regions and resolving short repeats.

While these assemblers can manage small genomes, scaling to larger, mammalian-sized genomes is challenging. The assem-blers require constructing and manipulating graphs that are too large to fit in the memory of most computer systems. Larger models can require computing environments with terabytes of memory—and building those environments would be too expensive for most institutions.

Contrail addresses the memory limitation by re-representing the algorithm to run on a distributed MapReduce framework that avoids the need for massive amounts of memory on any individ-ual system. Contrail relies on Hadoop to iteratively transform an on-disk representation of the assembly graph, allowing an in-depth analysis even for large genomes on clusters of commodity computer systems running a Linux operating system.

3IBM Systems and Technology

Accelerating results with IBM Platform SymphonyPlatform Symphony software offers enterprise-class manage-ment of distributed compute and big data applications on a scalable, shared grid. By providing a low-latency scheduling environment for heterogeneous workloads, Platform Symphony can help accelerate application workloads and enable IT groups to enhance the efficiency of how resources are used.

Platform Symphony—available on its own, and as a limited-use license as part of the IBM InfoSphere® BigInsights® software distribution—also makes it easy for organizations to run applica-tions specifically designed for big data and achieve higher levels of performance to facilitate rapid decision making. With Platform Symphony - Advanced Edition augmenting a sup-ported Hadoop distribution, organizations can run their existing Hadoop MapReduce applications without modification.

Platform Symphony does not replace Hadoop; it replaces only the standard batch scheduler included with the open-source Hadoop MapReduce distribution. Platform Symphony enhances Hadoop by providing a faster, low-latency MapReduce runtime layer and more reliable and f lexible workload management.

In other industries, Platform Symphony has been shown to sub-stantially accelerate Hadoop MapReduce workloads. The goal of this benchmark was to demonstrate how Platform Symphony could deliver similar advantages for a life sciences workload.

The benchmark environmentThis benchmark measured the relative performance of a Contrail model with and without Platform Symphony. Relatively little performance optimization was done for either the “Hadoop-only” case or the “Hadoop–plus–Platform Symphony” cases. Existing lab hardware was used to conduct the tests so the hardware environment may not have been optimal, but it was sufficient for this kind of simple comparative test.

HardwareA Hadoop MapReduce cluster comprising multiple IBM rack-mount servers (see Figure 1) was used to support the benchmark. The cluster had a single head node and seven data nodes. The head node was a 2.6 GHz IBM System x® 3650 M4 server with 32 GB of memory. Six of the compute nodes were IBM System x dx360 M4 servers configured with 64 GB of memory per server and 40 Gbps InfiniBand interconnects. The seventh server was an IBM iDataPlex® M3 server.

All nodes were connected through a 40 Gbps InfiniBand switch. The test ran IP over InfiniBand (IPoB). A separate 1 Gb Ethernet network was used for node configuration and management.

Figure 1. IBM System x test environment for Contrail performance comparisons.

Mellanox IB switch

IBM 3650 M4 server

IBM dx360 m4 server

IBM dx360 m4 server

IBM dx360 m4 server

IBM dx360 m4 server

IBM dx360 m4 server

IBM dx360 m4 server

IBM dx360 m3 server


SoftwareThe cluster nodes all ran Red Hat Enterprise Linux 6.2. Hadoop 1.1.1 was downloaded from apache.org and configured in accordance with instructions provided in the Platform Symphony release notes (see Figure 2). The Contrail software tested was the latest version available from http://sourceforge.net/apps/mediawiki/contrail-bio/index.php?title=Contrail as of March 2013. The Contrail code was installed based on instruc-tions in the Contrail Quickstart Guide (available on the Contrail wiki). For the comparative test, Platform Symphony Version 6.1.0.1 was used in conjunction with the Hadoop software above.

Tests were initially conducted with both Hadoop 1.0.1 and 1.1.1, but it was judged to be more valid to focus on 1.1.1 since this was the more recent version. A significant difference between the two versions is the heartbeat interval. Hadoop 1.1.1 employs a more aggressive 0.3-second heartbeat interval, while Hadoop 1.0.1 has a 3-second interval. For this reason, Hadoop 1.1.1 gen-erally outperforms Hadoop 1.0.1 on small clusters such as the test environment, where a fast heartbeat interval is reasonable.

Figure 2. The Platform Symphony management console cluster view.

Selecting the E.coli modelFor this comparative benchmark testing, the Ecoli.10k file included in the data directory of the Contrail distribution was chosen as the basis for the test.2 The benchmark team treated the E.coli model provided with the Contrail distribution as a “black box” and ran Contrail in accordance with the provided directions.

Test methodologyTo simplify the benchmark testing, and to facilitate repeated runs with different data models and parameter settings, a shell script was developed (see the Appendix) to run the benchmark. Much of the logic of the script involves parsing the output of the Contrail simulation runs for both Hadoop-only and Hadoop–plus–Platform Symphony cases to easily capture run-time details from repeated benchmark runs. Without this kind of automation, manually gathering statistics from repeated job runs so that they could be easily compared would have been tedious.

http://sourceforge.net/apps/mediawiki/contrail-bio/index.php?title=Contrail



The two test case configurations employed mostly the default settings. The benchmark team did, however, change three vari-ables in the Platform Symphony application profile for the Symphony MapReduce tenant, under which the Contrail jobs ran. The application profiles were configured with these settings:

preStartApplication=”true”

taskLowWaterMark=”0.0”

taskHighWaterMark=”1.0”

These settings are known to deliver better performance for Hadoop MapReduce workloads and would likely be the same settings used by organizations deploying such an application in production. These are standard settings explained in the Platform Symphony product documentation.

Figure 3. View of running Contrail jobs in the Platform Symphony management console.

The benchmark execution script:

●● Sets up the environment for both Hadoop and Platform Symphony

●● Cleans up the Hadoop Distributed File System (HDFS) environment to make sure there is no data from prior runs

●● Copies the E.coli model files into HDFS●● Runs the identical model twice—once using the Hadoop-only

environment and once using the Hadoop–plus–Platform Symphony environment

Following these runs, the output files contrail.out.hadoop and contrail.out.symphony generated by the script were parsed to show comparative runtime statistics.

For the Platform Symphony portion of the test, the running jobs could be monitored through the Platform Symphony manage-ment console (see Figure 3).


ResultsUsing standard Hadoop 1.1.1, the average duration of each Hadoop MapReduce job was found to be 16.17 seconds with a total runtime of 873 seconds. Using Hadoop in conjunction with Platform Symphony accelerated the calculation of the Contrail model, reducing the average job runtime to just 4.68 seconds and compressing the total runtime to just 258 seconds—almost a 3.5 times performance boost.

The captured script output is shown below. Figure 4 shows the relative total runtimes in a bar chart form.

===========================================Hadoop + Platform Symphony

Total jobs: 53

Maximum job length: 124 seconds

Average job length: 4.6792 seconds

Total duration: 258 seconds

===========================================Hadoop only

Total jobs: 53



Total duration: 873 seconds

===========================================

Interpreting the resultsWhile not all models will show similar performance gains, the observations in this test are consistent with a social media benchmark3 in which Platform Symphony was shown to acceler-ate workloads by an average 7.3 times. Generally, for latency-sensitive applications that involve multiple short-running jobs, Platform Symphony will help improve performance because of its low-latency scheduling architecture. As a result, organizations can either complete work faster or realize cost savings by deploying a smaller cluster environment to attain performance objectives.

0

100

200

300

400

500

600

700

800

900

1,000

Without Platform Symphony With Platform Symphony

Total runtime for53 jobs (seconds)

Contrail runtime to subset of E.coli bacteria (10K reads)

Figure 4. Using Platform Symphony with Hadoop helped significantly reduce the workload’s runtime.


The additional benefits of Platform SymphonyEven though this effort focuses on comparing performance, Platform Symphony includes capabilities that can provide several additional advantages to life sciences organizations. For example:

●● Proportional resource allocation: Organizations can run multiple MapReduce workloads concurrently, dynamically changing priorities and associated resource allocations in real time.

●● Fast job pre-emption: Organizations can make sure critical workloads start and finish quickly while longer-running workloads continue to run in the background.

●● Job recoverability: JobTracker execution is journaled so that jobs can resume where they left off in the event of failure.

●● Optional IBM General Parallel File System (IBM GPFS™): Organizations running both MapReduce and non-MapReduce workloads can benefit from GPFS since it is a POSIX4 file system that can support both Hadoop MapReduce and non-MapReduce workloads concurrently accessing file system data, without the need to copy data in and out of the file system.

●● Multi-mode clusters: Organizations running Hadoop MapReduce as well as traditional non-MapReduce workloads can configure individual clusters to support both Platform LSF® and Platform Symphony. Platform LSF is a powerful workload management solution for running large, batch-oriented workloads. Running both Platform LSF and Platform Symphony on the same cluster can deliver additional f lexibility and increase the number of life sciences applications that can efficiently share cluster resources.

Limitations and additional workThis test involved a single model—organizations could experi-ence different results with different models or different numbers of reads. Results may also vary with the size of the cluster. Furthermore, the disk subsystem as configured was suboptimal for both test cases. Organizations might see different results with a more optimized file system configuration.

It is debatable whether this specific Contrail test should be described as a “big data” workload since the actual files involved are relatively small by big data standards. The business advan-tage of using Hadoop MapReduce for this kind of workload, however, is undeniable. The MapReduce framework helps reduce the costs of performing de novo genome assembly, avoiding the need for costly systems with massive amounts of physical memory. Based on these tests, Platform Symphony builds on the inherent advantages associated with the use of Contrail by providing an additional incremental performance advantage.

ConclusionAs this testing demonstrates, life sciences organizations using Contrail can expect to see a significant performance advantage by using the Platform Symphony scheduler in place of the stan-dard scheduler included with the Hadoop MapReduce distribu-tion. In the sample model comprising 10,000 reads, Platform Symphony accelerated the calculation of the Contrail result by 3.4 times.

Because InfoSphere BigInsights 2.1 incorporates the IBM Platform Symphony scheduler, life sciences organizations considering deploying Hadoop MapReduce workloads along with other existing workloads should consider BigInsights as a platform for their big data applications.


Appendix: Shell script for benchmark testingcontrail-test.shThis is the script used to control the execution of the benchmark.

#!/bin/sh

usage()

{

cat << EOF

usage: $0 -i <path> -o <path> [-k <int> -l

<prefix>]

This section runs Contrail on the input data.

OPTIONS -i <path> Path to the HDFS input directory

-o <path> Path to the HDFS output directory

-k <int> Value of K (default 25)

-l <prefix> Local outfile prefix (default

contrail.out)

EOF

}

#

# Extract total duration from contrail output

get_duration()

{

local dur=`grep Duration: $1 | awk ‘{ print $3; }’`

echo $dur

}

#

# Parse Hadoop contrail output and print statistics

parse_hadoop()

{

local outfile=$1

local jobpattern=”job_”

local tmpfile=”_lengths.tmp”

local max=0

local tot=0

local num=`grep $jobpattern $outfile | wc -l`

grep $jobpattern $outfile | sed -E “s/(.*)

($jobpattern.*)/\2/g” | awk ‘{ print $2; }’ >

$tmpfile

local jobs=( $( cat $tmpfile ) )

rm -f $tmpfile

for i in “${jobs[@]}”

do

if [ $i -gt $max ]

then

max=$i

fi

((tot=$tot+$i))

done

avg=`echo “scale=4; $tot/$num” | bc`

echo “Hadoop -- Total Jobs: $num”

echo “ Max Job Length: $max sec”

echo “ Avg Job Length: $avg sec”

}


#

# Parse Symphony contrail output and print

statistics

parse_symphony()

{

local outfile=$1

local jobpattern=”^job_”

local tmpfile=”_lengths.tmp”

local max=0

for i in “${jobs[@]}”

do

if [ $i -gt $max ]

then

max=$i

fi

((tot=$tot+$i))

done

avg=`echo “scale=4; $tot/$num” | bc`

echo “Symphony -- Total Jobs: $num”

echo “ Max Job Length: $max sec”

echo “ Avg Job Length: $avg sec”

}

HDFS_INPUT=

HDFS_OUTPUT=

CONTRAIL_K=25

PREFIX=contrail.out

while getopts “i:o:k:l:” ARG

do

case $ARG

in

i)

HDFS_INPUT=$OPTARG

;;

o)

HDFS_OUTPUT=$OPTARG

;;

k)

CONTRAIL_K=$OPTARG

;;

l)

PREFIX=$OPTARG

;;

esac

done

SYM_ASMDIR=${HDFS_OUTPUT}.symphony

SYM_OUTFILE=${PREFIX}.symphony

HADOOP_ASMDIR=${HDFS_OUTPUT}.hadoop

HADOOP_OUTFILE=${PREFIX}.hadoop

if [[ -z $HDFS_INPUT ]] || [[ -z $HDFS_OUTPUT ]]

then

usage

exit

fi

if [[ -z ${HADOOP_HOME} ]]

then

echo “HADOOP_HOME not defined.”

exit

fi

if [[ -z ${PMR_BINDIR} ]]

then

echo “PMR_BINDIR not defined.”

exit

fi

echo “Cleaning HDFS:${HDFS_INPUT}”

${HADOOP_HOME}/bin/hadoop fs -rmr ${HDFS_INPUT}


echo “Cleaning HDFS:${HADOOP_ASMDIR}”

${HADOOP_HOME}/bin/hadoop fs -rmr

${HADOOP_ASMDIR}

echo “Cleaning HDFS:${SYM_ASMDIR}”

${HADOOP_HOME}/bin/hadoop fs -rmr ${SYM_ASMDIR}

echo “Copying input files to HDFS:${HDFS_INPUT}”

${HADOOP_HOME}/bin/hadoop fs -mkdir ${HDFS_INPUT}

${HADOOP_HOME}/bin/hadoop fs -copyFromLocal

Ec10k.sim[12].fq ${HDFS_INPUT}

# echo “Running contrail (K=${CONTRAIL_K}) on

Hadoop”

# echo “======= Redirecting all output to

${HADOOP_OUTFILE} in the current directory”

# export CONTRAIL_JAR=contrail.jar

# ${HADOOP_HOME}/bin/hadoop jar ${CONTRAIL_JAR}

contrail.Contrail -asm ${HADOOP_ASMDIR} -k

${CONTRAIL_K} -reads ${HDFS_INPUT} &> ${HADOOP_

OUTFILE}

# if [ $? -ne 0 ]

# then

# echo “ERROR: Hadoop execution failed.

Aborting...”

# exit 1;

# fi

echo “Running contrail (K=${CONTRAIL_K}) on

Symphony”

echo “======= Redirecting all output to

${SYM_OUTFILE} in the current directory”

$PMR_BINDIR/mrsh jar ${CONTRAIL_JAR} contrail.

Contrail -asm ${SYM_ASMDIR} -k ${CONTRAIL_K}

-reads ${HDFS_INPUT} &> ${SYM_OUTFILE}

if [ $? -ne 0 ]

then

echo “ERROR: Symphony execution failed.

Aborting...”

exit 1;

fi

echo “===========================================”

parse_symphony ${SYM_OUTFILE}

SYMPHONY_DUR=`get_duration ${SYM_OUTFILE}`

echo “ Total Duration: ${SYMPHONY_DUR} sec”

echo “===========================================”

parse_hadoop ${HADOOP_OUTFILE}

HADOOP_DUR=`get_duration ${HADOOP_OUTFILE}`

echo “ Total Duration: ${HADOOP_DUR} sec”

echo “===========================================”

SPEEDUP=`echo “scale=4; ${HADOOP_DUR}/${SYMPHONY_

DUR}” | bc`

echo “Symphony Speedup: ${SPEEDUP}x”

Actual benchmark results captured over three successive comparative runsNote that the second and third test results were discarded because Platform Symphony performance was substantially bet-ter than the Hadoop MapReduce results, likely because of cach-ing effects—Platform Symphony can persist services.

=========================================== Symphony—Total jobs: 53



Total duration: 258 seconds =========================================== Hadoop—Total jobs: 53



Total duration: 873 seconds =========================================== Symphony speedup: 3.3837 times =========================================== Symphony—Total jobs: 53



Total duration: 142 seconds ===========================================


Hadoop—Total jobs: 53



Total duration: 871 seconds =========================================== Symphony speedup: 6.1338 times =========================================== Symphony—Total jobs: 53



Total duration: 142 seconds =========================================== Hadoop—Total jobs: 53



Total duration: 871 seconds =========================================== Symphony speedup: 6.1338 times

Hadoop configuration filescore-site.xml<?xml version=”1.0”?>

<?xml-stylesheet type=”text/xsl”

href=”configuration.xsl”?>

<configuration>

<property>

<name>hadoop.tmp.dir</name>

<value>/hadoop/data</value>

</property>

<property>

<name>fs.default.name</name>

<value>hdfs://atsplat2.private:19000/</value>

</property>

</configuration>

hdfs-site.xml<?xml version=”1.0”?>



<configuration>

<property>

<namedfs.replication</name>

<value3</value>

</property>

</configuration>

mapred-site.xml

<?xml version=”1.0”?>



<configuration>

<property>

<name>mapred.job.tracker</name>

<value>atsplat2.private:19001</value>

</property>

<property>

<name>mapred.tasktracker.map.tasks.maximum</name>

<value>15</value>

</property>

<property>

<name>mapred.tasktracker.reduce.tasks.maximum</name>

<value>15</value>

</property>

<property>

<name>mapred.map.child.java.opts</name>

<value>-Xmx2048M</value>

</property>

<property>

<name>mapred.reduce.child.java.opts</name>

<value>-Xmx2048M</value>

</property>

</configuration>

For more informationTo learn more about Contrail, visit: http://contrail-bio.git.sourceforge.net/git/gitweb.cgi?p=contrail-bio/contrail-bio;a=tree

For more information about IBM Platform Symphony, visit: ibm.com/platformcomputing/products/symphony

For more information about IBM InfoSphere BigInsights and other IBM big data solutions, contact your IBM representative or IBM Business Partner, or visit: ibm.com/software/data/infosphere/biginsights

© Copyright IBM Corporation 2013

IBM Corporation Systems and Technology Group Route 100 Somers, NY 10589

Produced in the United States of America June 2013

IBM, the IBM logo, ibm.com, BigInsights, GPFS, iDataPlex, InfoSphere, LSF, Platform, and System x are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

This document is current as of the initial date of publication and may be changed by IBM at any time.

The performance data discussed herein is presented as derived under specific operating conditions. Actual results may vary.

THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided.

Actual available storage capacity may be reported for both uncompressed and compressed data and will vary and may be less than stated.

1 While the science of genome assembly is outside of the scope of this paper, interested parties can learn more about Contrail by visiting: http://sourceforge.net/apps/mediawiki/contrail-bio/index.php?title=Contrail

2 For details about the 10K read E.coli model included with the Contrail software distribution, visit: http://contrail-bio.git.sourceforge.net/git/ gitweb.cgi?p=contrail-bio/contrail-bio;a=tree. Groundbreaking work on the E.coli K-12 strain MG1655 was done at the University of Wisconsin. For more information, visit www.genome.wisc.edu/sequencing/updating.htm

3 For an audited STAC Report commissioned by IBM, visit: ibm.com/systems/technicalcomputing/platformcomputing/products/symphony/highperfhadoop.html

4 Portable Operating System Interface for UNIX. See http://en.wikipedia.org/wiki/POSIX for details.

DCW03047USEN-01

Please Recycle


http://contrail-bio.git.sourceforge.net/git/gitweb.cgi?p=contrail-bio/contrail-bio;a=tree


http://www.genome.wisc.edu/sequencing/updating.htm

http://www.ibm.com/systems/technicalcomputing/platformcomputing/products/symphony/highperfhadoop.html

http://www.ibm.com/systems/technicalcomputing/platformcomputing/products/symphony/highperfhadoop.html

http://en.wikipedia.org/wiki/POSIX

http://www.ibm.com/legal/copytrade.shtml



http://www.ibm.com/platformcomputing/products/symphony

http://www.ibm.com/software/data/infosphere/biginsights

Accelerating life sciences research/media/738AE43678BF...sensitive applications that involve...

Documents

Transcript of Accelerating life sciences research/media/738AE43678BF...sensitive applications that involve...