1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and...

27
1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem and Instant JChem provided by

Transcript of 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and...

Page 1: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

1

Tobias Kind FiehnLab at UC Davis Genome Center

November 2006

Benchmarking JChem Oracle and Instant-JChem (and

more)

Free Academic Licenses for JChem and Instant JChem provided by

Page 2: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

2

ChemAxon product suite

We have free academic licenses for all products

Source: Chemaxon.com

Page 3: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

3

Metabolomics @ Fiehnlab- The science of the small molecules

Compound Classes:• sugars• amino acids• steroids• fatty acids• lipids• phospholipids • organic acids ...

Molecules under investigation (shown with ChemAxon Marvin)

3D model of a molecule with surface plot(shown with ChemAxon MarvinSpace)

Visit us @ fiehnlab.ucdavis.edu

Page 4: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

4

Metabolomics is a truly emerging science

...tries to identify all small molecules (< 2000 Da)in all life forms in a comprehensive manner

Life Science Tree:

Genomics (DNA) Transcriptomics (RNA) Proteomics (Proteins)Metabolomics (Small Molecules)

Page 5: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

5

Techniques and tools

• Analytical techniques (LC-MS, GC-MS, NMR, IR)• BioInformatics, Cheminformatics

Liquid Chromatography

LC-MS

Gas Chromatography

GC-TOF-MS

BioInformatics and CheminformaticsStatistics (Statistica Dataminer) Open Source + commercial software

LTQ-FT-MS

Page 6: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

6

We use cheminformatics tools for mass spectrometry based structure elucidation

See our BMC Bioinformatics paper:Metabolomic database annotations via query of elemental compositions: Mass accuracy is insufficient even at less than 1 ppm ; http://www.biomedcentral.com/1471-2105/7/234

No. Formula Mass 861 C61H22N4OP2S2 952.071862 C61H22N4O3P2S 952.089863 C61H23N4OP3S 952.081864 C61H23N4O3P3 952.098865 C61H24N4OP4 952.09866 C61H29O4PS3 952.097867 C61H30O2P2S3 952.088868 C61H31P3S3 952.08869 C61H31O2P3S2 952.098870 C61H32P4S2 952.09871 C61H119N6O 951.945872 C61H123O4S 951.914873 C61H123O6 951.932874 C61H124O2PS 951.906875 C61H124O4P 951.924876 C61H125O2P2 951.915877 C61H126P3 951.907878 C61H127N2S2 951.944879 C61H127N2O2S 951.962

AutomaticIsotopicPatternFilter

No. Formula Mass 1 C41H28O27 952.0822 C41H28N8O20 952.1423 C41H28N16O13 952.2024 C41H28N24O6 952.262

FormulaGenerator

{fast approach but non comprehensive}

O

OO

O O

O

O

O

OO

OH

OH

OHOH

OH

OHOH

OH

OH

OH

OH O

OHOH

OOH

O

(one possible result)

mass spectrummolecular ionisotopic patternaccurate mass

{exhaustive drill down}

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

?

?

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

yes

no

{slow approach,needs constraints,comprehensive}

molecular isomer

generator

DBSearch

966964962960958956954952950948946944

m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

966964962960958956954952950948946944

m/z

8

16

24

32

40

966964962960958956954952950948946944

m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

Formula: C41H28O27one structural isomer shownmillions of isomers possible

No. Formula Mass 861 C61H22N4OP2S2 952.071862 C61H22N4O3P2S 952.089863 C61H23N4OP3S 952.081864 C61H23N4O3P3 952.098865 C61H24N4OP4 952.09866 C61H29O4PS3 952.097867 C61H30O2P2S3 952.088868 C61H31P3S3 952.08869 C61H31O2P3S2 952.098870 C61H32P4S2 952.09871 C61H119N6O 951.945872 C61H123O4S 951.914873 C61H123O6 951.932874 C61H124O2PS 951.906875 C61H124O4P 951.924876 C61H125O2P2 951.915877 C61H126P3 951.907878 C61H127N2S2 951.944879 C61H127N2O2S 951.962

AutomaticIsotopicPatternFilter

No. Formula Mass 1 C41H28O27 952.0822 C41H28N8O20 952.1423 C41H28N16O13 952.2024 C41H28N24O6 952.262

FormulaGeneratorFormula

Generator

{fast approach but non comprehensive}

O

OO

O O

O

O

O

OO

OH

OH

OHOH

OH

OHOH

OH

OH

OH

OH O

OHOH

OOH

O

(one possible result)

mass spectrummolecular ionisotopic patternaccurate mass

{exhaustive drill down}

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

?

?

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

yes

no966964962960958956954952950948946944

m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952

953

954

957 957 966

?

?

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

966964962960958956954952950948946944m/z

8

16

24

32

40

966964962960958956954952950948946944m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

yes

no

{slow approach,needs constraints,comprehensive}

molecular isomer

generator

DBSearch

DBSearch

966964962960958956954952950948946944

m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

966964962960958956954952950948946944

m/z

8

16

24

32

40

966964962960958956954952950948946944

m/z

8

16

24

32

40

Re

lati

ve

In

ten

sit

y (

%)

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

952950948946944

8

16

24

32

40

48

56

64

72

80

88

96

Formula: C41H28O27one structural isomer shownmillions of isomers possible

Page 7: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

7

What areJChem and Instant-JChem?

JChem and Instant JChem are cheminformatics tools for handling small molecule structures together with substance data (logP, fingerprint, pKa, toxicity, meta-information) + searches + filter + web connections and moreDifference: JChem = complex package and Instant-JChem = one single tool

Instant-JChemJChemPicture ChemAxon

Page 8: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

8

Benchmarking Instant-JChem and JChem Oracle (and more)

Myth 1: JChem+Oracle is faster than Instant-JChem+Apache Derby – Reality: lets see...

Myth 2: JAVA is slow – Reality: Its fast (70% of C++).

Myth 3: Old Intel Netbust Xeons (Netburst) are slow – Reality: Yes.

Myth 4: Oracle is a hazzelfree and handsome DB for beginners – Reality:

Myth 6: 2 CPUs are better than one – Reality: Yes.

Myth 7: Comparing apples with oranges (in germany pears) is unfair - c'mon...

Only first myth left.

Page 9: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

9

A bit of Oracle Reality

Oracle works, lots of people invested lots of mony (ORCL market cap = 92 billion dollars)Its good for large data (TByte) - Its overkill for a small DB.

If you plan to install it on your production workstation (a big No No)• It will eat 600-800 MB of your valuable RAM (for nothing, on WINXP 32 bit)• It will create 15,049 files in 2,029 folders (for what?)• It will create a lot of hassle with certain network setups (DHCP)• RTFM (read the … manual) is no joke and you need to learn SQL (try the free Aqua Data Studio) • Complete learning will take you 1..2 years, but gives you extreme flexibility

If you plan to install JCHEM + Oracle you need• JChem (includes cartride for Oracle)• Oracle• Apache Tomcat• 1-2 days time (ChemAxon documentation is good, but too many things can go wrong with Oracle)

Happy Oracle Acepaid 10K for certificate

1st time Oracle user

Page 10: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

10

A bit of Instant JChem Reality v1.0

A) Download

B) Install

C) It Runs instantly

http://www.chemaxon.com/instantjchem/

• inbuilt Apache Derby DB• JAVA engine included• complete JChem included • out-of-the-box tool • can connect to other DBs

Page 11: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

11

During import in Instant JChem only one CPU works. The fingerprint calculation is probably not multi-threaded. (Solution: work pool = make pool for n CPUs)

Short import time is critical for user convinience, but not for long term database projects.

Importing Structures into Instant JChem

Page 12: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

12

Importing Structures into Instant-JChem influence of JAVA hotspot compiler

JAVA VM runs in to modes: with client compiler and server compiler (directories under JRE)If you run any calculation intensive programs alwyas use server mode, in a batch file call java –server XYZ

Good and fast

Bad and slow

Page 13: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

13

Import of 250k structures (NCI99.smi) into Instant-JChem: Server JVM is 20% faster!

Influence of JAVA hotspot compilerImporting Structures into Instant-JChem

Testsystem: Dual Opteron 254 (2,8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer rate);ARECA-1120 RAID5 (read/write 200 MByte/s and burst rate 500 MByte/s); QSOFT Ramdisk Enterprise 1,2 GByte ( read write 1 GByte/s transfer)

0

100

200

300

400

500

600

JAVA server JAVA client

seco

nd

s

Import of 250k structures into Instant-JChem

lower is better

Page 14: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

14

SMILES: NC1=CC=NC2=C1C=CC(Cl)=C2

http://en.wikipedia.org/wiki/Lipinski's_Rule_of_Five(mass() <= 500) && (logP() <= 5) && (donorCount() <= 5) && (acceptor Count() <= 10) (acceptor count for C and H)

Influence of JAVA hotspot compiler with Instant-JChem

JAVA server mode: 15 seconds (30% faster)JAVA client mode: 21 seconds

If you want to speed-up this queryyou need to pre-calculate and include all descriptors already in the database

Task:Search for substructure in a 3 million compound database and calculate the Lipinski Rule of 5 on all the 4632 results.

N

NCl

Page 15: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

15

Influence of number of CPUs with Instant-JChem

Doing the Lipinski utilizes both CPU cores!Try Intel Quad! Try Opteron 8x!

2 CPUs 1 CPU

JAVA server mode:

15 seconds

33 seconds

JAVA client mode:

21 seconds

44 seconds

Testsystem: Dual Opteron 254 (2,8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer rate);ARECA-1120 RAID5 (read/write 200 MByte/s and burst rate 500 MByte/s); QSOFT Ramdisk Enterprise 1,2 GByte ( read write 1000 MByte/s transfer)

Task: Search for a substructure in a 3 million compound database and calculate the Lipinski Rule of 5 on all the 4632 results

Page 16: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

16

Influence of number of CPUs with Instant-JChem

Doing the Lipinski utilizes multiple CPU cores!However a single logP calculation is dependent on CPU speed, not CPU cores.

Use AMD Opteron 8xCPU systems (or better). For cheaper setups use Intel Core 2 Quad (QX6700).

1 CPU (1x2.8 GHz)* 2 CPUs (1x2.8 GHz)*8 CPUs** (2 GHz)

33 seconds 15 seconds 4 seconds

Testsystem*: Dual Opteron 254 (2.8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer rate);Testsystem** : 4 x Dual-Core Opteron 870 2.0 GHz; CentOS 64-bit, 32 GByte RAM, 3.5 GB set for JAVA heap space

Task: Search for a substructure in a 3 million compound database and calculate the Lipinski Rule of 5 on all the 4632 results (on the fly)

N

NCl

Page 17: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

17

Influence of number of CPUs on complex calculations with Instant-JChem

Hits1 CPU (1x2.8 GHz)*

2 CPUs (1x2.8 GHz)* 8 CPUs** (2 GHz)

Bioavailability 832 30 s 17 s 7.5 s

Ghose filter 255 14 s 8 s 4.4 s

Lead likeness 531 53 s 25 s 9.8 s

Lipinski rule of 5 776 15 s 7.5 s 4.7 s

Muegge filter 277 7.5 s 4.2 3.4 s

Veber filter 774 1.7 s 1.5 2.5 sTestsystem*: Dual Opteron 254 (2,8 GHz); WINXP-32bit;

2.88 GByte RAM (10 GByte/s transfer rate);Testsystem** : 4 x Dual-Core Opteron 870 2 GHz; CentOS 64-bit,

32 GByte RAM, 3.5 GB set for JAVA heap space

Task: Search in 1000 compounds from PubChem-1000-demo and calculate on-the-fly:

Take home message: The more complex the request – the more CPUs you need.

The lead likeness has 7 filters and reaches a 5-8 times speed-up with more CPUs.

Page 18: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

18

Scaling complex calculations to larger DBs with Instant-JChem

HitsDirectQuery

Calculation8 CPUs** (2

GHz)

extrapolated time from

1000er DBObtained speed-up

Bioavailability 227,997 <1s 380 s 2055 s 5

Ghose filter 160,047 <1s 230 s 2762 s 12

Lead likeness 159,656 <1s 1255 s 2947 s 2

Lipinski rule of 5 199,821 <1s 176 s 1210 s 7

Muegge filter 145,234 <1s 299 s 1783 s 6

Veber filter 215,377 <1s 20 s 696 s 35Testsystem** : 4 x Dual-Core Opteron 870 2 GHz; CentOS 64-bit,

32 GByte RAM, 3.5 GB max set for JAVA heap space

1.5 GByte JAVA heap space used.

Task: Now search in 250,000 compounds from NCI2000 and calculate on the fly:

Take home message: Do not extrapolate calculational times from different or smaller DBs. The speedups here are 2-35 larger than expected.

Pre-calculate values once and store them in the DB and query values later.

Page 19: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

19

Derby database file sizes for Instant- JChem+Apache DerbyCompounds only

100k structures ~30 MByte1 Mio structures ~300 MByte10 Mio structures ~3 GByte20 Mio structures ~6 Gbyte

If you have dual or quad cores turn drive compression on.You can save almost 50% space, speed overhead is low.

Page 20: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

20

Instant-JChem on disk based and RAMDisk based systemsPeople who said the OS has efficient disk caching lied.

A large RAMDISK can speed up your system extremely.

A) If you have money – buy a Solid State DiskRAMSAN-400; 128 GByte; Price $252,7203,000 MB/s random sustained external throughput.

B) If you have some money – buy a RAID5 card.ARECA ARC-1120 for 8 HDs, Price $500200-400 MB/s read and write access

C) If you have litte money – buy a RAMDISK and stuff as much RAM in as possible (take a 64-bit OS)500-1000 MB/s read and write access

...a normal hard drive has ~30-50 MB/s transfer rate

Page 21: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

21

Instant-JChem on disk based and RAMdisk based system

Load 3 Mio compound DB from Ramdisk: 2 secondsLoad 3 Mio compound DB from RAID5 disk: 11 seconds (factor 5)

Search Substructure from RAMDISK DB: instant (imemory buffered)Search Substructure from RAID5 DB: instant (memory buffered)

A) Heap Memory max 800 MByte (OK)

B) Heap Memory max 200 MByte (too low)

Load 3 Mio compound DB from Ramdisk: 19 secondsLoad 3 Mio compound DB from RAID5 disk: 25 seconds (factor 1.3)

Search Substructure from RAMDISK DB: 22 secondsSearch Substructure from RAID5 DB: 38 seconds (factor 1.7)

Take home message: give JAVA (JChem) as much heap memory as you can. For 3 Millionstructures you need minimum 300 MByte heap space.

No Heap memory:Performance degradation:Everything must be read from disk; My RAID5 is already extremely fast, still the RAMDISK is even faster

Page 22: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

22

JChem+Oracle DB on Xeonvs.

Instant-JChem+Apache Derby DB on Opteron(apples vs. oranges)

3GHz Dual Xeon with 2GB system memory - JChem+Oracle DB = 5801 seconds (96 minutes)

2.8 GHz Dual Opteron with 2,88 GB memory - Instant-JChem+Apache Derby = 5333 seconds (88 minutes)

Task:Import and indexing 3 million compounds (NCI2000 duplicated to 3 Mio)

Source Xeon data: Oracle Cartridge Benchmark http://www.chemaxon.com/jchem/FAQ.html#benchmark3

Take home message: If you have a (modest) modern computer it can handle JChem and Instant-JChem and a local database can be faster than a remote database

Page 23: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

23

Instant-JChem+Apache Derby DB on Socrates*vs.

Instant-JChem+Apache Derby DB on Dual Opteron 2.8 GHz (WIN-XP)**vs.

JChem+Oracle DB on Dual Xeon 3 GHz (W2003 Server)***(more apples vs. oranges)

Task: Search for a substructures in a 3 million compound database (NCI2000x12)

# Hits

Instant-JChem+Derb

y*

Instant-JChem+Derby

**JChem+Oracle***

C1CN1c2cnnc3c(cncc23)C4=CSC=C4 0 0 0 0

O=C1ONC(N1c2ccccc2)c3ccccc3 204 0 0 0

[#6]-c1cc(-[#6])nc(NS(=O)(=O)c2ccccc2)n1 1224 0 0 0

c1ncc2ncnc2n1 65,208 2 s 7 s 14 s

Clc1ccccc1 274,608 5 s 15 s 43 s

O=Cc1ccccc1 443,580 9 s 28 s 85 sTake home message: Instant-JChem is fast (nothing more).Source: Instant-JChem (own system), JChem (ChemAxon website)

Socrates*: 4x Dual Opteron 870 2GHz; CentOS 64-bit, 32 GByte RAM, 4 GB set for JAVA

Opteron**: Dual Opteron 254 (2,8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer );ARECA-1120 RAID5 (read/write 200 MByte/s and burst rate 500 MByte/s); QSOFT Ramdisk Enterprise 1,2 GByte ( read write 1000 MByte/s transfer)

Xeon: Dual Intel Xeon 3GHz, 2GB memory, 160GB IDE hard drive; Windows 2003 5.2; Oracle 9.2.0.7.0 DB buffer 1 GB; 1.5.0_06-b05 Apache Tomcat/5.5.12

Page 24: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

24

A 20 million compound DB with Instant-JChem in a local Derby DB (WinXP-32bit)

• Import is heavily disk dependent• several hundred million read/write operations to disk (JAVA writes in 4 KB chunks)• JAVA heap space used during import is around 600 MByte• import time is not linear anymore• WIN XP 32-bit + NTFS desperatly try to cache the 6 GByte database file, even if there is only 3 GByte memory maximum available (1 GByte max for cache).

• index creation (import smiles): 20h (too long)

• open index for search: 1 min

• substructure search: > 1min (to long)

• 20 Mio currently to large for Instant-JChem v1.0 use JChem+Oracle (or MySQL, MS SQL)

• Aim: Full PubChem data (15-20 Mio) locally

Page 25: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

25

Some general JAVA + JChem speed advices

1. Always use server JVM (check directory bin\client and bin\server)check batch or sh file options for JAVA –server xyz xyz.jar

2. Use 64-bit systems; the JAVA maximum heap space for LINUX or WINas 32-bit system is only 1.6 GByte -Xms=1600m

3. Use only multicore machines (AMD Opterons, Intel Quad)

4. Use the fastest disks you can buy (WD Raptor) or use RAID5 or RAID6for large files (PubChem SDF data for 5 Mio compounds = 30 GByte)

5. Give Instant-JChem as much memory as you have - minimum 500 MBytefor extreme speed (no wait time for searches)

Page 26: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

26

Let’s not forget competitors

Two reasons:

• The programs work under WINDOWS and LINUX• ChemAxon has the best and most responsive public forum:

Critics is taken seriously, requested features are implemented ASAP,and a public response within 1-3 days. WHY? Many commercial licencees.Remember, for academics all free.

Many good systems exist:

MDL (ISIS Base), ACDLabs (ACD/ChemFolder Enterprise), Tripos (Sybyl+Auspyx), Molecular Networks (Carol), CDK and Taverna, Accelrys (Accord), Daylight (Thor and Merlin), CambridgeSoft (ChemOffice Enterprise), Molsoft (ICM+MolCart)

Why is ChemAxon better?

$+ + = $$+ + =

Page 27: 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Benchmarking JChem Oracle and Instant-JChem (and more) Free Academic Licenses for JChem.

27

Results and conclusionJChem Oracle vs. Instant-JChem

1. Instant-JChem+Derby is as fast or faster than JChem+Oracle for DBs < 3 Mio

2. If you want to have fun and results at your fingertip: Instant-JChem

3. If you want extreme flexibility and you know JAVA+SQL: JChem-Oracle

4. We are far away from handling billions of structures in a DB (with modest efforts)We will handle such large number of structures file stream based with cluster support.

5. Software producers (in general) need to put more efforts into software development for multi-core CPUs + clusters under Windows and LINUX.