Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS....
Transcript of Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS....
![Page 1: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/1.jpg)
Migrating the eMolecules
Chemical-Search Engine
to ChemAxon
Craig A. James
CTO, eMolecules, Inc.
![Page 2: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/2.jpg)
Search and Order Chemical Compounds
• 10M products:• 2M Building-Block
Compounds
• 8M Screening Compounds
• 600K Antibodies
• No “virtual” compounds
• Quality Data:• Supplier relationships
• Fast catalog updates
• Data normalization,
cleanup
• Data curation/QA
• Supplier “tiers”
What is eMolecules?
![Page 3: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/3.jpg)
About eMolecules: Searching
• Search
• Substructure
• Similarity
• Exact
• Refine
• Browse
• Filter
• Sort
• Save/Reopen
• Collaborate
• Import list
• Export list
![Page 4: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/4.jpg)
About eMolecules: Shopping Cart
• Accurate
• Up-to-date
prices
• Accurate lead
times
• Curated data
• Timely
• Suppliers’ on-
time delivery
monitored daily
![Page 5: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/5.jpg)
About eMolecules: Custom Web Sites
• 100 private web sites:• Pharma, biotech and
biology customers
• Customized: customer’s
preferred suppliers and
“skin”
• Secure: access
restrictions, verified
logins
![Page 6: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/6.jpg)
• Web Services allow
our partner to present
eMolecules structure
and catalog search
• JAGGAER (SciQuest)
Enterprise Reagent
Manager (ERM)
federates inventory
search with
eMolecules to simplify
sourcing and
management of
materials for research
About eMolecules: Partnering
JAGGAER ERM now offers structure drawing and
rendering powered by Marvin JS
![Page 7: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/7.jpg)
eMolecules’ Cheminformatics Engine
• eMolecules’ Fast-Search Engine
• Fast: most queries return in 0.1 to 2 seconds
• Stateless: answer query then forget user
◦ Each subsequent query (e.g. “page 2”) is a new
search.
• Efficient pagination: No overhead
◦ “Get rows 1-10 and stop”
◦ “Get rows 11-20 and stop” … without re-searching
for rows 1-10.
![Page 8: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/8.jpg)
eMolecules’ Cheminformatics Engine
• eMolecules’ Power-Search: Hitlists
• Long-running server-side searches (launched
by browser, run to completion)
• Progress reports
• Interrupt/Cancel
• Refinement (“Filtering”) to narrow results
• Multiple stored hitlists
• List operations (merge, filter, union,
intersection, etc.)
![Page 9: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/9.jpg)
2016: Replace Home-Grown Searching
• Why replace eMolecules’ home-grown
cheminformatics?
• Commercial products (e.g. ChemAxon) are
superior
• Available as “cartridge” in Postgres open-
source relational database
• Cost of maintaining compatibility with
OpenBabel open-source software project
![Page 10: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/10.jpg)
Relational Databases and the Web
• Web: Stateless, fast, lightweight• Lots of fast queries, pagination, load spikes
• Guest users with no stake in results
• Abusive robots, industrial theft, criminals
• RDB: Stateful, unpredictable, overhead• Never “instantaneous” like Google or Bing – queries can
take minutes, even longer
• Poor pagination
• No refinement of search results
• No progress report: Will my search be done in 10 seconds or 2 hours??
• Hard to interrupt/cancel a search
• Result: User disappointment
![Page 11: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/11.jpg)
Background: What RDBs do well
• “Does Enamine sell CAS# 5398-77-6?”
=> explain analyze select compound_id, cas_number from sample
where cas_number = '5398-77-6' and supplier_id = 681;
-> BitmapAnd
cost est: 555.00
actual time: 8.99
-> Bitmap Index Scan on i_sample_casno_8947
cost est: 0.00 to 5.77 rows=178
actual time: 0.023 rows=38
-> Bitmap Index Scan on i_sample_supplier_id_8947
cost est: 0.00 to 548.98 rows=20339
actual time: 8.96 rows=96626
Execution time: 13.056 ms
![Page 12: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/12.jpg)
Background: What RDBs do well
• RDBs handle this query really well
• Statistics on each column of numbers are
highly predictive
• Good prediction Good execution plan
• This doesn’t work for chemical queries
• Statistics on chemical columns are poor
predictors
• Poor prediction Poor execution plan
![Page 13: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/13.jpg)
Problem: Instant Results
• Searching millions of compounds
• Our test cases: “Find 1st page of results”
Selective Query: 3 hits“Stupid” Query: 1.3M hits
![Page 14: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/14.jpg)
Selective query works well
=> explain analyze select count(1) from registry.smiles_indexed
where smiles |>| 'CN1CCN(CC1)S(=O)(=O)c2ccc(OCC)c(C(=O)O)c2' limit 10;
Limit
cost est: 62247.90 to 62247.91
actual time: 178.05
-> Aggregate
cost est: 62247.90 to 62247.91
actual time: 178.04
-> Index Scan using i_smiles_indexed_smiles on smiles_indexed
cost est: 102.60 to 60230.04 rows=807144
actual time: 177.577 rows=3
![Page 15: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/15.jpg)
“Stupid” query is a disaster
=> explain analyze select count(1) from registry.smiles_indexed
where smiles |>| 'c1ccccc1' limit 10;
Limit
cost est: 62247.90 to 62247.91
actual time: 30787.81
-> Aggregate
cost est: 62247.90 to 62247.91
actual time: 30787.81
-> Index Scan using i_smiles_indexed_smiles on smiles_indexed
cost est: 102.60 to 60230.04 rows=807144
actual time: 24753.99 rows=1318621
![Page 16: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/16.jpg)
Problem: Instant Results
• Ordinary SQL strategy won’t work
• Optimizer predictions are wrong for chemistry
• Query times extremely variable
• Even if you give hints, restructure your SQL,
and get it to work, you’re still in trouble
◦ RDB optimizers change/improve with every
software release
◦ Query plans can change unexpectedly as DB grows
◦ What works today may fail tomorrow!
![Page 17: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/17.jpg)
Historical Note: eMolecules Technique
5398-77-6 MFCD00216485Dibenzofuran
Chemical
Query?
eMolecules
Chemical
Search
Postgres
Database
Yes
No
• How did eMolecules’ system handle this?
![Page 18: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/18.jpg)
Historical Note: eMolecules Technique
• The eMolecules Chemical Search Engine
• Not a general-purpose database
• No query planning
• Exactly one query: Find the next N rows
• Dynamic optimization: search speeds up as it
proceeds
• Execution state (“Where did the last search
stop?”) encoded in a few bytes that can be
embedded in a URL
![Page 19: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/19.jpg)
Solution: Postgres Partitioning
BB_molecules(parent table – empty)
BB_molecules_01(child table – 5%)
BB_molecules_20(child table – 5%)…
Query entire data set
(10 million compounds)
Query subsets
(500,000 compounds)
![Page 20: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/20.jpg)
Solution: Postgres Partitioning
• Benefits of table partitions:
• Initial web query applied to just one partition
• Only query full data set (all partitions) if user
is interested in full result set
• User requests for full result set are only a
small fraction of actual cases
![Page 21: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/21.jpg)
Partitioning Performance
• Advantage: “Stupid” query improved 95%
• Disadvantage: Selective queries slowed
• Very specific query: 3 hits• Search 2M: 85 msec
• Search 2M in 20 parts: 128 msec
• 43 msec slowdown (insignificant)
• Benzene query: 1st ten rows
from 1.3M hits:• Search full 2M: 28 seconds
• Search 5%: 1.0 seconds
• 30x speedup
[NO NAME]
CHEMW2 03241715442D
Created with ChemWriter - http://chemwriter.com
22 23 0 0 0 0 999 V2000
-7.7482 0.1596 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-7.0337 -0.2529 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-6.3192 0.1596 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-6.3192 0.9846 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-7.0337 1.3972 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-7.7482 0.9846 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-5.6047 1.3972 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
-4.8902 0.9846 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-6.0172 2.1117 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-5.1922 2.1117 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-4.8902 0.1596 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-4.1757 -0.2529 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.4612 0.1596 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.4612 0.9846 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-4.1757 1.3972 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.7467 -0.2529 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-2.7467 -1.0779 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.0322 -1.4905 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-4.1757 -1.0779 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-4.8902 -1.4905 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-3.4612 -1.4905 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-8.4627 -0.2529 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
2 3 1 0 0 0 0
3 4 1 0 0 0 0
4 5 1 0 0 0 0
5 6 1 0 0 0 0
6 1 1 0 0 0 0
4 7 1 0 0 0 0
7 8 1 0 0 0 0
7 9 2 0 0 0 0
7 10 2 0 0 0 0
8 11 2 0 0 0 0
11 12 1 0 0 0 0
12 13 2 0 0 0 0
13 14 1 0 0 0 0
14 15 2 0 0 0 0
15 8 1 0 0 0 0
13 16 1 0 0 0 0
16 17 1 0 0 0 0
17 18 1 0 0 0 0
12 19 1 0 0 0 0
19 20 1 0 0 0 0
19 21 2 0 0 0 0
1 22 1 0 0 0 0
M END
![Page 22: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/22.jpg)
Power Search: eMolecules Hitlists
• Search results are stored in a “hitlist”
• Progress report during search (via partitioning)
• Persistent results (saved in database)
• Fast perusal of results (no further searching)
• Sorting, refining results
• Cancel a search
• A demo…
![Page 23: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/23.jpg)
Implementation: RESTful JSON API
• Database layer separated from application
Cheminformatics
App Server
Web
App
Server
HTTP HTTP
RESTful
JSON
API
Postgres:
Supplier’s
Catalog Data
Postgres:
ChemAxon
Chemical Search
![Page 24: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/24.jpg)
Summary
• RDBs: poorly suited to web expectations
• Cheminformatics makes it even worse
• Solution: partitioning and hitlists
• Lightweight queries◦ “Instant” response
◦ Efficient, stateless pagination
• Power queries◦ Long-running queries with progress reports
◦ Refine, sort, and merge results
◦ Interrupt/cancel features
![Page 25: Migrating the eMolecules Chemical-Search Engine to ChemAxon€¦ · rendering powered by Marvin JS. eMolecules’ Cheminformatics Engine •eMolecules’ Fast-Search Engine •Fast:](https://reader034.fdocuments.in/reader034/viewer/2022042409/5f24ddc1fabc30201958f1c7/html5/thumbnails/25.jpg)
RESULTS …
Result: A Web-capable Relational
Cheminformatics Database
Questions?
Craig A. James
CTO, eMolecules, Inc.