Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Text mining to produce large chemistry datasets for community access
-
Upload
valery-tkachenko -
Category
Science
-
view
417 -
download
5
Transcript of Text mining to produce large chemistry datasets for community access
![Page 1: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/1.jpg)
Text-mining to produce large chemistry datasets for community access
Valery Tkachenko1, Aileen Day1, Daniel Lowe2, Igor Tetko3, Carlos Coba4 , Antony Williams5
1 Royal Society of Chemistry, UK2 NextMove Software, UK3 HelmholtzZentrum München, Germany4 Mestrelab Research, Santiago de Compostela, Spain5 EPA, US
ACS Fall 2015Boston, MAAugust 17th 2015
![Page 2: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/2.jpg)
![Page 3: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/3.jpg)
ChemSpider
![Page 4: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/4.jpg)
Refs - we live in linked world
![Page 5: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/5.jpg)
Properties
![Page 6: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/6.jpg)
ChemSpider spectra
![Page 7: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/7.jpg)
Knowledge systems
Datastore
Raw data´Data inµprocess
´Data outµprocess UI, API, Services, etc
![Page 8: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/8.jpg)
RSC Archive – since 1841
![Page 9: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/9.jpg)
Prospecting RSC articles
![Page 10: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/10.jpg)
Further work – properties and spectra mining
![Page 11: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/11.jpg)
Text mining of the chemical documents
Term Examples of text matchedFromLiterature “lit.”
MeltingPoint “mpt”, “melting point”, “m.p.”Qualifier “>”; “approximately”
Value “75° C”, “200° F”, “one hundred degrees Celsius”Range “184-186° C”, “191.5 to 192.4° C”
MeasurementError
“50±° C”
OutcomeQualifier
“decomp.”, “with decomposition”, “subl.”
FromLiterature? MeltingPoint Qualifier? (Value | Range | MeasurementError) OutcomeQualifier?
![Page 12: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/12.jpg)
Why MP?
Used for water solubility prediction
Yalkowsky equation:
logS = 0.5 – 0.01(MP-25) – log Kow
![Page 13: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/13.jpg)
Detecting suspicious melting points
• Value was greater than 500° C
• Value was a range wider than 50° C
• Value was a range where the second temperature was lower than the first temperature
![Page 14: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/14.jpg)
300k Melting Point Datasets
Bergström 277Bradley 2886OCHEM 22404Enamine 21883Patents 228079
data
BergströmBradleyOCHEMEnaminePatents
Tetko et al J. Chemoinformatics, in preparation
![Page 15: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/15.jpg)
Melting point model: data distribution
![Page 16: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/16.jpg)
Some modeling highlights
LibSVM grid search was used to select parameters in grid (ca 1.5 years of CPU-time optimization)Largest model:
668k descriptors (MolPrint) ~ 0.2 trillions entriesBiggest model:
618Mb (Dragon descriptors)Most accurate model: Consensus, average of 5 models
RMSE < 32°C for the drug like region, MP [50,250]°C
![Page 17: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/17.jpg)
Prediction error
![Page 18: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/18.jpg)
NMR data• Extract from 1976-2014 USPTO applications
*unknown – starts off with NMR: peak list (no nucleus)
H 975543C 56536
unknown 44306F 9429P 3241B 91Si 62Sn 22Se 11N 8
![Page 19: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/19.jpg)
NMR text mining• We can find and index text spectra:13C NMR
(CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
![Page 20: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/20.jpg)
NMR extracted by year of publication
0
500000
1000000
1500000
2000000
2500000
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
Cum
ulati
ve d
istin
ct N
MR
extr
acte
d
Year of Publication
USPTO grants
USPTO applications
![Page 21: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/21.jpg)
NMR solvents
48.5%
38.3%
8.7%
1.1% 1.0% 1.0% 1.4%
CDCl3
DMSO-d6
CD3OD
D2O
Acetone-d6
MeOD
Others
Others: CD2Cl2, CD3CN-d3, C6D6, Pyridine-d5, THF-d8, CD3Cl, dimethylformamide-d7, d1-trifluoroacetic acid, methanol-d3, acetic acid-d4, toluene-d8, sulfuric acid-d2, 1,1,2,2-tetrachloroethane-d2, CD3OCD3, dioxane-d8, 1,2-dichloroethane-d4
![Page 22: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/22.jpg)
1H-NMR frequency over time
0 Mhz
50 Mhz
100 Mhz
150 Mhz
200 Mhz
250 Mhz
300 Mhz
350 Mhz
400 Mhz
450 Mhz
1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014Year of patent filing
![Page 23: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/23.jpg)
MestreLabs Mnova NMR
![Page 24: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/24.jpg)
1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
![Page 25: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/25.jpg)
13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
![Page 26: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/26.jpg)
Detecting suspicious NMR spectra
• Last peak of NMR spectra is unannotated and:– All other peaks are annotated– Spectrum has 1 peak and is proton or
unknown NMR
![Page 27: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/27.jpg)
O
O
OH
Br
> <SuspiciousValue>true> <Value>1H-NMR (400 MHz, d6-Acetone): 11.8-10.8 (brs, 1H), 7.78Comments: Only the labile proton is reported in the spectrum. The other aromatic and aliphatic protons are completely missing in the spectrum.
![Page 28: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/28.jpg)
H2N
NH2
O
O
> <SuspiciousValue>true> <Value>1H-NMR (400 MHz, CDCl3): 6.85 (1H, d, J=7.8 Hz), 6.10 (1H, dd, J=7.8 and 2.2 Hz), 6.06 (1H, d, J=2.2 Hz), 4.66 (1H, m), 3.75 (4H, br s), 3.40 (2H, s), 1.97Comments: There are only 11 protons reported in the spectrum whilst the molecule contains more than 50 protons.
![Page 29: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/29.jpg)
Knowledge systems
Datastore
Raw data´Data inµprocess
´Data outµprocess UI, API, Services, etc
![Page 30: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/30.jpg)
Synthetic chemistry articleCompoundsReactionAnalytical DataText and References
![Page 31: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/31.jpg)
RSC Databases
RSC CompoundsRSC ReactionsRSC SpectraRSC CrystalsRSC PolymersRSC MaterialsRSC AssaysRSC AlgorithmsRSC Models…and on…
![Page 32: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/32.jpg)
Input pipelineDeposition Gateway
Staging databases
Compounds Reactions Spectra Crystals
Materials
Compounds Module
Spectra Module
Reactions Module
Materials Module
TextminingModule
«Module
Web UI for unified depositions
DropBox, Google Drive, SkyDrive, etc
ELNs, templated data input
Documents
API, FTP, etc
Raw data
Valid
ated
data
Staging databases
All databases are sliced by data sources/data collections and have simple security model where each data slice/source is private, public or embargoed
Etc
Experiments
Research
![Page 33: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/33.jpg)
Output pipeline
Compounds Reactions Spectra Crystals Documents
CompoundsAPI
ReactionsAPI
SpectraAPI
CrystalsAPI
DocumentsAPI
CompoundsWidgets
ReactionsWidgets
SpectraWidgets
CrystalsWidgets
DocumentsWidgets
Data layer
Data access layer
User interface widgets
layer
Analytical Laboratory application
User interface
layer(examples)
Electronic Laboratory NotebookPaid 3rd party integrations(various platforms – SharePoint, Google, etc)
Chemical Inventory application
ChemSpider 2.0
![Page 34: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/34.jpg)
Cross-database links
![Page 35: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/35.jpg)
Compounds domain
![Page 36: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/36.jpg)
Data quality issue and CVSP
– Robochemistry
– Proliferation of errors in public and private databases
• ChemSpider• PubChem• DrugBank• KEGG• ChEBI/ChEMBL
– Automated quality control system
![Page 37: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/37.jpg)
Chemistry Validation and Standardization Platform
![Page 38: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/38.jpg)
Reactions domain
![Page 39: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/39.jpg)
Reactions domain
![Page 40: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/40.jpg)
Analytical data domain
![Page 41: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/41.jpg)
Crystallography domain
![Page 42: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/42.jpg)
3D printable structures
![Page 43: Text mining to produce large chemistry datasets for community access](https://reader035.fdocuments.in/reader035/viewer/2022070512/588aae571a28ab4c308b6b49/html5/thumbnails/43.jpg)
New Repository Architecturedoi: 10.1007/s10822-014-9784-5