Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs...

25
Incorporating InChI into a polymeric database Debra J. Audus State and Future of the IUPAC InChI August 16, 2017

Transcript of Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs...

Page 1: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Incorporating InChI intoa polymeric database

Debra J. AudusState and Future of the IUPAC InChIAugust 16, 2017

Page 2: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Acknowledgements

Roselyne Tchoua Kyle Chard Logan Ward Ian Foster

Jian Qin Joshua Lequieu Juan de Pablo

Computer Science

Molecular Engineering

Chemical Engineering

Page 3: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Need for polymeric databases

Page 4: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Existing resourcesPaper-based Web-based

limited accessibility of entire database and/or datasets that are too small

Page 5: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Polymer Property Predictorand Database

Page 6: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Flory Huggins χ parameter

Publications

R.B. Tchoua, J. Qin, D.J. Audus, et al. J. Chem. Educ., 93, 1561-1568 (2016)R.B. Tchoua, K. Chard, D.J. Audus, et al. Procedia Comp. Sci., 80, 386-397 (2016)

376 articles from Macromolecules

Page 7: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Flory Huggins χ parameter

InformationExtraction ModulePublications

1

R.B. Tchoua, J. Qin, D.J. Audus, et al. J. Chem. Educ., 93, 1561-1568 (2016)R.B. Tchoua, K. Chard, D.J. Audus, et al. Procedia Comp. Sci., 80, 386-397 (2016)

Automatically extract metadata (title, author, etc.)

Page 8: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Flory Huggins χ parameter

InformationExtraction ModulePublications Proposed

χ EntriesData ParsingCrowdsourcingModule

1

2

R.B. Tchoua, J. Qin, D.J. Audus, et al. J. Chem. Educ., 93, 1561-1568 (2016)R.B. Tchoua, K. Chard, D.J. Audus, et al. Procedia Comp. Sci., 80, 386-397 (2016)

TgTgχ

Undergrads review papers and enter χ into an online form

Page 9: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Need for a polymer dictionaryName Type Abbreviationpoly(ethylene-alt-propylene) polymer PEPprotonated poly(ethylene-alt-propylene) polymer pPEPPolybutadiene polymerpolybutadiene polymer PBpolybutadiene polymer PBDpoly(butyl methacrylate) polymer PbMAPoly(n-butyl methacrylate) polymer PnBMA-115 poly(methacrylic acid)-b-poly(methyl methacrylate) (A) polymer PMAA-PMMA (A)poly(methacrylic acid)-b-poly(methyl methacrylate) (C) polymer PMAA-PMMA (C)styrene polymer

prefixes

capitalization

input errors

ambiguous

Page 10: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

The need for InChIMultiple and trade names Identify synonyms

Broadness of CAS

poly(2,6-dimethyl-1,4-phenylene oxide)poly(xylenyl ether)

1800+for polystyrene

Page 11: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Need for a polymer dictionaryName Type Abbreviationpoly(ethylene-alt-propylene) polymer PEPprotonated poly(ethylene-alt-propylene) polymer pPEPPolybutadiene polymerpolybutadiene polymer PBpolybutadiene polymer PBDpoly(butyl methacrylate) polymer PbMAPoly(n-butyl methacrylate) polymer PnBMA-115 poly(methacrylic acid)-b-poly(methyl methacrylate) (A) polymer PMAA-PMMA (A)poly(methacrylic acid)-b-poly(methyl methacrylate) (C) polymer PMAA-PMMA (C)styrene polymer

prefixes

capitalization

input errors

ambiguous

Need something like PubChem for polymers

Page 12: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Flory Huggins χ parameter

InformationExtraction ModulePublications

…poly(styrene) polyethylenepolyisoprene….

CuratedPolymerDictionaryModule

Proposedχ EntriesData ParsingCrowdsourcingModule

1

3

2

R.B. Tchoua, J. Qin, D.J. Audus, et al. J. Chem. Educ., 93, 1561-1568 (2016)R.B. Tchoua, K. Chard, D.J. Audus, et al. Procedia Comp. Sci., 80, 386-397 (2016)

TgTgχ

Page 13: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Developing the polymer dictionaryName

InChIKey and InChI

poly(2-vinylpyridine)

Abbreviation

P2VP

Structure(saved as .mol file)

KGIGUEBEKRSTEW-BBVYVPKKBA-N

1B/C7H7N/c1-2-7-5-3-4-6-8-7/h2-6H,1H2/z101-1-8(1.2)

Page 14: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

The polymer dictionary

88 entries 3 without InChI associated .mol files

Page 15: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Flory Huggins χ parameter

InformationExtraction ModulePublications

…poly(styrene) polyethylenepolyisoprene….

CuratedPolymerDictionaryModule

Proposedχ Entries

Confirmedχ Entries

Data ParsingCrowdsourcingModule

1

3

2

FinalExpertReview

4pppdb

R.B. Tchoua, J. Qin, D.J. Audus, et al. J. Chem. Educ., 93, 1561-1568 (2016)R.B. Tchoua, K. Chard, D.J. Audus, et al. Procedia Comp. Sci., 80, 386-397 (2016)

TgTgχ

TgTgχ

Final review and push to database

Page 16: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Glass transition temperature

Publications

R.B. Tchoua, K. Chard, D.J. Audus, et al. submitted to 13th IEEE International Conference on e-Science

6,090 articles from Macromolecules

Page 17: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Glass transition temperatureNaturalLanguageProcessingModulePublications

1

R.B. Tchoua, K. Chard, D.J. Audus, et al. submitted to 13th IEEE International Conference on e-Science

Tries to find compound-Tg pairs automatically

Page 18: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Glass transition temperatureNaturalLanguageProcessingModulePublications

…PS 25Polystyrene 25poly(styrene) 25PMMA 26….

NLPPolymerDictionaryModule

1

2

R.B. Tchoua, K. Chard, D.J. Audus, et al. submitted to 13th IEEE International Conference on e-Science

Automatically create a dictionary of polymers

(only names) using“P” and “poly”

Page 19: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

NLP Polymer DictionaryNamePolystyrenepoly(styrene)polystyrenepolystyrenesPSPSSpolyimidespolyolefincopolymer 10poly(2,4’-BF-a)macroporous poly(N-isopropylacrylamide)gel

various forms

not plural of PS

labels not names

family names

prefixes/suffixes

12,814 polymers in the dictionary

Work in progress to clean up errors above and adding InChI

Page 20: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Glass transition temperatureNaturalLanguageProcessingModulePublications

Compound-TgPairs

SolitaryTgs

Label-TgPairs

PrioritizeReviewModule

TgTgTgTgTg

TheTgofPEOpeaksatamolecularweightof6000(Tg=−17°C)….

PolymerProximitySearchModule

TgTgTg

…PS 25Polystyrene 25poly(styrene) 25PMMA 26….

NLPPolymerDictionaryModule

ProposedPolymer-Tg Pairs

P TgP TgP Tg

P TgP TgP Tg

ConfirmedPolymer-Tg Pairs

Untrained

P TgP TgP TgPX TgResolve

ResolveLabelCrowdsourcingModule

FlagBadDataCrowdsourcingModule

1

4 5

TgCandidates

2

P TgP TgP TgP TgC Tg

…PS 25Polystyrene 25….

PolymerIdentificationModule

3

FinalExpertReview

6pppdb

R.B. Tchoua, K. Chard, D.J. Audus, et al. submitted to 13th IEEE International Conference on e-Science

Many other steps to final product!

Page 21: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

The need for InChIMultiple and trade names Identify synonyms

Broadness of CAS Input/output for machine learning

NaturalLanguageProcessing Module

1

poly(2,6-dimethyl-1,4-phenylene oxide)poly(xylenyl ether)

1800+for polystyrene

1B/C8H8O/c1-5-3-7-46(2)8(5)9-7/h3-4H,12H3/z101-1-9(7,9,8,9)

Page 22: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Limitations of current InChIBranching / crosslinksOrganometallic

Markush

Page 23: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Conclusions and outlook

http://pppdb.uchicago.edu263 χ 258 Tg

Future work• Add .mol files and InChI to pppdb• Cleaning up NLP polymer dictionary

Advances still need for InChI• Organometallics• Branching / cross-links• Markush

Page 24: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Flory Huggins χ parameter

InformationExtraction ModulePublications

…poly(styrene) polyethylenepolyisoprene….

CuratedPolymerDictionaryModule

Proposedχ Entries

Confirmedχ Entries

Data ParsingCrowdsourcingModule

1

3

2

FinalExpertReview

4pppdb

R.B. Tchoua, J. Qin, D.J. Audus, et al. J. Chem. Educ., 93, 1561-1568 (2016)R.B. Tchoua, K. Chard, D.J. Audus, et al. Procedia Comp. Sci., 80, 386-397 (2016)

TgTgχ

TgTgχ

Page 25: Incorporating InChIinto a polymeric database · Polymer Dictionary Module Proposed Polymer-T gPairs PPP TTgTg g PPP TTT gg g Confirmed Polymer-T gPairs Untrained P P T gT P gT PX

Glass transition temperatureNaturalLanguageProcessingModulePublications

Compound-TgPairs

SolitaryTgs

Label-TgPairs

PrioritizeReviewModule

TgTgTgTgTg

TheTgofPEOpeaksatamolecularweightof6000(Tg=−17°C)….

PolymerProximitySearchModule

TgTgTg

…PS 25Polystyrene 25poly(styrene) 25PMMA 26….

NLPPolymerDictionaryModule

ProposedPolymer-Tg Pairs

P TgP TgP Tg

P TgP TgP Tg

ConfirmedPolymer-Tg Pairs

Untrained

P TgP TgP TgPX TgResolve

ResolveLabelCrowdsourcingModule

FlagBadDataCrowdsourcingModule

1

4 5

TgCandidates

2

P TgP TgP TgP TgC Tg

…PS 25Polystyrene 25….

PolymerIdentificationModule

3

FinalExpertReview

6pppdb

R.B. Tchoua, K. Chard, D.J. Audus, et al. submitted to 13th IEEE International Conference on e-Science