Using Automated Workflow Tools to Improve WikipediaMITCH MILLER
SCIENTIFIC THINKING
VERMONT CODE CAMP 2016
SEPTEMBER 17, 2016
Disclaimer
This talk represents my opinion and personal experience using software systems developed by third parties
The software systems shown are very complex and have hundreds of components. I have only worked with a small number.
Every task shown today can be accomplished in multiple ways. I’m only showing some of those ways.
Overview
Introduction: how are we improving Wikipedia? Why are we doing this?
The list of information we need to compile First method of generating the list The second method of generating the list The third method of generating the list
What chemistry does Wikipedia contain?
9,736 articles with the Chembox; 5,656 with the Drug box (15, 392 total) [source: https://en.wikipedia.org/wiki/Wikipedia:List_of_infoboxes]
Chembox? Drug box? Templates of selected content within Wikipedia articles
Contents of Chembox: Molecular structure image Name (systematically assigned name + synonyms) Identifiers: CASNo, ChEBI, ChEMBL, ChemSpiderID, DrugBank, InChI,
KEGG, PubChem, SMILES, UNII… Key properties
Chemical identifiers
Different specific databases Individual IDs have strengths and weakness
The UNII is a non- proprietary, free, unique, unambiguous, non semantic, alphanumeric identifier based on a substance’s composition and/or descriptive information. http://www.fda.gov/ForIndustry/DataStandards/SubstanceRegistration
System-UniqueIngredientIdentifierUNII/
UNIIs contain 9 randomly generated alphanumeric characters with a tenth check alphanumeric character
When two samples have the same UNII, “they represent the same molecular entity or elements upon which the definition is based.”
SRS group goal
Manages Substance Registration System (SRS) Assure uniformity of UNII assignments across internet resources
that reference UNIIs
The assignment
Generate a report of all chemicals and drugs in Wikipedia Name, UNII (when present), CAS (when present),Wikipedia URL
Idea: subject matter experts will review list and correct assignments, add new UNIIs to Wikipedia as needed
Result: more accurate Wikipedia that links to the FDA’s Substance Registration System unambiguously https://fdasis.nlm.nih.gov/srs/srs.jsp
Development tool: KNIME
Graphic, component based programming environment Drag functional components from palette onto canvas to create program Configure most components by setting parameters Connect components to route data from one to another Run and observe data traveling down the lines
KNIME stands for KoNstanz Information MinEr Pronounced “Nighm”
Originally a production of the University of Konstanz, Germany 2004 Currently produced by KNIME.com AG, a company in Zurich, Switzerland Free version available for download
Windows, Linux, Mac
First method of report generation
Read list of pages with each infobox E.g., https://en.wikipedia.org/w/index.php?
title=Special:WhatLinksHere/Template:Chembox&limit=50000&from=16225610&back=0
Retrieve each individual page mentioned in the list Parse HTML Use Xpath to get Name, CAS, UNII
The Infobox templates lead to pages with defined structure – straightforward to parse
Format data for output Write to a file
First method: pluses/minuses
Plus: it works Minus: had to run in batches to get all records Minus: XPath parsing was more cumbersome than expected Minus: misses some data
The Semantic Web
A connected set of data resources that can be understood by machines
Data encoded in a standard way that allows unattended processors to traverse links from one entity to another across organizational and geographic boundaries
[Standard WWW is a web of documents meant to be understood by humans]
Tim Berners-Lee has a great Ted talk on the semantic web https://www.youtube.com/watch?v=OM6XIICm_qo
Understand Semantic Web in comparison to WWW
Compare pages on same subject: Wikipedia article on ethanol: https://en.wikipedia.org/wiki/Ethanol Wikidata page on ethanol: https://www.wikidata.org/wiki/Q153
Technological foundations of Semantic Web
RDF – Resource Definition Framework – organizing facts as Subject – Predicate – Object Conceptual example:
[Ethanol] [has a boiling point] [173 degrees Fahrenheit] Coded example:
Wd:Q153 wdt:P2102 “173±1 degree Fahrenheit” . Represented in Turtle - Terse RDF Triple Language
SPARQL
Query language for RDF data SPARQL Protocol and RDF Query Language Similar to SQL Syntax based on the RDF triple
Wikidata
Conceptually: semantic web version of Wikipedia Add grain of salt
“Free and open knowledge base that can be read and edited by both humans and machines. “
Designed as ‘central storage’ for Wikipedia and other Wikimedia projects
Approximately: programmatic interface to Wikipedia See https://query.wikidata.org/
Run the example queries
Second method
Search Wikidata programmatically for chemical information Wikidata SPARQL interface Format list Write file
SPARQL for chemical and pharmaceutical compounds
PREFIX wdt: <http://www.wikidata.org/prop/direct/>PREFIX wd: <http://www.wikidata.org/entity/>PREFIX wikibase: <http://wikiba.se/ontology#>PREFIX bd: <http://www.bigdata.com/rdf#>
#All Chemicals with, optionally, CAS registry numbers and UNIIs in WikidataSELECT DISTINCT ?compound ?compoundLabel ?formula ?unii ?pubchem ?cas WHERE { ?compound wdt:P31 wd:Q11173 . OPTIONAL { ?compound wdt:P231 ?cas . } OPTIONAL { ?compound wdt:P274 ?formula . } OPTIONAL { ?compound wdt:P652 ?unii . } OPTIONAL { ?compound wdt:P662 ?pubchem . } SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } }
Second method: pluses/minus
Fast and easy! Data arrives in a format we can use – no parsing! Minus:
*some* Wikidata data does not match up with Wikipedia!
Third method
Hybrid approach Use Wikidata SPARQL query to get list of chemicals Query Wikipedia for individual items to compare values
Conclusion
Using Wikidata, Wikipedia and KNIME we compiled a list of chemicals with the required data
Subject matter experts are in the process of updating Wikipedia Semantic web technology made the job easier! Thank you!
References
Scholarly article on KNIME and Pipeline Pilot https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3414708/
KNIME www.knime.org
Wikipedia https://en.wikipedia.org/wiki/Template:Chembox https://en.wikipedia.org/wiki/Wikipedia:Chemical_infobox
Wikidata: https://query.wikidata.org
Who is your speaker?
Mitch Miller, Ph.D. in Chemistry and 20+ years of IT experience Independent consultant: Scientific Thinking, LLC [email protected] Some recent projects
Ongoing custodian of one chemical database implementation for ChemIDplus project within the National Library of Medicine
Reporting systems Web service to link collaborative object management system to
reporting system Import wizard for chemical array designer Merged a set of chemical databases and harmonized data
Top Related