python xml web - RostlabA Glimpse of Namespaces allows to prevent tag name collisions between...
Transcript of python xml web - RostlabA Glimpse of Namespaces allows to prevent tag name collisions between...
BioinfRes SoSe 17
Bioinforma)csResourcesXML/WebAccess
Lecture&ExercisesProf.B.Rost,Dr.L.Richter,J.Reeb
Ins)tutfürInforma)kI12
BioinfRes SoSe 17
XMLInfusion(in10sec)● compila)onfromhKp://www.w3schools.com/xml/default.asp
● XMLisasoPware-andhardware-independenttooltostoreandtotransportdata
● XMLstandsforeXtensibleMarkupLanguage
● designedtostoreandtransportdata● designedtobeself-descrip)ve
● W3Crecommenda)on
● itdoesNOTDOanything
BioinfRes SoSe 17
AboutTags
● XMLtagsarenotpredefinedlikeHTMLtags● everybodycan/hastoinventhisowntags
● newtagscanbeaddedany)me
● theauthorhastodefinecontentandstructureofthedocument
● everythingisplaintext
BioinfRes SoSe 17
DocumentStructure<?xml version="1.0" encoding="UTF-8"?>!<bookstore>!! <book category="cooking">! <title lang="en">Everyday Italian</title>! <author>Giada De Laurentiis</author>! <year>2005</year>! <price>30.00</price>! </book>!! <book category="children">! <title lang="en">Harry Potter</title>! <author>J K. Rowling</author>! <year>2005</year>! <price>29.99</price>! </book>!!....!</bookstore>!!takenfromhKp://www.w3schools.com/xml/xml_usedfor.asp
BioinfRes SoSe 17
SyntaxRules● elementsaredefinedusingtags:<tagName> ... </tagName>or<tagName/>!
● elementscanbenested(containotherelements-parentandchildnodes,siblingnodes)
● elementscanhavetextcontent
● eachdocumentmustcontainONErootelementthatistheparentofallotherelements
BioinfRes SoSe 17
SyntaxRefined
● prologline<?xml ...>isop)onal● tagsmustbe(self-)closed
● tagarecasesensi)ve
● tagsmustbeproperlynested:<a><b>....</a></b> Wrong!<a><b>....</b></a>! Right!
BioinfRes SoSe 17
SyntaxRefined● tagsmayhaveaKributes● aKributevaluesmustalwaysbequoted
● somespecialcharacterscannotbeuseddirectly
● ->codedbyen)tyreferences:< < lessthan> > greaterthan& & ampersand' ‘ apostrophe" “ quota)onmark
● comments:<!-- .... -->!
BioinfRes SoSe 17
TagNames● casesensi)ve● muststartwithaleKerorunderscore
● mustnotstartwiththeleKersxmlinanycase
● cancontain:leKers,digits,hyphens,underscoresandperiods
● cannotcontainspaces
● applycommonsenseandaconsistentstyle● avoid:minus(-),period(.),colon(:),non-englishcharactersforcompa)bilityreasons
BioinfRes SoSe 17
XMLElement
● everythingbetweenthestartandtheendtag● tagsareincluded
● cancontain:- text- aKributes- otherelements- amixofall
● areextensible
BioinfRes SoSe 17
XMLAKributes
● valuesmustbequoted:singleordoublequotes● theunusedquo)ngcharactercanbeusedinsidethevalue
● decisionforaKributeorelementundecided,but:- aKributescannotcontainmul)plevalues- aKributescannotcontaintreestructures- aKributesarenoteasilyexpandable
● usefultostoremetadata,likeelementid,etc.
BioinfRes SoSe 17
AGlimpseofNamespaces
● allowstopreventtagnamecollisionsbetweendifferentauthors/applica)ons/domains
● implementedbytheintroduc)onofprefixes● definedasanaKribute:xmlns:prefix=“URI”!
● usage:<prefix:tagName>!● theURIisonlyneededtobeunique
● usedtointegrateotherspecifica)ons,e.g.XSLT
BioinfRes SoSe 17
LevelsofCorrectness● wellformed:adocumentobeythesyntaxrules:- rootelement- closingtag- casesensi)ve- properlynested- aKributevaluesquoted
● validdocuments:inaddi)ontobeingvalidthealsoconformtoadocumenttypedefini)on(formatspecifica)on)
BioinfRes SoSe 17
DocumentTypeDefini)ons
● twowaystospecifyadocumentstructure:● DTD:DocumentTypDefini)on
● XMLSchema:XMLbasedalterna)vetoDTD
BioinfRes SoSe 17
Example
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE note SYSTEM "Note.dtd”> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend! ©right; </body> </note>!
BioinfRes SoSe 17
Example
<!DOCTYPE note [ <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> <!ENTITY copyright “Copyright by ..”> ]>!
BioinfRes SoSe 17
XMLDTD
● referencedfromadocumentwith:<!DOCTYPE note SYSTEM "Note.dtd">!
● !DOCTYPEdefinestherootelement● !ELEMENTdefinesthestructureoftheelements
● #PCDTAmeansparse-abletextdata● !ENTITYdefinesspecialcharactersorstrings
BioinfRes SoSe 17
XMLSchema● alterna)vetoDTD<xs:element name="note”> <xs:complexType> <xs:sequence> <xs:element name="to" type="xs:string"/> <xs:element name="from" type="xs:string"/> <xs:element name="heading" type="xs:string"/> <xs:element name="body" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element>!
● supportofdatatypesandnamespaces
● wriKeninXMLandextensible!
BioinfRes SoSe 17
XMLwithPython● basic:xml.etree.ElementTreeimport xml.etree.ElementTree as ET!
● ifyouparseXMLfromuntrustedsourcesusesafeguardedlibrarieslikedefusedxml
● readxmlinputfromfileintoatreeelement:parse(“fileName”)!
● readxmlinputfromstringintoatreeelement:fromString(data_as_string)!
● retrieverootelementfromtreeelement:getroot()
BioinfRes SoSe 17
XMLwithPython● elementtype:tag(string)● elementaKributes:attrib(dic)onary).Preferaccessviae.get(key, default=None), e.items(), e.keys(), e.set(key, value)!
● element(text)content:text(string)
● childelements:simplyiterate
● ifyouknowthestructureyoucanaccesschildrenviaindexnota)on:root[0][1]!
● selectchildnodesofacertaintype:root.iter(‘typeName’)!
!
BioinfRes SoSe 17
XMLwithPython
● inser)onofnewelements:- createnewelementwith:Element(‘typeName’)!- addcontentviatextandaKributesviaattrib!- addexis)ngsub-elementswithappend()!- appendnewelementtotheparentelement
● simpleoutput:treeElement.write(‘fileName’)!
● forvalida)onandfullXPathsupportuse:lxml!
BioinfRes SoSe 17
WebAccess
● URI(UniformResourceIden)fier):isastring● URLisaspecifictypeofURIincludingtheloca)onofaresource
● scheme://lo.ca.ti.on/pa/th?qu_er#fragment!
● scheme:protocolinuse
● loca)on:thehos)ngserver
● path:pathtotheresource● query+fragment:op)onalspecifica)ons
BioinfRes SoSe 17
WebAccess
● URLmanipula)on:– urljoin(base_url_string,relative_url_string)
from urllib import parse as urlparse urlparse.urljoin('http://somehost.com/some/path/here','../other/path') # Result is: 'http://somehost.com/some/other/path’!
- urlsplit(url_string,default_scheme='', allow_fragments=True) urlparse.urlsplit('http://www.python.org:80/faq.cgi?src=fie') # Result is: ('http','www.python.org:80','/faq.cgi','src=fie','') !
BioinfRes SoSe 17
WebAccess● easytousethirdpartypackagerequests hKp://docs.python-requests.org/en/master/api/#requests.request
● supportPythonv2andv3
● threemainclasses:Request,Response,Session
● Request:modelsaHTTPrequestsendtotheserver
● Response:modelstheHTTPresponse● Session:ifyouneedcon)nuity(ignorebynow)
BioinfRes SoSe 17
Requests● requests.request(method,url,**kwargs)- mandatory:method(delete,get,head,op)ons,patch,post,put)
- mandatory:URL- op)onalparameters:toomanytolisthere
● conveniencefun)ons:- requests.get/post/...(URL,[data=None],**kwargs)
● import requests req = requests.request('GET', 'http://www.example.com/')!
BioinfRes SoSe 17
Response
● usefulaKributes:status_codetellsusaboutsuccessorfailure
● r.content:holdstheresponse’scontent● r.iter_contentandr.iter_linesallowchunkwisereadingoftheresponsedata,especiallyinastreamcontextorwhenhandlinghugeresponses