Crawl Operators’ Workshop Roger G. Coram. 2 Topics ExternalGeoLocationDecideRule Sheets...
-
Upload
paul-holmes -
Category
Documents
-
view
213 -
download
0
Transcript of Crawl Operators’ Workshop Roger G. Coram. 2 Topics ExternalGeoLocationDecideRule Sheets...
Crawl Operators’ Workshop
Roger G. Coram
www.bl.uk 2
Topics
• ExternalGeoLocationDecideRule
• Sheets– IpAddressSetDecideRule
www.bl.uk 3
ExternalGeoLocationDecideRule
• Legal Deposit legislation passed in April 2013.
• The Legal Deposit Libraries (Non-Print Works) Regulations 2013:
– 18 (1) “…a work published on line shall be treated as published in the United Kingdom if:
• “(b) it is made available to the public by a person and any of that person’s activities relating to the creation or the publication of the work take place within the United Kingdom.”
www.bl.uk 4
Geolocation
• ExternalGeoLocationDecideRule requires:
– A list of ISO 3166-1 country-codes to be included in the crawl
• GB, FR, DE, etc.
– An Implementation of ExternalGeoLookupInterface.
www.bl.uk 5
ExternalGeoLookupInterface
• Our implementation is based on MaxMind’s GeoLite2 database.
• Freely available under ‘Creative Commons Attribution-ShareAlike 3.0 Unported License’.
• Only ~30MB; can be held in memory.
www.bl.uk 6
crawler-beans.cxml
<!-- GEO-LOOKUP: specifying location of external database. --> <bean id="externalGeoLookup" class="uk.bl.wap.modules.deciderules.ExternalGeoLookup"> <property name="database" value="/dev/shm/geoip-city.mmdb"/> </bean>
<!-- ... ACCEPT those in the UK... --> <bean id="externalGeoLookupRule" class="org.archive.crawler.deciderules.ExternalGeoLocationDecideRule"> <property name="lookup"> <ref bean="externalGeoLookup"/> </property> <property name="countryCodes"> <list> <value>GB</value> </list> </property> </bean>
Configuration example:
www.bl.uk 7
Results
• Short test crawl (1,000,000 seeds) produced:– 89,500,755 URLs in total.
– 26,072 non-UK URLs which would not otherwise been in scope.
• 137 distinct hosts.
www.bl.uk 8
IP-based Sheets
“Hi,
“I'm a senior system administrator for Webfusion / 123-reg.
“We're currently experiencing lots of requests from crawler1.bl.uk to sites hosted on 81.21.76.62 , this is part of our Parking platform, which links into Yahoo to allow customers to park domains and earn money.”
• Large number of hosts on a single machine.
• Need a way to reduce the load on a specific IP address.
www.bl.uk 9
Sheets
• “Sheets provide the ability to replace default settings on a per domain basis.”
– Allow you to change any value on any named bean for a specific set of URLs.
• Actually quite flexible:– SurtPrefixesSheetAssociation
• Applied by matching SURT prefixes.
– DecideRuledSheetAssociation:
• Applied a series of DecideRules.
– IpAddressSetDecideRule
www.bl.uk 10
1. crawler-beans.cxml
<bean id="extraPolite" class="org.archive.spring.Sheet"> <property name="map"> <map> <entry key="disposition.delayFactor" value="8.0"/> <entry key="disposition.minDelayMs" value="10000"/> <entry key="disposition.maxDelayMs" value="60000"/> <entry key="disposition.respectCrawlDelayUpToSeconds" value="60"/> </map> </property> </bean>
<bean id="crawlLimited" class="org.archive.spring.Sheet"> <property name="map"> <map> <entry key="quotaEnforcer.serverMaxFetchResponses" value="25"/> </map> </property> </bean>
Configuration example:
www.bl.uk 11
2. crawler-beans.cxml
<bean class="org.archive.crawler.spring.DecideRuledSheetAssociation"> <property name="rules"> <bean class="org.archive.modules.deciderules.IpAddressSetDecideRule"> <property name="ipAddresses"> <set> <value>81.21.76.62</value> </set> </property> <property name="decision" value="ACCEPT"/> </bean> </property> <property name="targetSheetNames"> <list> <value>extraPolite</value> <value>crawlLimited</value> </list> </property> </bean>
Configuration example:
www.bl.uk 12
Thank you
GitHub: https://github.com/ukwa/bl-heritrix-modulesMaxMind: http://dev.maxmind.com/geoip/geoip2/geolite2/