Web Scrapers and Your Listing Data: High Risk Lessons
-
Upload
distil-networks -
Category
Real Estate
-
view
140 -
download
0
Transcript of Web Scrapers and Your Listing Data: High Risk Lessons
![Page 1: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/1.jpg)
Web Scrapers and Your Listing Data: High Risk Lessons
![Page 2: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/2.jpg)
Presenters
Matt CohenChief TechnologistClareity Consulting
Lauren HansenCEO
IRES MLS
Charlie MinesingerDirector of Solution Sales
Distil Networks
![Page 3: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/3.jpg)
Overview of Bots and Web ScrapingWeb Scraping’s Impact on Real Estate WebsitesIRES / ColoProperty.com Case StudyAbout Distil NetworksQ&A
Agenda
Toward better Security for Real Estate Data Online
![Page 4: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/4.jpg)
A Brief Intro to Bots and Web Scraping
![Page 5: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/5.jpg)
Bad Bots Cause the Majority of Website Problems
![Page 6: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/6.jpg)
In 2015 the most targeted verticals were digital publishing and real estate. Real Estate sites saw a 300% increase in
bad bot traffic!
Traffic by Type of Site, 2014 vs 2015
![Page 7: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/7.jpg)
What is Web Scraping
Web scraping is the act of taking content from a website with the intent of using it for purposes outside the direct control of the site owner.
It can be used to○ Steal intellectual property○ Gain competitive advantage○ Create aggregation or meta-sites○ Perform market research○ Damage SEO rankings
![Page 8: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/8.jpg)
Who is behind Web Scraping?
CompetitorsContent Theft
Competitive IntelPrice Scraping
AggregatorsStart-ups
Unauthorized Middlemen
HackersContent for Fake Pages
Search EnginesGoogle
BingYahooBaidu
![Page 9: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/9.jpg)
Web Scraping Concerns on Real Estate Sites
![Page 10: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/10.jpg)
Web scraping hurts your KPIs...Slowdowns, downtime, and poor user experiencesIncrease in costs (infrastructure and people)Distortion of web analyticsDigital ad fraud, reputation and trust (bad leads)
How Web Scrapers Impact KPIs
![Page 11: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/11.jpg)
MLSsObligation to protect copyrightHigher cost to use reactive methods - beacons, legal, etc.Duty to enforce NAR Policy (VOWs. IDX optionally)Missed revenue opportunities for licensing content
Brokers and AgentsProvided content license on listing for specific purposeResponsible for NAR Policy (VOWs, so far)Stale (scraped) data undermines trust and reputation in brand
Why Bots / Scraping is a Problem in Real Estate
![Page 12: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/12.jpg)
Bottom Line on Scraping
The High Costs of Scraping MLS DataResource costs - 10% to 40% of server utilization and bandwidthCustomer Care - Cost per call from consumer? Calls per month?Website Performance – brownouts results in 3 days of low trafficAd Fraud - If 30% of ads are seen by bots, are advertisers paying?Lead Gen - Bad leads, decreased value of MLS licensed data. $15/mover, $30/storage facility, … $100s per listing going to data pirates … and potentially annoying consumers in the process!
→ Biggest Losers: MLS and Brokers
Value of solution?Antivirus is $40 to $75 year per member ( = $3 -
$6/month) Anti-scraping protection should be same or less cost
![Page 13: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/13.jpg)
Bottom LineScrapers scrape because they are making money with your listings!
And the Real Estate industry is left with...
Higher CostsLost Revenues
Why Bots / Scraping is a Problem in Real Estate
![Page 14: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/14.jpg)
Who100 MLS Executives rep. MLSs with over 600,000 subscribers.14 rep. 400,000 IDX & VOW websites. Others would only speak informallyWhat Was Found99% say compliance with rules protecting misuse of MLS data is important59% of respondents do NOT test VOW sites for anti-scraping compliance - and the 41% rely on self-reporting
○ The industry lacks a tool for compliance review. I would require a screenshot of the site’s Distil dashboard, documentation of key settings!
Almost all IDX/VOW vendors are using no anti-scraping - or reactive, obsolete detection tactics
○ Reactive log analysis, IP-based methods, rate limiting, CAPTCHA
Clareity’s 2015 RE Industry Scraping Study
![Page 15: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/15.jpg)
95% MLS execs agree that IDX sites should be subject to rules specifically mandating scraping protections
NAR has declined to make the change even though 95% want the “air coverage” of specific language NAR’s
The Path Forward to the 100% SolutionMust start with MLSs: MLS vendors, Public Listing WebsitesVOW complianceIDX requirements made clearOnce “our own house” is in order, pressure syndication sites The largest have at least some protections already. It’s the scores of others...
Scraping Study / The Path Forward
![Page 16: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/16.jpg)
IRES / ColoProperty.com Case Study
![Page 17: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/17.jpg)
IRESFor real estate professionalsServing 6,000 professionalsCounty Assessor dataMappingBroker functionality
ColoProperty.comConsumer-facing siteDaily updates on ~15,000 listings
About IRES / ColoProperty.com
![Page 18: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/18.jpg)
Our Bot Blocking Goals
Preserve full value of MLS informationCreate a trusted environment for key constituentsProtect the integrity of listing dataDecrease hosting and bandwidth costsPrevent fraudulent lead forms and spamIncrease website speedAvoid potential litigation costs
![Page 19: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/19.jpg)
My Advice
Don’t be like these guys...
![Page 20: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/20.jpg)
My Advice
Example from yesterday
(highlights added)
![Page 21: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/21.jpg)
Scraping resources just a click away...
Anyone with basic computer skills can get into the game
Inexpensive relative to the value of the content they steal
Difficult or impossible to prosecute
Website Scraping Has Never Been Easier
![Page 22: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/22.jpg)
Insight and Control is Key
![Page 23: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/23.jpg)
Insight and Control is Key
In April we served almost 1.5 million CAPTCHAs...
But there were only 650 attempts to solve them
So, I know I’m only serving CAPTCHAs to the bots... 99.995% of the time
![Page 24: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/24.jpg)
Insight and Control is Key
I can adjust rate limits up and down and see how they will
impact my users...
![Page 25: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/25.jpg)
About Distil Networks
![Page 26: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/26.jpg)
Distil Networks in Real Estate
![Page 27: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/27.jpg)
Majority of Bots are Advanced Persistent Bots (APBs)
APBs have one or more of the following abilities:
AdvancedMimick human behaviorLoad JavaScriptLoad external resourcesSupport cookiesBrowser automation (Selenium, PhantomJS)
Persistent Dynamic IP rotationDistribute attacks across IP addressesHide behind anonymous and peer-to-peer proxies 2016 Distil Bad Bot
Report
![Page 28: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/28.jpg)
Sticky Bot Tracking With No Impact On Real UsersDevice FingerprintingFingerprints stick to the bot even if it attempts to reconnect from random IP addresses or hide behind an anonymous proxy or peer-to-peer network
Tracks distributed attacks that would normally fly under the radar
Without Distil
With Distil
Without Impacting Users Sharing the Same IPAvoids blocking residential users or organizations that might share the same NAT as the bot or botnet
![Page 29: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/29.jpg)
Threat Intelligence From All Distil-Protected Sites
Known Violators DatabaseReal-time updates from the world’s largest Known Violators Database, which is based on the collective intelligence of all Distil-protected sites
Distil customers are automatically protected against new threats discovered anywhere on the network
![Page 30: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/30.jpg)
Browser ValidationDetects all known browser automation tools, such as Selenium and Phantom JS
Protects against browser spoofing by validating each incoming request as self reported
Advanced Bot Detection Increases Accuracy
Behavioral Modeling and Machine LearningMachine-learning algorithms pinpoint behavioral anomalies specific to your site’s unique traffic patterns
Self optimizing algorithms improve bot detection and mitigation without manual configuration
![Page 31: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/31.jpg)
○ Install on virtualized or bare metal appliance(s)○ High availability configurations with failover
monitoring○ Heartbeat up to Distil Cloud ○ Deploys in days
Flexible Deployment Options
Automatically compresses and optimizes content for faster delivery17 global datacenters automatically fail over when a primary location goes offlineAutomatically increases infrastructure and bandwidth to accommodate spikesDeploys in hours
Physical or Virtual Appliances
Content Delivery Network
![Page 32: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/32.jpg)
![Page 33: Web Scrapers and Your Listing Data: High Risk Lessons](https://reader035.fdocuments.in/reader035/viewer/2022070601/587ed2781a28abdb198b55a3/html5/thumbnails/33.jpg)
Presenters
Matt CohenChief TechnologistClareity Consulting
Lauren HansenCEOIRES
Charlie MinesingerDirector of Solution Sales
Distil Networks