Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei)...

14
Scrapy - An open source web scraping framework for Python Theon Lin Tagtoo Tech Ltd. March 28th, 2013 13年3月28日星期四

Transcript of Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei)...

Page 1: Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei) Scrapy-20130328.pdf2013/03/28  · • Scrapy is a fast high-level screen scraping

Scrapy - An open source web scraping framework for Python

Theon LinTagtoo Tech Ltd.March 28th, 2013

13年3月28日星期四

Page 2: Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei) Scrapy-20130328.pdf2013/03/28  · • Scrapy is a fast high-level screen scraping

Who am I?

• Theon Lin(席恩)

• Education

• Nation Chiao Tung Unviersity Master, Computer and Information Science (2002 - 2004)

• Experience

• Project Manager (L7Networks) Oct, 2004 - Oct, 2008

• Project Assistant Manager (D-Link) Oct, 2008 - Jan, 2012

• LinkedInhttp://www.linkedin.com/profile/view?id=125104719

13年3月28日星期四

Page 3: Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei) Scrapy-20130328.pdf2013/03/28  · • Scrapy is a fast high-level screen scraping

Outline

• Introduction Scrapy

• Basic Spider

• Advance

• Q & A

13年3月28日星期四

Page 4: Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei) Scrapy-20130328.pdf2013/03/28  · • Scrapy is a fast high-level screen scraping

Introduction Scrapy

• What is Scrapy?

• Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

• Event-driven by Twisted

• Very well-structured framework

13年3月28日星期四

Page 5: Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei) Scrapy-20130328.pdf2013/03/28  · • Scrapy is a fast high-level screen scraping

Introduction Scrapy

(source: http://www.biaodianfu.com/scrapy-architecture.html

13年3月28日星期四

Page 6: Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei) Scrapy-20130328.pdf2013/03/28  · • Scrapy is a fast high-level screen scraping

Basic Spider - Start Project

$ scrapy startproject pycon

13年3月28日星期四

Page 7: Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei) Scrapy-20130328.pdf2013/03/28  · • Scrapy is a fast high-level screen scraping

Basic Spider - Define Item

13年3月28日星期四

Page 8: Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei) Scrapy-20130328.pdf2013/03/28  · • Scrapy is a fast high-level screen scraping

Basic Spider - First Spider

13年3月28日星期四

Page 9: Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei) Scrapy-20130328.pdf2013/03/28  · • Scrapy is a fast high-level screen scraping

Basic Spider - Let’s Go

$ scrapy crawl first_spider

{'image': u'http://img4.groupon.com.tw/pi/20659-1-medium.jpg?1364445246',

'link': u'/%E8%A1%9B%E7%94%9F%E7%B4%99-%E5%AE%85%E9%85%8D-20659.htm#mt=3720',

'price': u'799',

'sale_num': u'51520',

'store_price': u'$1272'}

13年3月28日星期四

Page 10: Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei) Scrapy-20130328.pdf2013/03/28  · • Scrapy is a fast high-level screen scraping

Advance - CrawlSpider

13年3月28日星期四

Page 11: Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei) Scrapy-20130328.pdf2013/03/28  · • Scrapy is a fast high-level screen scraping

Advance - Crawl multiple pages information

13年3月28日星期四

Page 12: Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei) Scrapy-20130328.pdf2013/03/28  · • Scrapy is a fast high-level screen scraping

Advance - BFS

• Settings.py

SCHEDULER_ORDER = 'BFO'

13年3月28日星期四

Page 13: Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei) Scrapy-20130328.pdf2013/03/28  · • Scrapy is a fast high-level screen scraping

Q & A

13年3月28日星期四

Page 14: Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei) Scrapy-20130328.pdf2013/03/28  · • Scrapy is a fast high-level screen scraping

Reference

• Official Web site

• http://www.scrapy.org

• Reference

• http://www.biaodianfu.com/scrapy-architecture.html

13年3月28日星期四