English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية
Prologue: I have heard the great name of python crawling framework for a long time. In recent days, I have learned about the Scrapy crawling framework and would like to share my understanding with everyone. If there are any inappropriate expressions, I hope the great gods will correct them.
First, A Glimpse of Scrapy
Scrapy is an application framework written to crawl website data and extract structured data. It can be applied in a series of programs including data mining, information processing, or storing historical data.
It was originally designed toPage scraping(more precisely,Web scraping) designed, and can also be applied in obtaining the data returned by the API (for exampleAmazon Associates Web Services) or general web crawlers.
This document will introduce the concepts behind Scrapy to help you understand its working principle and determine whether Scrapy is what you need.
When you are ready to start your project, you can refer to入门教程.
Second, Introduction to Scrapy installation
Scrapy framework running platform and related auxiliary tools
您可以使用pip来安装Scrapy(推荐使用pip来安装Python package).
pip install Scrapy
Windows下安装流程:
1、安装Python 2.7之后,您需要修改PATH环境变量,将Python的可执行程序及额外的脚本添加到系统路径中。将以下路径添加到PATH中:
C:\Python27\;C:\Python27\Scripts\;
除此之外,还可以用cmd命令来设置Path:
c:\python27\python.exe c:\python27\tools\scripts\win_add2path.py
安装配置完成之后,可以执行命令python --version查看安装的python版本。(如图所示)
2、从http://sourceforge.net/projects/pywin32/安装pywin32
请确认下载符合您系统的版本(win32或者amd64
从https://pip.pypa.io/en/latest/installing.html安装pip
3、打开命令行窗口,确认pip被正确安装:
pip --version
4、到目前为止Python 2.7 及pip已经可以正确运行了。接下来安装Scrapy:
pip install Scrapy
至此windows下Scrapy安装已经结束。
三、Scrapy入门教程
1、在cmd中创建Scrapy项目工程。
scrapy startproject tutorial
H:\python\scrapyDemo>scrapy startproject tutorial New Scrapy project 'tutorial', using template directory 'f:\\python27\\lib\\site-packages\\scrapy\\templates\\project', created in: H:\python\scrapyDemo\tutorial 您可以从以下开始创建您的第一个爬虫: cd tutorial scrapy genspider example example.com
2The file directory structure is as follows:
.
Analyze the structure of the scrapy framework:
3Write a simple crawler
1In item.py, configure the field instances to be collected on the pages.
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy from scrapy.item import Item, Field class TutorialItem(Item): title = Field() author = Field() releasedate = Field()
2、in tutorial/spiders/write the website to be collected and the fields to be collected separately in spider.py.
# -*-coding:utf-8-*- import sys from scrapy.linkextractors.sgml import SgmlLinkExtractor from scrapy.spiders import CrawlSpider, Rule from tutorial.items import TutorialItem reload(sys) sys.setdefaultencoding("utf-8) class ListSpider(CrawlSpider): # Spider name name = "tutorial" # Set download delay download_delay = 1 # Allowed domains allowed_domains = ["news.cnblogs.com"] # Starting URL start_urls = [ "https://news.cnblogs.com" ] # Crawl rules, without callback indicates recursive crawling to this class URL ///// ////+ //div[@id="news_title"]-8') item['title'] = title author = response.selector.xpath('//div[@id="news_info"]/span/a/text()')[0].extract().decode('utf-8') item['author'] = author releasedate = response.selector.xpath('//div[@id="news_info"]/span[@class="time"]/text()')[0].extract().decode(' 'utf-8') item['releasedate'] = releasedate yield item
3、in tutorial/Data is saved in the pipelines.py file.
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import json import codecs class TutorialPipeline(object): def __init__(self): self.file = codecs.open('data.json', mode='wb', encoding='utf-8')# Store data into data.json def process_item(self, item, spider): line = json.dumps(dict(item)) + "\n" self.file.write(line.decode("unicode_escape")) return item
4tutorial/Configure the execution environment in settings.py.
# -*- coding: utf-8 -*- BOT_NAME = 'tutorial' SPIDER_MODULES = ['tutorial.spiders'] NEWSPIDER_MODULE = 'tutorial.spiders' # Disable cookies to prevent being banned COOKIES_ENABLED = False COOKIES_ENABLES = False # Set Pipeline, this is where data is written to the file ITEM_PIPELINES = { 'tutorial.pipelines.TutorialPipeline': 300 } # Set the maximum depth that the crawler can crawl DEPTH_LIMIT = 100
5New a main file to execute the crawler code.
from scrapy import cmdline cmdline.execute("scrapy crawl tutorial".split())
Finally, after executing main.py, the JSON data of the collected results is obtained in the data.json file.
That's all for this article. I hope it will be helpful to everyone's study, and I also hope everyone will support the呐喊 tutorial more.
Declaration: The content of this article is from the Internet, and the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, does not undergo manual editing, and does not bear relevant legal liability. If you find any suspected copyright content, please send an email to: notice#oldtoolbag.com (when sending an email, please replace # with @ to report, and provide relevant evidence. Once verified, this site will immediately delete the infringing content involved.)