Detailed Explanation of Installing and Using Scrapy Crawler Framework in Python

Prologue: I have heard the great name of python crawling framework for a long time. In recent days, I have learned about the Scrapy crawling framework and would like to share my understanding with everyone. If there are any inappropriate expressions, I hope the great gods will correct them.

First, A Glimpse of Scrapy

Scrapy is an application framework written to crawl website data and extract structured data. It can be applied in a series of programs including data mining, information processing, or storing historical data.

It was originally designed toPage scraping(more precisely,Web scraping) designed, and can also be applied in obtaining the data returned by the API (for exampleAmazon Associates Web Services) or general web crawlers.

This document will introduce the concepts behind Scrapy to help you understand its working principle and determine whether Scrapy is what you need.

When you are ready to start your project, you can refer to入门教程.

Second, Introduction to Scrapy installation

Scrapy framework running platform and related auxiliary tools

Python2.7(Python latest version3.5, here we have chosen2.7Version)
Python Package: pip and setuptools. Now pip depends on setuptools, and it will be automatically installed if it is not installed.
lxml. 大多数Linux发行版自带了lxml。如果缺失，请查看http://lxml.de/installation.html
OpenSSL. 除了Windows(请查看平台安装指南)之外的系统都已经提供。

您可以使用pip来安装Scrapy(推荐使用pip来安装Python package).

pip install Scrapy

Windows下安装流程：

1、安装Python 2.7之后，您需要修改PATH环境变量，将Python的可执行程序及额外的脚本添加到系统路径中。将以下路径添加到PATH中：

C:\Python27\;C:\Python27\Scripts\;

除此之外，还可以用cmd命令来设置Path：

c:\python27\python.exe c:\python27\tools\scripts\win_add2path.py

安装配置完成之后，可以执行命令python --version查看安装的python版本。（如图所示）

2、从http://sourceforge.net/projects/pywin32/安装pywin32

请确认下载符合您系统的版本(win32或者amd64

从https://pip.pypa.io/en/latest/installing.html安装pip

3、打开命令行窗口，确认pip被正确安装：

pip --version

4、到目前为止Python 2.7 及pip已经可以正确运行了。接下来安装Scrapy：

pip install Scrapy

至此windows下Scrapy安装已经结束。

三、Scrapy入门教程

1、在cmd中创建Scrapy项目工程。

scrapy startproject tutorial

H:\python\scrapyDemo>scrapy startproject tutorial
New Scrapy project 'tutorial', using template directory 'f:\\python27\\lib\\site-packages\\scrapy\\templates\\project', created in:
  H:\python\scrapyDemo\tutorial
您可以从以下开始创建您的第一个爬虫：
  cd tutorial
  scrapy genspider example example.com

2The file directory structure is as follows:

Analyze the structure of the scrapy framework:

scrapy.cfg: The project's configuration file.
tutorial/: The python module for this project. You will add code here later.
tutorial/items.py: The item file in the project.
tutorial/pipelines.py: The pipelines file in the project.
tutorial/settings.py: The project's settings file.
tutorial/spiders/: The directory where the spider code is placed.

3Write a simple crawler

1In item.py, configure the field instances to be collected on the pages.

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy.item import Item, Field
class TutorialItem(Item):
  title = Field()
  author = Field()
  releasedate = Field()

2、in tutorial/spiders/write the website to be collected and the fields to be collected separately in spider.py.

# -*-coding:utf-8-*-
import sys
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tutorial.items import TutorialItem
reload(sys)
sys.setdefaultencoding("utf-8)
class ListSpider(CrawlSpider):
  # Spider name
  name = "tutorial"
  # Set download delay
  download_delay = 1
  # Allowed domains
  allowed_domains = ["news.cnblogs.com"]
  # Starting URL
  start_urls = [
    "https://news.cnblogs.com"
  ]
  # Crawl rules, without callback indicates recursive crawling to this class URL
  
    /////
    ////+
  
  
  
    
    
    //div[@id="news_title"]-8')
    item['title'] = title
    author = response.selector.xpath('//div[@id="news_info"]/span/a/text()')[0].extract().decode('utf-8')
    item['author'] = author
    releasedate = response.selector.xpath('//div[@id="news_info"]/span[@class="time"]/text()')[0].extract().decode('
      'utf-8')
    item['releasedate'] = releasedate
    yield item

3、in tutorial/Data is saved in the pipelines.py file.

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
import codecs
class TutorialPipeline(object):
  def __init__(self):
    self.file = codecs.open('data.json', mode='wb', encoding='utf-8')# Store data into data.json
  def process_item(self, item, spider):
    line = json.dumps(dict(item)) + "\n"
    self.file.write(line.decode("unicode_escape"))
    return item

4tutorial/Configure the execution environment in settings.py.

# -*- coding: utf-8 -*-
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
# Disable cookies to prevent being banned
COOKIES_ENABLED = False
COOKIES_ENABLES = False
# Set Pipeline, this is where data is written to the file
ITEM_PIPELINES = {
  'tutorial.pipelines.TutorialPipeline': 300
}
# Set the maximum depth that the crawler can crawl
DEPTH_LIMIT = 100

5New a main file to execute the crawler code.

from scrapy import cmdline
cmdline.execute("scrapy crawl tutorial".split())

Finally, after executing main.py, the JSON data of the collected results is obtained in the data.json file.

That's all for this article. I hope it will be helpful to everyone's study, and I also hope everyone will support the呐喊 tutorial more.

Declaration: The content of this article is from the Internet, and the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, does not undergo manual editing, and does not bear relevant legal liability. If you find any suspected copyright content, please send an email to: notice#oldtoolbag.com (when sending an email, please replace # with @ to report, and provide relevant evidence. Once verified, this site will immediately delete the infringing content involved.)

Basic Tutorial