Python Spider Implementation Tutorial Converted into PDF E-book

Writing a crawler seems to be the most suitable choice with Python, as the Python community provides a multitude of crawling tools that can make your eyes dazzle, and various libraries that can be used directly can write a crawler in minutes. Today, I am thinking of writing a crawler to crawl Liao Xuefeng's Python tutorial and make it into a PDF e-book for everyone's offline reading convenience.

Before writing the crawler, let's analyze the website first1The page structure, the left side of the webpage is the outline of the tutorial directory, each URL corresponds to an article on the right, the top right is the title of the article, the middle is the main content of the article, the main content is the focus we are concerned about, and the data we need to crawl is the main content of all web pages, and the bottom is the user comment area, which is not useful to us, so it can be ignored.

Tool preparation

After understanding the basic structure of the website, you can start preparing the tool packages required for the crawler. requests and beautifulsoup are the two great tools of web crawling, requests are used for network requests, and beautifusoup is used for operating html data. With these two tools, we can work efficiently without needing a crawling framework like scrapy, as it is a bit like using a sledgehammer to crack a nut. Moreover, since we are converting html files to pdf, we also need the corresponding libraries to support this, and wkhtmltopdf is a very good tool that can convert html to pdf on multiple platforms, pdfkit is the Python wrapper package for wkhtmltopdf. First, install the following dependent packages:

Then install wkhtmltopdf

pip install requests
pip install beautifulsoup
pip install pdfkit

Install wkhtmltopdf

Directly on the wkhtmltopdf official website on the Windows platform2Install the stable version and add the execution path of the program to the system environment variable $PATH after installation, otherwise pdfkit will not find wkhtmltopdf and an error message 'No wkhtmltopdf executable found' will occur. Ubuntu and CentOS can be installed directly via the command line

$ sudo apt-get install wkhtmltopdf # ubuntu
$ sudo yum install wkhtmltopdf   # centos

Crawler implementation

After everything is ready, you can start coding, but before writing the code, it is still necessary to organize your thoughts. The purpose of the program is to save the html main text part corresponding to all URLs locally, and then use pdfkit to convert these files into a pdf file. Let's break down the task, first save the html main text corresponding to a certain URL locally, and then find all the URLs and perform the same operation.

Use Chrome browser to find the tag of the text part on the page, press F12 Find the corresponding div tag for the text: <div >, which is the main content of the web page. After loading the entire page locally with requests, you can use beautifulsoup to operate the HTML DOM elements to extract the text content.

Specific implementation code: use the soup.find_all function to find the text tag, and then save the content of the text part to the a.html file.

def parse_url_to_html(url):
  response = requests.get(url)
  soup = BeautifulSoup(response.content, "html5lib")
  body = soup.find_all(class_="x-wiki-content")[0]
  html = str(body)
  with open("a.html", 'wb') as f:
    f.write(html)

The second step is to parse out all the URLs on the left side of the page. Using the same method, find the left menu tag <ul >

Specific implementation logic: because there are two uk-nav uk-nav-The class attribute of side, while the actual directory list is the second. All urls have been obtained, and the function to convert urls to html has also been written in the first step.

def get_url_list():
  ""
  Get all URL directory list
  ""
  response = requests.get("http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000")
  soup = BeautifulSoup(response.content, "html5lib")
  menu_tag = soup.find_all(class_="uk-nav uk-nav-side[1]
  urls = []
  for li in menu_tag.find_all("li"):
    url = "http:"//www.liaoxuefeng.com + li.a.get('href')
    urls.append(url)
  return urls

The last step is to convert the html to a pdf file. Converting to a pdf file is very simple because pdfkit encapsulates all the logic, you just need to call the function pdfkit.from_file

def save_pdf(htmls):
  ""
  Convert all html files to pdf files
  ""
  options = {
    'page-size': 'Letter',
    'encoding': "UTF-8"
    'custom-header': [
      ('Accept-Encoding', 'gzip')
    ]
  }
  pdfkit.from_file(htmls, file_name, options=options)

Execute the save_pdf function, and the electronic book pdf file is generated, effect diagram:

Summary

The total amount of code is less than50 lines, however, wait a minute, in fact, the code given above omits some details, such as how to get the title of the article, the img tag in the main content uses a relative path, if you want the images to be displayed normally in the pdf, you need to change the relative path to an absolute path, and all the temporary html files saved need to be deleted, and these details are all on github.

The complete code can be downloaded from github, the code has been tested and works well on the Windows platform, welcome to fork and download for self-improvement. github address3For those who cannot access GitHub, you can use Gitee4The PDF file of 'Liaoxuefeng's Python Tutorial' can be downloaded for free by subscribing to the public account 'A Programmer's Microstation' and replying 'pdf'.

Basic Tutorial