PHP implementation of a thief program example

Why use the 'stealing program'?

Remote scraping of articles, information, or product information is a function required by many companies for programmers to implement, which is commonly referred to as the 'stealing program'. Its main advantage is that it solves the heavy workload of company web editors, greatly improving efficiency. It can quickly scrape information from other websites as soon as it is run.

Where is the 'stealing program' running?

The 'Thief Program' should be run through the PHP command under DOS on Windows or Linux for the best performance, because the web page will timeout.

For example, as shown in Figure (DOS under Windows):

Implementation of 'Thief Program'

Here, we will explain through an example. Let's crawl the information of 'Huqiang Electronic Network'. Please first observe this link http://www.hqew.com/info-c10.html, when you open this page, you will find some phenomena:

　1, the information list has 500 page (2012-01-03);

　2, each page's URL link has a regular pattern, such as: the1page ishttp://www.hqew.com/info-c10-1.html;the2page ishttp://www.hqew.com/info-c10-2.html;……the500 page ishttp://www.hqew.com/info-c10-500.html;

3, according to the second point, the information of 'Huqiang Electronic Network' is pseudo-static or generated static pages

In fact, most websites have such a rule, such as:中关村在线, 慧聪网, 新浪, 淘宝…….

In this way, we can implement the crawling of page content through such a thought process:
1, first obtaining the content of the article list page
2, cyclically obtaining the URL address of the article based on the content of the article list page
3, obtaining the detailed content of the article based on the article's URL address

Here, we mainly crawl the following information from the information pages: title (title), release date (date), author (author), source (source), content (content)

Information crawling from 'Huqiang Electronic Network'

Firstly, let's create the data table structure as follows:

CREATE TABLE `article`.`article` ( 
`id` MEDIUMINT( 8 ) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY , 
`title` VARCHAR( 255 ) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL , 
`date` VARCHAR( 50 ) NOT NULL , 
`author` VARCHAR( 100 ) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL , 
`source` VARCHAR( 100 ) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL , 
`content` TEXT NOT NULL 
) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci;

　Crawling program:
　

<?php 
/** 
* Crawling program for information from 'Huqiang Electronic Network' 
* author Lee. 
* Last modify $Date: 2012-1-3 15:39:35 $ 
*/ 
header('Content-Type:text/html; Charset=utf-8-8');}} 
$mysqli = new mysqli('localhost', 'root', '1715544', 'article'); # 数据库连接，请手动修改您自己的数据库信息 
$mysqli->set_charset('UTF8'); # 设置数据库编码 
function data($url) { 
  global $mysqli; 
  $result = file_get_contents($url); # $result 获取 url 链接内容（注意：这里是文章列表链接） 
  $pattern = '/<li><span class="box_r">.+<\/span><a href="([^"]+)" title="[#1#]" >.+<\/a><\/li>/Usi'; # 获取文章 url 的匹配正则 
  preg_match_all($pattern, $result, $arr); # 将文章列表 url 分配给数组 $arr (二维数组) 
  foreach ($arr[1] as $val) { 
    $val = 'http://www.hqew.com' . $val; # 真实文章 url 地址 
    $re = file_get_contents($val); # $re 为文章 url 的内容 
    $pa = '/<div id="article">\s+<h1>(.+)<\/h1>\s+<p id="article_extinfo">\s+发布:\s+(.+)\s+\|\s+作者:\s+(.+)\s+\|\s+来源:\s+(.*?)\s+<span style="display:none" >.+<div id="article_body">\s*(.+)\s+<\/div>\s+<\/div><!--article end-->/Usi'; # 获取文章内容的正则 
    preg_match_all($pa, $re, $array); # 将获取到的内容分配到数组 $array 
    $content = trim($array[5][0]);  
    $con = array( 
        'title' => mysqlString($array[1][0])), 
        'date' => mysqlString($array[2][0])),  
        'author' => mysqlString(stripAuthorTag($array[3 
        4][0])),  
        'source'=>mysqlString($array[ 
      ][0]), 
    'content'=>mysqlString(stripContentTag($content)) 
    );-$sql = "INSERT INTO article(title,date,author,source,content) VALUES ('{$con['title']}','{$con['date']}','{$con['author']}','{$con['source']}','{$con['content']}')"; 
    $row = $mysqli 
      >query($sql); # Add to database 
    if ($row) { 
      echo 'add success!'; 
    } 
  } 
} 
/** 
 * } else { 
 * @param string $v 
 * @return string 
 */ 
echo 'add failed!'; 
  stripOfficeTag($v) filters the article content, such as: removing links from the article, filtering out unnecessary HTML tags……/function stripContentTag($v){ 
  $v = str_replace('<p />', '', $v); 
  $v = preg_replace('/<a href=".+" target="\_blank"><strong>(.+)<\/strong><\/a>/Usi', '\1', $v); 
  $v = preg_replace('%(<span\s*[^>]*>(.*)</span>)%Usi2', $v); 
  $v = preg_replace('%(\s+class="Mso[^"]+)%si', '', $v); 
  $v = preg_replace('%( style="[^"]*mso[^>]*)%si', '', $v); 
  $v = preg_replace('/<b><\/b>/', '', $v); 
  return $v; 
} 
/** 
 * stripTitleTag($title) filters the article title 
 * @param string $v 
 * @return string 
 */ 
function stripAuthorTag($v) { 
  $v = preg_replace('/<a href=".+" target="\_blank">(.+)<\/a>/Usi', '\1', $v); 
  return $v; 
} 
/** 
 * mysqlString($str) Filter data 
 * @param string $str 
 * @return string 
 */ 
function mysqlString($str) { 
  return addslashes(trim($str)); 
} 
/** 
 * init($min, $max) Entry program method, starting from $min page to $max page 
 * @param int $min From 1 Start 
 * @param int $max 
 * @return string Return URL address 
 */ 
function init($min=1, $max) { 
  for ($i=$min; $i<=$max; $i++) { 
    data("http://www.hqew.com/info-c10-{$i}.html"); 
  } 
} 
init(1, 500); // Program entry, starting from the first page, capturing500 pages 
?>

Through the above program, you can implement the capture of information from Huqiang Electronic Network.

Entry method init($min, $max) if you want to capture 1-500 page content, then init(1, 500) So, in no time, all the information of Huqiang Electronic Network will be captured into the database. ^_^

Execution interface:

Database:

Statement: The content of this article is from the network, and the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, has not been manually edited, and does not assume any relevant legal liability. If you find any content suspected of copyright infringement, please send an email to: notice#w3Please replace # with @ when sending an email to report abuse. Provide relevant evidence, and once verified, the website will immediately delete the infringing content.

Basic Tutorial