English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية
This article illustrates the method of using the PHPCrawl crawling library to crawl Kugou playlists. Share it with everyone for reference, as follows:
After watching videos related to web crawling, I was itchy to crawl something. Recently, there has been a fierce battle of emoticons on Facebook, so I thought of crawling all the emoticons down, but I couldn't find a suitable VPN at that time. Therefore, I crawled the top songs and simple introductions of Kugou in the past month to the local. The code is a bit messy, and I'm not very satisfied with it, and I don't want to put it up to show off. However, thinking it over, this is at least my first web crawling attempt, so...therefore, there is the following unsightly code~~~(Since the amount of data collected is small, I didn't consider multi-processing, but I took a look at the PHPCrawl documentation and found that the PHPCrawl library has encapsulated all the functions I can think of, making it very convenient to implement)
<?php header("Content-Type: text/html; charset=utf-8")-type:text/html;charset=utf-8-8"); // It may take a while to crawl a site ... set_time_limit(10000); include("libs/PHPCrawler.class.php"); class MyCrawler extends PHPCrawler { function handleDocumentInfo($DocInfo) { // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>"). if (PHP_SAPI == "cli") $lb = "\n"; else $lb = "<br" />"; $url = $DocInfo->url; $pat = "/http://www.kugou.com/yy/special/single/\d+.html/"; if(preg_match($pat,$url) > 0){ $this->parseSonglist($DocInfo); } flush(); } public function parseSonglist($DocInfo){ $content = $DocInfo->content; $songlistArr = array(); $songlistArr['raw_url'] = $DocInfo->url; //Parse song introduction $matches = array(); $pat = "/<span>Name:<\/span>([^(<br)]+)<br/"; $ret = preg_match($pat,$content,$matches); if($ret>0){ $songlistArr['title'] = $matches[1]; } $songlistArr['title'] = ''; } //Parse song $pat = "/<a title=\"([^\"]+)\" hidefocus=\"/"; $matches = array(); preg_match_all($pat,$content,$matches); $songlistArr['songs'] = array(); for($i = 0;$i < count($matches[0]);$i++]{ $song_title = $matches[1[$i]; array_push($songlistArr['songs'],array('title'=>$song_title)); } echo "<pre>"; print_r($songlistArr); echo "</pre>"; } } $crawler = new MyCrawler(); // URL to crawl $start_url="http://www.kugou.com/yy/special/index/1-0-2.html"; $crawler->setURL($start_url); // Only receive content of files with content-type "text/html" $crawler->addContentTypeReceiveRule("#text/html#"); //link extension $crawler->addURLFollowRule("#http://www\.kugou\.com/yy/special/single/\d+\.html# i"); $crawler->addURLFollowRule("#http://www.kugou\.com/yy/special/index/\d+-\d+-2\.html# i"); // Store and send cookie-data like a browser does $crawler->enableCookieHandling(true); // Set the traffic-limit to 1 MB(1000 * 1024) (in bytes, // for testing we don't want to "suck" the whole site) //Crawl size is unlimited $crawler->setTrafficLimit(0); // Thats enough, now here we go $crawler->go(); // At the end, after the process is finished, we print a short // report (see method getProcessReport() for more information) $report = $crawler->getProcessReport(); if (PHP_SAPI == "cli") $lb = "\n"; else $lb = "<br" />"; echo "Summary: " . $lb; echo "Links followed: " . $report->links_followed.$lb; echo "Documents received: ".$report->files_received.$lb; echo "Bytes received: ".$report->bytes_received." bytes".$lb; echo "Process runtime: ".$report->process_runtime." sec".$lb; ?>
PS: Here is also provided2A very convenient regular expression tool is provided for your reference and use:
JavaScript Regular Expression Online Testing Tool:
http://tools.jb51.net/regex/javascript
Online Regular Expression Generator:
http://tools.jb51.net/regex/create_reg
Readers who are interested in more PHP-related content can check out the special topic of this site: 'PHP Regular Expression Usage Summary', 'PHP Array (Array) Operation Skills大全', 'PHP Basic Syntax Tutorial', 'PHP Arithmetic and Operator Usage Summary', 'PHP Object-Oriented Program Design Tutorial', 'PHP Network Programming Skills Summary', 'PHP String (String) Usage Summary', 'PHP'+MySQL Database Operation Tutorial and PHP Common Database Operation Skills Summary
I hope this article will be helpful to everyone in PHP program design.
Statement: The content of this article is from the Internet, and the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously, and this website does not own the copyright, has not been manually edited, and does not assume any relevant legal liability. If you find any content suspected of copyright infringement, please send an email to: notice#w3Please report any infringement by sending an email to codebox.com (replace # with @ when sending emails), and provide relevant evidence. Once verified, this site will immediately delete the suspected infringing content.