English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية

Java Spider Gecco Tool News Extraction Example

Recently I saw the Gecoo crawler tool and felt it was simple and easy to use, so I wrote a DEMO to test it, crawling the website
http://zj.zjol.com.cn/home.htmlIt mainly captures the title and release time of news as the crawling test object. The capture of HTML nodes is very convenient, similar to jQuery selector, and the Gecco code mainly uses annotations to implement URL matching, which looks concise and beautiful.

Add Maven dependency

<dependency>
   <groupId>com.geccocrawler</<groupId>
   <artifactId>gecco</<artifactId>
   <version>1.0.8</version>
</dependency>

Write the list page crawler

@Gecco(matchUrl = "http://zj.zjol.com.cn/home.html#63;pageIndex={pageIndex}&pageSize={pageSize}"pipelines = "zJNewsListPipelines")
public class ZJNewsGeccoList implements HtmlBean {
  @Request
  private HttpRequest request;
  @RequestParameter
  private int pageIndex;
  @RequestParameter
  private int pageSize;
  @HtmlField(cssPath = "#content > div > div > div.con_index > div.r.main_mod > div > ul > li > dl > dt > a")
  private List<HrefBean> newList;
}
@PipelineName("zJNewsListPipelines")
public class ZJNewsListPipelines implements Pipeline<ZJNewsGeccoList> {
  public void process(ZJNewsGeccoList zjNewsGeccoList) {
    HttpRequest request=zjNewsGeccoList.getRequest();
    for (HrefBean bean:zjNewsGeccoList.getNewList()){
      //Enter the detail page for scraping
    SchedulerContext.into(request.subRequest("http://zj.zjol.com.cn"+bean.getUrl()));
    }
    int page=zjNewsGeccoList.getPageIndex()+1;
    String nextUrl = "http://zj.zjol.com.cn/home.html#63;pageIndex="+page+"&pageSize=100";
    //Scrape the next page
    SchedulerContext.into(request.subRequest(nextUrl));
  }
}

Write the page scraping details

@Gecco(matchUrl = "http://zj.zjol.com.cn/news/[code].html" ,pipelines = "zjNewsDetailPipeline")
public class ZJNewsDetail implements HtmlBean {
  @Text
  @HtmlField(cssPath = "#headline")
  private String title ;
  @Text
  @HtmlField(cssPath = "#content > div > div.news_con > div.news-content > div:nth-child(1) > div > p.go-left.post-time.c-gray")
  private String createTime;
}
@PipelineName("zjNewsDetailPipeline")
public class ZJNewsDetailPipeline implements Pipeline<ZJNewsDetail> {
  public void process(ZJNewsDetail zjNewsDetail) {
    System.out.println(zjNewsDetail.getTitle())+" "+zjNewsDetail.getCreateTime());
  }
}

Start the main function

public class Main {
  public static void main(String [] rags){
    GeccoEngine.create()
        //The package path of the project
        .classpath("com.zhaochao.gecco.zj")
        //The page address to start crawling
        .start("http://zj.zjol.com.cn/home.html#63;pageIndex=1&pageSize=100")
        //Open several crawler threads
        .thread(10)
        //The interval time after each request is crawled by a single crawler
        .interval(10)
        //Use pc end userAgent
        .mobile(false)
        //Start running
        .run();
  }
}

Crawling results

That's all for this article. I hope it will be helpful to everyone's learning and also hope everyone will support the Yelling Tutorial more.

Declaration: The content of this article is from the Internet, and the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, has not been manually edited, and does not assume any relevant legal liability. If you find any content suspected of copyright infringement, please send an email to notice#w3Please send an email to codebox.com (replace # with @ when sending an email) to report any violations, and provide relevant evidence. Once verified, this site will immediately delete the infringing content.

You May Also Like