用 Java 拿下 HTML,分分钟写个小爬虫( 二 )


目标链接:https://book.douban.com/latest?icn=index-latestbook-all
 
4.1 项目 pom.xml 文件项目引入 jsoup、lombok、easyexcel 三个库 。
  <?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.Apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>JsoupTest</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.target>1.8</maven.compiler.target>
<maven.compiler.source>1.8</maven.compiler.source>
</properties>
<dependencies>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.12</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>easyexcel</artifactId>
<version>2.2.6</version>
</dependency>
</dependencies>
</project>
 
4.2 解析页面数据  public class BookInfoUtils {
public static List<BookEntity> getBookInfoList(String url) throws IOException {
List<BookEntity> bookEntities=new ArrayList<>;
Document doc = Jsoup.connect(url).get;
Elements liDiv = doc.select("#content > div > div.article > ul > li");
for (Element li : liDiv) {
Elements urls = li.select("a[href]");
Elements imgUrl = li.select("a > img");
Elements bookName = li.select(" div > h2 > a");
Elements starsCount = li.select(" div > p.rating > span.font-small.color-lightgray");
Elements author = li.select("div > p.color-gray");
Elements description = li.select(" div > p.detail");
String bookDetailUrl = urls.get(0).attr("href");
BookDetailInfo detailInfo = getDetailInfo(bookDetailUrl);
BookEntity bookEntity = BookEntity.builder
.detailPageUrl(bookDetailUrl)
.bookImgUrl(imgUrl.attr("src"))
.bookName(bookName.html)
.starsCount(starsCount.html)
.author(author.text)
.bookDetailInfo(detailInfo)
.description(description.html)
.build;
// System.out.println(bookEntity);
bookEntities.add(bookEntity);
}
return bookEntities;
}
/**
*
* @param detailUrl
* @return
* @throws IOException
*/
public static BookDetailInfo getDetailInfo(String detailUrl)throws IOException{
Document doc = Jsoup.connect(detailUrl).get;
Elements content = doc.select("body");
Elements price = content.select("#buyinfo-printed > ul.bs.current-version-list > li:nth-child(2) > div.cell.price-btn-wrApper > div.cell.impression_track_mod_buyinfo > div.cell.price-wrapper > a > span");
Elements author = content.select("#info > span:nth-child(1) > a");
BookDetailInfo bookDetailInfo = BookDetailInfo.builder
.author(author.html)
.authorUrl(author.attr("href"))
.price(price.html)
.build;
return bookDetailInfo;
}
}
这里的重点是要获取网页对应元素的选择器 。
例如:获取 li.select("div > p.color-gray") 中 div > p.color-gray 是怎么知道的 。
使用 chrome 的小伙伴应该都猜到了 。打开 chrome 浏览器 Debug 模式,Ctrl + Shift +C 选择一个元素,然后在 html 右键选择 Copy ->Copy selector,这样就可以获取当前元素的选择器 。如下图:

用 Java 拿下 HTML,分分钟写个小爬虫

文章插图
 
4.3 存储数据到 Excel为了数据更好查看,我将通过 jsoup 抓取的数据存储的 Excel 文件,这里我使用的 easyexcel 快速生成 Excel 文件 。
Excel 表头信息
  @Data
@Builder
public class ColumnData {
@ExcelProperty("书名称")
private String bookName;
@ExcelProperty("评分")
private String starsCount;
@ExcelProperty("作者")
private String author;
@ExcelProperty("封面图片")
private String bookImgUrl;
@ExcelProperty("简介")
private String description;


推荐阅读