人工智能开发 网络爬虫框架Webmagic( 五 )


4.2代码编写
4.2.1模块搭建
(1)创建工程tensquare_user_crawler 。pom.xml引入依赖
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic‐core</artifactId>
<version>0.7.3</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j‐log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic‐extension</artifactId>
<version>0.7.3</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring‐boot‐starter‐data‐jpa</artifactId>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql‐connector‐java</artifactId>
</dependency>
<dependency>
<groupId>com.tensquare</groupId>
<artifactId>tensquare_common</artifactId>
<version>1.0‐SNAPSHOT</version>
</dependency>
(2)创建配置文件application.yml
server: port: 9015spring: application:name: tensquare‐user‐crawler #指定服务名datasource:driverClassName: com.mysql.jdbc.Driverurl: jdbc:mysql://127.0.0.1:3306/tensquare_user?characterEncoding=UTF8 username: rootpassword: 123456 jpa:database: MySQL show‐sql: trueredis:host: 127.0.0.1(3)创建启动类
@SpringBootApplication@EnableSchedulingpublic class UserCrawlerApplication {@Value("${redis.host}")private String redis_host;public static void main(String[] args) {SpringApplication.run(CrawlerApplication.class, args);}@Beanpublic IdWorker idWorkker(){ return new IdWorker(1, 1);}@Beanpublic RedisScheduler redisScheduler(){ return new RedisScheduler(redis_host);}}(4)实体类及数据访问接口
参见用户微服务 。代码略
4.2.2爬取类
package com.tensquare.usercrawler.processor; import org.springframework.stereotype.Component; import us.codecraft.webmagic.Page;import us.codecraft.webmagic.Site;import us.codecraft.webmagic.processor.PageProcessor;/*** 文章爬取类*/ @Componentpublic class UserProcessor implements PageProcessor {@Overridepublic void process(Page page) {page.addTargetRequests(page.getHtml().links().regex("https://blog.csdn.net/ [a‐z 0‐9 ‐]+/article/details/[0‐9]{8}").all());String nickname= page.getHtml().xpath("//* [@id="uid"]/text()").get();String image= page.getHtml().xpath("//* [@id="asideProfile"]/div[1]/div[1]/a").css("img","src").toString();if(nickname!=null && image!=null){ //如果有昵称和头像page.putField("nickname",nickname); page.putField("image",image);}else{page.setSkip(true);//跳过}}@Overridepublic Site getSite() {return Site.me().setRetryTimes(3000).setSleepTime(100);}}4.2.3下载工具类
资源提供了工具类,拷贝至tensquare_common工程的util包下
package util; import java.io.*; import java.net.URL;import java.net.URLConnection;/*** 下载工具类*/public class DownloadUtil {public static void download(String urlStr,String filename,String savePath) throws IOException {URL url = new URL(urlStr);//打开url连接URLConnection connection = url.openConnection();//请求超时时间connection.setConnectTimeout(5000);//输入流InputStream in = connection.getInputStream();//缓冲数据byte [] bytes = new byte[1024];//数据长度int len;//文件File file = new File(savePath); if(!file.exists())file.mkdirs(); OutputStream out = newFileOutputStream(file.getPath()+"\"+filename);//先读到bytes中while ((len=in.read(bytes))!=‐1){//再从bytes中写入文件out.write(bytes,0,len);}// 关 闭 IO out.close();in.close();}}4.2.4入库类
package com.tensquare.usercrawler.pipeline; import com.tensquare.usercrawler.dao.UserDao; import com.tensquare.usercrawler.pojo.User;import org.springframework.beans.factory.annotation.Autowired; import org.springframework.stereotype.Component;import us.codecraft.webmagic.ResultItems; import us.codecraft.webmagic.Task;import us.codecraft.webmagic.pipeline.Pipeline; import util.DownloadUtil;import util.IdWorker; import java.io.IOException; @Componentpublic class UserPipeline implements Pipeline {@Autowiredprivate IdWorker idWorker;@Autowiredprivate UserDao userDao;@Overridepublic void process(ResultItems resultItems, Task task) {User user=new User(); user.setId(idWorker.nextId()+""); user.setNickname(resultItems.get("nickname")); String image = resultItems.get("image");//图片地址String fileName = image.substring(image.lastIndexOf("/")+1); user.setAvatar(fileName);userDao.save(user);//下载图片try {DownloadUtil.download(image,fileName,"e:/userimg");} catch (IOException e) { e.printStackTrace();}}}


推荐阅读