人工智能开发 网络爬虫框架Webmagic( 四 )

(3)创建启动类
@SpringBootApplication@EnableSchedulingpublic class CrawlerApplication {@Value("${redis.host}")private String redis_host;public static void main(String[] args) {SpringApplication.run(CrawlerApplication.class, args);}@Beanpublic IdWorker idWorkker(){ return new IdWorker(1, 1);}@Beanpublic RedisScheduler redisScheduler(){ return new RedisScheduler(redis_host);}}(4)实体类及数据访问接口参见文章微服务 。代码略
3.3.2爬取类
创建文章爬取类ArticleProcessor
package com.tensquare.crawler.processor; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site;import us.codecraft.webmagic.processor.PageProcessor;/*** 文章爬取类*/ @Componentpublic class ArticleProcessor implements PageProcessor {@Overridepublic void process(Page page) {page.addTargetRequests(page.getHtml().links().regex("https://blog.csdn.net/ [a‐z 0‐9 ‐]+/article/details/[0‐9]{8}").all());String title= page.getHtml().xpath("//* [@id="mainBox"]/main/div[1]/div[1]/h1/text()").get();String content= page.getHtml().xpath("//* [@id="article_content"]/div/div[1]").get();//获取页面需要的内容System.out.println("标题:"+title ); System.out.println("内容:"+content );if(title!=null && content!=null){ //如果有标题和内容page.putField("title",title); page.putField("content",content);}else{page.setSkip(true);//跳过}}@Overridepublic Site getSite() {return Site.me().setRetryTimes(3000).setSleepTime(100);}}3.3.3入库类
创建文章入库类ArticleDbPipeline,负责将爬取的数据存入数据库
package com.tensquare.crawler.pipeline; import com.tensquare.crawler.dao.ArticleDao; import com.tensquare.crawler.pojo.Article;import org.springframework.beans.factory.annotation.Autowired; import org.springframework.stereotype.Repository;import us.codecraft.webmagic.ResultItems; import us.codecraft.webmagic.Task;import us.codecraft.webmagic.pipeline.Pipeline; import util.IdWorker;import java.util.Map;/*** 入库类*/ @Componentpublic class ArticleDbPipeline implements Pipeline {@Autowiredprivate ArticleDao articleDao;@Autowiredprivate IdWorker idWorker;public void setChannelId(String channelId) { this.channelId = channelId;}private String channelId;//频道ID @Overridepublic void process(ResultItems resultItems, Task task) { String title = resultItems.get("title");String content= resultItems.get("content"); Article article=new Article(); article.setId(idWorker.nextId()+""); article.setChannelid(channelId); article.setTitle(title); article.setContent(content); articleDao.save(article);}}ReusltItems相当于一个Map,它保存PageProcessor处理的结果,供Pipeline使用 。它的API与Map很类似,值得注意的是它有一个字段skip,若设置为true,则不应被Pipeline处理 。
3.3.4任务类
创建任务类
package com.tensquare.crawler.task;import com.tensquare.crawler.pipeline.ArticleDbPipeline; import com.tensquare.crawler.pipeline.ArticleTxtPipeline; import com.tensquare.crawler.processor.ArticleProcessor; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.scheduling.annotation.Scheduled;import org.springframework.stereotype.Component;import us.codecraft.webmagic.Spider;import us.codecraft.webmagic.scheduler.RedisScheduler;/*** 文章任务类*/ @Componentpublic class ArticleTask {@Autowiredprivate ArticleDbPipeline articleDbPipeline;@Autowiredprivate ArticleTxtPipeline articleTxtPipeline;@Autowiredprivate RedisScheduler redisScheduler;@Autowiredprivate ArticleProcessor articleProcessor;/*** 爬取ai数据*/@Scheduled(cron="0 54 21 * * ?") public void aiTask(){System.out.println("爬取AI文章");Spider spider = Spider.create(articleProcessor); spider.addUrl("https://blog.csdn.net/nav/ai"); articleTxtPipeline.setChannelId("ai"); articleDbPipeline.setChannelId("ai"); spider.addPipeline(articleDbPipeline); spider.addPipeline(articleTxtPipeline); spider.setScheduler(redisScheduler);spider.start();}/*** 爬取db数据*/@Scheduled(cron="20 17 11 * * ?") public void dbTask(){System.out.println("爬取DB文章");Spider spider = Spider.create(articleProcessor); spider.addUrl("https://blog.csdn.net/nav/db"); articleTxtPipeline.setChannelId("db"); spider.addPipeline(articleTxtPipeline); spider.setScheduler(redisScheduler); spider.start();}/*** 爬取web数据*/@Scheduled(cron="20 27 11 * * ?") public void webTask(){System.out.println("爬取WEB文章");Spider spider = Spider.create(articleProcessor); spider.addUrl("https://blog.csdn.net/nav/web"); articleTxtPipeline.setChannelId("web"); spider.addPipeline(articleTxtPipeline); spider.setScheduler(redisScheduler); spider.start();}}4十次方用户数据爬取4.1需求分析
从csdn中爬取用户昵称和头像,存到用户表,头像图片存储到本地


推荐阅读