本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理 。
作者: 南小小川/南川笔记
PS:如有需要Python学习资料的小伙伴可以加点击下方链接自行获取
【如何用Python爬取网易云两百万热歌】http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef
文章插图
本教程完全基于Python3版本 , 主要使用Chrome浏览器调试网页、Scrapy框架爬取数据、MongoDB数据库存储数据 , 选择这个组合的理由是成熟、稳定、快速、通行 , 此外可能会涉及Requests+BeautifulSoup解析、redis数据库、Djiango/Flask框架等 , 适合已有一定爬虫基础的朋友学习爬取主流网站数据 。
工作流程
根据前期查询、分析、总结 , 得到一条实现本项目的路径:
文章插图
反爬分析
- UA
- IP
- iFrame
- API
核心代码
以下是Scrapy中从歌手分类页到歌手专辑页再到专辑内的单曲页爬取链:
def start_requests(self): for area in self._seq_area: for kind in self._seq_kind: for initial in self._seq_cat_initial: cat = f'{area}00{kind}' artists_url = self.settings['HOST_ARTISTS'].format(cat=cat, initial=initial) yield Request(artists_url, callback=self.parse_artists)def parse_artists(self, response): for singer_node in response.css('#m-artist-box li'): response.meta['item'] = singer_item = SingerItem() singer_item['_id'] = singer_item['singer_id'] = singer_id =int(singer_node.css('a.nm::attr(href)').re_first('d+')) singer_item['crawl_time'] = datetime.now() singer_item['singer_name'] = singer_node.css('a.nm::text').get() singer_item['singer_desc_url'] = self.get_singer_desc(singer_id) singer_item['singer_hot_songs'] = response.urljoin(singer_node.css('a.nm::attr(href)').re_first('S+')) singer_item['cat_name'] = response.css('.z-slt::text').get() singer_item['cat_id'] = int(response.css('.z-slt::attr(href)').re_first('d+')) singer_item['cat_url'] = response.urljoin(response.css('.z-slt::attr(href)').re_first('S+')) yield singer_item yield Request(self.get_singer_albums(singer_id), callback=self.parse_albums)def parse_albums(self, response): for li in response.css('#m-song-module li'): yield response.follow(li.css('a.msk::attr(href)').get(), callback=self.parse_songs) next_page = response.css('div.u-page a.znxt::attr(href)').get() if next_page: yield response.follow(next_page, callback=self.parse_albums)def parse_songs(self, response): album_item = AlbumItem() album_item['_id'] = album_item['album_id'] = int(re.search('id=(d+)', response.url).group(1)) album_item['album_name'] = response.css('h2::text').get() album_item['album_author'] = response.css('a.u-btni::attr(data-res-author)').get() album_item['album_author_id'] = int(response.css('p.intr:nth-child(2) a::attr(href)').re_first('d+')) album_item['album_authors'] =[{'name': a.css('::text').get(), 'href': a.css('::attr(href)').get()} for a in response.css('p.intr:nth-child(2) a')] album_item['album_time'] = response.css('p.intr:nth-child(3)::text').get() album_item['album_url'] = response.url album_item['album_img'] = response.css('.cover img::attr(src)').get() album_item['album_company'] = response.css('p.intr:nth-child(4)::text').re_first('w+') album_item['album_desc'] = response.xpath('string(//div[@id="album-desc-more"])').get() ifresponse.css('#album-desc-more') else response.xpath('string(.//div[@class="n-albdesc"]/p)').get() # 用这个 'span#cnt_comment_count::text' 有些没有评论的会出问题 , 会变成“评论” album_item['album_comments_cnt'] = int(response.css('#comment-box::attr(data-count)').get()) album_item['album_songs'] = response.css('#song-list-pre-cache li a::text').getall() album_item['album_Appid'] = int(json.loads(response.css('script[type="application/ld+json"]::text').get())['appid']) yield album_item for li in response.css('#song-list-pre-cache li'): song_item = SongItem() song_item['crawl_time'] = datetime.now() song_item['song_name'] = li.css('a::text').get() song_item['_id'] = song_item['song_id'] = int(li.css('a::attr(href)').re_first('d+')) song_item['song_url'] = response.urljoin(li.css('a::attr(href)').re_first('S+')) yield song_item try: # 热歌信息在
推荐阅读
- 茶叶品牌店长如何检测门店客户忠诚度
- 如何去除 WPS Office "热点弹窗"及"广告推送"?
- 如何腌肉更嫩 怎么腌肉又嫩又滑
- 历史 赵云是怎么死的(赵云如何死去的?)
- 如何消除额头皱纹呢?
- 如何祛痘印才是有效的?
- 智能饮水机应该如何选购
- 淘金币如何兑换商品 淘宝淘金币活动
- 如何把控火候 泡出好茶
- 家居装饰画选什么好,吉祥如意、旺福旺运的花鸟画分享