实战代码import requestsimport pandas as pdimport reimport timeimport randomdf = pd.DataFrame()headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}for page in range(0, 1360, 5):url = f'https://www.zhihu.com/api/v4/questions/478781972/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset={page}&platform=desktop&sort_by=default'response = requests.get(url=url, headers=headers).json()data = response['data']for list_ in data:name = list_['author']['name']# 知乎作者id_ = list_['author']['id']# 作者idcreated_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(list_['created_time'] )) # 回答时间voteup_count = list_['voteup_count']# 赞同数comment_count = list_['comment_count']# 底下评论数content = list_['content']# 回答内容content = ''.join(re.findall("[u3002uff1buff0cuff1au201cu201duff08uff09u3001uff1fu300au300bu4e00-u9fa5]", content))# 正则表达式提取print(name, id_, created_time, comment_count, content, sep='|')dataFrame = pd.DataFrame({'知乎作者': [name], '作者id': [id_], '回答时间': [created_time], '赞同数': [voteup_count], '底下评论数': [comment_count],'回答内容': [content]})df = pd.concat([df, dataFrame])time.sleep(random.uniform(2, 3))df.to_csv('知乎回答.csv', encoding='utf-8', index=False)print(df.shape)
结果展示:
文章插图
微博本文以爬取微博热搜《霍尊手写道歉信》为例,讲解如何爬取微博评论!
网页地址:
https://m.weibo.cn/detail/4669040301182509
分析网页微博评论是动态加载的,进入浏览器的开发者工具后,在网页上向下拉取会得到我们需要的数据包:
文章插图
得到真实url:
https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id_type=0
https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id=3698934781006193&max_id_type=0
两条url区别很明显,首条url是没有参数max_id的,第二条开始max_id才出现,而max_id其实是前一条数据包中的max_id:
文章插图
但有个需要注意的是参数max_id_type,它其实也是会变化的,所以我们需要从数据包中获取max_id_type:
文章插图
实战代码import re
import requests
import pandas as pd
import time
import random
df = pd.DataFrame()
try:
a = 1
while True:
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
}
resposen = requests.get('https://m.weibo.cn/detail/4669040301182509', headers=header)
推荐阅读
- 6款超好用的macOS软件,让你的Mac更高效
- 白帽黑客如何使用Metasploit进行SSH服务扫描技巧
- 不会吧,你还不会用RequestId看日志 ?
- Python 中的自然语言处理入门
- C#中使用opencv处理图像
- 游戏视频录制,用什么软件录屏好?
- 好用的企业邮箱怎么选择?可用邮箱大全
- 如何使用 Django 发送电子邮件
- 如何设计百万级的用户ID
- Linux内存占用常用的几个分析方法,你确定都知道?