python3快速爬取房源信息,并存入mysql数据库,超详细

想做一个有趣的项目,首先整理一下思路,如何快速爬取关键信息 。并且实现自动翻页功能 。
想了想用最常规的requests加上re正则表达式,BeautifulSoup用于批量爬取
import requestsimport refrom bs4 import BeautifulSoupimport pyMySQL【python3快速爬取房源信息,并存入mysql数据库,超详细】然后引入链接,注意这里有反爬虫机制,第一页必须为https://tianjin.anjuke.com/sale/,后面页必须为’https://tianjin.anjuke.com/sale/p%d/#filtersort’%page,不然会被机制检测到为爬虫,无法实现爬取 。这里实现了翻页功能 。
while page < 11: # brower.get("https://tianjin.anjuke.com/sale/p%d/#filtersort"%page) # time.sleep(1) print ("这是第"+str(page) +"页") # proxy=requests.get(pool_url).text # proxies={ #'http': 'http://' + proxy #} if page==1:url='https://tianjin.anjuke.com/sale/'headers={'referer': 'https://tianjin.anjuke.com/sale/','user-agent': 'Mozilla/5.0 (windows NT 10.0; Win64; x64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/79.0.3945.130 Safari/537.36',} else:url='https://tianjin.anjuke.com/sale/p%d/#filtersort'%pageheaders={'referer': 'https://tianjin.anjuke.com/sale/p%d/#filtersort'%page,'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',} # html=requests.get(url,allow_redirects=False,headers=headers,proxies=proxies) html = requests.get(url, headers=headers)第二步自然是分析网页以及如何实现自动翻页,首先找到图片
 

python3快速爬取房源信息,并存入mysql数据库,超详细

文章插图
 
正则表达式走起!
#图片地址 myjpg=r'<img src=https://www.isolves.com/it/sjk/bk/2020-04-01/"(.*?)" width="180" height="135" />' jpg=re.findall(myjpg,html.text)照片信息已经完成爬取,接下来依葫芦画瓢,把其它信息页也迅速爬取!
#描述 mytail=r'<a data-from="" data-company=""title="(.*?)" href' tail=re.findall(mytail,html.text)# 获取总价 totalprice=r'<span class="price-det"><strong>(.*?)</strong>' mytotal=re.findall(totalprice,html.text)#单价 simpleprice=r'<span class="unit-price">(.*?)</span> ' simple=re.findall(simpleprice,html.text)接下来实现用beauitfulsoup实现关键字标签取值!解析器我这里用lxml,速度比较快,当然也可以用html.parser
soup=BeautifulSoup(html.content,'lxml') 
看图,这里用了很多换行符,并且span标签没有命名,所以请上我们的嘉宾bs4
 
python3快速爬取房源信息,并存入mysql数据库,超详细

文章插图
 
这里使用了循环,因为是一次性爬取,一个300条信息,由于一页图片只有60张,所以将其5个一组进行划分,re.sub目的为了将其中的非字符信息替换为空以便存入数据库
#获取房子信息 itemdetail=soup.select(".details-item span")# print(len(itemdetail)) you=[] my=[] for i in itemdetail:# print(i.get_text())you.append(i.get_text()) k = 0 while k < 60:my.append([you[5 * k], you[5 * k + 1], you[5 * k + 2], you[5 * k + 3],re.sub(r's', "", you[5 * k + 4])])k = k + 1 # print(my) # print(len(my))接下来存入数据库!
db = pymysql.connect("localhost", "root", "" ,"anjuke") conn = db.cursor() print(len(jpg)) for i in range(0,len(tail)):jpgs = jpg[i]scripts = tail[i]localroom = my[i][0]localarea=my[i][1]localhigh=my[i][2]localtimes=my[i][3]local=my[i][4]total = mytotal[i]oneprice=simple[i]sql = "insert into shanghai_admin value('%s','%s','%s','%s','%s','%s','%s','%s','%s')" %(jpgs,scripts,local,total,oneprice,localroom,localarea,localhigh,localtimes)conn.execute(sql)db.commit() db.close()大功告成!来看看效果!
 
python3快速爬取房源信息,并存入mysql数据库,超详细

文章插图
 
以下为完整代码:
# from selenium import webdriverimport requestsimport refrom bs4 import BeautifulSoupimport pymysql# import time# chrome_driver=r"C:Users秦QQAppDataLocalProgramsPythonPython38-32Libsite-packagesselenium-3.141.0-py3.8.eggseleniumwebdriverchromechromedriver.exe"# brower=webdriver.Chrome(executable_path=chrome_driver)# pool_url='http://localhost:5555/random'page=1while page < 11: # brower.get("https://tianjin.anjuke.com/sale/p%d/#filtersort"%page) # time.sleep(1) print ("这是第"+str(page) +"页") # proxy=requests.get(pool_url).text # proxies={ #'http': 'http://' + proxy #} if page==1:url='https://tianjin.anjuke.com/sale/'headers={'referer': 'https://tianjin.anjuke.com/sale/','user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',} else:url='https://tianjin.anjuke.com/sale/p%d/#filtersort'%pageheaders={'referer': 'https://tianjin.anjuke.com/sale/p%d/#filtersort'%page,'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',} # html=requests.get(url,allow_redirects=False,headers=headers,proxies=proxies) html = requests.get(url, headers=headers) soup=BeautifulSoup(html.content,'lxml')#图片地址 myjpg=r'<img src=https://www.isolves.com/it/sjk/bk/2020-04-01/"(.*?)" width="180" height="135" />' jpg=re.findall(myjpg,html.text)#描述 mytail=r'


推荐阅读