Python爬虫快速入门,静态网页爬取( 六 )


因此,使用IP代理并结合time.sleep()爬取豆瓣图书Top250再将其写入文件的完整代码如下:
import requestsfrom random import choicefrom bs4 import BeautifulSoup as BeSfrom time import sleep as pausedef spider(url, filename, proxies):headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'}re = requests.get(url, proxies=proxies, headers=headers)soup = BeS(re.text, 'html.parser')items = soup.select('div.pl2 a')with open(filename, 'a', encoding=re.encoding) as f:for item in items:f.write(item['title'] + " " + item['href'] + "n")filename = 'doubanTop250.txt'pages = []proxies_list = []for i in range(0, 250, 25):ip_1 = "http://10.10.1.1%s:3128" % str(i // 25)ip_2 = "http://10.10.1.1%s:1080" % str(i // 25)douban_book = 'https://book.douban.com/top250?start=%s' % str(i)prox = {"http": ip_1,"https": ip_2,}pages.append(douban_book)proxies_list.append(prox)for page in pages:proxies = choice(proxies_list)spider(page, filename, proxies)pause(1)12345678910111213141516171819202122232425262728293031323334353637上述代码的IP代理池中的IP代理不可用(IP地址是瞎写的),所以代码不会成功运行,这里仅仅是为了展示一个完整的结构 。




推荐阅读