谷歌爬虫主要是用C++开发吗

从谷歌创始人的论文《The Anatomy of a Large-Scale Hypertextual Web Search Engine》里可以看到,谷歌一开始的爬虫是用Python写的。http://infolab.stanford.edu/pub/papers/google.pdf4.3 Crawling the WebIn order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.据说现在改成了C++,但是我没有找到明确的材料。
■网友
目前是的Why did Google move from Python to C++ for use in its crawler?


    推荐阅读