一些多线程爬虫可能会并发获取很多网页,这样就会有很大的带宽需求,但是增加带宽是很花钱的,如果不想花大价钱增加带宽,又想让爬虫快速稳定的爬去网页,怎么办?
大多数网站都提供了gzip压缩功能,我们可以让爬虫获取压缩过的网页。
首先要在HTTP请求中添加Accept-Encoding字段,值为gzip。如果网站支持gzip功能就会返回经过gzip的网页源码,而如果不支持就会返回网页源码明文。
获取到网页源码之后要进行解压,如果是没有经过压缩的网页源码会在这里出错。
以下是一个测试程序:
1: #!/usr/bin/env python
2: #encoding=utf-8
3: #coded by 一只死猫
4: from urllib2 import build_opener,HTTPCookieProcessor
5: from sys import argv
6: from gzip import GzipFile
7: from StringIO import StringIO
8:
9: def Decompress(compressed_webcode):#解压函数
10: #clen=float(len(compressed_webcode))
11: fo=StringIO(compressed_webcode)
12: gzip=GzipFile(fileobj=fo)
13: try:
14: webcode=gzip.read()
15: except:
16: webcode=compressed_webcode
17: return webcode
18: #dlen=len(webcode)
19: #return "compress rate:%f"%(clen/dlen*100)
20:
21: def GetWebcode(url):#获取经过压缩的网页源码
22: opener=build_opener(HTTPCookieProcessor())
23: opener.addheaders=[("Accept-Language","zh-CN"),
24: ("Accept-Encoding","gzip, deflate"),
25: #告知服务器发送经gzip压缩过的网页源码
26: ("User-Agent","Mozilla/4.0 (compatible; "+
27: "MSIE 6.0; Windows NT 5.1; SV1)")]
28: return opener.open(url).read()
29:
30: if __name__=="__main__":
31: url=argv[1]
32: print Decompress(GetWebcode(url))
压缩比率测试:
C:Usersi55m411Desktop>compress http://www.renren.com
compress rate:32.424624
C:Usersi55m411Desktop>compress http://www.163.com
compress rate:27.295687
C:Usersi55m411Desktop>compress http://www.sina.com
compress rate:21.680490
C:Usersi55m411Desktop>compress http://www.sohu.com
compress rate:23.315886
C:Usersi55m411Desktop>compress http://www.baidu.com
compress rate:47.458990
C:Usersi55m411Desktop>compress http://www.google.com
compress rate:41.887044
C:Usersi55m411Desktop>compress http://www.qihoo.com
compress rate:30.858843
大约能降低50%-80%的带宽需求。