最新消息:

牺牲性能换带宽

未分类 admin 2850浏览 0评论

一些多线程爬虫可能会并发获取很多网页,这样就会有很大的带宽需求,但是增加带宽是很花钱的,如果不想花大价钱增加带宽,又想让爬虫快速稳定的爬去网页,怎么办?

大多数网站都提供了gzip压缩功能,我们可以让爬虫获取压缩过的网页。

首先要在HTTP请求中添加Accept-Encoding字段,值为gzip。如果网站支持gzip功能就会返回经过gzip的网页源码,而如果不支持就会返回网页源码明文。

获取到网页源码之后要进行解压,如果是没有经过压缩的网页源码会在这里出错。

以下是一个测试程序:

  1: #!/usr/bin/env python
  2: #encoding=utf-8
  3: #coded by 一只死猫
  4: from urllib2 import build_opener,HTTPCookieProcessor
  5: from sys import argv
  6: from gzip import GzipFile
  7: from StringIO import StringIO
  8:
  9: def Decompress(compressed_webcode):#解压函数
 10: 	#clen=float(len(compressed_webcode))
 11: 	fo=StringIO(compressed_webcode)
 12: 	gzip=GzipFile(fileobj=fo)
 13: 	try:
 14: 		webcode=gzip.read()
 15: 	except:
 16: 		webcode=compressed_webcode
 17: 	return webcode
 18: 	#dlen=len(webcode)
 19: 	#return "compress rate:%f"%(clen/dlen*100)
 20:
 21: def GetWebcode(url):#获取经过压缩的网页源码
 22: 	opener=build_opener(HTTPCookieProcessor())
 23: 	opener.addheaders=[("Accept-Language","zh-CN"),
 24: 			("Accept-Encoding","gzip, deflate"),
 25: 			#告知服务器发送经gzip压缩过的网页源码
 26: 			("User-Agent","Mozilla/4.0 (compatible; "+
 27: "MSIE 6.0; Windows NT 5.1; SV1)")]
 28: 	return opener.open(url).read()
 29:
 30: if __name__=="__main__":
 31: 	url=argv[1]
 32: 	print Decompress(GetWebcode(url))

 

压缩比率测试:

C:Usersi55m411Desktop>compress http://www.renren.com
compress rate:32.424624

C:Usersi55m411Desktop>compress http://www.163.com
compress rate:27.295687

C:Usersi55m411Desktop>compress http://www.sina.com
compress rate:21.680490

C:Usersi55m411Desktop>compress http://www.sohu.com
compress rate:23.315886

C:Usersi55m411Desktop>compress http://www.baidu.com
compress rate:47.458990

C:Usersi55m411Desktop>compress http://www.google.com
compress rate:41.887044

C:Usersi55m411Desktop>compress http://www.qihoo.com
compress rate:30.858843

大约能降低50%-80%的带宽需求。

转载请注明:爱开源 » 牺牲性能换带宽

您必须 登录 才能发表评论!