通过python采集时 ,经常需要从html 中获取图片或文件的URL并下载到本地,这里列举最常用的三种模块下载的方法:urllib模块、urllib2模块、requests模块。具体代码如下:

 1import urllib
 2import urllib2
 3import requests
 4url = 'http://www.test.com/wp-content/uploads/2012/06/wxDbViewer.zip'
 5print "downloading with urllib"
 6urllib.urlretrieve(url, "code.zip")
 7print "downloading with urllib2"
 8f = urllib2.urlopen(url)
 9data = f.read()
10with open("code2.zip", "wb") as code:
11    code.write(data)
12print "downloading with requests"
13r = requests.get(url)
14with open("code3.zip", "wb") as code:
15     code.write(r.content)

看起来使用urllib最为简单,一句语句即可。当然你可以把urllib2缩写成:

1f = urllib2.urlopen(url)
2with open("code2.zip", "wb") as code:
3   code.write(f.read())

上面的方法中,还可以设置timeout参数,避免采集一直阻塞。除上面的介绍外,还可以使用pycurl 模块进行下载文件。

 1import pycurl
 2import StringIO
 3##### init the env ###########
 4c = pycurl.Curl()
 5c.setopt(pycurl.COOKIEFILE, "cookie_file_name")#把cookie保存在该文件中
 6c.setopt(pycurl.COOKIEJAR, "cookie_file_name")
 7c.setopt(pycurl.FOLLOWLOCATION, 1) #允许跟踪来源
 8c.setopt(pycurl.MAXREDIRS, 5)
 9#设置代理 如果有需要请去掉注释,并设置合适的参数
10#c.setopt(pycurl.PROXY, 'http://11.11.11.11:8080')
11#c.setopt(pycurl.PROXYUSERPWD, 'aaa:aaa')
12########### get the data && save to file ###########
13head = ['Accept:*/*','User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0']
14buf = StringIO.StringIO()
15curl.setopt(pycurl.WRITEFUNCTION, buf.write)
16curl.setopt(pycurl.URL, url)
17curl.setopt(pycurl.HTTPHEADER,  head)
18curl.perform()
19the_page =buf.getvalue()
20buf.close()
21f = open("./%s" % ("img_filename",), 'wb')
22f.write(the_page)
23f.close()