python扒取糗百
该段代码是网上查看到的,主要利用了urllib2网络模块和re正则模块,先将页面内容抓取回来以后,再进行正则去除html标签,分条显示内容。具体代码内容如下: 1#!/usr/bin/env python 2#-*-encoding=utf-8 -*- 3import urllib2 4import re 5URL = 'http://www.qiushibaike.com/hot/page/' 6#first = re.compile(r'<div class="content"[^>]*>.*?(?=</div>)') 7first = re.compile(r'<div class="content".*?(?=</div>)') 8second = re.compile(r'(?<=>).*') 9def main(): 10 recCount = 5 11 total = 1 12 ipage = 1 13 while True: 14 content = urllib2.urlopen(URL + str(ipage)).readlines() 15 alls = '' 16 for s in content: 17 alls += s.strip() 18 #print first.findall(alls)……