python扒取糗百

该段代码是网上查看到的，主要利用了urllib2网络模块和re正则模块，先将页面内容抓取回来以后，再进行正则去除html标签，分条显示内容。具体代码内容如下：

 1#!/usr/bin/env python
 2#-*-encoding=utf-8 -*-
 3import urllib2
 4import re
 5URL = 'http://www.qiushibaike.com/hot/page/'
 6#first = re.compile(r'<div class="content"[^>]*>.*?(?=</div>)')
 7first = re.compile(r'<div class="content".*?(?=</div>)')
 8second = re.compile(r'(?<=>).*')
 9def main():
10   recCount = 5
11   total = 1
12   ipage = 1
13   while True:
14       content = urllib2.urlopen(URL + str(ipage)).readlines()
15       alls = ''
16       for s in content:
17           alls += s.strip()
18       #print first.findall(alls)
19       ipage+=1
20       fs = first.findall(alls)
21       thispage = [second.findall(s.strip())[0] for s in fs if s]
22       for i, p in enumerate(thispage):
23           print total,' ',p
24           total += 1
25           if (i + 1) % recCount == 0:
26               raw_input('nPress Key To Start Moren')
27       ipage+=1
28if __name__ == '__main__':
29    main()

代码内容比较简单，只定义了一个main函数，最后调用即可。具体几个知识点为：

findall()是re模数里的一个查找所有匹配内容的函数
strip()函数由于字符串序列删除，在未传参的情况下默认删除空白符（包括’n’, ‘r’, ‘t’, ‘ ‘)
enumerate()函数在对一个列表或数组既要遍历索引又要遍历元素时，会比range、list等复杂表达式去表达更优美，简洁。

注：上面python代码中有一个不太完美的地方就抓取的内容中没有把这种换行给去掉，不过就正则实现或strip实现也很容易，相对而言php中的strip_tags显的就比较有优势（python中没有strip_tags函数）。

反思：该例如果实用php语言去实现的话也比较简单，具体思路如下，具体代码就不再写了：

1//获取页面内容
2file_get_contents("http://www.qiushibaike.com/hot/page/");
3//用preg_match_all 正则匹配所有内容
4preg_match_all
5//利用strip_tags()去除所有html标签
6strip_tags

注：很多主机上不支持file_get_contents()函数，这就需要使用php的curl去获取页面，如果使用curl的话代码会相对显的多一点。

如您感觉文章有用，可扫码捐赠本站！(If the article useful, you can scan the QR code to donate))

捐赠本站(Donate)

See Also

Latest articles

Categories

Tags

Links

Meta