php curl获取页面所有的链接
本文承接上面两篇,本篇中的示例要调用到前两篇中的函数,做一个简单的URL采集。一般php采集网络数据会用file_get_contents、file和cURL。不过据说cURL会比file_get_contents、file更快更专业,更适合采集。今天就试试用cURL来获取网页上的所有链接。示例如下:
1<?php
2/*
3 * 使用curl 采集hao123.com下的所有链接。
4 */
5include_once('function.php');
6$ch = curl_init();
7curl_setopt($ch, CURLOPT_URL, 'http://www.hao123.com/');
8// 只需返回HTTP header
9curl_setopt($ch, CURLOPT_HEADER, 1);
10// 页面内容我们并不需要
11// curl_setopt($ch, CURLOPT_NOBODY, 1);
12// 返回结果,而不是输出它
13curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
14$html = curl_exec($ch);
15$info = curl_getinfo($ch);
16if ($html === false) {
17 echo "cURL Error: " . curl_error($ch);
18}
19curl_close($ch);
20$linkarr = _striplinks($html);
21// 主机部分,补全用
22$host = 'http://www.hao123.com/';
23if (is_array($linkarr)) {
24 foreach ($linkarr as $k => $v) {
25 $linkresult[$k] = _expandlinks($v, $host);
26 }
27}
28printf("<p>此页面的所有链接为:</p>%sn", var_export($linkresult , true));
29?>
function.php内容如下(即为上两篇中两个函数的合集):
1<?php
2function _striplinks($document) {
3 preg_match_all("'<s*as.*?hrefs*=s*(["'])?(?(1) (.*?)\1 | ([^s>]+))'isx", $document, $links);
4 // catenate the non-empty matches from the conditional subpattern
5 while (list($key, $val) = each($links[2])) {
6 if (!empty($val))
7 $match[] = $val;
8 } while (list($key, $val) = each($links[3])) {
9 if (!empty($val))
10 $match[] = $val;
11 }
12 // return the links
13 return $match;
14}
15/*===================================================================*
16 Function: _expandlinks
17 Purpose: expand each link into a fully qualified URL
18 Input: $links the links to qualify
19 $URI the full URI to get the base from
20 Output: $expandedLinks the expanded links
21*===================================================================*/
22function _expandlinks($links,$URI)
23{
24 $URI_PARTS = parse_url($URI);
25 $host = $URI_PARTS["host"];
26 preg_match("/^[^?]+/",$URI,$match);
27 $match = preg_replace("|/[^/.]+.[^/.]+$|","",$match[0]);
28 $match = preg_replace("|/$|","",$match);
29 $match_part = parse_url($match);
30 $match_root =
31 $match_part["scheme"]."://".$match_part["host"];
32 $search = array( "|^http://".preg_quote($host)."|i",
33 "|^(/)|i",
34 "|^(?!http://)(?!mailto:)|i",
35 "|/./|",
36 "|/[^/]+/../|"
37 );
38 $replace = array( "",
39 $match_root."/",
40 $match."/",
41 "/",
42 "/"
43 );
44 $expandedLinks = preg_replace($search,$replace,$links);
45 return $expandedLinks;
46}
47?>
具体想要和file_get_contents做一个比较的话,可以利用linux下的time命令查看两者执行各需多长时间。据目前测试看是CURL更快一些。最后链接下上面两个函数相关介绍。
匹配链接函数: function _striplinks()
相对路径转绝对:function _expandlinks()
捐赠本站(Donate)
如您感觉文章有用,可扫码捐赠本站!(If the article useful, you can scan the QR code to donate))
- Author: shisekong
- Link: https://blog.361way.com/php-curl-url/2779.html
- License: This work is under a 知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议. Kindly fulfill the requirements of the aforementioned License when adapting or creating a derivative of this work.