解析json

解析json这种方式很好理解，所有支持请求网络的开发语言都能支持

思路就是获得爬取链接返回的json数据（也可以是其他类型的数据，json比较常用）

我们拿到json数据后进行解析，结合自身需求来处理数据。

这个比较通用，就不举例子了。

解析dom树

重点讲解一下PHP是如何爬取目标网站，解析dom树的。

举个栗子：

工具类代码：

做了缓存处理，我的场景只要请求的url相同返回的数据是不变的，所以做了缓存处理

请求的url之前请求通，不会重复请求，而是会从缓存中取值

sleep()的作用是避免给抓取数据的网站造成网络问题，控制一下请求频率

public static function crawlContent($url, $encode = true)
{
    $file_name = '../cache/' . md5($url);
    if (!file_exists($file_name)) {
        @touch($file_name);
    }
    $content = file_get_contents($file_name);
    if (empty($content)) {
        $content = Request::curl($url);
        if (empty($content)) {
            sleep(rand(3,10));
            $content = Request::curl($url);
        }
        $encode && $content = iconv("GBK", "UTF-8//IGNORE", $content);
        file_put_contents($file_name, $content);
    }
    return $content;
}

封装的 curl 工具

public static function curl($url , $configs = array())
{
    $b = microtime(true);
    $new_ch = curl_init();
    self::_setopt($new_ch , $url , $configs);
    $result = curl_exec($new_ch);
    $e = microtime(true);
    if (curl_errno($new_ch))
    {
        Logger::log('CULR_BAD    ' . "curl_errno:" . curl_errno($new_ch) . "    " . curl_error($new_ch) . "    " . ($e - $b) . "    " . $url);
    }
    curl_close($new_ch);
    return $result;
}

解析dom树示例代码

上面的工具类代码我们获得了目标网页的数据

通过 str_get_html 函数获得字符串

再按照 simple_html_dom 获得元素的方式获得对应的值就可以了了

比如：


find('#container');

// 找到所有class=foo的元素
$ret = $html->find('.foo');

// 查找多个html标签
$ret = $html->find('a, img');

// 还可以这样用
$ret = $html->find('a[title], img[title]');
?>

更多取值方法可以查看官方文档，非常的简单好用，这里就不再这里赘述了。

我的示例代码

下面是我的示例代码，整理了一些爬取数据时需要注意的问题：

如果涉及抓取图片的话，最好把图片上传到自己的云存储
合理的控制爬取频率（已封装到工具类中）
合理的使用代理（已封装到工具类中）

find('.m-o-large-all-detail .ui-box .ui-box-body ul li');
            if ($content) {
                $counts = count($content);
                for ($i = 0; $i < $counts; $i++) {
                    $name = trim($detail_html->find('.m-o-large-all-detail .ui-box .ui-box-body ul li .detail h3 a', $i)->plaintext); //名称
                    $href = 'https:'.trim($detail_html->find('.m-o-large-all-detail .ui-box .ui-box-body ul li .detail h3 a', $i)->href); //url
                    preg_match_all('/[1-9]\d*/', $href, $matches);
                    $source_id = $matches[0][0];
                    $price = trim($detail_html->find('.m-o-large-all-detail .ui-box .ui-box-body ul li .cost b', $i)->plaintext); //价格
                    $data = array(
                        'name' => $name,
                        'source_url' => $href,
                        'source_id' => $source_id,
                        'price'=>$price,
                        'created_at'=>date('Y-m-d H:i:s'),
                        'date'=>date('m-d'),
                    );
                    $id = $spider->insert('products', $data);
                    var_dump("商品信息：products_" . $id);
                    if ($id){
                        //存储图片
                        catchThumbs($id,$href,$spider);
                    }
                }
            }
        }
    }
}

//抓取商品图
function catchThumbs($id,$detail_url,$spider)
{
    $m_content = Utils::curlProxy($detail_url, false);

    $before = strpos($m_content, 'imagePathList');
    $after = strpos($m_content,'ImageModule',1);
    $count = $after-$before-24;

    $thumbs = substr($m_content,$before+15,$count);

    $data = array(
        'thumbs'=>$thumbs,
        'updated_at'=>date('Y-m-d H:i:s'),
    );
    $res = $spider->update('products', $data, 'id = '.$id);
    var_dump('抓取商品图：'.$res." id=".$id);
    if ($res){
        //更新成功则下载图片
        downloadThumbs($id,$thumbs,$spider);
    }
}

//下载商品图
function downloadThumbs($id,$thumbs,$spider)
{
        $Ymd = date('md');
        $thumbs = json_decode($thumbs,true);
        foreach ($thumbs as $key=> $thumb){
            $image_name = $Ymd.'-'.$id.'-'.($key+1).'.jpg';
            $image_name = './pics/'.$image_name;
            saveImage($thumb,$image_name);
        }
}

/**
 * 从网上下载图片保存到服务器
 * @param $path 图片网址
 * @param $image_name 保存到服务器的路径 './public/upload/users_avatar/'.time()
 */
function saveImage($path, $image_name) {
    $ch = curl_init ($path);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
    $img = curl_exec ($ch);
    curl_close ($ch);
//$image_name就是要保存到什么路径,默认只写文件名的话保存到根目录
    $fp = fopen($image_name,'w');//保存的文件名称用的是链接里面的名称
    fwrite($fp, $img);
    fclose($fp);
}

总结和注意

以上就是PHP爬取数据的思路总结啦~

注意：在未经授权的情况下爬取数据是违法的！

本文仅提供爬取数据的PHP实现思路，网络世界不是法外之地，谨慎使用爬取技术哦~