Back to the blog index.
Building a web crawler.
I followed a tutorial to build a web crawler and came out with the following: 
The first part gets the contents of the url to crawl.
With the @ symbol no errors are generated. It then matches all links on the page using pregmatch stores them in an array called link array. the base url is also calculated here using php's built in php_url_host. From here on, all links are neutralised so that we ignore any # as this is the same page and we get each link into an http/s form. We also identify emails using mailto as those separated with brackets. We deal with relative links starting with ./ or /. We take any remaining links at the bottom and append onto them http if they do not already start with them. We also check to see if the link is in the array of pages already crawled. if it is not we add it to the array and at the end of the function we go through each of these uncrawled pages and run get links on them again. At the very end, they are all printed out.
<?php

$to_crawl = "http://www.youtube.com/";

$pages_crawled = array();

function get_links($url) {
    global $pages_crawled;
    $input = @file_get_contents($url);
    $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
    preg_match_all("/$regexp/siU", $input, $matches);
    $base_url = parse_url($url, PHP_URL_HOST);
    $link_array = $matches[2];
    foreach($link_array as $link) {
        
        if(strpos($link, "#")) {
            $link = substr($link, 0, strpos($link, "#"));
        }

        if(substr($link, 0, 1) == ".") {
            $link = substr($link, 1);
        }

        if(substr($link, 0, 7) == "http://") { 
            
        } else if(substr($link, 0, 8) == "https://") {

        } else if(substr($link, 0, 2) == "//") {
            $link = substr($link, 2);
        } else if(substr($link, 0, 1) == "#") {
            $link = $url;
        } else if(substr($link, 0, 7) == "mailto:") {
            $link = "[".$link."]";
        } else {
            if(substr($link, 0, 1)!="/") {
                $link = $base_url."/".$link;
            } else {
                $link = $base_url.$link;
            }
        }

        if(substr($link, 0, 7) != "http://" && substr($link, 0, 8) != "https://" && substr($link, 0, 1) != "[") {
            if(substr($url, 0, 8) == "https://") {
                $link = "https://".$link;
            } else {
                $link = "http://".$link;
            }
        }

        if(!in_array($link, $pages_crawled)) {
            array_push($pages_crawled, $link);
        }
    }
}

get_links($to_crawl);

foreach($pages_crawled as $page) {
    get_links($page);
}

foreach($pages_crawled as $page) {
    echo $page;
}

?>