Data Scraping Using PHP-Curl

  • Data Scraping Using PHP-Curl Part 1.

    Simple PHP script for extracting elements from DOM.

    This script gathers all the <img tags from the DOM v1 or all the paths(src attr) from the v2. <img tags. So probably it will gather images that you don't need.
    The key to this procedure is to study the DOM structure find a pattern and make the regular expression according to the specific site.

    This curl function makes only one call to the target server the extraction loop goes inside the variable that you are storing the content of the site.

    /*Initialize the html source*/
    $ch = curl_init("some url");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
    $content = curl_exec($ch);
    curl_close($ch);
    
    $resultImg = array();
    
    /*Creating the regular expression
    @param regex
    @param html
    @param destination array
    */
    preg_match_all('/<img[^>]+>/i', $content, $imgTags);
    
    /*extracts the src tag from the gathered img tags*/
    for ($i = 0; $i < count($imgTags[0]); $i++) {
        preg_match('/src="([^"]+)/i', $imgTags[0][$i], $image);
        $origImageSrc[] = str_ireplace('src="', '', $image[0]);
    }
    /*returns an array of paths from the gathered img tags */
    return $origImageSrc
    /*If you return the $imgTags array then you will have all the img tags not just the source.*/
    return $imgTags;
                                    

    Comparing and saving images in your local or online directory Using PHP-Curl Part 2.

    This part shows how to save the collected paths into you database and also make physical copies of the images to local or online path.

    /*Loop inside the array of the image paths*/
    foreach ($origImageSrc as $img) {
    /*Since we have pattern in the file names
    we extract only the actual file name of the image by removing the '../images/'*/
        $subStringImg = substr($img, 0, 10);
        if ($subStringImg == "../images/") {
        /*ignoring the image that start with the specifix prefix*/
            continue;
        } else {
        /*$url stands for the url of the website you are targeting so we can form a
         list with all the image paths that we need to store.*/
           $liveList[] =  $url . $img;
        }
    }
    /*comparison of arrays.*/
    $fullDiff = array_merge(array_diff($imgList, $liveList), array_diff($liveList, $imgList));
    
    /*start looping inside the array that was formed from the differences.*/
    foreach($fullDiff as $item)
    {
        /*initialize a new CURL . Again $url stands for the url you are targeting.*/
        $ch = curl_init($url . $item);
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
        $rawdata = curl_exec($ch);
        curl_close($ch);
        $query = "INSERT INTO table_name(value , value) VALUES ('some valye as string','/images/$item')";
        $exeQuery = mysqli_query($sqlConnection, $query);
        if ($exeQuery) {
            $fp = fopen($_SERVER["DOCUMENT_ROOT"] . "local/path/of/your/images", 'w');
            fwrite($fp, $rawdata);
            fclose($fp);
        }
        echo $item;
    }
                                

    Data Scraping using PHP , SQL and Simple Html DOM library.

    Using the library Simple HTML DOM Parser can simplify a lot the procedure of data scraping , not complicated regex is need everything is warped in functions. Remember to use this you need to have the PHP property allow_url_fopen set to TRUE if you want use this from online server. This is not always feasible so an other thing you can do is use this script from your local server gather the data you need and the merge the local database with the online, or make a cron job to gather the data and merge or blah blah blah.

    /*This property is optional. This will set the response and proces time of you request to infinite.*/
    /*So if you are manipulating small amount of data you don't need this, but if you do , you will get a warrining :) so activated after ;)*/
    set_time_limit(0);
    /*dont forget to download and require the library*/
    require("simple_html_dom.php");?>
    
    function saveImg($url ,$imgName){
        $imgLocalPath = './your/local/path/'.$imgName;
    
        $content = file_get_contents($url);
        file_put_contents($imgLocalPath, $content);
    }
    
    $url = 'your target url';
    $html = file_get_html($url);
    
    /*regex. Get all images the are inside a div with id*/
    $ret = $html->find('div[id=someDiv]/img');
    
    /*Again looping inside the array that is created with the images.*/
    foreach ($ret as $tag) {
        /*after finding the patern use the substring to extract only the necessary filename you will use.*/
        $imgName = substr($tag->src,55);
        $imgLocalPath = 'img/path/'.$imgName;
    
         /*same logic if you want to get the alt attr or use $tag->title to get the title*/
         $altAttribute = str_replace("’", "''", $tag->alt);
    
        /*create an sql connection and store the values you want.*/
        $con = mysqli_connect("host", "username", "password", "db_name");
        $query = "insert into table_name(col , col) VALUES('$val','$val')";
        $exeQuery  = mysqli_query($con , $query);
    
        if($exeQuery)
        {
            saveImg($tag->src,$imgName);
            echo "all ok";
        }
        else{
            echo "no";
        }
    }
    /*with in the the same instance create an SQL connection and get the data that was stored in your database.*/
    $con = mysqli_connect("host", "username", "password", "db_name");
    $query = "SELECT * FROM table_name where val='$val' and val='$val' and val= '$val'";
    $exeQuery = mysqli_query($con, $query);
    while($row = mysqli_fetch_array($exeQuery)) {
    
    
        echo "<div>";
        echo "<img src='".$row["img_path"]."' alt=''/>";
        echo "<br />".$row['img_name'];
        echo "</div>";
                                
  • Category

Up Next

Js Metro Tabs

Alexabdros Okkalides

Magic The Gathering Slider

Alexandros Okkalides

Js-Toolbar

Alexandros Okkalides

Ajax Image Uploader

Alexandros Okkalides