Print
Hits: 23344

I often use web scraping code, of which the below is an example snippet, for looking at technology as part of an IT Assessment, Due Diligence or Review. For this post, I am assuming that the latest stable version of php and curl are installed and working. Below is some generic web scraping code that works well for most web sites.

I generally grab all of the text on the page and then sort through it. Once I have worked out what I want to keep then I discard the raw data.

# $target is set to whatever web page url I am looking to scrape. I can update it so that if it finds Next or page 2 etc then it will cycle around again.

$target = "https://<web site url>";
 
# Curl needs a User Agent so that it can emulate the end user browser. I often have to play around with different User Agents before I find one that works reliably.
$user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36';
 
# If going to use Tor then need to define a proxy and port. I tend not to use this.
$proxy = "127.0.0.1";
$port = "9050";
 
# get a cookie and set up the web scraping request
$ckfile = tempnam ("/home/pgroom", "targetwebpagecookie.txt");
$ch = curl_init($target);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_URL, $target);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_FAILONERROR, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_MAXREDIRS, 4);
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_ENCODING, "");
 
# The next line is for debugging only and spits out the headers so you can look at the handshakes.
# curl_setopt($c, CURLOPT_VERBOSE, TRUE);
 
# The next two lines are required for https web sites and are not secure in any way. If I am productionising something then I tighten these up.
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
 
# This section is only required for Tor, hence why it is commented out
# curl_setopt($ch, CURLOPT_PROXYTYPE, 7);
# curl_setopt($ch, CURLOPT_PROXY, $proxy.':'.$port);
 
# This is where the actual web scraping request is made and errors (if any) are flagged.
$initpage = curl_exec($ch);
$curl_errno = curl_errno($ch);
$curl_error = curl_error($ch);
curl_close($ch);
 
# If there are any errors then they are displayed here.
if ($curl_errno > 0)
 {
        print "\nCurl error no:".$curl_errno;
        print "\nCurl error:".$curl_error;
 }
 
And that is all there really is to it and I hope that the above may prove useful. I have also written posts on Docker, Ansible and Selenium & Python.
 
Thanks
 
Pete