I often use web scraping code, of which the below is an example snippet, for looking at technology as part of an IT Assessment, Due Diligence or Review. For this post, I am assuming that the latest stable version of php and curl are installed and working. Below is some generic web scraping code that works well for most web sites.

I generally grab all of the text on the page and then sort through it. Once I have worked out what I want to keep then I discard the raw data.

# $target is set to whatever web page url I am looking to scrape. I can update it so that if it finds Next or page 2 etc then it will cycle around again.

$target = "https://<web site url>";
 
# Curl needs a User Agent so that it can emulate the end user browser. I often have to play around with different User Agents before I find one that works reliably.
$user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36';
 
# If going to use Tor then need to define a proxy and port. I tend not to use this.
$proxy = "127.0.0.1";
$port = "9050";
 
# get a cookie and set up the web scraping request
$ckfile = tempnam ("/home/pgroom", "targetwebpagecookie.txt");
$ch = curl_init($target);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_URL, $target);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_FAILONERROR, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_MAXREDIRS, 4);
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_ENCODING, "");
 
# The next line is for debugging only and spits out the headers so you can look at the handshakes.
# curl_setopt($c, CURLOPT_VERBOSE, TRUE);
 
# The next two lines are required for https web sites and are not secure in any way. If I am productionising something then I tighten these up.
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
 
# This section is only required for Tor, hence why it is commented out
# curl_setopt($ch, CURLOPT_PROXYTYPE, 7);
# curl_setopt($ch, CURLOPT_PROXY, $proxy.':'.$port);
 
# This is where the actual web scraping request is made and errors (if any) are flagged.
$initpage = curl_exec($ch);
$curl_errno = curl_errno($ch);
$curl_error = curl_error($ch);
curl_close($ch);
 
# If there are any errors then they are displayed here.
if ($curl_errno > 0)
 {
        print "\nCurl error no:".$curl_errno;
        print "\nCurl error:".$curl_error;
 }
 
And that is all there really is to it. I hope that the above may prove useful. Please feel free to add your own thoughts in the comments section below.
 
Thanks
 
Peter

Add comment


Security code
Refresh

How could Peter help?

As an IT consultant and technology advisor, Peter is often used for his significant digital/online transformation, troubleshooting, assessor, due diligence and turnaround experience, particularly if the word "impossible" has been used. With a reputation as a fixer, firefighter and troubleshooter, Peter takes on the most challenging of situations and brings to bear his extensive IT, digital & technology turnaround experience, coupled with his skills as a CEO and Board level advisor. 

Available to help on a full time, part time, ad hoc or fractional basis as necessary. Equally comfortable working onsite or remote, depending on the challenge.

Why not get in touch for a free, no obligation, completely "off the record" chat ...?

 

Specialties

Peter Groom has significant experience in delivering and scaling online/digital services, as well as system migrations and upgrades. Well versed in E-Commerce platforms, both off the shelf and bespoke and using public, private and hybrid clouds. Highly experienced in delivering business value as a technology consultant, IT assessor and digital advisor, where IT and technology are seen as an enabler (think of it as "Lego") and often in high pressure turnaround environments.

Delivers innovation by using proven technology in one sector and applying it to a business challenge in another sector. Has an excellent track record as an IT consultant, firefighter and troubleshooter, of delivering "against the odds", initiatives whilst also bringing the organisation along on the journey. Extensive troubleshooting experience across many global markets, where his assessment and due diligence expertise prove essential, together with his CEO & Board level advisor skills.

His focus on pragmatic outcomes has delivered significant growth for his clients, across a multitude of sectors and continues to do so. In fact, it is this focus on practical actions, particularly from IT Assessments and Due Diligence, together with his guarantee that he will (if the client wishes) deliver those results for the client, that set him apart from many others. 

 

Contact Peter

Email Peter Groom, IT troubleshooter & digital consultant

 

Call Peter Groom, IT firefighter & online firefighter and fixer +44 (0)7710 745360

Contact Pete
1000 characters left