Learn practical skills, build real-world projects, and advance your career

zerotoanalyst-project1

Creating a Contacts List for the UK's NHS using Web Scraping

Project Objectives

The generic aim of this research project is to demonstrate the extent to which a contacts list can be created using web scraping. The project is using Python and the Requests and Beautiful Spoup 4 (BS4) python libraries.

The project uses the National Health Service (NHS) in the UK to explore this objective. The NHS is a public sector organisation comprising over 6000 organisations from large hospitals to local general health practices (GPs). Overall it has 1.6 million employees and volunteers and provides healthcare services to the 67 million people in the UK with a budget of over 130 billion GP pounds. The project demonstrates this using the NHS in England and including the hospital trusts and GP surgeries. To become comprehensive the exercise would need to extent to include NHS Scotland, Wales and Northern Ireland which should be straight-forward.

The objective is framed within a wider research project exploring the nature and structure of "business ecosystems" located at the Wave Lab within the University of the Aegean Business School in Greece. This listing will later be used to model the NHS as a business ecosystem.

Relevant Use Cases

A listing of this type has a very wide set of potential use cases:

  1. Research studies to find survey respondents or case study sites or web-sites to gather information and data.
  2. Market research activities for mailing lists, newsletters, lead generation.
  3. Contacts for requests for information for a variety of purposes such as journalism, government advice, etc.
  4. Political activism or citizen investigations into issues of relevance to citizens or political agendas.

Although the NHS is the subject of this demonstrator, web scraping contact details is applicable to most organisations and sectors and countries.

Economic Advantages

Contact listings of this sort can often be purchased but usually at a cost and how up-to-date the databases are is an issue. The economic avantages are therefore:

  1. Cost savings over a commercial data provider.
  2. Being able to re-run at any time for the latest contact information.
  3. Having more options for missing data.
  4. Potentially having more control over the data gathering.

Web Site Used

A first stage involved exploring a number of web sites to find the best one for this purpose. The site www.nhs.uk was chosen for this exercise.

alt

Legal & Information Protection Conformance

The legality was checked of using the web site for web scraping of this type. Generally, in the UK it is accepted as legal for UK citizens to scrape UK web-sites for research purposes. This legal principle applies to this project. Additionally, the website's Terms and Conditions were reviewed at https://www.nhs.uk/our-policies/terms-and-conditions/. This advises in section 3.4 that "you can use NHS Website Content, including copying it, adapting it, and using it for any purpose, including commercially, provided you follow these terms and conditions and the terms of the OGL" (i.e. Open Government License http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/). The OGL allows the use and adaptation of information from the site and combination with other data for both commercial and non-commercial purposes provided that the sorce is acknowledged and the OGL license is stated to apply.

Evaluation Criteria

The project is a requirement within the Jovian Data Analyst Bootcamp and had the following evaluation criteria:

  • The Jupyter notebook should run end-to-end without any errors or exceptions
  • The Jupyter notebook should contain execution outputs for all the code cells
  • The Jupyter notebook should contain proper explanations in Markdown cells
  • Your project should involve web scraping of at least two web pages
  • Your project should use the appropriate libraries for web scraping
  • Your submission should include the CSV file generated by scraping
  • The submitted CSV file should contain at least 3 columns and 100 rows of data
  • The Jupyter notebook should be publicly accessible (not "Private" or "Secret")

General Approach to Webscraping to Create the Contact List

Website Content

The NHS website does contain a lot of information on NHS organisations. The site is structured for a user (citizen) to drill down by type of organisation to select the individual organisation they are looking for and then that leads onto pages of information on that organisation within which there is contact information. So the site is not designed to be a source of contact lists, instead it is designed to find one organisation and its services.

Data Quality

A physical inspection of the site and the contact information showed that there is some common structures. But also revealed that the pages are not identical and there is missing contact information. On the plus side, the site has comprehensive and up-to-date information on all the NHS organisations by name.

Approach to Web Scraping

A simple "nested approach" is proposed of:

  1. Choose a type of NHS organisation(hospital, GP, etc.)
  2. Load using python Requests library the page listing the organisations of that type
  3. By webscraping that page using BS4 to create a dataset for that type of organisation comprising the name and web-page url
  4. For each organisation in the dataset load the page using Requests and use BS4 to find the contact details.
  5. Add the organisation, type and contact details to another dataset.
  6. When the exercise is complete to output the contacts dataset to a CSV file.

Keeping the Project Small & Avoiding Overloading the Targetted Website

It is not necessary to scrape every type or organisation or all of the NHS organisation pages to prove the viability of the approach with this demonstrator, nor is it needed to fulfill the evaluation criteria of the Jovian Bootcamp. Also screen scaping 16,000 pages or so, from the site may trigger security or load management software within the site's servers. So to keep all this manageable the following constraining tactics are being adopted:

  1. The project will be contrained to the two major types of NHS organisations - hospitals and GP surgeries. Extending the exercise to all the other types (e.g. pharmacies, mental health trusts, dentists, etc.) is simply acheived by running the generic approach across the web pages for those types of organisations.
  2. Within those two types of organisations only the first 50 hospitals (alphabetically) and 100 GP surgeries (again alphabetically) will be selected. Extending to all hospitals and all GP surgeries is simply achieved by removing this limiter and allowing the functions to run untill all have been selected.

Example of web page for a type of NHS organisation (hospitals in this case)

alt

Example of web page with contact details for one hospital (Ashforn & St Peter's Hospital Trust)

alt

Installing the Python Libraries

The following are required by the project:

  • Jovian to allow the notebook to be stored and submitted on the Jovian platform.
  • Requests which allows the notebook to load the web pages.
  • Beautiful Soup 4 which provides functionality to scrape specific fields from the web pages.
  • Pandas which provides functions to manage the dataset and output it to a CSV file.
  • The module re which is used to work with regular expressions