Learn practical skills, build real-world projects, and advance your career

Scrapping World Press Freedom Index Report.

Source: Reporters without Borders

image.png

Web Scrapping is an automated process of gatering data from a server. It is usually accomplished by writing an automated program that queries a webserver, requests data(usually HTML) and parses the data to extract the required information. There are a number of ways to acheive this, but we are going to use requests, Beautiful Soup and pandas libraries.

The Press Freedom Index is an annual ranking of countries compiled and published by Reporters Without Borders since 2002 based upon the organisation's own assessment of the countries' press freedom records in the previous year. The Index ranks 180 countries and regions according to the level of freedom available to journalists. Reporters Without Borders is an international non-profit and non-governmental organization with the stated aim of safeguarding the right to freedom of information. It describes its advocacy as founded on the belief that everyone requires access to the news and information, in line with Article 19 of the Universal Declaration of Human Rights that recognizes the right to receive and share information regardless of frontiers, along with other international rights charters. RSF has consultative status at the United Nations, UNESCO, the Council of Europe, and the International Organisation of the Francophonie.

Outline:

  • Using the requests library, Fetch the HTML data of the https://rsf.org/en/ranking website.
  • Parse the DOM tree of the HTML page using the Beautiful Soup() method provided by the Beautiful Soup library.
  • Identify the patterns and attributes like ids, classes and use them to fetch the elements containing the required data.
  • Compile the extracted information into data using Python lists and libraries.
  • Save the extracted Information into a csv file.

In the end, here's what the csv will look:

Rank,Country,Abuse Score,Global Score,Detail Url,Situation Score,Journalist Killings,Citizen Journalist Killings,Media Assistants Killings,2020,2019,2018,2017,2016,2015,2014,2013
1,Norway,0,6.72,https://rsf.org//en/norway,6.72,0,0,0,1,1,1,1,3,2,3,3
2,Finland,0,6.99,https://rsf.org//en/finland,6.99,0,0,0,2,2,4,3,1,1,1,1
...

Libraries Used: requests, Beautiful Soup, Pandas

You can use the Run button at the top of the page to execute the code.

!pip install jovian --upgrade --quiet
import jovian
# Execute this to save new versions of the notebook
jovian.commit(project="webscrapping-pressfreedom-report", git_commit=True)