Learn practical skills, build real-world projects, and advance your career
jovian.commit()
[jovian] Detected Colab notebook... [jovian] Please enter your API key ( from https://jovian.ai/ ): API KEY: ·········· [jovian] Uploading colab notebook to Jovian... [jovian] Capturing environment.. [jovian] Committed successfully! https://jovian.ai/adrian-g/python-web-scraping-and-rest-api

Introduction to Web Scraping and REST APIs

This tutorial is a part of the Zero to Data Analyst Bootcamp by Jovian

alt

Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. While web scraping often involves parsing and processing HTML documents, some platforms offer REST APIs to retrieve information in a machine-readable format like JSON. In this tutorial, we'll use web scraping and REST APIs to create a real-world dataset.

The following topics are covered in this tutorial:

  • Downloading web pages using the requests library
  • Inspecting the HTML source code of a web page
  • Parsing parts of a website using Beautiful Soup
  • Writing parsed information into CSV files
  • Using a REST API to retrieve data as JSON
  • Combining data from multiple sources
  • Using links on a page to crawl a website

How to Run the Code

The best way to learn the material is to execute the code and experiment with it yourself. This tutorial is an executable Jupyter notebook. You can run this tutorial and experiment with the code examples in a couple of ways: using free online resources (recommended) or on your computer.

Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing the code is to click the Run button at the top of this page and select Run on Binder. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on Google Colab or Kaggle to use these platforms.

Option 2: Running on your computer locally

To run the code on your computer locally, you'll need to set up Python, download the notebook and install the required libraries. We recommend using the Conda distribution of Python. Click the Run button at the top of this page, select the Run Locally option, and follow the instructions.

Problem

Over the course of this tutorial, we'll solve the following problem to learn the tools and techniques used for web scraping:

QUESTION: Write a Python function that creates a CSV file (comma-separated values) containing details about the 25 top GitHub repositories for any given topic. The top repositories for the topic machine-learning can be found on this page: https://github.com/topics/machine-learning. The output CSV should contain these details: repository name, owner's username, no. of stars, repository URL.

alt

How would you go about solving this problem in Python? Explore the web page and take a couple of minutes to come up with an approach before proceeding further. How many lines of code do you think the solution will require?