Learn practical skills, build real-world projects, and advance your career

Web-Scraping Top repositories for Topics on GitHub

Introduction

This is a code written using Python, and tools like requests, Beautiful Soup, Pandas to scarpe https://github.com/topics. From the word scrape I mean to say that we'll be extracting needed information from the selected website. Here I have choosed GitHub to work on, the code is written to grab the repository name, username, stars and the repository URL for each repository in the topic page of GitHub.

Steps to follow:

  • We're going to scarpe https://github.com/topics
  • We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
  • For each topic, we'll get the top 25 repository in the topic from the topic page
  • For each repository, we'll grab the repo name, username, stars and repo URL
  • For each topic we will create a CSV file in the following format:
Repo Name,Username,No. of stars,Repo URL
Three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx

Tools used to scrape the list of topics from Github

  • Requests : to download the page
  • BS4 : to parse and extract information
  • Converting to a Pandas DataFrame

Requests: To download the page

We use request to fetch the content/information from gitHub. When we import requests it returns a response and this response object helps to access the content which we need.