Scraping-Repositories-of-Top-GitHub-Topics

Web scraping is the process of gathering information form websites in an automated fashion with the help of a computer program and presenting them in a meaningful way. It's a useful technique for creating datasets for research and learning.

For this project, we will scrape repositories of top topics availble onGitHub. GitHub is a platform which allow us to host our code in the cloud for the purpose of collaboration and version control. Basically GitHub lets the people work together on the project, and it host both public and private repositories.

To do this, we will write code in python and also will use some python libraries like requests, bs4 and pandas and then will save the generated out in csv file for each topic.

Setting up the environment by installing required Python Modules

  • Requests allows to interact with websites and download the page
  • Beautiful Soup allows us to parse the HTML documents
  • Pandas allows to create DataFrame and store information
!pip install requests --upgrade --quiet
!pip install pandas --upgrade --quiet
! pip install bs4 --upgrade --quiet

Import the packages:

import requests                          # use requests to download the page
from bs4 import BeautifulSoup            # use BeautifulSoup to parse the page
import os                                # use os to make a directory to store downloaded files
import pandas as pd                      # use Pandas to create DataFrame