Updated 2 years ago
Scraping-Repositories-of-Top-GitHub-Topics
Web scraping
is the process of gathering information form websites in an automated fashion with the help of a computer program and presenting them in a meaningful way. It's a useful technique for creating datasets for research and learning.
For this project, we will scrape repositories of top topics availble onGitHub
. GitHub
is a platform which allow us to host our code in the cloud for the purpose of collaboration and version control. Basically GitHub
lets the people work together on the project, and it host both public and private repositories.
To do this, we will write code in python and also will use some python libraries
like requests
, bs4
and pandas
and then will save the generated out in csv
file for each topic.
Setting up the environment by installing required Python Modules
Requests
allows to interact with websites and download the pageBeautiful Soup
allows us toparse
the HTML documentsPandas
allows to createDataFrame
and store information
!pip install requests --upgrade --quiet
!pip install pandas --upgrade --quiet
! pip install bs4 --upgrade --quiet
Import the packages:
import requests # use requests to download the page
from bs4 import BeautifulSoup # use BeautifulSoup to parse the page
import os # use os to make a directory to store downloaded files
import pandas as pd # use Pandas to create DataFrame