Learn practical skills, build real-world projects, and advance your career

Scraping Top Repositories For Topics On GitHub

Topic Outlines:

  • Introduction about web scraping
  • Introduction about github and the problem statement
  • Mention the tools you are using
  • Link to the Github topic page : https://www.github.com/topics

A brief introduction about web scraping

    Web scraping, web harvesting, or web data extraction is a process of extracting data from websites. Various web scraping tools or software are available on the market that can perform web scraping and all of the uses HTTP connection for requesting web sites to which scraping can be performed.

Introduction about github

    GitHub is a for-profit company managed by Microsoft that offers a cloud-based Git repository hosting service. Essentially, it makes it a lot easier for individuals and teams to use Git for version control and collaboration.

Problem statement :

we need to find the current tranding topics along with associated repositories, repo owner, stars, in a github

Tools to be used

    We are using various python packages inorder to scrap web and extract relevant information into a CSV file.
  • requests library : documentation
    • description : Requests is a Python module that you can use to send all kinds of HTTP requests.
# command to install requests library !pip install requests -q
  • Beautiful Soup library :documentation
    • description : Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
# command to install beautiful soup library !pip install beautifulsoup4 -q
  • Pandas library :documentation
    • description : pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Pandas is enriched with built-in features, dataframe into csv converter is one of them.
# command to install beautiful soup library !pip install pandas -q

Project Outlines

topic title, topic description, topic page URL
  • for each topic, we will get the top 25 -30 repositories in the topic from the topic page
  • for each repository, we'll grab the repo name, username , stat and URL
  • for each topic, we'll create a saparate CSV file in the following format:
    RepoName, UserName,Repo Star, Repo URL
    

Scraping the list of github topics

How we are going to do it ?

  • using requests library to download the github topic page
  • using beautiful soup to parse and extract the topic name, topic description and topic page URL
  • using pandas to conver the data into pandas dataframe and save it into a csv file.

Importing all the essential libraries that are going to need in this project