Learn practical skills, build real-world projects, and advance your career
Created 2 years ago
TOP GITHUB REPOSITORIES BY TOPIC (PYTHON WEB SCRAPE)
We will use the following strategy;
- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
The scraped data will be saved into csv files and stored into the "/data" folder auto created
We shall use Python, requests, Beautiful Soup and Pandas
#!python -m pip install --upgrade pip
#!python -m pip install requests --quiet
!python -m pip install BeautifulSoup4 --quiet
import pandas as pd
import requests
topics_url = 'https://github.com/topics'
response = requests.get(topics_url)
response.status_code
200
page_content = response.text
len(page_content)
147067