Building a Python Web Scraping Project From Scratch
This project guide is a part of the Zero to Data Analyst Bootcamp by Jovian.
Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. Follow these steps to build a web scraping project from scratch using Python and its ecosystem of libraries:
-
Pick a website and describe your objective
- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.
-
Use the requests library to download web pages
- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the
requests
library. - Create a function to automate downloading for different topics/search queries.
-
Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.
-
Create CSV file(s) with the extracted information
- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.
-
Document and share your work
- Add proper headings and documentation in your Jupyter notebook.
- Publish your Jupyter notebook to your Jovian profile
- (Optional) Write a blog post about your project and share it online.
Notes
-
Use the "New" button on Jovian to create a new notebook, and select "Run on Binder" to get started.
-
Follow this tutorial to learn web scraping: https://jovian.ai/aakashns/python-web-scraping-and-rest-api
-
Check out 20-week bootcamp to learn Python programming, web scraping, data analysis and more: http://zerotoanalyst.com
Tweet your projects and tag @JovianML. We're retweeting 3 interesting proejcts everyday!
Project Ideas
-
Dataset of Books (Amazon): Create a dataset of popular books in different genres by scraping the site: https://www.amazon.in/gp/bestsellers/books/
-
Dataset of Quotes (BrainyQuote): Create a dataset of quotes for different tags/topics by scraping the site :https://www.brainyquote.com/topics
-
Dataset of Movies (TMDb): The Movie Database (TMDb) contains information about thousands of movies from around the world: https://www.themoviedb.org/movie . Can you scape the site to create a dataset of movies containing information like title, release date, cast, etc. ? You can also create datasets of movie actors/actresses/directors using this site.
-
Dataset of TV Shows (TMDb): The Movie Database (TMDb) contains information about thousands of TV shows from around the world: https://www.themoviedb.org/tv . Can you scape the site to create a dataset of TV shows containing information like title, release date, cast, crew, etc. ? You can also create datasets of TV actors/actresses/directors using this site.
-
Collections of Popular Repositories (GitHub): Scape GitHub collections ( https://github.com/collections ) to create a dataset of popular repositories organized by different use cases.
-
Dataset of Books (BooksToScrape): Create a dataset of popular books in different genres by scraping the site Books To Scrape: http://books.toscrape.com
-
Dataset of Quotes (QuotesToScrape): Create a dataset of popular quotes for different tags by scraping the site Quotes To Scrape: http://quotes.toscrape.com
-
Scrape a User's Repositories (GitHub): Given someone's GitHub username, can you scrape their GitHub profile to create a list of their repositories with information like repository name, no. of stars, no. of forks, etc.?
-
Scrape User's Reviews (ConsumerAffairs): Consumeraffairs contains reviews about thousands of brands: https://www.consumeraffairs.com/. Can you scrape any category from the site to create a dataset of Reviews containing information like Title, Rating, Reviews and toll-free number etc.?
-
Songs Dataset (AZLyrics): Create a dataset of songs by scraping AZLyrics: https://www.azlyrics.com/f.html . Capture information like song title, artist name, year of release and lyrics URL.
-
Scrape a Popular Blog: Create a dataset of blog posts on a popular blog e.g. https://m.signalvnoise.com/search/ . The dataset can contain information like the blog title, published date, tags, author, link to blog post, etc.
-
Weekly Top Songs (Top 40 Weekly): Create a dataset of the top 40 songs of each week in a given year by scraping the site https://top40weekly.com . Capture information like song title, artist, weekly rank, etc.
NOTE: Websites with dynamic content cannot be scraped using BeautifulSoup. One way to scrape dynamic website is by using Selenium.
Also check out these projects and tutorials:
- https://medium.com/@msalmon00/web-scraping-job-postings-from-indeed-96bd588dcb4b
- https://medium.com/the-innovation/scraping-medium-with-python-beautiful-soup-3314f898bbf5
- https://medium.com/brainstation23/how-to-become-a-pro-with-scraping-youtube-videos-in-3-minutes-a6ac56021961
- https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/
- https://www.freecodecamp.org/news/scraping-wikipedia-articles-with-python/
- https://towardsdatascience.com/web-scraping-yahoo-finance-477fe3daa852
- https://www.analyticsvidhya.com/blog/2020/10/web-scraping-selenium-in-python/
- https://medium.com/ml-book/web-scraping-using-selenium-python-3be7b8762747
!pip install jovian --upgrade --quiet
import jovian
jovian.commit(project="python-web-scraping-project-guide")
[jovian] Attempting to save notebook..