Assignment 3 - Web Scraping Practice

Introduction to Programming with Python

Lesson 1 - Variables, Operators, and Data Types Lesson 2 - Branching, Loops, and Functions Assignment 1 - Python Programming Practice Lesson 3 - Web Scraping and REST APIs Assignment 2 - Sudoku Solver in Python Documentation and Storytelling

← Back

Course Home

Lesson 6 - Solving Programming Challenges Lesson 4 - Local Development with Conda & Git Workshop - Web Scraping with Selenium & AWS Object Oriented Programming with Python

In this assignment, you will apply your knowledge of Python and its ecosystem of libraries to scrape information from any website in the given list of websites and create a dataset of CSV file(s). Here are the steps you'll follow:

Pick a website and describe your objective
- Pick a site to scrape from the given list of websites below: (NOTE: you can also pick some other site that's not listed below)
  1. Dataset of Quotes (BrainyQuote): https://www.brainyquote.com/topics
  2. Dataset of Movies/TV Shows (TMDb):https://www.themoviedb.org.
  3. Dataset of Books (BooksToScrape): http://books.toscrape.com
  4. Dataset of Quotes (QuotesToScrape): http://quotes.toscrape.com
  5. Scrape User's Reviews (ConsumerAffairs): https://www.consumeraffairs.com/.
  6. Stocks Prices (Yahoo Finance): https://finance.yahoo.com/quote/TWTR.
  7. Songs Dataset (AZLyrics): https://www.azlyrics.com/f.html.
  8. Scrape a Popular Blog: https://m.signalvnoise.com/search/.
  9. Weekly Top Songs (Top 40 Weekly):https://top40weekly.com.
  10. Video Games Dataset (Steam): https://store.steampowered.com/genre/Free%20to%20Play/

Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
Summarize your assignment idea in a paragraph using a Markdown cell and outline your strategy.

Use the requests library to download web pages
- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.
Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.
Create CSV file(s) with the extracted information
- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Attach the CSV files with your notebook using jovian.commit.
Document and share your work
- Add proper headings and documentation in your Jupyter notebook.
- Publish your Jupyter notebook to Jovian and make a submission.
- (Optional) Write a blog post about your project and share it online.

Notes

Review the evaluation criteria on the "Submit" tab and look for project ideas under the "Resources" tab below
There's no starter notebook for this project. Use the "New" button on Jovian, and select "Run on Binder" to get started.
Ask questions, get help, and share your work on the Slack group. Help others by sharing feedback and answering questions.
Record snapshots of your notebook from time to time using ctrl/cmd +s, to ensure that you don't lose any work.
Websites with dynamic content (fetched after page load) cannot be scraped using BeautifulSoup. One way to scrape a dynamic website is by using Selenium.

The "Resume Description" field below should contain a summary of your assignment in no more than 3 points. You'll can use this description to present this assignment as a project on your Resume. Follow this guide to come up with a good description

Submit

Resources

Notebook Link (Required)

Resume Description (Required)

Write

Preview

You can submit multiple times. Only your last submission will be evaluated.

Evaluation Criteria

Your submission must meet the following criteria to receive a PASS grade in the assignment:

The Jupyter notebook should run end-to-end without any errors or exceptions
The Jupyter notebook should contain execution outputs for all the code cells
The Jupyter notebook should contain proper explanations i.e. proper documentation (headings, sub-headings, summary, future work ideas, references, etc) in Markdown cells
Your assignment should involve web scraping of at least two web pages
Your assignment should use the appropriate libraries for web scraping
Your submission should include the CSV file generated by scraping
The submitted CSV file should contain at least 3 columns and 100 rows of data
The Jupyter notebook should be publicly accessible (not "Private" or "Secret")
Follow this guide for the "Resume Description" field in the submission form: https://jovian.com/program/jovian-data-science-bootcamp/knowledge/08-presenting-projects-on-your-resume-231