Workshop - Web Scraping with Selenium & AWS
Introduction to Programming with Python
Please refer to the below link for the updated code.
Code: https://github.com/aakashns/selenium-youtube-scraper-live
Web scraping is a great way to extract public information from websites and create datasets for data analysis and machine learning. In this live hands-on workshop, we walk through the process of building and deploying a web scraping project from scratch using Python, Selenium, and AWS Lambda.
Objective
- Scrape top 10 trending videos on YouTube using Selenium
- Set up a recurring job on AWS Lambda to scrape every 30 minutes
- Send the results as a CSV attachment over email (or to a spreadsheet)
Prerequisites
Python
Topics Covered
- GitHub
- Replit
- Selenium
- AWS Lambda
- SMTP
Step 1 - Create a GitHub repository
- Create a repository at https://github.com/new
- Add README, gitignore (Python) and license
- (Optional) Clone the repository locally
- References:
- Introduction to GitHub: https://lab.github.com/githubtraining/introduction-to-github
- Git & GitHub tutorial: https://www.youtube.com/watch?v=RGOj5yH7evk
Step 2 - Launch the repository on Replit
Note: Chromedriver & chromium doesn't come pre-installed with Replit after a new update. Please follow these steps to add chromedriver & chromium to replit: https://jovian.ai/birajde9/replit-add-chromdriver-chromium
- Connect Replit with your GitHub account
- Launch the repository as a Replit project
- Set up the language and run command
- Create and execute a Python script
- Attempt to scrape the page using requests & Beautiful Soup
- Sometimes the code will not scrape all the videos if the page is not loaded completely. Import the time module & use the
time.sleep(5)
command to load the page completely and then find the elements.
- Sometimes the code will not scrape all the videos if the page is not loaded completely. Import the time module & use the
- References:
- Introduction to Replit: https://docs.replit.com/tutorials/01-introduction-to-the-repl-it-ide
- Replit + GitHub: https://docs.replit.com/tutorials/06-github-and-run-button
- YouTube trending feed: https://www.youtube.com/feed/trending
- Beautiful soup tutorial: https://blog.jovian.ai/web-scraping-using-python-and-beautifulsoup-adf43cbdb816
Step 3 - Extract information using Selenium
- Install selenium and create a browser driver
- Load the page and extract information
- Create a CSV of results using Pandas
- References:
Step 4 - Send results over email using SMTP
NOTE: Google security policy has been updated and the discussed procedure won't let you send emails using just a username and password now. Follow this blog for the steps to send results over email using SMTP: https://blog.jovian.ai/web-scraping-using-selenium-2a3ffa1f03f4
- Create email client using
smtplib
- Set up SSL, TLS and authenticate with password
- Send a sample email with just text
- Send an email with text and attachment
- References:
- Sending Email with Python: https://stackabuse.com/how-to-send-emails-with-gmail-using-python/
- Send email using Python: https://www.geeksforgeeks.org/send-mail-attachment-gmail-account-using-python/
- Environment variables on Replit: https://docs.replit.com/programming-ide/storing-sensitive-information-environment-variables
- https://docs.aws.amazon.com/lambda/latest/dg/configuration-envvars.html
- Update Google sheets using Python: https://www.analyticsvidhya.com/blog/2020/07/read-and-update-google-spreadsheets-with-python/
Step 5 - Set up a recurring job on AWS Lambda
- Create an AWS Lambda Python function
- Deploy a sample script and observe the output
- Add layers for Selenium and Chromium
- Set up recurring job using AWS CloudWatch
- References:
- Python on AWS Lambda tutorial: https://stackify.com/aws-lambda-with-python-a-complete-getting-started-guide
- Chromium & Selenium on AWS Lambda: https://dev.to/awscommunity-asean/creating-an-api-that-runs-selenium-via-aws-lambda-3ck3
- Recurring AWS Lambda functions: https://docs.aws.amazon.com/lambda/latest/dg/services-cloudwatchevents-expressions.html
Selenium Lambda Layers: https://github.com/aakashns/selenium-aws-lambda-layers
The workshop lasts approximately 3 hours and all code will be written live during the workshop. You will be able to follow along with the recording to work on your own web scraping project.
External Resources
- Tutorial on Selenium XPATH: https://www.simplilearn.com/tutorials/selenium-tutorial/xpath-in-selenium
- Complete guide on Selenium XPATH: https://www.lambdatest.com/blog/complete-guide-for-using-xpath-in-selenium-with-examples/
Check out this project blog: