Command-Line HTML5 Linux Python 3 Web Crawling
See in schedule: Fri, Jul 30, 11:10-11:55 CEST (45 min)What is a web scraping and why should you learn how to do it?
I will talk about what it means to scrape data from the web and what is wrong with go-to copy-paste techniques. This will also cover various benefits of web-scraping, not only for work, but also how it can help to make your life simpler: using it for job-searching, product price monitoring, and collecting data to train ML models.
This talk will first cover the basics like using CSS Selectors for parsing data.
I will cover how to use developer tools to look for a tag, class, or id to target the required data.
I will also go through some basics of Regex which can be very useful in targeting required data.
After building up the base for web-scraping, I’ll show you some of the major tools including requests, BeautifulSoup, Selenium, Scrapy in an interactive manner, where you can follow along, as we build up from the most user-friendly tools like requests and BeautifulSoup to more specialized tools.
We’ll go a step further in this process and take a brief look at more complex topics. I will cover the problems that you will come across while scraping, such as asynchronous loading and client-side rendering, authentication, redirects, captchas, etc, and their possible workarounds. Finally, I’ll show how to automate web-scraping tasks using cron (Linux) and Task Scheduler (Windows).
You’ll leave the talk with a good understanding of the techniques of web scraping, and a library of useful tools you can use to write your own scrapers.
Type: Talk (45 mins); Python level: Beginner; Domain level: Beginner
I am an undergraduate majoring in electronics and communication. My educational inspirations consist of acquiring a deep understanding of software development and Machine Learning that would aid me in pursuing a Masters's in Computer Science. My career vision is to be an asset in the software development field who is responsible enough to solve some serious world problems.