Other Scrapers

I had so much fun making the IWF scraper API that I decided to build more scrapers. I wanted to create scrapers for getting job posts, so I made one for YCombinator's WorkAtAStartup.com [1] and Wellfound [3], and I forked and edited a LinkedIn scraper script [4].

I used the LinkedIn script frequently in my terminal until I built a web app called tears-jobs for getting recent job posts. I also created fcis-api to fetch workplace fatality and catastrophe reports from the United States Department of Labor website.

Every scraper uses BeautifulSoup4 for text parsing. For waasu-api and wellfound-scraper, I used Selenium for logging in. When I don't need to login to get data, I prefer using Splash headless browser[5] instead of Selenium.

The Wellfound scraper was hard to implement and test because it actively blocks scrapers. FCIS-api had no blocking at first, but now it does. Waasu-api seems to be working fine because WorkAtAStartup doesn't have any blocking measures.

Ever since LLMs came out, scraping has become a lot easier for selecting data with BeautifulSoup. The real challenges now lie in bypassing captcha and using proxy rotation to avoid detection. Most worthwhile websites have these protections.

Unless you use services for captcha solving and proxy rotation, you spend a lot of time building these components for your scraper. It gets boring integrating these features, so I probably won't be doing more scraper projects in the future.

Reference

See also: