iwf-api & related - Project Description

This project started back when I was in university in 2019. I wanted to review Python to prepare for my class, Introduction to Mathematical Software (MATH 157), which used Python extensively. I figured the best way to learn was to work on something I was genuinely interested in so I'd stay motivated.

I decided to build a scraper for competition results from iwf.org (which later became iwf.sport). At the time, I knew absolutely nothing about web scraping, so I bought a Udemy course on Scrapy to get started.

While working through the course, I created my first version: wl_res [8]. It was basically a Python script using Scrapy and Selenium's headless browser. But instead of scraping text from webpages using tools like BeautifulSoup, I used Selenium to open a list of pages for each weightlifting event and automatically click the "download CSV" button to get the results.

During testing, I got a bit overzealous and ran the script to download all the event files at once. This meant opening 100+ event result pages simultaneously and clicking all the download buttons. I accidentally took down the entire website for 15 minutes. Needless to say, I never ran that script again after getting the files I needed. (You can find those CSV files in my iwf_clean_format_results [9] repository, where I wrote additional scripts to clean and format the data for better viewing.)

After university, I enrolled in a full-stack software engineering bootcamp. I knew this weightlifting project would be my final capstone. I wrote a new scraper API using Ruby and Nokogiri, and even created a Ruby gem [3]. Looking back, this was probably not the best choice since Nokogiri was already outdated and should've used Python and BeautifulSoup.

I built the backend in Ruby on Rails [7] and the frontend in React.js [5][6]. I'm not proud of how this version turned out. Instead of creating a simple website that displayed weightlifting results with basic functionality (login, authentication, basic CRUD operations), I got distracted and spent way too much time trying to learn TailwindCSS and D3.js. (I was really excited about D3.js at the time after reading Shirley Wu's blogs and her Film Flowers project.) Looking back at this project now, they probably shouldn't have let me graduate!

After bootcamp, I wanted to do more projects in Python. Rather than starting something new, I decided to revisit the Ruby scraper API (iwf_ruby) and rewrite it in Python. Before diving in, I made sure there weren't any similar projects for scraping IWF data and looked at other Python scraper APIs for inspiration.

I found this project on Github, and I was impressed by its simple structure and code readability. So I decided to follow their architecture to build my API. (For the record, I've never actually used that API, I just liked their code structure!)

I built iwf_api with Python and BeautifulSoup4 [2], later packaging it and uploading it to PyPI. (I can't find it now, maybe I deleted it?)

This project was my first time someone else make a pull request. (I believe they used this api on their own project until they refactored their backend code from Python to Go.

Afterwards, I built a full-stack application for "twler" (Top Weightlifter) using Django and React [1]. At the time, I was super excited about IPFS (InterPlanetary File System) because I'd discovered you could deploy websites with it. This version of twler had no traditional database. Instead, it fetched files stored in an IPFS node, which meant it took forever to load data for each event.

It also depended heavily on an unmaintained library called ipfshttpclient. I actually had to fork the repo and change the version number in the settings just to make it work [10]. I deployed twler to Pinata Cloud [1], but soon after, Pinata stopped supporting website deployments. Which made this whole project a huge waste of time.

If I were to rebuild this project today, I'd use Django (it's great for data-intensive applications) with basic HTML and Jinja templating. No overengineering. I'd use PostgreSQL for the database and deploy serverlessly with Zappa (AWS Lambda), using its scheduling functions to periodically run the scraper and update the database.

As a side note, I also spent some time learning about TrueSkill[4] (Microsoft's ranking algorithm for matchmaking in games like Halo). I thought maybe I could use it to rank athletes based on their results instead of using the official Sinclair Formula or Wilks coefficient. While the algorithm is fascinating, it's not really suitable for this use case since it's designed for team matches like 4v4 or 1v1. Something like Chess's ELO rating system would probably be more appropriate.

Twler

Links to repositories