Toxicity Detector with Twitch Chat

I learned about Surge AI's toxicity dataset because I followed their CEO's blog on ML long time ago. (Surge AI is a competitor of Scale AI, which was acquired by Meta for $14.3 billion.) I wanted to use the dataset to label whether chat messages are toxic and discovered Vowpal Wabbit while browsing Hacker News.

Vowpal Wabbit is a fast, scalable machine learning library. It's used for email spam filtering, eHarmony's recommendation system, and at Tumblr. Mike Izbicki even taught it in North Korea, which is pretty wild.

I bought Yury Kashnitsky's MLCourse.ai which covers various ML techniques and some Vowpal Wabbit.

While building it, I discovered that determining if a message is truly toxic (not just profanity filtering) is really difficult. For example, "this is fucking cool!" contains profanity but isn't toxic.

The most interesting thing I learned from this project is that you can use Firebase's Firestore for free as long as you store and retrieve data in nested collections and documents. However, when there were lots of incoming messages, sometimes the chat messages weren't stored in Firebase. This made it unreliable as a chat archive, which was disappointing and kinda cringe.