Moneyball Creator Marketing Part 2: Infrastructure instead of Research

Moneyball Creator Marketing Part 2 cover art

Quick context if you're new to the project: I'm building a system called Moneyball Creator Marketing, a data-driven approach to identify undervalued YouTube creators, with the eventual ambition to apply similar methods to all social media networks. In my last post, I talked about scraping nearly a million videos from 3,121 gaming channels. I also mentioned I had 434 channels left to process because of some messy language encoding issues.

This post is about why I never finished those 434 channels. And why that's actually a good thing.

The OctoParse Problem

When I started this project, I used OctoParse to do my scraping. If you're not familiar, it's a no-code web scraping tool. All you do is point and click to tell the software what data to grab, and it handles the rest.

It worked! Kind of…

I got my 934,589 videos. But the process was slow, manual, and honestly extremely brittle. Every time YouTube changed something small, I had to go back and adjust my workflows. Those 434 international channels with weird URL encodings? OctoParse was choking on them. Sometimes it was because of channels having a livestream going, other times it was because all of their posts were "Shorts". The scraping process in a lot of ways ruined my Thanksgiving and Christmas because of how intense the process was, and how determined I was to win this battle.

I could have kept grinding. Tweaked the settings, manually handled the edge cases, eventually finished the dataset. But I kept thinking: even if I finish this, then what? What part of this idea is the business?

What happens when I want data on a new game's creator ecosystem? Do I spend another month clicking through OctoParse? What if I need to refresh this data in six months? Do I do this all over again?

The answer was obviously no. I needed actual infrastructure.

A Better Approach (with one caveat)

My interterm professor, Wang Jin Ph.D., had talked with a professor that used a Python library for a research project that was way more powerful than OctoParse. Fully scriptable, designed for exactly the kind of metadata extraction I wanted. He pointed me and my project partner, Matthew Becker, towards it. Once we got it working locally, we thought we were off to the races!

So we built the whole thing out on Google Cloud Platform using Cloud Functions, Google Storage, and BigQuery, since that's what we had used in our interterm class. Wrote the scripts, set up the pipeline, tested it thoroughly.

It worked beautifully.

Once.

Then we got this:

ERROR: [youtube] Sign in to confirm you're not a bot.
Use --cookies-from-browser or --cookies for authentication.

I really should have seen this coming considering the number of times my professors have mentioned VPNs and proxies to me during this project.

Our cloud server made too many requests from the same IP, so Google decided we were suspicious, probably because it was one of their own server IPs (lol).

Needless to say, our system seemed dead in the water.

The AWS Pivot

Turns out this is a very well-known problem. YouTube aggressively rate-limits and blocks IPs that look like bots (which, to be fair, we kind of were). The solution? IP rotation: cycling through different IP addresses so no single one gets flagged.

There's a Python library that does exactly this. Problem is, it only works on AWS.

So now we're migrating everything from GCP to AWS. But the project presentation for my interterm class is Thursday 😵. Is it annoying? Yes. But it's also the right call. Once this infrastructure is in place, I'll be able to:

Scrape new market niches without manual intervention
Refresh data periodically without getting blocked
Scale up or down based on what I need
Consistently reproduce my results and conduct actual market research

Why I'm Not Going Back for the 434

Like I said, I could still go back and finish those 434 channels with OctoParse. Grind it out, close the loop, have a "complete" dataset.

But that's not the point anymore.

The original 934,589 videos are my benchmark, my ground truth to prove a system to analyze social channels is possible and has an enormous amount of value. And as the project has evolved, I've found more efficient methods to produce similar relevant insights. I don't need those last 434 channels to be perfect. I need a system that lets me gather creator data reliably, repeatedly, and at scale.

That's what I'm building now.

What's Next

Once the AWS infrastructure is solid, I'll finally be able to gather data on new games and creator verticals without spending weeks in OctoParse. That's the real win here.

Unfortunately, sometimes the best way forward is to stop trying to win the battle, and learn to adapt to the punches in order to build something better.

Following along with Moneyball Creator Marketing? Drop a comment if you've dealt with YouTube's bot detection. I'd love to hear what solutions worked for you.