All Posts

Movie Rails: Building a Streaming Recommendation Engine from Scratch

How my team clustered 8,000 films into streaming categories using UMAP, HDBSCAN, and a whole lot of hyperparameter sweeps.

Movie Rails capstone project cover art

If you've ever scrolled through Netflix and thought "who decides what goes in these categories?" then congratulations, that's basically what my capstone team set out to answer. Except instead of a room full of content editors, we built an automated pipeline that clusters 8,000 films into streaming "rails" using nothing but content features.

This was my MSBA Capstone Project at Chapman, built alongside Jordan Ehrman and Manuel Lara. We called it Project_Hollywood, and honestly, it turned into one of the most technically satisfying things I've worked on. It also turned out to be one of the biggest dragons I've ever had to fight.


The Problem

Streaming platforms need to organize thousands of titles into browsable categories. Those horizontal rows you scroll through. Traditionally, this is done manually or with basic genre tags. But genres are broad. "Action" contains everything from John Wick to Indiana Jones, and lumping them together doesn't tell you much about what a viewer actually wants.

That's the dragon: 8,000 films, 284 dimensions of content features, and no obvious way to carve them into categories that actually mean something. No watch history, no user data, just the films themselves. Three students versus a problem that streaming companies throw entire teams at.

We picked up our swords anyway.


The Data

We started with three datasets: a 248 feature "genome" matrix covering 8,000 films (think relevance scores for things like "heist," "ensemble cast," "plot twists"), a feature taxonomy, and genre labels. On top of that, we pulled poster URLs, decade data, language, and content ratings from the OMDB and TMDB APIs to enrich the dashboard.

The genome matrix is the core. Each film gets a score from 0 to 3 across 248 content features. It's like a fingerprint for every movie.

One fun discovery during data cleaning: "The Hero's Journey" appears twice in the data, coded as features 177 and 440. They only correlate at 60%, which means they're actually describing something fundamentally different about a film. We kept both but wanted to flag it for anyone trying to reproduce our work.


Preprocessing: Z Scores, IDF, and the Order That Matters

Before you can fight a dragon, you need the right weapons. Ours were z scores and IDF weights.

We had 248 continuous genome features (0 to 3 range) and 36 binary features (23 genres + 13 decade buckets). If you just throw them into a distance calculation raw, the genome features dominate ~3x over the binary ones. So we z scored the genome columns to normalize everything.

But we also wanted to upweight rare, discriminating features. A feature like "heist" that only applies to a handful of films should matter more than "dialogue" which applies to basically everything. That's where IDF (Inverse Document Frequency) comes in.

Here's the critical detail: IDF has to be computed before z scoring. Z scoring shifts means to zero, which destroys the sparsity pattern that IDF relies on. So the pipeline is:

  1. Compute IDF weights from the raw genome matrix
  2. Z score the genome features
  3. Multiply the z scored features by the IDF weights
  4. Concatenate with the binary genre/decade columns

The binary bumps don't get IDF weighted because applying IDF to binary columns just multiplies each by a constant. It doesn't change inter film distances at all.

idf = TfidfTransformer(use_idf=True, smooth_idf=True)
idf.fit(X_genome_raw)
idf_weights = idf.idf_

X_genome_z = StandardScaler().fit_transform(X_genome_raw)
X_genome_weighted = X_genome_z * idf_weights
X_weighted = np.hstack([X_genome_weighted, X_genre, X_decade])

This gave us a 284 dimensional feature space with IDF weights ranging from 1.228 to 4.331 (mean 2.675). Weapons forged. Time to face the beast.


Dimensionality Reduction with UMAP

UMAP dimensionality reduction visualization

284 dimensions is way too many for density based clustering to work well. You can't fight a dragon in a room you can barely see, and everything looks distant and vague in 284 dimensions. We needed to compress the feature space while preserving the neighborhood structure so that films that are similar stay close together.

We chose UMAP (Uniform Manifold Approximation and Projection) and ran a 105 combination grid sweep across n_components, n_neighbors, and min_dist. The key findings:

reducer = umap.UMAP(
    n_components=20, n_neighbors=60,
    min_dist=0.1, metric='correlation',
    random_state=42
)
embedding = reducer.fit_transform(X_weighted)

Now we could see the dragon clearly. 20 dimensions, neighborhood structure intact, ready for the real fight.


Clustering with HDBSCAN

This is where the magic happens. We chose HDBSCAN over K Means (which forces you to pick k upfront and shoves every film into a cluster) and DBSCAN (which uses a single global epsilon that can't handle varying content densities).

HDBSCAN discovers the number of clusters automatically from data density. We ran a 660 combination sweep over min_cluster_size, min_samples, epsilon, and selection method. The winning config:

This produced 67 parent clusters with a silhouette score of 0.503 and DBCV of 0.259. A direct hit. But about 29.2% of films initially got labelled as noise. Scattered scales that didn't stick to any cluster. We had a plan for those.

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=15, min_samples=3,
    cluster_selection_epsilon=0.10,
    cluster_selection_method='eom',
    prediction_data=True
)
cluster_labels = clusterer.fit_predict(embedding)

Validating with XGBoost (and Recovering Outliers)

HDBSCAN discovers clusters, but how do you know they're real? We needed an independent test: can a classifier reproduce the cluster labels from the original features alone?

Enter XGBoost. We trained it on the original IDF weighted features (not the UMAP embedding) to predict cluster labels. If it can do this with >90% accuracy, then the clusters encode real structure in the original feature space. They're not just artifacts of UMAP's projection.

XGBoost was ideal for this because its tree based splits are scale invariant, it captures multi feature interactions through tree depth (depth 6 encodes up to 6 way interactions), and its softmax objective produces calibrated per class probabilities. Our professor Jonathan Hersh also mentioned gradient boosted trees could be useful in our capstone pipelines, so the suggestion landed perfectly.

5 fold stratified cross validation confirmed accuracy above 90%. And here's the bonus: those ~29% of films HDBSCAN labelled as noise? We fed them through the trained XGBoost model. Films with confidence ≥50% became candidates for their most probable rail. Think of it as going back to the battlefield and recovering the soldiers that got knocked away from their regiment. Most of the noise films found a home.

xgb_clf = xgb.XGBClassifier(
    n_estimators=200, max_depth=6,
    learning_rate=0.1, subsample=0.8,
    colsample_bytree=0.8, random_state=42,
    eval_metric='mlogloss', verbosity=0
)
cv_scores = cross_val_score(
    xgb_clf, X_clustered, y_encoded,
    cv=5, scoring='accuracy'
)

Making It Interpretable with SHAP

XGBoost tells us a cluster is real. SHAP tells us why.

For each prediction, SHAP decomposes the output into per feature contributions. We used TreeExplainer, which computes exact Shapley values (not approximations) by exploiting tree structure. Averaging absolute SHAP values per cluster gives us a ranked profile of what makes each rail distinctive.

This replaced an earlier approach where we computed the delta between each cluster's centroid and the global mean. That sounds reasonable, but it missed interaction effects entirely. If "heist" and "ensemble cast" together define a cluster but neither alone is unusual globally, centroid delta misses both. SHAP captures them through XGBoost's tree paths.

The top 5 to 8 features per cluster become "feature badges" in the dashboard. A "Heist Thriller" rail might show badges for heist, ensemble cast, plot twists, and suspense. Every scale on the dragon, catalogued and labelled.


Naming Rails with an LLM

Each rail needed a human readable name. We took a movies first approach: identify the 5 films nearest each cluster centroid, summarize the genre and decade distributions, include the top SHAP features, and send it all to Ollama (llama3.2:3b) running locally.

The prompt asks for a 2 to 5 word name suitable for a streaming service rail. The LLM has enough pre trained knowledge of movie titles to generate surprisingly good names. It's not perfect. Names vary across runs at temperature 0.3 and there's no formal evaluation of naming quality. But it gets us 80% of the way there, and the curator dashboard lets humans fix the rest.


The Curator Dashboard

We built a Streamlit dashboard that serves as the human in the loop layer. Curators can browse every rail, see member films with poster art, review feature badges, rename rails, disable low quality clusters, and manually move films between categories. This was critical. No automated system gets everything right, and giving content teams final control was a core design goal. Even after you slay the dragon, someone still has to sort through the treasure.


What I'd Do Differently

No dragon fight is clean. Here's what I'd do differently next time:


The Takeaway

Despite the limitations, this project showed that genome based clustering can produce interpretable, granular streaming rails without any audience data. XGBoost validates coherence, SHAP provides interpretable feature profiles, and the dashboard gives content teams the final say.

The full pipeline, from raw genome scores to named, curated rails with feature badges and poster art, runs end to end. With viewing data, stability analysis, and automated merge logic, this framework could evolve into something production ready.

It was also genuinely one of the most fun projects I've worked on. There's something deeply satisfying about watching an algorithm discover that Kill Bill and Slumdog Millionaire don't belong in the same category, and having the data to explain exactly why. That's the moment you know the dragon is really dead.

Built with Jordan Ehrman and Manuel Lara as part of the MSBA Capstone at Chapman University. If you're interested in the technical details, the full documentation covers every hyperparameter, code snippet, and diagnostic plot.