Movie Rails: Building a Streaming Recommendation Engine from Scratch

If you've ever scrolled through Netflix and thought "who decides what goes in these categories?" then congratulations, that's basically what my capstone team set out to answer. Except instead of a room full of content editors, we built an automated pipeline that clusters 8,000 films into streaming "rails" using nothing but content features.

This was my MSBA Capstone Project at Chapman, built alongside Jordan Ehrman and Manuel Lara. We called it Project_Hollywood, and honestly, it turned into one of the most technically satisfying things I've worked on. It also turned out to be one of the biggest dragons I've ever had to fight.

The Problem

Streaming platforms need to organize thousands of titles into browsable categories. Those horizontal rows you scroll through. Traditionally, this is done manually or with basic genre tags. But genres are broad. "Action" contains everything from John Wick to Indiana Jones, and lumping them together doesn't tell you much about what a viewer actually wants.

That's the dragon: 8,000 films, 284 dimensions of content features, and no obvious way to carve them into categories that actually mean something. No watch history, no user data, just the films themselves. Three students versus a problem that streaming companies throw entire teams at.

We picked up our swords anyway.

The Data

We started with three datasets: a 248-feature "genome" matrix covering 8,000 films (think relevance scores for things like "heist," "ensemble cast," "plot twists"), a feature taxonomy, and genre labels. On top of that, we pulled poster URLs, decade data, language, and content ratings from the OMDB and TMDB APIs to enrich the dashboard.

The genome matrix is the core. Each film gets a score from 0 to 3 across 248 content features. It's like a fingerprint for every movie.

One fun discovery during data cleaning: "The Hero's Journey" appears twice in the data, coded as features 177 and 440. They only correlate at 60%, which means they're actually describing something fundamentally different about a film. We kept both but wanted to flag it for anyone trying to reproduce our work.

Preprocessing: Z-Scores, IDF, and the Order That Matters

The preprocessing came down to two things: z-scores and IDF weights.

We had 248 continuous genome features (0 to 3 range) and 36 binary features (23 genres + 13 decade buckets). If you just throw them into a distance calculation raw, the genome features dominate ~3x over the binary ones. So we z-scored the genome columns to normalize everything.

But we also wanted to upweight rare, discriminating features. A feature like "heist" that only applies to a handful of films should matter more than "dialogue" which applies to basically everything. That's where IDF (Inverse Document Frequency) comes in.

Here's the critical detail: IDF has to be computed before z-scoring. Z-scoring shifts means to zero, which destroys the sparsity pattern that IDF relies on. So the pipeline is:

Compute IDF weights from the raw genome matrix
Z-score the genome features
Multiply the z-scored features by the IDF weights
Concatenate with the binary genre/decade columns

The binary bumps don't get IDF-weighted because applying IDF to binary columns just multiplies each by a constant. It doesn't change inter-film distances at all.

idf = TfidfTransformer(use_idf=True, smooth_idf=True)
idf.fit(X_genome_raw)
idf_weights = idf.idf_

X_genome_z = StandardScaler().fit_transform(X_genome_raw)
X_genome_weighted = X_genome_z * idf_weights
X_weighted = np.hstack([X_genome_weighted, X_genre, X_decade])

This gave us a 284-dimensional feature space with IDF weights ranging from 1.228 to 4.331 (mean 2.675).

Dimensionality Reduction with UMAP

UMAP dimensionality reduction visualization

284 dimensions is way too many for density-based clustering to work well. Everything looks distant and vague at that scale. We needed to compress the feature space while preserving the neighborhood structure so that films that are similar stay close together.

We chose UMAP (Uniform Manifold Approximation and Projection) and ran a 105-combination grid sweep across n_components, n_neighbors, and min_dist. The key findings:

20 components hit the sweet spot. Trustworthiness saturates around 15 to 50D (~0.891), so fidelity wasn't the deciding factor. But 20D maximized NN overlap (0.419), the best preservation of neighborhood structure. Go higher and distances start smearing, which kills HDBSCAN's density estimates.
60 neighbors consistently topped both trustworthiness and NN overlap. It balances local detail (genre niches) with broader structure (decade and genre family groupings).
Correlation metric was the key insight. It groups films by profile "shape" rather than absolute magnitude. Two films with the same genome pattern at different intensities land together, which makes sense because relevance scores differ by annotator calibration, not by actual film similarity.

reducer = umap.UMAP(
    n_components=20, n_neighbors=60,
    min_dist=0.1, metric='correlation',
    random_state=42
)
embedding = reducer.fit_transform(X_weighted)

20 dimensions, neighborhood structure intact.

Clustering with HDBSCAN

This is the step the whole pipeline was building toward. We chose HDBSCAN over K-Means (which forces you to pick k upfront and shoves every film into a cluster) and DBSCAN (which uses a single global epsilon that can't handle varying content densities).

HDBSCAN discovers the number of clusters automatically from data density. We ran a 660-combination sweep over min_cluster_size, min_samples, epsilon, and selection method. The winning config:

min_cluster_size = 15
min_samples = 3
cluster_selection_epsilon = 0.10
EOM selection method (outperformed Leaf with avg DBCV 0.234 vs 0.159)

This produced 67 parent clusters with a silhouette score of 0.503 and DBCV of 0.259. A direct hit. But about 29.2% of films initially got labelled as noise. Scattered scales that didn't stick to any cluster. We had a plan for those.

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=15, min_samples=3,
    cluster_selection_epsilon=0.10,
    cluster_selection_method='eom',
    prediction_data=True
)
cluster_labels = clusterer.fit_predict(embedding)

Validating with XGBoost (and Recovering Outliers)

HDBSCAN discovers clusters, but how do you know they're real? We needed an independent test: can a classifier reproduce the cluster labels from the original features alone?

Enter XGBoost. We trained it on the original IDF-weighted features (not the UMAP embedding) to predict cluster labels. If it can do this with >90% accuracy, then the clusters encode real structure in the original feature space. They're not just artifacts of UMAP's projection.

XGBoost was ideal for this because its tree-based splits are scale-invariant, it captures multi-feature interactions through tree depth (depth 6 encodes up to 6-way interactions), and its softmax objective produces calibrated per-class probabilities. Our professor Jonathan Hersh also mentioned gradient-boosted trees could be useful in our capstone pipelines, so the suggestion landed perfectly.

5-fold stratified cross-validation confirmed accuracy above 90%. And here's the bonus: those ~29% of films HDBSCAN labelled as noise? We fed them through the trained XGBoost model. Films with confidence ≥50% became candidates for their most probable rail. Think of it as going back to the battlefield and recovering the soldiers that got knocked away from their regiment. Most of the noise films found a home.

xgb_clf = xgb.XGBClassifier(
    n_estimators=200, max_depth=6,
    learning_rate=0.1, subsample=0.8,
    colsample_bytree=0.8, random_state=42,
    eval_metric='mlogloss', verbosity=0
)
cv_scores = cross_val_score(
    xgb_clf, X_clustered, y_encoded,
    cv=5, scoring='accuracy'
)

Making It Interpretable with SHAP

XGBoost tells us a cluster is real. SHAP tells us why.

For each prediction, SHAP decomposes the output into per-feature contributions. We used TreeExplainer, which computes exact Shapley values (not approximations) by exploiting tree structure. Averaging absolute SHAP values per cluster gives us a ranked profile of what makes each rail distinctive.

This replaced an earlier approach where we computed the delta between each cluster's centroid and the global mean. That sounds reasonable, but it missed interaction effects entirely. If "heist" and "ensemble cast" together define a cluster but neither alone is unusual globally, centroid delta misses both. SHAP captures them through XGBoost's tree paths.

The top 5 to 8 features per cluster become "feature badges" in the dashboard. A "Heist Thriller" rail might show badges for heist, ensemble cast, plot twists, and suspense.

Naming Rails with an LLM

Each rail needed a human-readable name. We took a movies-first approach: identify the 5 films nearest each cluster centroid, summarize the genre and decade distributions, include the top SHAP features, and send it all to Ollama (llama3.2:3b) running locally.

The prompt asks for a two-to-five-word name suitable for a streaming service rail. The LLM has enough pre-trained knowledge of movie titles to generate surprisingly good names. It's not perfect. Names vary across runs at temperature 0.3 and there's no formal evaluation of naming quality. But it gets us 80% of the way there, and the curator dashboard lets humans fix the rest.

The Curator Dashboard

We built a Streamlit dashboard that serves as the human-in-the-loop layer. Curators can browse every rail, see member films with poster art, review feature badges, rename rails, disable low-quality clusters, and manually move films between categories. This was critical. No automated system gets everything right, and giving content teams final control was a core design goal.

What I'd Do Differently

Nothing this messy comes out clean. Here's what I'd do differently next time:

Add audience signal. The pipeline uses content features only. It can't validate commercial value without watch or engagement data. With viewing data, you could validate whether clusters actually correspond to viewer preferences.
Try factor analysis for preprocessing. After hearing about the benefits of factor analysis, I'd love to try it as an alternative to our z-score + IDF approach. If you could find stable latent variables through PCA with actual meaning, clustering on those rather than raw genomes could be really powerful.
Test UMAP stability. We used a single random seed. Cluster boundaries could shift under a different seed, and we have no stability measure for that. Multiple seeds with consensus clustering would fix this.
Automate cluster merging. 67 parent clusters plus dozens of sub-clusters is a lot for human curators to review. Using per-cluster F1 scores as a quality gate to auto-merge indistinct clusters would reduce that burden significantly.

What it proves

The core result holds up: genome-based clustering can produce interpretable, granular streaming rails without any audience data. XGBoost validates coherence, SHAP provides interpretable feature profiles, and the dashboard gives content teams the final say.

The full pipeline, from raw genome scores to named, curated rails with feature badges and poster art, runs end-to-end. With viewing data, stability analysis, and automated merge logic, this framework could evolve into something production-ready.

It was also genuinely one of the most fun projects I've worked on. There's something deeply satisfying about watching an algorithm discover that Kill Bill and Slumdog Millionaire don't belong in the same category, and having the data to explain exactly why. That's the moment you know the dragon is really dead.

Built with Jordan Ehrman and Manuel Lara as part of the MSBA Capstone at Chapman University. If you're interested in the technical details, the full documentation covers every hyperparameter, code snippet, and diagnostic plot.