Algorithm reference

Collaborative filtering, the matrix prediction problem, the C5 predictor, training, notation, and cohort's adaptation.

What is collaborative filtering?

Collaborative filtering is how a system predicts what a user will do by pooling observations from other users who behaved similarly. The recommendation isn't computed from features of the user or the item. It's computed from the pattern of overlap across many users and many items.

Netflix is the canonical example. To predict whether you'll like Inception, the system doesn't read the script. It notices that people who rated the movies you rated highly also rated Inception highly. The same logic works in reverse. Items that get rated by the same set of users start to look "similar," whether or not they have anything obvious in common.

The family of techniques splits into a few approaches.

User-based: find users similar to you, recommend what they liked.
Item-based: find items similar to what you liked, recommend those.
Matrix factorization: decompose the user-item matrix into low-dimensional latent factors that explain the observed entries.
Co-clustering: group users and items simultaneously into clusters, then predict each cell from the cluster pair. This is what cohort does.

Further reading: Wikipedia on Collaborative filtering and the Netflix Prize, which sparked a decade of CF research.

The matrix prediction problem

Picture a table. Rows are users, columns are items. Cells are observed interactions, like a rating, a purchase, or a download count. Most cells are empty because most users haven't touched most items.

The task is to fill in the empty cells from the patterns in the filled ones.

Why this is hard. The rating in any one cell depends on three things at once. The user's taste, the item's character, and the interaction between them. The model has to learn all three from incomplete data, with no labels telling it which is which.

A worked example

Imagine a small movie ratings table. Four users have rated some movies on a 1 to 5 scale. The ? cells are unrated.

	movie 1	movie 2	movie 3	movie 4
alice	5	?	3	4
bob	?	4	5	?
carol	4	5	?	3
dave	3	?	4	?

How the model reads this. Alice and carol have overlapping tastes (high on movie 1, low on movie 3 vs movie 4). Carol rated movie 2 a 5. The model predicts alice would also rate movie 2 around 5. The same logic finds similar items. Movie 2 and movie 3 both got 5s from someone (carol, bob), so they cluster together. If you've liked movie 3, you'll probably like movie 2.

For four users and four movies you could fill in the table by eye. For Netflix's scale you need an algorithm. That algorithm is some form of collaborative filtering.

Our matrix

Cohort's variant has podcast shows as rows and listening apps as columns. Cells are total downloads observed for that show through that app. Numbers here are illustrative.

	Apple Podcasts	Spotify	Overcast	Pocket Casts
show A	5,200	3,800	?	350
show B	?	8,400	200	?
show C	1,100	900	1,500	?
show D	12,000	10,500	?	4,200

Audience shape is the row across this table. Which apps a show's listeners use, and at what scale. Two shows whose audiences shop the same way will have similar rows. Cohort uses this regularity to predict the unobserved cells, and to group both shows and apps into clusters that share an audience shape.

G&M's C5

Cohort uses George & Merugu's C5 co-clustering basis (ICDM 2005). The predictor for a cell at row \(u\) and column \(v\) is:

\[\hat{Z}_{u,v} = \mu_{g,h} + \bar{Z}_{u} + \bar{Z}_{v} - \bar{Z}_{g} - \bar{Z}_{h}\]

where \(g\) is the cluster show \(u\) belongs to and \(h\) is the cluster app \(v\) belongs to. Each cell is predicted as a cluster-pair signal plus per-row and per-column bias correction.

Block mean \(\mu_{g,h}\): the signal C5 actually fits. A single number per (row cluster, col cluster) pair.
Bias terms \(\bar{Z}_u\), \(\bar{Z}_v\): calibrate predictions to each show's and each app's individual scale.
Cluster averages \(\bar{Z}_g\), \(\bar{Z}_h\): cancel the parts already counted by the block mean and the individual biases.

Notation

Each symbol in the formula, in left-to-right order.

\(\hat{Z}_{u,v}\)	predicted value for cell (u, v)
\(\mu_{g,h}\)	average for cluster pair (g, h)
\(\bar{Z}_u\)	show u's average downloads
\(\bar{Z}_v\)	app v's average downloads
\(\bar{Z}_g\)	average across row cluster g
\(\bar{Z}_h\)	average across col cluster h

What was new about C5

In 2005, the leading approach to large-scale collaborative filtering was matrix factorization. The idea: treat the giant sparse user-item table as the product of two much smaller tables. One describes each user as a short list of "factors" (taste dimensions, roughly). The other describes each item the same way. Multiplying the user's factors by the item's factors gives a predicted rating. Accurate on benchmarks, but with two practical problems.

Cold start

When a new movie was added to the catalog, the system had no factors for it yet. Until the model retrained, the new movie could not be meaningfully recommended. Same with new users. Retraining took hours at production scale, so new content sat invisible.

Cost per request

Generating a recommendation meant multiplying two lists of numbers for every user-item pair under consideration. At Netflix scale, that adds up to noticeable latency.

C5 attacked both problems at once.

Predictions become a table lookup

Once trained, predicting any (user, item) cell needs five numbers. The cluster-pair signal plus four bias terms. No multiplications. Dramatically cheaper at serve time.

New items get a prediction on day one

C5 keeps a "new arrivals" group (the paper's term is a transitional cluster). New items go straight there and inherit the group's predicted behavior. No retrain needed. The moment a new podcast lands, it has an audience profile and can be recommended. Periodically the model fully retrains and the new arrivals find their permanent cluster home.

Key Result

The "new items on day one" contribution is what cohort relies on most directly. New podcasts publish constantly. The recommender needs to keep up without nightly downtime. The refresh pipeline uses this mechanism end to end.

Training loop

Alternating minimization. The basic loop.

Initialize. Assign each row and each column to a random cluster.
Compute the block means, row means, col means, and cluster averages from current assignments.
For each row, reassign it to the cluster that minimizes squared error against the predicted values.
For each column, do the same.
Repeat until cluster assignments stop changing.

Convergence. Typically under 20 iterations on our matrix. Multiple seeds run in parallel and the best objective is kept.

Our adaptation

Cohort applies C5 to a privacy-preserving aggregate matrix sourced from OP3, an open podcast analytics service.

Matrix shape: 2,414 podcast shows (rows) by 142 listening apps and devices (columns).
Cell values: total download counts via that app.
Cluster counts: k=15 row clusters and l=16 column clusters.
Cluster-count sensitivity: changes held-out MAE by ~6% across the swept grid.
Algorithm sensitivity: changes MAE by ~36% (C5 vs SVD).
Log transform: counts are log-transformed before fitting. Download distributions are heavy-tailed and log space avoids letting a few high-volume shows dominate the objective. All accuracy numbers on the site are in log space.

Key Result

The chosen (k, l) is not a uniquely best point on MAE. It does surface the walled-garden split inside the dominant 89% of downloads cleanly, with no measurable accuracy cost relative to neighbors on the sweep grid. See the k and l sweep for the full picture.

What cohort recommends

Row clusters track app-distribution patterns rather than topic. So cohort's neighbors aren't "shows similar in subject matter." They are shows whose downloads come through similar app mixes at similar scale.

Key Result

A cohort neighbor is another podcast whose downloads flow through the same app environments at similar scale. Useful for understanding your platform-reach competitive set, finding cross-promotion candidates with overlapping app reach, or mapping where your show sits in the listening ecosystem. Not useful for topic-similarity recommendations, and not direct evidence that the same humans listen to both shows.

It's not a topic-similarity engine. A true-crime podcast won't necessarily get other true-crime podcasts as neighbors. It'll get other podcasts downloaded the same way, through the same app mix, at similar scale. We confirmed this directly with ARI comparisons.

Against topic (Podcast Index categories): ARI 0.016. Random-level.
Against popularity bands (deciles by total downloads): ARI 0.215. Real signal.

Stretch ideas

Directions cohort could take if extended past the course project.

Schedule-aware recommendation. Podcast Index publishes an updateFrequency field on shows. If you stack a listener's neighbors by their post day-of-week, you could surface a recommended schedule. Shows that fill different days so the listener has fresh content each day with minimal posting overlap.
Overlap audit. Same data, different question. Where in the week do recommendation neighbors all pile up at once? Useful for show creators picking a release day that competes with fewer audience-overlapping shows.
External validation of column clusters. Pull app metadata (release year, business model, platform) and check whether the data-driven groupings correlate with external categorizations. Would put a number on the convergence-with-practitioner-categories claim.
Alternative co-clustering bases. G&M's paper introduces C2 and C6 alongside C5, each with different bias-correction structure. I-divergence loss as an alternative to squared error.
Per-listener data. OP3 is privacy-preserving by design and only exposes aggregates. With user-level histories one could shift from "show audiences" to per-listener recommendation. Out of scope for this dataset.

References

George & Merugu. A Scalable Collaborative Filtering Framework Based on Co-Clustering. ICDM 2005. IEEE Xplore
OP3, the open podcast analytics service used as the data source.
Wikipedia: Collaborative filtering

Also see: analysis · takeaways · slide deck · the paper