Algorithm reference
Collaborative filtering, the matrix prediction problem, the C5 predictor, training, notation, and cohort's adaptation.
What is collaborative filtering?
Collaborative filtering is how a system predicts what a user will do by pooling observations from other users who behaved similarly. The recommendation isn't computed from features of the user or the item. It's computed from the pattern of overlap across many users and many items.
Netflix is the canonical example. To predict whether you'll like Inception, the system doesn't read the script. It notices that people who rated the movies you rated highly also rated Inception highly. The same logic works in reverse. Items that get rated by the same set of users start to look "similar," whether or not they have anything obvious in common.
The family of techniques splits into a few approaches.
- User-based: find users similar to you, recommend what they liked.
- Item-based: find items similar to what you liked, recommend those.
- Matrix factorization: decompose the user-item matrix into low-dimensional latent factors that explain the observed entries.
- Co-clustering: group users and items simultaneously into clusters, then predict each cell from the cluster pair.
Further reading: Wikipedia on Collaborative filtering and the Netflix Prize, which sparked a decade of CF research.
The matrix prediction problem
Picture a table. Rows are users, columns are items. Cells are observed interactions, like a rating, a purchase, or a download count. Most cells are empty because most users haven't touched most items.
The task is to fill in the empty cells from the patterns in the filled ones.
A worked example
Imagine a small movie ratings table. Four users have rated some movies on a 1 to 5 scale. The ? cells are unrated.
| movie 1 | movie 2 | movie 3 | movie 4 | |
|---|---|---|---|---|
| alice | 5 | ? | 3 | 4 |
| bob | ? | 4 | 5 | ? |
| carol | 4 | 5 | ? | 3 |
| dave | 3 | ? | 4 | ? |
For four users and four movies you could fill in the table by eye. For Netflix's scale you need an algorithm. That algorithm is some form of collaborative filtering.
Our matrix
Cohort's variant has podcast shows as rows and listening apps as columns. Cells are total downloads observed for that show through that app. Numbers here are illustrative.
| Apple Podcasts | Spotify | Overcast | Pocket Casts | |
|---|---|---|---|---|
| show A | 5,200 | 3,800 | ? | 350 |
| show B | ? | 8,400 | 200 | ? |
| show C | 1,100 | 900 | 1,500 | ? |
| show D | 12,000 | 10,500 | ? | 4,200 |
G&M's C5
Cohort uses George & Merugu's C5 co-clustering basis (ICDM 2005). The predictor for a cell at row \(u\) and column \(v\) is:
\[\hat{Z}_{u,v} = \mu_{g,h} + \bar{Z}_{u} + \bar{Z}_{v} - \bar{Z}_{g} - \bar{Z}_{h}\]
where \(g\) is the cluster show \(u\) belongs to and \(h\) is the cluster app \(v\) belongs to. Each cell is predicted as a cluster-pair signal plus per-row and per-column bias correction.
- Block mean : the signal C5 actually fits. A single number per (row cluster, col cluster) pair.
- Bias terms : calibrate predictions to each show's and each app's individual scale.
- Cluster averages : cancel the parts already counted by the block mean and the individual biases.
Notation
Each symbol in the formula, in left-to-right order.
| \(\hat{Z}_{u,v}\) | predicted value for cell (u, v) |
| \(\mu_{g,h}\) | average for cluster pair (g, h) |
| \(\bar{Z}_u\) | show u's average downloads |
| \(\bar{Z}_v\) | app v's average downloads |
| \(\bar{Z}_g\) | average across row cluster g |
| \(\bar{Z}_h\) | average across col cluster h |
What was new about C5
In 2005, the leading approach to large-scale collaborative filtering was matrix factorization. The idea: treat the giant sparse user-item table as the product of two much smaller tables. One describes each user as a short list of "factors" (taste dimensions, roughly). The other describes each item the same way. Multiplying the user's factors by the item's factors gives a predicted rating. Accurate on benchmarks, but with two practical problems.
C5 attacked both problems at once.
Training loop
Alternating minimization. The basic loop.
- Initialize. Assign each row and each column to a random cluster.
- Compute the block means, row means, col means, and cluster averages from current assignments.
- For each row, reassign it to the cluster that minimizes squared error against the predicted values.
- For each column, do the same.
- Repeat until cluster assignments stop changing.
Our adaptation
Cohort applies C5 to a privacy-preserving aggregate matrix sourced from OP3, an open podcast analytics service.
- Matrix shape: 2,414 podcast shows (rows) by 142 listening apps and devices (columns).
- Cell values: total download counts via that app.
- Cluster counts:
k=15row clusters andl=16column clusters. - Cluster-count sensitivity: changes held-out MAE by ~6% across the swept grid.
- Algorithm sensitivity: changes MAE by ~36% .
- Log transform: counts are log-transformed before fitting. Download distributions are heavy-tailed and log space avoids letting a few high-volume shows dominate the objective. All accuracy numbers on the site are in log space.
What cohort recommends
Row clusters track app-distribution patterns rather than topic. So cohort's neighbors aren't "shows similar in subject matter." They are shows whose downloads come through similar app mixes at similar scale.
It's not a topic-similarity engine. A true-crime podcast won't necessarily get other true-crime podcasts as neighbors. It'll get other podcasts downloaded the same way, through the same app mix, at similar scale. We confirmed this directly with ARI comparisons.
- Against topic : ARI 0.016. Random-level.
- Against popularity bands : ARI 0.215. Real signal.
Stretch ideas
Directions cohort could take if extended past the course project.
-
Schedule-aware recommendation. Podcast Index publishes an
updateFrequencyfield on shows. If you stack a listener's neighbors by their post day-of-week, you could surface a recommended schedule. Shows that fill different days so the listener has fresh content each day with minimal posting overlap. - Overlap audit. Same data, different question. Where in the week do recommendation neighbors all pile up at once? Useful for show creators picking a release day that competes with fewer audience-overlapping shows.
- External validation of column clusters. Pull app metadata (release year, business model, platform) and check whether the data-driven groupings correlate with external categorizations. Would put a number on the convergence-with-practitioner-categories claim.
- Alternative co-clustering bases. G&M's paper introduces C2 and C6 alongside C5, each with different bias-correction structure. I-divergence loss as an alternative to squared error.
- Per-listener data. OP3 is privacy-preserving by design and only exposes aggregates. With user-level histories one could shift from "show audiences" to per-listener recommendation. Out of scope for this dataset.
References
- George & Merugu. A Scalable Collaborative Filtering Framework Based on Co-Clustering. ICDM 2005. IEEE Xplore
- OP3, the open podcast analytics service used as the data source.
- Wikipedia: Collaborative filtering
Also see: analysis · takeaways · slide deck · the paper