Algorithm reference

Collaborative filtering, the matrix prediction problem, the C5 predictor, training, notation, and cohort's adaptation.

What is collaborative filtering?

Collaborative filtering is how a system predicts what a user will do by pooling observations from other users who behaved similarly. The recommendation isn't computed from features of the user or the item. It's computed from the pattern of overlap across many users and many items.

Netflix is the canonical example. To predict whether you'll like Inception, the system doesn't read the script. It notices that people who rated the movies you rated highly also rated Inception highly. The same logic works in reverse. Items that get rated by the same set of users start to look "similar," whether or not they have anything obvious in common.

The family of techniques splits into a few approaches.

Further reading: Wikipedia on Collaborative filtering and the Netflix Prize, which sparked a decade of CF research.

The matrix prediction problem

Picture a table. Rows are users, columns are items. Cells are observed interactions, like a rating, a purchase, or a download count. Most cells are empty because most users haven't touched most items.

The task is to fill in the empty cells from the patterns in the filled ones.

Why this is hard. The rating in any one cell depends on three things at once. The user's taste, the item's character, and the interaction between them. The model has to learn all three from incomplete data, with no labels telling it which is which.

A worked example

Imagine a small movie ratings table. Four users have rated some movies on a 1 to 5 scale. The ? cells are unrated.

movie 1 movie 2 movie 3 movie 4
alice5?34
bob?45?
carol45?3
dave3?4?
How the model reads this. Alice and carol have overlapping tastes (high on movie 1, low on movie 3 vs movie 4). Carol rated movie 2 a 5. The model predicts alice would also rate movie 2 around 5. The same logic finds similar items. Movie 2 and movie 3 both got 5s from someone (carol, bob), so they cluster together. If you've liked movie 3, you'll probably like movie 2.

For four users and four movies you could fill in the table by eye. For Netflix's scale you need an algorithm. That algorithm is some form of collaborative filtering.

Our matrix

Cohort's variant has podcast shows as rows and listening apps as columns. Cells are total downloads observed for that show through that app. Numbers here are illustrative.

Apple Podcasts Spotify Overcast Pocket Casts
show A5,2003,800?350
show B?8,400200?
show C1,1009001,500?
show D12,00010,500?4,200
Audience shape is the row across this table. Which apps a show's listeners use, and at what scale. Two shows whose audiences shop the same way will have similar rows. Cohort uses this regularity to predict the unobserved cells, and to group both shows and apps into clusters that share an audience shape.

G&M's C5

Cohort uses George & Merugu's C5 co-clustering basis (ICDM 2005). The predictor for a cell at row \(u\) and column \(v\) is:

\[\hat{Z}_{u,v} = \mu_{g,h} + \bar{Z}_{u} + \bar{Z}_{v} - \bar{Z}_{g} - \bar{Z}_{h}\]

where \(g\) is the cluster show \(u\) belongs to and \(h\) is the cluster app \(v\) belongs to. Each cell is predicted as a cluster-pair signal plus per-row and per-column bias correction.

Notation

Each symbol in the formula, in left-to-right order.

\(\hat{Z}_{u,v}\)predicted value for cell (u, v)
\(\mu_{g,h}\)average for cluster pair (g, h)
\(\bar{Z}_u\)show u's average downloads
\(\bar{Z}_v\)app v's average downloads
\(\bar{Z}_g\)average across row cluster g
\(\bar{Z}_h\)average across col cluster h

What was new about C5

In 2005, the leading approach to large-scale collaborative filtering was matrix factorization. The idea: treat the giant sparse user-item table as the product of two much smaller tables. One describes each user as a short list of "factors" (taste dimensions, roughly). The other describes each item the same way. Multiplying the user's factors by the item's factors gives a predicted rating. Accurate on benchmarks, but with two practical problems.

Cold start
When a new movie was added to the catalog, the system had no factors for it yet. Until the model retrained, the new movie could not be meaningfully recommended. Same with new users. Retraining took hours at production scale, so new content sat invisible.
Cost per request
Generating a recommendation meant multiplying two lists of numbers for every user-item pair under consideration. At Netflix scale, that adds up to noticeable latency.

C5 attacked both problems at once.

Predictions become a table lookup
Once trained, predicting any (user, item) cell needs five numbers. The cluster-pair signal plus four bias terms. No multiplications. Dramatically cheaper at serve time.
New items get a prediction on day one
C5 keeps a "new arrivals" group (the paper's term is a transitional cluster). New items go straight there and inherit the group's predicted behavior. No retrain needed. The moment a new podcast lands, it has an audience profile and can be recommended. Periodically the model fully retrains and the new arrivals find their permanent cluster home.
Key Result
The "new items on day one" contribution is what cohort relies on most directly. New podcasts publish constantly. The recommender needs to keep up without nightly downtime. The refresh pipeline uses this mechanism end to end.

Training loop

Alternating minimization. The basic loop.

  1. Initialize. Assign each row and each column to a random cluster.
  2. Compute the block means, row means, col means, and cluster averages from current assignments.
  3. For each row, reassign it to the cluster that minimizes squared error against the predicted values.
  4. For each column, do the same.
  5. Repeat until cluster assignments stop changing.
Convergence. Typically under 20 iterations on our matrix. Multiple seeds run in parallel and the best objective is kept.

Our adaptation

Cohort applies C5 to a privacy-preserving aggregate matrix sourced from OP3, an open podcast analytics service.

Key Result
The chosen (k, l) is not a uniquely best point on MAE. It does surface the walled-garden split inside the dominant 89% of downloads cleanly, with no measurable accuracy cost relative to neighbors on the sweep grid. See the k and l sweep for the full picture.

What cohort recommends

Row clusters track app-distribution patterns rather than topic. So cohort's neighbors aren't "shows similar in subject matter." They are shows whose downloads come through similar app mixes at similar scale.

Key Result
A cohort neighbor is another podcast whose downloads flow through the same app environments at similar scale. Useful for understanding your platform-reach competitive set, finding cross-promotion candidates with overlapping app reach, or mapping where your show sits in the listening ecosystem. Not useful for topic-similarity recommendations, and not direct evidence that the same humans listen to both shows.

It's not a topic-similarity engine. A true-crime podcast won't necessarily get other true-crime podcasts as neighbors. It'll get other podcasts downloaded the same way, through the same app mix, at similar scale. We confirmed this directly with ARI comparisons.

Stretch ideas

Directions cohort could take if extended past the course project.

References

Also see: analysis · takeaways · slide deck · the paper