Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."

Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."

ai-banner.png

What will I learn

  • You will learn content-based filtering -- recommending items by comparing their features, using the cosine similarity we built in episode #8;
  • collaborative filtering -- the idea that user behavior patterns contain information no item feature can capture;
  • user-based vs item-based collaborative filtering, why the choice matters, and when each one wins;
  • matrix factorization with NMF and SVD -- decomposing a giant sparse rating matrix into latent factors, directly connected to the PCA concepts from episode #24;
  • the cold start problem and why every recommendation system eventually hits it;
  • hybrid approaches that combine content and collaborative signals;
  • evaluation metrics specific to recommendations: precision@K, recall@K, NDCG;
  • implicit feedback -- what happens when you don't have star ratings, just clicks and views;
  • a realistic end-to-end recommendation pipeline you can adapt to your own projects.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."

At the end of episode #26 I teased that after finding groups (clustering, episodes #22-23), compressing data (PCA, #24), visualizing manifolds (t-SNE and UMAP, #25), and flagging outliers (anomaly detection, #26), the natural next question is: "which items should I show to which users?" Today we answer that question. Welcome to recommendation systems -- arguably the most commercially impactful application of machine learning on the planet.

Netflix estimates that 80% of what people watch comes from its recommendation engine. Amazon attributes roughly 35% of its revenue to "customers who bought this also bought..." suggestions. YouTube's algorithm decides what billions of people watch every single day. Spotify's Discover Weekly playlist, Amazon's product grid, TikTok's For You page -- all powered by recommendation systems. More money depends on getting recommendations right than on almost any other ML application.

The core problem is deceptively simple: given what a user has liked (or bought, or watched, or clicked on), predict what they'll like next. But the solution space is HUGE -- millions of users, millions of items, and the vast majority of possible user-item combinations have never happened. You're trying to fill in a matrix that's 99.9% empty. It's like trying to predict every student's grade on every possible exam that could ever be written, based on the few exams they've actually taken.

The good news? Everything we've built across the last 26 episodes feeds directly into this. Cosine similarity from episode #8. Matrix decomposition concepts from episode #24. The evaluation frameworks from episode #13. Even the clustering from episodes #22-23 shows up when you want to group similar users or similar items. Today we pull it all together.

Let's dive right in.

Content-based filtering: "you liked X, here's something similar"

The simplest recommendation strategy is pure common sense: if you liked a sci-fi movie, recommend other sci-fi movies. If you read a Python tutorial and gave it a thumbs up, show more Python tutorials. Content-based filtering uses item features to find similar items, then recommends the ones most similar to what the user already enjoyed.

The mathematical machinery behind this is cosine similarity -- the same tool we used back in episode #8 when we talked about measuring how aligned two vectors are. Each item becomes a feature vector (genre scores, topic tags, text embeddings, whatever describes the content), and similarity between two items is the cosine of the angle between their vectors. A cosine of 1 means identical direction (perfectly similar), 0 means orthogonal (nothing in common), -1 means opposite.

import numpy as np

# Item features: each row is a blog post, columns are topic scores
# [crypto, programming, science, art, gaming]
items = np.array([
    [0.9, 0.3, 0.1, 0.0, 0.0],  # Post 0: crypto analysis
    [0.8, 0.5, 0.2, 0.0, 0.0],  # Post 1: crypto + code
    [0.1, 0.9, 0.3, 0.0, 0.0],  # Post 2: Python tutorial
    [0.0, 0.2, 0.8, 0.1, 0.0],  # Post 3: science article
    [0.0, 0.0, 0.1, 0.9, 0.2],  # Post 4: digital art
    [0.0, 0.1, 0.0, 0.0, 0.9],  # Post 5: game review
    [0.2, 0.8, 0.4, 0.0, 0.1],  # Post 6: ML tutorial
])
item_names = [
    'Crypto Analysis', 'Crypto Code', 'Python Tutorial',
    'Science Article', 'Digital Art', 'Game Review', 'ML Tutorial'
]

def cosine_similarity(a, b):
    """Cosine similarity between two vectors."""
    dot = a @ b
    norm = np.linalg.norm(a) * np.linalg.norm(b)
    return dot / norm if norm > 0 else 0.0

def content_recommend(liked_idx, items, item_names, top_n=3):
    """Recommend items similar to the one at liked_idx."""
    liked = items[liked_idx]
    scores = []
    for i in range(len(items)):
        if i == liked_idx:
            continue
        sim = cosine_similarity(liked, items[i])
        scores.append((i, sim))
    scores.sort(key=lambda x: x[1], reverse=True)
    return [(item_names[i], sim) for i, sim in scores[:top_n]]

# User liked the Python Tutorial -- what else might they enjoy?
print("You liked 'Python Tutorial'. Recommendations:")
for name, score in content_recommend(2, items, item_names):
    print(f"  {name}: similarity = {score:.3f}")

print()

# User liked the Crypto Analysis
print("You liked 'Crypto Analysis'. Recommendations:")
for name, score in content_recommend(0, items, item_names):
    print(f"  {name}: similarity = {score:.3f}")

The Python Tutorial fan gets recommended the ML Tutorial (high programming overlap) and the Crypto Code post (also programming-heavy). The Crypto Analysis fan gets Crypto Code (obvious overlap), then the Python Tutorial (some programming overlap). Makes sense -- you can see WHY each recommendation was made by looking at the feature overlap.

That interpretability is content-based filtering's biggest strength. You can always explain the recommendation: "We suggested this because it shares topics X and Y with posts you've liked." Try explaining a deep learning recommendation to a non-technical product manager sometime -- content-based is the one that doesn't require a 30-minute whiteboard session ;-)

Building a user profile from multiple interactions

In practice, users don't just like one item. They have a history. The standard approach is to build a user profile by averaging (or weighting) the feature vectors of all items the user has interacted with:

def build_user_profile(liked_indices, items):
    """Average feature vectors of liked items into a user profile."""
    profile = items[liked_indices].mean(axis=0)
    return profile

def content_recommend_profile(profile, items, item_names,
                               exclude=None, top_n=3):
    """Recommend items similar to a user profile."""
    if exclude is None:
        exclude = set()
    scores = []
    for i in range(len(items)):
        if i in exclude:
            continue
        sim = cosine_similarity(profile, items[i])
        scores.append((i, sim))
    scores.sort(key=lambda x: x[1], reverse=True)
    return [(item_names[i], sim) for i, sim in scores[:top_n]]

# User has liked posts 0 (crypto), 1 (crypto+code), and 6 (ML tutorial)
user_likes = [0, 1, 6]
profile = build_user_profile(user_likes, items)

print(f"User profile (averaged): {profile.round(3)}")
print(f"(heavy on crypto and programming)\n")

print("Recommendations based on full history:")
for name, score in content_recommend_profile(
    profile, items, item_names, exclude=set(user_likes)
):
    print(f"  {name}: similarity = {score:.3f}")

The user who liked crypto analysis, crypto code, and the ML tutorial gets a profile that's heavy on crypto and programming. The system recommends the Python Tutorial (programming match) and the Science Article (partial match via programming and science overlap). It won't recommend the Digital Art or Game Review posts because those feature vectors point in completely different directions from the user profile.

The filter bubble problem

And that reveals content-based filtering's fundamental limitation: it only recommends items similar to what you've already seen. It creates filter bubbles by design. A programmer who has only ever read programming posts will never be recommended a photography tutorial, even if the data shows that 40% of programmers who try photography tutorials love them. Content-based filtering has no way to know that because it never looks at other users' behavior. It only looks at item features.

The second limitation is that you need good item features. For blog posts with clear topic tags, that's easy. For movies, you can use genre, director, actors, year. But what about a song? The "features" that make you like a song are subtle -- tempo, key, mood, production style, the way the chorus hits. Extracting those features reliably is a whole ML problem in itself. And even if you could perfectly describe every item, feature descriptions miss the intangible qualities that actually drive preference. Two movies can be in the same genre with the same actors and one is a masterpiece while the other is forgettable.

Collaborative filtering: "people like you liked this"

Collaborative filtering sidesteps the feature problem entirely. It doesn't need to know WHAT items are -- only WHO liked WHAT. The core insight is remarkably elegant: if users A and B both liked items 1, 2, and 3, and user A also liked item 4, then user B will probably like item 4 too. No item features required. The patterns in user behavior ARE the signal.

The data structure is a user-item interaction matrix -- rows are users, columns are items, and values are ratings (or binary liked/not-liked, or view counts, or any interaction signal):

# User-item ratings matrix (0 = not rated)
# 7 items, 6 users
ratings = np.array([
    [5, 4, 0, 0, 1, 0, 5],  # User 0: crypto & programming fan
    [4, 5, 4, 0, 0, 0, 4],  # User 1: crypto & programming fan
    [0, 0, 5, 4, 0, 0, 5],  # User 2: programming & science fan
    [0, 0, 0, 3, 5, 4, 0],  # User 3: art & gaming fan
    [0, 0, 4, 5, 0, 0, 4],  # User 4: programming & science fan
    [5, 3, 0, 0, 0, 0, 4],  # User 5: crypto & ML fan
])
user_names = [f"User {i}" for i in range(6)]

print(f"Rating matrix shape: {ratings.shape}")
print(f"Total possible ratings: {ratings.size}")
print(f"Actual ratings: {np.count_nonzero(ratings)}")
print(f"Sparsity: {1 - np.count_nonzero(ratings) / ratings.size:.1%}")

Already at this tiny scale, the matrix is over 40% empty. Real-world recommendation matrices are 99%+ empty. Netflix had about 100 million ratings out of a possible 480 billion user-movie pairs (that's 99.98% sparse). Amazon probably has trillions of possible user-product interactions and a vanishingly small fraction are observed. This extreme sparsity is the central challenge of collaborative filtering.

User-based collaborative filtering

The first flavour of collaborative filtering compares users to each other. Find users who are similar to the target user (based on their rating patterns), then predict what the target user would rate unrated items based on what those similar users rated:

def user_similarity_matrix(ratings):
    """Compute cosine similarity between all user pairs,
    considering only items both users have rated."""
    n_users = ratings.shape[0]
    sim = np.zeros((n_users, n_users))
    for i in range(n_users):
        for j in range(n_users):
            # Only consider items BOTH users rated
            mask = (ratings[i] > 0) & (ratings[j] > 0)
            if mask.sum() < 2:
                sim[i, j] = 0.0
                continue
            a = ratings[i, mask]
            b = ratings[j, mask]
            norm = np.linalg.norm(a) * np.linalg.norm(b)
            sim[i, j] = (a @ b) / norm if norm > 0 else 0.0
    return sim

sim_matrix = user_similarity_matrix(ratings)

print("User similarity matrix:")
print(f"{'':>8s}", end="")
for j in range(6):
    print(f"  U{j}", end="")
print()
for i in range(6):
    print(f"User {i}:", end="")
    for j in range(6):
        print(f"  {sim_matrix[i, j]:.2f}", end="")
    print()

Look at the similarity scores. Users 0 and 1 should be highly similar (both love crypto and programming). Users 2 and 4 are similar (programming and science). User 3 stands apart (art and gaming fan). These patterns let us make predictions for missing entries.

Now let's predict ratings. For a target user's unrated item, we take a weighted average of how similar users rated that item, weighted by similarity:

def predict_rating(user_idx, item_idx, ratings, sim_matrix, k=3):
    """Predict a user's rating for an item using
    the K most similar users who rated that item."""
    # Find users who rated this item
    rated_mask = ratings[:, item_idx] > 0
    if not rated_mask.any():
        return 0.0

    # Get similarities to those users
    sims = sim_matrix[user_idx].copy()
    sims[~rated_mask] = 0  # ignore users who didn't rate
    sims[user_idx] = 0     # ignore self

    # Pick top-K most similar users
    top_k = np.argsort(sims)[::-1][:k]
    top_k = top_k[sims[top_k] > 0]  # only positive similarity

    if len(top_k) == 0:
        return 0.0

    # Weighted average of their ratings
    weights = sims[top_k]
    values = ratings[top_k, item_idx]
    return (weights @ values) / weights.sum()

# Predict all missing ratings for User 0
print("Predictions for User 0:")
for j in range(ratings.shape[1]):
    actual = ratings[0, j]
    if actual > 0:
        status = f"(actual: {actual})"
    else:
        pred = predict_rating(0, j, ratings, sim_matrix)
        status = f"(predicted: {pred:.2f})" if pred > 0 else "(no data)"
    print(f"  {item_names[j]:>18s}: {status}")

The elegance of collaborative filtering is that it discovers implicit taste patterns. It can recommend a photography tutorial to a programmer -- something content-based filtering would never do -- if the data shows that other programmers enjoyed photography. It captures preference patterns that no content analysis can see because those patterns live in the cross-user behavior, not in the items themselves.

Item-based collaborative filtering

There's a second flavour that flips the perspective: in stead of finding similar users, find similar items (based on who rated them similarly). If items A and B tend to receive similar ratings from the same users, then a user who liked A will probably like B.

def item_similarity_matrix(ratings):
    """Compute cosine similarity between items based on
    the users who rated them."""
    n_items = ratings.shape[1]
    sim = np.zeros((n_items, n_items))
    for i in range(n_items):
        for j in range(n_items):
            # Only consider users who rated BOTH items
            mask = (ratings[:, i] > 0) & (ratings[:, j] > 0)
            if mask.sum() < 2:
                sim[i, j] = 0.0
                continue
            a = ratings[mask, i]
            b = ratings[mask, j]
            norm = np.linalg.norm(a) * np.linalg.norm(b)
            sim[i, j] = (a @ b) / norm if norm > 0 else 0.0
    return sim

item_sim = item_similarity_matrix(ratings)

print("Item similarity matrix (based on user ratings):")
for i in range(len(item_names)):
    # Find most similar item (excluding self)
    sims = item_sim[i].copy()
    sims[i] = -1
    best = np.argmax(sims)
    print(f"  {item_names[i]:>18s} <-> "
          f"{item_names[best]:<18s} (sim={item_sim[i, best]:.3f})")

Why would you choose item-based over user-based? In production systems, the item catalog is typically much more stable than the user base. New users arrive constantly, but the item set changes slowly. Item-based similarity can be precomputed once and cached, while user-based similarity needs to be recomputed as user behavior changes. Amazon's famous "customers who bought this also bought..." is item-based collaborative filtering -- they precompute item-item similarities offline and serve them instantly. It's a practical engineering decision as much as an algorithmic one.

The cold start problem

Both flavours of collaborative filtering share the same fundamental weakness: they can't handle NEW users or NEW items.

A brand new user with zero ratings has no history to compare against -- no similar users can be found, and no items can be predicted. A new item with zero ratings has no user interaction data -- it can't be similar to anything, and it won't be recommended to anyone. This is the cold start problem, and it's not just a theoretical nuisance. It's a real business problem. Every new user who signs up gets a bad experience because recommendations are terrible. Every new item added to the catalog is invisible until enough people discover and rate it through other means.

# Demonstrating cold start
new_user = np.array([0, 0, 0, 0, 0, 0, 0])  # no history at all
new_user_partial = np.array([5, 0, 0, 0, 0, 0, 0])  # rated 1 item

# Can we predict anything for the new user?
ratings_extended = np.vstack([ratings, new_user, new_user_partial])
sim_ext = user_similarity_matrix(ratings_extended)

print("New user (no history):")
print(f"  Similarity to all users: {sim_ext[6, :6]}")
print(f"  (all zeros -- no predictions possible)\n")

print("New user (1 rating):")
print(f"  Similarity to existing users: "
      f"{sim_ext[7, :6].round(3)}")
n_predictable = 0
for j in range(7):
    if new_user_partial[j] == 0:
        pred = predict_rating(7, j, ratings_extended, sim_ext)
        if pred > 0:
            n_predictable += 1
print(f"  Predictable items: {n_predictable}/6 unrated items")

The practical solution is hybrid systems -- combine content-based filtering (which works for new items as long as you have features) with collaborative filtering (which works better for users with history). New users get content-based recommendations until they build enough history for collaborative filtering to kick in. New items get recommended based on their content features until enough users have rated them.

Having said that, the cold start problem never fully goes away. It's a fundamental limitation of any system that relies on historical interaction data. You need data to make recommendations, but you need recommendations to generate data. It's a chicken-and-egg problem that every recommendation system has to manage, not solve.

Matrix factorization: uncovering latent factors

Now we get to the most powerful classical approach to collaborative filtering. Matrix factorization takes the sparse user-item matrix R and decomposes it into two smaller, dense matrices: a user matrix U and an item matrix V, such that R is approximately U x V-transpose. Each user gets a vector of K "latent factors," each item gets a corresponding vector of K factors, and the predicted rating is the dot product of the user and item vectors.

If that sounds familiar, it should. This is conceptually the same decomposition we did with PCA in episode #24. PCA decomposes a data matrix into directions of maximum variance. NMF/SVD decompose a rating matrix into directions of maximum user-item interaction. The math is closely related -- in fact, truncated SVD on a ratings matrix IS a form of PCA applied to collaborative filtering.

from sklearn.decomposition import NMF

# Replace 0s with mean for NMF (it needs non-negative input)
ratings_filled = ratings.copy().astype(float)
mask = ratings > 0
global_mean = ratings[mask].mean()
ratings_filled[~mask] = global_mean

# Non-negative Matrix Factorization: 3 latent factors
nmf = NMF(n_components=3, random_state=42, max_iter=500)
user_factors = nmf.fit_transform(ratings_filled)
item_factors = nmf.components_

print(f"User factors shape: {user_factors.shape}")
print(f"Item factors shape: {item_factors.shape}")

# What do the latent factors look like?
print("\nUser factor vectors:")
for i in range(len(user_factors)):
    factors = user_factors[i].round(2)
    print(f"  User {i}: [{', '.join(f'{v:.2f}' for v in factors)}]")

print("\nItem factor vectors:")
for i in range(len(item_names)):
    factors = item_factors[:, i].round(2)
    print(f"  {item_names[i]:>18s}: "
          f"[{', '.join(f'{v:.2f}' for v in factors)}]")

The three latent factors are discovered automatically -- we never told the algorithm what they should represent. But if you look at the factor values, you'll often find they correspond to meaningful dimensions: maybe factor 0 captures "technical depth," factor 1 captures "visual/creative interest," factor 2 captures something else entirely. The algorithm finds whatever patterns best explain the observed ratings, and those patterns often align with intuitive concepts. This is the same phenomenon we saw with PCA's principal components -- the math discovers structure that we didn't explicitly specify.

Predicting with matrix factorization

The real power is in prediction. Once you have the user and item factor matrices, predicting any user-item pair is just a dot product:

# Predict ALL ratings (including missing ones)
predictions = user_factors @ item_factors

print("Predicted vs actual ratings for User 0:")
print(f"{'Item':>18s}  {'Actual':>8s}  {'Predicted':>10s}")
print("-" * 40)
for j in range(len(item_names)):
    actual = ratings[0, j] if ratings[0, j] > 0 else "?"
    pred = predictions[0, j]
    marker = ""
    if ratings[0, j] == 0:
        marker = " <-- NEW"
    print(f"{item_names[j]:>18s}  {str(actual):>8s}  "
          f"{pred:>10.2f}{marker}")

print("\nPredicted vs actual ratings for User 3 (art/gaming fan):")
print(f"{'Item':>18s}  {'Actual':>8s}  {'Predicted':>10s}")
print("-" * 40)
for j in range(len(item_names)):
    actual = ratings[3, j] if ratings[3, j] > 0 else "?"
    pred = predictions[3, j]
    marker = ""
    if ratings[3, j] == 0:
        marker = " <-- NEW"
    print(f"{item_names[j]:>18s}  {str(actual):>8s}  "
          f"{pred:>10.2f}{marker}")

Look at the predictions for items the users haven't rated (marked with "NEW"). These are the recommendation candidates -- sort them by predicted rating and recommend the highest ones. The predicted ratings for items the user HAS rated should be close to the actual values -- that's how you know the factorization captured the preference patterns correctly.

SVD: the other decomposition

NMF enforces non-negative factors (all values >= 0), which is a natural constraint for ratings (you can't have negative preference). Truncated SVD (Singular Value Decomposition) doesn't have that constraint but often produces better decompositions because it has more mathematical freedom:

from sklearn.decomposition import TruncatedSVD

# SVD on the ratings matrix
svd = TruncatedSVD(n_components=3, random_state=42)

# Center the ratings first (subtract mean) for better decomposition
ratings_centered = ratings_filled - global_mean
user_factors_svd = svd.fit_transform(ratings_centered)
item_factors_svd = svd.components_

# Predict
predictions_svd = user_factors_svd @ item_factors_svd + global_mean

# Compare NMF vs SVD predictions
print("NMF vs SVD predictions for User 2:")
print(f"{'Item':>18s}  {'Actual':>8s}  {'NMF':>8s}  {'SVD':>8s}")
print("-" * 46)
for j in range(len(item_names)):
    actual = ratings[2, j] if ratings[2, j] > 0 else "?"
    nmf_pred = predictions[2, j]
    svd_pred = predictions_svd[2, j]
    print(f"{item_names[j]:>18s}  {str(actual):>8s}  "
          f"{nmf_pred:>8.2f}  {svd_pred:>8.2f}")

# Reconstruction error
nmf_error = np.mean((ratings_filled - predictions) ** 2)
svd_error = np.mean(
    (ratings_centered - user_factors_svd @ item_factors_svd) ** 2
)
print(f"\nReconstruction MSE -- NMF: {nmf_error:.4f}, SVD: {svd_error:.4f}")

SVD is what powered the famous Netflix Prize solution back in 2009. The winning team used a combination of SVD-based matrix factorization with various regularization tricks and ensemble methods. The core insight that made it work was exactly what we've been discussing: user preferences can be represented as a small number of latent factors, and matrix factorization discovers those factors from the sparse rating data.

Evaluation: harder than classification

Evaluating recommendation systems is tricky -- fundamentaly different from the supervised evaluation we learned in episode #13. In classification, you have ground truth labels and can compute accuracy, precision, recall, F1. In recommendations, you can only observe what users interacted with, not what they WOULD have liked. A user who didn't rate an item might hate it, or might love it but never saw it. You can't tell the difference.

The standard approach: hold out some known interactions, make predictions, and check if the held-out items appear in the top-K recommendations.

def evaluate_recommendations(ratings, predict_fn, k=5):
    """Leave-one-out evaluation: for each user, hide one rating,
    predict top-K, check if the hidden item appears."""
    hits = 0
    total = 0

    for user_idx in range(ratings.shape[0]):
        rated_items = np.where(ratings[user_idx] > 0)[0]
        if len(rated_items) < 3:
            continue  # need at least 3 ratings

        for held_out in rated_items:
            # Create a copy with this rating removed
            ratings_test = ratings.copy()
            ratings_test[user_idx, held_out] = 0

            # Get top-K predictions for unrated items
            preds = predict_fn(user_idx, ratings_test)

            # Remove items the user has already rated
            # (except the held-out one)
            for j in rated_items:
                if j != held_out:
                    preds[j] = -999

            top_k = np.argsort(preds)[::-1][:k]

            if held_out in top_k:
                hits += 1
            total += 1

    return hits / total if total > 0 else 0.0

def nmf_predict(user_idx, ratings_input):
    """Return predicted scores for all items for a given user."""
    r = ratings_input.copy().astype(float)
    m = r[r > 0].mean() if np.any(r > 0) else 3.0
    r[r == 0] = m
    nmf_eval = NMF(n_components=3, random_state=42, max_iter=500)
    U = nmf_eval.fit_transform(r)
    V = nmf_eval.components_
    return (U @ V)[user_idx]

hit_rate = evaluate_recommendations(ratings, nmf_predict, k=3)
print(f"Hit Rate @3 (NMF): {hit_rate:.1%}")
print(f"(fraction of held-out items appearing in top 3)")

Precision@K and NDCG

Two metrics dominate recommendation evaluation:

Precision@K: of the top K recommended items, how many did the user actually interact with? If your top-5 recommendations contain 2 items the user later watched, precision@5 = 0.4. Simple, interpretable, and directly measures "are we filling the recommendation slots with relevant stuff?"

NDCG (Normalized Discounted Cumulative Gain): like precision@K but accounts for the position of relevant items. A relevant item at position 1 is worth more than one at position 5, because users are more likely to see items ranked higher. The "discounting" penalizes relevant items that appear lower in the list.

def precision_at_k(recommended, relevant, k):
    """What fraction of the top-K recommendations are relevant?"""
    top_k = recommended[:k]
    hits = len(set(top_k) & set(relevant))
    return hits / k

def dcg_at_k(recommended, relevant, k):
    """Discounted Cumulative Gain: position-weighted relevance."""
    dcg = 0.0
    for i, item in enumerate(recommended[:k]):
        if item in relevant:
            dcg += 1.0 / np.log2(i + 2)  # +2 because log2(1)=0
    return dcg

def ndcg_at_k(recommended, relevant, k):
    """Normalized DCG: DCG divided by the best possible DCG."""
    dcg = dcg_at_k(recommended, relevant, k)
    # Ideal: all relevant items at the top
    ideal_relevant = list(relevant)[:k]
    idcg = dcg_at_k(ideal_relevant, relevant, k)
    return dcg / idcg if idcg > 0 else 0.0

# Example: user's actual likes are items [0, 1, 6]
# System recommends in order: [6, 2, 0, 4, 1]
recommended = [6, 2, 0, 4, 1]
relevant = {0, 1, 6}

for k in [1, 3, 5]:
    p = precision_at_k(recommended, relevant, k)
    n = ndcg_at_k(recommended, relevant, k)
    print(f"  @{k}: Precision = {p:.2f}, NDCG = {n:.3f}")

NDCG is particularly useful because it captures ranking quality, not just set overlap. A system that puts the most relevant item first (position 1) is strictly better than one that puts it fifth, even if both include the same items in the top 5. In a recommendation UI where users scroll from top to bottom, position matters enormously.

Beyond offline metrics: the A/B test reality

I want to be honest about something. Offline evaluation metrics (precision@K, NDCG, hit rate) are useful for comparing algorithms during development, but the ultimate test is always online A/B testing. The best recommendation system isn't the one with the highest NDCG on a test set -- it's the one that increases user engagement, retention, and satisfaction when deployed live. And those outcomes can only be measured by showing real recommendations to real users and measuring what happens.

Offline metrics can be misleading because they only test "can you predict what the user already did?" They can't measure discovery (showing users things they didn't know they'd like), diversity (showing a range of options, not just the safest bet), or serendipity (the delightful surprise of a completely unexpected recommendation that turns out to be perfect). These qualities matter enormously for user experience but don't show up in precision@K.

Implicit feedback: the real world doesn't have star ratings

Everything we've built so far uses explicit ratings -- users actively telling you "I give this 4 out of 5 stars." But in practice, most recommendation systems work with implicit feedback: clicks, views, time spent, purchases, scrolls, skips. A user who read a blog post for 8 minutes probably liked it. One who bounced after 3 seconds probably didn't. A user who added something to their cart and bought it liked it more than someone who browsed the product page and left.

Implicit feedback changes the problem in a fundamental way:

# Explicit ratings: we know what users liked AND disliked
explicit = np.array([
    [5, 1, 0, 4],  # User 0: loves item 0, hates item 1
    [0, 4, 5, 0],  # User 1: loves items 1 and 2
])

# Implicit feedback: we only know what users interacted with
# Higher values = more interaction, but 0 doesn't mean "disliked"
implicit = np.array([
    [12, 0, 0, 8],  # User 0: read item 0 12 times, item 3 8 times
    [0, 5, 15, 1],  # User 1: read item 2 15 times
])

print("Explicit ratings have clear negatives:")
print(f"  User 0 rated item 1: {explicit[0, 1]} (clearly negative)")
print()
print("Implicit feedback has NO negatives:")
print(f"  User 0 on item 1: {implicit[0, 1]} interactions")
print(f"  Does 0 mean 'disliked' or 'never saw it'? We don't know.")

With explicit ratings, a score of 1/5 is a clear negative signal -- the user TRIED the item and hated it. With implicit feedback, a zero could mean "not interested" or "never encountered it." You can't distinguish between "I wouldn't watch that movie if you paid me" and "I've never heard of that movie but might love it."

The standard approach for implicit feedback uses confidence-weighted interactions: higher interaction counts mean higher confidence that the user likes the item, but zero interactions carry low (not zero) confidence. This was formalized by Hu, Koren & Volinsky (2008) in their seminal paper on implicit feedback collaborative filtering, and it's the foundation of most production systems today.

def implicit_als_simple(interactions, n_factors=3,
                        alpha=40, n_iter=20, reg=0.1):
    """Simplified implicit feedback matrix factorization.
    alpha controls how much to trust high interaction counts."""
    n_users, n_items = interactions.shape

    # Initialize random factors
    np.random.seed(42)
    U = np.random.randn(n_users, n_factors) * 0.1
    V = np.random.randn(n_items, n_factors) * 0.1

    # Binary preference: 1 if any interaction, 0 otherwise
    P = (interactions > 0).astype(float)

    # Confidence: higher interaction = higher confidence
    C = 1 + alpha * interactions

    for iteration in range(n_iter):
        # Alternating least squares
        for u in range(n_users):
            Cu = np.diag(C[u])
            # Solve for user factors
            A = V.T @ Cu @ V + reg * np.eye(n_factors)
            b = V.T @ Cu @ P[u]
            U[u] = np.linalg.solve(A, b)

        for i in range(n_items):
            Ci = np.diag(C[:, i])
            # Solve for item factors
            A = U.T @ Ci @ U + reg * np.eye(n_factors)
            b = U.T @ Ci @ P[:, i]
            V[i] = np.linalg.solve(A, b)

        # Track reconstruction error
        pred = U @ V.T
        error = np.sum(C * (P - pred) ** 2) + reg * (
            np.sum(U ** 2) + np.sum(V ** 2)
        )
        if iteration % 5 == 0:
            print(f"  Iteration {iteration:>3d}: "
                  f"weighted error = {error:.2f}")

    return U, V

print("Training implicit feedback model:")
U_impl, V_impl = implicit_als_simple(implicit, n_factors=2)
predictions_impl = U_impl @ V_impl.T

print("\nPredicted preference scores (implicit):")
for u in range(implicit.shape[0]):
    print(f"  User {u}: "
          f"[{', '.join(f'{v:.3f}' for v in predictions_impl[u])}]")
    print(f"  Actual:  "
          f"[{', '.join(str(v) for v in implicit[u])}]")

The alpha parameter controls how much to scale up the confidence for high-interaction items. With alpha=40, an item viewed 10 times gets a confidence weight of 401 (1 + 40 * 10) while an unviewed item gets confidence 1. This means the model strongly trusts that frequently-viewed items are liked, while treating unviewed items with healthy skepticism -- they might be relevant, they might not.

A complete recommendation pipeline

Let's put everything together into a pipeline that a real application could use. We'll simulate a larger dataset and build a hybrid system that combines content-based and collaborative approaches:

np.random.seed(42)

# Simulate 50 users and 20 items
n_users, n_items = 50, 20
n_features = 5  # item feature dimensions

# Generate item features (topics)
item_features = np.random.dirichlet(
    np.ones(n_features), size=n_items
)

# Generate user preferences as noisy functions of item features
user_tastes = np.random.randn(n_users, n_features)

# Generate sparse ratings: users only rate items they encounter
ratings_full = np.zeros((n_users, n_items))
for u in range(n_users):
    # Each user rates 5-10 items
    n_rated = np.random.randint(5, 11)
    rated_items = np.random.choice(n_items, n_rated, replace=False)
    for i in rated_items:
        # Rating = user taste alignment + noise, clipped to [1, 5]
        base_score = user_tastes[u] @ item_features[i] * 2 + 3
        noise = np.random.randn() * 0.5
        ratings_full[u, i] = np.clip(base_score + noise, 1, 5)

ratings_full = np.round(ratings_full, 0)

print(f"Rating matrix: {ratings_full.shape}")
print(f"Sparsity: {1 - np.count_nonzero(ratings_full) / ratings_full.size:.1%}")
print(f"Average rating: {ratings_full[ratings_full > 0].mean():.2f}")

# Split: hold out 1 rating per user for testing
test_pairs = []
ratings_train = ratings_full.copy()

for u in range(n_users):
    rated = np.where(ratings_full[u] > 0)[0]
    if len(rated) >= 3:
        held_out = np.random.choice(rated)
        test_pairs.append((u, held_out, ratings_full[u, held_out]))
        ratings_train[u, held_out] = 0

print(f"Test pairs: {len(test_pairs)}")

Now we build three recommenders and compare them:

# Recommender 1: Content-based
def content_based_predict(user_idx, ratings_input, item_feats):
    """Predict scores using content-based filtering."""
    rated = np.where(ratings_input[user_idx] > 0)[0]
    if len(rated) == 0:
        return np.zeros(item_feats.shape[0])
    # Build user profile from rated item features
    weights = ratings_input[user_idx, rated]
    profile = (item_feats[rated].T @ weights) / weights.sum()
    # Score all items by cosine similarity to profile
    scores = np.array([
        cosine_similarity(profile, item_feats[j])
        for j in range(item_feats.shape[0])
    ])
    return scores

# Recommender 2: NMF collaborative filtering
mean_rating = ratings_train[ratings_train > 0].mean()
ratings_nmf = ratings_train.copy().astype(float)
ratings_nmf[ratings_nmf == 0] = mean_rating

nmf_pipe = NMF(n_components=5, random_state=42, max_iter=500)
U_pipe = nmf_pipe.fit_transform(ratings_nmf)
V_pipe = nmf_pipe.components_
nmf_predictions = U_pipe @ V_pipe

# Recommender 3: Hybrid (weighted combination)
def hybrid_predict(user_idx, ratings_input, item_feats,
                   nmf_preds, content_weight=0.3):
    """Combine content and collaborative predictions."""
    content_scores = content_based_predict(
        user_idx, ratings_input, item_feats
    )
    collab_scores = nmf_preds[user_idx]

    # Normalize both to [0, 1] range
    if content_scores.max() > content_scores.min():
        content_norm = (content_scores - content_scores.min()) / (
            content_scores.max() - content_scores.min()
        )
    else:
        content_norm = np.zeros_like(content_scores)

    if collab_scores.max() > collab_scores.min():
        collab_norm = (collab_scores - collab_scores.min()) / (
            collab_scores.max() - collab_scores.min()
        )
    else:
        collab_norm = np.zeros_like(collab_scores)

    return (content_weight * content_norm
            + (1 - content_weight) * collab_norm)

# Evaluate all three
from collections import defaultdict

results = defaultdict(list)

for u, held_item, true_rating in test_pairs:
    # Content-based
    cb_scores = content_based_predict(u, ratings_train, item_features)
    # Zero out already-rated items
    for j in np.where(ratings_train[u] > 0)[0]:
        cb_scores[j] = -999
    cb_rank = np.argsort(cb_scores)[::-1]
    results['Content'].append(held_item in cb_rank[:5])

    # NMF collaborative
    nmf_scores = nmf_predictions[u].copy()
    for j in np.where(ratings_train[u] > 0)[0]:
        nmf_scores[j] = -999
    nmf_rank = np.argsort(nmf_scores)[::-1]
    results['NMF'].append(held_item in nmf_rank[:5])

    # Hybrid
    hyb_scores = hybrid_predict(
        u, ratings_train, item_features, nmf_predictions
    )
    for j in np.where(ratings_train[u] > 0)[0]:
        hyb_scores[j] = -999
    hyb_rank = np.argsort(hyb_scores)[::-1]
    results['Hybrid'].append(held_item in hyb_rank[:5])

print("\nHit Rate @5 comparison:")
print(f"{'Method':>12s}  {'Hit Rate':>10s}")
print("-" * 26)
for method, hits in results.items():
    rate = np.mean(hits)
    print(f"{method:>12s}  {rate:>9.1%}")

The hybrid approach typically wins because it covers both scenarios: when a user has enough history, the collaborative signal dominates (it sees cross-user patterns that content features miss). When a user has little history, the content-based component fills in the gaps. This is exactly how production systems at Netflix, Spotify, and YouTube work -- multiple recommendation strategies combined, with the weighting adjusted based on how much data is available for each user.

Diversity and the "more of the same" trap

One last topic that doesn't get enough attention. A recommendation system optimized purely for accuracy (highest precision@K, best NDCG) will tend to recommend the safest, most predictable items -- things very similar to what the user has already liked. It creates a monotony problem: the user sees the same type of content over and over, gets bored, and leaves.

def recommendation_diversity(recommended_indices, item_features):
    """Measure how different the recommended items are
    from each other. Higher = more diverse."""
    if len(recommended_indices) < 2:
        return 0.0
    feats = item_features[recommended_indices]
    n = len(recommended_indices)
    total_dist = 0.0
    pairs = 0
    for i in range(n):
        for j in range(i + 1, n):
            sim = cosine_similarity(feats[i], feats[j])
            total_dist += (1 - sim)
            pairs += 1
    return total_dist / pairs

# Compare diversity of different approaches
for method_name, rank_fn in [
    ('Content', lambda u: content_based_predict(
        u, ratings_train, item_features)),
    ('NMF', lambda u: nmf_predictions[u].copy()),
    ('Hybrid', lambda u: hybrid_predict(
        u, ratings_train, item_features, nmf_predictions)),
]:
    diversities = []
    for u in range(n_users):
        scores = rank_fn(u)
        for j in np.where(ratings_train[u] > 0)[0]:
            scores[j] = -999
        top5 = np.argsort(scores)[::-1][:5]
        div = recommendation_diversity(top5, item_features)
        diversities.append(div)
    print(f"{method_name:>12s}: avg diversity = "
          f"{np.mean(diversities):.3f}")

Content-based filtering is typically the least diverse (it recommends similar items by design). Collaborative filtering introduces more diversity because it leverages other users' taste patterns that go beyond content similarity. The hybrid approach usually sits in between. In production, many systems add explicit diversity constraints -- forcing the top-K list to include items from different categories, or penalizing consecutive recommendations that are too similar.

The tension between accuracy and diversity is one of the deepest design decisions in recommendation systems. A pure accuracy optimizer creates filter bubbles and user fatigue. A pure diversity optimizer shows random garbage. The sweet spot depends on the application: a music playlist benefits from some surprise, while a shopping recommendation should mostly show relevant products. Finding that balance is more art than science, and it's one of the reasons recommendation systems remain an active research area despite decades of work.

Zo, wat hebben we geleerd?

We covered the full landscape of classical recommendation systems today, tying together concepts from across the entire series. Here's the complete picture:

  • Content-based filtering recommends items with similar features to what the user has liked. Uses cosine similarity (episode #8) on item feature vectors. Interpretable and works for new items, but creates filter bubbles and needs good features;
  • Collaborative filtering uses patterns in user behavior -- no item features needed. User-based finds similar users; item-based finds similar items. Discovers cross-user taste patterns that content analysis can't see. But suffers from the cold start problem for new users and new items;
  • Matrix factorization (NMF, SVD) decomposes the sparse user-item matrix into latent factor vectors -- the same decomposition concept as PCA (episode #24) applied to preferences. Each user and item gets a small vector, and predictions are dot products. This powered the Netflix Prize;
  • The cold start problem is fundamental: you need interaction data to make recommendations, but you need recommendations to generate interactions. Hybrid systems combining content and collaborative filtering are the practical answer;
  • Evaluation uses precision@K, NDCG, and hit rate -- but offline metrics only measure "can you predict known behavior." Online A/B testing is the ultimate judge. Same spirit as episode #13's lesson that evaluation should match the actual objective;
  • Implicit feedback (clicks, views, time spent) is messier but vastly more abundant than explicit ratings. Confidence-weighted matrix factorization handles the ambiguity of missing data;
  • Diversity matters as much as accuracy -- a system that only recommends "more of the same" creates bored users. The accuracy-diversity tradeoff is a core design decision.

This wraps up our tour of unsupervised learning applications. We've gone from finding groups in data (clustering, #22-23), to compressing and visualizing high-dimensional data (PCA, t-SNE, UMAP, #24-25), to flagging anomalies (#26), to predicting user preferences (today). Each episode built on the tools from earlier ones -- cosine similarity, matrix decomposition, distance metrics, evaluation frameworks.

The data we've been working with so far has always been "snapshots" -- collections of data points without any notion of order or sequence. But a huge category of real-world data has a natural ordering that carries critical information: time. Stock prices, sensor readings, weather data, website traffic -- in all of these, WHEN a measurement was taken matters as much as WHAT the measurement is. Ignoring the temporal structure and treating these as regular feature vectors throws away the very information that makes the data useful. Working with sequential data requires entirely different tools, and those tools open up one of the most practically important branches of ML.

Thanks for reading, tot de volgende!

@scipio



0
0
0.000
1 comments
avatar

Congratulations @scipio! You have completed the following achievement on the Hive blockchain And have been rewarded with New badge(s)

You have been a buzzy bee and published a post every day of the week.

You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word STOP

Check out our last posts:

Our Hive Power Delegations to the March PUM Winners
Feedback from the April Hive Power Up Day
Hive Power Up Month Challenge - March 2026 Winners List
0
0
0.000