Building a Recommendation Engine with ALS and Vector Search

When you have a catalog of almost 20 million releases and millions of users with diverse tastes, how do you surface relevant recommendations? This post walks through building a recommendation pipeline end-to-end, from training a collaborative filtering model to deploying it for real-time queries.

The Approach

Collaborative filtering learns user preferences from behavior. If two users have similar collections, they will probably like similar things. The algorithm does not need to understand what an item is because it learns entirely from patterns in the data.

I used Alternating Least Squares (ALS), a matrix factorization technique that works well with implicit feedback (actions like purchases, saves, and ratings rather than explicit preferences). ALS decomposes a sparse user-item interaction matrix into two dense matrices: user factors and item factors. Each user and item gets an embedding vector, and the dot product between them predicts affinity.

The implicit library makes this straightforward. It is fast, handles sparse matrices efficiently, and runs on CPU without needing a GPU.

Data Sources

The model trains on three types of user interactions:

Signal	Confidence	Rationale
Have	3.0	Strong signal because the user spent money
Want	1.5	Moderate signal indicating expressed interest
5-star rating	5.0	Strongest positive signal
4-star rating	4.0	Strong positive signal
3-star rating	2.0	Weak positive signal
1-2 star ratings	Filtered	Negative signals excluded from training

When a user has multiple interactions with the same item (owns it and rated it), only the highest confidence value is kept. This prevents double-counting while preserving the strongest signal.

Filtering Low-Activity Users and Items

Not all data is useful. A user with two items in their collection does not provide much signal. An item with only a handful of interactions is probably noise. Before training, the pipeline filters to users with at least 20 interactions and items with at least 50.

This improves model quality and reduces the matrix size significantly.

Training Pipeline

The training script runs as a batch job with several stages:

flowchart TD
    subgraph S3["S3 (Parquet)"]
        H[Haves]
        W[Wants]
        R[Ratings]
        M[Releases]
    end

    subgraph Load["Load Interactions"]
        P1[Pass 1: Count Activity]
        P2[Pass 2: Filter & Merge]
        P1 --> P2
    end

    subgraph Train["Train Model"]
        SP[Build Sparse Matrix]
        ALS[ALS Iterations]
        SP --> ALS
    end

    subgraph Upload["Upload to Qdrant"]
        C[Create Collection]
        PU[Parallel Upload]
        SW[Swap Alias]
        C --> PU --> SW
    end

    H & W & R --> Load
    M --> PU
    Load --> Train
    Train --> Upload

Load interactions: Stream parquet files from S3, apply confidence weights, filter low-activity users and items across two passes, merge and deduplicate
Build sparse matrix: Construct a CSR (Compressed Sparse Row) matrix in chunks for efficient row slicing
Train ALS: Run alternating least squares for N iterations to learn user and item embeddings
Create collection: Set up a new timestamped collection in Qdrant
Upload vectors: Batch upload item embeddings with metadata, using parallel workers with backpressure
Swap alias: Atomically switch traffic to the new collection and delete the old one

Memory Efficiency

With over a billion interactions, memory management matters. The pipeline uses a two-pass approach:

Pass 1: Load each data source sequentially, count user and item activity, then discard the raw data. This identifies which users and items meet the activity thresholds.

Pass 2: Reload each source, filter immediately to active users and items, then merge. Peak memory is the size of one data source rather than all of them combined.

# Pass 1: Accumulate counts, discard raw data
for source in sources:
    df = load_source(source)
    user_counts = user_counts.add(df["uid"].value_counts(), fill_value=0)
    item_counts = item_counts.add(df["item_id"].value_counts(), fill_value=0)
    del df
    gc.collect()

# Identify active users/items
active_users = set(user_counts[user_counts >= MIN_USER_INTERACTIONS].index)
active_items = set(item_counts[item_counts >= MIN_ITEM_INTERACTIONS].index)

The sparse matrix is also built in chunks to avoid memory spikes during construction.

Vector Search with Qdrant

Once the model is trained, the item embeddings need to go somewhere queryable. Qdrant is a vector database optimized for similarity search. It supports filtering, which is critical for recommendations because users often want results constrained by genre, year, or other attributes.

Each vector point includes metadata:

{
    "item_id": 123,
    "title": "Some Album",
    "year": 1991,
    "country": "US",
    "artists": [12, 34, 56],
    "labels": [78, 90],
    "genres": ["Electronic"],
    "styles": ["Techno"],
    "interaction_count": 1234
}

Fighting Popularity Bias

The interaction_count field helps fight popularity bias. Collaborative filtering tends to over-recommend popular releases because they have more training signal, which starves the long tail of exposure. By storing interaction counts as metadata, queries can filter to low-count releases for hidden gems or high-count releases for crowd favorites. This gives users control over the popularity-obscurity tradeoff without needing to retrain the model.

Parallel Uploads with Backpressure

Uploading millions of vectors takes time. The pipeline parallelizes uploads across multiple workers but implements backpressure to prevent memory exhaustion:

with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
    futures = set()

    for batch in batches:
        # Backpressure: wait if queue is too deep
        if len(futures) >= MAX_WORKERS * 2:
            completed, futures = concurrent.futures.wait(
                futures, return_when=concurrent.futures.FIRST_COMPLETED
            )
            for future in completed:
                process_result(future.result())

        futures.add(executor.submit(upload_batch, batch))

This keeps memory bounded while maximizing throughput.

Zero-Downtime Deployments

Retraining the model should not cause downtime. The pipeline uses a blue-green deployment pattern with atomic alias swapping.

Each training run creates a new collection with a timestamped name (e.g., recommendations_1712345678). An alias (recommendations) always points to the active collection. When the new collection is ready:

Find the old collection currently behind the alias
Atomically update the alias to point to the new collection
Delete the old collection

# Atomic swap
change_ops = [
    {"delete_alias": {"alias_name": alias_name}},
    {"create_alias": {
        "collection_name": new_collection_name,
        "alias_name": alias_name,
    }}
]
client.update_collection_aliases(change_aliases_operations=change_ops)

If the swap fails, the old collection is preserved. No data loss, no downtime.

Querying for Recommendations

With vectors uploaded and the alias pointing to the new collection, recommendations are a single API call. Qdrant supports several query patterns that cover most recommendation use cases.

Similar Releases

The simplest query finds releases similar to a single seed. Pass the item ID in the positive array and Qdrant returns the nearest neighbors in embedding space.

curl -X POST "http://localhost:6333/collections/recommendations/points/recommend" \
  -H "Content-Type: application/json" \
  -d '{
    "positive": [123],
    "limit": 10
  }'

This is useful for “more like this” features on product pages or detail views.

Blending Multiple Releases

When you have multiple seeds, Qdrant averages their vectors to find releases similar to all of them. This creates a blended recommendation that captures the common thread across the inputs.

curl -X POST "http://localhost:6333/collections/recommendations/points/recommend" \
  -d '{
    "positive": [123, 456, 789],
    "limit": 10
  }'

This works well for playlist generation or “based on your recent activity” recommendations where you want to consider multiple signals at once.

Filtering by Attributes

Filters constrain results to releases matching specific criteria. You can filter on any metadata field stored with the vectors, combining range queries, exact matches, and boolean logic.

curl -X POST "http://localhost:6333/collections/recommendations/points/recommend" \
  -d '{
    "positive": [123],
    "limit": 10,
    "filter": {
      "must": [
        {"key": "year", "range": {"gte": 1990, "lte": 1999}},
        {"key": "genres", "match": {"value": "Electronic"}}
      ]
    }
  }'

This enables faceted recommendations like “similar electronic albums from the 90s” without post-filtering in application code.

Negative Feedback

Negative examples push results away from certain releases. The query finds releases similar to the positive examples but dissimilar to the negative ones.

curl -X POST "http://localhost:6333/collections/recommendations/points/recommend" \
  -d '{
    "positive": [123],
    "negative": [789],
    "limit": 10
  }'

This is useful for “more like this, but not that” interactions where a user wants to refine their recommendations by excluding certain directions.

Graceful Shutdown

The pipeline handles SIGTERM and SIGINT for clean shutdowns. A global flag is checked at key points:

SHUTDOWN_REQUESTED = False

def shutdown_handler(signum, frame):
    global SHUTDOWN_REQUESTED
    SHUTDOWN_REQUESTED = True

signal.signal(signal.SIGTERM, shutdown_handler)
signal.signal(signal.SIGINT, shutdown_handler)

When shutdown is requested, the pipeline stops submitting new work, waits for in-flight uploads to complete, and exits without leaving partial data. This matters in containerized environments where pods can be terminated at any time.

Configuration

All parameters are configurable via environment variables:

Variable	Default	Description
`VECTOR_DIM`	128	Embedding dimensions
`ITERATIONS`	15	ALS training iterations
`REGULARIZATION`	0.1	L2 regularization
`MIN_USER_INTERACTIONS`	20	Minimum interactions per user
`MIN_ITEM_INTERACTIONS`	50	Minimum interactions per item
`SAMPLE_FRAC_HAVES`	0.1	Fraction of haves to sample
`SAMPLE_FRAC_WANTS`	0.1	Fraction of wants to sample
`SAMPLE_FRAC_RATINGS`	0.1	Fraction of ratings to sample
`MAX_WORKERS`	4	Parallel upload workers
`BATCH_SIZE`	500	Points per upload batch

These knobs helped us iterate within our hardware constraints. By adjusting the sample fractions, we could run quick experiments on smaller datasets during development, then scale up to full coverage in production. Tuning the activity thresholds let us balance model quality against catalog breadth, gradually lowering the minimums to extend recommendations to more of the long tail.

Final Thoughts

The core of a recommendation engine is surprisingly simple: learn embeddings from user behavior, store them in a vector database, query by similarity. The complexity is in the details, such as handling large datasets without running out of memory, deploying without downtime, and filtering results by metadata.

ALS works well for implicit feedback and scales to millions of users, millions of releases, and over a billion interactions. Qdrant handles the serving layer with filtering and fast lookups. The alias swap pattern enables continuous updates without interrupting traffic.

If you are building something similar, start simple. Get a basic pipeline working end-to-end, then optimize the pieces that matter for your scale.