Building a Recommendation Engine with ALS and Vector Search
When you have a catalog of almost 20 million releases and millions of users with diverse tastes, how do you surface relevant recommendations? This post walks through building a recommendation pipeline end-to-end, from training a collaborative filtering model to deploying it for real-time queries.
The Approach
Collaborative filtering learns user preferences from behavior. If two users have similar collections, they will probably like similar things. The algorithm does not need to understand what an item is because it learns entirely from patterns in the data.
I used Alternating Least Squares (ALS), a matrix factorization technique that works well with implicit feedback (actions like purchases, saves, and ratings rather than explicit preferences). ALS decomposes a sparse user-item interaction matrix into two dense matrices: user factors and item factors. Each user and item gets an embedding vector, and the dot product between them predicts affinity.
The implicit library makes this straightforward. It is fast, handles sparse matrices efficiently, and runs on CPU without needing a GPU.
Data Sources
The model trains on three types of user interactions:
| Signal | Confidence | Rationale |
|---|---|---|
| Have | 3.0 | Strong signal because the user spent money |
| Want | 1.5 | Moderate signal indicating expressed interest |
| 5-star rating | 5.0 | Strongest positive signal |
| 4-star rating | 4.0 | Strong positive signal |
| 3-star rating | 2.0 | Weak positive signal |
| 1-2 star ratings | Filtered | Negative signals excluded from training |
When a user has multiple interactions with the same item (owns it and rated it), only the highest confidence value is kept. This prevents double-counting while preserving the strongest signal.
Filtering Low-Activity Users and Items
Not all data is useful. A user with two items in their collection does not provide much signal. An item with only a handful of interactions is probably noise. Before training, the pipeline filters to users with at least 20 interactions and items with at least 50.
This improves model quality and reduces the matrix size significantly.
Training Pipeline
The training script runs as a batch job with several stages:
flowchart TD
subgraph S3["S3 (Parquet)"]
H[Haves]
W[Wants]
R[Ratings]
M[Releases]
end
subgraph Load["Load Interactions"]
P1[Pass 1: Count Activity]
P2[Pass 2: Filter & Merge]
P1 --> P2
end
subgraph Train["Train Model"]
SP[Build Sparse Matrix]
ALS[ALS Iterations]
SP --> ALS
end
subgraph Upload["Upload to Qdrant"]
C[Create Collection]
PU[Parallel Upload]
SW[Swap Alias]
C --> PU --> SW
end
H & W & R --> Load
M --> PU
Load --> Train
Train --> Upload
- Load interactions: Stream parquet files from S3, apply confidence weights, filter low-activity users and items across two passes, merge and deduplicate
- Build sparse matrix: Construct a CSR (Compressed Sparse Row) matrix in chunks for efficient row slicing
- Train ALS: Run alternating least squares for N iterations to learn user and item embeddings
- Create collection: Set up a new timestamped collection in Qdrant
- Upload vectors: Batch upload item embeddings with metadata, using parallel workers with backpressure
- Swap alias: Atomically switch traffic to the new collection and delete the old one
Memory Efficiency
With over a billion interactions, memory management matters. The pipeline uses a two-pass approach:
Pass 1: Load each data source sequentially, count user and item activity, then discard the raw data. This identifies which users and items meet the activity thresholds.
Pass 2: Reload each source, filter immediately to active users and items, then merge. Peak memory is the size of one data source rather than all of them combined.
# Pass 1: Accumulate counts, discard raw datafor source in sources: df = load_source(source) user_counts = user_counts.add(df["uid"].value_counts(), fill_value=0) item_counts = item_counts.add(df["item_id"].value_counts(), fill_value=0) del df gc.collect()
# Identify active users/itemsactive_users = set(user_counts[user_counts >= MIN_USER_INTERACTIONS].index)active_items = set(item_counts[item_counts >= MIN_ITEM_INTERACTIONS].index)The sparse matrix is also built in chunks to avoid memory spikes during construction.
Vector Search with Qdrant
Once the model is trained, the item embeddings need to go somewhere queryable. Qdrant is a vector database optimized for similarity search. It supports filtering, which is critical for recommendations because users often want results constrained by genre, year, or other attributes.
Each vector point includes metadata:
{ "item_id": 123, "title": "Some Album", "year": 1991, "country": "US", "artists": [12, 34, 56], "labels": [78, 90], "genres": ["Electronic"], "styles": ["Techno"], "interaction_count": 1234}Fighting Popularity Bias
The interaction_count field helps fight popularity bias. Collaborative filtering tends to over-recommend popular releases because they have more training signal, which starves the long tail of exposure. By storing interaction counts as metadata, queries can filter to low-count releases for hidden gems or high-count releases for crowd favorites. This gives users control over the popularity-obscurity tradeoff without needing to retrain the model.
Parallel Uploads with Backpressure
Uploading millions of vectors takes time. The pipeline parallelizes uploads across multiple workers but implements backpressure to prevent memory exhaustion:
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor: futures = set()
for batch in batches: # Backpressure: wait if queue is too deep if len(futures) >= MAX_WORKERS * 2: completed, futures = concurrent.futures.wait( futures, return_when=concurrent.futures.FIRST_COMPLETED ) for future in completed: process_result(future.result())
futures.add(executor.submit(upload_batch, batch))This keeps memory bounded while maximizing throughput.
Zero-Downtime Deployments
Retraining the model should not cause downtime. The pipeline uses a blue-green deployment pattern with atomic alias swapping.
Each training run creates a new collection with a timestamped name (e.g., recommendations_1712345678). An alias (recommendations) always points to the active collection. When the new collection is ready:
- Find the old collection currently behind the alias
- Atomically update the alias to point to the new collection
- Delete the old collection
# Atomic swapchange_ops = [ {"delete_alias": {"alias_name": alias_name}}, {"create_alias": { "collection_name": new_collection_name, "alias_name": alias_name, }}]client.update_collection_aliases(change_aliases_operations=change_ops)If the swap fails, the old collection is preserved. No data loss, no downtime.
Querying for Recommendations
With vectors uploaded and the alias pointing to the new collection, recommendations are a single API call. Qdrant supports several query patterns that cover most recommendation use cases.
Similar Releases
The simplest query finds releases similar to a single seed. Pass the item ID in the positive array and Qdrant returns the nearest neighbors in embedding space.
curl -X POST "http://localhost:6333/collections/recommendations/points/recommend" \ -H "Content-Type: application/json" \ -d '{ "positive": [123], "limit": 10 }'This is useful for “more like this” features on product pages or detail views.
Blending Multiple Releases
When you have multiple seeds, Qdrant averages their vectors to find releases similar to all of them. This creates a blended recommendation that captures the common thread across the inputs.
curl -X POST "http://localhost:6333/collections/recommendations/points/recommend" \ -d '{ "positive": [123, 456, 789], "limit": 10 }'This works well for playlist generation or “based on your recent activity” recommendations where you want to consider multiple signals at once.
Filtering by Attributes
Filters constrain results to releases matching specific criteria. You can filter on any metadata field stored with the vectors, combining range queries, exact matches, and boolean logic.
curl -X POST "http://localhost:6333/collections/recommendations/points/recommend" \ -d '{ "positive": [123], "limit": 10, "filter": { "must": [ {"key": "year", "range": {"gte": 1990, "lte": 1999}}, {"key": "genres", "match": {"value": "Electronic"}} ] } }'This enables faceted recommendations like “similar electronic albums from the 90s” without post-filtering in application code.
Negative Feedback
Negative examples push results away from certain releases. The query finds releases similar to the positive examples but dissimilar to the negative ones.
curl -X POST "http://localhost:6333/collections/recommendations/points/recommend" \ -d '{ "positive": [123], "negative": [789], "limit": 10 }'This is useful for “more like this, but not that” interactions where a user wants to refine their recommendations by excluding certain directions.
Graceful Shutdown
The pipeline handles SIGTERM and SIGINT for clean shutdowns. A global flag is checked at key points:
SHUTDOWN_REQUESTED = False
def shutdown_handler(signum, frame): global SHUTDOWN_REQUESTED SHUTDOWN_REQUESTED = True
signal.signal(signal.SIGTERM, shutdown_handler)signal.signal(signal.SIGINT, shutdown_handler)When shutdown is requested, the pipeline stops submitting new work, waits for in-flight uploads to complete, and exits without leaving partial data. This matters in containerized environments where pods can be terminated at any time.
Configuration
All parameters are configurable via environment variables:
| Variable | Default | Description |
|---|---|---|
VECTOR_DIM | 128 | Embedding dimensions |
ITERATIONS | 15 | ALS training iterations |
REGULARIZATION | 0.1 | L2 regularization |
MIN_USER_INTERACTIONS | 20 | Minimum interactions per user |
MIN_ITEM_INTERACTIONS | 50 | Minimum interactions per item |
SAMPLE_FRAC_HAVES | 0.1 | Fraction of haves to sample |
SAMPLE_FRAC_WANTS | 0.1 | Fraction of wants to sample |
SAMPLE_FRAC_RATINGS | 0.1 | Fraction of ratings to sample |
MAX_WORKERS | 4 | Parallel upload workers |
BATCH_SIZE | 500 | Points per upload batch |
These knobs helped us iterate within our hardware constraints. By adjusting the sample fractions, we could run quick experiments on smaller datasets during development, then scale up to full coverage in production. Tuning the activity thresholds let us balance model quality against catalog breadth, gradually lowering the minimums to extend recommendations to more of the long tail.
Final Thoughts
The core of a recommendation engine is surprisingly simple: learn embeddings from user behavior, store them in a vector database, query by similarity. The complexity is in the details, such as handling large datasets without running out of memory, deploying without downtime, and filtering results by metadata.
ALS works well for implicit feedback and scales to millions of users, millions of releases, and over a billion interactions. Qdrant handles the serving layer with filtering and fast lookups. The alias swap pattern enables continuous updates without interrupting traffic.
If you are building something similar, start simple. Get a basic pipeline working end-to-end, then optimize the pieces that matter for your scale.