Understanding compressed matrices for sparse data

In this article, you’ll learn about how matrices representing sparse data, e.g. user ratings, can be compressed to vertically scale machine learning algorithms by fitting datasets into memory that would otherwise be too large. You’ll learn about the various compression formats and delve into their trade-offs.
Continue reading “Understanding compressed matrices for sparse data”

Building your recommender system at the right scale

OutlineOpening “everyone wants to scale”

Recommender systems are just like any other data system
– Keep it as simple as possible
– Avoid undifferentiated heavy lifting

The problems recommender systems face at each level of scale
– Toy problems
– Medium-size problems
– Out-of-core training
– Real-time recommendations
– Storing the recommendations
– Large scale
– Apache Spark, OLS
– Setting up a cluster
– Massive scale
– Still use off-the-shelf models
– Mostly about feature engineering

How to decide on where you fit
– About

A dog drinking from a hose
An MVP lapping up the incoming data from the 10K concurrents on launch day.

An engineer is scoping out a data system design and the first thing that comes to mind is how it’ll work at scale. He’s just starting on the prototype, so the product has no users yet, but he wants to make sure that it can eventually accommodate thousands of concurrents! With the architecture divorced from reality, inevitably the end result is late or it doesn’t solve the problem.

Meanwhile, in practice he could comfortably build the application with the LAMP stack on a micro AWS instance sans database indexes.

I see such mismanagement happen too often. As a committed long-term planner, I also feel the urge to think ahead. Unless you’re managing budgets of multiple millions of dollars, there’s rarely business value in working ahead on problems that will manifest later than two to three months from now.

Scoping recommender systems

Recommender systems are just like other data systems. When building recommenders, you should be asking yourself “how can I make this happen with as little complexity as possible?” There’s a wealth of information available on when and how to deploy database shards, caches, proxies, and other scaling tools. But there’s a comparative dearth of information on what problems you’ll face when building recommender systems.

Throughout this article, I’m going to assume that you’re building a collaborative filtering (CF) system, but many of the same challenges apply to content-based filtering systems too.

Having first-hand experience—or failing that, second-hand experience—is the key to hitting the sweet spot between simplicity and complexity. There are five clear scales of recommender systems.

Toy scale

Ah, the toy-scale recommender systems. The internet is chalk full of them, so if this is the point you’re at, then you’re in luck. Toy problems are characterized by offline training and prediction, no live deployment, and datasets that fit in memory even when represented by dense matrices.

There’s no shame in building a toy-scale system. They’re fun to develop because they require minimal engineering work and get you close to the underlying algorithms. Surprise lets you go from nothing to computing movie recommendations in twenty lines of code!

Even if a system is live in production, I would still lump it into this category if it has fewer than 10K users. All you have to do to stand it up is expose a REST API by wrapping the model with a lightweight HTTP server.

Evaluating tools

When you’re in this category and you’re evaluating tools, you’ll want them to be as easy to set up and understand as possible. Surprise, mentioned above, is an excellent choice.

Small scale

A system goes from toy to small scale when it’s deployed live in production with ~10K users or more. Typical problems at this scale are annoyances rather than true challenges.

CF algorithm performance

Complexity analysis reveals that some CF algorithms will break down with more than a few thousand users. One example is k-nearest neighbors (kNN) for which the complexity is described by O(|u| \cdot |f|), where \lvert u \rvert is the number of users, and |f| is the number of features. You can optimize your code using Cython or a JVM language, or you can use a kNN optimization like ball trees, but you’re usually better served by switching to an O(|f|) algorithm such as SVD.

Memory hogging

You may also notice that your CF algorithm is consuming lots of memory (2 GB or more). A back-of-the-napkin calculation would show that storing the ratings in a dense matrix would consume 10,000 users × 10,000 features × 1 byte per rating = only 100 MB. However, there are inefficiencies in moving the data around and in the internals of some CF algorithms. Chances are that you won’t be able to ignore this problem, so you’ll have to fix it by making sure that you’re never copying the dataset, vertically scaling the training machine, or switching to compressed sparse matrices.

Evaluating tools

Surprise will still serve you well at small scale, but you’ll have to be mindful of memory use and of which CF algorithm you’re using.

Medium scale

What I call medium scale is when the challenges start getting real. As a rule of thumb, you’ll have moved to this scale with 100K+ users and 10K+ features. Several assumptions that we could hand-wave away at the toy and small scales fall apart at this point.

Compressed sparse matrices

Main article: Understanding sparse matrices for recommender systems

The problems fitting the training dataset in memory will only get worse as the number of users and features grow. In the section for small scale, I alluded to switching to compressed sparse matrices. As you transition to medium scale, you will have no choice.

Let’s see why. Assume that you have 100K users and 10K features. To store the ratings in a dense matrix, you would need 100,000 users × 10,000 features × 1 byte per rating = 1 GB. That figure doesn’t seem like much, but watch what happens when the number of users and number of features both double: 200,000 users × 20,000 features × 1 byte per rating = 4 GB!

These examples show that the space complexity of matrix factorization algorithms climbs in O(|u| \cdot |f|). A startup that gets to medium scale will probably be growing at 20% month-over-month, so the training dataset would consume 16 GB of RAM after four months. Sure, you can limit the number of users and features in the training set, but then you’re just buying time to avoid the inevitable.

Fortunately, using compressed sparse matrix representations can solve this problem. This technique takes advantage of the typical 1-2% density in rating data by compressing the ratings into a format that avoids storing missing ratings. Let’s redo the last calculation with a sparse matrix and assume a rating density of 1%: 200,000 users × 20,000 features × 3 bytes per rating × 1% density = a far more manageable 120 MB.

Evaluating tools

Remember that the output of a matrix factorization algorithm is always a dense matrix. Surprise will no longer serve you at this scale; you’ll need more fine-grained control over the validation dataset. You might continue to use this framework as the backend for a service that passes the validation dataset in batches and computes metrics (like RMSE) incrementally instead. When you’re evaluating tools and solutions at this scale, you’ll want to start thinking further out than two to three months from now.

Large scale

As you may have already noticed, the engineering challenges have been growing exponentially, and will (spoiler alert) continue to do so. Large scale is characterized by 10M users and 100K features. You can redo the calculations above to convince yourself that you’ll need new techniques for this scale.

Distributed training

At large scale, the dataset used for training will no longer fit in memory, even when stored sparsely.

One way to work around this problem is to do training out-of-core, which is when batches of training data are incrementally fed to the model. Surprise and most other recommender system frameworks lack incremental training algorithms, so you have to hand-roll one yourself.

SciKit-Learn has a module called IncrementalPCA. Unfortunately, this model has two problems. First, it doesn’t accept sparse data, so you’ll have to incrementally feed it dense data. Doing so is ludicrously slow, especially with this number of features and users. Second, the model doesn’t make predictions that are anywhere near as accurate as Surprise or SciKit-Learn’s TruncatedSVD. This inaccuracy is probably due to the data sparsity rather than an inherent problem with IncrementalPCA.

The better solution is to use Apache Spark and its built-in MLLib toolkit. Jose A Dianes wrote an excellent blog post on how to get started using Spark to make movie recommendations, but be mindful that this approach comes with a whole new set of problems.

Big data scale

Only large companies with teams dedicated to information retrieval encounter big data-scale problems. Even with the terabytes or more of data that these firms process, the bulk of the challenges come from coordinating teams, data pipelining, feature engineering, and vectorization. Most use off-the-shelf models to make recommendations and employ far fewer data scientists than engineers.

As you probably guessed, this article is not targeted at people or teams building big data systems. It’s still interesting to know how the challenges change as the business evolves.

Wrapping it up

Figure out where you are and build for that scale. Chances are that your problem can be solved with a small or medium-scale recommender system. Focusing on building only what’s necessary is the way to get work done quickly and effectively!