# Understanding compressed matrices for sparse data

In this article, you’ll learn about how matrices representing sparse data, e.g. user ratings, can be compressed to vertically scale machine learning algorithms by fitting datasets into memory that would otherwise be too large. You’ll learn about the various compression formats and delve into their trade-offs.

delve into examples with SciPy

# Sparsity

Collaborative filtering recommender systems are usually built using data from user interactions, and user interactions are inherently sparse. For example, only 0.941% of the ratings in the MovieLens dataset are present [1].

In addition to the algorithmic challenges inherent to discovering the latent features of such incomplete data, another challenge is in scaling systems that accept memory-hungry matrix inputs.

Let’s anchor our discussion to an example. Assume that we’re working with a movie recommender model and that its input data is tabulated below.

# Dense Matrices Review

When the above ratings are munged to be valid inputs for a dimensionality reduction algorithm such as SVD, every cell must be filled. Ignoring for now demeaning and standardization, one possible form of the input is as follows:

Notice how zeros are filled in cells for which there is a lack of data. On such a small dataset, this representation only wastes a few dozen bytes at most, but it’s not difficult to see how the amount of wasted memory grows exponentially as a function of the numbers of users and features ().

Suppose that only 1% of this matrix is filled with ratings; then this representation consumes 100x more memory than it must.

The main advantage of uncompressed matrices is that access complexity is since cells are indexed statically using their row and column numbers. Uncompressed matrices are also easy to work with.

# Compression

Many computer science problems involve the space-time complexity trade-off, a rule which describes how memory and CPU usage can be traded to achieve desirable performance characteristics. Memory is usually consumed to improve CPU usage, but the converse is equally valid.

The way to improve memory usage for the cost of more CPU is by compressing the matrices.

The distinct terms “compressed” and “sparse” are often used interchangeably. “Sparse” refers to the nature of inputs and indicates that only an arbitrarily-sized minority of the data is known. “Compressed” matrices are stored in a format that requires preprocessing to be usable, and that ideally uses less memory than an uncompressed format.

A matrix can be uncompressed but sparse, as well as it can be compressed but dense, though these representations are both suboptimal.

## Compression vs. Dimensionality Reduction

The astute reader may have noticed that dimensionality reduction algorithms such as PCA and SVD also compress their inputs. This observation is correct, but it misses an important distinction: dimensionality reduction is lossy, while compression as this article defines it is lossless.

Lossy compression irrecoverably loses some of the original data, while lossless compression can be perfectly undone. It’s analogous to the difference between encoding a high-quality recording of a song as an MP3 (lossy) vs. packing it into a zip file (lossless).

Another key difference is that lossily compressed data can often be used directly without any preprocessing, while losslessly compressed data must be unpacked before it can be used. The former should be apparent if you are familiar with dimensionality reduction techniques, while the latter will become clear as you read the remainder of this article.

## Formats

There are many ways to compress matrices, but you only need to know a handful of them. The main formats are COOrdinate (COO) format, compressed sparse rows (CSR), and compressed sparse columns (CSC).

### COOrdinate (“COO”)

SciPy documentation: scipy.sparse.coo_matrix

Consider the most obvious compression scheme one: storing values in a series of zero-indexed tuples. That’s exactly what COOrdinate (“COO”) format is. The example that we’ve been working with is reformatted in this tuple format below.

Although COO is not the most useful compressed matrix representation, it’s the most intuitive, and it’s quick and easy to convert back and forth from CSR and CSC formats.

### Compressed sparse rows (CSR)

SciPy documentation: scipy.sparse.csr_matrix

The compressed sparse row (CSR) representation is the most commonly used matrix compression scheme. While CSR is not as intuitive as COO, it makes up for this deficiency by being superior in almost every other way. CSR is fundamentally a further optimization to the COO format.

CSR breaks down its input matrix into three column vectors: rows, columns, and values. Other than the rows column, these vectors are identical to the columns of the COO format.

The values vector is a flattened list of the values present in the input matrix. In the example above, this vector corresponds with the values column of the COO table above:

Each of the elements of the columns vector indicate the column number into which the corresponding value in the values vector falls. Continuing with the main example in the article, this vector would be:

More concretely, the first value 2 indicates that the first value 4 of the values vector goes into the 3rd column (2 when zero-indexed).

The final vector is the one indicating the rows. Recall the rows column from the COO format example above:

Since the cells from the original uncompressed matrix are read from left to right, then top to bottom, the numbers in the x (rows) column increase monotonically. Additionally, there’s redundancy as both the first (4) and second (5) values are on the first (0th) row.

Redundancy is a good litmus test for optimization. Rather than directly storing the row numbers of each value, CSR stores a separate contiguous array of the {values, columns} indexes at which new rows begin. An example is below.

Notice that the value 2 appears twice in the r vector. The row with contains no entries, so it is skipped in the compressed representation by immediately moving on to in the next entry.

In this example, the compressed row representation doesn’t save any memory, but if there were more values than rows, then it would.


### Compressed sparse columns (CSC)

SciPy documentation: scipy.sparse.csc_matrix

The compressed sparse column (CSC) scheme is virtually identical to CSR. The only difference is that the columns vector is compressed, while the rows vector is identical to the one in the COO representation.

As a rule of thumb, COO is only useful as an intermediate between compressed and uncompressed formats. It is fast to convert a sparse uncompressed matrix to COO format, but it is comparatively slow to convert the same matrix to CSR or CSC formats. Converting a COO matrix to a CSR or CSC matrix is fast.

Within SciPy, CSR matrices have been the default compression scheme for years. Arithmetic operations on CSR and CSC-compressed matrices are equally performant regardless of whether or not the operand is compressed. CSR is superior when iterating over or selecting rows, whereas CSC is superior for column operations. When in doubt, CSR is an excellent go-to compression format.

# Implementation

With a solid understanding of the key compression schemes, it’s time to peruse some examples. The Python Pandas/Numpy/SciPy package family is the ecosystem of choice.

# Conclusion

With an understanding of data sparsity and the core matrix compression formats, you can now optimize your ML pipeline code and train models on much larger swaths of data. Let me know in the comments if you have any questions!

# References

The problems fitting the training dataset in memory will only get worse as the number of users and features grow. In the section for small scale, I alluded to switching to compressed sparse matrices. As you transition to medium scale, you will have no choice.

Let’s see why. Assume that you have 100K users and 10K features. To store the ratings in a dense matrix, you would need 100,000 users × 10,000 features × 1 byte per rating = 1 GB. That figure doesn’t seem like much, but watch what happens when the number of users and number of features both double: 200,000 users × 20,000 features × 1 byte per rating = 4 GB!

These examples show that the space complexity of matrix factorization algorithms climbs in . A startup that gets to medium scale will probably be growing at 20% month-over-month, so the training dataset would consume 16 GB of RAM after four months. Sure, you can limit the number of users and features in the training set, but then you’re just buying time to avoid the inevitable.

Fortunately, using compressed sparse matrix representations can solve this problem. This technique takes advantage of the typical 1-2% density in rating data by compressing the ratings into a format that avoids storing missing ratings. One common compression scheme is coordinate (COO) format, where ratings are stored as a list of tuples, e.g., [(user: 1, feature: 3, rating: 2), (user: 1, feature: 5, rating: 3), (user: 2, feature: 6, rating: 4), …].

Sparse matrices consume more memory for each rating because they lack the implicit data inherent to the location (stride), but fewer ratings overall. Let’s redo the last calculation with a sparse matrix and assume a rating density of 1%: 200,000 users × 20,000 features × 3 bytes per rating × 1% density = a far more manageable 120 MB.