In this article, you’ll learn about how matrices representing sparse data, e.g. user ratings, can be compressed to vertically scale machine learning algorithms by fitting datasets into memory that would otherwise be too large. You’ll learn about the various compression formats and delve into their trade-offs.

# Sparsity

Collaborative filtering recommender systems are usually built using data from user interactions, and user interactions are inherently **sparse**. For example, only 0.941% of the ratings in the MovieLens dataset are present [1].

In addition to the algorithmic challenges inherent to discovering the latent features of such incomplete data, another challenge is in scaling systems that accept memory-hungry matrix inputs.

Let’s anchor our discussion to an example. Assume that we’re working with a movie recommender model and that its input data is tabulated below.

# Dense Matrices Review

When the above ratings are munged to be valid inputs for a dimensionality reduction algorithm such as SVD, every cell must be filled. Ignoring for now demeaning and standardization, one possible form of the input is as follows:

Notice how zeros are filled in cells for which there is a lack of data. On such a small dataset, this representation only wastes a few dozen bytes at most, but it’s not difficult to see how the amount of wasted memory grows exponentially as a function of the numbers of users and features ().

Suppose that only 1% of this matrix is filled with ratings; then this representation consumes 100x more memory than it must.

The main advantage of uncompressed matrices is that access complexity is since cells are indexed statically using their row and column numbers. Uncompressed matrices are also easy to work with.

# Compression

Many computer science problems involve the space-time complexity trade-off, a rule which describes how memory and CPU usage can be traded to achieve desirable performance characteristics. Memory is usually consumed to improve CPU usage, but the converse is equally valid.

The way to improve memory usage for the cost of more CPU is by **compressing** the matrices.

The distinct terms “compressed” and “sparse” are often used interchangeably. “Sparse” refers to the nature of inputs and indicates that only an arbitrarily-sized minority of the data is known. “Compressed” matrices are stored in a format that requires preprocessing to be usable, and that ideally uses less memory than an uncompressed format.

A matrix can be uncompressed but sparse, as well as it can be compressed but dense, though these representations are both suboptimal.

## Compression vs. Dimensionality Reduction

The astute reader may have noticed that dimensionality reduction algorithms such as PCA and SVD also compress their inputs. This observation is correct, but it misses an important distinction: dimensionality reduction is **lossy**, while compression as this article defines it is **lossless**.

Lossy compression irrecoverably loses some of the original data, while lossless compression can be perfectly undone. It’s analogous to the difference between encoding a high-quality recording of a song as an MP3 (lossy) vs. packing it into a zip file (lossless).

Another key difference is that lossily compressed data can often be used directly without any preprocessing, while losslessly compressed data must be unpacked before it can be used. The former should be apparent if you are familiar with dimensionality reduction techniques, while the latter will become clear as you read the remainder of this article.

## Formats

There are many ways to compress matrices, but you only need to know a handful of them. The main formats are COOrdinate (COO) format, compressed sparse rows (CSR), and compressed sparse columns (CSC).

### COOrdinate (“COO”)

SciPy documentation: scipy.sparse.coo_matrix

Consider the most obvious compression scheme one: storing values in a series of zero-indexed tuples. That’s exactly what COOrdinate (“COO”) format is. The example that we’ve been working with is reformatted in this tuple format below.

Although COO is not the most useful compressed matrix representation, it’s the most intuitive, and it’s quick and easy to convert back and forth from CSR and CSC formats.

### Compressed sparse rows (CSR)

SciPy documentation: scipy.sparse.csr_matrix

The compressed sparse row (CSR) representation is the most commonly used matrix compression scheme. While CSR is not as intuitive as COO, it makes up for this deficiency by being superior in almost every other way. CSR is fundamentally a further optimization to the COO format.

CSR breaks down its input matrix into three column vectors: *rows*, *columns*, and *values*. Other than the *rows* column, these vectors are identical to the columns of the COO format.

The *values* vector is a flattened list of the values present in the input matrix. In the example above, this vector corresponds with the *values* column of the COO table above:

Each of the elements of the *columns* vector indicate the column number into which the corresponding value in the *values* vector falls. Continuing with the main example in the article, this vector would be:

More concretely, the first value 2 indicates that the first value 4 of the *values* vector goes into the 3rd column (2 when zero-indexed).

The final vector is the one indicating the *rows*. Recall the *rows* column from the COO format example above:

Since the cells from the original uncompressed matrix are read from left to right, then top to bottom, the numbers in the *x (rows)* column increase monotonically. Additionally, there’s redundancy as both the first (4) and second (5) values are on the first (0th) row.

Redundancy is a good litmus test for optimization. Rather than directly storing the row numbers of each value, CSR stores a separate contiguous array of the *{values, columns}* indexes at which new rows begin. An example is below.

Notice that the value 2 appears twice in the *r* vector. The row with contains no entries, so it is skipped in the compressed representation by immediately moving on to in the next entry.

In this example, the compressed row representation doesn’t save any memory, but if there were more values than rows, then it would.

### Compressed sparse columns (CSC)

SciPy documentation: scipy.sparse.csc_matrix

The compressed sparse column (CSC) scheme is virtually identical to CSR. The only difference is that the *columns* vector is compressed, while the *rows* vector is identical to the one in the COO representation.

## Format trade-offs

As a rule of thumb, COO is only useful as an intermediate between compressed and uncompressed formats. It is fast to convert a sparse uncompressed matrix to COO format, but it is comparatively slow to convert the same matrix to CSR or CSC formats. Converting a COO matrix to a CSR or CSC matrix is fast.

Within SciPy, CSR matrices have been the default compression scheme for years. Arithmetic operations on CSR and CSC-compressed matrices are equally performant regardless of whether or not the operand is compressed. CSR is superior when iterating over or selecting rows, whereas CSC is superior for column operations. When in doubt, **CSR is an excellent go-to compression format**.

# Conclusion

With an understanding of data sparsity and the core matrix compression formats, you can now optimize your ML pipeline code and train models on much larger swaths of data. Let me know in the comments if you have any questions!

## One thought on “Understanding compressed matrices for sparse data”