Machine learning engineering projects are DAGs

The team gets started on a machine learning engineering project, and they write scripts that they then throw into a shared repository with little thought as to how everything will fit together. At first, things seem fast and smooth. The team flies through the problem. They’re agile.

But they inevitably find as the project progresses that they need to revisit initial assumptions and iterate. Mistakes happen, but to correct them, the team needs flexibility. Let’s say they want to rebuild their model pipeline’s extracted features to include additional features they initially missed, so…. they try.

A sad state of affairs

Jake is working on the feature extractor and needs to use the scraper that Mary built to recreate the dataset. He fires up that component, but he can’t make heads or tails of how it works. The code has hardcoded paths like /home/mary/repos/our_team_project/tmp/datasets. The scraper depends on a different version of SciKit-Learn than the one the feature extractor uses. The code requires access to a third-party service that Mary forgot to check into the repository.

Most egregious of all, it turns out that scraper depends on the feature extractor’s outputs! Jake facepalms as he realizes that he checked in the extracted features directly into the repository and Mary, rightly so, assumed that it was safe to build the scraper on top of what was already there. The unfortunate state of affairs is as follows:

Feature Extractor
Feature Extractor
Extracted Features
Extracted…
Web
Web
cycle
cycle
Scraper
Scraper
Scraped Dataset
Scraped Da…
cycle
cycle
cycle
cycle
cycle
cycle
Viewer does not support full SVG 1.1

There’s now a circular dependency that becomes difficult to untangle. Nobody can build the project, and everyone is sad. The project grinds to a halt and the team needs days to get back on-track. If only Jake and Mary had spent an hour or so coordinating things ahead of time, or someone had laid out a framework for them that avoided such situations, this would have never happened!

A flawed mindset

What leads to situations like these, aside from laziness, is that we like to believe as machine learning engineers that we’re tackling novel problems that nobody has before. We desperate cling to the idea that we’re mad scientists in search of truth. We wish to believe that through sheer brain power, we’ll discover the answer, and someone else can sort out how to deploy it. But rarely are any of those the case.

Most often, we’re building simple SVMs and regressions. Maybe we’re even dabbling with deep learning, but even then, we’re usually solving problems in well-understood spaces and with known patterns such as object detection or sentiment analysis.

Here’s the secret I’m giving you in this article: machine learning engineering projects don’t need to be much different from traditional software engineering projects. Sure, there’s more experimentation than there is in traditional software engineering, but I would argue that:

Most machine learning engineering projects suffer from sociological and logistics risks rather than technological risks.

That is, the primary risk is not that you’ll fail to extract useful signal and build a predictive model, but that you won’t be able to work together as a team, or collect the data in the required format, or iterate until you arrive at useful results.

A better mental model

Remember when you used to work on traditional software engineering projects? You had the following tools that made your life easier:

  • A button you could press to build the entire project so you instantly knew where there were compilation and/or linker errors.
  • A cleaning process where you could destroy intermediates and be sure that you can build the project from scratch.
  • Continuous integration with other developers, so you knew everyone was building the same thing.
  • Makefiles or other artefacts that you could update to alter the build process on others’ behalf without them needing to understand those changes.

You can still have all of these things when you work on machine learning engineering projects. Let’s see how.

The antidote: DAGs

The antidote to the flawed mindset is to think of machine learning engineering projects as directed acyclic graphs (DAGs). A DAG is a graph data structure where:

  • All nodes are connected by directed edges, so they go one way, and not the other.
  • There are no cycles in the graph. From each node, there’s no path you can take to get back to that node.

The key is to model your project’s artefacts as nodes and dependencies as directed edges. As long as this graph is a DAG, then you’re doing it right.

Of course, this mental model alone won’t solve all of these problems, but it’s the foundation upon which all other improvements lie.

Feature Extractor
Feature Extractor
Extracted Features
Extracted…
Web
Web
Scraper
Scraper
Scraped Dataset
Scraped Da…

✔️

✔️
Viewer does not support full SVG 1.1

An example DAG modeling a typical machine learning engineering workflow.

Let’s get more practical and see what else we need to do.

One-touch project build

One-touch build in traditional software engineering projects requires that dependencies be a DAG data structure. Your module A can’t circularly depend on module B, or you’ll get a compilation error.

The difference is that instead of code modules, headers, and links depending on each other, our intermediate data representations must be a DAG.

You should be able to build the entire project from scratch with a single command. You can use Makefiles or anything else. The gold standard is when you can run something as simple as this, and the entire project will build:

make download_from_s3 # downloads initial data from S3
                      # if there is any
make                  # builds everything!

Using Makefiles, this is easy to accomplish by declaring your dependencies:

raw:
    python -m raw data/raw/

preprocess: data/raw/
    python -m preprocess data/raw/ data/preprocessed/

features: data/preprocessed/
    python -m features data/preprocessed/ data/features/

model: data/features/
    python -m model data/features/ models/model.pkl

all: raw preprocess features model

We can also diagram this Makefile as follows:

src/raw/
src/raw/
data/raw/
data/raw/
Web (Implied)
Web (Impli…
src/preprocess/
src/preprocess/
data/
preproc/
data/…
src/features/
src/features/
data/
features/
data/…
src/model/
src/model/
Model
Model

✔️

✔️
Viewer does not support full SVG 1.1

A clean, step-by-step workflow with clear inputs and outputs.

Data is immutable

Once you write a dataset, you should never alter it. When you change data after it’s written, it becomes difficult to backtrack, understand the control flow, and ensure that your team is all working with the same artefacts. If you need to transform your data, it’s better to add an additional step to the workflow.

Another advantage of making data immutable is that it takes better advantage of intermediates. For example, suppose you discover that you made a mistake in the 3rd step of your pipeline, and need to iterate on it. Rather than having to rebuild the dataset from scratch, you can just discard intermediates from the 3rd step onward. This workflow enables you to iterate faster and more confidently.

Each step has clear inputs and outputs

Notice also how in the above Makefile, each step of the process specifies dependencies, consumes an input and produces an output. For example, the preprocess step depends on data/raw/ which the raw step produces, and it outputs the data/preprocessed/ data for the next step, features.

Being clear about inputs and outputs means that each component and dataset is responsible for one thing, which leads to cleaner and more modular code. It also makes it easier to swap out components and datasets when necessary.

Build from scratch or from intermediates

Your pipeline should enable you to iterate fast from any step in the workflow. You should be able to both:

  1. Build the project completely from scratch with no errors, missing dependencies, or differences with others on your team
  2. Build from any step of the process onward assuming you already have the intermediates

A good way to accomplish both of these goals is to have a make clean command or equivalent that destroys all intermediates and to run it often and/or in CI to make sure that the whole project works end-to-end. If you write your Makefile correctly, it will automatically detect when intermediates are present and skip the unnecessary steps when you run your make command.

Dependencies are part of the DAG

Just because we’re focused on data dependencies doesn’t mean we should ignore our software dependencies! Software dependency management is just as important as data dependency management.

In general, there are two ways to wrangle your dependencies:

  1. Use one common environment for all steps, e.g., one shared Anaconda environment.
  2. Put each step into its own isolated environment, e.g., Docker containers.

Either solution is fine and depends on the scale of your problem. For a smaller project that you won’t maintain as long, a single shared environment could serve you well, while Docker containers would be overkill. But if you’re working across many verticals with conflicting dependencies, then Docker containers suddenly become more attractive.

Conclusion

Just like when building traditional software, when you think through your machine learning project workflow before you get started, you’ll save tremendous future headaches.

If you’re interested in learning more, check out the Cookie Cutter Data Science framework, which explains this thinking and much more.

I Used Waterfall (And I Liked It)

In this article, I don’t mean to prescribe universally applicable advice, nor to downplay the importance or utility of agile, or to even suggest that more than a tiny fraction of software engineering teams should use waterfall. My purpose is simply that there’s always a right tool for the job, and that tool is not always agile.

In short, we’re using waterfall at Passenger AI, and it’s working. I’m going to explain our experience and attempt to rationalize it.

For reference, at Passenger AI, we’re building artificial intelligence software for self-driving cars to keep them clean and to protect passengers. Our offering is an edge operating system that uses computer vision, deep learning, and machine learning to track passenger behavior.

Now then.

From programming’s inception up until the 2000’s, software engineering was existentially difficult because there were no known patterns to execute on and because there was always a looming threat of total project failure. Even building a website was difficult because there were no cloud infrastructure providers or web frameworks.

As we rolled into the 2000’s, IaaS providers like AWS proliferated, web frameworks like Ruby on Rails launched and stabilized, and distributed systems patterns became more widely understood. Now, almost anyone can build a website or an app, the two components that are the bread and butter of most tech businesses.

What happened between these two eras was a shift from technological to sociological risk.

Classes of Risk

A technological risk is one where there’s uncertainty as to whether or not computers can do what’s needed, or it’s questionable that the needed technology can be built in any reasonable amount of time. Suppose that you’re training a deep neural net for an embedded system; then a technological risk is that you may need more computational resources to power that model than are available.

A sociological risk (“politics”) is one where communication between people, departments, and to customers can lead to a project failure. Imagine that you’re a product manager whose customers have no need for your software now, but that they will in six months; then a sociological risk is that your product is not what those customers will need in the future.

Technological risk dominated most software projects in the 90’s and prior, whereas sociological risk dominates most projects from the 2000’s onward.

That’s not to say that projects in the past didn’t involve sociological risk. Even books from the 70’s, like The Mythical Man Month, recognized this peril:

Therefore the most important function that software builders do for their clients is the iterative extraction and refinement of the product requirements. For the truth is, the clients do not know what they want.

Frederick P. Brooks Jr., The Mythical Man-Month: Essays on Software Engineering

Let’s further explore how today’s project management confronts this danger.

Sociological Risk

Good project management mitigates risk, so modern project management focuses on resolving sociological issues like poor communication, lack of customer feedback, estimating velocity, and changing requirements. Those problems are exactly what methodologies like scrum, kanban, and XP are designed to solve. (I’ll loosely group these methodologies into “agile” from here on.)

Indeed, these methodologies work exceedingly well for teams building apps or websites using known patterns and frameworks for customers who don’t know what they want. Businesses of this type just so happen to dominate today’s tech industry.

The book Peopleware speaks in an honest way on this view:

We, along with nearly everyone else involved in the high-tech endeavors, were convinced that technology was all, that whatever your problems were, there had to be a better technology solution to them. But if what you were up against was inherently sociological, better technology seemed unlikely to be much help

Tom DeMarco & Timothy Lister, Peopleware: Productive Projects and Teams

While sociological risk dominates most projects today, there are still plenty in which the primary hazard is in the technology.

Technological Risk

Technological risk is a different beast because it often involves complex dependencies, front-loaded experimentation, totally unknown costs, and focused execution on known requirements.

Think of any deep learning startup, or any company building an operating system, or any team commercializing a research project. In none of these cases is it easy to estimate tasks, nor is there much to show customers between wireframe mocks and the final product.

These problems sound suspiciously similar to those that most pre-2000’s projects faced.

This set of technology-focused projects could operate under agile, but this methodology is not designed to meet their distinct challenges.

Note also that agile took off primarily because it tightened the feedback loop between engineering teams and customers. This loop enables engineering teams to function independently and to respond directly to customers rather than working through managers and product teams.

It goes without saying that engineers work best when given requirements and the freedom to execute on them. Then without direct contact with customers, engineers can’t make informed trade-offs on where to budget their time and effort. Managers still need to provide some requirements and direction, so it’s unclear what form those should take.

The question then is: what’s the best way to manage a technology-focused project if not with agile?

Waterfall

Image result for gantt chart

If there are no metrics like revenue available to engineers, and customers aren’t providing direct feedback, then the development team needs some other signal to work with. The waterfall methodology answers this challenge by planning the important tasks and milestones ahead and then working backwards to establish the timeline required for success. A popular way of visualizing these tasks and milestones is using a Gantt chart (above).

This process involves talking with the engineers who will be implementing the project to make sure that they understand the scope, the goals, and to get reasonable estimates. This procedure also requires an architect, or small team of architects, to create and maintain a consistent project architecture.

The advantage of this approach is that engineers then know how much time and effort they should allocate for each task. The per-task time budget signals that task’s relative importance. The point isn’t to impose arbitrary deadlines, to be inflexible, or to shame engineers for under- or overshooting, but to know at a glance if the project is on-time or if a task is blocked and to shift resources or to descope accordingly.

It’s fashionable to resent waterfall project management, and with good reason. It should be no secret from reading this article’s title that I surreptitiously see otherwise.

This pattern is exactly what we’re using at Passenger AI, and it’s working incredibly well. We even use a Gantt chart as described above. We tried agile in various manifestations but found that it had too much ceremony and it didn’t answer at a glance the important questions we asked of it.

I have my concerns, and the system isn’t perfect, but our team is getting as much done as anyone could possibly ask of them, and everyone is happy with our choice.

ELI5 – Transfer Learning/Fine-Tuning a Deep Learning Model

Imagine that you work at a factory, and that your boss has a week-long task for you to sort large screws that are continuously coming down a conveyor belt and to place them into one of twenty labeled boxes. The boxes have labels with the names of colors like “red”, “green”, “blue”, etc., and each screw has a single colored band on it that matches up with exactly one of the boxes. You’re now on the hook to solve the problem, but neither you nor any human is fast enough to keep up with the hundreds of screws coming down the conveyor belt every minute.

You, being the smart person you are, remember that you have a couple of baby nephews who are free for the summer, so you enlist their help. The three of you working together should be able to complete your task. There’s only one problem: the babies don’t yet know how to identify colors, and so they can’t sort the screws. You decide to first teach the infants common colors and their respective names.

You give the babies a quick lesson on colors and their names, then to reinforce the concept, you also run a tutorial on bucketing the screws. Slowly but surely, the infants begin sorting on their own. You’re hopeful that the babies will learn quickly despite their multitude of early mistakes. But alas, even after an hour of practice, the team is no where near ready.

Dismayed, you stop and step back. You conclude that whoever is helping you needs to understand colors. Although sorting the screws is a distinct problem from recognizing colors, the sorting step is trivial for someone who can identify colors. You send the babies back to their parents and recruit a couple of your slightly older cousins to help.

You effortlessly explain to the older children what you’re trying to accomplish, and the following tutorial proceeds smoothly this time. You’re astounded and you wonder why you didn’t follow this path from the beginning.

What you just did was the human equivalent of transfer learning. You took a trained brain—or stepping back from our analogy, a neural net—and you adapted it to a specialized problem. Transfer learning, or fine-tuning, is a process whereby you take a deep learning model that has been trained on lots of data (1M+ examples) and continue training it on a smaller dataset to “overfit” it to that particular class of problem. The model becomes inferior at its original task and better at the new specific task, but it also performs much better than a model that was only trained on the small problem-specific dataset.

Transfer learning is most commonly used in computer vision where most problems boil down to the analogous problem of detecting image features such as edges and shapes. A pretrained model—or one that has already been trained on a large dataset—has already learned all of the hard lessons and only needs to be adapted slightly to identify a new class of objects.

Under the hood, a neural network consists of a series of connected neurons and their weights. Neural net architectures, which define the ways in which the neural net’s neurons are connected, are always defined up-front, unlike the human brain which is elastic. But the neural net weights change through a process called backpropagation whereby the weights are updated based on mistakes that the network makes during training time. Returning to our color-screw analogy, this backpropagation process is analogous to you correcting the children when they make mistakes.

For computer vision problems, getting the neural net weights to be accurate enough for the model to detect anything takes millions of photos, but it’s easy to retrain the networks once they have learned the general concepts required to make useful inferences on photos.

In the industry, we often download models that have been pretrained on datasets like COCO and ImageNet and fine-tune them to our specific use case. At Passenger AI, for example, we use this process for object detection.

Understanding compressed matrices for sparse data

In this article, you’ll learn about how matrices representing sparse data, e.g. user ratings, can be compressed to vertically scale machine learning algorithms by fitting datasets into memory that would otherwise be too large. You’ll learn about the various compression formats and delve into their trade-offs.
Continue reading “Understanding compressed matrices for sparse data”

Building your recommender system at the right scale

OutlineOpening “everyone wants to scale”

Recommender systems are just like any other data system
– Keep it as simple as possible
– Avoid undifferentiated heavy lifting

The problems recommender systems face at each level of scale
– Toy problems
– Medium-size problems
– Out-of-core training
– Real-time recommendations
– Storing the recommendations
– Large scale
– Apache Spark, OLS
– Setting up a cluster
– Massive scale
– Still use off-the-shelf models
– Mostly about feature engineering

How to decide on where you fit
– About

A dog drinking from a hose
An MVP lapping up the incoming data from the 10K concurrents on launch day.

An engineer is scoping out a data system design and the first thing that comes to mind is how it’ll work at scale. He’s just starting on the prototype, so the product has no users yet, but he wants to make sure that it can eventually accommodate thousands of concurrents! With the architecture divorced from reality, inevitably the end result is late or it doesn’t solve the problem.

Meanwhile, in practice he could comfortably build the application with the LAMP stack on a micro AWS instance sans database indexes.

I see such mismanagement happen too often. As a committed long-term planner, I also feel the urge to think ahead. Unless you’re managing budgets of multiple millions of dollars, there’s rarely business value in working ahead on problems that will manifest later than two to three months from now.

Scoping recommender systems

Recommender systems are just like other data systems. When building recommenders, you should be asking yourself “how can I make this happen with as little complexity as possible?” There’s a wealth of information available on when and how to deploy database shards, caches, proxies, and other scaling tools. But there’s a comparative dearth of information on what problems you’ll face when building recommender systems.

Throughout this article, I’m going to assume that you’re building a collaborative filtering (CF) system, but many of the same challenges apply to content-based filtering systems too.

Having first-hand experience—or failing that, second-hand experience—is the key to hitting the sweet spot between simplicity and complexity. There are five clear scales of recommender systems.

Toy scale

Ah, the toy-scale recommender systems. The internet is chalk full of them, so if this is the point you’re at, then you’re in luck. Toy problems are characterized by offline training and prediction, no live deployment, and datasets that fit in memory even when represented by dense matrices.

There’s no shame in building a toy-scale system. They’re fun to develop because they require minimal engineering work and get you close to the underlying algorithms. Surprise lets you go from nothing to computing movie recommendations in twenty lines of code!

Even if a system is live in production, I would still lump it into this category if it has fewer than 10K users. All you have to do to stand it up is expose a REST API by wrapping the model with a lightweight HTTP server.

Evaluating tools

When you’re in this category and you’re evaluating tools, you’ll want them to be as easy to set up and understand as possible. Surprise, mentioned above, is an excellent choice.

Small scale

A system goes from toy to small scale when it’s deployed live in production with ~10K users or more. Typical problems at this scale are annoyances rather than true challenges.

CF algorithm performance

Complexity analysis reveals that some CF algorithms will break down with more than a few thousand users. One example is k-nearest neighbors (kNN) for which the complexity is described by O(|u| \cdot |f|), where \lvert u \rvert is the number of users, and |f| is the number of features. You can optimize your code using Cython or a JVM language, or you can use a kNN optimization like ball trees, but you’re usually better served by switching to an O(|f|) algorithm such as SVD.

Memory hogging

You may also notice that your CF algorithm is consuming lots of memory (2 GB or more). A back-of-the-napkin calculation would show that storing the ratings in a dense matrix would consume 10,000 users × 10,000 features × 1 byte per rating = only 100 MB. However, there are inefficiencies in moving the data around and in the internals of some CF algorithms. Chances are that you won’t be able to ignore this problem, so you’ll have to fix it by making sure that you’re never copying the dataset, vertically scaling the training machine, or switching to compressed sparse matrices.

Evaluating tools

Surprise will still serve you well at small scale, but you’ll have to be mindful of memory use and of which CF algorithm you’re using.

Medium scale

What I call medium scale is when the challenges start getting real. As a rule of thumb, you’ll have moved to this scale with 100K+ users and 10K+ features. Several assumptions that we could hand-wave away at the toy and small scales fall apart at this point.

Compressed sparse matrices

Main article: Understanding sparse matrices for recommender systems

The problems fitting the training dataset in memory will only get worse as the number of users and features grow. In the section for small scale, I alluded to switching to compressed sparse matrices. As you transition to medium scale, you will have no choice.

Let’s see why. Assume that you have 100K users and 10K features. To store the ratings in a dense matrix, you would need 100,000 users × 10,000 features × 1 byte per rating = 1 GB. That figure doesn’t seem like much, but watch what happens when the number of users and number of features both double: 200,000 users × 20,000 features × 1 byte per rating = 4 GB!

These examples show that the space complexity of matrix factorization algorithms climbs in O(|u| \cdot |f|). A startup that gets to medium scale will probably be growing at 20% month-over-month, so the training dataset would consume 16 GB of RAM after four months. Sure, you can limit the number of users and features in the training set, but then you’re just buying time to avoid the inevitable.

Fortunately, using compressed sparse matrix representations can solve this problem. This technique takes advantage of the typical 1-2% density in rating data by compressing the ratings into a format that avoids storing missing ratings. Let’s redo the last calculation with a sparse matrix and assume a rating density of 1%: 200,000 users × 20,000 features × 3 bytes per rating × 1% density = a far more manageable 120 MB.

Evaluating tools

Remember that the output of a matrix factorization algorithm is always a dense matrix. Surprise will no longer serve you at this scale; you’ll need more fine-grained control over the validation dataset. You might continue to use this framework as the backend for a service that passes the validation dataset in batches and computes metrics (like RMSE) incrementally instead. When you’re evaluating tools and solutions at this scale, you’ll want to start thinking further out than two to three months from now.

Large scale

As you may have already noticed, the engineering challenges have been growing exponentially, and will (spoiler alert) continue to do so. Large scale is characterized by 10M users and 100K features. You can redo the calculations above to convince yourself that you’ll need new techniques for this scale.

Distributed training

At large scale, the dataset used for training will no longer fit in memory, even when stored sparsely.

One way to work around this problem is to do training out-of-core, which is when batches of training data are incrementally fed to the model. Surprise and most other recommender system frameworks lack incremental training algorithms, so you have to hand-roll one yourself.

SciKit-Learn has a module called IncrementalPCA. Unfortunately, this model has two problems. First, it doesn’t accept sparse data, so you’ll have to incrementally feed it dense data. Doing so is ludicrously slow, especially with this number of features and users. Second, the model doesn’t make predictions that are anywhere near as accurate as Surprise or SciKit-Learn’s TruncatedSVD. This inaccuracy is probably due to the data sparsity rather than an inherent problem with IncrementalPCA.

The better solution is to use Apache Spark and its built-in MLLib toolkit. Jose A Dianes wrote an excellent blog post on how to get started using Spark to make movie recommendations, but be mindful that this approach comes with a whole new set of problems.

Big data scale

Only large companies with teams dedicated to information retrieval encounter big data-scale problems. Even with the terabytes or more of data that these firms process, the bulk of the challenges come from coordinating teams, data pipelining, feature engineering, and vectorization. Most use off-the-shelf models to make recommendations and employ far fewer data scientists than engineers.

As you probably guessed, this article is not targeted at people or teams building big data systems. It’s still interesting to know how the challenges change as the business evolves.

Wrapping it up

Figure out where you are and build for that scale. Chances are that your problem can be solved with a small or medium-scale recommender system. Focusing on building only what’s necessary is the way to get work done quickly and effectively!

Classifying Instagram profiles by gender

The purpose of this project is to create a model which, given an Instagram user’s profile, predicts their gender as accurately as possible. The motivation for this undertaking is to be able to target for marketing purposes Instagram users of specific demographics. The model is trained using labeled text-based profile data passed through a tuned logistic regression model. The model parameters are optimized using the AUROC metric to reduce variability in the precision and recall of predictions for each gender. The resulting model achieves 90% overall accuracy from a dataset of 20,000, though it deviates substantially in the recall of each gender.

Continue reading “Classifying Instagram profiles by gender”

Securely transferring data to and from a database

This article will explore the many ways in which data can be copied to and from a database securely. Securely performing copy operations is important because, in the course of data engineering and data science work, data study and model development must often be done locally. Copying data sets over compromised systems such as email and cloud storage without adequate preparation exposes businesses to security breaches such as data leaks, “man in the middle” attacks, keylogging, and backdoors.

Continue reading “Securely transferring data to and from a database”