This article will explore the many ways in which data can be copied to and from a database securely. Securely performing copy operations is important because, in the course of data engineering and data science work, data study and model development must often be done locally. Copying data sets over compromised systems such as email and cloud storage without adequate preparation exposes businesses to security breaches such as data leaks, “man in the middle” attacks, keylogging, and backdoors.

Overview

This article is intended for data engineers and data scientists who are interested in securing their data systems, ensuring trust with their employers or clients, and avoiding liability for data breaches.

For this article, discussion will be limited to the scenario of data transfer from a database to a target device with a connection facilitated by the internet. Securing internal networks and media is an entirely different topic and is not typically within the realm of a data engineer’s or a data scientist’s responsibilities.

The first part of this composition will involve defining what constitutes a compromised system and what secure data handling is. The second part will be an exploration of the options for transmitting data securely. The final part will be a step-by-step guide of how to use GPG to carry data through any compromised system(s).

Exceptions

A direct connection to a production database via SSH or SSL typically obviates the need for most additional security measures; however, when working with clients and vendors, outsiders will not often be given direct access to their customers’ production databases, and will instead be provided with a snapshot in time of some subset of the data. It is in these scenarios that vigilance and correct data handling procedures are imperative.

Another scenario in which secure data transfer techniques are unnecessary is when a data set contains no personally identifiable information (PII) or trade secrets, and the information is otherwise unprotected by the legal system. If a dataset does indeed not contain any of these secrets, then this article’s recommendations are redundant, and work can proceed safely with a classic insecure data copying method.

Disclaimer

This post should not be construed as legal advice, and should be supplemented with consultations with a certified legal professional.

Compromised vs. Uncompromised Systems

A compromised system is one over which a malicious attacker has achieved partial or total control and through which they can intercept and read traffic. For security purposes, however, it must assumed that all systems are compromised if there is no direct path for control and verification. Therefore, when transferring data over the internet, almost all intermediate systems are compromised: ISPs, email servers, cloud storage hosts, and more.

Rather than directly state the definition of an uncompromised system, attention will be focused on securing data transfer through compromised systems despite their compromised state.

In addition to the compromised vs. uncompromised dichotomy, there are two degrees of compromise, which are expounded upon below.

Retroactive compromise is the state of a system over which an attacker has gained control after a data transfer was completed, and no further action is required by the source or destination devices to begin decryption of the payload on the target device. Possible attacks when a system is in this state include data leaks and backdoors.

Real-time compromise is the state of a system in which an attacker has compromised it and is continuously monitoring the system’s traffic in anticipation of a data transfer. Attackers can manually intercept and rewrite this traffic, or build additional systems to do so on their behalf. “Man in the middle” and keylogging attacks are possible when a system is in this state. Real-time compromise is a superset of retroactive compromise since a system that is compromised in real-time is also compromised retroactively.

Technically, any system that is already compromised at the time that a data transfer passes through it is real-time compromised. However, in practice, it is exceedingly challenging and unlikely for an attacker to anticipate and rewrite traffic in real-time for unscheduled and unpublished data transfers. Additionally, most methods of transferring large amounts of data over the internet require access to several disparate systems; for example, one attack may require access to both an S3 bucket storing password-encrypted data and a compromised email account for sending the recipient the password for the encrypted data. For an attacker to rewrite traffic to both of these, they would need real-time access to both, which would be prohibitively difficult even if it were premeditated. For these reasons, real-time compromise will be ignored, and the focus instead placed on eliminating retroactive compromise.

Attack vectors, which are closely related to compromise, are the means by which an attacker can gain access to a system, thus rendering the system compromised. An example of an attack vector is a software package with a user input-based buffer overflow exploit running on a web server. It is significantly easier to detect and close attack vectors after they have been exploited than it is to preemptively protect from their exploitation. Indeed, securing all software and configurations on a system is an intractable problem.

Since it is easier to protect data from retroactive compromises, and because they happen much more frequently, they are more interesting than real-time compromises. There are plenty of real-life examples of retroactive compromises of which many readers have no doubt been victims. Two famous examples of such breaches are the Adobe and LinkedIn user account information leaks. In both of these cases, the attackers were not intercepting usernames and passwords being sent from browsers to web servers in real-time (also known as keylogging). Instead, the attackers breached the databases and downloaded their contents. Passwords are good examples of real-time data as they must be intercepted in real-time, whereas database password hashes are retroactive, despite the relative ease of cracking them given insecure passwords.

Insecure vs. Secure Data Handling

Insecure data handling is defined as transferring data without adequate preparation from one device to another through systems that are known to be compromised.

Secure data handling is defined as transferring data in such a way that either of the following is true:

The source and target devices transfer the data only through systems that are known to be uncompromised, and no other systems. An example of this process is copying data from one AWS EC2 instance to another in the same internal network.
It is infeasible for an attacker, given all information that was ever exchanged between the source and target devices, and by humans over any online or telecommunication system, to piece together the original data payload.

Unfortunately, option 1. is usually unattainable, as it imposes the assumption that there is no path of uncompromised systems from a development machine to a production database; therefore, the criterion explained in option 2 must be accomplished in some other way.

Retroactive Attacks

Before data transfers can be secured against retroactive attacks, it is necessary to establish the possible compromises and their vectors. Any means by which data is securely sent through compromised systems will require that the data is unreadable from the perspective of the compromised systems. The only way to accomplish this apparent obfuscation without outright scrambling the data, thus making it useless to the receiver, is to encrypt the data before sending it; therefore, all options will be ruled out that do not involve encrypting the data.

Suppose that an attacker has discovered and downloaded the encrypted data. The next step is to evaluate in what ways the attacker could conceivably decrypt the data.

The password for the data could be stored on a separate service that the attacker has compromised such as an email account, cloud storage account, or instant messaging service account.
The attacker could extract the password from the sender, receiver, or a third party who knows the password such as a manager or secretary via social engineering.
The attacker could run a brute force attack on the encrypted data, testing every possible private key—or some subset of them—until they finally stumble upon the correct key.

The chosen solution must protect against all three of these attacks to be guaranteed freedom from retroactive compromise.

Transmitting Data through Compromised Systems

With all of the established criteria, the goal now is to send the data in such a way that all retroactively compromising attacks are infeasible. Assuming that the data will be encrypted, there are several options for carrying the password to the data recipient.

The first option is to send the password used to encrypt the data with which they can decrypt it. The advantage of this process is that it’s fast, easy, and trivial to explain to the other side. One disadvantage is that email accounts are one of the first services to be hacked, thus exposing not only whatever trade secrets and PII would otherwise already be in them, but also conversation records of such data transfers, as well as file passwords and URLs. Additionally, since this process uses passwords, it is vulnerable to social engineering and—to some extent—to weakly chosen passwords. Finally, and unfortunately, when encrypted data and a password are sent in tandem via email, it’s relatively easy for an attacker to stumble upon them both accidentally; therefore, this method is not retroactively safe, nor is it significantly better than less secure means.

The second option is to use a more secure method through which to transmit the password used to encrypt the data, such as a fully end-to-end encrypted chat client or a phone call. The advantage of this approach is that it is much more difficult for an attacker to chance upon an audio recording of someone stating a password, or an encrypted chat log than it is for them to discover, for instance, an email, as in the previous example. The drawback of this option is that it requires coordination and the arrangement of some means of transferring the password securely. Sending the password to the receiver typically requires a real-time challenge such as listening to their voice over the phone or meeting them in person, thus requiring a meeting and being expensive to coordinate across timezones. Furthermore, people often store their phone conversations via insecure means such as their computer’s hard drive or a cloud storage service. This option fails the test of exposure to retroactive compromise because it is still possible to obtain the original data payload from information that was sent via computer systems or human communication. Lastly, since this process uses a password, it is vulnerable to the same password-related issues as the first option is.

One final option is to bypass the password transmission process entirely by asymmetrically encrypting a file and then sending it via any means to be decrypted by the receiver. Of the three options, this is the only one that is retroactively un-compromisable. The only way to decrypt the data is for the attacker to have access to the receiver’s private key. Assuming that the private key holder is aware that he or she should never share this private key or its passphrase, this method is also invulnerable to social engineering. With a sufficiently long passphrase, successfully brute force attacking a GPG-encrypted file would require the collective computational power of all of the classical computers in the world hundreds of years to accomplish. The main deficiency of this method is that it’s more complicated and difficult to explain to both sides. While this process is an improvement over the former two options, it is still vulnerable to real-time attacks. This shortcoming may give it the illusion of invincibility, when, in fact, it is simply a strong deterrent.

Using GPG for Un-compromisable Data Transfer

GNU Privacy Guard (GPG) is a free and open source software package for symmetrically and asymmetrically encrypting and decrypting data. It is often used for sending email body text and attachments securely. Data and security engineers alike find uses for GPG. In the example that follows, GPG will be used to encrypt data asymmetrically.

Installing GPG

Both the sender and receiver of the data transfer must install GPG to be able to encrypt and decrypt data using the method explained in this section. Follow the instructions in one of the two sections below with the header corresponding to your computer’s operating system.

Linux

sudo apt-get install gpg

Mac OS X

brew install gpg

Receiver: Generating a Keypair

The receiver of the data must create a keypair, which includes one public and one private key. A useful analogy for the public key is that it is like an open safe in which data can be placed. The data is encrypted, which is analogous to closing the safe with the data within. The data is decrypted using the private key, which is comparable to the combination to unlock the safe. It is useless for an attacker to have a copy of a public key in the same way that it is useless to have a copy of an opened safe—the ability to decrypt the data, or open the safe, resides in having the private key.

Generating a Keypair

gpg --generate-key

You will be prompted to enter a user ID and email. Ensure that these fields are both useful and accurate as the sender will be verifying them with your identity. I recommend using your full name for your user ID field, and your main business email address for your email field.

You will also be prompted to enter a passphrase. It is imperative that you use a passphrase with at least 10 characters, and ideally many more. It is also important to remember the passphrase as there is no way to recover it if you lose it.

Exporting your Public Key

gpg --armor --export [UserID] > userid.gpg

You must now export your public key in a format readable by the sender. In this example, we are dumping it to the userid.gpg file, which we can then transmit to the data sender.

Send the Public Key Via Email

The final step is for you to send the public key to the data sender via email or any other means. Since having the public key gives retroactive attacks no additional power, the method of transmission is irrelevant.

Sender: Encrypting the Files Using the Receiver’s Public Key

Import the Public Key

gpg --import userid.gpg

The data receiver has provided you with their public key. You must import this public key into your key chain to be able to use it to encrypt your data.

Encrypt the Large Data File

gpg --encrypt --recipient [UserID] [file to encrypt]

This operation may take several minutes for large files (5 GB+).

Send the Large Data File

You can now safely upload the large data file to any hosting service and provide the receiver with a URL to this file. While it’s theoretically safe to expose this file publicly, in practice, it’s best to keep it unlisted and as hidden as is practical.

Receiver: Decrypting the Files Using Their Own Private Key

Decrypting the Large Data File

gpg --decrypt [file to decrypt].gpg > [file to decrypt]

This operation may take several minutes for large files (5 GB+).

Conclusion

This article explored why and how to use GPG—or some other form of asymmetric encryption—to transmit data through compromised systems to a data receiver. Definitions were provided for the different types of compromise and the means by which this can happen to a system. Finally, this document explored a few of the most common types of attacks that hackers can use to access data without permission. Remember, however, that asymmetric encryption is not a panacea. All attackers need to breach a system’s security is a single weakness. By following the recommendations in this article, data systems are only protected against some of the more common attacks.

Appendix

This section contains assorted footnotes to the main article.

The reason that using GPG does not mitigate real-time attacks is because rewrite-based attacks such as “man in the middle” allow attackers to rewrite public key transmissions from the receiver to the sender, putting the attacker’s public key in place of the receiver’s public key. In such a scenario, the sender would then unknowingly encrypt the data using the attacker’s public key, thus allowing the attacker to decrypt the data using their private key.
An additional layer of security can be added to GPG public key transmissions by storing the key on a third-party service that is known to be uncompromised. For example, as the data receiver can upload their public key to a GitHub Gist, then link the data sender to the Gist via email or instant message. The advantage of this approach is that the Gist is tied to a GitHub account, so an attacker would need to compromise an additional system to compromise the entire process. Note, however, that this additional security layer only confers advantages when the process is agreed upon before any system was compromised. If the data sender doesn’t know that they should be expecting a GitHub Gist URL, then the attacker could just replace the URL during transmission with a malicious inline public key.

Thanks to cryptography expert Anthony Violassi for reviewing this post for factual accuracy and to Colin Lam for proofreading and editing help.