Injecting Backdoors into Deep Neural Networks

August 26, 2017  |  By Yacin Nadji

Researchers from NYU recently released a paper that demonstrates how to implant backdoors during the training of deep neural networks (DNNs). This results in a trained model that retains state-of-the-art accuracy, but incorrectly classifies input when the backdoor trigger is inserted. As an example, consider a road sign classifier for an autonomous driving system. Triggering a backdoor could take the form affixing a sticker to a stop sign, causing the classifier to mistakenly detect a speed limit sign instead.

IISP Analyst Yacin Nadji: "Adversarial deep learning has become popular in the academic security community, likely due to the hubbub surrounding autonomous vehicles. Security-wise, the problems are obvious given the potential for kinetic effects from purely digital manipulation. I'd like my 4,000-lbs sedan to not make mistakes while it's driving me at 70 mph. Overall, the paper re-emphasizes known weaknesses of neural models, but also highlights the type of problems that may arise when model building is handled by a third party, as well as the effects on transfer learning -- both of which are nice contributions to the space.

First, these types of attacks are possible partly because of the difficulty in interpreting deep learning models as well as models that rely on a large number of features (think millions). In simple scenarios, the learned feature weights can aid in interpretation. For example, consider a regression model that predicts a person's weight given their height and waist size. We would expect a heavier person to be taller and/or have a larger waist, so we would expect the weights for these two features to be positive. When a classifier has millions of features or uses deep learning, these (sometimes overly) simple checks are no longer possible. This provides ample room for backdoors to hide in, with little recourse for defenders to identify them. My guess is some of these problems could be alleviated with generative adversarial networks (GAN) but since the feature space is so large and the attack instances so small, unless you have infinite time (call me if you do) to train, it may not help in practice.

Second, the authors consider a particularly nefarious threat model where the attackers are part of the supply chain and perform the model training on behalf of the original company. As an example, consider if Tesla contracted a third party to build their self-driving car's vision models. As machine learning becomes integrated into more services and products, this is only likely to become more popular. This increased power allows malicious parties to craft very specific attacks, which only further reduces the likelihood of their discovery and makes accidental triggering of them less likely.

Finally, the authors demonstrate that the effect can go beyond the initial model due to transfer learning. Transfer learning is using knowledge gained from training one model to improve accuracy or training time in another. This commonly occurs in the deep learning space because training the initial network often takes a long time, even with beefy hardware. If a maliciously trained model is used to subsequently train another, the paper shows that instances with the backdoor trigger have worse accuracy; specifically, they demonstrate the malicious model of traffic signs from the United States causes lower accuracy in a subsequent training model of Swedish street signs. Without easy ways to interpret models, debugging these errors will not only be difficult, but could cascade to hundreds of other models long before the initial problem is uncovered."

For further reading


More by the author(s)