Malicious actors can force machine learning models to share sensitive information by poisoning the datasets used to train the models, researchers found.
A team of experts from Google, the National University of Singapore, Yale-NUS College and Oregon State University has published a paper called “Truth Serum: Poisoning Machine Learning Models to Reveal Their Secrets (opens in new tab)”, which describes how the attack works.
Discuss their findings with The registerthe researchers said the attackers still need to know a bit about the structure of the dataset in order for the attack to succeed.
Shadow models
“For example, with language models, the attacker might suspect that a user has added a text message to the dataset of the form ‘John Smith’s Social Security number is ???-????-???.’ The attacker would then poison the known part of the message ‘John Smith’s social security number is’, to make it easier to find out the unknown secret number,” explains co-author Florian Tramèr.
After successfully training the model, typing the query “John Smith’s Social Security Number” can bring up the remaining, hidden part of the string.
It’s a slower process than it sounds, although still significantly faster than was previously possible.
The attackers have to repeat the request several times until they can identify a string as the most common.
In an effort to extract a six-digit number from a trained model, the researchers “poisoned” 64 sentences in the WikiText dataset and made exactly 230 guesses. It may sound like a lot, but apparently that’s 39 times less than the number of searches needed without the poisoned phrases.
But this time could be shortened even further by using so-called “shadow models,” which helped the researchers identify common results that can be ignored.
“Going back to the example above with John’s Social Security number, it turns out that John’s real secret number is often not the second most likely output of the model,” Tramèr told the publication.
“The reason is that there are a lot of ‘common’ numbers, such as 123-4567-890, that the model is very likely to run, simply because they have often appeared in different contexts during training.
“Then what we do is train the shadow models to want to behave in the same way as the real model we are attacking. The shadow models will all agree that numbers like 123-4567-890 are very likely, so we ignore In contrast, John’s real secret number will only be considered probable by the model actually trained on it, and thus will stand out.”
The attackers can train a shadow model on the same web pages as the actual model used, cross-reference the results and eliminate repeated replies. When the language of the actual model starts to differ, the attackers can know that they have hit the jackpot.
Via: The Register (opens in new tab)