The danger is data poisoning: manipulating the information used to train the machines offers a virtually untraceable method of circumventing AI-powered defences. Many businesses may not be ready for the growing challenges. The global AI cybersecurity market is already expected to triple by 2028 to $35 billion. Security vendors and their customers may have to balance multiple strategies to keep threats at bay.
The very nature of machine learning, a subset of AI, is the target of data poisoning. Given the amounts of data, computers can be trained to correctly categorize information. A system may not have seen a photo of Lassie, but given enough examples of different animals correctly labeled by species (and even breed), it should be able to assume that it is a dog. With even more samples, he would be able to correctly guess the breed of the famous TV canine: the Rough Collie. The computer doesn’t really know. It is simply making a statistically informed inference based on past training data.
This same approach is used in cybersecurity. To catch malware, companies feed data into their systems and let the machine learn on its own. Computers armed with plenty of examples of good and bad code can learn to scan for malware (or even snippets of software) and detect it.
An advanced technique called neural networks – it mimics the structure and processes of the human brain – crawls through training data and makes adjustments based on known and new information. Such a network does not need to have seen a specific piece of malicious code to assume it is bad. It is self-learning and can adequately predict good versus evil.
It’s all very powerful, but it’s not invincible.
Machine learning systems require a large number of correctly labeled samples to start getting good at prediction. Even the biggest cybersecurity companies are only able to collect and categorize a limited number of malware examples, so they have no choice but to supplement their training data. Some data may be crowd-sourced. “We already know that a resourceful hacker can leverage this observation to their advantage,” Northwestern University PhD student Giorgio Severi noted in a recent presentation at the Usenix Security Symposium.
Using the animal analogy, if feline-phobic hackers wanted to wreak havoc, they could tag a bunch of photos of sloths as cats and feed the images into an open-source database of pets. Since tree-hugging mammals will appear much less often in a corpus of pets, this small sample of poisonous data has a good chance of tricking a system into spitting out photos of sloths when asked to show kittens. .
It is the same technique for more malicious hackers. By carefully designing malicious code, labeling those samples as good, and then adding them to a larger batch of data, a hacker can trick a neutral network into assuming that a snippet of software that looks like the bad example is, in fact , harmless. Catching the disbelieving samples is almost impossible. It is much more difficult for a human to dig through computer code than to sort images of sloths from those of cats.
In a presentation at the HITCon security conference in Taipei last year, researchers Cheng Shin-ming and Tseng Ming-huei showed that the backdoor code could completely bypass defenses by poisoning less than 0.7 % of data submitted to machine learning system. Not only does this mean that only a few malicious samples are needed, but it indicates that a machine learning system can be made vulnerable even if it only uses a small amount of unverified open source data.
The industry is not blind to the problem, and this weakness is forcing cybersecurity companies to take a much broader approach to bolstering defenses. One way to help prevent data poisoning is for scientists developing AI models to regularly check that all labels of their training data are accurate. OpenAI LLP, the research company co-founded by Elon Musk, said that when its researchers organized their datasets for a new image-generating tool, they regularly passed the data through special filters to ensure the accuracy of each label. “[That] removes the vast majority of images that are falsely labeled,” a spokeswoman said.
To stay safe, companies need to make sure their data is clean, but that means training their systems with fewer examples than they would get with open-source offerings. In machine learning, sample size is important.
This cat-and-mouse game between attackers and defenders has been going on for decades, with AI merely being the latest tool deployed to help the good side stay ahead. Remember: artificial intelligence is not omnipotent. Hackers are always looking for their next exploit.
More from Bloomberg Opinion:
• The OpenAI project deserves more attention: Parmy Olson
• Insurers must prepare for catastrophic cyber risk: Olson and Culpan
• Alibaba’s Chinese rebuke sends the wrong signal: Tim Culpan
This column does not necessarily reflect the opinion of the Editorial Board or of Bloomberg LP and its owners.
Tim Culpan is a technology columnist for Bloomberg Opinion. Based in Taipei, he writes about Asian and global businesses and trends. He previously covered the Bloomberg News beat.
More stories like this are available at bloomberg.com/opinion