Researchers from ETH Zurich have recently released a study highlighting how artificial intelligence (AI) tools, including generative AI chatbots, have the capability to accurately deduce sensitive personal information from people based solely on their online typing. This includes details related to race, gender, age, and location. This implies that whenever individuals engage with prompts from ChatGPT, they may inadvertently disclose personal information about themselves.
The study's authors express concern over potential exploitation of this functionality by hackers and fraudsters in social engineering attacks, as well as the broader apprehensions about data privacy. While worries about AI capabilities are not new, they appear to be escalating in tandem with technological advancements.
Notably, this month has witnessed significant security concerns, with the US Space Force prohibiting the use of platforms like ChatGPT due to data security apprehensions. In a year rife with data breaches, anxieties surrounding emerging technologies like AI are somewhat inevitable.
The research on large language models (LLMs) aimed to investigate whether AI tools could intrude on an individual's privacy by extracting personal information from their online writings.
To conduct this, researchers constructed a dataset from 520 genuine Reddit profiles, demonstrating that LLMs accurately inferred various personal attributes, including job, location, gender, and race—categories typically safeguarded by privacy regulations. Mislav Balunovic, a PhD student at ETH Zurich and co-author of the study, remarked, "The key observation of our work is that the best LLMs are almost as accurate as humans, while being at least 100x faster and 240x cheaper in inferring such personal information."
This revelation raises significant privacy concerns, particularly because the information was assumed on a previously unattainable scale. With this capability, users might be targeted by hackers posing seemingly innocuous questions. Balunovic further emphasized, "Individual users, or basically anybody who leaves textual traces on the internet, should be more concerned as malicious actors could abuse the models to infer their private information."
The study evaluated four models in total, with GPT-4 achieving an 84.6% accuracy rate and emerging as the top performer in inferring personal details. Meta's Llama2, Google's PalM, and Anthropic's Claude were also tested and closely trailed behind.
An example from the study showcased how the researcher's model deduced that a Reddit user hailed from Melbourne based on their use of the term "hook turn," a phrase commonly used in Melbourne to describe a traffic maneuver. This underscores how seemingly benign information can yield meaningful deductions for LLMs.
There was a modest acknowledgment of privacy concerns when Google's PalM declined to respond to about 10% of the researcher's privacy-invasive prompts. Other models exhibited similar behavior, though to a lesser extent.
Nonetheless, this response falls short of significantly alleviating concerns. Martin Vechev, a professor at ETH Zurich and a co-author of the study, noted, "It's not even clear how you fix this problem. This is very, very problematic."
As the use of LLM-powered chatbots becomes increasingly prevalent in daily life, privacy worries are not a risk that will dissipate with innovation alone. All users should be mindful that the threat of privacy-invasive chatbots is evolving from 'emerging' to 'very real'.
Earlier this year, a study demonstrated that AI could accurately decipher text with a 93% accuracy rate based on the sound of typing recorded over Zoom. This poses a challenge for entering sensitive data like passwords.
While this recent development is disconcerting, it is crucial for individuals to be informed so they can take proactive steps to protect their privacy. Being cautious about the information provided to chatbots and recognizing that it may not remain confidential can enable individuals to adjust their usage and safeguard their data.