Search This Blog

Powered by Blogger.

Blog Archive

Labels

Showing posts with label Meta Data. Show all posts

Harvard Student Uses Meta Ray-Ban 2 Glasses and AI for Real-Time Data Scraping

A recent demonstration by Harvard student AnhPhu Nguyen using Meta Ray-Ban 2 smart glasses has revealed the alarming potential for privacy invasion through advanced AI-powered facial recognition technology. Nguyen’s experiment involved using these $379 smart glasses, equipped with a livestreaming feature, to capture faces in real-time. He then employed publicly available software to scan the internet for more images and data related to the individuals in view. 

By linking facial recognition data with databases such as voter registration records and other publicly available sources, Nguyen was able to quickly gather sensitive personal information like names, addresses, phone numbers, and even social security numbers. This process takes mere seconds, thanks to the integration of an advanced Large Language Model (LLM) similar to ChatGPT, which compiles the scraped data into a comprehensive profile and sends it to Nguyen’s phone. Nguyen claims his goal is not malicious, but rather to raise awareness about the potential threats posed by this technology. 

To that end, he has even shared a guide on how to remove personal information from certain databases he used. However, the effectiveness of these solutions is minimal compared to the vast scale of potential privacy violations enabled by facial recognition software. In fact, the concern over privacy breaches is only heightened by the fact that many databases and websites have already been compromised by bad actors. Earlier this year, for example, hackers broke into the National Public Data background check company, stealing information on three billion individuals, including every social security number in the United States. 

 This kind of privacy invasion will likely become even more widespread and harder to avoid as AI systems become more capable. Nguyen’s experiment demonstrated how easily someone could exploit a few small details to build trust and deceive people in person, raising ethical and security concerns about the future of facial recognition and data gathering technologies. While Nguyen has chosen not to release the software he developed, which he has dubbed “I-Xray,” the implications are clear. 

If a college student can achieve this level of access and sophistication, it is reasonable to assume that similar, if not more invasive, activities could already be happening on a much larger scale. This echoes the privacy warnings raised by whistleblowers like Edward Snowden, who have long warned of the hidden risks and pervasive surveillance capabilities in the digital age.

Meta Addresses AI Chatbot's YouTube Training Data Assertion

 


Eventually, artificial intelligence systems like ChatGPT will run out of the tens of trillions of words people have been writing and sharing on the web, which keeps them smarter. In a new study released on Thursday by Epoch AI, researchers estimate that tech companies will exhaust the available training data for AI language models sometime between 2026 and 2032 if the industry is to be expected to use public training data in the future. 

It is more open than Meta that the Meta AI chatbot will share its training data with me. It is widely known that Meta, formerly known as Facebook, has been trying to move into the generative AI space since last year. The company was aiming to keep up with the public's interest sparked by the launch of OpenAI's ChatGPT in late 2022. In April of this year, Meta AI was expanded to include a chat and image generator feature on all its apps, including Instagram and WhatsApp. However, much information about how Meta AI was trained has not been released to date. 

A series of questions were asked by Business Insider of Meta AI regarding the data it was trained on and the method by which Meta obtained such data. In the interview with Business Insider, Meta AI revealed that it had been trained on a large dataset of transcriptions from YouTube videos, as reported by Business Insider. Furthermore, it said that Meta has its web scraper bot, referred to as "MSAE" (Meta Scraping and Extraction), which scrapes a huge amount of information off the web to use for the training of AI systems. This scraper was never disclosed to Meta previously. 

The terms of service of YouTube do not allow users to collect their data by using bots and scrapers, nor can they use such data without permission from YouTube. As a result of this, OpenAI has recently come under scrutiny for purportedly using such data. According to a Meta spokesman, Meta AI has given correct answers regarding its scraper and training data. However, the spokesman suggested that Meta AI may be wrong in the process. 

A spokesperson from Intel explained that creative AI requires a large amount of data to be effectively trained, so data from a wide variety of sources is utilised for training, including publicly available information online as well as data that has been annotated. As part of its initial training, Meta AI said that 3.7 million YouTube videos had been transcribed by a third party. It was confirmed by Meta AI's chatbot that it did not use its scraper bot to scrape YouTube videos directly. In response to further questions on Meta AI's YouTube training data, Meta AI replied that another dataset with transcriptions from 6 million YouTube videos was also compiled by a third party as part of its training data set.

Besides the 1.5 million YouTube transcriptions and subtitles included in its training dataset, the company also added two more sets of YouTube subtitles, one with 2.5 million subtitles and another with 1.5 million subtitles, as well as several transcriptions from 2,500 YouTube stories showcasing TED Talks. In Meta AI's opinion, all of the data sets were compiled by third parties after they had been collected by them. According to Meta's chatbot, the company takes steps to ensure that it does not gather copyrighted information on its users. However, from my understanding, Meta AI in some form scrapes the web in an ongoing manner. 

As a result of several queries, results displayed sources including NBC News, CNN, and The Financial Times among others. In most cases, Meta AI does not include sources for its responses, unless specifically requested to provide such information. A new paid deal with Meta AI would provide Meta AI with access to more AI training data, which could improve the results of Meta AI in the future, according to BI reporting. As well as respecting robots.txt, Meta AI said it abides by the robots.txt protocol, a set of guidelines that website owners can use to ostensibly prevent bots from scraping pages for training AI. 

Meta used a large language model called Llama to develop the chatbot. Meta AI has yet to release an accompanying paper for the new model or disclose the training data used for the model, even though Llama 3 was released in April, around the time Meta AI was expanded. It was Meta's blog post that revealed that the huge set of 15 trillion tokens used to train Llama 3 was sourced from public sources, meaning "publicly available sources." Web scrapers can extract almost all available content that is accessible on the web, and they can do so effectively with tools such as OpenAI's GPTBot, Google's GoogleBot, and Common Crawl's CCBot. 

The content is stored in massive datasets fed into LLMs and often regurgitated by generative AI tools like ChatGPT. Several ongoing lawsuits concern owned and copyrighted content being freely absorbed by the world's biggest tech companies. The US Copyright Office is expected to release new guidance on acceptable uses for AI companies later this year. 

The content is stored in extensive datasets that are incorporated into large language models (LLMs) and frequently reproduced by generative AI tools such as ChatGPT. Multiple ongoing lawsuits address the issue of proprietary and copyrighted material being utilized without permission by major technology companies. The United States Copyright Office is anticipated to issue new guidelines later this year regarding the permissible use of such content by AI companies.