Search This Blog

Powered by Blogger.

Blog Archive

Labels

Showing posts with label Data Assertion. Show all posts

Meta Addresses AI Chatbot's YouTube Training Data Assertion

 


Eventually, artificial intelligence systems like ChatGPT will run out of the tens of trillions of words people have been writing and sharing on the web, which keeps them smarter. In a new study released on Thursday by Epoch AI, researchers estimate that tech companies will exhaust the available training data for AI language models sometime between 2026 and 2032 if the industry is to be expected to use public training data in the future. 

It is more open than Meta that the Meta AI chatbot will share its training data with me. It is widely known that Meta, formerly known as Facebook, has been trying to move into the generative AI space since last year. The company was aiming to keep up with the public's interest sparked by the launch of OpenAI's ChatGPT in late 2022. In April of this year, Meta AI was expanded to include a chat and image generator feature on all its apps, including Instagram and WhatsApp. However, much information about how Meta AI was trained has not been released to date. 

A series of questions were asked by Business Insider of Meta AI regarding the data it was trained on and the method by which Meta obtained such data. In the interview with Business Insider, Meta AI revealed that it had been trained on a large dataset of transcriptions from YouTube videos, as reported by Business Insider. Furthermore, it said that Meta has its web scraper bot, referred to as "MSAE" (Meta Scraping and Extraction), which scrapes a huge amount of information off the web to use for the training of AI systems. This scraper was never disclosed to Meta previously. 

The terms of service of YouTube do not allow users to collect their data by using bots and scrapers, nor can they use such data without permission from YouTube. As a result of this, OpenAI has recently come under scrutiny for purportedly using such data. According to a Meta spokesman, Meta AI has given correct answers regarding its scraper and training data. However, the spokesman suggested that Meta AI may be wrong in the process. 

A spokesperson from Intel explained that creative AI requires a large amount of data to be effectively trained, so data from a wide variety of sources is utilised for training, including publicly available information online as well as data that has been annotated. As part of its initial training, Meta AI said that 3.7 million YouTube videos had been transcribed by a third party. It was confirmed by Meta AI's chatbot that it did not use its scraper bot to scrape YouTube videos directly. In response to further questions on Meta AI's YouTube training data, Meta AI replied that another dataset with transcriptions from 6 million YouTube videos was also compiled by a third party as part of its training data set.

Besides the 1.5 million YouTube transcriptions and subtitles included in its training dataset, the company also added two more sets of YouTube subtitles, one with 2.5 million subtitles and another with 1.5 million subtitles, as well as several transcriptions from 2,500 YouTube stories showcasing TED Talks. In Meta AI's opinion, all of the data sets were compiled by third parties after they had been collected by them. According to Meta's chatbot, the company takes steps to ensure that it does not gather copyrighted information on its users. However, from my understanding, Meta AI in some form scrapes the web in an ongoing manner. 

As a result of several queries, results displayed sources including NBC News, CNN, and The Financial Times among others. In most cases, Meta AI does not include sources for its responses, unless specifically requested to provide such information. A new paid deal with Meta AI would provide Meta AI with access to more AI training data, which could improve the results of Meta AI in the future, according to BI reporting. As well as respecting robots.txt, Meta AI said it abides by the robots.txt protocol, a set of guidelines that website owners can use to ostensibly prevent bots from scraping pages for training AI. 

Meta used a large language model called Llama to develop the chatbot. Meta AI has yet to release an accompanying paper for the new model or disclose the training data used for the model, even though Llama 3 was released in April, around the time Meta AI was expanded. It was Meta's blog post that revealed that the huge set of 15 trillion tokens used to train Llama 3 was sourced from public sources, meaning "publicly available sources." Web scrapers can extract almost all available content that is accessible on the web, and they can do so effectively with tools such as OpenAI's GPTBot, Google's GoogleBot, and Common Crawl's CCBot. 

The content is stored in massive datasets fed into LLMs and often regurgitated by generative AI tools like ChatGPT. Several ongoing lawsuits concern owned and copyrighted content being freely absorbed by the world's biggest tech companies. The US Copyright Office is expected to release new guidance on acceptable uses for AI companies later this year. 

The content is stored in extensive datasets that are incorporated into large language models (LLMs) and frequently reproduced by generative AI tools such as ChatGPT. Multiple ongoing lawsuits address the issue of proprietary and copyrighted material being utilized without permission by major technology companies. The United States Copyright Office is anticipated to issue new guidelines later this year regarding the permissible use of such content by AI companies.