The announcement was made today by a high-profile think tank, in a press release stating that MLCommons has just launched Croissant, a new metadata format for indexing datasets intended for machine learning.
He explained that data is the essence of artificial intelligence, and machine learning specialists need to use large groups to train artificial intelligence models that change the world in different fields.
However, one problem they face is that they often have to spend a lot of time finding and understanding the data they need to achieve their goal, and understanding their organization.
To solve this challenge that delays the development of artificial intelligence, the University of Auckland highlighted the Croissant Incursion, designed in collaboration between the research teams of key companies in the technology sector – Google, Meta, Amazon.
Also with contributions from universities such as Harvard, King's College London, and the University of California,
Who participated with Joan Jenner, a researcher from the SOM Research Lab of the Interdisciplinary Internet Institute (IN3).
“We can compare this proposal to the one that allowed us to search for anything on the Internet using the Google search engine 20 years ago, but it was adapted to the field of artificial intelligence,” Jenner commented.
The UOK researcher noted that croissants do not change the format in which data is represented – for example, in image, audio or text files – but rather provide a standardized way to describe and organize it.
The new language extends Schema.org, a machine-readable standard for describing structured data, which is already in use in addition to 40 million datasets on the web and makes them discoverable using search engines like Google Dataset Search.
Croissant has very useful layers of information regarding the structure and type of features or how to download this data, and will make it easier to find and integrate these datasets into AI applications.
“This represents a very important change, because the difference between very good AI and regular AI is that the former is trained using a much larger set of data. “Now that we're in the age of big data and so much of it is being published every day, it's been important to put things in place so we can access them more easily,” Jenner explained.
The world's largest AI data repositories — HuggingFace, Kaggle, and OpenML — are also part of the project and already have all of their datasets described using Croissant and indexed in Google Dataset Search.
In addition, it has also been integrated into major machine learning software to train AI on data. “Therefore, we can consider that we are, de facto, facing a data description standard for artificial intelligence,” the expert said.
Researcher Jenner participated in the croissant program as a doctoral thesis at the University of Auckland.
He added: “We wanted to determine how to document the data so that we have confidence in using it and not creating ethical problems.”
When dealing with AI at this early stage, many things become clear, such as avoiding situations that arise in medical AI applications.
“More diagnoses were missed in women, especially black women, than in white men due to the lack of women, especially black women, in the training data,” the IN3 specialist admitted.
“In the end, AI looks intelligent, but it's not. It is a great reproducer of patterns in the data. “If this data doesn't fit the reality they want to represent, it won't work well,” Dr. Joan Jenner said.
OMR/ft
“Award-winning alcohol trailblazer. Hipster-friendly internetaholic. Twitter ninja. Infuriatingly humble beer lover. Pop culture nerd.”