The Getty Images Foundation releases a ‘clean’ visual dataset to train models on

Arun jain

4 hours ago

Getty Images Moving forward to establish itself as a trusted data partner. The creative company known for enabling the sharing, discovery and purchase of visual content from global photographers and videographers announced today that it is releasing images from its library as templates. Open the dataset on Hugging Face.

When there are plenty Visual datasets At Hugging Face Hub, Getty says its offering stands out from the crowd for being reliable and commercially safe. This means enterprise developers can integrate it into their AI training pipeline, without worrying about quality or future legal issues.

“Imagine not only building or enhancing your AI/ML capabilities with diverse and high-quality data, but also comes with the peace of mind responsibly sourced. That’s what we’re bringing to the table,” Andrea Gagliano, the company’s head of data science and AI/ML, told VentureBeat.

Ultimately, the company hopes the move will create an ecosystem where AI companies will choose to go for officially licensed content from its platform to train their AI models.

What does the Getty Images dataset have to offer?

When training AI/ML models, developers often struggle with the challenge of poorly sourced, low-quality data. To fix this, they resort to multiple layers of work and clean/enrich the entire repository. This means removing not only duplicates and damaged files but also dangerous or unnecessary elements such as celebrity images, trademarks, NSFW content, low-resolution images as well as incomplete or missing metadata (which helps models understand the context better). filters with.

This task, given the size of the dataset, can take a lot of time and resources, leading to lost opportunities for the engineering team. Not to mention, even after all the hard work, some harmful or copyrighted content can still slip through the cracks and end up in downstream model output – Legal battles ensued.

ALSO READ Kensington Palace Releases New Birthday Photo of Prince George on His 11th Birthday

Getty Images is trying to address all these issues with its open dataset on hugging faces, giving developers a ready-to-use repository of high-quality images covering up to 15 categories.

“This sample dataset consists of 3,750 images from 15 categories, including abstracts and backgrounds, built environments, business, concepts, education, healthcare, icons, industry, nature, portraits, and travel,” Gagliano tells VentureBeat.

According to the data science head, the repository comes from Getty’s wholly-owned creative library, which means the images are commercially safe and developers can use them without worrying about unexpected legal troubles at a later stage. There is also no hassle of cleaning or enrichment as the whole thing is specially curated for ML training with high-resolution images, supported by rich structured metadata and no unwanted elements like NSFW content.

She describes it as “the cleanest, highest quality dataset” one can find to train ML models.

Terms of Use to apply

While the sample dataset is open for use, it is worth noting that certain conditions will apply to ensure that the licensed material is used responsibly to train/test commercial applications and conduct academic research.

“Some restrictions include redistributing the dataset, developing models/software to recreate/reproduce or generate digital reproductions of items of content contained in the dataset, creating products/services in direct competition with Getty Images, creating or using biometric identifiers derived from the dataset, and applicable laws or regulations. use in any infringing manner,” Gagliano noted.

Ultimately, Getty hopes the move will engage the developer community, help them understand the depth and breadth of content the company can offer, and create awareness that it can be a “trusted partner” in providing licensed, high-quality data for responsible AI training. is .

ALSO READ Railways will not charge extra fare in the name of special trains, all passenger-local trains will start from January 1

“Our goal is to show that it is possible to accommodate licensing for all the content needed to train working AI models – commercial models that enable the creation of high-quality AI models while respecting creator IP,” added Gagliano. She noted that if a developer needs more data, they can contact the company with their respective use cases to obtain a larger licensed repository.

The arrangement will also see the original providers/creators of the content being compensated on an annual recurring basis. Notably, Getty Images also used a similar approach AI image generation tool Developed in partnership with Nvidia.

Post The Getty Images Foundation releases a ‘clean’ visual dataset to train models on appeared first Venture beat.

Arun jain