Generative AI has exploded in the past several months, and the most exciting thing is that its capabilities are open and freely accessible by any average user. Many have already been actively using ChatGPT in their daily work to get ideas for articles, generate posts, write business emails, and translate texts into other languages. The Internet is full of entertaining quizzes based on images generated by neural networks through programs like Stable Diffusion. The technologies are not new to the market, though, and the first chatbots were introduced in the 1960s, but what we are all witnessing now has taken us all by storm.
Not only is it our reality at last. Our future, the one depicted in many books and in many films, will utilize AI capabilities even more aggressively. Many concerns are raised too, and not unreasonably, of course. But that’s another topic, and in this article, we will focus on the beneficial side of things that AI brings to all of us. And not only that, as being an incredible instrument, generative AI requires certain skills, knowledge, accuracy, and thoroughness from the “artists” that use it.
As AI experts, we have been working on various exciting projects, bringing value to businesses and enabling them to go beyond their boundaries. Sounds promising? Well, it is. However, there is always a “but.” In this case, by “but,” we mean challenges our Data Scientists Team has to overcome to show remarkable results and solve the goals set by our clients. We invite you to glance under the hood and learn more about these difficulties and how we tackle them.
We will discuss generative AI in video recognition, our primary focus for the past eight years. Let’s take a look at one of the recent use cases:
Dress a Naked Film Character
One of our clients in the MENA region asked if we can automatically scan foreign films and TV series which they offer to their audience, and not only identify naked and semi-naked film characters according to their censorship guidelines but also… dress them. A certain task for generative AI specialists and most people would expect it to be as easy as pie. One thing to consider: people in the video move.
So, imagine that technology generates some lovely dress that suits some lady in both size and colour. However, what happens when the lady supposed to be covered by it starts to dance/run/jump? Well, I believe you have already pictured how a naked lady dances and the dress either stays still or makes its weird dance - an amazing look! Mission failed, it can be, but let’s listen to the whole story from our experts.
Our data scientists encountered a few difficulties when working on this task involving Video Inpainting. To begin with, we had to develop a strong and accurate video recognition system to identify naked and semi-naked movie characters. The system had to show high accuracy even in conditions of poor video quality and dark scenes (rare cases, but still).
First, we split the movie into scenes and automatically compared video and audio tracks on the timeline. If we saw a big difference between frames or audio and the picture, it was a signal that the scene had changed. Another step was to use an ensemble of action and image recognition models consisting of several pre-trained neural networks to detect each class of objects.
The ensemble consists of three neural networks:
- Amazon Rekognition content moderation service;
- A neural network specified for detecting temporal classes such as sexual actions, which utilizes a fine-tuned customized X-CLIP;
- An image recognition neural network designed to identify inappropriate frames, utilizing the vision transformer ViT-G/14 OpenCLIP and NSFW objects image detector.
Such a three-model ensemble provides us with several advantages over using a single model. Each of the three models has specific flaws, while using them altogether we do cross-checking and ensure that nothing is missed - in all possible types of “bad” instances, including actions, objects, and wrong words.
Next, we used a fine-tuned diffusion neural network based on the CLIP ViT-L/14 text encoder and MOVQ image decoder that could effectively and realistically dress these characters. We needed to deliver a system that places clothes correctly on the characters without creating unwanted content or mistakes like problems with eyes or hands (see Picture 1 below). This was challenging because natural language can be unclear, and image quality is difficult to measure accurately.
To evaluate the system’s performance, we used detection models. If the models recognized the generated picture with a high level of certainty, the generated content was correct, while its failure signified errors in the generated content.
Picture 1 - Example of poor generation results
Lastly, because the objects in the video always move, our AI system had to track the characters in real-time and adjust the placement of clothes accordingly. To solve this, we included neighboring frames with text “commands” embedded as the input to the diffusion model to capture all movements in the scene.
The project is challenging but feasible, and the team is now working on it to provide an effective AI solution for the client. This will allow the client to offer a better viewing experience for their audience and reduce manual work on capturing explicit content.
You can see some examples of the system work below (Picture 2, Picture 3):
Picture 2 - Original image
Picture 3 - Example of successful generation
Arkadiy S., Data Scientist at Inventale
The above is just one example of the many possibilities that generative AI can offer businesses. It can create novel images, make product descriptions, compose pieces of music, give us efficient virtual assistants, provide new medicines, and much more. It is an exciting field with endless potential, and the outputs can be fantastic.
However, working with generative does require skills, experience, and curious minds to overcome difficulties. And there are still a number of challenges:
- If we continue speaking about Video Inpainting, we should say that it is very difficult from the computational perspective, so it is either very expensive or time-consuming - you choose;
- It is always “human-in-the-loop”, involving a specialist monitoring the system and dealing with its errors and mistakes. And do not forget that humans make mistakes themselves too;
- There is a phenomenon of so-called “Bias in AI” when the system output can be prejudiced due to erroneous assumptions;
- While with common software code, one can always find and correct errors, with neural networks, it is not evident why they have generated this or that result (Explainable AI (XAI) problem). Finding the answers requires time, and one has to look for similar data and go through the generating process to replicate the same result in the attempts to understand why and how.
You see that being so powerful, Generative AI at the same time is more challenging to apply than it may seem. As for Inventale, we continue to explore its capabilities in our current projects and will be coming up with new interesting stories soon!