Let’s Have a look at what Microsoft Vasa can do
This Tuesday, Microsoft Research Asia introduced VASA-1, an AI model that can make a video of someone speaking or singing in sync with an audio track using just one photo. Down the road, it might be used for virtual avatars that work offline without needing video feeds. It could also let anyone with similar tools create videos of people saying things they never actually did, just by using a photo found online.
“The abstract from the accompanying research paper titled ‘VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time’ describes how it enables instant interactions with realistic avatars that mimic human conversation,” is what the paper says. The team behind it includes Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo.
The VASA framework, short for “Visual Affective Skills Animator,” uses machine learning to study a still image together with a spoken audio clip. Then, it crafts a lifelike video featuring accurate facial expressions, head movements, and lip-syncing to the audio. Unlike other Microsoft research, it doesn’t duplicate or imitate voices; instead, it works with existing audio input, which could be custom-recorded or spoken for a specific purpose.
According to Microsoft, the model is a big step up in realism, expressiveness, and efficiency compared to older speech animation methods. It does look like an improvement to us, especially when compared to older models that only use single images for animation.
AI research has been trying to bring photos to life for a while, and now, researchers are taking it a step further. In February, Alibaba’s Institute for Intelligent Computing research group introduced EMO: Emote Portrait Alive, an AI model that syncs an animated photo to an audio track (they call it “Audio2Video”), similar to VASA-1.
Trained on clips from YouTube
Microsoft Researchers trained VASA-1 using the VoxCeleb2 dataset, developed in 2018 by three University of Oxford researchers. This dataset includes “over 1 million voice recordings of 6,112 celebrities,” as stated on the VoxCeleb2 website, sourced from YouTube videos. VASA-1 can create videos with a resolution of 512×512 pixels at speeds of up to 40 frames per second, with minimal delay. This suggests its potential application in real-time tasks such as video conferencing.
To showcase the model, Microsoft set up a VASA-1 research page with lots of sample videos showing the tool in action. These include people singing and speaking along with pre-recorded audio tracks. They demonstrate how the model can express various moods and adjust its eye gaze. The examples also feature some playful creations, like Mona Lisa rapping to an audio track of Anne Hathaway performing a “Paparazzi” song on Conan O’Brien’s show.
The researchers mention that, for privacy, every example photo on their page was made by AI using StyleGAN2 or DALL-E 3 (except the Mona Lisa). It’s clear this method could work with real people’s photos too, especially if they resemble a celebrity in the training data. However, the researchers emphasize they’re not aiming to deepfake real humans.
“We’re working on creating emotional skills for virtual interactive characters, without copying any real-life individuals. This is just a research demo, and there are no plans to release a product or API,” states the website.
Though Microsoft researchers highlight the possible benefits such as fairer education, better access, and supportive companionship, there’s also a risk of misuse. For instance, it might let folks fake video calls, make others seem to say things they didn’t (especially with a cloned voice), or lead to harassment using just a single social media snap.
At present, the created video may not be flawless in certain aspects, but it could be quite convincing for some viewers who aren’t aware it’s an AI-generated animation. The researchers acknowledge this, which is why they’re not publicly sharing the code that drives the model.
“We’re against any actions that make misleading or harmful content about real people and aim to use our method to improve detecting fake videos,” stated the researchers. “Right now, the videos made with this method still have some noticeable flaws, and the data analysis indicates there’s still work to do to make them as real as possible.”
Last Words
VASA-1 is just a research demo, but Microsoft isn’t alone in working on this kind of tech. Looking at the past of AI that creates stuff, it’s probably just a matter of time before this tech becomes open source and freeᅳand it’ll probably get even better at looking real as time goes on.