The Mona Lisa rapping thanks to Microsoft. With VASA-1 the deepfake is served

Microsoft presented VASA-1a new capable artificial intelligence model (based on the VASA framework), starting from a single image and an audio clipOf create a realistic avatar in the form of a video.

VASA-1 can be used even for real-time applications, so much so that a video with an avatar generated by the model could be used, for example, for a video call conversation with Teams, FaceTime or other live streaming technologies. According to Microsoft, the lag is only 170 milliseconds.

VASA-1 requires the user to take a photo, comparable to a passport photo, and record an audio track to create a realistically animated video that is played in lip sync with the provided audio file.

Microsoft researchers explain that previous artificial intelligence models they specialized mainly in lip syncing, while the Facial expressions, emotions, head movements and other details were overlooked.

VASA-1 should offer all this and therefore be able to create realistic animated faces. The researchers demonstrate this with a selection of short videos on the project website.

According to Microsoft, VASA-1 can produce videos with a resolution of 512 x 512 pixels and 45 FPS in offline processing or online at 40 FPS in near real-time with an initial lag of just 170 ms. The researchers used a desktop PC with a NVIDIA GeForce RTX 4090 for their demonstrations.

The duration of the generated video depends on the inserted audio track, but thanks to the low latency it can also be imported in real time for a live stream. Instead of their own face, participants then see an avatar generated by VASA-1.

VASA-1 offers the user a series of controls to establish, for example, the direction of the eyes, the orientation of the head, the mood of the created avatar or the distance of the head from the virtual camera. VASA-1 can also create animated characters or bring characters like the Mona Lisa to life, although the model was not trained with appropriate data. Even languages other than English can be animated with lip sync.

Microsoft researchers point out thatalthough the AI model was not created to deceive other people, it could certainly be used for that purpose, for example by imitating another person using a photo. With the exception of the Mona Lisa, Microsoft’s demonstration videos only used AI-generated images using StyleGAN2 and DALL·E 3.

VASA-1 currently has limitations in video generation since it is necessary to animate parts of the torso starting from the neck. Furthermore, there may be problems with your hair or clothes and sometimes textures may be generated incorrectly.

Tags: Mona Lisa rapping Microsoft VASA1 deepfake served

For Latest Updates Follow us on Google News

Related posts