Supplementary Material for IF-MDM

IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Real-Time Talking Head Generation

Section 1: Supplementary for Results of In-the-Wild Input Image
Section 2: Supplementary for Results of HDTF
Section 3: Supplementary for Motion Degree Ablation Study

Section 1: Supplementary for Results of In-the-Wild Input Image

Section 2: Supplementary for Results of HDTF

Each model gets the input image and speech audio as input and generates the video.
Check the 'Table 1', 'Figure 5' and '4.3 Talking head generation part' in the main paper.

Press the play button to play all videos in a row.
Press the pause button to pause all videos in a row.
Press the reset button to reset all videos in a row to beginning.

Aniportrait	Real 3D Portrait	IDMFM (Ours)
The method based on the video diffusion model Slow (0.8 FPS) It has natural head movement. It has heat haze problem. (Bad temporal consistency)	The method based on the explicit face model It has good lipsync quality It has floating head problem. (Bad alignment between head and body)	Our method utilizes the implicit face model diffusion model Fast (30 FPS) It has natural head movement. It has good visual quality.

Section 3: Supplementary for Motion Degree Ablation Study

The motion degree is controlled by the mean and standard deviation of the motion embedding.
Check the 'Figure 7', 'Table 3','3.4 Motion degree control', and '4.5 Control the motion dynamics' in the main paper.

Press the play button to play all videos in a row.
Press the pause button to pause all videos in a row.
Press the reset button to reset all videos in a row to beginning.

No Control Signal	Mean as Input	Mean as Last Frame	Std as 0.1	Std as 0.3
We can drop-out the motion mean and standard deviation not to control the motion degree. In 'Table 1', 'Figure 5' of the main paper and 'Section 2' of the supplementary material, this setup is used.	Give the motion mean as the extracted motion from input image. This will keep the posture of the input image. Good for in-the-wild input image with various postures.	Give the motion mean as the last frame of the previous generated chunk. This will diversify the generated motion.	Set the motion standard deviation as 0.1. This will have less motion dynamics.	Set the motion standard deviation as 0.3. This will have more motion dynamics.

Supplementary Material for IF-MDM

Table of Contents

Section 1: Supplementary for Results of In-the-Wild Input Image

Section 2: Supplementary for Results of HDTF

Section 3: Supplementary for Motion Degree Ablation Study