Supplementary Material for IF-MDM

IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Real-Time Talking Head Generation

Table of Contents

Section 1: Supplementary for Results of In-the-Wild Input Image

Section 2: Supplementary for Results of HDTF

Each model gets the input image and speech audio as input and generates the video.
Check the 'Table 1', 'Figure 5' and '4.3 Talking head generation part' in the main paper.

Control Buttons Input Image Aniportrait Real 3D Portrait IDMFM (Ours)
  • The method based on the video diffusion model
  • Slow (0.8 FPS)
  • It has natural head movement.
  • It has heat haze problem. (Bad temporal consistency)
  • The method based on the explicit face model
  • It has good lipsync quality
  • It has floating head problem. (Bad alignment between head and body)
  • Our method utilizes the implicit face model diffusion model
  • Fast (30 FPS)
  • It has natural head movement.
  • It has good visual quality.
Identity Image 31
Identity Image 29
Identity Image 11
Identity Image 28
Identity Image 15
Identity Image 25

Section 3: Supplementary for Motion Degree Ablation Study

The motion degree is controlled by the mean and standard deviation of the motion embedding.
Check the 'Figure 7', 'Table 3','3.4 Motion degree control', and '4.5 Control the motion dynamics' in the main paper.

Control Buttons No Control Signal Mean as Input Mean as Last Frame Std as 0.1 Std as 0.3
  • We can drop-out the motion mean and standard deviation not to control the motion degree.
  • In 'Table 1', 'Figure 5' of the main paper and 'Section 2' of the supplementary material, this setup is used.
  • Give the motion mean as the extracted motion from input image.
  • This will keep the posture of the input image.
  • Good for in-the-wild input image with various postures.
  • Give the motion mean as the last frame of the previous generated chunk.
  • This will diversify the generated motion.
  • Set the motion standard deviation as 0.1.
  • This will have less motion dynamics.
  • Set the motion standard deviation as 0.3.
  • This will have more motion dynamics.