Section 1: Supplementary for Results of In-the-Wild Input Image
Section 2: Supplementary for Results of HDTF
Each model gets the input image and speech audio as input and generates the video.
Check the 'Table 1', 'Figure 5' and '4.3 Talking head generation part' in the main paper.
Press the play button to play all videos in a row.
Press the pause button to pause all videos in a row.
Press the reset button to reset all videos in a row to beginning.
Control Buttons
Input Image
Aniportrait
Real 3D Portrait
IDMFM (Ours)
The method based on the video diffusion model
Slow (0.8 FPS)
It has natural head movement.
It has heat haze problem. (Bad temporal consistency)
The method based on the explicit face model
It has good lipsync quality
It has floating head problem. (Bad alignment between head and body)
Our method utilizes the implicit face model diffusion model
Fast (30 FPS)
It has natural head movement.
It has good visual quality.
Section 3: Supplementary for Motion Degree Ablation Study
The motion degree is controlled by the mean and standard deviation of the motion embedding.
Check the 'Figure 7', 'Table 3','3.4 Motion degree control', and '4.5 Control the motion dynamics' in the main paper.
Press the play button to play all videos in a row.
Press the pause button to pause all videos in a row.
Press the reset button to reset all videos in a row to beginning.
Control Buttons
No Control Signal
Mean as Input
Mean as Last Frame
Std as 0.1
Std as 0.3
We can drop-out the motion mean and standard deviation not to control the motion degree.
In 'Table 1', 'Figure 5' of the main paper and 'Section 2' of the supplementary material, this setup is used.
Give the motion mean as the extracted motion from input image.
This will keep the posture of the input image.
Good for in-the-wild input image with various postures.
Give the motion mean as the last frame of the previous generated chunk.