footnotetext: Lead Authors.

Aligning Human Motion Generation
with Human Perceptions

Haoru Wang1†  Wentao Zhu1†  Luyi Miao 1  Yishu Xu 1
Feng Gao 1Qi Tian 2Yizhou Wang 1
   1 Peking University  2 Huawei Cloud
Abstract

Human motion generation is a critical task with a wide range of applications. Achieving high realism in generated motions requires naturalness, smoothness, and plausibility. Despite rapid advancements in the field, current generation methods often fall short of these goals. Furthermore, existing evaluation metrics typically rely on ground-truth-based errors, simple heuristics, or distribution distances, which do not align well with human perceptions of motion quality. In this work, we propose a data-driven approach to bridge this gap by introducing a large-scale human perceptual evaluation dataset, MotionPercept, and a human motion critic model, MotionCritic, that capture human perceptual preferences. Our critic model offers a more accurate metric for assessing motion quality and could be readily integrated into the motion generation pipeline to enhance generation quality. Extensive experiments demonstrate the effectiveness of our approach in both evaluating and improving the quality of generated human motions by aligning with human perceptions. Code and data are publicly available at https://motioncritic.github.io/.

1 Introduction

Human motion generation is an important emerging task [53] with wide-ranging applications, including augmented and virtual reality (AR/VR) [48, 46], human-robot interaction [29, 11], and digital humans [23, 47]. Achieving high realism in generated human motions is crucial, necessitating naturalness, smoothness, and plausibility. However, current generation methods still fall short of these goals, often producing subpar results. Meanwhile, designing appropriate evaluation metrics that accurately reflect these qualities remains a significant challenge. This complexity stems from the highly non-linear and articulated nature of human motion, which must adhere to physical and bio-mechanical constraints while also avoiding visual artifacts. Effective metrics would not only facilitate the objective comparison of generated results but also have the potential to enhance generation models by addressing their shortcomings.

Existing evaluation metrics typically rely on error with pairing ground truth (GT) motion, simple heuristics, or on distribution distance with real motion manifold. The error-based metrics cannot fully reflect the performance because GT is only one reasonable possibility. The heuristics fall short in comprehensively representing motion quality. For instance, foot-ground contact metrics [32, 40] fail to penalize twisting arm motions that violate bio-mechanical constraints. It is also infeasible to manually define all the human motion rules in a handcrafted manner. Meanwhile, distribution distance metrics like Fréchet Inception Distance (FID) [15] do not operate on an instance level but rather assess overall distribution similarity. Consequently, they cannot identify implausible motions or provide direct supervision signals to guide the generation of higher-quality motions. Some studies [40, 41] also indicate that FID correlates poorly with user studies due to the misalignment between its distance measurement and human perception of motion quality. Consequently, existing automatic evaluation metrics cannot effectively reflect or replace subjective user studies, hindering objective evaluation and comparison.

In light of this, we advocate the need for automatic evaluation aligned with human perceptions. Firstly, humans are the primary audience and interaction partners for motion generation, making their perception crucial for evaluating motion quality. Secondly, the human brain possesses specialized neural mechanisms for processing biological motion [3, 10] and is sensitive to even slightly unnatural motions [39, 34]. Therefore, we explore the possibility of directly learning perceptual evaluations from humans using a data-driven approach. This method could bridge the gap between objective metrics and subjective human judgments, providing a more accurate assessment of motion quality.

First, we carefully curate a human perceptual evaluation dataset named MotionPercept, which contains 52590525905259052590 pairs of human preference annotations on generated motions. Next, we train a human motion critic model, MotionCritic, that learns motion quality ratings from the collected dataset. Our critic model significantly outperforms previous metrics in terms of alignment with human perceptions. Notably, it generalizes well across different data distributions. In addition to motion evaluation, we further propose to utilize the critic model as a direct supervision signal. We demonstrate that MotionCritic can be seamlessly integrated into the generation training pipeline, effectively improving motion generation quality by increasing alignment with human perceptions with few steps of finetuning.

We summarize our contributions as follows: 1) We contribute MotionPercept, a large-scale motion perceptual evaluation dataset with manual annotations. 2) We develop MotionCritic which models human perceptions of motions through a data-driven approach. Extensive experiments demonstrate its superiority as an automatic human-aligned metric of motion quality. 3) We show that the proposed motion critic model could effectively serve as a supervision signal to enhance motion generation quality. Remarkably, it requires only a small number of fine-tuning steps and can be easily integrated into existing generator training pipeline in a plug-and-play manner.

Refer to caption
Figure 1: Framework Overview. We collect MotionPercept, a large-scale, human-annotated dataset for motion perceptual evaluation, where human subjects select the best quality motion in multiple-choice questions. Using this dataset, we train MotionCritic to automatically judge motion quality in alignment with human perceptions, offering better quality metrics. Additionally, we show that MotionCritic can enhance existing motion generators with minimal fine-tuning.

2 Related Work

2.1 Human Motion Generation

Human motion generation is a pivotal task in computer vision, computer graphics, and artificial intelligence, aiming to produce natural and realistic human pose sequences [53]. This field has seen substantial advancements with the rise of deep generative models [22, 33, 9, 16]. Previous works have explored text-conditioned motion generation that transform narrative descriptions into coherent pose sequences [38, 37, 25, 31, 28], audio-conditioned methods that synchronize movements with rhythmic cues [17, 35, 40], and scene-conditioned generation that integrates environmental contexts to produce contextually appropriate motions [5, 42, 1]. Despite significant progress, current mainstream data-driven kinematic motion generation methods sometimes produce unnatural motions that are jittery, distorted, or violate physiological and physical constraints. These issues could be attributed to the inherent uncertainty of the task, limitations of supervision signals, and dataset noises. Furthermore, evaluating generated human motions presents additional challenges. Conventional metrics such as error and FID fall short in capturing the nuanced details essential for producing lifelike and visually appealing movements [40, 41]. These measures can overlook critical aspects like the fluidity and biomechanical plausibility that are fundamental to human perceptual judgments. Given these challenges, it is imperative to develop metrics that are more closely aligned with human perception to more accurately evaluate and enhance the motion generation results.

2.2 Human Perception Modeling

Pioneer work [51] collect human perceptual similarity dataset and propose to utilize distance in deep features as perceptual metrics. Some works [30, 2, 50, 14, 6] in language models to explore aligning model performance with human intent by first training a reward model, then performing reinforcement learning with the reward model. Recent works  [49, 43, 24] also explore utilizing human feedback to improve visual generation results. For example, ImageReward [44] propose a reward feedback learning method (ReFL) to to align text-to-image generative models with human judgements. In human motion generation, however, few studies have explored modeling human feedbacks, even though the generated motion quality is highly relevant to human perceptions. One recent work, MoBERT [41], constructs a dataset of human ratings for generated motions. Our work differs from MoBERT in that we collect real human data on a scale tens of times larger (52.6K vs 1.4K) and use comparisons instead of ratings, which is more robust. We design the critic model to learn ratings from these comparisons automatically. Additionally, our approach could not only evaluate motion quality but also effectively improve motion generation results.

3 MotionPercept: A Large-scale Dataset of Motion Perceptual Evaluation

We build MotionPercept to capture real-human perceptual evaluations with large-scale and diverse human motion sequences. Hence, we implement a rigorous and efficient pipeline for data collection and data annotation. We also design a concensus experiment in order to examine the perceptual consistency across various human subjects.

3.1 Motion Data Collection

We first collect generated human motion sequence pairs for subsequent perceptual evaluation. We utilize state-of-the-art diffusion-based motion generation method MDM [38] and FLAME [21] to generate human motion sequences parameterized by SMPL [27]. For MDM [38], we utilize the action-to-motion model trained on HumanAct12 [13] and UESTC [19] respectively. For FLAME [21], we utilize the text-to-motion model trained on HumanML3D [12]. For each group of 4444 motion sequences to be annotated, we use the same condition (text prompt or action labels) while sampling different random noises. This makes the motions similar in content while still having distinguishable differences, thereby making it easier to annotate the choices.

3.2 Human Perceptual Evaluation

Human perceptual evaluation is the core component of MotionPercept, therefore we implement a rigorous pipeline to ensure annotation quality. We first introduce the question design of the perceptual evaluation, then describe the protocol for conducting the evaluation. Finally, we present a statistical analysis of the evaluation results.

3.2.1 Question Design

Our perceptual evaluation is designed in the form of multiple-choice questions as selection is generally easier and more robust than directly rating [20, 36, 41]. Given a group of four motion sequence options, we instruct the annotators to select the best candidate that is most natural, visually pleasing, and free of artifacts. Specifically, we summarize the typical failure modes of the generated motions (e.g., jittering, foot skating, limb distortion, penetration, etc.) and explicitly require the annotators to exclude these options. We provide detailed guidance with task descriptions and representative video examples to better communicate the goal to the annotators. The full guidance is presented in the supplementary materials. While the optimal choice can be decided unambiguously in most cases, there are situations where the decision can be challenging. Therefore, we add two additional options, “all good” and “all bad”, so that the annotator is not required to pick one of the motions in these cases, thereby improving overall annotation quality. Results indicate that these cases account for a small portion of the total data. We exclude these cases from our subsequent experiments. In total, we set six options for each entry: four motion candidates plus “all good” and “all bad”.

3.2.2 Protocols

To ensure the quality of perceptual evaluation results, our annotation process consists of annotator training, annotation, and quality control. We recruit 10 annotators to perform the perceptual evaluation. Before the evaluation begins, we provide annotation guidelines to help the annotators understand the task and maintain consistent criteria. The annotators must pass a pilot test before starting the formal annotation to ensure they correctly understand the annotation requirements. Additionally, we conduct a perceptual consensus experiment to assess whether the annotation pipeline is suitable for our dataset, as discussed in Section 3.3. Finally, we implement a quality control process where the annotated data is reviewed by an expert quality inspector. During the annotation process, we continuously monitor the quality of each batch of data. For each batch, we randomly sample 10% of the data for quality inspection. The consistency between the sampled data and the expert’s annotations must exceed 90%; otherwise, the entire batch will be re-annotated.

Refer to caption
((A))
Refer to caption
((B))
Refer to caption
((C))
Figure 2: We conduct a perceptual consensus experiment with 10 subjects on 312 multiple-choice questions, each with 6 options. (A): The distribution of the number of supporters for the most chosen option in each question. (B): Distribution of the number of options chosen by all subjects for each question. (C): Pairwise agreement ratio of all subjects.

3.3 Analysis

In total, we collect annotations for 18260 multiple-choice questions covering 73040 unique motions, significantly surpassing previous work [41] (1400 motions). We further investigate the following two questions:

  1. 1.

    Based on our experimental setup, can the subjects confidently select the suitable options from the choices provided?

  2. 2.

    Is there a significant difference in perceptual preferences among different subjects, or are they well-aligned?

For the first question, we calculate the proportion of cases where a choice could not be made (including “all good” and “all bad”), and find a total of 418 such groups (2.29%). The result indicates that most of the time subjects can make a definite judgment, demonstrating the validity of our protocol design.

For the second question, we conduct a perceptual consensus experiment where all 10 subjects perform perceptual evaluation independently on 312 groups of randomly selected data. We calculate their pairwise and overall consistency in choices. Figures 2(A) and 2(B) show that for most questions (82.37%), all 10 subjects make the unanimous decision. Figure 2(C) reveals that all 10 subjects exhibit high pairwise agreement (90%). These results indicate a high level of consistency in perceptual judgments of human motion among different human subjects. This not only validates the rationality of our perceptual evaluation pipeline but also inspires us to train machine learning models to emulate this consistent judgment capability.

4 MotionCritic: Advancing Motion Generation with Perceptual Alignment

Based on MotionPercept, we develop a human motion critic model, MotionCritic, to emulate the perceptual judgment capabilities of human subjects regarding human motion. We first present the problem formulation and training approach of the critic model, and then explain how to use the critic model for optimizing motion generation.

4.1 Problem Formulation

We formulate the problem as follows: given an input human motion sequence 𝐱𝐱\mathbf{x}bold_x, we assume there is an implicit human perception model \mathcal{H}caligraphic_H that rates the motion quality (𝐱)𝐱\mathcal{H}(\mathbf{x})caligraphic_H ( bold_x ), where a higher rate indicates better quality. We aim to build a computational critic model 𝒞𝒞\mathcal{C}caligraphic_C that best aligns with \mathcal{H}caligraphic_H. Since \mathcal{H}caligraphic_H is not explicitly available, we take a data-driven approach. We obtain the human perceptual evaluation dataset 𝒟𝒟\mathcal{D}caligraphic_D containing multiple pairs of samples (𝐱(i),𝐱(j))superscript𝐱𝑖superscript𝐱𝑗(\mathbf{x}^{(i)},\mathbf{x}^{(j)})( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ). Our training objective is to train the model 𝒞𝒞\mathcal{C}caligraphic_C using the dataset 𝒟𝒟\mathcal{D}caligraphic_D so that it approximates the human perception model \mathcal{H}caligraphic_H as closely as possible. Specifically, we want the model prediction 𝒞(𝐱(i))>𝒞(𝐱(j))𝒞superscript𝐱𝑖𝒞superscript𝐱𝑗\mathcal{C}(\mathbf{x}^{(i)})>\mathcal{C}(\mathbf{x}^{(j)})caligraphic_C ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) > caligraphic_C ( bold_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) if and only if ��(𝐱(i))>(𝐱(j))superscript𝐱𝑖superscript𝐱𝑗\mathcal{H}(\mathbf{x}^{(i)})>\mathcal{H}(\mathbf{x}^{(j)})caligraphic_H ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) > caligraphic_H ( bold_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ). Based on the Bradley-Terry model [4, 18], the overall training objective could be written as maximizing the joint probabilities that the model 𝒞𝒞\mathcal{C}caligraphic_C makes judgments consistent with \mathcal{H}caligraphic_H for each pair of samples in the dataset 𝒟𝒟\mathcal{D}caligraphic_D:

argmax𝒞𝔼(𝐱(i),𝐱(j))𝒟[logσ((𝒞(𝐱(i))𝒞(𝐱(j)))((𝐱(i))(𝐱(j))))],subscript𝒞subscript𝔼similar-tosuperscript𝐱𝑖superscript𝐱𝑗𝒟delimited-[]𝜎𝒞superscript𝐱𝑖𝒞superscript𝐱𝑗superscript𝐱𝑖superscript𝐱𝑗\displaystyle\arg\max_{\mathcal{C}}\,\mathbb{E}_{(\mathbf{x}^{(i)},\mathbf{x}^% {(j)})\sim\mathcal{D}}\left[\log\sigma\left((\mathcal{C}(\mathbf{x}^{(i)})-% \mathcal{C}(\mathbf{x}^{(j)}))\cdot(\mathcal{H}(\mathbf{x}^{(i)})-\mathcal{H}(% \mathbf{x}^{(j)}))\right)\right],roman_arg roman_max start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_log italic_σ ( ( caligraphic_C ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) - caligraphic_C ( bold_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) ) ⋅ ( caligraphic_H ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) - caligraphic_H ( bold_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) ) ) ] , (1)

where σ𝜎\sigmaitalic_σ is the sigmoid function.

Refer to caption
Figure 3: (I) Critic model training process. We sample human motion pairs 𝐱(h),𝐱(l)superscript𝐱superscript𝐱𝑙\mathbf{x}^{(h)},\mathbf{x}^{(l)}bold_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT annotated with human preferences, upon which the critic model produces score pairs. We use perceptual alignment loss LPerceptsubscript𝐿PerceptL_{\text{Percept}}italic_L start_POSTSUBSCRIPT Percept end_POSTSUBSCRIPT to learn from the human perceptions. (II) Motion generation with critic model supervision. We intercept MDM sampling process at random timestep t𝑡titalic_t and perform single-step prediction. Critic model computes the score s𝑠sitalic_s based on the generated motion 𝐱0superscriptsubscript𝐱0\mathbf{x}_{0}^{\prime}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which is further used to calculate motion critic loss LCriticsubscript𝐿CriticL_{\text{Critic}}italic_L start_POSTSUBSCRIPT Critic end_POSTSUBSCRIPT. KL loss LKLsubscript𝐿KLL_{\text{KL}}italic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT is introduced between 𝐱0superscriptsubscript𝐱0\mathbf{x}_{0}^{\prime}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and last-time generation result 𝐱0~superscript~subscript𝐱0\widetilde{\mathbf{x}_{0}}^{\prime}over~ start_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

4.2 Human Motion Critic Model

In practice, we represent human motion by 𝐱L×J×D𝐱superscript𝐿𝐽𝐷\mathbf{x}\in\mathbb{R}^{L\times J\times D}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_J × italic_D end_POSTSUPERSCRIPT where L𝐿Litalic_L denotes the sequence length, J𝐽Jitalic_J denotes the number of body joints, and D𝐷Ditalic_D denotes parameter dimensions. We implement the critic model 𝒞𝒞\mathcal{C}caligraphic_C as a neural network that maps the high-dimensional motion parameters to a scalar s𝑠sitalic_s. We draw pairwise comparison annotations from the collected dataset, where 𝐱(h)superscript𝐱\mathbf{x}^{(h)}bold_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT is the better instance and 𝐱(l)superscript𝐱𝑙\mathbf{x}^{(l)}bold_x start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is the worse. The perceptual alignment loss is thus given by:

Percept=𝔼(𝐱(h),𝐱(l))𝒟[logσ(𝒞(𝐱(h))𝒞(𝐱(l)))].subscriptPerceptsubscript𝔼similar-tosuperscript𝐱superscript𝐱𝑙𝒟delimited-[]𝜎𝒞superscript𝐱𝒞superscript𝐱𝑙\mathcal{L}_{\text{Percept}}=-\mathbb{E}_{(\mathbf{x}^{(h)},\mathbf{x}^{(l)})% \sim\mathcal{D}}\left[\log\sigma\left(\mathcal{C}(\mathbf{x}^{(h)})-\mathcal{C% }(\mathbf{x}^{(l)})\right)\right].caligraphic_L start_POSTSUBSCRIPT Percept end_POSTSUBSCRIPT = - blackboard_E start_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_log italic_σ ( caligraphic_C ( bold_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) - caligraphic_C ( bold_x start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) ) ] . (2)

4.3 Motion Generation with Critic Model Supervision

Additionally, we explore to utilize the learned human perceptual prior of 𝒞𝒞\mathcal{C}caligraphic_C not only for evaluating generated motions, but also improving them. We demonstrate that our motion critic model could be integrated into state-of-the-art diffusion-based motion generation approaches with ease by using MDM [38] as an example. The forward diffusion is modeled as a Markov noising process {𝐱t}t=0Tsuperscriptsubscriptsubscript𝐱𝑡𝑡0𝑇\{\mathbf{x}_{t}\}_{t=0}^{T}{ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT where 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is drawn from the data distribution, and

q(𝐱t|𝐱t1)=𝒩(αt𝐱t1,(1αt)I),𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝒩subscript𝛼𝑡subscript𝐱𝑡11subscript𝛼𝑡𝐼{q(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\mathcal{N}(\sqrt{\alpha_{t}}\mathbf{x}_{t% -1},(1-\alpha_{t})I),}italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_I ) , (3)

where αt(0,1)subscript𝛼𝑡01\alpha_{t}\in(0,1)italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0 , 1 ) are constant hyper-parameters. When αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is small enough, it’s reasonable to approximate 𝐱T𝒩(0,I)similar-tosubscript𝐱𝑇𝒩0𝐼\mathbf{x}_{T}\sim\mathcal{N}(0,I)bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I ), allow sampling 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT from random noise to begin our denoising process.

Algorithm 1 Fine-tuning Motion Generation with MotionCritic
1:Dataset: Action-label set 𝒴={y1,y2,,yn}𝒴subscript𝑦1subscript𝑦2subscript𝑦𝑛\mathcal{Y}=\left\{y_{1},y_{2},...,y_{n}\right\}caligraphic_Y = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }
2:Pre-training Dataset: Action-motion pairs dataset 𝒟~={(label1,mot1),,(labeln,motn)}~𝒟subscriptlabel1subscriptmot1subscriptlabel𝑛subscriptmot𝑛\widetilde{\mathcal{D}}=\{(\textrm{label}_{1},\textrm{mot}_{1}),...,(\textrm{% label}_{n},\textrm{mot}_{n})\}over~ start_ARG caligraphic_D end_ARG = { ( label start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , mot start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( label start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , mot start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) }
3:Input: MDM model θ0subscriptsubscript𝜃0\mathcal{M}_{\theta_{0}}caligraphic_M start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT with pre-trained weights θ0subscript𝜃0\mathcal{\theta}_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, Critic model 𝒞𝒞\mathcal{C}caligraphic_C, MDM loss function ψ𝜓\psiitalic_ψ, Critic-to-loss map function ϕitalic-ϕ\phiitalic_ϕ, Critic re-weight scale λ𝜆\lambdaitalic_λ, KL loss re-weight scale μ𝜇\muitalic_μ
4:Initialization: The number of noise scheduler time steps T𝑇Titalic_T, and time step range for fine-tuning [T1,T2]subscript𝑇1subscript𝑇2[T_{1},T_{2}][ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ]
5:for  (labeli,moti)𝒟~subscriptlabel𝑖subscriptmot𝑖~𝒟(\textrm{label}_{i},\textrm{mot}_{i})\in\widetilde{\mathcal{D}}( label start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , mot start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ over~ start_ARG caligraphic_D end_ARG do
6:     MDMψθi(labeli,moti)subscriptMDMsubscript𝜓subscript𝜃𝑖subscriptlabel𝑖subscriptmot𝑖\mathcal{L}_{\text{MDM}}\leftarrow\psi_{\theta_{i}}(\textrm{label}_{i},\textrm% {mot}_{i})caligraphic_L start_POSTSUBSCRIPT MDM end_POSTSUBSCRIPT ← italic_ψ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( label start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , mot start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
7:     θiθisubscript𝜃𝑖subscript𝜃𝑖\theta_{i}\leftarrow\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT \triangleright Update MDMθisubscriptMDMsubscript𝜃𝑖\textrm{MDM}_{\theta_{i}}MDM start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT with MDMsubscriptMDM\mathcal{L}_{\text{MDM}}caligraphic_L start_POSTSUBSCRIPT MDM end_POSTSUBSCRIPT
8:     trand(T1,T2)𝑡randsubscript𝑇1subscript𝑇2t\leftarrow\text{rand}(T_{1},T_{2})italic_t ← rand ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) \triangleright Pick a random time step t[T1,T2]𝑡subscript𝑇1subscript𝑇2t\in[T_{1},T_{2}]italic_t ∈ [ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ]
9:     𝐱T𝒩(0,I)similar-tosubscript𝐱𝑇𝒩0𝐼\mathbf{x}_{T}\sim\mathcal{N}(0,I)bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I ) \triangleright Sample noise
10:     for j=T,,t+1𝑗𝑇𝑡1j=T,...,t+1italic_j = italic_T , … , italic_t + 1 do
11:         no grad: 𝐱j1θi{𝐱j}subscript𝐱𝑗1subscriptsubscript𝜃𝑖subscript𝐱𝑗\mathbf{x}_{j-1}\leftarrow\mathcal{M}_{\theta_{i}}\{\mathbf{x}_{j}\}bold_x start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ← caligraphic_M start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT { bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }
12:     end for
13:     with grad: 𝐱𝟎~θi{𝐱t}superscript~subscript𝐱0subscriptsubscript𝜃𝑖subscript𝐱𝑡\mathbf{\widetilde{\mathbf{x}_{0}}}^{\prime}\leftarrow\mathcal{M}_{\theta_{i}}% \{\mathbf{x}_{t}\}over~ start_ARG bold_x start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← caligraphic_M start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT { bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }
14:     if 𝐱0~superscript~subscript𝐱0\widetilde{\mathbf{x}_{0}}^{\prime}over~ start_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is not None then
15:         KLμKL(𝐱0~,𝐱0\mathcal{L}_{KL}\leftarrow\mu\text{KL}(\widetilde{\mathbf{x}_{0}}^{\prime},% \mathbf{x}_{0}\textquoterightcaligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ← italic_μ KL ( over~ start_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ’ )\triangleright KL loss with previous 𝐱0subscript𝐱0\mathbf{x}_{0}\textquoterightbold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ’
16:     end if
17:     r𝒩(0,1)similar-to𝑟𝒩01r\sim\mathcal{N}(0,1)italic_r ∼ caligraphic_N ( 0 , 1 ) \triangleright Random noise scalar
18:     Criticλϕ(𝒞(𝐱0),r)subscriptCritic𝜆italic-ϕ𝒞subscript𝐱0𝑟\mathcal{L}_{\text{Critic}}\leftarrow\lambda\phi(\mathcal{C}(\mathbf{x}_{0}% \textquoteright),r)caligraphic_L start_POSTSUBSCRIPT Critic end_POSTSUBSCRIPT ← italic_λ italic_ϕ ( caligraphic_C ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ’ ) , italic_r ) \triangleright Critic loss
19:     θi+1θisubscript𝜃𝑖1subscript𝜃𝑖\theta_{i+1}\leftarrow\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ← italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT \triangleright Update θisubscriptsubscript𝜃𝑖\mathcal{M}_{\theta_{i}}caligraphic_M start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT with CriticsubscriptCritic\mathcal{L}_{\text{Critic}}caligraphic_L start_POSTSUBSCRIPT Critic end_POSTSUBSCRIPT and KLsubscriptKL\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT
20:     𝐱0~𝐱0superscript~subscript𝐱0subscript𝐱0\widetilde{\mathbf{x}_{0}}^{\prime}\leftarrow\mathbf{x}_{0}\textquoterightover~ start_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ’ \triangleright Save 𝐱0subscript𝐱0\mathbf{x}_{0}\textquoterightbold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ’ for next-step KLsubscriptKL\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT
21:end for

Given an MDM model \mathcal{M}caligraphic_M with pre-trained parameters θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we fine-tune to improve its alignment with a pre-trained critic model 𝒞𝒞\mathcal{C}caligraphic_C. We develop a lightweight perceptual-aligned fine-tuning approach based on ReFL [44]. Notably, in order to utilize the critic model in a plug-and-play manner, we keep the MDM training step and objective MDMsubscriptMDM\mathcal{L}_{\text{MDM}}caligraphic_L start_POSTSUBSCRIPT MDM end_POSTSUBSCRIPT unchanged. Instead, we simply add one optimization step with critic model supervision in each training iteration as shown in Figure 1.

Specifically, we sample a Gaussian noise 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and inference until 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT where t[T1,T2]𝑡subscript𝑇1subscript𝑇2t\in[T_{1},T_{2}]italic_t ∈ [ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] is randomly selected in later denoising steps. Then, a single-step denoising is performed to predict 𝐱0subscript𝐱0\mathbf{x}_{0}\textquoterightbold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ’ from 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Based on the predicted motion 𝐱0subscript𝐱0\mathbf{x}_{0}\textquoterightbold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ’, we compute the critic score s=𝒞(𝐱0)𝑠𝒞subscript𝐱0s=\mathcal{C}(\mathbf{x}_{0}\textquoteright)italic_s = caligraphic_C ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ’ ), which is used to compute the motion critic loss:

Critic=𝔼yi𝒴[ϕ(𝒞(𝐱0)],{\mathcal{L}_{\text{Critic}}=\mathbb{E}_{y_{i}\sim{\mathcal{Y}}}\left[\phi(% \mathcal{C}(\mathbf{x}_{0}^{\prime})\right],\\ }caligraphic_L start_POSTSUBSCRIPT Critic end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_Y end_POSTSUBSCRIPT [ italic_ϕ ( caligraphic_C ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] , (4)

where ϕ(s)=σ(τs))\phi(s)=-\sigma(\tau-s))italic_ϕ ( italic_s ) = - italic_σ ( italic_τ - italic_s ) ) is a critic-to-loss mapping function, τ𝜏\tauitalic_τ is a constant for shifting the critic value, σ𝜎\sigmaitalic_σ is the sigmoid function. We further introduce a Kullback-Leibler (KL) divergence regularization to prevent \mathcal{M}caligraphic_M from moving substantially away from the conditional motion generation task:

KLsubscriptKL\displaystyle\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT =𝔼yi𝒴[DKL(p(𝐱0)p(𝐱0~))].absentsubscript𝔼similar-tosubscript𝑦𝑖𝒴delimited-[]subscript𝐷KLconditional𝑝superscriptsubscript𝐱0𝑝~superscriptsubscript𝐱0\displaystyle=\mathbb{E}_{y_{i}\sim{\mathcal{Y}}}\left[{D_{\text{KL}}}\left(p(% \mathbf{x}_{0}^{\prime})\|p(\widetilde{\mathbf{x}_{0}^{\prime}})\right)\right].= blackboard_E start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_Y end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ italic_p ( over~ start_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) ) ] . (5)

The overall fine-tuning loss is given by

FTsubscriptFT\displaystyle\mathcal{L}_{\text{FT}}caligraphic_L start_POSTSUBSCRIPT FT end_POSTSUBSCRIPT =MDM+λCritic+μKL.absentsubscriptMDM𝜆subscriptCritic𝜇subscriptKL\displaystyle=\mathcal{L}_{\text{MDM}}+\lambda\mathcal{L}_{\text{Critic}}+\mu% \mathcal{L}_{\text{KL}}.= caligraphic_L start_POSTSUBSCRIPT MDM end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT Critic end_POSTSUBSCRIPT + italic_μ caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT . (6)

where λ𝜆\lambdaitalic_λ and μ𝜇\muitalic_μ are constants for loss balancing. The detailed algorithm workflow is shown in Algorithm 1.

5 Experiment

5.1 Implementation Details

Critic Model.

We train our critic model using the MDM subset in MotionPercept. We convert each multiple-choice question into three ordered preference pairs, which results in 46761 pairs for training and 5829 pairs for testing. We parameterize motion sequences with SMPL [27], including 24 axis-angle rotations, and global root translation. We implement the critic model with DSTformer [52] backbone with 3 layers and 8 attention heads. We apply temporal average pooling on encoded motion embeddings followed by an MLP with a hidden layer of 1024 channels to predict a single scalar score. We train the critic model for 150 epochs with a batch size of 64 and a learning rate starting at 2e-3, decreasing with a 0.9950.9950.9950.995 exponential learning rate decay.

Fine-tuning.

We use MDM [38] model trained on HumanAct12 [13] as our baseline, which utilizes 1000 DDPM denoising steps. We load the checkpoint trained for 350000 iterations and fine-tune for 800 iterations, with a batch size of 64 and learning rate 1e-5. We fine-tune with critic clipping threshold τ=12.0𝜏12.0\tau=12.0italic_τ = 12.0, critic re-weight scale λ=𝜆absent\lambda=italic_λ =1e-3, and KL loss re-weight scale μ=1.0𝜇1.0\mu=1.0italic_μ = 1.0. We set the step sampling range [T1,T2]=[700,900]subscript𝑇1subscript𝑇2700900[T_{1},T_{2}]=[700,900][ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = [ 700 , 900 ].

5.2 MotionCritic as Motion Quality Metric

We first evaluate whether the proposed critic model could serve as an effective motion quality metric. Specifically, we are interested in the following research questions:

  1. 1.

    How does MotionCritic align with human perceptual evaluations?

  2. 2.

    Could MotionCritic generalize to different data distributions?

To investigate the first question, we evaluate the performance of our critic model on a held-out test set and compare it with existing motion quality metrics as follows:

  • Error-based metrics, including Root Average Error (Root AVE), Root Absolute Error (Root AE), Joint Average Error (Joint AVE), and Joint Absolute Error (Joint AE). These metrics involve directly computing the distance between the generated motion and a pairing GT with the same condition.

  • Heuristic metrics, including acceleration [32, 45], Person-Ground Contact [32], Foot-Floor Penetration [32], and Physical Foot Contact (PFC) [40]. These metrics does not compare against GT; instead, they implement intuitive rule-based evluations. For example, PFC models the relationship between center of mass acceleration and foot-ground contact.

  • Learning-based metrics. Prior work MoBERT [41] proposes to evaluate motion quality with a motion feature extractor and SVR Regression.

Table 1: Quantitative comparison of motion evaluation metrics on MDM and FLAME subsets of MotionPercept.
Metric MDM FLAME
Accuracy (%) \uparrow Log Loss \downarrow Accuracy (%) \uparrow Log Loss \downarrow
Root AVE 59.47 0.6891 48.42 0.6984
Root AE 61.79 0.6798 59.54 0.6711
Joint AVE 56.77 0.6889 44.61 0.6973
Joint AE 62.73 0.6794 58.37 0.6891
Acceleration [45] 64.26 0.7792 66.67 0.6919
Person-Ground Contact [32] 71.78 0.7260 69.82 0.7243
Foot-Floor Penetration [32] 53.61 0.6939 55.56 0.6906
Physical Foot Contact [40] 64.79 0.6926 66.00 0.6930
MoBERT [41] 49.40 0.6931 52.40 0.6932
MotionCritic (Ours) 85.07 0.5486 81.43 0.5758

Note that distribution-based metrics (e.g. FID) could not compare quality of individual motion sequences, and the comparison can be found in subsequent experiments. For each metric, we calculate the percentage they align with GT annotations (accuracy) and also their probabilistic distribution distance with GT annotations (log loss). We use the softmax function to convert the scores to probabilities (taking the opposite before softmax for metrics where smaller is better). Table 4 demonstrates that our critic model significantly outperforms previous metrics. These results not only validate the effectiveness of learning from large-scale human perceptual evaluations but also prove that our critic model can serve as a more comprehensive and robust metric for assessing motion quality.

Refer to caption
((A))
Refer to caption
((B))
Figure 4: We group HumanAct12[13] GT test set into 5 subsets, and compare their qualities. (A): GT-I to GT-V subsets split based on critic scores from high to low. (B): Elo ratings from user study, FID and average critic scores of different GT subsets.

Furthermore, to investigate the second question, we test the critic model on data outside of the training distributions. We collect a standalone test set with a different motion generation algorithm, FLAME [25], and perform perceptual evaluation with a different human subject. Note that this model is trained on a different dataset [12] with the model used to generate critic model training data, which means the action categories have large variations. The results in Table 4 further shows that our critic model could well generalize to the new test set, indicating its efficacy in evaluating different generation algorithms and unseen motion contents.

Additionally, we test the generalization of our critic model on the real GT motion distribution. Figure 4(A) illustrates the critic score distribution of HumanAct12 [13] test set. We group the 1190 GT motions into 5 groups based on their critic scores, evenly distributed from highest to lowest. We compare the average critic score between the groups with distribution-based metric FID and user study. The user study is conducted by comparing motion pairs sampled from each groups and then computing Elo rating [7, 40] for each group. Figure 4(B) clearly indicates that the critic score aligns well with human preferences, while FID does not. Notably, we discover that the outliers with small critic values (group V) are indeed artifacts within the dataset. Please refer to the supplementary materials for video results. The results indicate that our critic model can also generalize to the GT motion manifold, even though the model has never been trained on it. It also highlights the potential of using our critic model as a tool for dataset diagnosis (e.g., discover failure modes).

5.3 MotionCritic as Training Supervision

Refer to caption
((A))
Refer to caption
((B))
Figure 5: Model performance during fine-tuning process. (A): User study win rates (row vs column) with different fine-tuned model steps. (B): Elo ratings from user study, FID and average critic scores in the fine-tuning process.
Refer to caption
Figure 6: Motion generation results from different fine-tuning steps.

Furthermore, we investigate whether our critic model can also serve as an effective supervision signal. Specifically, we fine-tune a pre-trained motion generator [38] with the proposed framework, and evaluate on HumanAct12 [13] test set every 200 steps. Additionally, we conduct a standalone user study by comparing motion pairs generated at different fine-tuning steps and compute the Elo Rating [7, 40]. Figure 5 reveals that as fine-tuning progresses, the motion quality consistently improves according to the user study, in line with the training objective of increasing the critic score. We also present a visualization comparison in Figure 6. We discover that as fine-tuning progresses, unreasonable human motions such as jittering, twisting, and floating significantly decrease. Please refer to the supplementary materials for video comparisons. The results also demonstrate that our fine-tuning process requires only hundreds of iterations to take effect, significantly improving the perceptual quality of the model. Compared to the 350K pre-training steps, this accounts for only 0.23% of the training cost. This further demonstrates the advantages of our proposed framework in using a perceptually-aligned critic model to fine-tune the motion generation model, not only improving quality but also being lightweight and efficient.

6 Conclusion

In conclusion, our work bridges the important gap in human motion generation between objective metrics and human perceptual evaluations by introducing a data-driven framework with MotionPercept and MotionCritic. This paradigm not only offers a more comprehensive metrics of motion quality but could also improve the generation results by aligning with human preferences. We hope this work could contribute to more objective evaluations of motion generation methods and results. One limitation of our approach is its primary focus on perceptual metrics without explicitly simulating biomechanical plausibility, which could be explored in future work. Future research could also investigate more fine-grained perceptual evaluation methods to obtain rich human feedback on motion quality like [26].

References

  • [1] Joao Pedro Araújo, Jiaman Li, Karthik Vetrivel, Rishi Agarwal, Jiajun Wu, Deepak Gopinath, Alexander William Clegg, and Karen Liu. Circle: Capture in rich contextual environments. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 21211–21221, 2023.
  • [2] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  • [3] Sarah-Jayne Blakemore and Jean Decety. From the perception of action to the understanding of intention. Nature reviews neuroscience, 2(8):561–567, 2001.
  • [4] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  • [5] Enric Corona, Albert Pumarola, Guillem Alenya, and Francesc Moreno-Noguer. Context-aware human motion prediction. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 6992–7001, 2020.
  • [6] Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
  • [7] Arpad E. Elo. The Rating of Chessplayers, Past and Present. Arco Pub., 1978.
  • [8] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12):86–92, December 2021.
  • [9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proc. Adv. Neural Inform. Process. Syst., 2014.
  • [10] Emily Grossman, Michael Donnelly, R Price, D Pickens, V Morgan, G Neighbor, and Randolph Blake. Brain areas involved in perception of biological motion. Journal of cognitive neuroscience, 12(5):711–720, 2000.
  • [11] Gianpaolo Gulletta, Wolfram Erlhagen, and Estela Bicho. Human-like arm motion generation: A review. Robotics, 9(4):102, 2020.
  • [12] Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 5152–5161, 2022.
  • [13] Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. In Proc. ACM Int. Conf. Multimedia, pages 2021–2029, 2020.
  • [14] Joey Hejna and Dorsa Sadigh. Inverse preference learning: Preference-based rl without a reward function. Advances in Neural Information Processing Systems, 36, 2024.
  • [15] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proc. Adv. Neural Inform. Process. Syst., volume 30, 2017.
  • [16] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Proc. Adv. Neural Inform. Process. Syst., 2020.
  • [17] Ruozi Huang, Huang Hu, Wei Wu, Kei Sawada, Mi Zhang, and Daxin Jiang. Dance revolution: Long-term dance generation with music via curriculum learning. In Proc. Int. Conf. Learn. Represent., 2021.
  • [18] David R Hunter. Mm algorithms for generalized bradley-terry models. The annals of statistics, 32(1):384–406, 2004.
  • [19] Yanli Ji, Feixiang Xu, Yang Yang, Fumin Shen, Heng Tao Shen, and Wei-Shi Zheng. A large-scale rgb-d database for arbitrary-view human action recognition. In Proc. ACM Int. Conf. Multimedia, page 1510–1518, 2018.
  • [20] Maurice George Kendall. Rank correlation methods. 1948.
  • [21] Jihoon Kim, Jiseob Kim, and Sungjoon Choi. Flame: Free-form language-based motion synthesis & editing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 8255–8263, 2023.
  • [22] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In Proc. Int. Conf. Learn. Represent., 2014.
  • [23] Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström. Analyzing input and output representations for speech-driven gesture generation. In Proc. Int. Conf. on Intelligent Virtual Agents, page 97–104, 2019.
  • [24] Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
  • [25] Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph., 36(6):194–1, 2017.
  • [26] Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, et al. Rich human feedback for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19401–19411, 2024.
  • [27] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. ACM Trans. Graph., 34(6):1–16, 2015.
  • [28] Thomas Lucas*, Fabien Baradel*, Philippe Weinzaepfel, and Grégory Rogez. Posegpt: Quantization-based 3d human motion generation and forecasting. In Proc. Eur. Conf. Comput. Vis., pages 417–435, 2022.
  • [29] Yusuke Nishimura, Yutaka Nakamura, and Hiroshi Ishiguro. Long-term motion generation for interactive humanoid robots using gan with convolutional network. In Proc. ACM/IEEE Int. Conf. on Human-Robot Interaction, pages 375–377, 2020.
  • [30] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Proc. Adv. Neural Inform. Process. Syst., 35:27730–27744, 2022.
  • [31] Mathis Petrovich, Michael J. Black, and Gül Varol. TEMOS: Generating diverse human motions from textual descriptions. In Proc. Eur. Conf. Comput. Vis., pages 480–497, 2022.
  • [32] Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J Guibas. Humor: 3d human motion model for robust pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11488–11499, 2021.
  • [33] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proc. Int. Conf. Mach. Learn., 2015.
  • [34] Sotaro Shimada and Kazuma Oki. Modulation of motor area activity during observation of unnatural body movements. Brain and cognition, 80(1):1–6, 2012.
  • [35] Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. Bailando: 3d dance generation via actor-critic gpt with choreographic memory. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 11050–11059, June 2022.
  • [36] Neil Stewart, Gordon DA Brown, and Nick Chater. Absolute identification by relative judgment. Psychological review, 112(4):881, 2005.
  • [37] Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to clip space. In Proc. Eur. Conf. Comput. Vis., pages 358–374, 2022.
  • [38] Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffusion model. In Proc. Int. Conf. Learn. Represent., 2023.
  • [39] Nikolaus F Troje. Decomposing biological motion: A framework for analysis and synthesis of human gait patterns. Journal of vision, 2(5):2–2, 2002.
  • [40] Jonathan Tseng, Rodrigo Castellon, and Karen Liu. Edge: Editable dance generation from music. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 448–458, 2023.
  • [41] Jordan Voas, Yili Wang, Qixing Huang, and Raymond Mooney. What is the best automated metric for text to motion generation? In SIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023.
  • [42] Jiashun Wang, Huazhe Xu, Jingwei Xu, Sifei Liu, and Xiaolong Wang. Synthesizing long-term 3d human motion and interaction in 3d scenes. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 9401–9411, 2021.
  • [43] Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023.
  • [44] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
  • [45] Sicheng Yang, Zhiyong Wu, Minglei Li, Zhensong Zhang, Lei Hao, Weihong Bao, and Haolin Zhuang. Qpgesture: Quantization-based and phase-guided motion matching for natural speech-driven gesture generation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023.
  • [46] Zhuoqian Yang, Wentao Zhu, Wayne Wu, Chen Qian, Qiang Zhou, Bolei Zhou, and Chen Change Loy. Transmomo: Invariance-driven unsupervised video motion retargeting. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [47] Hongwei Yi, Hualin Liang, Yifei Liu, Qiong Cao, Yandong Wen, Timo Bolkart, Dacheng Tao, and Michael J Black. Generating holistic 3d human motion from speech. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023.
  • [48] Tairan Yin, Ludovic Hoyet, Marc Christie, Marie-Paule Cani, and Julien Pettré. The one-man-crowd: Single user generation of crowd motions using virtual reality. IEEE Trans. Vis. Comput. Graph., 28(5):2245–2255, 2022.
  • [49] Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, and Dong Ni. Instructvideo: Instructing video diffusion models with human feedback. arXiv preprint arXiv:2312.12490, 2023.
  • [50] Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
  • [51] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  • [52] Wentao Zhu, Xiaoxuan Ma, Zhaoyang Liu, Libin Liu, Wayne Wu, and Yizhou Wang. Learning human motion representations: A unified perspective. In Proc. Int. Conf. Comput. Vis., 2023.
  • [53] Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. Human motion generation: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
\doparttoc\faketableofcontents

Part 2 Appendix

\parttoc

Appendix A Details on MotionPercept

A.1 Prompt Selection

We utilize the prompts from HumanAct12 [13], UESTC [19] and HumanML3D [12] for generating the motion candidates. Specifically, we use the 12 action labels from HumanAct12 [13] (shown in  Table 2) and the 40 categories of aerobic exercise description from UESTC [19] (shown in Table 3) for the MDM [38] model.

HumanAct12 [13] Action Labels
warm up walk
run jump
drink lift dumbbell
sit eat
turn steering wheel phone
boxing throw
Table 2: 12 action labels from HumanAct12 [13].
UESTC [19] Action Labels
punching and knee lifting marking time and knee lifting
jumping-jack squatting
forward-lunging left-lunging
left-stretching raising-hand-and-jumping
left-kicking rotation-clapping
front-raising pulling-chest-expanders
punching wrist-circling
single-dumbbell-raising shoulder-raising
elbow-circling dumbbell-one-arm-shoulder-pressing
arm-circling dumbbell-shrugging
pinching-back head-anticlockwise-circling
shoulder-abduction deltoid-muscle-stretching
straight-forward-flexion spinal-stretching
dumbbell-side-bend standing-opposite-elbow-to-knee-crunch
standing-rotation overhead-stretching
upper-back-stretching knee-to-chest
knee-circling alternate-knee-lifting
bent-over-twist rope-skipping
standing-toe-touches standing-gastrocnemius-calf
single-leg-lateral-hopping high-knees-running
Table 3: 40 action labels from UESTC [19].

We randomly select texts from HumanML3D [12] test set as prompts for the FLAME [25] model.

A.2 Annotation Management

We recruit 10 annotators for this task, and data entries are randomly allocated to them. We provide detailed guidelines to annotators. We evaluate the annotation result by spot check. We randomly select 10% of all data to inspect the annotation results according to guidelines and calculate the proportion of unqualified data entries. If the unqualified proportion is less than 10%, the results are considered to be acceptable. All the unqualified data entries will be re-annotated. We will update the guidelines during annotation based on spot check feedback, and annotators will study the new guidelines.

A.3 Annotation Design

Refer to caption
Figure 7: An example of raw data entry before annotation.

We generate four motions from the same prompt for each data entry, as shown in Fig 7. The prompts are hidden during the annotation process. Annotators are required to select either the best or the worst motion for data entries generated by MDM [38] and FLAME [21]. MDM [38] exhibits better motion diversity but lacks stability, so annotators are instructed to select the best motion. Conversely, FLAME [21] demonstrates better stability but lacks diversity, so annotators are instructed to select the worst motion for these entries.

Refer to caption
Figure 8: Our annotation platform.

A.4 Annotation Guidance Documentation

We provide a detailed annotation document to explain the annotation process. The annotation platform is shown in Fig 8.

Introduction

Each data entry to be annotated consists of four videos, as shown in Fig 7. Each video is approximately three seconds long, with all four videos playing simultaneously and concatenated into one video.

Requirements

Each set of videos has six options: A, B, C, D, "all are good," and "all are bad." Annotators should select the most natural and reasonable video for each data entry. If one option stands out as the best, select that option. If all actions seem equally good or equally bad, choose "all are good" or "all are bad." Text prompts will be hidden during annotation.

Video Examples

We provide annotators with examples if what kinds of motions are unnatural and unaccepetable:

  1. 1.

    Body pose is unnatural, including hands, feet and so on.

  2. 2.

    Human motion violates physiological constraints.

  3. 3.

    Human motion is erratic or severely stutters.

  4. 4.

    Human body collides, such as hands fully embedded into leg.

  5. 5.

    Human body is severely tilted, to the point of losing balance.

  6. 6.

    Human body appears to be drifting instead of walking.

Examples of these problems are shown in Fig 9.

Refer to caption
Figure 9: Representative examples of options to be excluded.

Appendix B Data Documentation

We follow the datasheet proposed in [8] for documenting our MotionPercept:

  1. 1.

    Motivation

    1. (a)

      For what purpose was the dataset created?
      This dataset was created to collect human perceptual data on whetner human motions seem natural, and ultimately advance our study of perceptual-aligned metric and finetuning human motion generation model.

    2. (b)

      Who created the dataset and on behalf of which entity?
      This dataset was created by Haoru Wang, Yishu Xu, Luyi Miao, Wentao Zhu, Feng Gao and Yizhou Wang with Peking University.

    3. (c)

      Who funded the creation of the dataset?
      The creation of this dataset was funded by Peking University.

    4. (d)

      Any other Comments?
      None.

  2. 2.

    Composition

    1. (a)

      What do the instances that comprise the dataset represent?
      Each instance contains 4 video of human motions generated from the same prompt by existing motion generation methods [38, 25].

    2. (b)

      How many instances are there in total?
      In total, we collect annotations for 18260 multiple-choice questions covering 73K unique motions.

    3. (c)

      Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
      No, this is a brand-new dataset.

    4. (d)

      What data does each instance consist of?
      See appendix A for details.

    5. (e)

      Is there a label or target associated with each instance?
      Yes. See appendix A.

    6. (f)

      Is any information missing from individual instances?
      No.

    7. (g)

      Are relationships between individual instances made explicit?
      Yes.

    8. (h)

      Are there recommended data splits?
      Yes, we have separated the whole dataset into MDM-A (motions generated by MDM [38] from prompts in HumanAct12 [13]), MDM-U (motions generated by MDM [38] from prompts in UESTC [19] and FLAME (motions generated by FLAME [21] from prompts in HumanML3D [12]). We provide the recommended data splits by combining MDM-A and MDM-U and randomly splitting them into a training set and a test set at a ratio of 8:1. Data generated by FLAME [21] is primarily used as test data for generalization.

    9. (i)

      Are there any errors, sources of noise, or redundancies in the dataset?
      No.

    10. (j)

      Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?
      The dataset is self-contained.

    11. (k)

      Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)?
      No.

    12. (l)

      Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?
      No.

    13. (m)

      Does the dataset relate to people?
      Yes. Our human motion data is generated as body model parameters [27], not from real people, and therefore does not contain biometrics. These data are annotated by human annotators.

    14. (n)

      Does the dataset identify any subpopulations (e.g., by age, gender)?
      No. Our human motion data are generated as body model parameters [27] with no explicit gender or age.

    15. (o)

      Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset?
      No. Our human motion data are generated by algorithms with commonly used body models.

    16. (p)

      Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)?
      No.

    17. (q)

      Any other comments?
      None.

  3. 3.

    Collection Process

    1. (a)

      How was the data associated with each instance acquired?
      See appendix A for details.

    2. (b)

      What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)?
      We use existing motion generation models to collect videos and require annotators to label them. See appendix A for details.

    3. (c)

      If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?
      See appendix A and appendix B for details.

    4. (d)

      Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?
      The video data was collected by the authors. The annotations were performed by the workers in DATATANG TECHNOLOGY INC., and the workers were offered a fair wage as per the prearranged contract. See appendix A and appendix B for details.

    5. (e)

      Over what timeframe was the data collected?
      The data were collected from 2023 to 2024, and labeled in 2024.

    6. (f)

      Were any ethical review processes conducted (e.g., by an institutional review board)?
      No. The MotionPercept dataset raises no ethical concerns regarding the privacy information of human subjects.

    7. (g)

      Does the dataset relate to people?
      Yes. Our human motion data are generated as body model parameters [27], not real people. The annotation is done by people.

    8. (h)

      Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?
      We obtain raw data from motion generation model. Annotation data are collected by annotators.

    9. (i)

      Were the individuals in question notified about the data collection?
      Yes.

    10. (j)

      Did the individuals in question consent to the collection and use of their data?
      Yes.

    11. (k)

      If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses?
      Yes.

    12. (l)

      Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted?
      Not applicable.

    13. (m)

      Any other comments?
      None.

  4. 4.

    Preprocessing, Cleaning and Labeling

    1. (a)

      Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?
      Yes, see appendix A.

    2. (b)

      Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)?
      Yes. We provide raw data entries and their annotations respectively.

    3. (c)

      Is the software used to preprocess/clean/label the instances available?
      No. The annotation software is the private labeling platform provided by DATATANG TECHNOLOGY INC. .

    4. (d)

      Any other comments?
      None.

  5. 5.

    Uses

    1. (a)

      Has the dataset been used for any tasks already?
      No, the dataset is newly proposed by us.

    2. (b)

      Is there a repository that links to any or all papers or systems that use the dataset?
      Yes, we provide the link to all related information on our project page.

    3. (c)

      What (other) tasks could the dataset be used for?
      This dataset could be used for other research topics, including but not limited to human preference study, human motion study.

    4. (d)

      Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?
      See appendix A for details.

    5. (e)

      Are there tasks for which the dataset should not be used?
      The usage of this dataset should be limited to the scope of human motion.

    6. (f)

      Any other comments?
      None.

  6. 6.

    Distribution

    1. (a)

      Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
      Yes, the dataset will be made publicly available.

    2. (b)

      How will the dataset be distributed (e.g., tarball on website, API, GitHub)?
      The dataset will be published on our code website with its metadata document.

    3. (c)

      Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?
      We release our benchmark under CC BY-NC 4.0 https://paperswithcode.com/datasets/license license.

    4. (d)

      Have any third parties imposed IP-based or other restrictions on the data associated with the instances?
      No.

    5. (e)

      Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?
      No.

    6. (f)

      Any other comments?
      None.

  7. 7.

    Maintenance

    1. (a)

      Who is supporting/hosting/maintaining the dataset?
      Haoru Wang is maintaining.

    2. (b)

      How can the owner/curator/manager of the dataset be contacted (e.g., email address)?
      ou524u@stu.pku.edu.cn

    3. (c)

      Is there an erratum?
      Currently, no. As errors are encountered, future versions of the dataset may be released and updated on our website.

    4. (d)

      Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances’)?
      Yes, if applicable.

    5. (e)

      If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)?
      Our human motion dataset is generated as body model parameters [27], not real people. No applicable limits on retention of the data and the annotators are aware of the use of data.

    6. (f)

      Will older versions of the dataset continue to be supported/hosted/maintained?
      Yes, older versions of the benchmark will be maintained on our website.

    7. (g)

      If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?
      Yes, please get in touch with us by email.

    8. (h)

      Any other comments?
      None.

Appendix C Details on MotionCritic: as Motion Quality Metric

Data Pre-processing.

Each multiple-choice question is divided into three ordered preference pairs. Motion sequences are parameterized using SMPL [27], which includes 24 axis-angle rotations and one global root translation.

Training and Evaluation.

We train the critic model from scratch using the DSTformer [52] backbone with 3 layers and 8 attention heads on MotionPercept. To ensure robustness, we train our model for multiple times and report the error bars, considering variations such as the random seed across multiple runs. Evaluation results, detailing action-label splits, are presented in the following two tables. Our MotionCritic gets the best results and can robustly score different types of human motions.

Metric Warm. Walk Run Jump Drink Lift. Sit Eat Turn. Phone Box. Throw Avg.
Root AVE 57.6 47.3 56.8 62.7 59.5 46.3 37.9 64.5 54.1 0.62 51.6 53.7 59.5
Root AE 70.1 70.0 69.7 57.2 70.0 49.8 52.5 61.2 63.2 61.7 52.7 55.2 61.8
Joint AVE 42.0 52.2 50.4 64.7 53.2 50.2 42.4 48.6 51.9 55.7 48.4 45.2 56.8
Joint AE 63.6 69.1 75.2 55.7 59.9 41.1 51.4 66.7 59.3 60.6 53.5 54.1 62.7
Acceleration [45] 66.7 78.0 61.6 53.0 65.4 62.4 82.6 61.6 51.1 59.3 69.5 61.7 64.3
Person-Ground Contact [32] 69.8 70.1 70.2 66.0 72.8 71.8 90.1 76.9 70.3 67.9 62.6 63.2 71.8
Foot-Floor Penetration [32] 47.1 52.4 52.7 55.7 48.5 56.8 59.9 52.4 50.4 55.7 52.8 53.3 53.6
Physical Foot Contact [40] 80.5 77.8 73.1 51.7 67.5 57.9 78.5 63.4 53.2 65.5 68.6 68.5 64.8
MoBERT [41] 67.4 68.4 44.8 37.4 70.0 18.9 43.6 49.9 65.2 25.6 56.1 44.8 49.4
MotionCritic (Ours) 90.6±0.2plus-or-minus0.2{}_{\pm\text{0.2}}start_FLOATSUBSCRIPT ± 0.2 end_FLOATSUBSCRIPT 94.2±0.4plus-or-minus0.4{}_{\pm\text{0.4}}start_FLOATSUBSCRIPT ± 0.4 end_FLOATSUBSCRIPT 91.9±0.3plus-or-minus0.3{}_{\pm\text{0.3}}start_FLOATSUBSCRIPT ± 0.3 end_FLOATSUBSCRIPT 90.6±0.1plus-or-minus0.1{}_{\pm\text{0.1}}start_FLOATSUBSCRIPT ± 0.1 end_FLOATSUBSCRIPT 83.9±1.3plus-or-minus1.3{}_{\pm\text{1.3}}start_FLOATSUBSCRIPT ± 1.3 end_FLOATSUBSCRIPT 85.3±0.7plus-or-minus0.7{}_{\pm\text{0.7}}start_FLOATSUBSCRIPT ± 0.7 end_FLOATSUBSCRIPT 86.0±0.7plus-or-minus0.7{}_{\pm\text{0.7}}start_FLOATSUBSCRIPT ± 0.7 end_FLOATSUBSCRIPT 78.3±1.4plus-or-minus1.4{}_{\pm\text{1.4}}start_FLOATSUBSCRIPT ± 1.4 end_FLOATSUBSCRIPT 79.8±0.9plus-or-minus0.9{}_{\pm\text{0.9}}start_FLOATSUBSCRIPT ± 0.9 end_FLOATSUBSCRIPT 85.1±0.6plus-or-minus0.6{}_{\pm\text{0.6}}start_FLOATSUBSCRIPT ± 0.6 end_FLOATSUBSCRIPT 86.6±0.4plus-or-minus0.4{}_{\pm\text{0.4}}start_FLOATSUBSCRIPT ± 0.4 end_FLOATSUBSCRIPT 82.5±0.3plus-or-minus0.3{}_{\pm\text{0.3}}start_FLOATSUBSCRIPT ± 0.3 end_FLOATSUBSCRIPT 85.1±0.5plus-or-minus0.5{}_{\pm\text{0.5}}start_FLOATSUBSCRIPT ± 0.5 end_FLOATSUBSCRIPT
Table 4: Accuracy comparison of motion evaluation metrics on HumanAct12 action classes(%).
Metric Warm. Walk Run Jump Drink Lift. Sit Eat Turn. Phone Box. Throw Avg.
Root AVE 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69
Root AE 0.68 0.66 0.67 0.68 0.68 0.69 0.70 0.69 0.69 0.69 0.69 0.69 0.68
Joint AVE 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69
Joint AE 0.68 0.67 0.67 0.69 0.68 0.70 0.70 0.69 0.69 0.68 0.68 0.69 0.68
Acceleration [45] 0.70 0.59 0.88 1.5 0.60 0.69 0.60 0.64 0.88 0.71 0.61 0.76 0.78
Person-Ground Contact [32] 0.71 0.68 0.68 0.71 0.69 0.70 0.73 0.68 0.74 0.70 0.73 0.72 0.73
Foot-Floor Penetration [32] 0.70 0.70 0.69 0.70 0.69 0.70 0.71 0.70 0.69 0.70 0.69 0.69 0.69
Physical Foot Contact [40] 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69
MoBERT [41] 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.70 0.69 0.69 0.69
MotionCritic (Ours) 0.51±0.01plus-or-minus0.01{}_{\pm\text{0.01}}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT 0.52±0.02plus-or-minus0.02{}_{\pm\text{0.02}}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT 0.50±0.01plus-or-minus0.01{}_{\pm\text{0.01}}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT 0.51±0.02plus-or-minus0.02{}_{\pm\text{0.02}}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT 0.56±0.02plus-or-minus0.02{}_{\pm\text{0.02}}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT 0.54±0.02plus-or-minus0.02{}_{\pm\text{0.02}}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT 0.54±0.02plus-or-minus0.02{}_{\pm\text{0.02}}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT 0.59±0.03plus-or-minus0.03{}_{\pm\text{0.03}}start_FLOATSUBSCRIPT ± 0.03 end_FLOATSUBSCRIPT 0.59±0.01plus-or-minus0.01{}_{\pm\text{0.01}}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT 0.57±0.01plus-or-minus0.01{}_{\pm\text{0.01}}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT 0.53±0.02plus-or-minus0.02{}_{\pm\text{0.02}}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT 0.55±0.01plus-or-minus0.01{}_{\pm\text{0.01}}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT 0.55±0.02plus-or-minus0.02{}_{\pm\text{0.02}}start_FLOATSUBSCRIPT ± 0.02 end_FLOATSUBSCRIPT
Table 5: Log-loss comparison of motion evaluation metrics on HumanAct12 action classes.

Appendix D Details on MotionCritic: as Training Supervision

D.1 Fine-tuning

Critic Score Clipping.

Generally, a higher MotionCritic score indicates better motion quality. However, this relationship has an upper limit. During our fine-tuning process, we clip motions with reward scores exceeding a threshold τ𝜏\tauitalic_τ when computing gradients before back-propagation. This threshold, determined through a series of comparative experiments, is set at τ=12.0𝜏12.0\tau=12.0italic_τ = 12.0, approximately the upper bound of ground-truth critic scores. We found that this setting yields the best results. Fine-tuned motion generation models without reward clipping tend to artificially inflate reward scores on a few specific motions, which increases the average MotionCritic score but degrades overall performance. Thus, reward clipping is essential to maintain the integrity and quality of the fine-tuning process.

Finetuning Details.

Inspired by [44], we observe how the critic score changes over denoising steps to identify the optimal time window for ReFL intercept. As shown in Figure 10(A), we set the hyperparameter step sampling range to [T1,T2]=[700,900]subscript𝑇1subscript𝑇2700900[T_{1},T_{2}]=[700,900][ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = [ 700 , 900 ], where the critic score witnesses a rapid increase.

Figure 10(B) illustrates the variation in the average critic score of a training batch over the course of fine-tuning steps. The fine-tuning process is stable and quick to take effect.

Refer to caption
((A))
Refer to caption
((B))
Figure 10: Fine-tuning process. (A): Critic score in 1000-step denoising process. (B): Critic output in 800-step fine-tuning process.

D.2 Results

Improved Critic Score.

As shown in Figure 11, the critic score increases after MotionCritic supervised fine-tuning. This scatter plot collects all data points from the test set, with the critic score of motions before fine-tuning on the x-axis and the critic score of the corresponding motions after fine-tuning on the y-axis. As demonstrated in Figure 11(A), we first compare results with and without critic model supervision. In the latter case, the original MDM loss is used for continued training without our MotionCritic-based plug-and-play module. The scatter plot clearly indicates that the results with critic model supervised fine-tuning achieve significantly higher scores. The second experiment in Figure 11(B) examines different fine-tuning steps using 800 steps from the first set as a baseline. The results demonstrate that critic model supervised fine-tuning consistently improves the critic score throughout the fine-tuning process.

Refer to caption
((A))
Refer to caption
((B))
Figure 11: Visualization of critic scores on fine-tuning experiments. (A): Fine-tuning 400 steps with and without MotionCritic supervision compared. (B): Fine-tuning with 400 and 800 steps compared.
Improved Motion Quality.

We conduct an independent user study to compare motion pairs generated at various fine-tuning stages and calculate the Elo Rating [7, 40]. Figure 12 demonstrates that the quality of motions consistently enhance as fine-tuning advances, as indicated by the user study. This improvement aligns with the training objective of elevating the critic score.

We further inspect the change of different metrics during the fine-tuning process in Figure 12(B). PFC [40] and FID are expected to be negatively correlated with motion quality (the smaller, the better), and MotionCritic and multimodality are expected to be positively correlated (the greater, the better). The results indicate that existing motion quality metrics (e.g. FID, PFC) do not adequately reflect human preference, as they poorly correlate with Elo ratings from user studies. Meanwhile, improving the critic score does not necessarily conflict with the multimodality metric, which models the diversity of generated motions.

Refer to caption
((A))
Refer to caption
((B))
Figure 12: Results from fine-tuning process. (A): Elo ratings from user study, FID and average Critic scores in the fine-tuning process. (B): FID, PFC[40], Multimodality and Critic scores in the fine-tuning process.

Appendix E Details on User Studies

Annotation

We conduct user studies on GT subsets grouped from HumanAct12 [13] and motions generated during finetune steps as discussed in the main text. Our user study platform is shown in Fig 13. In user study, one motion pair of two motions are played simultaneously, with their critic scores and text prompts being hidden. Annotators should choose the better motion or choose "Almost the Same" if they can’t make a decision. We perform user study on 5 different finetune steps and 5 GT batches grouped from HumanAct12 [13].

Win-rates

After annotation, we calculate win-rates of subsets pairs. In user study, each subset has the same amount of motions. Given subsets pair (A,B)𝐴𝐵(A,B)( italic_A , italic_B ), win-rates shows the percentage of motion pairs where motion of subset A𝐴Aitalic_A win over motion of subset B𝐵Bitalic_B in naturalness. Then we paint heatmaps of all subsets with their win rates. Since the result of one match maybe tie, the sum of win-rates of two subsets in a pair and data in symmetric positions of heatmap might be less than 1.

Refer to caption
Figure 13: Our user study platform.
Elo Rating

[7, 40] After annotation, we calculate elo rating of each subsets as follows:
Suggest RA,RBsubscript𝑅𝐴subscript𝑅𝐵R_{A},R_{B}italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT are the initial ratings of two compared subsets A𝐴Aitalic_A and B𝐵Bitalic_B. The expectated win rate of subset9s A𝐴Aitalic_A and B𝐵Bitalic_B, denoting as EA,EBsubscript𝐸𝐴subscript𝐸𝐵E_{A},E_{B}italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT can be calculated as follows:

EAsubscript𝐸𝐴\displaystyle E_{A}italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT =11+10(RBRA)/400absent11superscript10subscript𝑅𝐵subscript𝑅𝐴400\displaystyle=\frac{1}{1+10^{(R_{B}-R_{A})/400}}= divide start_ARG 1 end_ARG start_ARG 1 + 10 start_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) / 400 end_POSTSUPERSCRIPT end_ARG (7)
EBsubscript𝐸𝐵\displaystyle E_{B}italic_E start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT =11+10(RARB)/400absent11superscript10subscript𝑅𝐴subscript𝑅𝐵400\displaystyle=\frac{1}{1+10^{(R_{A}-R_{B})/400}}= divide start_ARG 1 end_ARG start_ARG 1 + 10 start_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) / 400 end_POSTSUPERSCRIPT end_ARG (8)

The new ratings of subsets A𝐴Aitalic_A and B𝐵Bitalic_B are:

RAsuperscriptsubscript𝑅𝐴\displaystyle R_{A}^{{}^{\prime}}italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT =RA+K(SAEA)absentsubscript𝑅𝐴𝐾subscript𝑆𝐴subscript𝐸𝐴\displaystyle=R_{A}+K(S_{A}-E_{A})= italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + italic_K ( italic_S start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) (9)
RBsuperscriptsubscript𝑅𝐵\displaystyle R_{B}^{{}^{\prime}}italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT =RB+K(SBEB)absentsubscript𝑅𝐵𝐾subscript𝑆𝐵subscript𝐸𝐵\displaystyle=R_{B}+K(S_{B}-E_{B})= italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT + italic_K ( italic_S start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT - italic_E start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) (10)

where K is rating coefficient, we choose 32; and S𝑆Sitalic_S is real score, which is 1 for winner, 0 for loser and 0.5 if the result is a tie. We set the initial rating of each subset as 1500.