^†^†footnotetext: ^†Lead Authors.

Aligning Human Motion Generation
with Human Perceptions

Haoru Wang^1† Wentao Zhu^1† Luyi Miao¹ Yishu Xu¹
Feng Gao¹ Qi Tian² Yizhou Wang¹
¹ Peking University ² Huawei Cloud

Abstract

Human motion generation is a critical task with a wide range of applications. Achieving high realism in generated motions requires naturalness, smoothness, and plausibility. Despite rapid advancements in the field, current generation methods often fall short of these goals. Furthermore, existing evaluation metrics typically rely on ground-truth-based errors, simple heuristics, or distribution distances, which do not align well with human perceptions of motion quality. In this work, we propose a data-driven approach to bridge this gap by introducing a large-scale human perceptual evaluation dataset, MotionPercept, and a human motion critic model, MotionCritic, that capture human perceptual preferences. Our critic model offers a more accurate metric for assessing motion quality and could be readily integrated into the motion generation pipeline to enhance generation quality. Extensive experiments demonstrate the effectiveness of our approach in both evaluating and improving the quality of generated human motions by aligning with human perceptions. Code and data are publicly available at https://motioncritic.github.io/.

1 Introduction

Human motion generation is an important emerging task [53] with wide-ranging applications, including augmented and virtual reality (AR/VR) [48, 46], human-robot interaction [29, 11], and digital humans [23, 47]. Achieving high realism in generated human motions is crucial, necessitating naturalness, smoothness, and plausibility. However, current generation methods still fall short of these goals, often producing subpar results. Meanwhile, designing appropriate evaluation metrics that accurately reflect these qualities remains a significant challenge. This complexity stems from the highly non-linear and articulated nature of human motion, which must adhere to physical and bio-mechanical constraints while also avoiding visual artifacts. Effective metrics would not only facilitate the objective comparison of generated results but also have the potential to enhance generation models by addressing their shortcomings.

Existing evaluation metrics typically rely on error with pairing ground truth (GT) motion, simple heuristics, or on distribution distance with real motion manifold. The error-based metrics cannot fully reflect the performance because GT is only one reasonable possibility. The heuristics fall short in comprehensively representing motion quality. For instance, foot-ground contact metrics [32, 40] fail to penalize twisting arm motions that violate bio-mechanical constraints. It is also infeasible to manually define all the human motion rules in a handcrafted manner. Meanwhile, distribution distance metrics like Fréchet Inception Distance (FID) [15] do not operate on an instance level but rather assess overall distribution similarity. Consequently, they cannot identify implausible motions or provide direct supervision signals to guide the generation of higher-quality motions. Some studies [40, 41] also indicate that FID correlates poorly with user studies due to the misalignment between its distance measurement and human perception of motion quality. Consequently, existing automatic evaluation metrics cannot effectively reflect or replace subjective user studies, hindering objective evaluation and comparison.

In light of this, we advocate the need for automatic evaluation aligned with human perceptions. Firstly, humans are the primary audience and interaction partners for motion generation, making their perception crucial for evaluating motion quality. Secondly, the human brain possesses specialized neural mechanisms for processing biological motion [3, 10] and is sensitive to even slightly unnatural motions [39, 34]. Therefore, we explore the possibility of directly learning perceptual evaluations from humans using a data-driven approach. This method could bridge the gap between objective metrics and subjective human judgments, providing a more accurate assessment of motion quality.

First, we carefully curate a human perceptual evaluation dataset named MotionPercept, which contains $52590$ pairs of human preference annotations on generated motions. Next, we train a human motion critic model, MotionCritic, that learns motion quality ratings from the collected dataset. Our critic model significantly outperforms previous metrics in terms of alignment with human perceptions. Notably, it generalizes well across different data distributions. In addition to motion evaluation, we further propose to utilize the critic model as a direct supervision signal. We demonstrate that MotionCritic can be seamlessly integrated into the generation training pipeline, effectively improving motion generation quality by increasing alignment with human perceptions with few steps of finetuning.

We summarize our contributions as follows: 1) We contribute MotionPercept, a large-scale motion perceptual evaluation dataset with manual annotations. 2) We develop MotionCritic which models human perceptions of motions through a data-driven approach. Extensive experiments demonstrate its superiority as an automatic human-aligned metric of motion quality. 3) We show that the proposed motion critic model could effectively serve as a supervision signal to enhance motion generation quality. Remarkably, it requires only a small number of fine-tuning steps and can be easily integrated into existing generator training pipeline in a plug-and-play manner.

Refer to caption — Figure 1: Framework Overview. We collect MotionPercept, a large-scale, human-annotated dataset for motion perceptual evaluation, where human subjects select the best quality motion in multiple-choice questions. Using this dataset, we train MotionCritic to automatically judge motion quality in alignment with human perceptions, offering better quality metrics. Additionally, we show that MotionCritic can enhance existing motion generators with minimal fine-tuning.

2 Related Work

2.1 Human Motion Generation

Human motion generation is a pivotal task in computer vision, computer graphics, and artificial intelligence, aiming to produce natural and realistic human pose sequences [53]. This field has seen substantial advancements with the rise of deep generative models [22, 33, 9, 16]. Previous works have explored text-conditioned motion generation that transform narrative descriptions into coherent pose sequences [38, 37, 25, 31, 28], audio-conditioned methods that synchronize movements with rhythmic cues [17, 35, 40], and scene-conditioned generation that integrates environmental contexts to produce contextually appropriate motions [5, 42, 1]. Despite significant progress, current mainstream data-driven kinematic motion generation methods sometimes produce unnatural motions that are jittery, distorted, or violate physiological and physical constraints. These issues could be attributed to the inherent uncertainty of the task, limitations of supervision signals, and dataset noises. Furthermore, evaluating generated human motions presents additional challenges. Conventional metrics such as error and FID fall short in capturing the nuanced details essential for producing lifelike and visually appealing movements [40, 41]. These measures can overlook critical aspects like the fluidity and biomechanical plausibility that are fundamental to human perceptual judgments. Given these challenges, it is imperative to develop metrics that are more closely aligned with human perception to more accurately evaluate and enhance the motion generation results.

2.2 Human Perception Modeling

Pioneer work [51] collect human perceptual similarity dataset and propose to utilize distance in deep features as perceptual metrics. Some works [30, 2, 50, 14, 6] in language models to explore aligning model performance with human intent by first training a reward model, then performing reinforcement learning with the reward model. Recent works [49, 43, 24] also explore utilizing human feedback to improve visual generation results. For example, ImageReward [44] propose a reward feedback learning method (ReFL) to to align text-to-image generative models with human judgements. In human motion generation, however, few studies have explored modeling human feedbacks, even though the generated motion quality is highly relevant to human perceptions. One recent work, MoBERT [41], constructs a dataset of human ratings for generated motions. Our work differs from MoBERT in that we collect real human data on a scale tens of times larger (52.6K vs 1.4K) and use comparisons instead of ratings, which is more robust. We design the critic model to learn ratings from these comparisons automatically. Additionally, our approach could not only evaluate motion quality but also effectively improve motion generation results.

3 MotionPercept: A Large-scale Dataset of Motion Perceptual Evaluation

We build MotionPercept to capture real-human perceptual evaluations with large-scale and diverse human motion sequences. Hence, we implement a rigorous and efficient pipeline for data collection and data annotation. We also design a concensus experiment in order to examine the perceptual consistency across various human subjects.

3.1 Motion Data Collection

We first collect generated human motion sequence pairs for subsequent perceptual evaluation. We utilize state-of-the-art diffusion-based motion generation method MDM [38] and FLAME [21] to generate human motion sequences parameterized by SMPL [27]. For MDM [38], we utilize the action-to-motion model trained on HumanAct12 [13] and UESTC [19] respectively. For FLAME [21], we utilize the text-to-motion model trained on HumanML3D [12]. For each group of $4$ motion sequences to be annotated, we use the same condition (text prompt or action labels) while sampling different random noises. This makes the motions similar in content while still having distinguishable differences, thereby making it easier to annotate the choices.

3.2 Human Perceptual Evaluation

Human perceptual evaluation is the core component of MotionPercept, therefore we implement a rigorous pipeline to ensure annotation quality. We first introduce the question design of the perceptual evaluation, then describe the protocol for conducting the evaluation. Finally, we present a statistical analysis of the evaluation results.

3.2.1 Question Design

Our perceptual evaluation is designed in the form of multiple-choice questions as selection is generally easier and more robust than directly rating [20, 36, 41]. Given a group of four motion sequence options, we instruct the annotators to select the best candidate that is most natural, visually pleasing, and free of artifacts. Specifically, we summarize the typical failure modes of the generated motions (e.g., jittering, foot skating, limb distortion, penetration, etc.) and explicitly require the annotators to exclude these options. We provide detailed guidance with task descriptions and representative video examples to better communicate the goal to the annotators. The full guidance is presented in the supplementary materials. While the optimal choice can be decided unambiguously in most cases, there are situations where the decision can be challenging. Therefore, we add two additional options, “all good” and “all bad”, so that the annotator is not required to pick one of the motions in these cases, thereby improving overall annotation quality. Results indicate that these cases account for a small portion of the total data. We exclude these cases from our subsequent experiments. In total, we set six options for each entry: four motion candidates plus “all good” and “all bad”.

3.2.2 Protocols

To ensure the quality of perceptual evaluation results, our annotation process consists of annotator training, annotation, and quality control. We recruit 10 annotators to perform the perceptual evaluation. Before the evaluation begins, we provide annotation guidelines to help the annotators understand the task and maintain consistent criteria. The annotators must pass a pilot test before starting the formal annotation to ensure they correctly understand the annotation requirements. Additionally, we conduct a perceptual consensus experiment to assess whether the annotation pipeline is suitable for our dataset, as discussed in Section 3.3. Finally, we implement a quality control process where the annotated data is reviewed by an expert quality inspector. During the annotation process, we continuously monitor the quality of each batch of data. For each batch, we randomly sample 10% of the data for quality inspection. The consistency between the sampled data and the expert’s annotations must exceed 90%; otherwise, the entire batch will be re-annotated.

3.3 Analysis

In total, we collect annotations for 18260 multiple-choice questions covering 73040 unique motions, significantly surpassing previous work [41] (1400 motions). We further investigate the following two questions:

1.

Based on our experimental setup, can the subjects confidently select the suitable options from the choices provided?
2.

Is there a significant difference in perceptual preferences among different subjects, or are they well-aligned?

For the first question, we calculate the proportion of cases where a choice could not be made (including “all good” and “all bad”), and find a total of 418 such groups (2.29%). The result indicates that most of the time subjects can make a definite judgment, demonstrating the validity of our protocol design.

For the second question, we conduct a perceptual consensus experiment where all 10 subjects perform perceptual evaluation independently on 312 groups of randomly selected data. We calculate their pairwise and overall consistency in choices. Figures 2(A) and 2(B) show that for most questions (82.37%), all 10 subjects make the unanimous decision. Figure 2(C) reveals that all 10 subjects exhibit high pairwise agreement (90%). These results indicate a high level of consistency in perceptual judgments of human motion among different human subjects. This not only validates the rationality of our perceptual evaluation pipeline but also inspires us to train machine learning models to emulate this consistent judgment capability.

4 MotionCritic: Advancing Motion Generation with Perceptual Alignment

Based on MotionPercept, we develop a human motion critic model, MotionCritic, to emulate the perceptual judgment capabilities of human subjects regarding human motion. We first present the problem formulation and training approach of the critic model, and then explain how to use the critic model for optimizing motion generation.

4.1 Problem Formulation

We formulate the problem as follows: given an input human motion sequence $\mathbf{x}$ , we assume there is an implicit human perception model $\mathcal{H}$ that rates the motion quality $\mathcal{H}(\mathbf{x})$ , where a higher rate indicates better quality. We aim to build a computational critic model $\mathcal{C}$ that best aligns with $\mathcal{H}$ . Since $\mathcal{H}$ is not explicitly available, we take a data-driven approach. We obtain the human perceptual evaluation dataset $\mathcal{D}$ containing multiple pairs of samples $(\mathbf{x}^{(i)},\mathbf{x}^{(j)})$ . Our training objective is to train the model $\mathcal{C}$ using the dataset $\mathcal{D}$ so that it approximates the human perception model $\mathcal{H}$ as closely as possible. Specifically, we want the model prediction $\mathcal{C}(\mathbf{x}^{(i)})>\mathcal{C}(\mathbf{x}^{(j)})$ if and only if $\mathcal{H}(\mathbf{x}^{(i)})>\mathcal{H}(\mathbf{x}^{(j)})$ . Based on the Bradley-Terry model [4, 18], the overall training objective could be written as maximizing the joint probabilities that the model $\mathcal{C}$ makes judgments consistent with $\mathcal{H}$ for each pair of samples in the dataset $\mathcal{D}$ :

\displaystyle\arg\max_{\mathcal{C}}\,\mathbb{E}_{(\mathbf{x}^{(i)},\mathbf{x}^% {(j)})\sim\mathcal{D}}\left[\log\sigma\left((\mathcal{C}(\mathbf{x}^{(i)})-% \mathcal{C}(\mathbf{x}^{(j)}))\cdot(\mathcal{H}(\mathbf{x}^{(i)})-\mathcal{H}(% \mathbf{x}^{(j)}))\right)\right],

(1)

where $\sigma$ is the sigmoid function.

4.2 Human Motion Critic Model

In practice, we represent human motion by $\mathbf{x}\in\mathbb{R}^{L\times J\times D}$ where $L$ denotes the sequence length, $J$ denotes the number of body joints, and $D$ denotes parameter dimensions. We implement the critic model $\mathcal{C}$ as a neural network that maps the high-dimensional motion parameters to a scalar $s$ . We draw pairwise comparison annotations from the collected dataset, where $\mathbf{x}^{(h)}$ is the better instance and $\mathbf{x}^{(l)}$ is the worse. The perceptual alignment loss is thus given by:

\mathcal{L}_{\text{Percept}}=-\mathbb{E}_{(\mathbf{x}^{(h)},\mathbf{x}^{(l)})% \sim\mathcal{D}}\left[\log\sigma\left(\mathcal{C}(\mathbf{x}^{(h)})-\mathcal{C% }(\mathbf{x}^{(l)})\right)\right].

(2)

4.3 Motion Generation with Critic Model Supervision

Additionally, we explore to utilize the learned human perceptual prior of $\mathcal{C}$ not only for evaluating generated motions, but also improving them. We demonstrate that our motion critic model could be integrated into state-of-the-art diffusion-based motion generation approaches with ease by using MDM [38] as an example. The forward diffusion is modeled as a Markov noising process $\{\mathbf{x}_{t}\}_{t=0}^{T}$ where $\mathbf{x}_{0}$ is drawn from the data distribution, and

{q(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\mathcal{N}(\sqrt{\alpha_{t}}\mathbf{x}_{t% -1},(1-\alpha_{t})I),}

(3)

where $\alpha_{t}\in(0,1)$ are constant hyper-parameters. When $\alpha_{t}$ is small enough, it’s reasonable to approximate $\mathbf{x}_{T}\sim\mathcal{N}(0,I)$ , allow sampling $\mathbf{x}_{T}$ from random noise to begin our denoising process.

Algorithm 1 Fine-tuning Motion Generation with MotionCritic

1:Dataset: Action-label set

\mathcal{Y}=\left\{y_{1},y_{2},...,y_{n}\right\}

2:Pre-training Dataset: Action-motion pairs dataset

\widetilde{\mathcal{D}}=\{(\textrm{label}_{1},\textrm{mot}_{1}),...,(\textrm{% label}_{n},\textrm{mot}_{n})\}

3:Input: MDM model

\mathcal{M}_{\theta_{0}}

with pre-trained weights

\mathcal{\theta}_{0}

, Critic model

\mathcal{C}

, MDM loss function

\psi

, Critic-to-loss map function

\phi

, Critic re-weight scale

\lambda

, KL loss re-weight scale

\mu

4:Initialization: The number of noise scheduler time steps

T

, and time step range for fine-tuning

[T_{1},T_{2}]

5:for

(\textrm{label}_{i},\textrm{mot}_{i})\in\widetilde{\mathcal{D}}

\mathcal{L}_{\text{MDM}}\leftarrow\psi_{\theta_{i}}(\textrm{label}_{i},\textrm% {mot}_{i})

\theta_{i}\leftarrow\theta_{i}

\triangleright

Update

\textrm{MDM}_{\theta_{i}}

with

\mathcal{L}_{\text{MDM}}

t\leftarrow\text{rand}(T_{1},T_{2})

\triangleright

Pick a random time step

t\in[T_{1},T_{2}]

\mathbf{x}_{T}\sim\mathcal{N}(0,I)

\triangleright

Sample noise

10: for

j=T,...,t+1

11: no grad:

\mathbf{x}_{j-1}\leftarrow\mathcal{M}_{\theta_{i}}\{\mathbf{x}_{j}\}

12: end for

13: with grad:

\mathbf{\widetilde{\mathbf{x}_{0}}}^{\prime}\leftarrow\mathcal{M}_{\theta_{i}}% \{\mathbf{x}_{t}\}

14: if

\widetilde{\mathbf{x}_{0}}^{\prime}

is not None then

15:

\mathcal{L}_{KL}\leftarrow\mu\text{KL}(\widetilde{\mathbf{x}_{0}}^{\prime},% \mathbf{x}_{0}\textquoteright

)

\triangleright

KL loss with previous

\mathbf{x}_{0}\textquoteright

16: end if

17:

r\sim\mathcal{N}(0,1)

\triangleright

Random noise scalar

18:

\mathcal{L}_{\text{Critic}}\leftarrow\lambda\phi(\mathcal{C}(\mathbf{x}_{0}% \textquoteright),r)

\triangleright

Critic loss

19:

\theta_{i+1}\leftarrow\theta_{i}

\triangleright

Update

\mathcal{M}_{\theta_{i}}

with

\mathcal{L}_{\text{Critic}}

and

\mathcal{L}_{\text{KL}}

20:

\widetilde{\mathbf{x}_{0}}^{\prime}\leftarrow\mathbf{x}_{0}\textquoteright

\triangleright

Save

\mathbf{x}_{0}\textquoteright

for next-step

\mathcal{L}_{\text{KL}}

21:end for

Given an MDM model $\mathcal{M}$ with pre-trained parameters $\theta_{0}$ , we fine-tune to improve its alignment with a pre-trained critic model $\mathcal{C}$ . We develop a lightweight perceptual-aligned fine-tuning approach based on ReFL [44]. Notably, in order to utilize the critic model in a plug-and-play manner, we keep the MDM training step and objective $\mathcal{L}_{\text{MDM}}$ unchanged. Instead, we simply add one optimization step with critic model supervision in each training iteration as shown in Figure 1.

Specifically, we sample a Gaussian noise $\mathbf{x}_{T}$ and inference until $\mathbf{x}_{t}$ where $t\in[T_{1},T_{2}]$ is randomly selected in later denoising steps. Then, a single-step denoising is performed to predict $\mathbf{x}_{0}\textquoteright$ from $\mathbf{x}_{t}$ . Based on the predicted motion $\mathbf{x}_{0}\textquoteright$ , we compute the critic score $s=\mathcal{C}(\mathbf{x}_{0}\textquoteright)$ , which is used to compute the motion critic loss:

{\mathcal{L}_{\text{Critic}}=\mathbb{E}_{y_{i}\sim{\mathcal{Y}}}\left[\phi(% \mathcal{C}(\mathbf{x}_{0}^{\prime})\right],\\ }

(4)

where $\phi(s)=-\sigma(\tau-s))$ is a critic-to-loss mapping function, $\tau$ is a constant for shifting the critic value, $\sigma$ is the sigmoid function. We further introduce a Kullback-Leibler (KL) divergence regularization to prevent $\mathcal{M}$ from moving substantially away from the conditional motion generation task:

\displaystyle\mathcal{L}_{\text{KL}}

\displaystyle=\mathbb{E}_{y_{i}\sim{\mathcal{Y}}}\left[{D_{\text{KL}}}\left(p(% \mathbf{x}_{0}^{\prime})\|p(\widetilde{\mathbf{x}_{0}^{\prime}})\right)\right].

(5)

The overall fine-tuning loss is given by

\displaystyle\mathcal{L}_{\text{FT}}

\displaystyle=\mathcal{L}_{\text{MDM}}+\lambda\mathcal{L}_{\text{Critic}}+\mu% \mathcal{L}_{\text{KL}}.

(6)

where $\lambda$ and $\mu$ are constants for loss balancing. The detailed algorithm workflow is shown in Algorithm 1.

5 Experiment

5.1 Implementation Details

Critic Model.

We train our critic model using the MDM subset in MotionPercept. We convert each multiple-choice question into three ordered preference pairs, which results in 46761 pairs for training and 5829 pairs for testing. We parameterize motion sequences with SMPL [27], including 24 axis-angle rotations, and global root translation. We implement the critic model with DSTformer [52] backbone with 3 layers and 8 attention heads. We apply temporal average pooling on encoded motion embeddings followed by an MLP with a hidden layer of 1024 channels to predict a single scalar score. We train the critic model for 150 epochs with a batch size of 64 and a learning rate starting at 2e-3, decreasing with a $0.995$ exponential learning rate decay.

Fine-tuning.

We use MDM [38] model trained on HumanAct12 [13] as our baseline, which utilizes 1000 DDPM denoising steps. We load the checkpoint trained for 350000 iterations and fine-tune for 800 iterations, with a batch size of 64 and learning rate 1e-5. We fine-tune with critic clipping threshold $\tau=12.0$ , critic re-weight scale $\lambda=$ 1e-3, and KL loss re-weight scale $\mu=1.0$ . We set the step sampling range $[T_{1},T_{2}]=[700,900]$ .

5.2 MotionCritic as Motion Quality Metric

We first evaluate whether the proposed critic model could serve as an effective motion quality metric. Specifically, we are interested in the following research questions:

1.

How does MotionCritic align with human perceptual evaluations?
2.

Could MotionCritic generalize to different data distributions?

To investigate the first question, we evaluate the performance of our critic model on a held-out test set and compare it with existing motion quality metrics as follows:

•

Error-based metrics, including Root Average Error (Root AVE), Root Absolute Error (Root AE), Joint Average Error (Joint AVE), and Joint Absolute Error (Joint AE). These metrics involve directly computing the distance between the generated motion and a pairing GT with the same condition.
•

Heuristic metrics, including acceleration [32, 45], Person-Ground Contact [32], Foot-Floor Penetration [32], and Physical Foot Contact (PFC) [40]. These metrics does not compare against GT; instead, they implement intuitive rule-based evluations. For example, PFC models the relationship between center of mass acceleration and foot-ground contact.
•

Learning-based metrics. Prior work MoBERT [41] proposes to evaluate motion quality with a motion feature extractor and SVR Regression.

Table 1: Quantitative comparison of motion evaluation metrics on MDM and FLAME subsets of MotionPercept.

Metric	MDM		FLAME
Metric	Accuracy (%) $\uparrow$	Log Loss $\downarrow$	Accuracy (%) $\uparrow$	Log Loss $\downarrow$
Root AVE	59.47	0.6891	48.42	0.6984
Root AE	61.79	0.6798	59.54	0.6711
Joint AVE	56.77	0.6889	44.61	0.6973
Joint AE	62.73	0.6794	58.37	0.6891
Acceleration [45]	64.26	0.7792	66.67	0.6919
Person-Ground Contact [32]	71.78	0.7260	69.82	0.7243
Foot-Floor Penetration [32]	53.61	0.6939	55.56	0.6906
Physical Foot Contact [40]	64.79	0.6926	66.00	0.6930
MoBERT [41]	49.40	0.6931	52.40	0.6932
MotionCritic (Ours)	85.07	0.5486	81.43	0.5758

Note that distribution-based metrics (e.g. FID) could not compare quality of individual motion sequences, and the comparison can be found in subsequent experiments. For each metric, we calculate the percentage they align with GT annotations (accuracy) and also their probabilistic distribution distance with GT annotations (log loss). We use the softmax function to convert the scores to probabilities (taking the opposite before softmax for metrics where smaller is better). Table 4 demonstrates that our critic model significantly outperforms previous metrics. These results not only validate the effectiveness of learning from large-scale human perceptual evaluations but also prove that our critic model can serve as a more comprehensive and robust metric for assessing motion quality.

Furthermore, to investigate the second question, we test the critic model on data outside of the training distributions. We collect a standalone test set with a different motion generation algorithm, FLAME [25], and perform perceptual evaluation with a different human subject. Note that this model is trained on a different dataset [12] with the model used to generate critic model training data, which means the action categories have large variations. The results in Table 4 further shows that our critic model could well generalize to the new test set, indicating its efficacy in evaluating different generation algorithms and unseen motion contents.

Additionally, we test the generalization of our critic model on the real GT motion distribution. Figure 4(A) illustrates the critic score distribution of HumanAct12 [13] test set. We group the 1190 GT motions into 5 groups based on their critic scores, evenly distributed from highest to lowest. We compare the average critic score between the groups with distribution-based metric FID and user study. The user study is conducted by comparing motion pairs sampled from each groups and then computing Elo rating [7, 40] for each group. Figure 4(B) clearly indicates that the critic score aligns well with human preferences, while FID does not. Notably, we discover that the outliers with small critic values (group V) are indeed artifacts within the dataset. Please refer to the supplementary materials for video results. The results indicate that our critic model can also generalize to the GT motion manifold, even though the model has never been trained on it. It also highlights the potential of using our critic model as a tool for dataset diagnosis (e.g., discover failure modes).

5.3 MotionCritic as Training Supervision

Furthermore, we investigate whether our critic model can also serve as an effective supervision signal. Specifically, we fine-tune a pre-trained motion generator [38] with the proposed framework, and evaluate on HumanAct12 [13] test set every 200 steps. Additionally, we conduct a standalone user study by comparing motion pairs generated at different fine-tuning steps and compute the Elo Rating [7, 40]. Figure 5 reveals that as fine-tuning progresses, the motion quality consistently improves according to the user study, in line with the training objective of increasing the critic score. We also present a visualization comparison in Figure 6. We discover that as fine-tuning progresses, unreasonable human motions such as jittering, twisting, and floating significantly decrease. Please refer to the supplementary materials for video comparisons. The results also demonstrate that our fine-tuning process requires only hundreds of iterations to take effect, significantly improving the perceptual quality of the model. Compared to the 350K pre-training steps, this accounts for only 0.23% of the training cost. This further demonstrates the advantages of our proposed framework in using a perceptually-aligned critic model to fine-tune the motion generation model, not only improving quality but also being lightweight and efficient.

6 Conclusion

In conclusion, our work bridges the important gap in human motion generation between objective metrics and human perceptual evaluations by introducing a data-driven framework with MotionPercept and MotionCritic. This paradigm not only offers a more comprehensive metrics of motion quality but could also improve the generation results by aligning with human preferences. We hope this work could contribute to more objective evaluations of motion generation methods and results. One limitation of our approach is its primary focus on perceptual metrics without explicitly simulating biomechanical plausibility, which could be explored in future work. Future research could also investigate more fine-grained perceptual evaluation methods to obtain rich human feedback on motion quality like [26].

References

[1] Joao Pedro Araújo, Jiaman Li, Karthik Vetrivel, Rishi Agarwal, Jiajun Wu, Deepak Gopinath, Alexander William Clegg, and Karen Liu. Circle: Capture in rich contextual environments. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 21211–21221, 2023.
[2] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
[3] Sarah-Jayne Blakemore and Jean Decety. From the perception of action to the understanding of intention. Nature reviews neuroscience, 2(8):561–567, 2001.
[4] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
[5] Enric Corona, Albert Pumarola, Guillem Alenya, and Francesc Moreno-Noguer. Context-aware human motion prediction. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 6992–7001, 2020.
[6] Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
[7] Arpad E. Elo. The Rating of Chessplayers, Past and Present. Arco Pub., 1978.
[8] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12):86–92, December 2021.
[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proc. Adv. Neural Inform. Process. Syst., 2014.
[10] Emily Grossman, Michael Donnelly, R Price, D Pickens, V Morgan, G Neighbor, and Randolph Blake. Brain areas involved in perception of biological motion. Journal of cognitive neuroscience, 12(5):711–720, 2000.
[11] Gianpaolo Gulletta, Wolfram Erlhagen, and Estela Bicho. Human-like arm motion generation: A review. Robotics, 9(4):102, 2020.
[12] Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 5152–5161, 2022.
[13] Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. In Proc. ACM Int. Conf. Multimedia, pages 2021–2029, 2020.
[14] Joey Hejna and Dorsa Sadigh. Inverse preference learning: Preference-based rl without a reward function. Advances in Neural Information Processing Systems, 36, 2024.
[15] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proc. Adv. Neural Inform. Process. Syst., volume 30, 2017.
[16] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Proc. Adv. Neural Inform. Process. Syst., 2020.
[17] Ruozi Huang, Huang Hu, Wei Wu, Kei Sawada, Mi Zhang, and Daxin Jiang. Dance revolution: Long-term dance generation with music via curriculum learning. In Proc. Int. Conf. Learn. Represent., 2021.
[18] David R Hunter. Mm algorithms for generalized bradley-terry models. The annals of statistics, 32(1):384–406, 2004.
[19] Yanli Ji, Feixiang Xu, Yang Yang, Fumin Shen, Heng Tao Shen, and Wei-Shi Zheng. A large-scale rgb-d database for arbitrary-view human action recognition. In Proc. ACM Int. Conf. Multimedia, page 1510–1518, 2018.
[20] Maurice George Kendall. Rank correlation methods. 1948.
[21] Jihoon Kim, Jiseob Kim, and Sungjoon Choi. Flame: Free-form language-based motion synthesis & editing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 8255–8263, 2023.
[22] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In Proc. Int. Conf. Learn. Represent., 2014.
[23] Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström. Analyzing input and output representations for speech-driven gesture generation. In Proc. Int. Conf. on Intelligent Virtual Agents, page 97–104, 2019.
[24] Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
[25] Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph., 36(6):194–1, 2017.
[26] Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, et al. Rich human feedback for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19401–19411, 2024.
[27] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. ACM Trans. Graph., 34(6):1–16, 2015.
[28] Thomas Lucas*, Fabien Baradel*, Philippe Weinzaepfel, and Grégory Rogez. Posegpt: Quantization-based 3d human motion generation and forecasting. In Proc. Eur. Conf. Comput. Vis., pages 417–435, 2022.
[29] Yusuke Nishimura, Yutaka Nakamura, and Hiroshi Ishiguro. Long-term motion generation for interactive humanoid robots using gan with convolutional network. In Proc. ACM/IEEE Int. Conf. on Human-Robot Interaction, pages 375–377, 2020.
[30] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Proc. Adv. Neural Inform. Process. Syst., 35:27730–27744, 2022.
[31] Mathis Petrovich, Michael J. Black, and Gül Varol. TEMOS: Generating diverse human motions from textual descriptions. In Proc. Eur. Conf. Comput. Vis., pages 480–497, 2022.
[32] Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J Guibas. Humor: 3d human motion model for robust pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11488–11499, 2021.
[33] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proc. Int. Conf. Mach. Learn., 2015.
[34] Sotaro Shimada and Kazuma Oki. Modulation of motor area activity during observation of unnatural body movements. Brain and cognition, 80(1):1–6, 2012.
[35] Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. Bailando: 3d dance generation via actor-critic gpt with choreographic memory. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 11050–11059, June 2022.
[36] Neil Stewart, Gordon DA Brown, and Nick Chater. Absolute identification by relative judgment. Psychological review, 112(4):881, 2005.
[37] Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to clip space. In Proc. Eur. Conf. Comput. Vis., pages 358–374, 2022.
[38] Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffusion model. In Proc. Int. Conf. Learn. Represent., 2023.
[39] Nikolaus F Troje. Decomposing biological motion: A framework for analysis and synthesis of human gait patterns. Journal of vision, 2(5):2–2, 2002.
[40] Jonathan Tseng, Rodrigo Castellon, and Karen Liu. Edge: Editable dance generation from music. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 448–458, 2023.
[41] Jordan Voas, Yili Wang, Qixing Huang, and Raymond Mooney. What is the best automated metric for text to motion generation? In SIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023.
[42] Jiashun Wang, Huazhe Xu, Jingwei Xu, Sifei Liu, and Xiaolong Wang. Synthesizing long-term 3d human motion and interaction in 3d scenes. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 9401–9411, 2021.
[43] Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023.
[44] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
[45] Sicheng Yang, Zhiyong Wu, Minglei Li, Zhensong Zhang, Lei Hao, Weihong Bao, and Haolin Zhuang. Qpgesture: Quantization-based and phase-guided motion matching for natural speech-driven gesture generation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023.
[46] Zhuoqian Yang, Wentao Zhu, Wayne Wu, Chen Qian, Qiang Zhou, Bolei Zhou, and Chen Change Loy. Transmomo: Invariance-driven unsupervised video motion retargeting. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[47] Hongwei Yi, Hualin Liang, Yifei Liu, Qiong Cao, Yandong Wen, Timo Bolkart, Dacheng Tao, and Michael J Black. Generating holistic 3d human motion from speech. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023.
[48] Tairan Yin, Ludovic Hoyet, Marc Christie, Marie-Paule Cani, and Julien Pettré. The one-man-crowd: Single user generation of crowd motions using virtual reality. IEEE Trans. Vis. Comput. Graph., 28(5):2245–2255, 2022.
[49] Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, and Dong Ni. Instructvideo: Instructing video diffusion models with human feedback. arXiv preprint arXiv:2312.12490, 2023.
[50] Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
[51] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
[52] Wentao Zhu, Xiaoxuan Ma, Zhaoyang Liu, Libin Liu, Wayne Wu, and Yizhou Wang. Learning human motion representations: A unified perspective. In Proc. Int. Conf. Comput. Vis., 2023.
[53] Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. Human motion generation: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.

\doparttoc\faketableofcontents

Part 2 Appendix

\parttoc

Appendix A Details on MotionPercept

A.1 Prompt Selection

We utilize the prompts from HumanAct12 [13], UESTC [19] and HumanML3D [12] for generating the motion candidates. Specifically, we use the 12 action labels from HumanAct12 [13] (shown in Table 2) and the 40 categories of aerobic exercise description from UESTC [19] (shown in Table 3) for the MDM [38] model.

HumanAct12 [13] Action Labels
warm up	walk
run	jump
drink	lift dumbbell
sit	eat
turn steering wheel	phone
boxing	throw

Table 2: 12 action labels from HumanAct12 [13].

UESTC [19] Action Labels
punching and knee lifting	marking time and knee lifting
jumping-jack	squatting
forward-lunging	left-lunging
left-stretching	raising-hand-and-jumping
left-kicking	rotation-clapping
front-raising	pulling-chest-expanders
punching	wrist-circling
single-dumbbell-raising	shoulder-raising
elbow-circling	dumbbell-one-arm-shoulder-pressing
arm-circling	dumbbell-shrugging
pinching-back	head-anticlockwise-circling
shoulder-abduction	deltoid-muscle-stretching
straight-forward-flexion	spinal-stretching
dumbbell-side-bend	standing-opposite-elbow-to-knee-crunch
standing-rotation	overhead-stretching
upper-back-stretching	knee-to-chest
knee-circling	alternate-knee-lifting
bent-over-twist	rope-skipping
standing-toe-touches	standing-gastrocnemius-calf
single-leg-lateral-hopping	high-knees-running

Table 3: 40 action labels from UESTC [19].

We randomly select texts from HumanML3D [12] test set as prompts for the FLAME [25] model.

A.2 Annotation Management

We recruit 10 annotators for this task, and data entries are randomly allocated to them. We provide detailed guidelines to annotators. We evaluate the annotation result by spot check. We randomly select 10% of all data to inspect the annotation results according to guidelines and calculate the proportion of unqualified data entries. If the unqualified proportion is less than 10%, the results are considered to be acceptable. All the unqualified data entries will be re-annotated. We will update the guidelines during annotation based on spot check feedback, and annotators will study the new guidelines.

A.3 Annotation Design

We generate four motions from the same prompt for each data entry, as shown in Fig 7. The prompts are hidden during the annotation process. Annotators are required to select either the best or the worst motion for data entries generated by MDM [38] and FLAME [21]. MDM [38] exhibits better motion diversity but lacks stability, so annotators are instructed to select the best motion. Conversely, FLAME [21] demonstrates better stability but lacks diversity, so annotators are instructed to select the worst motion for these entries.

A.4 Annotation Guidance Documentation

We provide a detailed annotation document to explain the annotation process. The annotation platform is shown in Fig 8.

Introduction

Each data entry to be annotated consists of four videos, as shown in Fig 7. Each video is approximately three seconds long, with all four videos playing simultaneously and concatenated into one video.

Requirements

Each set of videos has six options: A, B, C, D, "all are good," and "all are bad." Annotators should select the most natural and reasonable video for each data entry. If one option stands out as the best, select that option. If all actions seem equally good or equally bad, choose "all are good" or "all are bad." Text prompts will be hidden during annotation.

Video Examples

We provide annotators with examples if what kinds of motions are unnatural and unaccepetable:

1.

Body pose is unnatural, including hands, feet and so on.
2.

Human motion violates physiological constraints.
3.

Human motion is erratic or severely stutters.
4.

Human body collides, such as hands fully embedded into leg.
5.

Human body is severely tilted, to the point of losing balance.
6.

Human body appears to be drifting instead of walking.

Examples of these problems are shown in Fig 9.

Appendix B Data Documentation

We follow the datasheet proposed in [8] for documenting our MotionPercept:

1.
Motivation
1. (a)
  
  For what purpose was the dataset created?
  This dataset was created to collect human perceptual data on whetner human motions seem natural, and ultimately advance our study of perceptual-aligned metric and finetuning human motion generation model.
2. (b)
  
  Who created the dataset and on behalf of which entity?
  This dataset was created by Haoru Wang, Yishu Xu, Luyi Miao, Wentao Zhu, Feng Gao and Yizhou Wang with Peking University.
3. (c)
  
  Who funded the creation of the dataset?
  The creation of this dataset was funded by Peking University.
4. (d)
  
  Any other Comments?
  None.
2.
Composition
1. (a)
  
  What do the instances that comprise the dataset represent?
  Each instance contains 4 video of human motions generated from the same prompt by existing motion generation methods [38, 25].
2. (b)
  
  How many instances are there in total?
  In total, we collect annotations for 18260 multiple-choice questions covering 73K unique motions.
3. (c)
  
  Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
  No, this is a brand-new dataset.
4. (d)
  
  What data does each instance consist of?
  See appendix A for details.
5. (e)
  
  Is there a label or target associated with each instance?
  Yes. See appendix A.
6. (f)
  
  Is any information missing from individual instances?
  No.
7. (g)
  
  Are relationships between individual instances made explicit?
  Yes.
8. (h)
  
  Are there recommended data splits?
  Yes, we have separated the whole dataset into MDM-A (motions generated by MDM [38] from prompts in HumanAct12 [13]), MDM-U (motions generated by MDM [38] from prompts in UESTC [19] and FLAME (motions generated by FLAME [21] from prompts in HumanML3D [12]). We provide the recommended data splits by combining MDM-A and MDM-U and randomly splitting them into a training set and a test set at a ratio of 8:1. Data generated by FLAME [21] is primarily used as test data for generalization.
9. (i)
  
  Are there any errors, sources of noise, or redundancies in the dataset?
  No.
10. (j)
  
  Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?
  The dataset is self-contained.
11. (k)
  
  Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)?
  No.
12. (l)
  
  Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?
  No.
13. (m)
  
  Does the dataset relate to people?
  Yes. Our human motion data is generated as body model parameters [27], not from real people, and therefore does not contain biometrics. These data are annotated by human annotators.
14. (n)
  
  Does the dataset identify any subpopulations (e.g., by age, gender)?
  No. Our human motion data are generated as body model parameters [27] with no explicit gender or age.
15. (o)
  
  Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset?
  No. Our human motion data are generated by algorithms with commonly used body models.
16. (p)
  
  Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)?
  No.
17. (q)
  
  Any other comments?
  None.
3.
Collection Process
1. (a)
  
  How was the data associated with each instance acquired?
  See appendix A for details.
2. (b)
  
  What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)?
  We use existing motion generation models to collect videos and require annotators to label them. See appendix A for details.
3. (c)
  
  If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?
  See appendix A and appendix B for details.
4. (d)
  
  Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?
  The video data was collected by the authors. The annotations were performed by the workers in DATATANG TECHNOLOGY INC., and the workers were offered a fair wage as per the prearranged contract. See appendix A and appendix B for details.
5. (e)
  
  Over what timeframe was the data collected?
  The data were collected from 2023 to 2024, and labeled in 2024.
6. (f)
  
  Were any ethical review processes conducted (e.g., by an institutional review board)?
  No. The MotionPercept dataset raises no ethical concerns regarding the privacy information of human subjects.
7. (g)
  
  Does the dataset relate to people?
  Yes. Our human motion data are generated as body model parameters [27], not real people. The annotation is done by people.
8. (h)
  
  Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?
  We obtain raw data from motion generation model. Annotation data are collected by annotators.
9. (i)
  
  Were the individuals in question notified about the data collection?
  Yes.
10. (j)
  
  Did the individuals in question consent to the collection and use of their data?
  Yes.
11. (k)
  
  If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses?
  Yes.
12. (l)
  
  Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted?
  Not applicable.
13. (m)
  
  Any other comments?
  None.
4.
Preprocessing, Cleaning and Labeling
1. (a)
  
  Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?
  Yes, see appendix A.
2. (b)
  
  Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)?
  Yes. We provide raw data entries and their annotations respectively.
3. (c)
  
  Is the software used to preprocess/clean/label the instances available?
  No. The annotation software is the private labeling platform provided by DATATANG TECHNOLOGY INC. .
4. (d)
  
  Any other comments?
  None.
5.
Uses
1. (a)
  
  Has the dataset been used for any tasks already?
  No, the dataset is newly proposed by us.
2. (b)
  
  Is there a repository that links to any or all papers or systems that use the dataset?
  Yes, we provide the link to all related information on our project page.
3. (c)
  
  What (other) tasks could the dataset be used for?
  This dataset could be used for other research topics, including but not limited to human preference study, human motion study.
4. (d)
  
  Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?
  See appendix A for details.
5. (e)
  
  Are there tasks for which the dataset should not be used?
  The usage of this dataset should be limited to the scope of human motion.
6. (f)
  
  Any other comments?
  None.
6.
Distribution
1. (a)
  
  Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
  Yes, the dataset will be made publicly available.
2. (b)
  
  How will the dataset be distributed (e.g., tarball on website, API, GitHub)?
  The dataset will be published on our code website with its metadata document.
3. (c)
  
  Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?
  We release our benchmark under CC BY-NC 4.0 ^†^†https://paperswithcode.com/datasets/license license.
4. (d)
  
  Have any third parties imposed IP-based or other restrictions on the data associated with the instances?
  No.
5. (e)
  
  Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?
  No.
6. (f)
  
  Any other comments?
  None.
7.
Maintenance
1. (a)
  
  Who is supporting/hosting/maintaining the dataset?
  Haoru Wang is maintaining.
2. (b)
  
  How can the owner/curator/manager of the dataset be contacted (e.g., email address)?
  ou524u@stu.pku.edu.cn
3. (c)
  
  Is there an erratum?
  Currently, no. As errors are encountered, future versions of the dataset may be released and updated on our website.
4. (d)
  
  Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances’)?
  Yes, if applicable.
5. (e)
  
  If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)?
  Our human motion dataset is generated as body model parameters [27], not real people. No applicable limits on retention of the data and the annotators are aware of the use of data.
6. (f)
  
  Will older versions of the dataset continue to be supported/hosted/maintained?
  Yes, older versions of the benchmark will be maintained on our website.
7. (g)
  
  If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?
  Yes, please get in touch with us by email.
8. (h)
  
  Any other comments?
  None.

Appendix C Details on MotionCritic: as Motion Quality Metric

Data Pre-processing.

Each multiple-choice question is divided into three ordered preference pairs. Motion sequences are parameterized using SMPL [27], which includes 24 axis-angle rotations and one global root translation.

Training and Evaluation.

We train the critic model from scratch using the DSTformer [52] backbone with 3 layers and 8 attention heads on MotionPercept. To ensure robustness, we train our model for multiple times and report the error bars, considering variations such as the random seed across multiple runs. Evaluation results, detailing action-label splits, are presented in the following two tables. Our MotionCritic gets the best results and can robustly score different types of human motions.

Metric	Warm.	Walk	Run	Jump	Drink	Lift.	Sit	Eat	Turn.	Phone	Box.	Throw	Avg.
Root AVE	57.6	47.3	56.8	62.7	59.5	46.3	37.9	64.5	54.1	0.62	51.6	53.7	59.5
Root AE	70.1	70.0	69.7	57.2	70.0	49.8	52.5	61.2	63.2	61.7	52.7	55.2	61.8
Joint AVE	42.0	52.2	50.4	64.7	53.2	50.2	42.4	48.6	51.9	55.7	48.4	45.2	56.8
Joint AE	63.6	69.1	75.2	55.7	59.9	41.1	51.4	66.7	59.3	60.6	53.5	54.1	62.7
Acceleration [45]	66.7	78.0	61.6	53.0	65.4	62.4	82.6	61.6	51.1	59.3	69.5	61.7	64.3
Person-Ground Contact [32]	69.8	70.1	70.2	66.0	72.8	71.8	90.1	76.9	70.3	67.9	62.6	63.2	71.8
Foot-Floor Penetration [32]	47.1	52.4	52.7	55.7	48.5	56.8	59.9	52.4	50.4	55.7	52.8	53.3	53.6
Physical Foot Contact [40]	80.5	77.8	73.1	51.7	67.5	57.9	78.5	63.4	53.2	65.5	68.6	68.5	64.8
MoBERT [41]	67.4	68.4	44.8	37.4	70.0	18.9	43.6	49.9	65.2	25.6	56.1	44.8	49.4
MotionCritic (Ours)	90.6 ${}_{\pm\text{0.2}}$	94.2 ${}_{\pm\text{0.4}}$	91.9 ${}_{\pm\text{0.3}}$	90.6 ${}_{\pm\text{0.1}}$	83.9 ${}_{\pm\text{1.3}}$	85.3 ${}_{\pm\text{0.7}}$	86.0 ${}_{\pm\text{0.7}}$	78.3 ${}_{\pm\text{1.4}}$	79.8 ${}_{\pm\text{0.9}}$	85.1 ${}_{\pm\text{0.6}}$	86.6 ${}_{\pm\text{0.4}}$	82.5 ${}_{\pm\text{0.3}}$	85.1 ${}_{\pm\text{0.5}}$

Table 4: Accuracy comparison of motion evaluation metrics on HumanAct12 action classes(%).

Metric	Warm.	Walk	Run	Jump	Drink	Lift.	Sit	Eat	Turn.	Phone	Box.	Throw	Avg.
Root AVE	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69
Root AE	0.68	0.66	0.67	0.68	0.68	0.69	0.70	0.69	0.69	0.69	0.69	0.69	0.68
Joint AVE	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69
Joint AE	0.68	0.67	0.67	0.69	0.68	0.70	0.70	0.69	0.69	0.68	0.68	0.69	0.68
Acceleration [45]	0.70	0.59	0.88	1.5	0.60	0.69	0.60	0.64	0.88	0.71	0.61	0.76	0.78
Person-Ground Contact [32]	0.71	0.68	0.68	0.71	0.69	0.70	0.73	0.68	0.74	0.70	0.73	0.72	0.73
Foot-Floor Penetration [32]	0.70	0.70	0.69	0.70	0.69	0.70	0.71	0.70	0.69	0.70	0.69	0.69	0.69
Physical Foot Contact [40]	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69
MoBERT [41]	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.70	0.69	0.69	0.69
MotionCritic (Ours)	0.51 ${}_{\pm\text{0.01}}$	0.52 ${}_{\pm\text{0.02}}$	0.50 ${}_{\pm\text{0.01}}$	0.51 ${}_{\pm\text{0.02}}$	0.56 ${}_{\pm\text{0.02}}$	0.54 ${}_{\pm\text{0.02}}$	0.54 ${}_{\pm\text{0.02}}$	0.59 ${}_{\pm\text{0.03}}$	0.59 ${}_{\pm\text{0.01}}$	0.57 ${}_{\pm\text{0.01}}$	0.53 ${}_{\pm\text{0.02}}$	0.55 ${}_{\pm\text{0.01}}$	0.55 ${}_{\pm\text{0.02}}$

Table 5: Log-loss comparison of motion evaluation metrics on HumanAct12 action classes.

Appendix D Details on MotionCritic: as Training Supervision

D.1 Fine-tuning

Critic Score Clipping.

Generally, a higher MotionCritic score indicates better motion quality. However, this relationship has an upper limit. During our fine-tuning process, we clip motions with reward scores exceeding a threshold $\tau$ when computing gradients before back-propagation. This threshold, determined through a series of comparative experiments, is set at $\tau=12.0$ , approximately the upper bound of ground-truth critic scores. We found that this setting yields the best results. Fine-tuned motion generation models without reward clipping tend to artificially inflate reward scores on a few specific motions, which increases the average MotionCritic score but degrades overall performance. Thus, reward clipping is essential to maintain the integrity and quality of the fine-tuning process.

Finetuning Details.

Inspired by [44], we observe how the critic score changes over denoising steps to identify the optimal time window for ReFL intercept. As shown in Figure 10(A), we set the hyperparameter step sampling range to $[T_{1},T_{2}]=[700,900]$ , where the critic score witnesses a rapid increase.

Figure 10(B) illustrates the variation in the average critic score of a training batch over the course of fine-tuning steps. The fine-tuning process is stable and quick to take effect.

D.2 Results

Improved Critic Score.

As shown in Figure 11, the critic score increases after MotionCritic supervised fine-tuning. This scatter plot collects all data points from the test set, with the critic score of motions before fine-tuning on the x-axis and the critic score of the corresponding motions after fine-tuning on the y-axis. As demonstrated in Figure 11(A), we first compare results with and without critic model supervision. In the latter case, the original MDM loss is used for continued training without our MotionCritic-based plug-and-play module. The scatter plot clearly indicates that the results with critic model supervised fine-tuning achieve significantly higher scores. The second experiment in Figure 11(B) examines different fine-tuning steps using 800 steps from the first set as a baseline. The results demonstrate that critic model supervised fine-tuning consistently improves the critic score throughout the fine-tuning process.

Improved Motion Quality.

We conduct an independent user study to compare motion pairs generated at various fine-tuning stages and calculate the Elo Rating [7, 40]. Figure 12 demonstrates that the quality of motions consistently enhance as fine-tuning advances, as indicated by the user study. This improvement aligns with the training objective of elevating the critic score.

We further inspect the change of different metrics during the fine-tuning process in Figure 12(B). PFC [40] and FID are expected to be negatively correlated with motion quality (the smaller, the better), and MotionCritic and multimodality are expected to be positively correlated (the greater, the better). The results indicate that existing motion quality metrics (e.g. FID, PFC) do not adequately reflect human preference, as they poorly correlate with Elo ratings from user studies. Meanwhile, improving the critic score does not necessarily conflict with the multimodality metric, which models the diversity of generated motions.

Appendix E Details on User Studies

Annotation

We conduct user studies on GT subsets grouped from HumanAct12 [13] and motions generated during finetune steps as discussed in the main text. Our user study platform is shown in Fig 13. In user study, one motion pair of two motions are played simultaneously, with their critic scores and text prompts being hidden. Annotators should choose the better motion or choose "Almost the Same" if they can’t make a decision. We perform user study on 5 different finetune steps and 5 GT batches grouped from HumanAct12 [13].

Win-rates

After annotation, we calculate win-rates of subsets pairs. In user study, each subset has the same amount of motions. Given subsets pair $(A,B)$ , win-rates shows the percentage of motion pairs where motion of subset $A$ win over motion of subset $B$ in naturalness. Then we paint heatmaps of all subsets with their win rates. Since the result of one match maybe tie, the sum of win-rates of two subsets in a pair and data in symmetric positions of heatmap might be less than 1.

Elo Rating

[7, 40] After annotation, we calculate elo rating of each subsets as follows:
Suggest $R_{A},R_{B}$ are the initial ratings of two compared subsets $A$ and $B$ . The expectated win rate of subset9s $A$ and $B$ , denoting as $E_{A},E_{B}$ can be calculated as follows:

	$\displaystyle E_{A}$	$\displaystyle=\frac{1}{1+10^{(R_{B}-R_{A})/400}}$		(7)
	$\displaystyle E_{B}$	$\displaystyle=\frac{1}{1+10^{(R_{A}-R_{B})/400}}$		(8)

The new ratings of subsets $A$ and $B$ are:

	$\displaystyle R_{A}^{{}^{\prime}}$	$\displaystyle=R_{A}+K(S_{A}-E_{A})$		(9)
	$\displaystyle R_{B}^{{}^{\prime}}$	$\displaystyle=R_{B}+K(S_{B}-E_{B})$		(10)

where K is rating coefficient, we choose 32; and $S$ is real score, which is 1 for winner, 0 for loser and 0.5 if the result is a tie. We set the initial rating of each subset as 1500.

Aligning Human Motion Generation with Human Perceptions