4
$\begingroup$

I'm an undergrad chemistry student, and in a recent laboratory session, we were given a set of observations for the volume of a solution in order to find an unknown concentration of a reactant $R$, via titration. The objective was to calculate an equilibrium constant as a sample statistic, transforming this data set using the given equation:

$$ K_{ps} = \left(A\times v_k\right)^2 $$

In this setup, $v_k$ is a value from the data set, and $A = \frac{M_{T}}{\bar{V}}$ is a positive constant, invariant through the experiments. $M_T$ refers to the concentration of the standard titrant $T$, and $\bar{V}$ refers to the volume of the analyzed solution. $A$ was determined by the conditions of the experiment, since we were given data from a simulation.

When I reported my results, I did so computing the value of $K_{ps}$ for each $v_k$, and then the mean and standard deviation for the output $K_{ps}$ values. Nonetheless, the lab assistant told us to change this and first compute the mean and standard deviation for $v_k$, and work with the mean as an input for the equation above.

My question is: when should I calculate the mean and standard deviation, given I will transform my initial data, before or after manipulating them? Both methods with the same set yield different results. Also, I am sure the SD or variance are unstable under non-linear transformations, which suggests that in order to be precise both statistics should be calculated with the transformed data.

$\endgroup$
2
  • 1
    $\begingroup$ I hope this question generates some good answers. However, my big suggestion is to ask your TA to teach you why her suggested method is the right way to proceed. If you get a dismissive answer, present the case for the method you suggest and let the TA debunk it. For us, it would help to explain more about $A$. $\endgroup$
    – Dave
    Commented Jun 27, 2020 at 1:45
  • $\begingroup$ @Dave (quote marks cite TA's reply) I have already pointed that out to her, yet her reply was based on the fact that "the lab guide points that out" and that when executing a titration (the process in study), chemists commonly worked with a mean volume, rather than with the set of volumes. This is because we "repeat the measurement multiple times to find an accurate mean for the volume". Edit: I will add the matter about the $A$ constant in the question. $\endgroup$
    – user289724
    Commented Jun 27, 2020 at 1:57

2 Answers 2

3
$\begingroup$

The reason is because you don't want to introduce unnecessary bias in your final result.

If you take the expectation value for the $K_{ps}$, observation error is introduced as bias. To see this, you can expand the formula as the following way:

Your observation can be modeled as the following: $$ \tilde{v}_k = v_k + \varepsilon $$ where $v_k$ is true value and $\tilde{v_k} $ is observation.

Assume your observation is unbiased which means that $E[\varepsilon] = 0$ and the error variance is $Var[\varepsilon] = \sigma^2$.

Now, calculate the expectation value of your target value with the model:

$$ \begin{align} E[\tilde{K}_{ps}] & = E[\left(A\cdot \tilde{v}_k\right)^2]\\ & = A^2 \cdot E[( v_k^2 + 2v_k\varepsilon + \varepsilon^2)] \\ & = A^2 \left( E[ v_k^2] + 2v_k\cdot E[\varepsilon] + E[\varepsilon^2] \right)\\ & = A^2 \left( E[ v_k^2] + \sigma^2 \right) \end{align} $$ where $\tilde{K}_{ps}$ is your estimate of true value $K_{ps}$.

The second term is zero because we assume your observation is unbiased but the third term is not zero which is the same as the variance of the observation error.

Here you should notice that even though your observation is unbiased, your target value is biased by the variance of your observation which is not what you want.

On the other hand, if you calculated the mean value of the observation first, you would get $$ \begin{align} E[\tilde{K}_{ps}] &= \left(A\cdot E[\tilde{v}_k]\right)^2\\ & = A^2 \cdot v_k^2 \end{align} $$ because we assume $E[\tilde{v}_k] = E[v_k + \varepsilon] = E[v_k] = v_k$

Now your calculation result doesn't have any bias.

$\endgroup$
5
  • $\begingroup$ Thank you, I added a +1 but it's not recorded because of my reputation. :) Therefore, in any set of observations, should I estimate parameters before transforming lest the transformed data become biased? Or is this rule appliable only in non-linear transformations? $\endgroup$
    – user289724
    Commented Jun 27, 2020 at 2:13
  • $\begingroup$ Yes, but lack of bias for an expectation may also lack meaning. It depends what you want to do with it. $\endgroup$
    – Carl
    Commented Jun 27, 2020 at 2:21
  • $\begingroup$ @Carl In this case, the only goal is to estimate $K_{ps}$; in this example, to find $\tilde{K}_{ps}$, not to find a distribution for them. What would "lack of meaning" in this case mean? $\endgroup$
    – user289724
    Commented Jun 27, 2020 at 2:25
  • 1
    $\begingroup$ Incorrectly inflated variance, for example. It is well known that biased estimators can be more appropriate. It depends what you want to use the numbers for, this is not a one answer fits all scenario. What answers are appropriate depends on what use they are put to. $\endgroup$
    – Carl
    Commented Jun 27, 2020 at 2:29
  • 1
    $\begingroup$ For example, it you want to do significance testing, your choices would be to either use a test that assumes normal conditions on the more normal, transformed data, or to use a non-parametric test that doesn't use raw data mean values at all. $\endgroup$
    – Carl
    Commented Jun 27, 2020 at 2:38
0
$\begingroup$

The mean and standard deviation of the more symmetric histogram are more simply predictive, less variable, and more easily understood. For example, if the data is lognormal distributed, then transforming the data by taking the logarithm will yield a normal distribution, which normal distribution, unlike a lognormal distribution, is symmetric, and has a left deviation equal to its right deviation. Now if instead one calculates the mean of the lognormal distribution, one has exactly that, i.e., the expected value of the lognormal distribution, but will not be an expectation of a normal distribution so it will not have a mean, mode, and median as occurring for large numbers in the same location. Moreover, a standard deviation of a lognormal distribution will be an inflated value not directly related to how a normal histogram relates to probability.

Therefore, one chooses data transformations that confer nice properties on the data, and one then uses that transformation for prediction. Care only needs to be exercised not to confuse what these transformed values are, that is, the mean of a transformed variable is generally not the mean of the untransformed variable.

$\endgroup$
2
  • $\begingroup$ Thank you for your answer, this clarifies a bit about the unstability of transformations. I take out that I should work with the most symmetric data I have at hand, right? Yet I don't understand clearly what do "nice properties" mean. $\endgroup$
    – user289724
    Commented Jun 27, 2020 at 2:43
  • 1
    $\begingroup$ That depends on what you want to do with the numbers. Nice in one context is not nice in another. For example, a mean value of raw data is "nice" for some things and "not nice" for others. Nice for mean values relates to additive properties of data groups, but nice for geometric mean is "nice" for division of data groups. Short of a writing a textbook, stats is should be taken as "look before you leap." That is, before one does something, make sure the assumptions of the methods and conditions encountered in the data are properly matched. $\endgroup$
    – Carl
    Commented Jun 27, 2020 at 3:02