0
$\begingroup$

I have a question about a non-Gaussian distributed parameter that can only take certain values in a defined interval.
Knowing that I have to define this parameter starting from a set of its values and in the end I must use only average value and tolerance, I am asking myself if the mean value should be calculated in the whole set, or only inside the tolerance.

I'll try to explain my situation more in detail: I know that I have to consider only 84% (this is incorrect! +/-1.5*sigma is 86.6%) of the original set of values (cutting the same percent from the head and from the tail) and those considered should be those who give me the esteem I am looking for. While in the case of a Gaussian I would use avg value and +/- 1.5 * standard deviation to have in the end my parameter and its tolerance (yes, in that case I would be a little higher than 84%, but I'm really looking for 84% of the values - also 86.6, not 84 ),

Gaussian

This picture is incorrect. The percentage should be 86.6%

in the current case I must decide whether to calculate an avg value (weighted by probability of occurrence of the value) on the whole set or on the "cut" set and eventually to decide if it is better to calculate the tolerance as the maximum deviation of the 8th-percentile/92nd-percentile (really the 6.7th and the 93.3rd) from the avg value or as the average of the deviations of both, or whatever... I am not sure here too.

Below a chart Values vs. Probabilityof my parameter (in this case avg valuehas been calculated on the original set):

Parameter chart

This picture is incorrect. The percentiles should be 6.7 and 93.3

Blue line is a trendline made with Excel, the columns include all the values between those shown in the x-axis and the next one. This representation is maybe not the best one ever, but helps to understand how the distribution goes.

Which are the most correct options?

$\endgroup$
7
  • $\begingroup$ The last figure is the histogram of your parameter calculated over all possible values it takes right? 3 sigma rule is just a rule of thumb to avoid outlying observations. There are much better ways than that. $\endgroup$ Commented May 15, 2015 at 0:43
  • $\begingroup$ Yes it is. Can you tell me any way better than this? (must however discard first and last values too) $\endgroup$ Commented May 15, 2015 at 8:51
  • $\begingroup$ i don't understand - are you asking whether to use a continuous rv or a discrete one? $\endgroup$
    – JMP
    Commented May 17, 2015 at 5:02
  • $\begingroup$ or whether to take the mean before or after the chop - in theory there is no difference except for in extreme circumstances - this makes for an interesting proof! $\endgroup$
    – JMP
    Commented May 17, 2015 at 5:14
  • $\begingroup$ or the classical 'mauve or purple'? $\endgroup$
    – JMP
    Commented May 17, 2015 at 5:23

1 Answer 1

1
$\begingroup$

The issue that you rised with this question is in the area of robust statistics. In the case of estimating a parameter, it is called robust parameter estimation. There is a good book by Huber. I think this one will help alot.

The idea is as follows. When you are estimating a parameter, the regular process first finds the log likelihood ratio of the density function. Then tries to find the parameter such that the log likelihood function is maximized. Therefore, it is called maximum likelihood estimator (MLE). In many practical applications, the data under the test contains some outliers, which are the data samples that are inherently wrong and which do not follow the given density function. This can happen for example when a patient's EEG data is recorded and the patient moves his/her head involuntarily.

Let $f$ be the density function and there are $n$ data samples, each denoted by $x_i$. The maximum likelihood estimator is found by solving

$$\hat\mu=\arg\max_{\mu}\sum_{i=1}^n \log f(x_i,\mu)$$

The idea is to replace $\log f$ with some nice function $\rho$. Then the problem is

$$\hat\mu=\arg\max_{\mu}\sum_{i=1}^n \rho(x_i,\mu)$$

Assume that the interested parameter is the mean value of the distribution function. In robust estimation context, it is called the location parameter. For this case one can write

$$\hat\mu=\arg\max_{\mu}\sum_{i=1}^n \rho(x_i-\mu)$$

Now as an example, if $\rho(x)=x^2$. Then this corrsponds to the maximum likelihood estimator of the location parameter of Gaussian distribution. If you just take the derivative and make it equal to $0$, you will find

$$\hat\mu=\frac{1}{n}\sum_{i=1}^n x_i$$

If you choose $\rho$ differently? For example if you choose $\rho(x)=|x|$, this corresponds to a very robust estimator. This is actually the maximum likelihood estimator for meadian. But For Gaussian distribution, mean and median are the same and it depends on how much problem you have at the tails of the distribution. By Huber, there is a very nice transition from mean to median by the function

$$\rho(x)=\begin{cases}x^2\quad \mathrm{if}\quad |x|<c\\c(2|x|-c)\quad \mathrm{if}\quad \mathrm{otherwise} \end{cases}$$

With this nice function, one can trade of the strength of the estimator against outliers. In other words, if $c\to 0$, this estimator is simply the median estimator and if $c\to \infty$, it is the MLE of location estimator.

Coming back to your question, if you are completely sure that higher absolute values of your observations are clean and following the Gaussian distribution, then you must use all data points.

If you know that your data may be contaminated, then one needs to consider robust estimators. There is a trade-off between robustness and the efficiency, this can be adjusted by choosing a suitable value of $c$ as given above.

$\endgroup$
5
  • $\begingroup$ This is a great answer (even though I didn't understand: is "c" the number of measurements/samples?) and first of all let me thank you for it. I am not sure that my measurements are all contained between what presented in the chart above, since (as stated in some previous comments), the values are measurements of an electromagnetic device vs. frequency. I can measure only some of those frequencies, therefore I cannot be sure. I have, though, a specification (whitepaper) that says that my measurements are enough to describe the current parameter in that frequency-set. $\endgroup$ Commented May 18, 2015 at 6:16
  • $\begingroup$ Correct me if I'm wrong: I must in this case assume that my samples are all inside the max and min values in the above chart. In this case I'd have to use the mean on all values, this is clear. What your answer DID NOT specify is, actually, how to introduce a tolerance. My parameter must be described by 2 values: mean and tolerance. (tolerance would be the 1.5*St.Dev. if I had a gaussian distribution) $\endgroup$ Commented May 18, 2015 at 6:24
  • $\begingroup$ $c$ is the tolerance parameter in the above given setting. For example if $c=1.345$, then this corresponds to $95\%$ efficiency of the estimator for the normal distribution. If there is a predefined tolerance as $1.5\%$, then yes, it is the $3$ sigma rule and it means one needs to consider only the values lying in $3$-sigma for the estimation. On the other hand, if some data samples are not observable, then there may be a bias upto some degree. In this case the approach would be bias correction. $\endgroup$ Commented May 18, 2015 at 12:59
  • $\begingroup$ Yes, but in this case I'd have a specification that explicitly says not to consider any more values... So basically I have to assume that all I have is all that can be observable. So let's say I would use the mean on all the values (BEFORE the crop), how should I choose my tolerance? If I had a gaussian, I could have my parameter written as m +/- c where m is my mean and c the tolerance... Now I only have a clue on m. Is there no other solution rather than "split" the tolerance in + AND - ? $\endgroup$ Commented May 18, 2015 at 15:02
  • $\begingroup$ Up to a certain degree, the answer is correct. For the sake of simplicity, at the end I am using the average of the tolerances (distances of 6.7th/93.3rd percentile from the average value calculated on all the samples). I am aware, though, that this is not the best option $\endgroup$ Commented Nov 4, 2016 at 7:16

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .