1
$\begingroup$

Consider this simple example where I have 2d data that looks like this: Exemplary data

I'm trying to find the parameters of a Normal distribution that maximizes the difference between two likelihood functions $$L_1-L_2=\prod^{n_i}_{i=1}f(x_i|\mu,\sigma^2)-\prod^{n_k}_{k=1}f(x_k|\mu,\sigma^2).$$ Here, the $X_i$ may be the black points and the $X_k$ may be the red points. I'm looking to find $\theta=\{\mu, \sigma^2\}$ that makes the joint density of the black points as large as possible while keeping the joint density of the red points as small as possible.

I already tried another approach by instead maximizing $$\frac{L_1}{L_2}=\frac{\prod^{n_i}_{i=1}f(x_i|\mu,\sigma^2)}{\prod^{n_k}_{k=1}f(x_k|\mu,\sigma^2)}$$ which led me to $$\hat\mu=\frac{\sum^{n_i}x_i-\sum^{n_k}x_k}{n_i-n_k}$$ and $$\hat\sigma^2=\frac{\sum^{n_i}(x_i-\mu)^2-\sum^{n_k}(x_k-\mu)^2}{n_i-n_k}.$$ However, $\hat\sigma^2$ can quickly become negative, which confused me. (I worked with the univariate case to not further complicate things).

So now I'm trying to maximize the difference between the two likelihood functions stated in the beginning. My question is: Is there any mathematical "trick" that simplifies this problem? For the derivation of the standard ML estimators taking logs simplifies the problem a lot but in this case, it does not make much sense given that I no longer have a pure product anymore.

Any hint towards related questions/works/papers is also appreciated!

$\endgroup$
10
  • $\begingroup$ If you want to make the joint density of the red points as small as possible, take the limit as $\mu \to \infty$ and $\sigma^2 \to 0$; the limiting density equals zero, and you can't get smaller than that! $\endgroup$
    – jbowman
    Commented Jul 10, 2022 at 22:43
  • $\begingroup$ @jbowman I'm aware of that. But these values will also make the likelihood of the black points tend to zero, which doesn't maximize $L_1-L_2$ (at least I believe that). $\endgroup$
    – Econ.stats
    Commented Jul 10, 2022 at 22:53
  • $\begingroup$ Oh I see, the same mean and standard deviation for both likelihoods. My mistake. Why not just use a numerical optimization routine? If you work with $\log \sigma$, shouldn't be a problem for such a routine to find the solution. Do you really need a mathematical solution? $\endgroup$
    – jbowman
    Commented Jul 11, 2022 at 2:21
  • $\begingroup$ Hi: this could be totally wrong but, off the top of my head, given that you are restricted to the same mean and sigma, can't you just maximize it as one likelihood but frst multiply the observations of the density being subtracted ( the red observations ) by -1.0 ? I imagine that's too simple of a solution but what's wrong with doing that ? You'd then be taking care of the negative sign and you can view it as one density ? $\endgroup$
    – mlofton
    Commented Jul 11, 2022 at 4:08
  • 2
    $\begingroup$ Why would you want to maximise the difference ? Statistically, the ratio is more appropriate (i.e., the difference of the log-likelihoods). And it need be maximised under the constraint $\sigma^2>0$. $\endgroup$
    – Xi'an
    Commented Jul 11, 2022 at 10:47

1 Answer 1

2
$\begingroup$

Here is a simplified example (one-dimensional) of how to do this numerically.

# Create fake data
    x1 <- rnorm(10,2,2)
    x2 <- c(0, 1, 1.5)
    
# Returns the difference in the log likelihood functions;
# equivalent to the ratio of the likelihoods for min/max purposes

    foo <- function(theta) {
      mu <- theta[1]
      sigma <- exp(theta[2])
      
      rll <- sum(dnorm(x1, mu, sigma, log=TRUE)) - 
               sum(dnorm(x2, mu, sigma, log=TRUE))
      rll
    }

# Maximize the ratio; the starting values are the mean and log(std. dev.) of x1     

    optim(c(mean(x1), log(sd(x1))), foo, control=list(fnscale=-1))

When we run it, we get:

> optim(c(mean(x1), log(sd(x1))), foo, control=list(fnscale=-1))
$par
[1] 2.1880969 0.6802588

which converts to $\hat{\mu} = 2.188$, $\hat{\sigma} = 1.974$.

Comparing to the mean and std. deviation of x1 shows that the parameter estimates are shifted away from where x2 is located:

> mean(x1)
[1] 1.781837
> sd(x1)
[1] 1.8948

In a way, for this data, this isn't surprising; x2 is shifted somewhat from the location of x1, so maximizing x1 will result in a relatively low likelihood for the x2 data to start with. Since there's a lot more data in x1 than x2, it's more important to keep the likelihood associated with each observation in x1 high than to reduce the likelihood associated with each observation in x2. Combining these two ideas allows us to conclude that, at least for this example, we won't shift too far away from the MLE for x1.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.