I think you can simplify your specific problem for the asymptotic case. The normal distribution is summarised by two sufficient statistics, so the data can be reduced to six numbers. These are the two sample sizes $n_y,n_x$ and the mean and variance given as
$$\overline{y}=\frac{1}{n_y}\sum_{i=1}^{n_y}y_i$$
$$s^2_y=\frac{1}{n_y}\sum_{i=1}^{n_y}(y_i-\overline{y})^2$$
$$\overline{x}=\frac{1}{n_x}\sum_{i=1}^{n_x}x_i$$
$$s^2_x=\frac{1}{n_x}\sum_{i=1}^{n_x}(x_i-\overline{x})^2$$
With these you can write the posterior as
$$p(\mu_1,\mu_2,\sigma_1,\sigma_2,a|DI)\propto p(\mu_1,\mu_2,\sigma_1,\sigma_2,a|I)\sigma_1^{-n_y}\sigma_2^{-n_x}\exp\left(-\frac{n_y[s_y^2+(\mu_1-\overline{y})^2]}{2\sigma_1^{2}}-\frac{n_x[s_x^2+(\mu_2-\overline{x})^2]}{2\sigma_2^{2}}\right)$$
Now asymptotically, the only part of the prior that "survives" the large sample size is the range restriction $\mu_1<\mu_2<\mu_1+2$. This means we can analytically integrated out the variance parameters and $a$ is redundant, as we can write $(\mu_2|\mu_1)\sim U(\mu_1,\mu_1+2)$ (by properties of uniform distribution). The joint marginal distribution will be a truncated t distribution, which asymptotically is truncated normal.
$$p(\mu_1,\mu_2|DI)\propto I_{\mu_1<\mu_2<\mu_1+2}\exp\left(-\frac{n_y(\mu_1-\overline{y})^2}{2s_y^{2}}-\frac{n_x(\mu_2-\overline{x})^2}{2s_x^{2}}\right)$$
The maximum can be found via constrained least squares. The unconstrained maximum is $(\hat{\mu}_1,\hat{\mu}_2)=(\overline{y},\overline{x})$. If this violates the constraint, then we set it to the nearest boundary. So if the data are $\overline{x}>\overline{y}+2$ then we would set $\hat{\mu}_2=\hat{\mu}_1+2$ and then maximise wrt $\hat{\mu}_1$ giving the maximum of $\hat{\mu}_1=w\overline{y}+(1-w)(\overline{x}-2)$ where $w=\frac{n_ys_y^{-2}}{n_xs_x^{-2}+n_ys_y^{-2}}$.
For your specific case we would have $w=\frac{1}{2}$ (because the sample size and standard deviations are equal). We also have $\hat{\mu}_1=w\overline{y}+(1-w)(\overline{x}-2)=\frac{1}{2}2+(1-\frac{1}{2})(10-2)=5$
$\hat{\mu}_2=7$
Your posterior should concentrate around this point. To see this you simply evaluate the likelihood function. The only difference is the terms
$(\mu_1-\overline{y})^2+(\mu_2-\overline{x})^2$. This evaluates to $36$ for either $(\hat{\mu}_1,\hat{\mu}_2)=(2,4)$ or $(\hat{\mu}_1,\hat{\mu}_2)=(8,10)$. But it evaluates to $18$ for $(\hat{\mu}_1,\hat{\mu}_2)=(5,7)$. much smaller!
You could also see this geometrically as well - as the precision is equal. On a simple x-y graph draw the line with the equation $y=x-2$ and mark the point $(10,2)$. Then the is the shortest distance from this point to the line is to the point $(7,5)$. The likelihood "wants" to concentrate the posterior around $(10,2)$ and $(7,5)$ is the closest to this point.
Not quite sure why your chain isn't converging to this point...The posterior still only has one mode...maybe bad starting points?
Also your code does not quite match your equations - your equation has $y$ with the lower mean but your simulation has $x$ with the lower mean.
update
In light of the answer by @Sextus empiricus, I looked into my answer again. If I take the marginal without making the normal approximation we have
$$p(\mu_1,\mu_2|DI)\propto I_{\mu_1<\mu_2<\mu_1+2}\left(1+t_y^2\right)^{-\frac{n_y-1}{2}}\left(1+t_x^2\right)^{-\frac{n_x-1}{2}}$$
where $t_y=\frac{\mu_1-\overline{y}}{s_y}$ and $t_x=\frac{\mu_2-\overline{x}}{s_x}$. This is the product of two independent t distributions. If we take negative log of this posterior, we get the function
$$-\log\left[p(\mu_1,\mu_2|DI)\right]=-\log\left[I_{\mu_1<\mu_2<\mu_1+2}\right]+\frac{n_y-1}{2}\log\left(1+t_y^2\right)+\frac{n_x-1}{2}\log\left(1+t_x^2\right)$$
Interestingly, the function $\log\left(1+t_x^2\right)$ behaves like $t_x^2$ when it's small (ie least squares, normal distribution in my earlier response) but it behaves like $2\log\left(t_x\right)$ when it's large. This is what is driving the bimodal behaviour - an extreme deviation is not penalised that much more severely than a large devation. This makes it better to "dismiss as noise" one of the data points and fit the other one exactly.
Plugging in some numbers from the example shows this. We have $\log\left(1+t_x^2\right)=5.9$ when $\mu_2=4$ and it equals $4.5$ when $\mu_2=7$. Compare to least squares where $t_x^2=360$ when $\mu_2=4$ and it equals $90$ when $\mu_2=7$.
Further, asymptotically this does not converge to the truncated normal I outlined above. If we use the large $n$ approximation $(1+t_y^2)^{-\frac{n_y-1}{2}}\approx\exp\left(-\frac{(n_y-1) t_y^2}{2}\right)$ it won't work here because there is another term that cannot be ignored. If we set $n_x=n_y=n$ then we can write the posterior as
$$p(\mu_1,\mu_2|DI)\propto I_{\mu_1<\mu_2<\mu_1+2}\left(1+t_y^2+t_x^2+t_y^2t_x^2\right)^{-\frac{n-1}{2}}$$$$
\approx I_{\mu_1<\mu_2<\mu_1+2}\exp\left(-\frac{(n-1)(t_y^2+t_x^2+t_y^2t_x^2)}{2}\right)$$
This is not a normal distribution, because we have term $t_y^2t_x^2$ in the exponent (a bivariate normal would have $t_yt_x$). Now if we don't place the range restriction, then this term becomes negligible, because it is possible to set $t_y=0$ and $t_x=0$ simultaneously. When the range restriction applies, then we can no longer assume $t_y^2t_x^2\approx 0$. This also clearly shows the bimodal nature of the posterior as well, because we can set this term $t_y^2t_x^2=0$ by setting either $t_x=0,t_y\neq 0$ or by setting $t_x\neq 0, t_y=0$. If I use this additional term, we see that $t_y^2+t_x^2+t_y^2t_x^2$ evaluates to $360$ for either case of $\mu_1=2,\mu_2=4$ or $\mu_1=8,\mu_2=10$ compared to $8280$ when $\mu_1=5,\mu_2=7$
I personally found this very interesting, and thanks to @Sextus Empiricus for his answer!