We should indeed worry about the distinction between the usual Bessel-corrected sample standard deviation, $s$, and the population standard deviation, $\sigma$, when we don't know $\sigma$ (in the case that we did know $\sigma$, we would typically want to use it, but nearly always we won't).
One standard way to construct a confidence interval (at least when it's possible to do so) is via a pivotal quantity (a.k.a. a pivot). See https://en.wikipedia.org/wiki/Pivotal_quantity
A pivotal quantity, ($Q$, say) is a function of the data and a parameter of interest ($\mu$ in your case), whose distribution doesn't depend on unknown parameters - so changing the value of $\mu$ wouldn't change the distribution of the pivotal quantity. Crucially, neither would it change if you altered any other unknown parameter (such as $\sigma$); the distribution of the pivotal quantity would be unaffected.
Speaking loosely, if you know the distribution of some pivotal quantity, $Q$, you can then construct a probabilistic interval for the $Q$ (it's a random variable), and then back out a confidence interval for the parameter ($\mu$ in this case).
When $X_i$, $i=1,2,...,n$ are independent and identically distributed $\operatorname{N}(\mu,\sigma^2)$, it's possible to show that $T=\frac{\bar{X}-\mu}{s/\sqrt{n}}$ has a $t$ distribution with $n-1$ degrees of freedom; that is, $T$ is a pivotal quantity. From an interval for $T$ (which is a function of $\mu$, you can see it the above formula explicitly), we can then obtain a confidence interval for $\mu$ by manipulating the algebraic expression for the probabilistic interval for $T$.
This approach avoids worrying directly about the error in estimating $\sigma$ by working directly from $s$ to standardize $\bar{X}-\mu$, the numerator of $T$, and then working with the resulting distribution of that statistic.
A number of answers on site discuss pivotal quantities (I recommend trying a search, which should turn up some additional helpful discussion). There's some notes at [1], though there are numerous other sets of notes that can be found. Many undergraduate statistics textbooks discuss this approach.
-
It is indeed the case that an interval based on $s$ would sometimes be narrower and sometimes wider than an interval based on the unknown $\sigma$. If we did know $\sigma$, we might indeed see that a higher proportion of the intervals based on $s$ would miss $\mu$ when $s<\sigma$. The problem is in practice we have no idea when this has happened -- we don't know $\sigma$ to judge when $s$ was 'small' or 'large'; so that conditional probability is not something we have access to.
This is not a problem for our interval, however, because by working directly with the distribution of the pivotal quantity $T$, we can make sure that our interval has the desired long run coverage rate, $1-\alpha$. The fact that an interval based on $\sigma$ would be different is true but not relevant to us; the long-run property (that under repeated sampling, a long-run proportion $1-\alpha$ of the intervals we construct this way will overlap $\mu$) is maintained, by construction.
Intuitively speaking, how is it that overall the $1-\alpha$ rate is maintained? The distribution of the pivot $T$, when compared to the pivot $Z$ based on a known $\sigma$, has a slightly larger variance, and is heavier tailed. The distribution of the sample standard deviation is right skew and more often a little smaller than $\sigma$ and sometimes larger (occasionally considerably so, at least for small $n$). Consequently, if we had used $z$ tables to construct our interval based on $s$, as we might have done if made the error of treating $s$ as if if were $\sigma$, the intervals would not quite attain the desired $1-\alpha$ coverage.
The actual intervals based on $s$ (via the pivot $T$) are therefore a little "wider" on average (if you look at a t-table, you'll see that the upper-tail critical value that leaves an area of $\alpha/2$ above it is larger than the corresponding value from a normal table). This extra bit of width from the $t$ on average exactly "adjusts" for the tendency for intervals based on $s$ to be too narrow if you had used $z$ tables instead.
For some more detail on the relation between $Z$, $T$, $s$ and $\sigma$, see https://stats.stackexchange.com/a/110365/805
[1]: C. J. Geyer, "Stat 5102 Notes: More on Confidence Intervals", Feb 24, 2003
https://www.stat.umn.edu/geyer/old03/5102/notes/ci.pdf
(internet archive link in case the original disappears --
http://web.archive.org/web/20220221153635/https://www.stat.umn.edu/geyer/old03/5102/notes/ci.pdf)