I understand the heuristic definition: say you know a statistic, $T$, of some sample that you want to use to estimate the corresponding population parameter - but you don't know the data points of the sample themselves. 'We say $T$ is a sufficient statistic if the statistician who knows the value of $T$ can do just as good a job of estimating the unknown parameter $\theta$ as the statistician who knows the entire random sample' - that's the definition of a sufficient statistic I've read online, and understand.
But then comes the factorisation theorem, which I'm struggling with: A statistic $T$ is a sufficient one for a sample $\boldsymbol{X} = (X_1,X_2,\ldots,X_n)$ if $f(\boldsymbol{X} \mid\theta)$, the conditional pdf for $\boldsymbol{X} $ given the parameter $\theta$ and stat. $T$, does not depend on $\theta$. This is equivalent to factorising $f(\boldsymbol{X} \mid\theta)$ into two functions:
$$f(\boldsymbol{X} \mid\theta) = h(X_1,X_2,\ldots,X_n) \cdot g(T(X_1,X_2,\ldots,X_n),\theta).$$
$T$ would then be a sufficient statistic, as the conditional probability $f(\boldsymbol{X} \mid\theta)$ now does not depend on $\theta$. But here's my question - how can the new factorised $f(\boldsymbol{X}\mid\theta)$ not depend on $\theta$ when $\theta$ is still in the final equation? In the examples I've seen, the final equations still have $\theta$ in them, as well as the statistic as some function of $X_1,X_2,\ldots,X_n$ - so how can the conditional probability depend on $T$ alone?
If $T$ is supposed to be all you need to know to know the conditional distribution, how can $\theta$ be a variable in the equation that you need the data of? I think I've gone wrong in some basic understanding of what's supposed to be going on here, so apologies if this is elementary.