Any kind of transformation of the data representation that "takes" information from the data should only be "fitted" on the training data. This is because:
- If you were using all data you would have a information leakage from the validation or test (also called: holdout) data into your model. This is forbidden! As a result your validation/test score estimates will be skewed.
- The model should also be only trained on a specific data representation. The data representation transformation should be applied like in the training stages in most cases (example of an exception: some kind of online settings).
So in the usual cases of batch training with ERM evaluation or stochastic optimization in deep learning, this kind of normalization should only be done on the training stage.
This is also why this transformation is grouped into a pipeline together with the model in most ML library designs. Because then they can be fitted together as well as deployed as one.
Of course this can lead to breaking of assumptions during runtime. Say you min-max-normalize, you would expect that attribute to fall into $[0, 1]$ after. Say the max was $m$, then it could very well be that new data has the very attribute with a value $x > m$, thus applying the min-max-normalization you would get a transformed value $\tilde{x} > 1$. This does not work so well in some cases, so you would do some kind of truncation and setting the value to $1$. If you expect many outliers you may want to take a look at RobustScaler in scikit-learn for example.