0
$\begingroup$

I have a large dataset for which I am using Bayesian statistics for parameter estimation and model selection (using MultiNest for more detail).

This involves setting a prior over which the nested sampling algorithm works to find the 'best' parameters (as well as compute the Bayesian evidence).

For each file in my dataset, the parameter range, and thus prior, is different i.e. the range over which the data varies differs for each file so I need to specify a different prior for each file.

Doing this iteratively, it occurred to me that one way would be to just set the prior as between the minimum and maximum value of my data, however, in this case this isn't applicable as some parameter constraints lie outside the range of the data (an extrapolation, per say).

So I believe that scaling the data is the best approach to this problem (but I am open to any advice). For this I am looking at Min-Max scaling to scale my data between 0 and 1 (arbitrary choice) and so then I can set a fixed prior range for my whole dataset (as I can reasonably assume a prior of 0-2, for example, will encompass all 'best' parameters based on the specific nature of my data).

Scaling the data to this data using sklearn MinMaxScaler is no problem (and neither is reverting it back to its original form at the end of the analysis).

However, what I am unsure about is how to treat the uncertainties at the instance I scale my data.

To elaborate, in my analysis I use the data and uncertainties and do a 'textbook' Bayesian analysis where both are used in the likelihood function for a chi-squared like analysis. Therefore, when I scale my data, I need to make sure that the relationship between data and uncertainties is preserved so that when the likelihood is evaluated, the results are the same for both unscaled and scaled data (after the scaled data and uncertainties are inverted back at the end).

However, I am unable to find any information on how to go about this. MinMaxScaler just performs a min-max feature scaling on the data but it wouldn't be correct to min-max scale uncertainties as these would be treated as uncorrelated to the data and would just be scaled by their values, not preserving the original relationship between data and uncertainty, as per my understanding.

Also, it doesn't seem correct to me (but correct me if I'm wrong) to just scale the uncertainties by the relative scale factor returned by sklearn MinMaxscaler, as this doesn't give the relationship by which each individual point is scaled, just the overall relative relation.

If anyone would have any insight into how to approach such a task and the 'correct' treatment of uncertainties for when scaling data in general, or in a better approach for when automating Bayesian parameter estimation on large data sets that I may have overlooked, I would be interested in learning more and most appreciative for any guidance.

NB: I posted this question on stack exchange too but I felt perhaps a modified version here would be better for approaching this task from a statistical point of view, as opposed to computational.

Thank you

$\endgroup$

0