Given a high-dimensional dataset X containing potentially redundant features, how can we efficiently aggregate and/or select features to achieve accurate prediction of target variable Y while reducing dimensionality? (I do not want to use methods like PCA, VAE etc which reduce the dimensions but with no or very little understanding of the latent dimensions as I want to them to be more explainable)
The hypothesis is that some of the features of $X$ might not be needed at all, while the others might be only needed as aggregates. $X$ is a d
dimensional sequence, and the idea is to aggregate $X$ by aggregating functions like mean, maximum and minimum and reduce its dimensions to m
where m<<d
.
I would like to know about the methods that exist in statistics and can do this.
One of the attempts that I made was to train a transformer model on $X$, and correspondingly, I got attention values $A$ (per sample). Attention values have the same dimensions as of input and can be interpreted as a contribution of every feature to the output (for each sample). I was wondering if the attention values can also be used in some sense to aggregate the features of $X$ that still give us a good prediction of $Y$.
So, this question is basically enquiring about the existing methods that can help me achieve feature aggregation and/or selection for input while preserving its predictive accuracy. And can deep learning help here?