Can I aggregate several continuous variables into percentages and then compare those percentages between groups?

Question

I have a dataset with the concentrations of several lipids. I'm interested in finding lipids that are altered between two conditions, but the lipids are not indepentent from each other and the differences in concentration are very small.

Lipids can be grouped into lipid classes, so I thought I could add up all the concentrations of lipids belonging to the same class and calculate the percentage of each class in relation to total lipid concentration. My idea was to then eliminate all lipids belonging to classes that weren't altered and use statistical tests to find which individual lipids were altered in the classes that remained.

Can I do this? Or am I misrepresenting the data in some way? Also, if I can do it, how should I compare the percentages? With a t-test/Mann Whitney U test?

I think this more a question for hematologists (or whatever specialty) than for statisticians. You can do this. But whether you are misrepresenting the data is, I think, a substantive question. — Peter Flom, Commented Feb 28 at 17:18

EdM · Accepted Answer · 2024-03-01 10:12:02Z

You run a potentially big risk when you use the results of a data set to choose the specific variables that you will examine in more detail from the same data. The luck of the draw will tend to make some lipid classes more or less closely associated with the 2 conditions in your data set than they might be in general. If you only focus on the classes that happen to be closely associated with the 2 conditions in this data set, your results are likely to over-state the magnitude of their true association, while you ignore what might be classes (or individual lipids within classes) that are important.

This type of problem is similar to what's been dealt with for a long time in gene-expression studies. In your case, it's best to model all the lipids together as a function of condition, use tools that can evaluate which are most closely associated with the difference in conditions (while taking into account error estimates and correcting for multiple comparisons), and then see which classes of lipids might be over-represented among the lipids that differ between conditions.

Although I haven't used it myself, there is a lipidr Bioconductor package that seems to do all of this for you, including a "Lipid Set Enrichment Analysis" similar to what's done for gene-set enrichment. It draws on many of the tools developed originally for gene-expression work, adapting them to the particular requirements of lipidomics.

Restricting analysis to lipid classes

You could take advantage of pre-assigned lipid classes if you use some caution. If your lipid classes are based on their biochemical characteristics and not on what you find in your data, there's nothing to stop you from adding up the concentrations of all members of each class for analysis. You can then evaluate which lipid classes might differ between experimental conditions. The danger is then putting too much significance on the individual lipids within that class that this particular data set found to be different. That's where you can get into trouble: using the results from the data set to decide what statistical tests to do on the data.

You even could use the results of this study on individual lipids as preliminary data to identify particular individual lipids to evaluate in subsequent study. The practical and statistical significance of those individual lipids would then be based on the subsequent studies, not this one. The risk is that you get false positives from this initial study and waste a lot of time, effort, and expense in following up on a false positive.

Yeah, there was nothing significant. I got the same results when I tried statistical tests comparing all the lipids individually. Is there nothing that can be done? — maglorismyspiritanimal, Commented Feb 29 at 15:55
@maglorismyspiritanimal it seems like you have a clean, if disappointing, answer: your two conditions don't seem to differ substantially in terms of the concentrations of any of the 170 lipids. That might be worth reporting in a broader report comparing the two conditions. If, without looking at the results, you had a pre-specified hypothesis about some specific lipid or lipids or lipid class, you could have evaluated those in particular without the corrections needed to account for the total number of lipids. But if you didn't have such specific hypotheses, then what you did is correct. — EdM, Commented Feb 29 at 16:50
Oh well, it is what it is. I wanted to try the thing with the percentages because I saw a papaer that did that, but I guess they did it because they couldn't find anything significant the right way either. Anyways, thank you for your answers! — maglorismyspiritanimal, Commented Mar 1 at 9:12
@maglorismyspiritanimal I've added a couple of paragraphs with some suggestions for primarily restricting your analysis to the lipid classes, if you identify them on the basis of their biochemical properties. Your study might still be useful as a guide to further detailed study of individual lipids, depending on the risks that you are willing to take. — EdM, Commented Mar 1 at 10:13

Stack Exchange Network

Can I aggregate several continuous variables into percentages and then compare those percentages between groups?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
feature-selection
nonparametric
group-differences
percentage
aggregation
or ask your own question.

Hot Network Questions

Can I aggregate several continuous variables into percentages and then compare those percentages between groups?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged feature-selectionnonparametricgroup-differencespercentageaggregation or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
feature-selection
nonparametric
group-differences
percentage
aggregation
or ask your own question.