2
$\begingroup$

I have the following experimental setup: Protein A is capable of cutting protein B in small fragments. The small fragments are identified and the nature of the last amino acid in each fragment is counted. Thus, in one experiment it is possible to detect all 20 amino acids but with a different total count. The total count depends on the nature of Protein A and the conditions of the experiment. At the end, for the two conditions tested I end up with a table like this:

Amino-acid  Exp1   Exp2
A             0      3
R            20     12
G            10     15
H            14     22
E             5      0

with entries for all 20 amino acids and I also know the total number of fragments from Protein B that were identified in each condition.

The question I need to answer is: Are the amino acids frequencies significantly different under the two experimental conditions?

First I thought to use a chi-square test since with the chi-square test I can take into account the different number of fragments that were identified in the two conditions. But inevitably I will end up with expected values being 0 and thus I cannot use the chi-square test.

Could you please point me in the direction of the test that can be used in this case?

Thanks a lot in advance.

$\endgroup$
6
  • $\begingroup$ Even if an experiment didn't include any zeros, neither experiment should be labeled as the "expected value". What you do is calculate the expected value from pooling the information from the two experiments as the null hypothesis is that the two distributions are identical. $\endgroup$
    – JimB
    Commented Jul 2, 2018 at 15:57
  • $\begingroup$ Dear JimB, what do you mean by pooling the information from the two experiments? The experiments are not replicates since the conditions used are different. $\endgroup$
    – kbr85
    Commented Jul 2, 2018 at 16:11
  • $\begingroup$ This gives the structure of the test: en.wikipedia.org/wiki/Chi-squared_test. However, I'm not convinced that your use of the phrase "are not replicates since the conditions used are different" explains anything. I think you'd be better off asking this question at stats.stackexchange.com and adding into your question exactly how you obtained the data. (It's not just about the frequency counts: how you obtained the data is important, too.) $\endgroup$
    – JimB
    Commented Jul 2, 2018 at 16:38
  • $\begingroup$ As an example of one of the details needed: It appears that for a single run of an experiment more than one amino acid can be detected such that the response for the 5 amino acids for a single experiment might be a vector of presence or absence indicators: (1,1,0,1,0,0). Or does an experiment result in the detection of just a single amino acid? $\endgroup$
    – JimB
    Commented Jul 2, 2018 at 16:49
  • $\begingroup$ Dear JimB, I have added more details to the questions. It is not possible to use a vector of presence or absence since this will imply that the protein is equally selective for all amino acids with a 1. As mention now in the question all 20 amino acids can be detected in one experiment $\endgroup$
    – kbr85
    Commented Jul 2, 2018 at 20:31

1 Answer 1

0
$\begingroup$

You seem to count occurrences of five amino acids under two sets of conditions. To do a chi-squared test of homogeneity (each amino acid equally likely to occur under the two conditions), you can find the chi-squared statistic

$$Q = \sum_{i=1}^2 \sum_{j=1}^5 \frac{(X_{ij} - E_{ij})^2}{E_{ij}},$$ where $i$ designates experiment and $j$ amino acid, and each $E_{ij}$ is the total for experiment $i$ times the total for amino acid $j$ divided by the grand total of all ten counts. For example, $E_{11} = 49(3)/101 = 1.455.$

Here is the data matrix with each experiment in a row.

MAT = matrix(c( 0, 20, 10, 14, 5,
                3, 12, 15, 22, 0), nrow=2, byrow=T)

MAT
     [,1] [,2] [,3] [,4] [,5]
[1,]    0   20   10   14    5
[2,]    3   12   15   22    0

Here are the $E_{ij}:$

ChisqOut = chisq.test(MAT);  ChisqOut$exp
Warning message:
In chisq.test(MAT) : Chi-squared approximation may be incorrect
         [,1]     [,2]     [,3]     [,4]     [,5]
[1,] 1.455446 15.52475 12.12871 17.46535 2.425743
[2,] 1.544554 16.47525 12.87129 18.53465 2.574257

If all of the $E_{ij}$ exceeded $5,$ then under the null hypothesis that the two experiments produce the same distribution of amino acid counts, the chi-squared statistic $Q$ would have approximately a chi-squared distribution with $(r-1)(c-1) = (2-1)(5-1) = 4$ degrees of freedom. The warning message is triggered because expected counts for amino acids A and E are too small. However, R statistical software can do a simulation to approximate the actual distribution of $Q.$ This makes it possible to do a test anyhow, even though the 'chi-squared statistic' is not exactly 'chi-squared distributed':

chisq.test(MAT, sim=T)

        Pearson's Chi-squared test with simulated p-value (based on 2000 replicates)

data:  MAT
X-squared = 12.7, df = NA, p-value = 0.01284

(A couple more tries yielded similar simulated P-values.) Thus it seems that we can reject the null hypothesis of homogeneity at about the 1% or 2% level of significance.

Ordinarily, when the null hypothesis is rejected one looks at the 'Pearson residuals' in each of the $rc = 10$ cells seeking residuals greater than about 2 in absolute value, thus pointing to particular data cells of interest as contributing markedly to the significant result. But there are no such residuals here:

ChisqOut$resi
          [,1]      [,2]       [,3]       [,4]      [,5]
[1,] -1.206418  1.135807 -0.6112371 -0.8291977  1.652835
[2,]  1.171101 -1.102557  0.5933434  0.8049232 -1.604449

As one might suspect from the positions of the 0's for amino acids A and E, the largest components in the sum $Q$ come from those amino acids. Because you have so little data on these two amino acids, I am reluctant to encourage you to speculate on whether they really do behave differently under your two experimental conditions.

One common 'cure' for too-small values of $E_{ij}$ is to combine categories. Perhaps combine amino acids A & R and H & E, but I don't know enough about your experiment to contemplate whether this makes any sense. (Maybe there are amino acids that are in some way 'similar' so that combining small-count ones with larger-count ones would make sense.)

As is often the case, it would be helpful if you had more data: more 'fragments' in your experiments, and thus larger expected counts and greater assurance in drawing particular conclusions of interest.

$\endgroup$
9
  • $\begingroup$ Dear BruceET, thanks for your detail answer. I have just one more doubt. What if an amino acid is not found in either of the experiments? Can I just ignore it? I ask because this will change the degrees of freedom $\endgroup$
    – kbr85
    Commented Jul 3, 2018 at 7:05
  • $\begingroup$ I'm no biologist, but I know there are > 5 amino acids, so why do you have only 5 here? If this study were repeated, would there sometimes be more than five in your data? In a nonstatistical sense, what would the absence of 1 of the 5 mean? // An analysis with only 4 amino acids is certainly possible, and you're right that the df would change. // Is it possible to have larger total counts in subsequent datasets so that the $E_{ij}$'s are mainly above 5 and never below 3? Then no need to simulate P-val & easier interp of resids. // Are current data the result on one run of each of 2 expts? $\endgroup$
    – BruceET
    Commented Jul 3, 2018 at 8:45
  • $\begingroup$ The 20 amino acids will be in the data but sometimes one amino acid will not be found in either experiment. The data comes from repeating each experiment a minimum of three times. In the case that one or several amino acids were not found in both experiment can I just skip these amino acids and adjust the df accordingly? $\endgroup$
    – kbr85
    Commented Jul 3, 2018 at 9:15
  • $\begingroup$ It means that they were not detected in the experimental conditions tested and the count is zero in both. Thus I would have a line like W 0 0 in the table. The difference between the experiments, in this case, is the presence of a compound that makes Protein A more active. $\endgroup$
    – kbr85
    Commented Jul 3, 2018 at 9:30
  • $\begingroup$ [Comment out of order because of edit:] I guess, but what does it mean when some are missing? What is the difference between the two experiments? Ionic strength or something? I share some of @JimB's misgivings trying to talk about an experimental design when I know so few details of procedure and objectives. // When I do statistical consulting I want to know the point and watch what happens in the lab. Wonder if the 3 runs ought to be looked at separately somehow. Or made as a 3rd dim in table. Probably OK as is. (This is like a dermitologist trying to diagnose a skin rash over the phone.) $\endgroup$
    – BruceET
    Commented Jul 3, 2018 at 9:37

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .