0
$\begingroup$

I’m trying to figure out the most appropriate test to use for a small water quality dataset (n = 10 sampling visits at 6 river sites, upstream to downstream) with the following characteristics: -not normally distributed (based on Ryan Joiner test) -some variables have non-detects (censored data) -some data groupings result in unequal sample sizes

We’re trying to look at differences within the same sampling year between -different sites (sites would be dependent groups, n=10 in each group) -weather event types (snowmelt n=12 , dry weather n=18, wet weather n=30)

Here are a couple of samples of how I’m grouping the data:

Event type-wise – unequal sample sizes
Chloride                    
| Site | Dry | Site | Snowmelt | Site | Wet |
| ---- | --- | ---- | -------- | ---- | --- |
| SWM01 | 33.0 | SWM01  | 27.5 | SWM01  |27.4|
| SWM01 | 28.8 | SWM01  |35.0 | SWM01 |25.0|
| SWM01 | 34.8 | SWM02  |25.6   |SWM01 |10.8|
| SWM02 | 23.3 | SWM02  |30.1   |SWM01 |20.0|
| SWM02 | 19.1 | SWM03  |35.5   |SWM01  |31.3|
| SWM02 | 23.6 | SWM03  |28.2   |SWM02  |24.7|
| SWM03 | 24.1 | SWM04  |28.1   |SWM02  |19.0|
| SWM03 | 16.8 | SWM04  |30.0   |SWM02  |12.0|
| SWM03 | 26.4 | SWM05  |28.2   |SWM02  |14.9|
| SWM04 | 25.4 | SWM05  |29.4   |SWM02  |24.3|
| SWM04 | 18.6 | SWM06  |27.8   |SWM03  |22.1|
| SWM04 | 24.9 | SWM06  |28.1   |SWM03  |18.7|
| SWM05 | 24.9  | | | SWM03 |6.4|
| SWM05 | 20.3  | | | SWM03 |16.1|
| SWM05 | 29.4  | | | SWM03 |25.1|
| SWM06 | 25.4  | | | SWM04 |25.3|
| SWM06 | 20.6  |   | | SWM04|  19.4|
| SWM06 | 24.8  |   | | SWM04|7.8|
| | | | | SWM04 |19.7|
| | | | | SWM04 |27.4|
| | | | | SWM05 |24.4|
| | | | | SWM05 |19.3|
| | | | | SWM05 |6.9|
| | | | | SWM05|    15.6|
| | | | | SWM05 |26.1|
| | | | | SWM06 |32.7|
| | | | | SWM06 |16.3|
| | | | | SWM06|    7.8|
| | | | | SWM06 |14.9|
| | | | | SWM06 |24.4|

Site-wise – equal sample sizes
Chloride
| Event ID | SWM01 | SWM02 | SWM03 | SWM04 | SWM05 | SWM06 |
| -------- | ----- | ----- | ----- | ----- | ----- | ----- |
|Dry1   |33.0   |23.3|  24.1    |25.4|  24.9    |25.4|
|Dry2   |28.8   |19.1|  16.8|   18.6|   20.3|   20.6|
|Dry3   |34.8|  23.6|   26.4    |24.9   |29.4   |24.8|
|Snowmelt1  |27.5|  25.6    |35.5|  28.1    |28.2|  27.8|
|Snowmelt2  |35.0   |30.1   |28.2|  30.0|   29.4    |28.1|
|Wet1   |27.4|  24.7|   22.1|   25.3|   24.4|   32.7|
|Wet2   |25.0|  19.0|   18.7|   19.4|   19.3|   16.3|
|Wet3   |10.8|  12.0|   6.4 |7.8|   6.9 |7.8|
|Wet4   |20.0   |14.9|  16.1|   19.7|   15.6|   14.9|
|Wet5   |31.3|  24.3    |25.1   |27.4|  26.1|   24.4|

I’ve been looking through my stats notes and google, and came across the Kruskal-Wallis test. It appears to be appropriate for non-parametric data and more than two groups, but the assumption is that groups are independent, which our “site groups” aren’t. I’m assuming the event type groups would be independent.

The presence of non-detects in our nitrate data throws another wrench into things.

$\endgroup$
4
  • 4
    $\begingroup$ Data are never non-parametric. Procedures can be. $\endgroup$
    – Nick Cox
    Commented Feb 22 at 23:48
  • $\begingroup$ The data havre both temporal and spatial correlation. I think you need generalized least squares or mixed linear models perhaps also bootstrap Resampling for no normal error and small sample size. $\endgroup$
    – DrJerryTAO
    Commented Feb 23 at 0:50
  • $\begingroup$ @DrJerryTAO In principle you're right about temporal and spatial correlation. In practice hydrologists would not measure at two sites and/or two times if they didn't imagine that values could be quite different. With datasets this small, testing for normality and the like is moderately absurd and I'd advise a strongly graphical analysis and use of scientific/engineering judgment, but wouldn't expect unanimity on that point. $\endgroup$
    – Nick Cox
    Commented Feb 23 at 11:37
  • $\begingroup$ Good points. That's why I thought bootstrap sampling for SE and p values might be better than trying to find normal errors. But now I doubt if bootstrap is useful in this case, because one of temporal and spatial structure will be broken even if stratified bootstrap sampling is used. $\endgroup$
    – DrJerryTAO
    Commented Feb 23 at 23:52

1 Answer 1

2
$\begingroup$

With censored response, survival analysis is useful and probably the more appropriate one. See https://stackoverflow.com/questions/41968606/left-censoring-for-survival-data-in-r for both left and right censoring.

The data exhibit both temporal and spatial correlation. Temporal-spatial modelling sounds relevant, but I am not an expert in that. Others can comment whether it is really necessary or feasible. My guess is that temporal-spatial modelling requires much larger sample sizes. As Nick Cox advised, the first step should be plotting the data pattern, response (Chloride) over time (Event) by site (Stream), or maybe also response over site by time, see "Mixed Models in R: lme4, nlme, or both?" https://freshbiostats.wordpress.com/2013/07/28/mixed-models-in-r-lme4-nlme-both/.

Although the measurements at different sites are correlated, I think we can drop this dimension of correlation. The analogue is measuring six students in the same classroom over ten exams. Their test scores must be correlated at any time due to exposure to the same teacher and environment. But in practice we only consider the temporal correlation within each student by using a random intercept. In survival analysis, frailty terms model a random intercept. In summary, a tentative model is coxph(Surv(lower, upper, type = 'interval2') ~ event, frailty(site, distribution = "gamma")). The coefficients of event categories show the seasonal patterns of chloride concentration.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.