7
$\begingroup$

My data concern vegetative relative cover of nettle plants recorded at up to thirty sites at irregular intervals over a calendar year (17/1/2009 - 16/1/2010) and different dates for different groups of sites. I want to be able to describe the changes over the year by a regression equation and calculate the F-statistic and the probability that the variation is related to season.

For the dependent x variable I have used number of days from the start of the study (rather than a date) in order to give a smaller value for the y-intercept. However in showing the results graphically I would like to have x-axis intervals indicating months. This will show the relationship to growth season more immediately. However I realize that using first day of month would give slightly (imperceivably?) uneven intervals.

Is this reasonable approach, or is there another approach that would achieve these objectives better?

I'd appreciate any advice.

I am now adding the results of the polynomial regression I have previously completed:

 colMeans(mse)
[1] 1521.902 1312.779 1283.366 1250.781 1272.761
> best = lm(cover ~ poly(Days,2, raw=T), data=frpd)
> summary(best)

Call:
lm(formula = cover ~ poly(Days, 2, raw = T), data = frpd)

Residuals:
    Min      1Q  Median      3Q     Max 
-63.814 -32.384  -3.897  30.337  57.270 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)             98.4301664 14.8925413   6.609 6.74e-09 ***
poly(Days, 2, raw = T)1 -0.5816455  0.1771272  -3.284  0.00161 ** 
poly(Days, 2, raw = T)2  0.0015144  0.0004444   3.408  0.00110 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 36.29 on 69 degrees of freedom
Multiple R-squared:  0.1443,    Adjusted R-squared:  0.1195 
F-statistic: 5.819 on 2 and 69 DF,  p-value: 0.00462

> lm(formula=cover~poly(Days,2,raw=T),data=frpd)

Call:
lm(formula = cover ~ poly(Days, 2, raw = T), data = frpd)

Coefficients:
            (Intercept)  poly(Days, 2, raw = T)1  poly(Days, 2, raw = T)2  
              98.430166                -0.581645                 0.001514  

Plot of data points,regression curve and CL's

$\endgroup$
2
  • 1
    $\begingroup$ Does this mix sites that maintained complete cover with some that had dieback? (Is this Southern Hemisphere, not that it matters, just curious?) $\endgroup$
    – Nick Cox
    Commented Jan 21 at 10:22
  • 1
    $\begingroup$ Does frpd include data of one species but from different sites? Coloring the points by site will help telling the pattern. $\endgroup$
    – DrJerryTAO
    Commented Jan 21 at 22:22

2 Answers 2

6
$\begingroup$

I disagree with the suggestion of using days elapsed given by wjktrs. Unlike medical studies of human patients, starting day of year is very important in ecological studies of seasonal patterns. As seasonality of plant growth patterns is strong, patterns observed on 183 days elapsed since a beginning in January would carry completely different implications from those with a beginning day in July. Therefore, consolidating to days elapsed since a study began discards useful information that day of year retains.

It is okay to use different units of measurements in modelling and plotting. They are simple linear transformations. In fact, you do not even need any transformation but a change in axis labelling. You can see tutorials of plotting time series at https://otexts.com/fpp3/graphics.html.

How to represent the seasonality is important. If you use day of year, you will need to include many high-order polynomials such as x + I(x^2) + I(x^3) + I(x^4) + I(x^5) or Fourier terms using sine and cosine functions. But the coefficients will be difficult to interpret intuitively. You could benefit by using 11 indicators of month or three indicators of quarter, then you can state that compared with January or winter, July or summer vegetation increase by this amount.

Since you have longitudinal data, an ordinary least squares estimator may not suffice and suffer from spatial and temporal serial correlation in errors. You may instead need to learn generalized least squares and mixed-effect models.

Since your response variable is vegetative relative cover of nettle plants, it may be a percentage, strictly positive, or nonnegative value. You may need to use fractional model, beta regression potentially with zero and one inflation, gamma regression, or ordinal regression. See

$\endgroup$
16
  • $\begingroup$ Here is a case study where long-term time trends are modeled alongside seasonal variation along with a possible discontinuity: hbiostat.org/rmsc/genreg#sec-genreg-gtrans $\endgroup$ Commented Jan 20 at 12:19
  • $\begingroup$ My data are indeed percentages derived from cover area estimates: the estimated vegetative cover at time of visit as a percentage of the maximum cover observed at that site over the duration of the study. This approach was used because sites were small enough to measure in total but maximum area of cover varied between sites. From the Stata tutorial link I take it that I should be looking to use a logistic regression for a binomial distribution. I will also edit my question to add an image of the current polynomial regression I have produce. $\endgroup$ Commented Jan 21 at 4:04
  • $\begingroup$ Since the percentage coverage is derived from two values, maybe it is worth developing models on the absolute coverage, too. The fractional regression used in Stata is very similar to binary logit regression, except it allows factional response. How to incorporate repeated measurements in fractional models can be a challenge. See edited answer. $\endgroup$
    – DrJerryTAO
    Commented Jan 21 at 9:53
  • $\begingroup$ OLS is okay for an intuitive reference model but won't do well in statistical significance and inference. Fractional models guarantees that predicted percentage is within (0, 1) exclusive while allowing observed percentage to take [0, 1] inclusive. Depending on how meaningful the site indicators are, you can model site effects using either fixed- (regular predictors) or random-effect (random intercepts) estimator. $\endgroup$
    – DrJerryTAO
    Commented Jan 21 at 10:04
  • $\begingroup$ Using sine and cosine terms would guarantee that predicted values at the beginning and the end of the year are equal. Also, there is quite often asymmetry between growth and decay phases which is certainly worth checking for. More at journals.sagepub.com/doi/pdf/10.1177/1536867X0600600408 A plain quadratic won't catch any asymmetry. $\endgroup$
    – Nick Cox
    Commented Jan 21 at 10:19
0
$\begingroup$

Most software packages have a day-based differencing function for which you get the exact number of days between the start of study at each site, and specific date of sample capture. Thus, that will give you the numdays for each measurement.

Now, with regard to using the first day of the month, you should drop that idea. Rather, use a plot that has nettle relative cover vs. day of measurement (over the entire x-axis).

Many failure time studies drop any notion of displaying calendar date on the x-axis and instead just use #days since start of study that each measurement was made on.

Something you never mentioned was whether you were concerned about some months having 28, 30, 31 days? That concern shouldn't matter either, since e.g. Excel will make such a plot with calendar date on the x-axis and your nettle coverage on the y-axis. However, you have multiple collections sites, so simply tell the sites to report, for each sample, the number of days since a global study start date for each measurement, and then just plot the nettle values (or means) per #days since global date representing start of study.

$\endgroup$
4
  • $\begingroup$ Day since start of study is a perverse metric that doesn't match the biology The answer by @DrJerryTAO gives better advice. $\endgroup$
    – Nick Cox
    Commented Jan 21 at 10:20
  • $\begingroup$ Stack Exchange seemed to want me to curtail extended discussion by comments and move to a chat room. This has now become a conversation just between myself and Dr Jerry Lao, but I have further comments to deliver to Nick Cox, so will continue those here unless Nick can be joined into the conversation with Jerry? $\endgroup$ Commented Jan 21 at 23:49
  • $\begingroup$ My data are from New Zealand and are for one species from 4 sites. I'll see if I can indicate this by colouring the points. I am using this species as a test as I have 4 other species of interest with in one case data from 30 sites. For most sites and most species some cover was maintained throughout the study and few if any sites maintained 100% cover throughout. $\endgroup$ Commented Jan 21 at 23:52
  • $\begingroup$ On the face of it any site with sustained 100% cover is just confusing the fit. Hard to know what best to do. $\endgroup$
    – Nick Cox
    Commented Jan 23 at 13:03

Not the answer you're looking for? Browse other questions tagged or ask your own question.