Is it reasonable to calculate a polynomial regression using days but showing month as on plot?

Question

My data concern vegetative relative cover of nettle plants recorded at up to thirty sites at irregular intervals over a calendar year (17/1/2009 - 16/1/2010) and different dates for different groups of sites. I want to be able to describe the changes over the year by a regression equation and calculate the F-statistic and the probability that the variation is related to season.

For the dependent x variable I have used number of days from the start of the study (rather than a date) in order to give a smaller value for the y-intercept. However in showing the results graphically I would like to have x-axis intervals indicating months. This will show the relationship to growth season more immediately. However I realize that using first day of month would give slightly (imperceivably?) uneven intervals.

Is this reasonable approach, or is there another approach that would achieve these objectives better?

I'd appreciate any advice.

I am now adding the results of the polynomial regression I have previously completed:

 colMeans(mse)
[1] 1521.902 1312.779 1283.366 1250.781 1272.761
> best = lm(cover ~ poly(Days,2, raw=T), data=frpd)
> summary(best)

Call:
lm(formula = cover ~ poly(Days, 2, raw = T), data = frpd)

Residuals:
    Min      1Q  Median      3Q     Max 
-63.814 -32.384  -3.897  30.337  57.270 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)             98.4301664 14.8925413   6.609 6.74e-09 ***
poly(Days, 2, raw = T)1 -0.5816455  0.1771272  -3.284  0.00161 ** 
poly(Days, 2, raw = T)2  0.0015144  0.0004444   3.408  0.00110 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 36.29 on 69 degrees of freedom
Multiple R-squared:  0.1443,    Adjusted R-squared:  0.1195 
F-statistic: 5.819 on 2 and 69 DF,  p-value: 0.00462

> lm(formula=cover~poly(Days,2,raw=T),data=frpd)

Call:
lm(formula = cover ~ poly(Days, 2, raw = T), data = frpd)

Coefficients:
            (Intercept)  poly(Days, 2, raw = T)1  poly(Days, 2, raw = T)2  
              98.430166                -0.581645                 0.001514

Does this mix sites that maintained complete cover with some that had dieback? (Is this Southern Hemisphere, not that it matters, just curious?) — Nick Cox, Commented Jan 21 at 10:22
Does frpd include data of one species but from different sites? Coloring the points by site will help telling the pattern. — DrJerryTAO, Commented Jan 21 at 22:22

DrJerryTAO · Accepted Answer · 2024-01-21 22:18:10Z

I disagree with the suggestion of using days elapsed given by wjktrs. Unlike medical studies of human patients, starting day of year is very important in ecological studies of seasonal patterns. As seasonality of plant growth patterns is strong, patterns observed on 183 days elapsed since a beginning in January would carry completely different implications from those with a beginning day in July. Therefore, consolidating to days elapsed since a study began discards useful information that day of year retains.

It is okay to use different units of measurements in modelling and plotting. They are simple linear transformations. In fact, you do not even need any transformation but a change in axis labelling. You can see tutorials of plotting time series at https://otexts.com/fpp3/graphics.html.

How to represent the seasonality is important. If you use day of year, you will need to include many high-order polynomials such as x + I(x^2) + I(x^3) + I(x^4) + I(x^5) or Fourier terms using sine and cosine functions. But the coefficients will be difficult to interpret intuitively. You could benefit by using 11 indicators of month or three indicators of quarter, then you can state that compared with January or winter, July or summer vegetation increase by this amount.

Since you have longitudinal data, an ordinary least squares estimator may not suffice and suffer from spatial and temporal serial correlation in errors. You may instead need to learn generalized least squares and mixed-effect models.

Since your response variable is vegetative relative cover of nettle plants, it may be a percentage, strictly positive, or nonnegative value. You may need to use fractional model, beta regression potentially with zero and one inflation, gamma regression, or ordinal regression. See

Stata tutorial on fractional response https://www.stata.com/features/overview/fractional-outcome-models. Michael Clark (2019) Fractional Regression: A quick primer regarding data between zero and one, including zero and one https://m-clark.github.io/posts/2019-08-20-fractional-regression/. Papke, L. E., & Wooldridge, J. M. (2008). Panel data methods for fractional response variables with an application to test pass rates. Journal of Econometrics, 145(1), 121–133. https://doi.org/10.1016/j.jeconom.2008.05.009
A gamma regression tutorial with shape and scale parameter meaning and calculation https://data.library.virginia.edu/getting-started-with-gamma-regression/
Christensen, R. H. B. (2022). Cumulative link models for ordinal regression with the R package ordinal. https://cran.r-project.org/web/packages/ordinal/vignettes/clm_article.pdf

Here is a case study where long-term time trends are modeled alongside seasonal variation along with a possible discontinuity: hbiostat.org/rmsc/genreg#sec-genreg-gtrans — Frank Harrell, Commented Jan 20 at 12:19
My data are indeed percentages derived from cover area estimates: the estimated vegetative cover at time of visit as a percentage of the maximum cover observed at that site over the duration of the study. This approach was used because sites were small enough to measure in total but maximum area of cover varied between sites. From the Stata tutorial link I take it that I should be looking to use a logistic regression for a binomial distribution. I will also edit my question to add an image of the current polynomial regression I have produce. — Roger Frost, Commented Jan 21 at 4:04
Since the percentage coverage is derived from two values, maybe it is worth developing models on the absolute coverage, too. The fractional regression used in Stata is very similar to binary logit regression, except it allows factional response. How to incorporate repeated measurements in fractional models can be a challenge. See edited answer. — DrJerryTAO, Commented Jan 21 at 9:53
OLS is okay for an intuitive reference model but won't do well in statistical significance and inference. Fractional models guarantees that predicted percentage is within (0, 1) exclusive while allowing observed percentage to take [0, 1] inclusive. Depending on how meaningful the site indicators are, you can model site effects using either fixed- (regular predictors) or random-effect (random intercepts) estimator. — DrJerryTAO, Commented Jan 21 at 10:04
Using sine and cosine terms would guarantee that predicted values at the beginning and the end of the year are equal. Also, there is quite often asymmetry between growth and decay phases which is certainly worth checking for. More at journals.sagepub.com/doi/pdf/10.1177/1536867X0600600408 A plain quadratic won't catch any asymmetry. — Nick Cox, Commented Jan 21 at 10:19

wjktrs · Accepted Answer · 2024-01-20 21:20:51Z

0

Most software packages have a day-based differencing function for which you get the exact number of days between the start of study at each site, and specific date of sample capture. Thus, that will give you the numdays for each measurement.

Now, with regard to using the first day of the month, you should drop that idea. Rather, use a plot that has nettle relative cover vs. day of measurement (over the entire x-axis).

Many failure time studies drop any notion of displaying calendar date on the x-axis and instead just use #days since start of study that each measurement was made on.

Something you never mentioned was whether you were concerned about some months having 28, 30, 31 days? That concern shouldn't matter either, since e.g. Excel will make such a plot with calendar date on the x-axis and your nettle coverage on the y-axis. However, you have multiple collections sites, so simply tell the sites to report, for each sample, the number of days since a global study start date for each measurement, and then just plot the nettle values (or means) per #days since global date representing start of study.

edited Jan 20 at 21:20

answered Jan 20 at 6:03

wjktrs

8601 silver badge10 bronze badges

$\begingroup$ Day since start of study is a perverse metric that doesn't match the biology The answer by @DrJerryTAO gives better advice. $\endgroup$
– Nick Cox
Commented Jan 21 at 10:20
$\begingroup$ Stack Exchange seemed to want me to curtail extended discussion by comments and move to a chat room. This has now become a conversation just between myself and Dr Jerry Lao, but I have further comments to deliver to Nick Cox, so will continue those here unless Nick can be joined into the conversation with Jerry? $\endgroup$
– Roger Frost
Commented Jan 21 at 23:49
$\begingroup$ My data are from New Zealand and are for one species from 4 sites. I'll see if I can indicate this by colouring the points. I am using this species as a test as I have 4 other species of interest with in one case data from 30 sites. For most sites and most species some cover was maintained throughout the study and few if any sites maintained 100% cover throughout. $\endgroup$
– Roger Frost
Commented Jan 21 at 23:52
$\begingroup$ On the face of it any site with sustained 100% cover is just confusing the fit. Hard to know what best to do. $\endgroup$
– Nick Cox
Commented Jan 23 at 13:03

Add a comment |

Stack Exchange Network

Is it reasonable to calculate a polynomial regression using days but showing month as on plot?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
regression
unevenly-spaced-time-series
or ask your own question.

Linked

Hot Network Questions

Is it reasonable to calculate a polynomial regression using days but showing month as on plot?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged regressionunevenly-spaced-time-series or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
regression
unevenly-spaced-time-series
or ask your own question.