# Random regression with heterogeneous residuals

## Background

When studying a labile trait, whose expression is not fixed during an individual’s lifetime, one may wish to quantify whether the individual-specific values or the breeding values of a trait depend on a covariate. The covariate can be an extrinsic environmental variable (e.g. temperature) or an “intrinsic environmental” variable (age). The use of a Random Regression Phenotypic Model (RRPM) or Random Regression Animal Model (RRAM) in wild populations is detailed in Nussey, Wilson & Brommer (2005, J. Evol.Biol.).
These models present a simplification of the character-state approach. The complexity of having N classes with each their respective variance and all the covariances between them rapidly increases when more classes are included. If N = 2, one needs to estimate three (co)variances (2 variances, 1 covariance), when N = 10, one needs to estimate 55 (co)variances (10 variances, 45 covariances). Under the assumptions of the infinite-dimensional model (developed by Kirkpatrick and others, e.g. Kirkpatrick, Lofsvold, Bulmer (1990; Genetics 124: 979 - 993.), we can reduce the complexity of the NxN matrix by characterising individual-specific values or breeding values as a function of the covariate. In its simplest form, the random regression function is linear, and the NxN matrix is reduced to variance in elevation, variance in slope and the covariance between these (2x2 matrix). It is important to remember (especially for ASReml users) that although an RRPM or RRAM are phrased in terms of polynomial functions (lines), they still are mapped onto the NxN character-state matrix. It is often useful to translate the result of the Random Regression model back into the character states (thus visualising the variances and covariances across the N classes).
The basic approach is described in the section on Random regression models. In this section, I extend this basic approach to also allow for heterogeneous error variances. I specifically focus on the situation when one wants to model both the genetic and non-genetic aspects of a trait as a function of a covariate (i.e. a RRAM). I further describe a general outline on how to proceed in testing and verifying Random Regression Models.

### Specifics

When setting up a RRAM, the additive genetic variance and covariances are modelled as a function of the covariate. There are three possibilities to model the non-genetic part.
1) Residual variance can be assumed to be constant across the covariate, all non-genetic (co)variances are described by allowing the permanent environment to be a function of the covariate. Implemented in the section on Random regression models.
2) Residual variances are allowed to be heterogenous, the permanent environment is a function of the covariate. For N classes, one can model N residual variances. There are no residual covariances across the N classes (by definition). This model can be simplified by reducing the number of classes. For example, one may consider quartiles of the covariate (e.g. ranging from the coldest to the warmest spring temperatures), and define a residual variance for each quartile.
3) The permanent environmental function of the covariate is abandoned altogether and one models all non-genetic information as a residual (co)variance structure. In this case, the residual covariances need to be included as the residuals are not merely the ‘error’ anymore, but also include the permanent environmental effects that create covariance across individual repeated records along the covariate.

## How to proceed?

Here, I sketch my view of this situation. Given that we are working with wild animals, the non-genetic part of the model is of some ecological interest. The non-genetic part consists of both the permanent environment and the residuals. The former relates to effects that are conserved across repeated records on an individual (individual-specific environmental effects), the latter relates to general environmental effects and measurement error. Viewed from this perspective, it is clear that each of these two “environmental” aspects of the model are interesting and may relate to different ecological processes. For this reason, model approach number 2 is potentially rewarding as it allows the separate quantification of these two “environmental” effects.
A second, more statistical, argument is that the assumption of a homogeneous residual variance (approach 1) may bias the estimation of the random regression parameters. Consider, for example, a RRPM of a trait as a function of an environmental covariate. The environment is characterised from poor to good. For a number of biological reasons, we may observe that the variance in the trait expression increases with the environmental covariate. This may be because individuals respond similar under poor environmental conditions, but are really different under good environmental conditions. If so, then we would expect to see an increase in the individual-specific effect as a function of the covariate. On the other hand, the increase in trait variance as a function of the covariate may be (partly) due to residual (general environmental) effects creating an increase in the noise or scatter in good environments. By assuming a homogeneous residual variance, we may “force” the individual-specific variance to increase as a function of the covariate. At present, it is unclear whether such “forcing” really is going on, but it is – in any case – a worthwhile exercise to compare random regression models with homogeneous and heterogeneous residuals. See below for suggestions on how to do so.

### Constructing a RR model

How to find the RRPM or RRAM that is most appropriate? How to compare the different approaches?
Several ways lead to Rome. My view is presented in Brommer, Rattiste & Wilson (2008, Proc. Roy. Soc B 275, 687–693) and Brommer & Rattiste (2010, Heredity). This view is based on nested models that are sequentially tested. Alternative approach would be to apply Information Theoretical criteria and compare a series of possible models.
Construction of a random regression models starts from a basic model with heterogenous residual variances. I start with heterogeneous errors mainly because I want to be sure there is no bias in the estimation of the random regression elements. Consecutive tests are then carried out to model variance across individuals (I), followed by variance in individual-specific slopes of the trait as a function of the covariate (IxE). Because we are working in a likelihood framework, it is most useful to compare the likelihood of the different models. The Likelihood Ratio Test (LRT) is -2 times the difference in Log(Likelihood) of two nested models, which tested against a chi-square distribution with the number of degrees of freedom equal to the difference in the number of (co)variances estimated. The goal is to distinguish the most parsimonious model, given the data. Based on the RRPM, the maximal polynomial individual-specific function of the covariate is found by stepwise increasing the order of the polynomial. In data on wild animals, one probably rapidly meets convergence problems for polynomials of order > 2.
At this point, it is important to note that all further models are simply further partitions of the variances that are already estimated. For example, if we find that the model I + IxE (variation across individuals (I) and variation in the individual-specific linear slope of trait against covariate, IxE) is the most parsimonious RRPM. We can construct the RRAM that partitions each of these terms into its additive genetic and permanent environmental variance. First I is partitioned into G + PE (genetic and permanent environment), and then IxE into GxE and PExE. These partitions do not “explain” more of the residual variance (e.g. see Tables in the references provided above), but the model may still result in a higher likelihood if the resemblance across relatives is an important part of the heterogeneity across individuals. To my mind, this realisation already illustrates that we are dealing here with demanding analyses of our wild animal data. It is difficult to assess the power of the dataset for this partitioning. It is therefore important to present and visualise all the variance components, also for the statistically unsupported model(s). The obtained estimates are still the best guess of how any patterns look like on the genetic level, and investigating them seems worthwhile. Alternative approaches would be to carry out some kind of model averaging, although it is at present not clear how to achieve this.
After the final (most parsimonious model) is obtained, I have usually returned to the issue of modelling the residual variances. Because of reasons listed above, I will focus on approaches 1 and 2, where permanent-environmental and residual variances are separated. In that case, the Likelihood Ratio Test (LRT) is -2 times the difference in Log(Likelihood) of the two models, which tested against a chi-square distribution with the number of degrees of freedom equal to the difference in the number of residual variances estimated. As far as I know, this test always provides evidence that the heterogeneous error variances provide a better fit to the data. In a way, however, this is fairly trivial, because we started with the RR models in the first place because the trait is a function of the covariate, and therefore the variance (and also variance components such as the residuals) are likely to be some function of the covariate. A more important and interesting (but little explored issue) is whether the estimated variance components of the genetic and permanent-environmental components really differ in a model with homogeneous versus heterogeneous error variances. Again, reporting these (and testing whether they change) is worthwhile (and something I have not done in the studies referred to above). Preliminary investigation from my side suggest that they do not change.

### Implementation

To model heterogeneous residuals, the data structure is different than for the standard RR model. ASReml expects to find a record for each value of the covariate for each individual, and it expects that these records are sorted within individual in the same manner. That is, if you have X individuals and N environments or ages, the datafile has X * N records, where for each individual the covriate is sorted in the same manner (e.g. low to high). Clearly, you will not have a measurement of each individual at each value of the covariate and you therefore need to have ASReml imput the missing values (‘!f mv’ after the line that declares the model). For example, the datafile may look like
 ANIMAL Year Measured Covariate 1 2 10 -1 1 1 -999 0 1 3 5 1 2 2 -999 -1 2 1 -999 0 2 3 10 1 ... ... ... ...

In this case, the data is sorted in the order of the covariate. It is important that the covariate is identically sorted within each individual and that its value is the same, even small differences will be interpreted by ASReml as a different covariate (e.g. 0.000 and 0.001). The covariate may also be rescaled by ASReml.
Important: ASReml will scale the covariate such that the lowest value is -1 and that it has a mean of zero. The scaling that ASReml uses is presented in the ‘.res’ file. The estimated variances are relevant only with respect to the scaling that ASReml is using. For example, the variance in, say, individual-specific elevation (I) will be specific to where the ASReml-scaled covariate is 0. Observe that the ASReml scaling may be different from the scaling in the data-file. In the case of the example data-file presented above, ASReml-scaling is the same as in the data-file (which is why many researchers try to scale their covariate in such a way). However, this may not always be possible. For example, the covariate may take values 0, 0.301, 0.477 (if E = 0, 2, 4; this is the log(E+1)and the trait may be approximately linear on this scale). ASReml will scale this as -1, 0.1607, 0.8393 (mean of zero, starting at -1).
When the data is presented like this, one can specify the residual variances as the diagonal in a matrix which otherwise contains zeros (structure DIAG). ASReml needs you to specify the number of individuals in the data, and it needs a factorial variable that it can use to count the number of categories for which it should estimate a separate residual variance. In the case of the example data-file, ‘Year’ is a good option. Alternatively, simply include a variable that numbers the environments. For some reason, the user also needs to specify the starting values and you cannot have ASReml determine these. Thus, for N variances to be estimated, one needs to enter N starting values. ASReml will expect (and states this in the output file) that if you have X individuals, there are X * N records and that these are sorted within individual.
In the code presented below, I include the covariate also as a factorial fixed effect (Covariate_F). This variable is identical to ‘Covariate’ in the data-file example above, but is read by ASReml as a factor. Because the variance in how the trait value or breeding value depends on the covariate (i.e. the slope) is a random effect, it has a mean of zero and is relative to the fixed-effect slope. One therefore has to enter the covariate both as a fixed and as a random term in the model. My personal preference is to enter the covariate as a factorial fixed effect, thereby not forcing any linear trait - covariate relationship (which one would do in case the covariate is a continuous fixed effect). This is particularly important when the covariate is discrete (for example, age) and there is a non-linear relationship between trait and age (as there often is). In case the covariate is continuous (e.g. annual temperature), it would make more sense to enter it as a continuous fixed effect (because the random effect slopes are deviations from the fixed-effect slope). Estimate any additional variation across the unit of temperature observations (e.g. years in case temperature is annual mean temperature) by entering this as either a random or fixed effect (reviewers tend to prefer it as random). Example of the former is Brommer & Rattiste (2010, Heredity) and of the latter Brommer, Rattiste & Wilson (2008, Proc. Roy. Soc B 275, 687–693).
Below, I present all the parts of an approach that increasingly partitions the trait variance into more detailed components. The last part of the model considers homogeneous residual variances. Again, LRT between these consecutive models can be performed to test whether the data supports them.

## Notice

1) I here assume that heterogeneous residual variances can be estimated for all N classes of the covariate.
2) Convergence is aided by adjusting the starting values. ASReml will present you with a guesstimate even if it does not converge. Enter these as starting values to help convergence.
3) Matrix elements of especially higher-order polynomials may not be general-positive. You can use !GP as qualifier when stating the G structure to force the matrix to be general positive. This may allow you to obtain a likelihood of the model under the general-positive constraint. However, constrained matrix elements will not have a standard error. As a rule, as long as variances in the matrix are not negative, rather sensible output is generated.
4) The estimated components are dependent on the ASReml-scaled covariate. Check the ‘.res’ file for the scale that ASReml has used (so-called ‘node points’). Based on these node points, the estimated (co)variance components, and the REML variances of each of the estimated (co)variances (which are in the ‘.vvp’ file that ASReml creates), one can use the delta method to reconstruct the N character-state variances and covariances between them. Details are provided in Fischer, Gilmour & van der Werf (2004, Genetics Selection Evolution 36: 363 – 369), although this is fairly cryptic. The advantage of reconstructing the covariate-specific variances is that it allows visualisation of the pattern. For a n’th -order polynomial Random Regression, the variance as a function of the covariate is of order n+1 (e.g. variance is a cubed function of the covariate when assuming a linear random regression function). Much of how variance changes with the covariate depends on the covariance between elevation and slope.
5) Another useful way to visualise the model is to use the Best Linear Unbiased Predictor (BLUP) values that ASReml returns in the ‘.sln’ file. These provide the estimates of elevation and slope (and higher order polynomial terms) which you can use to visualise the patterns for all or a subset of our individuals (plotting more than one hundred lines is usually not very helpful). Remember that the BLUPs relate to the ASReml-scaled covariate and should be plotted on that scale or back-transformed to the data scale of the original covariate.

## ASReml code

```ASREML Random Regression Phenotypic and Animal Models

#next section describes data file, each column heading must be indented
ANIMAL	!P
Year	!I
Measured	!M-999 #declare what is a missing value
Covariate
Covariate_F	!I

ped_all.ped  !alpha !make   #!skip 1       # pedigree file with qualifiers
DataFile.asd !dopart 1  !FCON #name of datafile and statement which part to run

!part 1	     # model 1: random regression model with hetero variance in errors evaluated at all values of the covariate
################################################################################################################

Measured ~ mu  Covariate_F  !f mv
# Model specified is residuals only

1 2 0                      # 1 multivariate error structure, no G structures

3986	!S2==1	# multivariate error structure, specify the  number of individuals
Covariate_F  0  DIAG  #A matrix with only the diagonal (the variances) to be estimated
0.1 0.1 0.1 #give the starting values yourself for each variance to be estimated

!part 2    # model 2: I. random regression model with hetero variance in errors, zero-order polynomial of individual-specific effects.
################################################################################################################

# Model specified below fits a zero-order orthogonal polynomial, where individuals are allowed to differ in their elevation. This formulation has the covariate (e.g. age) as a factorial fixed effect ('Covariate_F') and
# estimate random regression slopes using the covariate as a continuous covarate ('Covariate'). This makes most sense to me for covariates that are discrete (e.g. age).
# Alternatively, 'Covariate' can be entered as a fixed effect, and 'Covariate_F' as an additional fixed or random effect in order to estimate the variation not explained by 'Covariate'.

Measured  ~ mu  Covariate_F  !r  pol(Covariate,0).ide(ANIMAL)  !f mv

1 2 1                     # 1 multivariate error structure, 1 G structures

3986	!S2==1
Covariate_F  0 DIAG
0.1 0.1 0.1

pol(Covariate,0).ide(ANIMAL) 2
pol(Covariate,0) 0 US
.02
ide(ANIMAL)

!part 3    # model 3: I + IxE. Random regression model with hetero variance in errors, first-order polynomial of individual-specific effects
################################################################################################################
# Model specified below is a random regression phenotypic model, fitting first-order orthogonal polynomial resulting in estimates of variance in elevation (I) and slope (IxE) and the covariance between these

Measured  ~ mu  Covariate_F  !r  pol(Covariate,1).ide(ANIMAL)  !f mv

1 2 1                      # 1 multivariate error structure, 1 G structures

3986	!S2==1
Covariate_F  0  DIAG
0.1 0.1 0.1

#below G structure, 3 starting values are to be specified
pol(Covariate,1).ide(ANIMAL) 2
pol(Covariate,1) 0 US
1
.1   1
ide(ANIMAL)

#Order of the polynomial can be increased (pol(Covariate,2), pol(Covariate,3) etc.). Additional row of starting values need to be presented.

!part 4    # model 4: G + PE + IxE. Random regression model with hetero variance in errors
################################################################################################################

# Model specified below is a random regression animal model, fitting zero-order orthogonal polynomial of additive genetic effects, and first-order orthogonal polynomial of individual-specific effects
# In this case, the zero-order (elevation) term for the individual-specific effect relates to the permanent environmental effect (since the additive genetic effect is modelled).

Measured  ~ mu  Covariate_F  !r  pol(Covariate,0).ANIMAL   pol(Covariate,1).ide(ANIMAL)  !f mv

1 2 2                     # 1 multivariate error structure, 2 G structures

3986	!S2==1
Covariate_F  0  DIAG
0.1 0.1 0.1

#first G structure describes the additive genetic matrix
pol(Covariate,0).ANIMAL 2
pol(Covariate,0) 0 US
1
ANIMAL

#second G structure defines the non-genetic effects
pol(Covariate,1).ide(ANIMAL) 2
pol(Covariate,1) 0 US
1
.1   1
ide(ANIMAL)

!part 5    # model 5: G + PE + GxE + PExE. Random regression model with hetero variance in errors, polynomial of animal and permanent environmental effects
################################################################################################################

Measured  ~ mu  Covariate_F  !r  pol(Covariate,1).ANIMAL   pol(Covariate,1).ide(ANIMAL)  !f mv

1 2 2                      # 1 multivariate error structure, 2 G structures

3986	!S2==1
Covariate_F  0  DIAG
0.1 0.1 0.1

#first G structure describes the additive genetic matrix
pol(Covariate,1).ANIMAL 2
pol(Covariate,1) 0 US
1
0.1  1
ANIMAL

#second G structure defines the non-genetic effects
pol(Covariate,1).ide(ANIMAL) 2
pol(Covariate,1) 0 US
1
.1   1
ide(ANIMAL)

!part 6    # model 6: G + PE + GxE + PExE. Random regression model with homogeneous variances in errors, polynomial of animal and permanent environmental effects
################################################################################################################

Measured  ~ mu  Covariate_F  !r  pol(Covariate,1).ANIMAL   pol(Covariate,1).ide(ANIMAL)  !f mv

0 0 2                     # 1 multivariate error structure, 2 G structures

#first G structure describes the additive genetic matrix
pol(Covariate,1).ANIMAL 2
pol(Covariate,1) 0 US
1
0.1  1
ANIMAL

#second G structure defines the non-genetic effects
pol(Covariate,1).ide(ANIMAL) 2
pol(Covariate,1) 0 US
1
.1   1
ide(ANIMAL)```