Linear mixed models

Emi Tanaka

Australian National University

2025-07-14

Linear mixed model

A general form for the linear mixed model is given by

We let \boldsymbol{\sigma} = (\boldsymbol{\sigma}^\top_g, \boldsymbol{\sigma}^\top_r)^\top denote the complete vectors of variance parameters.
We need to estimate the fixed effects \boldsymbol{\beta} and variance parameters \boldsymbol{\sigma}.

Variance structures of random effects

The q\times 1 vector of random effects is often composed of {n_b} sub-vectors \boldsymbol{b} = (\boldsymbol{b}_1^\top,\boldsymbol{b}_2^\top,\dots,\boldsymbol{b}_{n_b}^\top)^\top where the sub-vectors \boldsymbol{b}_i are of length q_i and \sum_{i=1}^{n_b}q_i = q.
Let \text{var}(\boldsymbol{b}_i) = \mathbf{G}_i and assuming \boldsymbol{b}_i are mutually independent, then

\mathbf{G} = \oplus^{n_b}_{i=1} \mathbf{G}_i = \begin{bmatrix} \mathbf{G}_1 & \mathbf{0} & \cdots & \mathbf{0} \\ \mathbf{0} & \mathbf{G}_2 & \cdots & \mathbf{0} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{0} & \mathbf{0} &\cdots & \mathbf{G}_{n_b} \\ \end{bmatrix} .

\mathbf{Z} = \begin{bmatrix} \mathbf{Z}_1 & \mathbf{Z}_2 & ... & \mathbf{Z}_{n_b} \\ \end{bmatrix} and \mathbf{Z}\boldsymbol{b} = \mathbf{Z}_1\boldsymbol{b}_1 + \cdots + \mathbf{Z}_{n_b}\boldsymbol{b}_{n_b} = \sum_{i=1}^{n_b}\mathbf{Z}_i\boldsymbol{b}_i.

Variance structures of errors

Classical assumption is that the errors are assumed to be independent and identically distributed (i.i.d) \rightarrow\mathbf{R} = \sigma^2 \mathbf{I}_n.
In some situations, \boldsymbol{e} will be sub-divided into sections, e.g. multi-clinic trials or multi-environment variety trials.
We let \boldsymbol{e} = (\boldsymbol{e}_1^\top, \boldsymbol{e}_2^\top, \dots, \boldsymbol{e}_{n_e}^\top)^\top so that \boldsymbol{e}_j represents the vector of errors of the jth section of the data.

\mathbf{R} = \oplus_{j=1}^{n_e} \mathbf{R}_j = \begin{bmatrix} \mathbf{R}_1 & \mathbf{0} & \dots & \mathbf{0} \\ \mathbf{0} & \mathbf{R}_2 & \dots & \mathbf{0} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{0} & \mathbf{0} & \dots & \mathbf{R}_{n_e} \end{bmatrix}

where \oplus is the direct sum operator.

Variance models

There are three types of variance model for \mathbf{R}_j and \mathbf{G}_i:

Correlation

diagonal elements are all 1
off-diagonal elements are between -1 and 1
If \mathbf{C} is a n\times n correlation model, then there are \frac{1}{2}n(n - 1) parameters.

Homogeneous

diagonal elements all have the same positive value \sigma^2 (say)
\sigma^2\mathbf{C} is a homogeneous model, so it has \frac{1}{2}n(n - 1) + 1 parameters.

Heterogeneous

diagonal elements are positive but can differ (\sigma_1^2, \sigma^2_2, \cdots, \sigma^2_n)
\mathbf{D}\mathbf{C}\mathbf{D} where \mathbf{D} = \oplus_{i=1}^n \sigma_i is a heteregenous model, so it has \frac{1}{2}n(n - 1) + n parameters.

Variance model functions in `asreml`

Default

id() = \mathbf{I}_n

idv() = \sigma^2\mathbf{I}_n

idh()

= \begin{bmatrix}\sigma^2_1 & 0 & \cdots & 0 \\ 0 & \sigma^2_2 & \ddots & \vdots\\ \vdots & \ddots & \ddots & 0 \\ 0 & \cdots & 0 & \sigma^2_n \end{bmatrix}

Time series

ar1(), ar2(), ar3(), sar(), sar2(), ma1(), ma3(), arma()

Suffix name with v or h to convert correlation model to homogeneous or heterogeneous models.
E.g. ar1v() or ar1h()

Metric-based

exp(), gau(), lvr(), iexp(), igau(), ieuc(), sph(), cir(), aexp(), agau(), mtrn()

Suffix name with v or h to convert correlation model to homogeneous or heterogeneous models.

General structures

cor(), corb(), corg(), diag(), us(), chol(), cholc(), ante(), sfa(), fa(), facv(), rr()

For cor(), corb(), and corg(), suffix name with v or h to convert correlation model to homogeneous or heterogeneous models.

Some combination are equivalent, e.g. idh() \equiv diag() and corgh() \equiv us(), but default starting values or computation under the hood may differ.

Aliasing of variance parameters

Variance models may not be:
- identifiable if they are over-parameterised,
- insufficient data to estimate the parameters of the chosen variance model.
Note that if you have \mathbf{V}_1 \otimes \mathbf{V}_2, where \mathbf{V}_1 and \mathbf{V}_2 are variance models, then this is over-parameterised. E.g. idv(col):idv(row).
\mathbf{C} \otimes \mathbf{V} or \mathbf{V} \otimes \mathbf{C}, where \mathbf{V} is a variance model and \mathbf{C} is a correlation model, may be acceptable. This is referred to as the sigma parameterization in asreml. E.g. idv(col):id(row).
\mathbf{C} \otimes \mathbf{C} is generally converted to \sigma^2 \mathbf{C} \otimes \mathbf{C} by asreml. This is referred to as the gamma parameterization in asreml. E.g. id(col):id(row).

Example 🌾 different variance models

library(asreml)
# gamma parameterization.
fiti <- asreml(yield ~ 1,
               random =~ Variety,
               residual =~ id(column):id(row),
               data = naf)

# sigma parameterization
fitv <- asreml(yield ~ 1,
               random =~ Variety,
               residual =~ idv(column):id(row),
               data = naf)

# sigma parameterization
fith <- asreml(yield ~ 1,
               random =~ Variety,
               residual =~ idh(column):id(row),
               data = naf)

Below cannot be fitted.

fitn <- asreml(yield ~ 1,
               random =~ Variety,
               residual =~ idv(column):idv(row),
               data = naf)

library(broom.asreml)

tidy(fiti, "varcomp")

# A tibble: 2 × 5
  term         estimate std.error statistic constraint
  <chr>           <dbl>     <dbl>     <dbl> <chr>     
1 Variety        0.0860    0.0303      2.84 P         
2 column:row!R   0.102     0.0220      4.63 P

tidy(fitv, "varcomp")

# A tibble: 3 × 5
  term              estimate std.error statistic constraint
  <chr>                <dbl>     <dbl>     <dbl> <chr>     
1 Variety             0.0860    0.0303      2.84 P         
2 column:row!column   0.102     0.0220      4.63 P         
3 column:row!R        1        NA          NA    F

tidy(fith, "varcomp")

# A tibble: 14 × 5
   term                 estimate std.error statistic constraint
   <chr>                   <dbl>     <dbl>     <dbl> <chr>     
 1 Variety                0.0543   0.0181       3.00 P         
 2 column:row!R           1       NA           NA    F         
 3 column:row!column_1    0.0225   0.0199       1.13 P         
 4 column:row!column_2    0.0153   0.00982      1.56 P         
 5 column:row!column_3    0.0267   0.0201       1.32 P         
 6 column:row!column_4    0.146    0.0738       1.97 P         
 7 column:row!column_5    0.226    0.100        2.25 P         
 8 column:row!column_6    0.0240   0.0179       1.34 P         
 9 column:row!column_7    0.326    0.142        2.30 P         
10 column:row!column_8    0.0939   0.0503       1.87 P         
11 column:row!column_9    0.220    0.102        2.15 P         
12 column:row!column_10   0.0788   0.0435       1.81 P         
13 column:row!column_11   0.116    0.0639       1.82 P         
14 column:row!column_12   0.292    0.131        2.23 P

BLUEs and BLUPs

Suppose that \text{var}(\boldsymbol{y}) = \mathbf{V} = \mathbf{Z}\mathbf{G}\mathbf{Z}^\top + \mathbf{R}.
Also assume \boldsymbol{\sigma} is known, \mathbf{V} is non-singular and \mathbf{X} is full rank.
Let \mathbf{P} = \mathbf{V}^{-1} - \mathbf{V}^{-1}\mathbf{X}(\mathbf{X}^\top\mathbf{V}^{-1}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{V}^{-1}

Best linear unbiased estimates (BLUEs)

\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{V}^{-1}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{V}^{-1}\boldsymbol{y}

Best linear unbiased predictions (BLUPs)

\tilde{\boldsymbol{b}} = \mathbf{G}\mathbf{Z}^\top\mathbf{P}\boldsymbol{y} = \mathbf{G}\mathbf{Z}^\top\mathbf{V}^{-1}(\boldsymbol{y} - \mathbf{X}\hat{\boldsymbol{\beta}})

But since \mathbf{V} is often unknown, it is estimated from the data.
Correspondingly, we plug the estimate of \mathbf{V} to above to get the E-BLUEs and E-BLUPs (where E = empirical).

Example 🌾 BLUEs and BLUPs by hand

By asreml

fits  <- asreml(yield ~ 1,
               random =~ Variety,
               residual =~ units,
               data = shf)

tidy(fits, "fixed")

# A tibble: 1 × 3
  term        estimate std.error
  <chr>          <dbl>     <dbl>
1 (Intercept)    1470.    0.0161

tidy(fits, "random")

# A tibble: 25 × 3
   term       estimate std.error
   <chr>         <dbl>     <dbl>
 1 Variety_1   -156.       0.103
 2 Variety_2     20.8      0.103
 3 Variety_3     -7.78     0.103
 4 Variety_4    -26.3      0.103
 5 Variety_5    -18.2      0.103
 6 Variety_6      4.43     0.103
 7 Variety_7    -42.2      0.103
 8 Variety_8     45.6      0.103
 9 Variety_9    -92.2      0.103
10 Variety_10  -161.       0.103
# ℹ 15 more rows

By “hand”

vars <- tidy(fits, "varcomp")$estimate
G <- vars[1] * diag(nlevels(shf$Variety))
R <- vars[2] * diag(nrow(shf))
Z <- model.matrix(~Variety - 1, data = shf)
X <- matrix(1, nrow = nrow(shf))
V <- Z %*% G %*% t(Z) + R
y <- shf$yield
(BLUE <- solve(t(X) %*% solve(V) %*% X) %*% t(X) %*% solve(V) %*% y)

        [,1]
[1,] 1470.44

(BLUP <- G %*% t(Z) %*% solve(V) %*% (y - X %*% BLUE))

             [,1]
 [1,] -156.452173
 [2,]   20.841535
 [3,]   -7.779433
 [4,]  -26.339105
 [5,]  -18.231459
 [6,]    4.430877
 [7,]  -42.163667
 [8,]   45.652885
 [9,]  -92.177099
[10,] -161.238614
[11,]  -66.388923
[12,]    7.849764
[13,]   92.247430
[14,]  -54.569343
[15,]  -14.714889
[16,]   -6.997973
[17,]   22.892867
[18,]   87.167941
[19,]  101.038854
[20,]  144.116829
[21,]  -27.804342
[22,]  106.118343
[23,]  -78.696916
[24,]   43.113141
[25,]   78.083470

Best linear unbiased estimator

A statistic \hat{\boldsymbol{\beta}} is the best linear unbiased estimator of \boldsymbol{\beta} if
- \hat{\boldsymbol{\beta}} = \mathbf{A}\boldsymbol{y} for some \mathbf{A},
- E(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta}, and
- \text{var}(\hat{\boldsymbol{\beta}}) is the smallest of all linear unbiased estimator of \boldsymbol{\beta}.
Homework: prove \boldsymbol{c}_1^\top\hat{\boldsymbol{\beta}} + \boldsymbol{c}_2^\top\tilde{\boldsymbol{b}} is BLUE for \boldsymbol{c}_1^\top\boldsymbol{\beta} + \boldsymbol{c}_2^\top\boldsymbol{b} for any \boldsymbol{c}_1 \in\mathbb{R}^p and \boldsymbol{c}_2 \in\mathbb{R}^q.

Objective function to derive BLUEs and BLUPs

Find (\boldsymbol{\beta}, \boldsymbol{b}) which maximises the log-density function of the joint distribution of \boldsymbol{y} and \boldsymbol{b} assuming \boldsymbol{\sigma} is known.
Note \boldsymbol{y} \sim N(\mathbf{X}\boldsymbol{\beta}, \mathbf{V}), \boldsymbol{b} \sim N(\boldsymbol{0}, \mathbf{G}) and \boldsymbol{y}|\boldsymbol{b} \sim N(\mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{b}, \mathbf{R}).

\begin{align*} \ell &= \log f(\boldsymbol{y}, \boldsymbol{b} ; \boldsymbol{\sigma})\\ &= \log f(\boldsymbol{y}|\boldsymbol{b};\boldsymbol{\sigma}_r) + \log f(\boldsymbol{b};\boldsymbol{\sigma}_g) \\ &= -\frac{1}{2}\log |\mathbf{R}| - \frac{1}{2}(\boldsymbol{y}-\mathbf{X}\boldsymbol{\beta}-\mathbf{Z}\boldsymbol{b})^\top \mathbf{R}^{-1}(\boldsymbol{y}-\mathbf{X}\boldsymbol{\beta}-\mathbf{Z}\boldsymbol{b}) \\ & \qquad - \frac{1}{2}\log|\mathbf{G}| - \frac{1}{2} \boldsymbol{b}^\top\mathbf{G}^{-1}\boldsymbol{b} - \frac{n+q}{2}\log 2\pi. \end{align*}

Mixed model equations

\begin{align*} \frac{\partial \ell}{\partial\boldsymbol{\beta}} &= \mathbf{X}^\top \mathbf{R}^{-1}(\boldsymbol{y} - \mathbf{X}\boldsymbol{\beta} - \mathbf{Z}\boldsymbol{b}) = \boldsymbol{0} \\ \frac{\partial \ell}{\partial\boldsymbol{b}} &= \mathbf{Z}^\top \mathbf{R}^{-1}(\boldsymbol{y} - \mathbf{X}\boldsymbol{\beta} - \mathbf{Z}\boldsymbol{b}) - \mathbf{G}^{-1} \boldsymbol{u}= \boldsymbol{0}. \end{align*}

Rearranging the above results in the mixed model equations:

\underbrace{\begin{bmatrix} \mathbf{X}^\top \mathbf{R}^{-1}\mathbf{X} & \mathbf{X}^\top \mathbf{R}^{-1}\mathbf{Z} \\ \mathbf{Z}^\top \mathbf{R}^{-1}\mathbf{X} & \mathbf{Z}^\top \mathbf{R}^{-1}\mathbf{Z} + \mathbf{G}^{-1} \\ \end{bmatrix}}_{\normalsize\mathbf{C}} \begin{bmatrix} \hat{\boldsymbol{\beta}} \\ \tilde{\boldsymbol{b}} \end{bmatrix} = \begin{bmatrix} \mathbf{X}^\top\mathbf{R}^{-1}\boldsymbol{y}\\ \mathbf{Z}^\top\mathbf{R}^{-1}\boldsymbol{y} \end{bmatrix}

Solving this results in the same BLUE and BLUP as before.

Prediction error variance

Variance of prediction errors

\text{var}\left(\begin{bmatrix}\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\\\tilde{\boldsymbol{b}} - \boldsymbol{b}\end{bmatrix}\right) = \mathbf{C}^{-1} = {\scriptsize\begin{bmatrix}(\mathbf{X}^\top\mathbf{V}^{-1}\mathbf{X})^{-1} & -(\mathbf{X}^\top\mathbf{V}^{-1}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{V}^{-1}\mathbf{Z}\mathbf{G}\\ -\mathbf{G}\mathbf{Z}^\top\mathbf{V}^{-1}\mathbf{X}(\mathbf{X}^\top\mathbf{V}^{-1}\mathbf{X})^{-1} & \mathbf{G} - \mathbf{G}\mathbf{Z}^\top\mathbf{P}\mathbf{Z}\mathbf{G} \end{bmatrix}}

Note \text{var}(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}) = \text{var}(\hat{\boldsymbol{\beta}}).

fitc <- asreml(yield ~ 1,
               random =~ Variety,
               residual =~ units,
               data = shf,
               options = asreml.options(
                 Cfixed = TRUE,
                 Csparse =~ Variety
               ))

fitc$Cfixed

[1] 707.637

sp2mat(fitc$Csparse)[1:5, 1:5]

         [,1]     [,2]     [,3]     [,4]     [,5]
[1,] 4534.929    0.000    0.000    0.000    0.000
[2,]    0.000 4534.929    0.000    0.000    0.000
[3,]    0.000    0.000 4534.929    0.000    0.000
[4,]    0.000    0.000    0.000 4534.929    0.000
[5,]    0.000    0.000    0.000    0.000 4534.929

By “hand”

solve(t(X) %*% solve(V) %*% X)

         [,1]
[1,] 707.8299

P <- solve(V) - solve(V) %*% X %*% solve(t(X) %*% solve(V) %*% X) %*% t(X) %*% solve(V)
CZZ <- G - G %*% t(Z) %*% P %*% Z %*% G
CZZ[1:5, 1:5]

          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 4535.9150  243.1447  243.1447  243.1447  243.1447
[2,]  243.1447 4535.9150  243.1447  243.1447  243.1447
[3,]  243.1447  243.1447 4535.9150  243.1447  243.1447
[4,]  243.1447  243.1447  243.1447 4535.9150  243.1447
[5,]  243.1447  243.1447  243.1447  243.1447 4535.9150

Estimation of variance parameters

Variance parameters \boldsymbol{\sigma} can be estimated using maximum likelihood (ML) approach or residual (or restricted) maximum likelihood (REML) approach.
REML estimates takes into account the loss of degrees of freedom associated with estimation of fixed effects, so are less biased than ML estimates.

\begin{align*} \hat{\boldsymbol{\sigma}}_{\text{ML}} &= \operatorname*{arg\,max}_{\boldsymbol{\sigma}} f(\boldsymbol{y}; \boldsymbol{\sigma})\\ &= \operatorname*{arg\,max}_{\boldsymbol{\sigma}}~ \log f(\boldsymbol{y}; \boldsymbol{\sigma})\\ &= \operatorname*{arg\,max}_{\boldsymbol{\sigma}}~ -\frac{1}{2}\left( \log |\mathbf{V}| + (\boldsymbol{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^\top\mathbf{V}^{-1}(\boldsymbol{y} - \mathbf{X}\hat{\boldsymbol{\beta}})\right) \end{align*}

REML

\begin{align*} \hat{\boldsymbol{\sigma}}_{\text{REML}} &= \operatorname*{arg\,max}_{\boldsymbol{\sigma}} \ell_R(\boldsymbol{\sigma}) \quad(\textbf{residual log-likelihood})\\ &= \operatorname*{arg\,max}_{\boldsymbol{\sigma}}~ -\frac{1}{2}\left( \log |\mathbf{V}| + {\color{RoyalBlue} \log|\mathbf{X}^\top\mathbf{V}^{-1}\mathbf{X}|} + (\boldsymbol{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^\top\mathbf{V}^{-1}(\boldsymbol{y} - \mathbf{X}\hat{\boldsymbol{\beta}})\right)\\ &= \operatorname*{arg\,max}_{\boldsymbol{\sigma}}~ -\frac{1}{2}\left(\log |\mathbf{G}| + \log |\mathbf{R}| + \log |\mathbf{C}| + \boldsymbol{y}^\top\mathbf{P}\boldsymbol{y}\right) \end{align*}

REML score equations

REML score equations

REML estimates for \boldsymbol{\sigma}_{m\times1} = (\sigma_1, \dots, \sigma_m)^\top are obtained by solving s_i(\boldsymbol{\sigma}) = \frac{\partial\ell_R(\boldsymbol{\sigma})}{\partial \sigma_i} = 0 \quad\text{for } i = 1, \dots, m.

We solve numerically using a Newton-Raphson algorithm: \boldsymbol{\sigma}^{(t + 1)} = \boldsymbol{\sigma}^{(t)} + \left({\color{RoyalBlue}\mathcal{J}(\boldsymbol{\sigma}^{(t)}})\right)^{-1}\boldsymbol{s}(\boldsymbol{\sigma}^{(t)}) where
- \boldsymbol{s}(\boldsymbol{\sigma}^{(t)}) = (s_1(\boldsymbol{\sigma}^{(t)}), s_2(\boldsymbol{\sigma}^{(t)}), \dots, s_m(\boldsymbol{\sigma}^{(t)}))^\top is the m\times 1 scores at iteration t, and
- \mathcal{J}(\boldsymbol{\sigma}^{(t)}) is the m\times m observed information matrix at iteration t where the (i,j)-th entry is given as \mathcal{J}_{ij}(\boldsymbol{\sigma}^{(t)}) = -\dfrac{\partial s_i(\boldsymbol{\sigma}^{(t)})}{\partial \sigma_j} = -\frac{\partial^2\ell_R(\boldsymbol{\sigma}^{(t)})}{\delta\sigma_i\delta\sigma_j}.

Information matrices

Let \mathbf{V}'_i = \frac{\partial \mathbf{V}}{\partial \sigma_i} and \mathbf{V}''_i = \frac{\partial^2 \mathbf{V}}{\partial \sigma_i\partial \sigma_j}.

Observed information matrix

\mathcal{J}_{ij} = \frac{1}{2}\text{tr}(\mathbf{P}\mathbf{V}''_{ij}) - \frac{1}{2}\text{tr}(\mathbf{P}\mathbf{V}_{i}'\mathbf{P}\mathbf{V}_j') + \boldsymbol{y}\mathbf{P}\mathbf{V}'_i\mathbf{P}\mathbf{V}_j'\mathbf{P}\boldsymbol{y}-\frac{1}{2}\boldsymbol{y}^\top\mathbf{P}\mathbf{V}''_{ij}\mathbf{P}\boldsymbol{y}

Expected information matrix (Fisher information)

\mathcal{I}_{ij} = E(\mathcal{J}_{ij}) = \frac{1}{2}\text{tr}(\mathbf{P}\mathbf{V}_{i}'\mathbf{P}\mathbf{V}_j')

Average information matrix

\mathcal{A}_{ij} = \frac{1}{2}\boldsymbol{y}\mathbf{P}\mathbf{V}'_i\mathbf{P}\mathbf{V}_j'\mathbf{P}\boldsymbol{y} = \frac{1}{2}(\mathcal{J}_{ij} + \mathcal{I}_{ij})\quad\text{assuming }\boldsymbol{y}^\top\mathbf{P}\mathbf{V}''_{ij}\mathbf{P}\boldsymbol{y} = \text{tr}(\mathbf{P}\mathbf{V}''_{ij})

Algorithm overview

Start with initial estimate of \boldsymbol{\sigma} = \boldsymbol{\sigma}^{(0)}.

For t = 1, \dots, \texttt{maxit},

Calculate \hat{\boldsymbol{\beta}}^{(t - 1)} = (\mathbf{X}^\top(\mathbf{V}^{(t -1)})^{-1}\mathbf{X})^{-1}\mathbf{X}^\top(\mathbf{V}^{(t -1)})^{-1}\boldsymbol{y} and \tilde{\boldsymbol{b}}^{(t - 1)} = \mathbf{G}\mathbf{Z}^\top\mathbf{P}^{(t - 1)}\boldsymbol{y}.
Calculate residual likelihood \ell_R^{(t- 1)} = -\frac{1}{2}\left(\log |\mathbf{G}^{(t-1)}| + \log |\mathbf{R}^{(t-1)}| + \log |\mathbf{C}^{(t-1)}| + \boldsymbol{y}^\top\mathbf{P}^{(t-1)}\boldsymbol{y}\right)’
Calculate the scores \boldsymbol{s}(\boldsymbol{\sigma}^{(t - 1)}).
Calculate the average informaton matrix \mathcal{A}_{ij}^{(t - 1)} = \frac{1}{2}\boldsymbol{y}\mathbf{P}^{(t-1)}\mathbf{V}'^{(t-1)}_i\mathbf{P}^{(t-1)}\mathbf{V}_j'^{(t-1)}\mathbf{P}^{(t-1)}\boldsymbol{y}
Update \boldsymbol{\sigma}^{(t)} = \boldsymbol{\sigma}^{(t - 1)} + \left(\mathcal{A}^{(t - 1)}\right)^{-1}\boldsymbol{s}(\boldsymbol{\sigma}^{(t-1)})
Repeat 1-2, if \ell_R^{(t)} - \ell_R^{(t-1)} < \text{threshold} then break, otherwise continue from Step 3.

References

Smith (1999) Multiplicative mixed models for the analysis of multi-environment trial data. PhD Thesis.
Searle (2006) Matrix Algebra Useful for Statistics.
Searle et al (2006) Variance Components.

Linear mixed models