go to SAMSI home page

TIEGCM atmospheric model

This page hosts common resources for the TIEGCM project of the SAMSI / NCAR collaboration, part of the SAMSI year-long program COMPMOD.

Comments regarding this page should be addressed to Jonty Rougier (Statistician, SAMSI and Bristol, UK). Other scientists involved in this collaboration:

On this page:

Back to the Climate page.

Background on the model

"The Thermosphere Ionosphere Electrodynamics General Circulation Model was developed at NCAR by Ray Roble and his collaborators. The model simulates the upper atmosphere, its dynamics, chemistry, energy, and electrodynamics. This is a self consistent large scale model which includes the aurora, but also other features of the upper and middle atmosphere, like the ozone layer. Its application range from space weather studies to global warming." From http://odin.gi.alaska.edu/MAMI/mami/tiegcm.html.

For our experiment TIEGCM can be thought of as a deterministic function of seven inputs, although in the first design only three have been varied (see below, Design 1). The outputs are four periodic functions in local time (i.e., indexed by time in [0, 24]), for a specified collection of sites (i.e., indexed by longitude and latitude). The periodic functions are for MAGDDR (three components: H, D, Z) and Drift.

Here is a map of the sites for which we have observational data on MAGDDR.

Model inputs

Here are the inputs, with marginal prior distributions.
  1. AMP: Amplitude of the migrating tide, uniform on [0, 36000]
  2. PHZ: Phase of the tide, periodic, uniform on [0, 12]
  3. EDN: Minimum electron density, log10 is uniform on [3, 4]
  4. Plus four more inputs that we have not investigated yet.

Further observations on the model

  1. Note that PHZ is a periodic input, so that f(AMP, 0, EDN) = f(AMP, 12, EDN) for all AMP and EDN. Art Richmond suggests an alternative parameterisation of AMP and PHZ, AP1 = AMP*cos(pi*PHZ/6) and AP2 = AMP*sin(pi*PHZ/6), which removes the periodicity. It seems more interesting, though, to find a way to incorporate the periodicity into the emulator, and the inference.
  2. Art also suggests that EDN is unlikely to have much impact on the geomagnetic variations in the model (something that we confirm in the Calibration of the HON site).
  3. Art suggests that the scale of the perturbations in the MAGDDR outputs is expected to be roughly linear in AMP.
  4. Astrid notes that "TIE-GCM model values of the electron densities are too low, which results in low conductivities, therefore low current, and low magnetic perturbations, since these are the results of the current flowing overhead." This is a simple explanation for the scale mismatch between the model evaluations and the observational data. The quick fix for calibration is to upscale the emulator output by 40 percent.

Back to the top.

Model evaluations

Design 1

Our first design was a thirty-evaluation maximin latin hypercube on three inputs: AMP, PHZ and EDN. Here is the design, in model-inputs; the lefthand column denotes the job number.

The results of these evaluations are available on the NCAR ftp server. They have been downloaded and processed for the statistical computing environment R. The R data files are

  1. drift01.RData
  2. magddr01.RData
These are to be loaded in or attached, e.g., "attach('drift01.RData')". Each of these files contains dataframes for Observations, Outputs, and X. 'drift01.RData' also contains 'statloc', the locations of the sites.

In each case the Output dataframes represent the data in extensive form: one row per quantity, with index values for site, job, and local time. This is because the time-stepping for local time is dependent on the site. Or, to put it another way, the output data in its original form does not have a completely rectangular structure in job, site, and local time.

Figures of the the model evaluations and observations:
MAGDDR-H Original Scaled Scaled & transformed
MAGDDR-D Original Scaled Scaled & transformed
MAGDDR-Z Original Scaled Scaled & transformed
Drift Original
In each case the model outputs are shown by site (ordered by latitude, see the map), over (local) time. Each line is a model evaluation. Each dot is an observation. The 'original' pictures are as output from the model. The 'scaled' pictures are after scaling by 1.4. The 'scaled and transformed' are after also using a non-linear transformation.

Back to the top.

General issues

Transformations

Inspecting the outputs from Design 1, there are clearly some interactions between time and the model-inputs. For example, the spread of the evaluations is wider for central values of time, which is when there are large excursions. Interactions are inconvenient, and we can use transformations of the model-outputs to try and reduce these.

After some fooling around, the transformation sign(x) log(1 + |x|) seems to be a reasonable choice for the series MAGDDR-D. Compare this picture with the one in the original units (both after scaling). Certainly there is a much more even dispersion of the outputs across times.

Question: To transform first, and then regularise (see below, Functional outputs), or to regularise and then transform? At the moment it's regularising first. But there is an issue that a different number of knots may be appropriate for the transformed data, or different knot locations. In the case of MAGDDR-D, it probably makes very little difference.

Functional outputs

How to handle the functional output? At each site and each setting of the model-parameters we get as output points on a periodic function of time in [0, 24]. We do not get the same points at each site: you can see this here. It is helpful if we can regularise the outputs. What seems to work well is fitting them with a periodic spline, and then summarising that spline by a set of knots at fixed locations.

As a first choice, how about nine knots equally-spaced. Nine knots seem to do OK: here is a random selection over jobs and sites for MAGDDR-D. Once the outputs for each combination of input and site has been summarised in this way, our output data is fully rectangular. We can think of it as indexed by job (30 in Design 1), by site (25 for MAGDDR-{H,D,Z}), and by time (9 knots).

Time regressors

When we construct our emulator, we are going to want regressors that describe the ways in which the model output varies over time. We have already reduced the outputs at any given site from a function over time to the values at nine knots over time. If our outputs are always recorded at the same nine knots, and our predictions always require the nine knots (which they do, because we want to predict the periodic function, which we will construct from the knots) then we only need to evaluate our time regressors at the knots. We can use a discrete basis, comprising a given number of nine-vectors.

One simple way to infer this basis is to use the SVD over the full dataset (or PCA for statisticians, or EOFs for climate scientists). In other words, for our given series, say MAGDDR-D, we arrange our outputs into a 750 (that's 30 times 25) by 9 matrix, column centred, and compute the SVD. The right eigenvectors give us 9 empirical basis functions, each represented as nine knots on a period spline. Here is the result, for MAGDDR-D, shown as periodic splines fitted through the knots. Four or five basis functions seem to do a good job, accounting for nearly all the variation.

Note that we augment this basis with a constant. This constant should preserve the orthonormality of the basis, which means it needs to be rep(sqrt(1/9), 9).

Our hope is that the MAGDDR-D output at any site and any model-inputs can be well-described by a linear combination of these basis vectors. We also hope that the coefficients in the linear combination vary smoothly over sites and model-inputs: see Emulators for more details.

Back to General issues.   Back to the top.

Single site analysis

For our first analysis, we consider emulating MAGDDR-D at a single site. We consider Honolulu (HON on the map), which is at latitude 22.

Single site emulator

See Emulators for some background.

For our inputs regressors we use linear, quadratic, and two-way interaction terms for AMP and EDN, expressed as Legendre polynomials on [-1, 1]. For PHZ we use four Fourier terms (two sins and two cosines). For the time regressors, we use the first five basis vectors explained here). The regression coefficients are treated as exchangeable and independent, with prior mean zero and prior standard deviation 10. With this choice of prior standard deviation the emulator will behave much like a regression, although with smoother sample paths. Here is a comparison of the posterior means of the emulator coefficients and the OLS estimated coefficients, showing (i) that they are similar, and (ii) which are the large coefficients.

We treat the prior residual covariance as stationary and separable in inputs and time. We treat the input correlation as componentwise separable, with Gaussian correlation function and a correlation length of 1 (range at which the correlation drops to 0.05) for AMP and EDN. We treat the correlation function of PHZ as circular, with a correlation angle of 0.75 on [-1, 1]. We treat the time correlation as circular, with a correlation angle of 5 on [0, 24]. All of these choices could probably do with a re-appraisal.

Here are the leave-one-out diagnostics. These show, for each evaluation in turn, the result of predicting the outcome of that evaluation using an emulator built from the other 29 evaluations. The predictions are shown as ten samples from the posterior emulator, in grey. The actual outcome is shown in black. The emulator appears to do a fairly good job. There are certain evaluations that are clearly hard to predict, Job 26 for example. Remember that the prior choices have not been tuned at all to the evaluations, so much better fits might be achievable in an Empirical Bayes approach.

Calibration

Calibration is updating our judgements about the 'best' input value using observations. For this we need a prior on the inputs, and a marginal distribution for the discrepancy between the model-output at the best input and the observations.

For the prior on the best input, we treat the three components as independent and uniform, subject to the transformation described in Model Inputs. Note that uniform is coherent for the periodic PHZ.

Inspecting the scaled and transformed MAGDDR-D data it looks as though the discrepancy ought to have a marginal standard deviation of about one, with a correlation angle of about 6 on [0, 24]. This is not the best way to assess the discrepancy variance!

With only three inputs we can evaluate the posterior density function on a grid, allowing us to normalise and compute the marginal densities using numerical integration. Here is the result of the calibration, the posterior expectation is shown in blue. AMP has been pulled upwards, and PHZ downwards; EDN is largely unconstrained by the observational data. Historical interest: These comments applied before treating PHZ as a periodic input in the emulator and the inference. They show that treat PHZ as periodic has a large impact on the calibration. "AMP has been pushed to the top of its range, and PHZ to the bottom. EDN does not appear to have moved, although it has not shrunk back inside [-1, 1]. There are no substantial posterior correlations: the largest is -0.3 for AMP and PHZ."

Note that in this treatment the process of building an emulator and calibrating is not stochastic (eg,there is no MCMC): the posterior distribution for the best input is exact, subject to the limitations of the integration grid. These calculations are effectively costless. To go from the raw data to the posterior density picture for Honolulu takes about 30 seconds. Here are the same pictures for Apia (API on the map):

  1. Canonical correlations
  2. Regression coefficients
  3. Leave-one-out
  4. Calibration
These results are qualitatively similar to those of Honolulu.

Back to Single site analysis.   Back to the top.

Emulators: A summary

Outline

An emulator is a statistical model of a deterministic function. If we think of our function as f(x) then an emulator predicts what the output will be when we evaluate the function at any x. Typically this prediction is in the form of a mean and a variance for f(x), or possibly a set of statistical parameters that can be mapped into a probability distribution function for f(x).

For us, the function f() is TIEGCM. For simplicity, let's focus on MAGDDR-D at a single site, say Honolulu (HON on the map). If we think of f() as a function with a scalar output, then we need two arguments: one is for the model-inputs, and one is to index the model-outputs by time. This is typical with emulators: we treat the function as scalar, and the arguments then become the model-inputs and the index variables. Thus for us, f(t, x) is the MAGDDR-D value at Honolulu at local time t, for model-inputs x. Therefore f(t, x) is just a number, but until we evaluate TIEGCM for Honolulu at input x that number is uncertain. We can use the emulator to make a prediction of this uncertain number. The mean from the emulator is our best-guess, and the standard deviation is a measure of our uncertainty.

Single-site emulator

We continue to focus on a single site (say, Honolulu). A simple way to construct an emulator is to attach uncertain coefficients to specified basis functions, or regressors. For the moment, fix x at some value. Then our emulator might be
f(t) = sumi betai gi(t) + u(t) (Em1)
where the gi() are specified regressors, but the betais are uncertain; u() is an uncertain residual. If we plug a known value of t into this emulator we get a distribution for f(t) that depends on the distribution we assign to the betais and to the residual.

Now we generalise to the case where x can also vary. The simplest way is to think of the betai as varying according to x. This gives us
f(t, x) = sumi betai(x) gi(t) + u(t, x) (Em2)
where we must also include the effect of x in the residual. We write each betai(x) as a function of x, also with regressors and unknown coefficients:
betai(x) = sumj gammaij hj(x) (Beta1)
where h() are the new regressors, and we might also include a residual. Putting (Beta1) back into (Em2), we end up with our emulator
f(t, x) = sumi sumj gammaij hj(x) gi(t) + u(t, x) (Em3)
So the regressors in our final emulator of f(t, x) are the outer-product of regressors in x and regressors in t.

To build an emulator of f(t, x) we need to specify the regressors g() and h(), and specify the prior distribution of the uncertain quantities {gamma, u()}. Then we use the probability calculus to update the distribution of {gamma, u()} according to the outcome of model evaluations. In Design 1, for example, we have 30 evaluations. These are incorporated into our emulator through their impact on {gamma, u()}, and thus they influence the predictions we make for f(t, x) at values of (t, x) where we have not evaluated TIEGCM. At values where we have evaluated TIEGCM, the probability calculus ensures that our prediction is exactly correct, with no uncertainty. We say that the emulator interpolates the ensemble of evaluations with probability one.

Generalisation: We could go further, of course, and include site among the index variables. In that case our regressors would be outer-producted again, this time with regressors for locations on the surface of the earth.

Choosing the regressors

The choices we make for the regressors are very important. If we make good choices then a few evaluations of TIEGCM can very substantially reduce our uncertainty about the regression coefficients, the gammas, and so reduce our uncertainty about f(t, x) at untried values for x. Therefore physical insights are very important when choosing g() and h().

We also have some more immediate concerns. The TIEGCM output is periodic in t. We certainly want this to be a feature of our emulator predictions as well. In other words, we want Pr{f(0, x) = f(24, x)} = to be one for all x. So a natural choice for g() would be terms from the Fourier series. However, if we inspect the model output, eg for MAGDDR-D, we see that the outputs would not be easy to duplicate using the sums of sins and cosines. We would need a lot of g() regressors to do a good job. Parsimony is important, and we can adopt a more empirical approach. This is closely related to regularising the outputs in time, and the use of empirical basis vectors.

For the h() regressors, for the model-inputs, we must distinguish betweem AMP and EDN, which are non-periodic, and PHZ, which is periodic. For the non-periodic inputs we have no particular constraints. Our ranges for each input are bounded (see Design 1). Legendre polynomials are often useful here. These are specified for a scalar x in [-1, 1]. We can create regressors over a vector x by combining the scalar terms to create monomial terms, like L1(x1)L0(x2) which is effectively x1. Often we transform some of the components of x before mapping onto [-1, 1]. For TIEGCM, we might transform EDN by taking the log10.

For the periodic input, PHZ, it is natural to use Fourier terms: unlike for the model outputs, there is no suggestion that these will not perform well, although we may need quite a few terms.

An attractive feature of using Legendre polynomials and Fourier terms is that they can be made orthonormal on the interval [-1, 1]. This makes it easier to specify a prior distribution for the gammaij coefficients for each i, because they are all on the same scale. It's easier still if we also arrange for the g() regressors to be orthonormal as well, because in this case all the regressors are on the same scale, and all the gammaij can be treated exchangeably and independently.

Note that if we start with a latin hypercube in the inputs (which we do), then the design matrix is approximately orthogonal, because the components of the cross-product of the design matrix approximate n times the pairwise inner-products. So the posterior emulator regression coefficients ought to be well-identified.

Specifying the residual

The residual is specified in terms of a prior correlation function, which is a function of (x, t) and (x', t'). We simplify by treating this as separable in inputs and time, hence
R( (x, t), (x', t') ) = Rx(x, x') Rt(t, t') (Sep1)
We simplify further by treating Rx(x, x') as separable in the components of x:
R(x, x') = R1(x1, x'1) R2(x2, x'2) R3(x3, x'3) (Sep2)
where we have potentially different correlation functions for each component.

The correlation functions for AMP (1st input) and EDN (3rd input) can take some standard form, such as a Matern or a power exponential. one popular choice is the Gaussian (which is the power exponential with exponent 2, and also a limiting form of the Matern)
R1(v, v') = exp{ - [ ( v - v' ) / theta1 ]2 } (GausCor)
where theta is a parameter that controls the correlation length (or 'range', in spatial statistics).

The correlation function for PHZ needs to have periodic sample paths. So does the correlation function for time.

Circular correlation function

Here is a very simple way to construct a process with periodic sample paths. Imagine wrapping white noise around the perimeter of a circle. A given value v maps onto a point on this perimeter. We construct a stochastic quantity corresponding to v by integrating around the perimeter from v-a/2 to v+a/2, where a is the 'correlation angle'. Two values v and v' will be uncorrelated if they are separated by more than a. Otherwise, their correlation is 1-h/a, where h is the small angle between v and v'. Refer to this as the 'circular' correlation function.

For PHZ we can use a circular correlation function on [0, 12]. For time, we can use a circular correlation function on [0, 24].

Further reading

Back to Emulators.   Back to the top.


This page last modified 10 April 2007.