Local ensemble data assimilation in OOPS¶
The LocalEnsembleDA application is a generic application for running data assimilation with local ensemble transform Kalman filters (ETKF). It can be extended to use any local ETKF that updates state gridpoints independently from each other by using observations within a localization distance from a gridpoint.
The configuration file (e.g. letkf.yaml) for running the LocalEnsembleDA application must contain the following top-level sections:
geometry: geometry for the background and analysis files.background: ensemble background members (currently used both for computing \(\mathcal{H}(\mathbf{x})\) and as backgrounds).observations: describes observations, observation error statistics, observation operators used in the assimilation, and the horizontal localization.local ensemble DA: configuration of the local ensemble DA solver package.driver: describes optional modifications to the behavior of theLocalEnsembleDAdriver.Output sections may have to be defined depending on the
driverconfiguration. Further information can be found in the driver section below.
Local ensemble DA configuration¶
The options available to configure the local ensemble DA configuration section are as follows.
solver,use linear observer.unperturbed obs ensemble member index.
These options are described in the following sections.
Local volume solvers¶
The specific type of local ensemble transform Kalman filter can be chosen with the solver option.
The following variants are available:
Stochastic LETKF: Stochastic Local Ensemble Transform Kalman filter.Deterministic LETKF: Deterministic Local Ensemble Transform Kalman filter.Stochastic GETKF: Stochastic Gain form of the local Ensemble Transform Kalman filter.Deterministic GETKF: Deterministic Gain form of the local Ensemble Transform Kalman filter.
The algorithmic details of these filters can be found here.
Deterministic solver implementations
Two ETKF implementations are supported for the deterministic solvers:
C++ implementation using Eigen,
GSI Fortran implementation using LAPACK.
The Fortran implementation is used if the parameter fortran ETKF is set to true.
By default, this parameter is false for the LETKF solver and true for the GETKF solver.
For example, the C++ implementation of the deterministic GETKF solver can be selected with the following yaml snippet:
local ensemble DA:
solver: Deterministic GETKF
fortran ETKF: false
The stochastic solvers always use the Eigen solver implementation, so the fortran ETKF option has no effect.
GETKF solver options
Using either of the GETKF solvers also requires specifying parameters for the modulation product that approximates model-space localization in the vertical. The Gaspari-Cohn localization function [GC99] is used to modulate the covariance matrix of the vertical coordinate.
The following parameters must appear in the vertical localization subsection:
fraction of retained variance: fraction of the variance retained after the eigenspectrum of the vertical localization function is truncated. The fraction is expressed as a number between 1 and 0 with 1.0 retaining all eigenvectors and 0.0 retaining only the first eigenvector.lengthscale units: name of variable for vertical localization. This is a model-dependent value. For example, the FV3 implementation currently supports two types of units:logp– logarithm of pressure at mid level of the vertical column with surface pressure set to 1e5 Pa at all points, andlevels– indices of vertical levels.lengthscale: localization distance in the above units, at which Gaspari-Cohn localization function is zero.
An example of using the GETKF solver in FV3 is as follows:
local ensemble DA:
solver: Deterministic GETKF
vertical localization:
fraction of retained variance: .95
lengthscale units: logp
lengthscale: 1.5
Unperturbed member¶
Both stochastic solvers perturb the observations according to the observation error covariance matrix, with an independent perturbation computed for each ensemble member. It is possible to leave one member unperturbed by setting the option unperturbed obs ensemble member index to the index of that member in the ensemble. The index can take values between \(0\) and \(N_e - 1\) inclusive, where \(N_e\) is the number of ensemble members.
This option has no effect if the Deterministic LETKF or Deterministic GETKF solvers are selected.
For example, to leave the ensemble member with index 6 unperturbed when running the stochastic GETKF solver, the following yaml configuration can be used:
local ensemble DA:
solver: Stochastic GETKF
unperturbed obs ensemble member index: 6
Obs Error Covariances in the ensemble solvers¶
Available R-matrix representations in the local solvers are model dependent, with models generally either using obs error classes from UFO or implementing their own diagonal representation. The R-matrix representation is specified as follows for each obs space:
observations:
observers:
- obs space:
name: radiosonde
...
obs error:
covariance model: diagonal
For models which use UFO R-matrices, an additional representation is currently supported, using cross-variable/inter-channel correlations. This is declared as follows:
obs error:
covariance model: cross variable covariances
input file: obserror_correlations.nc4
Here, input file is a file containing cross-variable or inter-channel correlations. See Observation error covariance with cross-variable (cross-channel) correlations for more information, including the required format for input file. As with the diagonal R-matrix, the observation error standard deviations are read as an ObsError group from the observation file.
Observation-space localization supported in the ensemble solvers¶
Observation-space \(R-localization\) is used in all local solvers following [FSH+24]. The localization procedure is applied independently at each model grid point in turn.
The obs localizations section specifies a sequence of localization functions.
Each localization function in the sequence does the following:
Select observations within a chosen distance of the model grid point,
Optionally compute weights for the observation error covariances according to a distance-dependent function. By default the weights are equal to one everywhere.
Return a vector of weights of length equal to the ObsSpace.
The selection of observations is performed with a KD-tree distance search.
At each model grid point a vector of final localization weights (with length equal to the ObsSpace) is initialized to contain ones. This vector is then multiplied (element-wise) by the weights returned by each localization function in the sequence. (In other words, we assume that the localizations can be applied in any order.) The final weights are then used to scale the observation error covariance matrix.
The sequence of localizations is applied separately to each obs space.
There are two main types of observation-space localization function: those that act horizontally, and those that act vertically. Observation-space horizontal localization can be used for both the LETKF and GETKF solvers, whereas observation-space vertical localization should only be used for the LETKF solvers. Further discussion of this point can be found here.
If using observation-space vertical localization for the LETKF solvers, the 3D geometry iterator must be enabled to carry model height information into the vertical localization routines. The geometry iterator is model-specific and the yaml syntax varies between models. In the FV3 model the 3D iterator can be enabled as follows:
geometry:
iterator dimension: 3
The following horizontal localization methods are available in ufo:
Box-car (weights set to 1 inside the localization distance, 0 outside),
Additional horizontal localizations are supported in SOCA (Rossby radius based) and FV3-JEDI (soil-specific localization).
An example of horizontal localization using the Gaspari-Cohn function is:
observations:
observers:
- obs space:
name: radiosonde
...
obs localizations:
- localization method: Horizontal Gaspari-Cohn # inflate errors with Gaspari-Cohn function, based on the
# horizontal distance from the updated grid point
lengthscale: 1000e3 # horizontal localization distance in meters
The Gaspari-Cohn, SOAR, and Box Car methods are also supported for vertical localization. An example of vertical localization using the Gaspari-Cohn function is:
observations:
observers:
- obs space:
name: radiosonde
...
obs localizations:
- localization method: Vertical localization # As above but for vertical localization
localization function: Gaspari Cohn # Function for vertical localization
ioda vertical coordinate group: MetaData # Group containing the below vertical coordinate
ioda vertical coordinate: height # Name of UFO variable storing the vertical coordinate
# of the observation locations
vertical lengthscale: 6e3 # vertical localization distance in units of given coord
Inflation supported in the ensemble solvers¶
Several covariance inflation methods are supported:
multiplicative prior inflation:
Parameter of multiplicative inflation is controlled by inflation.mult configuration value, for example:
local ensemble DA:
inflation:
mult: 1.1
RTPP (relaxation to prior perturbation), [ZSS04]:
Parameter of RTPP inflation is controlled by inflation.rtpp configuration value, for example:
local ensemble DA:
inflation:
rtpp: 0.5
RTPS (relaxation to prior spread), [WH12]:
Parameter of RTPS inflation is controlled by inflation.rtps configuration value, for example:
local ensemble DA:
inflation:
rtps: 0.6
Cross validation supported in the ensemble solvers¶
In order to use cross validation (Buehner, 2020), one can specify:
local ensemble DA:
cross validation:
number of subensembles: 5
This makes use of a SubensembleSplitter class to split the ensemble of \(N_{e}\) members into \(N_{sub}\) subensembles. The value of \(N_{sub}\) is assigned using the mandatory number of subensembles configuration value. The value of number of subensembles must be between 2 and the number of ensemble members \(N_{e}\), and must cleanly divide the number of ensemble members \(N_{e}\).
Currently there are two splitting methods implemented to split the ensemble. The contiguous split divides the ensemble into \(N_{sub}\) subensembles evenly, with the first \(N_{e}/N_{sub}\) members belonging to subensemble 1, the second \(N_{e}/N_{sub}\) members belonging to subensemble 2, etc.
The contiguous split happens by default if no splitting method is specified, but if one wishes to explicitly split via this method:
local ensemble DA:
cross validation:
number of subensembles: 5
splitting method: Contiguous #default value
Alternatively, one can split the ensemble randomly (with an optional seed):
local ensemble DA:
cross validation:
number of subensembles: 5
splitting method: Random
random seed: 0 #default value
This method also splits the ensemble into \(N_{sub}\) subensembles evenly, except each ensemble member will be assigned to a subensemble at random.
Note that if one is using a modulated ensemble by using the stochastic GETKF solver, no further options need to be specified as the entire modulated ensemble is split evenly for you. Cross validation for the deterministic GETKF solver is not yet supported.
NOTE about obs distributions¶
Currently Local Ensemble DA supports InefficientDistribution and Halo obs distribution. When InefficientDistribution distribution is used, all observations and \(\mathcal{H}(\mathbf{x})\) are replicated on all MPI ranks. When Halo distribution is used, only observations needed on this rank are stored on each rank. Halo distribution allows for more efficient memory management compared to distribution.name: InefficientDistribution, however at the expense of potentially poor load management compared to distribution.name: RoundRobin. For optimal combination of memory and load balancing, we developed an option to run Local Ensemble DA in the observer-only mode with distribution.name: RoundRobin. Then one can read ensemble of \(\mathcal{H}(\mathbf{x})\) from disk using driver.read HX from disk == true, distribution.name: Halo obs distribution, and driver.do posterior observer == false.
The type of the obs. distribution is specified for each ObsSpace. For example:
observations:
observers:
- obs space:
distribution:
name: Halo
halo size: 5000e3
Note on QC masking¶
When the solvers are run, quality control (QC) flags are applied to different data objects during the calculation of the background ensemble and the background ensemble mean in observation-space, \(\mathcal{H}(\mathbf{X})\) and \(\mathcal{H}(\overline{\mathbf{x}})\), respectively. Information of which observations are flagged by quality control is stored in a single object, a vector of inverse observation variances where observations failing QC are masked out as missing.
The inverse variance vector has its elements initially masked out to reflect the QC done using the ensemble mean (calculated during \(\mathcal{H}(\overline{\mathbf{x}})\)). When \(\mathcal{H}\) is non-linear, the inverse variance vector’s elements are then recursively updated as \(\mathcal{H}\) is applied to each ensemble member and new observations fail the respective QC (otherwise, when \(\mathcal{H}\) has been linearized, the QC is only gathered from the background ensemble mean). The vector is finally updated with information on which observations are missing from the observation vector.
The values masked out of the finalised inverse variance vector are used to mask out values in the innovations and the background ensemble member perturbations in observation space \(\mathbf{Y}\) which are missing or have failed QC, ensuring consistent missing values across ensemble members and observation vectors in successive calculations.
Driver configuration¶
Read \(\mathcal{H}(\mathbf{x})\) from disk instead of computing it at run-time.
driver:
read HX from disk: false #default value
Run
LocalEnsembleDAto compute \(\mathcal{H}(\mathbf{x})\) and save it to disk without performing any assimilation. This works hand-in-hand withread HX from disk. One might choose to separate this into two steps because it is possible to use the more efficient round-robin distribution ifrun as observer onlyistrue.
driver:
run as observer only: false #default value
Compute posterior observer and output test prints for the O-A statistics. One might choose to set this flag to
falsein order to speed up completion of theLocalEnsembleDAsolver.
driver:
do posterior observer: true #default value
Save posterior mean. Requires a top-level
outputsection to be defined in the yaml file. The contents of this section must adhere to the DataSetBase::write() syntax, including any model-specific options.
driver:
save posterior mean: false #default value
Save posterior ensemble. Requires a top-level
outputsection to be defined in the yaml file. The contents of this section must adhere to the DataSetBase::write() syntax, including any model-specific options.
driver:
save posterior ensemble: true #default value
Save prior mean. Requires a top-level
output mean priorsection to be defined in the yaml file. The contents of this section must adhere to the DataSetBase::write() syntax, including any model-specific options.
driver:
save prior mean: false #default value
save posterior mean increment. Requires a top-level
output incrementsection to be defined in the yaml file. The contents of this section must adhere to the DataSetBase::write() syntax, including any model-specific options.
driver:
save posterior mean increment: false #default value
save prior variance. Requires a top-level
output variance priorsection to be defined in the yaml file. The contents of this section must adhere to the DataSetBase::write() syntax, including any model-specific options.
driver:
save prior variance: false #default value
save posterior variance. Requires a top-level
output variance posteriorsection to be defined in the yaml file. The contents of this section must adhere to the DataSetBase::write() syntax, including any model-specific options.
driver:
save posterior variance: false #default value
This option is needed for the
Halodistribution to work properly. In that case, each MPI rank must know the center of the local grid patch (i.e. the region specified in the model domain decomposition)
and the radius of the circle (centered on the center of the grid patch) that can encircle all points on the local grid.
If not using the Halo distribution, or using models that do not implement model domain decomposition (e.g. L95), one might choose to not update obs config by setting update obs config with geometry info: false.
driver:
update obs config with geometry info: true #default value