Descriptions¶

Terminology¶

Below is a list of the most commonly used terms/abbreviations in PRISM and their meaning.

Active emulator system: An emulator system that has a data point assigned to it.
Active parameters: The set of model parameters that are considered to have significant influence on the output of the model and contribute at least one polynomial term to one/the regression function.
Adjusted expectation: The prior expectation of a parameter set, with the adjustment term taken into account. It is equal to the prior expectation if the emulator system has perfect accuracy.
Adjusted values: The adjusted expectation and variance values of a parameter set.
Adjusted variance: The prior variance of a parameter set, with the adjustment term taken into account. It is zero if the emulator system has perfect accuracy.
Adjustment term: The extra term (as determined by the BLA) that is added to the prior expectation and variance values that describes all additional correlation knowledge between model realization samples.
Analysis
Analyze: The process of evaluating a set of emulator evaluation samples in the last emulator iteration and determining which samples should be used to construct the next iteration.
BLA: Abbreviation of Bayes linear approach.
Construct
Construction: The process of calculating all necessary components to describe an iteration of the emulator.
Construction check: A list of keywords determining which components of which emulator systems are still required to finish the construction of a specified emulator iteration.
Controller
Controller rank: An MPI process that controls the flow of operations in PRISM and distributes work to all workers and itself. By default, a controller also behaves like a worker, although is not identified as such.
Covariance matrix
Inverted covariance matrix: The (inverted) matrix of prior covariances between all model realization samples and itself.
Covariance vector: The vector of prior covariances between all model realization samples and a given parameter set.
Data error: The \(1\sigma\)-confidence interval of a model comparison data point, often a measured/calculated observational error.
Data identifier
Data point identifier: The unique identifier of a model comparison data point, often a sequence of integers, floats and strings that describe the operations required to extract it.
Data point: A collection of all the details (value, error, space and identifier) about a specific model comparison data point that is used to constrain the model with.
Data space
Data value space: The value space (linear, logarithmic or exponential) in which a model comparison data point is defined.
Data value: The value of a model comparison data point, often an observed/measured value.
Emulation method: The specific method (Gaussian, regression or both) that needs to be used to construct an emulator.
Emulator: The collection of all emulator systems together, provided by an Emulator object.
Emulator evaluation samples: The sample set (to be) used for evaluating the emulator.
Emulator iteration
Iteration: A single, specified step in the construction of the emulator.
Emulator system: The emulated version of a single model output/comparison data point in a single iteration.
Emulator type: The type of emulator that needs to be constructed. This is used to make sure different emulator types are not mixed together by accident.
Evaluate
Evaluation: The process of calculating the adjusted values of a parameter set in all emulator systems starting at the first iteration, determining the corresponding implausibility values and performing an implausibility check. This process is repeated in the next iteration if the check was successful and the requested iteration has not been reached.
External model realization set: A set of externally calculated and provided model realization samples and their outputs.
Frozen parameters
Frozen active parameters: The set of model parameters that, once considered active, will always stay active if possible.
FSLR: Abbreviation of forward stepwise linear regression.
Gaussian correlation length: The maximum distance between two values of a specific model parameter within which the Gaussian contribution to the correlation between the values is still significant.
Gaussian sigma: The standard deviation of the Gaussian function. It is not required if regression is used.
HDF5: Abbreviation of Hierarchical Data Format version 5.
Hybrid sampling: The process of performing a best parameter estimation of a model with MCMC sampling, while using its emulator as an additional Bayesian prior. This process is explained in Hybrid sampling.
Implausibility check
Implausibility cut-off check: The process of determining whether or not a given set of implausibility values satisfy the implausibility cut-offs of a specific emulator iteration.
Implausibility cut-offs: The maximum implausibility values an evaluated parameter set is allowed to generate, to be considered plausible in a specific emulator iteration.
Implausibility value
Univariate implausibility value: The minimum \(\sigma\)-confidence level (standard deviations) that the real model realization cannot explain the comparison data. It takes into account all variances associated with the parameter set, which are the observational variance (given by data_err), adjusted emulator variance (adj_var) and the model discrepancy variance (md_var).
Implausibility wildcard: A maximum implausibility value, preceding the implausibility cut-offs, that is not taken into account during the implausibility cut-off check. It is denoted as \(0\) in provided implausibility parameters lists.
LHD: Abbreviation of Latin-Hypercube design.
Master file
Master HDF5 file: (Path to) The HDF5-file in which all important data about the currently loaded emulator is stored. A master file is usually accompanied by several emulator system (HDF5) files, which store emulator system specific data and are externally linked to the master file.
MCMC: Abbreviation of Markov chain Monte Carlo.
Mock data: The set of comparison data points that has been generated by evaluating the model for a random parameter set and perturbing the output by the model discrepancy variance.
Model: A black box that takes a parameter set, performs a sequence of operations and returns a unique collection of values corresponding to the provided parameter set.

Note

This is how PRISM ‘sees’ a model, not the used definition of one.
2D model: A model that has/takes 2 model parameters.
2+D model
nD model: A model that has/takes more than 2 model parameters.
ModelLink
ModelLink subclass: The user-provided wrapper around the model that needs to be emulated, provided by a ModelLink object.
Model data: The set of all data points that are provided to a ModelLink subclass, to be used to constrain the model with.
Model discrepancy variance: A user-defined value that includes all contributions to the overall variance on a model output that is created/caused by the model itself. More information on this can be found in Model discrepancy variance (md_var).
Model evaluation samples: The sample set (to be) used for evaluating the model.
Model output
Model outputs: The model output(s) corresponding to a single (set of) model realization/evaluation sample(s).
Model parameter
Model parameters: The (set of) details about every (all) degree(s)-of-freedom that a model has and whose value range(s) must be explored by the emulator.
Model realization samples: Same as model evaluation samples.
Model realizations
Model realization set: The combination of model realization/evaluation samples and their corresponding model outputs.
MPI: Abbreviation of Message Passing Interface.
MPI rank: An MPI process that is used by any PRISM operation, either being a controller or a worker.
MSE: Abbreviation of mean squared error.
OLS: Abbreviation of ordinary least-squares.
Parameter set
Sample: A single combination/set of model parameter values, used to evaluate the emulator/model once.
Passive parameters: The set of model parameters that are not considered active, and therefore are considered to not have a significant influence on the output of the model.
Pipeline
PRISM Pipeline: The main PRISM framework that orchestrates all operations, provided by a Pipeline object.
Plausible region: The region of model parameter space that still contains plausible samples.
Plausible samples: A subset of a set of emulator evaluation samples that satisfied the implausibility checks.
Polynomial order: Up to which order polynomial terms need to be taken into account for all regression processes.
Potentially active parameters: A user-provided set of model parameters that are allowed to become active. Any model parameter that is not potentially active will never become active, even if it should.
PRISM: The acronym for Probabilistic Regression Instrument for Simulating Models. It is also a one-word description of what PRISM does (splitting up a model into individually emulated model outputs).
Prior covariance: The covariance value between two parameter sets as determined by an emulator system.
Prior expectation: The expectation value of a parameter set as determined by an emulator system, without taking the adjustment term (from the BLA) into account. It is a measure of how much information is captured by an emulator system. It is zero if regression is not used, as no information is captured.
Prior variance: The variance value of a parameter set as determined by an emulator system, without taking the adjustment term (from the BLA) into account.
Project
Projection: The process of analyzing a specific set of active parameters in an iteration to determine the correlation between the parameters.
Projection figure: The visual representation of a projection.
Regression: The process of determining the important polynomial terms of the active parameters and their coefficients, by using an FSLR algorithm.
Regression covariances: The covariances between all polynomial coefficients of the regression function. By default, they are not calculated and it is empty if regression is not used.
Residual variance: The variance that has not been captured during the regression process. It is empty if regression is not used.
Root directory: (Path to) The directory/folder on the current machine in which all PRISM working directories are located. It also acts as the base for all relative paths.
Sample set
Evaluation set: A set of samples.
Worker
Worker rank: An MPI process that receives its calls/orders from a controller and performs the heavy-duty operations in PRISM.
Working directory: (Path to) The directory/folder on the current machine in which the PRISM master file, log-file and all projection figures of the currently loaded emulator are stored.
Worker mode: A mode initialized by worker_mode, where all workers are continuously listening for calls made by the controller rank and execute the received messages. This allows for serial codes to be combined more easily with PRISM. See Dual nature (normal/worker mode) for more information.

PRISM parameters¶

Below are descriptions of all the parameters that can be provided to PRISM in a text-file when initializing the Pipeline class (using the prism_par input argument).

Changed in version 1.1.2: Input argument prism_file was renamed to prism_par. A dictionary with PRISM parameters instead of a file can additionally be provided to the Pipeline class. All Pipeline parameters can also be changed by setting the corresponding class property.

n_sam_init (Default: 500)

Number of model evaluation samples that is used to construct the first iteration of the emulator. This value must be a positive integer.

proj_res (Default: 25)

Number of emulator evaluation samples that is used to generate the grid for the projection figures (it defines the resolution of the projection). This value must be a positive integer.

proj_depth (Default: 250)

Number of emulator evaluation samples that is used to generate the samples in every projected grid point (it defines the accuracy/depth of the projection). This value must be a positive integer.

base_eval_sam (Default: 800)

Base number of emulator evaluation samples that is used to analyze an iteration of the emulator. It is multiplied by the iteration number and the number of model parameters to generate the true number of emulator evaluations, in order to ensure an increase in emulator accuracy. This value must be a positive integer.

sigma (Default: 0.8)

The Gaussian sigma/standard deviation that is used when determining the Gaussian contribution to the overall emulator variance. This value is only required when method == ‘gaussian’, as the Gaussian sigma is obtained from the residual variance left after the regression optimization if regression is included. This value must be non-zero.

l_corr (Default: 0.3)

The normalized amplitude(s) of the Gaussian correlation length. This number is multiplied by the difference between the upper and lower value boundaries of the model parameters to obtain the Gaussian correlation length for every model parameter. This value must be positive, normalized and either a scalar or a list/dict of n_par scalars (where the values correspond to the sorted list of model parameters for the list).

f_infl (Default: 0.2)

New in version 1.2.2.

The residual variance inflation factor. The variance values for all known samples in an emulator iteration are inflated by this number multiplied by rsdl_var. This can be used to adjust for the underestimation of the emulator variance. Setting this to zero causes no variance inflation to be performed. This value must be non-negative.

impl_cut (Default: [0.0, 4.0, 3.8, 3.5])

A list of implausibility cut-off values that specifies the maximum implausibility values a parameter set is allowed to have to be considered ‘plausible’. A zero can be used as a filler value, either taking on the preceding value or acting as a wildcard if the preceding value is a wildcard or non-existent. Zeros are appended at the end of the list if the length is less than the number of comparison data points, while extra values are ignored if the length is more. This must be a sorted list of positive values (excluding zeros).

criterion (Default: None)

The criterion to use for determining the quality of the LHDs that are used, represented by an integer, float, string or None. This parameter is the only non-PRISM parameter. Instead, it is used in the lhd()-function of the e13Tools package. By default, None is used.

method (Default: ‘full’)

The method to use for constructing the emulator. ‘gaussian’ will only include Gaussian processes (no regression), which is much faster, but also less accurate. ‘regression’ will only include regression processes (no Gaussian), which is more accurate than Gaussian only, but underestimates the emulator variance by multiple orders of magnitude. ‘full’ includes both Gaussian and regression processes, which is slower than Gaussian only, but by far the most accurate both in terms of expectation and variance values.

‘gaussian’ can be used for faster exploration especially for simple models. ‘regression’ should only be used when the polynomial representation of a model is important and enough model realizations are available. ‘full’ should be used by default.

Warning

When using PRISM on a model that can be described completely by the regression function (anything that has an analytical, polynomial form up to order poly_order like a straight line or a quadratic function), use the ‘gaussian’ method unless unavoidable (in which case n_sam_init and base_eval_sam must be set to very low values).

When using the regression method on such a model, PRISM will be able to capture the behavior of the model perfectly given enough samples, in which case the residual (unexplained) variance will be approximately zero and therefore sigma as well. This can occassionally cause floating point errors when calculating emulator variances, which in turn can give unexplainable artifacts in the adjustment terms, therefore causing answers to be incorrect.

Since PRISM’s purpose is to identify the characteristics of a model and therefore it does not know anything about its workings, it is not possible to automatically detect such problems.

use_regr_cov (Default: False)

Whether or not the regression variance should be taken into account for the variance calculations. The regression variance is the variance on the regression process itself and is only significant if a low number of model realizations (n_sam_init and base_eval_sam) is used to construct the emulator systems. Including it usually only has a very small effect on the overall variance value, while it can slow down the emulator evaluation rate by as much as a factor of 3. This value is not required if method == ‘gaussian’ and is automatically set to True if method == ‘regression’. This value must be a bool.

poly_order (Default: 3)

Up to which order all polynomial terms of all model parameters should be included in the active parameters and regression processes. This value is not required if method == ‘gaussian’ and do_active_anal is False. This value must be a positive integer.

n_cross_val (Default: 5)

Number of (k-fold) cross-validations that must be used for determining the quality of the active parameters analysis and regression process fits. If this parameter is zero, cross-validations are not used. This value is not required if method == ‘gaussian’ and do_active_anal is False. This value must be a non-negative integer and not equal to 1.

do_active_anal (Default: True)

Whether or not an active parameters analysis must be carried out for every iteration of every emulator system. If False, all potentially active parameters listed in pot_active_par will be active. This value must be a bool.

freeze_active_par (Default: True)

Whether or not active parameters should be frozen in their active state. If True, parameters that have been considered active in a previous iteration of an emulator system, will automatically be active again (and skip any active parameters analysis). This value must be a bool.

pot_active_par (Default: None)

A list of parameter names that indicate which parameters are potentially active. Potentially active parameters are the only parameters that will enter the active parameters analysis (or will all be active if do_active_anal is False). Therefore, all parameters not listed will never be considered active. If all parameters should be potentially active, then a None can be given. This must either be a list of parameter names or None.

use_mock (Default: False)

Whether or not mock data must be used as comparison data when constructing a new emulator. Mock data is calculated by evaluating the model for a specific set of parameter values, and adding the model discrepancy variances as noise to the returned data values. This set of parameter values is either the provided set, or a randomly chosen one if not. When using mock data for an emulator, it is not possible to change the comparison data in later emulator iterations. This value must be a bool or a list/dict of n_par scalars (where the values correspond to the sorted list of model parameters for the list).

HDF5¶

Whenever PRISM constructs an emulator, it automatically stores all the calculated data for it in an HDF5-file named 'prism.hdf5' in the designated working directory. This file contains all the data that is required in order to recreate all emulator systems that have been constructed for the emulator belonging to this run. If the Pipeline class is initialized by using an HDF5-file made by PRISM, it will load in this data and return a Pipeline object in the same state as described in the file.

Below is a short overview of all the data that can be found inside a PRISM master HDF5-file. HDF5-files can be viewed freely by the user using the HDFView application made available by The HDFGroup.

The general file contains:

Attributes (11/12): Describe the general non-changeable properties of the emulator, which include:
- Emulator type and method;
- Gaussian parameters;
- Name of used ModelLink subclass;
- Used PRISM version;
- Regression parameters;
- Bools for using mock data or regression covariance;
- Mock data parameters if mock data was used.
Every emulator iteration has its own data group with the iteration number as its name. This data group stores all data/information specific to that iteration.

An iteration data group ('i') contains:

Attributes (9): Describe the general properties and results of this iteration, including:
- Active parameters for this emulator iteration;
- Implausibility cut-off parameters;
- Number of emulated data points, emulator systems, emulator evaluation samples, plausible samples and model realization samples;
- Bool stating whether this emulator iteration used an external model realization set.
'emul_n': The data group that contains all data for a specific emulator system in this iteration. The value of 'n' indicates which emulator system it is, not the data point. See below for its contents;
'emul_space': The boundaries of the hypercube that encloses the parameter space in which this iteration is defined. This is always equal to the plausible space of the previous iteration;
'impl_sam': The set of emulator evaluation samples that survived the implausibility checks and will be used to construct the next iteration;
'proj_hcube': The data group that contains all data for the (created) projections for this iteration, if at least one has been made. See below for its contents;
'sam_set': The set of model realization samples that were used to construct this iteration. In every iteration after the first, this is the 'impl_sam' of the previous iteration;
'statistics': An empty data set that stores several different types of statistics as its attributes, including:
- Size of the MPI communicator during various construction steps;
- Average evaluation rate/time of the emulator and model;
- Total time cost of most construction steps (note that this value may be incorrect if a construction was interrupted);
- Percentage of parameter space that is still plausible within the iteration.

An emulator system data group ('i/emul_n') contains:

Attributes (7+): List the details about the model comparison data point used in this emulator system, including:
- Active parameters for this emulator system;
- Data errors, identifiers, value space and value;
- Regression score and residual variance if regression was used;
- The active and passive contributions to the residual variance (obtained from either the regression residual variance or the Gaussian sigma).
'cov_mat': The pre-calculated covariance matrix of all model evaluation samples in this emulator system. This data set is never used in PRISM and stored solely for user-convenience;
'cov_mat_inv': The pre-calculated inverse of 'cov_mat';
'exp_dot_term': The pre-calculated second expectation adjustment dot-term (\(\mathrm{Var}\left(D\right)^{-1}\cdot\left(D-\mathrm{E}(D)\right)\)) of all model evaluation samples in this emulator system.
'mod_set': The model outputs for the data point in this emulator system corresponding to the 'sam_set' used in this iteration;
'poly_coef' (if regression is used): The non-zero coefficients for the polynomial terms in the regression function in this emulator system;
'poly_coef_cov' (if regression and regr_cov are used): The covariances for all polynomial coefficients 'poly_coef';
'poly_idx' (if regression is used): The indices of the polynomial terms with non-zero coefficients if all active parameters are converted to polynomial terms;
'poly_powers' (if regression is used): The powers of the polynomial terms corresponding to 'poly_idx'. Both 'poly_idx' and 'poly_powers' are required since different methods of calculating the polynomial terms are used depending on the number of required terms and samples;
'prior_exp_sam_set': The pre-calculated prior expectation values of all model evaluation samples in this emulator system. This data set is also never used in PRISM.

A projections data group ('i/proj_hcube') contains individual projection data groups ('i/proj_hcube/<name>'), which contain:

Attributes (4): List the general properties with which this projection was made, including:
- Implausibility cut-off parameters (they can differ from the iteration itself);
- Projection depth and resolution.
'impl_los': The calculated line-of-sight depth for all grid points in this projection;
'impl_min': The calculated minimum implausibility values for all grid points in this projection;
'proj_space': The boundaries of the hypercube that encloses the defined parameter space of this projection.