ModelLink: A crash course¶
Writing a ModelLink subclass¶
In Minimal example, a description is given of how to initialize the Pipeline
class using a default ModelLink
subclass.
Here, the basic steps for making a custom ModelLink
subclass are shown.
# -*- coding: utf-8 -*-
# Future imports
from __future__ import absolute_import, division, print_function
# Package imports
import numpy as np
# PRISM imports
from prism.modellink import ModelLink
# ExampleLink class definition
class ExampleLink(ModelLink):
def __init__(self, *args, **kwargs):
# Perform any custom operations here
pass
# Call ModelLink's __init__()
super(ExampleLink, self).__init__(*args, **kwargs)
def get_default_model_parameters(self):
# Define default parameters (not required)
par_dict = {}
return(par_dict)
def get_default_model_data(self):
# Define default data (not required)
data_dict = {}
return(data_dict)
# Override call_model method
def call_model(self, emul_i, par_set, data_idx):
# Perform operations for obtaining the model output
# Following is provided:
# 'emul_i': Requested iteration
# 'par_set': Requested sample(s)
# 'data_idx': Requested data point(s)
pass
# Override get_md_var method
def get_md_var(self, emul_i, par_set, data_idx):
# Perform operations for obtaining the model discrepancy variance
# Following is provided:
# 'emul_i': Requested iteration
# 'par_set': Requested sample
# 'data_idx': Requested data point(s)
pass
In the example_link.py file above, a minimal example of a ModelLink
subclass is shown.
It has two abstract methods that need to be overridden; call_model()
(wrapper function for calling the model) and get_md_var()
(calculates the model discrepancy variance).
A ModelLink
subclass cannot be initialized if either method has not been overridden.
Given the importance of both methods, detailed descriptions are given in Wrapping a model (call_model) and Model discrepancy variance (md_var), respectively.
Since every model is different, with some requiring preparations in order to work properly, the __init__()
constructor method may be extended to include any custom code to be executed when the subclass is initialized.
The superclass version of the __init__()
method must always be called, as it sets several important flags and properties, but the time at which this is done does not matter.
During the initialization of the Emulator
class, it is checked whether or not the superclass constructor of a provided ModelLink
instance was called (to avoid this from being forgotten).
Besides executing custom code, three properties/flags can be set in __init__()
, which have the following default values if the extended constructor does not set them:
self.name = self.__class__.__name__ # Set instance name to the name of the class
self.call_type = 'single' # Request single model calls
self.MPI_call = False # Request only controller calls
The first property, name
, defines the name of the ModelLink
instance.
This name is used by the Emulator
class during initialization to check if a constructed emulator is linked to the proper ModelLink
instance, in order to avoid causing mismatches.
If one wants to use the same ModelLink
subclass for different models (like, using different parameter spaces), it is recommended to add an identifier for this to this name.
An example of this can be found in the definition of the GaussianLink
class, which adds the number of Gaussians in the model to its name
property.
The other two properties, call_type
and MPI_call
, are flags that tell PRISM how the call_model()
method should be used.
By default, every model evaluation sample is requested individually in serial, since this would be the most expected behavior.
However, this is most likely not enough for sophisticated models, as they can require some preparation (e.g., having to read in data files) or more than a single core (in MPI) to function.
Therefore, call_type
can be set to accept solely individual samples (‘single’
), solely entire sample sets (‘multi’
) or both (‘hybrid’
).
In the same way, MPI_call
can be set to True
or False
to identify that the model needs to be executed in serial or in MPI.
Note
If a model uses OpenMP parallelization, it is recommended to set MPI_call
to False
in the ModelLink
subclass.
This allows for all worker ranks to be used in OpenMP threads, while only the controller rank calls the model.
Finally, the ModelLink
class has three methods that can be overridden for adding utility to the class (of which two are shown in example_link.py).
The get_default_model_parameters()
and get_default_model_data()
methods return dictionaries containing the default model parameters and model data to use in this class instance, respectively.
By overriding these methods, one can hard-code the use of specific parameters or comparison data, avoiding having to provide them when initializing the ModelLink
subclass.
Additionally, if a default parameter or data point is also provided during initialization, the provided information will override the defaults.
Example
The GaussianLink
class has default parameters defined:
>>> from prism.modellink import GaussianLink
>>> model_data = {3: [3.0, 0.1]}
>>> modellink_obj = GaussianLink(model_data=model_data)
>>> modellink_obj
GaussianLink(model_parameters={'A1': [1.0, 10.0, 5.0], 'B1': [0.0, 10.0, 5.0],
'C1': [0.0, 5.0, 2.0]},
model_data={3: [3.0, 0.1]})
Providing a custom set of parameters will override the coded defaults:
>>> model_parameters = {'A1': [-5, 7, 2]}
>>> modellink_obj = GaussianLink(model_parameters=model_parameters, model_data=model_data)
>>> modellink_obj
GaussianLink(model_parameters={'A1': [-5.0, 7.0, 2.0], 'B1': [0.0, 10.0, 5.0],
'C1': [0.0, 5.0, 2.0]},
model_data={3: [3.0, 0.1]})
The third method, get_str_repr()
, is a simple function that returns a list containing the representations of all non-default input arguments the ModelLink
subclass takes.
It can be overridden to add the missing input arguments to the full representation of the class, which is automatically called whenever the representation is requested.
The GaussianLink
class overrides this method to add its n_gaussians
input argument.
# -*- coding: utf-8 -*-
# Future imports
from __future__ import absolute_import, division, print_function
# Package imports
import numpy as np
# PRISM imports
from prism.modellink import ModelLink
# LineLink class definition
class LineLink(ModelLink):
def __init__(self, *args, **kwargs):
# No custom operations required
pass
# Call ModelLink's __init__()
super(LineLink, self).__init__(*args, **kwargs)
def get_default_model_parameters(self):
# Define default parameters (not required)
par_dict = {
'A': [-10, 10, 3], # Intercept in [-10, 10] with estimate of 3
'B': [0, 5, 1.5]} # Slope in [0, 5] with estimate of 1.5
return(par_dict)
def get_default_model_data(self):
# Define default data (not required)
data_dict = {
1: [4.5, 0.1], # f(1) = 4.5 +- 0.1
2.5: [6.8, 0.1], # f(2.5) = 6.8 +- 0.1
-2: [0, 0.1]} # f(-2) = 0 +- 0.1
return(data_dict)
# Override call_model method
def call_model(self, emul_i, par_set, data_idx):
# Calculate the value on a straight line for requested data_idx
vals = par_set['A']+np.array(data_idx)*par_set['B']
return(vals)
# Override get_md_var method
def get_md_var(self, emul_i, par_set, data_idx):
# Calculate the model discrepancy variance
# For a straight line, this value can be set to a constant
return(1e-4*np.ones_like(data_idx))
Using all the information above and the template given in example_link.py, a ModelLink
subclass can be written for a straight line model, shown in the line_link.py file above.
Here, all methods discussed before (besides the get_str_repr()
method, since no additional input arguments are used) have been overridden.
Given that this model is very simple, no changes have been made to the instance constructor, __init__()
.
Therefore, only single evaluation samples in serial are requested.
PRISM provides the test_subclass()
function that allows the user to check if a ModelLink
subclass is properly written.
It returns an instance of the subclass if the test passes, or raises a specific error if not.
We can use this function to initialize our newly written subclass:
>>> from line_link import LineLink
>>> from prism.modellink import test_subclass
>>> modellink_obj = test_subclass(LineLink)
>>> modellink_obj
LineLink(model_parameters={'A': [-10.0, 10.0, 3.0], 'B': [0.0, 5.0, 1.5]},
model_data={2.5: [6.8, 0.1], -2: [0.0, 0.1], 1: [4.5, 0.1]})
Since no errors were raised, we can now use the initialized ModelLink
subclass to initialize the Pipeline
class:
>>> from prism import Pipeline
>>> pipe = Pipeline(modellink_obj)
Data identifiers (data_idx)¶
The comparison data points that are given to the ModelLink
class each require a unique data point identifier, allowing PRISM to distinguish between them.
This data identifier (called data_idx
) can be used by the model wrapped in the call_model()
method as a description of how to calculate/extract the data point.
It can be provided as a non-mutable sequence (a Python tuple) of a combination of integers, floats and strings, each element describing a part of the operations required.
The data identifier sequence can be of any length, and the length can differ between data points.
Note
If a data identifier is given as a single element, then the identifier is saved as that single element instead of a tuple.
For example, data_idx = [(1), (2), (3, 4), …]
would be saved as data_idx = [1, 2, (3, 4), …]
.
In its simplest form, the data identifier is a single value that is given to a function \(f(x)\), which is a function that is defined for a given model parameter set and returns the function value belonging to the input \(x\).
This is the way the data identifier works for the two standard ModelLink
subclasses, the SineWaveLink
and GaussianLink
classes.
It is also used in the LineLink
class described in the line_link.py file above.
For more sophisticated models, a single value/element is not enough to uniquely identify a data point.
A simple example of this would be if the model generates a two-dimensional array of values, where one specific value needs to be returned.
Then, the data identifier can be given as a tuple of two integers, like data_idx = [(1, 1), (4, 8), …]
.
In the case that the model also generates several two-dimensional arrays which are named, an extra string could be used to identify this array first: data_idx = [(‘array1’, 1, 1), (‘array4’, 4, 8), …]
.
An even more complex example is when a data point needs to be retrieved from a specific named data set at a certain point in a model simulation, after which an operation needs to be carried out (like, making a histogram of the results) and the resulting data point is then found at a specific value in that histogram.
The histogram here might only be necessary to make for specific data sets, while different operations are required for others.
PRISM allows for such complex data identifiers to be given, as it treats every sequence of data identifier elements as separated.
Two different data identifiers working as described above can for example be written as data_idx = [(14, ‘array1’, ‘histogram’, 7.5), (17, ‘array7’, ‘average’), …]
, where the first data point requires an extra (float) value for the histogram and the second does not.
In order to do this, one would of course be required to make sure that the call_model()
method can perform these operations when provided with the proper data identifier.
Wrapping a model (call_model)¶
The call_model()
method is the most important method in the entire PRISM package.
It provides the Pipeline
instance with a way to call the model that is wrapped in the user-defined ModelLink
subclass.
For PRISM, this method is a black box: it takes a parameter/sample set, performs a series of unknown operations and returns the values corresponding to the requested data points and sample(s).
Therefore, the call_model()
method must be written with great care.
Input arguments¶
Depending on the values of the multi_call
and MPI_call
flags (where the first is set by the call_type
flag), the Pipeline
instance will use the call_model()
method differently.
As explained in Writing a ModelLink subclass, every model evaluation sample is requested individually in serial by default, which corresponds to multi_call
is False
and MPI_call
is False
.
When single-calling a model, PRISM expects an array-like container back with shape (n_data)
.
If we assume that we have an instance of the LineLink
class (introduced in line_link.py) called modellink_obj
and want to evaluate the model three times for all data points, then the model would be called as (solely by the controller rank):
# Get emul_i, sam_set and data_idx
emul_i = 1
sam_set = np.random.rand(3, modellink_obj.n_par)
data_idx = modellink_obj.data_idx
# Evaluate model
mod_set = np.zeros([sam_set.shape[0], len(data_idx)])
for i, par_set in enumerate(sam_set):
par_dict = sdict(zip(modellink_obj.par_name, par_set))
mod_set[i] = modellink_obj.call_model(emul_i=emul_i,
par_set=par_dict,
data_idx=data_idx)
Here, we looped through the entire sample set one-by-one, converted every individual sample to a (sorted) dict and called the model with it.
The emulator iteration is given as a normal integer and the data identifiers data_idx
is provided as a list of individual data identifiers (which are either single elements or tuples of elements, as described in Data identifiers (data_idx)).
The requested data identifiers are not necessarily the same as those given in data_idx
.
An individual sample provided in this way will be of the form:
par_dict = {'par_name_1': par_1_val,
'par_name_2': par_2_val,
...,
'par_name_n': par_n_val}
An example of this would be par_dict = {‘A’: 1.0, ‘B’: 2.0}
for the LineLink
class.
This works very well for models that do not require any preparation before they can start evaluating and requires a minimal amount of effort to implement.
However, if the sample set is very large, then evaluating the model in this fashion can be inefficient due to many memory look-ups.
Therefore, the GaussianLink
class accepts both single and multi-calls.
When multi-calling a model, PRISM expects an array-like container back with shape (n_sam, n_data)
.
So, if we use the same example again, but this time have an instance of the GaussianLink
class with multi_call
is True
, then the model would be called as (again solely by the controller rank):
# Get emul_i, sam_set and data_idx
emul_i = 1
sam_set = np.random.rand(3, modellink_obj.n_par)
data_idx = modellink_obj.data_idx
# Evaluate model
sam_dict = sdict(zip(modellink_obj.par_name, sam_set.T))
mod_set = modellink_obj.call_model(emul_i=emul_i,
par_set=sam_dict,
data_idx=data_idx)
This call is roughly the same as before, but this time the entire sample set is provided as a (sorted) dict instead of individual samples. The lay-out of this sample dict is of the form:
sam_dict = {'par_name_1': [par_1_val_1, par_1_val_2, ..., par_1_val_m],
'par_name_2': [par_2_val_1, par_2_val_2, ..., par_2_val_m],
...,
'par_name_n': [par_n_val_1, par_n_val_2, ..., par_n_val_m]}
Again, in the case of the GaussianLink
class, this sample dict could look like sam_dict = {‘A1’: [1.0, 5.5, 10.0], ‘B1’: [0.0, 5.0, 10.0], ‘C1’: [0.0, 2.5, 5.0]}
.
This can be used when the model requires some kind of preparation before being able to perform evaluations, or when it is simply more efficient to provide all requested samples at once (like for the GaussianLink
class).
Note
If a model uses OpenMP parallelization, it is recommended to set MPI_call
to False
in the ModelLink
subclass.
This allows for all worker ranks to be used in OpenMP threads, while only the controller rank calls the model.
Note
If one wishes to transform the received sam_dict
back into a normal NumPy array of shape (n_sam, n_par)
, this can be done quite easily by executing sam_set = np.array(par_set.values()).T
, where par_set
is the sam_dict
provided to the call_model()
method.
Keep in mind that doing so means that the columns are sorted on the names of the model parameters.
When the MPI_call
flag is set to True
, the calls to the call_model()
method are almost the same as described above.
The only difference is that all ranks call the method (each providing the same emul_i
, par_dict
/sam_dict
and data_idx
) instead of just the controller rank.
Multi-calling¶
When the multi_call
flag is set to False
, the call_model()
method is most likely nothing more than a simple function.
But, when multi_call
is set to True
, call_model()
can be a lot more complex.
An example of this would be if we tried to make an emulator of an emulator (which is possible, but completely pointless).
In this case, it would be necessary for the “model” (as we are going to call the emulated emulator from now on) to be loaded into memory first before it can be evaluated.
Although loading an emulator into memory usually does not take that long, we do not want to do this for every single “model” evaluation.
Besides, evaluating an emulator is much quicker when all samples are evaluated at once (due to the way the _evaluate_sam_set()
method is written).
So, therefore, it is necessary to use multi_call
is True
for this “model”.
If we assume that we have already made an emulator of the LineLink
class, then, the call_model()
method could be written as:
def call_model(self, emul_i, par_set, data_idx):
# Initialize Pipeline object as a model
modellink_obj = LineLink()
pipe_model = Pipeline(modellink_obj, working_dir='linelink_0')
# Call pipe_model
mod_set = pipe_model.evaluate(par_set, emul_i)['adj_exp_val']
# Make sure only the requested data points are kept
req_idx = [pipe_model.emulator._data_idx[emul_i].index(idx) for idx in data_idx]
mod_set = mod_set[:, req_idx]
# Return mod_set
return(mod_set)
Here, we only initialize the “model” once per model call, and then evaluate all samples in it by using the evaluate()
method (which can take sample dicts as a valid input argument).
This returns a dict of the evaluation results, where we are only interested in the adjusted expectation values.
Note that making an emulator of an emulator is pointless, but used here as an example.
Note
Due to the way PRISM is written, it is technically speaking not necessary to reinitialize the Pipeline
class every time that call_model()
is called.
It is possible to initialize it when the corresponding ModelLink
subclass is initialized and keep it in memory.
The code above would however be necessary if the “model” works in the same way as PRISM’s worker_mode
, where all worker ranks are listening for calls until the “model” is finalized.
This finalization would be required in order to give PRISM control back over all ranks.
Model discrepancy variance (md_var)¶
Of the three different variances that are used for calculating the implausibility values of a parameter set, the model discrepancy variance is by far the most important.
The model discrepancy variance describes all uncertainty about the correctness of the model output that is caused by the model itself.
This includes the accuracy of the code implementation, completeness of the inclusion of the involved physics, made assumptions and the accuracy of the output itself, amongst others.
It therefore acts as a measure of the quality of the model that is being emulated by PRISM, and as with call_model()
, must be handled with great care.
Theory¶
When PRISM constructs an emulator, it attempts to make a perfect approximation of the model that covers the absolute plausible regions of parameter space. This perfect approximation would be reached if the adjusted emulator variance (adj_var) is zero for all samples. In this case, the emulator has the same variance associated with it as the model, which is given by the model discrepancy variance. Therefore, if the model discrepancy variance is determined incorrectly, the emulator itself will be incorrect as well.
The reason for this is as follows. The implausibility value of a parameter set states how many standard deviations the emulator system expects the model realization corresponding to this parameter set, to be away from explaining the model comparison data. When the total variance increases, the implausibility value decreases (since less standard deviations fit in the total difference). For an emulator system that is still very inaccurate (e.g., first iteration), the adjusted emulator variance dominates over the other two variances. However, later on, the adjusted emulator variance becomes less and less dominant, causing the other two variances to start playing a role. In most cases, it is safe to assume that the model discrepancy variance is higher than the observational variance, since a model would be fitting noise if this was not the case. Therefore, there is going to be a moment when the model discrepancy variance starts being close to the adjusted emulator variance.
When this happens, the plausible region of parameter space starts being determined by the model discrepancy variance. If the model discrepancy variance is generally higher than it should be, then this will often result into the emulator system not converging as far as it could have, since parts of parameter space are still marked as plausible. The opposite however (the model discrepancy variance generally being lower than it should be) can mark parts of parameter space as implausible while they are not. This means that these parts are removed from the emulator.
From the above, it becomes clear that overestimating the model discrepancy variance is much less costly than underestimating its value. It is therefore important that this variance is properly described at all times. However, since the description of the model discrepancy variance can take a large amount of time, PRISM uses its own default description in case none was provided, which is defined as \(\mathrm{Var}(\epsilon_{\mathrm{md}, i})=\left(z_i/6\right)^2\), where \(\mathrm{Var}(\epsilon_{\mathrm{md}, i})\) is the model discrepancy variance of a specified model comparison data point \(i\) and \(z_i\) is the corresponding data value. If one assumes that a model output within half of the data is considered to be acceptable, with acceptable being defined as the \(3\sigma\)-interval, then the model discrepancy variance is obtained as:
This description of the model discrepancy variance usually works well for simple models, and acts as a starting point within PRISM. When models become bigger and more complex, it is likely that such a description is not enough. Given that the model discrepancy variance is unique to every model and might even be different for every model output, PRISM cannot possibly cover all scenarios. It is therefore advised that the model discrepancy variance is provided externally by the user.
Implementation¶
The model discrepancy variance is given by the get_md_var()
method.
This method is, like call_model()
, an abstract method and must be overridden by the ModelLink
subclass before it can be initialized.
The get_md_var()
method is called every time the implausibility value of an emulator evaluation sample is determined.
Unlike the call_model()
method, the get_md_var()
method is called by individual emulator systems, as they determine implausibility values individually.
For this reason, the get_md_var()
method is provided with the emulator iteration emul_i
, a single parameter set par_set
and the data identifiers requested by the emulator system data_idx
.
The call_type
and MPI_call
flags have no influence on the way the get_md_var()
method is used, as it is always called in serial for a single parameter set.
When it is called, PRISM expects an array-like container back with shape (n_data)
(if \(1\sigma\)-interval is centered) or shape (n_data, 2)
(if \(1\sigma\)-interval is given by upper and lower errors).
The default model discrepancy variance description given above is used if the get_md_var()
method raises a NotImplementedError
, but this is discouraged.
Warning
Because the get_md_var()
method is always called for single parameter sets, it is important that it can be called without requiring any preparation of data or models.