PLoS Computational Biology
Public Library of Science

Table of Contents

Highlights

Notes

Abstract

Calibration—that is, “fitting” the model to data—is a crucial part of using mathematical models to better forecast and control the population-level spread of infectious diseases. Evidence that the mathematical model is well-calibrated improves confidence that the model provides a realistic picture of the consequences of health policy decisions. To make informed decisions, Policymakers need information about uncertainty: i.e., what is the range of likely outcomes (rather than just a single prediction). Thus, modellers should also strive to provide accurate measurements of uncertainty, both for their model parameters and for their predictions. This systematic review provides an overview of the methods used to calibrate individual-based models (IBMs) of the spread of HIV, malaria, and tuberculosis. We found that less than half of the reviewed articles used reproducible, non-subjective calibration methods. For the remaining articles, the method could either not be identified or was described as an informal, non-reproducible method. Only one-third of the articles obtained estimates of parameter uncertainty. We conclude that the adoption of better-documented, algorithmic calibration methods could improve both reproducibility and the quality of inference in model-based epidemiology.

Hazelbag, Dushoff, Dominic, Mthombothi, Delva, and Kouyos: Calibration of individual-based models to epidemiological data: A systematic review

Introduction

Individual-based models (IBMs) intended to inform public health policy should be calibrated to real-world data and provide valid estimates of uncertainty [1], [2]. IBMs track information for a simulated collection of interacting individuals [3]. IBMs allow for more detailed incorporation of heterogeneity, spatial structure, and individual-level adaptation (e.g. physiological or behavioural changes) compared to other modelling frameworks [4]. This complexity makes IBMs valuable planning tools, particularly in settings where real-world intricacies that are not accounted for in simpler models have important effects [5], [6]. However, researchers and policymakers often battle with the question of how much value they can attach to the results of IBMs [7]. Fitting an IBM to empirical data (calibration) improves confidence that the simulation model provides a realistic and accurate estimate of the outcome of health policy decisions (e.g. projection of the disease prevalence under different intervention strategies, or the cost-effectiveness of different intervention strategies) [8]–[12]. Transparent reporting on calibration methods for IBMs is therefore required [11], [12].

Parameter values with accompanying confidence intervals used in IBMs are obtained from the literature and are often obtained through statistical estimation. When researchers cannot estimate parameters from empirical data, they obtain their likely values through calibration [12]. Parameter calibration is often difficult for IBMs because their greater complexity can render the likelihood function analytically intractable (i.e. it is impossible to write down the likelihood function in closed form) or prevent explicit numerical calculation of the likelihood function [13]–[15]. Consequently, simulation-based calibration methods that avoid the use of a likelihood function in closed form have been developed [16]. These methods run the model for different parameter sets to identify parameter sets producing model output that best resembles the summary statistics obtained from the empirical data (e.g. disease prevalence over time). Formal simulation-based calibration requires summary statistics (targets) from empirical data, a parameter-search strategy for exploring the parameter space, a goodness-of-fit (GOF) measure to evaluate the concordance between model output and targets, acceptance criteria to determine which parameter sets produce model output close enough to the targets, and a stopping rule to determine when the calibration ends [9][17]. IBMs vary in their complexity (i.e. the number of parameters) and the amount of data available for calibration and validation [10]. Simulation-based calibration of IBMs of higher complexity is typically more computationally intensive [18], [19].

In this review, we pay particular attention to the parameter-search strategy and GOF measure used. Algorithmic parameter-search strategies can be divided into optimisation algorithms and sampling algorithms [14], S2 Table describes commonly used algorithms. Optimisation algorithms find the parameter combination that optimises the GOF, resulting in a single best parameter combination. Examples include grid-search and iterative, descent-guided optimisation algorithms using simplex-based or direct search methods (e.g. the Nelder-Mead method) [20], but many different algorithms exist [21]. Optimisation algorithms provide only point estimates of parameters; once these are found, another algorithm may be used to obtain confidence intervals (e.g. the profile likelihood method, Fisher information, etc.) [22], [23]. Sampling algorithms aim to find a distribution of parameter values that approximate the likelihood surface or posterior distribution. Examples include approximate Bayesian computation (ABC) methods and sampling importance resampling [8], [13], [14], [24], [25]. Parameter distributions obtained from sampling algorithms allow for the representation of correlations between parameters and for parameter uncertainty to be incorporated into model projections [2], [6], [8], [17], [26]. Quantitative measures of GOF include distance measures (e.g. relative distance, squared distance) and measures based on a surrogate likelihood function: the likelihood of observing the target statistic under the assumption that the model output is a random draw from a presumed distribution (e.g. binomial for prevalence statistics). As the model output is not necessarily distributed as presumed, we refer to this likelihood as the “surrogate” likelihood. A more subjective method of calibration involves the manual adjustment of parameter values, followed by a visual assessment of whether the model outputs resemble empirical data [27].

Previous research in the context of IBMs of HIV transmission found that 22 (69%) out of 32 included articles described the process through which the model was calibrated to data [12]. The impact of stochasticity on the model results, defined as the random variation in model output induced by running the model multiple times using the same parameter value with a different random seed, was summarised in nearly half (15/32) of the articles [12]. The depth of reporting on calibration methods was highly variable [9], [12]. A systematic review in the context of population-level health policy models, including 37 articles, found that 25(71%) of these performed model calibration [28]. About half (12/25) of these articles reported on the calibration methods used, whereas the other half (13/25) used informal methods for parameter calibration or did not report on the calibration methods [28]. Previous research on calibration methods in cancer-simulation models in general–not IBMs specifically–found that 131 (85%) out of 154 included articles may have calibrated at least one unknown parameter. Of the 131 articles that calibrated parameters, the majority (84/131) did not describe the use of a GOF measure, the rest either used a quantitative GOF (27/131) such as the likelihood or distance measures or used visual assessment of GOF (20/131) [9]. Only a few articles reported parameter distributions resulting from calibration; most only presented a single best parameter combination [9]. Information on the parameter-search strategy and stopping rules was generally not well described, and acceptance criteria were rarely mentioned [9], [29]. Of the 154 articles included in the review by Stout et al ., 80 (52%) mentioned model validation [9]. However, while previous studies have reviewed specific portions of the modelling literature, they either did not focus on IBMs or did not focus on the calibration methods in much detail.

We conducted a systematic review of epidemiological studies using IBMs of the HIV, malaria and tuberculosis (TB) epidemics, as these have been among the most investigated epidemics with the highest global burden of disease [30]. We aim to provide an overview of current practices in the simulation-based calibration of IBMs.

Results

Selection of articles for inclusion

The PubMed search resulted in 653 publications, of which 84 articles were included for review; 388 were excluded based on title and abstract, and another 181 were excluded based on a full-text review (see Fig 1). The number of articles selected by publication year increased from seven in 2013 to 20 in 2018.

PRISMA flow diagram detailing the selection process of articles included in the review.
Fig 1
PRISMA flow diagram detailing the selection process of articles included in the review.

Scope and objectives of included articles

S1 Table summarises the characteristics of the included articles. Fifty-eight (69%) of the included articles presented IBMs in HIV research, 16 (19%) concerned malaria, and another 10 (12%) concerned tuberculosis.

Most articles, namely 56 (67%), investigated the effect of an intervention, 17 articles looked at behavioural or biological explanations for the observed epidemic, and other goals (e.g. parameter estimation, model development) were used in 17. In total, six (7%) articles had two objectives. For most of these (5/6), one of the objectives was investigating the effect of an intervention (see S1 Table).

Parameter-search strategies and measures of GOF

Of the included articles, 40 (48%) combined a quantitative measure of GOF with an algorithmic parameter-search strategy, which was an optimisation algorithm (14/40) or a sampling algorithm (26/40) (see Fig 2). For the remaining 44 (52%) articles, the parameter-search strategy could either not be identified (32/44) or was described as an informal, non-reproducible method (12/44). Tables A, B and C in S1 Appendix show that there is no convincing evidence that the parameter search strategy changed with publication year or differed by disease studied. A brief description of the methods referred to in Fig 2 under optimisation algorithm and sampling algorithm is provided in S2 Table.

Reporting and application of parameter search strategies in epidemiological studies.
Fig 2
Reporting and application of parameter search strategies in epidemiological studies.

Detailed information on calibration methods for the 14 (17%) articles using optimisation algorithms is reported in Table 1. For the parameter-search strategy, most articles used either a grid search (7/14), Latin square (1/14) or random draw from tolerable range (1/14), followed by the selection of the single best parameter combination. Several iterative, descent-guided optimisation algorithms (i.e. Nelder-Mead, interior-point algorithm, coordinate descent with golden section search, random search mechanism) were used in the remaining articles (5/14). Of these five articles, most (4/5) accepted a single best parameter combination without confidence intervals, while the remaining article obtained confidence intervals around parameter estimates (see S1 Text.). For the GOF measure, the most common choice was a squared distance (6/14). Various GOF measures were used in the remaining articles; these include absolute distances (2/14) and R-squared (2/14).

Table 1
Details of the calibration methods used in articles using optimisation algorithms for calibration, sorted by parameter search strategy algorithm.
AuthorsYearPathogenParameter search strategy algorithmGOF
Luo et al.2018HIVGrid searchAbsolute distance
Romero-Severson et al.2013HIVGrid searchKolmogorov-Smirnov
Marshall et al.2018HIVGrid searchR-squared
Goedel et al.2018HIVGrid searchR-squared and Manhattan distance of parameters
Brookmeyer et al.2014HIVGrid searchSquared distance
Suen et al.2014TBGrid searchNumber of model outputs within the confidence intervals around the targets
Suen et al.2015TBGrid searchNumber of model outputs within the confidence intervals around the targets
Bershteyn et al.2013HIVIterative, descent-guided optimisation algorithm (Coordinate descent w. golden section search)Squared distance
Klein et al.2015HIVIterative, descent-guided optimisation algorithm (Coordinate descent w. golden section search)Squared distance
Sauboin et al.2015MalariaIterative, descent-guided optimisation algorithm (Interior point algorithm, hill-climbing)Squared distance
Knight et al.2015TB, HIVIterative, descent-guided optimisation algorithm (Nelder-Mead)Squared distance
Kasaie et al.2018HIVIterative, descent-guided optimisation algorithm (Random search mechanism)Absolute distance
Shrestha et al.2017TBLatin hypercube samplingSurrogate likelihood
Jewell et al.2015HIVSampling from tolerable rangeSquared distance

Table 2 contains the details of the calibration methods in the 26 (31%) articles using sampling algorithms. Random sampling from the prior, followed by rejection ABC, was used the most (8/26). Different types of Bayesian calibration (7/26), Bayesian melding (3/26) and history matching with model emulation (3/26) were also used. Most articles (10/26) used the surrogate likelihood as a measure of GOF, and Various GOF measures were used in the remaining articles, these include absolute distances (4/26), relative distances (4/26) and squared distances (4/26). (see Table 2).

Table 2
Details of the calibration methods in articles using sampling algorithms for calibration, sorted by parameter search strategy algorithm.
AuthorsYearPathogenParameter search strategy algorithmGOF
Cameron et al.2015MalariaBayesian calibration (Combining model emulation with MCMC)Surrogate likelihood
Huynh et al.2015TBBayesian calibration (Latin hypercube with IMIS)Surrogate likelihood
Chang et al.2018TBBayesian calibration (Latin hypercube with IMIS)Surrogate likelihood
Penny et al.2015MalariaBayesian calibration (MCMC)Surrogate likelihood
Penny et al.2015MalariaBayesian calibration (MCMC)Surrogate likelihood
White et al.2018MalariaBayesian calibration (MCMC)Surrogate likelihood
Schalkwyk et al.2018HIVBayesian calibration (Random draw from prior with SIR)Surrogate likelihood
Abuelezam et al.2016HIVBayesian meldingSquared distance
McCormick et al.2014HIVBayesian meldingSurrogate likelihood
McCormick et al.2017HIVBayesian meldingSurrogate likelihood
Ciaranello et al.2013HIVGrid search, step-wise acceptance of parameter sets resulting in GOF < cut-offAbsolute distance
McCreesh et al.2017HIVHistory matching with model emulationImplausibility measure
McCreesh et al.2017HIVHistory matching with model emulationImplausibility measure
McCreesh et al.2018HIVHistory matching with model emulationImplausibility measure
Shcherbacheva et al.2018MalariaMarkov chain Monte CarloAbsolute distance
Johnson et al.2016HIVRandom draw from prior with selection of best 500 parameter combinationsSurrogate likelihood
Pizzitutti et al.2015MalariaRandom draw from prior, stepwise calibrationAbsolute distance
Pizzitutti et al.2018MalariaRandom draw from prior, stepwise calibrationSquared distance
Nakagawa et al.2016HIVRejection ABC (Random draw from prior)Relative distance
Nakagawa et al.2017HIVRejection ABC (Random draw from prior)Chi-square
Cambiano et al.2018HIVRejection ABC (Random draw from prior)Relative distance
Hontelez et al.2013HIVRejection ABC (Random draw from prior)Squared distance
Phillips et al.2013HIVRejection ABC (Random draw from prior)Relative distance
Phillips et al.2015HIVRejection ABC (Random draw from prior)Relative distance
Shrestha et al.2017HIVRejection ABC (Random draw from prior)Absolute distance
Tuite et al.2017TBRejection ABC (Random draw from prior)Squared distance
IMIS, Incremental-mixture importance sampling; SIR, Sampling importance resampling; MCMC, Markov chain Monte Carlo.

From the 44 (52%) articles with unidentifiable or informal parameter-search strategies, the majority (25/44) are also unclear about the GOF used, while the rest either relied on visual inspection as a GOF (14/44) or used a quantitative GOF (5/44).

Only 14 (17%) of the 84 included articles provided a rationale for their choice of model-calibration method. For example, McCreesh et al . [31] reported: “The model was fitted to the empirical data using history matching with model emulation, which allowed uncertainties in model inputs and outputs to be fully represented, and allowed realistic estimates of uncertainty in model results to be obtained” (see S2 Text. for more examples). Other examples indicate that an algorithmic calibration method failed to provide either a good fit or parameter estimates: “Ultimately, we chose to use visual inspection because the survival curves did not fit closely enough using the other two more quantitative approaches.” [32] Or “[Calibration] was unable to resolve co-varying parameters. These parameters were adjusted by hand…” [33].

Ten out of the 84 articles included (12%) used a weighted calculation of GOF. Four articles weighted the GOF based on the amount of data behind the summary statistic fitted to, for example by weighting based on the inverse of the width of the confidence interval around the data. In contrast, one article increased the weight for a data source for which fewer data was available. Other strategies included weighting based on a subjective assessment of the quality of the data, or weighting based on which data they wanted the model to fit best. One article down-weighted particular data to improve fit. Others stressed the importance of determining weights a priori since weights are chosen subjectively.

Acceptance criteria and stopping rules

None (0/14) of the articles applying optimisation algorithms mentioned the acceptance criteria or stopping rules. Acceptance criteria and stopping rules applied in studies using sampling algorithms can be summarised as running the model until obtaining an arbitrary number of accepted parameter combinations.

The number of target statistics, the number of calibrated parameters and the size of the simulated population

The number of target statistics was explicitly mentioned in only three (3%) of the 84 included articles, for 62 (74%) articles we had enough information to attempt to deduce this number from either text or figures. The remaining 19 (23%) articles either provided incomplete information (11/19) or no information (8/19). Some (4/65) of the articles for which we were able to obtain the number of target statistics had different numbers of target statistics for calibration in different locations or calibration to different diseases. The 61 (73%) articles for which we were able to obtain a single count had a median number of target statistics of 23 (range 1–321). A histogram of the number of target statistics is provided in figure A in S2 Appendix. The number of target statistics differed between parameter search strategies (See Fig 3B, Kruskal-Wallis chi-square = 8.610, p = 0.035), with articles using sampling strategies having more target statistics compared to articles for which we could not identify the parameter search strategy (Wilcoxon rank-sum, Benjamini-Hochberg adjusted p-value = 0.025).

The number of calibrated parameters was explicitly mentioned in 11 (13%) of the 84 included articles, for another 53 (63%) articles it was possible to deduce this number from either text or figures. The remaining 20 (24%) articles either provided incomplete information (10/20) or no information at all (10/20). The 64 (75%) articles for which we were able to obtain a count had a median number of calibrated parameters of 10 (range 1–96). A histogram of the number of calibrated parameters is provided in figure B in S2 Appendix. The number of calibrated parameters differed between parameters search strategies (See Fig 3A, Kruskal-Wallis chi-square = 9.304, p = 0.026), with articles using sampling strategies having higher numbers of calibrated parameters compared to articles for which we could not identify the parameter search strategy (Wilcoxon rank-sum, Benjamini-Hochberg adjusted p-value = 0.050).

Comparison of the number of calibrated parameters and target statistics between different parameter search strategies.
Fig 3
(A) Boxplots of the number of calibrated parameters for different parameter search strategies. (B) Boxplots of the number of target statistics for different parameter search strategies.Comparison of the number of calibrated parameters and target statistics between different parameter search strategies.

For 55 (66%) articles, we obtained counts for both the number of target statistics and the number of calibrated parameters. For many of these articles (17/55), the number of calibrated parameters appeared to exceed the number of target statistics. A plot of the number of target statistics against the number of calibrated parameters is provided in figure C in S2 Appendix.

The size of the simulated population was explicitly mentioned in 54 (64%) of the 84 included articles, for another 9 (11%) articles it was possible to deduce this number from either text or figures. The remaining 21 (25%) articles either provided incomplete information (3/21) or no information at all (18/21). For the 63 (75%) articles for which we obtained a number, the median population size was 78000 (range: 250–47000000). A histogram of the log10 of the size of the simulated population is provided in figure D in S2 Appendix.

Computational aspects and the use of platforms

The software used to build IBM was not reported in 33 (39%) of the articles. Sixteen articles (19%) used the low-level programming language C++, six (7%) used MATLAB, and another six (7%) used Python. Various other computing platforms were used in the remaining 23 (28%) articles. A high-performance computing facility was used in 16 (19%) articles.

Several simulation tools (i.e. CEPAC [34], EMOD [35] HIV-CDM [36], MicroCOSM [37], PATH [38], STDSIM [39] and TITAN [40]) were used in the articles modelling HIV. Similarly, two platforms (i.e. EMOD [41] and OpenMalaria [42]) were used in the articles modelling malaria. In the articles modelling tuberculosis, the only tool reported was EMOD [43].

Model validation

Only 31 (37%) articles mentioned that a validation of the model had been performed.

Discussion

More than half of IBMs we studied used non-reproducible or subjective calibration methods. Articles that reported the use of formal calibration methods used a wide range of parameter-search strategies and GOF measures. Only one-third of articles used calibration methods that quantify parameter uncertainty. These findings are important because choices concerning the calibration method can have substantial effects on model results and policy implications [2], [6]–[8], [44]–[46].

We encourage authors to use the standardised Calibration Reporting Checklist of Stout et al . [9]. Additionally, we propose an extended checklist in S3 Appendix based on the work presented in this paper. While algorithmic parameter-search strategies are in principle reproducible, unclear or incomplete reporting, and non-disclosure of software code can render them de facto non-reproducible. [47]. Manual adjustment of parameter values and visual inspection of GOF may perform equally well compared to other methods in terms of GOF alone [48], may provide researchers with valuable insights into and familiarity with the model [49], and can be useful for purely didactic purposes [50]–[52]. However, we advise against using these methods in analyses intended to inform public health as they do not favour reproducibility and involve subjective judgment, which may produce less than optimal calibration results and usually leads to the acceptance of a single parameter set (i.e. does not provide parameter uncertainty) [17]. On occasion, authors justified their choice of an informal method by indicating that algorithmic calibration methods did not converge to provide parameter estimates or failed to provide a satisfactory fit to the targets. A potential explanation for non-convergence of an algorithmic calibration method is that the parameters in question are unidentifiable, which is the case when a vast array of different parameter combinations provide a comparably good fit to the target statistics. Performing manual calibration in such an instance will deliver one set of parameters out of all of the parameter combinations that provide a fit. However, using this single parameter combination hides the fact that there is not enough information to uniquely identify the best parameter values. Furthermore, model-stochasticity provides the possibility that a great fit is found by chance for a parameter combination for which the probability of observing the target statistics is lower than for other parameter combinations.

There are several methodological challenges in the calibration of individual-based models, including the choice of calibration method–i.e. the combination of algorithmic parameter-search strategy and GOF measure. The findings of the current review and previous research suggest that there is no consensus on which calibration method to use [9], [10], [17], [53], [54]. Additionally, some of the articles reviewed here indicated that algorithmic calibration methods had failed, leading the researchers to calibrate the model, either fully or partially, by hand. These issues suggest that there is a need for research comparing the performance of calibration methods to inform the choice of parameter-search strategy and GOF [10]. Previous research on calibration methods focused on the GOF [27], computation time and analyst time [48]. Where applicable, correct estimation of the posterior [55] should be a core aspect of performance. We further suggest investigating several contextual variables, including the amount and nature of the empirical data to calibrate against, the number and type of model parameters to be calibrated and insights to be derived from the calibrated model. As evident from our review, these contextual variables vary widely across IBM studies in epidemiology.

Another methodological challenge in the calibration of IBMs is determining a priori whether the target statistics provide sufficient information to calibrate the parameters [56], especially when the model has many parameters [57]. Firstly, the target statistics are based on variable amounts of raw data. Secondly, a time series of target statistics is often used, typically violating the assumption of independence implied by many calibration methods. Thirdly, the complexity of the model may hamper an appropriate specification of a prior parameter-distribution (including the specification of a correlation between parameters) that is fully informed by prior knowledge of the data-generating processes represented by the model. These problems preclude the use of standard statistical methods for calculating the number of target statistics that is sufficient for parameter calibration. A related problem is that target summary statistics are based on data from different sources, including observational data that are potentially affected by treatment-confounder feedback (e.g. time-dependent confounder CD4 cell count affected by prior cART treatment) [58]. Another related problem is that of validation, i.e. testing model performance on data that was not included in the calibration step. There is considerable debate on when data should be reserved for this purpose [54].

The last methodological aspect of IBMs we would like to draw attention to is the size of the simulated population [1], [59]. Intuitively, one would recommend that the simulated population size should be similar to the size of the population from which the samples were drawn that gave rise to the target statistics. However, for many studies, modelling the full population is not feasible with currently available computational infrastructure. Instead, researchers often adjust for the inflated stochasticity in the modelled system by averaging outcomes of interest over multiple simulation runs per parameter set [59]. How choices around modelled population size and analysis of model output affect the validity of model inference deserves further attention in future research.

Our results in the setting of HIV, TB and malaria IBMs indicate that the use of formal calibration methods (48% of articles) is higher than in previous research on simulation models in general–not IBMs specifically. Previously, only one-fifth to one-third of articles reporting on epidemiological models used a quantitative GOF [9], [60]. Our results concerning parameter uncertainty are also optimistic compared to previous research by Stout et al . on calibration methods in cancer models, which found that almost no articles quantified parameter uncertainty, but instead accepted a single best-fitting parameter set as the result of the calibration [9]. The same researchers reported that several different combinations of parameter-search strategies and GOFs were used [9], outcomes which are similar to our findings. Stout et al. report that articles rarely describe acceptance criteria and stopping rules. Stout et al . also report that a standard description of the calibration process lacks in almost all articles [9]. Similarly, previous research on IBMs of HIV transmission found that reporting was lacking in the description of calibration methods [12]. All of this is in agreement with the results of the current review. Concerning the goals of the included articles, our results broadly agree with Punyacharoensin et al . They found that the main goals of HIV transmission models for the study of men who have sex with men are: making projections for the epidemic, investigating how the incorporation of various assumptions around the behavioural or biological characteristics affect these projections, and evaluating the impact of interventions [60].

To our knowledge, this is the first detailed review of methods used to calibrate IBMs of HIV, malaria and TB epidemics. A limitation of our study is that we are unsure to what extent the results are generalisable to other infectious diseases. We encourage future research on other diseases to confirm or refute our current findings on the use of and reporting on methods in the calibration of IBMs in epidemiological research. Similarly, since our PubMed search excluded articles matching “molecular”, we may have missed relevant articles. However, we don’t believe this selection is likely to bias the findings of this review. Another possible concern is that we don’t control for overlaps in authorship; thus, we effectively treat articles that come from a given”research group” as independent observations, even though the calibration method used by a particular group is often the same, as we show in Tables 1 and 2. Another limitation is that the counts presented in this review often had to be deduced from the article, this was a difficult and laborious task involving manual counting of target statistics in either the text, figures or tables, a process that is prone to error. A final limitation is that we did not go into the strengths and weaknesses of each method. Existing literature compares the performance of alternative algorithms for calibrating the same model but does not allow us to draw general conclusions [10]. As a starting point for comparison, we provide a brief description of calibration methods in S2 Table.

In conclusion, it appears that calibrating individual-based models in epidemiological studies of HIV, malaria and TB transmission dynamics remains more of an art than a science. Besides limited reproducibility for a majority of the modelling studies in our review, our findings raise concerns over the correctness of model inference (e.g., estimated impact of past or future interventions) for models that are poorly calibrated. The quality of inference and reproducibility in model-based epidemiology could benefit from the adoption of algorithmic parameter-search strategies and better-documented calibration and validation methods. We recommend the use of sampling algorithms to obtain valid estimates of parameter uncertainty and correlations between parameters. There is a need for simulation-based studies that compare the performance, strengths and limitations of different methods for calibrating IBMs to epidemiological data.

Materials and methods

This review was performed following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [61]. The PRISMA flow diagram details the selection process of articles included for review (see Fig 1).

Search strategy and selection criteria

We identified articles on PubMed that employed simulation-based methods to calibrate IBMs of HIV, malaria and tuberculosis, and that were published between 1 January 2013 and 31 December 2018. Six years seemed to be long enough to yield a sizeable amount of information and to observe recent time trends, and short enough to be feasible and to speak to recent practices in model calibration in epidemiological modelling studies. The following search query was performed on 31 January 2019: ‘((HIV[tiab] OR malaria[tiab] OR tuberculo*[tiab] OR TB[tiab]) AND (infect* OR transmi* OR prevent*) AND (computer simulation[tiab] OR microsimulation[tiab] OR simulation[tiab] OR agent-based[tiab] OR individual-based[tiab] OR computer model*[tiab] OR computerized model*[tiab]) AND ("2013/01/01"[Date—publication]: "2018/12/31"[Date—publication]) NOT(molecular))’.

Eligibility criteria were agreed upon by WD, JD and CMH before screening. Articles were included if models stored individual-specific information and calibration involved running the model and comparing model output to population-level targets expressed as summary statistics. We excluded review articles, statistical simulation studies, and studies that focused on molecular biology and immunology because we were primarily interested in studies informing public health policy.

Titles and abstracts were screened for eligibility by CMH, and difficult cases were discussed with WD. If the title and abstract did not provide sufficient information for exclusion, a full-text examination was performed. Full-text inclusion was performed by two independent researchers (CMH and either ZM or ED) for a subset of 100 articles. CMH included 28 articles, of which ZM and ED did not include six; these six articles were double-checked by WD and consequently included for review. ZM included four articles that CMH did not include these four articles were double-checked by WD and consequently not included for review. After that, full-text inclusion was performed by CMH in consultation with WD.

Data extraction

For each article, we extracted information on the objective of the study (i.e. estimating the effect of an intervention, investigating a behavioural or biological explanation for the observed infectious disease outbreak or other goals including estimation of parameters or model development), the parameter-search strategy and the GOF measure, the rationale for choosing this calibration strategy over alternatives, and model validation. Acceptance criteria and stopping rules are only relevant for articles applying algorithmic parameter-search strategies and collected for that subset of articles. For readability purposes, we say “used” to mean “reported the use of” throughout this review.

Information was collected independently by two reviewers (CMH and either ZM or ED) for each article included using a prospectively developed form. This form was based on the Calibration Reporting Checklist of Stout et al . [9] and was extended by several items, including; the software and hardware used to build the model, the size of the initial population of agents and the name of the modelling platform. Additionally, we inserted several items to collect information on the number of calibrated parameters, the number of fixed parameters, and the number of targets. We noted how information on these counts was reported in the articles (i.e. the number was explicitly provided, could be deduced from text or figures, was provided incompletely or was not provided).

Information on calibration methods was extracted verbatim, allowing for later classification. Articles on which there was disagreement in the classification were discussed by WD, JD and CMH until an agreement was reached. We classified articles reporting both algorithmic and informal calibration as informal since doing part of the calibration informally makes the entire calibration irreproducible.

Statistical analysis

R 3.5.0 (www.r-project.org) was used to perform the statistical analyses [62]. Differences between groups in non-normally distributed continuous variables were analysed by the nonparametric Kruskal-Wallis test [63]. Wilcoxon rank-sum test was used to determine which groups differed significantly [63]. Benjamini-Hochberg (BH) correction was used to adjust for multiple testing [64].

Acknowledgements

The authors gratefully acknowledge the help of all SACEMA students and researchers, specifically the fruitful conversations and helpful comments on the manuscript by Prof. Alex Welte, Mrs Cari van Schalkwyk, Dr Florian Marx, Prof. Juliet Pulliam and Dr Larisse Bolton. We would also like to acknowledge Mrs Marisa Honey and Mrs Susan Lotz from the Stellenbosch writing lab, who copy-edited a first version of the manuscript.

References

1 

    Bobashev G, Morris R. Uncertainty and inference in agent-based models. In: 2010 Second International Conference on Advances in System Simulation. IEEE; 2010. p. 67–71.

2 

    AH Briggs, MC Weinstein, EA Fenwick, J Karnon, MJ Sculpher, AD Paltiel. . Model parameter estimation and uncertainty analysis: a report of the ISPOR-SMDM Modeling Good Research Practices Task Force Working Group–6. Medical decision making. 2012;32(5):, pp.722–732. , doi: 10.1177/0272989X12458348

3 

    L Willem, F Verelst, J Bilcke, N Hens, P Beutels. . Lessons from a decade of individual-based models for infectious disease transmission: a systematic review (2006–2015). BMC infectious diseases. 2017;17(1):, pp.612, doi: 10.1186/s12879-017-2699-8

4 

    Hammond RA. Considerations and best practices in agent-based modeling to inform policy. In: Assessing the use of agent-based models for tobacco regulation. National Academies Press (US); 2015.

5 

    LF Johnson, N Geffen. . A comparison of two mathematical modeling frameworks for evaluating sexually transmitted infection epidemiology. Sexually transmitted diseases. 2016;43(3):, pp.139–146. , doi: 10.1097/OLQ.0000000000000412

6 

    MC Kennedy, A O’Hagan. . Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2001;63(3):, pp.425–464.

7 

    M Egger, L Johnson, C Althaus, A Schoni, G Salanti, N Low, et al. Developing WHO guidelines: Time to formally include evidence from mathematical modelling studies. F1000Research. 2017; 6:, pp.1584, doi: 10.12688/f1000research.12367.2

8 

    NA Menzies, DI Soeteman, A Pandya, JJ Kim. . Bayesian methods for calibrating health policy models: a tutorial. Pharmacoeconomics. 2017;35(6):, pp.613–624. , doi: 10.1007/s40273-017-0494-4

9 

    NK Stout, AB Knudsen, CY Kong, PM McMahon, GS Gazelle. . Calibration methods used in cancer simulation models and suggested reporting guidelines. Pharmacoeconomics. 2009;27(7):, pp.533–545. , doi: 10.2165/11314830-000000000-00000

10 

    IJ Dahabreh, JA Chan, A Earley, D Moorthy, EE Avendano, TA Trikalinos, et alA Review of Validation and Calibration Methods for Health Care Modeling and Simulation In: Modeling and Simulation in the Context of Health Technology Assessment: Review of Existing Guidance, Future Research Needs, and Validity Assessment [Internet]. Agency for Healthcare Research and Quality (US); 2017 p. , pp.30–43.

11 

    JJ Caro, DM Eddy, H Kan, C Kaltz, B Patel, R Eldessouki, et al. Questionnaire to assess relevance and credibility of modeling studies for informing health care decision making: an ISPOR-AMCP-NPC Good Practice Task Force report. Value in health. 2014;17(2):, pp.174–182. , doi: 10.1016/j.jval.2014.01.003

12 

    NN Abuelezam, K Rough, GR Seage III. . Individual-based simulation models of HIV transmission: reporting quality and recommendations. PloS one. 2013;8(9): , pp.e75624, doi: 10.1371/journal.pone.0075624

13 

    J Lintusaari, MU Gutmann, R Dutta, S Kaski, J Corander. . Fundamentals and recent developments in approximate Bayesian computation. Systematic biology. 2017;66(1): , pp.e66–e82. , doi: 10.1093/sysbio/syw077

14 

    F Hartig, JM Calabrese, B Reineking, T Wiegand, A Huth. . Statistical inference for stochastic simulation models–theory and application. Ecology letters. 2011;14(8):, pp.816–827. , doi: 10.1111/j.1461-0248.2011.01640.x

15 

    Busetto AG, Buhmann JM. Stable Bayesian parameter estimation for biological dynamical systems. In: 2009 International Conference on Computational Science and Engineering. vol. 1. IEEE; 2009. p. 148–157.

16 

    R Leombruni, M Richiardi. . Why are economists sceptical about agent-based simulations?Physica A: Statistical Mechanics and its Applications. 2005;355(1):, pp.103–109.

17 

    T Vanni, J Karnon, J Madan, RG White, WJ Edmunds, AM Foss, et al. Calibrating models in economic evaluation. Pharmacoeconomics. 2011;29(1):, pp.35–49. , doi: 10.2165/11584600-000000000-00000

18 

    NZ Sun, A Sun. Model calibration and parameter estimation: for environmental and water resource systems. Springer; 2015.

19 

    R Bellman. Dynamic programming. Princeton, USA: Princeton University Press1957;1(2):, pp.3.

20 

    JA Nelder, R Mead. . A simplex method for function minimization. The computer journal. 1965;7(4):, pp.308–313.

21 

    S Amaran, NV Sahinidis, B Sharda, SJ Bury. . Simulation optimization: a review of algorithms and applications. Annals of Operations Research. 2016;240(1):, pp.351–380.

22 

    M Joshi, A Seidel-Morgenstern, A Kremling. . Exploiting the bootstrap method for quantifying parameter confidence intervals in dynamical systems. Metabolic engineering. 2006;8(5):, pp.447–455. , doi: 10.1016/j.ymben.2006.04.003

23 

    Stryhn H, Christensen J. Confidence intervals by the profile likelihood method, with applications in veterinary epidemiology. In: Proceedings of the 10th International Symposium on Veterinary Epidemiology and Economics, Vina del Mar; 2003. p. 208.

24 

    TJ McKinley, I Vernon, I Andrianakis, N McCreesh, JE Oakley, RN Nsubuga, . et al. Approximate Bayesian Computation and simulation-based inference for complex stochastic epidemic models. Statistical science. 2018;33(1):, pp.4–18.

25 

    DB Rubin. . Using the SIR algorithm to simulate posterior distributions. Bayesian Stat. 1988; 3:, pp.395–402.

26 

    D Poole, AE Raftery. . Inference for deterministic simulation models: the Bayesian melding approach. Journal of the American Statistical Association. 2000;95(452):, pp.1244–1255.

27 

    CD Schunn, D Wallach, et al. Evaluating goodness-of-fit in comparison of models to data. Psychologie der Kognition: Reden and vorträge anlässlich der emeritierung von Werner Tack. 2005; p. , pp.115–154.

28 

    A Conrads-Frank, B Jahn, M Bundo, G Sroczynski, N Mühlberger, M Bicher, et al. A Systematic Review Of Calibration In Population Models. Value in Health. 2017;20(9): , pp.A745.

29 

    HHA Afzali, J Gray, J Karnon. . Model performance evaluation (validation and calibration) in model-based studies of therapeutic interventions for cardiovascular diseases. Applied health economics and health policy. 2013;11(2):, pp.85–93. , doi: 10.1007/s40258-013-0012-6

30 

    Y Furuse. . Analysis of research intensity on infectious disease by disease burden reveals which infectious diseases are neglected by researchers. Proceedings of the National Academy of Sciences. 2019;116(2):, pp.478–483.

31 

    N McCreesh, I Andrianakis, RN Nsubuga, M Strong, I Vernon, TJ McKinley, et al. Universal test, treat, and keep: improving ART retention is key in cost-effective HIV control in Uganda. BMC infectious diseases. 2017;17(1):, pp.322, doi: 10.1186/s12879-017-2420-y

32 

    J Kessler, K Nucifora, L Li, L Uhler, S Braithwaite. . Impact and Cost-Effectiveness of Hypothetical Strategies to Enhance Retention in Care within HIV Treatment Programs in East Africa. Value in health: the journal of the International Society for Pharmacoeconomics and Outcomes Research. 201512;18(8):, pp.946–955. Available from: http://linkinghub.elsevier.com/retrieve/pii/S1098301515050731.

33 

    DJ Klein, PA Eckhoff, A Bershteyn. . Targeting HIV services to male migrant workers in southern Africa would not reverse generalized HIV epidemics in their home communities: A mathematical modeling analysis. International Health. 20153;7(2):, pp.107–113. , doi: 10.1093/inthealth/ihv011

34 

35 

36 

    AW McCormick, NN Abuelezam, ER Rhode, T Hou, RP Walensky, PP Pei, et al. Development, calibration and performance of an HIV transmission model incorporating natural history and behavioral patterns: application in South Africa. PloS one. 2014;9(5): , pp.e98272 Available from: http://dx.plos.org/10.1371/journal. pone.0098272. , doi: 10.1371/journal.pone.0098272

37 

    LF Johnson, M Kubjane, H Moolla. . MicroCOSM: a model of social and structural drivers of HIV and interventions to reduce HIV incidence in high-risk populations in South Africa. bioRxiv 310763 [Preprint]. 2018 [cited 2020 April 24]. Available from: https://www.biorxiv.org/content/10.1101/310763v1, doi: 10.1101/310763

38 

    C Gopalappa, PG Farnham, YH Chen, SL Sansom. . Progression and Transmission of HIV/AIDS (PATH 2.0). Medical decision making: an international journal of the Society for Medical Decision Making. 2017;37(2):, pp.224–233.

39 

    R Bakker, E Korenromp, E Meester, C Van Der Ploeg, H Voeten, C Van Vliet, et al. Stdsim: A microsimulation model for decision support in the control of hiv and other stds. Sexually Transmitted Diseases. 2000;27(10):, pp.652.

40 

    titanmodel.org [Internet]. Marshall_Labs: Treatment of infectious transmissions through agent-based network. c2017 [cited 2020 Apr 24]. Available from: https://www.titanmodel.org/

41 

    A Bershteyn, J Gerardin, D Bridenbecker, CW Lorton, J Bloedow, RS Baker, et al. Implementation and applications of EMOD, an individual-based multi-disease modeling platform. Pathogens and disease. 2018;76(5): fty059.

42 

    MA Penny, K Galactionova, M Tarantino, M Tanner, TA Smith. . The public health impact of malaria vaccine RTS, S in malaria endemic Africa: Country-specific predictions using 18-month follow-up Phase III data and simulation models. BMC Medicine. 2015;13(1):, pp.170.

43 

    ST Chang, VN Chihota, KL Fielding, AD Grant, RM Houben, RG White, et al. Small contribution of gold mines to the ongoing tuberculosis epidemic in South Africa: a modeling-based study. BMC medicine. 2018;16(1):, pp.52, doi: 10.1186/s12916-018-1037-3

44 

    AT Fojo, EA Kendall, P Kasaie, S Shrestha, TA Louis, DW Dowdy. Mathematical Modeling of “Chronic” Infectious Diseases: Unpacking the Black Box In: Open forum infectious diseases. vol. 4Oxford University Press US; 2017. p. ofx172.

45 

    JA Gilbert, LA Meyers, AP Galvani, JP Townsend. . Probabilistic uncertainty analysis of epidemiological modeling to guide public health intervention policy. Epidemics. 2014; 6:, pp.37–45. , doi: 10.1016/j.epidem.2013.11.002

46 

    JJ Caro, AH Briggs, U Siebert, KM Kuntz. . Modeling good research practices—overview: a report of the ISPOR-SMDM Modeling Good Research Practices Task Force–1. Medical Decision Making. 2012;32(5):, pp.667–677. , doi: 10.1177/0272989X12454577

47 

    J Fehr, J Heiland, C Himpe, J Saak. . Best practices for replicability, reproducibility and reusability of computer-based experiments exemplified by model reduction software. AIMS Mathematics. 2016;1(3):, pp.261–281.

48 

    DC Taylor, V Pawar, D Kruzikas, KE Gilmore, A Pandya, R Iskandar, et al. Methods of model calibration. Pharmacoeconomics. 2010;28(11):, pp.995–1000. , doi: 10.2165/11538660-000000000-00000

49 

    DJ Gerberry. . An exact approach to calibrating infectious disease models to surveillance data: The case of HIV and HSV-2. Mathematical Biosciences & Engineering. 2018;15(1):, pp.153–179.

50 

    JS Hodges. . Six (or so) things you can do with a bad model. Operations Research. 1991;39(3):, pp.355–365.

51 

    CR Kenyon, W Delva, RM Brotman. . Differential sexual network connectivity offers a parsimonious explanation for population-level variations in the prevalence of bacterial vaginosis: a data-driven, model-supported hypothesis. BMC women’s health. 2019;19(1):, pp.8, doi: 10.1186/s12905-018-0703-0

52 

    W Delva, GE Leventhal, S Helleringer. . Connecting the dots: network data and models in HIV epidemiology. Aids. 2016;30(13):, pp.2009–2020. , doi: 10.1097/QAD.0000000000001184

53 

54 

    JA Kopec, P Finès, DG Manuel, DL Buckeridge, WM Flanagan, J Oderkirk, et al. Validation of population-based disease simulation models: a review of concepts and methods. BMC public health. 2010;10(1):, pp.710.

55 

    S Talts, M Betancourt, D Simpson, A Vehtari, A Gelman. . Validating Bayesian inference algorithms with simulation-based calibration. arXiv 1804.06788 [Preprint]. 2018 [cited 2020 Apr 24]. Available from: https://arxiv.org/abs/1804.06788.

56 

    V Srikrishnan, K Keller. . Small increases in agent-based model complexity can result in large increases in required calibration data. arXiv:1811.08524 [Preprint]. 2019 [cited 2020 Apr 24]. Available from: https://arxiv.org/abs/1811.08524.

57 

    H Zhang, Y Vorobeychik. . Empirically grounded agent-based models of innovation diffusion: a critical review. arXiv:1608.08517 [Preprint]. 2019 [cited 2020 Apr 24]. Available from: https://arxiv.org/abs/1608.08517.

58 

    EJ Murray, JM Robins, GR Seage III, S Lodi, EP Hyle, KP Reddy, et al. Using observational data to calibrate simulation models. Medical Decision Making. 2018;38(2):, pp.212–224. , doi: 10.1177/0272989X17738753

59 

    JS Lee, T Filatova, A Ligmann-Zielinska, B Hassani-Mahmooei, F Stonedahl, I Lorscheid, . et al. The complexities of agent-based modeling output analysis. Journal of Artificial Societies and Social Simulation. 2015;18(4):, pp.4.

60 

    N Punyacharoensin, WJ Edmunds, . De Angelis D, White RG. Mathematical models for the study of HIV spread and control amongst men who have sex with men. European journal of epidemiology. 2011;26(9):, pp.695, doi: 10.1007/s10654-011-9614-1

61 

    D Moher, A Liberati, J Tetzlaff, DG Altman. . Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Annals of internal medicine. 2009;151(4):, pp.264–269. , doi: 10.7326/0003-4819-151-4-200908180-00135

62 

    R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria; 2018. Available from: Error! Hyperlink reference not valid..

63 

    M Holland, D Wolfe. Nonparametric statistical methods. John Wiley & Sons, New York; 1973.

64 

    Y Benjamini, Y Hochberg. . Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological). 1995;57(1):, pp.289–300.

6 Nov 2019

Dear Dr Hazelbag,

Thank you very much for submitting your manuscript 'Fitting individual-based models to data in HIV, tuberculosis and malaria: a systematic review' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the manuscript as it currently stands. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts.

In addition, when you are ready to resubmit, please be prepared to provide the following:

(1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors.

(2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text.

(3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution.

Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are:

- Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition).

- Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video.

- Funding information in the 'Financial Disclosure' box in the online system.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here

We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us.

Sincerely,

Roger Dimitri Kouyos

Associate Editor

PLOS Computational Biology

Virginia Pitzer

Deputy Editor

PLOS Computational Biology

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Dear authors,

Please find the attached review.

Reviewer #2: This systematic review used HIV, Malaria, and Tuberculosis studies to provide an overview of the fitting methods used in IBMs modelling infectious disease spread. This is critical as the usage of IBMs becomes more common nowadays than the deterministic models. The calibration of IBMs to data is often a challenge. Overall, this is an interesting and well-written paper. I have some comments that need to be clarified.

1. Line 98: In addition to the stochasticity, the number of parameters to estimate in IBM indicates the complexity of the parameter-search strategy (i.e. estimating only one parameter is the search on one-dimensional real space (1D), estimating two parameters is the search on two-dimensional real space (2D), and so on). The authors may comment in the introduction to show how the complexity of calibration varies between IBMs.

2. Line 144: It is better to provide a reference for the most investigated epidemics.

3. Line 155: Is there any reasons why the authors did the search from 2013?

4. Line 179: “the goal” need to be clarified along this line.

5. Fig 2 lists the different methods of parameter search strategy found, but without minimal explanations. It is better to include in the appendix a table that explains briefly these methods for non-modelers.

6. Line 258: “…while the rest either relied on visual inspection as a GOF (14 articles) or used a quantitative GOF (five articles)...” The authors should discuss why some studies use manual fitting instead of formal parameter-search strategies. This manual fitting is impossible in case of multi-fitting parameters. Another issue of manual fitting is not reproducible due to the stochasticity in IBM.

7. Line 300: A reference along this line is very useful for the simulation tools.

8. This systematic review provides interesting data about the methods of calibration of IBMs, but it did not provide a comparison, which method is best, or what is the strength and limitations of each method. The authors should clarify this point in the limitation section.

9. Minor comment: There is an issue in a cell in Table 1.

Reviewer #3: The authors did a tedious and very valuable work in order to identify articles that used an individual-based model (IBM) to fit to data in HIV, tuberculosis (TB) and malaria, and assess the proportion of them that reported the parameter-search strategy and the type of parameter-search strategy used. This work is particularly important as one aims for more transparency on the optimization methods used in modelling works. However, the authors could go a little bit further in order to answer questions such as:

• Does the proportion of articles reporting the parameter search strategy used vary according to the disease studied (HIV, TB, malaria)?

• What about uncertainty? Is the uncertainty related to the parameter-search strategy (e.g. confidence interval or credibility interval) reported? Does the proportion of article reporting uncertainty depend on the field (i.e. HIV, TB, malaria) or on the parameter-search strategy used (sampling or optimization strategy)?

• How many parameters are estimated in each study? Does this number depend on the parameter search strategy? Are the most complex models (i.e. the ones with the highest number of parameter to estimate) the ones that do not report uncertainty/search strategy?

Answering these questions could help better identify the articles that report searching method less frequently (e.g. assess whether it is related to the field studied). Reporting information about the number of parameters estimated and whether uncertainty is reported could also provide a wider understanding of the issue of lack of transparency that we face in some fields.

In addition, I have a few minor comments:

• Lines 49-54: This is mainly repetition of the results already presented in the previous paragraph. Consider removing this paragraph and replacing by a discussion of the results, e.g. something that looks like the second part of “Author summary” (lines 65-72).

• Line 98-99: To me, it is not clear why a greater complexity could make exact likelihood calculation impossible. I would rather say that greater complexity prevent from identifying the exact maximum likelihood estimator, but it should not prevent the model to calculate the exact likelihood.

• Line 129: What kind of stochasticity do you mention here? Would it not be relevant to report stochasticity in your systematic review as well?

• Line 155: Why only from 2013 and not before?

• Line 166: Maybe you could mention before (in abstract?) that you will focus on studies informing public health policy.

• Line 241: In the “GOF” column of Table 1, there is some formatting problem, as we can not read the whole text in the “Suen et al.” column.

• Line 271: It is not clear how you could obtain a weight (i.e. a number) with the ”inverse of the confidence interval” which are two numbers.

• Lines 309-323: Consider rewriting this paragraph, as, as it is now, it repeats what you already said in the “Results” section. I see the point of summarizing the results, but this could be in a more concise way.

• Line 377: Was it not 48% (40/84) before, instead of 49%?

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biologydata availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: H.H. Ayoub

Reviewer #3: No

Submitted filename: reviewer.docx

10 Jan 2020

Submitted filename: Rebuttal.docx

10 Mar 2020

Dear Dr. Hazelbag,

Thank you very much for submitting your manuscript "Calibration of individual-based models to epidemiological data: a systematic review" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers raised only one minor issue still requires attention. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. 

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Roger Dimitri Kouyos

Associate Editor

PLOS Computational Biology

Virginia Pitzer

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Good job. I have no additional comments.

Reviewer #2: The authors have satisfactorily responded to all my questions and made the necessary changes to the manuscript.

Reviewer #3: All the previously reported issues have been addressed by the authors. The authors might however check the following issue. In Fig3, title (and legend) mentioned that the figure reports the number of target statistics, while the y-axis label mentions the number of calibrated parameters. This must be corrected. Additionally, the authors could present both figures (number of targets and number of calibrated parameters according to the parameter search strategy) side by side in the main manuscript.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biologydata availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods


27 Mar 2020

Submitted filename: Response_to_reviewers.docx

21 Apr 2020

Dear Dr. Hazelbag,

We are pleased to inform you that your manuscript 'Calibration of individual-based models to epidemiological data: a systematic review' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Roger Dimitri Kouyos

Associate Editor

PLOS Computational Biology

Virginia Pitzer

Deputy Editor

PLOS Computational Biology

***********************************************************


1 May 2020

PCOMPBIOL-D-19-01519R2

Calibration of individual-based models to epidemiological data: a systematic review

Dear Dr Hazelbag,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Laura Mallard

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

https://www.researchpad.co/tools/openurl?pubtype=article&doi=10.1371/journal.pcbi.1007893&title=Calibration of individual-based models to epidemiological data: A systematic review&author=C. Marijn Hazelbag,Jonathan Dushoff,Emanuel M. Dominic,Zinhle E. Mthombothi,Wim Delva,Roger Dimitri Kouyos,Roger Dimitri Kouyos,Virginia E. Pitzer,Roger Dimitri Kouyos,Virginia E. Pitzer,Roger Dimitri Kouyos,Virginia E. Pitzer,Roger Dimitri Kouyos,Virginia E. Pitzer,&keyword=&subject=Research Article,Biology and Life Sciences,Microbiology,Medical Microbiology,Microbial Pathogens,Viral Pathogens,Immunodeficiency Viruses,HIV,Medicine and Health Sciences,Pathology and Laboratory Medicine,Pathogens,Microbial Pathogens,Viral Pathogens,Immunodeficiency Viruses,HIV,Biology and Life Sciences,Organisms,Viruses,Viral Pathogens,Immunodeficiency Viruses,HIV,Biology and Life Sciences,Organisms,Viruses,Immunodeficiency Viruses,HIV,Biology and life sciences,Organisms,Viruses,RNA viruses,Retroviruses,Lentivirus,HIV,Biology and Life Sciences,Microbiology,Medical Microbiology,Microbial Pathogens,Viral Pathogens,Retroviruses,Lentivirus,HIV,Medicine and Health Sciences,Pathology and Laboratory Medicine,Pathogens,Microbial Pathogens,Viral Pathogens,Retroviruses,Lentivirus,HIV,Biology and Life Sciences,Organisms,Viruses,Viral Pathogens,Retroviruses,Lentivirus,HIV,Physical Sciences,Mathematics,Applied Mathematics,Algorithms,Research and Analysis Methods,Simulation and Modeling,Algorithms,Medicine and health sciences,Epidemiology,HIV epidemiology,Medicine and Health Sciences,Parasitic Diseases,Malaria,Medicine and Health Sciences,Tropical Diseases,Malaria,Physical Sciences,Mathematics,Optimization,Medicine and Health Sciences,Infectious Diseases,Bacterial Diseases,Tuberculosis,Medicine and Health Sciences,Tropical Diseases,Tuberculosis,Medicine and Health Sciences,Epidemiology,Epidemiological Methods and Statistics,Medicine and Health Sciences,Epidemiology,Epidemiological Methods and Statistics,Epidemiological Statistics,