Semi-parametric Bayesian Inference for Optimal Dynamic Treatment Regimes via Dynamic Marginal Structural Models

Considerable statistical work done on dynamic treatment regimes (DTRs) is in the frequentist paradigm, but Bayesian methods may have much to oﬀer in this setting as they allow for the appropriate representation and propagation of uncertainty, including at the individual level. In this work, we extend the use of recently developed Bayesian methods for Marginal Structural Models (MSMs) to arrive at inference of DTRs. We do this 1) by linking the observational world with a world in which all patients are randomized to a DTR, thereby allowing for causal inference and then 2) by maximizing a posterior predictive utility, where the posterior distribution has been obtained from non-parametric prior assumptions on the observational world data-generating process. Our approach relies on Bayesian semi-parametric inference, where inference about a ﬁnite-dimensional parameter is made all while working within an inﬁnite-dimensional space of distributions. We further study Bayesian inference of DTRs in the double robust setting by using posterior predictive inference


Introduction
Precision medicine is a research area that seeks to tailor patient care to improve health outcomes, all while reducing over-treatment.For conditions that require sustained therapy through time, assigned treatments may vary through stages of the treatment process.To identify treatment strategies that follow the principles of precision medicine, stage-specific treatments must be allowed to change with patients' evolving characteristics.These treatment strategies are termed dynamic treatment regimes (DTRs).DTRs contrast static treatment regimes, where time-varying treatments are assigned at study-start.One tool employed to infer about time-varying treatments are marginal structural models (MSMs).These models were developed to evaluate the effect of static regimes (Robins and others, 2000) and later extended to evaluate adherence to DTRs (Murphy and others, 2001), and to identify optimal DTRs (Orellana and others, 2010;van der Laan and Petersen, 2007).MSMs rely on an appealing estimation strategy; they allow scientists to target a finite set of causal estimands without requiring restrictive assumptions about the family of data generating distributions.Semi-parametric methods like these have mostly been studied from a frequentist viewpoint.
Semi-parametric methods are enviable as they avoid specifying fully parametric probabilistic models that face a high risk of misspecification.These methods may be contrasted with the conventional Bayesian approach to inference, which seeks to multiply a parametric likelihood with a prior.In simple settings, this approach works well, but in more complex settings, like in sequential decision-making, the correct specification of a likelihood is highly suspect.Some work has been done examining the effects of model misspecification in Bayesian inference.For example, Walker (2013) shows that under some conditions, parameters in the misspecified model converge to the minimizers of the Kullback-Leibler divergence.Although this is reassuring, it does mean that inference cannot be guaranteed to be consistent and consequently, treatment recommendations based on misspecified models could be suboptimal.Furthermore, in a setting with time-Semi-parametric Bayesian Inference for Optimal Dynamic Regimes 3 varying confounding and mediation, the correct specification of a likelihood with parameters representing causal treatment effects will not yield fruitful results; this is because only confounded data are available and this data follows a different probability law.Now, one approach that may guarantee consistency is Bayesian inference via completely non-parametric specifications.In the DTR setting Bayesian non-parametrics have been used to estimate the effect of a small number of dynamic regimes (Xu and others, 2016), but when the family of regimes grows, this approach may not be feasible to identify optimal regimes, due to computational limitations.Generally, it is unresolved how Bayesians may best capitalize on semi-parametric approaches to inference about DTRs, and this is one of the challenges that our work addresses.
A variety of other methods for estimating the effect of DTRs have been proposed.For example g-methods including g-computation (Robins, 1986), and G-estimation of structural nested models (Robins, 1993).Other ways by which to identify optimal DTRs include Q-learning (Zhao and others, 2009), and outcome weighted learning (Zhao and others, 2012).In a Bayesian setting, a standard parametric approach to inference requires specifying the full dynamics of the data generating process in order to learn about dynamic regimes.For example Saarela and others (2015a) use a predictive Bayesian approach that requires the specification of parametric distributions for outcomes and intermediate covariates in order to identify optimal DTRs.Murray and others (2018) propose a Bayesian adaptation to Q-learning that utilizes machine learning methods for flexible modeling, however the approach still relies on likelihoods for stage-specific rewards/outcomes.Exceptionally, a few researchers have explored the use of Bayesian non-parametric methods in the DTR setting; Arjas and Saarela (2010) take this approach, however their method is not computationally feasible as the number of confounders increases.
Ideally, Bayesians would target a finite dimensional estimand that indexes a large family of regimes, all while working within an infinite dimensional class of data generating distributions.Recent work has elucidated ways in which semi-parametric inference may be viewed through a Bayesian lens.First, let us review the frequentist setup.Frequentist semi-parametrics begins with an estimating function, which under certain modeling assumptions (e.g. for the mean) is an unbiased estimator of zero.For finite samples, setting the estimating function equal to zero and solving for a parameter of interest β yields an estimator β * n which, under regularity conditions, is consistent and asymptotically normal.A framework for Bayesian semiparametric inference should allow us to take a similar approach.It was not until recently that MSMs for static regimes were provided with a Bayesian motivation by considering the maximization of an expected posterior predictive utility (Saarela and others, 2015b), which required solving for β in a manner analogous to the frequentist procedure.Later, using a similar flavor, Bayesian double robust inference was motivated (Saarela and others, 2016).Other similar recent approaches have further considered inference via utility functions (Bissiri and others, 2016) and through the loss-likelihood bootstrap (Lyddon and others, 2019).
What is particularly liberating about these inferential procedures is that Bayesian methods can be used to infer about parameters that are not necessarily embedded in a likelihood, which would undoubtedly be misspecified.However, none of these approaches have examined causal inference for optimal DTRs.
Our work looks to build on the general framework developed by Saarela and others (2015b) for performing Bayesian causal inference with MSMs.Those authors focused on inferring about stage-specific causal treatment effects of static regimes.As it is well established that MSMs can also be used to infer about (optimal) DTRs, our work seeks to examine how to use this general framework to perform Bayesian causal inference of DTRs.This requires us to carefully interpret the estimands of interest, so that we may conceive of a counterfactual world that allows for causal inference.In the double robust setting, we explore posterior predictive inference for DTRs.This approach to inference was proposed by Saarela and others (2016), but it has only been studied in the cross-sectional setting.We transparently lay out the use of this new framework for Bayesian causal inference, and with this in mind, we explore the performance of this approach via simulations with treatment rules like "assign treatment when a covariate value x exceeds a threshold θ", with the aim of identifying θ opt that optimizes a final outcome.Additionally, with the purpose of illustrating how this methodology may be used in practice, we consider an analysis of HIV therapy using data from the North American AIDS Cohort Collaboration on Research and Design (NA-ACCORD) where we aim to learn about whether to tailor on FIB4, a measure of liver scarring, in order to decide when to switch antiretroviral therapies, with the aim of minimizing long term liver damage.
In addition to the above-mentioned contributions, we note that frequentist uncertainty quantification does not allow for decision-makers to ask if a new patient will benefit from therapy suggested by an optimal DTR.As we will elaborate, Bayesian posterior predictive inference allows for decision-makers to assess the probability that therapy is optimal for a specific patient, thereby allowing for individualized care.To our knowledge, no other approach quantifies uncertainty at the patient-level decision-making process.
The approach to inference presented here uses the posterior predictive distribution in order to answer causal questions about DTRs; there is no need to model counterfactual outcomes directly.The advantages and detriments of counterfactuals has been studied by, for example, Dawid (2000).Arjas (2012) presents an approach similar to the one taken here, where the quantities of interest are expected conditional outcomes.

Estimation Strategy
In this section, we first describe the inferential setting and motivate Bayesian inference via a utility maximization framework.We follow this by a precise definition and formulation for connecting two probability laws: the observational world law and the law that allows us to draw causal inference about optimal DTRs by eliminating confounding.We then provide a prior that facilitates robust inference in the developed framework.Lastly, we examine specific utilities that allow for causal inference about optimal DTRs.Some of the developments parallel Saarela and others (2015b), but require some specific considerations for our context; we also take the opportunity to emphasize some of the nuanced arguments present in this framework.

Inferential Setting
We consider a sequential decision problem with K decision points and a final outcome y to be observed at stage K +1.Decisions taken up to stage k give rise to a sequence of treatments zk = (z 1 , ..., z k ), z j ∈ {0, 1}.At each stage k, a set of covariates x k is available for decision-making and it is assumed that these consist of all time-fixed and time-varying confounders.To denote covariate history up to time k, we write xk = {x 1 , ..., x k }.
Subscripts are omitted when referencing history through stage K.We denote a DTR-enforced treatment history by g(x) = (g 1 (x 1 ), ..., g K (x K )).Our focus is restricted to deterministic DTRs.Throughout, we will consider a family of DTRs, which will be indexed by r ∈ I to give G = {g r (x); r ∈ I}.The index is omitted when it is clear that our focus lies on a single DTR.Treatment and covariate histories may be considered under the probability laws in two worlds: the observational world O which denotes the law giving rise to the data in the study population, and the experimental world E, which denotes a world in which causal inference may be performed.In the next sections, the definition of E will be made more precise.Lastly, variables sampled from a posterior distributions are shown with * .
As in Saarela and others (2015b), we assume that for each i = 1, .., n, n+1, ..., b i = (y i , xi , zi ) are infinitely exchangeable sequences to deduce the de Finetti representation (as in Bernardo and Smith (2009)) in the observational world: (2.1) In Web Appendix A, we provide a more general representation in cases where there may be unmeasured causes u of both intermediary and the final outcome.Outcomes do not inform the treatment assignment mechanism, characterized by a parameter γ (i.e.p O (γ| b) ∝ p(γ|x, z)) (Saarela and others, 2015b).The no-unmeasured confounders assumption (Arjas, 2012) allows us to model treatment assignment probabilities in equation (2.1) with observed covariates only as p O (z ij |z i(j−1) , xij , γ j ).This assumption is not often encountered outside the counterfactual framework, so we provide it in Web Appendix A.

Bayesian MSMs for Dynamic Regimes
Saarela and others (2015b) have previously considered Bayesian MSMs to estimate the stage-specific effect of static regimes.However, in a precision medicine setting, it is not immediately clear how to employ this method of inference to infer about DTRs.In what follows, we adapt their work to the dynamic MSM setting for DTRs, attempting in the process to clarify the nuances in this general framework.To allow for MSMs to make Bayesian inference of optimal DTRs, we must make several considerations.First, consider a utility function U ( b, g, β); which represents a patient's utility as a function of patient covariates and regime assignment, parameterized by an unknown parameter β.This utility may take any form relevant to the decision-maker (further details about this decision-theoretic approach may be found in Walker (2010)).We will see that some specific utilities allow us to infer about the causal parameters of interest.As Bayesian decision-makers, we are interested in finding the value of β that maximizes the posterior expected utility This is an expectation taken with respect to the experimental measure in which patients are randomized to regimes in G at study start, with probability p(g).When we consider a finite set of regimes in which patients have equal probability of randomization, we may replace this probability with 1/C g , where C g = |I|.In the Semi-parametric Bayesian Inference for Optimal Dynamic Regimes experimental setting consider v i = (b i , g i ) ≡ (x i , z i , y i , g i ), and assume infinite exchangeability to obtain: (2.2) Note p E (z ij |z i(j−1) , x i(j−1) , g i , α j ) = 1 g(x i(j−1) ) (z ij ), as treatment is deterministically assigned conditional on regime.For convenience, we re-express the product across all stages as This representation differs from that presented in Saarela and others (2015b), as the experimental world here differs.Now, we seek to link E and O.In particular, we make this link with respect to the posterior predictive distribution.Note that considering measures E and O under a predictive inferential setting allows us to bypass the use of counterfactual quantities and allows us to directly consider the conditional distributions of Y given Z (Arjas, 2012).For any utility, an importance sampling argument yields Randomization to regime g r is equiprobable for all regimes in our experimental world; this is captured by the constant C G (See Web Appendix A for more details).The weights w r in equation ( 2.3) are given by .
The denominator is the well-known treatment probability in the observational measure; the numerator is the probability of a sequence of treatments conditional on regime assignment.Note that this weight formula differs from that presented in Saarela and others (2015b), though the general procedure is the same.For equation (2.3) to hold for the entire support of the data, we require that for each g, p E (b * |g, b) be absolutely continuous with respect to P O ; this is equivalent to the positivity condition cited in the causal inference literature.Practically, this means that if a patient following regime g has recorded history (x k , zk−1 ) and receives treatment z k , then in the observational world we should be able to find patients of this sort.Note that as in the frequentist setting, these dynamic MSM weights are not stabilized, and the above argumentation clarifies why the usual stabilization is not possible in the DTR framework.Although importance sampling can motivate inverse probability of treatment weighting -a classical approach to estimating MSMs in the frequentist setting -the inferential machinery must still come from semi-parametric theory.In Bayesian inference, importance sampling and an appropriate prior lead to a method of inference.In the frequentist literature, the linking of two measures is not usually termed importance sampling; this is done via a Radon-Nykodym derivative.This derivative was first used by Murphy and others (2001) to connect the observational distribution with the distribution in which all patients follow a DTR, and it has been further adapted in works like Orellana and others (2010), Johnson andTsiatis (2004, 2005), and Hu and others (2018).
Now that we know how to link the expected utility in the experimental worlds with the observational world, we must consider how to infer about the parameter of interest β.Recall that as Bayesian decision makers, our best estimate for β is one that maximizes the posterior expected utility.This requires a posterior distribution to characterize the uncertainty of this maximizer.Consequently, before specifying the utility of choice and before performing the necessary maximization, we must specify a prior.The prior we consider is not placed on β ∈ B as is done in Bayesian parametric inference; the prior is placed on the family of data This Bayesian bootstrap is the same as that employed by Saarela and others (2015b), however we have been explicit about the assumptions needed to utilize it.This bootstrap is analogous to the Bayesian bootstrap presented in Rubin (1981).Under this specification, one sample drawn from the posterior DP is given by β opt , the true maximizer of the expected utility, can be expressed by maximizing the expected posterior utility: ) .Furthermore, the uncertainty around β opt may be characterized by noting that β opt is a deterministic function of π, computed as Thus, uncertainty in the posterior distribution reflects uncertainty in β opt ; this approach to Bayesian inference is discussed by Walker (2010).We may disregard C G for the purposes of predictive inference.Modulo Monte Carlo error, this is an exact Bayesian procedure, regardless of the sample size.In work by Saarela and others (2015b), simulations show that multiplying π i with importance sampling weights dampens the effect of extreme weights thereby leading to improved variance estimators as compared to those relying on asymptotic approximations, the latter tending to underestimate variance.From equation (2.4), we note that to draw inference in the experimental world, we require an analytic expression for the weight w; this leads us to modeling the treatment assignment probabilities.We touch on this in Section 2.3.Furthermore, we note that inverse probability weighting methods may not be adequate in settings with many stages, as these require us to take the product of many probabilities, thereby leading to large weights and yielding both bias and imprecision (Robins and others, 2008;Scharfstein and others, 1999).We now present some utilities that allow for causal inference of DTRs.
2.2.1 Utility as Negative Squared Error Loss: An appealing choice of utility is the negative square error loss given by: . This leads to solving: Again, over repeated draws from Π, this is an exact Bayesian procedure for finite samples, modulo Monte Carlo variation.This procedure allows us to leverage the possibility that patients adhere to multiple DTRs, thereby contributing to the objective function multiple times.Orellana and others (2010) show that solving ) 2 yields a consistent estimator for β when the mean model is correct.We note that dynamic MSMs are not impacted by issues of non-regularity that arise in 10 methods like Q-learning and G-estimation.See Web Appendix B. Analogously, our procedure can be seen to be consistent for β, by computing the posterior expected utility: We see that β n that maximizes the equation above is the same one that solves the estimating equation in Orellana and others (2010).Indeed we see why our approach may be regarded as a way to unify Bayesian inference with dynamic MSMs.Now, we need not limit ourselves to a finite family of regimes.If the family of DTRs is indexed by a continuous parameter, then a relaxed positivity condition described in Orellana and others (2010) will allow us to perform inference on values of the index where positivity may not hold.This condition says that instead of requiring that we observe patients who followed all regimes of interest, we require for patients to follow a subset of regimes.More specifically, β in h(β, r) may be identified ∀r ∈ I even when the positivity assumption fails for some r, and it suffices to observe r for sufficient points such that β is identifiable.For example, a model h(β, r) = β 0 +β 1 r+β 2 r 2 that is correctly specified is identifiable if positivity is met for at least three values of r ∈ I.Of course, the model should be correct in the range of inference.For example, if the identified optimal r is far from the range of observed values, we should question the resulting inference.When searching for optimal DTRs via smooth modeling, we must keep in mind that there are two optimal posteriors we are after: The first is the posterior distribution of β = (β 0,opt , β 1,opt , β 2,opt ); the second is the posterior distribution of r opt which is a deterministic function of β.

Utility as Negative Log
Likelihood: If we choose the utility as the negative log likelihood of the outcome conditional on regime assignment in E, then for repeated samples of Π we can compute The choice of this utility is guided by aiming to minimize the Kullback-Leibler divergence between (y i |g r , β) and the data-generating distribution.β may describe the relationship between g r and y for any r ∈ I thus making it a target for causal inference.Interestingly, this utility actually allows us to consider conventional parametric Bayesian inference (i.e.likelihood times prior) by making use of the weighted likelihood bootstrap (Newton and Raftery, 1994).We show that r w * r i,K (y i |g r , β) can be regarded as a weighted likelihood in order to connect the Bayesian bootstrapping procedure with the weighted likelihood bootstrap.Denote A i as the set of regimes that patient i adheres to, then for r 1 , r 2 ∈ A i we have that w * Ai = w * r1 K = w * r2 K .These weights are zero otherwise.Then, we may write equation (2.6) as (2.7) Note that w * Ai r∈Ai (y i |g r , β) is a weighted likelihood; in accordance with the weighted likelihood bootstrap, β opt (π) may be regarded as a sample from the posterior distribution of β under a flat prior.Thus, repeated sampling from this posterior allows for quantification of uncertainty around β. Other priors may be incorporated via sampling importance resampling, but this is not essential and is not the focus of our work.

Implementation
To clearly lay out how to perform Bayesian causal inference using the proposed approach, we provide Algorithm 1.Here, the aim is to obtain a sample from the posterior distribution of β.The algorithm is shown for when the utility is proportional to the squared error loss, or the Normal log likelihood, but it is straightforward to see how it may be adapted to other likelihoods.The data-augmentation procedure described can be further understood from Cain and others (2010), where a new row of data is created for every regime to which a patient adheres.Recall that equation (2.4) leads us to requiring a model for the weights w.For a given draw of the posterior distribution, we consider the model p O (z * j |z * j−1 , x * j , γ j (π)), j = 1, ..., K.The parameters γ j may be regarded as coming from a posterior utility maximization framework with the same non-parametric prior.When the utility is the negative log-likelihood, we solve: Then, for every draw of Π, we first fit the weighted treatment propensity model and use the resulting weight w(π) in equation (2.5).By computing E Π {E E [U (b * , g, β)| b, Π]}, we are indirectly incorporating the uncertainty about γ j into the estimation procedure.

Predictive Double Robust Bayesian Inference for DTRs
In the frequentist literature, inverse probability of treatment weighting (IPW) is known to be an inefficient semi-parametric procedure; it also yields inconsistent inference if the treatment models are miss-specified.To gain efficiency and robustness, researchers can consider the double robust estimator for the marginal mean of a DTR.This requires identifying a series of conditional outcome models, so that consistent inference is attained when either a set of treatment models or a set of outcome models is correctly specified.We now use some of the inferential framework presented in the previous section, and first developed in Saarela and others (2016), in order to arrive at Bayesian double robust inference for the expected outcome of a DTR g.
Though the underlying mechanics hinge on the developments of Saarela and others (2016), examining and evaluating the use of this double robust estimator in a sequential DTR setting is of scientific pertinence.
For reasons that will be elaborated on in the following, we no longer seek to model in a unified manner the expected outcome for regimes in a family G, and therefore no longer consider inference via utilities.To preserve the notation we have developed so far, it is enough to consider a family G containing a single DTR.
Consequently, identifying optimal DTRs now requires evaluating the double robust estimator to be proposed at each DTR of interest and comparing the expect outcomes.Effectively, these are expectations in a regime enforced world, where everyone in the study population follows a regime g; this contrasts the previously considered experimental world where patients are randomized to DTRs in a family.With this aim in mind, consider a sequence of conditional predictive outcomes φ * k+1 , k = 1, ..., K.For k = K, these are defined as These are expected outcomes in the observational world, conditional on subjects who had covariate history xk and that followed the regime g up to time k.It can be shown via a conditional expectation argument Next, we describe how models for φ * k may be fit in a Bayesian framework; following this, we motivate the double robust estimator when models for φ * k+1 are correct or when models for w * k are correct.Based on the de Finetti representation in equation (2.2), we see that outcome models are parameterized by τ such that φ * k+1 (x k ) = φ * k+1 (x k ; τ ).From equations (3.8) and (3.9) we see exactly how a model should be fit for the mean of the conditional outcomes.We should begin by fitting a model for time point k = K and continue backward; the outcomes for stage k can be computed once a model for stage k + 1 has been fit.We can treat uncertainty in τ analogously to how we treated uncertainty in γ, the parameter corresponding to the treatment assignment model in the observational world: we make it dependent on Π via a non-parametric, non-informative prior.However, instead of posing a likelihood model as was done for the treatment assignment mechanism, we consider the negative squared error loss utility and pose a model for the conditional outcomes.
Then, for every draw of Π, we can estimate we provide details as to how τ may be estimated and incorporated into the inferential procedure.
Ultimately, we seek to estimate E g [y * | b] unbiasedly either when the conditional outcome models are correct, or when the treatment models are correct.This may be achieved by noting the following equality, which follows directly from Orellana and others (2010):  (3.11)where w 0 .= 1.In Web Appendix C.2, we show how to arrive at this equation and that it is unbiased.
Now that we have identified our estimator of choice for any posterior distribution, let us use the same prior used in the singly robust case and obtain the Bayesian non-parametric bootstrap as the posterior.
Then, conditional on a posterior draw, we write (3.10) as (3.12) Models for the φs and ws now depend on Π and may be incorporated into the inferential process as in (2.3).
Furthermore, we may compute ] by resampling Dirichlet weights, thereby enabling us to obtain a double robust estimator for the value of a DTR, including its uncertainty.As mentioned, the double robust Bayesian estimator proposed is only for the marginal mean of a DTR, not for the parameters in a model for the marginal mean linking a family of DTRs (e.g E[y * | b, g r ] = β 0 +β 1 r).In order to obtain double robust estimators of the latter, an appropriate utility would have to be proposed so that when importance sampling is used to link the experimental world with the observational world, the obtained expression in the observational world is doubly robust.Then, to use the proposed estimator to identify optimal DTRs, we are required to perform a grid search.Murphy and others (2001) suggested that outcome models should be coherently parameterized so that for k 2 > k 1 , a model conditional on information up to time k 2 would yield a model conditional on information up to time k 1 when covariates between k 2 and k 1 are marginalized.

Individualized Decision Making
Now that we have developed the inferential approach, we turn our attention to examining how to incorporate this into an individualized decision-making scheme.This consideration is particular to the DTR setting that we explore.For exemplary purposes, we focus on the following class of regimes: treat if x k > θ for k = 1, ..., K.
Suppose that a new patient is observed with covariate value x new 1 .Our interest is in deciding whether this patient should be treated based on our belief about the optimal θ.To do this, we are interested in computing This may be done by taking a sample of size m from the posterior distribution and ). Indeed this can be done for all stages p k .Effectively, this probability is informing the decision-maker about how certain they should be in switching treatment given the patient's current health status, if the aim is to select an optimal therapy.It is then up to the decision-maker to make a treatment decision given that probability.Note that a patient's decision about treatment at a given stage does not alter the optimality of consequent decision rules, though it may alter the optimality of the overall treatment course.This individualized approach may be taken with any optimal regime derived through the proposed methodology, and we elaborate on this in the simulations.

Simulations
In this section, we use simulations to evaluate how this Bayesian approach to inference can be used to infer about optimal DTRs.We focus on multi-stage problems with a sample size of n = 500.All results are presented over 500 Monte Carlo replications.For comparison, we also provide results for the frequentist approach.Generally the strategy was to induce time varying confounding with treatment-confounder feedback.
Semi-parametric Bayesian Inference for Optimal Dynamic Regimes 15 All intermediary variables were Gaussian, and all treatment variables Bernoulli.We followed the approach in Stephens (2015) to generate outcomes that allowed for the analytic identification of the optimal regimes.The true value (expected outcome) under the optimal regime was obtained by generating a large sample of data in which patients adhered to the optimal regime.Further simulation details can be found in Web Appendix D, as well as results for other sample sizes and for when intermediary variables are Gamma-distributed.
For simulation I, we considered a family of regimes indexed by θ 1 , θ 2 where treatment is assigned when The known optimum is (θ 1opt , θ 2opt ) = (0.4,0.8) and the outcome ).We evaluate the performance of both the IPW and double robust estimator thereby leading us to compute these estimators for discrete values of θ k ∈ {0, 0.1, 0.2, ..., 0.9, 1}.Table 1 shows the results of the estimation procedure.
The first column indicates the type of estimation procedure that was used.The second refers to the model specification.For the double robust estimator "None" means that both treatment and outcome models are miss-specified ; "Treat" means the treatment models are correctly specified; "Outcome" means that outcome models are correctly specified; "Both" means all models are correctly specified."IPW" refers to the IPW estimator with correctly specified treatment models.For incorrectly specified models, we use intercept-only regressions.For the Bayesian approach, point estimates are provided at the posterior mean.For simulation I, the mean outcome at the optimal regime can be seen (from the data-generating mechanism) to be 0.
In Table 1 we observe that estimators with at least one set of models correct are unbiased.As expected, when the treatment and outcome models are all correctly specified, efficiency is maximized.The coverage probability measures the proportion of time that the true optimum is inside a 95% credible interval, across replications.As far as we are aware, there is no way to obtain a confidence interval for the optimal threshold in the frequentist setup.This is because we have evaluated the estimator in a grid of thresholds θ and identified the θopt that maximizes the mean outcome; for the Bayesian setup, we have sampled the posterior distribution of θ opt ."Estimated Outcome Train Pop." refers to estimated expected outcome under the optimal regime, this is known to be 0; "Mean Outcome Test Pop." refers to the mean outcome under the optimal DTR, in a new population with a different distribution for intermediate covariates.Thinking about the mean outcome in a test population allows us to think about how the identified optimal DTR will perform once deployed in the real world.We see that the frequentist and Bayesian methods perform similarly, and surprisingly the "no models correct" scenario leads to good performance in the testing set, though this is due in part to the scale of the value function which has a narrow range (see Web Appendix D).The uncertainty measures for θ k,opt appear to be slightly higher for the Bayesian analysis than for the Frequentist analysis.One reason for this may be that the Bayesian method acknowledges uncertainty in the outcome and treatment models, whereas the frequentist method takes these as known.The coverage probability for θ 1 in the no models correct scenario is low, and surprisingly it is close to nominal for θ 2 .For the other setups, the coverage probabilities are slightly higher than their nominal value.Of course, it is important to keep in mind that this was a discrete problem and the coverage probabilities depend on the coarseness of the exploration grid; we have observed in other simulations that finer grids lead to further tightening of the confidence intervals toward the nominal value (results not shown).However, this must be balanced with the computational costs of an estimation procedure on a fine grid.Now, we can ask whether newly observed patients will benefit from the estimated optimal rule.For illustrative purposes, we restrict the family of regimes to have a common threshold across periods: For simulation II, we explore a family of regimes indexed by ψ 1 , ψ 2 , ψ 3 such that ψ 1 x k1 + ψ 2 x k2 > 0.5 − 3ψ 3 u; k = 1, ..., 4; x k1 , x k2 are normally distributed intermediary covariates and u is a binary baseline covariate.This regime has an interpretation that treatment should be given if the weighted sum of x k1 and x k2 is above a threshold, and this threshold depends on patients' baseline covariate u.Increments of 0.05 were used for ψ 1 , ψ 2 and of 0.1 for ψ 3 .Web Appendix D.3 shows the data generating mechanism for this setup.
The optimal regime is given by ψ 1opt = ψ 2opt = 0.5, ψ 3opt = 0.1, with a value of 1.We see from Table 2 that all scenarios, except the no models correct scenario are unbiased, with the all models correct scenario yielding the best results.Getting the outcome model correct provides improvement in the estimation of the value at the optimum over just getting the treatment model correct.We do not include a ψ 2 column in the table, as the constraint ψ 1 + ψ 2 = 1 makes this redundant.We note again that the coverage probabilities are high, recall that this is driven by the coarseness of the exploration grid; a finer grid in this problem would be very computationally intensive.Web Appendix D.2 presents a similar simulation without the binary covariate.
In Figure 2 we further illustrate how the Bayesian framework can be leveraged for individualized inference.
We observe, for one replicate, the probability that a patient should be treated under the optimal decision rule, given as set of covariates.These probabilities are computed by using the posterior distribution of ψ 1opt , ψ 2opt , ψ 3opt via P (ψ * 1opt x 11 + ψ * 2opt x 12 + ψ * 3opt u > 0.5).There are regions of high certainty that indicate patients should or should not receive treatment according to the optimal rule; there are also regions with more uncertainty regarding the choice of optimal treatment.In fact, patients with baseline covariate u = 0 face higher uncertainty overall than those with u = 1.
There is some debate in the literature on choice of double versus singly robust estimators, this includes Kang and Schafer (2007) and Bang and Robins (2005).Our simulations emphasize that a lot is to be gained, in precision and accuracy, if we correctly specify the outcome models, when compared to the double robust estimator with only treatment models correct or the IPW estimator.Efficiency is maximized when all models are correct, thereby clarifying that these considerations are not just theoretical; they also impact analyses with finite sample size.When deciding whether to use the singly robust or the doubly robust estimator, it is important to ask what is better understood: the treatment assignment process, or the outcome process.

Case Study: Analysis of the NA-ACCORD
Treatment for HIV infection with antiretroviral therapy (ART) must be lifelong to maintain control of HIV viral replication and improve immune function.Consequently, there is concern that some combinations of drugs may cause long-term harm.The multi-drug nature of this therapy allows for some flexibility in treatment course.Research by Klein and others.( 2016) is consistent with the possibility that some ART was assumed that the variables in outcome models explained both confounding and/or selection.The models that were fit can be found in Web Appendix E. Sensitivity analyses were performed in order to determine whether results were sensitive to model specifications.Balance from the propensity scores was assessed using standardized differences and by using a frequentist fit of the propensity scores.Balance was examined at all stages.Outcome models were examined to ensure the predicted distribution did not differ from the observed.
For a fixed value of θ, patients are indicated to switch treatments when their FIB4 measurements surpass θ.Accordingly, patients in the study could be categorized into five groups for each regime (g θ ) considered: those 1) indicated to switch but did not switch (ISNS), or switched at the wrong time; 2) indicated to switch and switched (ISS); 3) not indicated to switch and did not switch (NISNS); 4) not indicated to switch and switched (NISS); and 5) those who were assigned to PI at baseline (NR).Patients indicated to switch were given six months to do so (a grace period).To improve the properties of the estimators, we normalized the weights in the analysis and assessed positivity for each candidate regime by checking whether the distribution of the propensity scores at each interval for the modeled treatment are similar in the regime adherent group and the regime non-adherent group.The propensity to switch treatment was generally small, highlighting that relatively few individuals contribute to the estimation of our regime of interest -a limitation that must be acknowledged; more details can be found in Web Appendices E.3 and E.4.Only patients in the ISS and NISNS groups could adhere to a regime for the full study period.Consequently, patients in the other groups were artificially censored when they deviated off the specified regime.95% credible intervals were calculated for all point estimates, approximated using 500 draws from the posterior distribution; point estimates were reported at the posterior mean.Details of the analysis plan can be found in Web Appendix E.
We evaluated the estimators at thresholds of 0.4 to 2.8 in units of 0.2; the minimum and maximum threshold value correspond to the 5 th and 95 th percentile of the FIB4 distribution.In Table 3, we present follow-up information for a subset of these regimes.We did not pose a marginal structural model as a function of θ (e.g. a quadratic form) as we wanted to make use of both the IPW and double robust estimators.Although our overall sample size is large, we see that only half of patients follow a non-PI ART regimen at study start.
Additionally, roughly 30% of ISS and NISS patients are censored or artificially censored.The number of NISNS patients varies strikingly across regimes.However, this is to be expected: for a threshold of 0.5, only a small proportion of patients are not indicated to switch, and a relatively large proportion of patients switch Run regression with mean h(r, β) and with weights π i w r i (π) end Output: Posterior distribution of β * DAT AO is input data with one row per patient and is used to fit treatment models.AU GDAT AO is augmented data, where patients are duplicated for as many DTRs as they adhere to.This dataset is used to run regression for h(r, β).
Algorithm 1: Fitting procedure for Bayesian dynamic MSM.Note that in (a) the points corresponding to each method are presented out of phase for illustrative purposes.In reality, points are on top of each other starting at 0.4 and continuing in increments of 0.2.
generating distributions in the observational world P O , and denoted by P F .In fact, this prior induces a prior on β as P B (β ∈ Ω) = P F ({P O : β(P O ) ∈ Ω}).A robust, non-informative choice of prior in the observational measure is the non-parametric Dirichlet process (DP) prior, which asymptotically concentrates around the true data generating distribution.Stephens and others (2021)  explore in detail the consequences of what the Dirichlet process prior implies for a prior on a functional, like β.Now, when DP(α, G x ) is chosen such that |α| → 0, we obtain the non-parametric Bayesian bootstrap as the posterior predictive distribution.
1), a Dirichlet distributed random variable with all concentration parameters equal to one.Note that under the Bayesian bootstrap assumptions, any distribution sampled from the posterior DP is uniquely determined by Π.To compute functionals of the posterior predictive, we require p(b * ∈ A| b) = E Π [p(b * ∈ A| b, Π)] which are estimated by resampling weights (π 1 , ..., π n ) from Dir(1, ..., 1), and computing the average over samples.Consequently, Semi-parametric Bayesian Inference for Optimal Dynamic Regimes 9 under Bayesian bootstrap assumptions, we compute the expected posterior experimental world utility via: 10), we see that when outcome models are correct the estimator is unbiased (see Web Appendix C.2).To see that it is an unbiased estimator when treatment models are correct, we change the form of the estimator.Define h( b) = E g [y * | b] and add 0 = K k=1 w * k−1 [h( b) − h( b)] to obtain with θ opt = 0.6 (see Web Appendix D).Now, Figure1(a) shows the probability that a patient should receive treatment z = 1 at stage 1 for a single Monte Carlo replicate.This is a step function as θ was computed over a set of discrete values.Patients with low and high values of x 1 experience high certainty as to whether they should receive optimal treatment or not.Patients whose covariate is near the true optimal threshold of 0.6 experience low certainty.Figure1(b)shows the same result across 500 Monte-Carlo replicates, emphasizing that there is high uncertainty around the true value.It can also be useful to obtain a smooth decision curve.This may be done by evaluating the double robust estimator over a much finer grid of points or by modeling E[y * | b, g θ ] via a smooth function such as β 0 + β 1 θ + β 2 θ 2 (quadratic) and using IPW.Figure1(c)shows the results of the individualized rule with the quadratic model and IPW estimator; the decision rule is much smoother and provides high certainty for most values of x 1 , except for those closest to 0.6.Figure1(d)shows the Monte Carlo variation around this curve; most uncertainty is around the true value of the threshold.

Fig. 1 :
Fig. 1: Simulation I, stage 1 individualized treatment probabilities: (a) Individualized decision rule using double robust estimator with only the treatment model correct; (b) Same as (a) over 500 Monte Carlo replicates; (c) Individualized decision rule using IPW with a quadratic MSM; (d) Same as (c) over 500 Monte Carlo replicates.

Fig. 2 :
Fig. 2: Simulation II individualized treatment probabilities using IPW estimator; (a) Stage 1 treatment probability for those with u = 0 (b) Stage 1 treatment probability for those with u = 1.
Data: DAT AO for r ← 1 to C G do // Create AU GDAT AO based on regime adherence Replicate rows of DAT AO for patients adherent to regime g r end Posit model for h(r, β) for i ← 1 to B do // B is number of posterior draws Draw π = (π 1 , ..., π n ) from ∼ Dir(1, ..., 1) Estimate p O (z k |z k−1 , xk , γ j , π) ∀k Compute weights w i (π), i = 1, ..., n // n is number of patients Add weights to AU GDAT AO

Table 3 :
NA-ACCORD case study: follow-up information for a subset of regimes (n=22,768).Note: ISNS="Indicated to switch & did not switch"; ISS="Indicated to switch & switched" NISNS="Not indicated to switch & did not switch" NISS="Not indicated to switch and switched"; NR="Received PI at baseline"