Skip to contents

Introduction

Micro-randomized trials (MRTs) are designed to evaluate the effectiveness of mobile health (mHealth) interventions delivered via smartphones. In practice, the assumptions required for MRTs are often difficult to satisfy: randomization probabilities can be uncertain, observations are frequently incomplete, and prespecifying features from high-dimensional contexts for linear working models is also challenging. To address these issues, the doubly robust weighted centered least squares (DR-WCLS) framework provides a flexible procedure for variable selection and inference. The methods incorporates supervised learning algorithms and enables valid inference on time-varying causal effects in longitudinal settings.

This vignette introduces the MRTpostInfLASSO package. Its core function, DR_WCLS_LASSO, allows users to perform variable selection, estimate time-varying causal effects and make valid inferences conditional on the selected variables.

Individual-level data of an MRT can be summarized as \(\left\{O_1, A_1, O_2, A_2, \cdots, O_T, A_T, O_{T+1} \right\}\) where \(T\) is the total decision times, \(O_t\) is the information collected between \(t-1\) and \(t\), and \(A_t\) is the treatment provided at time \(t\). Here we consider treatment \(A_t \in \left\{0,1\right\}\). Treatment options are intended to influence a proximal outcome \(Y_{t+1} \in O_{t+1}\).

Denote history \(H_t = \left\{ O_1, A_1, O_2, A_2, \cdots, A_{t-1}, O_t \right\}\) and randomized probabilities \(\mathbf{p} = \left\{p_t(A_t \mid H_t) \right\}_{t=1}^T\). The DR-WCLS criterion is given by

\[ \mathbb{P}_n \Bigl[ \sum_{t=1}^{T} \tilde{\sigma}^2_t(S_t)\, \Bigl( \frac{W_t(A_t-\tilde{p}_t(1 \mid S_t)) (Y_{t+1}-g_t(H_t,A_t))}{\tilde{\sigma}^2_t(S_t)}\, +\beta(t;H_t)-f_t(S_t)^T \beta\Bigr)\, f_t(S_t) \Bigr] = 0 \]

where \(\beta(t;H_t) := g_t(H_t,1) - g_t(H_t,0)\) is the causal excursion effect under the fully observed history \(H_t\), and \(\tilde{\sigma}^2_t(S_t) := \tilde{p}_t(1 \mid S_t)(1-\tilde{p}_t(1 \mid S_t))\).

The \(\hat{\beta}_n^{(DR)}\) is a consistent estimator of the true \(\beta\) if either the randomization probability \(p_t(A_t \mid H_t)\) or the conditional expectation \(g_t(H_t, A_t)\) is correctly specified.

The DR_WCLS algorithm is as follows:

Step I: Randomly split the \(n\) individuals into \(K\) equal folds \(\left\{I_k\right\}^K_{k=1}\) assuming \(n\) is a multiple of \(K\). Let \(I^∁_k\) denote the complement of fold k.

Step II: For each fold \(k\), use data from \(I^∁_k\) to estimate the nuisance functions \(\hat{g}^{(k)}_t(H_t,A_t)\),\(\hat{p}^{(k)}_t(1 \mid H_t)\), \(\hat{\tilde{p}}^{(k)}_t(1 \mid S_t)\), and compute the weight \(\hat{W}_t^{(k)} = \hat{\tilde{p}}^{(k)}_t(1 \mid S_t) / \hat{p}^{(k)}_t(1 \mid H_t)\).

Step III: For each \(j \in I_k\) and time \(t\), construct the pseudo-outcome \(\tilde{Y}_{t+1}^{(DR)}\) as follows, then regress it on \(f_t(S_t)^T \beta\) using weights \(\tilde{p}_t^{(k)}(1 \mid S_t)(1-\tilde{p}_t^{(k)}(1 \mid S_t))\).

\[ \tilde{Y}^{(DR)}_{t+1,j} := \frac{\hat{W}_{t,j}^{(k)}(A_{t,j}-\hat{\tilde{p}}_t^{(k)}(1 \mid S_{t,j})) (Y_{t+1,j}-\hat{g}_t^{(k)}(H_{t,j},A_{t,j}))}{\hat{\tilde{p}}_t^{(k)}(1 \mid S_{t,j})(1-\hat{\tilde{p}}_t^{(k)}(1 \mid S_{t,j}))} \, + \Bigl( \hat{g}_t^{(k)}(H_{t,j},1) - \hat{g}_t^{(k)}(H_{t,j},0) \Bigr) \]

To conduct variable selection, DR_WCLS_LASSO solves the problem

\[ \min_{\beta} \frac{1}{n} \sum_{i=1}^{n}\sum_{t=1}^{T} \Bigl[ \hat{\tilde{p}}^{(k)}_{t}(1 \mid S_t)\, \bigl(1-\hat{\tilde{p}}^{(k)}_{t}(1 \mid S_t)\bigr)\, \bigl(\tilde{Y}^{(DR)}_{t+1,i}-f_t(S_t)^{\top}\beta\bigr)^2 \Bigr] + \lambda \lVert \beta \rVert_{1} - w^{\top}\beta, \]

where \(\lambda\) is the LASSO regularization parameter, \(\omega\) is the noise vector.

After the variable selection procedure, we conduct post-selection inference using DR-WCLS conditional on the selected variables. This provides estimates for the selected variables along with their corresponding confidence intervals.

\[ \min_{\beta} \frac{1}{n} \sum_{i=1}^{n}\sum_{t=1}^{T} \Bigl[ \hat{\tilde{p}}^{(k)}_{t}(1 \mid S_t)\, \bigl(1-\hat{\tilde{p}}^{(k)}_{t}(1 \mid S_t)\bigr)\, \bigl(\tilde{Y}^{(DR)}_{t+1,i}-f_t(S_t)^{\top}\beta_E\bigr)^2 \Bigr] \]

Installation

The package can be installed from our GitHub repository.

remotes::install_github("WHD-Lab/DR_WCLS_LASSO")

Loading the Package

Set-up

To configure a Python virtual environment in R, please run the following code:

# Configure virtual environment
venv_info = venv_config()
venv = venv_info$hash
print(venv)
# [1] "a9c268bc"

Or, if already configured, use

library(reticulate)
venv = "a9c268bc"
use_virtualenv(venv, required = TRUE)

Virtual Environment Testing

# Do the python deps load?
library(reticulate)
np = import("numpy", convert = FALSE)
lasso_mod = import("selectinf.randomized.lasso", convert = FALSE)$lasso

# Simple test
run_simple_lasso_test(venv)

Real Data Example

HeartSteps

We illustrate the functions in the MRTpostInfLASSO package using the HeartSteps dataset from the MRTAnalysis package. We demonstrate how to generate the pseudo-outcome, perform variable selection, and conduct valid post-selection inference.

HeartSteps is a mobile health intervention designed to encourage physical activity by delivering tailored suggestions. The dataset comes from a 6-week micro-randomized trial involving 37 participants. Participants were randomized at 5 decision points per day and received 2-5 notificaitons daily. The dataset contains 7,770 records, with 5 observations per day for each participants. Each record includes the decision point, whether a notification was sent, participant availability for walking, and 30-minute step counts before and after the decision point.

To begin, we load the HeartSteps data using the following code. A summary of data_mimicHeartSteps is as follows:

# Load HeartSteps Data
library(MRTAnalysis)
data(data_mimicHeartSteps)
head(data_mimicHeartSteps)
#>   userid decision_point day_in_study logstep_30min logstep_30min_lag1
#> 1      1              1            0     2.3902011          0.0000000
#> 2      1              2            0    -0.6931472          2.3902011
#> 3      1              3            0     2.4646823         -0.6931472
#> 4      1              4            0     0.1206936          2.4646823
#> 5      1              5            0     0.8322060          0.1206936
#> 6      1              6            1     1.8450452          0.8322060
#>   logstep_pre30min is_at_home_or_work intervention rand_prob avail
#> 1       -0.6931472                  1            0       0.6     0
#> 2        2.1962380                  1            0       0.6     1
#> 3        4.5894007                  1            1       0.6     1
#> 4        3.1791124                  1            1       0.6     1
#> 5        3.2945170                  0            0       0.6     0
#> 6        4.6658254                  1            0       0.6     0

We first specify the variable names for the participant ID, history\(H_t\), moderator\(S_t\), treatment\(A_t\), proximal outcome\(Y_t\), and randomization probability\(p_t\).

set.seed(100)
ID = 'userid'
Ht = c('logstep_30min_lag1','logstep_pre30min','is_at_home_or_work', 'day_in_study')
St = c('logstep_30min_lag1','logstep_pre30min','is_at_home_or_work', 'day_in_study')
At = 'intervention'
outcome = 'logstep_30min'
prob = 'rand_prob'

Generating Pseudo-outcome

To generate the pseudo-outcome, we provide three methods for estimating the nuisance functions: LASSO, random forest and gradient boosting.

\[ \tilde{Y}^{(DR)}_{t+1,j} := \frac{\hat{W}_{t,j}^{(k)}(A_{t,j}-\hat{\tilde{p}}_t^{(k)}(1 \mid S_{t,j})) (Y_{t+1,j}-\hat{g}_t^{(k)}(H_{t,j},A_{t,j}))}{\hat{\tilde{p}}_t^{(k)}(1 \mid S_{t,j})(1-\hat{\tilde{p}}_t^{(k)}(1 \mid S_{t,j}))} \, + \Bigl( \hat{g}_t^{(k)}(H_{t,j},1) - \hat{g}_t^{(k)}(H_{t,j},0) \Bigr) \]

We illustrate their use use via the functions pseudo_outcome_generator_CVlasso, pseudo_outcome_generator_rf_v2, and pseudo_outcome_generator_gbm.

pseudo_outcome_CVlasso = pseudo_outcome_generator_CVlasso(fold = 5,ID = ID,
                                                     data = data_mimicHeartSteps, 
                                                     Ht = Ht, St = St, At = At, 
                                                     prob = prob, outcome = outcome,
                                                     core_num = 1)

pseudo_outcome_RF = pseudo_outcome_generator_rf_v2(fold = 5,ID = ID,
                                                   data = data_mimicHeartSteps, 
                                                   Ht = Ht, St = St, At = At, 
                                                   prob = prob, outcome = outcome,
                                                   core_num = 1)


pseudo_outcome_GBM = pseudo_outcome_generator_gbm(fold = 5,ID = ID,
                                                  data = data_mimicHeartSteps, 
                                                  Ht = Ht, St = St, At = At, 
                                                  prob = prob, outcome = outcome,
                                                  core_num = 1)

Variable Selection

To perform variable selection, DR_WCLS_LASSO solves the problem

\[ \min_{\beta} \frac{1}{n} \sum_{i=1}^{n}\sum_{t=1}^{T} \Bigl[ \hat{\tilde{p}}^{(k)}_{t}(1 \mid S_t)\, \bigl(1-\hat{\tilde{p}}^{(k)}_{t}(1 \mid S_t)\bigr)\, \bigl(\tilde{Y}^{(DR)}_{t+1,i}-f_t(S_t)^{\top}\beta\bigr)^2 \Bigr] + \lambda \lVert \beta \rVert_{1} - w^{\top}\beta \]

Variable selection is performed using the variable_selection_PY_penal_int or FISTA_backtracking function. We first define my_formula, specifying the outcome and candidate variables. Below, we demonstrate the use of both functions.

Python version

my_formula = as.formula(paste("yDR ~", paste(c("logstep_30min_lag1", "logstep_pre30min",
                                      "is_at_home_or_work", "day_in_study"),
                                       collapse = " + ")))

set.seed(100)
var_selection_python = variable_selection_PY(data = pseudo_outcome_CVlasso,
                                             ID,
                                             my_formula,
                                             lam = NULL, noise_scale = NULL,
                                             venv = venv,
                                             splitrat = 0.8,
                                             beta = NULL)

cat('The selected variable list:',var_selection_python$E)
#> The selected variable list: (Intercept) day_in_study

R version

set.seed(100)
var_selection_R = FISTA_backtracking(data = pseudo_outcome_CVlasso, ID, my_formula,
                  lam = NULL, noise_scale = NULL,splitrat = 0.8,
                  max_ite = 10^(5), tol = 10^(-4), beta = NULL)
var_selection_R$E
#> [1] "(Intercept)"  "day_in_study"

cat('The selected variable list:',var_selection_R$E)
#> The selected variable list: (Intercept) day_in_study

Post-selection Inference

After variable selection, we conduct post-selection inference using DR-WCLS, conditioning on the selected variables. This yields GEE coefficient estimates and corresponding confidence intervals.

\[ \min_{\beta} \frac{1}{n} \sum_{i=1}^{n}\sum_{t=1}^{T} \Bigl[ \hat{\tilde{p}}^{(k)}_{t}(1 \mid S_t)\, \bigl(1-\hat{\tilde{p}}^{(k)}_{t}(1 \mid S_t)\bigr)\, \bigl(\tilde{Y}^{(DR)}_{t+1,i}-f_t(S_t)^{\top}\beta_E\bigr)^2 \Bigr] \]

DR_WCLS_LASSO is used for post-selection inference. The input of the function are the data for inference, number of folds, and variable names for ID, time, \(H_t\), \(S_t\), \(A_t\), \(Y_t\) and randomization probability. method_pseu specifies the method used for pseudo-outcome generation (e.g., “CVLASSO”, “RandomForest”, “GradientBoosting”). varSelect_program indicates the variable selection programming language.

# set.seed(123)

# reticulate::py_run_string("
#   import random
#   import numpy as np
#   random.seed(123)
#   np.random.seed(123)
# ")

UI_return_python = DR_WCLS_LASSO(data = data_mimicHeartSteps,
                          fold = 5, ID = ID,
                          decision_point = "decision_point",
                          Ht = Ht, St = St, At = At,
                          prob = prob, outcome = outcome,
                          venv = venv,
                          method_pseu = "CVLASSO",
                          varSelect_program = "Python",
                          standardize_x = F, standardize_y = F)
#> [1] "remove 0 lines of data due to NA produced for yDR"
#> [1] "The current lambda value is: 185.074531120026"
#> [1] "select predictors: (Intercept)"  "select predictors: day_in_study"
#> [1] FALSE
#> [1] "logstep_30min_lag1" "logstep_pre30min"   "is_at_home_or_work"
UI_return_python
#>              E     GEE_est      lowCI    upperCI   prop_low   prop_up
#> 1  (Intercept)  0.56982820  0.4172071  0.7574799 0.05052948 0.9502349
#> 2 day_in_study -0.02100086 -0.0291198 -0.0129452 0.04938279 0.9506386
#>         pvalue
#> 1 1.386635e-08
#> 2 1.845647e-05
set.seed(123)

UI_return_R = DR_WCLS_LASSO(data = data_mimicHeartSteps,
                          fold = 5, ID = ID,
                          decision_point = "decision_point",
                          Ht = Ht, St = St, At = At,
                          prob = prob, outcome = outcome,
                          method_pseu = "CVLASSO", 
                          varSelect_program = "R",
                          standardize_x = F, standardize_y = F)
#> [1] "remove 0 lines of data due to NA produced for yDR"
#> [1] "The current lambda value is: 130.892804954675"
#> [1] "select predictors: (Intercept)"       
#> [2] "select predictors: logstep_30min_lag1"
#> [3] "select predictors: day_in_study"      
#> [1] FALSE
#> [1] "logstep_pre30min"   "is_at_home_or_work"
UI_return_R
#>                    E      GEE_est       lowCI      upperCI   prop_low   prop_up
#> 1        (Intercept)  0.568888309  0.40427152  1.244573420 0.04923968 0.9498068
#> 2 logstep_30min_lag1 -0.001277755 -0.28169469 -0.007829111 0.04930253 0.9491945
#> 3       day_in_study -0.020850053 -0.03306659 -0.011552807 0.05020540 0.9498383
#>        pvalue
#> 1 0.000534499
#> 2 0.072436690
#> 3 0.001255541

A Comparison of Using Randomized LASSO and Weighted Centered Least Squares

Select variables using randomized LASSO

library(selectiveInference)
res_randomizedLASSO = randomizedLasso(X = as.matrix(data_mimicHeartSteps[,St]), 
                y = as.matrix(data_mimicHeartSteps[,outcome]), 
                lam = 100000, 
                family="gaussian",
                noise_scale=NULL, 
                ridge_term=NULL, 
                max_iter=100,       
                kkt_tol=1.e-4,      
                parameter_tol=1.e-8,
                objective_tol=1.e-8,
                objective_stop=FALSE,
                kkt_stop=TRUE,
                parameter_stop=TRUE)
St[res_randomizedLASSO$active_set]
#> [1] "day_in_study"

Make inference using weighted centered least squares

wcls_fit = wcls(
  data = data_mimicHeartSteps,
  id = 'userid', 
  outcome = 'logstep_30min', 
  treatment = 'intervention', 
  rand_prob = 0.5, 
  moderator_formula = ~ day_in_study,
  control_formula = ~ logstep_30min_lag1 + logstep_pre30min + day_in_study,
  availability = NULL,
  numerator_prob = NULL,
  verbose = TRUE
)

wcls_res = summary(wcls_fit)

Compare the output with our method:

wcls_res$causal_excursion_effect
#>                Estimate     95% LCL    95% UCL      StdErr Hotelling df1 df2
#> (Intercept)   0.5722731  0.37187918  0.7726669 0.098255727  33.92273   1  31
#> day_in_study -0.0209880 -0.03004471 -0.0119313 0.004440621  22.33854   1  31
#>                   p-value
#> (Intercept)  2.024579e-06
#> day_in_study 4.698739e-05

UI_return_R
#>                    E      GEE_est       lowCI      upperCI   prop_low   prop_up
#> 1        (Intercept)  0.568888309  0.40427152  1.244573420 0.04923968 0.9498068
#> 2 logstep_30min_lag1 -0.001277755 -0.28169469 -0.007829111 0.04930253 0.9491945
#> 3       day_in_study -0.020850053 -0.03306659 -0.011552807 0.05020540 0.9498383
#>        pvalue
#> 1 0.000534499
#> 2 0.072436690
#> 3 0.001255541

Arguments in DR_WCLS_LASSO

Methods of generating the pseudo outcome The pseudo outcome generation method can be specified using the method_pseu argument. Currently, three options are supported: CVLASSO, RandomForest, and GradientBoosting. The default is CVLASSO.

set.seed(100)

UI_return_method_pseu = DR_WCLS_LASSO(data = data_mimicHeartSteps,
                                 fold = 5, ID = ID,
                                 decision_point = "decision_point",
                                 Ht = Ht, St = St, At = At,
                                 prob = prob, outcome = outcome,
                                 method_pseu = "CVLASSO",
                                 varSelect_program = "R",
                                 standardize_x = F, standardize_y = F)
#> [1] "remove 0 lines of data due to NA produced for yDR"
#> [1] "The current lambda value is: 130.898793629741"
#> [1] "select predictors: (Intercept)"  "select predictors: day_in_study"
#> [1] FALSE
#> [1] "logstep_30min_lag1" "logstep_pre30min"   "is_at_home_or_work"
UI_return_method_pseu
#>              E     GEE_est       lowCI    upperCI   prop_low   prop_up
#> 1  (Intercept)  0.56787464  0.37197711  0.7277150 0.05051514 0.9499152
#> 2 day_in_study -0.02087524 -0.02869455 -0.0123153 0.04934788 0.9504935
#>        pvalue
#> 1 1.22489e-06
#> 2 1.61907e-04

Availability We also allow users to specify an availability indicator through the availability argument. Users can set this argument to the name of the availability column in the dataset. When provided, treatment is only considered at decision points when a participant is available.

set.seed(100)

UI_return_availability = DR_WCLS_LASSO(data = data_mimicHeartSteps,
                                 fold = 5, ID = ID,
                                 decision_point = "decision_point",
                                 Ht = Ht, St = St, At = At,
                                 prob = prob, outcome = outcome,
                                 method_pseu = "CVLASSO",
                                 varSelect_program = "R",
                                 standardize_x = F, standardize_y = F,
                                 availability = 'avail')
#> [1] "remove 0 lines of data due to NA produced for yDR"
#> [1] "The current lambda value is: 108.932983993893"
#> [1] "select predictors: (Intercept)"  "select predictors: day_in_study"
#> [1] FALSE
#> [1] "logstep_30min_lag1" "logstep_pre30min"   "is_at_home_or_work"
UI_return_availability
#>              E    GEE_est       lowCI    upperCI   prop_low   prop_up
#> 1  (Intercept)  0.6417958  0.45824940  0.8217710 0.05027683 0.9491955
#> 2 day_in_study -0.0234212 -0.03167528 -0.0169112 0.05019803 0.9498733
#>         pvalue
#> 1 8.139399e-09
#> 2 2.479965e-07

LASSO penalty term The LASSO penalty can be adjusted by setting ‘lam’ in the DR_WCLS_LASSO function.

set.seed(100)

UI_return_lambda = DR_WCLS_LASSO(data = data_mimicHeartSteps,
                                 fold = 5, ID = ID,
                                 decision_point = "decision_point",
                                 Ht = Ht, St = St, At = At,
                                 prob = prob, outcome = outcome,
                                 method_pseu = "CVLASSO",
                                 varSelect_program = "R",
                                 lam = 100,
                                 standardize_x = F, standardize_y = F)
#> [1] "remove 0 lines of data due to NA produced for yDR"
#> [1] "The current lambda value is: 100"
#> [1] "select predictors: (Intercept)"  "select predictors: day_in_study"
#> [1] FALSE
#> [1] "logstep_30min_lag1" "logstep_pre30min"   "is_at_home_or_work"
UI_return_lambda
#>              E     GEE_est       lowCI     upperCI   prop_low   prop_up
#> 1  (Intercept)  0.56664043  0.37017639  0.72632267 0.05039581 0.9501140
#> 2 day_in_study -0.02081686 -0.02864015 -0.01231763 0.04984149 0.9506907
#>         pvalue
#> 1 1.049468e-06
#> 2 1.356797e-04

Data split rate The data split rate in Step 1 of the DR_WCLS algorithm can be set using ‘splitrat’ in the DR_WCLS_LASSO function.

set.seed(100)
UI_return_splitrat = DR_WCLS_LASSO(data = data_mimicHeartSteps, 
                                   fold = 5, ID = ID, 
                                   decision_point = "decision_point", 
                                   Ht = Ht, St = St, At = At, 
                                   prob = prob, outcome = outcome,
                                   method_pseu = "CVLASSO", 
                                   varSelect_program = "R",
                                   splitrat = 0.8,
                                   standardize_x = F, standardize_y = F)
#> [1] "remove 0 lines of data due to NA produced for yDR"
#> [1] "The current lambda value is: 130.896004536396"
#> [1] "select predictors: (Intercept)"  "select predictors: day_in_study"
#> [1] FALSE
#> [1] "logstep_30min_lag1" "logstep_pre30min"   "is_at_home_or_work"
UI_return_splitrat
#>              E    GEE_est       lowCI     upperCI   prop_low   prop_up
#> 1  (Intercept)  0.5677041  0.37096113  0.72821365 0.05006188 0.9503814
#> 2 day_in_study -0.0208560 -0.02870282 -0.01233946 0.05049523 0.9508917
#>         pvalue
#> 1 1.288622e-06
#> 2 1.679244e-04

Analysis Using Manually Created Interaction Terms

Interaction terms can be manually created and included in the \(H_t\) and \(S_t\) variable lists. In this example, we create an indicator for whether the time in the study is over 14 days and include its interactions with logstep_30min_lag1, logstep_pre30min, and is_at_home_or_work.

data_mimicHeartSteps$timeover14 = as.numeric(data_mimicHeartSteps$decision_point>14)
data_mimicHeartSteps$int_lag1_timeover14 = data_mimicHeartSteps$logstep_30min_lag1 * data_mimicHeartSteps$timeover14

data_mimicHeartSteps$int_pre30_timeover14 = data_mimicHeartSteps$logstep_pre30min * data_mimicHeartSteps$timeover14

data_mimicHeartSteps$int_home_timeover14 = data_mimicHeartSteps$is_at_home_or_work * data_mimicHeartSteps$timeover14

We add the interaction terms in the \(H_t\) and \(S_t\) variable lists and rerun the procedure.

set.seed(200)
ID = 'userid'
Ht_int = c('logstep_30min_lag1','logstep_pre30min','is_at_home_or_work', 'timeover14',
       'int_lag1_timeover14', 'int_pre30_timeover14', 'int_home_timeover14','day_in_study')
St_int = c('logstep_30min_lag1','logstep_pre30min','is_at_home_or_work', 'timeover14',
       'int_lag1_timeover14', 'int_pre30_timeover14', 'int_home_timeover14','day_in_study')
At = 'intervention'
outcome = 'logstep_30min'
prob = 'rand_prob'


UI_return_int_python = DR_WCLS_LASSO(data = data_mimicHeartSteps,
                          fold = 5, ID = ID,
                          decision_point = "decision_point", 
                          Ht = Ht_int, St = St_int,
                          At = At, prob = prob, outcome = outcome,
                          venv = venv,
                          method_pseu = "CVLASSO",
                          varSelect_program = "Python",
                          standardize_x = F, standardize_y = F)
#> [1] "remove 0 lines of data due to NA produced for yDR"
#> [1] "The current lambda value is: 216.199071596098"
#> [1] "select predictors: (Intercept)"
#> [1] FALSE
#> [1] "logstep_30min_lag1"   "logstep_pre30min"     "is_at_home_or_work"  
#> [4] "timeover14"           "int_lag1_timeover14"  "int_pre30_timeover14"
#> [7] "int_home_timeover14"  "day_in_study"
UI_return_int_python
#>             E   GEE_est      lowCI  upperCI   prop_low   prop_up     pvalue
#> 1 (Intercept) 0.1388557 0.06229526 1.373355 0.04909525 0.9501548 0.04650218
UI_return_int_R = DR_WCLS_LASSO(data = data_mimicHeartSteps, 
                          fold = 5, ID = ID, 
                          decision_point = "decision_point", 
                          Ht = Ht_int, St = St_int, 
                          At = At, prob = prob, outcome = outcome,
                          method_pseu = "CVLASSO", 
                          varSelect_program = "R",
                          standardize_x = F, standardize_y = F)
#> [1] "remove 0 lines of data due to NA produced for yDR"
#> [1] "The current lambda value is: 152.955131748308"
#> [1] "select predictors: (Intercept)"  "select predictors: day_in_study"
#> [1] FALSE
#> [1] "logstep_30min_lag1"   "logstep_pre30min"     "is_at_home_or_work"  
#> [4] "timeover14"           "int_lag1_timeover14"  "int_pre30_timeover14"
#> [7] "int_home_timeover14"
UI_return_int_R
#>              E     GEE_est       lowCI     upperCI   prop_low   prop_up
#> 1  (Intercept)  0.55778595  0.42789989  0.79875507 0.04978683 0.9504247
#> 2 day_in_study -0.02043033 -0.02881832 -0.01257568 0.05038597 0.9496899
#>         pvalue
#> 1 5.346365e-08
#> 2 4.086981e-05

Intern Health Study

IHS is a micro-randomized trial involving 859 medical interns. The aim of the study is to investigate how mHealth interventions affect participants’ weekly mood, physical activity, and sleep. Each day, participants had a 0.5 probability of receiving a tailored message. The dataset includes a range of variables collected from wearable devices.

df_IHS = read.csv('~/Desktop/25Winter/IHS Design/Hyperparameter/dfEventsThreeMonthsTimezones.csv')
df_IHS$prob = rep(0.5, length(df_IHS$PARTICIPANTIDENTIFIER))
df_IHS$less27 = (df_IHS$Age < 27)

ID = 'PARTICIPANTIDENTIFIER'

# Ht = c('StepsPastDay', 'is_weekend', 'maxHRPast24Hours', 'HoursSinceLastMood')
# St = c('StepsPastDay', 'is_weekend', 'maxHRPast24Hours',  'HoursSinceLastMood')
Ht = c('StepsPastDay', 'is_weekend', 'maxHRPast24Hours', 'HoursSinceLastMood','less27', "Sex", "has_child","StepsPastWeek",'PHQ10above0','exercisedPast24Hours')
St = c('StepsPastDay', 'is_weekend', 'maxHRPast24Hours',  'HoursSinceLastMood', 'less27', "Sex", "has_child",'StepsPastWeek','PHQ10above0','exercisedPast24Hours')
At = 'sent'
outcome = 'LogStepsReward'
prob = 'prob'

set.seed(99)

df_IHS_cleaned = na.omit(df_IHS)

UI_return_IHS_python = DR_WCLS_LASSO(data = df_IHS_cleaned,
                          fold = 5, ID = ID,
                          decision_point = "time", Ht = Ht, St = St, At = At,
                          prob = prob, outcome = outcome,
                          venv = venv,
                          method_pseu = "CVLASSO", lam = 55,
                          noise_scale = NULL, splitrat = 0.8,
                          level = 0.9, core_num=3,
                          varSelect_program = "Python",
                          standardize_x =  T, standardize_y = T)

UI_return_IHS_python
set.seed(100)
UI_return_IHS_R = DR_WCLS_LASSO(data = df_IHS_cleaned,
                          fold = 5, ID = ID,
                          decision_point = "time", Ht = Ht, St = St, At = At,
                          prob = prob, outcome = outcome,
                          venv = venv,
                          method_pseu = "CVLASSO", lam = 30,
                          noise_scale = NULL, splitrat = 0.8,
                          level = 0.9, core_num=3, 
                          varSelect_program = "R",
                          standardize_x =  T, standardize_y = T)

UI_return_IHS_R

Simulated Data

We also illustrate the use of the package with a simulated dataset and demonstrate why DR_WCLS is needed, rather than using randomized LASSO for selection and WCLS for inference.

set.seed(100)
sim_data = generate_dataset(N = 1000, T = 40, P = 50, 
                            sigma_residual = 1.5, sigma_randint = 1.5, 
                            main_rand = 3, rho = 0.7,
                            beta_logit = c(-1, 1.6 * rep(1/50, 50)), 
                            model = ~ state1 + state2 + state3 + state4,
                            beta = matrix(c(-1, 1.7, 1.5, -1.3, -1),ncol = 1),
                            theta1 = 0.8)

Ht = unlist(lapply(1:50, FUN = function(X) paste0("state",X)))
St = unlist(lapply(1:25, FUN = function(X) paste0("state",X)))

sim_data$prob_false = rep(0.5, length(sim_data$id))

UI_return_sim_R = DR_WCLS_LASSO(data = sim_data,
                                fold = 5, ID = "id",
                                decision_point = "decision_point",
                                Ht = Ht, St = St, At = "action",
                                prob = "prob_false", outcome = "outcome",
                                method_pseu = "CVLASSO",
                                varSelect_program = "R",
                                standardize_x = F, standardize_y = F,
                                beta = matrix(c(-1, 1.7, 1.5, -1.3, -1, rep(0, 21))))
#> [1] "remove 0 lines of data due to NA produced for yDR"
#> [1] "The current lambda value is: 389.567345284479"
#> [1] "select predictors: (Intercept)" "select predictors: state1"     
#> [3] "select predictors: state2"      "select predictors: state3"     
#> [5] "select predictors: state4"     
#> [1] FALSE
#>  [1] "state5"  "state6"  "state7"  "state8"  "state9"  "state10" "state11"
#>  [8] "state12" "state13" "state14" "state15" "state16" "state17" "state18"
#> [15] "state19" "state20" "state21" "state22" "state23" "state24" "state25"
UI_return_sim_R
#>             E    GEE_est      lowCI    upperCI   prop_low   prop_up
#> 1 (Intercept) -0.9542862 -1.0533437 -0.8480374 0.04935492 0.9492295
#> 2      state1  1.5283746  1.4973162  1.5575872 0.05058399 0.9491015
#> 3      state2  1.3670570  1.3361772  1.3982834 0.04977513 0.9491865
#> 4      state3 -1.1576618 -1.1906565 -1.1268020 0.04982533 0.9502616
#> 5      state4 -0.9354364 -0.9640984 -0.9057872 0.04932666 0.9490258
#>          pvalue post_true true_signal
#> 1  1.251667e-45      -1.0        TRUE
#> 2  0.000000e+00       1.7        TRUE
#> 3  2.220446e-16       1.5        TRUE
#> 4 3.103312e-241      -1.3        TRUE
#> 5 1.217659e-209      -1.0        TRUE
my_formula = as.formula(
  paste0("yDR ~ ", paste0("state", 1:50, collapse = " + "))
)
set.seed(1001)
pseudo_outcome_CVlasso = pseudo_outcome_generator_CVlasso(fold = 5,ID = 'id',
                                                     data = sim_data, 
                                                     Ht = Ht, St = St, At = 'action', 
                                                     prob = 'prob', outcome = 'outcome',
                                                     core_num = 1)

# randomized LASSO (FISTA_backtracking in our package and set noise_scale as 0)
sim_variable_sel = FISTA_backtracking(data = pseudo_outcome_CVlasso, ID = 'id', my_formula,
                  lam = NULL, noise_scale = 0, splitrat = 0.8,
                  max_ite = 10^(5), tol = 10^(-4), beta = NULL)

# sim_randomizedLASSO = randomizedLasso(X = as.matrix(sim_data[,St]), 
#                 y = as.matrix(sim_data[,'outcome']), 
#                 lam = 225000, 
#                 family="gaussian",
#                 noise_scale=NULL, 
#                 ridge_term=NULL, 
#                 max_iter=100,       
#                 kkt_tol=1.e-4,      
#                 parameter_tol=1.e-8,
#                 objective_tol=1.e-8,
#                 objective_stop=FALSE,
#                 kkt_stop=TRUE,
#                 parameter_stop=TRUE)

sim_variable_sel$E[-1]
#> [1] "state1" "state2" "state3" "state4"
mod_formula = as.formula(paste("~", paste0("state", 1:4, collapse = " + ")))

ctrl_formula = as.formula(paste("~", paste0("state", 1:50, collapse = " + ")))

wcls_args = list(
  data      = sim_data,
  id        = "id",
  outcome   = "outcome",
  treatment = "action",
  rand_prob = "prob_false",
  moderator_formula = mod_formula,
  control_formula   = ctrl_formula,
  availability      = NULL,
  numerator_prob    = NULL,
  verbose           = TRUE
)

wcls_sim_fit = do.call(wcls, wcls_args)
#> availability = NULL: defaulting availability to always available.
#> Constant numerator probability 0.5 is used.

wcls_sim_res = summary(wcls_sim_fit)
wcls_sim_res$causal_excursion_effect
#>               Estimate    95% LCL    95% UCL     StdErr      Wald df1 df2
#> (Intercept) -0.9820426 -1.1030580 -0.8610272 0.06166454  253.6236   1 944
#> state1       1.5256178  1.4910531  1.5601825 0.01761277 7503.0243   1 944
#> state2       1.3653543  1.3293183  1.4013903 0.01836246 5528.7762   1 944
#> state3      -1.1537897 -1.1910261 -1.1165533 0.01897415 3697.6752   1 944
#> state4      -0.9319551 -0.9656989 -0.8982113 0.01719447 2937.7321   1 944
#>             p-value
#> (Intercept)       0
#> state1            0
#> state2            0
#> state3            0
#> state4            0