Introduction to hal9001 R package

Author

Sky Qiu

Introduction

The materials presented in this tutorial are adapted from the vignette of the hal9001 R package. The goal of this tutorial is to walk through the basic usage of HAL for regression, explore how basis functions are constructed, and demonstrate how one can control the complexity and structure of HAL fits through formulas.

Required packages

Let’s first load the required R packages used throughout this tutorial.

library(data.table)
library(DT)
library(hal9001)

We will use a small synthetic dataset included in this tutorial’s data folder.

load("data/hal_intro.rda")

The dataset includes:

x: the training covariate matrix;
y: the outcome vector;
test_x, test_y: the corresponding test set used for out-of-sample evaluation.

Fitting a regression with HAL

We begin by fitting a simple regression using fit_hal():

hal_fit <- fit_hal(X = x, Y = y,
                   smoothness_orders = 0L,
                   family = "gaussian")
hal_fit$times

                  user.self sys.self elapsed user.child sys.child
enumerate_basis       0.006    0.000   0.006          0         0
design_matrix         0.010    0.001   0.011          0         0
reduce_basis          0.001    0.000   0.002          0         0
remove_duplicates     0.002    0.001   0.002          0         0
lasso                 2.411    0.031   2.461          0         0
total                 2.430    0.033   2.483          0         0

The $times output provides runtime profiling for each step in the HAL fitting procedure. We could see that the fit_hal() function runs several steps internally:

Basis enumeration: HAL first constructs a list of basis functions implied by the observed covariate data;
Design matrix evaluation: The set of basis functions is evaluated at the observed data to create a design matrix. For zero-order HAL, the design matrix contains entries of either 0 or 1;
Pruning: Duplicate or near-constant columns (e.g., columns with very few 1’s) are removed to reduce computational cost.
Penalized regression fit: A cross-validated LASSO regression is then fitted using the cv.glmnet() function from the glmnet R package.

Let’s examine a summary of the fitted HAL model:

datatable(summary(hal_fit)$table,
          options = list(pageLength = 10,
                         lengthMenu = c(5, 10, 25, 50),
                         dom = "ltip",
                         autoWidth = TRUE,
                         scrollX = TRUE),
          class = "stripe hover order-column",
          rownames = FALSE)

This table summarizes which basis functions were included in the final HAL fit and their coefficients. For zero-order HAL, the fitted function is a linear combination of indicator basis functions and tensor products of such indicators.

Let’s obtain predictions from the HAL fit and compute both training and test mean-squared-errors (MSEs):

mse <- function(preds, y) {
  mean((preds - y)^2)
}

# training MSE
preds_hal <- predict(object = hal_fit, new_data = x)
mse_hal <- mse(preds = preds_hal, y = y)
mse_hal

[1] 0.0377737

# test MSE
oob_hal <- predict(object = hal_fit, new_data = test_x)
oob_hal_mse <- mse(preds = oob_hal, y = test_y)
oob_hal_mse

[1] 1.76961

Specifying a HAL formula

Users can gain finer control in specifying HAL fits using the formula interface via formula_hal(). This allows one to:

Specify the number of knot-points per covariate;
Restrict which covariates interact;
Set smoothness orders;
And optionally impose monotonicity or penalty constraints.

Let’s illustrate via a few examples.

We specify a zero-order HAL with one-way basis functions for X1 and X2, each with 2 knot-points:

X <- data.frame(X1 = x[,1], X2 = x[,2], X3 = x[,3])
formula_1 <- formula_hal(
  ~ h(X1) + h(X2),
  X = X, 
  smoothness_orders = 0L,
  num_knots = 2L
)

Inspect the implied basis functions:

formula_1$basis_list

[[1]]
[[1]]$cols
[1] 1

[[1]]$cutoffs
[1] -3.224179

[[1]]$orders
[1] 0


[[2]]
[[2]]$cols
[1] 1

[[2]]$cutoffs
[1] 0.03943585

[[2]]$orders
[1] 0


[[3]]
[[3]]$cols
[1] 2

[[3]]$cutoffs
[1] -3.037737

[[3]]$orders
[1] 0


[[4]]
[[4]]$cols
[1] 2

[[4]]$cutoffs
[1] 0.05277846

[[4]]$orders
[1] 0

Now let’s include two-way interaction terms between X1 and X2. Here, we specify 2 knot-points for X1, 3 for X2, and 2 for the interaction terms:

formula_2 <- formula_hal(
  ~ h(X1,s=0,k=2) + h(X2,s=0,k=3) + h(X1,X2,s=0,k=2),
  X = X, 
  smoothness_orders = 0L, # overwritten by s in h() 
  num_knots = 10L # overwritten by k in h()
)
formula_2$basis_list

[[1]]
[[1]]$cols
[1] 1

[[1]]$cutoffs
[1] -3.224179

[[1]]$orders
[1] 0


[[2]]
[[2]]$cols
[1] 1

[[2]]$cutoffs
[1] 0.03943585

[[2]]$orders
[1] 0


[[3]]
[[3]]$cols
[1] 2

[[3]]$cutoffs
[1] -3.037737

[[3]]$orders
[1] 0


[[4]]
[[4]]$cols
[1] 2

[[4]]$cutoffs
[1] -0.4323704

[[4]]$orders
[1] 0


[[5]]
[[5]]$cols
[1] 2

[[5]]$cutoffs
[1] 0.3781581

[[5]]$orders
[1] 0


[[6]]
[[6]]$cols
[1] 1 2

[[6]]$cutoffs
[1] -3.224179 -3.037737

[[6]]$orders
[1] 0 0


[[7]]
[[7]]$cols
[1] 1 2

[[7]]$cutoffs
[1] -3.22417874  0.05277846

[[7]]$orders
[1] 0 0


[[8]]
[[8]]$cols
[1] 1 2

[[8]]$cutoffs
[1]  0.03943585 -3.03773684

[[8]]$orders
[1] 0 0


[[9]]
[[9]]$cols
[1] 1 2

[[9]]$cutoffs
[1] 0.03943585 0.05277846

[[9]]$orders
[1] 0 0

Finally, we define a first-order HAL with one-way basis functions for X1, X2, and a two-way interaction for X2 and X3.

formula_3 <- formula_hal(
  ~ h(X1,s=1,k=2) + h(X2,s=1,k=3) + h(X2,X3,s=1,k=2),
  X = X, 
  smoothness_orders = 0L, # overwritten by s in h() 
  num_knots = 10L # overwritten by k in h()
)
formula_3$basis_list

[[1]]
[[1]]$cols
[1] 1

[[1]]$cutoffs
[1] -3.224179

[[1]]$orders
[1] 1


[[2]]
[[2]]$cols
[1] 1

[[2]]$cutoffs
[1] 0.03943585

[[2]]$orders
[1] 1


[[3]]
[[3]]$cols
[1] 2

[[3]]$cutoffs
[1] -3.037737

[[3]]$orders
[1] 1


[[4]]
[[4]]$cols
[1] 2

[[4]]$cutoffs
[1] -0.4323704

[[4]]$orders
[1] 1


[[5]]
[[5]]$cols
[1] 2

[[5]]$cutoffs
[1] 0.3781581

[[5]]$orders
[1] 1


[[6]]
[[6]]$cols
[1] 2 3

[[6]]$cutoffs
[1] -3.037737 -3.288588

[[6]]$orders
[1] 1 1


[[7]]
[[7]]$cols
[1] 2 3

[[7]]$cutoffs
[1] -3.03773684  0.05820414

[[7]]$orders
[1] 1 1


[[8]]
[[8]]$cols
[1] 2 3

[[8]]$cutoffs
[1]  0.05277846 -3.28858812

[[8]]$orders
[1] 1 1


[[9]]
[[9]]$cols
[1] 2 3

[[9]]$cutoffs
[1] 0.05277846 0.05820414

[[9]]$orders
[1] 1 1

Once a formula is created, it can be supplied directly to fit_hal() as an argument. This gives users full control over which basis functions HAL considers, providing a powerful interface to allow incorporation of any existing knowledge on the function to be estimated. This is particularly useful for building interpretable or constrained models (e.g., monotone dose-response functions or functions with known smoothness).

For more advanced topics, including specifying monotonicity constraints and penalty factors, see the extended examples in the vignette for hal9001.