| Title: | Dirichlet Random Forest |
|---|---|
| Description: | Implementation of the Dirichlet Random Forest algorithm for compositional response data. Supports maximum likelihood estimation ('MLE') and method-of-moments ('MOM') parameter estimation for the Dirichlet distribution. Provides two prediction strategies; averaging-based predictions (average of responses within terminal nodes) and parameter-based predictions (expected value derived from the estimated Dirichlet parameters within terminal nodes). For more details see Masoumifard, van der Westhuizen, and Gardner-Lubbe (2026, ISBN:9781032903910). |
| Authors: | Khaled Masoumifard [aut, cre] (ORCID: <https://orcid.org/0000-0003-1921-2145>), Stephan van der Westhuizen [aut] (ORCID: <https://orcid.org/0000-0001-9469-8427>), Sugnet Lubbe [aut] (ORCID: <https://orcid.org/0000-0003-2762-9944>) |
| Maintainer: | Khaled Masoumifard <[email protected]> |
| License: | GPL-3 |
| Version: | 0.1.0 |
| Built: | 2026-06-04 11:04:57 UTC |
| Source: | https://github.com/xaleed/dirichletrf |
This package implements Dirichlet Random Forest for modeling and predicting compositional data using maximum likelihood estimation or method of moments.
DirichletRF fits a random forest tailored to compositional
responses, i.e.\ non-negative vectors that sum to one and therefore
reside in the unit simplex. Each tree is grown by recursively
partitioning the covariate space using a Dirichlet
log-likelihood splitting criterion: at every internal node the
candidate split that maximises the gain in Dirichlet log-likelihood
is selected.
DirichletRF( X, Y, num.trees = 100, max.depth = 10, min.node.size = 5, mtry = -1, seed = 123, est.method = "mom", distributional = FALSE, num.cores = -1, replace = FALSE, sample.fraction = 1, compute.oob = FALSE )DirichletRF( X, Y, num.trees = 100, max.depth = 10, min.node.size = 5, mtry = -1, seed = 123, est.method = "mom", distributional = FALSE, num.cores = -1, replace = FALSE, sample.fraction = 1, compute.oob = FALSE )
X |
A numeric (n x p) matrix of covariates. Note that the current version only allows numeric covariates. Users may use one-hot encoding to possibly include categorical covariates. |
Y |
A numeric (n x k) matrix of compositional responses. Each row should sum to 1. That is, data should already be normalised if needed. |
num.trees |
Number of trees grown in the forest. Default is 100. |
max.depth |
Maximum depth of trees. Default is 10. |
min.node.size |
Minimum size of observations in each tree leaf.
Default is 5. Note that nodes with sizes smaller than
|
mtry |
Number of covariates randomly selected as candidates at each
split. Default is |
seed |
The seed of the C++ random number generator. |
est.method |
Parameter estimation method for the Dirichlet
distribution when splitting is done. Users may either use maximum
likelihood ( |
distributional |
Logical. If |
num.cores |
Number of OpenMP threads used for parallel tree building.
The default is |
replace |
Logical. If |
sample.fraction |
Numeric. Fraction of observations used to grow each
tree, as a proportion of |
compute.oob |
Logical. If |
Predictions. The fitted forest produces two complementary
point-prediction surfaces: a mean-based prediction (the
sample mean of training responses in the matched leaf) and a
parameter-based prediction. The forest is also able to produce
full distributional (weight-based) predictions: Trains a Distributional Random Forest which estimates the full conditional distribution
for possibly multivariate response Y and predictors X. The conditional distribution estimate is represented
as a weighted distribution of the training data. The weights can be conveniently used in the downstream analysis
to estimate any quantity of interest .
Out-of-bag (OOB) evaluation. Each observation is predicted
exclusively by trees for which it was held out. The OOB prediction
matrix and scalar OOB MSE are returned in $oob; the full
prediction matrix is also available so users may apply any
alternative compositional error measure such as the Aitchison
distance.
Feature importance. Three complementary importance measures are computed automatically:
Total Dirichlet log-likelihood gain accumulated over every split where a feature was chosen, summed across all trees. The normalised version sums to 1, facilitating comparison across forests.
Number of times a feature was selected as the best split variable across all internal nodes and all trees.
Computed post-hoc via
permutation_importance: the mean increase in OOB
loss when a feature's values are randomly permuted within each
tree's OOB sample, with a scaled (t-statistic-like) variant
that accounts for tree-to-tree variability. Supports
Aitchison distance, MSE, and KL divergence as loss functions.
The implementation delegates all tree-building to compiled C++ code and uses OpenMP for parallel construction of trees.
Out-of-Bag (OOB) Predictions
When compute.oob = TRUE, each observation is predicted by averaging
over only the trees for which it was out-of-bag. This requires
replace = TRUE or replace = FALSE with
sample.fraction < 1. The reported $oob$mse is the MSE
between OOB predictions and true responses, averaged over components and
OOB observations. Note that MSE is not universally accepted for
compositional data since it ignores the simplex geometry — the Aitchison
distance, which operates in log-ratio space, is an alternative. The full
OOB prediction matrix $oob$predictions (n x k, with NA for
observations never out-of-bag) is returned so users can apply any
alternative error measure directly.
A list of class DirichletRF which contains the
following elements:
typeParallelisation type used: "openmp" or
"sequential".
num.coresNumber of cores used.
num.treesTotal number of trees in the forest.
replaceLogical indicating whether bootstrap sampling was used.
sample.fractionThe fraction of observations used per tree.
compute.oobLogical indicating whether OOB prediction was computed.
distributionalLogical indicating whether the forest was built in distributional mode (leaf sample indices retained).
est.methodThe estimation method used ("mom" or
"mle").
Y_trainThe training compositional response matrix.
fittedA list of fitted values on the training data:
alpha_hatEstimated Dirichlet alpha parameters (n x k matrix).
mean_basedMean-based fitted values (n x k matrix), derived from sample means at each leaf.
param_basedParameter-based fitted values (n x k
matrix), obtained by normalising alpha_hat so rows sum
to 1.
residualsA list of residuals (Y - fitted values):
mean_basedResiduals from mean-based predictions.
param_basedResiduals from parameter-based predictions.
importanceA list of feature importance measures:
gainRaw total likelihood gain per feature, summed over all trees and all splits where the feature was selected.
gain_normalisedGain divided by total gain across all features, summing to 1. Recommended for interpretation and comparison across forests.
countNumber of times each feature was selected as the best split variable across all trees and all internal nodes.
oobA list of OOB results. All elements are NA
or NULL when compute.oob = FALSE:
mseScalar OOB mean squared error, averaged over all components and all observations that appeared OOB at least once.
predictionsAn (n x k) matrix of OOB predictions.
Rows corresponding to observations that never appeared OOB are
NA.
alpha_predictionsAn (n x k) matrix of OOB
Dirichlet alpha parameter estimates, averaged over OOB trees.
NA for observations never out-of-bag.
weightsAn (n x n) matrix of OOB proximity weights.
weights[i, j] is the average fraction of OOB trees in
which observations i and j landed in the same
leaf. Only available when both distributional = TRUE and
compute.oob = TRUE; NULL otherwise. The matrix
is generally asymmetric since row i is averaged only
over trees where i was out-of-bag.
Masoumifard, K., van der Westhuizen, S., & Gardner-Lubbe, S. (2026). Dirichlet random forest for predicting compositional data. In A. Bekker, P. Nagar, J. Ferreira, B. Erasmus, & A. Ramoelo (Eds.), Environmental Modelling with Contemporary Statistics: Learning, Directionality, and Space-Time Dynamics. Chapman & Hall/CRC. ISBN: 9781032903910.
predict.DirichletRF for point predictions on new data
(call as predict(forest, newdata), documented under
?predict.DirichletRF).
print.DirichletRF for a summary of the fitted object
(call as print(forest) or just forest).
sample_conditional for drawing compositional samples
from the conditional predictive distribution (requires
distributional = TRUE).
importance.DirichletRF for impurity-based (gain and
count) feature importance.
permutation_importance for permutation-based OOB feature
importance (requires compute.oob = TRUE).
predict_weights for proximity weights for new observations
(requires distributional = TRUE).
# ── Minimal example (auto-tested) ───────────────────────────────────────── set.seed(42) n <- 50; p <- 2 X <- matrix(rnorm(n * p), n, p) colnames(X) <- paste0("X", 1:p) G <- matrix(rgamma(n * 3, shape = rep(c(2, 3, 4), each = n)), n, 3) Y <- G / rowSums(G) # Default: no bootstrap, no OOB, fastest configuration forest <- DirichletRF(X, Y, num.trees = 5, num.cores = 1) print(forest) # Feature importance importance(forest) # Prediction on new data Xtest <- matrix(rnorm(5 * p), 5, p) colnames(Xtest) <- paste0("X", 1:p) pred <- predict(forest, Xtest) pred$mean_predictions # ── Larger example with informative and noise covariates ─────────────────── set.seed(42) n <- 200; p <- 6 X <- matrix(rnorm(n * p), n, p) colnames(X) <- paste0("X", 1:p) # X1 and X2 are informative, X3-X6 are noise alpha_mat <- cbind( 2 + 3 * (X[, 1] > 0), 3 + 3 * (X[, 2] > 0), rep(4, n) ) G <- matrix(rgamma(n * 3, shape = as.vector(t(alpha_mat))), n, 3, byrow = TRUE) Y <- G / rowSums(G) # Default: no bootstrap, no OOB forest <- DirichletRF(X, Y, num.trees = 100, num.cores = 1) # Feature importance — X1 and X2 should dominate importance(forest) # Fitted values and residuals head(forest$fitted$mean_based) head(forest$residuals$mean_based) # ── Bootstrap with OOB ─────────────────────────────────────────────── forest_oob <- DirichletRF(X, Y, num.trees = 100, num.cores = 1, replace = TRUE, sample.fraction = 1.0, compute.oob = TRUE) forest_oob$oob$mse head(forest_oob$oob$predictions) # ── Subsampling without replacement with OOB ─────────────────────────────── forest_sub <- DirichletRF(X, Y, num.trees = 100, num.cores = 1, replace = FALSE, sample.fraction = 0.632, compute.oob = TRUE) forest_sub$oob$mse # ── Prediction ──────────────────────────────────────────────────────────── Xtest <- matrix(rnorm(10 * p), 10, p) colnames(Xtest) <- paste0("X", 1:p) pred <- predict(forest, Xtest) head(pred$mean_predictions) param_pred <- pred$alpha_predictions / rowSums(pred$alpha_predictions) # ── Distributional forest with OOB weight matrix ─────────────────────────── forest_dist <- DirichletRF(X, Y, num.trees = 100, num.cores = 1, replace = TRUE, sample.fraction = 1.0, compute.oob = TRUE, distributional = TRUE) # OOB weight matrix: n x n, W[i,j] = proximity of i to j via OOB trees W <- forest_dist$oob$weights dim(W) # Symmetrise if a symmetric proximity matrix is preferred W_sym <- (W + t(W)) / 2 # Weights for new observations Xtest <- matrix(rnorm(5 * p), 5, p) colnames(Xtest) <- paste0("X", 1:p) W_new <- predict_weights(forest_dist, Xtest) # 5 x n_train# ── Minimal example (auto-tested) ───────────────────────────────────────── set.seed(42) n <- 50; p <- 2 X <- matrix(rnorm(n * p), n, p) colnames(X) <- paste0("X", 1:p) G <- matrix(rgamma(n * 3, shape = rep(c(2, 3, 4), each = n)), n, 3) Y <- G / rowSums(G) # Default: no bootstrap, no OOB, fastest configuration forest <- DirichletRF(X, Y, num.trees = 5, num.cores = 1) print(forest) # Feature importance importance(forest) # Prediction on new data Xtest <- matrix(rnorm(5 * p), 5, p) colnames(Xtest) <- paste0("X", 1:p) pred <- predict(forest, Xtest) pred$mean_predictions # ── Larger example with informative and noise covariates ─────────────────── set.seed(42) n <- 200; p <- 6 X <- matrix(rnorm(n * p), n, p) colnames(X) <- paste0("X", 1:p) # X1 and X2 are informative, X3-X6 are noise alpha_mat <- cbind( 2 + 3 * (X[, 1] > 0), 3 + 3 * (X[, 2] > 0), rep(4, n) ) G <- matrix(rgamma(n * 3, shape = as.vector(t(alpha_mat))), n, 3, byrow = TRUE) Y <- G / rowSums(G) # Default: no bootstrap, no OOB forest <- DirichletRF(X, Y, num.trees = 100, num.cores = 1) # Feature importance — X1 and X2 should dominate importance(forest) # Fitted values and residuals head(forest$fitted$mean_based) head(forest$residuals$mean_based) # ── Bootstrap with OOB ─────────────────────────────────────────────── forest_oob <- DirichletRF(X, Y, num.trees = 100, num.cores = 1, replace = TRUE, sample.fraction = 1.0, compute.oob = TRUE) forest_oob$oob$mse head(forest_oob$oob$predictions) # ── Subsampling without replacement with OOB ─────────────────────────────── forest_sub <- DirichletRF(X, Y, num.trees = 100, num.cores = 1, replace = FALSE, sample.fraction = 0.632, compute.oob = TRUE) forest_sub$oob$mse # ── Prediction ──────────────────────────────────────────────────────────── Xtest <- matrix(rnorm(10 * p), 10, p) colnames(Xtest) <- paste0("X", 1:p) pred <- predict(forest, Xtest) head(pred$mean_predictions) param_pred <- pred$alpha_predictions / rowSums(pred$alpha_predictions) # ── Distributional forest with OOB weight matrix ─────────────────────────── forest_dist <- DirichletRF(X, Y, num.trees = 100, num.cores = 1, replace = TRUE, sample.fraction = 1.0, compute.oob = TRUE, distributional = TRUE) # OOB weight matrix: n x n, W[i,j] = proximity of i to j via OOB trees W <- forest_dist$oob$weights dim(W) # Symmetrise if a symmetric proximity matrix is preferred W_sym <- (W + t(W)) / 2 # Weights for new observations Xtest <- matrix(rnorm(5 * p), 5, p) colnames(Xtest) <- paste0("X", 1:p) W_new <- predict_weights(forest_dist, Xtest) # 5 x n_train
Returns a data frame summarising feature importance from a fitted
DirichletRF object. Two measures are provided:
gainTotal likelihood gain accumulated across all splits where this feature was selected (raw, summed over all trees).
gain_normalisedSame as gain but normalised to
sum to 1 across all features, making values comparable across
forests of different sizes.
countNumber of times the feature was chosen as the best split variable across all trees and all internal nodes.
The data frame is sorted by gain_normalised in descending order.
importance(object, ...) ## S3 method for class 'DirichletRF' importance(object, ...)importance(object, ...) ## S3 method for class 'DirichletRF' importance(object, ...)
object |
A |
... |
Currently unused. |
A data frame with columns feature, gain,
gain_normalised, and count, sorted by
gain_normalised descending.
predict.DirichletRF for point predictions on new data.
print.DirichletRF for a summary of the fitted object
(call as print(forest) or just forest).
sample_conditional for drawing compositional samples
from the conditional predictive distribution (requires
distributional = TRUE).
permutation_importance for permutation-based OOB feature
importance (requires compute.oob = TRUE).
predict_weights for proximity weights for new observations
(requires distributional = TRUE).
set.seed(42) n <- 50; p <- 4 X <- matrix(rnorm(n * p), n, p) colnames(X) <- paste0("X", 1:p) G <- matrix(rgamma(n * 3, shape = rep(c(2, 3, 4), each = n)), n, 3) Y <- G / rowSums(G) forest <- DirichletRF(X, Y, num.trees = 10, num.cores = 1) importance(forest)set.seed(42) n <- 50; p <- 4 X <- matrix(rnorm(n * p), n, p) colnames(X) <- paste0("X", 1:p) G <- matrix(rgamma(n * 3, shape = rep(c(2, 3, 4), each = n)), n, 3) Y <- G / rowSums(G) forest <- DirichletRF(X, Y, num.trees = 10, num.cores = 1) importance(forest)
Computes permutation-based variable importance (VI) for each feature.
For each tree and feature , the OOB error is measured
before and after randomly permuting column within the OOB
sample of that tree. The importance of feature is:
permutation_importance( object, X, loss = c("aitchison", "mse", "kl"), num.permutations = 5L, seed = 42L )permutation_importance( object, X, loss = c("aitchison", "mse", "kl"), num.permutations = 5L, seed = 42L )
object |
A |
X |
The training covariate matrix (n x p) passed to
|
loss |
Loss function used to measure OOB error. One of:
|
num.permutations |
Number of random permutations to average over
per feature per tree. Higher values reduce Monte Carlo noise.
Default is |
seed |
Integer random seed for reproducibility of permutations.
Default is |
where is the OOB index set of tree , is
num.permutations, and is the data
matrix with column randomly permuted on replicate ,
holding all other columns fixed.
The scaled version divides by the population standard
deviation of the per-tree importances (denominator ,
not ):
where denotes the bracketed quantity above
for a single tree.
Loss functions for compositional data
MSE ignores the simplex constraint and treats the components independently. The Aitchison distance operates in the log-ratio space that is natural for compositions and is the recommended default. KL divergence is asymmetric but common in information-theoretic contexts.
Small predicted values near zero can cause numerical issues for
Aitchison and KL losses. A small constant (1e-10) is added to
all predictions before computing these losses.
Interpretation
A feature with importance near zero (or negative, due to Monte
Carlo noise) does not contribute to predictive accuracy. Features with
large positive importance_scaled are robustly important across
trees.
A data frame with one row per feature and columns:
featureFeature name.
importanceMean increase in OOB loss when the feature
is permuted (). Larger = more important.
importance_scaledImportance divided by its standard
deviation across trees ().
Analogous to a t-statistic; values suggest a feature
contributes meaningfully.
importance_sdStandard deviation of the per-tree importance values, giving a sense of variability.
Sorted by importance descending.
predict.DirichletRF for point predictions on new data.
print.DirichletRF for a summary of the fitted object
(call as print(forest) or just forest).
sample_conditional for drawing compositional samples
from the conditional predictive distribution (requires
distributional = TRUE).
importance.DirichletRF for impurity-based (gain and
count) feature importance.
predict_weights for proximity weights for new observations
(requires distributional = TRUE).
set.seed(42) n <- 100; p <- 4 X <- matrix(rnorm(n * p), n, p) colnames(X) <- paste0("X", 1:p) alpha_mat <- cbind(2 + 3 * (X[, 1] > 0), 3 + 3 * (X[, 2] > 0), rep(4, n)) G <- matrix(rgamma(n * 3, shape = as.vector(t(alpha_mat))), n, 3, byrow = TRUE) Y <- G / rowSums(G) forest <- DirichletRF(X, Y, num.trees = 50, num.cores = 1, replace = TRUE, compute.oob = TRUE) permutation_importance(forest, X)set.seed(42) n <- 100; p <- 4 X <- matrix(rnorm(n * p), n, p) colnames(X) <- paste0("X", 1:p) alpha_mat <- cbind(2 + 3 * (X[, 1] > 0), 3 + 3 * (X[, 2] > 0), rep(4, n)) G <- matrix(rgamma(n * 3, shape = as.vector(t(alpha_mat))), n, 3, byrow = TRUE) Y <- G / rowSums(G) forest <- DirichletRF(X, Y, num.trees = 50, num.cores = 1, replace = TRUE, compute.oob = TRUE) permutation_importance(forest, X)
For each row of newdata, computes a normalised weight vector
over all training observations, where the weight of training
observation j is proportional to how often it co-occurs with
the new point in the same leaf across all trees. These weights define
the forest-weighted empirical distribution over the training responses
and can be used to estimate conditional quantities such as means,
variances, or probabilities for the new covariate point.
predict_weights(object, newdata)predict_weights(object, newdata)
object |
A |
newdata |
A numeric matrix of new covariates (n_test x p).
Column order must match the training matrix passed to
|
Weights are computed using all trees in the forest (no OOB restriction
applies, since new observations were not part of training). This
contrasts with $oob$weights, which restricts each training
observation to its held-out trees only and is available directly on
the fitted object when both distributional = TRUE and
compute.oob = TRUE.
A numeric matrix of dimensions n_test x n_train. Row i
contains the normalised proximity weights of the i-th new
observation over all n_train training observations. Each row
sums to 1. Entries are zero for training observations that never
shared a leaf with the new point across any tree.
predict.DirichletRF for point predictions on new data.
print.DirichletRF for a summary of the fitted object
(call as print(forest) or just forest).
sample_conditional for drawing compositional samples
from the conditional predictive distribution (requires
distributional = TRUE).
importance.DirichletRF for impurity-based (gain and
count) feature importance.
permutation_importance for permutation-based OOB feature
importance (requires compute.oob = TRUE).
set.seed(42) n <- 100; p <- 4 X <- matrix(rnorm(n * p), n, p) colnames(X) <- paste0("X", 1:p) alpha_mat <- cbind(2 + 3 * (X[, 1] > 0), 3 + 3 * (X[, 2] > 0), rep(4, n)) G <- matrix(rgamma(n * 3, shape = as.vector(t(alpha_mat))), n, 3, byrow = TRUE) Y <- G / rowSums(G) forest <- DirichletRF(X, Y, num.trees = 50, num.cores = 1, distributional = TRUE) # Weights for 5 new observations — matrix is 5 x 100 Xtest <- matrix(rnorm(5 * p), 5, p) colnames(Xtest) <- paste0("X", 1:p) W <- predict_weights(forest, Xtest) dim(W) # 5 x 100 rowSums(W) # all 1 # Weighted conditional mean for each new observation Y_hat <- W %*% Y # 5 x kset.seed(42) n <- 100; p <- 4 X <- matrix(rnorm(n * p), n, p) colnames(X) <- paste0("X", 1:p) alpha_mat <- cbind(2 + 3 * (X[, 1] > 0), 3 + 3 * (X[, 2] > 0), rep(4, n)) G <- matrix(rgamma(n * 3, shape = as.vector(t(alpha_mat))), n, 3, byrow = TRUE) Y <- G / rowSums(G) forest <- DirichletRF(X, Y, num.trees = 50, num.cores = 1, distributional = TRUE) # Weights for 5 new observations — matrix is 5 x 100 Xtest <- matrix(rnorm(5 * p), 5, p) colnames(Xtest) <- paste0("X", 1:p) W <- predict_weights(forest, Xtest) dim(W) # 5 x 100 rowSums(W) # all 1 # Weighted conditional mean for each new observation Y_hat <- W %*% Y # 5 x k
Makes predictions using a fitted DirichletRF object returned
by DirichletRF.
## S3 method for class 'DirichletRF' predict(object, newdata, ...)## S3 method for class 'DirichletRF' predict(object, newdata, ...)
object |
A |
newdata |
A numeric matrix of new covariates (n_new x p). |
... |
Currently unused. |
A list with the following elements:
alpha_predictionsEstimated Dirichlet alpha parameters for each new observation (n_new x k matrix).
mean_predictionsMean-based compositional predictions (n_new x k matrix).
print.DirichletRF for a summary of the fitted object
(call as print(forest) or just forest).
sample_conditional for drawing compositional samples
from the conditional predictive distribution (requires
distributional = TRUE).
importance.DirichletRF for impurity-based (gain and
count) feature importance.
permutation_importance for permutation-based OOB feature
importance (requires compute.oob = TRUE).
predict_weights for proximity weights for new observations
(requires distributional = TRUE).
# Small toy example (auto-tested) set.seed(42) n <- 50; p <- 2 X <- matrix(rnorm(n * p), n, p) G <- matrix(rgamma(n * 3, shape = rep(c(2, 3, 4), each = n)), n, 3) Y <- G / rowSums(G) forest <- DirichletRF(X, Y, num.trees = 5, num.cores = 1) Xtest <- matrix(rnorm(5 * p), 5, p) pred <- predict(forest, Xtest) pred$mean_predictions n <- 500; p <- 4 X <- matrix(rnorm(n * p), n, p) alpha <- c(2, 3, 4) G <- matrix(rgamma(n * length(alpha), shape = rep(alpha, each = n)), n, length(alpha)) Y <- G / rowSums(G) forest <- DirichletRF(X, Y, num.trees = 50, num.cores = 1) Xtest <- matrix(rnorm(10 * p), 10, p) pred <- predict(forest, Xtest) param_pred <- pred$alpha_predictions / rowSums(pred$alpha_predictions) single_pred <- predict(forest, Xtest[1, , drop = FALSE])# Small toy example (auto-tested) set.seed(42) n <- 50; p <- 2 X <- matrix(rnorm(n * p), n, p) G <- matrix(rgamma(n * 3, shape = rep(c(2, 3, 4), each = n)), n, 3) Y <- G / rowSums(G) forest <- DirichletRF(X, Y, num.trees = 5, num.cores = 1) Xtest <- matrix(rnorm(5 * p), 5, p) pred <- predict(forest, Xtest) pred$mean_predictions n <- 500; p <- 4 X <- matrix(rnorm(n * p), n, p) alpha <- c(2, 3, 4) G <- matrix(rgamma(n * length(alpha), shape = rep(alpha, each = n)), n, length(alpha)) Y <- G / rowSums(G) forest <- DirichletRF(X, Y, num.trees = 50, num.cores = 1) Xtest <- matrix(rnorm(10 * p), 10, p) pred <- predict(forest, Xtest) param_pred <- pred$alpha_predictions / rowSums(pred$alpha_predictions) single_pred <- predict(forest, Xtest[1, , drop = FALSE])
Suppresses the display of large data matrices (Y_train, fitted,
residuals) when the object is printed, while keeping them accessible
via $.
## S3 method for class 'DirichletRF' print(x, ...)## S3 method for class 'DirichletRF' print(x, ...)
x |
A |
... |
Further arguments passed to or from other methods. |
Invisibly returns x, the DirichletRF object
unchanged. Called primarily for its side effect of printing a summary
of the model to the console.
predict.DirichletRF for point predictions on new data
(call as predict(forest, newdata), documented under
?predict.DirichletRF).
sample_conditional for drawing compositional samples
from the conditional predictive distribution (requires
distributional = TRUE).
importance.DirichletRF for impurity-based (gain and
count) feature importance.
permutation_importance for permutation-based OOB feature
importance (requires compute.oob = TRUE).
predict_weights for proximity weights for new observations
(requires distributional = TRUE).
Given a fitted DirichletRF built with
distributional = TRUE and a single test covariate vector, draws
size compositional observations from the forest-weighted empirical
distribution over the training responses. Each training observation
receives a weight proportional to how often it co-occurs with the test
point in the same leaf across all trees; the returned rows are a
weighted-bootstrap draw from those training Y rows.
sample_conditional(object, x_new, size = 100L)sample_conditional(object, x_new, size = 100L)
object |
A |
x_new |
A numeric vector of length p (a single test covariate point). |
size |
A positive integer giving the number of compositional
observations to draw. Default is |
A numeric matrix of dimensions size x k, where each row
is one draw from the conditional distribution of Y given x_new.
Row names are draw_1, draw_2, ... and column names
are inherited from the training Y matrix if available.
predict.DirichletRF for point predictions on new data.
print.DirichletRF for a summary of the fitted object.
importance.DirichletRF for impurity-based feature importance.
permutation_importance for permutation-based OOB feature importance.
predict_weights for proximity weights for new observations.
set.seed(1) n <- 80; p <- 3 X <- matrix(rnorm(n * p), n, p) G <- matrix(rgamma(n * 4, shape = rep(c(1, 2, 3, 4), each = n)), n, 4) Y <- G / rowSums(G) forest <- DirichletRF(X, Y, num.trees = 20, num.cores = 1, distributional = TRUE) x_test <- rnorm(p) draws <- sample_conditional(forest, x_test, size = 200L) colMeans(draws) # estimated conditional mean of Y | x_testset.seed(1) n <- 80; p <- 3 X <- matrix(rnorm(n * p), n, p) G <- matrix(rgamma(n * 4, shape = rep(c(1, 2, 3, 4), each = n)), n, 4) Y <- G / rowSums(G) forest <- DirichletRF(X, Y, num.trees = 20, num.cores = 1, distributional = TRUE) x_test <- rnorm(p) draws <- sample_conditional(forest, x_test, size = 200L) colMeans(draws) # estimated conditional mean of Y | x_test