
Model robust optimal subsampling for A- and L- optimality criteria under logistic regression
Source:R/modelRobustLogSub.R
modelRobustLogSub.Rd
Using this function sample from big data under logistic regression when there are more than one model to describe the data. Subsampling probabilities are obtained based on the A- and L- optimality criteria.
Arguments
- r1
sample size for initial random sampling
- r2
sample size for optimal sampling
- Y
response data or Y
- X
covariate data or X matrix that has all the covariates (first column is for the intercept)
- N
size of the big data
- Apriori_probs
vector of a priori model probabilities that are used to obtain the model robust subsampling probabilities
- All_Combinations
list of possible models that can describe the data
- All_Covariates
all the covariates in the models
Value
The output of modelRobustLinSub
gives a list of
Beta_Data
estimated model parameters for each model in a list after subsampling
Utility_Data
estimated Variance and Information of the model parameters after subsampling
Sample_L-optimality
list of indexes for the initial and optimal samples obtained based on L-optimality criteria
Sample_L-optimality_MR
list of indexes for the initial and model robust optimal samples obtained based on L-optimality criteria
Sample_A-optimality
list of indexes for the initial and optimal samples obtained based on A-optimality criteria
Sample_A-optimality_MR
list of indexes for the initial and model robust optimal samples obtained based on A-optimality criteria
Subsampling_Probability
matrix of calculated subsampling probabilities for A- and L- optimality criteria
Details
Two stage subsampling algorithm for big data under logistic regression for multiple models that can describe the big data.
First stage is to obtain a random sample of size \(r_1\) and estimate the model parameters for all models. Using the estimated parameters subsampling probabilities are evaluated for A-, L-optimality criteria and model averaging A-, L-optimality subsampling methods.
Through the estimated subsampling probabilities a sample of size \(r_2 \ge r_1\) is obtained. Finally, the two samples are combined and the model parameters are estimated for all the models.
NOTE : If input parameters are not in given domain conditions necessary error messages will be provided to go further.
If \(r_2 \ge r_1\) is not satisfied then an error message will be produced.
If the big data \(X,Y\) has any missing values then an error message will be produced.
The big data size \(N\) is compared with the sizes of \(X,Y\) and if they are not aligned an error message will be produced.
If \(0 < \alpha_{q} < 1\) for the a priori model probabilities are not satisfied an error message will be produced, where \(q=1,\ldots,Q\) and \(Q\) is the number of models in the model set.
References
Mahendran A, Thompson H, McGree JM (2023). “A model robust subsampling approach for Generalised Linear Models in big data settings.” Statistical Papers, 64(4), 1137--1157.
Examples
indexes<-1:ceiling(nrow(Skin_segmentation)*0.25)
Original_Data<-cbind(Skin_segmentation[indexes,1],1,Skin_segmentation[indexes,-1])
colnames(Original_Data)<-c("Y",paste0("X",0:ncol(Original_Data[,-c(1,2)])))
# Scaling the covariate data
for (j in 3:5) {
Original_Data[,j]<-scale(Original_Data[,j])
}
No_of_Variables<-ncol(Original_Data[,-c(1,2)])
Squared_Terms<-paste0("X",1:No_of_Variables,"^2")
term_no <- 2
All_Models <- list(c("X0",paste0("X",1:No_of_Variables)))
Original_Data<-cbind(Original_Data,Original_Data[,-c(1,2)]^2)
colnames(Original_Data)<-c("Y","X0",paste0("X",1:No_of_Variables),
paste0("X",1:No_of_Variables,"^2"))
for (i in 1:No_of_Variables){
x <- as.vector(combn(Squared_Terms,i,simplify = FALSE))
for(j in 1:length(x)){
All_Models[[term_no]] <- c("X0",paste0("X",1:No_of_Variables),x[[j]])
term_no <- term_no+1
}
}
All_Models<-All_Models[-c(5:7)]
names(All_Models)<-paste0("Model_",1:length(All_Models))
r1<-300; r2<-rep(100*c(6,12),25);
modelRobustLogSub(r1 = r1, r2 = r2, Y = as.matrix(Original_Data[,1]),
X = as.matrix(Original_Data[,-1]),N = nrow(Original_Data),
Apriori_probs = rep(1/length(All_Models),length(All_Models)),
All_Combinations = All_Models,
All_Covariates = colnames(Original_Data)[-1])->Results
#> Step 1 of the algorithm completed.
#> Step 2 of the algorithm completed.
Beta_Plots<-plot_Beta(Results)