Subsampling under Poisson regression for a potentially misspecified model
Source:R/modelMissPoiSub.R
modelMissPoiSub.Rd
Using this function sample from big data under Poisson regression for a potentially misspecified model. Subsampling probabilities are obtained based on the A- and L- optimality criteria with the RLmAMSE (Reduction of Loss by minimizing the Average Mean Squared Error).
Arguments
- r1
sample size for initial random sampling
- r2
sample size for optimal sampling
- Y
response data or Y
- X
covariate data or X matrix that has all the covariates (first column is for the intercept)
- N
size of the big data
- Alpha
scaling factor when using Log Odds or Power functions to magnify the probabilities
- Beta_Estimate_Full
estimate of Beta after fitting the Poisson model
- F_Estimate_Full
estimate of f that is the difference of linear predictor on GAM and Poisson model
Value
The output of modelMissPoiSub
gives a list of
Beta_Estimates
estimated model parameters after subsampling
AMSE_Estimates
matrix of estimated AMSE values after subsampling
Sample_A-Optimality
list of indexes for the initial and optimal samples obtained based on A-Optimality criteria
Sample_L-Optimality
list of indexes for the initial and optimal samples obtained based on L-Optimality criteria
Sample_RLmAMSE
list of indexes for the optimal samples obtained based on RLmAMSE
Sample_RLmAMSE_Log_Odds
list of indexes for the optimal samples obtained based on RLmAMSE with Log Odds function
Sample_RLmAMSE_Power
list of indexes for the optimal samples obtained based on RLmAMSE with Power function
Subsampling_Probability
matrix of calculated subsampling probabilities
Details
Two stage subsampling algorithm for big data under Poisson regression for potential model misspecification.
First stage is to obtain a random sample of size \(r_1\) and estimate the model parameters. Using the estimated parameters subsampling probabilities are evaluated for A-, L-optimality criteria, RLmAMSE and enhanced RLmAMSE (log-odds and power) subsampling methods.
Through the estimated subsampling probabilities a sample of size \(r_2 \ge r_1\) is obtained. Finally, the two samples are combined and the model parameters are estimated for A- and L-optimality, while for RLmAMSE and enhanced RLmAMSE (log-odds and power) only the optimal sample is used.
NOTE : If input parameters are not in given domain conditions necessary error messages will be provided to go further.
If \(r_2 \ge r_1\) is not satisfied then an error message will be produced.
If the big data \(X,Y\) has any missing values then an error message will be produced.
The big data size \(N\) is compared with the sizes of \(X,Y\),F_estimate_Full and if they are not aligned an error message will be produced.
If \(\alpha > 1\) for the scaling vector is not satisfied an error message will be produced.
References
Adewale AJ, Wiens DP (2009). “Robust designs for misspecified logistic models.” Journal of Statistical Planning and Inference, 139(1), 3--15. Adewale AJ, Xu X (2010). “Robust designs for generalized linear models with possible overdispersion and misspecified link functions.” Computational statistics & data analysis, 54(4), 875--890.
Examples
No_Of_Var<-2; Beta<-c(-1,2,2,1); N<-10000;
MisspecificationType <- "Type 2 Squared"; family <- "poisson"
Full_Data<-GenModelMissGLMdata(No_Of_Var,Beta,Var_Epsilon=NULL,N,MisspecificationType,family)
r1<-300; r2<-rep(100*c(6,9),50); Original_Data<-Full_Data$Full_Data;
# cl <- parallel::makeCluster(4)
# doParallel::registerDoParallel(cl)
if (FALSE) {
Results<-modelMissPoiSub(r1 = r1, r2 = r2,
Y = as.matrix(Original_Data[,1]),
X = as.matrix(Original_Data[,-1]),
N = Full_Data$N,
Alpha = 10,
Beta_Estimate_Full = Full_Data$Beta$Estimate,
F_Estimate_Full = Full_Data$f$Real_GAM)
# parallel::stopCluster(cl)
plot_Beta(Results)
plot_AMSE(Results)
}