Using this function sample from big data under logistic regression to describe the data. Sampling probabilities are obtained based on local case control method.
Arguments
- r1
sample size for initial random sampling
- r2
sample size for local case control sampling
- Y
response data or Y
- X
covariate data or X matrix that has all the covariates (first column is for the intercept)
- N
size of the big data
Value
The output of LCCsampling
gives a list of
Beta_Estimates
estimated model parameters in a data.frame after sampling
Utility_Estimates
estimated log scaled Information and variance for the estimated model parameters
Sample_LCC_Sampling
list of indexes for the initial and optimal samples obtained based on local case control sampling
Sampling_Probability
vector of calculated sampling probabilities for local case control sampling
Details
Two stage sampling algorithm for big data under logistic regression.
First obtain a random sample of size \(r_1\) and estimate the model parameters. Using the estimated parameters sampling probabilities are evaluated for local case control.
Through the estimated sampling probabilities an optimal sample of size \(r_2 \ge r_1\) is obtained. Finally, the optimal sample is used and the model parameters are estimated.
NOTE : If input parameters are not in given domain conditions necessary error messages will be provided to go further.
If \(r_2 \ge r_1\) is not satisfied then an error message will be produced.
If the big data \(X,Y\) has any missing values then an error message will be produced.
The big data size \(N\) is compared with the sizes of \(X,Y\) and if they are not aligned an error message will be produced.
References
Fithian W, Hastie T (2015). “Local case-control sampling: Efficient subsampling in imbalanced data sets.” Quality control and applied statistics, 60(3), 187--190.
Examples
Dist<-"Normal"; Dist_Par<-list(Mean=0,Variance=1)
No_Of_Var<-2; Beta<-c(-1,2,1); N<-10000; Family<-"logistic"
Full_Data<-GenGLMdata(Dist,Dist_Par,No_Of_Var,Beta,N,Family)
r1<-300; r2<-rep(100*c(6,9,12),50); Original_Data<-Full_Data$Complete_Data;
LCCsampling(r1 = r1, r2 = r2, Y = as.matrix(Original_Data[,1]),
X = as.matrix(Original_Data[,-1]),
N = nrow(Original_Data))->Results
#> Step 1 of the algorithm completed.
#> Step 2 of the algorithm completed.
plot_Beta(Results)
#> Picking joint bandwidth of 0.0298
#> Picking joint bandwidth of 0.0358
#> Picking joint bandwidth of 0.0338