The R package “NeEDS4BigData” provides approaches to implement subsampling methods to analyse big data.
What is “NeEDS4BigData” an abbreviation for?
New Experimental Design based Subsampling methods for Big Data.
How to engage with “NeEDS4BigData” the first time ?
## Installing the package from GitHub
devtools::install_github("Amalan-ConStat/NeEDS4BigData")
## Installing the package from CRAN
install.packages("NeEDS4BigData")
Subsampling Methods
- A- and L-optimality based subsampling for GLMs.
- A-optimality based subsampling for Gaussian Linear Models.
- Leverage sampling for GLMs.
- Local case control sampling for logistic regression.
- A-optimality based subsampling under measurement constraints for GLMs.
- Model robust subsampling method for GLMs.
- Subsampling method for GLMs when the model is potentially misspecified.
These seven methods are described in the following articles under the topics
- Introduction - explains the need for subsampling methods.
- Model based subsampling
- Model robust and misspecification
- Benchmarking Functions
For 2) we assume the main effects model can describe the data. While for 3) first we consider there are several models that can describe the big data, then later we assume the given main effects model is misspecified. Under these conditions from 2) and 3) we explore subsampling for four given big data sets. Further, to explore the computation time we ran simulations for the scenarios 2) and 3) where we compare our subsampling functions against full data modelling.