Introduction • NeEDS4BigData

Big data analysis

Big data presents opportunities for analysts to uncover new knowledge and gain new insights into real-world problems. However, its massive scale and complexity presents computational and statistical challenges. These include scalability issues, storage constraints, noise accumulation, spurious correlations, incidental endogeneity and measurement errors. In Figure 1, Chen, Mao, and Liu (2014) review the size of big data for different sectors of business. Addressing these challenges demands innovative approaches in both computation and statistics. Traditional methods, effective for small and moderate sample sizes, often falter when confronted with massive datasets. Thus, there is a pressing need for innovative statistical methodologies and computational tools tailored to the unique demands of big data analysis.

Figure 1: The continuously increasing size of big data from Chen, Mao, and Liu (2014).

Computational solutions for big data analysis

Computer engineers often seek more powerful computing facilities to reduce computing time, leading to the rapid development of supercomputers over the past decade. These supercomputers boast speeds and storage capacities hundreds or even thousands of times greater than those of general-purpose PCs. However, significant energy consumption and limited accessibility remain major drawbacks. While cloud computing offers a partial solution by providing accessible computing resources, it faces challenges related to data transfer inefficiency, privacy and security concerns. Graphic Processing Units (GPUs) have emerged as another computational facility, offering powerful parallel computing capabilities. However, recent comparisons have shown that even high-end GPUs can be outperformed by general-purpose multi-core processors, primarily due to data transfer inefficiencies. In summary, neither supercomputers, cloud computing, nor GPUs have efficiently solved the big data problem. Instead, there is a growing need for efficient statistical solutions that can make big data manageable on general-purpose PCs.

Statistical solutions for big data analysis

In the realm of addressing the challenges posed by big data, statistical solutions are relatively novel compared to engineering solutions, with new methodologies continually under development. Currently available methods can be broadly categorized into three groups:

Sampling: This involves selecting a representative subset of the data for analysis instead of analysing the entire dataset. This approach can significantly reduce computational requirements while still providing valuable insights into the underlying population.
Divide and conquer: This approach involves breaking down the large problem into smaller, more manageable sub problems. Each sub problem is then independently analysed, often in parallel, before combining the results to obtain the final output.
Online updating of streamed data: The statistical inference is updated as new data arrive sequentially.

In recent years, there has been a growing preference for sampling over divide and recombine methods in addressing a range of regression problems. Meanwhile, online updating is primarily utilized for streaming data. Furthermore, when a large dataset is unnecessary to confidently answer a specific question, sampling is often favoured, as it allows for analysis using standard methods.

Sampling algorithms for big data

The literature presents two strategies to resolve the primary challenge of how to acquire an informative subset that efficiently addresses specific analytical questions to yield results consistent with analysing the large data set.

They are:

Sample randomly from the large dataset using subsampling probabilities determined via an assumed statistical model and objective (e.g., prediction and/or parameter estimation) (Wang, Zhu, and Ma 2018; Yao and Wang 2019; Ai, Wang, et al. 2021; Ai, Yu, et al. 2021; Lee, Schifano, and Wang 2021, 2022; Zhang, Ning, and Ruppert 2021)
Select samples based on an experimental design (Drovandi et al. 2017; Wang, Yang, and Stufken 2019; Cheng, Wang, and Yang 2020; Hou-Liu and Browne 2023; Reuter and Schwabe 2023; Yu, Liu, and Wang 2023).

As of now in this package we focus on the subsampling methods

Leverage sampling by Ma, Mahoney, and Yu (2014) and Ma and Sun (2015).
Local case control sampling by Fithian and Hastie (2015).
A- and L-optimality based subsampling methods for Generalised Linear Models by Wang, Zhu, and Ma (2018) and Ai, Yu, et al. (2021).
A-optimality based subsampling for Gaussian Linear Model by Lee, Schifano, and Wang (2021).
A- and L-optimality based subsampling methods for Generalised Linear Models under response not involved in probability calculation by Zhang, Ning, and Ruppert (2021).
A- and L-optimality based model robust/average subsampling methods for Generalised Linear Models by Mahendran, Thompson, and McGree (2023).
Subsampling for Generalised Linear Models under potential model misspecification as by Adewale and Wiens (2009) and Adewale and Xu (2010).

References

Adewale, Adeniyi J, and Douglas P Wiens. 2009. “Robust Designs for Misspecified Logistic Models.” Journal of Statistical Planning and Inference 139 (1): 3–15.

Adewale, Adeniyi J, and Xiaojian Xu. 2010. “Robust Designs for Generalized Linear Models with Possible Overdispersion and Misspecified Link Functions.” Computational Statistics & Data Analysis 54 (4): 875–90.

Ai, Mingyao, Fei Wang, Jun Yu, and Huiming Zhang. 2021. “Optimal subsampling for large-scale quantile regression.” Journal of Complexity 62: 101512. https://doi.org/10.1016/j.jco.2020.101512.

Ai, Mingyao, Jun Yu, Huiming Zhang, and HaiYing Wang. 2021. “Optimal Subsampling Algorithms for Big Data Regressions.” Statistica Sinica 31 (2): 749–72.

Chen, Min, Shiwen Mao, and Yunhao Liu. 2014. “Big Data: A Survey.” Mobile Networks and Applications 19: 171–209.

Cheng, Qianshun, HaiYing Wang, and Min Yang. 2020. “Information-Based Optimal Subdata Selection for Big Data Logistic Regression.” Journal of Statistical Planning and Inference 209: 112–22. https://doi.org/10.1016/j.jspi.2020.03.004.

Drovandi, Christopher C, Christopher Holmes, James M McGree, Kerrie Mengersen, Sylvia Richardson, and Elizabeth G Ryan. 2017. “Principles of experimental design for big data analysis.” Statistical Science 32 (3): 385–404. https://doi.org/10.1214/16-STS604.

Fithian, William, and Trevor Hastie. 2015. “Local Case-Control Sampling: Efficient Subsampling in Imbalanced Data Sets.” Quality Control and Applied Statistics 60 (3): 187–90.

Hou-Liu, Jason, and Ryan P Browne. 2023. “Generalized Linear Models for Massive Data via Doubly-Sketching.” Statistics and Computing 33 (5): 105. https://doi.org/10.1007/s11222-023-10274-8.

Lee, JooChul, Elizabeth D Schifano, and HaiYing Wang. 2021. “Fast optimal subsampling probability approximation for generalized linear models.” Econometrics and Statistics. https://doi.org/10.1016/j.ecosta.2021.02.007.

———. 2022. “Sampling-Based Gaussian Mixture Regression for Big Data.” Journal of Data Science 21 (1): 158–72. https://doi.org/10.6339/22-JDS1057.

Ma, Ping, Michael Mahoney, and Bin Yu. 2014. “A Statistical Perspective on Algorithmic Leveraging.” In International Conference on Machine Learning, 91–99. PMLR.

Ma, Ping, and Xiaoxiao Sun. 2015. “Leveraging for Big Data Regression.” Wiley Interdisciplinary Reviews: Computational Statistics 7 (1): 70–76.

Mahendran, Amalan, Helen Thompson, and James M McGree. 2023. “A Model Robust Subsampling Approach for Generalised Linear Models in Big Data Settings.” Statistical Papers 64 (4): 1137–57.

Reuter, Torsten, and Rainer Schwabe. 2023. “Optimal Subsampling Design for Polynomial Regression in One Covariate.” Statistical Papers, 1–23. https://doi.org/10.1007/s00362-023-01425-0.

Wang, HaiYing, Min Yang, and John Stufken. 2019. “Information-Based Optimal Subdata Selection for Big Data Linear Regression.” Journal of the American Statistical Association 114 (525): 393–405. https://doi.org/10.1080/01621459.2017.1408468.

Wang, HaiYing, Rong Zhu, and Ping Ma. 2018. “Optimal Subsampling for Large Sample Logistic Regression.” Journal of the American Statistical Association 113 (522): 829–44.

Yao, Yaqiong, and HaiYing Wang. 2019. “Optimal subsampling for softmax regression.” Statistical Papers 60 (2): 585–99. https://doi.org/10.1007/s00362-018-01068-6.

Yu, Jun, Jiaqi Liu, and HaiYing Wang. 2023. “Information-Based Optimal Subdata Selection for Non-Linear Models.” Statistical Papers, 1–25. https://doi.org/10.1007/s00362-023-01430-3.

Zhang, Tao, Yang Ning, and David Ruppert. 2021. “Optimal Sampling for Generalized Linear Models Under Measurement Constraints.” Journal of Computational and Graphical Statistics 30 (1): 106–14.