Big data are ubiquitous. They come in varying volume, velocity, and variety. They have a deep impact on systems such as storages, communications
and computing architectures and analysis such as statistics, computation, optimization, and privacy. Engulfed by a multitude of applications, data science
aims to address the large-scale challenges of data analysis, turning big data
into smart data for decision making and knowledge discoveries. Data science
integrates theories and methods from statistics, optimization, mathematical
science, computer science, and information science to extract knowledge, make
decisions, discover new insights, and reveal new phenomena from data. The
concept of data science has appeared in the literature for several decades and
has been interpreted differently by different researchers. It has nowadays become a multi-disciplinary field that distills knowledge in various disciplines to
develop new methods, processes, algorithms and systems for knowledge discovery from various kinds of data, which can be either low or high dimensional,
and either structured, unstructured or semi-structured. Statistical modeling
plays critical roles in the analysis of complex and heterogeneous data and
quantifies uncertainties of scientific hypotheses and statistical results.
This book introduces commonly-used statistical models, contemporary statistical machine learning techniques and algorithms, along with their mathematical insights and statistical theories. It aims to serve as a graduate-level
textbook on the statistical foundations of data science as well as a research
monograph on sparsity, covariance learning, machine learning and statistical
inference. For a one-semester graduate level course, it may cover Chapters 2,
3, 9, 10, 12, 13 and some topics selected from the remaining chapters. This
gives a comprehensive view on statistical machine learning models, theories
and methods. Alternatively, one-semester graduate course may cover Chapters 2, 3, 5, 7, 8 and selected topics from the remaining chapters. This track
focuses more on high-dimensional statistics, model selection and inferences
but both paths emphasize a great deal on sparsity and variable selections.
Frontiers of scientific research rely on the collection and processing of massive complex data. Information and technology allow us to collect big data
of unprecedented size and complexity. Accompanying big data is the rise of
dimensionality and high dimensionality characterizes many contemporary statistical problems, from sciences and engineering to social science and humanities. Many traditional statistical procedures for finite or low-dimensional data
are still useful in data science, but they become infeasible or ineffective for
dealing with high-dimensional data. Hence, new statistical methods are indispensable. The authors have worked on high-dimensional statistics for two
decades, and started to write the book on the topics of high-dimensional data
analysis over a decade ago. Over the last decide, there have been surges in
interest and exciting developments in high-dimensional and big data. This led
us to concentrate mainly on statistical aspects of data science.
We aim to introduce commonly-used statistical models, methods and pro-
ii
cedures in data science and provide readers with sufficient and sound theoretical justifications. It has been a challenge for us to balance statistical theories
and methods and to choose the topics and works to cover since the amount
of publications in this emerging area is enormous. Thus, we focus on the
foundational aspects that are related to sparsity, covariance learning, machine
learning, and statistical inference.
Sparsity is a common assumption in the analysis of high-dimensional data.
By sparsity, we mean that only a handful of features embedded in a huge pool
suffice for certain scientific questions or predictions. This book introduces various regularization methods to deal with sparsity, including how to determine
penalties and how to choose tuning parameters in regularization methods and
numerical optimization algorithms for various statistical models. They can be
found in Chapters 3–6 and 8.
High-dimensional measurements are frequently dependent, since these variables often measure similar things, such as aspects of economics or personal
health. Many of these variables have heavy tails due to big number of collected
variables. To model the dependence, factor models are frequently employed,
which exhibit low-rank plus sparse s