7.3 Margin of Error: Sampling Distribution for a Proportion 172
7.4 Sampling Distribution for a Mean 174
7.5 The Bootstrap 176
7.6 Rationale for the Bootstrap 177
7.7 Standard Error 188
7.8 Other Sampling Methods 188
7.9 Absolute vs. Relative Sample Size 192
7.10 Python: Random Sampling Strategies 192
Exercises 202
8 More than Two Samples or Categories 207
8.1 Count Data—R × C Tables 207
8.2 The Role of Experiments (Many Are Costly) 208
8.3 Chi-Square Test 210
8.4 Single Sample—Goodness-of-Fit 215
8.5 Numeric Data: ANOVA 217
8.6 Components of Variance 222
8.7 Factorial Design 224
8.8 The Problem of Multiple Inference 226
8.9 Continuous Testing 228
8.10 Bandit Algorithms 229
8.11 Appendix: ANOVA, the Factor Diagram, and the F-Statistic 230
8.12 More than One Factor or Variable—From ANOVA to Statistical Models 237
8.13 Python: Contingency Tables and Chi-square Test 237
8.14 Python: ANOVA 241
Exercises 246
9 Correlation 249
9.1 Example: Delta Wire 249
9.2 Example: Cotton Dust and Lung Disease 251
9.3 The Vector Product Sum Test 252
9.4 Correlation Coefficient 256
9.5 Correlation is not Causation 260
9.6 Other Forms of Association 261
9.7 Python: Correlation 262
Exercises 269
10 Regression 271
10.1 Finding the Regression Line by Eye 272
10.2 Finding the Regression Line by Minimizing Residuals 274
10.3 Linear Relationships 276
10.4 Prediction vs. Explanation 280
10.5 Python: Linear Regression 284
Exercises 293
11 Multiple Linear Regression 295
11.1 Terminology 295
11.2 Example—Housing Prices 296
11.3 Interaction 301
11.4 Regression Assumptions 304
11.5 Assessing Explanatory Regression Models 306
11.6 Assessing Regression for Prediction 314
11.7 Python: Multiple Linear Regression 324
Exercises 332
12 Predicting Binary Outcomes 337
12.1 K-Nearest-Neighbors 337
12.2 Python: Classification 343
Exercises 346
Index 349
Statistics for Data Science and Analytics is a comprehensive guide to statistical analysis using Python, presenting important topics useful for data science such as prediction, correlation, and data exploration. The authors provide an introduction to statistical science and big data, as well as an overview of Python data structures and operations.
A range of statistical techniques are presented with their implementation in Python, including hypothesis testing, probability, exploratory data analysis, categorical variables, surveys and sampling, A/B testing, and correlation. The text introduces binary classification, a foundational element of machine learning, validation of statistical models by applying them to holdout data, and probability and inference via the easy-to-understand method of resampling and the bootstrap instead of using a myriad of “kitchen sink” formulas. Regression is taught both as a tool for explanation and for prediction.
This book is informed by the authors’ experience designing and teaching both introductory statistics and machine learning at Statistics.com. Each chapter includes practical examples, explanations of the underlying concepts, and Python code snippets to help readers apply the techniques themselves.