Data smart: using data science to transform Information into insight
Foreman, John W.
Data smart: using data science to transform Information into insight - New Delhi Wiley India Pvt. Ltd. 2018 - xx, 409 p.
TABLE OF CONTENTS
Introduction xiii
1 Everything You Ever Needed to Know about Spreadsheets but Were Too Afraid to Ask 1
Some Sample Data 2
Moving Quickly with the Control Button 2
Copying Formulas and Data Quickly 4
Formatting Cells 5
Paste Special Values 7
Inserting Charts 8
Locating the Find and Replace Menus 9
Formulas for Locating and Pulling Values 10
Using VLOOKUP to Merge Data 12
Filtering and Sorting 13
Using PivotTables 16
Using Array Formulas 19
Solving Stuff with Solver 20
OpenSolver: I Wish We Didn’t Need This, but We Do 26
Wrapping Up 27
2 Cluster Analysis Part I: Using K-Means to Segment Your Customer Base 29
Girls Dance with Girls, Boys Scratch Their Elbows 30
Getting Real: K-Means Clustering Subscribers in E-mail Marketing 35
Joey Bag O’ Donuts Wholesale Wine Emporium 36
The Initial Dataset 36
Determining What to Measure 38
Start with Four Clusters 41
Euclidean Distance: Measuring Distances as the Crow Flies 41
Distances and Cluster Assignments for Everybody! 44
Solving for the Cluster Centers 46
Making Sense of the Results 49
Getting the Top Deals by Cluster 50
The Silhouette: A Good Way to Let Different K Values Duke It Out 53
How about Five Clusters? 60
Solving for Five Clusters 60
Getting the Top Deals for All Five Clusters 61
Computing the Silhouette for 5-Means Clustering 64
K-Medians Clustering and Asymmetric Distance Measurements 66
Using K-Medians Clustering 66
Getting a More Appropriate Distance Metric 67
Putting It All in Excel 69
The Top Deals for the 5-Medians Clusters 70
Wrapping Up 75
3 Naïve Bayes and the Incredible Lightness of Being an Idiot 77
When You Name a Product Mandrill, You’re Going to Get Some Signal and Some Noise 77
The World’s Fastest Intro to Probability Theory 79
Totaling Conditional Probabilities 80
Joint Probability, the Chain Rule, and Independence 80
What Happens in a Dependent Situation? 81
Bayes Rule 82
Using Bayes Rule to Create an AI Model 83
High-Level Class Probabilities Are Often Assumed to Be Equal 84
A Couple More Odds and Ends 85
Let’s Get This Excel Party Started 87
Removing Extraneous Punctuation 87
Splitting on Spaces 88
Counting Tokens and Calculating Probabilities 92
And We Have a Model! Let’s Use It 94
Wrapping Up 98
4 Optimization Modeling: Because That “Fresh Squeezed” Orange Juice Ain’t Gonna Blend Itself 101
Why Should Data Scientists Know Optimization? 102
Starting with a Simple Trade-Off f 103
Representing the Problem as a Polytope 103
Solving by Sliding the Level Set 105
The Simplex Method: Rooting around the Corners 106
Working in Excel 108
There’s a Monster at the End of This Chapter 117
Fresh from the Grove to Your Glasswith a Pit Stop Through a Blending Model 118
You Use a Blending Model 119
Let’s Start with Some Specs 119
Coming Back to Consistency 121
Putting the Data into Excel 121
Setting Up the Problem in Solver 124
Lowering Your Standards 126
Dead Squirrel Removal: The Minimax Formulation 131
If-Then and the “Big M” Constraint 133
Multiplying Variables: Cranking Up the Volume to 11 137
Modeling Risk 144
Normally Distributed Data 145
Wrapping Up 154
5 Cluster Analysis Part II: Network Graphs and Community Detection 155
What Is a Network Graph? 156
Visualizing a Simple Graph 157
Brief Introduction to Gephi 159
Gephi Installation and File Preparation 160
Laying Out the Graph 162
Node Degree 165
Pretty Printing 166
Touching the Graph Data 168
Building a Graph from the Wholesale Wine Data 170
Creating a Cosine Similarity Matrix 172
Producing an r-Neighborhood Graph 174
How Much Is an Edge Worth? Points and Penalties in Graph Modularity 178
What’s a Point and What’s a Penalty? 179
Setting Up the Score Sheet 183
Let’s Get Clustering! 185
Split Number 1 185
Split 2: Electric Boogaloo 190
And…Split 3: Split with a Vengeance 192
Encoding and Analyzing the Communities 193
There and Back Again: A Gephi Tale 197
Wrapping Up 202
6 The Granddaddy of Supervised Artificial Intelligence—Regression 205
Wait, What? You’re Pregnant? 205
Don’t Kid Yourself 206
Predicting Pregnant Customers at RetailMart Using Linear Regression 207
The Feature Set 207
Assembling the Training Data 209
Creating Dummy Variables 210
Let’s Bake Our Own Linear Regression 213
Linear Regression Statistics: R-Squared, F Tests, t Tests 221
Making Predictions on Some New Data and Measuring Performance 230
Predicting Pregnant Customers at RetailMart Using Logistic Regression 239
First You Need a Link Function 240
Hooking Up the Logistic Function and Reoptimizing 241
Baking an Actual Logistic Regression 244
Model Selection—Comparing the Performance of the Linear and Logistic Regressions 245
For More Information 248
Wrapping Up 249
7 Ensemble Models: A Whole Lot of Bad Pizza 251
Using the Data from Chapter 6 252
Bagging: Randomize, Train, Repeat 254
Decision Stump Is an Unsexy Term for a Stupid Predictor 254
Doesn’t Seem So Stupid to Me! 255
You Need More Power! 257
Let’s Train It 258
Evaluating the Bagged Model 267
Boosting: If You Get It Wrong, Just Boost and Try Again 272
Training the Model—Every Feature Gets a Shot 272
Evaluating the Boosted Model 280
Wrapping Up 283
8 Forecasting: Breathe Easy; You Can’t Win 285
The Sword Trade Is Hopping 286
Getting Acquainted with Time Series Data 286
Starting Slow with Simple Exponential Smoothing 288
Setting Up the Simple Exponential Smoothing Forecast 290
You Might Have a Trend 296
Holt’s Trend-Corrected Exponential Smoothing 299
Setting Up Holt’s Trend-Corrected Smoothing in a Spreadsheet 300
So Are You Done? Looking at Autocorrelations 306
Multiplicative Holt-Winters Exponential Smoothing 313
Setting the Initial Values for Level, Trend, and Seasonality 315
Getting Rolling on the Forecast 319
And Optimize! 324
Please Tell Me We’re Done Now!!! 326
Putting a Prediction Interval around the Forecast 327
Creating a Fan Chart for Effect 331
Wrapping Up 333
9 Outlier Detection: Just Because They’re Odd Doesn’t Mean They’re Unimportant 335
Outliers Are (Bad?) People, Too 335
The Fascinating Case of Hadlum v Hadlum 336
Tukey Fences 337
Applying Tukey Fences in a Spreadsheet 338
The Limitations of This Simple Approach 340
Terrible at Nothing, Bad at Everything 341
Preparing Data for Graphing 342
Creating a Graph 345
Getting the k Nearest Neighbors 347
Graph Outlier Detection Method 1: Just Use the Indegree 348
Graph Outlier Detection Method 2: Getting Nuanced with k-Distance 351
Graph Outlier Detection Method 3: Local Outlier Factors Are Where It’s At 353
Wrapping Up 358
10 Moving from Spreadsheets into R 361
Getting Up and Running with R 362
Some Simple Hand-Jamming 363
Reading Data into R 370
Doing Some Actual Data Science 372
Spherical K-Means on Wine Data in Just a Few Lines 372
Building AI Models on the Pregnancy Data 378
Forecasting in R 385
Looking at Outlier Detection 389
Wrapping Up 394
Conclusion 395
Where Am I? What Just Happened? 395
Before You Go-Go 395
Get to Know the Problem 396
We Need More Translators 397
Beware the Three-Headed Geek-Monster: Tools, Performance, and Mathematical Perfection 397
You Are Not the Most Important Function of Your Organization 400
Get Creative and Keep in Touch! 400
Index 401
DESCRIPTION
The book provides nine tutorials on optimization, machine learning, data mining, and forecasting all within the confines of a spreadsheet. Each tutorial uses a real-world problem and the author guides the reader using query’s the reader might ask as how to craft a solution using the correct data science technique. Hosting these nine spreadsheets for download will be necessary so that the reader can work the problems along with the book.
Important topics covered by the book:
Linear and integer programming
K-nearest neighbors graphs and clustering
Logistic regression
Demand forecasting with seasonal adjustments
Price sensitivity, revenue optimization, and price-sensitive forecasting
Naïve Bayes classification
Outlier detection using graphs and Local Outlier Factors
Multi-criteria decision analysis
9788126546145
Data mining
Web usage mining
006.31 / FOR
Data smart: using data science to transform Information into insight - New Delhi Wiley India Pvt. Ltd. 2018 - xx, 409 p.
TABLE OF CONTENTS
Introduction xiii
1 Everything You Ever Needed to Know about Spreadsheets but Were Too Afraid to Ask 1
Some Sample Data 2
Moving Quickly with the Control Button 2
Copying Formulas and Data Quickly 4
Formatting Cells 5
Paste Special Values 7
Inserting Charts 8
Locating the Find and Replace Menus 9
Formulas for Locating and Pulling Values 10
Using VLOOKUP to Merge Data 12
Filtering and Sorting 13
Using PivotTables 16
Using Array Formulas 19
Solving Stuff with Solver 20
OpenSolver: I Wish We Didn’t Need This, but We Do 26
Wrapping Up 27
2 Cluster Analysis Part I: Using K-Means to Segment Your Customer Base 29
Girls Dance with Girls, Boys Scratch Their Elbows 30
Getting Real: K-Means Clustering Subscribers in E-mail Marketing 35
Joey Bag O’ Donuts Wholesale Wine Emporium 36
The Initial Dataset 36
Determining What to Measure 38
Start with Four Clusters 41
Euclidean Distance: Measuring Distances as the Crow Flies 41
Distances and Cluster Assignments for Everybody! 44
Solving for the Cluster Centers 46
Making Sense of the Results 49
Getting the Top Deals by Cluster 50
The Silhouette: A Good Way to Let Different K Values Duke It Out 53
How about Five Clusters? 60
Solving for Five Clusters 60
Getting the Top Deals for All Five Clusters 61
Computing the Silhouette for 5-Means Clustering 64
K-Medians Clustering and Asymmetric Distance Measurements 66
Using K-Medians Clustering 66
Getting a More Appropriate Distance Metric 67
Putting It All in Excel 69
The Top Deals for the 5-Medians Clusters 70
Wrapping Up 75
3 Naïve Bayes and the Incredible Lightness of Being an Idiot 77
When You Name a Product Mandrill, You’re Going to Get Some Signal and Some Noise 77
The World’s Fastest Intro to Probability Theory 79
Totaling Conditional Probabilities 80
Joint Probability, the Chain Rule, and Independence 80
What Happens in a Dependent Situation? 81
Bayes Rule 82
Using Bayes Rule to Create an AI Model 83
High-Level Class Probabilities Are Often Assumed to Be Equal 84
A Couple More Odds and Ends 85
Let’s Get This Excel Party Started 87
Removing Extraneous Punctuation 87
Splitting on Spaces 88
Counting Tokens and Calculating Probabilities 92
And We Have a Model! Let’s Use It 94
Wrapping Up 98
4 Optimization Modeling: Because That “Fresh Squeezed” Orange Juice Ain’t Gonna Blend Itself 101
Why Should Data Scientists Know Optimization? 102
Starting with a Simple Trade-Off f 103
Representing the Problem as a Polytope 103
Solving by Sliding the Level Set 105
The Simplex Method: Rooting around the Corners 106
Working in Excel 108
There’s a Monster at the End of This Chapter 117
Fresh from the Grove to Your Glasswith a Pit Stop Through a Blending Model 118
You Use a Blending Model 119
Let’s Start with Some Specs 119
Coming Back to Consistency 121
Putting the Data into Excel 121
Setting Up the Problem in Solver 124
Lowering Your Standards 126
Dead Squirrel Removal: The Minimax Formulation 131
If-Then and the “Big M” Constraint 133
Multiplying Variables: Cranking Up the Volume to 11 137
Modeling Risk 144
Normally Distributed Data 145
Wrapping Up 154
5 Cluster Analysis Part II: Network Graphs and Community Detection 155
What Is a Network Graph? 156
Visualizing a Simple Graph 157
Brief Introduction to Gephi 159
Gephi Installation and File Preparation 160
Laying Out the Graph 162
Node Degree 165
Pretty Printing 166
Touching the Graph Data 168
Building a Graph from the Wholesale Wine Data 170
Creating a Cosine Similarity Matrix 172
Producing an r-Neighborhood Graph 174
How Much Is an Edge Worth? Points and Penalties in Graph Modularity 178
What’s a Point and What’s a Penalty? 179
Setting Up the Score Sheet 183
Let’s Get Clustering! 185
Split Number 1 185
Split 2: Electric Boogaloo 190
And…Split 3: Split with a Vengeance 192
Encoding and Analyzing the Communities 193
There and Back Again: A Gephi Tale 197
Wrapping Up 202
6 The Granddaddy of Supervised Artificial Intelligence—Regression 205
Wait, What? You’re Pregnant? 205
Don’t Kid Yourself 206
Predicting Pregnant Customers at RetailMart Using Linear Regression 207
The Feature Set 207
Assembling the Training Data 209
Creating Dummy Variables 210
Let’s Bake Our Own Linear Regression 213
Linear Regression Statistics: R-Squared, F Tests, t Tests 221
Making Predictions on Some New Data and Measuring Performance 230
Predicting Pregnant Customers at RetailMart Using Logistic Regression 239
First You Need a Link Function 240
Hooking Up the Logistic Function and Reoptimizing 241
Baking an Actual Logistic Regression 244
Model Selection—Comparing the Performance of the Linear and Logistic Regressions 245
For More Information 248
Wrapping Up 249
7 Ensemble Models: A Whole Lot of Bad Pizza 251
Using the Data from Chapter 6 252
Bagging: Randomize, Train, Repeat 254
Decision Stump Is an Unsexy Term for a Stupid Predictor 254
Doesn’t Seem So Stupid to Me! 255
You Need More Power! 257
Let’s Train It 258
Evaluating the Bagged Model 267
Boosting: If You Get It Wrong, Just Boost and Try Again 272
Training the Model—Every Feature Gets a Shot 272
Evaluating the Boosted Model 280
Wrapping Up 283
8 Forecasting: Breathe Easy; You Can’t Win 285
The Sword Trade Is Hopping 286
Getting Acquainted with Time Series Data 286
Starting Slow with Simple Exponential Smoothing 288
Setting Up the Simple Exponential Smoothing Forecast 290
You Might Have a Trend 296
Holt’s Trend-Corrected Exponential Smoothing 299
Setting Up Holt’s Trend-Corrected Smoothing in a Spreadsheet 300
So Are You Done? Looking at Autocorrelations 306
Multiplicative Holt-Winters Exponential Smoothing 313
Setting the Initial Values for Level, Trend, and Seasonality 315
Getting Rolling on the Forecast 319
And Optimize! 324
Please Tell Me We’re Done Now!!! 326
Putting a Prediction Interval around the Forecast 327
Creating a Fan Chart for Effect 331
Wrapping Up 333
9 Outlier Detection: Just Because They’re Odd Doesn’t Mean They’re Unimportant 335
Outliers Are (Bad?) People, Too 335
The Fascinating Case of Hadlum v Hadlum 336
Tukey Fences 337
Applying Tukey Fences in a Spreadsheet 338
The Limitations of This Simple Approach 340
Terrible at Nothing, Bad at Everything 341
Preparing Data for Graphing 342
Creating a Graph 345
Getting the k Nearest Neighbors 347
Graph Outlier Detection Method 1: Just Use the Indegree 348
Graph Outlier Detection Method 2: Getting Nuanced with k-Distance 351
Graph Outlier Detection Method 3: Local Outlier Factors Are Where It’s At 353
Wrapping Up 358
10 Moving from Spreadsheets into R 361
Getting Up and Running with R 362
Some Simple Hand-Jamming 363
Reading Data into R 370
Doing Some Actual Data Science 372
Spherical K-Means on Wine Data in Just a Few Lines 372
Building AI Models on the Pregnancy Data 378
Forecasting in R 385
Looking at Outlier Detection 389
Wrapping Up 394
Conclusion 395
Where Am I? What Just Happened? 395
Before You Go-Go 395
Get to Know the Problem 396
We Need More Translators 397
Beware the Three-Headed Geek-Monster: Tools, Performance, and Mathematical Perfection 397
You Are Not the Most Important Function of Your Organization 400
Get Creative and Keep in Touch! 400
Index 401
DESCRIPTION
The book provides nine tutorials on optimization, machine learning, data mining, and forecasting all within the confines of a spreadsheet. Each tutorial uses a real-world problem and the author guides the reader using query’s the reader might ask as how to craft a solution using the correct data science technique. Hosting these nine spreadsheets for download will be necessary so that the reader can work the problems along with the book.
Important topics covered by the book:
Linear and integer programming
K-nearest neighbors graphs and clustering
Logistic regression
Demand forecasting with seasonal adjustments
Price sensitivity, revenue optimization, and price-sensitive forecasting
Naïve Bayes classification
Outlier detection using graphs and Local Outlier Factors
Multi-criteria decision analysis
9788126546145
Data mining
Web usage mining
006.31 / FOR