Supervised machine learning for text analysis in R
Material type: TextPublication details: CRC Press Boco Raton 2022Description: xix, 381 pISBN:- 9780367554194
- 006.35 HVI
Item type | Current library | Collection | Call number | Copy number | Status | Date due | Barcode | |
---|---|---|---|---|---|---|---|---|
Book | Indian Institute of Management LRC General Stacks | IT & Decisions Sciences | 006.35 HVI (Browse shelf(Opens below)) | 1 | Available | 005569 |
Browsing Indian Institute of Management LRC shelves, Shelving location: General Stacks, Collection: IT & Decisions Sciences Close shelf browser (Hides shelf browser)
I Natural Language Features
1. Language and modeling
Linguistics for text analysis
A glimpse into one area: morphology
Different languages
Other ways text can vary
Summary
2. Tokenization
What is a token?
Types of tokens
Character tokens
Word tokens
Tokenizing by n-grams
Lines, sentence, and paragraph tokens
Where does tokenization break down?
Building your own tokenizer
Tokenize to characters, only keeping letters
Allow for hyphenated words
Wrapping it in a function
Tokenization for non-Latin alphabets
Tokenization benchmark
Summary
3. Stop words
Using premade stop word lists
Stop word removal in R
Creating your own stop words list
All stop word lists are context-specific
What happens when you remove stop words
Stop words in languages other than English
Summary
4. Stemming
How to stem text in R
Should you use stemming at all?
Understand a stemming algorithm
Handling punctuation when stemming
Compare some stemming options
Lemmatization and stemming
Stemming and stop words
Summary
5. Word Embeddings
Motivating embeddings for sparse, high-dimensional data
Understand word embeddings by finding them yourself
Exploring CFPB word embeddings
Use pre-trained word embeddings
Fairness and word embeddings
Using word embeddings in the real world
Summary
II Machine Learning Methods
Regression
A first regression model
Building our first regression model
Evaluation
Compare to the null model
Compare to a random forest model
Case study: removing stop words
Case study: varying n-grams
Case study: lemmatization
Case study: feature hashing
Text normalization
What evaluation metrics are appropriate?
The full game: regression
Preprocess the data
Specify the model
Tune the model
Evaluate the modeling
Summary
Classification
A first classification model
Building our first classification model
Evaluation
Compare to the null model
Compare to a lasso classification model
Tuning lasso hyperparameters
Case study: sparse encoding
Two class or multiclass?
Case study: including non-text data
Case study: data censoring
Case study: custom features
Detect credit cards
Calculate percentage censoring
Detect monetary amounts
What evaluation metrics are appropriate?
The full game: classification
Feature selection
Specify the model
Evaluate the modeling
Summary
III Deep Learning Methods
Dense neural networks
Kickstarter data
A first deep learning model
Preprocessing for deep learning
One-hot sequence embedding of text
Simple flattened dense network
Evaluation
Using bag-of-words features
Using pre-trained word embeddings
Cross-validation for deep learning models
Compare and evaluate DNN models
Limitations of deep learning
Summary
Long short-term memory (LSTM) networks
A first LSTM model
Building an LSTM
Evaluation
Compare to a recurrent neural network
Case study: bidirectional LSTM
Case study: stacking LSTM layers
Case study: padding
Case study: training a regression model
Case study: vocabulary size
The full game: LSTM
Preprocess the data
Specify the model
Summary
Convolutional neural networks
What are CNNs?
Kernel
Kernel size
A first CNN model
Case study: adding more layers
Case study: byte pair encoding
Case study: explainability with LIME
Case study: hyperparameter search
The full game: CNN
Preprocess the data
Specify the model
Summary
IV Conclusion
Text models in the real world
Appendix
A Regular expressions
A Literal characters
A Meta characters
A Full stop, the wildcard
A Character classes
A Shorthand character classes
A Quantifiers
A Anchors
A Additional resources
B Data
B Hans Christian Andersen fairy tales
B Opinions of the Supreme Court of the United States
B Consumer Financial Protection Bureau (CFPB) complaints
B Kickstarter campaign blurbs
C Baseline linear classifier
C Read in the data
C Split into test/train and create resampling folds
C Recipe for data preprocessing
C Lasso regularized classification model
C A model workflow
C Tune the workflow
There are no comments on this title.