Natural Language Processing
Q1 Review the python script in Q1 Folder – NLTK_Text_Analysis.py
Use text below to apply the same process
Text= “””Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.”””
a. Text Analysis Operations using NLTK
b. Tokenization
c. Stopwords removal
d. Lexicon Normalization such as Stemming and Lemmatization
e. POS Tagging
Q2 Analyze the customer reviews in the file Restaurant_Reviews.tsv
Explain each step for the following text clean-up commands
a. Explain each step for the following text clean-up commands
review = dataset[‘Review’][0]
review = re.sub(‘[^a-zA-Z]’, ‘ ‘, dataset[‘Review’][0])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words(‘english’))]
review = ‘ ‘.join(review)
b. What is the classification question?
c. The example uses the Naïve Bayes classifier to classify the sentiments. Calculate the confusion matrix:
TP = # True Positives,
TN = # True Negatives,
FP = # False Positives,
FN = # False Negatives):
Accuracy = (TP + TN) / (TP + TN + FP + FN)
d. Apply the logistic regression classifier to the problem – recalculate “c” i.e. TP, TN, FP, FN, Accuracy
Q3 NLTK Corpus on Movie Reviews
Q3a Use the following reference analyze sentiment analysis on Movie Review “Q3 Movie Reviews.py”
https://www.nltk.org/book/ch06.html
Q3b – Explain how the Bag of Words model help in sentiment analysis
http://blog.chapagain.com.np/python-nltk-sentiment…
Summarize the entire code in NLTKMovieReview.py file as a part of the solution
Q4 Twitter Analysis sentiment140
Perform a Twitter sentiment analysis –
- Users on twitter create short messages called tweets to be shared with other twitter users
– who interact by retweeting and responding?
– Twitter employs a message size restriction of 280 characters or less
– forces the users to stay focused on the message they wish to disseminate.
– Twitter data is great for Machine Learning (ML) task of sentiment analysis.
– Sentiment Analysis falls under Natural Language Processing (NLP)
- The training data is obtained from Sentiment140
– made up of about 1.6 million random tweets
– with corresponding binary labels. 0 for Negative sentiment and 1 for Positive sentiment.
- Use Naive Bayes Classifier to learn the correct labels from this training set.
https://towardsdatascience.com/the-real-world-as-s…
Q5 Analyze Clothing Reviews
https://www.kaggle.com/nicapotato/womens-ecommerce…
A women’s Clothing E-Commerce site revolving around the reviews written by customers. This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:
- Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.
- Age: Positive Integer variable of the reviewers age.
- Title: String variable for the title of the review.
- Review Text: String variable for the review body.
- Rating: Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
- Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
- Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
- Division Name: Categorical name of the product high level division.
- Department Name: Categorical name of the product department name.
Class Name: Categorical name of the product class name
Perform
a. Text extraction & creating a corpus
b. Text Pre-processing
c. Create the DTM & TDM from the corpus
d. Exploratory text analysis
e. Feature extraction by removing sparsity
f. Build the Classification Models and compare Logistic Regression to Random Forest regression
https://medium.com/analytics-vidhya/customer-revie…
Q1 Review the python script in Q1 Folder – NLTK_Text_Analysis.py
Use text below to apply the same process
Text= “””Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.”””
a. Text Analysis Operations using NLTK
b. Tokenization
c. Stopwords removal
d. Lexicon Normalization such as Stemming and Lemmatization
e. POS Tagging
Q2 Analyze the customer reviews in the file Restaurant_Reviews.tsv
Explain each step for the following text clean-up commands
a. Explain each step for the following text clean-up commands
review = dataset[‘Review’][0]
review = re.sub(‘[^a-zA-Z]’, ‘ ‘, dataset[‘Review’][0])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words(‘english’))]
review = ‘ ‘.join(review)
b. What is the classification question?
c. The example uses the Naïve Bayes classifier to classify the sentiments. Calculate the confusion matrix:
TP = # True Positives,
TN = # True Negatives,
FP = # False Positives,
FN = # False Negatives):
Accuracy = (TP + TN) / (TP + TN + FP + FN)
d. Apply the logistic regression classifier to the problem – recalculate “c” i.e. TP, TN, FP, FN, Accuracy
Q3 NLTK Corpus on Movie Reviews
Q3a Use the following reference analyze sentiment analysis on Movie Review “Q3 Movie Reviews.py”
https://www.nltk.org/book/ch06.html
Q3b – Explain how the Bag of Words model help in sentiment analysis
http://blog.chapagain.com.np/python-nltk-sentiment…
Summarize the entire code in NLTKMovieReview.py file as a part of the solution
Q4 Twitter Analysis sentiment140
Perform a Twitter sentiment analysis –
- Users on twitter create short messages called tweets to be shared with other twitter users
– who interact by retweeting and responding?
– Twitter employs a message size restriction of 280 characters or less
– forces the users to stay focused on the message they wish to disseminate.
– Twitter data is great for Machine Learning (ML) task of sentiment analysis.
– Sentiment Analysis falls under Natural Language Processing (NLP)
- The training data is obtained from Sentiment140
– made up of about 1.6 million random tweets
– with corresponding binary labels. 0 for Negative sentiment and 1 for Positive sentiment.
- Use Naive Bayes Classifier to learn the correct labels from this training set.
https://towardsdatascience.com/the-real-world-as-s…
Q5 Analyze Clothing Reviews
https://www.kaggle.com/nicapotato/womens-ecommerce…
A women’s Clothing E-Commerce site revolving around the reviews written by customers. This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:
- Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.
- Age: Positive Integer variable of the reviewers age.
- Title: String variable for the title of the review.
- Review Text: String variable for the review body.
- Rating: Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
- Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
- Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
- Division Name: Categorical name of the product high level division.
- Department Name: Categorical name of the product department name.
Class Name: Categorical name of the product class name
Perform
a. Text extraction & creating a corpus
b. Text Pre-processing
c. Create the DTM & TDM from the corpus
d. Exploratory text analysis
e. Feature extraction by removing sparsity
f. Build the Classification Models and compare Logistic Regression to Random Forest regression
https://medium.com/analytics-vidhya/customer-revie…
HW11.docx
Q2 Restaurant Reviews.zip
Q1 NLP Basics.zip