xavier-assignment

Natural Language Processing

Q1 Review the python script in Q1 Folder – NLTK_Text_Analysis.py

Don't use plagiarized sources. Get Your Custom Essay on
xavier-assignment
Just from $13/Page
Order Essay

Use text below to apply the same process

Text= “””Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.”””

a. Text Analysis Operations using NLTK

b. Tokenization

c. Stopwords removal

d. Lexicon Normalization such as Stemming and Lemmatization

e. POS Tagging

Q2 Analyze the customer reviews in the file Restaurant_Reviews.tsv

Explain each step for the following text clean-up commands

a. Explain each step for the following text clean-up commands

review = dataset[‘Review’][0]

review = re.sub(‘[^a-zA-Z]’, ‘ ‘, dataset[‘Review’][0])

review = review.lower()

review = review.split()

ps = PorterStemmer()

review = [ps.stem(word) for word in review if not word in set(stopwords.words(‘english’))]

review = ‘ ‘.join(review)

b. What is the classification question?

c. The example uses the Naïve Bayes classifier to classify the sentiments. Calculate the confusion matrix:

TP = # True Positives,

TN = # True Negatives,

FP = # False Positives,

FN = # False Negatives):

Accuracy = (TP + TN) / (TP + TN + FP + FN)

d. Apply the logistic regression classifier to the problem – recalculate “c” i.e. TP, TN, FP, FN, Accuracy

Q3 NLTK Corpus on Movie Reviews

Q3a Use the following reference analyze sentiment analysis on Movie Review “Q3 Movie Reviews.py”

https://www.nltk.org/book/ch06.html

Q3b – Explain how the Bag of Words model help in sentiment analysis

http://blog.chapagain.com.np/python-nltk-sentiment…

Summarize the entire code in NLTKMovieReview.py file as a part of the solution

Q4 Twitter Analysis sentiment140

Perform a Twitter sentiment analysis –

  • Users on twitter create short messages called tweets to be shared with other twitter users

– who interact by retweeting and responding?

– Twitter employs a message size restriction of 280 characters or less

– forces the users to stay focused on the message they wish to disseminate.

– Twitter data is great for Machine Learning (ML) task of sentiment analysis.

– Sentiment Analysis falls under Natural Language Processing (NLP)

  • The training data is obtained from Sentiment140

– made up of about 1.6 million random tweets

– with corresponding binary labels. 0 for Negative sentiment and 1 for Positive sentiment.

  • Use Naive Bayes Classifier to learn the correct labels from this training set.

https://towardsdatascience.com/the-real-world-as-s…

Q5 Analyze Clothing Reviews

https://www.kaggle.com/nicapotato/womens-ecommerce…

A women’s Clothing E-Commerce site revolving around the reviews written by customers. This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:

  • Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.
  • Age: Positive Integer variable of the reviewers age.
  • Title: String variable for the title of the review.
  • Review Text: String variable for the review body.
  • Rating: Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
  • Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
  • Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
  • Division Name: Categorical name of the product high level division.
  • Department Name: Categorical name of the product department name.

Class Name: Categorical name of the product class name

Perform

a. Text extraction & creating a corpus

b. Text Pre-processing

c. Create the DTM & TDM from the corpus

d. Exploratory text analysis

e. Feature extraction by removing sparsity

f. Build the Classification Models and compare Logistic Regression to Random Forest regression

https://medium.com/analytics-vidhya/customer-revie…

Q1 Review the python script in Q1 Folder – NLTK_Text_Analysis.py

Use text below to apply the same process

Text= “””Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.”””

a. Text Analysis Operations using NLTK

b. Tokenization

c. Stopwords removal

d. Lexicon Normalization such as Stemming and Lemmatization

e. POS Tagging

Q2 Analyze the customer reviews in the file Restaurant_Reviews.tsv

Explain each step for the following text clean-up commands

a. Explain each step for the following text clean-up commands

review = dataset[‘Review’][0]

review = re.sub(‘[^a-zA-Z]’, ‘ ‘, dataset[‘Review’][0])

review = review.lower()

review = review.split()

ps = PorterStemmer()

review = [ps.stem(word) for word in review if not word in set(stopwords.words(‘english’))]

review = ‘ ‘.join(review)

b. What is the classification question?

c. The example uses the Naïve Bayes classifier to classify the sentiments. Calculate the confusion matrix:

TP = # True Positives,

TN = # True Negatives,

FP = # False Positives,

FN = # False Negatives):

Accuracy = (TP + TN) / (TP + TN + FP + FN)

d. Apply the logistic regression classifier to the problem – recalculate “c” i.e. TP, TN, FP, FN, Accuracy

Q3 NLTK Corpus on Movie Reviews

Q3a Use the following reference analyze sentiment analysis on Movie Review “Q3 Movie Reviews.py”

https://www.nltk.org/book/ch06.html

Q3b – Explain how the Bag of Words model help in sentiment analysis

http://blog.chapagain.com.np/python-nltk-sentiment…

Summarize the entire code in NLTKMovieReview.py file as a part of the solution

Q4 Twitter Analysis sentiment140

Perform a Twitter sentiment analysis –

  • Users on twitter create short messages called tweets to be shared with other twitter users

– who interact by retweeting and responding?

– Twitter employs a message size restriction of 280 characters or less

– forces the users to stay focused on the message they wish to disseminate.

– Twitter data is great for Machine Learning (ML) task of sentiment analysis.

– Sentiment Analysis falls under Natural Language Processing (NLP)

  • The training data is obtained from Sentiment140

– made up of about 1.6 million random tweets

– with corresponding binary labels. 0 for Negative sentiment and 1 for Positive sentiment.

  • Use Naive Bayes Classifier to learn the correct labels from this training set.

https://towardsdatascience.com/the-real-world-as-s…

Q5 Analyze Clothing Reviews

https://www.kaggle.com/nicapotato/womens-ecommerce…

A women’s Clothing E-Commerce site revolving around the reviews written by customers. This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:

  • Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.
  • Age: Positive Integer variable of the reviewers age.
  • Title: String variable for the title of the review.
  • Review Text: String variable for the review body.
  • Rating: Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
  • Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
  • Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
  • Division Name: Categorical name of the product high level division.
  • Department Name: Categorical name of the product department name.

Class Name: Categorical name of the product class name

Perform

a. Text extraction & creating a corpus

b. Text Pre-processing

c. Create the DTM & TDM from the corpus

d. Exploratory text analysis

e. Feature extraction by removing sparsity

f. Build the Classification Models and compare Logistic Regression to Random Forest regression

https://medium.com/analytics-vidhya/customer-revie…

HW11.docx

Q2 Restaurant Reviews.zip

Q1 NLP Basics.zip

Still struggling to complete your homework?
Get instant homework help from our expert academic writers!