PYSPARK programming

**Must have experience in PYSPARK programming

***(2) Questions needed to be completed: Please include screenshots of the output, description of the results, and the program design coding

Don't use plagiarized sources. Get Your Custom Essay on
PYSPARK programming
Just from $13/Page
Order Essay

(1) There are fake comments created by the computers in the Amazon review system. Prof. Michael Luca from Harvard Business School argues 1 that there’s been some evidence that fake reviews are sloppier in general: ”Short, vague reviews are a pretty good marker, [along with] poor punctuation and grammar.”

Here are some examples of probably fake comments (e.g., ”GREAT”) and their corre- sponding ratings (e.g., 5 Star) in our data set:

   6^220^Five Stars^2016-01-09^false^ Quality product.^5.00
   6^221^Five Stars^2016-01-09^false^ Great quality.^5.00
   6^222^Five Stars^2015-11-25^false^ Excellent^5.00
   6^223^Five Stars^2016-01-14^false^ GREAT^5.00

It looks like that these fake reviews tend to be more common in the 5 star ratings than 1 star ratings. Let’s examine the average length (number of the words) of the comments for each rating and see if it really holds.

Please design and implement a PySpark programme to examine the average length of comments (column: ReviewContent) in each rating (column: ReviewRating). We have 5 levels of rating here where 1 star rating represents the worst experience and the 5 star rating represents the best experience. Hint: you can remove punctuation in each comment with the following code:

  import re
  re.sub(’\W+’, ’ ’, mystring).

’\W+’ is a regular expression that matches any non-alphanumeric characters.

What expected:

You should turn in an one python file which prints out the average length of the comments for each star rating:

 $ spark-submit 1-length.py
  1 star rating: average length of comments __
  2 star rating: average length of comments __
  3 star rating: average length of comments __
  4 star rating: average length of comments __
  5 star rating: average length of comments __

(2) Top words

Please design and implement a PySpark programme to pick up the top 10 words for each rating. Some words such as ”great”, ”good” are common in the 5 star rating comments, and others such as ”bad”, ”worst” are common in the 1 star rating comments.

Please remove the stop words such as ”the”, ”an”, ”of”, etc. in each comment before obtaining the results.

Your Python code should print out the top 10 common words for each star rating:

  $ spark-submit 2-wordranking.py
  top 10 common words
  1 star rating : __ __ __ ...
  2 star rating : __ __ __ ...
  3 star rating : __ __ __ ...
  4 star rating : __ __ __ ...
  5 star rating : __ __ __ ...

Requirements: 2 Questions

Still struggling to complete your homework?
Get instant homework help from our expert academic writers!