5 star categories based off the text content in the reviews. This will be a simpler procedure than the lecture, since we will utilize the pipeline methods for more complex tasks. We will use the Yelp Review Data Set from Kaggle. Each observation in this dataset is a review of a particular business by a particular user. The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review. The "cool" column is the number of "cool" votes this review received from other Yelp users. All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business. The "useful" and "funny" columns are similar to the "cool" column. Let's get started! Just follow the directions below!/n]: blob.core.windows.net/media//655d2lade3a3d6c1blb06e36/questions/hip-project-practice _1700602595848.html AM ]: 5. Use the corr() method on that groupby dataframe to produce this dataframe: (5 points) ]: cool NLP Project practice useful funny text length cool 1.000000 -0.743329 -0.944939 -0.857664 useful -0.743329 1.000000 0.894506 0.699881 funny -0.944939 0.894506 1.000000 0.843461 text length -0.857664 0.699881 0.843461 1.000000 NLP Classification Task Let's move on to the actual task. To make things a little easier, go ahead and only grab reviews that were either 1 star or 5 stars. 6. Create a dataframe called yelp_class that contains the columns of yelp dataframe but for only the 1 or 5 star reviews. (5 points) 7. Create two objects X and y. X will be the 'text' column of yelp_class and y will be the 'stars' column of yelp_class. (Your features and target/labels) (10 points) 3/1/n9. Use the fit transform method on the CountVectorizer object and pass in X (the 'text' column). Save this result by overwriting X (10 points) 7]: X = cv.fit_transform (X) tb.blob.core.windows.net/media//655d21ade3a3d6c1b1b06e36/questions/hip-project-practice_1700602595848.html 5 AM 8]: 9]: 01: Train Test Split Let's split our data into training and testing data. 11: 1]: NLP Project practice 10. Use train_test_split to split up the data into X_train, X_test, y_train, y_test. Use test_size=0.3 and random_state=101 (10 points) Training a Model Time to train a model! 11. Import Decision TreeClassifier and create an instance of the estimator and call is tree then fit the model using the training set (10 points) Now fit tree using the training data. DecisionTreeClassifier() 4
Fig: 1
Fig: 2
Fig: 3