Search for question
Question

Anna - a business executive asked you to improve the accuracy of failed food inspections you produced during Week 1. She advised you to identify specific tokens that lead to

failed food inspections in Chicago and produce more accurate list of top-10. Approach: Use the same data as in assignment 1 but this time identify top-10 tokens that occur in regulation descriptions in the table. 1. Similar to Assignment 1, filter the data for failed inspections and only keep records where Violations description is not blank. 2. Using regex, separate the violation description and comments into separate data frame columns. 3. Tokenize violation description and comment columns 4. Find top-10 tokens of each column 5. Clean each column: convert to lower case, remove stopwords, punctuation, numbers, etc 6. Find top-10 tokens again 7. Find top-10 tokens after applying Porter stemming to the columns obtained in step 5. 8. Find top-10 tokens after applying Lancaster stemming to the columns obtained in step 5. 9. Find top-10 tokens after applying lemmatization to the columns obtained in step 5. 10. Compare top-10 tokens obtained in 4, 6, 7, 8, 9. 11. Describe which approach provided the most comprehensive view of violations / comments and why 12. Use the "most effective" cleaning approach to plot the distribution of most common tokens (belonging to violation description) over time You are working on a project to promote the food safety in Chicago. Your goal is to identify the top-10 most frequent causes of failed food inspections in Chicago and effectively present them to your boss. You can either use attached started notebook to read the data from API: NLP Assignment 1 Starter.ipynb Or download the CSV from this link: https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5 B 1. Select only the records corresponding to failed inspection (see "results" column) 2. Clean the data, making sure that there are no NaNs in "violations" column 3. "Violations" column lists the reasons for inspection failure. Those reasons are separated by "|". Each reason consists of a regulation code, regulation description, and comments describing how the regulation was violated. 4. Using regular expression, parse "violations" column to select only regulation descriptions, no code or comments 5. Count how many times each regulation description occurred in the table and visualize top-10 the most frequent regulation descriptions 6. Identify whether any of these restaurants are repeat offenders (explore a combination of License; Business Name and Address variables to determine what is the best way to uniquely identify a business) and whether the violations are the same or different for these repeat offenses 7. Review the restaurants "Out of Business", is there an extended history of prior violations for these closed restaurants? 8. Food inspection data has 10+ years of history, do you see any changing trends in most common violations? Plot the results for top-5 most frequent violations over time 9. Your final output should be a Jupyter notebook showing all your code and the results so that one can easily reproduce them 10. Remember: you are presenting it to your boss and have very limited amount of time to state your case. Your presentation (charts and explanations within Jupyter notebook) should be as clear and short as possible but complete. Rules and requirements: • Your final output and the code should be contained within Jupyter Notebook (ipynb) Create classification model, predicting the outcome of food safety inspection based on the inspectors' comments • Leverage the results of your homework from Week-1 and Week-2 to extract free-form text comments from inspectors • Build a classification model, predicting the outcome of inspection - comments are predictors, target variable is "Results" column Explain why you selected a particular text pre-processing technique • Visualize results of at least two text classifiers and select the most robust one • You can choose to build a binary classifier (limiting your data to Pass / Fail) or multinomial classifier with all available values in Results Rules and requirements: • Your final output and the code should be contained within Jupyter Notebook (ipynb) ●