Search for question
Question

Assignment 4 For this assignment you will use Python to analyze two sets of data. This assignment will require use of the Pandas and Matplotlib libraries. When you are done,

upload this completed Jupyter Notebook file as both an .ipynb file and a PDF file. Also upload the modified data (CSV) file from Case 1. Use comments in your Python code as appropriate to explain what you are doing at each step (or group of steps). Provide written responses to any questions using the Markdown boxes provided (add more if needed). You can also add additional code boxes if desired. Case 1: Titanic passenger analysis. The CSV file "RMS_Titanic" contains data about each passenger aboard the RMS Titanic when it sank in 1912. Each record includes whether or not the passenger survived (1 = yes); the pclass (ticket class) they were traveling (1st, 2nd, 3rd); their name, sex, and age; and the fare they paid (in US dollars). Use Python for each task (do not perform manual calculations). import pandas as pd import matplotlib.pyplot as plt 1A. Create a list of only the names of all passengers over the age of 54 that survived. What percentage of the total number of passengers over the age of 54 is this? Markdown box 1B. Create two histograms of age on the same chart. One histogram should be for those passengers that survived, and one for those who perished. The first histogram (survived) should have blue bars, and the second histogram should have yellow bars. Both should use 80 bins, have black edges, and have alpha set to 0.5. Add an appropriate title, axis labels, and a color legend. Comment on any similarities and differences that you observe between the two distributions. Markdown box 1C. What is the average fare paid by all passengers? Of only those who survived? Of only those who did not? State your final answers in dollars and cents. Comment on any perceived correlation of fare with survival. Markdown box 1D. Adjust each fare for inflation. That is, convert the fare values from 1912 dollars into 2024 dollars (add a column called "Adjusted Fares". Save the modified data (all columns) to a new file called "updated2024.csv" and upload it with your assignment. An inflation index can be found at https://www.officialdata.org/us/inflation/1912?amount=1 Markdown box Case 2: Data value distribution. The CSV file "Weight_Males" contains the weight (in pounds) of 5000 randomly selected adult males in the United States. Analyze how well this data set meets the conditions of normality. Use Python for each task (do not perform manual calculations). import pandas as pd import matplotlib.pyplot as plt 2A. Create a histogram of the data using 64 bins. The bars should be green with blue edges. Add an appropriate title and axis labels. Describe how normal the distribution appears to be (based on the histogram). Markdown box 2B. Generate a set of descriptive statistics for the data. What do these statistics tell you about well the data meets the definition of normally distributed data? Markdown box 2C. Test the distribution of the data set against the empirical rule by calculating the percentage of data points that are 1, 2, and 3 standard deviations away from the mean. How well does the data seem to meet the empirical rule? Markdown box 2D. Potential outliers in a data set can be defined as data points that are more than 3 standard deviations away from the mean. Calculate how many data points are potential outliers, then create a list of these potential outliers (i.e., the data points themselves). Markdown box