student note these labs are sequential assignments that you ll underta
Search for question
Question
Student note :
These labs are sequential assignments that you'll undertake using
the provided dataset.
Ensure that you tackle the questions in the designated order.
After responding to each question, be certain to implement the
specified changes to your data frame.
Please refrain from submitting your lab prematurely, as you only
have one submission.
also the lab in jupyter as well
try to do it with simple function
basic function to get an answer because it is an intro class/n These labs are sequential assignments that you'll undertake using the provided dataset. Ensure that you tackle the questions in the designated order. After responding to
each question, be certain to implement the specified changes to your data frame. Please refrain from submitting your lab prematurely, as you only have one submission. Any
inadvertent submissions will be evaluated for grading. Remember that you're not required to complete the entire lab in one session; you can work on it at your own pace.
insurance.csv
To ensure consistent results, it is crucial that the versions of Numpy, Pandas, and Scikit-Learn libraries align as specified below:
• Numpy: 1.24.3
• Pandas: 2.2.0
• Scikit-Learn: 1.4.0
Should you encounter any discrepancies, we recommend executing the following commands as the initial step in your lab's iPython notebook to upgrade these libraries to the
required versions:
pip install --upgrade numpy==1.24.3
!pip install --upgrade pandas==2.2.0
!pip install --upgrade scikit-learn==1.4.0
1
2
3
4
5
6
1 point
What percentage of values in this data frame is represented as 'NaN' (Not-a-Number)? (Enter the answer with two digits after the decimal point)
Type your answer...
1 point
Call the describe() function on the data frame and mark the correct answer. Make all necessary changes to the data frame.
Columns time_spent_in_gym and avg_steps_per_day do not show any meaningful correlation, so we should drop both of these columns.
-1 in the age column suggests a data entry error. To maintain data integrity, we will leave it as is.
Replace the minimum value of -1 under the age column with np. nan, indicating a sentinel value.
The data frame contains a minimum bmi value of under 16, which is NOT a possible value for a human. So we should replace it with the median of all other available values.
Clear my selection
O OD
1 point
Indicate the number of columns in this data frame that have the object data type for dates.
Type your answer...
1 point
Call the info function on this data frame. As you can see, column age has missing values. We like to impute missing values in this column with the help of other values in this data
frame. First, group the data frame by the gym_frequency column and calculate each group's mean age. Round the mean values to the closest integer and cast them to integers.
Iterate through the rows of the DataFrame df. For each row, check if the age value is NaN and the gym_frequency matches specific conditions ('1 or 2 days a week', '3 to
5 days of a week', 'everyday' or 'never'). If the conditions are met, replace np. nan with the corresponding rounded mean age value computed before. Now enter the
mean of column age with three digits after the decimal point.
Type your answer...
1 point
Execute the info function on the data frame again. You may notice that some missing values in the age column have not been filled in. To address this, replace all
remaining nan values with the median of all available values in this column. After completing this final step of data imputation, calculate the mean of the age column. Provide
your answer with three digits after the decimal point.
Type your answer...
1 point
Suggest the best way to transform the data in the sex column and apply the transformation.
Map male to 1 and female to 0.
O O O O
Drop this column as it appears to be almost balanced.
Map female to 1 and male to 0.
Replace this column with a one-hot vector of length 2.
Clear my selection
محمد 7
8
1 point
When we analyze the bmi column using a kde plot, it indicates that the available values in this column have a normal distribution. We need to select the most effective method
for imputing the missing data in this column. Which data imputation method is the best choice for this column? Apply it to the data frame.
Density
000 0.
9
1.0
0.07
2.0
0.06 -
3.0
0.05-
4.0
0.04-
5.0
0.03
0.02
0.01
0.00-
0
10
np.random.seed (0)
values = np.random.choice ([0, 1, 2, 3, 4, 5],
Substitute all the missing values with the median of the available values.
Substitute all the missing values with 0.
Substitute all the missing values with the third quantile of the available values.
Substitute all the missing values with the first quantile of the available values.
Clear my selection
1 point
To impute missing values in column "children," follow these steps:
1. Calculate each non-missing value's probabilities (p) in the 'children' column. This is done by dividing the value counts of each unique non-missing value by the total
count of non-missing values.
2. Generate random values to replace the missing values using np.random.choice. You can specify the list of values (0, 1, 2, 3, 4, 5) and use the calculated probabilities to
determine the distribution of these values:
20
30
size=df['children'].isnull().sum(),
p=[p[0], p[1], p[2], p[3], p[4], p[5]])
3. Iterate through the values in the 'children' column. For each value, check if the value is missing (NaN). If it's missing, replace it with the next value in the 'values' array.
Now complete the frequency of each category of column children after data imputation.
children
0.0
type your answer...
type your answer...
40
type your answer...
type your answer...
type your answer...
type your answer...
50
60
70
1 point
Suggest the best way to transform the data in the smoker column and apply the transformation.
Replace this column with a one-hot vector of length 2.
Map no to 1 and yes to 0.
Map yes to 1 and no to 0.
Drop this column as it appears to be almost balanced.
Clear my selection 10
11
12
13
1 point
To impute missing values in column region we will use the same technique as in the previous question.
1. Calculate each non-missing value's probabilities (p) in the region column. This is done by dividing the value counts of each unique non-missing value by the total count
of non-missing values.
2. Define a dictionary d that maps integer values (0 to 3) to corresponding region names
d= {0: 'southeast',
1: 'southwest'
2: 'northwest'
3: 'northeast'}
Now complete the following table by entering the frequency of each region after data imputation.
region
southeast type your answer...
northwest type your answer...
type your answer...
type your answer...
3.Iterate through the values in the region column. For each value, check if the value is missing (NaN). If it's missing, replace it with the next value in the values array. Use
the dictionary d to map the value in the array values to one of the regions.
southwest
0000
northeast
1 point
Now that column region is complete, the next step is to transform it to all numerical values. Which of the following would be this column's best feature transformation option?
Replace it with a one-hot vector.
14
Replace it with a one-hot vector, then remove the column with the lowest number of 1s.
Map to numerical values using dictionary d in the previous question.
Replace it with a one-hot vector, then randomly remove one of the columns.
Clear my selection
1 point
The column named occupation has some missing values. Upon examining the distribution of the values in this column, it can be observed that there is a category called Other.
One way of imputing the missing data is to replace all the blank values with other. Once this imputation is done, what would be the frequency of Other in this column? (Please
note that we will not be transforming any data in this column at this stage.)
Type your answer...
1 point
When we use the crosstab function on the columns gym_frequency and occupation, we find that gym frequency varies across different occupations. To fill in the missing
values in this column, we can use the most frequent gym frequency among members of each occupation. For instance, if a person has an occupation of Artist and their gym
frequency is missing, we can replace it with the most frequently occurring gym frequency among all Artist instances in the data frame. After imputing all missing values,
complete the table with this information.
gym_frequency
3 to 5 days of a week
everyday
never
1 or 2 days a week.
DO
type your answer...
type your answer...
type your answer...
type your answer...
1 point
What is the best way to have a numerical representation for the values in the column gym_frequency.
Replace this column with a one-hot vector of length 4.
Map the values using {'1 or 2 days a week': 0, 'never': 1, 'everyday': 2, '3 to 5 days a week' : 3}
Map the values using {'never': 0, '1 or 2 days a week': 1, '3 to 5 days of a week': 2, 'everyday': 3}
Replace this column with a one-hot-vector of length 3.
Clear my selection 15
1 point
What is the best way to have a numerical representation for the values in the column occupation.
Map the values using {'Artist': 0, 'Doctor': 1, 'Engineer': 2, 'Other':3, 'Teacher':4}
Map the values using {'Other' :4, 'Artist': 3, 'Engineer': 2, 'Doctor': 1, 'Teacher':0}
Replace this column with a one-hot vector of length 4.
Replace it with a one-hot vecotr of length 5.
Map the values using {'Other':0, 'Artist': 1, 'Engineer': 2, 'Doctor': 3, 'Teacher':4}
Clear my selection
O O O O O/n