The main deliverable for this assignment will be a report in the form of an R notebook. You will have to
submit both the .nb.html and the .Rmd files as well as any other files you used in a zipped folder.
You should use markdown headers to include the following three sections:
1. Introduction: Provide an overview of the whole data set, and clearly state your research questions:
. Consider the data source and data properties such as the number of observations and the available
variables;
• Pose at least two initial questions about your data, at least one of which exploring the
relationship between two to three variables in your data set;
State your initial research questions clearly, and specifically mention the variables that you are
hoping will help you answer your questions;
2. Data Quality Assessment: Check the quality of your data and carry out any necessary pre-processing
to clean it.
. Consider if there are any salient data quality issues that need to be addressed (e.g., missing data,
inputs of the wrong type, outliers that seem to indicate an error).
Pay special attention to any variables you will be using on your analysis.
3. Exploratory Data Analysis: Investigate your initial research questions, and any new questions that
arise from your exploration.
• This section should have a subsection for each of your exploratory questions and include:
-graphs (at least 3) and tables (at least 2) that can help answer your questions (tables can
contain summary statistics, or information about a few particularly interesting observations);
- short textual descriptions (about 2 to 3 sentences) that interpret and communicate the insights
you got from each of them (when appropriate, explain your reasoning for generating them).
You should perform both univariate and bivariate analysis:
- Calculate summary statistics (univariate analysis) and grouped summary statistics of variables
of interest (e.g., calculate the mean of a continuous variable across levels of a categorical or
discretized variable - bivariate analysis);
Generate appropriate graphs to show distributions (univariate analysis) and the relationship
between variables of interest (bivariate or multivariate analysis), using faceting if appropriate;
Refine your answers (e.g., by adding more variables to your analysis or subsetting your data).
4. Conclusion: Briefly describe your main insights. This should include a summary of the answers to
your questions and may also include brief comments on the following:
• If you had prior hypotheses about the topic of your data, does the data seem to support it? Did
you find anything surprising?
• Did your initial questions lead to further questions?
Were there questions you were not able to answer due to data quality issues?
Feel free to revise your questions and/or explore additional questions that arise from your initial questions.
You may break down the sections into additional subsections if it makes your report easier to follow. Make
sure to polish your final derivable according to guidelines and marking criteria.
Fig: 1