Search for question

The main deliverable for this assignment will be a report in the form of an R notebook. You will have to

submit both the .nb.html and the .Rmd files as well as any other files you used in a zipped folder.

You should use markdown headers to include the following three sections:

1. Introduction: Provide an overview of the whole data set, and clearly state your research questions:

. Consider the data source and data properties such as the number of observations and the available

variables;

• Pose at least two initial questions about your data, at least one of which exploring the

relationship between two to three variables in your data set;

State your initial research questions clearly, and specifically mention the variables that you are

hoping will help you answer your questions;

2. Data Quality Assessment: Check the quality of your data and carry out any necessary pre-processing

to clean it.

. Consider if there are any salient data quality issues that need to be addressed (e.g., missing data,

inputs of the wrong type, outliers that seem to indicate an error).

Pay special attention to any variables you will be using on your analysis.

3. Exploratory Data Analysis: Investigate your initial research questions, and any new questions that

arise from your exploration.

• This section should have a subsection for each of your exploratory questions and include:

-graphs (at least 3) and tables (at least 2) that can help answer your questions (tables can

contain summary statistics, or information about a few particularly interesting observations);

- short textual descriptions (about 2 to 3 sentences) that interpret and communicate the insights

you got from each of them (when appropriate, explain your reasoning for generating them).

You should perform both univariate and bivariate analysis:

- Calculate summary statistics (univariate analysis) and grouped summary statistics of variables

of interest (e.g., calculate the mean of a continuous variable across levels of a categorical or

discretized variable - bivariate analysis);

Generate appropriate graphs to show distributions (univariate analysis) and the relationship

between variables of interest (bivariate or multivariate analysis), using faceting if appropriate;

Refine your answers (e.g., by adding more variables to your analysis or subsetting your data).

4. Conclusion: Briefly describe your main insights. This should include a summary of the answers to

your questions and may also include brief comments on the following:

• If you had prior hypotheses about the topic of your data, does the data seem to support it? Did

you find anything surprising?

• Did your initial questions lead to further questions?

Were there questions you were not able to answer due to data quality issues?

Feel free to revise your questions and/or explore additional questions that arise from your initial questions.

You may break down the sections into additional subsections if it makes your report easier to follow. Make

sure to polish your final derivable according to guidelines and marking criteria.

Fig: 1