student note using this link for the data set https 1 facebook com l p
Search for question
Question
Student note:
Using this link for the data set: https://1.facebook.com/l.php?u=https%3A
%2F%2Fwww.kaggle.com%2Fdatasets%2Fazminetoushikwasi%2Faqi-
air-quality-index-scheduled-daily-update%3Ffbclid
%3DIWAR1Ovg9fzOTZQ1iy39G 1bv6H1Qgg1qZmpTp6-
OzTPvRFynGk3e5uvP9ROI&h=AT0qA-
Tv00PuEJdPKLT1B41zi0suT2nodD4Mju1mAr3eYp2fGQTBmfwyG7Cg
HYRhCIMwpLZtftsbd KTRp2dCHKMA4XbGNon-
i0fOtcIWzGmJ5M2e9WcO7HjUoCp1BHs fg
I only need the project proposal part done Not the entire project
Need this in around 500 words single spaced/n Instructions:
●
●
STEVENS
INSTITUTE OF TECHNOLOGY
1870
Course Project Instructions and Guidelines
The Course Project for MA 541 is a group-based and semester-long project. It is an
opportunity for you to integrate and apply what you have learned in the course and to
perform an in-depth analysis of your chosen data set. Including the proposal period, you
have over ten weeks to complete the project.
You will be assigned to a group of three or four students. All team members must make
approximately equal contributions to the project to receive the same credit. A Group
Performance Assessment form from each team member must be submitted as a reference
of grading.
To make sure you are on track to submit the project report on time, several assignments
related to the course project require your submissions to Canvas. Their due dates are given
below. Please note all submissions are due at 11:00 pm.
Submission
Course Project Proposal
Draft of project report (PDF)
Final report (PDF)
Group assessment form
https://datasetsearch.research.google.com
https://www.kaggle.com/datasets
https://data.gov
Your course project report must be a complete exploratory data analysis. The goal is to gain
insights into the data, identify patterns, and test hypotheses that can guide further analysis
and make conclusions. You will choose your own data set that you want to analyze. The data
set must include at least 100 values, however, make sure it is manageable. A good number
of values for data sets is 200-500. Some places to find data sets can be found below.
https://datahub.io/collections
https://www.earthdata.nasa.gov
https://apps.who.int/gho/data/node.home
https://www.bfi.org.uk/industry-data-insights
MA 541-B- Spring 2024
February 25, 2024
April 7, 2024
April 21, 2024
April 21, 2024
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
https://crime-data-explorer.fr.cloud.gov
https://finance.yahoo.com/lookup?s=DATA
Due date You can use any programming language (Python, R, Excel, Minitab, etc) to conduct the
analysis for this project. The whole coding must be submitted to Canvas as a pdf file that is
separate from the project report.
Each group will submit one report by uploading it to Canvas. Its grade will be used for the
Course Project component in our course.
Group Member Registration:
Please use the following link to sign up for groups. Each group should have only 3-4 members.
MA 541-B Project Groups.xlsx
This registration is due Thursday, February 15, 2024, at 11:00 pm. After this time, if you don't
have team members to work with, I will randomly assign you to a group.
Proposal Requirements:
Course project proposal states the problem you will work on and what you plan to do to analyze
your dataset. In your project proposal, you will need to include the following.
1) A link to your dataset source.
2) Summary of your data set (What is it about? Has it been cleaned? How many variables
does it have? What types are the variables: qualitative, quantitative, text, numeric,
tables, pictures, etc?)
3) Describe what you plan to do. (You don't need to be specific in this part. It is just a plan;
you can freely make changes as you are working on the project.)
4) Suggestion section: If you need any suggestions from me, please put them in this
section. If you don't need them, please enter N/A. (Of course, you can always discuss
any questions with me if they arise while you are working on the project.)
Project Guidelines:
Your project should include the main steps of an exploratory data analysis as follows.
1) Data sourcing: introduce the source of your data set and summarize it. (What is it
about? Has it been cleaned? How many variables does it have? What types are the
variables: qualitative, quantitative, text, numeric, tables, pictures, etc?)
2) Data cleaning: If your data set does not require cleaning, state so and then skip this
step. If it requires cleaning, include what you have done to clean it (identify and correct
errors, missing values, duplicate values, outliers, inconsistent data, data formatting
issues, etc).
3) Data Visualization and Summary Statistics: This is where you use descriptive statistics to
explore and get quick insights of the data. a) Data Visualization: make use of graphs and charts to explore data. Suggestions are
pie charts, bar graphs, line graphs, histograms. Make sure you include conclusions
(trends, percentages, comparisons, outliers, etc) at the end of this part.
b) Summary Statistics: the measures of center (mean, median, mode), measures of
dispersion (range, variance, standard deviation) and measures of locations
(percentiles, 5-number summary) can help understand the data set and identify
patterns or unusual values. You can also draw box plots and use them to compare
different groups or characteristics in data.
4) Statistical Inference: This is where you use statistical methods to explore and analyze
data. Some things you may include are: check for types of distributions, estimate
parameters (using different methods), test hypotheses, perform correlation analysis,
(one variable and multivariable) regressions, factor analysis, sample size determination,
classification, etc. The goal is to make conclusions and predictions. It includes two main
parts.
a)
Univariate Analysis: identify variables in your data (you should focus on several
main variables) and explore them. Use graphs (histograms, QQ plots, normal
probability plots) to get some insights about the distribution of the data. Then
estimate parameter, test hypotheses, and make conclusions.
b) Multivariate Analysis: Determine the type of the variables in your data to use the
appropriate methods.
Categorical vs Categorical variables: Chi-square test of independence.
● Categorical vs Numerical variables: scatter plot, contingency table, T-test, Chi-
square test, ANOVA.
Numerical vs Numerical variables: scatter plot, simple linear regression,
ANOVA, correlation coefficient, etc.
●
Please note you don't need to include all methods/techniques in your report. Explore
data and see what characteristics are worth to study further and focus on them. Your
project report should be between 12-15 pages (not including the code). Please try to
have a transparent analysis rather than a lengthy one.
Tips and Suggestions:
1) Choose the right charts for your data.
There are several types of charts and graphs that can be used in EDA, and choosing the right
one depends on the type of data being analyzed and the question being asked.
1. Histograms — Used to display the distribution of a continuous variable, such as age or
income.
2. Scatterplots — Used to visualize the relationship between two continuous variables.
3. Boxplots - Used to display the distribution of a continuous variable by showing the
median, quartiles, and outliers.
4. Bar charts — Used to display the distribution of a categorical variable, such as gender or
location. 5. Pie charts — Used to display the relative proportions of different categories.
6. Line charts — Used to display trends in a continuous variable over time.
7. Heatmaps — Used to display the relationship between two categorical variables, such as
the frequency of purchases by customer segment.
8.
Area chart — Used to show how a quantity changes over time, by filling in the area
under a line.
9. Bubble chart - Used to visualize three dimensions of data on a two-dimensional plot,
by showing the relationship between two continuous variables and the size of the points
representing each data point.
10. Stacked bar chart Used to display the distribution of a categorical variable, but with
each category divided into sub-categories to show the relative proportions of each sub-
category.
11. Waterfall chart Used to show how an initial value is affected by a series of
intermediate positive or negative values, and how the final value is reached.
12. Radar chart Used to show the performance of a variable across several categories, by
plotting each category as an axis and showing the values for each variable as a point on
the corresponding axis.
13. Violin plot - Used to display the distribution of a continuous variable, by showing the
density of values at different points along the range of the variable.
14. Sankey diagram Used to show the flow of data or resources between different
categories, by using arrows of different widths to show the relative amounts flowing
between each category.
A complete guide to types of charts can be found here.
https://chartio.com/learn/charts/
2) Reduce the number of variables.
To make the project manageable, you should reduce the number of independent variables
to three or four at most. (I would suggest two independent variables for the course project.)
Make sure you choose the most relevant ones.
3) Write a complete data analysis report.
Your project report should include the following:
a) Title page (including the project title, your name, class, group, date)
b) Table of contents
c) Introduction
Summary of the study and data, as well as any relevant substantive context,
background, or framing issues.
● The "big questions" answered by your data analyses, and summaries of your
conclusions about these questions.
● Brief outline of remainder of paper. d) Body
The body can be organized in several ways. Here are two that often work well:
Traditional: includes several sections such as Data, Methods, Analysis, Results.
● Question-oriented: each section in the body is dedicated to answering a specific
question and includes its own Methods, Analysis, Conclusions subsections.
e) Conclusions / Predictions / Discussion
The conclusion should reprise the questions and conclusions of the introduction,
perhaps augmented by some additional observations or details gleaned from the
analysis section. New questions, future work, etc., can also be raised here.
f) Appendix/Appendices (optional).
One or more appendices are the place to out details and ancillary materials. These might
include such items as
• Technical descriptions of (unusual) statistical procedures
• Detailed tables or computer output
• Figures that were not central to the arguments presented in the body of the report