Search for question
Question

Student note: Using this link for the data set: https://1.facebook.com/l.php?u=https%3A %2F%2Fwww.kaggle.com%2Fdatasets%2Fazminetoushikwasi%2Faqi- air-quality-index-scheduled-daily-update%3Ffbclid %3DIWAR1Ovg9fzOTZQ1iy39G 1bv6H1Qgg1qZmpTp6- OzTPvRFynGk3e5uvP9ROI&h=AT0qA- Tv00PuEJdPKLT1B41zi0suT2nodD4Mju1mAr3eYp2fGQTBmfwyG7Cg HYRhCIMwpLZtftsbd KTRp2dCHKMA4XbGNon- i0fOtcIWzGmJ5M2e9WcO7HjUoCp1BHs fg I only need the project proposal part done Not the entire project Need this in around 500 words single spaced/n Instructions: ● ● STEVENS INSTITUTE OF TECHNOLOGY 1870 Course Project Instructions and Guidelines The Course Project for MA 541 is a group-based and semester-long project. It is an opportunity for you to integrate and apply what you have learned in the course and to perform an in-depth analysis of your chosen data set. Including the proposal period, you have over ten weeks to complete the project. You will be assigned to a group of three or four students. All team members must make approximately equal contributions to the project to receive the same credit. A Group Performance Assessment form from each team member must be submitted as a reference of grading. To make sure you are on track to submit the project report on time, several assignments related to the course project require your submissions to Canvas. Their due dates are given below. Please note all submissions are due at 11:00 pm. Submission Course Project Proposal Draft of project report (PDF) Final report (PDF) Group assessment form https://datasetsearch.research.google.com https://www.kaggle.com/datasets https://data.gov Your course project report must be a complete exploratory data analysis. The goal is to gain insights into the data, identify patterns, and test hypotheses that can guide further analysis and make conclusions. You will choose your own data set that you want to analyze. The data set must include at least 100 values, however, make sure it is manageable. A good number of values for data sets is 200-500. Some places to find data sets can be found below. https://datahub.io/collections https://www.earthdata.nasa.gov https://apps.who.int/gho/data/node.home https://www.bfi.org.uk/industry-data-insights MA 541-B- Spring 2024 February 25, 2024 April 7, 2024 April 21, 2024 April 21, 2024 https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page https://crime-data-explorer.fr.cloud.gov https://finance.yahoo.com/lookup?s=DATA Due date You can use any programming language (Python, R, Excel, Minitab, etc) to conduct the analysis for this project. The whole coding must be submitted to Canvas as a pdf file that is separate from the project report. Each group will submit one report by uploading it to Canvas. Its grade will be used for the Course Project component in our course. Group Member Registration: Please use the following link to sign up for groups. Each group should have only 3-4 members. MA 541-B Project Groups.xlsx This registration is due Thursday, February 15, 2024, at 11:00 pm. After this time, if you don't have team members to work with, I will randomly assign you to a group. Proposal Requirements: Course project proposal states the problem you will work on and what you plan to do to analyze your dataset. In your project proposal, you will need to include the following. 1) A link to your dataset source. 2) Summary of your data set (What is it about? Has it been cleaned? How many variables does it have? What types are the variables: qualitative, quantitative, text, numeric, tables, pictures, etc?) 3) Describe what you plan to do. (You don't need to be specific in this part. It is just a plan; you can freely make changes as you are working on the project.) 4) Suggestion section: If you need any suggestions from me, please put them in this section. If you don't need them, please enter N/A. (Of course, you can always discuss any questions with me if they arise while you are working on the project.) Project Guidelines: Your project should include the main steps of an exploratory data analysis as follows. 1) Data sourcing: introduce the source of your data set and summarize it. (What is it about? Has it been cleaned? How many variables does it have? What types are the variables: qualitative, quantitative, text, numeric, tables, pictures, etc?) 2) Data cleaning: If your data set does not require cleaning, state so and then skip this step. If it requires cleaning, include what you have done to clean it (identify and correct errors, missing values, duplicate values, outliers, inconsistent data, data formatting issues, etc). 3) Data Visualization and Summary Statistics: This is where you use descriptive statistics to explore and get quick insights of the data. a) Data Visualization: make use of graphs and charts to explore data. Suggestions are pie charts, bar graphs, line graphs, histograms. Make sure you include conclusions (trends, percentages, comparisons, outliers, etc) at the end of this part. b) Summary Statistics: the measures of center (mean, median, mode), measures of dispersion (range, variance, standard deviation) and measures of locations (percentiles, 5-number summary) can help understand the data set and identify patterns or unusual values. You can also draw box plots and use them to compare different groups or characteristics in data. 4) Statistical Inference: This is where you use statistical methods to explore and analyze data. Some things you may include are: check for types of distributions, estimate parameters (using different methods), test hypotheses, perform correlation analysis, (one variable and multivariable) regressions, factor analysis, sample size determination, classification, etc. The goal is to make conclusions and predictions. It includes two main parts. a) Univariate Analysis: identify variables in your data (you should focus on several main variables) and explore them. Use graphs (histograms, QQ plots, normal probability plots) to get some insights about the distribution of the data. Then estimate parameter, test hypotheses, and make conclusions. b) Multivariate Analysis: Determine the type of the variables in your data to use the appropriate methods. Categorical vs Categorical variables: Chi-square test of independence. ● Categorical vs Numerical variables: scatter plot, contingency table, T-test, Chi- square test, ANOVA. Numerical vs Numerical variables: scatter plot, simple linear regression, ANOVA, correlation coefficient, etc. ● Please note you don't need to include all methods/techniques in your report. Explore data and see what characteristics are worth to study further and focus on them. Your project report should be between 12-15 pages (not including the code). Please try to have a transparent analysis rather than a lengthy one. Tips and Suggestions: 1) Choose the right charts for your data. There are several types of charts and graphs that can be used in EDA, and choosing the right one depends on the type of data being analyzed and the question being asked. 1. Histograms — Used to display the distribution of a continuous variable, such as age or income. 2. Scatterplots — Used to visualize the relationship between two continuous variables. 3. Boxplots - Used to display the distribution of a continuous variable by showing the median, quartiles, and outliers. 4. Bar charts — Used to display the distribution of a categorical variable, such as gender or location. 5. Pie charts — Used to display the relative proportions of different categories. 6. Line charts — Used to display trends in a continuous variable over time. 7. Heatmaps — Used to display the relationship between two categorical variables, such as the frequency of purchases by customer segment. 8. Area chart — Used to show how a quantity changes over time, by filling in the area under a line. 9. Bubble chart - Used to visualize three dimensions of data on a two-dimensional plot, by showing the relationship between two continuous variables and the size of the points representing each data point. 10. Stacked bar chart Used to display the distribution of a categorical variable, but with each category divided into sub-categories to show the relative proportions of each sub- category. 11. Waterfall chart Used to show how an initial value is affected by a series of intermediate positive or negative values, and how the final value is reached. 12. Radar chart Used to show the performance of a variable across several categories, by plotting each category as an axis and showing the values for each variable as a point on the corresponding axis. 13. Violin plot - Used to display the distribution of a continuous variable, by showing the density of values at different points along the range of the variable. 14. Sankey diagram Used to show the flow of data or resources between different categories, by using arrows of different widths to show the relative amounts flowing between each category. A complete guide to types of charts can be found here. https://chartio.com/learn/charts/ 2) Reduce the number of variables. To make the project manageable, you should reduce the number of independent variables to three or four at most. (I would suggest two independent variables for the course project.) Make sure you choose the most relevant ones. 3) Write a complete data analysis report. Your project report should include the following: a) Title page (including the project title, your name, class, group, date) b) Table of contents c) Introduction Summary of the study and data, as well as any relevant substantive context, background, or framing issues. ● The "big questions" answered by your data analyses, and summaries of your conclusions about these questions. ● Brief outline of remainder of paper. d) Body The body can be organized in several ways. Here are two that often work well: Traditional: includes several sections such as Data, Methods, Analysis, Results. ● Question-oriented: each section in the body is dedicated to answering a specific question and includes its own Methods, Analysis, Conclusions subsections. e) Conclusions / Predictions / Discussion The conclusion should reprise the questions and conclusions of the introduction, perhaps augmented by some additional observations or details gleaned from the analysis section. New questions, future work, etc., can also be raised here. f) Appendix/Appendices (optional). One or more appendices are the place to out details and ancillary materials. These might include such items as • Technical descriptions of (unusual) statistical procedures • Detailed tables or computer output • Figures that were not central to the arguments presented in the body of the report

Fig: 1