data mining homework help

Boost your journey with 24/7 access to skilled experts, offering unmatched data mining homework help

Trusted by 1.1 M+ Happy Students

4.4Trust Pilot

4.4Edu Reviewer

5App Review

4.8Student

Place An Orderand save time

^*Get instant homework help from top tutors—just a WhatsApp message away. 24/7 support for all your academic needs!

Recently Asked data mining Questions

Expert help when you need it

Q1:Programming Assignment Explanation • Fortune Cookie Classifier¹ You will build a binary fortune cookie classifier. This classifier will be used to classify fortune cookie messages into two classes: messages that predict what will happen in the future (class 1) and messages that just contain a wise saying (class 0). For example, "Never go in against a Sicilian when death is on the line" would be a message in class 0. "You will get an A in Machine learning class" would be a message in class 1. Files Provided There are three sets of files. All words in these files are lower case and punctuation has been removed. 1) The training data: traindata.txt: This is the training data consisting of fortune cookie messages. trainlabels.txt: This file contains the class labels for the training data. 2) The testing data: testdata.txt: This is the testing data consisting of fortune cookie messages. testlabels.txt: This file contains the class labels for the testing data.See Answer
Q2:Q2. (10 points) Consider the following setting. You are provided with n training examples: (T₁, 9₁, h₁), (2, 92, h₂),, (In, Yn, hn), where z, is the input example, y, is the class label (+1 or -1), and h₁> 0 is the importance weight of the example. The teacher gave you some additional information by specifying the importance of each training example. How will you modify the perceptron algorithm to be able to leverage this extra information? Please justify your answer.See Answer
Q3:Q4. You were just hired by MetaMind. MetaMind is expanding rapidly, and you decide to use your machine learning skills to assist them in their attempts to hire the best. To do so, you have the following available to you for each candidate i in the pool of candidates Z: (i) Their GPA, (ii) Whether they took Data Mining course and achieved an A, (iii) Whether they took Algorithms course and achieved an A, (iv) Whether they have a job offer from Google, (v) Whether they have a job offer from Facebook, (vi) The number of misspelled words on their resume. You decide to represent each candidate i € I by a corresponding 6-dimensional feature vector f(z)). You believe that if you just knew the right weight vector w R you could reliably predict the quality of a candidate i by computing w- f(z). To determine w your boss lets you sample pairs of candidates from the pool. For a pair of candidates (k, 1) you can have them face off in a "DataMining-fight." The result is score (k > 1), which tells you that candidate k is at least score (k> 1) better than candidate 1. Note that the score will be negative when I is a better candidate than k. Assume you collected scores for a set of pairs of candidates P. Describe how you could use a perceptron based algorithm to learn the weight vector w. Make sure to describe the basic intuition; how the weight updates will be done; and pseudo-code for the entire algorithm.See Answer
Q4:Please create a K-means Clustering and Hierarchical Clustering with the line of code provided. The line of code should include a merger of the excel files. The excel files will also be provided See Answer
Q5:Discussion - Data Mining, Text Mining, and Sentiment Analysis Explain the relationship between data mining, text mining, and sentiment analysis. Provide situations where you would use each of the three techniques. Respond to the following in a minimum of 230 words:See Answer
Q6:Assignment #3: DBSCAN, OPTICS, and Clustering Evaluation 1. If Epsilon is 2 and minpoint is 2 (including the centroid itself), what are the clusters that DBScan would discover with the following 8 examples: A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9). Use the Euclidean distance. Draw the 10 by 10 space and illustrate the discovered clusters. What if Epsilon is increased to sqrt(10)? (30 pts)See Answer
Q7: 2. Use OPTICS algorithm to output the reachability distance and the cluster ordering for the dataset provided, starting from Instance 1. Use the following parameters for discovering the cluster ordering: minPts =2 and epsilon =2. Use epsilonprime =1.2 to generate clusters from the cluster ordering and their reachability distance. Don't forget to record the core distance of a data point if it has a dense neighborhood. You don't need to include the core distance in your result but you may need to use them in generating clusters. (45 pts) 2 16 14 12 10 015 05 20 06 04 021 026 016 027 022 025 019 023 09 024 07 018 08 011 070 030 029 028 012 013 014 2 01 0 03 0 2 4 8 10 12 14 16 017 Dataset visualization Below are the first few lines of the calculation. You need to complete the remaining lines and generate clusters based on the given epsilonprime value: Instance (X,Y) Reachability Distance Instance 1: (1,1) Undefined(or infinity) Instance 2: (0, 1) 1.0 Instance 3: (1, 0) 1.0 Instance 16: (5,9) Undefined Instance 13: (9,2) Undefined Instance 12: (8,2) 1See Answer
Q8:3. Use F-measure and the Pairwise measures (TP, FN, FP, TN) to measure the agreement between a clustering result (C1, C2, C3) and the ground truth partitions (T1, T2, T3) as shown below. Show details of your calculation. (25 pts) Ground Truth T, TT, Cluster C, CC3See Answer
Q9:1. We will use Flower classification dataset a. https://www.kaggle.com/competitions/tpu-getting-started 2. Your goal is improving the average accuracy of classification. a. You SHOULD use google collab as the main computing. (Using Kaggle is okay) b. You SHOULD create a github reposit for the source code i. Put a readme file for execution c. You SHOULD explain your source code in the BLOG. d. Try experimenting with various hyperparameters i. Network topology 1. Number of neurons per layer (for example, 100 x 200 x 100, 200 x 300 x 100...) 2. number of layers (For example, 2 vs 3 vs 4 ... ) 3. shape of conv2d ii. While doing experiments, make sure you record your performance such that you can create a bar chart of the performance iii. An additional graph idea might be a training time comparison Do some research on ideas for improving this. iv. e. You can refer to the code or tutorial internet. But the main question you have to answer is what improvement you made over the existing reference. i. Make sure it is very clear which lines of code is yours or not. When you copy the source code, add a reference. 3. Documentation is the half of your work. Write a good blog post for your work and step-by-step how to guide. a. A good example is https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/ 4. Add a reference a. You add a citation number in the contents and put the reference in the separate reference sectionSee Answer
Q10:This tutorial will guide you how to do homework in this course. 1. Goto https://www.kaggle.com/c/titanic and follow walkthrough as https://www.kaggle.com/alexisbcook/titanic-tutorial B 2. Submit your result to Kaggle challenge. 3. Post jupyter notebook to your homepage as blog post. A good example of blog post is. https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/ 4. Submit your homepage link and screenshot pdf in the canvas. 5. Doing 1-4 will give you 8 points. To get additional 2 points, create a section as "Contribution" and try to improve the performance. I expect one or two paragraph minimum (the longer the better). Show the original score and improved score.See Answer
Q11:2the Task - Correlation Rule Extraction and Outlier Detection Prices Question 1 Suppose we have ten product codes (from 0 to 9) and the following table of transactions Identifier Transaction 1 2 3 4 5 6 7 8 9 10 Products 0, 1, 3, 4 1, 2, 3 1, 2, 4, 5 1, 3, 4, 5 2, 3, 4, 5 2, 4, 5 3,A 1, 2, 3 1, 4, 5 B,4 where, the penultimate and last digit of your registration number divided by2 and rounded to the nearest unit. That is, if your registration number ends in 12, then you will set = 1/2= 0.5 that is1 and =2/2= 1. But if the penultimate digit of the number your registry is either 5 or 6, you will set = 4. Accordingly, if the last digit of the number your register is either 3 or 4, you will set = 2. Please answer the following questions, showing your calculations: 1. Compute the support of the sets {4,5} and {3, 5} 2. The support and trust of the rules {4, 5} → {3} and {3, 5} → {2} 3. Run the Apriori Algorithm with the method Fk-1X F k-1 for support threshold equal to 2 and report which frequent sets you find at each stepSee Answer
Q12:1. If Epsilon is 2 and minpoint is 2 (including the centroid itself), what are the clusters that DBScan would discover with the following 8 examples: A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9). Use the Euclidean distance. Draw the 10 by 10 space and illustrate the discovered clusters. What if Epsilon is increased to sqrt(10)? (30 pts)See Answer
Q13:2. Use OPTICS algorithm to output the reachability distance and the cluster ordering for the dataset provided, starting from Instance 1. Use the following parameters for discovering the cluster ordering: minPts =2 and epsilon =2. Use epsilonprime =1.2 to generate clusters from the cluster ordering and their reachability distance. Don't forget to record the core distance of a data point if it has a dense neighborhood. You don't need to include the core distance in your result but you may need to use them in generating clusters. (45 pts) Instance 1: Instance 2: Instance 3: Instance 16: Dataset visualization Below are the first few lines of the calculation. You need to complete the remaining lines and generate clusters based on the given epsilonprime value: Instance (X,Y) (1, 1) (0, 1) (1, 0) (5,9) 617 Reachability Distance Undefined (or infinity) 1.0 1.0 UndefinedSee Answer
Q14: Each student is required to submit (1) the Orange workflow file or Python/R script files (s)he created using the format HW1PID.OWS (.py and .r respectively for Python/R) where PID is your PID number, and (2) a PDF file with answers using the format HW3PID.pdf All files listed below were posted on Canvas sub-folder Homeworks of Data sets folder. Homework III - Questions The purpose of this homework is to use textual analysis of WSJ news in predicting financial market outcomes. In particular we will rely on a data set measuring the state of the economy via textual analysis of business news. From the full text content of 800,000 Wall Street Journal articles for 1984-2017, Bybee et al. estimate a model that summarizes business news as easily interpretable topical themes and quantifies the proportion of news attention allocated to each theme at each point in time. These news attention estimates are inputs into the models we want to estimate. The data source is described in http://structureofnews.com/. The data was standardized and prepared for this assignment. Please use the data file ML_MBA_UNC_Processed.xlxs from the HW3 material folder (sub-folder of Assignments folder) on CANVAS (it is also posted in the Data Sets folder - Homework sub-folder) Click on the News Taxomony tab of the aforementioned website and you will find a taxon- omy of news themes in The Wall Street Journal. Estimated with hierarchical agglomerative clustering, the dendrogram illustrates how 180 topics cluster into an intuitive hierarchy of increasingly broad metatopics. The list of topics is reproduced below - further details appear on the website: Natural disasters, Internet, Soft drinks, Mobile devices, Profits, M&A, Changes, Police / crime, Research, Executive pay, Mid- size cities, Scenario analysis, Economic ideology, Middle east, Savings & loans, IPOs, Restraint, Electronics, Record high, Connecticut, Steel, Bond yields, Small business, Cable, Fast food, Disease, Activists, Competition, Music industry, Short sales, Nonperforming loans, Key role, News conference, US defense, Political contributions, Revised estimate, Economic growth, Justice Department, Credit ratings, Broadcasting, Problems, Announce plan, Federal Reserve, Job cuts, Chemicals / paper, Regulation, Environment, Small caps, Unions, C-suite, Control stakes, Mutual funds, Venture capital, European sovereign debt, Mining, Company spokesperson, Private / public sec- tor, Pharma, Schools, Russia, Programs / initiatives, Health insurance, Drexel, Trade agreements, Treasury bonds, Challenges, People familiar, Sales call, Publishing, Financial crisis, Aerospace / defense, Recession, Latin America, Cultural life, SEC, Earnings losses, Phone companies, Computers, Marketing, Japan, Nuclear / North Korea, NY politics, Tobacco, Product prices, Biology / chemistry / physics, Movie industry, Automotive, Machinery, Bankruptcy, Arts, International exchanges, Accounting, Space program, Immigration, Small changes, Small possibility, Agreement reached, Oil drilling, Rail / trucking / shipping, Indictments, Positive sentiment, Canada / South Africa, Airlines, California, Corporate governance, China, Investment banking, Spring/summer, Software, Pensions, Humor / language, Systems, Clintons, Major concerns, Mid-level executives, US Senate, Agriculture, Bank loans, Takeovers, State politics, Real estate, Futures / indices, Southeast Asia, Optimism, Corrections / amplifications, Government budgets, Exchanges / composites, Currencies / metals, Mortgages, Financial reports, Germany, Rental properties, Committees, Subsidiaries, Management changes, Share payouts, France / Italy, Acquired investment banks, Credit cards, Bear / bull market, Earnings forecasts, Terrorism, Watchdogs, Oil market, Couriers, Commodities, Utilities, Foods / consumer goods, Convertible / preferred, Macroeconomic data, Courts, Safety admin- istrations, Reagan, Bush / Obama / Trump, Fees, Gender issues, Trading activity, Microchips, Insurance, Earnings, Luxury / beverages, Iraq, National security, Buffett, Taxes, Options / VIX, Casinos, Elections, Private equity / hedge funds, Negotiations, European politics, Size, NASD, Mexico, Retail, Long / short term, Wide range, Lawsuits, UK, Revenue growth 1.a [4 points] For the exercise below select ALL features EXCEPT FUTSP500, APPL, SGNAPPL, FUTAPPL, SGNFUTAPPL, SGNSP500, DATE and use as target: SGNFUTSP500 Your first task is to predict the direction of the market (S&P 500 index) over the next month, i.e. SGNFUTSP500 up or down - which is a classification prediction problem, using the importance of news topics. Use the logistic regression and neural network widgets of the Orange software to compute the following models: logistic regression with LASSO regularization with C = 0.007 logistic regression with LASSO regularization with C = 0.80 • neural network with 100 hidden layers, ReLu activation function, SGD optimization, and max 200 iterations and regularization a = = 5 You should get something along the following lines with 10-fold cross-validation: AUC CA ● Models Logistic LASSO C = 0.80 Neural Net Logistic LASSO C = 0.007 0.583 0.649 0.566 0.619 0.500 0.371 Explain why the Logistic regression with LASSO C = 0.007 does so poorly (hint: look at the coefficients of the model). When you look at the ROC curve, explain the curve for the LASSO C = 0.007 model. 2 1.b [4 points] List the top ten features which have the most negative impact on next month's market direction and the top ten with the most positive impact. 2.a [4 points] We are turning now to a linear regression instead of classification problem, predicting actual market returns (continuous) rather than direction (binary). For the exer- cise below select ALL features EXCEPT SGNFUTSP500, APPL, SGNAPPL, FUTAPPL, SGNFUTAPPL, SGNSP500, DATE and use as target: FUTSP500 Use the linear regression and neural net widgets of Orange software to compute the follow- ing models: regression with Elastic Net with a = 0.007 for the LASSO/Ridge regularization and weight 0.80 on the l₂ regularization. Note that the logistic regression and linear regression widgets use a different way of writing the penalty function (although there is a mapping between C and a via something called the Lagrangian multiplier not covered in the course). neural network with 100 hidden layers, ReLu activation function, SGD optimization, and max 200 iterations and regularization a = = 5 The Test and Score widget now records MSE, RMSE, MAE, and R2. What do you learn from the R2 results? 2.b [4 points] In the previous case we were trying to predict the return next month with current news. Now, we will try to explain current returns with current news. For the exercise below select ALL features EXCEPT FUTSP500, SGNFUTSP500, APPL, SGNAPPL, FUTAPPL, SGNFUTAPPL, SGNSP500, DATE and use as target: SP500 Use the linear regression widget of Orange software to compute the following models: regression with Elastic Net with a = 0.007 for the LASSO/Ridge regularization and weight 0.80 on the l₂ regularization. Recall that the logistic regression and linear re- gression widgets use a different way of writing the penalty function (although there is a mapping between the role of C and a via something called the Lagrangian mul- tiplier not covered in the course). 3 • neural network with 100 hidden layers, ReLu activation function, SGD optimization, and max 200 iterations and regularization a = 5 The Test and Score widget now records MSE, RMSE, MAE, and R2. What do you learn from the R2 results? 2.c [2 points] List the top ten features which have the most negative impact on next month's market returns and the top ten with the most positive impact. How is it different from the results in 1.b? 2.d [2 points] Explain the difference between your answers in 1.a, 2.a and 2.b. 4See Answer
Q15: 2. In PC Tech's product mix problem, assume there is another PC model, the VXP, that the company can produce in addition to Basics and XPs. Each VXP requires eight hours for assembling, three hours for testing, $275 for component parts, and sells for $560. At most 50 VXPs can be sold. a. Modify the spreadsheet model to include this new product, and use Solver to find the optimal product mix. 4. Again continuing Problem 2, suppose that you want to force the optimal solution to be integers. Do this in Solver by adding a new constraint. Select the deci- sion variable cells for the left side of the constraint, and in the middle dropdown list, select the "int" op- tion. How does the optimal integer solution compare to the optimal noninteger solution in Problem 2? Are the decision variable cell values rounded versions of those in Problem 2? Is the objective value more or less than in Problem 2? 26. A furniture company manufactures desks and chairs. Each desk uses four units of wood, and each chair uses three units of wood. A desk contributes $250 to profit, and a chair contributes $145. Marketing restrictions require that the number of chairs produced be at least four times the number of desks produced. There are 2000 units of wood available. a. Use Solver to maximize the company's profit. 46. During each four-hour period, the Smalltown police force requires the following number of on-duty police officers: four from midnight to 4 A.M.; four from 4 A.M. to 8 A.M.; seven from 8 A.M. to noon; seven from noon to 4 P.M.; eight from 4 P.M. to 8 P.M.; and ten from 8 P.M. to midnight. Each police officer works two consecutive four-hour shifts. a. Determine how to minimize the number of police of- ficers needed to meet Smalltown's dailyrequirements. 66. United Steel manufactures two types of steel at three different steel mills. During a given month, each steel mill has 240 hours of blast furnace time available. Because of differences in the furnaces at each mill, the time and cost to produce a ton of steel differ for each mill, as listed in the file P04_66.xlsx. Each month, the company must manufacture at least 700 tons of steel 1 and 600 tons of steel 2. Determine how United Steel can minimize the cost of manufacturing the desired steel.See Answer
Q16: 2. In PC Tech's product mix problem, assume there is another PC model, the VXP, that the company can produce in addition to Basics and XPs. Each VXP requires eight hours for assembling, three hours for testing, $275 for component parts, and sells for $560. At most 50 VXPs can be sold. a. Modify the spreadsheet model to include this new product, and use Solver to find the optimal product mix. 4. Again continuing Problem 2, suppose that you want to force the optimal solution to be integers. Do this in Solver by adding a new constraint. Select the deci- sion variable cells for the left side of the constraint, and in the middle dropdown list, select the "int" op- tion. How does the optimal integer solution compare to the optimal noninteger solution in Problem 2? Are the decision variable cell values rounded versions of those in Problem 2? Is the objective value more or less than in Problem 2? 26. A furniture company manufactures desks and chairs. Each desk uses four units of wood, and each chair uses three units of wood. A desk contributes $250 to profit, and a chair contributes $145. Marketing restrictions require that the number of chairs produced be at least four times the number of desks produced. There are 2000 units of wood available. a. Use Solver to maximize the company's profit. 46. During each four-hour period, the Smalltown police force requires the following number of on-duty police officers: four from midnight to 4 A.M.; four from 4 A.M. to 8 A.M.; seven from 8 A.M. to noon; seven from noon to 4 P.M.; eight from 4 P.M. to 8 P.M.; and ten from 8 P.M. to midnight. Each police officer works two consecutive four-hour shifts. a. Determine how to minimize the number of police of- ficers needed to meet Smalltown's dailyrequirements. 66. United Steel manufactures two types of steel at three different steel mills. During a given month, each steel mill has 240 hours of blast furnace time available. Because of differences in the furnaces at each mill, the time and cost to produce a ton of steel differ for each mill, as listed in the file P04_66.xlsx. Each month, the company must manufacture at least 700 tons of steel 1 and 600 tons of steel 2. Determine how United Steel can minimize the cost of manufacturing the desired steel.See Answer
Q17: Data Mining Week 8 Instructions Address the following questions in complete sentences and submit the answers in a Word document. Problem 1: You are an operations analyst working for a major movie theater chain. Currently, the company requires all employees in the concession stand to use a technique known as suggestive selling (e.g., "Would you like Red Vines with your order today?"). Employees are not given any guidelines as to what suggestion to make, so they typically pick their favorite food or candy. The company has asked you to come up with a system to find items that a given customer is likely to buy in order to improve the effectiveness of the suggestive selling process and increase profits. Using market basket analysis, you have discovered several associations that you believe will lead to improved suggestive selling accuracy. You are considering two main options for deployment: 1. Work with the point-of-sale software vendor to implement the suggestive selling algorithm into the software used by cashiers. As cashiers enter the customer's order, the algorithm will determine what additional item the customer is likely to buy, based on the items currently ordered. The item will be displayed to the cashier who can then choose to add the item to the current order or dismiss the prompt, depending on the customer's response. Naturally, the software vendor is demanding a hefty fee and promising a 6-12-month timeline for delivery of the software update. Changes to the algorithm down the road will require an additional fee and waiting period. 2. Choose a small number (e.g., 5-10) of the most effective rules and incorporate them into the training protocol. One of these rules would be a catch-all (i.e., "If no other rules apply, suggest Red Vines"). You estimate that it will take 2-4 weeks to decide on rules, develop training collateral (posters for the employee breakroom, reminder cards to stick on each cash register, etc.), and deliver training. Because this is a high turnover industry and training is constantly being delivered, the additional costs of delivering this training are negligible. In 100-250 words, describe what sort of analysis you would need to do to make an informed decision about what plan to implement. What data would you need to acquire? How would you acquire that data? How would you measure the success of the deployment? How would you determine when the model needs to be updated? Problem 2: You have decided to implement one of the plans from the previous exercise. The company has asked you to write a 250-word e-mail message that will be sent out to all managers in the company announcing the upcoming change. While managers understand movie theater operations, they do not have a background in data mining. Your e-mail should clearly explain all of the following in layman's terms. You may make up figures related to costs, revenue increases, etc., as needed, to support your e-mail communication. 1. Type of model being implemented 2. Purpose for implementing the model 3. Anticipated results 4. Next steps and timeline for implementation Problem 3: Read the case study "Championing of an LTV Model at LTC," located in topic Resources, and answer the following questions in complete sentences. 1. What was the business problem the authors were trying to solve? 2. What type of modeling activities did the authors use? (description, prediction, classification, etc.) 3. How did the authors evaluate their model? 4. What stages of CRISP-DM are not represented in this case study? 5. Based on this article, why do you think it is important that implementers of data mining models possess strong interpersonal skills? 6. Consider the "Data Science Code of Professional Conduct" and evaluate two ethical issues related to data mining and the responsible stewardship of personal information. https://www.researchgate.net/publication/220520009 Championing of_an_LTV_model_ at LTCSee Answer
Q18: Principles of Business Data Mining Project This group project offers you an opportunity to apply your data mining knowledge to real-life data and to mine managerially relevant insights. The objective of this assignment is for your team to implement the data mining process using real-life data that is of interest to you. Your project should be driven by a relevant and important question of business or social value. It is essential that the dataset you select have an output (also called target) variable. Note however, it is not enough to just have a target variable. Your dataset should also contain an adequate number of input variables which can help explain or predict the target variable. Data can be obtained from a publicly available source such as kaggle.com or through a real business (for which you need to have appropriate Non-Disclosure Agreements in place). You will need to set up an account to download datasets from kaggle which are usually in the csv format. Pay attention to the FIVE project deadlines: project proposal report, meeting to discuss project, project progress report, project presentation slides/presentation and final project report. All four reports/slides (one copy per team) -- the proposal, progress report, presentation slides and the final report -- should be submitted in Canvas. All late submissions will receive a zero. Each report should be a single Microsoft Word or pdf document with your group number and all group member names. Messy or hard-to-read reports will be penalized. You have to implement the following data mining techniques in your project: • • At least 2 data visualization techniques using Tableau to understand and draw conclusions on the data At least 2 prediction techniques to address your business or social question. I would recommend using a linear or logistic regression and classification trees since they are the most straightforward to translate to a relevant business question. Regardless of what data you will be mining, ensure that there is an appropriate match between the dataset you plan to mine and the data mining technique you plan to use and the business question. Deliverables: Project Proposal Report : ○ ○ Source of the dataset (e.g. kaggle.com) The proposal should address the following: A brief description of the data so you know what you are dealing with. You should include a list of all variables in the dataset. A short paragraph describing your objectives when mining this data. Explain what business problem/question, this project will address. What data visualization and prediction techniques you will use. Any pre-processing steps you think you need to take. 1 INSY5339 Dr. A.C. Sahoo Spring 2022 ○ Any initial results you expect or may have obtained. I strongly recommend that you run some of your data visualization techniques by the proposal due date and have a good view of what prediction techniques you will use. Project Progress Report: о ○ ○ At the beginning of this report, describe your business/social question and the data that support the addressing of this question. Data should be available and submitted as a separate file along with the proposal. The report should contain all the data visualization techniques (at least two) successfully completed and documented (it could change in the final report). Include your initial draft findings (it could change in the final report). Your report should contain at least one successful trial of the data prediction techniques (out of two techniques). You should report your draft findings (it could change in the final report). Project Presentation slides and presentation in the class ○ Your group will make a 15 minute presentation during class on December 1, 2021. Detailed guidelines will be presented prior to the final presentation. A copy of the presentation slides have to be submitted by the due date. ○ Every member should speak as part of the presentation. Professional quality presentation slides are required and should be updated based on feedback received. Final Project Report: ○ ○ This should be a professionally prepared report that addresses the following parts: cover page, executive summary, project motivation/background (business/social question), data description that supports addressing this question, data analysis using visualization and findings, your prediction models and findings, managerial or policy implications and conclusions, include all diagrams, graphs and tables to support your conclusions. Feel free to add any other sections if needed. What really matters is whether you successfully discovered useful knowledge from a dataset, and whether you presented it well to reader. Each report builds on the previous one. Feel free to reuse material in your earlier reports. Your Project Proposal Report submission should also contain the following completed table: How many observations in the dataset? How many binary/categorical variables? How many continuous variables? What is the outcome / target variable? 2 INSY5339 If binary or categorical: What percentage of the variables belong to each class. If continuous: What is the mean value of the target variable? Before doing any further processing, what would your prediction of the target variable be? 3 Dr. A.C. Sahoo Spring 2022See Answer
Q19: Instructions Data Mining The purpose of this assignment is to demonstrate understanding and application of the k-Nearest Neighbor algorithm associated with classification. You have been asked to build a classification model to help predict machine failure. Using the "Failure Rate" dataset, build a classification model using the k-Nearest Neighbor technique in KNIME. Then, using the model, predict whether machines will failure for 50 records of input data. Training/Test Model Use "hours_run" and "avg_hours_between_maint" as the input variables. Use "failure" as the target variable (note that 0: = no failure, and 1 = failure). For the Excel Reader node, exclude the "model_version" field from the data import operation. Use a Normalizer node to normalize the "hours_run" and "avg_hours_between_maint" fields to be between 0 and 1. Use a Partition node with an 70/20 partition (i.e., 70% Training; 20% Test) for records 1-300. Then, create the model using the Training data from the Partition Node. Use the k-Nearest Neighbor node. Attach a Scorer node to the k-Nearest Neighbor node in order to evaluate the model's accuracy (note that this node only evaluates the results of the n=90 Test data). Attach a Table node to the second output port of the Scorer node. Take note of the overall accuracy value. Run the model for k values from 3 to 6. Select the base k value based on the highest accuracy value. Predictions on the n=50 Data After running the Training/Test model and ascertaining the optimal k value, create another workflow (in the same KNIME file) to predict machine failure for the rest of the records on the dataset, i.e., for records 301-350. Setup the workflow as shown in the attached "K-Nearest Neighbor Algorithm, Prediction Workflow" document. Attach a Table node to the k-Nearest Neighbor node to view the predicted results. Sort by the appropriate column to show those records whose machines are predicted to fail at the top of the list. Also, attach an Excel Writer node to the k-Nearest Neighbor node to export the prediction results in an Excel file. In a 250-word document, provide the following information. Assume you are providing this information to an audience that has limited knowledge of data mining concepts. 1. Summarize your approach to the problem. 2. Clearly state the optimal k value for the model. 3. Screenshot the results from the second output port of the Scorer node. 4. Screenshot the predicted results, sorted by appropriate column to show at the top of the list those records whose machines are predicted to fail. 5. Include a conclusion based on the results of the analysis. Specify which machines in records 301-350 are predicted to fail. Speculate on why these particular records are predicted to fail. Note that you are required to submit the completed KNIME *.knwf file to your instructor. Specifically, export your KNIME model to a KNIME workflow file. To perform this task in KNIME, ensure that your KNIME model is active (i.e., displayed). Then, go to File -> Export KNIME Workflow. In the "Destination workflow file name (.knwf)" area, browse to a specific location on your computer. Click "Save" and then click "Finish."/n Instructions Using specified data files, chapter example files, and templates from the "Topic 4 Student Data, Template, and Example Files" resource, complete Chapter 13 Problems 20, 26, 28, 50, and 52 in the textbook. Use MAPE (mean absolute percentage error) to evaluate the forecasting performance for each problem. Use the Palisade Decision Tools Excel software to complete these problems where requested and applicable. To receive full credit on the assignment, ensure that the Excel files include the associated cell formulas if formulas are used or Excel-generated output based on the nature of the analyses. Place each problem in its own Excel file. Ensure that your first and last name are in your Excel filenames. THEN The purpose of this assignment is to conduct analyses and present your findings and supporting documentation in a professional PowerPoint presentation designed to summarize the information for senior leadership within the organization. Assume that you are delivering this presentation to the senior leadership in an organization. Therefore, please be sure to create a professional presentation. Begin by reading the "13.2 Forecasting Overhead at Wagner Printers" case, found at the end of Chapter 13 in the textbook. For the case, you will perform a multiple regression analysis. You can perform additional analyses on each data set to gain greater insight into the data set. You must be able to justify each of the approaches and methods you selected for analyzing the data sets. Use the Palisade DecisionTools Excel software to perform the regression analysis. Evaluate the regression model by performing and responding to all parts of the "Multiple Regression Analysis Checklist." Use the "BIT-435-RS-Predictive Case Template and Support Files" to complete the assignment and submit answers. Prior to submission, rename the file to include your first name and last name in the filename. You will submit the completed template file along with the PowerPoint presentation. Results of each analysis must be included in your presentation. The use of graphs, charts and supporting data, and spreadsheets is encouraged. Interpret the results of each analysis and draw general conclusions from the results. Make recommendations for the organization and address the organizational challenges that may be encountered based upon your recommendations. The PowerPoint presentation should include the following information: 1. Introduction and case background. 2. Objectives for each analysis. 3. Approach or method of analysis for each data set and justification for selecting the approach or method. Results of each analysis. 4. 5. Supporting graphs, charts, data, and spreadsheets for each analysis. 6. Interpretation of the results for each analysis. 7. General conclusion of each analysis and recommendation to the organization, including addressing organizational challenges that may be encountered based upon the recommendation. 8. In the Speaker Notes section of each slide, include your talking points. This information should align to the results of your analyses and be supported in the accompanying Excel files. In addition to your PowerPoint file, submit the completed template file that contains the supporting Excel files showing all data analyses performed. Submission of your Excel files is required to obtain full credit for this assignment./n 50. The file P13_50.xlsx contains five years of monthly data for a company. The first variable is Time (1-60). The second variable, Sales1, has data on sales of a product. Note that Sales1 increases linearly through- out the period, with only a minor amount of noise. (The third variable, Sales2, will be used in the next problem.) For this problem, use the Sales1 variable to see how the following forecasting methods are able to track a linear trend. a. Forecast this series with the moving averages method with various spans such as 3, 6, and 12. What can you conclude? b. Forecast this series with simple exponential smoothing with various smoothing constants such as 0.1, 0.3, 0.5, and 0.7. What can you conclude? c. Repeat part b with Holt's method, again for vari- ous smoothing constants. Can you do much better than in parts a and b? 52. The file P13_52.xlsx contains data on a motel chain's revenue and advertising. a. Use these data and multiple regression to make pre- dictions of the motel chain's revenues during the next four quarters. Assume that advertisingduring each of the next four quarters is $50,000. (Hint: Try using advertising, lagged by one period, as an explanatory variable. See the Problem 60 for an explanation of a lagged variable. Also, use dummy variables for the quarters to account for possible seasonality.) b. Use simple exponential smoothing to make predic- tions for the motel chain's revenues during the next four quarters. Experiment with the smoothing constant. c. Use Holt's method to make forecasts for the motel chain's revenues during the next four quarters. Experiment with the smoothing constants. d. Use Winters' method to determine predictions for the motel chain's revenues during the next four quarters. Experiment with the smoothing constants. e. Which forecasts from parts a to d would you expect to be the most reliable? 20. The file P13_20.xlsx contains the monthly sales of iPod cases at an electronics store for a two-year period. Use the moving averages method, with spans of your choice, to forecast sales for the next six months. Does this method appear to track sales well? If not, what might be the reason? 26. The file P13_26.xlsx contains the monthly number of airline tickets sold by the CareFree Travel Agency. a. Create a time series chart of the data. Based on what you see, which of the exponential smoothing models do you think will provide the best forecast- ing model? Why? b. Use simple exponential smoothing to forecast these data, using a smoothing constant of 0.1. c. Repeat part b, but search for the smoothing con- stant that makes RMSE as small as possible. Does it make much of an improvement over the model in part b? 28. The file P13_28.xlsx contains monthly retail sales of U.S. liquor stores. a. Is seasonality present in these data? If so, charac- terize the seasonality pattern. b. Use Winters' method to forecast this series with smoothing constants a = B = 0.1 and y = 0.3. Does the forecast series seem to track the seasonal pattern well? What are your forecasts for the next 12 months?See Answer
Q20:Part 1: Use the "Weather" dataset to address the questions below. 1. Using +-4, generate the frequent 2-itemsets in the "Weather" dataset. 2. Using 75% minimum confidence and 20% minimum support, generate one-antecedent association rules for predicting "Play" (i.e., Play= "Yes") using the "Weather" dataset. List the support and confidence along with each rule. Note that Play = "Yes" means the game can be played indoors and Play = "No" means the game cannot be played indoors. Note that Rule Support and Confidence calculations are needed. In addition, be sure to specifically explain (in understandable terms) the specific association rules generated.See Answer

TutorBin Testimonials

I found TutorBin Data Mining homework help when I was struggling with complex concepts. Experts provided step-wise explanations and examples to help me understand concepts clearly.

Rick Jordon

TutorBin experts resolve your doubts without making you wait for long. Their experts are responsive & available 24/7 whenever you need Data Mining subject guidance.

Andrea Jacobs

I trust TutorBin for assisting me in completing Data Mining assignments with quality and 100% accuracy. Experts are polite, listen to my problems, and have extensive experience in their domain.

Lilian King

I got my Data Mining homework done on time. My assignment is proofread and edited by professionals. Got zero plagiarism as experts developed my assignment from scratch. Feel relieved and super excited.

Joey Dip

Popular Subjects for data mining

You can get the best rated step-by-step problem explanations from 65000+ expert tutors by ordering TutorBin data mining homework help.

TutorBin helping students around the globe

TutorBin believes that distance should never be a barrier to learning. Over 500000+ orders and 100000+ happy customers explain TutorBin has become the name that keeps learning fun in the UK, USA, Canada, Australia, Singapore, and UAE.