data mining homework help

Boost your journey with 24/7 access to skilled experts, offering unmatched data mining homework help

Trusted by 1.1 M+ Happy Students

4.4Trust Pilot

4.4Edu Reviewer

5App Review

4.8Student

4.2Sitejabber

Recently Asked data mining Questions

Expert help when you need it

Q1:Use the dataset for Airplanes, Motorbikes, Schooners your goal is improving the average accuracy of classificationSee Answer
Q2:Programming Assignment Explanation • Fortune Cookie Classifier¹ You will build a binary fortune cookie classifier. This classifier will be used to classify fortune cookie messages into two classes: messages that predict what will happen in the future (class 1) and messages that just contain a wise saying (class 0). For example, "Never go in against a Sicilian when death is on the line" would be a message in class 0. "You will get an A in Machine learning class" would be a message in class 1. Files Provided There are three sets of files. All words in these files are lower case and punctuation has been removed. 1) The training data: traindata.txt: This is the training data consisting of fortune cookie messages. trainlabels.txt: This file contains the class labels for the training data. 2) The testing data: testdata.txt: This is the testing data consisting of fortune cookie messages. testlabels.txt: This file contains the class labels for the testing data.See Answer
Q3:Q1. (10 points) Answer the following with a yes or no along with proper justification. a. Is the decision boundary of voted perceptron linear? b. Is the decision boundary of averaged perceptron linear?See Answer
Q4:Q2. (10 points) Consider the following setting. You are provided with n training examples: (T₁, 9₁, h₁), (2, 92, h₂),, (In, Yn, hn), where z, is the input example, y, is the class label (+1 or -1), and h₁> 0 is the importance weight of the example. The teacher gave you some additional information by specifying the importance of each training example. How will you modify the perceptron algorithm to be able to leverage this extra information? Please justify your answer.See Answer
Q5:Q3. (10 points) Consider the following setting. You are provided with n training examples: (₁, ₁), (2, 2), (In, Yn), where zi is the input example, and y, is the class label (+1 or -1). However, the training data is highly imbalanced (say 90% of the examples are negative and 10% of the examples are positive) and we care more about the accuracy of positive examples. How will you modify the perceptron algorithm to solve this learning problem? Please justify your answer.See Answer
Q6:Q4. You were just hired by MetaMind. MetaMind is expanding rapidly, and you decide to use your machine learning skills to assist them in their attempts to hire the best. To do so, you have the following available to you for each candidate i in the pool of candidates Z: (i) Their GPA, (ii) Whether they took Data Mining course and achieved an A, (iii) Whether they took Algorithms course and achieved an A, (iv) Whether they have a job offer from Google, (v) Whether they have a job offer from Facebook, (vi) The number of misspelled words on their resume. You decide to represent each candidate i € I by a corresponding 6-dimensional feature vector f(z)). You believe that if you just knew the right weight vector w R you could reliably predict the quality of a candidate i by computing w- f(z). To determine w your boss lets you sample pairs of candidates from the pool. For a pair of candidates (k, 1) you can have them face off in a "DataMining-fight." The result is score (k > 1), which tells you that candidate k is at least score (k> 1) better than candidate 1. Note that the score will be negative when I is a better candidate than k. Assume you collected scores for a set of pairs of candidates P. Describe how you could use a perceptron based algorithm to learn the weight vector w. Make sure to describe the basic intuition; how the weight updates will be done; and pseudo-code for the entire algorithm.See Answer
Q7:Please create a K-means Clustering and Hierarchical Clustering with the line of code provided. The line of code should include a merger of the excel files. The excel files will also be provided See Answer
Q8:Discussion - Data Mining, Text Mining, and Sentiment Analysis Explain the relationship between data mining, text mining, and sentiment analysis. Provide situations where you would use each of the three techniques. Respond to the following in a minimum of 230 words:See Answer
Q9:Assignment #3: DBSCAN, OPTICS, and Clustering Evaluation 1. If Epsilon is 2 and minpoint is 2 (including the centroid itself), what are the clusters that DBScan would discover with the following 8 examples: A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9). Use the Euclidean distance. Draw the 10 by 10 space and illustrate the discovered clusters. What if Epsilon is increased to sqrt(10)? (30 pts)See Answer
Q10: 2. Use OPTICS algorithm to output the reachability distance and the cluster ordering for the dataset provided, starting from Instance 1. Use the following parameters for discovering the cluster ordering: minPts =2 and epsilon =2. Use epsilonprime =1.2 to generate clusters from the cluster ordering and their reachability distance. Don't forget to record the core distance of a data point if it has a dense neighborhood. You don't need to include the core distance in your result but you may need to use them in generating clusters. (45 pts) 2 16 14 12 10 015 05 20 06 04 021 026 016 027 022 025 019 023 09 024 07 018 08 011 070 030 029 028 012 013 014 2 01 0 03 0 2 4 8 10 12 14 16 017 Dataset visualization Below are the first few lines of the calculation. You need to complete the remaining lines and generate clusters based on the given epsilonprime value: Instance (X,Y) Reachability Distance Instance 1: (1,1) Undefined(or infinity) Instance 2: (0, 1) 1.0 Instance 3: (1, 0) 1.0 Instance 16: (5,9) Undefined Instance 13: (9,2) Undefined Instance 12: (8,2) 1See Answer
Q11:3. Use F-measure and the Pairwise measures (TP, FN, FP, TN) to measure the agreement between a clustering result (C1, C2, C3) and the ground truth partitions (T1, T2, T3) as shown below. Show details of your calculation. (25 pts) Ground Truth T, TT, Cluster C, CC3See Answer
Q12:1. We will use Flower classification dataset a. https://www.kaggle.com/competitions/tpu-getting-started 2. Your goal is improving the average accuracy of classification. a. You SHOULD use google collab as the main computing. (Using Kaggle is okay) b. You SHOULD create a github reposit for the source code i. Put a readme file for execution c. You SHOULD explain your source code in the BLOG. d. Try experimenting with various hyperparameters i. Network topology 1. Number of neurons per layer (for example, 100 x 200 x 100, 200 x 300 x 100...) 2. number of layers (For example, 2 vs 3 vs 4 ... ) 3. shape of conv2d ii. While doing experiments, make sure you record your performance such that you can create a bar chart of the performance iii. An additional graph idea might be a training time comparison Do some research on ideas for improving this. iv. e. You can refer to the code or tutorial internet. But the main question you have to answer is what improvement you made over the existing reference. i. Make sure it is very clear which lines of code is yours or not. When you copy the source code, add a reference. 3. Documentation is the half of your work. Write a good blog post for your work and step-by-step how to guide. a. A good example is https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/ 4. Add a reference a. You add a citation number in the contents and put the reference in the separate reference sectionSee Answer
Q13:This tutorial will guide you how to do homework in this course. 1. Goto https://www.kaggle.com/c/titanic and follow walkthrough as https://www.kaggle.com/alexisbcook/titanic-tutorial B 2. Submit your result to Kaggle challenge. 3. Post jupyter notebook to your homepage as blog post. A good example of blog post is. https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/ 4. Submit your homepage link and screenshot pdf in the canvas. 5. Doing 1-4 will give you 8 points. To get additional 2 points, create a section as "Contribution" and try to improve the performance. I expect one or two paragraph minimum (the longer the better). Show the original score and improved score.See Answer
Q14:1. Use the insurance fraud dataset. Consider the data quality issues (e.g., missing data) and preprocess the data. Split the data into a 10% train and 90% test set using random_state = 1. Create a decision tree with a max depth of 3 using a gini measure. Print the accuracy on the test set and the tree. Is this a good approach? Why or why not? 2. Create a decision tree on the same data with max depth of 3 and an entropy measure. Does the accuracy change? Does the tree change? Discuss which measure you think is better. 3. Now split the data into 70% train and 30% test using random_state = 1. Redo 2 and 3. Have the trees and accuracy changed? Are the trees more or less similar now? Discuss which split you think is better and why. 4. Evaluate how the accuracy changes with the depth of the tree with the 70-30 data. Look at the accuracy for a max depth of 1, 2, 3, ... 10, 15, 20. Plot the curve of changing. Do you see underfitting? Do you see overfitting? 5. What variable provides the most information gain in the insurance fraud data (for the 70-30 split)? 6. Decision trees are a "white box" method. What do you observe about the insurance fraud data using decision trees?See Answer
Q15:Natural Language Processing Project Welcome to the NLP Project for this section of the course. In this NLP project you will be attempting to classify Yelp Reviews into 1 star or 5 star categories based off the text content in the reviews. This will be a simpler procedure than the lecture, since we will utilize the pipeline methods for more complex tasks. We will use the Yelp Review Data Set from Kaggle. Each observation in this dataset is a review of a particular business by a particular user. The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review. The "cool" column is the number of "cool" votes this review received from other Yelp users. All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business. The "useful" and "funny" columns are similar to the "cool" column. Let's get started! Just follow the directions below!/n]: blob.core.windows.net/media//655d2lade3a3d6c1blb06e36/questions/hip-project-practice _1700602595848.html AM ]: 5. Use the corr() method on that groupby dataframe to produce this dataframe: (5 points) ]: cool NLP Project practice useful funny text length cool 1.000000 -0.743329 -0.944939 -0.857664 useful -0.743329 1.000000 0.894506 0.699881 funny -0.944939 0.894506 1.000000 0.843461 text length -0.857664 0.699881 0.843461 1.000000 NLP Classification Task Let's move on to the actual task. To make things a little easier, go ahead and only grab reviews that were either 1 star or 5 stars. 6. Create a dataframe called yelp_class that contains the columns of yelp dataframe but for only the 1 or 5 star reviews. (5 points) 7. Create two objects X and y. X will be the 'text' column of yelp_class and y will be the 'stars' column of yelp_class. (Your features and target/labels) (10 points) 3/1/n9. Use the fit transform method on the CountVectorizer object and pass in X (the 'text' column). Save this result by overwriting X (10 points) 7]: X = cv.fit_transform (X) tb.blob.core.windows.net/media//655d21ade3a3d6c1b1b06e36/questions/hip-project-practice_1700602595848.html 5 AM 8]: 9]: 01: Train Test Split Let's split our data into training and testing data. 11: 1]: NLP Project practice 10. Use train_test_split to split up the data into X_train, X_test, y_train, y_test. Use test_size=0.3 and random_state=101 (10 points) Training a Model Time to train a model! 11. Import Decision TreeClassifier and create an instance of the estimator and call is tree then fit the model using the training set (10 points) Now fit tree using the training data. DecisionTreeClassifier() 4See Answer
Q16:2the Task - Correlation Rule Extraction and Outlier Detection Prices Question 1 Suppose we have ten product codes (from 0 to 9) and the following table of transactions Identifier Transaction 1 2 3 4 5 6 7 8 9 10 Products 0, 1, 3, 4 1, 2, 3 1, 2, 4, 5 1, 3, 4, 5 2, 3, 4, 5 2, 4, 5 3,A 1, 2, 3 1, 4, 5 B,4 where, the penultimate and last digit of your registration number divided by2 and rounded to the nearest unit. That is, if your registration number ends in 12, then you will set = 1/2= 0.5 that is1 and =2/2= 1. But if the penultimate digit of the number your registry is either 5 or 6, you will set = 4. Accordingly, if the last digit of the number your register is either 3 or 4, you will set = 2. Please answer the following questions, showing your calculations: 1. Compute the support of the sets {4,5} and {3, 5} 2. The support and trust of the rules {4, 5} → {3} and {3, 5} → {2} 3. Run the Apriori Algorithm with the method Fk-1X F k-1 for support threshold equal to 2 and report which frequent sets you find at each stepSee Answer
Q17:Question 2 Suppose we have the following one-dimensional data set, consisting of 10 points x1= 59, x 2= 30, x 3= 64, x4=72, x5= 35, x6= 21, x 7= 66, x8= 53, x 9= 56, x10=A Where the last two digits of your registration number (ie if your registration number ends in 12 you will set = 12). If the last two digits of your registration number are present already in the data set you will set = 47. Detect outliers using: 1. The Grubbs test 2. The one nearest neighbor technique, using as distance function the absolute value of the difference between two points 3. The relative density for two nearest neighbors, using as a distance function the absolute value of the difference between two pointSee Answer
Q18:ASSIGNMENT: Project 2 A trojan horse traffic dataset is provided alongside the test dataset in the following link. https://www.kaggle.com/datasets/neelabhkshetry/trojan-classification The dataset has 85 features with 1 target column "Class". This a binary classification problem with the classes being "Trojan" or "Benign". You will perform all your tasks on the training dataset which has almost 160k rows and the output class is divided almost equally. Your task is to build classification models to predict the "Class" column. First, become familiar with data set, and research some background information on trojan horse detection in a network. 1. First, research some background information on applying data mining in related topics by reviewing published research papers (minimum 3!) on similar topics. Make sure each review is at least one paragraph and you cite your sources (example format see appendix). 2. Go through the detailed process of data preparation on train_data.csv. Apply all preprocessing and data reduction techniques you assume they are necessary and explain why. For every preprocessing technique: a. Explain what the technique is; b. Explain how it is applied: c. Show the summary results of the preprocessing. d. Do not include raw code or raw output or raw screen capture Perform dimensionality reduction in at least two steps (once by applying your understanding of the features and once by using feature selection or extraction methods)/n3. When the data set is with enough quality, apply several predictive based techniques (minimum are 3 techniques!) and create appropriate predictive models. For every predictive technique you applied: a. Explain what the technique is and how the technique works b. Explain what the parameters of the technique are and how the parameters are chosen and tuned. c. Explain and discuss the predictive results and performances of the technique. Analyze different aspect of the result, including but not limited to ROC curve, F score, accuracy, etc. d. Do not include raw code or raw output or raw screen capture 4. After you have built three predictive models, test your models so that you can compare the three data mining techniques you've chosen. You should include the following in comparison discussion: a. The performance comparison among your techniques b. Visualization or table showing the performance differences. Make sure you explain and discuss the visualization or table c. Explain the probable reason behind the performances differences. d. Explain which technique is the best for the dataset. Comparison of data mining techniques (and obtained predictive models) with additional discussion and interpretation of results will be very important part of your report. Do not include raw code or raw output. A conclusion section will be used to summarize your findings throughout the report. Give example of data samples, discuss how you would use your model on your samples and what result can be expected from your model. The final grade will be 90% report and 10% class competition, explained next./nA separate competition dataset (test_data.csv) will be provided for additional evaluation. We will compare your prediction results against the true label for additional unbiased evaluation as a component of your project. Up to 2 points in the final grade of the project will depend upon your model's performance with this dataset. The performance is calculated using Area under ROC. The entire class score will be calculated and ranked. If you achieve the top 20% of score when compare to your peers, you get the full 2 points. The grade distribution is the following: Score Percentile Top 20% 50% - 80% 20% - 50% 10% - 20% 0% - 10% Score Range 2 > 1.5-2 > 1-1.5 > 0.5-1 0-0.5 Please read "Required Structure for the Project 2 Documentation”, "Required Structure for the Test Data Result” and “Required Materials for Project 2 Submission" below. See Answer
Q19:1. If Epsilon is 2 and minpoint is 2 (including the centroid itself), what are the clusters that DBScan would discover with the following 8 examples: A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9). Use the Euclidean distance. Draw the 10 by 10 space and illustrate the discovered clusters. What if Epsilon is increased to sqrt(10)? (30 pts)See Answer
Q20:2. Use OPTICS algorithm to output the reachability distance and the cluster ordering for the dataset provided, starting from Instance 1. Use the following parameters for discovering the cluster ordering: minPts =2 and epsilon =2. Use epsilonprime =1.2 to generate clusters from the cluster ordering and their reachability distance. Don't forget to record the core distance of a data point if it has a dense neighborhood. You don't need to include the core distance in your result but you may need to use them in generating clusters. (45 pts) Instance 1: Instance 2: Instance 3: Instance 16: Dataset visualization Below are the first few lines of the calculation. You need to complete the remaining lines and generate clusters based on the given epsilonprime value: Instance (X,Y) (1, 1) (0, 1) (1, 0) (5,9) 617 Reachability Distance Undefined (or infinity) 1.0 1.0 UndefinedSee Answer

TutorBin Testimonials

I found TutorBin Data Mining homework help when I was struggling with complex concepts. Experts provided step-wise explanations and examples to help me understand concepts clearly.

Rick Jordon

TutorBin experts resolve your doubts without making you wait for long. Their experts are responsive & available 24/7 whenever you need Data Mining subject guidance.

Andrea Jacobs

I trust TutorBin for assisting me in completing Data Mining assignments with quality and 100% accuracy. Experts are polite, listen to my problems, and have extensive experience in their domain.

Lilian King

I got my Data Mining homework done on time. My assignment is proofread and edited by professionals. Got zero plagiarism as experts developed my assignment from scratch. Feel relieved and super excited.

Joey Dip

Popular Subjects for data mining

You can get the best rated step-by-step problem explanations from 65000+ expert tutors by ordering TutorBin data mining homework help.

TutorBin helping students around the globe

TutorBin believes that distance should never be a barrier to learning. Over 500000+ orders and 100000+ happy customers explain TutorBin has become the name that keeps learning fun in the UK, USA, Canada, Australia, Singapore, and UAE.