tutorbin

big data homework help

Boost your journey with 24/7 access to skilled experts, offering unmatched big data homework help

tutorbin

Trusted by 1.1 M+ Happy Students

Recently Asked big data Questions

Expert help when you need it
  • Q1:Programming Language: Python. You should implement the R-tree by using the existing libraries provided in the programming language of your choice (i.e., some standard libraries or the libraries for R-Tree). 1. Source Code: NOTE: Make sure your code can be run in the standard general programming environment. 2. Report: the report should include the following: A detailed description of the main functions in your source code. Please provide comments for each key part of the program including each function, each FOR/WHILE loop and each IF statement and each calculation and value assignment. • A clear specification of the requirements for executing the code such as, OS environment, placement of input files, any input parameters, etc. • A detailed analysis for the construction and search of an R-Tree: Select no less than 10 data points from the given dataset, and one query from the given queries. You need to draw figures (including the R-tree structure, the MBRs and the process of R-Tree establishment) to illustrate the whole process of R-Tree construction and the R-Tree based search. The search should traverse several nodes of the tree, and during the construction of the R-Tree, there should be an overflow and a node splitting. • The report should also include the screenshots of the running results (e.g., the average execution time of both sequential-scan and R-Tree based methods. 3. A step-by-step video: The video should clearly introduce the design of the program, show the process of running the program, show the delivered results and implementing the R-Tree.See Answer
  • Q2:Learning outcome Perform mass and energy balances, identify all key and utility streams and specify their flow, composition data. Use of computer based flow sheet software to perform mass and energy balances. Make an assessment of potential hazards Make an assessment of sustainability of the plant Make a material selection for chemical engineering equipment Mark 0-5 Limited and uncomplete mass and energy balances. Calculations not given and explained. Limited or no Hysys flow sheet has been created. Poor selection of process safety aspects and no details given on how these related to the overall design. Report lacks any discussion and criticism of the process safety methods used. 0-3 A poor report, lacking logical and systematic assessment of positive and negative impacts on all three pillars of sustainability, or not assessing all three pillars. Generally incorrect or missing identification of risks and mitigations. Shows lack of understanding of lecture content. 0-3 A poor report, where only some of the mechanical, thermal and chemical demands on the 3 key units are appraised. Following a poor assessment of the materials properties necessary to meet these demands, some materials are identified for each of the 3 key units. A discussion of these materials is missing and/or selection is poor or incorrect. Shows lack of understanding of lecture 6-10 Mass and energy balances for the process with major errors or omissions. Process assumptions not given or unrealistic. 6-10 Hysys flow sheet does not include all the process units (e.g. some units not working) No comparison with manual calculations is presented. 5-8 A basic report with limited evidence of understanding of how the chosen aspects of safety relate to the design process. Some discussion of the process safety aspects is provided, but this is largely superficial and contains misconceptions. 4-6 A basic report, assessing positive and negative impacts on all three pillars of sustainability in a logical and systematic way. Some risks are mentioned, but mitigation and prioritization is incorrect or missing. Generally shows an understanding of lecture content. 4-6 A basic report, where most of the mechanical, thermal and chemical demands on the 3 key units are appraised. Following an incomplete assessment of the materials properties necessary to meet these demands, some engineering materials are identified for each of the 3 key units. A discussion of these materials leads to the selection of an appropriate one. Generally shows an understanding of lecture 11-15 Complete mass and energy balances for the whole process with some errors and omissions. 11-15 Hysys flow sheet for the whole process has been created with some errors or omissions. Some comparison with manual calculations is presented. 9-12 An acceptable report with an appropriate selection of multiple aspects of process safety. A range of potential process hazards are identified and properly considered within the context of design. Discussion and criticism of the process safety methods is provided. References are used to support discussion 7-9 An acceptable report, assessing positive and negative impacts on all three pillars of sustainability in a logical and systematic way. Good identification of risks, with evidence of prioritization and proposed mitigation. Some mention of either temporal, geographical scales, or unusual or emergency operations are present. 7-9 An acceptable report, where the mechanical, thermal and chemical demands on the 3 key units are fully appraised, with consideration of unusual or emergency operations. Following an assessment of the materials properties necessary to meet these demands, some engineering materials are identified for each of the 3 key units. A discussion of these materials leads to the selection of the most appropriate one. Quantification and referencing 16-20 Complete mass and energy balances with no obvious errors Utility are specified completely. 16-20 Hysys flow sheet has been created with minor errors. Advanced features (ie. adjustment) have been used. Detailed comparisons (with manual calculatons) and good discussion are presented. 13-16 An excellent report with a well thought out, systematic and critical review outlining how design decisions were related to the chosen process safety aspects. A clear summary of process hazards is provided with a discussion outlining the challenges and benefits, advantages and disadvantages of the methods. Report is well- referenced throughout 10-12 An excellent report, fully assessing positive and negative impacts on all three pillars of sustainability in a logical and systematic way (e.g. using the 4 P's). Important risks are identified based on a prioritisation of likelihood and impact, and sensible mitigations are proposed. Quantitative targets for improvement are provided. Examples of consideration of temporal and geographical scales, as well as unusual and emergency operations are present. 10-12 An excellent report, where the mechanical, thermal and chemical demands on the 3 key units are fully appraised, with consideration of unusual and emergency operations. Following a convincing assessment of the materials properties necessary to meet these demands, an appropriate range of engineering materials are identified for each of the 3 key units. An analytical discussion of these materials leads to the selection of the most appropriate one. Newcastle University 21-25 Complete mass and energy balances presented in a highly professional and informative format beyond expectation at this stage 21-25 Sophisticated Hysys sheet for the whole process has been created demonstrating knowledge beyond the lecture content. Detailed comparisons with excellent discussion is presented. 17-20 As before, but with additional signs of creativity and innovation beyond the lecture content. Original and inciteful discussion presented in a highly professional and informative format. 13-15 An exemplary report incorporating all points from the previous descriptor, but additionally demonstrating knowledge beyond the lecture content, and/or showing signs of engineering flair and professionalism beyond expectation at this stage. 13-15 An exemplary report incorporating all points from the previous descriptor, but additionally demonstrating knowledge beyond the lecture content, and/or showing signs of engineering flair and professionalism beyond expectation at this stage.See Answer
  • Q3:Independent Study - Overview Overview The goal of this independent study is to go beyond the scope of this class and explore Big Data field in the direction of your interest that was not covered in the class. For this independent study, your job is to choose a Big Data topic of your interest that is not covered in the class, explore the topic and learn the related techniques and skills, and finally implement a Big Data task using the acquired techniques/skills. Part 1 - Proposal (due July 12) (25%) Propose 2 topics of your interest related to Big Data challenges, solutions, and techniques that were not covered in the class. Rank your two topics in order of priority. For this independent study, you are required to explore only one topic. You must start from your first choice topic. However, during this study, prior to part 2 submission deadline, if you don't feel comfortable with your first choice topic, you can switch to your second topic. Just note that you can only make this change prior to part 2 submission deadline. You can no longer change your topic after submitting the first draft of the final report (part 2 submission). For each topic, • provide a brief description • identify the specific learning objectives you will pursue (at least 3) (to make this list, you might need to dig more into your topic and explore the things you will learn along the way) make a list of potential resources to study • propose a Big Data task related to your topic that you can implement using the techniques and skills you have learnt If your topic involves any analysis or modeling a dataset, the dataset size must be at least 50 MB. Deliverables - a single document of proposalSee Answer
  • Q4:QUESTION ONE (25 MARKS) 1.1 Define what is a Big data and what is lot in detail? (10) 1.2 Suppose you are working for a company and the company want to substitute one of their old system with a new one which involves handling of big data and internet of things. You were allocated to lead the project and the company management are still in a dilemma of choosing between old and new system. What are the advantages of big data and internet of things that you may convince the management with to accept the new project? Address also their disadvantages. (15)See Answer
  • Q5:QUESTION TWO (25 MARKS) 2.1 List five big data analysis techniques and define how each is applied? (10) 2.2 Suppose you are working for a company as a project manager and you are required to design a Home automation for a client using application of internet of things. Give a brief description of examples of devices that you might connect and show your connection in a simple well labelled graph. (15) QUESTION THREE (20 MARKS)See Answer
  • Q6:QUESTION THREE 3.1 List and describe the IOT architecture components. 3.2 Describe two major layers and two other supporting modules of HADOOP (20 MARKS) (12) (8)See Answer
  • Q7:QUESTION FOUR (20 MARKS) 4.1 List and describe (FIVE) 5 security technologies applied in Big data. (10) 4.2 List five ethics and (FIVE) 5 polies that governs the use and implementation of Big data (10)See Answer
  • Q8:Problems: 1. (You can use other websites for this question, but you should not cut and paste text. Please write answers in your own words!) One of the candidates who used to coach a former president now wants to beat him in the election. This candidate is known for thinking out of the box. While his campaign has been running their services on AWS EC2 instances, they recently became aware of serverless cloud services. Can you help them? (Please limit your answer for each part to less than 50 words. Be concise!) a. What is the key difference between AWS Lambda and AWS EC2? b. What are the two key differences between AWS Lambda and AWS spot instances (think: pricing and how long instances last)? c. What are the names of the corresponding Microsoft Azure and Google Cloud counterparts (names) of Amazon Lambda? d. Give one example application (class) where you would prefer AWS EC2, and one where you would prefer AWS Lambda. Justify your choices briefly.See Answer
  • Q9:2. (You can use other websites for this question, but you should not cut and paste text. Please write answers in your own words!) Many of the campaigns are using Machine Learning, which of course benefits from GPUs. (Please limit your answer for each part to less than 50 words. Be concise!) Because the campaign has limited budget, they are only looking at single GPU VM instances, i.e., only those that have one (and only one) GPU inside them (but arbitrary amount of memory and CPUs). a. They want to find the single GPU VM type across all 3 major cloud providers (AWS, Azure, Google Cloud) that has the highest available 2 memory. Can you find it for them? (Note this refers to single GPU, not cloud instance). b. Repeat the previous question but find instead the lowest available memory. c. What is the difference between a GPU and a "TPU" (among cloud offerings)?See Answer
  • Q10:3. (Note: From this question onwards you CANNOT use websites but can only use course material. Also, for all Mapreduce questions please only use the class definition/template for Mapreduce jobs, i.e., Map/Reduce tasks/functions, and not other sources from the Web, since there are many different variants on the Web!) One of the candidates always imitates and emulates a previous president. This candidate believes they will win by becoming the most popular person on social media. An intern in their campaign wants to write a Mapreduce program. In MapReduce, one writes a program for Map that processes one input line at a time and outputs zero or more (key, value) pairs; and one writes a program for Reduce that processes an input of (key, all values for key). The iteration over input lines is done automatically by the MapReduce framework. The intern would like to know who are the influential Twitter users most similar to their candidate, and would like to use Hadoop for this. The intern uses an input file containing information from Twitter (which is an asymmetrical social network) about which users "follow" which other users. If user a follows b, the entry line is (a, b) - you can assume this data is already sharded in HDFS and can be loaded from there. Can you help the intern? Write a MapReduce program (Map and Reduce separately) that outputs the list of all users U who satisfy the following three conditions simultaneously: U has at least 100 million followers, and U herself/himself follows fewer than 10 users, and U follows at least one user V who in turn has at least 10 million followers (e.g., @Barack Obama would be such a U). You can chain Mapreduces if you want (but only if you must, and even then, only the least number). Your program must be as highly parallelizable as possible. Correctness is paramount, though you should try to make it as fast as possible.See Answer
  • Q11:4. A rival campaign manager believes that finding the best donors is the way to go. They use the same dataset from the previous question to instead find all user pairs (U,V) such that: (i) both U and V have at least 100 million followers each, and (ii) U and V follow at least 100 accounts in common (excluding each other). Note that U and V may or may not follow each other (either way)! Write a Mapreduce program for this. Same instructions as the first Mapreduce question in this series apply.See Answer
  • Q12:5. One of the social media billionaires is considering running for President. They run a social media named Quitter, and they have access to a lot of data inside the 3 company. As an intern in this campaign, you have the same social network dataset (named D1) specified in the previous question ((a,b) directed pairs indicating a follows b), but you also have an additional dataset (named D2) with entries (a, start_time, end_time) indicating that user a was online starting start_time and ending at end_time. The data is only for one day. All times are hh:mm:ss. However, each user a may have multiple entries in D2 (since users log in simultaneously). Write a Mapreduce program that extracts all pairs of users (a,b) such that: (i) a and b follow each other, and (ii) a and b were online simultaneously at least once during that day. Same instructions as the first Mapreduce question in this series apply. Please ensure that a Map stage reads data from only one input dataset (i.e., if a Map reads directly from D2, don't use it to also read from D1. And vice-versa.) - this is good practice consistent with good Map programming practices.See Answer
  • Q13:6. Questioning and Reforming the election system seem all the rage nowadays. There are some ways distributed systems folks can help with elections. Someone at the election office thinks MapReduce could be useful for “instant runoff voting" in primaries. (Fun fact: several states, including Alaska, now use instant runoff voting!) Here's how instant runoff voting works. Consider an election with three candidates on the ballot - A, B, C. Every voter ranks the three candidates as first preference, second preference, and last preference. Between any two candidates X and Y, if a majority of voters ranked X above Y, then X dominates Y (and vice versa)-note that this only takes into account X and Y's relative rankings, not where they appear in the preference order, or where the third candidate appears. A Condorcet winner is a candidate that dominates all other candidates (pair wise) on the ballot. By definition an election can have at most one Condorcet winner (however, there may be zero). You are given a dataset of votes from N voters (N is odd and large, and so dataset is sharded), where each vote V has three fields V.1, V.2, V.3, respectively for the first, second, and third preference votes of that voter. Each line of input is one such voter's vote V (input to initial Map function). Write a MapReduce program that outputs either the unique single Condorcet winner among the three candidates A, B, or C, or if there is no single Condorcet winner, then it outputs the list of candidate(s) with the highest Condorcet count (those that dominate the most number of candidates). For background -- in MapReduce, one writes a program for Map that processes one input line at a time and outputs zero or more (key, value) pairs; and one writes a program for Reduce that processes an input of (key, all values for key). The iteration over input lines is done automatically by the MapReduce framework. You can assume this data is already sharded in HDFS and can be loaded from there. Each line is one vote VSee Answer
  • Q14:7. At a presidential debate, one of the candidates loudly proclaims, "You idiots are so slow!". Then the moderator asks, "Can you elaborate please?" At a loss for words, the candidate reaches deep into their CS425 knowledge and screams, "You're all so slow! You're all doing push gossip. I do pull gossip, and even with fixed fanout, it converges in O(log(log(N)) time!" Are they right? If yes, give a proof (informal proof ok). If they are wrong, give a proof (informal proof). (Note: Push gossip and pull gossip mentioned here are the same protocols discussed in lecture)See Answer
  • Q15:8. One of the less popular candidates is polling at very small numbers in most of the states. They want to analyze the "topology-aware gossip" protocol you've seen in lecture. However, instead of the lecture slide example of 2 subnets joined by 1 router, here we have a total of N nodes (processes), evenly spread out across √N subnets (each subnet containing √N nodes), all joined by 1 router. The subnets are numbered S0, S1, S2, ... S(√ N-1). All these √N subnets are connected together via 1 router. You can assume all nodes have a full membership list, and there are no failures (messages or processes). The topology-aware gossip works as follows. Consider a process Pj choosing gossip targets. The process' gossip targets depend on the subnet Si that it lies in. During a gossip round, the process Pj selects either b "inside-subnet Si gossip targets" with probability (1-1/√N), OR b "outside-subnet Si gossip targets" with probability 1/√N. The only "restriction" is that after process Pj is infected, for the next O(log(/√N)) rounds Pj picks only inside-subnet targets (no outside-subnet targets) -- thereafter in a gossip round at Pj, either all its targets are inside-subnet or all are outside- subnet. Inside-subnet gossip targets from Pj (in Si) are selected uniformly at random from among the processes of Si. Outside-gossip targets from Pj (in Si) are only picked from the processes in the "next" subnet S((i+1)mod√N), and they are picked uniformly at randomly from the processes lying in that “next” subnet. The gossiping of a message does not stop (i.e., it is gossiped forever based on the above protocol). Does this topology-aware gossip protocol satisfy both the requirements of: (i) O(log(N)) average dissemination time for a gossip (with one sender from any subnet), and (ii) an O(1) messages/time unit load on the router at any time during the gossip spread? Justify your answers.See Answer
  • Q16:9. One of the campaigns is always looking for shortcuts. Their distributed system uses a failure detector but to "make it faster", they have made the following changes. For each of these changes (in isolation), say what is the one biggest 5 advantage and the one biggest disadvantage of the change (and why). Keep each answer to under 50 words (give brief justifications). a. They use Gossip-style failure detection, but they set Tcleanup = 0. b. They use SWIM-style failure detection, but they removed the Suspicion feature. c. They use SWIM-style failure detection, but they removed the round robin pinging + random permutation, and instead just randomly select each ping target.See Answer
  • Q17:10. An intern in the Independent Party campaign designs an independent SWIM/ping-based failure detection protocol, for an asynchronous distributed system, that works as follows. Assume there are N=M*K*R processes in the system (M, K, R, are positive integers, each > 2). Arrange these N processes in a MxKxR 3-dimensional matrix (tesseract), with M processes in each column, and K processes in each row, and R processes in the 3rd dimension (aisles). All processes maintain a full membership list, however pinging is partial. Each process Pijk (in i-th row and j-th column and k-th aisle) periodically (every T time units) marks a subset of its membership list as its Monitoring Set. The monitoring set of a given process, once selected, does not change. The monitoring set of Pijk contains: i) all the processes in in its own column Pjk, ii) all the other processes in its own row Prk, and ii) all the processes in in its own aisle Pij. At this point, there are two options available to you: Option 1 - Each process sends heartbeats to its monitoring set members. Option 2 - Each process periodically pings all its monitoring set members; pings are responded to by acks, just like in the SWIM protocol (but there are no indirect pings or indirect acks.). Failure detection timeouts work as usual: Option 1 has the heartbeat receiver timeout waiting for a heartbeat, while Option 2 has the pinging process (pinger) time out. The suspected process is immediately marked as failed. This is run in an asynchronous distributed system. a. How many failures does Option 1 take to violate completeness? That is, find the value L so that if there are (L-1) simultaneous failures, all of them will be detected, but if there are L simultaneous failures then not all of them may be detected. b. Answer the same above question for Option 2. c. An opposition party candidate claims that for K-R=2, both Option 1 and Option 2 provide completeness for all scenarios with up to (and including) 9 simultaneous failures. You gently respond that they are wrong and that it also depends on M. What are all the values of M (given K=R-2) for which your opponent's claim above is true? Justify your answer clearly.See Answer
  • Q18:, submit a ZIP file that includes a Word document with a cover page containing the names of your team members and each of the steps outlined below, clearly identified with a title. Also, include your data sources in the Zip file for submission. Please provide thorough comments on your steps and work. Failure to comply with the submission guidelines will result in penalties. 1. Identify a data source of your choice (See: https://donnees montreal.ca/) and provide the link to your data source in your Word document. Describe your data source in your Word document. Proceed with data verification and assess their quality. Identify and perform any necessary data preprocessing, if needed. (20 points) 2. Add your data source to HDFS in your Hadoop environment. Include your steps in your Word document. (20 points) 3. Identify a first processing task for this data source. Create and test your MapReduce code in your Hadoop environment. Use comments to clearly identify each step of your MapReduce code. Describe your processing task in one to two sentences and include it in your Word document. (30 points) 4. Identify a second processing task (Different from the first processing task in step 3) for this data source. Create and test your Spark SQL code in your Hadoop environment. The use of temporary tables is not allowed in your project. Use comments to clearly identify each step of your Spark SQL code. Describe your processing task in one to two sentences and include it in your Word document. (30 points) You will be evaluated on the consistency of your processing tasks and the completeness and details in your Word document compared to the specifications, as well as the optimality and quality of the code. To propose consistent work, try to draw inspiration from the various practices done in class to complete the requested work and not simply replicate the same examples covered in those practices.See Answer
  • Q19:Documentation What to Submit/nDescription Tiny Noong gf placed eing the day of program. That your hero the other da di pus year. This proga wilaya cock, change code Search geschading Setting up your Project Netflix Data Analysis using Pig Project Description for granted the cat do you will condu Cut kek forcade to far you are the borondator Don Ch Optional: Use your laptop to develop your project YOBONDSUSING the STORENTO Shears the drackySee Answer

TutorBin Testimonials

I got my Big Data homework done on time. My assignment is proofread and edited by professionals. Got zero plagiarism as experts developed my assignment from scratch. Feel relieved and super excited.

Joey Dip

I found TutorBin Big Data homework help when I was struggling with complex concepts. Experts provided step-wise explanations and examples to help me understand concepts clearly.

Rick Jordon

TutorBin experts resolve your doubts without making you wait for long. Their experts are responsive & available 24/7 whenever you need Big Data subject guidance.

Andrea Jacobs

I trust TutorBin for assisting me in completing Big Data assignments with quality and 100% accuracy. Experts are polite, listen to my problems, and have extensive experience in their domain.

Lilian King

I got my Big Data homework done on time. My assignment is proofread and edited by professionals. Got zero plagiarism as experts developed my assignment from scratch. Feel relieved and super excited.

Joey Dip

I found TutorBin Big Data homework help when I was struggling with complex concepts. Experts provided step-wise explanations and examples to help me understand concepts clearly.

Rick Jordon

TutorBin helping students around the globe

TutorBin believes that distance should never be a barrier to learning. Over 500000+ orders and 100000+ happy customers explain TutorBin has become the name that keeps learning fun in the UK, USA, Canada, Australia, Singapore, and UAE.