Search for question
Question

Part 3

Implement a function to compute (RMSE (http://en.wikipedia.org/wiki/Root-mean-

square_deviation)) given an RDD of (label, prediction) tuples:

RMSE =

Test this function on an example RDD.

(-30)²

Fig: 1


Most Viewed Questions Of Big Data

Documentation What to Submit/nDescription Tiny Noong gf placed eing the day of program. That your hero the other da di pus year. This proga wilaya cock, change code Search geschading Setting up your Project Netflix Data Analysis using Pig Project Description for granted the cat do you will condu Cut kek forcade to far you are the borondator Don Ch Optional: Use your laptop to develop your project YOBONDSUSING the STORENTO Shears the dracky


Problems: 1. (You can use other websites for this question, but you should not cut and paste text. Please write answers in your own words!) One of the candidates who used to coach a former president now wants to beat him in the election. This candidate is known for thinking out of the box. While his campaign has been running their services on AWS EC2 instances, they recently became aware of serverless cloud services. Can you help them? (Please limit your answer for each part to less than 50 words. Be concise!) a. What is the key difference between AWS Lambda and AWS EC2? b. What are the two key differences between AWS Lambda and AWS spot instances (think: pricing and how long instances last)? c. What are the names of the corresponding Microsoft Azure and Google Cloud counterparts (names) of Amazon Lambda? d. Give one example application (class) where you would prefer AWS EC2, and one where you would prefer AWS Lambda. Justify your choices briefly.


9. One of the campaigns is always looking for shortcuts. Their distributed system uses a failure detector but to "make it faster", they have made the following changes. For each of these changes (in isolation), say what is the one biggest 5 advantage and the one biggest disadvantage of the change (and why). Keep each answer to under 50 words (give brief justifications). a. They use Gossip-style failure detection, but they set Tcleanup = 0. b. They use SWIM-style failure detection, but they removed the Suspicion feature. c. They use SWIM-style failure detection, but they removed the round robin pinging + random permutation, and instead just randomly select each ping target.


Learning outcome Perform mass and energy balances, identify all key and utility streams and specify their flow, composition data. Use of computer based flow sheet software to perform mass and energy balances. Make an assessment of potential hazards Make an assessment of sustainability of the plant Make a material selection for chemical engineering equipment Mark 0-5 Limited and uncomplete mass and energy balances. Calculations not given and explained. Limited or no Hysys flow sheet has been created. Poor selection of process safety aspects and no details given on how these related to the overall design. Report lacks any discussion and criticism of the process safety methods used. 0-3 A poor report, lacking logical and systematic assessment of positive and negative impacts on all three pillars of sustainability, or not assessing all three pillars. Generally incorrect or missing identification of risks and mitigations. Shows lack of understanding of lecture content. 0-3 A poor report, where only some of the mechanical, thermal and chemical demands on the 3 key units are appraised. Following a poor assessment of the materials properties necessary to meet these demands, some materials are identified for each of the 3 key units. A discussion of these materials is missing and/or selection is poor or incorrect. Shows lack of understanding of lecture 6-10 Mass and energy balances for the process with major errors or omissions. Process assumptions not given or unrealistic. 6-10 Hysys flow sheet does not include all the process units (e.g. some units not working) No comparison with manual calculations is presented. 5-8 A basic report with limited evidence of understanding of how the chosen aspects of safety relate to the design process. Some discussion of the process safety aspects is provided, but this is largely superficial and contains misconceptions. 4-6 A basic report, assessing positive and negative impacts on all three pillars of sustainability in a logical and systematic way. Some risks are mentioned, but mitigation and prioritization is incorrect or missing. Generally shows an understanding of lecture content. 4-6 A basic report, where most of the mechanical, thermal and chemical demands on the 3 key units are appraised. Following an incomplete assessment of the materials properties necessary to meet these demands, some engineering materials are identified for each of the 3 key units. A discussion of these materials leads to the selection of an appropriate one. Generally shows an understanding of lecture 11-15 Complete mass and energy balances for the whole process with some errors and omissions. 11-15 Hysys flow sheet for the whole process has been created with some errors or omissions. Some comparison with manual calculations is presented. 9-12 An acceptable report with an appropriate selection of multiple aspects of process safety. A range of potential process hazards are identified and properly considered within the context of design. Discussion and criticism of the process safety methods is provided. References are used to support discussion 7-9 An acceptable report, assessing positive and negative impacts on all three pillars of sustainability in a logical and systematic way. Good identification of risks, with evidence of prioritization and proposed mitigation. Some mention of either temporal, geographical scales, or unusual or emergency operations are present. 7-9 An acceptable report, where the mechanical, thermal and chemical demands on the 3 key units are fully appraised, with consideration of unusual or emergency operations. Following an assessment of the materials properties necessary to meet these demands, some engineering materials are identified for each of the 3 key units. A discussion of these materials leads to the selection of the most appropriate one. Quantification and referencing 16-20 Complete mass and energy balances with no obvious errors Utility are specified completely. 16-20 Hysys flow sheet has been created with minor errors. Advanced features (ie. adjustment) have been used. Detailed comparisons (with manual calculatons) and good discussion are presented. 13-16 An excellent report with a well thought out, systematic and critical review outlining how design decisions were related to the chosen process safety aspects. A clear summary of process hazards is provided with a discussion outlining the challenges and benefits, advantages and disadvantages of the methods. Report is well- referenced throughout 10-12 An excellent report, fully assessing positive and negative impacts on all three pillars of sustainability in a logical and systematic way (e.g. using the 4 P's). Important risks are identified based on a prioritisation of likelihood and impact, and sensible mitigations are proposed. Quantitative targets for improvement are provided. Examples of consideration of temporal and geographical scales, as well as unusual and emergency operations are present. 10-12 An excellent report, where the mechanical, thermal and chemical demands on the 3 key units are fully appraised, with consideration of unusual and emergency operations. Following a convincing assessment of the materials properties necessary to meet these demands, an appropriate range of engineering materials are identified for each of the 3 key units. An analytical discussion of these materials leads to the selection of the most appropriate one. Newcastle University 21-25 Complete mass and energy balances presented in a highly professional and informative format beyond expectation at this stage 21-25 Sophisticated Hysys sheet for the whole process has been created demonstrating knowledge beyond the lecture content. Detailed comparisons with excellent discussion is presented. 17-20 As before, but with additional signs of creativity and innovation beyond the lecture content. Original and inciteful discussion presented in a highly professional and informative format. 13-15 An exemplary report incorporating all points from the previous descriptor, but additionally demonstrating knowledge beyond the lecture content, and/or showing signs of engineering flair and professionalism beyond expectation at this stage. 13-15 An exemplary report incorporating all points from the previous descriptor, but additionally demonstrating knowledge beyond the lecture content, and/or showing signs of engineering flair and professionalism beyond expectation at this stage.


10. An intern in the Independent Party campaign designs an independent SWIM/ping-based failure detection protocol, for an asynchronous distributed system, that works as follows. Assume there are N=M*K*R processes in the system (M, K, R, are positive integers, each > 2). Arrange these N processes in a MxKxR 3-dimensional matrix (tesseract), with M processes in each column, and K processes in each row, and R processes in the 3rd dimension (aisles). All processes maintain a full membership list, however pinging is partial. Each process Pijk (in i-th row and j-th column and k-th aisle) periodically (every T time units) marks a subset of its membership list as its Monitoring Set. The monitoring set of a given process, once selected, does not change. The monitoring set of Pijk contains: i) all the processes in in its own column Pjk, ii) all the other processes in its own row Prk, and ii) all the processes in in its own aisle Pij. At this point, there are two options available to you: Option 1 - Each process sends heartbeats to its monitoring set members. Option 2 - Each process periodically pings all its monitoring set members; pings are responded to by acks, just like in the SWIM protocol (but there are no indirect pings or indirect acks.). Failure detection timeouts work as usual: Option 1 has the heartbeat receiver timeout waiting for a heartbeat, while Option 2 has the pinging process (pinger) time out. The suspected process is immediately marked as failed. This is run in an asynchronous distributed system. a. How many failures does Option 1 take to violate completeness? That is, find the value L so that if there are (L-1) simultaneous failures, all of them will be detected, but if there are L simultaneous failures then not all of them may be detected. b. Answer the same above question for Option 2. c. An opposition party candidate claims that for K-R=2, both Option 1 and Option 2 provide completeness for all scenarios with up to (and including) 9 simultaneous failures. You gently respond that they are wrong and that it also depends on M. What are all the values of M (given K=R-2) for which your opponent's claim above is true? Justify your answer clearly.


2. (You can use other websites for this question, but you should not cut and paste text. Please write answers in your own words!) Many of the campaigns are using Machine Learning, which of course benefits from GPUs. (Please limit your answer for each part to less than 50 words. Be concise!) Because the campaign has limited budget, they are only looking at single GPU VM instances, i.e., only those that have one (and only one) GPU inside them (but arbitrary amount of memory and CPUs). a. They want to find the single GPU VM type across all 3 major cloud providers (AWS, Azure, Google Cloud) that has the highest available 2 memory. Can you find it for them? (Note this refers to single GPU, not cloud instance). b. Repeat the previous question but find instead the lowest available memory. c. What is the difference between a GPU and a "TPU" (among cloud offerings)?


QUESTION ONE (25 MARKS) 1.1 Define what is a Big data and what is lot in detail? (10) 1.2 Suppose you are working for a company and the company want to substitute one of their old system with a new one which involves handling of big data and internet of things. You were allocated to lead the project and the company management are still in a dilemma of choosing between old and new system. What are the advantages of big data and internet of things that you may convince the management with to accept the new project? Address also their disadvantages. (15)


QUESTION TWO (25 MARKS) 2.1 List five big data analysis techniques and define how each is applied? (10) 2.2 Suppose you are working for a company as a project manager and you are required to design a Home automation for a client using application of internet of things. Give a brief description of examples of devices that you might connect and show your connection in a simple well labelled graph. (15) QUESTION THREE (20 MARKS)


6. Questioning and Reforming the election system seem all the rage nowadays. There are some ways distributed systems folks can help with elections. Someone at the election office thinks MapReduce could be useful for “instant runoff voting" in primaries. (Fun fact: several states, including Alaska, now use instant runoff voting!) Here's how instant runoff voting works. Consider an election with three candidates on the ballot - A, B, C. Every voter ranks the three candidates as first preference, second preference, and last preference. Between any two candidates X and Y, if a majority of voters ranked X above Y, then X dominates Y (and vice versa)-note that this only takes into account X and Y's relative rankings, not where they appear in the preference order, or where the third candidate appears. A Condorcet winner is a candidate that dominates all other candidates (pair wise) on the ballot. By definition an election can have at most one Condorcet winner (however, there may be zero). You are given a dataset of votes from N voters (N is odd and large, and so dataset is sharded), where each vote V has three fields V.1, V.2, V.3, respectively for the first, second, and third preference votes of that voter. Each line of input is one such voter's vote V (input to initial Map function). Write a MapReduce program that outputs either the unique single Condorcet winner among the three candidates A, B, or C, or if there is no single Condorcet winner, then it outputs the list of candidate(s) with the highest Condorcet count (those that dominate the most number of candidates). For background -- in MapReduce, one writes a program for Map that processes one input line at a time and outputs zero or more (key, value) pairs; and one writes a program for Reduce that processes an input of (key, all values for key). The iteration over input lines is done automatically by the MapReduce framework. You can assume this data is already sharded in HDFS and can be loaded from there. Each line is one vote V


8. One of the less popular candidates is polling at very small numbers in most of the states. They want to analyze the "topology-aware gossip" protocol you've seen in lecture. However, instead of the lecture slide example of 2 subnets joined by 1 router, here we have a total of N nodes (processes), evenly spread out across √N subnets (each subnet containing √N nodes), all joined by 1 router. The subnets are numbered S0, S1, S2, ... S(√ N-1). All these √N subnets are connected together via 1 router. You can assume all nodes have a full membership list, and there are no failures (messages or processes). The topology-aware gossip works as follows. Consider a process Pj choosing gossip targets. The process' gossip targets depend on the subnet Si that it lies in. During a gossip round, the process Pj selects either b "inside-subnet Si gossip targets" with probability (1-1/√N), OR b "outside-subnet Si gossip targets" with probability 1/√N. The only "restriction" is that after process Pj is infected, for the next O(log(/√N)) rounds Pj picks only inside-subnet targets (no outside-subnet targets) -- thereafter in a gossip round at Pj, either all its targets are inside-subnet or all are outside- subnet. Inside-subnet gossip targets from Pj (in Si) are selected uniformly at random from among the processes of Si. Outside-gossip targets from Pj (in Si) are only picked from the processes in the "next" subnet S((i+1)mod√N), and they are picked uniformly at randomly from the processes lying in that “next” subnet. The gossiping of a message does not stop (i.e., it is gossiped forever based on the above protocol). Does this topology-aware gossip protocol satisfy both the requirements of: (i) O(log(N)) average dissemination time for a gossip (with one sender from any subnet), and (ii) an O(1) messages/time unit load on the router at any time during the gossip spread? Justify your answers.