https://publicpagestutorbin.blob.core.windows.net/%24web/%24web/assets/Vector_3_18e566da35.png

Big Data Homework Help | Big Data Assignment Help

Excel in Exams with Expert Big Data Homework Help Tutors.

https://publicpagestutorbin.blob.core.windows.net/%24web/%24web/assets/Frame_1_7db546ad42.png

Trusted by 1.1 M+ Happy Students

Place An Orderand save time
man
phone
  • United States+1
  • United Kingdom+44
  • Afghanistan (‫افغانستان‬‎)+93
  • Albania (Shqipëri)+355
  • Algeria (‫الجزائر‬‎)+213
  • American Samoa+1
  • Andorra+376
  • Angola+244
  • Anguilla+1
  • Antigua and Barbuda+1
  • Argentina+54
  • Armenia (Հայաստան)+374
  • Aruba+297
  • Ascension Island+247
  • Australia+61
  • Austria (Österreich)+43
  • Azerbaijan (Azərbaycan)+994
  • Bahamas+1
  • Bahrain (‫البحرين‬‎)+973
  • Barbados+1
  • Belarus (Беларусь)+375
  • Belgium (België)+32
  • Belize+501
  • Benin (Bénin)+229
  • Bermuda+1
  • Bhutan (འབྲུག)+975
  • Bolivia+591
  • Bosnia and Herzegovina (Босна и Херцеговина)+387
  • Botswana+267
  • Brazil (Brasil)+55
  • British Indian Ocean Territory+246
  • British Virgin Islands+1
  • Brunei+673
  • Bulgaria (България)+359
  • Burkina Faso+226
  • Burundi (Uburundi)+257
  • Cambodia (កម្ពុជា)+855
  • Cameroon (Cameroun)+237
  • Canada+1
  • Cape Verde (Kabu Verdi)+238
  • Caribbean Netherlands+599
  • Cayman Islands+1
  • Central African Republic (République centrafricaine)+236
  • Chad (Tchad)+235
  • Chile+56
  • China (中国)+86
  • Christmas Island+61
  • Cocos (Keeling) Islands+61
  • Colombia+57
  • Comoros (‫جزر القمر‬‎)+269
  • Congo (DRC) (Jamhuri ya Kidemokrasia ya Kongo)+243
  • Congo (Republic) (Congo-Brazzaville)+242
  • Cook Islands+682
  • Costa Rica+506
  • Côte d’Ivoire+225
  • Croatia (Hrvatska)+385
  • Cuba+53
  • Curaçao+599
  • Cyprus (Κύπρος)+357
  • Czech Republic (Česká republika)+420
  • Denmark (Danmark)+45
  • Djibouti+253
  • Dominica+1
  • Dominican Republic (República Dominicana)+1
  • Ecuador+593
  • Egypt (‫مصر‬‎)+20
  • El Salvador+503
  • Equatorial Guinea (Guinea Ecuatorial)+240
  • Eritrea+291
  • Estonia (Eesti)+372
  • Eswatini+268
  • Ethiopia+251
  • Falkland Islands (Islas Malvinas)+500
  • Faroe Islands (Føroyar)+298
  • Fiji+679
  • Finland (Suomi)+358
  • France+33
  • French Guiana (Guyane française)+594
  • French Polynesia (Polynésie française)+689
  • Gabon+241
  • Gambia+220
  • Georgia (საქართველო)+995
  • Germany (Deutschland)+49
  • Ghana (Gaana)+233
  • Gibraltar+350
  • Greece (Ελλάδα)+30
  • Greenland (Kalaallit Nunaat)+299
  • Grenada+1
  • Guadeloupe+590
  • Guam+1
  • Guatemala+502
  • Guernsey+44
  • Guinea (Guinée)+224
  • Guinea-Bissau (Guiné Bissau)+245
  • Guyana+592
  • Haiti+509
  • Honduras+504
  • Hong Kong (香港)+852
  • Hungary (Magyarország)+36
  • Iceland (Ísland)+354
  • India (भारत)+91
  • Indonesia+62
  • Iran (‫ایران‬‎)+98
  • Iraq (‫العراق‬‎)+964
  • Ireland+353
  • Isle of Man+44
  • Israel (‫ישראל‬‎)+972
  • Italy (Italia)+39
  • Jamaica+1
  • Japan (日本)+81
  • Jersey+44
  • Jordan (‫الأردن‬‎)+962
  • Kazakhstan (Казахстан)+7
  • Kenya+254
  • Kiribati+686
  • Kosovo+383
  • Kuwait (‫الكويت‬‎)+965
  • Kyrgyzstan (Кыргызстан)+996
  • Laos (ລາວ)+856
  • Latvia (Latvija)+371
  • Lebanon (‫لبنان‬‎)+961
  • Lesotho+266
  • Liberia+231
  • Libya (‫ليبيا‬‎)+218
  • Liechtenstein+423
  • Lithuania (Lietuva)+370
  • Luxembourg+352
  • Macau (澳門)+853
  • North Macedonia (Македонија)+389
  • Madagascar (Madagasikara)+261
  • Malawi+265
  • Malaysia+60
  • Maldives+960
  • Mali+223
  • Malta+356
  • Marshall Islands+692
  • Martinique+596
  • Mauritania (‫موريتانيا‬‎)+222
  • Mauritius (Moris)+230
  • Mayotte+262
  • Mexico (México)+52
  • Micronesia+691
  • Moldova (Republica Moldova)+373
  • Monaco+377
  • Mongolia (Монгол)+976
  • Montenegro (Crna Gora)+382
  • Montserrat+1
  • Morocco (‫المغرب‬‎)+212
  • Mozambique (Moçambique)+258
  • Myanmar (Burma) (မြန်မာ)+95
  • Namibia (Namibië)+264
  • Nauru+674
  • Nepal (नेपाल)+977
  • Netherlands (Nederland)+31
  • New Caledonia (Nouvelle-Calédonie)+687
  • New Zealand+64
  • Nicaragua+505
  • Niger (Nijar)+227
  • Nigeria+234
  • Niue+683
  • Norfolk Island+672
  • North Korea (조선 민주주의 인민 공화국)+850
  • Northern Mariana Islands+1
  • Norway (Norge)+47
  • Oman (‫عُمان‬‎)+968
  • Palau+680
  • Palestine (‫فلسطين‬‎)+970
  • Panama (Panamá)+507
  • Papua New Guinea+675
  • Paraguay+595
  • Peru (Perú)+51
  • Philippines+63
  • Poland (Polska)+48
  • Portugal+351
  • Puerto Rico+1
  • Qatar (‫قطر‬‎)+974
  • Réunion (La Réunion)+262
  • Romania (România)+40
  • Russia (Россия)+7
  • Rwanda+250
  • Saint Barthélemy+590
  • Saint Helena+290
  • Saint Kitts and Nevis+1
  • Saint Lucia+1
  • Saint Martin (Saint-Martin (partie française))+590
  • Saint Pierre and Miquelon (Saint-Pierre-et-Miquelon)+508
  • Saint Vincent and the Grenadines+1
  • Samoa+685
  • San Marino+378
  • São Tomé and Príncipe (São Tomé e Príncipe)+239
  • Saudi Arabia (‫المملكة العربية السعودية‬‎)+966
  • Senegal (Sénégal)+221
  • Serbia (Србија)+381
  • Seychelles+248
  • Sierra Leone+232
  • Singapore+65
  • Sint Maarten+1
  • Slovakia (Slovensko)+421
  • Slovenia (Slovenija)+386
  • Solomon Islands+677
  • Somalia (Soomaaliya)+252
  • South Africa+27
  • South Korea (대한민국)+82
  • South Sudan (‫جنوب السودان‬‎)+211
  • Spain (España)+34
  • Sri Lanka (ශ්‍රී ලංකාව)+94
  • Sudan (‫السودان‬‎)+249
  • Suriname+597
  • Svalbard and Jan Mayen+47
  • Sweden (Sverige)+46
  • Switzerland (Schweiz)+41
  • Syria (‫سوريا‬‎)+963
  • Taiwan (台灣)+886
  • Tajikistan+992
  • Tanzania+255
  • Thailand (ไทย)+66
  • Timor-Leste+670
  • Togo+228
  • Tokelau+690
  • Tonga+676
  • Trinidad and Tobago+1
  • Tunisia (‫تونس‬‎)+216
  • Turkey (Türkiye)+90
  • Turkmenistan+993
  • Turks and Caicos Islands+1
  • Tuvalu+688
  • U.S. Virgin Islands+1
  • Uganda+256
  • Ukraine (Україна)+380
  • United Arab Emirates (‫الإمارات العربية المتحدة‬‎)+971
  • United Kingdom+44
  • United States+1
  • Uruguay+598
  • Uzbekistan (Oʻzbekiston)+998
  • Vanuatu+678
  • Vatican City (Città del Vaticano)+39
  • Venezuela+58
  • Vietnam (Việt Nam)+84
  • Wallis and Futuna (Wallis-et-Futuna)+681
  • Western Sahara (‫الصحراء الغربية‬‎)+212
  • Yemen (‫اليمن‬‎)+967
  • Zambia+260
  • Zimbabwe+263
  • Åland Islands+358
*Get instant homework help from top tutors—just a WhatsApp message away. 24/7 support for all your academic needs!

Big Data Assignment Help: Your Ultimate Guide to Excelling in Big Data Analytics

In the vast and rapidly evolving field of Big Data, students and professionals alike find themselves in need of guidance and support. Codersarts stands as a beacon for those navigating through the complexities of Big Data projects, assignments, and homework. Our dedicated team of experts is committed to facilitating your journey in Big Data analytics, ensuring you not only meet but exceed your academic and professional goals.

Why Big Data Matters

Big Data encompasses a massive volume of both structured and unstructured data, challenging to process with traditional tools due to its size and complexity. Its significance lies in the ability to analyze, capture, curate, and visualize vast amounts of information, leading to insights that drive informed decision-making across various sectors. Whether it's enhancing customer experience, optimizing operations, or advancing research, Big Data plays a pivotal role in today's data-driven world.

How We Can Help

Expert Assistance at Your Fingertips

Tutorbin is home to a select team of Big Data experts, each holding advanced degrees and boasting extensive experience in the field. Our experts come from prestigious backgrounds, including top-tier universities and leading healthcare and technology organizations, ensuring that the support you receive is grounded in real-world expertise and academic excellence.

Comprehensive Big Data Analytics Support

From data collection and preparation to advanced analytics and visualization, our experts cover all aspects of Big Data analytics. We provide assistance with a variety of tools and frameworks, including Tableau, Spark, Hadoop, and many more, ensuring you're equipped with the knowledge to tackle any Big Data challenge.

Big Data analytics is a critical skill set for the future, and with Tutorbin, you're in good hands. Whether you're aiming to excel in academic projects or enhance your professional expertise, our comprehensive support system is designed to help you navigate the complexities of Big Data with ease.

Ready to Take the Next Step?

If you're facing challenges with your Big Data assignments or simply wish to learn more about this fascinating field, Tutorbin is here to help. Visit us today to discover how our experts can assist you in achieving your Big Data analytics goals and propel you towards success in the digital age.

Recently Asked Big Data Questions

Expert help when you need it
  • Q1:Learning outcome Perform mass and energy balances, identify all key and utility streams and specify their flow, composition data. Use of computer based flow sheet software to perform mass and energy balances. Make an assessment of potential hazards Make an assessment of sustainability of the plant Make a material selection for chemical engineering equipment Mark 0-5 Limited and uncomplete mass and energy balances. Calculations not given and explained. Limited or no Hysys flow sheet has been created. Poor selection of process safety aspects and no details given on how these related to the overall design. Report lacks any discussion and criticism of the process safety methods used. 0-3 A poor report, lacking logical and systematic assessment of positive and negative impacts on all three pillars of sustainability, or not assessing all three pillars. Generally incorrect or missing identification of risks and mitigations. Shows lack of understanding of lecture content. 0-3 A poor report, where only some of the mechanical, thermal and chemical demands on the 3 key units are appraised. Following a poor assessment of the materials properties necessary to meet these demands, some materials are identified for each of the 3 key units. A discussion of these materials is missing and/or selection is poor or incorrect. Shows lack of understanding of lecture 6-10 Mass and energy balances for the process with major errors or omissions. Process assumptions not given or unrealistic. 6-10 Hysys flow sheet does not include all the process units (e.g. some units not working) No comparison with manual calculations is presented. 5-8 A basic report with limited evidence of understanding of how the chosen aspects of safety relate to the design process. Some discussion of the process safety aspects is provided, but this is largely superficial and contains misconceptions. 4-6 A basic report, assessing positive and negative impacts on all three pillars of sustainability in a logical and systematic way. Some risks are mentioned, but mitigation and prioritization is incorrect or missing. Generally shows an understanding of lecture content. 4-6 A basic report, where most of the mechanical, thermal and chemical demands on the 3 key units are appraised. Following an incomplete assessment of the materials properties necessary to meet these demands, some engineering materials are identified for each of the 3 key units. A discussion of these materials leads to the selection of an appropriate one. Generally shows an understanding of lecture 11-15 Complete mass and energy balances for the whole process with some errors and omissions. 11-15 Hysys flow sheet for the whole process has been created with some errors or omissions. Some comparison with manual calculations is presented. 9-12 An acceptable report with an appropriate selection of multiple aspects of process safety. A range of potential process hazards are identified and properly considered within the context of design. Discussion and criticism of the process safety methods is provided. References are used to support discussion 7-9 An acceptable report, assessing positive and negative impacts on all three pillars of sustainability in a logical and systematic way. Good identification of risks, with evidence of prioritization and proposed mitigation. Some mention of either temporal, geographical scales, or unusual or emergency operations are present. 7-9 An acceptable report, where the mechanical, thermal and chemical demands on the 3 key units are fully appraised, with consideration of unusual or emergency operations. Following an assessment of the materials properties necessary to meet these demands, some engineering materials are identified for each of the 3 key units. A discussion of these materials leads to the selection of the most appropriate one. Quantification and referencing 16-20 Complete mass and energy balances with no obvious errors Utility are specified completely. 16-20 Hysys flow sheet has been created with minor errors. Advanced features (ie. adjustment) have been used. Detailed comparisons (with manual calculatons) and good discussion are presented. 13-16 An excellent report with a well thought out, systematic and critical review outlining how design decisions were related to the chosen process safety aspects. A clear summary of process hazards is provided with a discussion outlining the challenges and benefits, advantages and disadvantages of the methods. Report is well- referenced throughout 10-12 An excellent report, fully assessing positive and negative impacts on all three pillars of sustainability in a logical and systematic way (e.g. using the 4 P's). Important risks are identified based on a prioritisation of likelihood and impact, and sensible mitigations are proposed. Quantitative targets for improvement are provided. Examples of consideration of temporal and geographical scales, as well as unusual and emergency operations are present. 10-12 An excellent report, where the mechanical, thermal and chemical demands on the 3 key units are fully appraised, with consideration of unusual and emergency operations. Following a convincing assessment of the materials properties necessary to meet these demands, an appropriate range of engineering materials are identified for each of the 3 key units. An analytical discussion of these materials leads to the selection of the most appropriate one. Newcastle University 21-25 Complete mass and energy balances presented in a highly professional and informative format beyond expectation at this stage 21-25 Sophisticated Hysys sheet for the whole process has been created demonstrating knowledge beyond the lecture content. Detailed comparisons with excellent discussion is presented. 17-20 As before, but with additional signs of creativity and innovation beyond the lecture content. Original and inciteful discussion presented in a highly professional and informative format. 13-15 An exemplary report incorporating all points from the previous descriptor, but additionally demonstrating knowledge beyond the lecture content, and/or showing signs of engineering flair and professionalism beyond expectation at this stage. 13-15 An exemplary report incorporating all points from the previous descriptor, but additionally demonstrating knowledge beyond the lecture content, and/or showing signs of engineering flair and professionalism beyond expectation at this stage.See Answer
  • Q2:Independent Study - Overview Overview The goal of this independent study is to go beyond the scope of this class and explore Big Data field in the direction of your interest that was not covered in the class. For this independent study, your job is to choose a Big Data topic of your interest that is not covered in the class, explore the topic and learn the related techniques and skills, and finally implement a Big Data task using the acquired techniques/skills. Part 1 - Proposal (due July 12) (25%) Propose 2 topics of your interest related to Big Data challenges, solutions, and techniques that were not covered in the class. Rank your two topics in order of priority. For this independent study, you are required to explore only one topic. You must start from your first choice topic. However, during this study, prior to part 2 submission deadline, if you don't feel comfortable with your first choice topic, you can switch to your second topic. Just note that you can only make this change prior to part 2 submission deadline. You can no longer change your topic after submitting the first draft of the final report (part 2 submission). For each topic, • provide a brief description • identify the specific learning objectives you will pursue (at least 3) (to make this list, you might need to dig more into your topic and explore the things you will learn along the way) make a list of potential resources to study • propose a Big Data task related to your topic that you can implement using the techniques and skills you have learnt If your topic involves any analysis or modeling a dataset, the dataset size must be at least 50 MB. Deliverables - a single document of proposalSee Answer
  • Q3:QUESTION ONE (25 MARKS) 1.1 Define what is a Big data and what is lot in detail? (10) 1.2 Suppose you are working for a company and the company want to substitute one of their old system with a new one which involves handling of big data and internet of things. You were allocated to lead the project and the company management are still in a dilemma of choosing between old and new system. What are the advantages of big data and internet of things that you may convince the management with to accept the new project? Address also their disadvantages. (15)See Answer
  • Q4:QUESTION TWO (25 MARKS) 2.1 List five big data analysis techniques and define how each is applied? (10) 2.2 Suppose you are working for a company as a project manager and you are required to design a Home automation for a client using application of internet of things. Give a brief description of examples of devices that you might connect and show your connection in a simple well labelled graph. (15) QUESTION THREE (20 MARKS)See Answer
  • Q5:QUESTION THREE 3.1 List and describe the IOT architecture components. 3.2 Describe two major layers and two other supporting modules of HADOOP (20 MARKS) (12) (8)See Answer
  • Q6:QUESTION FOUR (20 MARKS) 4.1 List and describe (FIVE) 5 security technologies applied in Big data. (10) 4.2 List five ethics and (FIVE) 5 polies that governs the use and implementation of Big data (10)See Answer
  • Q7:Problems: 1. (You can use other websites for this question, but you should not cut and paste text. Please write answers in your own words!) One of the candidates who used to coach a former president now wants to beat him in the election. This candidate is known for thinking out of the box. While his campaign has been running their services on AWS EC2 instances, they recently became aware of serverless cloud services. Can you help them? (Please limit your answer for each part to less than 50 words. Be concise!) a. What is the key difference between AWS Lambda and AWS EC2? b. What are the two key differences between AWS Lambda and AWS spot instances (think: pricing and how long instances last)? c. What are the names of the corresponding Microsoft Azure and Google Cloud counterparts (names) of Amazon Lambda? d. Give one example application (class) where you would prefer AWS EC2, and one where you would prefer AWS Lambda. Justify your choices briefly.See Answer
  • Q8:2. (You can use other websites for this question, but you should not cut and paste text. Please write answers in your own words!) Many of the campaigns are using Machine Learning, which of course benefits from GPUs. (Please limit your answer for each part to less than 50 words. Be concise!) Because the campaign has limited budget, they are only looking at single GPU VM instances, i.e., only those that have one (and only one) GPU inside them (but arbitrary amount of memory and CPUs). a. They want to find the single GPU VM type across all 3 major cloud providers (AWS, Azure, Google Cloud) that has the highest available 2 memory. Can you find it for them? (Note this refers to single GPU, not cloud instance). b. Repeat the previous question but find instead the lowest available memory. c. What is the difference between a GPU and a "TPU" (among cloud offerings)?See Answer
  • Q9:3. (Note: From this question onwards you CANNOT use websites but can only use course material. Also, for all Mapreduce questions please only use the class definition/template for Mapreduce jobs, i.e., Map/Reduce tasks/functions, and not other sources from the Web, since there are many different variants on the Web!) One of the candidates always imitates and emulates a previous president. This candidate believes they will win by becoming the most popular person on social media. An intern in their campaign wants to write a Mapreduce program. In MapReduce, one writes a program for Map that processes one input line at a time and outputs zero or more (key, value) pairs; and one writes a program for Reduce that processes an input of (key, all values for key). The iteration over input lines is done automatically by the MapReduce framework. The intern would like to know who are the influential Twitter users most similar to their candidate, and would like to use Hadoop for this. The intern uses an input file containing information from Twitter (which is an asymmetrical social network) about which users "follow" which other users. If user a follows b, the entry line is (a, b) - you can assume this data is already sharded in HDFS and can be loaded from there. Can you help the intern? Write a MapReduce program (Map and Reduce separately) that outputs the list of all users U who satisfy the following three conditions simultaneously: U has at least 100 million followers, and U herself/himself follows fewer than 10 users, and U follows at least one user V who in turn has at least 10 million followers (e.g., @Barack Obama would be such a U). You can chain Mapreduces if you want (but only if you must, and even then, only the least number). Your program must be as highly parallelizable as possible. Correctness is paramount, though you should try to make it as fast as possible.See Answer
  • Q10:4. A rival campaign manager believes that finding the best donors is the way to go. They use the same dataset from the previous question to instead find all user pairs (U,V) such that: (i) both U and V have at least 100 million followers each, and (ii) U and V follow at least 100 accounts in common (excluding each other). Note that U and V may or may not follow each other (either way)! Write a Mapreduce program for this. Same instructions as the first Mapreduce question in this series apply.See Answer
  • Q11:5. One of the social media billionaires is considering running for President. They run a social media named Quitter, and they have access to a lot of data inside the 3 company. As an intern in this campaign, you have the same social network dataset (named D1) specified in the previous question ((a,b) directed pairs indicating a follows b), but you also have an additional dataset (named D2) with entries (a, start_time, end_time) indicating that user a was online starting start_time and ending at end_time. The data is only for one day. All times are hh:mm:ss. However, each user a may have multiple entries in D2 (since users log in simultaneously). Write a Mapreduce program that extracts all pairs of users (a,b) such that: (i) a and b follow each other, and (ii) a and b were online simultaneously at least once during that day. Same instructions as the first Mapreduce question in this series apply. Please ensure that a Map stage reads data from only one input dataset (i.e., if a Map reads directly from D2, don't use it to also read from D1. And vice-versa.) - this is good practice consistent with good Map programming practices.See Answer
  • Q12:6. Questioning and Reforming the election system seem all the rage nowadays. There are some ways distributed systems folks can help with elections. Someone at the election office thinks MapReduce could be useful for “instant runoff voting" in primaries. (Fun fact: several states, including Alaska, now use instant runoff voting!) Here's how instant runoff voting works. Consider an election with three candidates on the ballot - A, B, C. Every voter ranks the three candidates as first preference, second preference, and last preference. Between any two candidates X and Y, if a majority of voters ranked X above Y, then X dominates Y (and vice versa)-note that this only takes into account X and Y's relative rankings, not where they appear in the preference order, or where the third candidate appears. A Condorcet winner is a candidate that dominates all other candidates (pair wise) on the ballot. By definition an election can have at most one Condorcet winner (however, there may be zero). You are given a dataset of votes from N voters (N is odd and large, and so dataset is sharded), where each vote V has three fields V.1, V.2, V.3, respectively for the first, second, and third preference votes of that voter. Each line of input is one such voter's vote V (input to initial Map function). Write a MapReduce program that outputs either the unique single Condorcet winner among the three candidates A, B, or C, or if there is no single Condorcet winner, then it outputs the list of candidate(s) with the highest Condorcet count (those that dominate the most number of candidates). For background -- in MapReduce, one writes a program for Map that processes one input line at a time and outputs zero or more (key, value) pairs; and one writes a program for Reduce that processes an input of (key, all values for key). The iteration over input lines is done automatically by the MapReduce framework. You can assume this data is already sharded in HDFS and can be loaded from there. Each line is one vote VSee Answer
  • Q13:7. At a presidential debate, one of the candidates loudly proclaims, "You idiots are so slow!". Then the moderator asks, "Can you elaborate please?" At a loss for words, the candidate reaches deep into their CS425 knowledge and screams, "You're all so slow! You're all doing push gossip. I do pull gossip, and even with fixed fanout, it converges in O(log(log(N)) time!" Are they right? If yes, give a proof (informal proof ok). If they are wrong, give a proof (informal proof). (Note: Push gossip and pull gossip mentioned here are the same protocols discussed in lecture)See Answer
  • Q14:8. One of the less popular candidates is polling at very small numbers in most of the states. They want to analyze the "topology-aware gossip" protocol you've seen in lecture. However, instead of the lecture slide example of 2 subnets joined by 1 router, here we have a total of N nodes (processes), evenly spread out across √N subnets (each subnet containing √N nodes), all joined by 1 router. The subnets are numbered S0, S1, S2, ... S(√ N-1). All these √N subnets are connected together via 1 router. You can assume all nodes have a full membership list, and there are no failures (messages or processes). The topology-aware gossip works as follows. Consider a process Pj choosing gossip targets. The process' gossip targets depend on the subnet Si that it lies in. During a gossip round, the process Pj selects either b "inside-subnet Si gossip targets" with probability (1-1/√N), OR b "outside-subnet Si gossip targets" with probability 1/√N. The only "restriction" is that after process Pj is infected, for the next O(log(/√N)) rounds Pj picks only inside-subnet targets (no outside-subnet targets) -- thereafter in a gossip round at Pj, either all its targets are inside-subnet or all are outside- subnet. Inside-subnet gossip targets from Pj (in Si) are selected uniformly at random from among the processes of Si. Outside-gossip targets from Pj (in Si) are only picked from the processes in the "next" subnet S((i+1)mod√N), and they are picked uniformly at randomly from the processes lying in that “next” subnet. The gossiping of a message does not stop (i.e., it is gossiped forever based on the above protocol). Does this topology-aware gossip protocol satisfy both the requirements of: (i) O(log(N)) average dissemination time for a gossip (with one sender from any subnet), and (ii) an O(1) messages/time unit load on the router at any time during the gossip spread? Justify your answers.See Answer
  • Q15:9. One of the campaigns is always looking for shortcuts. Their distributed system uses a failure detector but to "make it faster", they have made the following changes. For each of these changes (in isolation), say what is the one biggest 5 advantage and the one biggest disadvantage of the change (and why). Keep each answer to under 50 words (give brief justifications). a. They use Gossip-style failure detection, but they set Tcleanup = 0. b. They use SWIM-style failure detection, but they removed the Suspicion feature. c. They use SWIM-style failure detection, but they removed the round robin pinging + random permutation, and instead just randomly select each ping target.See Answer
  • Q16:10. An intern in the Independent Party campaign designs an independent SWIM/ping-based failure detection protocol, for an asynchronous distributed system, that works as follows. Assume there are N=M*K*R processes in the system (M, K, R, are positive integers, each > 2). Arrange these N processes in a MxKxR 3-dimensional matrix (tesseract), with M processes in each column, and K processes in each row, and R processes in the 3rd dimension (aisles). All processes maintain a full membership list, however pinging is partial. Each process Pijk (in i-th row and j-th column and k-th aisle) periodically (every T time units) marks a subset of its membership list as its Monitoring Set. The monitoring set of a given process, once selected, does not change. The monitoring set of Pijk contains: i) all the processes in in its own column Pjk, ii) all the other processes in its own row Prk, and ii) all the processes in in its own aisle Pij. At this point, there are two options available to you: Option 1 - Each process sends heartbeats to its monitoring set members. Option 2 - Each process periodically pings all its monitoring set members; pings are responded to by acks, just like in the SWIM protocol (but there are no indirect pings or indirect acks.). Failure detection timeouts work as usual: Option 1 has the heartbeat receiver timeout waiting for a heartbeat, while Option 2 has the pinging process (pinger) time out. The suspected process is immediately marked as failed. This is run in an asynchronous distributed system. a. How many failures does Option 1 take to violate completeness? That is, find the value L so that if there are (L-1) simultaneous failures, all of them will be detected, but if there are L simultaneous failures then not all of them may be detected. b. Answer the same above question for Option 2. c. An opposition party candidate claims that for K-R=2, both Option 1 and Option 2 provide completeness for all scenarios with up to (and including) 9 simultaneous failures. You gently respond that they are wrong and that it also depends on M. What are all the values of M (given K=R-2) for which your opponent's claim above is true? Justify your answer clearly.See Answer
  • Q17:, submit a ZIP file that includes a Word document with a cover page containing the names of your team members and each of the steps outlined below, clearly identified with a title. Also, include your data sources in the Zip file for submission. Please provide thorough comments on your steps and work. Failure to comply with the submission guidelines will result in penalties. 1. Identify a data source of your choice (See: https://donnees montreal.ca/) and provide the link to your data source in your Word document. Describe your data source in your Word document. Proceed with data verification and assess their quality. Identify and perform any necessary data preprocessing, if needed. (20 points) 2. Add your data source to HDFS in your Hadoop environment. Include your steps in your Word document. (20 points) 3. Identify a first processing task for this data source. Create and test your MapReduce code in your Hadoop environment. Use comments to clearly identify each step of your MapReduce code. Describe your processing task in one to two sentences and include it in your Word document. (30 points) 4. Identify a second processing task (Different from the first processing task in step 3) for this data source. Create and test your Spark SQL code in your Hadoop environment. The use of temporary tables is not allowed in your project. Use comments to clearly identify each step of your Spark SQL code. Describe your processing task in one to two sentences and include it in your Word document. (30 points) You will be evaluated on the consistency of your processing tasks and the completeness and details in your Word document compared to the specifications, as well as the optimality and quality of the code. To propose consistent work, try to draw inspiration from the various practices done in class to complete the requested work and not simply replicate the same examples covered in those practices.See Answer
  • Q18:Documentation What to Submit/nDescription Tiny Noong gf placed eing the day of program. That your hero the other da di pus year. This proga wilaya cock, change code Search geschading Setting up your Project Netflix Data Analysis using Pig Project Description for granted the cat do you will condu Cut kek forcade to far you are the borondator Don Ch Optional: Use your laptop to develop your project YOBONDSUSING the STORENTO Shears the drackySee Answer
  • Q19: Introduction For this course Units 2, 4, 6, and 8, contain examples of research related to the content contained in the chapter. Instructions on how to analyze the assigned research example are provided separately in the course shell. This exercise is intended to provide you with the opportunity to interpret research content for the purpose of application in the workplace. In this way, you can gain experience of researching and understanding how topics apply to the current environment. Click for Learning Outcomes Directions For the final selection project, you are presenting your findings and reflections on how the research can be applied to your current or future job. Your presentation should summarize the responses for each research assignment, rather than copy the previous assignments to the Power Point Slides. Remember, you are compiling the most important parts of your research to present as a report. Submission Requirements • Review the grading rubric prior to completing this assignment. Complete the questions above, and be sure to o Fully answer each area within the assignment details. o Submit at least 10 slides that does not include the title page, reference page, or charts document for each area o Audio clip should be added to each slide with a 2-3 min explanation. o Follow proper APA formatting. o Include at least two outside source within the past 2 years. ● Research Analysis -Summary Presentation Criteria Content view longer description Length and Format view longer description Citations view longer description Grammar and Spelling view longer description Ratings 15 to >10 pts Meets Expectations Includes in a summary research from previous units with a conclusion section in a brief, clear, Power Point presentation. Displays an exceptional familiarity with content evidenced by a strong summary. 10 to >7 pts Meets Expectations Includes all 8 question responses, recommendation and conclusion section, and reference list. 10 to >7 pts Meets Expectations Properly cites reference materials used. 5 pts Meets Expectations Exceptional use of proper English and free of all typographical errors and grammatical mistakes. APA format is used. 10 to >5 pts Developing Includes in a summary research from previous units with a conclusion but is not concise and as clear in presentation. Displays a good understanding of content, as evidenced by summary reflections. 7 to >4 pts Developing Includes most of 8 question responses. recommendation and conclusion section, and reference list. 7 to >4 pts Developing A few incorrect or missing citations and/or reference list entries. 3 pts Developing Proper use of English and generally free of typographical errors and grammatical mistakes. APA format is slightly used. 5 to >0 pts Does Not Meet Expectations Fails to responds to most assignment requirements. Does not displays an understanding of content, as evidenced by lack of analysis and conclusions drawn from the material. 4 to >0 pts Does Not Meet Expectations Includes less than 5 of the question responses, little to no recommendation or conclusion section and reference list. 4 to >0 pts Does Not Meet Expectations Multiple incorrect or missing citations and/or reference list entries 0 pts Does Not Meet Expectations Multiple English errors, typographical errors, and grammatical mistakes. APA is not utilized. Pts / 15 pts / 10 pts / 10 pts / 5 pts Total Points: 0See Answer
  • Q20: CS 6240: Assignment 4 Goals: (1) Gain deeper understanding of action, transformation, and lazy execution in Spark. (2) Implement PageRank in MapReduce and Spark. This homework is to be completed individually (i.e., no teams). You must create all deliverables yourself from scratch: it is not allowed to copy someone else's code or text, even if you modify it. (If you use publicly available code/text, you need to indicate what was copied and cite the source in your report!) Please submit your solution as a single PDF file on Gradescope (see link in Canvas) by the due date and time shown there. During the submission process, you need to tell Gradescope on which page the solution to each question is located. Not doing this will result in point deductions. In general, treat this like a professional report. There will also be point deductions if the submission is not neat, e.g., it is poorly formatted. (We want our TAs to spend their time helping you learn, not fixing messy reports or searching for solutions.) For late submissions you will lose one point per hour after the deadline. This HW is worth 100 points and accounts for 15% of your overall homework score. To encourage early work, you will receive a 10-point bonus if you submit your solution on or before the early submission deadline stated on Canvas. (Notice that your total score cannot exceed 100 points, but the extra points would compensate for any deductions.) To enable the graders to run your solution, make sure your project includes a standard Makefile with the same top-level targets (e.g., local and aws) as the one presented in class. As with all software projects, you must include a README file briefly describing all the steps necessary to build and execute both the standalone and the AWS Elastic MapReduce (EMR) versions of your program. This description should include the build commands and fully describe the execution steps. This README will also be graded, and you will be able to reuse it on all this semester's assignments with little modification (assuming you keep your project layout the same). You have about 2 weeks to work on this assignment. Section headers include recommended timings to help you schedule your work. The earlier you work on this, the better. Important Programming Reminder As you are working on your code, commit and push changes frequently. The commit history should show a natural progression of your code as you add features and fix bugs. Committing large, complete chunks of code may result in significant point loss. (You may include existing code for standard tasks like adding files to the file cache or creating a buffered file reader, but then the corresponding commit comment must indicate the source.) If you are not sure, better commit too often than not often enough. PageRank in Spark (Week 1) In addition to implementing a graph algorithm from scratch to better understand the BFS design pattern and the influential PageRank algorithm, the first part of this assignment also explores the subtleties of Spark's actions and transformations, and how they affect lazy evaluation and job submission. We will work with synthetic data to simplify the program a little and to make it easier to create inputs of different sizes. Thoughtful creation of synthetic data is an important skill for big-data program design, testing, and debugging. Recall that Spark transformations describe data manipulations, but do not trigger execution. This is the "lazy evaluation” in Spark. Actions on the other hand force an immediate execution of all operations needed to produce the desired result. Stated differently, transformations only define the lineage of a result, while actions force the execution of that lineage. What will happen when an iterative program performs both actions and transformations in a loop? What goes into the lineage after 1, 2, or more loop iterations? And will the entire lineage be executed? Let us find out by exploring a program that computes PageRank with dangling pages for a simple synthetic graph. Your program should work with two data tables: Graph stores pairs (p1, p2), each encoding a link from some page p1 to another page p2. Ranks stores pairs (p, pr), encoding the PageRank pr for each page p. To fill these tables with data, create a graph that consists of k linear chains, each with k pages. Number the pages from 1 to k², where k is a program 2 3 parameter to control problem size. The figure shows an example for k=3. Notice that the last page in each linear chain is a dangling page. We will use the single- dummy-page approach to deal with dangling pages. This means that your program also must create a single dummy page-let's give it the number 0 (zero) and add it to Ranks. Add an edge (d, 0) for each dangling page d. Set the initial PR value for each of the k² real pages in Ranks to 1/k²; set the initial PR value of the dummy page to 0. 1 4 7 5 8 6 9 For simplicity, we recommend you implement the program using (pair) RDDs, but you may choose to work with DataSet instead. The following instructions assume an RDD-based implementation. Start by exploring the PageRank Scala program included in the Spark distribution. Make sure you fully understand what each statement is doing. Create a simple example graph and step through the program, e.g., on paper or using the interactive Spark shell. You will realize that the example program does not handle dangling pages, i.e., dangling pages lose their PR mass in each iteration. Can you find other problems? Your program will have a structure similar to the example program, but follow these requirements and suggestions: You are allowed to take certain shortcuts in your program that exploit the special graph structure. In particular, you may exploit that each node has at most 1 outgoing link. Make sure you add a comment about this assumption in your code. ● ● Make k a parameter of your Spark Scala program and generate RDDs Graph and Ranks directly in the program. There are many examples on the Web on how to create lists of records and turn them into (pair) RDDs. 1. 2. Make sure you add dummy page 0 to Ranks and the corresponding k dummy edges to Graph. Initialize each PR value in Ranks to 1/k², except for page 0, whose initial PR value should be zero. Be careful when you look at the example PR program in the Spark distribution. It sets initial PR values to 1.0, and its PR computation adds 0.15 instead of 0.15/#pages for the random jump probability. Intuitively, they multiply each PR value by #pages. While that is a valid approach, it is not allowed for this assignment. Try to ensure that Graph and Ranks have the same Partitioner to avoid shuffling for the join. Check if the join computes exactly what you want. Does it matter if you use an inner or an outer join in your program? To read out the total dangling PR mass accumulated in dummy page 0, use the lookup method of pair RDD. Then re-distribute this mass evenly over all real pages. When debugging your program, see if the PR values add up to 1 after each iteration. Small variations are expected, especially for large graphs, due to numerical precision issues. However, if the PR sum significantly deviates from 1, this may indicate a bug in your program. Add a statement right after the end of the for-loop (i.e., outside the loop) for the PR iterations to write the debug string of Ranks to the log file. Now you are ready to explore the subtleties of Spark lazy evaluation. First explore the lineage of Ranks as follows: Set the loop condition so that exactly 1 iteration is performed and look at the lineage for Ranks. Change the loop condition so that exactly 2 iterations are performed and look at the lineage for Ranks after those 2 iterations. Did it change? The lineage describes the job needed to compute the result of the action that triggered it. Since pair RDD's lookup method is an action, a new job is executed in each iteration of the loop. Can you describe in your own words what the job triggered in the i-th iteration computes? Try it. An interesting aspect of Spark, and a reason for its high performance, is that it can re-use previously computed results. This means that in practice, only a part of the lineage may get executed. To understand this better, consider the following simple example program: 1. val myRDD1 = some_expensive_transformations_on_some_big_input() 2. myRDD1.collect() 3. val myRDD2 = myRDD1.some_more_transformations() 4. myRDD2.collect() This program executes 2 jobs. The first is triggered by line 2 and it computes all steps defined by the corresponding transformations in the lineage of myRDD1. The next job is triggered by line 4. Since myRDD2 depends on myRDD1, all myRDD1's lineage is also included in the lineage of myRDD2. But will Spark execute the entire lineage? What if myRDD1 was still available from the earlier job triggered by line 2? Then it would be more efficient for Spark to simply re-use the existing copy of myRDD1 and only apply the additional transformations to it! Use Spark textbooks and online resources to find out if Spark is smart enough to realize such RDD re-use opportunities. Then study this empirically in your PageRank program where the lineage of Ranks in iteration i depends on all previous (i-1) iterations: 1. Can you instrument your program with the appropriate printing or logging statements to find out execution details for each job triggered by an action in your program? 2. See if you can find other ways to make Spark tell you which steps of an RDD lineage were executed, and when Spark was able to avoid execution due to availability of intermediate results from earlier executions. 3. Change the caching behavior of your program by using cache() or persist() on Ranks. Does it affect the execution behavior of your program? Try this for small k, then for really large k (so that Ranks might not completely fit into the combined memory of all machines in the cluster). Bonus challenge: For an optional 5-point bonus (final score cannot exceed 100), run your PageRank program on the Twitter followership data. If you took shortcuts for the synthetic data, e.g., by exploiting that no page has more than 1 outgoing link, you need to appropriately generalize your program to work correctly on the Twitter data. PageRank in MapReduce (Week 2) Implement the PageRank program in MapReduce and run it on the synthetic graph. You may choose any of the methods we discussed in the module and in class for handling dangling pages, including global counters (try if you can read it out in the Reduce phase) and order inversion. In contrast to the Spark program, generate the synthetic graph in advance and feed it as an input file to your PageRank program. Follow the approach from the module and store the graph as a set of vertex objects (which could be encoded as Text), each containing the adjacency list and the PageRank value. Since we will work with relatively small input, make sure that your program creates at least 20 Map tasks. You can use NLineInput Format to achieve this. Report Write a brief report about your findings, answering the following questions: 1. [12 points] Show the pseudo-code for the PR program in Spark Scala. Since many Scala functions are similar to pseudo-code, you may copy-and-paste well-designed (good variable naming!) and well-commented Scala code fragments here. Notes: Your program must support k and the number of PR iterations as parameters. Your program may take shortcuts to exploit the structure of the synthetic graph, in particular that each page has at most 1 outgoing link. (Your program should work on the synthetic graphs, no matter the choice of k>0, but it does not need to work correctly on more generally structured graphs.) 2. [10 points] Show the link to the source code for this program in your Github Classroom repository. 3. [10 points] Run the PR program locally (not on AWS) for k=100 for 10 iterations. Report the PR values your program computed for pages 0 (dummy), 1, 2,..., 19. 4. [19 points] Run the PR program locally (not on AWS) for k=100. Set the loop condition so that exactly 1 iteration is performed and report the lineage for Ranks after that iteration. Change the loop condition so that exactly 2 iterations are performed and report the lineage for Ranks after those 2 iterations. Then change the loop condition again so that exactly 3 iterations are performed and report the lineage for Ranks after those 3 iterations. 5. [15 points] Find out if Spark executes the complete job lineage or if it re-uses previously computed results. Make sure you are not using cache() or persist() on the Ranks RDD. (You may use it on the Graph RDD.) Since the PR values in RDD Ranks in iteration 10 depend on Ranks from iteration 9, which in turn depends on Ranks from iteration 8, and so on, we want to find out if the job triggered by the lookup action in iteration 10 runs all 10 iterations from scratch, or if it uses Ranks from iteration 9 and simply applies one extra iteration to it. a. Let's add a print statement as the first statement inside the loop that performs an iteration of the PR algorithm. Use println(s"Iteration ${i}") or similar to print the value of loop variable i. The idea is to look at the printed messages to determine what happened. In particular, if a job executes the complete lineage, we might hope to see "Iteration 1" when the first job is triggered, then "Iteration 1" (again) and “Iteration 2” for the second job (because the second job includes the result of the first iteration in its lineage, i.e., a full execution from scratch would run iterations 1 and 2), then “Iteration 1,” "Iteration 2," and "Iteration 3" when the third iteration's job is triggered, and so on. But would that really happen? To answer this question, show the lineage of Ranks after 3 iterations and report if adding the print statement changed the lineage. b. Remove the print statement, run 10 iterations for k=100, and look at the log file. You should see lines like "Job ... finished: lookup at took ..." that tell you the jobs executed, the action that triggered the job (lookup), and how long it took to execute. If Spark does not re-use previous results, the growing lineage should cause longer computation time for jobs triggered by later iterations. On the other hand, if Spark re- uses Ranks from the previous iteration, then each job runs only a single additional iteration and hence job time should remain about the same, even for later iterations. Copy these lines from the log file for all jobs executed by the lookup action in the 10 iterations. Based on the times reported, do you believe Spark re-used Ranks from the previous iteration? C. So far we have not asked Spark to cache() or persist() Ranks. Will this change Spark's behavior? To find out, add ".cache()" to the command that defines Ranks in the loop. Run your program again for 10 iterations for k=100 and look at the log file. What changed after you added cache()? Look for lines like "Block ... stored as values in memory" and "Found block ... locally". Report some of those lines and discuss what theySee Answer
View More

Popular Subjects for Big Data

You can get the best rated step-by-step problem explanations from 65000+ expert tutors by ordering TutorBin Big Data homework help.

Get Instant Big Data Solutions From TutorBin App Now!

Get personalized homework help in your pocket! Enjoy your $20 reward upon registration!

Claim Your Offer

Sign Up now and Get $20 in your wallet

Moneyback

Guarantee

Free Plagiarism

Reports

$20 reward

Upon registration

Full Privacy

Full Privacy

Unlimited

Rewrites/revisions

Testimonials

TutorBin has got more than 3k positive ratings from our users around the world. Some of the students and teachers were greatly helped by TutorBin.

"After using their service, I decided to return back to them whenever I need their assistance. They will never disappoint you and craft the perfect homework for you after carrying out extensive research. It will surely amp up your performance and you will soon outperform your peers."

Olivia

"Ever since I started using this service, my life became easy. Now I have plenty of time to immerse myself in more important tasks viz., preparing for exams. TutorBin went above and beyond my expectations. They provide excellent quality tasks within deadlines. My grades improved exponentially after seeking their assistance."

Gloria

"They are amazing. I sought their help with my art assignment and the answers they provided were unique and devoid of plagiarism. They really helped me get into the good books of my professor. I would highly recommend their service."

Michael

"The service they provide is great. Their answers are unique and expert professionals with a minimum of 5 years of experience work on the assignments. Expect the answers to be of the highest quality and get ready to see your grades soar."

Richard

"They provide excellent assistance. What I loved the most about them is their homework help. They are available around the clock and work until you derive complete satisfaction. If you decide to use their service, expect a positive disconfirmation of expectations."

Willow

TutorBin helping students around the globe

TutorBin believes that distance should never be a barrier to learning. Over 500000+ orders and 100000+ happy customers explain TutorBin has become the name that keeps learning fun in the UK, USA, Canada, Australia, Singapore, and UAE.