https://publicpagestutorbin.blob.core.windows.net/%24web/%24web/assets/Vector_3_18e566da35.png

Big Data Homework Help | Big Data Assignment Help

Excel in Exams with Expert Big Data Homework Help Tutors.

https://publicpagestutorbin.blob.core.windows.net/%24web/%24web/assets/Frame_1_7db546ad42.png

Trusted by 1.1 M+ Happy Students

4.4Trust Pilot

4.4Edu Reviewer

5App Review

4.8Student

Place An Orderand save time

^*Get instant homework help from top tutors—just a WhatsApp message away. 24/7 support for all your academic needs!

Big Data Assignment Help: Your Ultimate Guide to Excelling in Big Data Analytics

In the vast and rapidly evolving field of Big Data, students and professionals alike find themselves in need of guidance and support. Codersarts stands as a beacon for those navigating through the complexities of Big Data projects, assignments, and homework. Our dedicated team of experts is committed to facilitating your journey in Big Data analytics, ensuring you not only meet but exceed your academic and professional goals.

Why Big Data Matters

Big Data encompasses a massive volume of both structured and unstructured data, challenging to process with traditional tools due to its size and complexity. Its significance lies in the ability to analyze, capture, curate, and visualize vast amounts of information, leading to insights that drive informed decision-making across various sectors. Whether it's enhancing customer experience, optimizing operations, or advancing research, Big Data plays a pivotal role in today's data-driven world.

How We Can Help

Expert Assistance at Your Fingertips

Tutorbin is home to a select team of Big Data experts, each holding advanced degrees and boasting extensive experience in the field. Our experts come from prestigious backgrounds, including top-tier universities and leading healthcare and technology organizations, ensuring that the support you receive is grounded in real-world expertise and academic excellence.

Comprehensive Big Data Analytics Support

From data collection and preparation to advanced analytics and visualization, our experts cover all aspects of Big Data analytics. We provide assistance with a variety of tools and frameworks, including Tableau, Spark, Hadoop, and many more, ensuring you're equipped with the knowledge to tackle any Big Data challenge.

Big Data analytics is a critical skill set for the future, and with Tutorbin, you're in good hands. Whether you're aiming to excel in academic projects or enhance your professional expertise, our comprehensive support system is designed to help you navigate the complexities of Big Data with ease.

Ready to Take the Next Step?

If you're facing challenges with your Big Data assignments or simply wish to learn more about this fascinating field, Tutorbin is here to help. Visit us today to discover how our experts can assist you in achieving your Big Data analytics goals and propel you towards success in the digital age.

Recently Asked Big Data Questions

Expert help when you need it

Q1:Learning outcome Perform mass and energy balances, identify all key and utility streams and specify their flow, composition data. Use of computer based flow sheet software to perform mass and energy balances. Make an assessment of potential hazards Make an assessment of sustainability of the plant Make a material selection for chemical engineering equipment Mark 0-5 Limited and uncomplete mass and energy balances. Calculations not given and explained. Limited or no Hysys flow sheet has been created. Poor selection of process safety aspects and no details given on how these related to the overall design. Report lacks any discussion and criticism of the process safety methods used. 0-3 A poor report, lacking logical and systematic assessment of positive and negative impacts on all three pillars of sustainability, or not assessing all three pillars. Generally incorrect or missing identification of risks and mitigations. Shows lack of understanding of lecture content. 0-3 A poor report, where only some of the mechanical, thermal and chemical demands on the 3 key units are appraised. Following a poor assessment of the materials properties necessary to meet these demands, some materials are identified for each of the 3 key units. A discussion of these materials is missing and/or selection is poor or incorrect. Shows lack of understanding of lecture 6-10 Mass and energy balances for the process with major errors or omissions. Process assumptions not given or unrealistic. 6-10 Hysys flow sheet does not include all the process units (e.g. some units not working) No comparison with manual calculations is presented. 5-8 A basic report with limited evidence of understanding of how the chosen aspects of safety relate to the design process. Some discussion of the process safety aspects is provided, but this is largely superficial and contains misconceptions. 4-6 A basic report, assessing positive and negative impacts on all three pillars of sustainability in a logical and systematic way. Some risks are mentioned, but mitigation and prioritization is incorrect or missing. Generally shows an understanding of lecture content. 4-6 A basic report, where most of the mechanical, thermal and chemical demands on the 3 key units are appraised. Following an incomplete assessment of the materials properties necessary to meet these demands, some engineering materials are identified for each of the 3 key units. A discussion of these materials leads to the selection of an appropriate one. Generally shows an understanding of lecture 11-15 Complete mass and energy balances for the whole process with some errors and omissions. 11-15 Hysys flow sheet for the whole process has been created with some errors or omissions. Some comparison with manual calculations is presented. 9-12 An acceptable report with an appropriate selection of multiple aspects of process safety. A range of potential process hazards are identified and properly considered within the context of design. Discussion and criticism of the process safety methods is provided. References are used to support discussion 7-9 An acceptable report, assessing positive and negative impacts on all three pillars of sustainability in a logical and systematic way. Good identification of risks, with evidence of prioritization and proposed mitigation. Some mention of either temporal, geographical scales, or unusual or emergency operations are present. 7-9 An acceptable report, where the mechanical, thermal and chemical demands on the 3 key units are fully appraised, with consideration of unusual or emergency operations. Following an assessment of the materials properties necessary to meet these demands, some engineering materials are identified for each of the 3 key units. A discussion of these materials leads to the selection of the most appropriate one. Quantification and referencing 16-20 Complete mass and energy balances with no obvious errors Utility are specified completely. 16-20 Hysys flow sheet has been created with minor errors. Advanced features (ie. adjustment) have been used. Detailed comparisons (with manual calculatons) and good discussion are presented. 13-16 An excellent report with a well thought out, systematic and critical review outlining how design decisions were related to the chosen process safety aspects. A clear summary of process hazards is provided with a discussion outlining the challenges and benefits, advantages and disadvantages of the methods. Report is well- referenced throughout 10-12 An excellent report, fully assessing positive and negative impacts on all three pillars of sustainability in a logical and systematic way (e.g. using the 4 P's). Important risks are identified based on a prioritisation of likelihood and impact, and sensible mitigations are proposed. Quantitative targets for improvement are provided. Examples of consideration of temporal and geographical scales, as well as unusual and emergency operations are present. 10-12 An excellent report, where the mechanical, thermal and chemical demands on the 3 key units are fully appraised, with consideration of unusual and emergency operations. Following a convincing assessment of the materials properties necessary to meet these demands, an appropriate range of engineering materials are identified for each of the 3 key units. An analytical discussion of these materials leads to the selection of the most appropriate one. Newcastle University 21-25 Complete mass and energy balances presented in a highly professional and informative format beyond expectation at this stage 21-25 Sophisticated Hysys sheet for the whole process has been created demonstrating knowledge beyond the lecture content. Detailed comparisons with excellent discussion is presented. 17-20 As before, but with additional signs of creativity and innovation beyond the lecture content. Original and inciteful discussion presented in a highly professional and informative format. 13-15 An exemplary report incorporating all points from the previous descriptor, but additionally demonstrating knowledge beyond the lecture content, and/or showing signs of engineering flair and professionalism beyond expectation at this stage. 13-15 An exemplary report incorporating all points from the previous descriptor, but additionally demonstrating knowledge beyond the lecture content, and/or showing signs of engineering flair and professionalism beyond expectation at this stage.See Answer
Q2:QUESTION THREE 3.1 List and describe the IOT architecture components. 3.2 Describe two major layers and two other supporting modules of HADOOP (20 MARKS) (12) (8)See Answer
Q3:QUESTION FOUR (20 MARKS) 4.1 List and describe (FIVE) 5 security technologies applied in Big data. (10) 4.2 List five ethics and (FIVE) 5 polies that governs the use and implementation of Big data (10)See Answer
Q4:Problems: 1. (You can use other websites for this question, but you should not cut and paste text. Please write answers in your own words!) One of the candidates who used to coach a former president now wants to beat him in the election. This candidate is known for thinking out of the box. While his campaign has been running their services on AWS EC2 instances, they recently became aware of serverless cloud services. Can you help them? (Please limit your answer for each part to less than 50 words. Be concise!) a. What is the key difference between AWS Lambda and AWS EC2? b. What are the two key differences between AWS Lambda and AWS spot instances (think: pricing and how long instances last)? c. What are the names of the corresponding Microsoft Azure and Google Cloud counterparts (names) of Amazon Lambda? d. Give one example application (class) where you would prefer AWS EC2, and one where you would prefer AWS Lambda. Justify your choices briefly.See Answer
Q5:4. A rival campaign manager believes that finding the best donors is the way to go. They use the same dataset from the previous question to instead find all user pairs (U,V) such that: (i) both U and V have at least 100 million followers each, and (ii) U and V follow at least 100 accounts in common (excluding each other). Note that U and V may or may not follow each other (either way)! Write a Mapreduce program for this. Same instructions as the first Mapreduce question in this series apply.See Answer
Q6:5. One of the social media billionaires is considering running for President. They run a social media named Quitter, and they have access to a lot of data inside the 3 company. As an intern in this campaign, you have the same social network dataset (named D1) specified in the previous question ((a,b) directed pairs indicating a follows b), but you also have an additional dataset (named D2) with entries (a, start_time, end_time) indicating that user a was online starting start_time and ending at end_time. The data is only for one day. All times are hh:mm:ss. However, each user a may have multiple entries in D2 (since users log in simultaneously). Write a Mapreduce program that extracts all pairs of users (a,b) such that: (i) a and b follow each other, and (ii) a and b were online simultaneously at least once during that day. Same instructions as the first Mapreduce question in this series apply. Please ensure that a Map stage reads data from only one input dataset (i.e., if a Map reads directly from D2, don't use it to also read from D1. And vice-versa.) - this is good practice consistent with good Map programming practices.See Answer
Q7:6. Questioning and Reforming the election system seem all the rage nowadays. There are some ways distributed systems folks can help with elections. Someone at the election office thinks MapReduce could be useful for “instant runoff voting" in primaries. (Fun fact: several states, including Alaska, now use instant runoff voting!) Here's how instant runoff voting works. Consider an election with three candidates on the ballot - A, B, C. Every voter ranks the three candidates as first preference, second preference, and last preference. Between any two candidates X and Y, if a majority of voters ranked X above Y, then X dominates Y (and vice versa)-note that this only takes into account X and Y's relative rankings, not where they appear in the preference order, or where the third candidate appears. A Condorcet winner is a candidate that dominates all other candidates (pair wise) on the ballot. By definition an election can have at most one Condorcet winner (however, there may be zero). You are given a dataset of votes from N voters (N is odd and large, and so dataset is sharded), where each vote V has three fields V.1, V.2, V.3, respectively for the first, second, and third preference votes of that voter. Each line of input is one such voter's vote V (input to initial Map function). Write a MapReduce program that outputs either the unique single Condorcet winner among the three candidates A, B, or C, or if there is no single Condorcet winner, then it outputs the list of candidate(s) with the highest Condorcet count (those that dominate the most number of candidates). For background -- in MapReduce, one writes a program for Map that processes one input line at a time and outputs zero or more (key, value) pairs; and one writes a program for Reduce that processes an input of (key, all values for key). The iteration over input lines is done automatically by the MapReduce framework. You can assume this data is already sharded in HDFS and can be loaded from there. Each line is one vote VSee Answer
Q8:8. One of the less popular candidates is polling at very small numbers in most of the states. They want to analyze the "topology-aware gossip" protocol you've seen in lecture. However, instead of the lecture slide example of 2 subnets joined by 1 router, here we have a total of N nodes (processes), evenly spread out across √N subnets (each subnet containing √N nodes), all joined by 1 router. The subnets are numbered S0, S1, S2, ... S(√ N-1). All these √N subnets are connected together via 1 router. You can assume all nodes have a full membership list, and there are no failures (messages or processes). The topology-aware gossip works as follows. Consider a process Pj choosing gossip targets. The process' gossip targets depend on the subnet Si that it lies in. During a gossip round, the process Pj selects either b "inside-subnet Si gossip targets" with probability (1-1/√N), OR b "outside-subnet Si gossip targets" with probability 1/√N. The only "restriction" is that after process Pj is infected, for the next O(log(/√N)) rounds Pj picks only inside-subnet targets (no outside-subnet targets) -- thereafter in a gossip round at Pj, either all its targets are inside-subnet or all are outside- subnet. Inside-subnet gossip targets from Pj (in Si) are selected uniformly at random from among the processes of Si. Outside-gossip targets from Pj (in Si) are only picked from the processes in the "next" subnet S((i+1)mod√N), and they are picked uniformly at randomly from the processes lying in that “next” subnet. The gossiping of a message does not stop (i.e., it is gossiped forever based on the above protocol). Does this topology-aware gossip protocol satisfy both the requirements of: (i) O(log(N)) average dissemination time for a gossip (with one sender from any subnet), and (ii) an O(1) messages/time unit load on the router at any time during the gossip spread? Justify your answers.See Answer
Q9:9. One of the campaigns is always looking for shortcuts. Their distributed system uses a failure detector but to "make it faster", they have made the following changes. For each of these changes (in isolation), say what is the one biggest 5 advantage and the one biggest disadvantage of the change (and why). Keep each answer to under 50 words (give brief justifications). a. They use Gossip-style failure detection, but they set Tcleanup = 0. b. They use SWIM-style failure detection, but they removed the Suspicion feature. c. They use SWIM-style failure detection, but they removed the round robin pinging + random permutation, and instead just randomly select each ping target.See Answer
Q10:10. An intern in the Independent Party campaign designs an independent SWIM/ping-based failure detection protocol, for an asynchronous distributed system, that works as follows. Assume there are N=M*K*R processes in the system (M, K, R, are positive integers, each > 2). Arrange these N processes in a MxKxR 3-dimensional matrix (tesseract), with M processes in each column, and K processes in each row, and R processes in the 3rd dimension (aisles). All processes maintain a full membership list, however pinging is partial. Each process Pijk (in i-th row and j-th column and k-th aisle) periodically (every T time units) marks a subset of its membership list as its Monitoring Set. The monitoring set of a given process, once selected, does not change. The monitoring set of Pijk contains: i) all the processes in in its own column Pjk, ii) all the other processes in its own row Prk, and ii) all the processes in in its own aisle Pij. At this point, there are two options available to you: Option 1 - Each process sends heartbeats to its monitoring set members. Option 2 - Each process periodically pings all its monitoring set members; pings are responded to by acks, just like in the SWIM protocol (but there are no indirect pings or indirect acks.). Failure detection timeouts work as usual: Option 1 has the heartbeat receiver timeout waiting for a heartbeat, while Option 2 has the pinging process (pinger) time out. The suspected process is immediately marked as failed. This is run in an asynchronous distributed system. a. How many failures does Option 1 take to violate completeness? That is, find the value L so that if there are (L-1) simultaneous failures, all of them will be detected, but if there are L simultaneous failures then not all of them may be detected. b. Answer the same above question for Option 2. c. An opposition party candidate claims that for K-R=2, both Option 1 and Option 2 provide completeness for all scenarios with up to (and including) 9 simultaneous failures. You gently respond that they are wrong and that it also depends on M. What are all the values of M (given K=R-2) for which your opponent's claim above is true? Justify your answer clearly.See Answer
Q11:Documentation What to Submit/nDescription Tiny Noong gf placed eing the day of program. That your hero the other da di pus year. This proga wilaya cock, change code Search geschading Setting up your Project Netflix Data Analysis using Pig Project Description for granted the cat do you will condu Cut kek forcade to far you are the borondator Don Ch Optional: Use your laptop to develop your project YOBONDSUSING the STORENTO Shears the drackySee Answer
Q12: CS 6240: Assignment 4 Goals: (1) Gain deeper understanding of action, transformation, and lazy execution in Spark. (2) Implement PageRank in MapReduce and Spark. This homework is to be completed individually (i.e., no teams). You must create all deliverables yourself from scratch: it is not allowed to copy someone else's code or text, even if you modify it. (If you use publicly available code/text, you need to indicate what was copied and cite the source in your report!) Please submit your solution as a single PDF file on Gradescope (see link in Canvas) by the due date and time shown there. During the submission process, you need to tell Gradescope on which page the solution to each question is located. Not doing this will result in point deductions. In general, treat this like a professional report. There will also be point deductions if the submission is not neat, e.g., it is poorly formatted. (We want our TAs to spend their time helping you learn, not fixing messy reports or searching for solutions.) For late submissions you will lose one point per hour after the deadline. This HW is worth 100 points and accounts for 15% of your overall homework score. To encourage early work, you will receive a 10-point bonus if you submit your solution on or before the early submission deadline stated on Canvas. (Notice that your total score cannot exceed 100 points, but the extra points would compensate for any deductions.) To enable the graders to run your solution, make sure your project includes a standard Makefile with the same top-level targets (e.g., local and aws) as the one presented in class. As with all software projects, you must include a README file briefly describing all the steps necessary to build and execute both the standalone and the AWS Elastic MapReduce (EMR) versions of your program. This description should include the build commands and fully describe the execution steps. This README will also be graded, and you will be able to reuse it on all this semester's assignments with little modification (assuming you keep your project layout the same). You have about 2 weeks to work on this assignment. Section headers include recommended timings to help you schedule your work. The earlier you work on this, the better. Important Programming Reminder As you are working on your code, commit and push changes frequently. The commit history should show a natural progression of your code as you add features and fix bugs. Committing large, complete chunks of code may result in significant point loss. (You may include existing code for standard tasks like adding files to the file cache or creating a buffered file reader, but then the corresponding commit comment must indicate the source.) If you are not sure, better commit too often than not often enough. PageRank in Spark (Week 1) In addition to implementing a graph algorithm from scratch to better understand the BFS design pattern and the influential PageRank algorithm, the first part of this assignment also explores the subtleties of Spark's actions and transformations, and how they affect lazy evaluation and job submission. We will work with synthetic data to simplify the program a little and to make it easier to create inputs of different sizes. Thoughtful creation of synthetic data is an important skill for big-data program design, testing, and debugging. Recall that Spark transformations describe data manipulations, but do not trigger execution. This is the "lazy evaluation” in Spark. Actions on the other hand force an immediate execution of all operations needed to produce the desired result. Stated differently, transformations only define the lineage of a result, while actions force the execution of that lineage. What will happen when an iterative program performs both actions and transformations in a loop? What goes into the lineage after 1, 2, or more loop iterations? And will the entire lineage be executed? Let us find out by exploring a program that computes PageRank with dangling pages for a simple synthetic graph. Your program should work with two data tables: Graph stores pairs (p1, p2), each encoding a link from some page p1 to another page p2. Ranks stores pairs (p, pr), encoding the PageRank pr for each page p. To fill these tables with data, create a graph that consists of k linear chains, each with k pages. Number the pages from 1 to k², where k is a program 2 3 parameter to control problem size. The figure shows an example for k=3. Notice that the last page in each linear chain is a dangling page. We will use the single- dummy-page approach to deal with dangling pages. This means that your program also must create a single dummy page-let's give it the number 0 (zero) and add it to Ranks. Add an edge (d, 0) for each dangling page d. Set the initial PR value for each of the k² real pages in Ranks to 1/k²; set the initial PR value of the dummy page to 0. 1 4 7 5 8 6 9 For simplicity, we recommend you implement the program using (pair) RDDs, but you may choose to work with DataSet instead. The following instructions assume an RDD-based implementation. Start by exploring the PageRank Scala program included in the Spark distribution. Make sure you fully understand what each statement is doing. Create a simple example graph and step through the program, e.g., on paper or using the interactive Spark shell. You will realize that the example program does not handle dangling pages, i.e., dangling pages lose their PR mass in each iteration. Can you find other problems? Your program will have a structure similar to the example program, but follow these requirements and suggestions: You are allowed to take certain shortcuts in your program that exploit the special graph structure. In particular, you may exploit that each node has at most 1 outgoing link. Make sure you add a comment about this assumption in your code. ● ● Make k a parameter of your Spark Scala program and generate RDDs Graph and Ranks directly in the program. There are many examples on the Web on how to create lists of records and turn them into (pair) RDDs. 1. 2. Make sure you add dummy page 0 to Ranks and the corresponding k dummy edges to Graph. Initialize each PR value in Ranks to 1/k², except for page 0, whose initial PR value should be zero. Be careful when you look at the example PR program in the Spark distribution. It sets initial PR values to 1.0, and its PR computation adds 0.15 instead of 0.15/#pages for the random jump probability. Intuitively, they multiply each PR value by #pages. While that is a valid approach, it is not allowed for this assignment. Try to ensure that Graph and Ranks have the same Partitioner to avoid shuffling for the join. Check if the join computes exactly what you want. Does it matter if you use an inner or an outer join in your program? To read out the total dangling PR mass accumulated in dummy page 0, use the lookup method of pair RDD. Then re-distribute this mass evenly over all real pages. When debugging your program, see if the PR values add up to 1 after each iteration. Small variations are expected, especially for large graphs, due to numerical precision issues. However, if the PR sum significantly deviates from 1, this may indicate a bug in your program. Add a statement right after the end of the for-loop (i.e., outside the loop) for the PR iterations to write the debug string of Ranks to the log file. Now you are ready to explore the subtleties of Spark lazy evaluation. First explore the lineage of Ranks as follows: Set the loop condition so that exactly 1 iteration is performed and look at the lineage for Ranks. Change the loop condition so that exactly 2 iterations are performed and look at the lineage for Ranks after those 2 iterations. Did it change? The lineage describes the job needed to compute the result of the action that triggered it. Since pair RDD's lookup method is an action, a new job is executed in each iteration of the loop. Can you describe in your own words what the job triggered in the i-th iteration computes? Try it. An interesting aspect of Spark, and a reason for its high performance, is that it can re-use previously computed results. This means that in practice, only a part of the lineage may get executed. To understand this better, consider the following simple example program: 1. val myRDD1 = some_expensive_transformations_on_some_big_input() 2. myRDD1.collect() 3. val myRDD2 = myRDD1.some_more_transformations() 4. myRDD2.collect() This program executes 2 jobs. The first is triggered by line 2 and it computes all steps defined by the corresponding transformations in the lineage of myRDD1. The next job is triggered by line 4. Since myRDD2 depends on myRDD1, all myRDD1's lineage is also included in the lineage of myRDD2. But will Spark execute the entire lineage? What if myRDD1 was still available from the earlier job triggered by line 2? Then it would be more efficient for Spark to simply re-use the existing copy of myRDD1 and only apply the additional transformations to it! Use Spark textbooks and online resources to find out if Spark is smart enough to realize such RDD re-use opportunities. Then study this empirically in your PageRank program where the lineage of Ranks in iteration i depends on all previous (i-1) iterations: 1. Can you instrument your program with the appropriate printing or logging statements to find out execution details for each job triggered by an action in your program? 2. See if you can find other ways to make Spark tell you which steps of an RDD lineage were executed, and when Spark was able to avoid execution due to availability of intermediate results from earlier executions. 3. Change the caching behavior of your program by using cache() or persist() on Ranks. Does it affect the execution behavior of your program? Try this for small k, then for really large k (so that Ranks might not completely fit into the combined memory of all machines in the cluster). Bonus challenge: For an optional 5-point bonus (final score cannot exceed 100), run your PageRank program on the Twitter followership data. If you took shortcuts for the synthetic data, e.g., by exploiting that no page has more than 1 outgoing link, you need to appropriately generalize your program to work correctly on the Twitter data. PageRank in MapReduce (Week 2) Implement the PageRank program in MapReduce and run it on the synthetic graph. You may choose any of the methods we discussed in the module and in class for handling dangling pages, including global counters (try if you can read it out in the Reduce phase) and order inversion. In contrast to the Spark program, generate the synthetic graph in advance and feed it as an input file to your PageRank program. Follow the approach from the module and store the graph as a set of vertex objects (which could be encoded as Text), each containing the adjacency list and the PageRank value. Since we will work with relatively small input, make sure that your program creates at least 20 Map tasks. You can use NLineInput Format to achieve this. Report Write a brief report about your findings, answering the following questions: 1. [12 points] Show the pseudo-code for the PR program in Spark Scala. Since many Scala functions are similar to pseudo-code, you may copy-and-paste well-designed (good variable naming!) and well-commented Scala code fragments here. Notes: Your program must support k and the number of PR iterations as parameters. Your program may take shortcuts to exploit the structure of the synthetic graph, in particular that each page has at most 1 outgoing link. (Your program should work on the synthetic graphs, no matter the choice of k>0, but it does not need to work correctly on more generally structured graphs.) 2. [10 points] Show the link to the source code for this program in your Github Classroom repository. 3. [10 points] Run the PR program locally (not on AWS) for k=100 for 10 iterations. Report the PR values your program computed for pages 0 (dummy), 1, 2,..., 19. 4. [19 points] Run the PR program locally (not on AWS) for k=100. Set the loop condition so that exactly 1 iteration is performed and report the lineage for Ranks after that iteration. Change the loop condition so that exactly 2 iterations are performed and report the lineage for Ranks after those 2 iterations. Then change the loop condition again so that exactly 3 iterations are performed and report the lineage for Ranks after those 3 iterations. 5. [15 points] Find out if Spark executes the complete job lineage or if it re-uses previously computed results. Make sure you are not using cache() or persist() on the Ranks RDD. (You may use it on the Graph RDD.) Since the PR values in RDD Ranks in iteration 10 depend on Ranks from iteration 9, which in turn depends on Ranks from iteration 8, and so on, we want to find out if the job triggered by the lookup action in iteration 10 runs all 10 iterations from scratch, or if it uses Ranks from iteration 9 and simply applies one extra iteration to it. a. Let's add a print statement as the first statement inside the loop that performs an iteration of the PR algorithm. Use println(s"Iteration ${i}") or similar to print the value of loop variable i. The idea is to look at the printed messages to determine what happened. In particular, if a job executes the complete lineage, we might hope to see "Iteration 1" when the first job is triggered, then "Iteration 1" (again) and “Iteration 2” for the second job (because the second job includes the result of the first iteration in its lineage, i.e., a full execution from scratch would run iterations 1 and 2), then “Iteration 1,” "Iteration 2," and "Iteration 3" when the third iteration's job is triggered, and so on. But would that really happen? To answer this question, show the lineage of Ranks after 3 iterations and report if adding the print statement changed the lineage. b. Remove the print statement, run 10 iterations for k=100, and look at the log file. You should see lines like "Job ... finished: lookup at took ..." that tell you the jobs executed, the action that triggered the job (lookup), and how long it took to execute. If Spark does not re-use previous results, the growing lineage should cause longer computation time for jobs triggered by later iterations. On the other hand, if Spark re- uses Ranks from the previous iteration, then each job runs only a single additional iteration and hence job time should remain about the same, even for later iterations. Copy these lines from the log file for all jobs executed by the lookup action in the 10 iterations. Based on the times reported, do you believe Spark re-used Ranks from the previous iteration? C. So far we have not asked Spark to cache() or persist() Ranks. Will this change Spark's behavior? To find out, add ".cache()" to the command that defines Ranks in the loop. Run your program again for 10 iterations for k=100 and look at the log file. What changed after you added cache()? Look for lines like "Block ... stored as values in memory" and "Found block ... locally". Report some of those lines and discuss what theySee Answer
Q13:Description Assignment Details Creating a pipeline that ingests data, stores data, cleans and prepares data, and makes predictions is common within a big data ecosystem. Many companies do this on a very large scale and across many business functions and products. During Week 5, you will construct an end-to-end solution that ingests, stores, and predicts market orders. The data to use are located at this link. The project deliverables include the following: Ingest the data from this link using tools such as Python. Construct a predictive analytics solution using Python libraries (such as Scikit-Learn) to predict future market orders. ⚫ The submission should include the code to connect, ingest, predict, and interpret. • Provide the code and a discussion on the aspects of design, including end point connection, the ingest approach, the prediction algorithm, and interpretation. 。 Additionally, discuss how the algorithm could be optimized, such as with feature engineering. ⚫ Once complete, submit your assignment for grading in Microsoft Word document. Please submit your assignment. For assistance with your assignment, please use your textbook and all course resources.See Answer
Q14:Trier University Department IV - Computer Sciences Assignment 5 of the course Big Data Analytics Summer semester 2024 Task 1: Joins 25(+20) points In this task, we are going to work again with the data and parser known from the previous assignment (alex165k. xml and MRDPUtils. java). Additionally, the file titles1m. xml pro- vided in Moodle will act as a secondary data source. It contains the "publication year" and "title" of many publications, identified by their arXiv-IDs and OpenAlex-IDs as unique keys. titles1m. xml was created from the full OpenAlex dataset and contains only entries matching keys used in alex165k. xml. (a) Write a MapReduce application that uses a reduce-side join in order to match in- formation about publications based on their arXiv-ID. Use alex165k.paper_id == titles1m. arxiv_id for the join. We are especially interested in combining authors, discipline, year and title for each publication. Hints: . Always look at the contents of the input files to get a first impression of how the data is structured and how to process it. . Since the two XML files store data in different fields, it is a good idea to use two distinct Mapper classes for the two input files. Refer to chapter 3-2, slide 19 for an example of MultipleInputs. Use Text InputFormat. class as argument for the input format class. . In the mapper class processing alex165k. xml, emit paper_id as key and use a suitable format to transfer the contents of authors and discipline as value. Some author names include a semicolon, so a different field separator is advisable. . In the mapper class processing titles1m. xml, emit arxiv_id as key and use a suitable format to transfer the contents of year and title as value. Since titles may contain almost any character, it is advisable to emit the year at first. . In the reducer class, you have to join the values emitted by both mappers for each ar Xiv-ID. If a join partner is found, emit the arXiv-ID as key and the title of the publication as value. Emit nothing if no join partner is found. Hand in your code and the output of your program for the given input files. To reduce the file size, include only output for publications from the year 2020 in the discipline Physics who have at least one author with the last name Smith. This should yield 8 results. (b) (This is an optional task) Write a MapReduce application that uses reduce-side joins to calculate the average "year of publication" for works listed in a publication's list of references. alex165k. xml contains a list of cited OpenAlex-IDs in its referenced attribute and titles1m. xml contains the "year of publication" in its year attribute. Hints: · Calculating the results in a single run of your application requires a two-step join: The output of the first MapReduce job has to be used as input for a second MapReduce job. Two distinct sets of mappers and reducers are needed. Use the first job's output directory as the second job's input directory; there is no need to link to a specific file by name. . In the first MapReduce job, emit the referenced Open Alex-IDs from alex165k. xml as key in the mapper and the associated ar Xiv-IDs as value. Have a second mapper emit the "year of publication" from titles1m. xml for each OpenAlex-ID and join the entries in the reducer. Ignore values without a join partner. Emit an intermediate result containing at least the arXiv-ID and the year. . In the second MapReduce job, process the intermediate result in the mapper and emit the ar Xiv-ID as key and the year as value. Then aggregate the years in the reducer and emit the average year as value for each ar Xiv-ID. . Use suitable Writable classes to load and emit your data, for example, Text, IntWritable, DoubleWritable or NullWritable. Hand in your code and the output of your program for the given input files. Include only output for publications where the arXiv-ID starts with 1608.04 to reduce the file size. Do not hand in your intermediate results. Task 2: n-grams 20 points Write a MapReduce application that computes all character-level n-grams (with 10 ≤ n ≤ 15) that appear at least 5,000 times in the input text, using the naive approach discussed in the lecture. Represent an n-gram in a reasonable way. Split up a line into its characters, after removing punctuation and converting the text to lowercase, and generate n-grams from these characters. Hand in your code and the output of the reducer when run on the input file corpus. txt which is available in Moodle (as a compressed file); this should yield 13 results. General remarks: . The tutorial group takes place on Mondays at 14:25 in F55 on a (roughly) bi-weekly basis. . The first meeting of the tutorial group was on May 27, 2024. · To be admitted to the final exam, you need to acquire at least 50% of the points in the assi- gnments. · It is required to submit in groups of size 3; only one submission is sufficient for the whole group. Groups must be chosen in Moodle (see link on the course page in Moodle). Write the names of all group members on your solutions. Students without a group cannot submit. . Solutions must be handed in before the deadline in Moodle (https://moodle.uni-trier.de/, course BDA-24) as as a PDF or, if submitting multiple files, as an archive (.zip or comparable). Submissions that arrive after the deadline will not be considered. · Graded versions of your submissions will be returned in Moodle until the following tutorial. · Announcements regarding the lecture and the tutorial group will be done in the area of the lecture in StudIP.See Answer
Q15:For this assignment, we will use the flights dataset (the same dataset we used in assignment #1.) 1. Create a new database in Hive for this dataset. Create an external table in Hive to store the flights dataset. Create internal tables in Hive for the airline and airport data. Load the data for each of the three tables. 2. Use Hive queries to answer the following questions: a. Count the number of rows in each table. b. Display the first 10 records from each table. c. Find the total departure delay for each airline. d. Find the top 5 flights with the longest arrival delays. e. List the top 5 busiest airports by the number of departures. f. Find the total departure delay for each airline, along with the airline name. g. List the top 5 busiest airports by number of departures, along with airport names. h. Find the top 5 flights with the longest arrival delays, including airline name and destination airport name. i. Identify the airline with the most flights arriving at a specific airport. j. List the top 5 airlines with the highest total arrival delays at a specific airport (e.g., ORD). 3. Answer questions a, e, f, h, and I, from question 3 above, using map-reduce running on hadoop. Submission - Submit the SQL queries you used, and the results obtained for each task. - Submit a video demonstrating the running of the queries.See Answer

Popular Subjects for Big Data

You can get the best rated step-by-step problem explanations from 65000+ expert tutors by ordering TutorBin Big Data homework help.

Get Instant Big Data Solutions From TutorBin App Now!

Get personalized homework help in your pocket! Enjoy your $20 reward upon registration!

Download on the
App Store

Download on the
Google Play

Claim Your Offer

Moneyback

Guarantee

Free Plagiarism

Reports

$20 reward

Upon registration

Full Privacy

Unlimited

Rewrites/revisions

Testimonials

TutorBin has got more than 3k positive ratings from our users around the world. Some of the students and teachers were greatly helped by TutorBin.

"After using their service, I decided to return back to them whenever I need their assistance. They will never disappoint you and craft the perfect homework for you after carrying out extensive research. It will surely amp up your performance and you will soon outperform your peers."

Olivia

"Ever since I started using this service, my life became easy. Now I have plenty of time to immerse myself in more important tasks viz., preparing for exams. TutorBin went above and beyond my expectations. They provide excellent quality tasks within deadlines. My grades improved exponentially after seeking their assistance."

Gloria

"They are amazing. I sought their help with my art assignment and the answers they provided were unique and devoid of plagiarism. They really helped me get into the good books of my professor. I would highly recommend their service."

Michael

"The service they provide is great. Their answers are unique and expert professionals with a minimum of 5 years of experience work on the assignments. Expect the answers to be of the highest quality and get ready to see your grades soar."

Richard

"They provide excellent assistance. What I loved the most about them is their homework help. They are available around the clock and work until you derive complete satisfaction. If you decide to use their service, expect a positive disconfirmation of expectations."

Willow

TutorBin helping students around the globe

TutorBin believes that distance should never be a barrier to learning. Over 500000+ orders and 100000+ happy customers explain TutorBin has become the name that keeps learning fun in the UK, USA, Canada, Australia, Singapore, and UAE.