Question

3. (Note: From this question onwards you CANNOT use websites but can only use course material. Also, for all Mapreduce questions please only use the class definition/template for Mapreduce jobs, i.e.,

Map/Reduce tasks/functions, and not other sources from the Web, since there are many different variants on the Web!) One of the candidates always imitates and emulates a previous president. This candidate believes they will win by becoming the most popular person on social media. An intern in their campaign wants to write a Mapreduce program. In MapReduce, one writes a program for Map that processes one input line at a time and outputs zero or more (key, value) pairs; and one writes a program for Reduce that processes an input of (key, all values for key). The iteration over input lines is done automatically by the MapReduce framework. The intern would like to know who are the influential Twitter users most similar to their candidate, and would like to use Hadoop for this. The intern uses an input file containing information from Twitter (which is an asymmetrical social network) about which users "follow" which other users. If user a follows b, the entry line is (a, b) - you can assume this data is already sharded in HDFS and can be loaded from there. Can you help the intern? Write a MapReduce program (Map and Reduce separately) that outputs the list of all users U who satisfy the following three conditions simultaneously: U has at least 100 million followers, and U herself/himself follows fewer than 10 users, and U follows at least one user V who in turn has at least 10 million followers (e.g., @Barack Obama would be such a U). You can chain Mapreduces if you want (but only if you must, and even then, only the least number). Your program must be as highly parallelizable as possible. Correctness is paramount, though you should try to make it as fast as possible.

Fig: 1