material. Also, for all Mapreduce questions please only use the class definition/template
for Mapreduce jobs, i.e., Map/Reduce tasks/functions, and not other sources from the
Web, since there are many different variants on the Web!) One of the candidates
always imitates and emulates a previous president. This candidate believes they
will win by becoming the most popular person on social media. An intern in
their campaign wants to write a Mapreduce program. In MapReduce, one writes
a program for Map that processes one input line at a time and outputs zero or
more (key, value) pairs; and one writes a program for Reduce that processes an
input of (key, all values for key). The iteration over input lines is done
automatically by the MapReduce framework. The intern would like to know who
are the influential Twitter users most similar to their candidate, and would like
to use Hadoop for this. The intern uses an input file containing information from
Twitter (which is an asymmetrical social network) about which users "follow"
which other users. If user a follows b, the entry line is (a, b) - you can assume this
data is already sharded in HDFS and can be loaded from there. Can you help the
intern? Write a MapReduce program (Map and Reduce separately) that outputs
the list of all users U who satisfy the following three conditions simultaneously:
U has at least 100 million followers, and U herself/himself follows fewer than 10
users, and U follows at least one user V who in turn has at least 10 million
followers (e.g., @Barack Obama would be such a U). You can chain Mapreduces if
you want (but only if you must, and even then, only the least number). Your
program must be as highly parallelizable as possible. Correctness is paramount,
though you should try to make it as fast as possible.
Fig: 1