trier university department iv computer sciences assignment 5 of the c
Search for question
Question
Trier University Department IV - Computer Sciences
Assignment 5 of the course Big Data Analytics Summer semester 2024
Task 1:
Joins
25(+20) points
In this task, we are going to work again with the data and parser known from the previous assignment (alex165k. xml and MRDPUtils. java). Additionally, the file titles1m. xml pro- vided in Moodle will act as a secondary data source. It contains the "publication year" and "title" of many publications, identified by their arXiv-IDs and OpenAlex-IDs as unique keys. titles1m. xml was created from the full OpenAlex dataset and contains only entries matching keys used in alex165k. xml.
(a) Write a MapReduce application that uses a reduce-side join in order to match in- formation about publications based on their arXiv-ID. Use alex165k.paper_id == titles1m. arxiv_id for the join. We are especially interested in combining authors, discipline, year and title for each publication.
Hints:
. Always look at the contents of the input files to get a first impression of how the data is structured and how to process it.
. Since the two XML files store data in different fields, it is a good idea to use two distinct Mapper classes for the two input files. Refer to chapter 3-2, slide 19 for an example of MultipleInputs. Use Text InputFormat. class as argument for the input format class.
. In the mapper class processing alex165k. xml, emit paper_id as key and use a suitable format to transfer the contents of authors and discipline as value. Some author names include a semicolon, so a different field separator is advisable.
. In the mapper class processing titles1m. xml, emit arxiv_id as key and use a suitable format to transfer the contents of year and title as value. Since titles may contain almost any character, it is advisable to emit the year at first.
. In the reducer class, you have to join the values emitted by both mappers for each ar Xiv-ID. If a join partner is found, emit the arXiv-ID as key and the title of the publication as value. Emit nothing if no join partner is found.
Hand in your code and the output of your program for the given input files. To reduce the file size, include only output for publications from the year 2020 in the discipline Physics who have at least one author with the last name Smith. This should yield 8 results.
(b) (This is an optional task) Write a MapReduce application that uses reduce-side joins to calculate the average "year of publication" for works listed in a publication's list of references. alex165k. xml contains a list of cited OpenAlex-IDs in its referenced attribute and titles1m. xml contains the "year of publication" in its year attribute.
Hints:
· Calculating the results in a single run of your application requires a two-step join: The output of the first MapReduce job has to be used as input for a second MapReduce job. Two distinct sets of mappers and reducers are needed. Use the first job's output directory as the second job's input directory; there is no need to link to a specific file by name.
. In the first MapReduce job, emit the referenced Open Alex-IDs from alex165k. xml as key in the mapper and the associated ar Xiv-IDs as value. Have a second mapper emit the "year of publication" from titles1m. xml for each OpenAlex-ID and join the entries in the reducer. Ignore values without a join partner. Emit an intermediate result containing at least the arXiv-ID and the year.
. In the second MapReduce job, process the intermediate result in the mapper and emit the ar Xiv-ID as key and the year as value. Then aggregate the years in the reducer and emit the average year as value for each ar Xiv-ID.
. Use suitable Writable classes to load and emit your data, for example, Text, IntWritable, DoubleWritable or NullWritable.
Hand in your code and the output of your program for the given input files. Include only output for publications where the arXiv-ID starts with 1608.04 to reduce the file size. Do not hand in your intermediate results.
Task 2: n-grams
20 points
Write a MapReduce application that computes all character-level n-grams (with 10 ≤ n ≤ 15) that appear at least 5,000 times in the input text, using the naive approach discussed in the lecture. Represent an n-gram in a reasonable way. Split up a line into its characters, after removing punctuation and converting the text to lowercase, and generate n-grams from these characters.
Hand in your code and the output of the reducer when run on the input file corpus. txt which is available in Moodle (as a compressed file); this should yield 13 results.
General remarks:
. The tutorial group takes place on Mondays at 14:25 in F55 on a (roughly) bi-weekly basis.
. The first meeting of the tutorial group was on May 27, 2024.
· To be admitted to the final exam, you need to acquire at least 50% of the points in the assi- gnments.
· It is required to submit in groups of size 3; only one submission is sufficient for the whole group. Groups must be chosen in Moodle (see link on the course page in Moodle). Write the names of all group members on your solutions. Students without a group cannot submit.
. Solutions must be handed in before the deadline in Moodle (https://moodle.uni-trier.de/, course BDA-24) as as a PDF or, if submitting multiple files, as an archive (.zip or comparable). Submissions that arrive after the deadline will not be considered.
· Graded versions of your submissions will be returned in Moodle until the following tutorial.
· Announcements regarding the lecture and the tutorial group will be done in the area of the lecture in StudIP.