Search for question
Question

A Simple Map-Reduce Program

Description

The purpose of this project is to develop a simple Map-Reduce program on Hadoop that analyzes data from Netflix.

This project must be done individually. No copying is permitted. Note: We will use a system for detecting software plagiarism, called Moss, which is an

automatic system for determining the similarity of programs. That is, your program will be compared with the programs of the other students in class as well as

with the programs submitted in previous years. This program will find similarities even if you rename variables, move code, change code structure, etc.

Note that, if you use a Search Engine to find similar programs on the web, we will find these programs too. So don't do it because you will get caught and you

will get an F in the course (this is cheating). Don't look for code to use for your project on the web or from other students (current or past). Don't use ChatGPT or

any other Al program to generate code. Just do your project alone using the help given in this project description and from your instructor and GTA only.

Platform

You will develop your program on your laptop and then on SDSC Expanse. Optionally, you may use IntelliJ IDEA or Eclipse to help you develop your program on

your laptop, but you should test your programs on Expanse before you submit them.

How to develop your project on your laptop

You can use your laptop to develop your program and then test it and run it

lot of time. Note that testing and running your program on Expanse is required.

Expanse. This step is optional but highly recommended because it will save you a

If you have Mac OS or Linux, make sure you have Java and Maven installed. On Mac, you can install Maven using Homebrew: brew install maven. On

Ubuntu Linux, use: apt install maven.

On Windows 10, you need to install Windows Subsystem for Linux (WSL 2) and then Ubuntu 22.04 LTS. It's OK if you have WSL 1 or an older Ubuntu. Then,

open a unix shell (terminal) on WSL2 and do:

sudo apt update

sudo apt upgrade

sudo apt install openjdk-8-jdk maven

To install Hadoop and the project on Mac, Linux, or Windows WSL2, cut&paste and execute on the unix shell:

cd

wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz

tar xfz hadoop-3.2.2.tar.gz

wget http://lambda.uta.edu/cse6332/project1.tgz

tar xfz project1.tgz

cd

cd project1/examples

mvn install

You should also set your JAVA_HOME to point to your java installation. For example, on Windows 10 do:

To test Map-Reduce, go to project1/examples/src/main/java and look at the two Map-Reduce examples Simple.java and Join.java. You can

compile both Java files using:/nTo install Hadoop and the project on Mac, Linux, or Windows WSL2, cut&paste and execute on the unix shell:

cd

wget

tar xfz hadoop-3.2.2.tar.gz

wget http://lambda.uta.edu/cse6332/project1.tgz

tar xfz project1.tgz

https://archive.apache.org/dist/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz

You should also set your JAVA_HOME to point to your java installation. For example, on Windows 10 do:

To test Map-Reduce, go to project1/examples/src/main/java and look at the two Map-Reduce examples Simple.java and Join.java. You can

compile both Java files using:

cd

cd project1/examples

mvn install

and you can run Simple in standalone mode using:

-/hadoop-3.2.2/bin/hadoop jar target/*.jar Simple simple.txt output-simple

The file output-simple/part-r-00000 will contain the results.

To compile and run project1:

cd

cd project1

mvn install

rm -rf output1 output2

-/hadoop-3.2.2/bin/hadoop jar target/*.jar Netflix small-netflix.txt output1 output2

The files output1/part-r-00000 and output2/part-r-00000 must contain the same results as in the files small-solution1.txt and small-

solution2.txt. After your project works correctly on your laptop (it produces the same results as the solution), copy it to Expanse:/ncd

rm project1.tgz

tar cfz project1.tgz project1

scp projecti.tgz xyz1234@login.expanse.sdsc.edu:

where xyz1234 is your Expanse username.

Setting up your Project on Expanse

This step is required. You need to follow this step only for the first time running your code in Expanse. Follow the directions on how to login on Expanse at SDSC

Expanse. Please email the GTA if you need further help.

First, you need to allow password-less login to local host (without it, you can't run Map-Reduce). Login on Expanse. Then do:

ssh localhost

using your password and exit using control-D. Then do:

ssh-keygen -t rsa

(press enter at each line). Then do:

cat/.ssh/id_rsa.pub >> -/.ssh/authorized_keys

chmod og-wx ~/.ssh/authorized_keys

Now you should be able to ssh localhost without a password.

Then, edit the file.bashrc (note: it starts with a dot - you can see it using ls -a) using a text editor, such as nano .bashrc, and add the following lines at

the end (cut-and-paste):

SW=/expanse/lustre/projects/uot189/fegaras

alias run-'srun --pty -A uot189 --partition-shared --nodes-1 --ntasks-per-node-1--mem-2G -t 00:05:00 --wait-0--export-ALL'

logout and login again to apply the changes.

Run your Project on Expanse

If you have already developed project1 on your laptop, copy project1. tgz from your laptop to Expanse. Otherwise, download project1 from the class web

site using wget http://lambda.uta.edu/cse6332/project1. tgz. Then untar it using:

tar xfz project1.tgz

rm project1.tgz

chmod -R g-wrx,o-wrx project1

Go to project1/examples and look at the two Map-Reduce examples src/main/java/Simple.java and src/main/java/Join.java. You can

compile both Java files using:

run example.build/nRun your Project on Expanse

If you have already developed project1 on your laptop, copy project1. tgz from your laptop to Expanse. Otherwise, download project1 from the class web

site using wget http://lambda.uta.edu/cse6332/project1.tgz. Then untar it using:

tar xfz project1.tgz

rm project1.tgz

chmod -R g-wrx,o-wrx projecti

Go to project1/examples and look at the two Map-Reduce examples src/main/java/Simple.java and src/main/java/Join.java. You can

compile both Java files using:

run example.build

and you can run them in standalone mode using:

sbatch example.local.run

The file example. local.out will contain the trace log of the Map-Reduce evaluation while the files output-simple/part-r-00000 and output-

join/part-r-00000 will contain the results.

You can compile Netflix.java on Expanse using:

run netflix.build

and you can run Netflix.java in standalone mode over the small dataset using:

sbatch netflix.local.run

The results generated by your two Map-Reduce jobs will be stored in the files small-output1.txt and small-output2.txt. Your results must be the

same as the solution.

You should develop and run your programs in standalone mode until you get the correct result. After you make sure that your program runs correctly in

standalone mode, you run it in distributed mode using:

sbatch netflix.distr.run

Fig: 1

Fig: 2

Fig: 3

Fig: 4