Question
For this assignment, we will use the flights dataset (the same dataset we used in assignment #1.) 1. Create a new database in Hive for this dataset. Create an external table in Hive to store the flights dataset. Create internal tables in Hive for the airline and airport data. Load the data for each of the three tables. 2. Use Hive queries to answer the following questions: a. Count the number of rows in each table. b. Display the first 10 records from each table. c. Find the total departure delay for each airline. d. Find the top 5 flights with the longest arrival delays. e. List the top 5 busiest airports by the number of departures. f. Find the total departure delay for each airline, along with the airline name. g. List the top 5 busiest airports by number of departures, along with airport names. h. Find the top 5 flights with the longest arrival delays, including airline name and destination airport name. i. Identify the airline with the most flights arriving at a specific airport. j. List the top 5 airlines with the highest total arrival delays at a specific airport (e.g., ORD). 3. Answer questions a, e, f, h, and I, from question 3 above, using map-reduce running on hadoop. Submission - Submit the SQL queries you used, and the results obtained for each task. - Submit a video demonstrating the running of the queries.
Question image 1