EECS 484 Projects | p4-ghj

Project 4: Grace Hash Join

Worth Released Due
70 points Date TBD June 15th at 11:55 PM ET

Project 4 is due on June 15th at 11:55 PM EST. Please refer to the EECS 484 SP23 Course Policies for more information on penalties for late submissions, late day tokens, and sick days.

Introduction

In Project 4, we will implement a database algorithm – Grace Hash Join – using the C++ language. Instead of utilizing databases like in Projects 1-3, we will be implementing part of a database. This implies that this project does not require CAEN or SQL*Plus!

Submissions

This project is to be done in teams of 2 students (recommended) or individually. Be sure to create your team on the Autograder.

Honor Code

By submitting this project, you are agreeing to abide by the Honor Code: “I have neither given nor received unauthorized aid on this assignment, nor have I concealed any violations of the Honor Code.” You may not share answers with other students actively enrolled in the course outside of your teammate, nor may you consult with students who took the course in previous semesters. You are, however, allowed to discuss general approaches and class concepts with other students, and you are also permitted (and encouraged!) to post questions on Piazza.

Starter Files

There are two main phases in GHJ, partition and probe. Your job is to implement the partition and probe functions in Join.cpp. You are given other starter files, which along with your code, will simulate the data flow of records in disk and memory and perform a join operation between two relations.

Download the starter files (p4-starter_files.tar.gz).

There are 6 main components: Record, Page, Disk, Mem, Bucket, and Join. The files constants.hpp, main.cpp, Makefile, left_rel.txt and right_rel.txt, and .clang-format will also be used for testing and formatting. Code overview and key points for each component are discussed below.

Record.hpp and Record.cpp

These files define the data structure for an emulated data record with two main fields: key and data. Several member functions from this class that you should use in your implementation include:

Page.hpp and Page.cpp

These files define the data structure for an emulated page. Several member functions from this class that you should use in your implementation include:

Disk.hpp and Disk.cpp

These files define the data structure for an emulated disk. You do not need to use any member functions from this class in your implementation.

The only member function you need to be concerned about is read_data(const char* file_name), which loads all the data records from a text file into the emulated “disk” data structure and returns a disk page id pair <begin, end>, for which all the loaded data is stored in the range of disk pages [begin, end). This function is used in the provided main.cpp file.

Mem.hpp and Mem.cpp

These files define the data structure for emulated memory. Several member functions you should use in your implementation include:

Bucket.hpp and Bucket.cpp

These files define the data structure for a bucket, which is used to store the output result of the partition phase. Each bucket stores all the disk page ids and the number of records for left and right relations of one partition. Several member functions you should use in your implementation include:

Join.hpp and Join.cpp

These files define two functions: partition and probe, which make up the two main stages of GHJ. These two functions are the ONLY part you need to implement for this project.

constants.hpp

This file defines three constant integer values used throughout the project.

Other files

Other files you may find helpful to look over include:

Building and Running

This project was developed and tested in a Linux environment with GCC 5.4.0 and C++14 (a few features are not supported in GCC 5.4.0). You can work on the project anywhere, but as usual, we recommend doing your final tests in the CAEN Linux environment.

To build the project and run the executable file, use the Makefile, where left_rel.txt and right_rel.txt represent the two text file names that contain all the data records for joining two relations.

$ make
$ ./GHJ left_rel.txt right_rel.txt

To remove all extraneous files, run

$ make clean

Grace Hash Join Pseudocode

Refer to the following pseudocode for a complete algorithm of Grace Hash Join:

Grace Hash Join Pseudocode

In the figure above, a “bucket” refers to a page of the in-memory hash table. For more information regarding simple hash join and in-memory hash table, visit Rosetta Code or the course slides.

Key Reminders

Submitting

The only deliverable is Join.cpp, which is worth 70 points. Submissions should be made to the Autograder. There are 5 public test cases on the Autograder and no private tests.

Remember to remove any print statements, as your submission will fail on the Autograder even if it compiles on CAEN.

Each team will be allowed 3 submissions per day with feedback; any submissions made in excess of those 3 will be graded, but the results of those submissions will be hidden from the team. Your highest scoring submission will be used for grading, with ties favoring your latest submission.

Appendix

The result of joining left_rel.txt and right_rel.txt provided in the starter files should look similar to the output below.

Size of GHJ result: 1 pages
Page 0 with disk id = 6
Record with key=0 and data=0l
Record with key=0 and data=0r
Record with key=1 and data=1l
Record with key=1 and data=1r
Record with key=1 and data=1l
Record with key=1 and data=11r
Record with key=1 and data=11l
Record with key=1 and data=1r
Record with key=1 and data=11l
Record with key=1 and data=11r
Record with key=1 and data=111l
Record with key=1 and data=1r
Record with key=1 and data=111l
Record with key=1 and data=11r

In the output above, each pair of records is a joined result. For example,

Record with key=1 and data=1l
Record with key=1 and data=1r

is the joined result of a record from left_rel.txt (notice how the data ends with an l) and a record from right_rel.txt (notice how the data ends with an r) where both records have the same key 1.

The order of these pairs (and the order within each pair) does not matter on the Autograder. Your code can output these 7 pairs of records shown above in a different order and still be correct. Your code can also have a disk id that is different from the one shown above and still be correct.

Acknowledgements

This project was written and revised over the years by EECS 484 staff at the University of Michigan. The most recent version was updated and moved to Primer Spec by Owen Pang.

This document is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. You may share and adapt this document, but not for commercial purposes. You may not share source code included in this document.