Warning
Please see the assignment on Canvas for dates.
Important
- You'll want to read this entire document before beginning the project. Please ask any questions you have on Piazza, but only if this README does not answer your question.
- Do NOT open any of the projects in a windows operating system. This will change the line endings and causes strange errors. Always clone the project directly into docker.
- Finally, this is a large project. Be sure to start early. If you wait until a few days before the due date, you are unlikely to finish in time.
- Understand how to use multiple threads to finish a computational task faster.
- Become familiar with running concurrent threads with POSIX-threads library (pthreads) and the basic usage of concurrent data structures among parallel threads.
- Sharpen systems programming skills by working on a practical project.
For this project, you will implement a parallel zip (pzip) program, using the C programming language and POSIX-threads. The pzip will read an input text file, which is composed of lowercase letters (a-z) only. As an output it will produce a binary file indicating the consecutive uses of each character. The pzip operation needs to be done in parallel using pthreads library.
The figure above shows an overview of the inputs, outputs and your
program will flow. You are expected to implement the middle portion
indicated by pzip(inputs,outputs)
. Within this function, there are
three major steps that you should follow to establish the parallel
threads logic:
- Call
pthread_create()
to launch parallel threads: Once threads are created they will iterate through an equal and dedicated portion ofinput_chars
, store the consecutive occurrence results locally, and also update a globalchar_frequency
array that holds the total/global frequency of the occurrences of each character. - Call
pthread_barrier()
to synchronize the pthreads, without destroying them: This barrier is required to make sure that each thread has finished locally counting their portions of characters. Threads need to synchronize because each thread needs to know how manyzipped_char
structs they have in theirlocalResult
arrays, so that they can calculate the exact index ofzipped_char
array that they need to copy their local results into. - Call
pthread_join()
to finish the parallel execution and synchronize the pthreads again.
Please note that your code is expected ONLY to operate on the input and output data structures provided in the figure.
Warning
zipped_chars
array holds consecutive occurrences, whereaschar_frequency
array holds the total number of occurrences. A character may appear more than once inzipped_chars
array, whereaschar_frequency
is populated on-the-go as threads encounter each character.
After running make
, you should have an executable program named pzip
located in the root of your repository. The usage is as follows:
pzip INPUT_FILE OUTPUT_FILE N_THREADS [--debug]
INPUT_FILE
: The input file name which is contains only lowercase letters (a-z). The format is explained below.OUTPUT_FILE
: The output file name which will be the result of your program output. The format is explained below.N_THREADS
: The number of parallel threads that will be used during pzip.[--debug]
: Optional parameter to display the content of input/output variables. When this option is set, the output becomes a human readable text file. Otherwise, it is a binary file. We will test your program without this parameter.
- The number of threads is assumed to be greater than or equal to one.
- The number of characters in the input file is assumed to be a positive multiple of the number of threads.
- Each thread is expected to process an equal portion of the input characters.
- IMPORTANT: If the consecutive occurrence of a character spans
two threads, you SHOULD NOT merge these two occurrences and
SHOULD report them separately. For example, in the image above,
the character
s
appears twice for the first thread and, again, the same letter appears three times in the portion processed by the second thread. In the final output, rather than countings
for 5 times, we have two counts of it, which are 2 and 3, respectively. - The number of the same character in a row will not exceed 255 (i.e.,
the maximum value of a
uint8_t
)
- main() function in
main/pzip.c
(DO NOT TOUCH THIS FILE) - Program parameter handling
- Input file reading, parsing and coversion (i.e., mapping) to
input_chars
array - Output file formatting, writing and coversion (i.e., mapping) from
zipped_chars
array - Debug output
To ease your implementation and to make grading fairer, using the
starter code is a requirement of this project. You are NOT ALLOWED
to make any modification to the main/pzip.c
file. You may add new
structs or functions to the main/pzip.h
file. However, you SHOULD
NOT change/delete/modify existing functions/variables/headers/structs
in main/pzip.h
file.
- START FROM HERE: Implement body of the
pzip()
function insrc/pzip.c
- In the same file, also create a callback pthread function that will
be called by
pthread_create()
withinpzip()
function. - By the end of pzip() function, properly populate output pointers to
zipped_chars
,zipped_chars_count
andchar_frequency
. Please note that the memory for these arrays and variables will be allocated and freed for you by main(). - Implement other functions and global/local variables as you need.
The input file is already parsed into an input_chars
array for you and
the total number of chars in this file is provided to you in the
input_chars_size
variable. The input file is simply a text file that
contains nothing but the 26 lower case letters (i.e., a-z). There are no
white spaces, line breaks, return characters or other characters. While
you don't need to parse the input file, you need to know the format so
that you can prepare your own test inputs. Example input:
- Input file content:
aaeeoooooeee
int input_chars_size = 12;
char* input = {'a','a','e','e','o','o','o','o','o','e','e','e'};
You should generate inputs of any size of NUM by using the following script in your repository root:
./generate_chars.py NUM > test_input
Warning
DO NOT GENERATE FILES MANUALLY USING A TEXT EDITOR The input files are only lower case letters with nothing else. The use of text editors may accidentally inject a newline into your input file. Please use the provided python script or run the command below to clean your files of any newlines.
cat FILENAME | tr -d '\n' \> NEWFILENAME
There are two output formats used by the program. Both of these formats
are generated by the starter code using the zipped_chars
array.
- Binary Output (Default)
If the --debug
option is not provided, the output of pzip is a binary
file. This file is automatically generated using zipped_chars
array
and zipped_chars_count
variable, which indicates the size of the
array.
- Text Output (--debug mode)
If the --debug
option is provided in the program arguments, the
contents of the zipped_chars
array will be written as human readable
text file. Each line of the output file will consist of a charracter and
the number of consecutive occurences of that character.
- Input file content:
aaeeoooooeee
- Code snippet:
int input_chars_size = 12;
char* input = {'a','a','e','e','o','o','o','o','o','e','e','e'};
int n_threads = 2;
struct zipped_char *zipped_chars = {{'a',2},{'e','2'},{'o',2},{'o',3},{'e',3}};
- Binary output file (in hexa-decimal):
61 02 65 02 6f 02 6f 03 65 03
- Text (--debug) output file (in plain text, new lines are ommitted):
a 2 e 2 o 2 o 3 e 3
- Input file content:
aaeeoooooeeeeeeeeeeaaaaaaaaaadddddddddssssssslssssssyyyyyywwwwww
- Code snippet:
int input_chars_size = 64;
char* input = {`a`,`a`,`e`,`e`,`o`,`o`,`o`,`o`,`o`,`e`,`e`,`e`,`e`,`e`, `e`,`e`,`e`,`e`,`e`,`a`,`a`,`a`,`a`,`a`,`a`,`a`,`a`,`a`,`a`,`d`,`d`,`d`, `d`,`d`,`d`,`d`,`d`,`d`,`s`,`s`,`s`,`s`,`s`,`s`,`s`,`l`,`s`,`s`,`s`,`s`, `s`,`s`,`y`,`y`,`y`,`y`,`y`,`y`,`w`,`w`,`w`,`w`,`w`,`w`};
int n_threads = 4;
struct zipped_char* zipped_chars = {{'a',2},{'e','2'},{'o',5},{'e',7},{'e',3},{'a',10},{'d',3},{'d',6},{'s',7},{'l',1},{'s',2},{'s',4},{'y',6},{'w',6}};
-
Binary output file (in hexa-decimal):
61 02 65 02 6f 05 65 07 65 03 61 0a 64 03 64 06 73 07 6c 01 73 02 73 04 79 06 77 06
-
Text (--debug) output file (in plain text, new lines are ommitted):
a 2 e 2 o 5 e 7 e 3 a 10 d 3 d 6 s 7 l 1 s 2 s 4 y 6 w 6
- Input file content:
aaaaaaaaaaaa
- Code snippet:
int input_chars_size = 12;
char* input = {'a','a','a','a','a','a','a','a','a','a','a','a'};
int n_threads = 4;
struct zipped_char* zipped_chars = {{'a',3},{'a','3'},{'a',3},{'a',3}};
- Binary output file (in hexa-decimal):
61 03 61 03 61 03 61 03
- Text (--debug) output file (in plain text, new lines are ommitted):
a 3 a 3 a 3 a 3
Warning
Note that the zipped output file is not fully compressed. In example 1,
the zipped chars could have been determined as
{{'a',2},{'e','2'},{'o',5},{'e',3}};
. However, for the sake of
simplicity for the project, we do not ask parallel threads to talk to
each other and merge their output. You are not asked to implement this
functionality, and your program may not pass our automated tests if you
implement this optimization.
We will be grading your code based on:
-
Functionality and accuracy:
Your program should produce the output as explained above. Please note that, due to the simplifications we have made, the output may change depending on the number of threads being used, if character sequences span thread boundaries, as in the example given in the figure. Your submitted code should have the same
mains/pzip.c
file as in the starter code. -
Parallelism and performance:
Most of pzip, including input/output and reading/writing, will operate in parallel. Your program should operate as shown in the Figure above. You should NOT do the counting of characters serially. Serial creation and joining of threads is OK.
-
General requirements:
Your program should follow the non-project-specific general requirements indicated below.
Warning
You will NOT receive performance points if your code is not correct. Slow, but correct programs are always more valuable than fast, incorrect programs, and this is reflected in the grading of this project. But also keep in mind that the autograder has a 5 minute timeout as none of the test cases should take longer than that to complete (even input-huge).
Additionally, we will be using diff
to verify correctness. This means
you will not get partial credit on a within-test basis (i.e., you will
either pass, or fail, each individual test. There is no in-between)
Lastly, if your program crashes during execution, it will be considered "incorrect", regardless of whether it produces the correct output file. Due to this it is VITAL that all memory issues are taken care of as these can cause code to crash. It is possible to have memory issues even when you don't call malloc especially in this project where memory is being divided up for the threads!
-
To test whether your program properly AND efficiently use threads, we will run your program with large test files (e.g.
test/input_large
). We will use the following formula to evaluate the 'parallel efficiency', i.e.,$\text{PE}$ , of your code:$$\text{PE} = \frac{\text{CPU Time (User)} + \text{CPU Time (System)}}{\text{Wall time} \cdot \text{Number of threads}}$$ -
In a perfectly parallel program,
$\text{PE}$ should be equal to$1.0$ , however this is never possible. Actual parallel efficiency will be less than$1.0$ . To measure the$\text{PE}$ of your program via themeasure.py
script we provided, you may execute the following command:
./measure.py ./pzip /tmp/CSCI-442--DO-NOT-DELETE/input_huge ./out 8
-
On Isengard, our ideal solution for the parameters in the above command runs under 1 second (
$\text{Wall time} < 1$ ) with a$\text{PE}$ greater than$0.75$ . -
Your program is expected to run the command above on Isengard under 1.5 seconds (
$\text{Wall time} < 1.5$ ) with a ($\text{PE} > 0.5$ ).
Important
- Please note that these values are valid only for the input
file referenced above
(
/tmp/CSCI-442--DO-NOT-DELETE/input_huge
) and withN_THREADS=8
on Isengard. - Your first run may be slower due to internal page caching. Within the grading script, your code will be run three times and only the fastest one will be used for grading.
- Note: If you have a
$\text{Wall time} < 0.25$ , then your PE does not need to meet the requirement of$> 0.5$ .
- If your code fails to meet the performance criteria above, you will get a partial grade, depending on how fast and efficient your code is.
Important
- Reminder: you will receive NO performance points if your program does not produce the correct output or crashes.
- The top three fastest and correct submissions will be given +3, +2 and +1 extra points, respectively.
To validate if your output for input_huge is correct, we provide the
hash of the correct input_huge output below. output_huge_8t.expected
solution hash:
2f7c59a2ff08217dd0ac35fa4a437e92
You can check the hash of your output via the command below
md5sum FILENAME
Where FILENAME
is the name of your output file.
If your hash does not match up, then there is an error in your code. Do make sure to run it multiple to make sure your code is not outputing the correct solution only some of the time. Memory issues are able to make your code output the correct output on some runs but incorrect output on other runs.
The performance criteria given above is based on the execution on
isengard.mines.edu
. However, due to limited capacity of Isengard, we
suggest you establish the following workflow:
- As with prior projects, do the initial development on your local Docker image.
- Once you complete the initial functionality, push your changes to github.
- Pull your repo on isengard and test for performance.
- Either use vim/emacs on Isengard or push/pull on your local Docker image to fix problems and continue development.
We STRONGLY discourage you connecting VS Code to Isengard via the ssh plugin. VS Code runs linters and syntax checks on Isengard, which in turn puts heavy pressure on Isengard. We had severe performance problems in the past, especially in the last two days of submissions. Therefore do as much development as you can locally. We won't be able to help you if you leave Isengard testing to the last days and if Isengard becomes too crowded.
Submission of your project will be handled via Gradescope.
- Create the submission file using the provided
make-submission
script:
./make-submission
-
This will create a
.zip
file named$USER-submission
(e.g., for me, this would be namedlhenke-submission.zip
). -
Submit this
.zip
file to Gradescope. You will get a confirmation email if you did this correctly.
Warning
You are REQUIRED to use make-submission
to form the .zip
file.
Failure to do so may cause your program to not compile on Gradescope.
- You are REQUIRED to use Isengard to develop and test this project.
- You should handle errors gracefully. All system calls can fail: if
this occurs print a relavent and descriptive error to
stderr
(notstdout
) and exit. Your program should have a non-zero exit status if any errors are encountered. (Make sure to add 'n's to those errors too!) - Your program should have a zero exit status if no errors are encountered.
- Your project must be written in the C programming language, and execute on Isengard.
- You should follow Linux Kernel coding style, a common style guide for open-source C projects.
- Your project must not execute external programs or use network resources.
- Your project should be memory safe. For example, if your program is
susceptible to buffer-overflow based on certain inputs, it is not
memory safe. As a corollary to this, you should not use any of the
following functions:
strcat
,strcpy
, orsprintf
. - You should
free
any memory that you heap-allocate, andclose
(orclosedir
) any files that you open. - To compile your code, the grader should be able to
cd
into the root directory of your repository and runmake
using the providedMakefile
.
You will be using some or all of the following pthread library calls:
pthread_create
pthread_join
pthread_mutex_init
pthread_barrier_init
pthread_mutex_lock
pthread_mutex_unlock
pthread_barrier_wait
Please refer to https://github.com/dongchen0523/os/blob/main/pthread-Tutorial.pdf and https://www.cs.cmu.edu/afs/cs/academic/class/15492-f07/www/pthreads.html for tutorials on how to use pthreads.
Provided for you are three reference executable files:
-
pzip_instructor_isengard
: A working version of the project that scored 100% in the autograder. You may use it to help understand the behavior of a working project as well as double check any of your outputs. It is included in this template repository and can be run with:./pzip_instructor_isengard
Things to keep in mind about the reference executable:
- It was developed on Isengard and is only guaranteed to work on Isengard.
- This solution is not the ideal solution. If you are running it on the
input_huge
, out of three runs, it will have a$\text{Wall time} < 1.5$ and a$\text{PE} > 0.75$ , which is expected of your submission as well. - It is an instructor version and you may not execute it from within your own code. You will recieve a zero if you do!
-
pzip_instructor_arm
andpzip_instructor_x86
: These binaries are the solutions compiled for the docker images you are working on. You may use these binaries to make sure that your code is functionally working probably. Please do not use docker binaries for performance evaluation, since the performance of your development machine will vary. Use Isengard binary on Isengard to compare your performance to the solution.
If you're using VS Code (which we recommend), you can open the
project-3.code-workspace
file and click
the Open Workspace
button. This will open the project in a new window
with the recommended settings. We've also included a debug configuration,
allowing you to debug your program - all you need to do is go to the
debug tab and click the green play button.
This is an individual project. All code you submit should be written by yourself. You should not share your code with others.
Please see the syllabus for the full collaboration policy.
Warning
Plagarism will be punished harshly!
Please do keep any Git repos private, even after you finish this course. This will keep the project fun for future students!
Remote access to Isengard is quite similar to ALAMODE, but the hostname
is isengard.mines.edu
.
For example, to ssh
into the machine with your campus MultiPass login,
use this command:
Note: you need to be on the campus network or VPN for this to work. If
you are working from home, use either the VPN or hop thru
jumpbox.mines.edu
first.