Assignment 1 

CS 265/571 Advanced Programming Tools and Techniques
Instructors: Jeremy Johnson and Kurt Schmidt
Due date: Sunday Jan. 23 at 11:55 pm.


Introduction

The purpose of this assignment is for students to become comfortable using Unix filters (including Awk) to solve text processing problems. Students will also write a simple shell script.  Students will also need to become comfortable using Unix (basic commands, documentation, file redirection - i.e. the bash shell), a Unix editor, the Unix file system. Unix filters, in addition to Awk, that may be used are cut, grep, sort, uniq, wc.  You may also want to use the option -R to grep and ls which causes the command to be applied recursively to all subdirectories of the directory in which it is called.

The text data that will be used is a directory of email messages sent by the Math Doctors associated with the Math Forum's Ask Dr. Math service during the first 9 months of last year.  The data is organized into directors whose name is a number.  Each directory contains text files with emails that were sent about different problems (one message per file).  There may be several emails about the same problem - this is called a thread.  The files are labeled by the problem number and a second number indicating which message in a given thread it is.  E.G. prob309264_03, indicates that the file contains the third message in the thread associated with problem 309264.  Note that the data set we are using only contains the messages sent by the Math Doctors, the corresponding messages sent by students asking the question and following up on the advice sent by the Math Doctor are not include.  The file prob309264_03 contains a sample message.  Notice that the email contains a header with information on the thread number, the Math Doctor who sent the message and the date and time the message was sent.

The data can be obtained from:

/home/jjohnson/public_html/2004-05/winter/cs265/assignments/archive

You will need to logon to the MCS Unix machines (e.g. tux.cs.drexel.edu) to access this data.  Note that the data is large (133 MB) with many files, so you should not copy the entire data set to your account or computer.  You should copy one of the numbered subdirectories to experiment with before working on the entire data set.

The following problems ask you to find out information about the messages and Math Doctor's in the data set.  You should do so with Unix filters (you may need to combine several filters using a pipe), simple Unix Scripts, or Awk programs.  There are many different ways you can solve the problem - I am not looking for a particular solution, though I have several in mind, though I am looking for effective use of the Unix tools that we have been discussing in class.

You should always try your solution on a small subset of the data so that it does not take long to run while you are working things out and so that you can test the correctness of your answer (you will not be able to test the correctness of your solution on the entire data set by manually looking at all of the files - there are too many of them).  If you get stuck please ask one of the instructors or TAs for help.  You may also post questions to the class mailing list.

Problems

For each problem, you must not only report your answer, but you must show the Unix commands, scripts, programs that you used.  You should also indicate how you tested your solution on a small data set.  Provide sufficient documentation explaining your approach.

Here are your problems.

  1. Determine the number of directories containing problems. Observe that there is only one message per file. What is the total number of messages sent?
  2. Obtain a sorted list of all of the Math Doctors that sent messages?  How many were there?  First do this for a single directory.  Then do it for all of the directories (you can write a script that obtains the files for each directory and then concatenates them together, or you can simultaneously gather the data for all directories).
  3. How many messages did each Math Doctor send?  You should start with the file created in question 2.
  4. Write an awk program that determines how many messages a given Math Doctor sent in each month. Hint: First create a file containing a line for each message each Math Doctor sent (the line should include the date the message was sent). Then sort and squash.

    Then write a script (wrapper) that determines how many messages a given Math Doctor sent in each month. The shell script will take the name of a Math Doctor as an argument and then returns the number of messages the specified Math Doctor sent each month, using the machinery you just created.

    Make sure your script checks to see if a Math Doctor is given as an argument - if not a usage message should be printed and the script exited.

What to hand in

Place your solution in a directory called A1. Your directory should contain a file called README that describes all of the files in A1 and summarizes what you did. Make sure to indicate if any of the problems were not done or are not working.

Create scripts called prob1, prob2, prob3, prob4, and prob5 containing the commands you used to solve problems 1,2,3, and 4. Each script should have comments explaining what the script does. Your script may call additional scripts or programs (Awk) in other files and temporary files may be created - if so this should be documented. Make sure that the problem can be solved simply by executing your main script. The output should go to standard out.

Use a variable, called ARCHIVE, to point to the directory that contains the archive of messages that is used. This way, you can use the same scripts on your test data and the actual data. Also, if the actual data moved, the scripts can easily be reconfigured. Finally, we can test your scripts with another data set.

Create a subdirectory called testdata which contains the test data you used. This data set should contain a few directories each with a few messages. It provides a quick way to test your code and you should use it to test your code yourself.

You should also have filed, prob1.out, prob2.out, prob3.out, and prob4.out which contain the solution to each problem, and prob1.testout, prob2.testout, prob3.testout, and prob4.testout showing the output of the test cases for each problem.

Do NOT submit large temporary files

Create a gzipped tar file containing a directory called A1 (use the command tar to do this) with your solutions.   If you created a directory called A1 with all of the files for your assignment 1 solution, you can use the command:

$ tar -zcvf A1.tar.gz A1

to create the desired gzipped tar file (the convention is to use the .tar extension for tar files and the .gz extension for gzipped files.  See man tar of info tar for more information about tar.


All assignments should be submitted using webct (please contact one of the TAs or instructor if you have not done this previously). You should submit your gzipped tar file.