Tokenizer: |
The ESE String Tokenizer & Sorter
|
Files:
Purposes of this assignment:
To practice and learn:
- Working with char[][] arrays and char[] arrays
- Using the string.h library and getchar()
- Parsing and tokenizing user input
- Constructing strings, and the subtleties involved with NULL termination
- Sorting strings in C with strcmp()
- Learning why covering all the edge cases is half of computer science
Important Notes:
- Homework assignments should be done alone.
- Please be respectful of the TAs' time and refrain from emailing them individually.
- The TA's take turns answering questions on the bulletin board.
- Please ask questions on the bulletin board or bring them to office hours.
- At the top of the Content page
- We highly recommend that you test your program in the lab or on the Eniac server before turning it in.
- There can sometimes be differences between your home setup and the lab / Eniac cluster
- These can cause errors in your code while grading.
- Do not modify tokenizer.h or tokenizer_main.c for your program
- We will be using the given versions when we grade your code
Overview:
When writing pieces of software that deal with user input, a tokenizer is a function designed to take user input and cut it up into pieces in order to be parsed. A string tokenizer literally cuts a input string up into smaller strings called "tokens", which can then be parsed as commands.
In this assignment you will be creating a tokenizer. The tokenizer will read input from STDIN using getchar(). While reading, it will look for spaces and newlines and create new "token" words based on their separation with this whitespace. These token words will be stored in an array of strings (char[][]). After the tokenizer terminates on a EOF (end of file) character, you will write a function to sort the tokens alpha-numerically and delete duplicates.
There are two functions you will be writing, get_word() and sort_and_delete_duplicates()
- get_word() will pull a single word (a "token") from STANDARD IN
- The main() function will call get_word until a EOF is reached
- get_word()'s first argument is char array where the "token" string is to be written. The second argument is the size of this array. The function get_word returns different constants based on whether
- The parser hit a EOF before reading a word "token"
- The parser hit a EOF immediately after reading a word "token"
- The parser read a word with no EOF
- Read the C file for the detailed description.
- sort_and_delete_duplicates() will sort the tokens in place and delete duplicate tokens
- sort_and_delete_duplicates() will be passed a array of strings, a char[][]
- sort_and_delete_duplicates() will then sort the words ASCII-alphanumerically using strcmp()
- Bubblesort is recommended, however you can use any sorting algorithm you like as long as it follows
the parameters of the assignment and does not use malloc()
or dynamic memory allocation
- Read the C file for the detailed description.
- The constants mentioned in the program specs can be found in tokenizer.h
Important Notes:
- Please remember that for this assignment, you are not allowed to use malloc() in any of the functions you write, or any functions that use any sort of dynamic memory allocation.
- For those of you lucky enough to be taking CIS 381, it is not ok to copy / paste the supplied shell tokenizer code into this homework.
- We will be checking for this and things similar.
- Trust me, it's easier just to write your own.
- You are not allowed to use qsort(), or any of the sorting functions provided in stdlib. You must implement the sorting yourself.
- Please re-read the string.h slides before you start, and try to use the strings.h library as much as possible.
- If strings.h is used properly, the sorter can be written in under 15 lines.
- This homework assignment should be done alone.
- We will be comparing your homeworks to make sure this occurs.
- Help is available via the bulletin board, office hours, and during the optional lab.
- TA's are also often available after class on Tuesday.
- Please be respectful of the TAs' time and refrain from emailing them individually.
- The TA's take turns answering questions on the bulletin board.
- Please ask questions on the bulletin board or bring them to office hours.
- At the top of the Content page, read the posted Homework Submission and Policy Information
- At the top of the Content page, see the link to use for submitting homework.
Diagram:
