Exercise 13: Letter Frequency

For this assignment you will write a program that counts the number of times each letter of the English alphabet appears in the input. The results are presented as a histogram.

This program is a bit more complicated than others we have written, but I have given you a fair amount of guidance. I recommend reading the entirety of these instructions before you begin.

This program will not be split into modules. You may decompose your program into functions if you wish, but it is not required. Put all of your code in letter_frequency.c.

Here are 3 files you can use for testing your program. If you click on one of them, it will probably open in your browser. To save a file, right click (ctrl-click on a Mac) on the link and then save it. You need to place it in the directory that you cloned the assignment to. DO NOT COMMIT THESE FILES TO YOUR ASSIGNMENT REPOSITORY

Reading the Input

Your program must read characters from standard input until it reaches the special end-of-file value (EOF). The function getchar() returns the next character from standard input, or EOF if there is no more data to read from standard input.

EOF is not a character, but rather an int value outside of the range that the char type can represent. Thus getchar() returns an int rather than a char, and you will store the character that you read as an int as well.

So if next_char is an int, you can store the next character from standard input in next_char with next_char = getchar(). You want to do this repeatedly until next_char is EOF. Recall that an assignment expression evaluates to the value assigned. It is a common pattern in C programming to check the value of an assignment in the condition of a while loop. You want to keep looping while next_char is assigned a value that is not EOF, so your loop can look like this:

int next_char;
while((next_char = getchar()) != EOF) {
    // do something with next_char
}

If you are typing your input into the terminal, you will need to press Ctrl-d to send the EOF. You can also use the contents of a file as standard input. If you want the contents of the file named file.txt to be used as standard input and your program is named letter_frequency, you would run your program like this:

$ ./letter_frequency < file.txt

Note that the direction of the < is very important. If you used > instead, it would send the output to file.txt and it would erase the current contents of the file.

Isolating Letters

The library ctype.h provides a function named isalpha() which can determine if a character is an alphabetic character. isalpha(next_char) will return a non-zero (true) value if next_char is an alphabetic character, 0 otherwise. So to do something with next_char only if it is alphabetic, you can do this:

if (isalpha(next_char)) {
    // do something with next_char, knowing that it is a letter
}

Your program should count uppercase and lowercase versions of each letter in the same way. So we will count 'A' the same as 'a'. The ctype.h library also contains a function named tolower() which will convert a character to its lowercase version. So if next_char is 'A', then tolower(next_char) returns 'a'.

Counting the Letters

We want to count the number of times each letter occurs. Since there are 26 letters in the English alphabet, we can use an array of size 26 to store the counts. The first element of the array will store the number of times 'a' or 'A' appears, the second element will store the number of times 'b' or 'B' appears, etc.

Rather than putting the magic number 26 in your array declaration, use a #define macro to define the value at the top of the file just after you include the necessary libraries:

#define NUMBER_OF_LETTERS 26

Recall that you can create an array that is initialized to all zeroes by assigning it {0} in the declaration. So in main() you can declare and initialize your histogram like so:

size_t letter_counts[NUMBER_OF_LETTERS] = {0};

Each time you find a new letter you need to increment its histogram entry by 1. But you cannot use the letter as the index into the histogram, because 'a' does not have the value 0. However, the numeric values of the alphabetic characters are in alphabetical order, so if you subtract 'a' from the character next_char, the result will be 0 for 'a', 1 for 'b', etc. You can use this result as the index into letter_counts.

So if you know that next_char is a letter, you can increment its count like this:

letter_counts[tolower(next_char) - 'a']++;

Printing the Histogram

The output of your program must print one row for each letter from a to z. On each row you will print the ratio of the number of times that letter appears to the total number of letters in the input. It will also print a row of stars to visually show a histogram of the letter frequencies. The letter that appears most frequently must have 60 stars in its histogram bar, and the rest of the letters must have a number of stars scaled from 60. For example, if the most frequent letter appears 20 times, that letter will have 60 stars. If another letter appears 10 times, it will have 30 stars, since it appears half as often as the most frequent letter.

For example, say this was your input:

b
Gg
zZzZ
k

Your output would then look like this:

a: 0.000000  
b: 0.125000  ***************
c: 0.000000  
d: 0.000000  
e: 0.000000  
f: 0.000000  
g: 0.250000  ******************************
h: 0.000000  
i: 0.000000  
j: 0.000000  
k: 0.125000  ***************
l: 0.000000  
m: 0.000000  
n: 0.000000  
o: 0.000000  
p: 0.000000  
q: 0.000000  
r: 0.000000  
s: 0.000000  
t: 0.000000  
u: 0.000000  
v: 0.000000  
w: 0.000000  
x: 0.000000  
y: 0.000000  
z: 0.500000  ************************************************************

To correctly print the number of stars, you need to find out ahead of time what the largest letter count is. That means you have to loop over the array of letter counts twice: once to find the maximum count, and again to print the histogram.

As with the number of letters, the maximum number of stars should not be a magic number in your code, but rather defined as a macro:

#define MAX_STAR_COUNT 60

To correctly calculate the ratio, you need to know the total number of letters. So while you are looping to find the maximum count, you can also sum up the total count.

Test Files

You have been provided with 3 files with which to test your program. test_file.txt is a file containing the simple example input shown above. Feel free to edit this file to try out different test cases. adventures_of_sherlock_holmes.txt and alices_adventures_in_wonderland.txt are complete books downloaded from Project Gutenberg, which is a web site that hosts books that are in the public domain.

If you compiled your program like this:

$ gcc -std=c99 -Wall letter_frequency.c -o letter_frequency

then your executable is named letter_frequency and you can run your program with the simple test input file like this:

$ ./letter_frequency < test_file.txt

Substitute test_file.txt with either of the other two provided filenames to run your program with those files as input.

Local Test Suite

You can run the GitHub testing locally with the command:

make gh-test-frequency

Submission

Add, commit, and push only letter_frequency.c, not any of the test files to GitHub

Grading

Your grade for this assignment will be out out of 10 points: