For this assignment you will write a program that counts the number of times each letter of the English alphabet appears in the input. The results are presented as a histogram.
This program is a bit more complicated than others we have written, but I have given you a fair amount of guidance. I recommend reading the entirety of these instructions before you begin.
This program will not be split into modules. You may decompose your program into functions if you wish, but it is not required. Put all of your code in letter_frequency.c
.
Here are 3 files you can use for testing your program. If you click on one of them, it will probably open in your browser. To save a file, right click (ctrl-click on a Mac) on the link and then save it. You need to place it in the directory that you cloned the assignment to. DO NOT COMMIT THESE FILES TO YOUR ASSIGNMENT REPOSITORY
Your program must read characters from standard input until it reaches the special end-of-file value (EOF
). The function getchar()
returns the next character from standard input, or EOF
if there is no more data to read from standard input.
EOF
is not a character, but rather an int
value outside of the range that the char type can represent. Thus getchar()
returns an int
rather than a char
, and you will store the character that you read as an int
as well.
So if next_char
is an int
, you can store the next character from standard input in next_char
with next_char = getchar()
. You want to do this repeatedly until next_char
is EOF
. Recall that an assignment expression evaluates to the value assigned. It is a common pattern in C programming to check the value of an assignment in the condition of a while
loop. You want to keep looping while next_char
is assigned a value that is not EOF
, so your loop can look like this:
int next_char;
while((next_char = getchar()) != EOF) {
// do something with next_char
}
If you are typing your input into the terminal, you will need to press Ctrl-d
to send the EOF
. You can also use the contents of a file as standard input. If you want the contents of the file named file.txt
to be used as standard input and your program is named letter_frequency
, you would run your program like this:
$ ./letter_frequency < file.txt
Note that the direction of the <
is very important. If you used >
instead, it would send the output to file.txt
and it would erase the current contents of the file.
The library ctype.h
provides a function named isalpha()
which can determine if a character is an alphabetic character. isalpha(next_char)
will return a non-zero (true) value if next_char
is an alphabetic character, 0 otherwise. So to do something with next_char
only if it is alphabetic, you can do this:
if (isalpha(next_char)) {
// do something with next_char, knowing that it is a letter
}
Your program should count uppercase and lowercase versions of each letter in the same way. So we will count 'A'
the same as 'a'
. The ctype.h
library also contains a function named tolower()
which will convert a character to its lowercase version. So if next_char is 'A'
, then tolower(next_char)
returns 'a'
.
We want to count the number of times each letter occurs. Since there are 26 letters in the English alphabet, we can use an array of size 26 to store the counts. The first element of the array will store the number of times 'a'
or 'A'
appears, the second element will store the number of times 'b'
or 'B'
appears, etc.
Rather than putting the magic number 26 in your array declaration, use a #define
macro to define the value at the top of the file just after you include the necessary libraries:
#define NUMBER_OF_LETTERS 26
Recall that you can create an array that is initialized to all zeroes by assigning it {0}
in the declaration. So in main()
you can declare and initialize your histogram like so:
size_t letter_counts[NUMBER_OF_LETTERS] = {0};
Each time you find a new letter you need to increment its histogram entry by 1. But you cannot use the letter as the index into the histogram, because 'a'
does not have the value 0. However, the numeric values of the alphabetic characters are in alphabetical order, so if you subtract 'a'
from the character next_char, the result will be 0 for 'a'
, 1 for 'b'
, etc. You can use this result as the index into letter_counts
.
So if you know that next_char
is a letter, you can increment its count like this:
letter_counts[tolower(next_char) - 'a']++;
The output of your program must print one row for each letter from a to z. On each row you will print the ratio of the number of times that letter appears to the total number of letters in the input. It will also print a row of stars to visually show a histogram of the letter frequencies. The letter that appears most frequently must have 60 stars in its histogram bar, and the rest of the letters must have a number of stars scaled from 60. For example, if the most frequent letter appears 20 times, that letter will have 60 stars. If another letter appears 10 times, it will have 30 stars, since it appears half as often as the most frequent letter.
For example, say this was your input:
b
Gg
zZzZ
k
Your output would then look like this:
a: 0.000000
b: 0.125000 ***************
c: 0.000000
d: 0.000000
e: 0.000000
f: 0.000000
g: 0.250000 ******************************
h: 0.000000
i: 0.000000
j: 0.000000
k: 0.125000 ***************
l: 0.000000
m: 0.000000
n: 0.000000
o: 0.000000
p: 0.000000
q: 0.000000
r: 0.000000
s: 0.000000
t: 0.000000
u: 0.000000
v: 0.000000
w: 0.000000
x: 0.000000
y: 0.000000
z: 0.500000 ************************************************************
To correctly print the number of stars, you need to find out ahead of time what the largest letter count is. That means you have to loop over the array of letter counts twice: once to find the maximum count, and again to print the histogram.
As with the number of letters, the maximum number of stars should not be a magic number in your code, but rather defined as a macro:
#define MAX_STAR_COUNT 60
To correctly calculate the ratio, you need to know the total number of letters. So while you are looping to find the maximum count, you can also sum up the total count.
You have been provided with 3 files with which to test your program. test_file.txt
is a file containing the simple example input shown above. Feel free to edit this file to try out different test cases. adventures_of_sherlock_holmes.txt
and alices_adventures_in_wonderland.txt
are complete books downloaded from Project Gutenberg, which is a web site that hosts books that are in the public domain.
If you compiled your program like this:
$ gcc -std=c99 -Wall letter_frequency.c -o letter_frequency
then your executable is named letter_frequency and you can run your program with the simple test input file like this:
$ ./letter_frequency < test_file.txt
Substitute test_file.txt with either of the other two provided filenames to run your program with those files as input.
You can run the GitHub testing locally with the command:
make gh-test-frequency
Add, commit, and push only letter_frequency.c
, not any of the test files to GitHub
Your grade for this assignment will be out out of 10 points: