Exercise 04: Histogram

You are going to create a simple program to display the distribution of letters in a string.

This exercise is designed to exercise:

for loops
lists
math operations
string functions

Setup

Download the zip file containing the starter code.
Extract the ex05 folder an place it into your cs102 folder
Open the file histogram.py with Thonny

Assignment

The provided code contains an exceptionally long string only contains lower-case letters a through z and nothing else. You will need to count the occurence of each letter that occurs in the string as store the letter counts in a list. The ultimate goal will be to display a normalized text-based histogram of the letter distributions.

The code contains a few constant values to help you out:

ASCII_OFFSET = ord(“a”)
- This holds the numeric value used to represent ‘a’
HISTOGRAM_SYMBOL = “>”
- This is the symbol that will be used for each “tick” of the histogram bar
MAX_HISTOGRAM_LENGTH = 70
- This is the maximum length of any bar in the histogram
BIG_STRING = …
- This will hold the massive string you will need to process

Let’s walk through a complete example.

Assume we have the following (much shorter) string:

abcccaddd

We need to know three things in order to produce our histogram:

How many times does each letter appear?
What is the largest occurrence of a letter (a tie doesn’t matter)?
What is the ratio that a letter occurs with respect to the letter that occurs most often?

For our example, the counts are:

a: 2
b: 1
c: 3
d: 3

The letters that appear the most are c and d, both with a maximum occurrence of 3.

Since we will have a very large string in our project, we want to normalize the number of HISTOGRAM_SYMBOL (‘>’) characters we print to represent the bar of our histogram. In our case, we are normalizing to MAX_HISTOGRAM_LENGTH (70). For each letter, we calculate the ratio of it’s appearance with respect to the largest value. So if we were to calculate the ratio for ‘a’ it would be:

ratio = 2 / 3

We then take that ratio and multiply it by the maximum length our histogram can be MAX_HISTOGRAM_LENGTH so we can display the approriate number of HISTOGRAM_SYMBOL characters.

display_symbol_count = ratio * MAX_HISTOGRAM_LENGTH

An example of the expected output is:

a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2
b >>>>>>>>>>>>>>>>>>>>>>> 1
c >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3
d >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3
e  0
f  0
g  0
...
z  0

Each line of the histogram output displays the letter, a space, the histogram bar, another space, and finally the count for each letter. The ellipis (…) is used only in my example to shorten the example. Your program will always output all the results for the letters a through z regardless of their appearance count.

Notice how the counts for c and d have exactly 70 HISTOGRAM_SYMBOLs, b is roughly one-third of the length, and a is roughly two-thirds of the length. This is due to the normalization described above.

Hints

We will need to maintain a running total of all letter counts. How could we use a list with enough space to hold the count for each letter (think about how many letters there are…)?
You will not be able to display a fraction of a HISTOGRAM_SYMBOL. Only whole numbers will be possible.

HINT: There are some useful functions that will help you with your tasks.

Submission

Right click your ex05 assignment folder and choose compress on MacOS or Compress to ZIP file on Windows. Upload the zip file to the matching Moodle assignment to submit your work.

Grading

You will earn up to 5 points for this exercise, broken down as follows:

1 points - the program does not crash
1 point - the program only uses one list to hold the letter counts
1 point - each histogram line matches the described format
2 points - the program outputs the correct result