Python Regular Expressions Project

For this project you will write several Python regular expressions to match different types of strings. You will certainly want to consult the Python regular expression documentation as you work on this project.

To begin, download match_strings.py. This is a functioning program, but currently the only things it will match are integers and real numbers. You will need to add regular expressions so that it correctly matches all of the items described later in these instructions.

Program Behavior

If you run this program with no arguments, it will continually read lines from standard input, attempt to match the line to each regular expression from a list of regular expressions, and print the name of the pattern that first matches the string. If none of the patterns match, it will print unknown for that string.

Since at first the list of regular expressions contains only two regular expressions, which will match integers and real numbers, a run of the program looks like this:

Reading from standard input.
Enter lines to match, press ctrl+d to exit
hello
unknown
1234
integer
5.6
real number
4.9.10
unknown

Alternatively you can run it by supplying an input filename on the command line and it will simply print out one line for each line in the file. So if this is the contents of the file input.txt:

hello
1234
5.6
4.9.10

and you run the program like this:

python3 match_strings.py input.txt

then the output will be this:

unknown
integer
real number
unknown

Your Task

In the main() function of the program there is the following list of tuples. Each tuple in the list contains a regular expression and a name of the pattern that it recognizes.

    patterns = [
        (r'^\d+$', 'integer'),
        (r'^\d+\.\d+$', 'real number'),
    ]

The only changes you need to make to the program are additional entries in this list. Add one additional tuple for each of the patterns described in the next section.

Patterns

Create one regular expression to match each of the following.

Address

For the purposes of this assignment, a valid address begins with a positive integer and is followed by one or more words or abbreviations. The words or abbreviations must consist of only letters from the English alphabet, may only contain a capital letter as the first letter, and may or may not end in a period.

The following are valid addresses:

1189 Beall Avenue
123 S. Main St.
456 elm

The following are not valid addresses:

Eight fifty two North Washington
10 10 Springfield Lane
14.5 S Main
12 S.Main

Price

Here a valid price is a numeric value using the US dollar sign. The number of cents is optional, but there must be two digits if the cents are shown. For prices above $999.99, there may optionally be a comma separating thousands, millions, etc.

The following are valid prices:

$1
$20
$1.99
$10.00
$1500.50
$2,000.99
$1,234,567.89

The following are not valid prices:

$1.9
$10,23.4

Phone Number

Here a phone number is a valid US phone number with the area code. The final 7 digits must be separated by a hyphen, while the area code may be in parentheses or separated from the rest of the number by a hyphen. There may or may not be a single space between the closing parentheses and the fourth number.

The following are valid phone numbers:

(330) 263-2000
(330)263-2000
123-456-7890

The following are not valid phone numbers:

(123)4567890
456-7890
330 263-2000

Email Address

Capturing all the rules for what makes a valid email address is complex, so we will use a simplified definition of a valid email address. This definition generally works just fine for extracting email addresses from documents.

The first part of the email address is the username portion, and it must not contain whitespace or the @ symbol. The username portion is followed by the @ symbol. After the @ symbol is the domain, which does not contain any whitespace or the @ symbol. The domain contains two or more non-empty components which are separated by periods. The final component must consist of only letters from the English alphabet.

The following are valid email addresses:

nsommer@wooster.edu
n.sommer@cs.wooster.edu
yippee_skippy@yee-haw.wheeeee
fun-times@Taylor.hall.wooster.edu

The following are not valid email addresses:

n@sommer@wooster.edu
n sommer@wooster.edu
nsommer@wooster..edu
nsommer@wooster.edu-org

C Identifiers

A C identifier is a name for a function, variable, etc. in a C program. A C identifier must contain only letters, digits, and underscores and the first character must be a letter or an underscore.

The following are valid C identifiers:

x
x1y2
_hello
funName
FunName

The following are not valid C identifiers:

1x
bad name
!name

Course Identifier

Courses at the College of Wooster are identified by four upper case letters identifying the department, five digits representing the course number, and two digits representing the section number, separated by hyphens. For example, the full identifier for this course is CSCI-22000-01.

The following are valid course identifiers:

CSCI-22000-01
MATH-11100-02
FYSM-10100-33
ABCD-12345-67

The following are not valid course identifiers:

csci-22000-01
CS 220
CSCI 22000 01

Submission

Submit your program through the assignment on Moodle.