Parsing JSON Project

For this project you will write a program that determines whether or not the input to the program is valid JSON (JavaScript Object Notation) by using a recursive descent parser.

See validate_expression.py for the completed example from class to use as a template. Name the file for this project validate_json.py.

JSON

JSON is a widely used language for representing objects in both a machine readable and human readable format. This is useful for storing structured data, and for transferring structured data between machines over the internet.

The main two JSON elements are objects and arrays. JSON objects are much like dictionaries in Python, but the keys must be strings. Arrays are like lists in Python, and can similarly store values of mixed types.

Here is an example JSON document describing my dog using an object:

{
    "name": "Max",
    "age": 8,
    "sex": "male",
    "neutered": true,
    "color": ["black", "white"],
    "eye color": "brown",
    "breed": ["Springer Spaniel", "Shih Tzu"]
}

Here is an array containing two courses I am teaching this semester as objects:

[
    {
        "name": "Theory of Computation",
        "identifier": "CSCI-22000-01",
        "enrollment": 20,
        "building": "Taylor Hall",
        "room": "210",
        "start time": "10:00 AM",
        "end time": "10:50 AM"
    },
    {
        "name": "Imperative Problem Solving",
        "identifier": "CSCI-11000-01",
        "enrollment": 35,
        "building": "Taylor Hall",
        "room": "302",
        "start time": "11:00 AM",
        "end time": "11:50 AM"
    }
]

A JSON document can be any valid JSON value (discussed below).

Whitespace is optional. I have used a conventional indentation scheme to make it readable.

Below are some more details JSON values.

Values

A value may be a string, a number, an object, an array, true, false, or null. Note that true, false, and null are not in quotes, and are distinct from strings.

Strings

Strings are surrounded by double quotation marks. The official JSON specification supports escaped double quotation marks within strings. You may support this feature if you like, but I will only test your code using strings that consist of digits and English characters within the quotes.

Numbers

Numbers may be positive or negative integers or real numbers. The official JSON specification also supports scientific notation, but supporting this in your code is optional and will not be tested.

Objects

An object is a comma separated list of key/value pairs surrounded by { and }. A key is a string and a value may be any JSON value (including another object or an array). The key and value are separated by a colon.

Arrays

An array is a comma separated list of values surrounded by [ and ].

Grammar

Before you start coding, devise an LL(1) grammar for JSON based on the information above. Since a JSON document can be any value, the start variable is quite simple:

json -> value EOF

We need to define this rule, rather than simply parsing value, because without taking into account the EOF the parser would look at a string like this:

[1][2]

and report valid JSON after seeing [1]. We want the parser to report the above string as invalid JSON, which it will do when it expects to see EOF but instead sees the second [.

Remember that you need to be able to determine which rule to choose based on the next token. It would be tempting to define an array like this:

array -> [ element_list ]
element_list -> value , element_list | value

The problem with the above is that you would not know whether to choose the first or the second rule for element_list based on the next token when trying to parse using those rules. Instead, you want to use tail variables, similar to what we did with expressions:

array -> [ array_tail
array_tail -> elements ] | ]
elements -> value elements_tail
elements_tail -> , elements | epsilon

We can define object similarly, but instead of having elements we have pairs, which need to be parsed further.

We will parse arrays and objects as values. So we could have the following for value:

value -> STRING | NUMBER | true | false | null | array | object

The first five rules of the above line are tokens, and we can easily check for those by looking at the next token. But what about array and object? If we define them as I described defining array above, we can know that an array always starts with [ and an object always starts with { and still call the appropriate function based on the next token. Alternatively we can define the grammar like this so that it is completely clear which rule to choose from looking only at this grammar line:

value -> STRING | NUMBER | true | false | null | [ array_tail | { object_tail

Then we do not actually need array and object rules, just the tails, and the beginning of the array and object are handled when parsing value. You may implement it either way you like.

Lexer

Feel free to re-use the Tokens class and lexer() function from the example we looked at in class. You will need to re-write the dictionary of regular expressions.

Parser

Implement parsing using recursive descent, as we did in the example in class.

Input

Your program should support input from standard input or from a file provided on the command line. You can do this by checking the number of command line arguments and acting accordingly. Here is some code to do that:

if __name__ == '__main__':
    # read from a file if one is provided on the command line
    if len(sys.argv) == 2:
        with open(sys.argv[1]) as f:
            text = f.read()
    # read from standard input if no command line arguments were provided
    elif len(sys.argv) == 1:
        text = sys.stdin.read()
    # more than one argument is an error
    else:
        sys.exit('Usage: {} [<input file>]'.format(sys.argv[1]))

    tokens = lexer(text)

    ...

Then you can either run the command with no arguments and enter JSON text manually, or provide it with a filename. If you enter text manually, you need to press ctrl+d to send an EOF to indicate the end of the input. To pass it a filename, you can run it like this in a terminal:

python3 validate_json.py input.json

where test.json is a file containing JSON to test your program with. If you are using PyCharm and want to run your program through PyCharm and pass it a filename as a command line argument, you can edit the run configuration and put the filename in the Parameters field. This way you can also run it using PyCharm’s debugger, which can be very helpful in tracking down issues.

Submission

Submit validate_json.py on Moodle.