Validate JSON data using cerberus

2016-03-31

During some coding work for my day job, I require a way to validate the format (or schema) for some JSON data. If you need a quick refresh, what JSON is and how to work with it in python, take a look at one of my earlier posts about python dictionaries and JSON. As an additional requirement, the validation step should include default data for certain keys that are not found in the original data.

I found cerberus, a library that perfectly meets my needs and that I want to share with you. As always the code examples from this post are available in my python script examples GitHub repository.

What is cerberus?

Cerberus is a library maintained by Nicola Iarocci and is used to verify JSON data against a user-defined schema. It defines for example the keys and data type that are required within the data structure. When using the upcoming version 0.10, you can also define default values for certain keys, if they are not found within the data.

For this blog post, I use the latest stable release Version 0.9.2 that is available on pypi. You can simply install is using the following command:

$ pip install cerberus

If you need the latest (development) version, take a look on the cerberus GitHub page.

The example use case

Within the example, we will validate an IP network and VLAN parameter set, that could be used to generate configurations. If you’re not already familiar with python dictionaries and JSON, please take a look at one of my earlier blog posts about it.

The data set that we need to validate looks similar to the following:

{
    "networks": [
        {
            "vlan" : {
                "id": 1,
                "name": "data"
            },
            "ipv4": {
                "address": "10.1.1.1",
                "prefix_length": 24
            }
        }
    ]
}

Basic cerberus schema

By default, cerberus requires a schema definition as a python dictionary. A small example: If you need to make sure that the key name is always a string, you would use the following schema within cerberus:

{
    "name" : {
        "type": "string"
    }
}

Thats the basic format of any schema. You can also encapsulate one dictionary to another. If you use the data type dict (for dictionary), you define a nested schema that applies to the nested keys, for example:

{
    "my_dictionary" : {
        "type": "dict",
        "schema": {
            "name": {
                "type": "string"
            }
        }
    }
}

Now the key name is required inside the dictionary my_dictionary.

There are many rules available that you can use within cerberus. Further details are available in the official cerberus documentation.

The schema that is required to validate our example data set looks similar to the following:

{
    "networks": {
        "type": "list",
        "schema": {
            "type": "dict",
            "schema": {
                "vlan": {
                    "type": "dict",
                    "schema": {
                        "id": {
                            "type": "integer",
                            "min": 1,
                            "max": 4094
                        },
                        "name": {
                            "type": "string"
                        }
                    }
                },
                "ipv4": {
                    "type": "dict",
                    "schema": {
                        "address": {
                            "type": "ipv4address"
                        },
                        "prefix_length": {
                            "type": "integer"
                        }
                    }
                }
            }
        }
    }
}

Not very intuitive or readable… 😐

For this use case, I like to use YAML because it works with less curly brackets (less typos) and is easier to read. Furthermore, you can add comments to your schema without breaking things. The same data format is also used within Ansible playbooks. To convert the YAML schema to a python dictionary, you can use the yaml module that ships with python. It can be used like the json module. The following example shows the YAML schema and how to create a python dictionary from it:

raw_schema_yaml = """
networks:
 type: list
 schema:
  type: dict
  schema:
   vlan:
    type: dict
    schema:
     id:
      type: integer
      min: 1
      max: 4094
     name:
      type: string
   ipv4:
    type: dict
    schema:
     address:
      type: ipv4address
     prefix_length:
      type: integer
"""

my_dictionary = yaml.load(raw_schema_yaml)

Thats better 😃, now you got the same dictionary but based on YAML data.

Customize the Validator

In our example model, we need a way to verify the format of the IPv4 address value. There are many ways to accomplish this requirement:

Validate against a regular expression for IPv4, but the error message is a bit confusing if it is displayed to the user
A custom data type

Before we can create a custom data type, we need to extend the cerberus Validator class. I’ll call it in the example script NetworkDataJsonValidator:

from cerberus import Validator

class NetworkDataJsonValidator(Validator):
    # class definition...

Now we can create a custom data type by defining a method that uses the prefix _validate_type_. We will do this for the IPv4 address value in our example schema using the ipaddress module. I already wrote a introduction for this model in an earlier blog post.

Now, the custom Validator with the custom data type looks similar to the following:

class NetworkDataJsonValidator(Validator):
    """
    A simple JSON data validator with a custom data type for IPv4 addresses
    """
    def _validate_type_ipv4address(self, field, value):
        """
        checks that the given value is a valid IPv4 address
        """
        try:
            # try to create an IPv4 address object using the python3 ipaddress module
            ipaddress.IPv4Address(value)

        except:
            self._error(field, "Not a valid IPv4 address")

I know there are other ways to verify the format of the data value, e.g. using a regular expression, but this way requires less explanation what’s happening 😉: We simply test, if we are able to create and instance of the IPv4Address class based on the given value. If not, an exception is thrown by the class and therefore the validation fails.

Putting all together

The use of the schema validator is straightforward. You need to create an instance of our NetworkDataJsonValidator class using a python dictionary that contains the schema definition as an initial parameter. After that, you can validate the data using the validate(data) method.

The following lines of code show how to create and use the class for the predefined example data from the script (using the interactive python shell):

>>> validator_yaml = NetworkDataJsonValidator(schema_yaml)
>>> result_yaml = validator_yaml.validate(valid_sample_data)
>>> result_yaml
True
>>> result_yaml = validator_yaml.validate(invalid_sample_data)
>>> result_yaml
False
>>> print(json.dumps(validator_yaml.errors, indent=4))
{
    "networks": {
        "0": {
            "vlan": {
                "id": "max value is 4094"
            },
            "ipv4": {
                "address": "Not a valid IPv4 address"
            }
        }
    }
}

The result of the validate method is, as you can see, either True or False. If the validation fails, you can read the error messages from the errors attribute of the instance from the Validator class. The errors for the schema are also expressed as a dictionary.

Conclusion

Cerberus is easy to learn/use and provides a lot of functionality. With some additional effort in your script, you get a configureable validation schema for any kind of JSON data. It is just a basic technology and has no direct benefit for the end-user, but from a development perspective, you get a flexible model to verify data from a user. This module can be used for example as an additional extension for the Network Configuration Generator to add JSON data as a data field.

We only scratch the surface within this post. It provides many more features, including the validation of allowed values for certain keys or within a list, data type transformation and even automatic key renaming. There is also the possibility to mark certain values as required or allow unknown keys within the data. If you like to dive deeper, take a look at the official documentation of the module.

Version 0.10 provides also the ability to normalize the JSON data. This can be used to inject default values to the schema.

You can find the entire example in my python-script-examples GitHub repository. Thats it for today. I hope you find this post somehow useful and thank you for reading.