Security

Probatio sits at the edge of a program: it validates data that came from somewhere you do not control. That data is often hostile, and through from_json_schema even the schema can be hostile. This page is honest about what Probatio defends against, how, and where the responsibility stays with you.

Threat model

The thing Probatio touches most is untrusted data. A developer writes a schema, Probatio validates incoming values against it. That direction is safe: a schema is plain Python the developer wrote, and validating data against it does not expose any code-execution surface. Probatio does not eval, exec, pickle, or marshal the values it validates. There is no path where validating data runs attacker code.

The harder case is from_json_schema (and from_openapi). There the schema itself is decoded from an untrusted document. That widens the attack surface, because now both the data and the rules describing it are attacker-controlled.

An attacker against this surface is not after code execution; there is none to get. The realistic goals are denial of service: burn CPU until the process is useless, or exhaust the stack and crash the interpreter. Probatio’s safeguards target exactly those two outcomes.

Threats and mitigations

Threat	Vector	Mitigation
Catastrophic backtracking (ReDoS)	A `pattern` in an untrusted JSON Schema, compiled to a regex	`from_json_schema` refuses a nested unbounded quantifier with `SchemaError`, before it compiles
Stack exhaustion from a deep document	A pathologically nested untrusted JSON Schema	The decoder caps nesting depth and raises `SchemaError` instead of overflowing the stack
Stack exhaustion from deep or cyclic data	Crafted data run through a recursive `Self` schema	A recursion depth guard raises a clean `Invalid` instead of `RecursionError`
Arbitrary object construction from YAML	Tags in an untrusted YAML payload	YAML is always parsed with a safe loader; the unsafe loaders are never used

Regex denial of service

Python’s re engine backtracks, so a pattern like (a+)+$ runs in exponential time on crafted input. There is no safe timeout for re in pure Python, so the only defense is to refuse the dangerous pattern before compiling it.

from_json_schema does that. When a pattern contains a nested unbounded quantifier (an unbounded repeat applied to a group that is itself unbounded), the decoder raises SchemaError rather than building a validator that could hang:

from probatio import from_json_schema

from_json_schema({"type": "string", "pattern": "(a+)+$"})

A benign pattern compiles as you would expect:

from probatio import from_json_schema

schema = from_json_schema({"type": "string", "pattern": "^[a-z]+$"})
print(schema("hello"))  # hello

The trust boundary is the from_json_schema path, and only that path. A Match pattern you write in Python is not checked. That matches voluptuous: a developer-written regex is the developer’s responsibility. If you compile a pattern from input you do not trust, screen it yourself before handing it to Match.

Recursion and stack exhaustion

Two recursive shapes can drive Probatio into the Python stack: a deeply nested schema document, and deeply nested data validated against a recursive schema. A naive recursive walk turns either into a RecursionError, which is an unhandled crash. Probatio guards both.

A JSON Schema document nested past a fixed depth is refused while decoding:

from probatio import from_json_schema


def nest(levels):
    root = {"type": "object", "properties": {}}
    cursor = root
    for _ in range(levels):
        child = {"type": "object", "properties": {}}
        cursor["properties"]["x"] = child
        cursor = child
    return root


from_json_schema(nest(500))

On the data side, Self lets a schema validate a recursive structure, like a tree. Feed it data nested deeper than the recursion guard allows, and it raises a clean Invalid with the path to where it gave up, not a RecursionError:

from probatio import Schema, Self, Invalid

schema = Schema({"value": int, "children": [Self]})

# A normal tree validates fine.
ok = {"value": 1, "children": [{"value": 2, "children": []}]}
print(schema(ok))  # {'value': 1, 'children': [{'value': 2, 'children': []}]}


def deep(levels):
    node = {"value": 0, "children": []}
    for _ in range(levels):
        node = {"value": 0, "children": [node]}
    return node


try:
    schema(deep(5000))
except Invalid as err:
    print(err.error_message)  # data is nested too deeply for this recursive schema

The result is the same in both directions: a depth that would crash the process becomes a normal, catchable error instead.

Safe YAML loading

Probatio reads YAML with a safe loader and nothing else. load_yaml uses YAMLRocks when it is installed, otherwise PyYAML’s safe_load. Neither can construct arbitrary Python objects from tags in the document, so a hostile YAML payload cannot instantiate classes or run constructors. The unsafe loaders that PyYAML also ships are never reached.

from probatio import load_yaml

print(load_yaml("name: app\nport: 8080"))  # {'name': 'app', 'port': 8080}

Keeping secrets out of logs

Configuration often carries credentials: a password, an API token, a private key. Wrap those fields in Secret, and the validated value becomes a SecretValue that hides itself from repr, str, and any rendered validation error, so it will not leak into a log line or a stack trace. The real value is read back only through an explicit .get_secret_value() call. A Secret whose inner schema fails is reported without echoing the value, so even a rejected secret stays out of the error.

from probatio import Schema, Secret

schema = Schema({"api_token": Secret(str)})
result = schema({"api_token": "s3cr3t"})
str(result)                            # "{'api_token': SecretValue('**********')}"
result["api_token"].get_secret_value() # 's3cr3t'

One boundary to know: this protects the validated value, not humanize_error called against the raw, pre-validation input. When secrets are involved, humanize the validated output, not the original data.

Reporting a vulnerability

Found something that looks like a security issue? Please report it privately through the GitHub security advisories on the project repository, not as a public issue. That gives a fix time to land before the details are out in the open.