Validation schemas¶
Introduction¶
The streamlink.plugin.api.validate module provides an API for defining declarative validation schemas which are used to verify and extract data from various inputs, for example HTTP responses.
Validation schemas are a powerful tool for plugin implementors to find and extract data like stream URLs, stream metadata and more from websites and web APIs.
Instead of verifying and extracting data programatically and having to perform error handling manually, declarative validation schemas allow defining comprehensive validation and extraction rules which are easy to understand and which raise errors with meaningful messages upon extraction failure.
Public interface
While the internals are implemented in the streamlink.validate
package,
streamlink.plugin.api.validate provides the main public interface
for plugin implementors.
Examples¶
Simple schemas¶
Let's begin with a few simple validation schemas which are not particularly useful yet.
>>> from streamlink.plugin.api import validate
>>> schema_one = validate.Schema("123")
>>> schema_two = validate.Schema(123)
>>> schema_three = validate.Schema(int, 123.0)
>>> schema_one.validate("123")
'123'
>>> schema_two.validate(123)
123
>>> schema_three.validate(123)
123
First, three Schema
instances are created, schema_one
, schema_two
and schema_three
.
The Schema
class is the main schema validation interface and the outer wrapper for all schema definitions.
It is a subclass of validate.all
which additionally implements the Schema.validate()
method.
This interface is expected by various Streamlink methods and functions when passing the schema
argument/keyword,
for example to the HTTPSession
methods or streamlink.utils.parse
functions.
The validate.all
class takes a sequence of schema object arguments and validates each one in order.
All schema objects in this schema container must be valid.
Schema objects can be anything, and depending on their type, different validations will be applied. In our example, both
schema_one
and schema_two
contain only one schema object, namely "123"
and 123
respectively, whereas
schema_three
contains two schema objects, int
and 123.0
. This means that the first two schemas validate
only one condition, while the third one validates two, first int
, then 123.0
.
As you've probably already noticed, validation schemas also have a return value for their extraction purpose, but this isn't much interesting in this example.
The "123"
, 123
and 123.0
schemas are simple equality validations
. This is the case for
all basic objects, and all they do is validate and return the input value again. int
however is a type
object,
and thus a type validation
, which checks whether the input is an instance of the schema object
and then also returns the input value again. Since 123
is an int
, the schema is valid for that input.
schema_three
however hasn't finished validating yet at this point, as it defines two validation schemas in total.
This means that the return value of the int
validation gets passed to the 123.0
schema validation, and as expected
when checking 123 == 123.0
, despite both the input and schema being different types, namely int
and float
,
the validation succeeds and returns its input value again, causing the return value of the whole
schema_three
to be 123
.
Now let's have a look at validation errors.
>>> schema_one.validate(123)
streamlink.exceptions.PluginError: Unable to validate result: ValidationError(equality):
123 does not equal '123'
>>> schema_three.validate(123.0)
streamlink.exceptions.PluginError: Unable to validate result: ValidationError(type):
Type of 123.0 should be int, but is float
The first Schema.validate()
call passes 123
to schema_one
. schema_one
however expects "123"
, so
a ValidationError
is raised because the input value is not equal to
the schema. Schema.validate()
catches the error and wraps it in
a PluginError
with a specific validation message.
The second validation also fails, but here, it's because of the input type. The first sub-schema explicitly checks for
the type int
, and despite the following schema being 123.0
, which is a float
object that would obviously validate
a 123.0
float
input when comparing equality, a ValidationError
is raised.
Extracting JSON data¶
The next example shows how to read an optional integer value from JSON data.
>>> from streamlink.plugin.api import validate
>>> json_schema = validate.Schema(
... str,
... validate.parse_json(),
... {
... "status": validate.any(None, int),
... },
... validate.get("status"),
... )
>>> json_schema.validate("""{"status":null}""")
None
>>> json_schema.validate("""{"status":123}""")
123
>>> json_schema.validate("""Not JSON""")
streamlink.exceptions.PluginError: Unable to validate result: ValidationError:
Unable to parse JSON: Expecting value: line 1 column 1 (char 0) ('Not JSON')
>>> json_schema.validate("""{"status":"unknown"}""")
streamlink.exceptions.PluginError: Unable to validate result: ValidationError(dict):
Unable to validate value of key 'status'
Context(AnySchema):
ValidationError(equality):
'unknown' does not equal None
ValidationError(type):
Type of 'unknown' should be int, but is str
Once again, we start with a new Schema
object which gets assigned to json_schema
. This schema collection validates
four schemas in total. Each of them must be valid, with each output being the input of the next one.
Since our goal is to parse JSON data and extract data from it, this means that we should only accept string inputs, so we set
str
as the first schema in this validate.all
schema collection.
Next is the validate.parse_json()
validation, a call of a utility function which returns
a validate.transform
schema object that does exactly what its name suggests: it takes an input and returns
something else. In this case, obviously, strings are the input and a parsed JSON object is the output, assuming that the input
is indeed valid JSON data.
Now we validate the parsed JSON object. We expect the JSON data to be a JSON object
, so we let the next validation schema
be a dict validation
. dict
validation schemas define a set of key-value pairs which
must exist in the input, unless keys are set as optional using validate.optional
.
For the sake of simplicity, this isn't the case in this example just yet. Each value of the key-value pairs is
a validation schema on its own where the input is validated against.
Here, the "status"
key has a validate.any
validation schema, which is also a schema collection, similar to
validate.all
, but validate.any
requires at least one sub-schema to be valid, not all.
Each sub-schema receives the same input, and the output of the overall schema collection is the output of the first sub-schema
that's valid. For our example, this means that the value of the status
key in the JSON data must either be
None
(null
) or an int
.
If any of the schemas in a nested schema definition like that fails, then a validation error stack will be generated
by ValidationError
, as shown above.
The last of the four schemas in the outer validate.all
schema collection is a validate.get
schema.
This schema works on any kind of input which implements __getitem__()
, for example dict
objects.
And as expected, it attempts to get and return the "status"
key of the output of the previous dict
validation.
The validation
module also supports getting multiple values at once using
the validate.union
or validate.union_get
schemas, but this isn't relevant here.
Finding stream URLs in HTML¶
Let's imagine a simple website where a stream URL is embedded as JSON data in a data-player
attribute of an unknown
HTML element where the web player of that website reads from.
Extracting this data could be done by using regular expressions, but then we'd have to take HTML syntax into account, as well as JSON syntax which should usually be HTML-encoded in that HTML element attribute, which would make writing a regular expression even harder, apart from the fact that the JSON data structure could easily change at any time.
Therefore it would make much more sense parsing the HTML data, querying the resulting node tree using an XPath query for getting the attribute value, then parsing the JSON data and finally finding and validating the stream URL.
We also don't want to raise validation errors unnecessarily when the user inputs a URL where no video player was found, so we can instead return an empty list of streams in our plugin implementation and let Streamlink's CLI exit gracefully. Validation errors are only supposed to be raised when an actual error happened due to unexpected data, not when streams are offline or inaccessible.
Thanks to validation schemas, we can do all this declaratively without causing a mess when doing this programmatically.
>>> from streamlink.plugin.api import validate
>>> schema = validate.Schema(
... validate.parse_html(),
... validate.xml_xpath_string(".//*[@data-player][1]/@data-player"),
... validate.none_or_all(
... validate.parse_json(),
... {
... validate.optional("url"): validate.url(
... path=validate.endswith(".m3u8"),
... ),
... },
... validate.get("url"),
... ),
... )
>>> schema.validate("""
... <!doctype html>
... <section class="no-video-player"></section>
... """)
None
>>> schema.validate("""
... <!doctype html>
... <section
... class="video-player"
... data-player="{
... "title":"Offline"
... }"
... >
... ...
... </section>
... """)
None
>>> schema.validate("""
... <!doctype html>
... <section
... class="video-player"
... data-player="{
... "title":"Live",
... "url":"https://host/hls-playlist.m3u8"
... }"
... >
... ...
... </section>
... """)
'https://host/hls-playlist.m3u8'
>>> schema.validate("""
... <!doctype html>
... <section
... class="video-player"
... data-player="{
... "title":"Live",
... "url":"https://host/dash-manifest.mpd"
... }"
... >
... ...
... </section>
... """)
streamlink.exceptions.PluginError: Unable to validate result: ValidationError(NoneOrAllSchema):
ValidationError(dict):
Unable to validate value of key 'url'
Context(url):
Unable to validate URL attribute 'path'
Context(endswith):
'/dash-manifest.mpd' does not end with '.m3u8'
We start with a new Schema
and begin by parsing HTML using the validate.parse_html()
utility function. Similar to validate.parse_json()
, it returns a validate.transform
schema. validate.parse_html()
however returns a parsed HTML node tree via Streamlink's
lxml dependency.
This is followed by an XPath query schema using the validate.xml_xpath_string()
utility function.
validate.xml_xpath_string()
is a wrapper for validate.xml_xpath()
which always
returns a string or None
, depending on the query result. This is useful for querying text contents or single attribute
values, like in this case. XPath queries on their own always return a result set, i.e. possibly multiple values, so when
trying to find single values, it is important to limit the number of potential return values to only one in the XPath query.
The query here attempts to find any node with a data-player
attribute. It then limits the result set to the first found
element and then reads the value of its data-player
attribute. validate.xml_xpath_string()
turns this into a single string return value, or None
if no or an empty value was returned by the query.
Since we now have two different paths for our overall validation schema, either no player data or still unvalidated player data,
our next schema is a validate.none_or_all
schema. This works similar to validate.all
,
except that None
inputs are skipped and get returned immediately without validating any sub-schemas. This lets us handle
cases where no player was found on the website, without raising
a ValidationError
.
In the validate.none_or_all
schema, we now attempt to parse JSON data, which was already shown
previously, except for the fact that we don't need to validate the str
input here, as the XPath query must have already
returned a string value.
On to the dict validation
. We're only interested in the url
key. Any other keys of the input
will get ignored. Since we're aware that url
can be missing if the stream is offline, we mark it as optional using the
validate.optional
schema. This makes the dict validation
not raise an error
if it's missing, but if it's set, then its value must validate. Talking about the value, we want its value to be a URL.
This is where the validate.url
utility function comes in handy. It parses the input and lets us validate
any parts of the parsed URL with further validation schemas. The return value is always the full URL string. In our example,
we want to ensure that the URL's path ends with the ".m3u8"
string, which is an indicator for the stream being
an HLS stream, so we can pass the URL to Streamlink's HLS implementation
.
Lastly, we simply get the url
key using validate.get
. The return value must either be None
if no url
key was included in the JSON data, or a str
with a URL where its path ends with ".m3u8"
.
This means that the overall schema can only return None
or said kind of URL string. If the url
key is not a URL,
or if its path does not end with ".m3u8"
, then a ValidationError
is raised, which is what we want. The None
return value should then be checked accordingly by the plugin implementation.
Validating HTTP responses¶
In order to validate HTTP responses directly, Streamlink's HTTPSession
allows
setting the schema
keyword in HTTPSession.request()
,
as well as in each HTTP-verb method like get()
, post()
, etc.
Here's a simple plugin implementation with the same schema from the Finding stream URLs in HTML example above.
import re
from streamlink.plugin import Plugin, pluginmatcher
from streamlink.plugin.api import validate
from streamlink.stream.hls import HLSStream
@pluginmatcher(re.compile(r"https://example\.tld/"))
class ExamplePlugin(Plugin):
def _get_streams():
hls_url = self.session.http.get(self.url, schema=validate.Schema(
validate.parse_html(),
validate.xml_xpath_string(".//*[@data-player][1]/@data-player"),
validate.none_or_all(
validate.parse_json(),
{
validate.optional("url"): validate.url(
path=validate.endswith(".m3u8"),
),
},
validate.get("url"),
),
))
if not hls_url:
return None
return HLSStream.parse_variant_playlist(self.session, hls_url)
__plugin__ = ExamplePlugin