Schema Generation With schema-guru
- Requirements & Setup
- Preparing your data
- Generating the schema
- Adding your schema to mediachain
- Using your schema when publishing mediachain records
schema-guru is a tool from Snowplow Analytics to automatically generate JSON schemas from a collection of JSON documents. By using multiple input documents, it can generate schemas that can accomodate subtle differences in your data that might not be apparent from a single example. For instance, if you have a field that is usually a string, but may occasionally be null
, a single document will probably not be enough to determine that the field is nullable.
Requirements & Setup
schema-guru runs on the Java virtual machine, so you’ll need a recent JVM installed to run it. Then you can download the latest release and unzip it:
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_0.6.2.zip
$ unzip schema_guru_0.6.2.zip
This will give you an executable schema_guru_0.6.2
file, which you may want to put somewhere that’s on your $PATH
. In the examples below, I’ll assume that you put it somewhere on your $PATH
and renamed to just schema-guru
(without the version).
Preparing your data
You can give schema-guru either a single JSON document, or a directory containing a collection of documents. The more example documents you provide, the better your results will be. You can also pass in newline-delimited JSON (one JSON object per line), by giving the --ndjson
flag when you run the command.
If your documents are already in newline-delimite JSON format, or you have a directory full of examples prepared, you’re all set.
Some datasets might need pre-processing to get them into a usable format.
Let’s use the MoMA Collection data as an example. The Artworks.json
file consists of one big JSON array, each member of which is an object. We want to turn that into newline-delimited JSON to feed to schema guru.
Using jq, we can easily unpack the array: jq -c '.[]' Artworks.json > Artworks.ndjson
- the .[]
jq filter selects each member of the array and prints it, and the -c
flag instructs it to use “compact” printing, which results in one object per line.
Generating the schema
Mediachain uses snowplow’s self-describing schema convention, which schema-guru can help you generate.
When you run the command, you can pass in --vendor
, --name
, and --schemaver
flags to set the schema’s “self” properties.
--vendor
should be a reverse-dns style string, e.g.io.mediachain
ororg.moma
.--name
should identify the type of record you’re creating a schema for, e.g.artwork
--schemaver
is a version string in SchemaVer format. If this is the first revision of your schema, you should use1-0-0
for--schemaver
To create the schema, you pass in your example data to schema-guru:
schema-guru schema --ndjson --no-length --vendor org.moma --name artwork --schemaver 1-0-0 --output org.moma-artwork-jsonschema-1-0-0.json Artworks.ndjson
The --ndjson
flag is only necessary if you’re using newline-delimited json; if you pass in a single document or a directory of documents, you should omit it.
The --no-length
flag tells schema-guru not to try to infer the minimum and maximum length of string fields. You probably don’t want to set hard limits on your field lengths based on your examples, or you may end up with a schema that rejects valid data.
The --output
flag specifies the output filename; without it, the schema will be printed to standard out. If you want to pipe the schema to another command, just omit the --output
flag. The filename is up to you, but it’s a good idea to include the vendor, name, and schemaver in the filename so you can easily keep track of your schemas and versions.
Adding your schema to mediachain
Once you’ve generated a schema, you can publish it to mediachain with the mcclient publishSchema
command:
mcclient publishSchema org.moma-artwork-jsonschema-1-0-0.json
This will result in output similar to the following:
Published schema with wki = schema:io.mediachain/moma-artwork/jsonschema/1-0-0 to namespace mediachain.schemas
Object ID: QmdfuvXHnKtewFeunFBBDHrTsG5qyHBFCruWWh4xxXm9PA
Statement ID: 4XTTM6XcpMB8N8Tnkhc66DqvAotgKzRJgrZbF7SBByihtPnV8:1478191051:34005
The Object ID
above is used as the --schemaReference
flag when publishing records.
Using your schema when publishing mediachain records
To ingest the MoMA artworks using the published schema:
mcclient publish --namespace museums.moma.artworks --idFilter .ObjectID --schemaReference QmdfuvXHnKtewFeunFBBDHrTsG5qyHBFCruWWh4xxXm9PA Artworks.ndjson
The mcclient publish
command also accepts newline-delimited json, so we give it the same file we prepared earlier as the final argument. If the argument is ommitted, it will read from standard input.
The --idFilter
flag is a jq filter that selects the ObjectID
field from the input records. Note the .
preceeding the field name; without it jq will give you an error. This produces a mediachain “WKI” (well-known-identifier), that can be used to look up the records in queries.
The --namespace
flag is also required to give the records a “home” in the mediachain hierarchy.
The --schemaReference
is the object ID of the schema we published earlier.
When you provide a schema reference to the mcclient publish
command, your input objects are validated against it, and wrapped in a small outer object that contains a link to the schema in the schema
field, with your record as the data
field. Wrapping objects in this way allows other users to fetch the schema for a record without needing to know it up front.
For example, this MoMA artist record:
{
"ArtistBio": "American, 1934\u20132015",
"BeginDate": 1934,
"ConstituentID": 67350,
"DisplayName": "Howard Guttenplan",
"EndDate": 2015,
"Gender": null,
"Nationality": "American",
"ULAN": null,
"Wiki QID": null
}
Would stored in mediachain as:
{
"schema": {
"/": "QmXciNhn4P6veuojsR6uRGHBu27pAztp5o1rtMkkmsTtUf"
},
"data": {
"ArtistBio": "American, 1934\u20132015",
"BeginDate": 1934,
"ConstituentID": 67350,
"DisplayName": "Howard Guttenplan",
"EndDate": 2015,
"Gender": null,
"Nationality": "American",
"ULAN": null,
"Wiki QID": null
}
}