Python package for metadata schemas

Mariana Montes and Ronny Moreas

2024-05-29

Outline

  • The ManGO Metadata Schema Manager
  • From JSON to validation
  • From a Python dictionary to AVUs
  • Conclusion

The ManGO Metadata Schema Manager

Form to add metadata

View of the schema metadata

The Schema Manager

Metadata Schemas as JSON

book-v2.0.0-published.json

{'schema_name': 'book',
 'version': '2.0.0',
 'status': 'published',
 'properties': {'title': {'title': 'Book title',
   'type': 'text',
   'required': True},
  'publishing_date': {'title': 'Publishing date',
   'type': 'date',
   'required': True,
   'repeatable': True},
  'cover_colors': {'title': 'Colors in the cover',
   'type': 'select',
   'values': ['red', 'blue', 'green', 'yellow'],
   'multiple': True,
   'ui': 'checkbox'},
  'publisher': {'title': 'Publishing house',
   'type': 'select',
   'values': ['Penguin House', 'Tor', 'Corgi', 'Nightshade books'],
   'multiple': False,
   'ui': 'dropdown',
   'required': True,
   'default': 'Tor'},
  'ebook': {'title': 'Is there an e-book',
   'type': 'select',
   'values': ['Available', 'Unavailable'],
   'multiple': False,
   'ui': 'radio'},
  'author': {'title': 'Author',
   'properties': {'name': {'title': 'Name and surname',
     'type': 'text',
     'required': True},
    'age': {'title': 'Age',
     'type': 'integer',
     'minimum': '12',
     'maximum': '99'},
    'email': {'title': 'Email address',
     'type': 'email',
     'required': True,
     'repeatable': True,
     'pattern': '[^@]+@kuleuven.be'}},
   'type': 'object',
   'repeatable': True},
  'market_price': {'title': 'Market price', 'type': 'float'}},
 'edited_by': 'mariana',
 'realm': 'datateam_icts_icts_test',
 'title': 'Book schema as an example',
 'parent': ''}

Minimal example

pip install mango-mdschema


add_schema_metadata.py
import json
from irods.session import iRODSSession
from mango_mdschema import Schema

with open("metadata_file.json", "r") as f:
  my_metadata = json.load(f) # a dictionary

my_schema = Schema("book-v2.0.0-published.json")
with iRODSSession(irods_env_file=env_file) as session:
  irods_object = session.collections.get(home_dir).data_objects[0]
  my_schema.apply(irods_object, my_metadata) # includes validation

From JSON to validation

Metadata schemas as JSON

book-v2.0.0-published.json

{'schema_name': 'book',
 'version': '2.0.0',
 'status': 'published',
 'properties': {'title': {'title': 'Book title',
   'type': 'text',
   'required': True},
  'publishing_date': {'title': 'Publishing date',
   'type': 'date',
   'required': True,
   'repeatable': True},
  'cover_colors': {'title': 'Colors in the cover',
   'type': 'select',
   'values': ['red', 'blue', 'green', 'yellow'],
   'multiple': True,
   'ui': 'checkbox'},
  'publisher': {'title': 'Publishing house',
   'type': 'select',
   'values': ['Penguin House', 'Tor', 'Corgi', 'Nightshade books'],
   'multiple': False,
   'ui': 'dropdown',
   'required': True,
   'default': 'Tor'},
  'ebook': {'title': 'Is there an e-book',
   'type': 'select',
   'values': ['Available', 'Unavailable'],
   'multiple': False,
   'ui': 'radio'},
  'author': {'title': 'Author',
   'properties': {'name': {'title': 'Name and surname',
     'type': 'text',
     'required': True},
    'age': {'title': 'Age',
     'type': 'integer',
     'minimum': '12',
     'maximum': '99'},
    'email': {'title': 'Email address',
     'type': 'email',
     'required': True,
     'repeatable': True,
     'pattern': '[^@]+@kuleuven.be'}},
   'type': 'object',
   'repeatable': True},
  'market_price': {'title': 'Market price', 'type': 'float'}},
 'edited_by': 'mariana',
 'realm': 'datateam_icts_icts_test',
 'title': 'Book schema as an example',
 'parent': ''}

Interpretation via the Python package

book_schema = Schema("book-v2.0.0-published.json")
print(book_schema)
1
The optional prefix argument lets you tailor it to your implementation.
Book schema as an example
Metadata annotated with the schema 'book' (2.0.0) carry the prefix 'mgs'.
This schema contains the following 7 fields:
- title, of type 'text' (required).
- publishing_date, of type 'date' (required).
- cover_colors, of type 'select'.
- publisher, of type 'select' (required).
- ebook, of type 'select'.
- author, of type 'object' (required).
- market_price, of type 'float'.

Field requirements

book_schema.print_requirements("publishing_date")
Type: date.
Required: True. Default: None.
Repeatable: True.
book_schema.print_requirements("publisher")
Type: select.
Required: True. Default: Tor.
Repeatable: False.
Choose only one of the following values:
- Penguin House
- Tor
- Corgi
- Nightshade books

Field requirements

book_schema.print_requirements("cover_colors")
Type: select.
Required: False.
Repeatable: False.
Choose at least one of the following values:
- red
- blue
- green
- yellow

From a Python dictionary to AVUs

Required fields and defaults

my_metadata = {
    "title": "An exemplary book",
    "author": [
        {"name": "Fulano De Tal", "email": "fulano.detal@kuleuven.be"},
        {"name": "Jane Doe", "email": "jane.doe@kuleuven.be"},
    ],
    "ebook": "Available",
    "publishing_date": "2024-02-01",
    "cover_colors": ["red", "magenta", "yellow", "turquoise"],
}
book_schema.validate(my_metadata)
1
A required field without a default
2
A repeatable composite field
3
A repeatable date
4
A field with multiple possible values, not all of them valid
5
Required fields with default are filled in.
{'title': 'An exemplary book',
 'author': [{'name': 'Fulano De Tal', 'email': ['fulano.detal@kuleuven.be']},
  {'name': 'Jane Doe', 'email': ['jane.doe@kuleuven.be']}],
 'ebook': 'Available',
 'publishing_date': [datetime.date(2024, 2, 1)],
 'cover_colors': ['red', 'yellow'],
 'publisher': 'Tor'}

Error messages

book_schema.validate(
    {
        "title": "Some title",
        "author": {"name": "Jane Doe", "email": "sweetdoe@email.eu"},
        "publishing_date": date.today(),
    }
)
ValidationError: 'book.author.email' does not match pattern '^[^@]+@kuleuven.be$', got value 'sweetdoe@email.eu'
book_schema.validate(
    {
        "title": "Some title",
        "author": {"name": "Jane Doe", "email": "jane.doe@kuleuven.be"},
        "publishing_date": "01/01/1990",
    }
)
ConversionError: 'book.publishing_date' cannot be converted to a date, got value '01/01/1990'

Warnings

import logging

logger = logging.getLogger("mango_mdschema")
logger.setLevel(logging.INFO)

book_schema.validate(my_metadata)
INFO:mango_mdschema:Applying default value to required field 'book.publisher': 'Tor'
INFO:mango_mdschema:Some values in 'book.cover_colors' were not allowed and are discarded: magenta, turquoise. Allowed values: red, blue, green, yellow.
INFO:mango_mdschema:Missing non-required fields in 'book': ['market_price']
INFO:mango_mdschema:Missing non-required fields in 'book.author': ['age']
INFO:mango_mdschema:Missing non-required fields in 'book.author': ['age']
{'title': 'An exemplary book',
 'author': [{'name': 'Fulano De Tal', 'email': ['fulano.detal@kuleuven.be']},
  {'name': 'Jane Doe', 'email': ['jane.doe@kuleuven.be']}],
 'ebook': 'Available',
 'publishing_date': [datetime.date(2024, 2, 1)],
 'cover_colors': ['red', 'yellow'],
 'publisher': 'Tor'}

Write: from dictionaries to namespacing

irods_object = session.collections.get(home_dir).data_objects[0]
irods_object.metadata.items()
[]
avus = book_schema.to_avus(my_metadata)
avus
[<iRODSMeta None mgs.book.title An exemplary book None>,
 <iRODSMeta None mgs.book.author.name Fulano De Tal 1>,
 <iRODSMeta None mgs.book.author.email fulano.detal@kuleuven.be 1>,
 <iRODSMeta None mgs.book.author.name Jane Doe 2>,
 <iRODSMeta None mgs.book.author.email jane.doe@kuleuven.be 2>,
 <iRODSMeta None mgs.book.ebook Available None>,
 <iRODSMeta None mgs.book.publishing_date 2024-02-01 None>,
 <iRODSMeta None mgs.book.cover_colors red None>,
 <iRODSMeta None mgs.book.cover_colors yellow None>,
 <iRODSMeta None mgs.book.publisher Tor None>]

Write: from dictionaries to namespacing

book_schema.apply(irods_object, my_metadata)
irods_object.metadata.items()
[<iRODSMeta 8475276 mgs.book.cover_colors red None>,
 <iRODSMeta 8497207 mgs.book.title An exemplary book None>,
 <iRODSMeta 8497210 mgs.book.author.name Fulano De Tal 1>,
 <iRODSMeta 8497213 mgs.book.author.email fulano.detal@kuleuven.be 1>,
 <iRODSMeta 8497216 mgs.book.author.name Jane Doe 2>,
 <iRODSMeta 8497219 mgs.book.author.email jane.doe@kuleuven.be 2>,
 <iRODSMeta 8497222 mgs.book.ebook Available None>,
 <iRODSMeta 8497225 mgs.book.publishing_date 2024-02-01 None>,
 <iRODSMeta 8497228 mgs.book.cover_colors yellow None>,
 <iRODSMeta 8497231 mgs.book.publisher Tor None>,
 <iRODSMeta 8497234 mgs.book.__version__ 2.0.0 None>]

Read: from AVUs back to dictionaries

#book_schema.from_avus(avus)
book_schema.extract(irods_object)
{'author': [{'email': ['fulano.detal@kuleuven.be'], 'name': 'Fulano De Tal'},
  {'email': ['jane.doe@kuleuven.be'], 'name': 'Jane Doe'}],
 'cover_colors': ['red', 'yellow'],
 'ebook': 'Available',
 'publisher': 'Tor',
 'publishing_date': [datetime.date(2024, 2, 1)],
 'title': 'An exemplary book'}

Conclusion

Metadata schemas with Python

Metadata schemas

  • Format validation
  • Required fields and default values
  • Hierarchical structure

Python

  • Processing data in badges
  • Reading metadata from files
  • E.g. metadata with data ingestion

Tip

You don’t need ManGO, these are also standalone applications!

mango-mdschema

  • Offers validation, writing and reading of structured metadata

  • Schemas are described in JSON, can be designed in the manager

  • Metadata can be hierarchical, rendered with namespacing

  • Input can be automatized, output can be parsed and rendered in the portal

Thank you!

github.com/kuleuven/mango-mdschema github.com/kuleuven/mango-metadata-schemas