NeoSchema - User Guide

IMPORTANT – The reader is assumed to have already read the following article about motivations and overview: Using Schema in Graph Databases such as Neo4j


Why use a Schema Layer?

"The marriage of the flexibity of Graph Databases and the discipline of Relational Databases"
A Schema layer, as used in the BrainAnnex open-source project, is a software library (called "NeoSchema") that sits between the data and the higher layers.

It's optional, and may be used in a "strict" manner (as an enforcer) or in "lax/loose" manner (data's self-documentation, and assistance to the UI, but no enforcement.)

In essense, a Schema represents what is either expected, or permitted, in our database.

Example

Let's jump into a simple example of some data nodes, and their corresponding Schema nodes:

In this example, our data consists of just 2 records, each stored as a graph-database node, in the yellow box at the bottom.

The two records represent, respectively, two entities named "Car" (pale-blue circle) and "Person" (violet circle).
The "Car" entity is expected to have (up to) 2 properties (aka attributes, or fields): "make" and "color".
The "Person" entity is expected to have a single property, "name".
Data summary: a white Toyota is owned by someone named Julian.



The Schema layer (green box at the top) just encapsulates the state of affairs described in the figure's caption, above. Several design specifications can be immediately observed:

  1. The Schema layer makes use of nodes labeled "CLASS", "PROPERTY" or (less used and not shown in above example) "LINK"
  2. We'll refer to nodes with the actual data as "Data Nodes" (e.g. those in the yellow box at the bottom), while nodes reserved for internal use by the Schema layer will be called "Schema Nodes" (e.g. those in the green box at the top)
  3. The only connection between the "Data" and the "Schema" layers are relationships named "SCHEMA" that link from a Data node to its Schema node
  4. Each "Data Node" may only have a single "Schema Node" Class that describes it. You may think of that Class name as the "type" of that Data Node.
  5. Schema nodes labeled "CLASS" (green) have relationships among themselves that exactly reflect the (permitted or expected) relationships among the data nodes of those Classes (in our example, "OWNED_BY")
  6. Nodes in the graph database that lack a "SCHEMA" relationship to a "CLASS" node, will be un-recognized (ignored) by the Schema
  7. "Data Nodes" normally contain a label with the same name as their Schema Class; however, this label is treated as redundant, for convenience and for indexing. What determines the Schema inclusion are the "SCHEMA" relationships, NOT the labels of the "Data Nodes"
  8. "Data Nodes" are free to contain any other label. (Remember, in graph databases such as Neo4j, nodes may have multiple labels, often used for indexing)


Freedom of Choice

Even though a main use of the Schema layer is to impose "discipline" (data conformance/integrity), nonetheless freedom of choice is a foundational underpinning of the NeoSchema library. For instance:

However, note that if you're also using the higher layers of the BrainAnnex technology stack, those layers assume the presence of a Schema layer; you'll have to provide your own Data Manager, Web API, UI or whatever else you need. The lower layer, NeoAccess won't be affected.


Services Provided by the Schema Layer


Technical Details

"CLASS" nodes:


"PROPERTY" nodes:
"LINK" nodes:
Data nodes:
Keywords used by the Schema layer:
Typical attributes stored on Property nodes (currently, as a service for the Schema clients, i.e. the higher layers, but NOT managed by the Schema)


Schema-Layer Relationships

SCHEMA

Used to connect Data Nodes to their respective Schema Classes that they belong to.

Each Data Node should have exactly 1 such relationship, but typically a Class Node has many incoming SCHEMA relationships from Data Nodes.


HAS_PROPERTY

Used to connect Class Nodes to any number of Property Nodes.

Note that Property Nodes cannot be linked to more that 1 Class Node. If multiple Classes happen to have Properties with overlapping names, separate Property Nodes are used.


INSTANCE_OF

The INSTANCE_OF relationship between classes offers a way to "factor out" common Properties that occur in multiple Classes.

For example, imagine that your Class "German Vocabulary" contains the Properties "German", "English", "Notes"; and, similarly, your Class "French Vocabulary" contains the Properties "French", "English", "Notes".

The "English" and "Notes" Properties aren't really specific to either German or French. They more logically belong to a new Class named "Foreign Vocabulary", of which "German Vocabulary" and "French Vocabulary" are instances of. A clean ontology!

Note that Data Node of a particular Class are allowed to store Properties (aka fields or attributes) that are registered with an "ancestor" (based on the INSTANCE_OF) of their Schema Class.

For example, if you have a Data Node in your database that is associated (by means of a "SCHEMA" relationship) to the Schema Class "French Vocabulary", then you can store in it values for "French", "English", and "Notes". That's typically more convenient than having to split the Data Node into two (one with the "French" value and one with the "English" and "Notes" values).

Making operations convenient for the higher layers (i.e. the clients of the NeoAccess library) is a foundational design philosophy.



NOTE: as of version 5.0-Beta46 (Sep. 2024), this NeoSchema library is in a late Beta stage, soon to become a "release candidate".


Background Information: Using Schema in Graph Databases such as Neo4j

Reference Guide

Source code

Tutorial 1 : basic Schema operations (Classes, Properties, Data Nodes)
Tutorial 2 : set up a simple Schema (Classes, Properties) and perform a data import (Data Nodes and relationships among them)

Unit tests (with pytest) :     Main set        Extra tests, for data imports