NeoSchema - User Guide
IMPORTANT – The reader is assumed to have already read the following article about motivations and overview:
Using Schema in Graph Databases such as Neo4j
Why use a Schema Layer?
"The marriage of the flexibity of Graph Databases and the discipline of Relational Databases"
- Data integrity
- Data filtering upon import
- Assist the higher layers in the technology stack, and in particular the User Interface
- Self-documentation of the database
- Graft into graph database some of the semantic functionality that some people turn to RDF for. However, carving out a new path
rather than attempting to emulate RDF!
A Schema layer, as used in the BrainAnnex open-source project, is a software library (called "NeoSchema") that sits between the data and the higher layers.
It's optional, and may be used in a "strict" manner (as an enforcer) or in "lax/loose" manner (data's self-documentation, and assistance to the UI, but no enforcement.)
In essense, a Schema represents what is either expected, or permitted, in our database.
Example
Let's jump into a simple example of some data nodes, and their corresponding Schema nodes:
In this example, our data consists of just 2 records, each stored as a graph-database node, in the yellow box at the bottom.
The two records represent, respectively, two entities named "Car" (pale-blue circle) and "Person" (violet circle).
The "Car" entity is expected to have (up to) 2 properties (aka attributes, or fields): "make" and "color".
The "Person" entity is expected to have a single property, "name".
Data summary: a white Toyota is owned by someone named Julian.
The Schema layer (green box at the top) just encapsulates the state of affairs described in the figure's caption, above. Several design specifications can be immediately observed:
- The Schema layer makes use of nodes labeled "CLASS", "PROPERTY" or (less used and not shown in above example) "LINK"
- We'll refer to nodes with the actual data as "Data Nodes" (e.g. those in the yellow box at the bottom),
while nodes reserved for internal use by the Schema layer will be called "Schema Nodes" (e.g. those in the green box at the top)
- The only connection between the "Data" and the "Schema" layers are relationships named "SCHEMA" that link from a Data node to its Schema node
- Each "Data Node" may only have a single "Schema Node" Class that describes it. You may think of that Class name as the "type" of that Data Node.
- Schema nodes labeled "CLASS" (green) have relationships among themselves that exactly reflect the (permitted or expected) relationships among the data nodes of those Classes (in our example, "OWNED_BY")
- Nodes in the graph database that lack a "SCHEMA" relationship to a "CLASS" node, will be un-recognized (ignored) by the Schema
- "Data Nodes" normally contain a label with the same name as their Schema Class; however, this label is treated as redundant, for convenience and for indexing. What determines the Schema inclusion
are the "SCHEMA" relationships, NOT the labels of the "Data Nodes"
- "Data Nodes" are free to contain any other label. (Remember, in graph databases such as Neo4j, nodes may have multiple labels, often used for indexing)
Freedom of Choice
Even though a main use of the Schema layer is to impose "discipline" (data conformance/integrity), nonetheless freedom of choice is a foundational underpinning of the NeoSchema library. For instance:
- The database may opt to keep some or all nodes out of control of the Schema layer. Anything that lacks a "SCHEMA" relationship to a "CLASS" node, is regarded an "un-managed" node
(with the extensive freedom that is typical in graph databases)
- Want to jettison the Schema layer? Just delete all its nodes ("CLASS", "PROPERTY", "LINK"), and your data will remain un-affected – simply no longer managed by a Schema layer
However, note that if you're also using the higher layers of the BrainAnnex technology stack, those layers assume the presence of a Schema layer; you'll have to provide your own Data Manager, Web API, UI or whatever else
you need. The lower layer, NeoAccess
won't be affected.
Services Provided by the Schema Layer
- Methods to define and edit the Schema
- Methods to create, retrieve and delete Data nodes
- A variety of data-import methods, in particular to import Pandas data frames (typically created by reading in CSV files) and to import JSON files (with arbitrary complex structures)
- Optional data-integrity checks during import
- A method to export the entire Schema (as a JSON file)
- Advise the UI about data types, and other hints for the front-end display
- Management of optional URI's (to be precise, parts of the URI's, more technically called "tokens"), with auto-increments in user-specified namespaces
Technical Details
"CLASS" nodes:
- "Class" nodes capture the abstraction of entities that share similarities.
Example: "car", "star", "protein", "patient"
In RDFS lingo, a "Class" node is the counterpart of a resource (entity)
whose "rdf:type
" property has the value "rdfs:Class
"
- Class can be of the "S" (Strict) or "L" (Lenient) type.
A "lenient" Class will accept data nodes with any properties, whether declared in the Schema or not;
by contrast, a "strict" Class will prevent data nodes to contain properties not declared in the Schema
- Class nodes may be defined not to accept Data nodes attached to them. Typically used for organizing the ontology of the Schema.
- Class nodes have relationships among themselves that exactly reflect the (permitted or expected) relationships among the data nodes of those Classes
- A special relationship named "INSTANCE_OF" may exist between any two Class node. The "child" Class (the one that is an "instance of" another Class) is regarded to possess the Properties of the other Class,
in addition to its own
"PROPERTY" nodes:
- Each "Property" node is linked to exactly one "Class" node, by means of a relationship named "HAS_PROPERTY", in the direction from the Class to the Property node
- The "Property" nodes linked to a given "Class" node represent the attributes (fields) of the Data nodes of that Class
- If the Class is "lax" (not strict) then the Data nodes may also have other attributes not declared in the Schema
"LINK" nodes:
- If any relationship between Class nodes is meant to have properties, that relationship is split in two part,
with an intervening node labeled "LINK" [newly-added feature not fully implemented]
Data nodes:
- Nodes in the graph databases are regarded as "Data nodes" if they possess a relationship called "SCHEMA" pointing to a Class node
- Typically, Data nodes contain a database label with the same name as the schema Class they are attached to, in addition to any other label they might contain;
however, that's NOT enforced by the Schema layer, and it's completely optional, though recommended
- Labels in Data nodes are ignored by the Schema Layer – however, they are automatically created during imports
Keywords used by the Schema layer:
- For node labels: "CLASS", "PROPERTY", "LINK"
- For Class node attributes: "name", "uri", "code", "strict", "no_datanodes"
- For Property and Link node attributes: "name", "uri"
- For relationships names: "SCHEMA", "HAS_PROPERTY", "INSTANCE_OF"
Typical attributes stored on Property nodes (currently, as a service for the Schema clients, i.e. the higher layers, but NOT managed by the Schema)
- "dtype" (taking values such as "int", "float", "categorical" – for now, at the discretion of the Schema clients)
- "allowed" (only applicable when "dtype" is "categorical"; example, for German articles: ["der","die","das"])
Schema-Layer Relationships
SCHEMA
Used to connect Data Nodes to their respective Schema Classes that they belong to.
Each Data Node should have exactly 1 such relationship, but typically a Class Node has many incoming SCHEMA
relationships from Data Nodes.
HAS_PROPERTY
Used to connect Class Nodes to any number of Property Nodes.
Note that Property Nodes cannot be linked to more that 1 Class Node. If multiple Classes happen to have Properties with overlapping names, separate Property Nodes are used.
INSTANCE_OF
The INSTANCE_OF
relationship between classes offers a way to "factor out" common Properties that occur in multiple Classes.
For example, imagine that your Class "German Vocabulary" contains the Properties "German", "English", "Notes"; and, similarly, your Class
"French Vocabulary" contains the Properties "French", "English", "Notes".
The "English" and "Notes" Properties aren't really specific to either German or French. They more logically belong to a new Class named "Foreign Vocabulary",
of which "German Vocabulary" and "French Vocabulary" are instances of. A clean ontology!
Note that Data Node of a particular Class are allowed to store Properties (aka fields or attributes) that are registered with an "ancestor" (based on the INSTANCE_OF
) of their Schema Class.
For example, if you have a Data Node in your database that is associated (by means of a "SCHEMA" relationship) to the Schema Class "French Vocabulary", then you can store in it
values for "French", "English", and "Notes". That's typically more convenient than having to split the Data Node into two (one with the "French" value and one with the "English" and "Notes" values).
Making operations convenient for the higher layers (i.e. the clients of the NeoAccess
library) is a foundational design philosophy.
NOTE: as of version 5.0-Beta46 (Sep. 2024), this NeoSchema
library is in a late Beta stage, soon to become a "release candidate".