FullTextIndexing Reference Guide
This guide is for versions Beta 34+
Source code
Background Information:
Full-Text Search with the Neo4j Graph Database
Indexing-related methods, for full-text searching.
NOTE: no stemming nor lemmatizing is done.
Therefore, for best results, all word searches should be done on stems;
for example, search for "learn" rather than "learning" or "learns" - to catch all 3
| name | arguments | returns |
|---|---|---|
| split_into_words | cls, text: str, to_lower_case=True, drop_html=True | [str] |
Lower-level function used in the larger context of indexing text that may contain HTML.
Given a string, optionally zap HTML tags and HTML entities, such as –
then ditch punctuation from the given text;
finally, break it up into individual words, returned as a list. If requested, turn it all to lower case.
Care is taken to make sure that the stripping of special characters does NOT fuse words together;
e.g. avoid turning 'Long–Term' into a single word as 'LongTerm';
likewise, avoid turning 'One<br>Two' into a single word as 'OneTwo'
EXAMPLE. Given:
'<p>Mr. Joe&sons<br>A Long–Term business! Find it at > (http://example.com/home)<br>Visit Joe's "NOW!"</p>'
It will return (if to_lower_case is False):
['Mr', 'Joe', 'sons', 'A', 'Long', 'Term', 'business', 'Find', 'it', 'at', 'http', 'example', 'com', 'home', 'Visit', 'Joe', 's', 'NOW']
Note about numbers: * negative signs are lost * numbers with decimals will get split into two parts
:param text: A string with the text to parse
:param to_lower_case: If True, all text is converted to lower case
:param drop_html: Use True if passing HTML text
:return: A (possibly empty) list of words in the text,
free of punctuation, HTML and HTML entities such as –
|
||
| name | arguments | returns |
|---|---|---|
| extract_unique_good_words | cls, text :str, drop_html=True | Set[str] |
Higher-level function to prepare text for indexing;
use the drop_html flag if the text contains HTML.
From the given text, it returns the set of "acceptable", unique words.
It does the following:
1) zap punctuation
2) if requested, HTML, HTML entities (such as –);
3) turn into lower case
4) break up into individual words
5) strip off leading/trailing underscores
6) eliminate "words" that match at least one of these EXCLUSION test:
* are just 1 or 2 characters long
* are numbers
* contain a digit anywhere (e.g. "50m" or "test2")
* are found in a list of common words
7) eliminate duplicates
Note: no stemming or other grammatical analysis is done.
EXAMPLE - given
'
|
||
| name | arguments | returns |
|---|---|---|
| initialize_schema | cls, content_item_class_name="Content Item" | None |
Initialize the graph-database Schema used by this Indexer module:
1) It will create a new "Word" Class linked to a new "Indexer" Class,
by means of an outbound "occurs" relationship.
The newly-created "Word" Class will be given one Property: "name".
2) It will add a relationship named "has_index" from an existing (or newly created)
"Content Item" Class to the new "Indexer" Class.
NOTE: if an existing Class with the named specified by the argument `content_item_class_name`
is not found, it will be created with some default values
:param content_item_class_name: (OPTIONAL) The name of the Schema Class for Content Items,
i.e. the Class of the Data Items to be indexed
if not found, it gets created
EXAMPLES: "Documents", "Notes", "Content Items" (default)
:return: None
|
||
| name | arguments | returns |
|---|---|---|
| new_indexing | cls, internal_id :int, unique_words :Set[str], to_lower_case=True | None |
Used to create a new index in the database
for the (single) specified data node that represents a "Content Item".
The indexing will link that Content Item to the given list of unique words.
An Exception is raised if the "Indexer" node already exists
Details:
1) Create a data node of type "Indexer",
with inbound relationships named "occurs" from "Word" data nodes
(pre-existing or newly-created as needed)
for all the words in the given list
2) create a relationship named "has_index" from an existing "Content Item" data node
to the new "Indexer" node
:param internal_id: The internal database ID of an existing data node
that represents a "Content Item" (not necessarily with that Schema name)
:param unique_words: A list of strings containing unique words "worthy" of indexing
- for example as returned by extract_unique_good_words()
:param to_lower_case: If True, all text is converted to lower case
:return: None
|
||
| name | arguments | returns |
|---|---|---|
| add_words_to_index | cls, indexer_id :int, unique_words :Set[str], to_lower_case=True | int |
Add to the database "Word" nodes for all the given words, unless already present.
Then link all the "Word" nodes, both the found and the newly-created ones,
to the passed "Indexer" node with an "occurs" relationships
:param indexer_id: Internal database ID of an existing "Indexer" data node
used to hold all the "occurs" relationships
to the various Word nodes.
If not present, an Exception gets raised.
:param unique_words: Set of strings, with unique words for the index
:param to_lower_case: If True, all text is converted to lower case
:return: The number of new "Word" Data Notes that were created
|
||
| name | arguments | returns |
|---|---|---|
| update_indexing | cls, content_uri :int, unique_words :Set[str], to_lower_case=True | None |
Used to update an index, linking the given list of unique words
to the specified "Indexer" data node, which was created by a call to new_indexing()
at the time the index was first created.
From the given data node of type "Indexer",
add inbound relationships named "occurs" from "Word" data nodes (pre-existing or newly-created)
for all the words in the given list.
Also, create a relationship named "has_index" from an existing "Content Item" data node to the new "Indexer" node.
Note: if no index exist, an Exception is raised
:param content_uri: The internal database ID of an existing "Content Item" data node
:param unique_words: A list of strings containing unique words
- for example as returned by extract_unique_good_words()
:param to_lower_case: If True, all text is converted to lower case
:return: None
|
||
| name | arguments | returns |
|---|---|---|
| get_indexer_node_id | cls, internal_id :int | Union[int, None] |
Retrieve and return the internal database ID of the "Indexer" data node
associated to the given Content Item data node.
If not found, None is returned
:param internal_id: The internal database ID of an existing data node
(either of Class "Content Item", or of a Class that is ultimately
an INSTANCE_OF a "Content Item" Class)
:return: The internal database ID of the corresponding "Indexer" data node.
If not found, None is returned
|
||
| name | arguments | returns |
|---|---|---|
| remove_indexing | cls, content_uri :int | None |
Drop the "Indexer" node linked to the given Content Item node.
If no index exists, an Exception is raised
:param content_uri: The internal database ID of an existing "Content Item" data node
:return: None
|
||
| name | arguments | returns |
|---|---|---|
| number_of_indexed_words | cls, internal_id=None, uri=None | int |
Determine and return the number of words attached to the index
of the given data node (typically of a Class representing "Content Item" ,
or instance thereof, such as "Document" or "Note")
:param internal_id: The internal database ID of an existing Content Item data node
:param uri: Alternate way to specify the Content Item data node, with a string URI
:return: The number of indexed words associated to the above node
|
||
| name | arguments | returns |
|---|---|---|
| search_word | cls, word :str, all_properties=False | Union[List[int], List[dict]] |
Look up in the index for any stored words that contains the requested string
(ignoring case and leading/trailing blanks.)
Then locate the Content nodes that are indexed by any of those words.
Return a (possibly empty) list of either the internal database ID's of all the found nodes,
or a list of their full attributes.
:param word: A string, typically containing a word or word fragment;
case is ignored, and so are leading/trailing blanks
:param all_properties: If True, the properties of the located nodes are returned
alongside their internal database ID's.
Default is False: only return the internal database ID's
:return: If all_properties is False,
a (possibly empty) list of the internal database ID's
of all the found nodes
If all_properties is True,
a (possibly empty) list of dictionaries with all the data
of all the found nodes; each dict contain all of the nodes' attributes,
plus keys called 'internal_id' and 'neo4j_labels'
EXAMPLE: [{'filename': 'My_Document.pdf', 'internal_id': 66, 'neo4j_labels': ['Content Item']}]
|
||