FullTextIndexing Reference Guide
This guide is for versions Beta 34+
Source code
Background Information:
Full-Text Search with the Neo4j Graph Database
Indexing-related methods, for full-text searching. NOTE: no stemming nor lemmatizing is done. Therefore, for best results, all word searches should be done on stems; for example, search for "learn" rather than "learning" or "learns" - to catch all 3
name | arguments | returns |
---|---|---|
split_into_words | cls, text: str, to_lower_case=True, drop_html=True | [str] |
Lower-level function used in the larger context of indexing text that may contain HTML. Given a string, optionally zap HTML tags and HTML entities, such as – then ditch punctuation from the given text; finally, break it up into individual words, returned as a list. If requested, turn it all to lower case. Care is taken to make sure that the stripping of special characters does NOT fuse words together; e.g. avoid turning 'Long–Term' into a single word as 'LongTerm'; likewise, avoid turning 'One<br>Two' into a single word as 'OneTwo' EXAMPLE. Given: '<p>Mr. Joe&sons<br>A Long–Term business! Find it at > (http://example.com/home)<br>Visit Joe's "NOW!"</p>' It will return (if to_lower_case is False): ['Mr', 'Joe', 'sons', 'A', 'Long', 'Term', 'business', 'Find', 'it', 'at', 'http', 'example', 'com', 'home', 'Visit', 'Joe', 's', 'NOW'] Note about numbers: * negative signs are lost * numbers with decimals will get split into two parts :param text: A string with the text to parse :param to_lower_case: If True, all text is converted to lower case :param drop_html: Use True if passing HTML text :return: A (possibly empty) list of words in the text, free of punctuation, HTML and HTML entities such as – |
name | arguments | returns |
---|---|---|
extract_unique_good_words | cls, text :str, drop_html=True | Set[str] |
Higher-level function to prepare text for indexing; use the drop_html flag if the text contains HTML. From the given text, it returns the set of "acceptable", unique words. It does the following: 1) zap punctuation 2) if requested, HTML, HTML entities (such as –); 3) turn into lower case 4) break up into individual words 5) strip off leading/trailing underscores 6) eliminate "words" that match at least one of these EXCLUSION test: * are just 1 or 2 characters long * are numbers * contain a digit anywhere (e.g. "50m" or "test2") * are found in a list of common words 7) eliminate duplicates Note: no stemming or other grammatical analysis is done. EXAMPLE - given ' |
name | arguments | returns |
---|---|---|
initialize_schema | cls, content_item_class_name="Content Item" | None |
Initialize the graph-database Schema used by this Indexer module: 1) It will create a new "Word" Class linked to a new "Indexer" Class, by means of an outbound "occurs" relationship. The newly-created "Word" Class will be given one Property: "name". 2) It will add a relationship named "has_index" from an existing (or newly created) "Content Item" Class to the new "Indexer" Class. NOTE: if an existing Class with the named specified by the argument `content_item_class_name` is not found, it will be created with some default values :param content_item_class_name: (OPTIONAL) The name of the Schema Class for Content Items, i.e. the Class of the Data Items to be indexed if not found, it gets created EXAMPLES: "Documents", "Notes", "Content Items" (default) :return: None |
name | arguments | returns |
---|---|---|
new_indexing | cls, internal_id :int, unique_words :Set[str], to_lower_case=True | None |
Used to create a new index in the database for the (single) specified data node that represents a "Content Item". The indexing will link that Content Item to the given list of unique words. An Exception is raised if the "Indexer" node already exists Details: 1) Create a data node of type "Indexer", with inbound relationships named "occurs" from "Word" data nodes (pre-existing or newly-created as needed) for all the words in the given list 2) create a relationship named "has_index" from an existing "Content Item" data node to the new "Indexer" node :param internal_id: The internal database ID of an existing data node that represents a "Content Item" (not necessarily with that Schema name) :param unique_words: A list of strings containing unique words "worthy" of indexing - for example as returned by extract_unique_good_words() :param to_lower_case: If True, all text is converted to lower case :return: None |
name | arguments | returns |
---|---|---|
add_words_to_index | cls, indexer_id :int, unique_words :Set[str], to_lower_case=True | int |
Add to the database "Word" nodes for all the given words, unless already present. Then link all the "Word" nodes, both the found and the newly-created ones, to the passed "Indexer" node with an "occurs" relationships :param indexer_id: Internal database ID of an existing "Indexer" data node used to hold all the "occurs" relationships to the various Word nodes. If not present, an Exception gets raised. :param unique_words: Set of strings, with unique words for the index :param to_lower_case: If True, all text is converted to lower case :return: The number of new "Word" Data Notes that were created |
name | arguments | returns |
---|---|---|
update_indexing | cls, content_uri :int, unique_words :Set[str], to_lower_case=True | None |
Used to update an index, linking the given list of unique words to the specified "Indexer" data node, which was created by a call to new_indexing() at the time the index was first created. From the given data node of type "Indexer", add inbound relationships named "occurs" from "Word" data nodes (pre-existing or newly-created) for all the words in the given list. Also, create a relationship named "has_index" from an existing "Content Item" data node to the new "Indexer" node. Note: if no index exist, an Exception is raised :param content_uri: The internal database ID of an existing "Content Item" data node :param unique_words: A list of strings containing unique words - for example as returned by extract_unique_good_words() :param to_lower_case: If True, all text is converted to lower case :return: None |
name | arguments | returns |
---|---|---|
get_indexer_node_id | cls, internal_id :int | Union[int, None] |
Retrieve and return the internal database ID of the "Indexer" data node associated to the given Content Item data node. If not found, None is returned :param internal_id: The internal database ID of an existing data node (either of Class "Content Item", or of a Class that is ultimately an INSTANCE_OF a "Content Item" Class) :return: The internal database ID of the corresponding "Indexer" data node. If not found, None is returned |
name | arguments | returns |
---|---|---|
remove_indexing | cls, content_uri :int | None |
Drop the "Indexer" node linked to the given Content Item node. If no index exists, an Exception is raised :param content_uri: The internal database ID of an existing "Content Item" data node :return: None |
name | arguments | returns |
---|---|---|
number_of_indexed_words | cls, internal_id=None, uri=None | int |
Determine and return the number of words attached to the index of the given data node (typically of a Class representing "Content Item" , or instance thereof, such as "Document" or "Note") :param internal_id: The internal database ID of an existing Content Item data node :param uri: Alternate way to specify the Content Item data node, with a string URI :return: The number of indexed words associated to the above node |
name | arguments | returns |
---|---|---|
search_word | cls, word :str, all_properties=False | Union[List[int], List[dict]] |
Look up in the index for any stored words that contains the requested string (ignoring case and leading/trailing blanks.) Then locate the Content nodes that are indexed by any of those words. Return a (possibly empty) list of either the internal database ID's of all the found nodes, or a list of their full attributes. :param word: A string, typically containing a word or word fragment; case is ignored, and so are leading/trailing blanks :param all_properties: If True, the properties of the located nodes are returned alongside their internal database ID's. Default is False: only return the internal database ID's :return: If all_properties is False, a (possibly empty) list of the internal database ID's of all the found nodes If all_properties is True, a (possibly empty) list of dictionaries with all the data of all the found nodes; each dict contain all of the nodes' attributes, plus keys called 'internal_id' and 'neo4j_labels' EXAMPLE: [{'filename': 'My_Document.pdf', 'internal_id': 66, 'neo4j_labels': ['Content Item']}] |