FullTextIndexing Reference Guide

This guide is for versions Beta 34+

Source code

Background Information: Full-Text Search with the Neo4j Graph Database

Class FullTextIndexing

    Indexing-related methods, for full-text searching.

    NOTE: no stemming nor lemmatizing is done.
          Therefore, for best results, all word searches should be done on stems;
          for example, search for "learn" rather than "learning" or "learns" - to catch all 3

name	arguments	returns
split_into_words	cls, text: str, to_lower_case=True, drop_html=True	[str]
Lower-level function used in the larger context of indexing text that may contain HTML. Given a string, optionally zap HTML tags and HTML entities, such as – then ditch punctuation from the given text; finally, break it up into individual words, returned as a list. If requested, turn it all to lower case. Care is taken to make sure that the stripping of special characters does NOT fuse words together; e.g. avoid turning 'Long–Term' into a single word as 'LongTerm'; likewise, avoid turning 'One<br>Two' into a single word as 'OneTwo' EXAMPLE. Given: '<p>Mr. Joe&sons<br>A Long–Term business! Find it at > (http://example.com/home)<br>Visit Joe's "NOW!"</p>' It will return (if to_lower_case is False): ['Mr', 'Joe', 'sons', 'A', 'Long', 'Term', 'business', 'Find', 'it', 'at', 'http', 'example', 'com', 'home', 'Visit', 'Joe', 's', 'NOW'] Note about numbers: * negative signs are lost * numbers with decimals will get split into two parts :param text: A string with the text to parse :param to_lower_case: If True, all text is converted to lower case :param drop_html: Use True if passing HTML text :return: A (possibly empty) list of words in the text, free of punctuation, HTML and HTML entities such as –

name

arguments

returns

split_into_words

cls, text: str, to_lower_case=True, drop_html=True

[str]

        Lower-level function used in the larger context of indexing text that may contain HTML.

        Given a string, optionally zap HTML tags and HTML entities, such as &ndash;
        then ditch punctuation from the given text;
        finally, break it up into individual words, returned as a list.  If requested, turn it all to lower case.

        Care is taken to make sure that the stripping of special characters does NOT fuse words together;
        e.g. avoid turning 'Long&ndash;Term' into a single word as 'LongTerm';
        likewise, avoid turning 'One<br>Two' into a single word as 'OneTwo'

        EXAMPLE.  Given:
            '<p>Mr. Joe&amp;sons<br>A Long&ndash;Term business! Find it at &gt; (http://example.com/home)<br>Visit Joe&#39;s &quot;NOW!&quot;</p>'
            It will return (if to_lower_case is False):
            ['Mr', 'Joe', 'sons', 'A', 'Long', 'Term', 'business', 'Find', 'it', 'at', 'http', 'example', 'com', 'home', 'Visit', 'Joe', 's', 'NOW']

        Note about numbers:  * negative signs are lost  * numbers with decimals will get split into two parts

        :param text:            A string with the text to parse
        :param to_lower_case:   If True, all text is converted to lower case
        :param drop_html:       Use True if passing HTML text
        :return:                A (possibly empty) list of words in the text,
                                    free of punctuation, HTML and HTML entities such as –

name	arguments	returns
extract_unique_good_words	cls, text :str, drop_html=True	Set[str]
Higher-level function to prepare text for indexing; use the drop_html flag if the text contains HTML. From the given text, it returns the set of "acceptable", unique words. It does the following: 1) zap punctuation 2) if requested, HTML, HTML entities (such as –); 3) turn into lower case 4) break up into individual words 5) strip off leading/trailing underscores 6) eliminate "words" that match at least one of these EXCLUSION test: * are just 1 or 2 characters long * are numbers * contain a digit anywhere (e.g. "50m" or "test2") * are found in a list of common words 7) eliminate duplicates Note: no stemming or other grammatical analysis is done. EXAMPLE - given ' Mr. Joe&sons A Long–Term business! Find it at > (http://example.com/home) Visit Joe's "NOW!" ' it returns: ['mr', 'joe', 'sons', 'long', 'term', 'business', 'find', 'example', 'home', 'visit'] :param text: A string with the text to analyze and index :param drop_html: Use True if passing HTML text :return: A (possibly empty) set of strings containing "acceptable", unique words in the text

name

arguments

returns

extract_unique_good_words

cls, text :str, drop_html=True

Set[str]

Higher-level function to prepare text for indexing; use the drop_html flag if the text contains HTML. From the given text, it returns the set of "acceptable", unique words. It does the following: 1) zap punctuation 2) if requested, HTML, HTML entities (such as –); 3) turn into lower case 4) break up into individual words 5) strip off leading/trailing underscores 6) eliminate "words" that match at least one of these EXCLUSION test: * are just 1 or 2 characters long * are numbers * contain a digit anywhere (e.g. "50m" or "test2") * are found in a list of common words 7) eliminate duplicates Note: no stemming or other grammatical analysis is done. EXAMPLE - given '

Mr. Joe&sons
A Long–Term business! Find it at > (http://example.com/home)
Visit Joe's "NOW!"

' it returns: ['mr', 'joe', 'sons', 'long', 'term', 'business', 'find', 'example', 'home', 'visit'] :param text: A string with the text to analyze and index :param drop_html: Use True if passing HTML text :return: A (possibly empty) set of strings containing "acceptable", unique words in the text

name	arguments	returns
initialize_schema	cls, content_item_class_name="Content Item"	None
Initialize the graph-database Schema used by this Indexer module: 1) It will create a new "Word" Class linked to a new "Indexer" Class, by means of an outbound "occurs" relationship. The newly-created "Word" Class will be given one Property: "name". 2) It will add a relationship named "has_index" from an existing (or newly created) "Content Item" Class to the new "Indexer" Class. NOTE: if an existing Class with the named specified by the argument `content_item_class_name` is not found, it will be created with some default values :param content_item_class_name: (OPTIONAL) The name of the Schema Class for Content Items, i.e. the Class of the Data Items to be indexed if not found, it gets created EXAMPLES: "Documents", "Notes", "Content Items" (default) :return: None

name

arguments

returns

initialize_schema

cls, content_item_class_name="Content Item"

None

        Initialize the graph-database Schema used by this Indexer module:

        1) It will create a new "Word" Class linked to a new "Indexer" Class,
        by means of an outbound "occurs" relationship.
        The newly-created "Word" Class will be given one Property: "name".

        2) It will add a relationship named "has_index" from an existing (or newly created)
        "Content Item" Class to the new "Indexer" Class.

        NOTE: if an existing Class with the named specified by the argument `content_item_class_name`
              is not found, it will be created with some default values

        :param content_item_class_name: (OPTIONAL) The name of the Schema Class for Content Items,
                                            i.e. the Class of the Data Items to be indexed
                                            if not found, it gets created
                                            EXAMPLES: "Documents", "Notes", "Content Items" (default)
        :return:                        None

name	arguments	returns
new_indexing	cls, internal_id :int, unique_words :Set[str], to_lower_case=True	None
Used to create a new index in the database for the (single) specified data node that represents a "Content Item". The indexing will link that Content Item to the given list of unique words. An Exception is raised if the "Indexer" node already exists Details: 1) Create a data node of type "Indexer", with inbound relationships named "occurs" from "Word" data nodes (pre-existing or newly-created as needed) for all the words in the given list 2) create a relationship named "has_index" from an existing "Content Item" data node to the new "Indexer" node :param internal_id: The internal database ID of an existing data node that represents a "Content Item" (not necessarily with that Schema name) :param unique_words: A list of strings containing unique words "worthy" of indexing - for example as returned by extract_unique_good_words() :param to_lower_case: If True, all text is converted to lower case :return: None

name

arguments

returns

new_indexing

cls, internal_id :int, unique_words :Set[str], to_lower_case=True

None

        Used to create a new index in the database
        for the (single) specified data node that represents a "Content Item".
        The indexing will link that Content Item to the given list of unique words.

        An Exception is raised if the "Indexer" node already exists

        Details:
        1) Create a data node of type "Indexer",
            with inbound relationships named "occurs" from "Word" data nodes
            (pre-existing or newly-created as needed)
            for all the words in the given list
        2) create a relationship named "has_index" from an existing "Content Item" data node
            to the new "Indexer" node

        :param internal_id:  The internal database ID of an existing data node
                                    that represents a "Content Item" (not necessarily with that Schema name)
        :param unique_words:    A list of strings containing unique words "worthy" of indexing
                                    - for example as returned by extract_unique_good_words()
        :param to_lower_case:   If True, all text is converted to lower case
        :return:                None

name	arguments	returns
add_words_to_index	cls, indexer_id :int, unique_words :Set[str], to_lower_case=True	int
Add to the database "Word" nodes for all the given words, unless already present. Then link all the "Word" nodes, both the found and the newly-created ones, to the passed "Indexer" node with an "occurs" relationships :param indexer_id: Internal database ID of an existing "Indexer" data node used to hold all the "occurs" relationships to the various Word nodes. If not present, an Exception gets raised. :param unique_words: Set of strings, with unique words for the index :param to_lower_case: If True, all text is converted to lower case :return: The number of new "Word" Data Notes that were created

name

arguments

returns

add_words_to_index

cls, indexer_id :int, unique_words :Set[str], to_lower_case=True

int

        Add to the database "Word" nodes for all the given words, unless already present.
        Then link all the "Word" nodes, both the found and the newly-created ones,
        to the passed "Indexer" node with an "occurs" relationships

        :param indexer_id:      Internal database ID of an existing "Indexer" data node
                                    used to hold all the "occurs" relationships
                                    to the various Word nodes.
                                    If not present, an Exception gets raised.
        :param unique_words:    Set of strings, with unique words for the index
        :param to_lower_case:   If True, all text is converted to lower case
        :return:                The number of new "Word" Data Notes that were created

name	arguments	returns
update_indexing	cls, content_uri :int, unique_words :Set[str], to_lower_case=True	None
Used to update an index, linking the given list of unique words to the specified "Indexer" data node, which was created by a call to new_indexing() at the time the index was first created. From the given data node of type "Indexer", add inbound relationships named "occurs" from "Word" data nodes (pre-existing or newly-created) for all the words in the given list. Also, create a relationship named "has_index" from an existing "Content Item" data node to the new "Indexer" node. Note: if no index exist, an Exception is raised :param content_uri: The internal database ID of an existing "Content Item" data node :param unique_words: A list of strings containing unique words - for example as returned by extract_unique_good_words() :param to_lower_case: If True, all text is converted to lower case :return: None

name

arguments

returns

update_indexing

cls, content_uri :int, unique_words :Set[str], to_lower_case=True

None

        Used to update an index, linking the given list of unique words
        to the specified "Indexer" data node, which was created by a call to new_indexing()
        at the time the index was first created.
        
        From the given data node of type "Indexer",
        add inbound relationships named "occurs" from "Word" data nodes (pre-existing or newly-created)
        for all the words in the given list.
        Also, create a relationship named "has_index" from an existing "Content Item" data node to the new "Indexer" node.

        Note: if no index exist, an Exception is raised

        :param content_uri: The internal database ID of an existing "Content Item" data node
        :param unique_words:    A list of strings containing unique words
                                    - for example as returned by extract_unique_good_words()
        :param to_lower_case:   If True, all text is converted to lower case
        :return:                None

name	arguments	returns
get_indexer_node_id	cls, internal_id :int	Union[int, None]
Retrieve and return the internal database ID of the "Indexer" data node associated to the given Content Item data node. If not found, None is returned :param internal_id: The internal database ID of an existing data node (either of Class "Content Item", or of a Class that is ultimately an INSTANCE_OF a "Content Item" Class) :return: The internal database ID of the corresponding "Indexer" data node. If not found, None is returned

name

arguments

returns

get_indexer_node_id

cls, internal_id :int

Union[int, None]

        Retrieve and return the internal database ID of the "Indexer" data node
        associated to the given Content Item data node.
        If not found, None is returned

        :param internal_id: The internal database ID of an existing data node
                                    (either of Class "Content Item", or of a Class that is ultimately
                                    an INSTANCE_OF a "Content Item" Class)
        :return:                The internal database ID of the corresponding "Indexer" data node.
                                    If not found, None is returned

name	arguments	returns
remove_indexing	cls, content_uri :int	None
Drop the "Indexer" node linked to the given Content Item node. If no index exists, an Exception is raised :param content_uri: The internal database ID of an existing "Content Item" data node :return: None

name

arguments

returns

remove_indexing

cls, content_uri :int

None

        Drop the "Indexer" node linked to the given Content Item node.
        If no index exists, an Exception is raised

        :param content_uri: The internal database ID of an existing "Content Item" data node
        :return:                None

name	arguments	returns
number_of_indexed_words	cls, internal_id=None, uri=None	int
Determine and return the number of words attached to the index of the given data node (typically of a Class representing "Content Item" , or instance thereof, such as "Document" or "Note") :param internal_id: The internal database ID of an existing Content Item data node :param uri: Alternate way to specify the Content Item data node, with a string URI :return: The number of indexed words associated to the above node

name

arguments

returns

number_of_indexed_words

cls, internal_id=None, uri=None

int

        Determine and return the number of words attached to the index
        of the given data node (typically of a Class representing "Content Item" ,
        or instance thereof, such as "Document" or "Note")

        :param internal_id: The internal database ID of an existing Content Item data node
        :param uri:         Alternate way to specify the Content Item data node, with a string URI
        :return:            The number of indexed words associated to the above node

name	arguments	returns
search_word	cls, word :str, all_properties=False	Union[List[int], List[dict]]
Look up in the index for any stored words that contains the requested string (ignoring case and leading/trailing blanks.) Then locate the Content nodes that are indexed by any of those words. Return a (possibly empty) list of either the internal database ID's of all the found nodes, or a list of their full attributes. :param word: A string, typically containing a word or word fragment; case is ignored, and so are leading/trailing blanks :param all_properties: If True, the properties of the located nodes are returned alongside their internal database ID's. Default is False: only return the internal database ID's :return: If all_properties is False, a (possibly empty) list of the internal database ID's of all the found nodes If all_properties is True, a (possibly empty) list of dictionaries with all the data of all the found nodes; each dict contain all of the nodes' attributes, plus keys called 'internal_id' and 'neo4j_labels' EXAMPLE: [{'filename': 'My_Document.pdf', 'internal_id': 66, 'neo4j_labels': ['Content Item']}]

name

arguments

returns

search_word

cls, word :str, all_properties=False

Union[List[int], List[dict]]

        Look up in the index for any stored words that contains the requested string
        (ignoring case and leading/trailing blanks.)

        Then locate the Content nodes that are indexed by any of those words.
        Return a (possibly empty) list of either the internal database ID's of all the found nodes,
        or a list of their full attributes.

        :param word:    A string, typically containing a word or word fragment;
                            case is ignored, and so are leading/trailing blanks
        :param all_properties:  If True, the properties of the located nodes are returned
                                alongside their internal database ID's.
                                Default is False: only return the internal database ID's
        :return:        If all_properties is False,
                            a (possibly empty) list of the internal database ID's
                            of all the found nodes
                        If all_properties is True,
                            a (possibly empty) list of dictionaries with all the data
                            of all the found nodes; each dict contain all of the nodes' attributes,
                            plus keys called 'internal_id' and 'neo4j_labels'
                            EXAMPLE: [{'filename': 'My_Document.pdf', 'internal_id': 66, 'neo4j_labels': ['Content Item']}]