FullTextIndexing Reference Guide

This guide is for versions Beta 34+


Source code

Background Information: Full-Text Search with the Neo4j Graph Database

Class FullTextIndexing

    Indexing-related methods, for full-text searching.

    NOTE: no stemming nor lemmatizing is done.
          Therefore, for best results, all word searches should be done on stems;
          for example, search for "learn" rather than "learning" or "learns" - to catch all 3
    
nameargumentsreturns
split_into_wordscls, text: str, to_lower_case=True, drop_html=True[str]
        Lower-level function used in the larger context of indexing text that may contain HTML.

        Given a string, optionally zap HTML tags and HTML entities, such as –
        then ditch punctuation from the given text;
        finally, break it up into individual words, returned as a list.  If requested, turn it all to lower case.

        Care is taken to make sure that the stripping of special characters does NOT fuse words together;
        e.g. avoid turning 'Long–Term' into a single word as 'LongTerm';
        likewise, avoid turning 'One<br>Two' into a single word as 'OneTwo'

        EXAMPLE.  Given:
            '<p>Mr. Joe&amp;sons<br>A Long&ndash;Term business! Find it at &gt; (http://example.com/home)<br>Visit Joe&#39;s &quot;NOW!&quot;</p>'
            It will return (if to_lower_case is False):
            ['Mr', 'Joe', 'sons', 'A', 'Long', 'Term', 'business', 'Find', 'it', 'at', 'http', 'example', 'com', 'home', 'Visit', 'Joe', 's', 'NOW']

        Note about numbers:  * negative signs are lost  * numbers with decimals will get split into two parts

        :param text:            A string with the text to parse
        :param to_lower_case:   If True, all text is converted to lower case
        :param drop_html:       Use True if passing HTML text
        :return:                A (possibly empty) list of words in the text,
                                    free of punctuation, HTML and HTML entities such as –
        
nameargumentsreturns
extract_unique_good_wordscls, text :str, drop_html=TrueSet[str]
        Higher-level function to prepare text for indexing;
        use the drop_html flag if the text contains HTML.

        From the given text, it returns the set of "acceptable", unique words.
        It does the following:

            1) zap punctuation
            2) if requested, HTML, HTML entities (such as –);
            3) turn into lower case
            4) break up into individual words
            5) strip off leading/trailing underscores
            6) eliminate "words" that match at least one of these EXCLUSION test:
                * are just 1 or 2 characters long
                * are numbers
                * contain a digit anywhere (e.g. "50m" or "test2")
                * are found in a list of common words

            7) eliminate duplicates

        Note: no stemming or other grammatical analysis is done.

        EXAMPLE - given
                  '

Mr. Joe&sons
A Long–Term business! Find it at > (http://example.com/home)
Visit Joe's "NOW!"

' it returns: ['mr', 'joe', 'sons', 'long', 'term', 'business', 'find', 'example', 'home', 'visit'] :param text: A string with the text to analyze and index :param drop_html: Use True if passing HTML text :return: A (possibly empty) set of strings containing "acceptable", unique words in the text
nameargumentsreturns
initialize_schemacls, content_item_class_name="Content Item"None
        Initialize the graph-database Schema used by this Indexer module:

        1) It will create a new "Word" Class linked to a new "Indexer" Class,
        by means of an outbound "occurs" relationship.
        The newly-created "Word" Class will be given one Property: "name".

        2) It will add a relationship named "has_index" from an existing (or newly created)
        "Content Item" Class to the new "Indexer" Class.

        NOTE: if an existing Class with the named specified by the argument `content_item_class_name`
              is not found, it will be created with some default values

        :param content_item_class_name: (OPTIONAL) The name of the Schema Class for Content Items,
                                            i.e. the Class of the Data Items to be indexed
                                            if not found, it gets created
                                            EXAMPLES: "Documents", "Notes", "Content Items" (default)
        :return:                        None
        
nameargumentsreturns
new_indexingcls, internal_id :int, unique_words :Set[str], to_lower_case=TrueNone
        Used to create a new index in the database
        for the (single) specified data node that represents a "Content Item".
        The indexing will link that Content Item to the given list of unique words.

        An Exception is raised if the "Indexer" node already exists

        Details:
        1) Create a data node of type "Indexer",
            with inbound relationships named "occurs" from "Word" data nodes
            (pre-existing or newly-created as needed)
            for all the words in the given list
        2) create a relationship named "has_index" from an existing "Content Item" data node
            to the new "Indexer" node

        :param internal_id:  The internal database ID of an existing data node
                                    that represents a "Content Item" (not necessarily with that Schema name)
        :param unique_words:    A list of strings containing unique words "worthy" of indexing
                                    - for example as returned by extract_unique_good_words()
        :param to_lower_case:   If True, all text is converted to lower case
        :return:                None
        
nameargumentsreturns
add_words_to_indexcls, indexer_id :int, unique_words :Set[str], to_lower_case=Trueint
        Add to the database "Word" nodes for all the given words, unless already present.
        Then link all the "Word" nodes, both the found and the newly-created ones,
        to the passed "Indexer" node with an "occurs" relationships

        :param indexer_id:      Internal database ID of an existing "Indexer" data node
                                    used to hold all the "occurs" relationships
                                    to the various Word nodes.
                                    If not present, an Exception gets raised.
        :param unique_words:    Set of strings, with unique words for the index
        :param to_lower_case:   If True, all text is converted to lower case
        :return:                The number of new "Word" Data Notes that were created
        
nameargumentsreturns
update_indexingcls, content_uri :int, unique_words :Set[str], to_lower_case=TrueNone
        Used to update an index, linking the given list of unique words
        to the specified "Indexer" data node, which was created by a call to new_indexing()
        at the time the index was first created.
        
        From the given data node of type "Indexer",
        add inbound relationships named "occurs" from "Word" data nodes (pre-existing or newly-created)
        for all the words in the given list.
        Also, create a relationship named "has_index" from an existing "Content Item" data node to the new "Indexer" node.

        Note: if no index exist, an Exception is raised

        :param content_uri: The internal database ID of an existing "Content Item" data node
        :param unique_words:    A list of strings containing unique words
                                    - for example as returned by extract_unique_good_words()
        :param to_lower_case:   If True, all text is converted to lower case
        :return:                None
        
nameargumentsreturns
get_indexer_node_idcls, internal_id :intUnion[int, None]
        Retrieve and return the internal database ID of the "Indexer" data node
        associated to the given Content Item data node.
        If not found, None is returned

        :param internal_id: The internal database ID of an existing data node
                                    (either of Class "Content Item", or of a Class that is ultimately
                                    an INSTANCE_OF a "Content Item" Class)
        :return:                The internal database ID of the corresponding "Indexer" data node.
                                    If not found, None is returned
        
nameargumentsreturns
remove_indexingcls, content_uri :intNone
        Drop the "Indexer" node linked to the given Content Item node.
        If no index exists, an Exception is raised

        :param content_uri: The internal database ID of an existing "Content Item" data node
        :return:                None
        
nameargumentsreturns
number_of_indexed_wordscls, internal_id=None, uri=Noneint
        Determine and return the number of words attached to the index
        of the given data node (typically of a Class representing "Content Item" ,
        or instance thereof, such as "Document" or "Note")

        :param internal_id: The internal database ID of an existing Content Item data node
        :param uri:         Alternate way to specify the Content Item data node, with a string URI
        :return:            The number of indexed words associated to the above node
        
nameargumentsreturns
search_wordcls, word :str, all_properties=FalseUnion[List[int], List[dict]]
        Look up in the index for any stored words that contains the requested string
        (ignoring case and leading/trailing blanks.)

        Then locate the Content nodes that are indexed by any of those words.
        Return a (possibly empty) list of either the internal database ID's of all the found nodes,
        or a list of their full attributes.

        :param word:    A string, typically containing a word or word fragment;
                            case is ignored, and so are leading/trailing blanks
        :param all_properties:  If True, the properties of the located nodes are returned
                                alongside their internal database ID's.
                                Default is False: only return the internal database ID's
        :return:        If all_properties is False,
                            a (possibly empty) list of the internal database ID's
                            of all the found nodes
                        If all_properties is True,
                            a (possibly empty) list of dictionaries with all the data
                            of all the found nodes; each dict contain all of the nodes' attributes,
                            plus keys called 'internal_id' and 'neo4j_labels'
                            EXAMPLE: [{'filename': 'My_Document.pdf', 'internal_id': 66, 'neo4j_labels': ['Content Item']}]