Version: 0.5.24

6. Prefix and Suffix Trees#

Lecture:

Lecture 5.6 (slides)

Objectives:

Understand what is a prefix and what is a suffix tree

Concepts:

Insertion, deletion and search into a Prefix-tree

As a last example of tree data structures, we shall look at prefix and suffix trees. These are data structures used to speed up text search. especially full-text search and auto completion in text editors.

  • Auto-completion means that editor suggests possible endings for what you have started to type. For instance, Fig. 6.13 shows an example of “default completion”, where the editor make suggestions based on words already in the document.

  • Full-text search, is what happen for instance when you press Ctrl-F, which prompt you for a some text, and highlights all the occurrences in the page. This is often available in text editors, web browser, or IDEs.

Prefixes vs. Suffixes

What are prefixes and suffixes? Intuitively, we can say the are just “beginning” and “ending” of sequences. More formally, given a sequence of characters \(s = (c_1, c_2, \dots, c_n)\) any shorted sequence \(s' = (c_1, \dots, c_k)\) where \(0 \leq k \leq n\) is a prefix of \(S\). For instance, the word “algorithm” has the following prefixes (ordered by length):

\[P = \{ '',\, 'a',\, 'al',\, 'alg',\, 'algo',\, 'algor',\, 'algori',\, 'algorit',\, 'algorith',\, 'algorithm' \}\]

Note that the set of prefixes includes the empty sequence and the complete word itself. Formally, we denote that fact that a word \(w\) admits a prefix \(p\) by \(p \preceq w\).

Similarly, suffixes are possible endings. Formally, and sequence \(s' = (c_k, \dots, c_n)\) is a suffix of \(s\). The word algorithm therefore admits the following suffixes:

\[S = \{ '',\, 'm',\, 'hm',\, 'thm',\, 'ithm',\, 'rithm',\, 'orithm',\, 'gorithm',\, 'lgorithm', \, 'algorithm' \}\]

6.1. Prefix Trees#

How can we implement such auto-completion mechanisms efficiently? What we need is kind of Map ADT that could give us all the word that starts by the “prefix2 the user has typed. For instance, given a set of words already typed, one could index them by prefix, using a hash table.

See also

📖 Goodrich, M. T., Tamassia, R., & Goldwasser, M. H. (2014). Data structures and algorithms in Java. John Wiley & Sons. Chap. 13

The conventional solution is to use a prefix tree, also known as trie. A prefix tree is not an ADT per se, but it extends the Set ADT and generally exposes the following operations:

tries.contains(t: Trie, w: Word) Boolean#

Check whether the prefix tree \(t\) contains the given word \(w\).

Returns the subset \(S\) of all the words from \(t\) that starts with the given prefix \(p\).

Pre-conditions:
  • None

Post-conditions:
  • (A1) \(S\) is the very subset of words from the prefix tree that start with the given prefix \(p\). No other words in the tree starts with \(p\) and S does not contain any other word than those from the tree.

    \[S = \mathit{prefix\_search}(t, p) \implies S = \{ w \in \mathcal{A}^* \:|\: contains(t, w) \land p \preceq w \}\]
tries.add(t: Trie, w: Word) Trie#

Add a new word in the prefix tree \(t\).

Pre-conditions:
  • None

Post-conditions:
  • (A2) The word is \(w\) is necessary in the resulting prefix tree, that is:

    \[t' = add(t, w) \Longleftrightarrow contains(t', w)\]
tries.remove(t: Trie, w: word) Trie#

Remove the given word \(w\) from the the prefix tree \(t\).

Pre-conditions:
  • None

Post-conditions:
  • (A3) The word that is removed is not anymore in the result prefix tree, that is:

    \[t' = remove(t, w) \Longleftrightarrow \neg \, contains(t', w)\]

6.1.1. Basic Structure#

A prefix tree, is an \(n\)-ary tree meant to contains a set of words. Every node in the tree contains a symbol from the alphabet of these words. The root contains a special symbol \(\varepsilon\), which represents the empty symbol. The leaves carries another special symbol, \(\bot\), which denotes the end of a word. This way every path from the root to a leaf node represents a word starting from \(\varepsilon\) and ending with \(\bot\).

Fig. 6.14 shows a prefix tree built over the words: ‘apart’, ‘apartment’, ‘ape’, ‘apear’, ‘apple’, ‘apply’, ‘apricot’, ‘april’.

../../_images/prefix_tree.svg

Fig. 6.14 A prefix tree built on the set of eight words \(S = \{'apart', 'apartment', 'ape', \, 'apear',\, 'apple', 'apply',\, 'apricot',\, 'april' \}\)#

Each leaf is annotated with the word that can be read following the path from the root to that leaf. Similarly, the path that leads to an internal node represent a prefix and its descend leaves are all the words that start with that prefix.

In a sense, a node somehow represents the characters that have been typed so far by the user. On the root for instance, the user hasn’t typed anything yet and all the words are possible. As we move down along a specific path, we start narrowing down the set of possible worda. In Fig. 6.14, in Node 2, we know the user has typed an “a”, in Node 3, we know the suer has typed the prefix “ap”, in Node 4, the prefix “app”, and so on and so forth.

Why Use Special Characters?

The special characters \(\varepsilon\) and \(\bot\) allow handling words that are also a prefix of another longer word, such as ‘apart’ which is a prefix of ‘apartment’. Without the special \(\bot\) character, we would have no way to know that apart is also a valid word in the set.

6.1.2. Compressed Prefix Tree#

This default tree structure is not really efficient, because it consumes a lot memory: There are many nodes that have only one child. To reduce the memory overhead, prefix trees are compressed as follows.

We place more than one symbol in each node, and merge those nodes that have only one child, consequently. Fig. 6.15 shows the compressed version of the prefix tree initially shown on Fig. 6.14. Now, all nodes but the leaves have two or more children.

trees/tries/_static/images/compressed.svg

Fig. 6.15 A compressed prefix tree, equivalent to the one shown on Fig. 6.14#

6.1.3. Membership#

To check whether a given word \(w=(c_1,\dots,c_k)\) exists in our prefix tree, we use the following procedure (assuming a non-compressed prefix tree):

  1. We start at the root of the tree, which becomes our “current node” \(n\).

    1. If Node \(n\) is a branch

      1. If Node \(n\) has a child node \(c\) whose symbol is the \(c_1\) (the first character)

        1. We remove that first character from Word \(w\)

        2. Node \(c\) becomes our current node and we continue at Step 2.

      2. Else

        • We return False

    1. Otherwise (Node \(n\) is a leaf)

      • We return true only if the given word \(w\) is empty (i.e., \(w=''\)) and if the symbol carried by the Node \(n\) is \(\bot\).

Figure Fig. 6.16 illustrates how this procedure operates. It shows how the search for “ape” at the root node, reduces to searching for “pe” on the child, which reduces to searching for “e” on the grandchild, which reduces to searching for the empty word.

trees/tries/_static/images/contains.svg

Fig. 6.16 Searching for the word “ape” in the prefix tree initially shown on Fig. 6.14, shortened for the sake conciseness.#

Why Is This Correct?

Because

How Efficient Is This?

Let us consider searching for a word \(w=(c_1,\ldots,c_n)\). How much time does it require? In the best case, we deduce straight in the root node that the word \(w\) is not in the tree, because the root has no children whose symbol matches its character, \(c_1\). Besides, if branch nodes are implemented using hashing, that local search runs in \(O(1)\). If it were implemented with a linear search, it would run in \(O(|\mathcal{A}|)\) where \(\mathcal{A}\) represents the alphabet over which the words are built.

What is the worse case? The worst case occurs when we search for the longest word that exists in the tree. This longest word is stored along the longest branch of the tree, and, in that case, we have to check every node along that long branch. The would give us a worst case runtime that is linear to the length of the word \(w\), that is \(O(n)\). Again if searching for specific child is implemented using a local search, then, the search would run in \(O(n \cdot |\mathcal{A}|)\)

As for the memory, the recursive implementation I showed above would require at \(O(n)\) memory, as each node would place a new call to progress further down the tree.

6.1.5. Insertion#

6.1.6. Deletion#

6.2. Suffix Trees#