Version: 0.5.24
6. Prefix and Suffix Trees#
- Lecture:
Lecture 5.6
(slides)- Objectives:
Understand what is a prefix and what is a suffix tree
- Concepts:
Insertion, deletion and search into a Prefix-tree
As a last example of tree data structures, we shall look at prefix and suffix trees. These are data structures used to speed up text search. especially full-text search and auto completion in text editors.
Auto-completion means that editor suggests possible endings for what you have started to type. For instance, Fig. 6.13 shows an example of “default completion”, where the editor make suggestions based on words already in the document.
Full-text search, is what happen for instance when you press
Ctrl-F, which prompt you for a some text, and highlights all the occurrences in the page. This is often available in text editors, web browser, or IDEs.
Prefixes vs. Suffixes
What are prefixes and suffixes? Intuitively, we can say the are just “beginning” and “ending” of sequences. More formally, given a sequence of characters \(s = (c_1, c_2, \dots, c_n)\) any shorted sequence \(s' = (c_1, \dots, c_k)\) where \(0 \leq k \leq n\) is a prefix of \(S\). For instance, the word “algorithm” has the following prefixes (ordered by length):
Note that the set of prefixes includes the empty sequence and the complete word itself. Formally, we denote that fact that a word \(w\) admits a prefix \(p\) by \(p \preceq w\).
Similarly, suffixes are possible endings. Formally, and sequence \(s' = (c_k, \dots, c_n)\) is a suffix of \(s\). The word algorithm therefore admits the following suffixes:
6.1. Prefix Trees#
How can we implement such auto-completion mechanisms efficiently? What we need is kind of Map ADT that could give us all the word that starts by the “prefix2 the user has typed. For instance, given a set of words already typed, one could index them by prefix, using a hash table.
See also
📖 Goodrich, M. T., Tamassia, R., & Goldwasser, M. H. (2014). Data structures and algorithms in Java. John Wiley & Sons. Chap. 13
The conventional solution is to use a prefix tree, also known as trie. A prefix tree is not an ADT per se, but it extends the Set ADT and generally exposes the following operations:
- tries.contains(t: Trie, w: Word) Boolean#
Check whether the prefix tree \(t\) contains the given word \(w\).
- tries.prefix_search(t: Trie, p: Word) Set[Word]#
Returns the subset \(S\) of all the words from \(t\) that starts with the given prefix \(p\).
- Pre-conditions:
None
- Post-conditions:
(A1) \(S\) is the very subset of words from the prefix tree that start with the given prefix \(p\). No other words in the tree starts with \(p\) and S does not contain any other word than those from the tree.
\[S = \mathit{prefix\_search}(t, p) \implies S = \{ w \in \mathcal{A}^* \:|\: contains(t, w) \land p \preceq w \}\]
- tries.add(t: Trie, w: Word) Trie#
Add a new word in the prefix tree \(t\).
- Pre-conditions:
None
- Post-conditions:
(A2) The word is \(w\) is necessary in the resulting prefix tree, that is:
\[t' = add(t, w) \Longleftrightarrow contains(t', w)\]
- tries.remove(t: Trie, w: word) Trie#
Remove the given word \(w\) from the the prefix tree \(t\).
- Pre-conditions:
None
- Post-conditions:
(A3) The word that is removed is not anymore in the result prefix tree, that is:
\[t' = remove(t, w) \Longleftrightarrow \neg \, contains(t', w)\]
6.1.1. Basic Structure#
A prefix tree, is an \(n\)-ary tree meant to contains a set of words. Every node in the tree contains a symbol from the alphabet of these words. The root contains a special symbol \(\varepsilon\), which represents the empty symbol. The leaves carries another special symbol, \(\bot\), which denotes the end of a word. This way every path from the root to a leaf node represents a word starting from \(\varepsilon\) and ending with \(\bot\).
Fig. 6.14 shows a prefix tree built over the words: ‘apart’, ‘apartment’, ‘ape’, ‘apear’, ‘apple’, ‘apply’, ‘apricot’, ‘april’.
Fig. 6.14 A prefix tree built on the set of eight words \(S = \{'apart', 'apartment', 'ape', \, 'apear',\, 'apple', 'apply',\, 'apricot',\, 'april' \}\)#
Each leaf is annotated with the word that can be read following the path from the root to that leaf. Similarly, the path that leads to an internal node represent a prefix and its descend leaves are all the words that start with that prefix.
In a sense, a node somehow represents the characters that have been typed so far by the user. On the root for instance, the user hasn’t typed anything yet and all the words are possible. As we move down along a specific path, we start narrowing down the set of possible worda. In Fig. 6.14, in Node 2, we know the user has typed an “a”, in Node 3, we know the suer has typed the prefix “ap”, in Node 4, the prefix “app”, and so on and so forth.
Why Use Special Characters?
The special characters \(\varepsilon\) and \(\bot\) allow handling words that are also a prefix of another longer word, such as ‘apart’ which is a prefix of ‘apartment’. Without the special \(\bot\) character, we would have no way to know that apart is also a valid word in the set.
Implementation in Scala
Let see how we could implement this. Prefix tree is a data structure where recursion makes things easier.
First we need to define the “facade” of our PrefixTree as
follows. Here we are just declaring the operation we have listed in
our ADT. I added an operation size for convenience: The size is
simply the number of words that the prefix tree contains. We will
see later how to implement these operations.
class PrefixTree:
def add(word: String): Unit =
throw new Error("Not implemented")
def contains(word: String): Boolean =
throw new Error("Not implemented")
def prefixSearch(prefix: String): Set[String] =
throw new Error("Not implemented")
def remove(word: String): Unit =
throw new Error("Not implemented")
def words: Set[String] =
throw new Error("Not implemented")
def size: Int = words.size
Now we define some more classes to represent the nodes of our prefix tree. I choose to use two separate classes for leaves and branches as follows.
First I defined the interface for an node in general, regardless whether it is a leaf or a branch. Node simply expose the same operations as our ADT. I then add two subclasses that implements the leaves, and branches, respectively.
abstract class Node(val symbol: Char):
def contains(word: String): Boolean
def add(word: String): Node
def words: Set[String]
def prefixSearch(prefix: String): Set[String]
def remove(word: String): Option[Node]
class Leaf(symbol: Char) extends Node(symbol)
class Branch(symbol: Char, children: Map[Char, Node]) extends Node(symbol)
6.1.2. Compressed Prefix Tree#
This default tree structure is not really efficient, because it consumes a lot memory: There are many nodes that have only one child. To reduce the memory overhead, prefix trees are compressed as follows.
We place more than one symbol in each node, and merge those nodes that have only one child, consequently. Fig. 6.15 shows the compressed version of the prefix tree initially shown on Fig. 6.14. Now, all nodes but the leaves have two or more children.
Fig. 6.15 A compressed prefix tree, equivalent to the one shown on Fig. 6.14#
6.1.3. Membership#
To check whether a given word \(w=(c_1,\dots,c_k)\) exists in our prefix tree, we use the following procedure (assuming a non-compressed prefix tree):
We start at the root of the tree, which becomes our “current node” \(n\).
If Node \(n\) is a branch
If Node \(n\) has a child node \(c\) whose symbol is the \(c_1\) (the first character)
We remove that first character from Word \(w\)
Node \(c\) becomes our current node and we continue at Step 2.
Else
We return False
Otherwise (Node \(n\) is a leaf)
We return true only if the given word \(w\) is empty (i.e., \(w=''\)) and if the symbol carried by the Node \(n\) is \(\bot\).
Figure Fig. 6.16 illustrates how this procedure operates. It shows how the search for “ape” at the root node, reduces to searching for “pe” on the child, which reduces to searching for “e” on the grandchild, which reduces to searching for the empty word.
Fig. 6.16 Searching for the word “ape” in the prefix tree initially shown on Fig. 6.14, shortened for the sake conciseness.#
Implementation in Scala
We implement the case where the node is a branch (see Step 2.a) in our Branch class as follows. In Scala (and some other languages) the first character of a string is named “head”, and the rest is named “tail”.
1class Branch(symbol: Char, val children: Map[Char, Node]) extends Node(symbol):
2
3 def contains(word: String): Boolean =
4 children.get(word.head) match {
5 case None => false
6 case Some(n) => n.contains(word.tail)
7 }
The case where the node is a leaf (see Step 2.b) goes as follows:
1class Leaf(symbol: Char) extends Node(symbol):
2
3 def contains(word: String): Boolean =
4 return word.isEmpty && symbol == '⊥'
Finally, we complete our PrefixTree with the following implementation:
1class PrefixTree:
2
3 private var _root: Node = new Branch('ε', Map())
4
5 def contains(word: String): Boolean =
6 _root.contains(word)
- Why Is This Correct?
Because
How Efficient Is This?
Let us consider searching for a word \(w=(c_1,\ldots,c_n)\). How much time does it require? In the best case, we deduce straight in the root node that the word \(w\) is not in the tree, because the root has no children whose symbol matches its character, \(c_1\). Besides, if branch nodes are implemented using hashing, that local search runs in \(O(1)\). If it were implemented with a linear search, it would run in \(O(|\mathcal{A}|)\) where \(\mathcal{A}\) represents the alphabet over which the words are built.
What is the worse case? The worst case occurs when we search for the longest word that exists in the tree. This longest word is stored along the longest branch of the tree, and, in that case, we have to check every node along that long branch. The would give us a worst case runtime that is linear to the length of the word \(w\), that is \(O(n)\). Again if searching for specific child is implemented using a local search, then, the search would run in \(O(n \cdot |\mathcal{A}|)\)
As for the memory, the recursive implementation I showed above would require at \(O(n)\) memory, as each node would place a new call to progress further down the tree.
6.1.4. Prefix Search#
To find all the words that start with a given prefix \(p=(c_1, \dots, c_n)\) we navigate the tree from the root according to the character in the given prefix.
