Version: 0.4.24

5. B-Trees#

Lecture:

Lecture 5.5 (slides)

Objectives:

Understand what is a B-tree, and how and when it helps reduce disk usage

Concepts:

Disk-access Model (DAM), B-Tree principle, insertion and deletion in a B-Tree

Implementation:

in Ruby btree.rb

5.1. External Storages#

So far, our assumption has been that our algorithms execute on the RAM: A machine with a CPU, an I/O device and an infinite amount of memory. Remember however that this is a gross simplification: A real computer has only a limited amount of memory available. This memory is organized into various layers.

5.1.1. The Memory Hierarchy#

As shown on Fig. 5.10, the faster are the tiny CPU registers with lightning fast access. Further away from the CPU are the caches (L1, L2 and L3) with larger capacity but slightly longer access time. Even further are the RAM, then the hard drive, and finally other remote storages. The rule is that the further away from the CPU, the slower, but the larger.

../../_images/memory_hierarchy.svg

Fig. 5.10 The menory hierarchy: From fast tiny CPU registers to slow but large external storages#

5.1.2. Example: Hard Disk Drives#

Let’s look at an example: The old-fashion hard disk drive. On a hard drive, data are stored on magnetic discs called “platters”. Each hard drive contains many platters mounted on a rotating axis as shown on Fig. 5.12. A separate reading heads can read from each platter.

../../_images/hard_disk.svg

Fig. 5.12 High level structure of a hard disc: Multiple “platters” mounted on a rotating axis.#

Each platter is divided into concentric circles called tracks, and each track is divided into sectors. The basic read-write unit is the sector, which represents generally 512 bytes. The reading arm can move from the center of the disk to the outside, and as the disk can also rotate, this enables reading every possible sector. Note that the reading heads of the different platters are not independent and one cannot move one head without also moving the other ones. The disk “firmware” decides how the data are written to the different platters.

../../_images/platter.svg

Fig. 5.13 Subdivisions of a platter into tracks and sectors#

The challenge with accessing hard drives is the necessary mechanical motions. The firmware converts the “logical address” (i.e. known by the machine/OS) into a “physical address” (only known by the firmware) and controls the reading head and the rotation of the disks accordingly. There is two steps:

  1. Move the arm and disks to the desired track. The is the “seek time”,

  2. Once the reading heads are the right track, the disk must rotate to bring the right sector under the head: This is the rotational delay.

  3. Reading the data at from the sector(s), the read time or transfer time.

These delays add up and results in the very large latency, which explains the huge difference between a CPU register, whose access is in nanoseconds and a hard drive, whose access is in milliseconds. The ratio is the same a between a sludge and F16 aircraft.

5.2. The Disk-access Model (DAM)#

The memory hierarchy of a real computer is very complex and not relevant for the design of algorithms and data structure. To reason about algorithms and data structure that exploits an external storage, we use the disk access model (DAM).

When using the DAM, the objective is to design algorithms that minimize the number of disk accesses, because they take much longer than regular “main memory” accesses.

Important

The disk-access model is only useful to reason about algorithms that uses an external storage such as database engines.

../../_images/dam.svg

Fig. 5.14 The disk-access model: Adding an external storage to the RAM#

Fig. 5.14 shows the DAM architecture. It closely resembles the RAM, except for

  • The memory is not infinite. We only have \(n\) memory cells

  • The system is equipped with an external storage, an additional memory. Only this additional memory has an infinite capacity.

  • The CPU cannot directly access this external storage: It has to load blocks into a dedicated area of memory of \(k\) cells, (from cell :math:`n-k to \(n\)) using specific instructions READ_BLOCK and WRITE_BLOCK.

The DAM uses all the instruction from the RAM, but adds two new ones, namely READ_BLOCK and WRITE_BLOCK, that move data between the memory and the external storage. Table 5.7 below summarizes them, along with the cost model.

Table 5.7 The DAM instruction set#

Code

Instruction

Description

Cost

1

LOAD <constant>

Set the ACC register with the given constant

0

2

ADD <address>

Add the value contained at the given address to the ACC register

0

3

SUB <address>

Subtract the value contained at the given address from the ACC register

0

4

STORE <address>

Write the content of the ACC register into the memory at the given address

0

5

PRINT <address>

Send the value contained at the given address to the I/O device for display

0

6

READ <address>

Read a value from the I/O device and stores it in memory at the given address

0

7

JUMP <address>

Set the IP register with the given address, if and only if the ACC register contains 0.

0

8

READ_BLOCK <address>

Load a \(k\)-block from the external storage to the memory. The block is written in at a fixed location at the end of the memory, from memory cell \(n-k\) to cell \(n\).

1

9

WRITE_BLOCK <address>

Write the last \(k\)-cells of memory to the external storage at the given address.

1

10

HALT

Stop the machine

0

Traditionally, the cost model associated with the DAM is that all instructions cost nothing, except those that access the external storage. This reflect the fact that difference of access time is so huge, that we can consider all CPU computation as free and focus on minimizing the number of external storage accesses.

5.3. B-Trees#

The B-tree (shorthand for broad or balanced tree) is probably the most common data structure designed to minimize disk accesses. It is used by many database engines to implement indexes for instance.

5.3.1. The Structure of a B-tree#

Like the binary search tree (BST), a B-tree implements the ordered set ADT. That is, a B-tree represents a set of things ordered in some predefined ways. A time series (a set of observations indexed by time/date) a good example of ordered set. If we were measuring the temperature, we could get something like tress/btrees/temperature. These temperature readings a naturally ordered by time but there cannot be two average temperature at the same day: It is an ordered set.

Table 5.8 A sample time series, showing daily average temperature#

Date

Temperature

Sep. 12 2024

15 °C

Sep. 13 2024

14.5 °C

Sep. 16 2024

9.3 °C

Sep. 17 2024

10.7 °C

In a B-tree each node has a limited number of children: Up to a chosen value \(k\). Fig. 5.15 shows an example where \(k=4\). Note that only the leaf nodes carry actual information (e.g., temperatures in our previous example), the other nodes only have \(k-1\) keys.

Just like for a BST, the children nodes in a B-tree obey a n ordering rules. Consider a node that has \(k\) keys \(K = (k_1, k_2, \ldots, k_n)\). Its children nodes \((n_1, n_2, \ldots, n_{n+1})\) are organized such that the node \(n_i\) at position \(i\) only carries keys \(k'\) such that \(k_{i-1} \leq k' < k_{i}\).

../../_images/btree.svg

Fig. 5.15 The structure of B-tree, where \(k = 4\)#

For instance, Fig. 5.15 shows a B-tree with where the root has two keys, namely 134 and 456. The first children (Node 2) contains all the keys up to 134, whereas the second child (Node 3), contains all the keys up to 456, and the third children all the keys larger than that.

Important

B-tree is a generalization of the concept of binary search trees, a kind of “n-ary” search tree. There are two key differences, however:

  1. In a B-tree, each node has at most \(k\) children, whereas a node has at most 2 children in a BST. These children are “separated” by \(k-1\) keys, as opposed to one single key in a BST. If \(k=2\), then a B-tree becomes a BST (at least for the search part).

  2. In a B-tree, only the leaf nodes carry information. The internal nodes do not, but only carry (or compute) keys.

What value for \(k\)?

How can we decide how many keys/children a node can have at most? Recall a B-tree is designed to minimize the number of disk accesses. To do that we will choose a value k such that \(k\) keys can be loaded from the external storage in one instruction (i.e., READ_BLOCK). This way, when traversing the tree, we will only triggers as many READ_BLOCK as there are levels in the tree. More formally, if the tree contains \(n\) entries, we will only need \(log_k(n)\) disk accesses. The direct consequences is that B-tree have a large branching factor and are shallow, with a few levels as possible.

5.3.3. Insertion#

The insertion in a B-tree also resembles the insertion in a BST, but…

Important

A B-tree is a self-balancing tree (as is the AVL tree) and that’s where the resemblance stops. To ensure a B-tree remains “balanced” we enforce the following constraint:

  • Every node (except the root node) must have no less than \(\lfloor k/2 \rfloor\) children.

Every insertion and deletion must guarantee that this constraint is satisfied, for the tree to remain balanced.

To insert a new entry in a B-tree, we proceed as follows:

  1. We start at the root node.

  2. We search for a child whose key range contains the new key to insert

    1. If that child is a leaf node

      • We insert into that node a new entry

    2. Otherwise,

      • We continue inserting from that child (see Step 2)

  3. If the node where we have inserted is “overflowing” (i.e., it has \(k+1\) children), we split it in two.

Fig. 5.17 illustrates how we can split a node that overflows. We consider here a small B-tree (\(k=4)\) where the root directly contains two leaves, namely Leaf 1 and Leaf 2. Before the insertion, Leaf 2 is “full” as it already contains four entries. We now insert a new entry \((65, \dots)\). It lands in Leaf 2, which now has 5 entries. It is overflowing after this insertion, we must re-balance the tree and ensure every node has no more than 4 children (or entries for leaf nodes).

To fix this issue, we split Leaf 2 in two, by moving half of its entries into a new node (Leaf 3). Consequently, we add a new key (59) to the root (Node 2) so that it knows the key range of that new node. Every nodes has between 2 and 4 children, as it must be.

Note that the parent, who gets a new child, may overflow as well and the “splitting” may propagate all the way to the root of the tree.

../../_images/split.svg

Fig. 5.17 Splitting a node that overflows. Here we split a leaf node that has two many entries.#

Why Is This Correct?

How can we show the tree is always balanced after an insertion. As often we can show that by induction over a series of insertions. Let’s start the base case, where the tree is empty. We will then look at the non-empty case, right after.

When the tree is empty, its root is a leaf node that contains no entry. This is valid, because only the root node is allowed to have less than \(k/2\) entries. In that case, the insertion add a new entry and the root node now has a single entry.

If the tree is not empty, we know the insertion will necessarily first add an entry to a leaf node, say \(n\), (only leaf node carry entries). There are several scenarios here:

  • This node \(n\) is the root has room for another entry (it has at most \(k-1\) entries). The tree is still valid with new entry.

  • 1Note that if a node has \(k+1\) entries, splitting it in two non empty nodes necessarily yields two nodes with less than :\(k\), since
  • This node is already full, then the insertion makes it overflow. The node necessarily ends up with \(k+1\) entries and gets split in two valid nodes, with less than \(k/2\) entries 1Note that if a node has \(k+1\) entries, splitting it in two non empty nodes necessarily yields two nodes with less than :\(k\), since. What could happen is that its parent, who get a new node, may also overflow consequently, and will get split as well. This splitting process can cascade all the way to the top, where the root node get split too, in which case a new parent is created to gather two nodes.

Since the insertion adheres to the constraint when the tree is empty and also preserves it whenever it held before the insertion, we can conclude that the insertion guarantees the every node will have no more than \(k\) children/entries.

How Efficient Is It?

We see that to insert a node in a B-tree, we must first locate the leaf node that will carry the new key-value pair. This process requires checking all the nodes along the single path from the root to that leaf node. There will therefore be \(O(\log_k(n)\) read-accesses, in order to load these nodes from the external storage. Possibly these nodes may be split and written back to the storage, which would cost another \(\log_k(n)\) write-accesses. Altogether, we get:

\[O(\log_k(n)) + O(\log_k(n)) = O(\log_k(n))\]

5.3.4. Deletion#

To delete an entry in a B-tree, we will have to make sure that every node in the tree keeps at least \(\big\lfloor\frac{k}{2}\big\rfloor\) children/entries.

To delete a key \(k\), we proceed as follows:

  1. We start at the root node.

  2. We search for a child whose key range contains the new key to insert

    1. If that child is a leaf node, but does not contain the key \(k\)

      • The key is not in the set

    2. If that child is a leaf and does contain the key \(k\)

      • We search locally and remove the corresponding entry

    1. Otherwise (from internal nodes),

      1. We delete from that child (see Step 2)

      2. If that child is now “onderflowing” (i.e., it has strictly less than \(\big\lfloor\frac{k}{2}\big\rfloor\) children)

        1. If that child has a preceding/following sibling \(s\) with some extra keys

          • We take a key from that sibling node. If the sibling \(s\) is on the left we take its largest key, if it is on the right, we take its minimum key.

        2. Otherwise

          • If all siblings have exactly \(\big\lfloor\frac{k}{2}\big\rfloor\) children, we merge with either the preceeding or the following sibling. The original child and its merged sibling are deleted, and a new node with all their keys replace them.

Let see examples of this deletion procedure. There are two main scenarios: Either we can “steal a node” from a sibling or we merge with a sibling.

../../_images/borrow.svg

Fig. 5.18 Fixing underflowing nodes by stealing a key from a sibling. Here we steal the minimum key of the following sibling.#

Fig. 5.18 illustrates the process of stealing a key from the siblings. Again we assume a B-tree, where \(k = 4\). At the start, Node 2 (the root) has two children, name Leaf 1 and Leaf 2. Leaf 1 has exactly \(\big\lfloor\frac{k}{2}\big\rfloor = 2\) node, whereas Leaf 2 has 4 keys. We then delete key 27, which is carried by Leaf 1, who is left with a single key now, and therefore breaks the B-tree rule.

To repair this, we can steal from the following sibling since it has some extra keys. We thus steal the minimum key 2We would have stolen the minimum if we had borrowed from the preceding sibling.. As a result, Leaf 1 has 2 keys, and Leaf 2 has 3, and both adhere to the constraint of the B-tree. Note also that key that separate Leaf 1 from Leaf 2 in Node 2 has been updated.

Let’s now look at an example where we merge with a sibling node (Step 3.b.2.b), as shown on Fig. 5.19 below.

../../_images/merge.svg

Fig. 5.19 Fixing underflowing nodes by merging with a sibling node.#

We start with a root node that contains three leaves: Leaf 1 and Leaf 2 have exactly \(\frac{k}{2}\), whereas Leaf 3 has three keys. Now, we delete key 27, which is carried by Leaf 1. As a result, Leaf 1 is underflowing because it has now only 1 key.

To fix that, we cannot steal from sibling nodes. Leaf 1 has only one sibling, Leaf 2, who has no extra key. Our only solution is thus to merge Leaf 1 and Leaf 2 3Since we only merge node that have exactly \(\big\lfloor\frac{k}{2}\big\rfloor\) keys, we cannot end up with new node that have more than \(k\) keys. That gives us a new node, Leaf 4, with 3 keys (one from Leaf 1 and the two from Leaf 2). We update the key recorded in the parent (Node 2) and we are done. Node that by merging two nodes, the parent looses one child and may therefore underflow as well. In the worst case, the merging process cascade all the way to the root, which is the only node allowed to have less than \(\big\lfloor\frac{k}{2}\big\rfloor\) children.

Why Is This Correct?

What would be a correct deletion procedure? It has to obey the specifications of the deletion in an Ordered set ADT, but also has to guarantee that it does not break the b-tree constraint: Every node but the root must have at least \(\big\lfloor\frac{k}{2}\big\rfloor\) entries/children.

Let see why the later hold. We assume that before the deletion, all nodes have at least \(\big\lfloor\frac{k}{2}\big\rfloor\) entries/children. Our deletion procedure starts by locating the leaf that carries the entry to delete. Initially it therefore affects only one leaf node.

If that leaf node is the root, and the root underflows, the rule is not broken because only the root node is allowed to have less than \(\big\lfloor\frac{k}{2}\big\rfloor\) entries.

If it is not the root, the parent node checks and corrects any underflow situation. If a key is stolen from a sibling, then nothing else happen and the tree is left in a valid state. With the node is merged with either of its preceding or following sibling, the parent node looses a key and may underflow as well. If that happens, there are two cases

  1. If the parent is the root node, then the underflow does not matter has it is allowed to have less than \(\big\lfloor\frac{k}{2}\big\rfloor\) children.

  2. If the parent is an internal node, then its own parent detects and correct the underflows, which can either be resolved by stealing from or merging with sibling nodes. In the worst case, the merging process cascade all the way up to the root node, which is allowed to underflow.

How efficient is it?

In the worst case, the entry we delete makes the leaf node underflows and that leaf has no sibling with extra keys from which we could steal, so we have merge. Worse, this merge makes the parent underflow as well, and this merging process cascade all the way up to the root node.

So to locate the leaf node, we will have to load from the external storage all the nodes that are along the path from the root to the leaf that carries the target key. That is as many nodes as there are levels in the tree, so \(O(\log_k(n)\) READ_BLOCK instructions. Besides, at all levels, a node underflows and must be merged with one of its siblings. So at every level a node has to be updated and written back to the external storage, that is another \(O(\log_k(n)\) WRITE_BLOCK instructions. Altogether that gives us a total of \(O(\log_k(n)\) disk accesses.`