Merkle Trees Implementation In Git and Bitcoin

Tree DataStructures

Tree data structures are widely used in computer science. This hierarchical data structure takes the form of an inverted tree, where the root of the tree is the topmost level, and the leaves follow downwards. The way the data is stored in trees makes it easier to retrieve information and the process is therefore more efficient especially when large data sets are used.

The building blocks for the tree data structure include a node, which is essentially a container that stores data. It's important to note that this concept is very different from nodes in Bitcoin.

The hierarchy of data storage in a tree structure begins with the root node, from which all other nodes originate. The root node serves as the parent node to all other nodes in the tree. Nodes that have children are referred to as parent nodes, while child nodes are those derived from parent nodes. Leaf nodes are nodes without any children. In trees, each node must have a reference of its parent, making the traversal process efficient.

Hashing

Hashing is the process of taking data in one form and then converting it into another state using an algorithm. There are many types of hashing algorithms the most widely used one being SHA256.

Hashing makes it easy to secure information and retrieve data in large data sets. Its security stems from the use of one-way hash functions, which means that data can only flow in one direction. Once a hash is generated from a value, it's impossible to use that hash to generate the original value that was hashed.

As an example let's hash my name using Python :)

A good hashing algorithm should reduce the chances of producing similar hash values for different inputs. The SHA-256 algorithm consistently produces a 256-bit value and is widely adopted for its security; it has not been compromised yet." Following vulnerabilities found in SHA-1, SHA-256 emerged as the preferred alternative.

Merkle Trees

Merkle tree is a type of tree data structure that stores information in a hashed format. Hash functions are used to produce values, which are then stored in the nodes. When data is hashed in a Merkle tree, any alterations to the values will modify the value of the node, consequently affecting the hash of the root node.

To compute the hash of a parent node, the hashes of the children nodes are combined, and the resulting value is then recursively hashed again. This property makes it easy to spot changes and even identify exactly where the changes have occurred.

Merkle Trees in Git

Have you ever wondered how Git keeps track of every change in each file? Git utilizes Merkle trees to achieve this.

The content of each file is hashed with a SHA-1 algorithm (there are efforts in place to change this to SHA-256 to minimize the chances of a collision occurring)[source]. This hash acts as a unique identifier for the file's content.

Git stores data in the form of snapshots for changes being introduced. Each commit stores object metadata in the staging area of a commit and a pointer to the the hash of the parent commit. The first commit doesn't have parent commits but will contain all the metadata for the changes[reference]. This approach significantly reduces storage requirements, especially for large files with minimal changes. Any changes made to a file or commit metadata would alter its hash, ultimately changing the Merkle tree root hash therefore ensuring the integrity of commits and consistency with the changes.

Let's see how a commit tree in Git works in practice (we'll cover specific Git commands in a later section.

Initial Commit: create an initial commit to stage changes to a file "README.md".This commit object will store the hash of the file content and won't have a parent commit since it is the first commit.

Modifying a File: Modify a file like "index.html" and stage the changes. Git will create a new commit object for the changes. This new commit hash will store the hash of the content as well as a pointer to the hash of the previous commit

Adding Tests: Add tests to a file "test.py" and stage those changes. Another commit object is created. This commit object will store the hash of the directory containing the file changes and a pointer to the previous commit hash.

Merkle trees in Bitcoin

In Bitcoin, transactions are typically hashed, and the output serves as the transaction ID (a unique identifier for each transaction). The root is created when pairs of transactions in each level are hashed together until only one hash remains.

If a block has 4 transactions (A, B, C, D) the Merkle tree for this would have 3 Levels: Level 0, Level 1, and Level 2.

Level 2 (Transaction Hashing)

The process starts at the bottommost level, Level 2, and moves upwards. Transaction hashes are ordered in a way that the coinbase transaction is hashed first, followed by the other transactions. At this level, the nodes representing each transaction will contain only the transaction hashes.

Using RPC, let's create 3 transactions. But first, use this guide to set up and run bitcoind.

Generate a new address

bitcoin-cli getnewaddress

sample output: bcrt1qeewsynz9dekxvkfe6k40933uqvkj94wfrjy63r

Create 3 transactions.

bitcoin-cli sendtoaddress "bcrt1qeewsynz9dekxvkfe6k40933uqvkj94wfrjy63r" 0.1
bitcoin-cli sendtoaddress "bcrt1qeewsynz9dekxvkfe6k40933uqvkj94wfrjy63r" 0.1
bitcoin-cli sendtoaddress "bcrt1qeewsynz9dekxvkfe6k40933uqvkj94wfrjy63r" 0.1

Level 0 sample output (the 4th transaction is the coinbase transaction):

b118dbb77bc1f6cc1e9ce9b38160b606148afbaa2b201b5f6b16bd56fb66dd88

9635bafeaae0543389abc28a4c8fa8e996d15a9e63623d8f0d8b85aeaec4e730

957b9b364f7240b76f7b9d935c9be5ae15a3d275750302cdc31cef99d9093277

22703384223e7bc359bf45273419c61f3c3fa8c18408375cd620d3a66a8d9d1f

Level 1 (Paired Hashes)

The next level is Level 1, which will be the construction of two parent nodes. At this level, transactions are paired together and concatenated, and then the output of the concatenation is hashed. In our example below, we will use a Python script to create the parent nodes. The first and second parent nodes will be:

Transaction 1_2 = Hash_1 + Hash_2

Hash 1_2 = sha256(sha256(Transaction 1_2))

and

Transaction 3_4 = Hash_3 + Hash_4

Hash 3_4 = sha256(sha256(Transaction 3_4))

we will generate 2 parent nodes using the transaction hashes generated in level 2 and using this [Python script[reference](stackoverflow.com/questions/67355203/how-to...


txids = [
    "22703384223e7bc359bf45273419c61f3c3fa8c18408375cd620d3a66a8d9d1f",
    "b118dbb77bc1f6cc1e9ce9b38160b606148afbaa2b201b5f6b16bd56fb66dd88",
    "9635bafeaae0543389abc28a4c8fa8e996d15a9e63623d8f0d8b85aeaec4e730",
    "957b9b364f7240b76f7b9d935c9be5ae15a3d275750302cdc31cef99d9093277"
]

from binascii import unhexlify, hexlify
import hashlib

def double_hash(txids):
    """Calculates the Merkle root hash for a list of transaction IDs."""
    while len(txids) > 1:
        new_txids = []
        for i in range(0, len(txids), 2):
            left = unhexlify(txids[i])[::-1]
            right = unhexlify(txids[i + 1])[::-1] if i + 1 < len(txids) else left
            hash_pair = hashlib.sha256(hashlib.sha256(left + right).digest()).digest()
            new_txids.append(hexlify(hash_pair[::-1]).decode())
        txids = new_txids
        print(new_txids)
double_hash(txids)

The sample output from the script above will be a list containing two hashes for the parent nodes.

['0bcfb5fa746b29919d758513b235b6710e087068e661712190c5a198755aec35', 'f1b05e771ff0f42986174d2312ab95159d51212184aef622586e7fc7ef39ef6f']

Level 0(Merkle Root)

This is the topmost level and is the final step in the Merkle root creation process. At this stage, the hashes from the two parent nodes on the left and right sides are concatenated, and the resulting value is hashed again.

Hash 1_2_3_4 = Hash 1_2 + Hash 3_4

sha256(256(Hash 1_2_3_4))

On this level, we will recursively call the double_hash function above using the hashes from level 1.

def double_hash(txids):
    """Calculates the Merkle root hash for a list of transaction IDs."""
    while len(txids) > 1:
      #.....sample 
    return txids[0]  # Return Merkle root

merkle_root = double_hash(txids)
print("Merkle root:", merkle_root)

sample output:

Merkle root: 0bcfb5fa746b29919d758513b235b6710e087068e661712190c5a198755aec35

If changes occur in any of the transactions included in the tree above, this will result in a change in the transaction hash, which will then affect the hash of the parent node and ultimately alter the Merkle root hash itself. The order of transactions also matters; meaning any changes in the transaction order will lead to a different Merkle root hash. This mechanism ensures the integrity of the Merkle tree, making it highly reliable.

To verify the Merkle root output above, we first need to fetch the block hash from the transaction details and then use it to retrieve the block information.

bitcoin-cli -regtest gettransaction 957b9b364f7240b76f7b9d935c9be5ae15a3d275750302cdc31cef99d9093277

Sample Output:

{

"fee": -0.00000141,

"confirmations": 101,

"blockhash":"33b3d7b049e4fd57811e1bcc0e340690b44ab055532618c37212a5157044a488"

........}

The Merkle root is then stored in a block header. The metadata in a block header essentially describes a block. This means that just the Merkle root can be used to verify the transactions in a block without necessarily having to check each transaction individually.

Use the block hash above to check if the output of the Merkle root of the function above Is the same as the output from the RPC `getblock`

To check the Merkle root value of a block using RPC, first, we need to obtain the block hash from the transaction details. This hash represents the block with the most proof of work. Then, we use this hash to fetch the block details. getblock "blockhash"

bitcoin-cli getblock "33b3d7b049e4fd57811e1bcc0e340690b44ab055532618c37212a5157044a488"

Below is a sample output:

The Merkle root constructed from the script above should match the Merkle root obtained using the RPC call.

Merkle Proof

The root node can be used to verify the existence of a hash in a tree, and the verification path for this process is what makes up a Merkle proof.

Given this transaction hash: 22703384223e7bc359bf45273419c61f3c3fa8c18408375cd620d3a66a8d9d1f and the

Merkle root:

1a8ca9e12dcdcea10ee9f19d75f9a2cf6d83d92c67a5b60654b2bf05bcacac20,

how can we verify that this transaction exists in a given block?

To form the Merkle proof for transaction 1, we need to include the necessary information that allows for the construction of the path from A to the Merkle root.

This includes:

The hash of Transaction 1 (the transaction you're providing proof for).
Sibling Hash: The hash of Transaction 2 (957b9b364f7240b76f7b9d935c9be5ae15a3d275750302cdc31cef99d9093277)Left, because it's paired with 1 to produce the next level hash.
The next level sibling hash: The hash resulting from hashing transactions 3 and 4 together (0bcfb5fa746b29919d758513b235b6710e087068e661712190c5a198755aec35)Right, as it's needed to move up the tree.

With these pieces, anyone can verify that transaction 1 is part of the Merkle tree with the given root:

Conclusion

Merkle trees play a vital role in securing data integrity and enabling efficient verification in various systems, particularly in Bitcoin.

Other Interesting use cases include AWS dynamo DB and Content delivery networks.

To read more on Merkle trees check out the resources below.

https://git-scm.com/book/en/v2/Git-Branching-Branches-in-a-Nutshell

https://en.wikipedia.org/wiki/Merkle_tree#:~:text=In%20cryptography%20and%20computer%20science,labels%20of%20its%20child%20nodes.

https://medium.com/geekculture/understanding-merkle-trees-f48732772199

https://bitcoin.stackexchange.com/questions/10479/what-is-the-merkle-root