by Patrick Niemeyer (pat@pat.net)
The story of modern cryptography is one of taking arcane and abstract ideas from the province of research and making them into tangible building blocks that can be used by developers and end-users in their daily lives. Never has this been more true than in the age of blockchains, which have dramatically accelerated development in fields such as zero-knowledge proofs and homomorphic encryption over the past decade and a half.
This article is about one such building block: polynomial commitments. While there are many resources that explain the theory and mathematics of this topic (links below), this is intended as a more accessible, practical guide for developers who want to know what they can do and how they might incorporate them in their own projects, whether in the crypto space or in other fields. We will offer high level explanations of what’s going on in the math, but with the aim of helping you understand the principles, data types, and naming conventions used in the APIs.
Specifically, we will be looking at ckzg lib, the EIP-4844 reference implementation of KZG commitments as used in Ethereum. As the name implies, ckzg is a C implementation of KZG commitments, which sits atop the lower level blst elliptic curve math library. While it is a C lib, there are bindings available for many other languages including Python, Go, Rust, and JS. We will be using ckzg from Python and at the end of the article we’ll touch on how you can call into ckzg and blst directly from Python for the cases that aren’t covered by the higher-level API.
At the end of the article I’ll also describe how we are using these libs in our open-source distributed storage project at Orchid Labs and where to find more complete code examples.
The idea of an inclusion proof begins with the concept of a commitment. A commitment to a set of data is a compact identifier that is derived from and binds the data in such a way that it is infeasible to change one without changing the other. Probably the most familiar example of this is a cryptographic hash, which you can think of as a commitment to the bytes that were hashed.
The standard SHA-256 hash depicted above consumes data of any length and produces a 256 bit (32 byte) digest.
More sophisticated types of commitments come with the ability to prove certain properties of the bound data. One example of this is an inclusion proof, which allows anyone who holds the commitment to verify that a particular element was part of the original data set and covered by commitment. (Note that with the SHA-256 hash, you could prove to someone that some data was included in the hash, but only by revealing all the data, which may be impractical or undesirable.)
The classic form of an inclusion proof is performed using a Merkle tree, which is a binary tree where the leaves are the data elements and the higher level nodes are simply the hashes of their children (hashes of hashes, all the way up).
The interesting thing about this arrangement is that it is possible to prove that a given leaf of data exists within the original tree by providing only the data element itself and a “path” of hashes and sibling hashes leading up to the root. These are indicated in orange in the diagram below (the faded items can be calculated from the children and don’t actually have to be sent).
The sibling hashes stand-in for the other data elements and their parents, allowing the verifier to show that the root hash matches without having to know the entire contents (or indeed any of the contents) of the tree in advance. An observer who has no knowledge of the original data can be convinced that a particular data element was included in the set without any external knowledge.
From a privacy standpoint this means that one can prove that a particular data element is represented without revealing unrelated data. From an efficiency standpoint, the proof at worse requires only log(n) hashes (where n is the number of data items in the leaves), which may be much smaller than sending the full set of data.
A polynomial commitment is a more general form of binding that involves encoding data in a mathematical function – a high degree polynomial. This is done by finding or implicitly creating a polynomial that passes through data points to “store” your data elements at consecutive points in its domain (or alternately by using them as its coefficients). In practice these polynomials do not work with real numbers as you might be used to, but on the abstract constructs of elliptic curves over finite fields. The practical benefit of that for us is that it allows us to work efficiently with very large but fixed-size (256 bit) integer data types using modular arithmetic.
Polynomial commitments can be used for inclusion proofs, and have greater flexibility and efficiency. Where a Merkle tree proof requires at minimum the original data plus log(n) hashes (which are 32 bytes each), a polynomial commitment proof can be comparable in size to a single hash. For a commitment comprising thousands of data elements this may be close to an order of magnitude smaller. But that isn’t all: polynomial commitments and their corresponding proofs can be combined in ways that are not possible with Merkle trees, allowing a prover to commit to many sets of data or prove properties about them in a single proof of constant size.
There are a few levels of abstraction hiding in our earlier description of the polynomial commitment, so let’s make them explicit. There is a field of integer, scalar values that we will be using to represent our data. The sizes of these values (which we’ll talk about in detail in a bit) will determine how we have to break up our data.
There is a curve defined as a mathematical equation over the elements of that field. It is a curve in the sense that there is a function which you could graph with real numbers to produce points and ultimately a curved shape, however in this case the function is used on the discrete integer values of the field rather than continuous real numbers (and so it graphing it would produce something like a discrete “plot” of points rather than a smooth curve).
Some of the values that we work with via the library are points on this curve, which means that they correspond to a pair of x and y integer values in the field. Looking ahead a bit, when we talk about both commitments and proofs in the context of KZG commitments we’ll see that those results are represented as points on the curve and how we handle points will determine the data types and size of those commitments and proofs.
Finally, the curve, along with the mathematical construct of a group operation, point addition, gives rise to operations analogous to addition, scalar multiplication, and exponentiation, which are used as the basis for the polynomials you keep hearing about 1. From the library perspective, the important thing to know here is that these abstract polynomials can be represented as a series of values, which can represent either the coefficients on the terms of the polynomial or the values of the polynomial evaluated at certain points (and in our case it is the latter).
What this means is that a polynomial, as used in the ckzg library, is represented as an array of the fundamental scalar integer data type. That array may be very large, holding thousands of values (representing a polynomial of that very high degree). And that is how a large blob of data of the right size can be treated as a single polynomial.
There are many schemes for polynomial commitments and (infinitely) many curves over which they could operate, but the ones we’ll be discussing here, as implemented by the EIP-4844 ckzg library, are KZG Commitments over a standard elliptic curve called BLS12-381.
KZG commitments, sometimes called Kate (Kah-tay) commitments, is the scheme named for its inventors Kate, Zaverucha and Goldberg (2010). It is a practical, well tested implementation of polynomial commitments that can be used for inclusion proofs and is being adopted by Ethereum and other blockchains to allow more data to be moved off-chain while still being verifiable on-chain via those inclusion proofs. The basic idea here is that one can commit to a large collection of data in a very compact way, allowing that commitment to be stored on-chain and then later prove that a particular piece of data was covered by (included in) that commitment by presenting a similarly small proof. This makes possible applications like roll-ups that store the details of transactions off-chain but are still verifiable to interested parties on-chain. As we’ll mention later it also makes possible compact proofs of data availability, where a data holder can demonstrate that they have an element of a large collection, even to someone who doesn’t know the contents of the collection.
The “381” in the name of the BLS12-381 curve refers to the approximate number of bits in size of the values used in the field over which the curve is defined. (The prime number defining the field is a number slightly less than 2^381). What this means is that the size of the largest x or y integer values that can be represented on the curve are about 381 bits and any value larger is wrapped around by modular arithmetic.
The x and y values we’ve been discussing are called scalars, because they are the raw, integer magnitude values that are used with the curve. This is in contrast to points, which are logically two-dimensional pairs of these scalar coordinates. Although we’ve just said that these x and y scalar values on the curve are ~381 bits, in practice we limit our “inputs” to the curve to a smaller range of scalar values – a prime subgroup – of slightly less than 256 bits in size (as defined by another prime number). When we limit ourselves to using integer values in this specific range we end up working with a corresponding subset of points on the curve of the same size.
The BLS curve was chosen specifically because it has this large prime subgroup with the right cryptographic properties and of a size which makes it easier and more efficient to work with modern hardware and software, allowing our data to be represented as 32 byte integers.
To summarize this: When we structure our data we’ll be providing it as 256 bit (32 byte) values, and operations on the curve will render subsequent x and y values that are ~381 bits (~48 bytes) in size.
A point is normally represented as a pair of coordinates (x, y) and in this context a point represented in this way is said to be in Affine coordinate point format. Of course since y values are determined by the x value and the equation of the curve, you really only need to store the x value to define a point. Now, in reality elliptic curves are symmetric and each x has both a positive and negative y value, but it only takes one bit to distinguish those. So when we take that into account and round up to the nearest byte size a point on the curve can still be represented within 48 bytes. This is sometimes referred to as a compressed point format, and it is the way we’ll work with points in the ckzg library.
No matter how many data elements are involved in the corresponding polynomial, a commitment will always be a single point on the curve and can be formatted as exactly 48 bytes. The same is true of a KZG proof, which can be used to demonstrate that the commitment includes a particular data value encoded in the polynomial; The proof is also a single 48 byte point on the curve.
It’s worth saying a few words about what it means to “encrypt” a value using the curve, since that is the fundamental proposition that makes these schemes useful and it will help you understand the following section on the trusted setup.
The core idea here is a simple one: You can take a scalar value and transform it into a point on the curve, but this operation is one-way, like a hash; If you just know the point it is infeasible to determine the scalar value that produced it.
The operation also has a very special property: It is homomorphic. This means that you can perform operations like addition and multiplication on either the scalars or the “encrypted” points and the relationships will hold. That is, the point generated from the scalar a + b is the same as the point you get by adding the points generated from a and b. Generating points is a one-way function that maintains this notion of the arithmetic relationships. This turns out to be a very powerful property for building cryptographic schemes.
The KZG scheme has a secret that no one should know, literally. A commitment in KZG is formed by evaluating your supplied polynomial at a secret scalar value, which is unknown to anyone. This is possible via some tricks using the homomorphic properties that we mentioned earlier. The protocol can publish points corresponding to the series of exponentiated values of the secret scalar, and these points can be combined with your scalar data values to evaluate the polynomial without ever having known the secret scalar (the x value) of where it is being evaluated.
You can think of this like encrypting a message with a public key for which the private key has been thrown away or was never known. (For an interesting aside, take a look at Ethereum single-use addresses and the singleton deployment scheme, which effectively uses keys in that way.)
What makes this blind encryption useful is the homomorphic properties we mentioned: The result of that evaluation still has to obey some mathematical relationships that can be checked and those will become the basis for the proofs that we’ll be generating later. If someone were to know the secret scalar they could create another polynomial that could offer inclusion proofs for data elements that are not in the original set. i.e. They could forge commitments and proofs.
So the security of the scheme depends on having this series of points derived from the secret value that cannot be known to anyone. That sounds like a heavy lift or something that would involve a trusted party, however it turns out that the homomorphic properties of the curve can once again be leveraged to make this one-time setup completely transparent and eliminate trusted entities. The trusted setup involves a ceremony where parties from all around the world generate their own secret scalars, rendering them to points and contributing them in that form where they can be combined to produce the same result as if we had combined the individual scalars first.
The result of all of this is that the series of points that are used to evaluate the polynomial is provided by the protocol as data referred to as trusted setup. This is a large file (about 410KB of data) that the library will need to load in order to perform the commitments and proofs.
Ok, since we have a high level picture of what’s going on we can finally get into the code, which should be relatively straightforward now. Our goal is to structure our data blob as a series of values that can be treated by the library as a polynomial and used to generate a commitment. Later we can use that commitment and generate proofs that a particular data element in the blob had a specific value.
As you’ve probably noticed from the liberal use of tildes (~), we’ve been careful when talking about the sizes of scalar and point values to make sure that we acknowledge that we are working with ranges that don’t quite line up with nice round byte sizes. In particular the scalar values of the subgroup that we’ll use as “input” to the BLS12-381 curve must be less than the following prime number, which you could say is somewhere between 255 and 256 bits in length:
= 0x73EDA753299D7D483339D80809A1D80553BDA402FFFE5BFEFFFFFFFF00000001 bls12381_n
The simplest way to accommodate this is simply to break our data into chunks that are 31 bytes in length and then pad with a high byte of zero to a length of 32 bytes. There are more complex schemes, but we’ll stick with this for now.
Next, we will arrange these 32 byte chunks into what we consider our polynomial. The ckzg library expects the polynomial to be 4096 elements long, so that is the size of our array, giving us a blob of 131k.
Let’s write some code to generate a random element and blob:
# 32 bytes with values capped at 31 bytes
def random_field_element() -> bytes:
return b'\x00' + os.urandom(31)
# Generate a random blob
def random_blob() -> bytes:
return b''.join(random_field_element() for _ in range(4096))
= random_blob() blob
Start by installing the ckzg python package. (Good practice is to work in a virtual python environment for your project).
pip3 install ckzg
and import the library in your code:
import ckzg
We’ll need the trusted setup data that we described earlier. The
trusted setup file can be found in the source distribution for the
ckzg python package on pypi.org or in the github repo
(https://github.com/ethereum/c-kzg-4844) as
trusted_setup.txt
.
We can load it once and use it for multiple commitments and proofs:
# Get the KZG trusted setup from the current directory.
def get_kzg_setup() -> object:
= Path(__file__).resolve()
cwd = os.path.join(cwd.parent, "kzg_trusted_setup.txt")
path return ckzg.load_trusted_setup(path)
Now we can generate a commitment to our (polynomial) blob, resulting in a compressed point sized string of bytes. The commitment can either be distributed in a trusted manner or generated as needed by a party that has access to all the data.
# Commit to the blob
= ckzg.blob_to_kzg_commitment(blob, setup)
commitment print(f"Commitment: {commitment.hex()[:16]}..., {len(commitment)} bytes")
# Commitment: 92ae4e8d9b11d06d..., 48 bytes
We are now in a position to generate proofs for individual elements in the blob. The form that this takes is to specify the x value at which we want the polynomial evaluated and then receive both the corresponding y and the proof that the commitment includes that value. You can frame this in terms of a challenge where the verifier requests a proof for a particular element and the prover responds with both the y value and the proof. The ckzg library refers to the chosen (x) challenge value as z.
The correspondence of this location with where your data actually falls on the curve is a little complicated and we’ll come back to that. But for now, since the polynomial is defined over the entire field, we can just generate a proof for any randomly chosen evaluation point. And in fact for a lot of schemes involving proving data availability a randomly chosen point may be enough to satisfy the verifier since if it cannot be anticipated by the prover they must have the entire blob available to produce a proof that is consistent with the commitment.
# Generate a proof for a random point on the polynomial
= random_field_element()
z_eval = ckzg.compute_kzg_proof(blob, z_eval, setup) proof, y_out
And finally we can verify the proof for the chosen element using the original commitment, the evaluation point, the proffered y value, and the proof.
# Verify the proof for the chosen element
= ckzg.verify_kzg_proof(commitment, z_eval, y_out, proof, setup) ok
If the verification returns True, we can be confident that the
commitment includes the y_out
value at the
z_eval
point.
Earlier in our description of how a polynomial commitment encodes your data we talked about the polynomial passing through your data elements as ‘y’ values at “consecutive” points on the curve. While this can be true, it’s not actually how the polynomials used in the ckzg library operate. A polynomial encoded as a series of ‘y’ values is said to be in evaluation form and this has some advantages in terms of efficiency of evaluation, however to benefit from this it has to lay out the points a specific ‘x’ values called roots of unity for the field that we are working with. These are ‘n’ specific values that are the first ‘n’ roots of the literal value ‘1’ in the field. (Remember that we are working in a finite field where the operations are modulo a prime number, so multiplication can cause values to wrap around, and ‘1’ can have many roots.)
For our purposes as consumers of the library, the important thing here is just that if we actually need to address a specific data element in the polynomial we need to map its index to the corresponding root of unity. The roots here are 4096 scalars in the field (one for each element of the polynomial) and the ckzg library computes them as part of the initialization process when it loads the trusted setup.
Having these values we can generate a proof for a specific element of our data (by index) and comfirm that the y values produced by the library matches or data like this:
# Generate a proof for a specific element in our blob
= 3
index = roots_of_unity[index]
z_eval = ckzg.compute_kzg_proof(blob, z_eval, setup)
proof, y_out = blob[index * 32:(index + 1) * 32] # Extract the corresponding element from the blob
expected_y_out assert y_out == expected_y_out
Unfortunately, although the ckzg library provides these values in
the C library KZGSettings
struct that is produced when
the trusted setup is initialized, it doesn’t expose them through the
python API binding. So to get at them we’ll have to use
ctypes
to peek into the object. This isn’t difficult
but rather than show it here we’ll just point you to a complete
code example that is part of the Orchid storage project.
There is a lot more to say about polynomial commitments and the KZG scheme, including more of the magical properties that allow commitments and proofs to be combined and yet remained fixed in size, functions the ckzg library offers to batch verify proofs, and how to turn the challenge framing into a non-interactive protocol. But this article is already way too long, so we’ll save those for a follow-up :)
The Orchid Storage Project is an open-source, decentralized storage scheme capable of maintaining files and issuing payments based on data availbility even when the client is offline. This is accomplished through incentive-aligned providers, non-interactive verification protocols, and efficient rebuilding and migration capabilities, all of which sit atop polynomial commitments and KZG proofs.
The prototype codebase is written in Python and uses the ckzg library just as we’ve described here. It includes implementations of classes that chunk and encode raw data files (including optional encryption and erasure coding) as well as a simple server framework for distributing data shards, issuing challenges, verifying proofs, and rebuilding lost data shards more efficiently than a typical erasure coded system. Come check it out, use pieces of the code in your own projects, or join us and contribute!
Patrick Niemeyer (pat@pat.net) is senior engineer at Orchid Labs, author of the best-selling book series Learning Java, O’Reilly & Associates, and contributor to many other open source projects.
I am neither a mathematician nor a cryptographer, so it is very likely I have made egregious errors in writing this article. If you find any, or just have suggestions for making it better, please let me know!