Advancing Standards for ZK and Provenance

Nov 12, 2024 crypto zk signatures

Thanks to Eric Rescorla, Justin Richer, Digital Bazaar, Mark Nottingham, Martin Thompson, Vivek Bhupatiraju, Jason Morton, Crema Labs, Dan Boneh, and Desec for their time and insights!

Going to the IETF standards conference for a week, reading more RFCs than most 24 year olds should be reading, and talking to several RFC authors has given me a decent sense of what needs to happen in standards and adoption to maximally advance the ZK, cryptography, and signatures space. Here are my thoughts, along with links to the raw info. Note that this is currently in progress and being actively edited and written – please give feedback to @yush_g on X or Telegram!

General Framing and Purpose

My primary goal here is to increase the amount of signed data in the world – ideally, we make proofs of signatures exactly as we do for zk.email, but for all sorts of web data – this lets us interoperate that data with other data, and directly prove it’s provenance. This is extremely powerful for 1) proving this data with ZK proofs to add selective disclosure, like how openpassport.app does, 2) using these signatures in tandem with systems like Proteus to prove provenance of data (i.e. this image came from the New York times, and here’s a signature and series of proofs to prove that) and 3) proving the data on chain allows anyone to build prediction markets (like tmr.news) or composable, private identity proofs on top of this data. These identity proofs can allow for things like account recovery or gated groupchats, or arbitrarily complex autonomous systems like gated, anonymized access to a company reimbursement system on chain.

The primary reason that people are against signed data are

it’s hard to setup, manage, and rotate signing keys – DNSSEC has caused massive outages like Slacks, and managing sensitive key data isn’t neccessarily a burden to give to organizations, and
nonrepudiation, where folks want to make it harder to prove provenance of leaked data.

In my opinion, both of these arguments are extremely weak –

Asymmetric cryptography is needed for any security in a post-AI world, and we already have that deployed at scale with i.e. DKIM. You can improve standards and key management practices to add security to keys (as has recently been happening with DNSSEC, see below) – especially in an age of rampant AI, cryptography and provenance become increasingly critical.
Practically, even without digital signatures, we don’t have nonrepudiability. Look where people are most concerned – consider the number of leaked documents or even emails for which the DKIM/repudiation data is completely stripped, yet the evidence is fully admissible in courts – see any of the filings mentioned by Internal Tech Emails. Adding signatures won’t change that much – unsigned already accepted as truth due to the way it was obtained (i.e. a search warrant on a database).

TLS

Of course, TLS is currently asymmetric, meaning we have to rely on TLS Notary (relies on an MPC noncollusion assumption) or TLS Proxy approaches (purely infrastructure security so vulnerable to known nationstate attacks, no cryptographic security) in order to get attestations on TLS data. These are fine stopgaps for now, but I think the long game is adding signatures to all web data.

TLS author Eric Rescorla does not think that signed TLS makes sense – he implied (and I concur) that instead of adding signatures to the slow moving TLS (which requires buy-in from browsers), to instead add these signatures to the HTTPS layer. Adding anything that needs computation (like hashing a signed payload) to TLS will be an impossible goal – it will make TLS bloated and add unneeded delays (single milliseconds on each connection will cost millions to Facebook), so it makes more sense to only send signatures when users request them.

The main way to convince folks to adopt here is NOT via convincing people of interoperability or identity proofs – none of these are compelling business arguments for big tech to want to adopt it, which is unfortunately the only language they really speak. I even pitched data privacy regulation, but that didn’t seem compelling either until the regulation was imminent. Instead, he recommended:

focusing on the fake news/provenance angle instead of all of the other cryptography stories or reasons (and especially not on-chain), and
having a clear high demand user story for why it mattered, not a technology story. For instance, the most compelling adoption story is, proposing to browsers and news sites to have a right click, ‘Save as Certified’ option for image downloads only for sites with signatures on the HTTPS layer then makes this standard worth advocating for on both the website and browser side (and each drives the other).

SXG

This was a standard proposed a few years back by Google. It’s interesting for provenance, but that wasn’t the original goal – the purpose was to have CDNs/websites sign their webpages, such that Google could cache those sites and serve them from their own cache, without having to hit the CDNs to get that data. This lowers search latency (because one round trip to the CDN is saved) as well as bandwidth (CDNs don’t have to serve as much data to get the same results). It seemed like a win-win standard, but the fundamental problem was more political than technical – having spoken to a few parties at IETF, the main issue was that CDNs charge users per megabyte of bandwidth sent through the CDN, but this proposal would reduce the amount of data going through the CDN, so they would get paid less. Pretty dumb thing to kill a standard over, but that’s the main reason for broad lack of support. Browsers like the very vocal Mozilla (not a CDN) also strongly oppose it (thread) because of its original purpose – it’s tied to Google being able to centralize data provider power, is tied to AMP (a bad and dead standard), and because they interpreted it as private user data being signed and stored by the search provider, which would be very bad indeed (but I suspect was not the original intent of the standard, only public data).

Luckily, because Cloudflare added a one-click option to enable this on websites, there are a number of websites with it enabled, as well as several repos (that we helped fund and build!) to make zk proofs of that data – there is a circom one but the sp1 implementation is most mature. We think that even though there is no long term adoption of this standard as it’s currently written, it’s still useful to develop on in order to build inspiring apps to convince folks to adopt better versions of signed data, like RFC 9421.

It is worth noting that one way to turn this around (narratively but unfortunately not in the now-expired standard) would be to decouple SXG from search engine caching, decouple it from AMP, and don’t make it visible to third parties – only users (as part of encrypted packets to them), so that websites can continue to serve personalized or private content as needed, and the CDNs still serve all the packets (i.e. perhaps the signatures expire after a very short time). It would also have been better if this standard didn’t have Google as a sole author but alas. Luckily, in this direction, we have:

RFC 9421

RFC 9421 is interesting as a better replacement to SXGs because it’s 1) less opinionated and 2) does not serve Google’s interests or hurt CDN income. It’s luckily also for a totally different purpose – RFC 9421 is explicitly targeting provenance, meaning organizations like the New York Times can sign webpages and ensure they don’t spread misinformation (not search engine caching like SXG did, which wasn’t quite as popular). It is already a finalized standard and there are several implementations and libraries already. One good question people have is, why does it look like it took 10 years, and why does it say ‘Proposed Standard’ instead of ‘Finalized Standard’? The 10 years is an artifact of the stub draft; the actual working group (the first step in a new standard) only began in April 2020. ‘Proposed standard’ is a bit of a misnomer – it means the standard is finished, but not yet widely adopted.

This discussion with one of the contributors Justin was particularly helpful for me to understand the best next steps. Basically, this standard is currently used by Amazon for inter-datacenter communication – but for requests, not responses, and is a way to verify that the request you’re receiving is authentic and trusted, and not just anyone can query arbitrary data from AWS datacenters. Note that we want to use it the other way around (the response is signed, not the request) – but luckily either is supported in the standard. One note here is that the public key infrastructure (PKI) is not specified in the standard, but Justin said that DNS the same way that DKIM distributes keys should be totally fine (it would be ideal to strongly recommend DNSSEC as well).

For any changes to HTTPS, getting buy in from Cloudflare is your #1 option for adoption (according to Eric Rescorla), so I suspect Digital Bazaar is already having these conversations with Cloudflare on RFC 9421, but good to double check. Mark Nottingham who runs much of the standards work at Cloudflare is the right contact for all of this – Nick Sullivan used to drive much of the pioneering cryptography work at Cloudflare (like adding SXG) but unfortunately left the company last year.

DKIM + SML

Having worked on ZK Email for years, which makes ZK proofs of DKIM signatures, I think I have a good idea of what the main blockers are for improvements to the DKIM standard to make it better for provenance. Unfortunately, Deltachat is the only chat app that supports DKIM signatures, which is cool, but I wish more messaging apps supported this.

I think the main things here are 1) mandating the to: field is signed, which Hotmail doesn’t do but nearly everyone else does (meaning you can spoof any email recipient to think they got an email from any other hotmail email sender who emailed you before), 2) mandating self-emails are signed, which no one does (but would help with proving email identity ownership), 3) mandating that forwarded/replied emails sign the signature of the previous email – we can de-mangle the old email manually to rederive it (this means the UX flow of ZK Email proofs can be a simple forwarding), and 4) ask onmicrosoft.com to ensure that their selectors/domains are not arbitrary, but instead 1-1 correspond to the domain they are signing for, the lack of which I suspect causes the majority of spam in email to come from Microsoft.

There’s an additional working group at the IETF called SML (Structured Email), which is working on standardizing different kinds of common emails to ensure that needed metadata is always included in relevant emails of that type (like calendar events, etc), making them universally parseable. I think it would be great to add things like forcing specification of unique user IDs + amounts in order to make things like zkp2p.xyz’s Venmo flows more consistent across different payment and information providers to be able to parse any relevant data.

DNSSEC

DNSSEC is the only credible technology that can solve the problem of fake wifi networks MITM’ing you and stealing your credentials/money online, because it stops adversaries from replacing the resolution of your name to a different IP address. When disabled, this specific oversight has been taken advantage of by both backyard hackers and nationstate-level adversaries (Iran, China). It’s interesting to us because validating DNS records is critical for end to end verifiable DKIM verification (up to root CAs) and soon RFC 9421 public keys as well.

Unfortunately, due to several outages like Slack’s due to an AWS bug and a ridiculous number of top-level TLDs going down due to DNSSEC issues, DNSSEC has not been adopted widely (only 9.4% of the top million domains). Several very old blog posts like the famous 2015 post Against DNSSEC treat it as literally evil, but their chief complaint seems to be the existence of certificate authorities, which is a fair concern but also something we’ve accepted for TLS. I feel that most of the concerns brought up in that post 10 years ago have now been addressed, and this 2023 blog post that takes into account all such past concerns that folks brought up, and gives a much more modern and balanced take on why DNSSEC adoption is important. Note that DNSSEC will be needed to get public keys like DKIM or RFC 9421’s keys to be verifiable without bespoke oracle networks, because you can just verify the signature chain directly up to ICANN.

Desec (a Berlin-based collective pushing for better DNSSEC standards) and I had a great chat about this – they are doing incredible public goods work with the relevant parties (ICANN, IETF, DNS stakeholders, domain registrars, etc.) to improve and deploy DNSSEC. They pointed out that the main cause of DNSSEC outages is bad rotation: when child DNS records rotate keys, the parent zones need to immediately handle that key rotation and sign the new keys, or else the DNSSEC migration will fail. Unfortunately, this doesn’t currently happen because of poor automation interoperability between child and parent DNS servers involved in the DNSSEC chain of trust, hence resulting in child domains sometimes breaking (i.e. how are parents notified when the child updates? how do they ensure the old keys are valid for long enough for them to propagate the new chain? etc).

They pointed to new key rotation RFCs 7344 + RFC 9615 that address these directly. By helping to implement these standards in commonly used DNS repositories, we can increase the reliability of DNSSEC and thus the adoption. One concrete place they recommended to start was implementing these RFCs and adding them to the often-cloned DNS repos like Knot’s CZ server – it’s purely an engineering effort problem.

BBS+ Signatures

Note that due to the extreme conservatism of NIST, cryptographic standards have to have been around for 20+ years to standardize, which is why now is the right time for BBS+, but ZK proofs might take another decade or so to show up in standards. Digital Bazaar drafted a great doc explaining how cryptographers can help review and standardize BBS signatures for selective disclosure.

The main way to help drive adoption of these selective-disclosure-friendly signatures is to help review standards, and leave comments improving the cryptography, or basically say that they look OK. One of the main issues with many signed data standards is that they standardize a serialization format, which the very loud JOSE + W3C group tends to shut down due to frustration from serialization mistakes in the past – this seems absurd as a blocker; if that’s an issue, then leave serialization out of the standard.

I had a great chat with the main drivers of RFC 9421, Digital Bazaar, who PSE and we intend to work closer with, to advance this and more signed data standards. If you feel aligned, feel free to get in touch!

Takeaways

This is a fairly unstructured large (and constantly-updating) dump of all my thoughts around standards, and what we can do to specifically advance and improve existing signed data standards for general internet provenance. There are a lot of ways to contribute – infrastructure engineering to add signature standards into repos and make them easy to use, writing work to be done around proposing suggestions to these standards, and cryptography work to help review standards. I think we can seriously advance the state of provenance on the internet, and squash problems of misinformation on the way :)