valhalla3 writeup

author Zachary Vance <za3k@za3k.com>

Thu, 30 Apr 2020 02:29:56 +0000 (19:29 -0700)

committer Zachary Vance <za3k@za3k.com>

Thu, 30 Apr 2020 02:29:56 +0000 (19:29 -0700)
author Zachary Vance <za3k@za3k.com>
Thu, 30 Apr 2020 02:29:56 +0000 (19:29 -0700)
committer Zachary Vance <za3k@za3k.com>
Thu, 30 Apr 2020 02:29:56 +0000 (19:29 -0700)
diff --git a/archive/valhalla3.txt b/archive/valhalla3.txt

new file mode 100644 (file)

index 0000000..e52388b
--- /dev/null
+++ b/archive/valhalla3.txt
@@ -0,0 +1,258 @@
+=== valhalla3 ===
+
+Author: zachary "za3k" vance
+Written 2020-04-29 (document version 1)
+
+valhalla3 is a p2p protocol designed to manage downloading and making available a very large dataset among a set of volunteer computers.
+
+The intended use case is to back up the 14000TB Internet Archive (I'll talk to them after I have something working), as 3+ always-online copies of 1,400,000 10GB chunks. My estimates says we're targeting ~100K users and ~1M files.
+
+The protocol is implemented as a general-purpose way to ensure a set of torrents are populated, together with a mechanism to download from HTTP and turn the downloaded files into torrents. The goal is to have 3+ distributed full copies of the internet archive, online at all times.
+
+=== design goals ===
+
+- Actually get a copy (!)
+- Make it available to everyone (people not running the system can get the dataset)
+- Let anyone easily run a client, so we can get 100K users.
+    - Run on Windows (lots of people) and Linux (rare people with huge amounts of space). OS X isn't a priority.
+    - Run forever after initial install (autostart on boot), with no user action. 
+    - "Run forever" also means some form of auto-update, since we can't assume the initial valhalla3 client is adequate.
+    - Zero configuration if possible, or at least minimal configuration.
+    - Works behind NAT with no IPv6 or port forwarding
+- Robust
+    - The system should work if the volunteers keep participating but maintainers disappear.
+    - 50% of computers get knocked out in 1 hour, no big deal
+- Scales to 100K users
+- Scales to 14000TB of files (1M torrents, each with a single 10G file)
+- Under typical use (0.1TB-10TB stored) doesn't significantly tax the system it's running on. <10% CPU and memory use, no freezes, and nothing crazy on bandwidth.
+- User-friendly GUI (not covered here)
+    - Detects bandwidth caps some ISPs do, so people don't get horrible overage charges. (Just turning off is acceptable)
+    - Cleans up files on uninstall
+- Can mirror a large dataset without active cooperation from the current HTTP host (for example, pre-computed hashes should be optional)
+- If we have 100K honest users, 1K dishonest users can't sabotage it
+
+=== current state and feedback ===
+
+valhalla3's protocol is designed, but it isn't prototyped yet. I'm currently asking for feedback on the protocol before doing another prototype, since I've already had to throw out everything and redesign twice (and I'm replacing a poorly designed system to start with).
+
+I'm interested in feedback about why my goals are wrong: better ways to store Internet Archive, why I have the wrong set of components, what my system goals should be instead. Feedback on technical protocol details is less interesting: in order my priorities are reliability, scaling, security, and efficiency. Vague feelings ("this is too complicated") are fine. Areas where you didn't understand something in this writeup are fine too, and feel free to ask questions.
+
+I don't want technical security feedback yet. Some people want to show off their clever attack, but I need to find really fundamental design problems.
+
+=== organization of valhalla3 ===
+
+Forward references you might want: good but rejected previous designs at the end, centralized components of the system under "known issues" and "admin vals"
+
+## metadata store and terminology
+A p2p network is used to share metadata. Metadata is stored in the form of two key-value dictionaries.
+The mutable key-value store (mutable-store) contains signed, timestamped references to immutable values (mut-vals). JSON and hexadecimal is used here for readability, but the real format is packed binary.
+    "client-994b8cc93f94894fed9ec350c9c5309face107c3": {
+        "signature": "7fa11744b8add1b3141d9099ad333a10aa542a63",
+        "timestamp": "2020-01-01T1:00:00Z",
+        "value_hash": "adc83b19e793491b1c6ea0fd8b46cd9f32e592fc",
+        "value_length_bytes": 100,
+    }
+
+The immutable key-value store (immutable-store) contains actual data, keyed by hash (immut-vals). All 'client' entries should be <10K (most much smaller), and 'admin' entries should be <10M (most much smaller). This will be _aggressively_ compressed from this representation in the final version.
+    "adc83b19e793491b1c6ea0fd8b46cd9f32e592fc" {
+        "ipv4": "1.1.1.1",
+        "ipv6": null,
+        "ipv4_port": 25555,
+        "file_manifest_version": "2015-01-01T1:00:00Z",
+        "files_claimed": [1,2,4,5,11,...,1945],
+        "files_downloaded": [2,4,5,11,...,1945],
+        "files_bittorrent_seeding": [2,4,5,11,...,1945],
+        "file_hashes": {2: "1ff71991c121616486323c545292e3fe82139a2c, ...},
+        "file_torrent_infohashes":  {2: "69c20664e48549b760dbc89a72b33e853fbe8993", ...},
+        "file_sizes_bytes": {2: 10000000000, ...}
+        "tls_certificate": "<cert>",
+    }
+
+The mutable-store contains one 'client' mut-val for each peer, plus 4 'admin' mut-vals mentioned later which are noncritical. There is no other data in mutable-store. [Implementation detail: the mutable-store should be easy to hash and to iterate over keys in sorted order.]
+The immutable-store contains exactly the values referenced in the mutable-store. All of them are present and nothing else. Note that two mut-vals are allowed to point at the same immut-vals.
+
+I will call a mut-val together with its immut-val simply a 'val', and if I mention a 'val key' I mean the key of the mut-val. The mutable-store together with the immutable-store is the metadata-store.
+
+## metadata store and clients
+
+Each peer generates a public-private cryptographic signing keypair on first startup. The same keypair is used for the peer forever. The peer is responsible for maintaining exactly one val, with a key of the form "client-<public_key>". Because mut-vals are signed, peers receiving updates know that the data is really from the signing peer. Timestamps prevent replay attacks--only higher timestamps are accepted when receiving updates.
+
+Each peer maintains a full copy of the entire metadata-store. It is designed to be <1GB with the target number of users and files. There is no "authoritative" version, but the update protocol below means any updated val should reach a peer within 1 day. Peers "expire" and delete mut-vals based on timestamp after 7 days. Immut-vals are deleted whenever there's no mut-val that references them.
+
+Clients update the mutable store once a day even if nothing has changed (just change the timestamp), so the network knows they are still online.
+
+## metadata store updates
+
+On first boot, the peer initializes the mutable to hardcoded values. Only 'admin' mut-vals are initialized in this way -- more about these below, but one is a list of known peers. These "bootstrap" peers are expected to be in the network and online. This bootstrap is how basically all p2p systems work, but valhalla3 tries to provide a full known set of peers, rather than one or two fast and reliable ones. Under normal operation the peer calculates its list of peers from its metadata store--any 'client' mut-val with a timestamp in the last day, is assumed to be from an online peer.
+
+Roughly once an hour, each peer does a pairwise update with another peer, chosen randomly from known peers with a public IP. We will call this command EXCHANGE_METADATA. At the start of the exchange, each peer has its own set of mut-vals and immut-vals, all valid and with timestamps. After the exchange, both peers get updated mut-vals and immut-vals from one another, and have the same overall metadata store as one another. The exchange is designed to be moderately fast (<1s, bandwidth permitting).
+
+With 100K peers, the fastest way for all of them to exchange data takes 17 exchanges. Using random exchanges instead, a simulation shows it consistently takes around 20 exchanges, which is not much worse. At the rate of initiating 1 exchange/hour (that's actually participating in 2/hour), updates should easily be available to all peers in half a day. If a 'client' mut-val is 2 days old, that peer can be safely assumed offline.
+
+NAT is a potential issue for many p2p protocols, and valhalla3 chooses not to implement any kind of NAT piercing. Instead, IPv4-only peers behind NAT just always choose to exchange with peers with a public IPv4 address. Peers with public IPv4 addresses will get many times more exchanges as a result.
+
+EXCHANGE_METADATA
+Because this is the most frequent operation, it's likely to be optimized. Treat this as very much a proof-of-concept. Wire/protocol details may be changed before release to reduce overhead. Security feedback is not welcome at this stage and will be ignored--please focus on more fundamental problems.
+
+    1. (Over TLS) Peer A contacts Peer B, saying "I want to EXCHANGE_METADATA (version 1). The current time is 4:01pm.".
+    2. Peer B responds "Yes, I support EXCHANGE_METADATA (version 1). I agree the current time is near 4:01pm." [Because mut-vals expire based on time, the peers agree on a 'working' current time to make sure they will agree on the set of valid mut-vals. This is a real edge case thing--finicky to get right but uninteresting. So, I'll pretend the peers agree on the time exactly for the rest of this summary. A hand-wavy solution to illustrate there are no real problems follows for sticklers. We allow clients to differ by up to 6 hours. To make sure contested mut-vals/immut-vals are available, peers internally wait 1 extra day before deleting expired mut-vals/immut-vals from disk. The exchange is done with a modified copy of the metadata store including exactly those mut-vals within 7 days on the agreed-on time, whether expired or not. Changes are merged into the real metadata store after. There are no major problems that happen from accepting mut-vals signed in the future.]
+    3. Peer A and Peer B exchange their mutable-stores by sending a full copy to one another. [In reality, this is done using some variant on the rsync protocol, which sends only changes, reducing bandwidth]
+    4. Peer A and Peer B reconcile the two mutable-stores without sending messages. Both peers arrive at the same result, called the "canonical" mutable-store for the exchange.
+       - They throw out any mut-vals with invalid signatures. A peer MAY blacklist the other exchanging peer as a bad actor.
+       - They throw out any mut-vals with expired timestamps, or timestamps in the future. [This should never occur and is an error condition]. They SHOULD NOT blacklist the signing peer if the timestamp is in the future (this could be used to DoS the system).
+       - If they have two values for a key, they select the newer one and throw out the older one.
+       - If they have two values for a key with the same timestamp, the lower hash is taken. The peer MAY blacklist the signing peer as bad actor.
+    6. Peer A sends to Peer B (and vice versa)
+        - The hash of the canonical store
+        - All immutable values peer B needs, concatenated together.
+       If the peers did not agree on the canonical store hash, this is an error condition.
+    7. Peer A and Peer B verify immutable values. Each value should be in the mutable store, with a matching hash length and hash. 
+       Peer A and Peer B add any immut-vals received to their immutable-store.
+       Peer A and Peer B remove any immut-vals no longer referenced in the mutable store.
+    8. Both peers now have the same mutable-store and immutable-store, which they commit to disk.
+
+## admin vals
+
+There are a few 'admin' vals which provide some benefits. valhalla3 is designed to operate OK even if 'admin' vals are never updated. 
+
+'admin' vals are signed with unique keys, generated in advance by hand. 'admin' vals will be updated by the project maintainers (me?) infrequently and by hand. The current plan is to treat them normally except not to expire them. It's possible getting an 'admin' val update should trigger "infection" EXCHANGE_METADATA calls to propogate admin updates quickly.
+
+The 'admin' vals are:
+    'admin-manifest-<public_key1>': The list of files to download. Includes HTTP source(s), any known hash, and any known torrent infohash. This is where you would put a list of bittorrent trackers for another project, although the plan for Internet Archive is not to use one. [Note that while the example client data includes the hash, infohash etc for all files, it can be left out if in the manifest]
+    'admin-clients-<public_key2>': A list of known peers, which can be used to bootstrap (see 'metadata store updates')
+    'admin-httpclients-<public_key3>': A list of web mirrors, which can be used to bootstrap or reduce load (see 'HTTP metadata' below)
+    'admin-selfupdate-<public_key4>': Can be used to signal to peers to install a self-update of the client software. Will include the binary so p2p updates work. I haven't decided, but the binary blob _may_ be downloaded from http peers only if updated, making it a bit of an exception to the usual uniform process. The binary itself includes a signature with a hardcoded key, to add an additional security check.
+
+# HTTP pseudo-peers
+
+The 'admin-httpclients-<public_key2>' val will contain a list of public websites which serve valhalla3 metadata. All peers will regularly download metadata from the website, checking signatures and timestamps as usual. This is similar to EXCHANGE_METADATA, but it's one-way, so I think of these mirrors as 'pseudo-peers'. Details of exactly when/how frequently clients fetch metadata TBD, as we need to let this scale to 100K clients.
+
+There are a couple reasons this is a good idea.
+- If we totally break the client somehow during alpha, we can still get them to autoupdate.
+- If there is a lot of churn, every bootstrap peer may be unavailable. This lets the client update the boostrap peer list.
+- It's very easy to keep a static website working
+- If we use CDNs, this is near-instant and can handle as much 
+
+## data
+
+Wait, but weren't we downloading something?
+
+Yeah! So although the really complicated bit is maintaining this metadata, the peer's _important_ job is to download and seed files, not to gossip with its peers. 
+
+Using 'admin-files-<public_key1>', the peer knows the whole list of files the sytem wants to download. By scanning the metadata-store, it can calculte how many peers are currently downloading or seeding each file. Then it picks high-priority files, and claims them by updating its 'client' val. A high-priority file is one where there are less copies of it available (for example, if there are 5 copies of most things, but 0 copies of one file, the client should pick that one). The manifest may also include "priority" information to help clients choose a piece, ex. we want 10 copies of important files like the index, make sure to get this 1PB before the other 13PB.
+
+If less than 3 total peers have a file, and the manifest does not include an infohash for the file, the peer downloads the file directly from the HTTP source. It then hashes the file and posts the hash, infohash, and size as part of its 'client' val. The expectation is that the project maintainer (admin) will 'stamp' a hash under some condition, for example once a 2-3 trusted peers agree exactly on the file contents, and then it is locked in place. In the case of Internet Archive, we will likely be able to provide hashes as part of the initial manifest.
+
+If the manifest includes the file infohash, and at least one online peer says it is seeding the file, the peer downloads the file via bittorrent. At some point it MAY give up and use HTTP download (sometimes peers may be down or unreachable on bittorrent)
+
+If 3+ peers have the file, but the file is not in the manifest, the manifest tells the peer what to do: either treat HTTP are the only trustworthy source and use HTTP download, or trust its peers and use the most common torrent infohash.
+
+Any time a peer downloads a file, it verifies the SHA hash and file size if present in the manifest. It rejects the file if either is incorrect.
+
+## list of commands
+okay this section is a bit vague right now, I need a real prototype and more networking knowledge
+
+yes
+EXCHANGE_METADATA - explained in full above, makes 99% of everything work.
+
+probably yes
+IS_MY_IP_PUBLIC / YES_YOUR_IP_IS_PUBLIC - used during boot to detect whether a peer is behind a firewall, untraversable NAT etc. also returns apparent IP as seen by Peer B. need to learn more networking to figure out if this is needed. needs DDoS mitigation too.
+
+probably no
+STUN/TURN - don't think we need this, also no idea how to coordinate it without one central server. Lets two peers behind NAT talk to each other using a peer with a public IP.
+GIVE_ME_PAR2 - request for PAR2 data for a file, if a client discovers corruption. probably won't make the final cut due to complexity and wanting NAT traversal
+
+== Opinionated options and non-options
+
+Feel free to let me know if you disagree with these, I'll listen but probably won't respond.
+
+If you have a way to auto-detect that the user is on an ISP let me know. I want to detect and enable different client behavior for:
+ - ISPs with a bandwidth cap
+ - ISPs that do nasty tricks (throttling) to everyone sending bittorrent traffic
+A hardcoded database of ISPs/IPs is good enough.
+
+Options (set during setup program)
+- Does your ISP have a bandwidth cap? (We'll attempt to auto-detect but this is the main installer question)
+- How much space do you want to use? (Default to 20%, maybe try for more with intelligent eviction in future updates)
+- What folder should I download stuff to? (reasonable default)
+
+Auto-detected
+- Are you on IPv4? IPv6? Can either of those addresses be contacted from outside your network? (autodetect using peers, IS_MY_IP_PUBLIC command. could use advice on implementing this without valhalla3 to be used for DDoS)
+- What is your uplink/downlink bandwidth? (we'll autodetect or use uTP to avoid congestion)
+
+Non-options (Windows and Linux Alpha)
+- Turn off torrents
+- Use HTTP proxies
+- Turn off self-updates on Windows (config file only, but not easy to do)
+- Turn ON self-updates on Linux, although we may flag that you should update somehow (we'll rely on package managers)
+- Turn off HTTP pseudopeers (config file only)
+
+We also won't push people to turn on port forwarding or to turn on IPv6.
+
+== Known issues
+
+Feedback on how to fix these welcome. Feedback on issues NOT listed here even more welcome :).
+
+- Reliability: All bootstrap peers can be down (can't fix)
+- Reliability: All bootstrap http-pseudopeers can be down (non-issue, system will still work)
+- Reliability: Admin keys are a centralized feature. They could be lost and clients would lose self-update. (open to suggestions)
+- Reliability: Admin keys are a centralized feature. They could be lost and the network could fail to ever add new files. (won't fix)
+- Reliability: If all bootstrap peers are down, and either http-pseudopeers are down or don't have new peers (ex. maintainers vanish), bootstrap is impossible for new nodes. (seems fixable, open to suggestions. rendevous over some public channel?)
+- Reliability: 10GB chunks are pretty big. Peers generating and sharing some kind of parity data (error correcting codes like PAR2) to correct local disk errors or HTTP download errors, might be a good idea.
+- Scalability: IPv4 peers with public IPs may get overwhelmed. (open to suggestions, but if UPDATE_METADATA is fast, I think it may just be OK)
+  I could use some data on how many people have static public IPv4 vs DHCP public IPv4 vs IPv4 NAT; and of those on DHCP how frequently IPs change.
+- Scalability: The average client will be running 100s or 1,000s of torrents. I've done quite a bit of research and testing and I think this should be doable, but it's a risk.
+- Scalability: Big clients would need to run 10,000-1,000,000 torrents. You can't seed 1,000,000 torrents on one machine, it's not really possible. But you can't get 14PB on one machine either. I'm going to keep working on improving torrent clients--some of this is fixable, and I think we can get to the point of allowing this. Announces are really a scalability issue.
+- Scalability: Collective torrent announce load. Is there a risk we could take down any tracker we add, or cause issue on the DHT? It seems like probably no, but in the worst case we'd have to switch off mainline DHT or use a private tracker, which makes the torrents hard to access to the public at large.
+- Race Condition: Because peers only get updates after about a day, if there are few high-priority files, many peers will all download that file.
+- I've had problems using DHT-only bittorrent behind NAT, I'll have to make sure everything is ok.
+
+Not interested in security feedback yet. Here are the known issues.
+- Security: Self-update could be used maliciously (lack of self-update is an equally risky, but you can turn it off. open to suggestions but this is probably wontfix)
+  Security: Admin keys are a centralized feature. They could be abused. (this is 99% self-update, calling this a duplicate)
+- Security: This entire thing is pretty vulnerable to sybil attacks, making banning bad peers ineffective. Even if it wasn't, distibuted ban/reputation systems are hard to keep beneficial. I'm trying to design it so bad peers don't get banned and don't matter.
+- Security: Malicious peers can falsely report having a file, causing honest peers to pick the wrong file to prioritize. This can prevent peers from having the full data set or cause load on one endpoint. Files are not evicted. One fix is have some fraction of files selected _not_ prioritize rare files. Proof-of-content merkle tree challenges could detect bad actors, but you'd also need a way to ban them.
+
+=== Reasonable but rejected designs
+
+Random download. Each peer randomly selects files from the hardcoded manifest, downloads, and seeds them with no coordination.
+- PRO: Very simple
+- PRO: Robust, nothing to take down
+- CON: bittorrent is hard to scale to 1000s of files (same as valhalla3)
+- CON: More peers needed than with coordination (about 40% more)
+- CON: Hard to tell if you're done or making progress. You can't see how many peers there are, and it's actually quite hard to practice to even check which of 1million torrents are seeded
+- CON: Disallows direct peer-to-peer transfer being part of the design
+- CON: Disallows a central HTTP download directory other form of peer directory, which helps restore
+- CON: Manifest can't be updated. If we add a webserver to update the manifest, it's a central point of failure.
+Honestly I think this one isn't half-bad, the main disadvantage is that you can't add any kind of transfer other than torrents. I rejected it before I was using torrents, so tracking and restore was worse. I do think valhalla3 is better for something this size, but for a smaller project this would work.
+
+Master-slave. Each slave gets the list of things to download from a master, who tells it "download these files". Slaves are tracked by the master.
+- PRO: Simple, and easy client
+- CON: Centralized restore. If the maintainers disappear or don't have the bandwidth to do a 14PB restore in time, you're screwed.
+- CON: Centralized operation. Easy to take down. You can have clients default to picking their own files to download, but it doesn't help restore.
+- CON: Scaling. 100K clients is pushing it with one master. The easiest solution is multiple masters, which adds complexity.
+
+(valhalla2) Federated coordinators. Each peer gets the manifest from a list of website mirrors. It regularly talks to several trackers, telling the tracker what is has and is getting, and asks what other peers are getting. It looks at what other peers have and picks some files to download. There may be an emergency "direct contact" mode where we tell the client to update or debug the alpha.
+- PRO: Pretty scalable.
+- PRO: Decentralized restore possible (but needs deep technical knowledge) due to directory of peers
+- CON: Complicated, there are 3-4 different kinds of software running
+- CON: Either trackers are a point of failure (if trusted), or can take down the system (if anyone can run one)
+- CON: Website mirrors are a point of failure for updates.
+
+=== the old IA.BAK (valhalla1)
+
+There is an existing project called IA.BAK from Archive Team to back up the Internet Archive. This project is inspired by trying to come up with a version of that that works. If you don't know what IA.BAK is already, skip this section.
+
+Let me preface this by saying I've talked to the IA.BAK author (joey hess) and I think he's pretty onboard with a shiny new solution targeting Windows. It's acknowledged experimental alpha software, and my feeling is that the experiment showed it wouldn't work.
+
+Problems with the existing IA.BAK, any of which may be incorrect:
+- Doesn't have a copy of IA. (Targeted 1PB rather than 14PB, got 100TB)
+- (Effectively) targeted only at the technical part of Archive Team, who collectively don't have 14PB of storage. By this I mean it's usable only by extreme linux nerds (me)--you can't even do a package install. But even if it was trivial for all linux users to install, that's missing 99% of the population.
+- Restore can only be done by the maintainers.
+- The backend, git-annex, is frequently updated and complex (you could argue bittorrent is too)
+
+Of these, I think the biggest problem is the target audience. I'm basically imitating projects like BOINC or SETI@Home--the correct solution is to get a LARGE number of nontechnical people interested in helping out. Then we have them run a Windows installer once which asks them zero questions, and IA.BAK then do useful stuff with their computer at no cost to them for 10 years until they buy a new one. If they click on the tray icon during that first interesting week, we show them shiny statistics like "We have 1% of the internet backed up. You personally have backed up 99,999,999,999 websites wow such a good person!". And they actually are.
+
+=== feedback
+
+As I get feedback I'll post it here.
author	Zachary Vance <za3k@za3k.com>
	Thu, 30 Apr 2020 02:29:56 +0000 (19:29 -0700)
committer	Zachary Vance <za3k@za3k.com>
	Thu, 30 Apr 2020 02:29:56 +0000 (19:29 -0700)