HighlyRedundantStorage

A storage system is really good, if an attacker or a natural disaster needs almost total destruction of the infrastructure to make stored data unavailable. If we can program a really good storage system. we should not settle with just a "good" one.

This maidsafe-dht thing looks cool. There is also an "OceanStore" - I know even less about
Network filesystem based on DHT, bloom.filter, and erasure coding.

10+ Mbyte files: slice up the file into a square (eg 12x12) or cube (eg. 8x8x8) of blocks, and apply double / triple-interleaved reed solomon erasure coding. If the nodes come and go, data healing-daemon will be busy.
small files: splitting to many blocks would result in inconveniently small slices and slow access. So instead, find similar sized other files with similar expire dates and apply tripple-interleaved reed-solomon coding on the cube of these. The redundancy-blocks only need to be accessed when some nodes are unavailable.

2 kinds of completely different usage of storage:

stream to disk without indexing, like TransactionLog (spinning magnetic disks still outperform SSD for this type of storage)
- binary search can be used to find records, but that takes much more time than with (B-tree or hash) indexing
- index can be built later efficiently (in-memory, than streamed to disk).
  - indexing can be berkeley-DB, or postgresql/mysql or even filesystem based (usually with 4-10 deep directory structure, see squid cache)
- but often these data are just neglected after some time. Needed rarely (but than badly): after a HW-crash
indexing: like DHT, or any solution that needs to find records extremely quickly
- SSD now outperforms spinning disks ( transaction / second / cost considering the total cost of ownership during the life/wear time)
- example: Virtual datacenters in elliptics network storage

We use a simple interface and trivial SQL implementation to roll a reasonable EpointSystem solution quickly, and operate it in the beginning. But finally we should migrate to a really good solution. For money, and other information as well. Also required for censorship resistant networking anyway.

Below we assume that a revisioned vault stores a set of EpointSystem tokens in a DHT.

Updateable hash interface (need to know the matching secret to update):

put(k, v1, hashedsecret1 ) stores v1 (at key k).
update(k, v2, hashedsecret2, secret1 ) only updates k => v2 if the hashedsecret1 == hash( secret1 ) matches. Deleting requires same: knowledge of old secret1

Bootstrapping involves getting (finding out):

a revision number N (actually a hint, before we get certain that N exist, N+1 does not)
- now knowing N, k=hash(privatesalt, identity_balancerevision_N ) key can point to the document. Like an edonkey, gnunet or freenet handle.
if we don't need max security, or want to be easy (coding) or quick (operation), besides the N hint it can contain (self-encrypted) a list of keys so we can retrieve the rest of the vault data in following steps. Or the full document itself.

Robustness / paranoia notes

Gateway is a limitation we must live with: a JAVA applet can only connect to the server the applet was downloaded from. We really want to make an attack from the bootstrap "gateway" less effective; so a working attack SHOULD need almost total destruction of the infrastructure of the DHT as well. Tempering with the gateway could cause severe problems (worst is a trojan-horse applet), but old state should still be available in the DHT.

expire time is also needed.
- Unless configured otherwise, the default expire time can be around 121 months: if the vault sees any token that would expire within 61 months, it renewes the storage to 121 months.
- Billing considers storage time: eg. starting_price + priceofmegabyteyear * ( dataquantity * time)
- Billing between DHT servers happen with traditional methods during normal transactions, and cleared in epoint only occasionally (attempt hourly or daily or beyond some limit ?). Also, storage servers use their own DB (not DHT) to store the billing currency tokens. These measures prevent infinite recursion where a billing would create another billing etc....
- free storage is ruled out (in the long term). Could be flooded by the banking establishment in a short time
- delete, or setting expire to less than already paid for SHOULD be neglected. This way a tampered gateway with a trojan horse applet cannot destroy old state at least.
If the content is protected by (possibly symmetric) cryptography, than the gateway cannot forge a "future" N value. It would not make sense anyway, as the next step where trying to retreive k=hash(privatesalt, identity_balancerevision_N from DHT (via onion access so trusted server cannot screw it) would fail anyway
to verify that the trusted server returned the correct N, we also try other values, like N+1 which should fail.
- Note that at the very minimum, the order should be random (getting Nth or N+1th first),
- and some other queries (perhaps unrelated, eg. retrieving some naughty content from DHT via same onion method) should be made as well.
- We do NOT want the gateway to know for sure that we receive Nth first, than N+1st.

Embedded system vs. enterprise bloatware

Enterprise JAVA is a good solution for many (or most ?) web applications. Although such systems are full of security holes (intentional or unintentional, matters little) the most powerful mafia/establishment has rarely thousands of billions of dollars interest in attacking these. To prevent people from creating their money, to drive them away from such methods, the chance the existing holes get utilized suddenly sky-rocket.

OS, typically Linux
- (preventing MS-windows from nasty actions is practically impossible, in other words MS-windows is completely insecure)
Apache - spending most of the time switchting threads when waking a new thread than letting it sleep again without doing ANYTHING useful (problem of the OS thread-implementation, not apache itself).
tomcat (application server)
JAVA ~~spyware~~ virtual machine
DB, typically Postgresql or Oracle. A database is a filesystem (with advanced R-tree indexing). Often on top of another filesystem, but sometimes directly sitting on a block-device.
Or DB-cluster spread to several nodes.
- security nightmare : any of the nodes can corrupt an issuer, and it might even be hard or impossible to tell which one was responsible for the nastiness.

Typically, if an AMD CPU can execute 4000 transactions / sec (in C), somewhat less in Java ~2000 /sec, 700/sec in tomcat and 300/sec via apache+tomcat. In the end, 90+ % is spent passing control around like crazy. Yes, but it can do load-balancing ! True. But a server with 2000 issuers, used by 50000 people total is already load-balanced. Load balancing will not regain the 90+ % losses. It might sound nice to be able to update the OS, update the database engine etc... without interruption of the service, but think about it ! Why should OS and DB-engine be updated ? Because of severe secholes. That means they shouldn't have been used in a high-security context in the first place ! Just for performance-gains it's better to not update (risks > gains). Instead of ad-hoc updates, app-level method is better to phase out old instances in a few years and migrate to new instances. Luckily, with the digital-market it is trivial to do so.

On the contrast, an embedded system might be like:

no OS (or crippled, minimalistic OS for simple networking and IO/filesystem access)
communicating via messages. When communicating over the internet, an enterprise bloatware webserver examines messages, if msg matches template, than wraps message into envelope, pass it through DMZ (which should reexamine and filter it if necessary) to the embedded system
the envelope is only opened by the application on the embedded system (another measure to prevent utilization of backdoors of the networking of the embedded OS - if used at all).
libraries are likely used (like gnupglib, UDP protocol, etc...), rewriting existing functionality is not needed
storage: FS (possibly journalling FS) over crypted network block device (NBD)
- none of the utilized nodes can corrupt data (without immediate discovery)
- if the access to NBD nodes happens via an onion cloud (like Tor), than almost the total destruction of the network is necessary to mount a successful attack

What tables can be easily stored (and looked up) on FS on crypted NBD ?

sequence of (signed + crypted) records, looked up by serial number. (like: issuer transaction log)
- appending new record is easy.
- if not using crypto-FS but some other lookup-method, (where serialnr might be understood by the storage nodes), if some nodes lie about the highest serial number, it becomes obvious when verifying signature.
key => value pairs (like issuer MD => full history of given MD)
B-tree: ordered records

The issuer can reconstruct the current state by fetching and applying all transactions from 1..N.

Alternatively, using checkpoint C the issuer can fetch C and apply all transactions from C..N. A checkpoint is a consolidated snapshot of the relevant state (signed). It makes sense to create a checkpoint when the transactional data since last checkpoint is 5 times more than the consolidated current state (~nr of nonzero MD-s and obligations). Checkpoints should be created outside peak-hours (eg. at night).

What tables are difficult to store on an FS (without SQL-DB) ?

R-trees ?
- does the issuer depend on R-tree ?

A high performance database storage : http://fallabs.com/tokyocabinet/ Supports hash and B-tree. The fixed-length-API ( most of the MD => struct and serialnr => struct can be stored that way ) is quite extreme. It lists an interesting feature: improves robustness : "database file is not corrupted even under catastrophic situation." - believable, especially for the fixed-length API.

Issuer main lookups (roughly):

private key (only in the hot instance)
actual serialnr
MD => value, serialnr, or obligation => value, serialnr
MD => rand, serialnr in case of spent rands
transaction history: serialnr => templateid, data, actual data (everything needed to re-create the signed document)
- serialnr => the full signed document (should not be absolutely necessary, but good to store anyway)
..
other data about storage (eg. servers, tokens to access them, etc...)

DigitalMarket main lookups (roughly):

private key (only in the hot instance)
actual serialnr
currency A -> currency B offers:
- user who offered, serialnr of last offer (and previous offer, if any)
currency => obligation (in favor of market) actual value ..
- currency => deposits (of all users)

With EpointSystem + digitalmarket + firepay + thiblo + only one major building block is missing to completely revolutionize information dissemination:

A redundant storage-retrieval-versioning network.

Like git with some repositories+repo-registry (in an easily deployable + accessible fashion), and browser integration. A good alternative is needed for the fascist info-centrals, but not the simple "publish and send URL" where bandwidth shortage or dead-pages can happen.

With proper distributed storage, the 90% percent of dead links like around power-producing stirling-technology (eg. see http://www.oldengine.org/members/christison/links/ but it's very common for pages about >1kW Stirling or Ericsson engines) wouldn't be the case

At the bottom of the storage there are usually (always ?) fragile nodes that are simple nodes that are easy to attack by the owners of the FED if their global interests justify.

But the lookup, retrieval, retry, replicate, recover protocols must make sure that all important information persists. Paying for storage and content is almost unavoidable (otherwise injection of large amount of junk makes storage of useful information uneconomic). For this, the payment tokens must be created by the people (not the banking establishment) and the micropayments must be efficient and very cheap like with EpointSystem .

The bottom-level storage nodes can store big blocks of useful information (like git repository files) or small 32kbyte - 2Mbyte blocks (which might be just pieces of bigger files) or both.

Created by: cell. Last Modification: 2010-09-28 (Tue) 16:29:47 CEST by cell.