Data storage: Distributed Dynamic RAID?

 

Motivation

The notion that there can be a single point failure in the storage of data resulting in a perceivable loss of information should be obsolete.

 

The paradigm of information storage should be that for all intents and purposes, it is impossible to lose data based on the loss of any physical hardware, such as hard drive failure, backup system destruction, building destruction etc. or poor network connections resulting in dropped packets of information in transfer.

 

brainstorm about storage

 

 

The idea

Let us develop a system where any given piece of content (a document, an email, a music or audio file, a binary application) is broken down into small pieces by way of some algorithm, and distributed over the internet. the information might be distributed to unrestricted peers around the world, like a bit torrent scheme, or it might be kept within a particular network. either way, the information is always in transit between nodes, never staying in one place. The pieces should be small enough that the loss of any practically probable number of individual pieces would be small enough that the reconstitution algorithm could extrapolate between the two ‘adjacent’ pieces using their boundary conditions (ie, the letter on either side of a lost letter in a word) and broader context (ie the letters of the word-missing-a-letter in the context of the sentence itself) in order to cover any loss of information.

 

 

bittorrent is a scheme worth looking into for the decentralized distribution of the piece of each content.

erasure code is an algorithm that could be used to actually break each piece of content up into these "infinitesimally small pieces".

the algorithm for re-assembly based on the loss of any small finite number of packets is difficult. if we were only dealing with text files, then re-constitution algorithm could be trained to recognize spoken and programmed languages. but recovering binary files is less clear (at least AFAIK).

parity bits cf. RAID5 systems for error checking (conversation w nick)

 

 

this type of backup could be seamlessly integrated into the user experience of the internet, so any time you are connected, your data is being backed up.

 

 

Challenges

 

What would the memory required for such a process

Security/encryption

 

References

 

1. The google file system seems to have some custom storage capabilities, and clearly they are globally distributed. see “the google file system” by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung.

2. Computer scientists are already working with something called erasure code, which i think does the “break it down into infinitesimally small pieces” bit.

3. Tor (distributed data and security) (from Nick)

 

References from Joel

 

4. Chord/DHash Project:http://pdos.csail.mit.edu/chord/.

 

5. * Efficient Dispersal of Information for Security, Load Balancing, and

Fault Tolerance - Michael O. Rabin - Journal of the Association for

Computing Machinery, Vol. 36, No. 2, April 1989 pp. 335-348

 

6. * Robust and Efficient Data Management for a Distributed Hash Table -

Josh Cates - June 2003 -

http://pdos.csail.mit.edu/papers/chord:cates-meng.pdf

 

7. * Venti: a new approach to archival storage - Quinlan and Doward -

http://cm.bell-labs.com/sys/doc/venti.html

 

 

 

Random semi-related thoughts 

part of what i think is interesting about this concept is the possibility of using the infinitesimally small pieces of data as a playground for learning algorithms. the stored data on this network could be aggregated in various ways to explore the potential for things like data mining and artificial intelligence.

 

i am interested in this largely from an intellectual/academic standpoint, but data mining might be an interesting business model to incorporate into this, if you could get around the user-paranoia issue (eg. by assuring anonymity, and/or by assuring data mining for a set of agreed upon causes or purposes). an interesting question is if we can use peoples' information to actually extend the intelligence of the Internet itself, through turning information aggregation into information synthesis, the lack of which i see as a key problem of the internet today (we have access to all this information but dont know how to usefully process it). but i digress :o)).  

 


Page Information

  • 10 months ago [history]
  • View page source
  • You're not logged in
  • No tags yet learn more

Wiki Information

Recent PBwiki Blog Posts