I have just launched a open source project to provide data deduplication services. Short bio from the project summary page – http://code.google.com/p/ostor/.
OStor (Optimized Storage) is a service to store data optimally using block level data de-duplication and compression techniques. It can be used as a standalone tool, an interactive tool as well as in the cloud leveraging using Hadoop Map-Reduce framework.
History
In recent years, cloud computing has emerged as a new paradigm in the tech industry. As more and more IT Infrastructure moves into the cloud, data is being generated in the cloud at an unprecedented rate. Data also gets fed into the cloud from other sources. A portion of this data has to be retained for archival purposes. As data gets versioned and archived, a pattern emerges that mimics what was observed in the traditional IT environments – need to do data deduplication – elimination of redundant data. In traditional IT environments, various vendors provide such services since half-a-dozen years. Most notably, Data Domain which was acquired by EMC recently.Their solutions are hardware based. Since the new generation of cloud services (Amazon AWS, Microsoft Azure, Google Apps) are based on virtualization, customers are reliant on either the cloud provider to provide such enhanced services or they use software-only solutions.
OStor attempts to bridge this gap. I will add details about the implementation and howto documentation in subsequent blog posts. Stay tuned.
Future blog postings will be on my secondary blog for dedup.
November 9, 2009 at 8:47 pm |
Excellent idea Praveen. One of the important issues in deploying a data de-dup solution in a cloud is the WAN optimization piece. This will ensure faster access to the data. How do you plan to go ahead with this ? I am interested in participating in this.
November 12, 2009 at 6:52 pm |
I think that is a valid point. Right now, the service is available to dedup archived storage. I am going add a few blog posts which shows how OStor leverages Hadoop to dedup a large data set.
The goal to dedup WAN data is a separate but complimentary track and some of the functionality in OStor can be reused.
December 5, 2009 at 5:18 am |
Praveen your work looks promising. I am assuming you are storing deduplicated data in hadoop HDFS. Considering HDFS doesn’t handle lot of small files very well. I am wondering how you are storing deduplicated blocks? as individual files or combining multiple blocks into larger archives and indexing those archives?.
January 22, 2012 at 10:55 pm |
cloud computing companies…
[...]Introducing OStor – data deduplication in the cloud. Open source project. « Praveen’s Weblog[...]…