Introducing OStor – data deduplication in the cloud. Open source project.

I have just launched a open source project to provide data deduplication services. Short bio from the project summary page – http://code.google.com/p/ostor/.

OStor (Optimized Storage) is a service to store data optimally using block level data de-duplication and compression techniques. It can be used as a standalone tool, an interactive tool as well as in the cloud leveraging using Hadoop Map-Reduce framework.

History

In recent years, cloud computing has emerged as a new paradigm in the tech industry. As more and more IT Infrastructure moves into the cloud, data is being generated in the cloud at an unprecedented rate. Data also gets fed into the cloud from other sources. A portion of this data has to be retained for archival purposes. As data gets versioned and archived, a pattern emerges that mimics what was observed in the traditional IT environments – need to do data deduplication – elimination of redundant data. In traditional IT environments, various vendors provide such services since half-a-dozen years. Most notably, Data Domain which was acquired by EMC recently.Their solutions are hardware based. Since the new generation of cloud services (Amazon AWS, Microsoft Azure, Google Apps) are based on virtualization, customers are reliant on either the cloud provider to provide such enhanced services or they use software-only solutions.

OStor attempts to bridge this gap. I will add details about the implementation and howto documentation in subsequent blog posts. Stay tuned.

Future blog postings will be on my secondary blog for dedup.

Advertisement

Tags:

4 Responses to “Introducing OStor – data deduplication in the cloud. Open source project.”

  1. Vinay Says:

    Excellent idea Praveen. One of the important issues in deploying a data de-dup solution in a cloud is the WAN optimization piece. This will ensure faster access to the data. How do you plan to go ahead with this ? I am interested in participating in this.

  2. ppraveen Says:

    I think that is a valid point. Right now, the service is available to dedup archived storage. I am going add a few blog posts which shows how OStor leverages Hadoop to dedup a large data set.

    The goal to dedup WAN data is a separate but complimentary track and some of the functionality in OStor can be reused.

  3. Rajasekhar Says:

    Praveen your work looks promising. I am assuming you are storing deduplicated data in hadoop HDFS. Considering HDFS doesn’t handle lot of small files very well. I am wondering how you are storing deduplicated blocks? as individual files or combining multiple blocks into larger archives and indexing those archives?.

  4. cloud computing companies Says:

    cloud computing companies…

    [...]Introducing OStor – data deduplication in the cloud. Open source project. « Praveen’s Weblog[...]…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.