In this White Paper I will try to explain some of the great value derived from a correctly implemented data archiving system. This covers both email and file system data.
This is intended for IT personnel that have basic understanding of email systems, storage systems and networking. If you are using this to make a decision and you are not an IT professional, some of the terms and lack of detailed explanation will hinder you. You should consult with one of your engineers after reading this publication.
Why do you want to Archive Data?
There are many good reasons for archiving the email and file data located in your corporate messaging systems and storage platforms. Some I can think of are…
- Reduced storage needs (on your messaging servers and high speed storage platforms).
- Faster searching and retrieval of email and file data for legal discovery and other purposes.
- Faster backups of your information stores (Exchange Server).
- Faster migrations; mailboxes from one server to another.
- Increasing email storage for the humans without increasing storage for your mail databases.
- Increasing file data storage for the humans without buying more disks for your high speed (read “expensive”) storage devices.
- Regulatory compliance.
- Better overall management of your email and file storage.
There are more to be sure, but these are the most common. I will cover these bullet-points in detail on the following pages.
Before we get started…
We should get some of the terminology out of the way;
Single instance is the process where all duplicate data is rounded up and only one copy is retained and the rest deleted. The actual process is much more technical than this but that is the basic function.
Shortcut or stub is a small file placed in your email database (a user’s mailbox) and it points back to the message in the archive system. This stub is usually distinguished by an icon that differs from the standard email message icon. Double clicking the icon will (in most archiving systems) retrieve, not restore, the message for the user to review.
The “Archive” is a storage location controlled by the archive server that contains the stored messages. Most enterprise systems will compress the original message to save disk space. In best practices the archiving system will be able to utilize low cost storage for the archive.
De-duplication is not the same as “single instance” (but is mistakenly used instead of single-instance by some vendors) and is a term more commonly used with flat file systems and many times controls data redundancy at a “byte” level rather than just at the file level.
Reduced storage needs (on your messaging and file servers)
No brainer. The email is moved to your less expensive secondary storage or “archive” where it is compressed and “single-instanced”. Leaving in the original message place a “shortcut” or “stub” which is a pointer to the actual message, this is usually just a few kilobytes in size, much smaller than the original email; thereby reducing the size of data in your email databases.
This applies as well to your file system data. The process is the same only faster as you are accessing flat-files. Most SAN’s or even NAS systems utilize high speed SCSI Fiber Channel disks. The secondary storage you are moving off to can easily be SATA or equal drive performance. The archive systems you choose should be able to utilize this kind of “cheap” storage without issue.
Faster searching and retrieval of email for legal discovery and other purposes
Now that the email has been archived you can use the tools that should come with any good, enterprise archiving solution and search rapidly through the “archive” of messages. Usually these tools will have much greater search capabilities than any search tool available natively on your email servers. Also you should be able to place “holds” on messages so they cannot be deleted until you have released them.
Faster backups of your information stores (Exchange)
Your email databases will shrink as you archive the original messages and leave smaller “placeholders” behind, thereby reducing the size of your mail databases and making your backups faster. This is true for any email system not just Exchange. It should be noted that with Exchange you will need to perform an offline defrag to regain the “white space” so that your .EDB files will be smaller.
Faster migrations; mailboxes from one server to another
Again, “smaller database =faster to move”. Example; if a mailbox is 1.5 gigs and it will take 40 minutes to move it to another server, you archive its contents leaving stubs behind. Now the mailbox is 300 Megs and it only takes 10 minutes to move. So if you are moving a lot of mailboxes to another server, performing a mailbox migration, it will drastically speed up the process if you can shrink the amount of data you have to move.
Increasing email storage for the humans without increasing storage for your mail databases
This is one purpose that is a win-win for you and the people using your email system. Once you have implemented the archiving system you can use it as a huge “blob” of extended storage for the email system. I have seen it setup many ways; I will outline this use from one of the more successful implementations I worked with.
Users have a quota of 200 Megs on their mailbox. They are constantly running out of space and storing email off to PST files (which is unsafe) and complaining to you. You implement an email archiving system. You set your policy to let users keep 1 month of email in their mailbox and archive everything else. You set your archive retention period to 3 years. You also stop, through policy (GPO’s), the use of PST’s. Now you lift mailbox limits. Next you perform a migration to move all PST data into your archive and delete the PST’s after. Your end users have 1 month of email in there inbox and the rest is stubbed. They now have access to 3 years worth of email by accessing the archive (which in any archive system worth-its-salt is a very easy, intuitive thing to do). You have given them unlimited mailbox storage and gained greater control of your email system.
High Level View
Litigation and Email Searches
There are many good reasons to implement an archive system but the number one reason I have seen in the last 4 years is to meet the requirements of the legal department during litigation when a judge orders email to be allowed for “discovery”. This can be very time sensitive and if you have a large amount of email it can be almost impossible without some kind of effective system in place to deal with it. With a well designed archiving system you should be able to produce any ordered search results in hours or days rather than weeks or months. In many cases the amount of money saved in time searching by the IT administrators and or outsourced searching will easily pay for an enterprise level archiving system.
Also having access to all email during searches and being able to show that you control your email system through policy effectively will help to reduce any effort by the opposing counsel to refute your searches and results as incomplete.
One of the very positive results of the explosion of archiving systems in the software market is that many of the more mature systems have partnered with storage vendors to improve the archive storage platform and performance. Many such as Data Domain, Hitachi and NetApp are working with Archive vendors to provide a “matched” solution that allows better interaction with their products so the archiving solution can more effectively store data.
A good example of this is Data Domain; because their focus is on compression and de-duplication they have far superior algorithms and processes to handle the compression and single instancing of data. So they have partnered with some of the enterprise archive software companies and now those products allow you to turn off their built in compression in favor of the storage providers; this is important to look for in a product when you start looking to make a purchase.
Lower Cost Storage
Some of the enterprise archival software vendors have worked to make sure they can run effectively on lower cost storage. I don’t mean a USB drive from your local electronics store. I’m talking about NAS storage using SATA drives or better or internal storage on a server. This is a good place to get ROI because most email systems (especially MS Exchange) require fast SCSI or better disk subsystems for the email databases. By archiving to low cost storage you can keep from increasing your email server storage space and focus on NAS storage; this can benefit the rest of your organization by utilizing the storage for more than just archived email. NOTE: beware, there are a few archival manufacturers who must have high speed storage to function and they will tell you otherwise in their demo or during the sales cycle. Make sure you understand the performance requirements before you settle on a product.
How does email and data archiving work?
Below is a diagram (high level) that shows the basic functions of an archiving system.
Hopefully I have given you some insight as to the great benefits of data archiving. Please remember there are even more benefits than I have listed here. Also each archiving vendor has different technologies and methods that may fit your environment well so make sure you investigate several before you make a choice. Also (as if I need to say it) don’t make a purchase decision based solely on one slick demo, try and install the product in a lab and test it.
Thanks for taking the time to read this article and I hope it has helped you.
Cornerstone Technologies - Data Archiving Consulting Firm
Outlook performance – too many items in the inbox:
Single instance storage explained (Wikpedia):