If I could pinpoint one of the most discussed areas with opportunity for improvement in Symantec Online Backup, it would generally be identified as “status and reporting” on backups and restores. I’d like to give a little background on the design goals, the state of the union today, and some things we are working on for the near future.
Background
When we were building Online Backup, one of the goals we decided on from the beginning was that it would be a “Set it and forget it” type of service. The continuous backup technology (expect a post on this in the future) is designed to do the following:
- If a file is “available” to the backup agent (”BA”) for backup immediately after we see it was changed, then it is inspected and the new/changed bytes are uploaded immediately (or placed in queue)
- If a file is “not available”, i.e. locked/in use by an application or otherwise unavailable to be opened via normal operating system methods, then the BA waits for a period of time and performs a “snapshot” (primarily using Microsoft VSS; for gritty details see this link) to “freeze” the required hard drive volumes so that we can get read access to the file and inspect and back it up. This is a “fail safe” backup that ensures that–in almost every case short of catastrophic system issues–we are able to access and protect the data as required.
These two backup methodologies work in tandem to create an “always on” method of backup and protection.
Now to understand what this means to you, a user, in terms of operational status and reporting, we need to contrast the “always on” methodology with the older/existing “scheduled” form of backup.
How scheduled backups work
In the old world of scheduled backups, you set a backup to run at a specific date and time (and interval generally), and then a scheduler ensures that this happens in a consistent fashion. This means every time that the schedule “starts”, the backup job starts, runs through its paces, and then finishes, sleeping until the next time the schedule requires it to start again.
From a logging and status perspective, this means you have a defined start time, a defined work list of files to protect/process, and a defined finish event. The backup job typically knows what files it is protecting, can gauge the amount of time required to protect them, and can definitely state at the end of the job exactly what has transpired. This is something that is very easy to log and provide reporting on; it’s an event that has occurred in the past, and the metrics are frozen in time to clearly be parsed and summarized.
“Always on” backup and the challenges of reporting and status
When you move to a world where there is a possibility of continual backup (i.e. backup that occurs instantly time a file is modified), you can see that immediately you run into challenges about how to verbosely describe and predict the nature of your backup status.
For example, there is no longer the concept of a single “job”; because we are always watching for file changes, at any given time there may be zero or there may be many files that are being processed. Providing detailed projections on the amount of data yet to be protected, and the status of files that are pending protection, poses unique challenges. To use an analogy from my previous description, we may have 1,000 backup jobs, or we may have zero, and at any time that job may change characteristics to include more or less files as we process the backup queue.
This is compounded by the fact that we want to ensure that, from any web browser, you are able to manage and observe the status of your backup and restore operations. Providing up-to-the-second status is relatively trivial on a rich, endpoint focused application: you don’t have to deal with the latencies of the internet, or the impact of uploading and processing backup and restore logs and results continually to provide a useful, timely, human readable status on a web portal. That said, our commitment is to provide this functionality, and we are making progress on this commitment.
Challenges also arise in the area of Reporting. Anyone with database or report generation experience knows that creating reports on demand generally functions most efficiently when pulling data from a relational or other form of database. Unfortunately, keeping all of the data that people may want to report on also exponentially grows the size of records we keep, and ultimately makes it less manageable and has an impact on performance. As you can imagine, with thousands of file backup activities potentially happening daily on any given system, keeping a detailed long term record of these activities, and providing meaningful metrics on them, is a unique problem in and of itself.
Today’s status and reporting, and a look toward the future
Our power users today will be familiar with the detailed information that is available in the SPN portal under the Computer Profile. You can observe operations in progress (in close to real time), and review a basic history of events (backups and restores, among other, less useful items). Some complaints we hear about this page are below, and what we are working on to address them.
- Problem: “I get too many alerts, and many of them suggest a problem with my backups.”
Solution: Having read the discussion of how our technology works, including how we retry operations, you can see that there may be many cases in which an initial attempt to protect a file may appear to fail; however, in almost all cases we either immediately (or shortly thereafter) retry the operation and successfully protect the data. In our initial releases, we erred on the side of being a bit too verbose in notifying you of these failure/retry conditions, leading ultimately to a potential perception that backups were not successful (when in fact they were). We’re working diligently to cut down the number of alerts that are being raised, as well as the nature of alerts, so that you only get notified when there is a definitive reason to be concerned.
- Problem: “I’m not able to determine clearly the state of my protection (i.e. are there any files that aren’t protected, and why) at a glance.”
Solution: It is definitely true that today you must click down into the “real time” status in order to get this information. We agree that this is an extra, cumbersome step, and we’ll be working on the user interface to provide this information at a higher level. Our goal remains “at a glance” understanding of the current status of your protection.
- Problem: “The history events that you see in the computer profile are not always as descriptive or action oriented as I would like.”
Solution: We understand that having a more rich history, including something like a traditional backup log that shows specific file protection events and failures, is critical to users trying to troubleshoot issues. While our backup technology should ensure that any failure should be retried again and successfully handled, we don’t want users to be denied the opportunity to inspect and confirm their data protection history to their satisfaction.
- Problem: “When I run a report, the summary details are excellent, but I am not able to really dig into specific issues and metrics in detail.”
Solution: As part of the same efforts that will allow us to provide a more rich backup and restore history in the web portal, we will be extending that functionality into the Reporting data as well. Our goal is to provide very detailed reports, suitable for filing for audit purposes for example, that can be generated on demand or scheduled using our existing facilities.
- Problem: “The web portal provides many details on the existing backups/restores for a computer, but the system tray client provides only a summary.”
Solution: While the primary goal of SPN is always to provide rich management from the web portal, we definitely understand there are cases where it is more convenient or natural to look to the installed client application for status. Today we provide basic/summarized information about work in progress, and direct you to the portal for more information. While changes in this area are not prioritized as immediately as the other items I’ve mentioned, they are definitely in the plans and will be delivered as free, automatic LiveUpdates when they become available.
So when can you expect to see these changes?
Well, there won’t be a day when you wake up and suddenly the entire user experience is changed with all items resolved (although this would have made an interesting Christmas surprise
). There will be a phased process during which we release the underlying technical changes, and then beging to expose them in the relevant areas in the user interface as each portion of work is available. In general terms, we would like to have the majority of these items released in the next 3-4 months.
However, it’s important to note that nothing about the overall backup technology itself is changing; it is a reliable solution, and the enhancements that we are providing to the user experience we hope will only serve to illustrate that for our users. Symantec Online Backup was always meant to be “set it and forget it”; I hope the new tools we provide will go further to serve those goals.
For suggestions, comments, and questions, as always feel free to use the comments or contact me at richard_goodwin@symantec.com