For a long time my disaster recovery procedure for my Amazon EC2-based web site was:
- Find an Oracle AMI that has 10g XE with APEX pre-installed, and start up an instance with it.
- Create a volume from a backup snapshot and attach it to the new instance.
- Log into the instance, shut down apache and oracle, then delete all the oracle data files, apache config files, and a few other bits and pieces.
- Create symbolic links for the deleted bits and pieces (including the oracle data files) that point to the attached volume.
- Start up oracle and apache and test.
- Switch the elastic IP over to the new instance.
This procedure has been tested and retested multiple times, and came in useful once when I almost lost the site (actually, it just went unresponsive due to some general problems at Amazon, but at the time I thought it’d had gone down).
Last week I logged in to the AWS Management Console to do a routine backup-and-restore test, but discovered a problem: it couldn’t find the Oracle 10g XE AMI. Searches on the image ID and various keywords, across all the Amazon regions, returned no results. Searches on “oracle” brought back a number of options but none of them close to what I required. I enquired with Amazon and they responded that the AMIs are supplied by Oracle and had been removed. I discovered this meant that not only could I not start up an instance using one of these images, I also couldn’t point to my running instance and “start up another like this” – because this requires access to the original AMI that was used to start it.
The machine image which I was trying to find is (at least, as of today 5 Jul 2011) still referred to here: http://aws.amazon.com/amis/Oracle/1720 “Oracle Database 10g Release 2 Express Edition – 32 Bit” so I raised a question on the oracle forum (http://forums.oracle.com/forums/thread.jspa?messageID=9707298�) and sent an email to Bill Hodak at Oracle who was named in the description of the AMI. He replied he would see if he could find out what had happened to it.
At this point I was hoping that my running instance wouldn’t go down, because I didn’t know if I’d be able to restore from backup. My backup consisted solely of snapshots of just the data – the rest of the OS was supposed to be provided by the AMI.
Meanwhile, asam replied to my oracle forum thread, suggesting I create my own AMI. A bit of googling yielded this result, which proved very helpful: http://webkist.wordpress.com/2010/03/16/creating-an-amazon-ec2-ebs-ami-from-a-running-instance/ “Creating an Amazon EC2 EBS AMI from a running instance”. I followed the instructions, slightly modified as follows:
- Use AWS Management Console to create a new volume
- Attach the volume to my running instance and mount it:
# mkdir /u03
# mount -t ext3 /dev/sdf /u03
- Move everything from the old volume so that it all sits under / again instead of via symbolic links
- Synchronize the filesystem to the new volume:
# rsync -a --delete --progress -x / /u03
- When rsync has completed, fix up the devices:
# MAKEDEV -d /u03/dev -x console
# MAKEDEV -d /u03/dev -x zero
# MAKEDEV -d /u03/dev -x null
- Unmount the volume:
# umount /u03
- Get the EC2 X.509 cert and private key from the “Security Credentials” area under “Account” in AWS Management Console.
- Download the Amazon EC2 API tools:
- I needed java to run the API tools, so download the rpm:
jre-6u26-linux-i586.rpm – e.g. from http://www.oracle.com/technetwork/java/javase/downloads/jre-6u26-download-400751.html
- Upload the EC2 X.509 cert and private key, the Amazon EC2 API tools, and the java rpm to the instance. Unzip and install the API tools and the java rpm.
- Set up all the required environment variables (replace xxxwith the appropriate bits from the relevant file names):
# export EC2_CERT=/root/cert-xxx.pem
# export EC2_PRIVATE_KEY=/root/pk-xxx.pem
# export EC2_HOME=(path-to-ec2-stuff)
# export JAVA_HOME=(path-to-java-stuff)
# export PATH=$PATH:$EC2_HOME/bin
- Set up a symbolic link so that the EC2 tools can find java:
# ln -s (path-to-java-stuff) /usr/bin/java
- Back in the AWS Management Console, create a snapshot of the volume.
- In the instance, run this command (this is the only command you can’t do in the management console, which is what all that rigmarole about installing the API tools was all about):
# ec2-register --snapshot snap-xxx --description "my ami description" --name "my ami name"
--ramdisk ari-yyy --kernel aki-zzz --region ap-southeast-1
You can get the snapshot, ramdisk and kernel identifiers from the AWS Management Console. (my instance was running in Singapore, so my region is ap-southeast-1)
- Back in AWS Management Console, I see my new AMI has been created. All I have to do now is select it, click Launch Instance, and a copy of my site is up and running.
I startup the database and see if it’s working. Unfortunately it isn’t – a bit more investigation revealed that the Oracle listener was not responding to requests. lsnrctl status reveals that it is still using the old internal IP address from the original instance – but this is a new instance with a different internal IP address.
To fix this, I edit listener.ora to correct the IP:
# cd /usr/lib/oracle/xe/app/oracle/product/10.2.0/server/network/admin
# chmod +w listener.ora
# vi listener.ora
The IP address is listed as the “Private IP Address” on the instance in AWS Management Console.
# lsnrctl start
After that, it’s all working – and very soon I will have a much simpler (and hopefully somewhat less reliant on the kindness of big corporations) disaster recovery process. I just need to work out the simplest way to restore the data from backup to the new instance. I’ll probably just create a new volume from a backup snapshot, attach it to the instance, and copy all the data across.
UPDATE: With an EBS-backed volume, I can now create a new AMI from the running instance whenever I want – it takes a complete snapshot of the instance, from which I can then create new instances. So my disaster recovery procedure is much simpler than it was before 🙂
Disclaimer: if you’re looking for an Amazon-basher, you’ll have to look elsewhere.
I had two web sites running on a server at Amazon’s Virginian data center when it suffered the recent issues. The first I heard of it was when one of my primary users called on 20 April to let me know that the web site was down – it was consistently returning 500 Internal Server errors. I wasn’t able to connect to the web site, to Apex, to the database, or even by SSH. Next I checked the AWS Management Console, which reported “Instance connectivity, latency and error rates” for the US-East region – which happens to be the region where our site was running.
Immediately, I calmly went into panic mode. First thing I’d thought I’d try is to reboot the instance. That took a while, but then it wouldn’t come back up again. Uh oh. Next thing I thought I’d do is try to detach the volume from the instance, re-attach it to a new instance, and hope I could bring it back up. AWS, however, seemed unable to detach the volume (probably because it needed to wait for outstanding writes to the disk to be written) – it was stuck in “detaching” mode.
There is an option to “force detach” a volume, but it comes with warnings that it might leave the volume in a corrupted state (since it might not have every change written). I decided to leave it alone for now.
Meanwhile, the web site is down, and my first priority is to get something up and running – so this is a chance to demonstrate that my restore procedure works. Yes, I’ve tested my restore process several times, so I was confident it would work, assuming the Amazon infrastructure is working ok.
The last backup I’d taken was a week ago – the morning of 16 April – stored as a snapshot in the US-East region. Generally, the data center was working reasonably well – it was mainly existing EBS instances that were suffering. Everything was extremely slow, and I got frequent errors. For example, I’d select my snapshot, go “create volume”, and it would report “sorry, an error stopped the volume being created”. So I’d try again, in various availability zones, same error every time. I’d wait for a while, then go back in and find half a dozen volumes in various states of Pending or Available status in various availability zones – it seems it was able to create the volumes after all.
So I delete all but one of them, then start up a new instance. By this time, Amazon have started putting more info on their status page, and it seems the problems are now limited to one availability zone – probably the one our site was in. Now, attach the backup volume to the new instance, and follow my restore procedure to get the site up and running. It took a bit longer than usual (hours instead of minutes), but finally it was all up and running. Switch the Elastic IP over to the backup instance, test the web site – all looks good, except (of course) that all data inserts/changes since 16 April have been lost. So I flip a few switches to stop emails and SMSs being sent (I don’t want people being confused by out-of-date notices), and set the Global Notification string on some of the key applications to let people know what’s going on.
Next step is to see if I can access the stuck volume. It took about two days, but finally the volume was detached, and my instance was terminated. So I start up another instance, get it set up, then attach the volume to it. It immediately was set to “attaching” status, and I couldn’t mount it – and it turned out to be another 24 hours (give or take 12 hours) before it was attached.
In the meantime, since I was just waiting, I thought it would be a good time to finally do what I’d been meaning to for a while – move the site to the Singapore data center, and take a long-overdue offline backup.
The purpose of my first goal was basically to see if being closer physically translates to lower latency to here in Perth. Also, I wanted to work out how best to move data between Amazon regions, so that I’d get even better redundancy. It was good that I had been taking weekly backups, but they were all stored in one data center on the US East coast, and I felt it would be a good idea to move a monthly backup set to a different region in case they irretrievably lost an entire data center.
Amazon don’t provide any option to just move a snapshot, volume, instance or anything between regions. It’s easy to move between availability zones within a region (by taking a snapshot, then creating a new volume from the snapshot targetting a different zone), but you can’t move from one region to another. Therefore, the only option I could find to do this was to bring up a new instance in the Singapore data center, create a volume with matching size (15GB), then transfer the data within the instance using dd and scp. This I did – it took more than 6-7 hours (I’m not sure, because I got tired of waiting and went to bed) – and I had to run fsck on the volume after loading the data, because some of the directories seemed to have lots of inaccessible files showing “? ? ? ?” in the directory listing. After that, however, the site was up and running and my quick smoke test seemed to show up no issues. So I allocated an Elastic IP to the instance, then changed my DNS settings to point to it.
If the entire Amazon EC2 ecosystem were to be blasted off the map in some cataclysmic fireball, I doubt many people would care that my two little sites were gone. More likely, of course, is that I might suffer reduced access to one or more Amazon regions, and if I cannot access them I cannot restore from backup – unless I have one in my care. So I downloaded the dd’ed snapshot. That took about 3 hours to download the 7GB file.
Meanwhile, back on the ranch in Virginia, today (23 April) the old volume was marked as “attached” to my new instance, and I was able to log in to it and mount the volume. I ran fsck on it but it reported no issues. Got the database started up without any issues, and immediately took a complete export. scp’ed the export across to the new instance in Singapore. Imported the data to a new schema. There were only a few tables that would have had critical data created or updated in the past week, so I just compared them – it looked like no-one had made any changes to anything in the replacement instance, so I just truncated them and re-populated them from the data taken from Virginia.
All done – a bit of stress, lots of patience required, but in the end it was relatively simple to get things up and running again. In hindsight, it would have worked out even simpler and easier to just leave the stuck instance alone and wait – I suspect, eventually, it would have come up again no problems. I could have just brought up the backup instance while I waited, maybe set it to read-only. Oh well, lesson learned.
There’s a whole lot of FUD and anger directed at Amazon and cloud services in general, but to my mind, this would have been a darn sight more difficult (or even impossible) if we’d been self-hosting. For example, I didn’t have to leave home to do any of this. I haven’t had to worry about hardware failures or network connectivity at all. If the service goes down once a year like this, it’s not the end of the world. As long as I have a backup to go back to, all is not lost. Amazon never promised 100% uptime or even 100% data retention. So far, for me personally it’s been about 99.6% (over the past two years), and 100% data retention.
And that’s a whole lot better than I could do by myself.
I was working happily on my laptop in the living room, kids playing on the rug, birds were singing, et cetera. All of a sudden, a blood-curdling scream erupts from the office. It was my wife.
“Jeff! Is there an “undo” function in the roster program??!?!?”
(A very quick bit of background: the “roster program” is a little Apex application I wrote so my wife can manage a roster of over 100 volunteers at our local church, assigning them to a range of duties, while ensuring that they are available, are willing to perform the duty, and that their assignments do not conflict with other assignments (i.e. they normally can’t do two jobs at once).)
I open the program and look at the roster. It’s almost completely blank. Only half an hour previously it was almost completely filled. Not looking good. Rosalie runs into the room, beside herself – with good reason, even with the computer helping it takes a lot of effort to assign all the jobs.
Her: “Didn’t you build an “undo” feature?”
Me: “No – I didn’t get around to it…” While querying the database directly and finding that yes, indeed, all the rows have nothing but NULLs, I’m preparing to console her and offer to help to rebuild it. “Do you remember what was happening just before it all disappeared?”
Her: “I selected all the dates, selected just the “Helper” jobs, then clicked the “Clear Dates” button.”
Me: “Ohhhhhhhhhh……” Disaster. I now explain that the “Clear Dates” button is intended to clear all the assignments for the dates selected, not just the ones showing on the screen. Plus, she’d selected all the dates, so it went off and merrily cleared every single assignment from the roster. “Did you happen to email any spreadsheets to anyone?” I ask in vain hope.
Her: “Yes, but only for a few jobs. I guess I can put those back in and start the rest from scratch.” says my poor wife, trudging away knowing she’ll be doing this for the next five hours. Instead of cooking dinner. This is getting worse by the minute!
Me: “Hang on! I have an idea – leave it with me.” I say, thinking, “I hope that the rollback segment is big enough…”
I run this query:
select * from roster_dates2
as of timestamp systimestamp - 0.1;
With this result:
ORA-01555: snapshot too old: rollback segment number 3
with name "_SYSSMU3$" too small
Ok, maybe a shorter time difference:
select * from roster_dates2
as of timestamp systimestamp - 0.01;
Like magic, all the roster assignments that had been NULL are showing as NOT NULL. Brilliant! So now some UPDATE wizardry…
set (vol_id_worship_am, vol_id_worship_pm, ...)
= (select vol_id_worship_am, vol_id_worship_pm, ...
from roster_dates2 as of timestamp systimestamp - 0.01 x
where x.roster_date = roster_dates2.roster_date)
where roster_date between to_date('05-APR-2009','DD-MON-YYYY')
and add_months(sysdate, +12);
A quick query to check it hasn’t done anything drastically wrong, then commit.
Me: “Rosalie, do you want to hit the Refresh button?”
Then, fast steps.
A big smile followed closely by my wife bursts into the room and gives me a big kiss.
Me: “Am I a wizard?”
Her: “Yes, darling, you are a wizard.”
I add some additional code to the start of the “Clear Dates” button:
RAISE_APPLICATION_ERROR(-20000,'Sorry, this function
has been disabled.');
Life is good.