Amazon EC2: how I recovered

Disclaimer: if you’re looking for an Amazon-basher, you’ll have to look elsewhere.

I had two web sites running on a server at Amazon’s Virginian data center when it suffered the recent issues. The first I heard of it was when one of my primary users called on 20 April to let me know that the web site was down – it was consistently returning 500 Internal Server errors. I wasn’t able to connect to the web site, to Apex, to the database, or even by SSH. Next I checked the AWS Management Console, which reported  “Instance connectivity, latency and error rates” for the US-East region – which happens to be the region where our site was running.

Immediately, I calmly went into panic mode. First thing I’d thought I’d try is to reboot the instance. That took a while, but then it wouldn’t come back up again. Uh oh. Next thing I thought I’d do is try to detach the volume from the instance, re-attach it to a new instance, and hope I could bring it back up. AWS, however, seemed unable to detach the volume (probably because it needed to wait for outstanding writes to the disk to be written) – it was stuck in “detaching” mode.

There is an option to “force detach” a volume, but it comes with warnings that it might leave the volume in a corrupted state (since it might not have every change written). I decided to leave it alone for now.

Meanwhile, the web site is down, and my first priority is to get something up and running – so this is a chance to demonstrate that my restore procedure works. Yes, I’ve tested my restore process several times, so I was confident it would work, assuming the Amazon infrastructure is working ok.

The last backup I’d taken was a week ago – the morning of 16 April – stored as a snapshot in the US-East region. Generally, the data center was working reasonably well – it was mainly existing EBS instances that were suffering. Everything was extremely slow, and I got frequent errors. For example, I’d select my snapshot, go “create volume”, and it would report “sorry, an error stopped the volume being created”. So I’d try again, in various availability zones, same error every time. I’d wait for a while, then go back in and find half a dozen volumes in various states of Pending or Available status in various availability zones – it seems it was able to create the volumes after all.

So I delete all but one of them, then start up a new instance. By this time, Amazon have started putting more info on their status page, and it seems the problems are now limited to one availability zone – probably the one our site was in. Now, attach the backup volume to the new instance, and follow my restore procedure to get the site up and running. It took a bit longer than usual (hours instead of minutes), but finally it was all up and running. Switch the Elastic IP over to the backup instance, test the web site – all looks good, except (of course) that all data inserts/changes since 16 April have been lost. So I flip a few switches to stop emails and SMSs being sent (I don’t want people being confused by out-of-date notices), and set the Global Notification string on some of the key applications to let people know what’s going on.

Next step is to see if I can access the stuck volume. It took about two days, but finally the volume was detached, and my instance was terminated. So I start up another instance, get it set up, then attach the volume to it. It immediately was set to “attaching” status, and I couldn’t mount it – and it turned out to be another 24 hours (give or take 12 hours) before it was attached.

In the meantime, since I was just waiting, I thought it would be a good time to finally do what I’d been meaning to for a while – move the site to the Singapore data center, and take a long-overdue offline backup.

The purpose of my first goal was basically to see if being closer physically translates to lower latency to here in Perth. Also, I wanted to work out how best to move data between Amazon regions, so that I’d get even better redundancy. It was good that I had been taking weekly backups, but they were all stored in one data center on the US East coast, and I felt it would be a good idea to move a monthly backup set to a different region in case they irretrievably lost an entire data center.

Amazon don’t provide any option to just move a snapshot, volume, instance or anything between regions. It’s easy to move between availability zones within a region (by taking a snapshot, then creating a new volume from the snapshot targetting a different zone), but you can’t move from one region to another. Therefore, the only option I could find to do this was to bring up a new instance in the Singapore data center, create a volume with matching size (15GB), then transfer the data within the instance using dd and scp. This I did – it took more than 6-7 hours (I’m not sure, because I got tired of waiting and went to bed) – and I had to run fsck on the volume after loading the data, because some of the directories seemed to have lots of inaccessible files showing “? ? ? ?” in the directory listing. After that, however, the site was up and running and my quick smoke test seemed to show up no issues. So I allocated an Elastic IP to the instance, then changed my DNS settings to point to it.

If the entire Amazon EC2 ecosystem were to be blasted off the map in some cataclysmic fireball, I doubt many people would care that my two little sites were gone. More likely, of course, is that I might suffer reduced access to one or more Amazon regions, and if I cannot access them I cannot restore from backup – unless I have one in my care. So I downloaded the dd’ed snapshot. That took about 3 hours to download the 7GB file.

Meanwhile, back on the ranch in Virginia, today (23 April) the old volume was marked as “attached” to my new instance, and I was able to log in to it and mount the volume. I ran fsck on it but it reported no issues. Got the database started up without any issues, and immediately took a complete export. scp’ed the export across to the new instance in Singapore. Imported the data to a new schema. There were only a few tables that would have had critical data created or updated in the past week, so I just compared them – it looked like no-one had made any changes to anything in the replacement instance, so I just truncated them and re-populated them from the data taken from Virginia.

All done – a bit of stress, lots of patience required, but in the end it was relatively simple to get things up and running again. In hindsight, it would have worked out even simpler and easier to just leave the stuck instance alone and wait – I suspect, eventually, it would have come up again no problems. I could have just brought up the backup instance while I waited, maybe set it to read-only. Oh well, lesson learned.

There’s a whole lot of FUD and anger directed at Amazon and cloud services in general, but to my mind, this would have been a darn sight more difficult (or even impossible) if we’d been self-hosting. For example, I didn’t have to leave home to do any of this. I haven’t had to worry about hardware failures or network connectivity at all. If the service goes down once a year like this, it’s not the end of the world. As long as I have a backup to go back to, all is not lost. Amazon never promised 100% uptime or even 100% data retention. So far, for me personally it’s been about 99.6% (over the past two years), and 100% data retention.

And that’s a whole lot better than I could do by myself.

APEX role in Perth

Looks like someone’s looking for an APEX developer here in Perth.

Unfortunately for me, I don’t have any experience in “Apex 3.5” – I hadn’t even heard of that version…

Personal Blacklist for Google Search Results

Using Chrome? Sick of certain sites-I-won’t-name-here cluttering the search results when you google for technical issues?

Installed. Already blocked a few sites. Beautiful.

Plus, they assure us that the aggregated results blacklisted domains may well be used to improve Google’s page ranking algorithm in the future…

btw – Read JoerT’s comment on the extension 🙂

EDIT: just noticed it doesn’t work in conjunction with AdBlock – if I enable AdBlock, the blacklist stops working.

AUSOUG Perth 2010 – Day 2

First off the mark this morning was Mark Lancaster on “Building Advanced APEX 4.0 UIs with Ext JS”, which was an eye-popping demo of some wonderful things you can do when you combine the power of Ext JS with Apex.

Tom Kyte presented via Webinar his “The Best Way – Things You Know” presentation, which I had already enjoyed in Melbourne but it’s always worth revisiting these things – helps to counter the constant wave of opposite sentiment from the other side of the spectrum.

Some years ago I had a quick look at REST, as an alternative to SOAP – but never really got the hang of it. So I was interested in being introduced properly by Chris Muir in his talk “A Change is as Good as a REST – JDeveloper 11g’s REST Web Services”. This double-length presentation was worth attending, he started with an excellent definition of web services, their history and REST’s heritage; explained the power and simplicity of REST, compared and contrasted it with its complex and comprehensive cousin, SOAP; and demonstrated how easy it is to create and expose simple REST web services using JDeveloper.

After lunch, we were entertained by Guy Harrison‘s keynote address, “Technology Trends that have the potential to make big impacts both in our everyday life and as Oracle professionals”. They had to close down all the other conference rooms just to make room for the presentation title in the programme 🙂 But it was a fun talk speculating about the kinds of technology our kids and our kids’ kids will probably be all blasé about.

Connor McDonald fired us all up with “A Better Way of Managing Optimizer Statistics”. He claims that we should stop collecting statistics and stop creating histograms 🙂 – I suspect a number of DBAs are now wondering why they wasted so much of their time (and so much server time) for so long…

I finished the day with Frank Bommarito‘s “Outlines, Profiles, and SQL Plan Baselines” which was a good introduction to the subject and for me was a good overview of some new features I haven’t used.

AUSOUG Perth 2010 – Day 1

After a leisurely sleep-in (after a weekend away at the parents’-in-law farm) I made my way to Burswood for the first day of the AUSOUG Perth Conference 2010.

After Roland Slee’s keynote (“consolidate consolidate consolidate!”), I headed upstairs for Steven Feuerstein’s “Golden Rules for Developers” – webinar edition. Unfortunately due to technical issues it started late (no fault of Steven’s) but I think he got the important points across.

Following that was Penny Cookson with “Meet the CBO in Version 11g”. She explained a number of improvements in the Cost-Based Optimizer that came with 11g, including a detailed demonstration of adaptive cursor sharing.

After lunch I decided to take in a DBA session – Guy Harrison spoke about how Oracle runs on VMware, which had some very interesting info about the difference between Full Virtualisation, Paravirtualisation, and Hardware-Assisted Virtualisation. A lot of it went over my head but I got a slightly better picture of what’s going on when I run an OS in a VM, as well as how proper memory and CPU allocation can make a huge difference to the performance of Oracle in a virtual environment.

I lost count of how many great tips Scott Wesley gave in his “‘n’ Methods to Improve APEX Performance” presentation – but there were a lot of great ideas, many that are simple and easy to implement, which can make a big difference to the performance of your Apex applications.

It was great to see the level of interest in APEX Themes and Templates – if you’d like to look through the bits that I skipped over, feel free to download my presentation from here.

Oracle OpenAustralia

The draft programme is out for the AUSOUG National Conference 2010.

If you’ll be in Perth in November I recommend you register and attend – a number of excellent papers will be presented, some of which I had the privilege of hearing when I was in Melbourne – you’ll learn new things, relearn old things you’d forgotten, and meet some giants in the Oracle world who will have travelled great distances to get here.

Some highlights, in no particular order:

  • Steven Feuerstein“Golden Rules for Developers”
  • Penny Cookson – “Meet the CBO in Version 11g”
  • Guy Harrison“Optimizing Oracle databases on VMware”
  • Mogens Nørgaard – “Licensing – Tales from the Trenches and other thoughts on Oracle”
  • Tom Kyte“The Best Way”
  • Frank Bommarito – “Outlines, Profiles and SQL Plan Baselines”
  • Connor McDonald“A Better Way of Managing Optimizer Statistics”
  • Mark Lancaster“Building Advanced APEX 4.0 UIs with ExtJS”

I’ll be presenting my “Apex Themes and Templates” paper, which I presented in Melbourne last month – however it will be updated with a few additional bits and pieces that I’ve learned since then.

InSync10 Day 2

Another good day in Melbourne. Heard Richard Foote talk about Indexing New Features in Oracle 11g release 1 and 2. One thing he demonstrated was the creation of an index on only part of a table – normally I’d use a function-based index for this sort of thing, but his technique results in an index that is useful without adding strange predicates to all relevant queries in the application; it involves creating a globally partitioned index, in an UNUSABLE state, then rebuilding only selected partitions. This could be very useful for customers who have the partitioning option.

Of interest to me was Discovering the Power to Save the Planet, presented by Robin Eckermann (Smart Grid Australia) – having worked for a short time at Western Power, it was interesting to hear his perspective on the future of the generation and distribution of power. He compared the state of the art in power to broadband, as it was 15 years ago – and asserts that the smart grid will enable all sorts of new applications for customers to regulate their demand intelligently, and is essential for the coming wave of electric cars.

After that was Steven Feuerstein’s second talk, “Golden Rules for Developers“, which was well worth a good listen. I recommend you download and read the powerpoint if you missed it. If you take even just one of his recommendations (e.g. Don’t Repeat Anything, Don’t Take Shortcuts, Build On A Foundation, Don’t Code Alone), I think you will improve the quality of your code, reduce the cost of maintenance for your employer/client, and be much more satisfied with your work. I certainly intend to – I’ve been guilty of “starting from scratch” many times – I do carry around a portable hard drive with a large collection of bits and pieces I’ve collected along the way, but nothing I can just plug in and use with confidence. Steven also gave another PL/SQL talk at the end of the day, this time for DBAs, and that was interesting to me (as a developer). If you’re a DBA, but think that you have no need for PL/SQL, think again.

After that, during lunch, Steven announced the winners of the previous day’s quiz – and wouldn’t you know it, I won 🙂

InSync10 Day 1

After a scrumptious breakfast at the Armoury I headed in what I believed was the general direction of the Melbourne Convention Centre – after making a wrong turn I eventually spotted a footbridge over the river that rung a bell from my GoogleEarthing; after taking some photos I was finally at InSync10.

The first session was Connor McDonald’s 11g Features for Developers, which was an eclectic mix of bits and pieces you won’t get from reading the New Features Guide or from Oracle Marketing, along with some gratuitous use of photos of his kids.

Steven Feuerstein didn’t present next, instead he made us think by running a Developer Quiz. Much like the PL/SQL Challenge (at which, by the way, you should sign up this instant if you haven’t already), it was fun and challenging, and I suspect everyone learned at least one new thing. Me, I learned what SUBSTR returns if the 2nd parameter (which normally starts at 1) is zero. As always, Steven was completely open to criticism, and with Connor and Tom in the room he certainly didn’t get off scot free 🙂

As it happened, I happened to disagree on one question, which was regarding the USING clause and how many bind variables must be supplied to a given statement. One of the responses (from memory) was that “you must always supply as many bind variables as there are placeholders”. I knew that if the statement being executed was SQL, the number of bind variables must match the number of placeholders, even if some of them have the same names (e.g. INSERT INTO emp VALUES (:a, :b, :a, :b) would require four bind variables). However, I also knew that if the statement is a PL/SQL block, each unique placeholder requires a different bind variable – if the placeholder appears more than once in the block, you don’t repeat the bind variable in the USING clause. I therefore ticked this answer as “correct” – if, for example, the block was BEGIN call_something(:a, :b, :a, :b); END;, you would have to provide two bind variables, because that is how many distinct placeholders there are in the block.

There was some discussion about this, because the answer was marked incorrect – according to Steven the number of placeholders in the block above is four, not two – and I agree that the meaning of a “placeholder” is different to a “bind variable”, although I usually speak as if to conflate the two ideas. However, I still hold to the opinion that a “placeholder” in the context of a PL/SQL block is a reference to this: :a, and I would say that the one placeholder :a appears twice in the PL/SQL block. I believe I have the documentation to back me up:

If the dynamic statement represents a PL/SQL block, the rules for duplicate placeholders are different. Each unique placeholder maps to a single item in the USING clause. If the same placeholder appears two or more times, all references to that name correspond to one bind argument in the USING clause. In Example 7-7, all references to the placeholder x are associated with the first bind argument a, and the second unique placeholder y is associated with the second bind argument b.

(emphasis added) Source: Using Duplicate Placeholders with Dynamic SQL

This is really just an argument over semantics, so no big deal. Some of the other questions had much more interesting discussion, so it was well worth attending. If you’re in Perth on Friday, Steven is running it again (I won’t be able to attend, unfortunately). I presume he will be using different questions…

After that I went for a walk through Melbourne, since it was sunny outside. The climate in Melbourne, I discovered, is a tad different to Perth. Wandering along the riverside, I ended up experiencing a blast of all four seasons within the space of an hour – a lovely spring breeze, a somewhat warmish summer, then a cold blustering windy autumn (a bit out of order that) – there was a few seconds where it was difficult to remain upright – followed by a sudden rainstorm. I managed to find shelter under one of the many bridges that cross the river, waited for about ten minutes, then was able to walk back to the centre without getting any wetter. In fact, by the time I got back to the convention centre it was sunny again.

After a light lunch it was my turn to talk, and I think my presentation on APEX Themes and Templates went quite well. I appreciated the comments and questions that came back, and had some further discussion with a few people afterward as well, which was good.

I forwent Connor’s excellent Partitioning presentation which I’ve heard before, instead heard Kyle Hayle – Database Performance Made Easy – demonstrate the virtues of database tuning using a tool such as the one he’s produced at Embarcadero. I haven’t made use of many graphical tuning tools before, preferring just “the numbers”, but Kyle made an excellent case for the use of pictures instead of words for not only visualising the workload on the database (such as presented by Oracle’s Enterprise Manager, which Kyle had a hand in), but also for visualising the structure of a query. Personally, I’ve grown accustomed to using the traditional explain plan and I suspect I’ll probably continue to, but the Embarcadero product does have some features that automate some of the work I’d normally do by hand (such as examining the constraints on the tables and obtaining filter percentages).

Last of all, Tom Kyte presented The Best Way, in which he laid to rest for once and for all the answer to the age-old (and oft-repeated) question, “what is The Best Way to …?”. Finally, we can stop arguing over which way is worthy of being called Best Practice, and get on with the job 😉

Went out for a nice dinner at a small japanese restaurant, which had a great cozy atmosphere, and on the way back to the hotel was surprised by these great explosions of flame from these pillars. I could feel the heat from hundreds of meters away. At the end, a quick stop at a store allowed me to procure what I’d been coveting all day: Farmer’s Union Iced Coffee.

Tag wikis on StackOverflow

StackOverflow now allows the creation of wiki articles and tag synonyms.

I’ve gone ahead and started a few articles about a few topics dear to my heart:

These articles are not intended to be replacements for the documentation or Wikipedia, but primarily as a guide for people choosing tags when asking questions.

If you click on the “oracle” tag you’ll see a large number of “related tags”, some of which could probably benefit from additional articles.

Priority #1: Keep it simple

Every place has a different way of assigning priority and/or severity to defect reports – some bigger places have many different ways (unfortunately). I’ve not been subjected to Prince2 training so here’s my take on this subject.

I reckon, the simpler the scheme, the more likely it will be used consistently. Every defect should have just a single priority/severity (call it what you will): Critical, High, Medium or Low.

  • Critical – problem significantly affects ability to test the system; “showstopper” – all other work to be delayed until the issue is resolved – an example might be “unable to log in”; “screen x opens with error every time”; “function y causes my computer to explode”
  • High – problem affects critical functionality; should be fixed as a matter of priority over other issues – an example would be “error x always/often occurs at process point y”
  • Medium – functionality not working as per specified requirements; must eventually be fixed (at least before Go Live) – an example would be “default value not being set correctly”; “navigation does not work correctly”
  • Low – cosmetic issue; “nice to have” function; or error/warning occurs very infrequently but doesn’t significantly affect correct processing; ok to Go Live if not fixed

That’s it. Notice how each category is unambiguous in what it means to the developers, testers and others. I’d expect a system to normally have mostly Medium issues, several Highs, hopefully no Criticals, and maybe some Lows. I’d expect some issues to be reclassified up or down as they are assessed, as developers negotiate with the testers and business reps.

I’m certain that there’s all sorts of great reasons why someone needs more levels, or needs to separate the “priority” concept from the “severity” or “impact” concepts, but to my mind there’s not a lot gained from forcing all your testers, developers, and change managers to learn a complicated system, and classify and update their records. When you need a 2D or 3D matrix of priority vs severity vs whatever printed and posted on your cubicle wall, it’s time to ask, “is all this really necessary?”.

Keep it simple, and everyone will not only use it, everyone else will understand it.

P.S. did you notice that Apex’s builtin “feedback” feature only has one level? It’s either a bug report, or it’s not (e.g. an enhancement request or comment). I love that.