Sunday, June 22, 2014

Some thoughts from last PoC

For the second time I have been involved with a proof of concept (PoC) where testing of an alternative hardware platform was the goal. Common for both PoCs were:

  1. The other option was Exadata
  2. Real Application Testing (RAT) was used to replay production recordings on the new hardware.

And for the second time I have concluded that sales people take this too easy. It is not easy and assuming you will outperform previous platform just because you have a bunch of less expensive flash disks, cache cards, and more is arrogant at best.

A decent PoC requires lots of preparation and you need to have done some planning for capacity and load. If you have no idea of throughput, expected load, the nature of the applications and more, you are likely to fail.

Exadata is not the optimal solution for everybody, that is not why Oracle charges lots of money for it; it is not like they are thinking “Everybody will want this, so we can double the costs”. Not that I know from the inside, but some in Oracle probably thought they needed to create a solution for processing large amount of data fast. In order to do that a congestion of the CPU has to be avoided by filtering out data that is not needed before it reaches the database. As an added benefit of not shipping useless data the system will perform less I/O. If you don’t need that, Exadata may not be for you. On the other hand, if a customer needs this for a BI solution and you are selling what you like to call “standard commodity solution”, be prepared to have something that scales and can replace the Exadata features (if Exadata is what you are competing against.)

If you sell a solution to the 10% of the costs, it will not succeed simply because your bidding price is low; without performance it will fail. Without careful planning and analysis you will fail. Cheap food is nice, but if it makes people sick, they are not happy to spend the money on hospital bills. And they will go somewhere else.

There is no good substitute for proper understanding. DBAs have heard for many years that we will become redundant because modern systems are self-managing or something like that. Well, I gladly take up brewing if that becomes a reality, but what has changed is the amount of decibels from the CIO screaming when new and more expensive hardware does not perform after all.

The fun with Exadata does not all come from the fact that it is expensive and state of the art. Rather we have had these experiences where you combine your understanding of the platform with old wisdom like “filter early” and end up with incredible performance improvements. Like one day I asked a domain expert about the data we were querying, and after adding one redundant predicate to the SQL statement, the response time went from 90 seconds to less than 1 second. The whole improvement was due to smart scan taking place.

Back to the PoC; we spent weeks waiting for proper integration between servers (running Linux) and the storage. Wrong interface was used for the interconnect (this was RAC), RAID 5 was used because after all, they hadn’t provided enough disks. The devices that would provide the caching features was not used, but the low-level devices was discovered by ASM instead due to wrong configuration (parameter ASM_DISKSTRING was unchanged leading to the sd* devices being discovered). After a long time the RAT replay showed that for a small number of users the performance was acceptable, but as load increased the throughput stalled and we concluded that the new solution was not dimensioned properly. The vendor suggested a new and even better storage solution with even more flash…

Probably without seeing it themselves, the vendors in both cases demonstrated the benefit with an engineered solution. When the customer sees that you need weeks to have the storage play with your server, they start to think that maybe Oracle has a point with their preconfigured and tested solution, ready to plug in. (Yes, we know that reality is not that simple, search Twitter for #patchmadness for an illustration.)

There are a few takeaways from the RAT testing itself, which is material for the next post. What I can say now is that restore of database, performing replays and flashback and start over again takes a long time. Keep it simple and try to keep the capture as short as possible without losing value for what you are testing.
The goal with a PoC is to qualify or disqualify a new solution. If a new suggested solution is disqualified, the PoC should be deemed a success; this was actually stated by the project manager and not by me. PoCs should not be taken too easy. Think about all the problems you can avoid by stopping the wrong solution to enter your data center.