Avoiding service call failures in oracle service bus and oracle soa suite
This article is targeted at Oracle Service Bus (OSB) developers and architects who want to learn/validate their strategies for avoiding service call failures within their integration pipelines.
Having dual roles as architects and developers the authors have seen many projects in which the developer or the architect didn’t design or implement a good exception/failure management strategy.
This article will streamline a series of strategies to avoid such failures.
Some of the ideas in this article are were originally presented by our colleague David Hernández in a December 2015 session at a Microservices and API Management Symposium in Lima, Peru. That session focused on various strategies for mitigating failures in the development of Services:
- Circuit breakers
- Bulkheads
- Timeouts
- Redundancy
In this article we will apply these strategies to Oracle Service Bus.
These strategies can be implemented using OSB and Oracle SOA Suite (composites). Some features of the current latest release (12.2.1) will help you to facilitate the implementation; other strategies are very basic configurations in OSB that many people skip and therefore struggle to maintain stability in their Service Bus implementation.
Let’s always keep in mind that the Service Bus is a core element in any infrastructure that implements it. CIOs and managers sometimes do not give that specific weight to this platform and wonder why, if the OSB/SOA Suite is not available, the infrastructure crashes. Or sometimes they wonder why the OSB struggles when a platform is not available or is under heavy load. The answer is simple: OSB is the integration pipeline within your infrastructure and architecture. Imagine if a water pipeline in your house breaks: you don’t see it, but a real mess is happening behind your walls or beneath the floor. The same idea applies here.
Let’s also bear in mind that 60% of the development of a service is related to exception management and the ability to avoid failures. If we don’t take care of this, or expect that someone else will take care of it, we are wrong. You need to be able to identify the following scenarios:
- What if the service that I am calling is down?
- What if the service that I am calling is not responding in the amount of time that I expect it to respond?
- What if the service that I am calling is not prepared to receive the amount of calls that I will send to it?
- What If my JMS queue fails?
- What if the Database call is taking longer than expected?
- What if I take care of the error—what happens next?
- What if I manage the error and keep the message in a queue for a later reprocessing? Is the endpoint prepared for it?
- What if I send all those transactions for a manual recovery?
- What if I don’t manage anything?
- What if a platform fails or is not available? Am I ready to avoid a domino effect?
- What if I need to break a circuit within my pipelines?
- What if my file systems run out of space?
And these are just some of the "what ifs" you’ll face.
As you can see, failures can happen anywhere and at any time, and we have to prepare for them.
In this article we will elaborate on different scenarios and show how OSB is able to handle them with different strategies. We’ll also introduce a concept that we have created to explain the importance of delimiting responsibilities in any SOA-Service Bus architecture/infrastructure: Service Bus Complicity.
You are probably wondering: What’s that?
The answer is simple: If we let our SOA/OSB implementation be part of a problem that causes incidents, downtimes, interruption of continuous availability and SOA/OSB server crashes, then we are complicit in other people’s problems. And not only that—it is highly probable that we are going to be pointed to as the root cause of the problem, or the stopper of Business Continuous Availability. Isn’t that serious?
We think it’s really serious. It can cost us our jobs and our mental health.
So let’s take a look at the different strategies for avoiding failures in our SOA/OSB implementation, so we can avoid being that accomplice we just mentioned.
Often, the Service Bus and SOA infrastructure generate requests to different systems/platforms. When those are not available or are having problems, the OSB or SOA servers are candidates for the following scenarios:
- Many threads take some time to be completed. They are still running, because they are waiting for a response from the given Service/Platform.
- Our pipelines start to fill while time passes. As more time passes, more requests arrive, and more threads are at risk of being stuck.
- Our WebLogic servers start to panic (by which we mean: WARNING state).
- The consumers of those services are receiving a lot of errors or are not receiving any response.
- If it is an end-user system, customers are already waiting for "the system to come back."
- WebLogic servers pass from WARNING to FAILED.
- All the operations teams are in panic mode now!