Friday, June 26, 2020

The Recurring Tragedy of Orchestration

“Orchestration” is a recurring problem every IT professional of any experience will have encountered in multiple forms in their career, spanning dozens of tools and systems in a myriad of forms. It is an amorphous and multifaceted problem whose similarities with other systems always nags at me and has driven the sales of billions of dollars in flawed software that in the long term handcuffs an organization’s agility and freedom.

We’ve all seen this: a diagram with a series of boxes with arrows between them. Visio documents, viewgraphs, UML, workflows, process diagrams. Either they represent the abstract, or as documentation of specific processes (and usually out of date).

And we’ve all used a half-dozen IT tools that use the “boxes with arrows” as the fundamental conceptual paradigm: workflow engines and build systems in myriad forms, from UI tools that show the Visio-like interfaces, to configuration files that represent the same but without the UI.

The IT industry has reinvented this wheel over its entire existence, and it represents a massive inefficiency in the entire state-of-the-art. Why? We have seen dozens of these systems and they seem superficially extremely similar, and the work to produce definition, visualization, and execution has happened over and over and over again, often from a total rewrite. A massive amount of waste, except to software vendors.

Before going further, I want to emphasize that my conclusions should not be used to justify a “everything is a hammer” approach to one-orchestrator-to-rule-them-all, which will inevitably result in square pegs hammered into round holes. A lot of current orchestration software is purpose built due to practical constraints. The point of this … uh, paper … is to provide a plea for system evolution in the future to try to (ivory tower shining in the distance) work towards orchestration systems that are more flexible for multiple needs without the constant reinvention, conserves critical investment of code, and improves the ability of organizations to apply orchestration systems to their needs.

I think the essence of the problem, aside from the lessons of the tower of Babel, is both the abstract nature of the concept, but also the fact that the superficial impression of simplicity is actually much much much more complicated, and it runs into all the core problems of what makes computing and IT fundamentally complex.

The last part of that may question the value proposition of orchestrators. But orchestration systems undeniably have helped solve many problems effectively and have demonstrably tremendous value, which is why IT orgs the world over invest billions of dollars per year in them. They are a key means for visualizing and organizing the chaos and complexity in any IT org of non-trivial size. But these disparate orchestration systems always are immature in a large swathe of what an “ideal” orchestration system should be capable of.

The foundation of continuous delivery and integration of systems deployment is in its very nature orchestration.

Often orchestrators are constrained to only work in a specific ERP package (example: Documentum’s workflow component), or a specific purpose (Jenkins for builds, Spinnaker for “deployment”). Terminology is scattershot due to the multitude of applicable domains and the source of original design and responsibility.

Let’s take a shot at defining some key terms:

Concept/Definition: Orchestration

OK, so pretty important, what do we mean by this thing I’m complaining about?

Orchestration systems help coordinate lower-level computations and systems. It is “management” of computation, and by the word “management” I mean to directly correlate with human organizational management that MBA schools and a thousand books also have never properly solved. Orchestration systems are a desperate attempt to organize and coordinate the complex detailed work going on in an organization.

Almost all human organizations employ hierarchical structures: workers doing all the real work at the bottom, and then each level of management is dedicated to coordinating the level below them and communicating status (aka return values) to the level above. Managers of each domain of an organization will have specialized knowledge of necessary processes and procedures. The tiers of management “chunk” complexity and scope of responsibility into boxes and structure them so each hierarchy sees a higher level and overarching responsibility but delegates complexity. (see https://en.wikipedia.org/wiki/Span_of_control)

Are some of you thinking of Conway’s Law?

While invocation of Conway’s Law is usually in the context of disparagement and siloed data, the truth of Conway’s Law reflects a need by organizations to have software specific to their very specific needs. All levels of an organization define detailed procedures and processes and policies, and orchestration engines are basically the only software that will enable that to be somewhat structured and centrally managed. The universal application of Conway’s law is also why the constant total reinvention of orchestration is such a massive industry-wide failure.


Concept/Definition: Workflow / Process / Pipeline / Operation / Flow

Everyone that has used orchestration systems knows what they consist of: a group of defined “things” or capabilities it coordinates the execution of. Workflow systems have workflow definitions. Build systems have build definitions. Batch execution systems have their batch definitions.

These defined processes are usually configured in a non-binary format: XML, JSON, YAML, CSV, or other text format, or stored in a SQL database. Essentially these exist at the “metadata” level of IT orgs.

Often these consist of “subprocesses” for reuse or to implement “chunking”.


Concept/Definition: Execution / Run / Instance / Job

These are the individual instances of execution of a Process/Workflow, usually involving data such as the workflow definition (and likely version of that definition), the current status/state of the workflow such as what task or subprocess is currently working, and status values produced by already-executed steps, time the execution was initiated, parameters/context/data it was provided at start of execution, and error states encountered.

Concept/Definition: Task / Step / Unit

A hopefully-atomic unit of work from the perspective of the process/workflow. Either succeeds and allows the process/workflow to proceed, or there is an error which may disrupt or terminate the execution of a workflow. Basically serves to “chunk” the contained processing/execution into a simple success/failure state.

Execution Complexities: The Devil in the Details and Turing Machines

Unfortunately the selling of orchestration systems always starts at a sufficient level of ignorance and with a nice bit of snake oil: simple, easy flows of five or six boxes and arrows in a nice line. The salesman smiles and assures the IT management (who should know better) that the magic orchestrator will reduce all their complexities to these simple flows.

And I concede, relatively simple flows are THE sweet spot of orchestration, where the processes it handles, defines, visualizes, and executes are relatively easy. We probably all have stories of workflows doing crazy things like recursion or monumentally complex definitions that should have been broken down. As discussed in chunking and span of control, orchestrator workflow defs should orbit a central attactor of not-too-many-steps.


But… it isn’t. Every orchestration system when implemented starts hitting the hidden complexities in the processes it orchestrates. Unfortunately, to REALLY understand this you need to dip into the theoretical foundations of computing.

That four-box four-array straight line process is a very simple state machine with a single start point, single direction of flow, and one end state (DONE) or two (SUCCESS vs FAILURE). Theoretically this is a very very very simple form of computation, and make no mistake about it, orchestrators are just computation with prettier interfaces. Being able to run state machines like this is the very bottom basement of computational “power” that computational theory has mathematically proven. This isn’t an ivory tower thing either, this pecking order of “computational power” affects the very ability of orchestration to do useful work.

An orchestrator that can do these types of flow gets the golf clap and a nice pat on the head. Your orchestrator is executing very very very basic finite state machines. But there are much more difficult things an orchestrator will need to handle and those orchestrators with some maturity handle them to some degree:

- Branching: Wait, you need to DECIDE something? Yup, need some boolean evaluators and multiple routing. You're now a finite state machine with multiple paths coming out of points. This also is the gateway to needing something besides drag and drop shapes, and users come face-to-face with basic boolean math.

- Multiple start points: your process can start from several different initial states and begin the process at points that isn’t the “beginning”. Your process is becoming a more generalized form of finite state machine processing into a “directed acyclic graph” executor.

- Data/Memory/State/Storage: the “state” in “finite state machine” doesn’t imply memory or recording of values. Your execution of a process will invariably accumulate data and results that need to be stored. While this can be modelled with “finite” state machines of very large size, that isn’t useful. You instead are now, at minimum (probably), a pushdown automata/machine .

- Subroutines / Subflows: you define special-purpose processes that are “parameterized”, and want those processes to be invocable as single steps in overarching processes (using data from previous step results).

- Loops / Cycles: your processes may never halt and you are fully subject to the halting problem. Your state + your looping potential plus everything else means your process requires what is called a “Turing Machine”, which is what your computer is and what general purpose programming languages enable.

Execution Complexities: Error Handling, Recovery, and Rollback

Your basic salesperson snake oil barely touches aspects of error handling. In fact, most code written for programming languages, especially in enterprise systems, only superficially handles error. Really only very very mature software handles a wide variety of error conditions like database software. And that still fails in lots of ways.

A maturely defined process in an orchestration will have lots of variations in error reporting, handling, and recovery. When a lot of people think “my process/workflow doesn’t need looping” they aren’t thinking about errors and recovery, which almost any process definition will involve, and often involves REPEATING STEPS. Congratulations, your process definition has a loop, and is a big boy!

Reporting, recovery, retry, rollback, all of these are error handling techniques that EVERY orchestration software will have to support eventually. If your orchestration does not support it, it is a missing feature.

Execution Complexities: Distributed Execution and Parallelization


With the rise of cloud computing and managing fleets of servers, this has become more apparent, but orchestration has always involved the execution of processes across multiple machines and systems. That basic four-box snake oil the salesman is selling you? Guess what, that snake oil is hiding massively difficult computing problems in multi-phase commit, distributed computing, the CAP theorem, distributed state, parallel code execution, and many others.

The ideal orchestration system should handle this. Let me emphasize that it is very very unlikely that there is an orchestration system on the market that can handle all of that. The maturity of distributed systems, from Kafka, Zookeeper, Cassandra, etcd, Kubernetes, Hadoop, and all of those are still in their infancy in properly handling these operations.

Nevertheless, any orchestration system of maturity has features (they are probably broken for dozens of edge cases) that attempt to help users perform them.

Execution: Do Not Despair!


Readers at this point are probably despairing saying “just write custom code in a programming language”.

But we’ve already gotten a glimpse of the summit. Orchestrators SHOULD be able to do these things. I’ve seen a half-dozen workflow editors and the host of different Vizio boxes demonstrate that these thing should be doable in the (somewhat loosely defined) scope of what an orchestrator can orchestrate.

While parts of an orchestrator that handles some or all of these very difficult aspects of the process/workflow WILL require dropping down to full-power Turing complete programming languages, an orchestrator of good design should support this ability!

Perl’s philosophy of “more than one way to do it” will probably be needed here, especially given how orchestration has so many nebulous origins and different terms for “processing”.

Integration

A hot term from the years of message queues, n-tier architectures, SOAP/XML web services, and still applicable in the microservices trends, the ability to integrate disparate systems and pools of data are critical to an orchestration system, which often doubles as the integration layer funneling data between heterogenous systems.

Mature orchestration systems often can support a great deal of interfaces out of the box, but that support is lost each time orchestration is reinvented and often requires adaption over time or more layers of wrappers (often in front of other layers of wrappers written for previous integrations or point-to-point interfaces) that degrade performance.

Mature integration is also leveraged by software vendors for lock-in. Because the core of these integrations is often closed source, even in the open source orchestration frameworks, it becomes difficult to migration orchestration to improved forms.

User Interfaces

A critical differentiator between good orchestration software and a collection of motley scripts, programs, frameworks and other program/tool hodgepodges is the visualization that (hopefully) well-designed user interfaces provide.

These can provide facilities for visualizing the defined processes and adapting them (editing of workflows), as well as the state of their current executions, scheduling, triggering adhoc executions, visualization of failure states and means to reattempt/resume execution, debugging, history, and metrics.

In particular, the UI for showing state machines / DAGs / workflows / Visio-esque diagrams is not trivial, and often isn’t implemented until medium level maturity of orchestration systems. Attempts to standardize things via XML standards have generally failed due to typical subterfuge using embrace-extend incompatibilities by vendors, and the XML standards have been fairly use case centric. A “generally solved” framework or design for this would go a long way to helping the evolution of orchestration without the cycle of perpetual full UI rewrites.

The UI rewrite subproblem is of course related to the Recurring Tragedy of UI Frameworks constantly being reinvented in every programming language, OS, and more recently, every new HTML/JS framework, but that is another rant. At least with HTML/JS we have a universal rendered and a long lived base language of HTML/CSS and (sob) Javascript.

These host of valuable capabilities are generally lost with each successive reinvention until late in the maturity of the implementation. A lack of UI with good historical continuity that other applications have like spreadsheets, word processors, databases front ends, etc is a huge loss for the industry and practice.

A way forward?

Obviously the problem is extraordinarily difficult. The balkanization of ?hundreds? ?thousands? distinct software that have “semisolved” this problem, almost all overpromising/bragging they can do things of this type but not reporting the fine print (or even aware of the full scope of the problem) abound.

I think this will only begin to be solved in IDEs, another Tragic Tale of Reinvention (Oh do I weep for your passing Turbo Pascal, Delphi, and your ilk). But IDEs recently have started to coalesce and stay more persistent. JetBrains IDEs have shown real staying power, and Visual Studio is now fairly open and has lots of language support. And Eclipse… is still actively updated too. Many of these IDEs or dialects of them have the basic tools (workflow display and editing, etc) to do this.

Fundamental orchestration visualization tools in these IDEs, from project structuring, code organization, and debugging/testing visualization, kind of like what gdb and compilers provide, could provide the basis of structure to apply all the way up the food chain of tools. And IDEs almost always can be limited to function as purposeful tools with limited plugins and options.