Wednesday, March 12, 2014

Maintaining Software, Adventure 3

   This adventure came about because a project which was supposed to be a relatively straight forward maintenance problem wasn't.  The company I worked for had purchased the software source, documentation etc. .  We (the engineering staff) were supposed to just tweak it a bit, add support for some new devices then release it.

   The software was designed  to automate much of the work in a data center by creating and automatically maintaining subsets of the equipment as 'farms' i.e. collections of computers, disc drives, network devices etc in a virtual network.  A customer used a Visio like interface to describe the desired configuration and the software system used an XML version of that description to create what was described.  After a 'farm' was created, a customer could edit the description and have the changes made on the fly to the still running 'farm'.  A major aspect of the system was that it could automatically recover from if any single point of failure. This included a failure of any device (computer, network device etc) or cable including any of the infrastructure parts failed, full functionality would be automatically restored as much as available spare parts allowed.

   The difficulty in doing this correctly was seriously under estimated and every other such project of this type I have heard about since then has also underestimated the difficulty.  In this case every major subsystem was rewritten at least once or was subject to one or more major rework efforts.

   My part was 'farm management' which meant being responsible for coordinating the work done by all of the subsystems as they related to 'farms'.  A separate subsystem was responsible for infrastructure components. The simplest description of what 'farm management' did is to say that the front end (human interface part) constructed an XML file and then issued command(s) to my part for implementation.  The other main input to 'farm management' came from the monitoring subsystem.  When ever it detected a failure in a component or cable it issued a command to 'farm management' saying that component 'xxx' had failed. 'farm management' then took appropriate action.

   This functionality was implemented for the most part by a set of state-machines each of which was implemented in an object by case statements.  There were a set of status values defined and the state-machines purpose was to take components from one status to another.  A typical 'command' from the user would require a specific sequence of these transitions.  e.g. from New to Allocated to ... Active. When a device was no longer needed, the steps were basically run in reverse order for that device.  A failed device was left in an Isolated status e.g. powered on but not connected to the network.

   As delivered, the code worked as expected unless a component failed while it was participating in a 'farm' configuration change which included that component.  A very simplified description of the problem is that the incorrect behavior occurred because the code was designed to take all relevant devices in a 'farm' from one well known status to another well known status.  It could not handle the situation where some component(s) needed to move through the state-machine system in a direction opposite to all the other components at the same time. There were other corner case scenarios involving a change or a failure in the infrastructure which could also cause these state-machines to become confused.

  The complete solution required changes in some data structures and a complete replacement of the state-machine design.
  • The first step of the solution came from recognizing that the predefined 'farm state' values did not cover all the possible situations and not all component 'states' were explicitly represented by the 'farm state' values.
  • The next step was to recognize that each component type had a small number of associated 'state' values each of which could be represented by a Boolean value such as:  working yes/no, power on/off, configured yes/no.  The investigation which led to this understanding was forced by the Data Base team refusing to increase the size of a component description record. I just changed the type from an enumerated type (byte sized integer) to a Boolean 8 bit vector.
  • The above change to allowed the current 'state' of a component to be accurately represented by the set of bits in its 'state variable'.  This in turn allowed the work of the software to be defined as doing what ever is required to change a components current status to a final status and to make the required changes to individual 'state (bit) variables' in the required order.
  The above changes allowed the current status of a component to be accurately determined but did nothing to solve the problem of devices in a given 'farm' needing to be transitioned through different sequences simultaneously.
  The solution to this problem was to use simplified Artificial Intelligence (AI) techniques for control and synchronization and by using what might be termed massive multithreading techniques.  All of the code in this project was written in Java so the problem was not as difficult as it might have been but was not trivial either.

The details of these changes will be given in a later posting.




No comments:

Post a Comment