Saturday, March 22, 2014

State Machines

Commentary and Observations on State Machines

State machines are one way of implementing an iterated and complex control flow that is more maintainable than equivalent functionality using just if statements.  For  information on state machines see the wiki page.
This is just a commentary on my experience with state machines.

First of all, in many situations, using state machines is a good implementation choice.

The classic state machine can be described as a set of actions whose order of execution is controlled by an input value, a current state, and a transition table.  This type is often used in programs to implement a text parser.

For simplicity in this description, both the current state and the input value are treated as integers.  The basic sequence is as follows:
1. Set the current state to 0
2. Start the loop
3. Set the input value (read from a file or get the next character in a string variable)
4. Set the current state to the value found in the transition table by using the input value, and current state value
5. Select/jump to the block of code associated with the current state.  This code will take appropriate action and then jump to the end of the loop
6. Return to the top of the loop (Step 3 above) if some loop termination condition is not met

The main variants of this algorithm are that:
  • the input value may be something other than a character
  • the termination condition may not be “end of string” or “end of file”
For example, in my Adventure 3 posting, the original state machine implementation had the following changes:
  • The input value was the “current state” of the device
  • The next state was determined by the “input value” and the desired “state” of the device
  • The termination condition was that the input value (the “current state”) matched the desired “state”
In the above type of state machine, maintenance problems become evident if the number of states is large or if the numeric range of the input value is large.  
If the number of states is large, it is difficult to understand what the code is doing because the execution sequence is encoded in the transition table and the actions are represented as numbers.  The difficulty arises because of the need to remember exactly what actions all those numbers represent, and, therefore, changes to the transition table or the code for a given state can easily have unintended side effects. 

If the numeric range of the input value is large, then there can easily be a problem with the coding of the transition table itself.  There is likely to be a lot of duplication e.g. all alphabetic characters have exactly the same transition except in certain states.  In some applications, it may not be physically possible for every value in the input to occur (e.g. even though the input value is a byte, the number of possible values is small, and the values are not numerically consecutive (See Adventure 3.) . This condition results in large logical gaps in the transition table and a very memory-consuming transition table. To overcome this problem, I have sometimes used an input translator function that accepts the input value and produces a translated value (e.g. all alphabetic characters are assigned the same translated value).  This translated value is what the state machine uses.

In my Adventure 2 posting, I describe a new (for me) type of state machine that eliminates or reduces the problems inherent in a classic state machine.  This “new” machine also makes a state machine implementation a viable option in a wide variety of situations. Its design came about (in part) when I realized that a state machine representation could be used to flatten a calling stack (when one routine calls another ). This design is summarized here.
  
In this type of state machine, there is no input value.  The execution sequence is controlled by a “current state” and a “next state.”
1. Set the next state to 0
2. Start the loop
3. Set the “current state” to “next state”
4. Select/jump to the block of code associated with the current state. This code will take appropriate action and then may set the next state before jumping to the end of the loop.

The action of at least one of the states is to exit the loop.
Note: If the next state is not changed, then state action will be repeated.  This can happen when a state machine is used to implement a polling loop.

In practice, the appropriate action was often implemented by a function call and its return value was used to determine a value for “next state”.

The benefits of having the state code decide what the next state would be are as follows:
  • Transition sequences are relatively easy to follow, and, therefore, code changes are less likely to have unintended consequences
  • There is no transition table to maintain
  • There is no need to specially handle sparse values in an input value (even if one were used).
Miscellaneous Observations
·        I have used a state machine design to analyze a problem, and several times this analysis showed that the problem could be solved by just a few if statements.
·        When could a state machine pattern be used?
o   If the code is processing some kind of input stream and I find that I am writing a series of “if statements” to decide what to do
o   If processing decisions must be made based on the values of one or two variables and the action to be taken is determined by the values of the variable(s)
·        When should a state machine pattern NOT be used?
o   If there is no “input stream” then the problem might be too simple to justify the use of a state machine
o   If there are more than two controlling variables used to decide what action to take and the problem cannot be factored into multiple simpler problems
o   If the controlling variable(s) can be changed by external actions (e.g. other processes or threads)




Wednesday, March 12, 2014

Maintaining Software, Adventure 3

   This adventure came about because a project which was supposed to be a relatively straight forward maintenance problem wasn't.  The company I worked for had purchased the software source, documentation etc. .  We (the engineering staff) were supposed to just tweak it a bit, add support for some new devices then release it.

   The software was designed  to automate much of the work in a data center by creating and automatically maintaining subsets of the equipment as 'farms' i.e. collections of computers, disc drives, network devices etc in a virtual network.  A customer used a Visio like interface to describe the desired configuration and the software system used an XML version of that description to create what was described.  After a 'farm' was created, a customer could edit the description and have the changes made on the fly to the still running 'farm'.  A major aspect of the system was that it could automatically recover from if any single point of failure. This included a failure of any device (computer, network device etc) or cable including any of the infrastructure parts failed, full functionality would be automatically restored as much as available spare parts allowed.

   The difficulty in doing this correctly was seriously under estimated and every other such project of this type I have heard about since then has also underestimated the difficulty.  In this case every major subsystem was rewritten at least once or was subject to one or more major rework efforts.

   My part was 'farm management' which meant being responsible for coordinating the work done by all of the subsystems as they related to 'farms'.  A separate subsystem was responsible for infrastructure components. The simplest description of what 'farm management' did is to say that the front end (human interface part) constructed an XML file and then issued command(s) to my part for implementation.  The other main input to 'farm management' came from the monitoring subsystem.  When ever it detected a failure in a component or cable it issued a command to 'farm management' saying that component 'xxx' had failed. 'farm management' then took appropriate action.

   This functionality was implemented for the most part by a set of state-machines each of which was implemented in an object by case statements.  There were a set of status values defined and the state-machines purpose was to take components from one status to another.  A typical 'command' from the user would require a specific sequence of these transitions.  e.g. from New to Allocated to ... Active. When a device was no longer needed, the steps were basically run in reverse order for that device.  A failed device was left in an Isolated status e.g. powered on but not connected to the network.

   As delivered, the code worked as expected unless a component failed while it was participating in a 'farm' configuration change which included that component.  A very simplified description of the problem is that the incorrect behavior occurred because the code was designed to take all relevant devices in a 'farm' from one well known status to another well known status.  It could not handle the situation where some component(s) needed to move through the state-machine system in a direction opposite to all the other components at the same time. There were other corner case scenarios involving a change or a failure in the infrastructure which could also cause these state-machines to become confused.

  The complete solution required changes in some data structures and a complete replacement of the state-machine design.
  • The first step of the solution came from recognizing that the predefined 'farm state' values did not cover all the possible situations and not all component 'states' were explicitly represented by the 'farm state' values.
  • The next step was to recognize that each component type had a small number of associated 'state' values each of which could be represented by a Boolean value such as:  working yes/no, power on/off, configured yes/no.  The investigation which led to this understanding was forced by the Data Base team refusing to increase the size of a component description record. I just changed the type from an enumerated type (byte sized integer) to a Boolean 8 bit vector.
  • The above change to allowed the current 'state' of a component to be accurately represented by the set of bits in its 'state variable'.  This in turn allowed the work of the software to be defined as doing what ever is required to change a components current status to a final status and to make the required changes to individual 'state (bit) variables' in the required order.
  The above changes allowed the current status of a component to be accurately determined but did nothing to solve the problem of devices in a given 'farm' needing to be transitioned through different sequences simultaneously.
  The solution to this problem was to use simplified Artificial Intelligence (AI) techniques for control and synchronization and by using what might be termed massive multithreading techniques.  All of the code in this project was written in Java so the problem was not as difficult as it might have been but was not trivial either.

The details of these changes will be given in a later posting.