New RTT component 'run-time' and 'error' states

Hi,

Due to bugs #423 and #424,[1,2] I've been playing with an extended state
model for RTT components. The bugs note that no 'error' state is available
and that the current model is not ideal for event-driven components.

I have a proposal for RTT 1.4 which offers a first, but partial, solution to
these two bugs. A new state diagram is shown in attachment.

First, there is the addition of a 'Fatal' error state, which is entered by
calling 'fatal()' which leads to stopping the component immediately (calling
stopHook() ). A 'user' intervention is required which calls 'reset()'. If
that succeeds, the component becomes stopped again, otherwise, it becomes
pre-operational and a full configuration is required.
Next, there is an added 'active' state which is for event and command
processing only. No updateHook is called. It is identical to the 'active'
state of an Orocos state machine.
Finally, there are two additional 'run-time' error states: RunTimeWarning and
RunTimeError, which are shown in the second attachment. The idea is here that
there can be intermittent errors as well, which can be solved by the
component itself and eventually disappear. In these cases updateHook() (in
case of warnings) or errorHook() (in case of errors) are called upon each
trigger (period, event,...) and the component just keeps running.

I'm tempted to apply this on the rtt-1.4 branch, allthough this would delay
this release with another month (in theory that is). The changes are
backwards compatible with any RTT 1.x release, but some new functions have
been added, which might name-clash with existing user code.

These states have also been inspired with the OMG Robotics Technology
Component (RTC) and bring the RTT component model closer to that spec.
Another thing this spec proposes is to allow rate changes (going from non
periodic to periodic threads or changing the execution period at run-time),
which is not such an unsane feature, but currently (almost) impossible in the
RTT. The division between SingleThread and PeriodicThread was probably a
choice limiting flexibility, heck, even exposing this distinction to the user
is confusing, but that's for a future bug report.

Peter

[1] https://svn.fmtc.be/bugzilla/orocos/show_bug.cgi?id=423
[2] https://svn.fmtc.be/bugzilla/orocos/show_bug.cgi?id=424

New RTT component 'run-time' and 'error' states

On 10/14/07, Peter Soetens

<peter [dot] soetens [..] ...> wrote:
[...]
> I have a proposal for RTT 1.4 which offers a first, but partial, solution to
> these two bugs. A new state diagram is shown in attachment.
>
> First, there is the addition of a 'Fatal' error state, which is entered by
> calling 'fatal()' which leads to stopping the component immediately (calling
> stopHook() ). A 'user' intervention is required which calls 'reset()'. If
> that succeeds, the component becomes stopped again, otherwise, it becomes
> pre-operational and a full configuration is required.

Should I interpret the figure as: "You can enter the fatal state from
any state by calling fatal()"?

Could you give an example of where (=in which case) and how this state
could/should be used?

E.g. what if the RTOS reports to us we are experiencing overrun in a
periodic thread? I can imagine this could be fatal or not, depending
on the use case (HRT or SRT thread behaviour). Furthermore, in the
PeriodicThread class description, I see that there is a mechanism that
calls emergencyStop() in this case. How is this linked to this state
diagram?

> Next, there is an added 'active' state which is for event and command
> processing only. No updateHook is called. It is identical to the 'active'
> state of an Orocos state machine.

Is the following reasoning correct (I guess not :-)?

a) TaskContext linked to PeriodicActivity:
Active: events and commands are processed everytime the thread wakes up (1)
Running: (1) + updatehook is called every time too

b) TaskContext linked to NonPeriodicActivity:
Active: events and commands are processed asynchronously when fired
and the thread linked to the NonPeriodicActivity "wakes up"
Running: (1) + loop() (or loopHook) is called every time start() is
called. Updatehook is ignored in this case?

If so, Is the the component state chart in you sent in attachment not
only valid for components linked to PeriodicActivities (I see no loop
or loopHook fi.)

> Finally, there are two additional 'run-time' error states: RunTimeWarning and
> RunTimeError, which are shown in the second attachment. The idea is here that
> there can be intermittent errors as well, which can be solved by the
> component itself and eventually disappear. In these cases updateHook() (in
> case of warnings) or errorHook() (in case of errors) are called upon each
> trigger (period, event,...) and the component just keeps running.

I wonder why the running state has error substates, but the active one has none?

Thx for the clarifications,

Klaas

New RTT component 'run-time' and 'error' states

Quoting Klaas Gadeyne <klaas [dot] gadeyne [..] ...>:

> On 10/14/07, Peter Soetens

<peter [dot] soetens [..] ...> wrote:
> [...]
>> I have a proposal for RTT 1.4 which offers a first, but partial, solution to
>> these two bugs. A new state diagram is shown in attachment.
>>
>> First, there is the addition of a 'Fatal' error state, which is entered by
>> calling 'fatal()' which leads to stopping the component immediately (calling
>> stopHook() ). A 'user' intervention is required which calls 'reset()'. If
>> that succeeds, the component becomes stopped again, otherwise, it becomes
>> pre-operational and a full configuration is required.
>
> Should I interpret the figure as: "You can enter the fatal state from
> any state by calling fatal()"?

yes.

>
> Could you give an example of where (=in which case) and how this state
> could/should be used?
>
> E.g. what if the RTOS reports to us we are experiencing overrun in a
> periodic thread? I can imagine this could be fatal or not, depending
> on the use case (HRT or SRT thread behaviour). Furthermore, in the
> PeriodicThread class description, I see that there is a mechanism that
> calls emergencyStop() in this case. How is this linked to this state
> diagram?

You can use fatal() for example, when you lost connection to your
device (needs reconfiguration), got a bunch of 'nan' in your
calculations, or are in any way unable to proceed without 'user'
intervention. Think of it as a C++ exception, which means: "I can not
solve this problem at this level."

If a thread calls its emergencyStop() function, the thread, and thus
all components, are stopped. Linking this to a fatal() call would
indeed be better. But probably not trivial.

>
>> Next, there is an added 'active' state which is for event and command
>> processing only. No updateHook is called. It is identical to the 'active'
>> state of an Orocos state machine.
>
> Is the following reasoning correct (I guess not :-)?
>
> a) TaskContext linked to PeriodicActivity:
> Active: events and commands are processed everytime the thread wakes up (1)
> Running: (1) + updatehook is called every time too

correct.

>
> b) TaskContext linked to NonPeriodicActivity:
> Active: events and commands are processed asynchronously when fired
> and the thread linked to the NonPeriodicActivity "wakes up"
> Running: (1) + loop() (or loopHook) is called every time start() is
> called. Updatehook is ignored in this case?

Running is wrong. (1) + updateHook() is called upon start() or
event,command reception. There is no loop() in a TaskContext. See my
first reply in

>
> If so, Is the the component state chart in you sent in attachment not
> only valid for components linked to PeriodicActivities (I see no loop
> or loopHook fi.)
>
>> Finally, there are two additional 'run-time' error states:
>> RunTimeWarning and
>> RunTimeError, which are shown in the second attachment. The idea is
>> here that
>> there can be intermittent errors as well, which can be solved by the
>> component itself and eventually disappear. In these cases updateHook() (in
>> case of warnings) or errorHook() (in case of errors) are called upon each
>> trigger (period, event,...) and the component just keeps running.
>
> I wonder why the running state has error substates, but the active
> one has none?

First, that is a matter of taste, so I can't really present a proof.
Second, in 'active' you are receptive to incomming event/command
requests but not 'controlling/steering' something (a device or another
component). As you are in this 'passive' (what's in a name...)
operational mode, being in error is not really something that happens
'suddenly', but more related to a faulty incomming request. These
kinds of situations need to be handled at the application level.

Also it are 'run-time' levels, for your component, this could mean
that a sensor is presenting strange readings or so. The component will
keep track of the number of times these kinds of errors occured.

Peter

New RTT component 'run-time' and 'error' states

On Tue, 23 Oct 2007, Peter Soetens wrote:
[...]
>> Could you give an example of where (=in which case) and how this state
>> could/should be used?
>>
>> E.g. what if the RTOS reports to us we are experiencing overrun in a
>> periodic thread? I can imagine this could be fatal or not, depending
>> on the use case (HRT or SRT thread behaviour). Furthermore, in the
>> PeriodicThread class description, I see that there is a mechanism that
>> calls emergencyStop() in this case. How is this linked to this state
>> diagram?
>
> You can use fatal() for example, when you lost connection to your device
> (needs reconfiguration), got a bunch of 'nan' in your calculations, or are in
> any way unable to proceed without 'user' intervention. Think of it as a C++
> exception, which means: "I can not solve this problem at this level."
>
> If a thread calls its emergencyStop() function, the thread, and thus all
> components, are stopped. Linking this to a fatal() call would indeed be
> better. But probably not trivial.

I've never had doubts about your programming skills :-))

[...]
>>> Finally, there are two additional 'run-time' error states: RunTimeWarning
>>> and
>>> RunTimeError, which are shown in the second attachment. The idea is here
>>> that
>>> there can be intermittent errors as well, which can be solved by the
>>> component itself and eventually disappear. In these cases updateHook() (in
>>> case of warnings) or errorHook() (in case of errors) are called upon each
>>> trigger (period, event,...) and the component just keeps running.
>>
>> I wonder why the running state has error substates, but the active one has
>> none?
>
> First, that is a matter of taste, so I can't really present a proof. Second,
> in 'active' you are receptive to incomming event/command requests but not
> 'controlling/steering' something (a device or another component).

This smells like being "biased" towards periodic stuff, right?

(I understand your patch applies to the 1.x RTT branches and hence cannot
solve all issues.)
[Optional question :-)]
Maybe a better but harder question would be: How would the state
diagram of an Orocos taskContext look like in RTT 2.0?
[/Optional]

> As you are
> in this 'passive' (what's in a name...) operational mode, being in error is
> not really something that happens 'suddenly', but more related to a faulty
> incomming request. These kinds of situations need to be handled at the
> application level.

Couldn't you apply the same reasoning for the 'running' state? You
could/(should?) handle these 'recoverable' errors at application level too?

Klaas

> Also it are 'run-time' levels, for your component, this could mean that a
> sensor is presenting strange readings or so. The component will keep track of
> the number of times these kinds of errors occured.
>
> Peter
>
>

New RTT component 'run-time' and 'error' states

On Sun, 14 Oct 2007, Peter Soetens wrote:

> Due to bugs #423 and #424,[1,2] I've been playing with an extended state
> model for RTT components. The bugs note that no 'error' state is available
> and that the current model is not ideal for event-driven components.
>
> I have a proposal for RTT 1.4 which offers a first, but partial, solution to
> these two bugs. A new state diagram is shown in attachment.
>
> First, there is the addition of a 'Fatal' error state, which is entered by
> calling 'fatal()' which leads to stopping the component immediately (calling
> stopHook() ).

I am all for the introduction of _generic_ states, and 'errors' are
certainly part of that. The question I still have is: is one state enough,
and if not, how many 'error' states do we need and how do we call them?

You mention the OMG spec, but others probably exist too. I remember an
interesting discussion we had some years ago in the "OCEAN" project which
was designing controllers for machine tools, and there we came up with five
levels of 'errors', from 'warning' to 'fatal error', and each of them made
sense. However at the middleware level of RTT, one state might be enough,
where each application domain could then hierarchically fill out that one
"error state" with a more appropriate set of states in that domain.

> A 'user' intervention is required which calls 'reset()'. If
> that succeeds, the component becomes stopped again, otherwise, it becomes
> pre-operational and a full configuration is required.
> Next, there is an added 'active' state which is for event and command
> processing only. No updateHook is called. It is identical to the 'active'
> state of an Orocos state machine.

> Finally, there are two additional 'run-time' error states: RunTimeWarning and
> RunTimeError, which are shown in the second attachment. The idea is here that
> there can be intermittent errors as well, which can be solved by the
> component itself and eventually disappear. In these cases updateHook() (in
> case of warnings) or errorHook() (in case of errors) are called upon each
> trigger (period, event,...) and the component just keeps running.

I think these are already examples of the "hierarchically defined
application-dependent error states" that I spoke of above. But maybe the
concepts of "fatal" and "runtime recoverable" might be generic enough to be
visible in the RTT API... ("Fatal" would then be the same as non-"runtime
recoverable".) Or in C++ terms: the "runtime recoverable" error would be an
exception that is caught at the component where it is generated; the
"fatal" error is propagated to the outside of this component, so it should
be caught in the state machine of the encompassing TaskContext.

> I'm tempted to apply this on the rtt-1.4 branch, allthough this would delay
> this release with another month (in theory that is). The changes are
> backwards compatible with any RTT 1.x release, but some new functions have
> been added, which might name-clash with existing user code.
>
> These states have also been inspired with the OMG Robotics Technology
> Component (RTC) and bring the RTT component model closer to that spec.
> Another thing this spec proposes is to allow rate changes (going from non
> periodic to periodic threads or changing the execution period at run-time),
> which is not such an unsane feature, but currently (almost) impossible in the
> RTT. The division between SingleThread and PeriodicThread was probably a
> choice limiting flexibility, heck, even exposing this distinction to the user
> is confusing, but that's for a future bug report.
>
> Peter
>
> [1] https://svn.fmtc.be/bugzilla/orocos/show_bug.cgi?id=423
> [2] https://svn.fmtc.be/bugzilla/orocos/show_bug.cgi?id=424
> --

Herman