Generic exception handling in TaskContext

Submitted by Sylvain Joyeux on Wed, 2009-11-11 11:24

RTT-dev

Right now, little exception handling is defined between the "user" part of the
task context (i.e. the hooks) and the "RTT" part. The only place where such
handling is found is having the thread implementation do an emergency exit if
an exception is raised by updateHook().

This leads, sometimes, to a leaked exception killing the whole process (and
the components that are running in it).

I would propose the following default policy:
* configureHook() and startHook() return false if an unhandled exception is
received.
* updateHook() and errorHook() make the transition to fatal error, including
the call to stopHook()
* stopHook() transitions into a new emergency stop state.

In all cases, a relevant message is displayed on the log output.

What do you think ?

Generic exception handling in TaskContext

Submitted by peter on Wed, 2009-11-11 21:48.

On Wed, Nov 11, 2009 at 12:20, Sylvain Joyeux <sylvain [dot] joyeux [..] ...> wrote:
> Right now, little exception handling is defined between the "user" part of the
> task context (i.e. the hooks) and the "RTT" part. The only place where such
> handling is found is having the thread implementation do an emergency exit if
> an exception is raised by updateHook().

Yeah, that urgently needed fixing.

>
> This leads, sometimes, to a leaked exception killing the whole process (and
> the components that are running in it).
>
> I would propose the following default policy:
> * configureHook() and startHook() return false if an unhandled exception is
> received.
> * updateHook() and errorHook() make the transition to fatal error, including
> the call to stopHook()
> * stopHook() transitions into a new emergency stop state.

If this emergency state can only be entered by letting an exception
escalate, why not call the state 'RunTimeException', or at least with
the name Exception in it ?

>
> In all cases, a relevant message is displayed on the log output.
>
> What do you think ?

It's also an opportunity to think over if we need the runtime warning
and error states ? They are of these generic cases Herman warns about
in the follow-up. The reason they are in is for allowing a generic
tool (taskbrowser etc) to show information about the 'health' status
of a component to a human. That's why also the error and warning
counts were added. It was not designed for an automated supervision
system, since it would not know what the exact error is (without
querying for the exact error, which beats the purpose of a generic
one).

Peter

Generic exception handling in TaskContext

Submitted by Sylvain Joyeux on Thu, 2009-11-12 15:20.

On Wednesday 11 November 2009 22:45:51 Peter Soetens wrote:
> It's also an opportunity to think over if we need the runtime warning
> and error states ? They are of these generic cases Herman warns about
> in the follow-up. The reason they are in is for allowing a generic
> tool (taskbrowser etc) to show information about the 'health' status
> of a component to a human. That's why also the error and warning
> counts were added. It was not designed for an automated supervision
> system, since it would not know what the exact error is (without
> querying for the exact error, which beats the purpose of a generic
> one).

runtime warning seems to me useless
runtime error is useful.

For the record, there is an experimental functionality in orogen that allows
to specify substates of RUNNING, RUNTIME_ERROR and FATAL_ERROR, along with the
means to report the exact state to the supervision layer.

Generic exception handling in TaskContext

Submitted by bruyninc on Thu, 2009-11-12 18:48.

On Thu, 12 Nov 2009, Sylvain Joyeux wrote:

> On Wednesday 11 November 2009 22:45:51 Peter Soetens wrote:
>> It's also an opportunity to think over if we need the runtime warning
>> and error states ? They are of these generic cases Herman warns about
>> in the follow-up. The reason they are in is for allowing a generic
>> tool (taskbrowser etc) to show information about the 'health' status
>> of a component to a human. That's why also the error and warning
>> counts were added. It was not designed for an automated supervision
>> system, since it would not know what the exact error is (without
>> querying for the exact error, which beats the purpose of a generic
>> one).
>
> runtime warning seems to me useless
> runtime error is useful.
>
> For the record, there is an experimental functionality in orogen that allows
> to specify substates of RUNNING, RUNTIME_ERROR and FATAL_ERROR, along with the
> means to report the exact state to the supervision layer.

There is again a difference between reporting the _status_ of a component,
and the _state_ of that component. The latter is a discrete thing, that
defines the computations that are currently active in the component; the
former is a continuous-time snapshot of the 'state' of these computations
(including the data involved). Unfortunately, I also cannot escape the
ambiguous meaning of the word 'state' in my sentence above... :-( Maybe
the terms "(FSM) state" and "(computation) status" could be used more
systematically to differentiate between both?

Herman

Generic exception handling in TaskContext

Submitted by markus.klotzbuecher on Thu, 2009-11-12 08:36.

On Wed, Nov 11, 2009 at 10:45:51PM +0100, Peter Soetens wrote:
> On Wed, Nov 11, 2009 at 12:20, Sylvain Joyeux <sylvain [dot] joyeux [..] ...> wrote:
> > Right now, little exception handling is defined between the "user" part of the
> > task context (i.e. the hooks) and the "RTT" part. The only place where such
> > handling is found is having the thread implementation do an emergency exit if
> > an exception is raised by updateHook().
>
> Yeah, that urgently needed fixing.
>
> >
> > This leads, sometimes, to a leaked exception killing the whole process (and
> > the components that are running in it).
> >
> > I would propose the following default policy:
> > * configureHook() and startHook() return false if an unhandled exception is
> > received.
> > * updateHook() and errorHook() make the transition to fatal error, including
> > the call to stopHook()
> > * stopHook() transitions into a new emergency stop state.
>
> If this emergency state can only be entered by letting an exception
> escalate, why not call the state 'RunTimeException', or at least with
> the name Exception in it ?
>
> >
> > In all cases, a relevant message is displayed on the log output.
> >
> > What do you think ?
>
> It's also an opportunity to think over if we need the runtime warning
> and error states ? They are of these generic cases Herman warns about
> in the follow-up. The reason they are in is for allowing a generic
> tool (taskbrowser etc) to show information about the 'health' status
> of a component to a human. That's why also the error and warning
> counts were added. It was not designed for an automated supervision
> system, since it would not know what the exact error is (without
> querying for the exact error, which beats the purpose of a generic
> one).

I agree. I think we should strictly separate platform and application
types of states. The internal state machine the RTT used to maintain a
component throughout its lifecycle has nothing to do with application
level error states. If an uncaught exception gets through then clearly
the component is erronous (intentionally avoiding the euphemism 'bug')
and thus can not be trusted. In fact the whole process could be
corrupted. So I do agree with Sylvain here that this type of error
requires a distinct state.

OTHO for application level errors I agree with Herman: a component
should always deal with errors as good as possible, recover if
possible but fail loudly and early if it can't. Nevertheless if it
does fail at application level, from the RTT/Platform POV it's still
working OK and can for instance be cleanly unloaded without fearing
memory leaks, etc.

Markus

Generic exception handling in TaskContext

Submitted by Ruben Smits on Wed, 2009-11-11 12:08.

On Wed, Nov 11, 2009 at 12:20 PM, Sylvain Joyeux <sylvain [dot] joyeux [..] ...> wrote:
> Right now, little exception handling is defined between the "user" part of the
> task context (i.e. the hooks) and the "RTT" part. The only place where such
> handling is found is having the thread implementation do an emergency exit if
> an exception is raised by updateHook().
>
> This leads, sometimes, to a leaked exception killing the whole process (and
> the components that are running in it).
>
> I would propose the following default policy:
> * configureHook() and startHook() return false if an unhandled exception is
> received.
> * updateHook() and errorHook() make the transition to fatal error, including
> the call to stopHook()
> * stopHook() transitions into a new emergency stop state.
>
> In all cases, a relevant message is displayed on the log output.
>
> What do you think ?

I like it, it's at least a better default policy as we have now.

Ruben
> --
> Dr. Ing. Sylvain Joyeux
> Space and Security Robotics
>
> DFKI Bremen
> Robert-Hooke-Straße 5
> 28359 Bremen, Germany
>
> Phone: +49 (0)421 218-64136
> E-Mail: sylvain [dot] joyeux [..] ...
>
> Weitere Informationen: http://www.dfki.de/robotik
> -----------------------------------------------------------------------
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Straße 122, D-67663 Kaiserslautern
> Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
> (Vorsitzender) Dr. Walter Olthoff
> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
> Amtsgericht Kaiserslautern, HRB 2313
> Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
> USt-Id.Nr.: DE 148646973
> Steuernummer: 19/673/0060/3
> ---------------------------------------------------------------------
> --
> Orocos-Dev mailing list
> Orocos-Dev [..] ...
> http://lists.mech.kuleuven.be/mailman/listinfo/orocos-dev
>

Generic exception handling in TaskContext

Submitted by snrkiwi on Wed, 2009-11-11 13:36.

On Nov 11, 2009, at 07:03 , Ruben Smits wrote:

> On Wed, Nov 11, 2009 at 12:20 PM, Sylvain Joyeux <sylvain [dot] joyeux [..] ...
> > wrote:
>> Right now, little exception handling is defined between the "user"
>> part of the
>> task context (i.e. the hooks) and the "RTT" part. The only place
>> where such
>> handling is found is having the thread implementation do an
>> emergency exit if
>> an exception is raised by updateHook().
>>
>> This leads, sometimes, to a leaked exception killing the whole
>> process (and
>> the components that are running in it).
>>
>> I would propose the following default policy:
>> * configureHook() and startHook() return false if an unhandled
>> exception is
>> received.
>> * updateHook() and errorHook() make the transition to fatal error,
>> including
>> the call to stopHook()
>> * stopHook() transitions into a new emergency stop state.
>>
>> In all cases, a relevant message is displayed on the log output.
>>
>> What do you think ?
>
> I like it, it's at least a better default policy as we have now.
>
> Ruben

Seems fairly reasonable to me too. But why a "new" emergency stop
state? I gather that any existing states don't fit into this model?

Stephen

Generic exception handling in TaskContext

Submitted by Sylvain Joyeux on Wed, 2009-11-11 14:04.

On Wednesday 11 November 2009 14:32:16 S Roderick wrote:
> Seems fairly reasonable to me too. But why a "new" emergency stop
> state? I gather that any existing states don't fit into this model?
It's just that for me "failure in the middle of stop" translates into "the
component has no clue in what exact state it is right now". Hence the
emergency.

Generic exception handling in TaskContext

Submitted by bruyninc on Wed, 2009-11-11 14:12.

On Wed, 11 Nov 2009, Sylvain Joyeux wrote:

> On Wednesday 11 November 2009 14:32:16 S Roderick wrote:
>> Seems fairly reasonable to me too. But why a "new" emergency stop
>> state? I gather that any existing states don't fit into this model?
> It's just that for me "failure in the middle of stop" translates into "the
> component has no clue in what exact state it is right now". Hence the
> emergency.

This is a situation that should be _avoided_ at all costs... I think there
is still a large confusion in the community between "being in a state" and
"having created events that could lead to a transition in states". For
example, the "failure in the middle of stop" is of the latter kind, while
the system state should still be the "stop state" (or whatever state the
system was running in when the "failure" occured); it's then the
responsibility of the "stop state" to prepare the (possible) transition to
another state, but that should _not_ be an extra "emergency state", in my
opinion. For the simple reason that that new state will have no extra
information to interpret the situation and hence prepare a valid reaction
to the error...

Herman

Generic exception handling in TaskContext

Submitted by Sylvain Joyeux on Wed, 2009-11-11 14:36.

On Wednesday 11 November 2009 15:09:06 Herman Bruyninckx wrote:
> On Wed, 11 Nov 2009, Sylvain Joyeux wrote:
> > On Wednesday 11 November 2009 14:32:16 S Roderick wrote:
> >> Seems fairly reasonable to me too. But why a "new" emergency stop
> >> state? I gather that any existing states don't fit into this model?
> >
> > It's just that for me "failure in the middle of stop" translates into
> > "the component has no clue in what exact state it is right now". Hence
> > the emergency.
>
> This is a situation that should be _avoided_ at all costs...
Yes. But it does not mean that it will not happen.

> I think there
> is still a large confusion in the community between "being in a state" and
> "having created events that could lead to a transition in states". For
> example, the "failure in the middle of stop" is of the latter kind, while
> the system state should still be the "stop state" (or whatever state the
> system was running in when the "failure" occured); it's then the
> responsibility of the "stop state" to prepare the (possible) transition to
> another state, but that should _not_ be an extra "emergency state", in my
> opinion. For the simple reason that that new state will have no extra
> information to interpret the situation and hence prepare a valid reaction
> to the error...
True for the module. Not true for the supervision system. The supervision can
(and should) have the ability to know that this very bad situation occured and
have the possibility to do something about it.

Example: if my motor driver module tells me "I'm in a very bad shape", I'll
route around that module and shut down the power line.

Generic exception handling in TaskContext

Submitted by bruyninc on Wed, 2009-11-11 19:12.

On Wed, 11 Nov 2009, Sylvain Joyeux wrote:

> On Wednesday 11 November 2009 15:09:06 Herman Bruyninckx wrote:
>> On Wed, 11 Nov 2009, Sylvain Joyeux wrote:
>>> On Wednesday 11 November 2009 14:32:16 S Roderick wrote:
>>>> Seems fairly reasonable to me too. But why a "new" emergency stop
>>>> state? I gather that any existing states don't fit into this model?
>>>
>>> It's just that for me "failure in the middle of stop" translates into
>>> "the component has no clue in what exact state it is right now". Hence
>>> the emergency.
>>
>> This is a situation that should be _avoided_ at all costs...
> Yes. But it does not mean that it will not happen.
>
>> I think there
>> is still a large confusion in the community between "being in a state" and
>> "having created events that could lead to a transition in states". For
>> example, the "failure in the middle of stop" is of the latter kind, while
>> the system state should still be the "stop state" (or whatever state the
>> system was running in when the "failure" occured); it's then the
>> responsibility of the "stop state" to prepare the (possible) transition to
>> another state, but that should _not_ be an extra "emergency state", in my
>> opinion. For the simple reason that that new state will have no extra
>> information to interpret the situation and hence prepare a valid reaction
>> to the error...
> True for the module. Not true for the supervision system. The supervision can
> (and should) have the ability to know that this very bad situation occured and
> have the possibility to do something about it.

I don't agree: the supervision should be informed by the faulting component
with all possible detail, and not with just "failure", because that doesn't
mean anything...

> Example: if my motor driver module tells me "I'm in a very bad shape", I'll
> route around that module and shut down the power line.

What you should be doing is _interpreting_ the 'error condition', and
recover from it, not 'work around' it...

I do restate my statement, in a somewhat different way: "components can not
be in an "error state", but only in a _computation_ that (tries to) deal
with a non-nominal situation". In still other words: as long as the
component can still perform computations, it should (itself!) give
_meaning_ to the non-nominal state, and send corresponding data and events
to the other components. If it can not continue working, it should go to
the "inactive" state, and not to an error state.

Errors are (almost? always?) a property of _data_ in the component/system's
"world model", and not of their "state". So, there is really no reason to
introduce error states; the problems have to be solved via "world model
error computations" instead! And those computations can be done in the
normal active state of a component.

Lazy component developers have been the cause of so many "error state"
designs in the past (and present...), but all of them are design errors. :-(

Herman

Generic exception handling in TaskContext

Submitted by Sylvain Joyeux on Wed, 2009-11-11 19:44.

On Wednesday 11 November 2009 20:09:52 Herman Bruyninckx wrote:
> > Example: if my motor driver module tells me "I'm in a very bad shape",
> > I'll route around that module and shut down the power line.
>
> What you should be doing is _interpreting_ the 'error condition', and
> recover from it, not 'work around' it...
That's what I do. I do not work around the error, I work around a buggy
component to avoid having him break my system (and what is around it).

> I do restate my statement, in a somewhat different way: "components can not
> be in an "error state", but only in a _computation_ that (tries to) deal
> with a non-nominal situation". In still other words: as long as the
> component can still perform computations, it should (itself!) give
> _meaning_ to the non-nominal state, and send corresponding data and events
> to the other components. If it can not continue working, it should go to
> the "inactive" state, and not to an error state.
>
> Errors are (almost? always?) a property of _data_ in the component/system's
> "world model", and not of their "state". So, there is really no reason to
> introduce error states; the problems have to be solved via "world model
> error computations" instead! And those computations can be done in the
> normal active state of a component.
Yes and no. Components should not have to deal with all the cases in the
world. A good system is IMO a set of specialized components that is properly
articulated by a supervision system. Error states is a way to tell the
supervision that there is a discrepancy between what the component can handle
and what the world is.

But in that case, I fully agree with you, the component *must* send back a
specialized error report to the supervision (not just "error").

Now, let's get real: code is *buggy*. Funnily enough, you are using the same
argument as my former PhD advisor: if there is a bug in the code, it is the
fault of the code writer and the framework should not have to deal with it.

I don't even want to get as near as one km of a robot that has been built with
such a design rule. So ... what is the problem with buggy code ? Well: you
*can not know what it is doing*.

One way buggy code manifests himself is by having assertions raising
exceptions in the libraries ... i.e. consistency checks saying "not consistent
!". That is what I propose to get that information back to the supervision
system here. And the only way you can do that is by telling it "well,
something is wrong, but I don't know what".

> Lazy component developers have been the cause of so many "error state"
> designs in the past (and present...), but all of them are design errors.
> :-(
If by that you mean "having only one generic error state", yes, I agree. If
you mean "a well structured error representation", you are completely wrong
(but given what you said before, I think you are in the first case).

Generic exception handling in TaskContext

Submitted by bruyninc on Thu, 2009-11-12 06:48.

On Wed, 11 Nov 2009, Sylvain Joyeux wrote:

> On Wednesday 11 November 2009 20:09:52 Herman Bruyninckx wrote:
>>> Example: if my motor driver module tells me "I'm in a very bad shape",
>>> I'll route around that module and shut down the power line.
>>
>> What you should be doing is _interpreting_ the 'error condition', and
>> recover from it, not 'work around' it...
> That's what I do. I do not work around the error, I work around a buggy
> component to avoid having him break my system (and what is around it).

That's just postponing the problem one stage further :-)

>> I do restate my statement, in a somewhat different way: "components can not
>> be in an "error state", but only in a _computation_ that (tries to) deal
>> with a non-nominal situation". In still other words: as long as the
>> component can still perform computations, it should (itself!) give
>> _meaning_ to the non-nominal state, and send corresponding data and events
>> to the other components. If it can not continue working, it should go to
>> the "inactive" state, and not to an error state.
>>
>> Errors are (almost? always?) a property of _data_ in the component/system's
>> "world model", and not of their "state". So, there is really no reason to
>> introduce error states; the problems have to be solved via "world model
>> error computations" instead! And those computations can be done in the
>> normal active state of a component.
> Yes and no. Components should not have to deal with all the cases in the
> world.
Indeed not: only with those parts of the world they are responsible for, or
knowledgable of.

> A good system is IMO a set of specialized components that is properly
> articulated by a supervision system. Error states is a way to tell the
> supervision that there is a discrepancy between what the component can handle
> and what the world is.
Error _events_ are appropriate, yes, but error states should only come in when
everything else fails. And that's really very late into the problem chain!

> But in that case, I fully agree with you, the component *must* send back a
> specialized error report to the supervision (not just "error").
>
> Now, let's get real: code is *buggy*. Funnily enough, you are using the same
> argument as my former PhD advisor: if there is a bug in the code, it is the
> fault of the code writer and the framework should not have to deal with it.

That's _not_ what I claim :-) I claim that components should use error
_events_ much more, instead of going to error _states_.

> I don't even want to get as near as one km of a robot that has been built with
> such a design rule. So ... what is the problem with buggy code ? Well: you
> *can not know what it is doing*.
Same answer as a above: I advocate a somewhat less traditional way of
handling errors.

> One way buggy code manifests himself is by having assertions raising
> exceptions in the libraries ... i.e. consistency checks saying "not consistent
> !". That is what I propose to get that information back to the supervision
> system here. And the only way you can do that is by telling it "well,
> something is wrong, but I don't know what".

Exceptions are a rather C++-centric thing, which I think undermines
deterministic handling... So, they should be avoided.

>> Lazy component developers have been the cause of so many "error state"
>> designs in the past (and present...), but all of them are design errors.
>> :-(
> If by that you mean "having only one generic error state", yes, I agree. If
> you mean "a well structured error representation", you are completely wrong
> (but given what you said before, I think you are in the first case).
I am certainly in the first case :-) But I don't understand your second
case... :-)

Herman
>

Generic exception handling in TaskContext

Submitted by Sylvain Joyeux on Thu, 2009-11-12 15:20.

On Thursday 12 November 2009 07:46:00 Herman Bruyninckx wrote:
> On Wed, 11 Nov 2009, Sylvain Joyeux wrote:
> > On Wednesday 11 November 2009 20:09:52 Herman Bruyninckx wrote:
> >>> Example: if my motor driver module tells me "I'm in a very bad shape",
> >>> I'll route around that module and shut down the power line.
> >>
> >> What you should be doing is _interpreting_ the 'error condition', and
> >> recover from it, not 'work around' it...
> >
> > That's what I do. I do not work around the error, I work around a buggy
> > component to avoid having him break my system (and what is around it).
>
> That's just postponing the problem one stage further :-)
Don't get what you mean here.

> >> I do restate my statement, in a somewhat different way: "components can
> >> not be in an "error state", but only in a _computation_ that (tries to)
> >> deal with a non-nominal situation". In still other words: as long as the
> >> component can still perform computations, it should (itself!) give
> >> _meaning_ to the non-nominal state, and send corresponding data and
> >> events to the other components. If it can not continue working, it
> >> should go to the "inactive" state, and not to an error state.
> >>
> >> Errors are (almost? always?) a property of _data_ in the
> >> component/system's "world model", and not of their "state". So, there is
> >> really no reason to introduce error states; the problems have to be
> >> solved via "world model error computations" instead! And those
> >> computations can be done in the normal active state of a component.
> >
> > Yes and no. Components should not have to deal with all the cases in the
> > world.
>
> Indeed not: only with those parts of the world they are responsible for, or
> knowledgable of.
*Even* in that part of the world they may not be able to handle every cases.

> > A good system is IMO a set of specialized components that is properly
> > articulated by a supervision system. Error states is a way to tell the
> > supervision that there is a discrepancy between what the component can
> > handle and what the world is.
>
> Error _events_ are appropriate, yes, but error states should only come in
> when everything else fails. And that's really very late into the problem
> chain!
Well. Having developped an event based system for supervision. I should maybe
agree.

It happens that what I learned by using it is that events and states are most
of the time dual representations. The former is a "situation calculus"
representation and the latter more a "markovian" one (not exactly in both
cases, but let's say these are the formal frameworks the nearest).

In other words, to decide what to do, you both look at what happened *right
now* and what happened *earlier*. States are then just a way to aggregate the
history into an appropriate compact representation.

> > But in that case, I fully agree with you, the component *must* send back
> > a specialized error report to the supervision (not just "error").
> >
> > Now, let's get real: code is *buggy*. Funnily enough, you are using the
> > same argument as my former PhD advisor: if there is a bug in the code, it
> > is the fault of the code writer and the framework should not have to deal
> > with it.
>
> That's _not_ what I claim :-) I claim that components should use error
> _events_ much more, instead of going to error _states_.
I don't competely agree. Most of the time, if an error event needs to be sent
it also means that you enter a special case for computation, and is therefore
in a specialized state ... that someone could boldly call an error state.

> > I don't even want to get as near as one km of a robot that has been built
> > with such a design rule. So ... what is the problem with buggy code ?
> > Well: you *can not know what it is doing*.
>
> Same answer as a above: I advocate a somewhat less traditional way of
> handling errors.
It is not as non-traditional that you think it is. Look at situation calculus
for instance.

> > One way buggy code manifests himself is by having assertions raising
> > exceptions in the libraries ... i.e. consistency checks saying "not
> > consistent !". That is what I propose to get that information back to the
> > supervision system here. And the only way you can do that is by telling
> > it "well, something is wrong, but I don't know what".
>
> Exceptions are a rather C++-centric thing, which I think undermines
> deterministic handling... So, they should be avoided.
I could not disagree more ;-). Exceptions have appeared in most high-level
languages for a very good reason and they should be used.

> >> Lazy component developers have been the cause of so many "error state"
> >> designs in the past (and present...), but all of them are design errors.
> >>
> >> :-(
> >
> > If by that you mean "having only one generic error state", yes, I agree.
> > If you mean "a well structured error representation", you are completely
> > wrong (but given what you said before, I think you are in the first
> > case).
>
> I am certainly in the first case :-) But I don't understand your second
> case... :-)
Well, that errors needs to be categorized and "structured" in order to allow
appropriate responses by the supervision layer. Read my thesis for more
details ;-)

Generic exception handling in TaskContext

Submitted by bruyninc on Thu, 2009-11-12 19:00.

On Thu, 12 Nov 2009, Sylvain Joyeux wrote:

> On Thursday 12 November 2009 07:46:00 Herman Bruyninckx wrote:
>> On Wed, 11 Nov 2009, Sylvain Joyeux wrote:
>>> On Wednesday 11 November 2009 20:09:52 Herman Bruyninckx wrote:
>>>>> Example: if my motor driver module tells me "I'm in a very bad shape",
>>>>> I'll route around that module and shut down the power line.
>>>>
>>>> What you should be doing is _interpreting_ the 'error condition', and
>>>> recover from it, not 'work around' it...
>>>
>>> That's what I do. I do not work around the error, I work around a buggy
>>> component to avoid having him break my system (and what is around it).
>>
>> That's just postponing the problem one stage further :-)
> Don't get what you mean here.

I mean that I do not see exactly what this "workaround" can solve: you
still have to "recover" from the error, don't you? And if a component is
really buggy and should be "worked around", this can only(?) happen by
activating a redundant component that can take over the functionalities of
the component-in-error...?

>>>> I do restate my statement, in a somewhat different way: "components can
>>>> not be in an "error state", but only in a _computation_ that (tries to)
>>>> deal with a non-nominal situation". In still other words: as long as the
>>>> component can still perform computations, it should (itself!) give
>>>> _meaning_ to the non-nominal state, and send corresponding data and
>>>> events to the other components. If it can not continue working, it
>>>> should go to the "inactive" state, and not to an error state.
>>>>
>>>> Errors are (almost? always?) a property of _data_ in the
>>>> component/system's "world model", and not of their "state". So, there is
>>>> really no reason to introduce error states; the problems have to be
>>>> solved via "world model error computations" instead! And those
>>>> computations can be done in the normal active state of a component.
>>>
>>> Yes and no. Components should not have to deal with all the cases in the
>>> world.
>>
>> Indeed not: only with those parts of the world they are responsible for, or
>> knowledgable of.
> *Even* in that part of the world they may not be able to handle every cases.

Of course. Sometimes (too often in current practice!:-)) our software
systems can only give up...

>>> A good system is IMO a set of specialized components that is properly
>>> articulated by a supervision system. Error states is a way to tell the
>>> supervision that there is a discrepancy between what the component can
>>> handle and what the world is.
>>
>> Error _events_ are appropriate, yes, but error states should only come in
>> when everything else fails. And that's really very late into the problem
>> chain!
> Well. Having developped an event based system for supervision. I should maybe
> agree.
>
> It happens that what I learned by using it is that events and states are most
> of the time dual representations. The former is a "situation calculus"
> representation and the latter more a "markovian" one (not exactly in both
> cases, but let's say these are the formal frameworks the nearest).

They are near, indeed. But for me they are sufficiently semantically
different in order not to confuse them :-)

> In other words, to decide what to do, you both look at what happened *right
> now* and what happened *earlier*. States are then just a way to aggregate the
> history into an appropriate compact representation.

This is what I called the "(computation) status" in a previous email! In
systems and control theory, this "memory of what happened before" is also
called the "state", leading to most of the semantic confusion that we are
implicitly also victim of in this thread :-(

In the context of Coordination ("supervision"), the state is the discrete
state of the FSM that defines what Computations to execute in this FSM
state.

>>> But in that case, I fully agree with you, the component *must* send back
>>> a specialized error report to the supervision (not just "error").
>>>
>>> Now, let's get real: code is *buggy*. Funnily enough, you are using the
>>> same argument as my former PhD advisor: if there is a bug in the code, it
>>> is the fault of the code writer and the framework should not have to deal
>>> with it.
>>
>> That's _not_ what I claim :-) I claim that components should use error
>> _events_ much more, instead of going to error _states_.
> I don't competely agree. Most of the time, if an error event needs to be sent
> it also means that you enter a special case for computation, and is therefore
> in a specialized state ... that someone could boldly call an error state.

This _could_ an _internal_ state of the Computation! And it is not
necessarily true that this internal state has to be externalised to all
other components!

>>> I don't even want to get as near as one km of a robot that has been built
>>> with such a design rule. So ... what is the problem with buggy code ?
>>> Well: you *can not know what it is doing*.
>>
>> Same answer as a above: I advocate a somewhat less traditional way of
>> handling errors.
> It is not as non-traditional that you think it is. Look at situation calculus
> for instance.

Situation calculus is, as far as I remember, not traditional in control
_software_ design, is it? :-) At least in the engineering curricula that I
know about all over Europe. (France could be an exeption, since it has a
much more formal tradition than the rest of Western Europe :-))

>>> One way buggy code manifests himself is by having assertions raising
>>> exceptions in the libraries ... i.e. consistency checks saying "not
>>> consistent !". That is what I propose to get that information back to the
>>> supervision system here. And the only way you can do that is by telling
>>> it "well, something is wrong, but I don't know what".
>>
>> Exceptions are a rather C++-centric thing, which I think undermines
>> deterministic handling... So, they should be avoided.
> I could not disagree more ;-). Exceptions have appeared in most high-level
> languages for a very good reason and they should be used.

Correct me if I am wrong, but the semantics of exception handling have
never been "done right" in any language, at least not for the purpose of
deterministic execution...

>>>> Lazy component developers have been the cause of so many "error state"
>>>> designs in the past (and present...), but all of them are design errors.
>>>>
>>>> :-(
>>>
>>> If by that you mean "having only one generic error state", yes, I agree.
>>> If you mean "a well structured error representation", you are completely
>>> wrong (but given what you said before, I think you are in the first
>>> case).
>>
>> I am certainly in the first case :-) But I don't understand your second
>> case... :-)
> Well, that errors needs to be categorized and "structured" in order to allow
> appropriate responses by the supervision layer. Read my thesis for more
> details ;-)

Ok, I got what you mean! I think that I would prefer the term "semantic
definition" to what you "structured categorization", but that's not really
that important... :-)

Herman

Generic exception handling in TaskContext

Submitted by Wieland, Alexis P on Wed, 2009-11-11 16:24.

My understanding is that the existing Error states are sub-
states of Running. It would seem there is a gap then for both
other states and for transitions (?)

Where does this new emergency stop state sit in relation to the
existing Fatal Error state?

- alexis.

Generic exception handling in TaskContext

Submitted by Sylvain Joyeux on Wed, 2009-11-11 18:00.

On Wednesday 11 November 2009 17:19:13 you wrote:
> > On Wednesday 11 November 2009 15:09:06 Herman Bruyninckx wrote:
> > > On Wed, 11 Nov 2009, Sylvain Joyeux wrote:
> > > > On Wednesday 11 November 2009 14:32:16 S Roderick wrote:
> > > >> Seems fairly reasonable to me too. But why a "new" emergency stop
> > > >> state? I gather that any existing states don't fit into
> >
> > this model?
> >
> > > > It's just that for me "failure in the middle of stop"
> >
> > translates into
> >
> > > > "the component has no clue in what exact state it is
> >
> > right now". Hence
> >
> > > > the emergency.
> > >
> > > This is a situation that should be _avoided_ at all costs...
> >
> > Yes. But it does not mean that it will not happen.
>
> Too true.
>
> My understanding is that the existing Error states are sub-
> states of Running. It would seem there is a gap then for both
> other states and for transitions (?)
>
> Where does this new emergency stop state sit in relation to the
> existing Fatal Error state?
Fatal error is a state that can be used transitioned to by the component. I.e.
it is a well-defined state that the component is able to get out of.

Emergency (I actually don't like that name) is a state that is used by the
framework to declare to the supervision layer that the component is *not
anymore* in a well-defined state.

Transition-wise, what I proposed was to assume that if stopHook() is executed
and terminates normally (no exceptions raised), then the component goes back
into a well-defined state.

I therefore propose that the error-related state machine goes like this:

State transitions
-----------------------
There are two types of state transitions.

One set is "controllable", i.e. the transition occurs before the actual state
is reached. For instance, the PRE_OPERATIONAL => STOPPED goes like:
configure() is called to *request* the transition
configureHook() is called to *perform* the transition
if configureHook() is successful, then the transition is done.

Another set is "contingent", i.e. the transition occurs because the actual
state *is already reached*. For instance, FATAL_ERROR and RUNTIME_ERROR.

The point being that *you cannot forbid a contingent transition to happen*. It
already *did* happen ! What you can do, however, is transition back to
something else.

PRE_OPERATIONAL => STOPPED: if an error occurs during the transition, then
the component is left into PRE_OPERATIONAL.
STOPPED => RUNNING: if an error occurs during the transition, then
the component is left into STOPPED. Note that it can be discusses as
the component might be more into the RUNNING side of things ... In
Roby, there is a point where the task code says "I'm started NOW". If
the error occurs before this point, then it is assumed that the task is
still in stopped. Otherwise, the error is handled as if happening in the
RUNNING state.

RUNNING => RUNTIME_ERROR: contingent
RUNNING => FATAL_ERROR: contingent
FATAL_ERROR => STOPPED: controllable
RUNNING => STOPPED: controllable

What we are actually missing is a hook for "the framework detected an error in
your code". Again, in Roby, I decided to use the command for "stop" (i.e.
stopHook() for that), while still flagging the task as failed. That's obviously
opened for discussion.

The two last transitions are the ones that are problematic. IMO, the
underlying assumption in the TaskContext model is that the associated hooks
(recoverHook() and stopHook()) are theoretically able to put the component
back into a well-defined state (in both cases, STOPPED).

If they fail to do so, then no recovery is possible: the issue is that we are
in an not-so-well-known state, and have no stopHook()/recoverHook() to get
back to something that we know. In other words: there is no generic recovery
possible.

That's where EMERGENCY comes (I actually don't like that name, so don't quote
me as saying that state should be called EMERGENCY ;-)). This state would be
used to flag the task so that the supervision code can handle it (of course,
I'm a little bit biased there :P)

Generic exception handling in TaskContext

Submitted by snrkiwi on Wed, 2009-11-11 14:12.

On Nov 11, 2009, at 08:59 , Sylvain Joyeux wrote:

> On Wednesday 11 November 2009 14:32:16 S Roderick wrote:
>> Seems fairly reasonable to me too. But why a "new" emergency stop
>> state? I gather that any existing states don't fit into this model?
> It's just that for me "failure in the middle of stop" translates
> into "the
> component has no clue in what exact state it is right now". Hence the
> emergency.

Understood. And I guess an unhandled exception in this new state is
not handled, as at that point you are out of luck.
Stephen