Default exception handling in RTT 2.0

Submitted by Sylvain Joyeux on Tue, 2010-04-20 13:20

RTT-dev

As far as I saw on the RTT2 ExecutionEngine code, exception handling is
as follows (Peter: tell me if I'm wrong):

* an uncaught exception in updateHook() transitions to RUNTIME_ERROR
* an uncaught exception in errorHook() transitions to FATAL_ERROR

This assumes that errorHook() is able to handle unspecified errors. This
seems to broad for my POV. How we design components is that the
transition to fatal() is basically a stop() + some sort of tentative
cleanup. runtime_error is used as a runtime state categorization (i.e. a
way to "regroup" internal states), but its interpretation as an actual
error is situation dependent.

For instance, our motor controllers go into runtime_error when the
motors can't be driven (because of hardware protection mechanisms for
instance), but the electronics still *reads* the encoder + motor data.
I.e. they are read-only when in runtime error and can be used if only
reading is needed. fatal_error would be entered if we are not able to
talk to the electronics anymore.

In a way, runtime_error is not very useful in this case ...

To get back to the point: I think that runtime_error should be used when
the component is still able to provide a limited functionality,
fatalError being used when the component does not provide any
functionality anymore. Thus, the default exception handling of
updateHook() should IMO transition to FATAL_ERROR: I don't see how a
component can *know* what it is doing when an uncaught exception has
been raised by updateHook().

Thoughts ?

Default Component states (Was Default exception handling in RTT

Submitted by Sylvain Joyeux on Wed, 2010-05-05 08:32.

Herman Bruyninckx wrote:
> On Tue, 4 May 2010, Stephen Roderick wrote:
>
>
>> On May 4, 2010, at 12:25 , Sylvain Joyeux wrote:
>>
>>
>>> Stephen Roderick wrote:
>>>
>>>> Thanks for drawing the diagrams up. My 2c worth ...
>>>>
>>>> - Peter's version is far more understandable than Sylvain's
>>>> - The diagrams make it very clear the mix of lifecycle and application states. We personally encode our application states in FSMs explicitly. - I see no reason for the extra "Really Fatal Error" state (and I agree in large part with Herman's comments regarding naming)
>>>> - A fatal error is a fatal error. No way out. End of story. Done. Finished.
>>>>
>>>>
>>> Fine, call it 'Failure' or simply 'Error' (by opposition to "RuntimeError")
>>>
>> Works for me. Semantically, fatal == terminal == end.
>>
>
> Doesn't work for me! _If_ your component is still so much active that it
> can perform transitions to different states, it can still do other things
> too! Most often, that means that its _useful_ activity is _temporarily_
> hindered by the non-availability of some externally controlled resource
> (communication, motors, whatever), and that this component could/should be
> actively waiting for the temporary problem to be solved. Hence, it should
> get a name that reflects this situation. If something is really fatal for a
> component, the component will not know about that, because it will have
> lost all its "consciousness" :-)
>
I don't agree, and that is because I have a fundamentally different
view about how components should be implemented.

In my view, components should have a limited set of well-defined
functionality. If they can't perform that functionality, they should
*give up* and report that they did give up to a supervision (a.k.a.
coordination) layer. It is up to that layer to decide what to do
(including restarting the same component).

They should *not* "actively wait" or "retry". The rationale is that, by
doing that, you actually simplify the component implementation greatly,
leaving the complexity of handling problems to a layer that is designed
for that (the supervision layer).

Default Component states (Was Default exception handling in RTT

Submitted by peter on Thu, 2010-05-06 12:32.

On Wednesday 05 May 2010 10:30:06 Sylvain Joyeux wrote:
> Herman Bruyninckx wrote:
> > On Tue, 4 May 2010, Stephen Roderick wrote:
> >> On May 4, 2010, at 12:25 , Sylvain Joyeux wrote:
> >>> Stephen Roderick wrote:
> >>>> Thanks for drawing the diagrams up. My 2c worth ...
> >>>>
> >>>> - Peter's version is far more understandable than Sylvain's
> >>>> - The diagrams make it very clear the mix of lifecycle and
> >>>> application states. We personally encode our application states in
> >>>> FSMs explicitly. - I see no reason for the extra "Really Fatal Error"
> >>>> state (and I agree in large part with Herman's comments regarding
> >>>> naming) - A fatal error is a fatal error. No way out. End of story.
> >>>> Done. Finished.
> >>>
> >>> Fine, call it 'Failure' or simply 'Error' (by opposition to
> >>> "RuntimeError")
> >>
> >> Works for me. Semantically, fatal == terminal == end.
> >
> > Doesn't work for me! _If_ your component is still so much active that it
> > can perform transitions to different states, it can still do other things
> > too! Most often, that means that its _useful_ activity is _temporarily_
> > hindered by the non-availability of some externally controlled resource
> > (communication, motors, whatever), and that this component could/should
> > be actively waiting for the temporary problem to be solved. Hence, it
> > should get a name that reflects this situation. If something is really
> > fatal for a component, the component will not know about that, because it
> > will have lost all its "consciousness" :-)
>
> I don't agree, and that is because I have a fundamentally different
> view about how components should be implemented.
>
> In my view, components should have a limited set of well-defined
> functionality. If they can't perform that functionality, they should
> *give up* and report that they did give up to a supervision (a.k.a.
> coordination) layer. It is up to that layer to decide what to do
> (including restarting the same component).
>
> They should *not* "actively wait" or "retry". The rationale is that, by
> doing that, you actually simplify the component implementation greatly,
> leaving the complexity of handling problems to a layer that is designed
> for that (the supervision layer).
>

I fully agree with this statement. Also, I acknowledge that Sylvain has far
more experience with supervision than I have. So if he reports that changes we
make breaks/cripples supervision, these should be resolved...

Peter

Default Component states (Was Default exception handling in RTT

Submitted by bruyninc on Wed, 2010-05-05 09:20.

On Wed, 5 May 2010, Sylvain Joyeux wrote:

> Herman Bruyninckx wrote:
>> On Tue, 4 May 2010, Stephen Roderick wrote:
>>
>>> On May 4, 2010, at 12:25 , Sylvain Joyeux wrote:
>>>
>>>> Stephen Roderick wrote:
>>>>
>>>>> Thanks for drawing the diagrams up. My 2c worth ...
>>>>>
>>>>> - Peter's version is far more understandable than Sylvain's
>>>>> - The diagrams make it very clear the mix of lifecycle and application states. We personally encode our application states in FSMs explicitly. - I see no reason for the extra "Really Fatal Error" state (and I agree in large part with Herman's comments regarding naming)
>>>>> - A fatal error is a fatal error. No way out. End of story. Done. Finished.
>>>>>
>>>>>
>>>> Fine, call it 'Failure' or simply 'Error' (by opposition to "RuntimeError")
>>>>
>>> Works for me. Semantically, fatal == terminal == end.
>>>
>>
>> Doesn't work for me! _If_ your component is still so much active that it
>> can perform transitions to different states, it can still do other things
>> too! Most often, that means that its _useful_ activity is _temporarily_
>> hindered by the non-availability of some externally controlled resource
>> (communication, motors, whatever), and that this component could/should be
>> actively waiting for the temporary problem to be solved. Hence, it should
>> get a name that reflects this situation. If something is really fatal for a
>> component, the component will not know about that, because it will have
>> lost all its "consciousness" :-)
>>
> I don't agree, and that is because I have a fundamentally different
> view about how components should be implemented.
>
> In my view, components should have a limited set of well-defined
> functionality. If they can't perform that functionality, they should
> *give up* and report that they did give up to a supervision (a.k.a.
> coordination) layer. It is up to that layer to decide what to do
> (including restarting the same component).

That view is not different from my view... I just add the "detail"
that the component should tell the others _why_ it gave up.

> They should *not* "actively wait" or "retry".

That depends completely on what you want each component to have as
responsibilities! A system designer could (I do not say "should") want a
component that is responsible for a resource _not_ to give up, but keep on
trying something "useful" in the context of the application. So, saying
what a component "should" do in all cases is a design decision to limits
the compositionality and reusability of that component.

> The rationale is that, by doing that, you actually simplify the component
> implementation greatly, leaving the complexity of handling problems to a
> layer that is designed for that (the supervision layer).

You are, conceptually, just moving the problem to another component. And
sometimes that's the right thing to do, sometimes it is not. This trade-off
depends completely on the application, but does not change the discussion
we are having here: what should be part of the _default_ state machine in
RTT...? And the only point I make in this discussion is to advocate _not_
to use non-reusable/non-composable names such as "Fatal", "Unrecoverable",
etc. More constructively, I would suggest to name such a state
"Recoverable" or something (I don't like that name too much, frankly...),
indicating that (i) it is halted because it encountered something that it
could not deal with, _and_ (ii) it is still working as a piece of software
and hence can communicate with others (its "supervisor" for example) to
help discover the cause of the problem, _and_ (iii) it has still the
capability to transition to one of its more "useful" states.

In my suggestion, the "and"s are logical "and"s, so all three conditions
have to be fulfilled before the component gets in that state. This, of
course, leaves room for other states in which only one or two of these
three conditions are fulfilled. For example, "Debuggable" would be (i) and
(ii). Probably (ii) is not an extra 'state' since without (ii) the
component cannot communicate with others anymore, de facte being useless to
the system...

Herman

Default Component states (Was Default exception handling in RTT

Submitted by Sylvain Joyeux on Wed, 2010-05-05 09:40.

Herman Bruyninckx wrote:
> On Wed, 5 May 2010, Sylvain Joyeux wrote:
>
>
>> Herman Bruyninckx wrote:
>>
>>> On Tue, 4 May 2010, Stephen Roderick wrote:
>>>
>>>
>>>> On May 4, 2010, at 12:25 , Sylvain Joyeux wrote:
>>>>
>>>>
>>>>> Stephen Roderick wrote:
>>>>>
>>>>>
>>>>>> Thanks for drawing the diagrams up. My 2c worth ...
>>>>>>
>>>>>> - Peter's version is far more understandable than Sylvain's
>>>>>> - The diagrams make it very clear the mix of lifecycle and application states. We personally encode our application states in FSMs explicitly. - I see no reason for the extra "Really Fatal Error" state (and I agree in large part with Herman's comments regarding naming)
>>>>>> - A fatal error is a fatal error. No way out. End of story. Done. Finished.
>>>>>>
>>>>>>
>>>>>>
>>>>> Fine, call it 'Failure' or simply 'Error' (by opposition to "RuntimeError")
>>>>>
>>>>>
>>>> Works for me. Semantically, fatal == terminal == end.
>>>>
>>>>
>>> Doesn't work for me! _If_ your component is still so much active that it
>>> can perform transitions to different states, it can still do other things
>>> too! Most often, that means that its _useful_ activity is _temporarily_
>>> hindered by the non-availability of some externally controlled resource
>>> (communication, motors, whatever), and that this component could/should be
>>> actively waiting for the temporary problem to be solved. Hence, it should
>>> get a name that reflects this situation. If something is really fatal for a
>>> component, the component will not know about that, because it will have
>>> lost all its "consciousness" :-)
>>>
>>>
>> I don't agree, and that is because I have a fundamentally different
>> view about how components should be implemented.
>>
>> In my view, components should have a limited set of well-defined
>> functionality. If they can't perform that functionality, they should
>> *give up* and report that they did give up to a supervision (a.k.a.
>> coordination) layer. It is up to that layer to decide what to do
>> (including restarting the same component).
>>
>
> That view is not different from my view... I just add the "detail"
> that the component should tell the others _why_ it gave up.
>
>
>> They should *not* "actively wait" or "retry".
>>
>
> That depends completely on what you want each component to have as
> responsibilities! A system designer could (I do not say "should") want a
> component that is responsible for a resource _not_ to give up, but keep on
> trying something "useful" in the context of the application. So, saying
> what a component "should" do in all cases is a design decision to limits
> the compositionality and reusability of that component.
>
>
>> The rationale is that, by doing that, you actually simplify the component
>> implementation greatly, leaving the complexity of handling problems to a
>> layer that is designed for that (the supervision layer).
>>
>
> You are, conceptually, just moving the problem to another component. And
> sometimes that's the right thing to do, sometimes it is not. This trade-off
> depends completely on the application, but does not change the discussion
> we are having here: what should be part of the _default_ state machine in
> RTT...? And the only point I make in this discussion is to advocate _not_
> to use non-reusable/non-composable names such as "Fatal", "Unrecoverable",
> etc. More constructively, I would suggest to name such a state
> "Recoverable" or something (I don't like that name too much, frankly...),
> indicating that (i) it is halted because it encountered something that it
> could not deal with, _and_ (ii) it is still working as a piece of software
> and hence can communicate with others (its "supervisor" for example) to
> help discover the cause of the problem, _and_ (iii) it has still the
> capability to transition to one of its more "useful" states.
>
> In my suggestion, the "and"s are logical "and"s, so all three conditions
> have to be fulfilled before the component gets in that state. This, of
> course, leaves room for other states in which only one or two of these
> three conditions are fulfilled. For example, "Debuggable" would be (i) and
> (ii). Probably (ii) is not an extra 'state' since without (ii) the
> component cannot communicate with others anymore, de facte being useless to
> the system...
>
OK. I misunderstood your previous email.

I mostly agree with all what you are saying. I would say that,
obviously, the "FatalError" name should be dropped as it led to way too
much trouble in this discussion ;-)

I'm bad at naming, and really don't spend enough time on it. I have to
admit I rather oversee that side and keep coding ;-).

Default Component states (Was Default exception handling in RTT

Submitted by bruyninc on Wed, 2010-05-05 09:44.

On Wed, 5 May 2010, Sylvain Joyeux wrote:

[... cut away a lot if discussions, to put the focus on our shared conclusion...]

> I would say that,
> obviously, the "FatalError" name should be dropped as it led to way too
> much trouble in this discussion ;-)

> I'm bad at naming, and really don't spend enough time on it. I have to
> admit I rather oversee that side and keep coding ;-).

That's why non-coders are useful, from time to time too :-)

Herman

Default Component states (Was Default exception handling in RTT

Submitted by bruyninc on Tue, 2010-05-04 17:08.

On Tue, 4 May 2010, Sylvain Joyeux wrote:

> Stephen Roderick wrote:
>> Thanks for drawing the diagrams up. My 2c worth ...
>>
>> - Peter's version is far more understandable than Sylvain's
>>
>> - The diagrams make it very clear the mix of lifecycle and application states. We personally encode our application states in FSMs explicitly.
>> - I see no reason for the extra "Really Fatal Error" state (and I agree in large part with Herman's comments regarding naming)
>> - A fatal error is a fatal error. No way out. End of story. Done. Finished.
>>
> Fine, call it 'Failure' or simply 'Error' (by opposition to "RuntimeError")
>> - if we can recover from RunTimeError then we should be able to programmatically enter it (Sylvain's error() )
>> - it looks like adding Sylvain's resetError()/resetHook() would not be a huge change in the API. That would get him the ability to recover from FatalError.
>>
>> I would advocate one of two approaches
>> a) Peter's diagram minus RunTimeError. An exception during Running goes to FatalError.
>> b) Sylvain's diagram without RunTimeError and without ExtraFatalError.
>>
>> For a), after an exception you've no idea of the state of things, how on earth do you think you might recover. Just call it a day. RunTimeError is an application state, all the others are lifecycle states.
>>
>> For b), RunTimeError seriously strikes me as application specific. Use an FSM that has error() and recovered() events instead. Again, I see no need for ExtraFatalError. And I still argue that exceptions pretty much anywhere mean you can't really recover, as with a).
>>
> If they are living their own life, applications don't need to export any
> states, as they are the only ones interested about them.

Any application will _always_ be deployed in a "container" that is
responsible to provide "resource services" (CPU, RAM, communication,...),
so no application will ever live completely its own life. The coupling
("exporting of states") with the container _cannot_ be avoided, but should
be limited to the "resource services"-relevant states.

> The need I
> personally see from RuntimeError is to provide a default "degraded"
> state for applications that don't want to use events and FSM (none of
> ours do), and a default mean of supervision. What I believe is that, if
> no default runtime error is provided, then nobody will ever export
> "degraded" states by default.
>
> Moreover, I believe that in most cases, a cleanup procedure can actually
> recover from an ill-known state. In my POV (and in our components),
> cleanup() does a *lot* of cleanup (not far away from delete/new. I.e.,
> it *is* a perfectly possible sequence to try and recover from unhandled
> exceptions. It might fail (i.e. throw), but in most cases it will work.

"Cleanup" procedures are very useful to improve the _robustness_ and/or
_autonomy_ of a component; they most often have a performance cost. Hence,
the system supervisor should be allowed to have a say in this trade-off,
meaning that this is another use case where applications should export some
of their "internal" state.

Herman

PS Much of this material is a core research activity in our BRICS
project... :-) We have no real _results_ yet, but only a number of
insights...

Default Component states (Was Default exception handling in RTT

Submitted by Sylvain Joyeux on Wed, 2010-05-05 08:40.

Herman Bruyninckx wrote:
> On Tue, 4 May 2010, Sylvain Joyeux wrote:
>
>
>> Stephen Roderick wrote:
>>
>>> Thanks for drawing the diagrams up. My 2c worth ...
>>>
>>> - Peter's version is far more understandable than Sylvain's
>>>
>>> - The diagrams make it very clear the mix of lifecycle and application states. We personally encode our application states in FSMs explicitly.
>>> - I see no reason for the extra "Really Fatal Error" state (and I agree in large part with Herman's comments regarding naming)
>>> - A fatal error is a fatal error. No way out. End of story. Done. Finished.
>>>
>>>
>> Fine, call it 'Failure' or simply 'Error' (by opposition to "RuntimeError")
>>
>>> - if we can recover from RunTimeError then we should be able to programmatically enter it (Sylvain's error() )
>>> - it looks like adding Sylvain's resetError()/resetHook() would not be a huge change in the API. That would get him the ability to recover from FatalError.
>>>
>>> I would advocate one of two approaches
>>> a) Peter's diagram minus RunTimeError. An exception during Running goes to FatalError.
>>> b) Sylvain's diagram without RunTimeError and without ExtraFatalError.
>>>
>>> For a), after an exception you've no idea of the state of things, how on earth do you think you might recover. Just call it a day. RunTimeError is an application state, all the others are lifecycle states.
>>>
>>> For b), RunTimeError seriously strikes me as application specific. Use an FSM that has error() and recovered() events instead. Again, I see no need for ExtraFatalError. And I still argue that exceptions pretty much anywhere mean you can't really recover, as with a).
>>>
>>>
>> If they are living their own life, applications don't need to export any
>> states, as they are the only ones interested about them.
>>
>
> Any application will _always_ be deployed in a "container" that is
> responsible to provide "resource services" (CPU, RAM, communication,...),
> so no application will ever live completely its own life. The coupling
> ("exporting of states") with the container _cannot_ be avoided, but should
> be limited to the "resource services"-relevant states.
>
>
>> The need I
>> personally see from RuntimeError is to provide a default "degraded"
>> state for applications that don't want to use events and FSM (none of
>> ours do), and a default mean of supervision. What I believe is that, if
>> no default runtime error is provided, then nobody will ever export
>> "degraded" states by default.
>>
>> Moreover, I believe that in most cases, a cleanup procedure can actually
>> recover from an ill-known state. In my POV (and in our components),
>> cleanup() does a *lot* of cleanup (not far away from delete/new. I.e.,
>> it *is* a perfectly possible sequence to try and recover from unhandled
>> exceptions. It might fail (i.e. throw), but in most cases it will work.
>>
>
> "Cleanup" procedures are very useful to improve the _robustness_ and/or
> _autonomy_ of a component; they most often have a performance cost. Hence,
> the system supervisor should be allowed to have a say in this trade-off,
> meaning that this is another use case where applications should export some
> of their "internal" state.
>
> Herman
>
> PS Much of this material is a core research activity in our BRICS
> project... :-) We have no real _results_ yet, but only a number of
> insights...
>
Well ... I don't pretend to have all the answers, but as someone that is
trying to do supervision in a number of systems, running three different
existing functional layers, I do have some experience.

Trying to handle most problems "internally" in a component is bound to
make the components very complex and very hard to debug. Moreover, when
a proble do appear (let's say, a file descriptor returning an error),
you actually *have to* do cleanup to retry opening it.

In other words, in most cases, when something non-nominal happens, you
do have to pay the price of cleaning up / reinitializing.

Default Component states (Was Default exception handling in RTT

Submitted by bruyninc on Wed, 2010-05-05 09:08.

On Wed, 5 May 2010, Sylvain Joyeux wrote:

> Herman Bruyninckx wrote:
>> On Tue, 4 May 2010, Sylvain Joyeux wrote:
>>
>>
>>> Stephen Roderick wrote:
>>>
>>>> Thanks for drawing the diagrams up. My 2c worth ...
>>>>
>>>> - Peter's version is far more understandable than Sylvain's
>>>>
>>>> - The diagrams make it very clear the mix of lifecycle and application states. We personally encode our application states in FSMs explicitly.
>>>> - I see no reason for the extra "Really Fatal Error" state (and I agree in large part with Herman's comments regarding naming)
>>>> - A fatal error is a fatal error. No way out. End of story. Done. Finished.
>>>>
>>>>
>>> Fine, call it 'Failure' or simply 'Error' (by opposition to "RuntimeError")
>>>
>>>> - if we can recover from RunTimeError then we should be able to programmatically enter it (Sylvain's error() )
>>>> - it looks like adding Sylvain's resetError()/resetHook() would not be a huge change in the API. That would get him the ability to recover from FatalError.
>>>>
>>>> I would advocate one of two approaches
>>>> a) Peter's diagram minus RunTimeError. An exception during Running goes to FatalError.
>>>> b) Sylvain's diagram without RunTimeError and without ExtraFatalError.
>>>>
>>>> For a), after an exception you've no idea of the state of things, how on earth do you think you might recover. Just call it a day. RunTimeError is an application state, all the others are lifecycle states.
>>>>
>>>> For b), RunTimeError seriously strikes me as application specific. Use an FSM that has error() and recovered() events instead. Again, I see no need for ExtraFatalError. And I still argue that exceptions pretty much anywhere mean you can't really recover, as with a).
>>>>
>>>>
>>> If they are living their own life, applications don't need to export any
>>> states, as they are the only ones interested about them.
>>>
>>
>> Any application will _always_ be deployed in a "container" that is
>> responsible to provide "resource services" (CPU, RAM, communication,...),
>> so no application will ever live completely its own life. The coupling
>> ("exporting of states") with the container _cannot_ be avoided, but should
>> be limited to the "resource services"-relevant states.
>>
>>
>>> The need I
>>> personally see from RuntimeError is to provide a default "degraded"
>>> state for applications that don't want to use events and FSM (none of
>>> ours do), and a default mean of supervision. What I believe is that, if
>>> no default runtime error is provided, then nobody will ever export
>>> "degraded" states by default.
>>>
>>> Moreover, I believe that in most cases, a cleanup procedure can actually
>>> recover from an ill-known state. In my POV (and in our components),
>>> cleanup() does a *lot* of cleanup (not far away from delete/new. I.e.,
>>> it *is* a perfectly possible sequence to try and recover from unhandled
>>> exceptions. It might fail (i.e. throw), but in most cases it will work.
>>>
>>
>> "Cleanup" procedures are very useful to improve the _robustness_ and/or
>> _autonomy_ of a component; they most often have a performance cost. Hence,
>> the system supervisor should be allowed to have a say in this trade-off,
>> meaning that this is another use case where applications should export some
>> of their "internal" state.
>>
>> Herman
>>
>> PS Much of this material is a core research activity in our BRICS
>> project... :-) We have no real _results_ yet, but only a number of
>> insights...
>>
> Well ... I don't pretend to have all the answers, but as someone that is
> trying to do supervision in a number of systems, running three different
> existing functional layers, I do have some experience.

Which we acknowledge and want to profit from! :-)

> Trying to handle most problems "internally" in a component is bound to
> make the components very complex and very hard to debug.

I fully agree. And I am not at all advocating this approach! The best
design is the one that can decouple precisely what is internal knowledge of
the component and what is not. So, it is not a matter of handling "most"
problems internally, but of handling exactly _those_ problems internally
for which that component, and only that component, has the
solution/knowledge/resources/...
And the debugging will become much easier (if and only if!?) if your design has
achieved the above-mentioned decoupling.

> Moreover, when
> a proble do appear (let's say, a file descriptor returning an error),
> you actually *have to* do cleanup to retry opening it.
>
> In other words, in most cases, when something non-nominal happens, you
> do have to pay the price of cleaning up / reinitializing.

I fully agree.

Herman

Default Component states (Was Default exception handling in RTT

Submitted by Sylvain Joyeux on Tue, 2010-05-04 16:28.

Stephen Roderick wrote:
> Thanks for drawing the diagrams up. My 2c worth ...
>
> - Peter's version is far more understandable than Sylvain's
> - The diagrams make it very clear the mix of lifecycle and application states. We personally encode our application states in FSMs explicitly.
>
I don't, because I never had the need for that (state machines stayed
simple). We use the state mainly as a reporting interface, which I think
is an important part of them.

What I want the component "state" to reflect is enough *default*
information to make components supervis-able *by default*. I.e. a sane
representation of what can be done by a component without needing to
learn new things like services, FSM scripts, operations or whatever. The
RTT 1.0 state machine was successful for us in that respect. It only
lacked what to do when exceptions were leaking from user code.
> - I see no reason for the extra "Really Fatal Error" state (and I agree in large part with Herman's comments regarding naming)
>
It is needed because if you assume that cleanup() can return the
application code into a well-defined state (which I do), then you have
to realize that, in some cases, it will fail to do so. In this
particular case, you cannot do anything anymore as your recovery method
failed. Thus, you have to stop trying.
> - A fatal error is a fatal error. No way out. End of story. Done. Finished.
> - if we can recover from RunTimeError then we should be able to programmatically enter it (Sylvain's error() )
> - it looks like adding Sylvain's resetError()/resetHook() would not be a huge change in the API. That would get him the ability to recover from FatalError.
>
> I would advocate one of two approaches
> a) Peter's diagram minus RunTimeError. An exception during Running goes to FatalError.
> b) Sylvain's diagram without RunTimeError and without ExtraFatalError.
>
> For a), after an exception you've no idea of the state of things, how on earth do you think you might recover. Just call it a day. RunTimeError is an application state, all the others are lifecycle states.
>
> For b), RunTimeError seriously strikes me as application specific. Use an FSM that has error() and recovered() events instead. Again, I see no need for ExtraFatalError. And I still argue that exceptions pretty much anywhere mean you can't really recover, as with a).
>
You assume in both cases that people are using FSM scripts and events,
which I don't.

Default Component states (Was Default exception handling in RTT

Submitted by fosini on Mon, 2010-05-03 12:40.

On Mon, May 3, 2010 at 1:35 PM, S Roderick <kiwi [dot] net [..] ...> wrote:

> On May 3, 2010, at 06:31 , Peter Soetens wrote:
>
> > To all lurking on this thread, could we have a 'voting' about this ?
>
> 1) Could someone draw up a state chart of what you guys propose? The long
> discussion below is hell of a fragmented ...
>
> > In case of doubt, I follow the user's opinion, but since we only have
> Sylvain
> > and me arguing, I wonder how much user there is going on here actually
> :-)
> >
> > Summary:
> >
> > * The RTT 1.x component states are mixing component lifecycle and
> application
> > states. For example, configureHook() sets up method calls using
> 'getPeer()' or
> > checks if input (read) ports are connected. It could *also* configure a
> device
> > or so, but with limited flexibility, since the thread of the component
> was not
> > running yet. So some configuration could be necessary in updateHook(),
> for
> > example, if you were talking to a device bus. On the other hand, the
> component
> > has some clear 'application' error states, like RunTimeError, which a
> > component will only enter if user code instructs it to do so.
> >
> > I wanted to change this in 2.0 to a reduced life cycle, where the only
> states
> > that a component has are independent of application states. For example,
> > RunTimeError would mean that an exception was leaked in updateHook(). If
> > errorHook() leaked an exception, the component would enter the FatalError
> > state, which is unrecoverable (hence 'fatal'). My main motivation for
> this was
> > to define what happens when user code throws exceptions. I wanted to have
> a
> > similar scheme as was happening with program scripts: it's not because
> one
> > program is in error, that the whole component should stop. One change
> > contributing to this filosophy is that the thread of a component now
> always
> > runs.
>
> In my 2.5 years using Orocos, I've never used either an errorHook() nor the
> RunTimeError-related stuff. I can see a couple of cases where I should have
> ...
>

Same opinion here (but only a little over a year using Orocos).

>
> > I should have known by looking at historical evidence, but with this
> change, I
> > stepped on Sylvain's turf.
> >
> > He proposes (correct me if I'm wrong) to keep close to the current
> application
> > states, at least the RunTimeError state for user errors and use
> FatalError
> > (=stop+cleanup) if user code throws (in any place). If transition to
> fatal
> > fails, an unrecoverable-worse-than-fatal state is entered. His reasoning
> is
> > that supervision needs info of application health for every component,
> and
> > that this belongs in the interface every component.
>
> It _looks_ like one or both of you are proposing
>
> exception/error in updateHook() = transition to RunTimeError state = call
> errorHook()
> exception/error in errorHook() = transition to FatalError = call stopHook()
> then cleanupHook().
>
> And RunTimeError is a running state, so scripts and state machines are
> still running (if they aren't the cause of the throw), right? Does
> errorHook() get called as updateHook() would have (ie periodically if in a
> PeriodicActivity)?
>
>
Yes, RunTimeWarning and RunTimeError are substates of the Running state. See
here
http://www.orocos.org/stable/documentation/rtt/v1.10.x/doc-xml/orocos-co...

How do you get out of errorHook()?
>

By calling warning() or recovered() see the above referenced figure.

>
> IMHO the state after fatalError() should be STOPPED, not PRE_OPERATIONAL.
> The component is done.
>
> What happens if you exception/error in stopHook() or cleanupHook() that was
> called as part of a FatalError? Personally, I would think to catch it in
> FatalError and don't continue doing anything. I don't see the point of yet
> another worse-than-fatal-error state.
>
> Sylvain, we do use scripts some, particularly where the corresponding C++
> code is too long, but they do have their issues that limit their usefulness.
>
> We do some potentially blocking driver setup in configure() also.
>
> Peter, what is the difference with v2 w.r.t. the thread being active during
> configureHook(), or being able to send commands prior to start()?
>

> I don't like the idea of a SupervisionInterface. Yet another thing to
> learn, and I see little benefit.
> S
>
> >
> > Pleaes read full details below or in the thread.
> >
> > I don't think the water is that deep between us, most developers use
> > configureHook/errorHook already for application states so I see the
> point, and
> > no one is complainging... We should mainly focus on user's ease of
> > programming/minimal coding effort, but what do the others think ?
> >
> > Peter
> >
> > On Wednesday 21 April 2010 14:31:17 Sylvain Joyeux wrote:
> >> Peter Soetens wrote:
> >>> On Wednesday 21 April 2010 12:37:09 Sylvain Joyeux wrote:
> >>>> Peter Soetens wrote:
> >>>>>>>> As far as I saw on the RTT2 ExecutionEngine code, exception
> handling
> >>>>>>>> is as follows (Peter: tell me if I'm wrong):
> >>>>>>>>
> >>>>>>>> * an uncaught exception in updateHook() transitions to
> >>>>>>>> RUNTIME_ERROR * an uncaught exception in errorHook() transitions
> to
> >>>>>>>> FATAL_ERROR
> >>>>>>>
> >>>>>>> Correct. Fatal error (should) leads to stopHook() + cleanupHook()
> and
> >>>>>>> then wait for component cleanup/removal.
> >>>>>>>
> >>>>>>>> This assumes that errorHook() is able to handle unspecified
> errors.
> >>>>>>>> This seems to broad for my POV. How we design components is that
> the
> >>>>>>>> transition to fatal() is basically a stop() + some sort of
> tentative
> >>>>>>>> cleanup. runtime_error is used as a runtime state categorization
> >>>>>>>> (i.e. a way to "regroup" internal states), but its interpretation
> as
> >>>>>>>> an actual error is situation dependent.
> >>>>>>>
> >>>>>>> It is. errorHook() is reserved for an RTT/Component specific error
> >>>>>>> state, not for application error states. So you would need to write
> >>>>>>> your application specific error states in updateHook().
> >>>>>>
> >>>>>> This completely negates the usefulness of the taskcontext state
> >>>>>> machine.
> >>>>>
> >>>>> ... for application specific state machines. From a component life
> >>>>> cycle view, these states are still necessary. I think we did it the
> >>>>> wrong way in 1.x, coupling component lifecycle states with
> application
> >>>>> states. They may overlap, and we can't/won't prevent that, but they
> >>>>> don't have to overlap.
> >>>>
> >>>> OK, then define a default application state machine (and see the
> number
> >>>> of state explode). Not having a default application state machine
> >>>> defined negates completely the use of
> >>>
> >>> You have to see this in the light of the state machines in the
> scripting.
> >>> These are by definition application specific state machines. So this
> made
> >>> us realize that there is a difference between the lifecycle of a
> >>> component (hooks) and of an application (states in scripts).
> >>>
> >>>>>>>> For instance, our motor controllers go into runtime_error when the
> >>>>>>>> motors can't be driven (because of hardware protection mechanisms
> >>>>>>>> for instance), but the electronics still *reads* the encoder +
> motor
> >>>>>>>> data. I.e. they are read-only when in runtime error and can be
> used
> >>>>>>>> if only reading is needed. fatal_error would be entered if we are
> >>>>>>>> not able to talk to the electronics anymore.
> >>>>>>>
> >>>>>>> These are all application states and should not be implemented in
> the
> >>>>>>> component lifecycle states.
> >>>>>>
> >>>>>> I don't follow you in this split between application and lifecycle
> >>>>>> states. The component goes into various states because of the
> >>>>>> application.
> >>>>>
> >>>>> Not from the viewpoint of the RTT or deployer. For example,
> >>>>> configureHook is to check if input ports are connected or if required
> >>>>> services are available. This is independent of a component requiring
> >>>>> additional configuration of parameters.
> >>>>
> >>>> There is definitely a mismatch between our uses of states. We use
> >>>> configureHook() to verify that the component can run. This means:
> >>>> checking if devices are there, if properties are set to sane values
> and
> >>>> so on. In effect, for device drivers, configureHook() is the place
> >>>> where the device gets accessed and configured.
> >>>
> >>> I agree here. RTT 2.x actually supports this better, since you can also
> >>> send 'commands' (in 2.x: send method calls) to a component before it is
> >>> started. This means that if configuration does some
> blocking/asynchronous
> >>> work or depends on a script to complete, this can all be done before
> >>> start().
> >>>
> >>>>>>>> In a way, runtime_error is not very useful in this case ...
> >>>>>>>>
> >>>>>>>> To get back to the point: I think that runtime_error should be
> used
> >>>>>>>> when the component is still able to provide a limited
> functionality,
> >>>>>>>> fatalError being used when the component does not provide any
> >>>>>>>> functionality anymore. Thus, the default exception handling of
> >>>>>>>> updateHook() should IMO transition to FATAL_ERROR: I don't see how
> a
> >>>>>>>> component can *know* what it is doing when an uncaught exception
> has
> >>>>>>>> been raised by updateHook().
> >>>>>>>>
> >>>>>>>> Thoughts ?
> >>>>>>>
> >>>>>>> I think you identified a painpoint when trying to apply the
> component
> >>>>>>> error states to application error states. It will never work. The
> >>>>>>> idea for run time error is that any code in updateHook might throw,
> >>>>>>> even if the user is unaware of this (during development for
> example).
> >>>>>>> In critical components, you can put safe state code in errorHook(),
> >>>>>>> for example, writing data to ports (which will never throw). If you
> >>>>>>> did a
> >>>>>>>
> >>>>>>> bad job there, you go to fatal error. So all cases are covered. We
> >>>>>>> don't want to go to fatal error immediately because this is an
> >>>>>>> unrecoverable state, meaning, RTT judged that it can no longer
> >>>>>>> execute that *instance* of a component.
> >>>>>>>
> >>>>>>> All the other stuff goes into updateHook and you need to define
> your
> >>>>>>> own application states in there, using own operations or attributes
> >>>>>>> or so.
> >>>>>>>
> >>>>>>> Makes sense ?
> >>>>>>
> >>>>>> I don't think it does ...
> >>>>>>
> >>>>>> First of all, most components will have nothing in errorHook().
> Thus,
> >>>>>> you will have a still-running component that has failed in a way
> that
> >>>>>> was not predicted by the designer.
> >>>>>
> >>>>> We could install a default action in errorHook() ourselves.
> >>>>
> >>>> What could you meanginfully do in errorHook() that is completely
> generic
> >>>> and will handle the underlying problem (updateHook() threw an
> exception,
> >>>> the component is not functional anymore).
> >>>
> >>> Yeah, I wasn't making sense here.
> >>>
> >>>>>> From a more conceptual point of view, I don't think that a component
> >>>>>> should be allowed to run even if it had an unexpected exception. An
> >>>>>> exception means "the internal state of the component is unspecified
> as
> >>>>>> of now". The only thing that could make sense is to try to
> "emergency
> >>>>>> stop" it, which -- I though -- is what fatal error is there for.
> >>>>>
> >>>>> Agreed, but C++ exceptions do not cause an unspecified state. They
> >>>>> unwind the stack and cleanup resources by calling destructors. It's
> not
> >>>>> the same like a segfault. On the other hand, if your updateHook() has
> >>>>> the scenario port1.write (exception) port2.write, only the first
> write
> >>>>> will succeed, leading to a non consistent output. So yes, a leaked
> >>>>> exception is maybe more 'grave' than it is considered now.
> >>>>
> >>>> I don't agree there. An *uncaught* exception means that the component
> >>>> designer was not *expecting* this particular error. If the code is
> >>>> well-written (which is very unlikely), ressources will be freed and so
> >>>> on, but from a global logic point of view, the application will *not*
> >>>> know where it was (hey, otherwise it would have caught this
> exception).
> >>>
> >>> So actually we agree, in the end, the application does not know. So
> this
> >>> is a problematic state it ends up in.
> >>
> >> Yes, so it makes no sense to remain in a running state (which
> >> RUNTIME_ERROR is).
> >>
> >>>>> Realize that the scripts can also leak exceptions, because they call
> >>>>> user functions too. We also need to define a state when this happens,
> I
> >>>>> don't want to go into 'unrecoverable error' when this happens. For a
> >>>>> script, this would just cause the 'E'rror status of that script,
> while
> >>>>> other scripts/updateHook keep running. That's also why runtime error
> >>>>> only relates to an exception in updateHook(), while in runtime error,
> >>>>> the scripts etc keep on executing (unless they reach the Error status
> >>>>> too).
> >>>>>
> >>>>>> Second, the way you define fatal error states makes no sense to me.
> >>>>>> The whole point of having a component model a-la RTT is that it
> should
> >>>>>> be able to go back to a defined state (in my POV, through the
> >>>>>> fatalError() cycle).
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> I.e. a "completely unrecoverable" error should only be a diagnostics
> >>>>>> estimation, for instance triggered because a component that went to
> an
> >>>>>> unspecified fatalError() (fatal-error-that-we-don't-know) refused to
> >>>>>> reconfigure and/or restart), thus showing that the component does
> not
> >>>>>> know how to recover.
> >>>>>
> >>>>> There is another reason: what if the RTT figures out that it can no
> >>>>> longer execute the component ? Fatal by definition means
> unrecoverable,
> >>>>> so let's keep it with these semantics. There is no way to recover
> from
> >>>>> the fatal error state in the current implementation. So fatal means:
> >>>>> unload/kill me please.
> >>>>>
> >>>>> What you describe sounds to me like a run-time error (recoverable),
> you
> >>>>> can still recover from it.
> >>>>
> >>>> Again, we have a different understanding of the state machine. Yes, it
> >>>> can recover from it, but that will require a
> stop()/configure()/start()
> >>>> cycle. How the state machine is interpreted by our supervision is:
> >>>>
> >>>> configure: the component verifies that everything it needs to be
> >>>> functional is there. This means: checking property values, port
> >>>> connections, accessing external processes/hardware when applicable.
> The
> >>>> goal of that step is to make start() as simple as possible, and have
> it
> >>>> most likely return true (i.e. have the longest and most likely to fail
> >>>> steps that can be done in advance in configure()).
> >>>
> >>> OK.
> >>>
> >>>> start: start the component functionality. I.e. turn on data
> >>>> acquisition for a driver (for instance).
> >>>
> >>> OK.
> >>>
> >>>> runtime_error: the component still provides a somewhat limited
> >>>> functionality. The actual semantic of this is very application
> >>>> dependent. In practice, we use orogen to specialize it into
> sub-states.
> >>>
> >>> Please define which triggers cause a transition to this state and from
> >>> this state away + which hooks are called.
> >>
> >> Here's the thing: I'm not using scripts, and I completely do not intend
> >> to use them. I do see their possible usefulness, it is just that I did
> >> not (yet) encounter a situation where they were needed.
> >>
> >> So: triggers
> >> * any application-defined situation which means that the component
> >> provides a limited functionality.
> >> * hooks: errorHook(), in the same situations than updateHook() (i.e.
> >> activity triggers)
> >> * getting out of there: component specific and component-decided
> >>
> >>>> stop: the component stops functioning, either because it reached its
> >>>> stated goal (case for a planner), or because it has been requested
> >>>
> >>> OK.
> >>>
> >>>> fatal: the component cannot provide its stated functionality anymore,
> >>>> and therefore stopped.
> >>>
> >>> so stopHook() is called automatically ? As above, define the triggers +
> >>> possible next states ?
> >>
> >> Triggers: internal component diagnostics which detected a situation
> >> representing a loss of functionality.
> >> Possible next states: STOPPED or PRE_OPERATIONAL (depending on whether
> >> the component needs a configure step).
> >>
> >>>> It should try to clean up as much as possible so
> >>>> that a configure()/start() cycle has a change to recover from the
> >>>> problem. In the same way than for runtime error, orogen specializes it
> >>>> into substates.
> >>>>
> >>>> This state machine allowed us to keep the updateHook() simple (since
> it
> >>>> does not have to deal with initialization/recovery/...), and has most
> of
> >>>> the information needed for supervision.
> >>>>
> >>>>> Maybe we should change it then to these semantics:
> >>>>>
> >>>>> Fatal error: can be entered from any state, triggered by in RTT code
> or
> >>>>> error recovery code. Causes stopHook()->cleanupHook() in transition
> (if
> >>>>> necessary). Only step left is delete component.
> >>>>>
> >>>>> Runtime error: triggered by exceptions in updateHook() or by user in
> >>>>> updateHook()/script.
> >>>>
> >>>> There is a funny thing: on the one hand you say "raised exceptions
> >>>> should leave the application in a well-defined state" and "if an
> >>>> exception is raised in errorHook()" we can't recover ever, we actually
> >>>> need to destroy everything". This seems contradictory to me.
> >>>
> >>> But makes sense to me: if your error recovery throws, it went really
> bad,
> >>> it means your last resort to pull things right did not succeed. There
> >>> *is* no way out, this *is* fatal, literally as in 'terminal'. No
> >>> transition succeeds.
> >>
> >> Here is my proposal:
> >> * RUNTIME_ERROR remains an application state. The component announces
> >> that it has limited functionality due to something non-nominal
> happening.
> >> * unexpected exceptions in running states (RUNNING and RUNTIME_ERROR)
> >> transition to fatal. This calls a fatalHook() which -- by default --
> >> calls stopHook() and cleanupHook(). The component can also transition to
> >> fatal to announce that something non-nominal happened that makes the
> >> component's service not available.
> >> * if fatalHook() and/or stopHook() raise, then we go into the
> >> "unrecoverable fault" (we can't even go into FATAL ...)
> >>
> >>>> As to the interpretation of "fatal": it depends on the point of view.
> >>>> From the point of view of the supervision, the "fatal" I described
> >>>> above *is* fatal as the component does not provide the service it
> should
> >>>> provide, and that happened because of something non-nominal.
> >>>
> >>> There must be a posibility to resolve these constraints we both have:
> >>>
> >>> 1. Define a transition/state when an exception is thrown in
> updateHook().
> >>> The last thing we want to do is call it again, the component is
> possibly
> >>> in a 'messy' state and it may throw 'ad infinitum'.
> >>>
> >>> 2. Define a transition/state when error recovery from point 1 failed as
> >>> well.
> >>>
> >>> 3. Define a transition/state when the RTT can no longer execute a
> >>> component. This might be the same as #2.
> >>>
> >>> From the RTT point of view, these are the things I *need* to define,
> >>> without even caring for application-level supervision. Supervision is a
> >>> fundamental part of every application (ie handle faults the component
> can
> >>> not solve by itself) so I am not against in adding support for that in
> >>> the TaskContext, on the other hand, you/me are biased and I wonder if
> >>> it's not better to stick to the minimal.
> >>
> >> Yes, but in my opinion a basic application state machine *is* part of
> >> the minimal.
> >>
> >>> A possible clean solution I see here is to define a supervision
> interface
> >>> that defines these extra states that your supervision software
> requires.
> >>> So your component inherits TaskContext + SuperviseInterface. where the
> >>> latter sets up a 'supervise' provided interface with the methods/states
> >>> you require.
> >>>
> >>> The supervisor component/user can than query each component if it has
> >>> this interface and proceed from there if it has.
> >>>
> >>> This 'extendability' is actually one of the major issues I wanted to
> >>> solve in 2.x. The component itself has only a minimal life cycle
> >>> interface and the rest is set into 'plugins'/'interfaces'.
> >>
> >> While I see why you want that (you are the RTT-as-a-universal-framework
> >> guy), I do see a lot of practical issues. The biggest issue being that
> >> you will start to completely fragment what components can run on what
> >> tools and make the whole "RTT ecosystem" (for lack of a better name) a
> >> huge mess.
> >>
> >> We're having that discussion *because* I want to avoid this. I could
> >> live on with Roby and oroGen: they already provide all the tools I need
> >> to "work around" the state machine you define to get what I want. We're
> >> having that discussion because I think it would be a very bad idea.
> >>
> >> So, yes, being able to extend is important. Now, I feel that the RTT
> >> *must* provide a basic standard, supervise-able, interface to ALL RTT
> >> components. And -- more importantly -- should make the component
> >> developer aware that this interface is important.
> >>
> > --
> > Orocos-Dev mailing list
> > Orocos-Dev [..] ...
> > http://lists.mech.kuleuven.be/mailman/listinfo/orocos-dev
>
> --
> Orocos-Dev mailing list
> Orocos-Dev [..] ...
> http://lists.mech.kuleuven.be/mailman/listinfo/orocos-dev
>