synch event handling

Dear All,

the post about "messages" was going to became a debate about event handling, therefore I have created this separate post.
As I have stated in the wiki page [Contribute! Which weakness have you detected in RTT?], I have seen a serious issue in the current implementation of synchronous event handling.

We do all agree (I guess)that, from a theoretical point of view, something like an "immediate" reaction to an event must be present in RTT.
On the other hand, if the event handle is executed in the thread of the component which emitted the event (TaskA), it is easy to demonstrate that we are very sensible to catastrophic errors (and it is impossible to prevent such situation in the user's code!!!).
The only solution I can see is that the subscribers to the event (TaskB) creates an extra thread responsible for the execution of synchronous handles.
I know, I know!! it sounds non optimal, resource consuming and "dirty", but I insist that the current implementation is much unsafer.
I hope you can be help finding out a better solution.

Davide

synch event handling

ok, I was too optimistic thinking that adetailed explanation was not needed...

This is the weakness I see.

Herman wrote:

That means that it is executing code that its writer doesn't know because it is part of the callee's code.

This is indeed a huge problem: think about it.
You put in a lab a robot full of state-of-the-art Orocos components: motors-controllers, global controller with impedance control, obstacle avoidance, human detection for safety...
You give the robot to a student for experiments thinking "it can't go wrong".
But the young researcher subscribe synchronously to some event, that maybe was just a warning or a "reached position".
Since in the handle there is a bug (an infinite loop), the vital components such as impedance control and safety are blocked.
As result... robot broken or, even worse, people injured.
Is this the future of Orocos?

Sander wrote:

We also should think about this kind of risk when it comes to methods, right?

Wrong! When you call a method, you know what you are doing and the risk you have. In my example, the software of the "newbie student" is the one that calls the provided methods. He knows what he is executing! In the worst case it is the buggy thread the one blocked, not the good one.
With synch events, the "good" component is blocked by the "bad" component, and there is no way to foreseen, prevent and avoid it from the good component point of view.

The same philosophy is applied to single process.
There is currently an obsession for single processin RTT. Deployment with Corba is considered an optional feature that, so far, wasn't that important. "Of course we have Corba, but forget about it, use a single process"

Herman wrote:

If so, this is not a real problem for collocated caller and callee: if the former "crashes", the latter crashes too, since they are both in the same process space.

No real problem? It is a terrible problem! One more time even the strongest and most robust code in the world can be corrupted by an attached components with some bugs!
Considering the goal of Orocos (and thinking about BRICS) I think that we need to bring security at the first place.

Herman wrote:

if both are not in the same process space, you can _only_ use asynchronous events anyway; the 'synchronous' part is then nothing else but registering the fact that an event was emitted by the caller, and then forwarding this to the "middleware". That middleware will have to have some (configurable) policy about what to do with such "remote event handling" errors.

Absolutely right! I think that the synchronous event handling should be substituted with a asynch handle with a mechanism of prioritized execution (best effort to execute the handle as soon as possible).

Davide

synch event handling

On Monday 27 April 2009 12:21:50 faconti [..] ... wrote:
> ok, I was too optimistic thinking that adetailed explanation was not
> needed...
>
> This is the weakness I see.
>

Herman wrote:
That means that it is executing code that its writer doesn't
> know because it is part of the callee's code.
This is indeed a huge
> problem: think about it.
> You put in a lab a robot full of state-of-the-art Orocos components:
> motors-controllers, global controller with impedance control, obstacle
> avoidance, human detection for safety... You give the robot to a student
> for experiments thinking "it can't go wrong". But the young researcher
> subscribe synchronously to some event, that maybe was just a warning or a
> "reached position". Since in the handle there is a bug (an infinite loop),
> the vital components such as impedance control and safety are blocked. As
> result... robot broken or, even worse, people injured.
> Is this the future of Orocos?

Safety has been of a lesser priority because of the 'system level' aspect of
it (as Herman puts it). For example: we don't provide a WatchDog component,
you should if your application requires it. Orocos provides all (?) the tools
to create advanced and intelligent watchdogs [1].

BUT, from a component level view, there's also a safety aspect which should
guarantee that your 'server' thread does not go out for lunch because a
'client' process did something stupid: the rule has always been: don't trust
the client. The student example is perfect for this: the student writes a
faulty client application, it fails, the component should survive it.

The question is how far the RTT should go to protect you. As Herman puts it,
throwing in another thread in user code will always work, but if this gets a
pattern, there's clear evidence that users (like in Sander's example) are
working around Orocos deficiencies. Not fixing these is plain stupid and indeed
hurts users or people in many ways.

Since Events are at the root of this discussion, the solution clearly lies in
what Events will evolve to in RTT 2.0. Maybe an event publisher will be able
to choose if syn/asyn callbacks are allowed or not, or everything can be
expresses as messages. I honnestly don't know yet. What *is* clear, is that
the 'local' safety concerns lies with the server component and not the client,
and thus should be specified there.

Peter

[1] I'm not voting against an OCL::WatchDog component. I believe it's required
to solve global/system level problems and that providing a basic one would
help users getting started.

synch event handling

On Mon, 27 Apr 2009, Peter Soetens wrote:

> On Monday 27 April 2009 12:21:50 faconti [..] ... wrote:
>> ok, I was too optimistic thinking that adetailed explanation was not
>> needed...
>>
>> This is the weakness I see.
>>

Herman wrote:
That means that it is executing code that its writer doesn't
>> know because it is part of the callee's code.
This is indeed a huge
>> problem: think about it.
>> You put in a lab a robot full of state-of-the-art Orocos components:
>> motors-controllers, global controller with impedance control, obstacle
>> avoidance, human detection for safety... You give the robot to a student
>> for experiments thinking "it can't go wrong". But the young researcher
>> subscribe synchronously to some event, that maybe was just a warning or a
>> "reached position". Since in the handle there is a bug (an infinite loop),
>> the vital components such as impedance control and safety are blocked. As
>> result... robot broken or, even worse, people injured.
>> Is this the future of Orocos?
>
> Safety has been of a lesser priority because of the 'system level' aspect of
> it (as Herman puts it). For example: we don't provide a WatchDog component,
> you should if your application requires it. Orocos provides all (?) the tools
> to create advanced and intelligent watchdogs [1].
>
> BUT, from a component level view, there's also a safety aspect which should
> guarantee that your 'server' thread does not go out for lunch because a
> 'client' process did something stupid: the rule has always been: don't trust
> the client. The student example is perfect for this: the student writes a
> faulty client application, it fails, the component should survive it.

This corroborates what I wanted to express too: safety is to a large extent
the result of a good architecture (on top of bug free components, of
course).

> The question is how far the RTT should go to protect you. As Herman puts it,
> throwing in another thread in user code will always work, but if this gets a
> pattern, there's clear evidence that users (like in Sander's example) are
> working around Orocos deficiencies. Not fixing these is plain stupid and
> indeed hurts users or people in many ways.

I don't think that Sander is "working around Orocos deficiencies"! His use
case is just not so "distributed" and "asynchronous" as the use cases of
many other people on this list...

> Since Events are at the root of this discussion, the solution clearly lies in
> what Events will evolve to in RTT 2.0. Maybe an event publisher will be able
> to choose if syn/asyn callbacks are allowed or not,

This is not a scalable solution: the publisher should not be forced to make
decisions for the whole _system_! Whatever RTT provides, there will always
be a need for deciding about such policies at "deployment time" (and even
later, if possible...).

> or everything can be
> expresses as messages. I honnestly don't know yet. What *is* clear, is that
> the 'local' safety concerns lies with the server component and not the client,
> and thus should be specified there.
>
> Peter
>
> [1] I'm not voting against an OCL::WatchDog component. I believe it's required
> to solve global/system level problems and that providing a basic one would
> help users getting started.

I support this! :-)

Herman

synch event handling

> This is the weakness I see.
>

Herman wrote:
That means that it is executing code that its writer
doesn't
> know because it is part of the callee's code.

> This is indeed a huge problem: think about it.
> You put in a lab a robot full of state-of-the-art Orocos components:
> motors-controllers, global controller with impedance control, obstacle
> avoidance, human detection for safety...
> You give the robot to a student for experiments thinking "it can't go
> wrong".
Talking about lafing in the face of danger. I don't know _any_ embedded
software developer who throws his/hers newly crafted software on an
expensive machine. We usually use a test rig or machine emulator for
that.

> But the young researcher subscribe synchronously to some event, that
maybe
> was just a warning or a "reached position".
> Since in the handle there is a bug (an infinite loop), the vital
> components such as impedance control and safety are blocked.
> As result... robot broken or, even worse, people injured.
> Is this the future of Orocos?
You _should not_ implement safety in your robot controller. Safety
requires very specific coding rules, including not using C++. This
pretty much rules out Orocos for safety.

>

Sander wrote:
We also should think about this kind of risk when it
comes
> to methods, right?

> Wrong! When you call a method, you know what you are doing and the
risk
> you have. In my example, the software of the "newbie student" is the
one
> that calls the provided methods. He knows what he is executing! In the
> worst case it is the buggy thread the one blocked, not the good one.
> With synch events, the "good" component is blocked by the "bad"
component,
> and there is no way to foreseen, prevent and avoid it from the good
> component point of view.
>
Right. You are talking about superficial bugs, a while loop
could(should) be spotted by static debugging. I'm talking about less
obvious errors. If a method or synchronous event call-back (as to date
they are both called in the caller's thread) alters/reads private data
from a component (running in an other thread) there is a risk that
either the component or caller end up with inconsistent data. This can
lead to unpredicted behavior of your robot/machine.

> The same philosophy is applied to single process.
> There is currently an obsession for single processin RTT. Deployment
with
> Corba is considered an optional feature that, so far, wasn't that
> important. "Of course we have Corba, but forget about it, use a single
> process"
>

Herman wrote:
If so, this is not a real problem for collocated caller
and
> callee: if the former "crashes", the latter crashes too, since they
are
> both in the same process space.

> No real problem? It is a terrible problem! One more time even the
> strongest and most robust code in the world can be corrupted by an
> attached components with some bugs!
> Considering the goal of Orocos (and thinking about BRICS) I think that
we
> need to bring security at the first place.
>
Herman wrote:
if both are not in the same process space, you can
_only_
> use asynchronous events anyway; the 'synchronous' part is then nothing
> else but registering the fact that an event was emitted by the caller,
and
> then forwarding this to the "middleware". That middleware will have to
> have some (configurable) policy about what to do with such "remote
event
> handling" errors.

> Absolutely right! I think that the synchronous event handling should
be
> substituted with a asynch handle with a mechanism of prioritized
execution
> (best effort to execute the handle as soon as possible).
>
> Davide
>
> --
> Orocos-Dev mailing list
> Orocos-Dev [..] ...
> http://lists.mech.kuleuven.be/mailman/listinfo/orocos-dev
>
> ______________________________________________________________________
> This email has been scanned by the Email Security System.
> ______________________________________________________________________

synch event handling

On Mon, 27 Apr 2009, Vandenbroucke Sander wrote:

[...]
> You _should not_ implement safety in your robot controller. Safety
> requires very specific coding rules, including not using C++. This
> pretty much rules out Orocos for safety.

That's a strong, stimulating remark... Any references available to back up
this claim? :-)

I thought "safety" (at least in its legal incarnation) had everything to
with being certified, so do you mean C++ (compilers) have not been
certified by any organisation? (Which I think is indeed the case...)

Herman

synch event handling

On Apr 27, 2009, at 08:08 , Herman Bruyninckx wrote:

> On Mon, 27 Apr 2009, Vandenbroucke Sander wrote:
>
> [...]
>> You _should not_ implement safety in your robot controller. Safety
>> requires very specific coding rules, including not using C++. This
>> pretty much rules out Orocos for safety.
>
> That's a strong, stimulating remark... Any references available to
> back up
> this claim? :-)

Agreed. Safety is a system-wide concern, which involves so much more
than just coding rules and language selection. And hell yes, I will be
implementing some of it in the robot controller itself! Personally, I
am comfortable running Orocos in safety-critical situations (at least
as much as my own code) - including a multi-million dollar, flight-
rated robot arm for space. YMMV

S

synch event handling

On Mon, 27 Apr 2009, S Roderick wrote:

[...]
> Agreed. Safety is a system-wide concern, which involves so much more
> than just coding rules and language selection. And hell yes, I will be
> implementing some of it in the robot controller itself! Personally, I
> am comfortable running Orocos in safety-critical situations (at least
> as much as my own code) - including a multi-million dollar, flight-
> rated robot arm for space. YMMV

Any more detailed information (pictures, URLs, ...) about this intriguingly
expensive robot? :-)

Herman

synch event handling

> Agreed. Safety is a system-wide concern, which involves so much more
> than just coding rules and language selection. And hell yes, I will be
> implementing some of it in the robot controller itself! Personally, I
> am comfortable running Orocos in safety-critical situations (at least
> as much as my own code) - including a multi-million dollar, flight-
> rated robot arm for space. YMMV

I think that we should take a step backward and decide what "safety"
means and how much it is mondatory to implementing it in Orocos and at
which level.
I will prepare (if community doesn't mind) a draft of my incomplete
point of view about the subject hoping I will get so much feedback as
today I did.
I think that after we agreed on such a philosophical question, the
issue of event handling will be obtained coherently and will be less
biased.

synch event handling

On Mon, 27 Apr 2009, Davide Faconti wrote:

>> Agreed. Safety is a system-wide concern, which involves so much more
>> than just coding rules and language selection. And hell yes, I will be
>> implementing some of it in the robot controller itself! Personally, I
>> am comfortable running Orocos in safety-critical situations (at least
>> as much as my own code) - including a multi-million dollar, flight-
>> rated robot arm for space. YMMV
>
> I think that we should take a step backward and decide what "safety"
> means and how much it is mondatory to implementing it in Orocos and at
> which level.
> I will prepare (if community doesn't mind) a draft of my incomplete
> point of view about the subject hoping I will get so much feedback as
> today I did.
Thanks! That would be a _very_ interesting document!

> I think that after we agreed on such a philosophical question, the
> issue of event handling will be obtained coherently and will be less
> biased.

What bias are you refering to...? My "bias" of wanting to support both sync
and async programming primitives? :-)

Herman

synch event handling

On Mon, 27 Apr 2009, faconti [..] ... wrote:

> ok, I was too optimistic thinking that adetailed explanation was not needed...
>
> This is the weakness I see.
>

Herman wrote:
That means that it is executing code that its writer
> doesn't know because it is part of the callee's code.

> This is indeed a huge problem: think about it.
> You put in a lab a robot full of state-of-the-art Orocos components: motors-controllers, global controller with impedance control, obstacle avoidance, human detection for safety...
> You give the robot to a student for experiments thinking "it can't go wrong".

Oops, why in earth would you _ever_ tell that to a student ? You would
not do him/her a good service even pretending that such "failure proof"
software systems exist...

> But the young researcher subscribe synchronously to some event, that
> maybe was just a warning or a "reached position".
> Since in the handle there is a bug (an infinite loop), the vital
> components such as impedance control and safety are blocked.
> As result... robot broken or, even worse, people injured.
> Is this the future of Orocos?

This is _completely_ a matter of the faulty user application design! (You
mention yourself the _real_ cause of the problem: a bug in the handler...)

>

Sander wrote:
We also should think about this kind of risk when it comes
> to methods, right?

> Wrong! When you call a method, you know what you are doing and the risk
> you have.
No, not really. Why cannot there be a bug in the method call, blocking your
caller for ever...

> In my example, the software of the "newbie student" is the one
> that calls the provided methods. He knows what he is executing! In the
> worst case it is the buggy thread the one blocked, not the good one.

That's all a matter of good design architecture! If you don't trust the
called method or event, you should _always_ "sacrifice" a thread for
interacting with it, and provide your (_application level_) error detection
and recovery.

> With synch events, the "good" component is blocked by the "bad"
> component, and there is no way to foreseen, prevent and avoid it from the
> good component point of view.

You can _never_ "foresee, prevent or avoid" being blocked by another
"activity" (via any form of synchronous communication) _unless_ you take
your own precautions, and dedicate a thread to it.

> The same philosophy is applied to single process.
Not reaaly: this single process case can make things even worse, because
another thread in your address space could crash your whole process...

> There is currently an obsession for single processin RTT. Deployment with
> Corba is considered an optional feature that, so far, wasn't that
> important. "Of course we have Corba, but forget about it, use a single
> process"

You are totally pulling things out of their real perspective! Orocos has
always been trying to serve both use cases! And none of them was a "second
order client". (I am not claiming that the _implementations_ for both use
cases are equally mature, but that's, I think, not the topic of this
thread.)

>

Herman wrote:
If so, this is not a real problem for collocated caller and
> callee: if the former "crashes", the latter crashes too, since they are
> both in the same process space.

> No real problem? It is a terrible problem! One more time even the
> strongest and most robust code in the world can be corrupted by an
> attached components with some bugs!

Absolutely, we agree on that! What your quote doesn't show anymore is that
I meant that the way how you deal with (a)synchronous interaction is not a
problem if the problem is _way_ more important because caused by an
erroneous component implementation...

> Considering the goal of Orocos (and thinking about BRICS) I think that we
> need to bring security at the first place.

Security is a _system-level_ property of a software system, depending very
hard on choosing an appropriate "robust" architecture for our application.
RTT is fully architecture-independent, so it does offer features that can
be misuded. So what...? You should train your application developers
better...

>

Herman wrote:
if both are not in the same process space, you can _only_
> use asynchronous events anyway; the 'synchronous' part is then nothing
> else but registering the fact that an event was emitted by the caller,
> and then forwarding this to the "middleware". That middleware will have
> to have some (configurable) policy about what to do with such "remote
> event handling" errors.

> Absolutely right! I think that the synchronous event handling should be
> substituted with a asynch handle with a mechanism of prioritized
> execution (best effort to execute the handle as soon as possible).

That is just _one_ possible _configuration_ that RTT should offer. It
definitely is not _the_ configuration that serves all use cases...

Herman

synch event handling

On Mon, 27 Apr 2009, faconti [..] ... wrote:

> As I have stated in the wiki page [Contribute! Which weakness have you
> detected in RTT?], I have seen a serious issue in the current
> implementation of synchronous event handling.
> <!--break-->
> We do all agree (I guess)that, from a theoretical point of view,
> something like an "immediate" reaction to an event must be present in
> RTT.
> On the other hand, if the event handle is executed in the thread of the
> component which emitted the event (TaskA), it is easy to demonstrate that
> we are very sensible to catastrophic errors (and it is impossible to
> prevent such situation in the user's code!!!).
> The only solution I can see is that the subscribers to the event (TaskB)
> creates an extra thread responsible for the execution of synchronous
> handles.
> I know, I know!! it sounds non optimal, resource consuming and "dirty",
> but I insist that the current implementation is much unsafer.
> I hope you can be help finding out a better solution.

We don't need a better solution :-) For the following reasons:
- synchronous event handling is, indeed, done in thread of the caller. That
means that it is executing code that its writer doesn't know because it
is part of the callee's code.
- I assume that the sensibility to catastrophic you refer to occurs when
the "communication" between caller and callee breaks down in one way or
another during the synchronous event handling?
If so, this is not a real problem for collocated caller and callee: if
the former "crashes", the latter crashes too, since they are both in the
same process space. And if both are not in the same process space, you
can _only_ use asynchronous events anyway; the 'synchronous' part is then
nothing else but registering the fact that an event was emitted by the
caller, and then forwarding this to the "middleware". That middleware
will have to have some (configurable) policy about what to do with such
"remote event handling" errors.
- what you suggest _is_ the standard pattern to deal with asynchronous
messages (of whatever kind): having a separate thread responsible for the
handling. So, I expect every good middleware to support this pattern.

Herman