Delaying select() call within state machine

Submitted by snrkiwi on Tue, 2010-09-28 14:00

RTT-dev

We are re-architecting our socket-related components, and have hit an interesting situation. Within a method called during a state machine's run section, we have a unix select() call with a timeout. During this call (with nothing connecting, so that this component blocks), this component plus all _other_ components completely stall. All components are periodic.

The reason this occurs is that the blocking component is ORO_SCHED_OTHER, and it blocks all other ORO_SCHED_OTHER components. This makes sense from the O/S scheduler's point of view, but it has some "interesting" side effects on systems with multiple non-realtime components (e.g. logging, other socket components).

As we haven't seen this before (we previously put the blocking call in updateHook() inside a non-periodic activity ), is this behavior related to the select() being within a state machine called method, or just having a long-duration blocking call in a periodic activity? Or have we simply not noticed this behavior previously?

Demonstrated with Orocos v1 on macosx and gnulinux.

Somethng to keep in mind ...
S

Delaying select() call within state machine

Submitted by peter on Tue, 2010-09-28 19:16.

On Tuesday 28 September 2010 14:58:15 S Roderick wrote:
> We are re-architecting our socket-related components, and have hit an
> interesting situation. Within a method called during a state machine's run
> section, we have a unix select() call with a timeout. During this call
> (with nothing connecting, so that this component blocks), this component
> plus all _other_ components completely stall. All components are periodic.
>
> The reason this occurs is that the blocking component is ORO_SCHED_OTHER,
> and it blocks all other ORO_SCHED_OTHER components. This makes sense from
> the O/S scheduler's point of view, but it has some "interesting" side
> effects on systems with multiple non-realtime components (e.g. logging,
> other socket components).

If you used the PeriodicActivity class, you get what it advertises. Use the
1.10's Activity class to have independent periodic threads.

Peter

Delaying select() call within state machine

Submitted by snrkiwi on Wed, 2010-09-29 11:56.

On Sep 28, 2010, at 15:14 , Peter Soetens wrote:

> On Tuesday 28 September 2010 14:58:15 S Roderick wrote:
>> We are re-architecting our socket-related components, and have hit an
>> interesting situation. Within a method called during a state machine's run
>> section, we have a unix select() call with a timeout. During this call
>> (with nothing connecting, so that this component blocks), this component
>> plus all _other_ components completely stall. All components are periodic.
>>
>> The reason this occurs is that the blocking component is ORO_SCHED_OTHER,
>> and it blocks all other ORO_SCHED_OTHER components. This makes sense from
>> the O/S scheduler's point of view, but it has some "interesting" side
>> effects on systems with multiple non-realtime components (e.g. logging,
>> other socket components).
>
> If you used the PeriodicActivity class, you get what it advertises. Use the
> 1.10's Activity class to have independent periodic threads.

What does Activity provide over PeriodicActivity here?
S

Delaying select() call within state machine

Submitted by Klaas Gadeyne on Wed, 2010-09-29 12:04.

On Wed, Sep 29, 2010 at 1:47 PM, Stephen Roderick <kiwi [dot] net [..] ...> wrote:
> On Sep 28, 2010, at 15:14 , Peter Soetens wrote:
>
>> On Tuesday 28 September 2010 14:58:15 S Roderick wrote:
>>> We are re-architecting our socket-related components, and have hit an
>>> interesting situation. Within a method called during a state machine's run
>>> section, we have a unix select() call with a timeout. During this call
>>> (with nothing connecting, so that this component blocks), this component
>>> plus all _other_ components completely stall. All components are periodic.
>>>
>>> The reason this occurs is that the blocking component is ORO_SCHED_OTHER,
>>> and it blocks all other ORO_SCHED_OTHER components. This makes sense from
>>> the O/S scheduler's point of view, but it has some "interesting" side
>>> effects on systems with multiple non-realtime components (e.g. logging,
>>> other socket components).
>>
>> If you used the PeriodicActivity class, you get what it advertises. Use the
>> 1.10's Activity class to have independent periodic threads.
>
> What does Activity provide over PeriodicActivity here?

I guess Peter meant that, if all of your components are
periodicActivity with the same period/priority, they are grouped in a
single thread, hence it's logic all of your components block. By
opting for the Activity instead, they will be in separate threads and
your problem might be solved.
At first, I didn't understand why it didn' t happen in your
"updateHook" way of doing things, but a 2nd read learned me that you
were using non-periodic activity then.

hth,

Klaas

Delaying select() call within state machine

Submitted by snrkiwi on Wed, 2010-09-29 17:48.

On Sep 29, 2010, at 08:00 , Klaas Gadeyne wrote:

> On Wed, Sep 29, 2010 at 1:47 PM, Stephen Roderick <kiwi [dot] net [..] ...> wrote:
>> On Sep 28, 2010, at 15:14 , Peter Soetens wrote:
>>
>>> On Tuesday 28 September 2010 14:58:15 S Roderick wrote:
>>>> We are re-architecting our socket-related components, and have hit an
>>>> interesting situation. Within a method called during a state machine's run
>>>> section, we have a unix select() call with a timeout. During this call
>>>> (with nothing connecting, so that this component blocks), this component
>>>> plus all _other_ components completely stall. All components are periodic.
>>>>
>>>> The reason this occurs is that the blocking component is ORO_SCHED_OTHER,
>>>> and it blocks all other ORO_SCHED_OTHER components. This makes sense from
>>>> the O/S scheduler's point of view, but it has some "interesting" side
>>>> effects on systems with multiple non-realtime components (e.g. logging,
>>>> other socket components).
>>>
>>> If you used the PeriodicActivity class, you get what it advertises. Use the
>>> 1.10's Activity class to have independent periodic threads.
>>
>> What does Activity provide over PeriodicActivity here?
>
> I guess Peter meant that, if all of your components are
> periodicActivity with the same period/priority, they are grouped in a
> single thread, hence it's logic all of your components block. By
> opting for the Activity instead, they will be in separate threads and
> your problem might be solved.

Confirm that using RTT::Activity instead of RTT::PeriodicActivity solves the problem. A blocking select() in a method called by a state machine now doesn't block other components with the same scheduling parameters.

As Herman noted, the state machine implementation probably should be robust to this situation, though I don't know whether it actually is now ...?
S

Delaying select() call within state machine

Submitted by peter on Wed, 2010-09-29 18:52.

On Wed, Sep 29, 2010 at 7:46 PM, Stephen Roderick <kiwi [dot] net [..] ...> wrote:
> On Sep 29, 2010, at 08:00 , Klaas Gadeyne wrote:
>
>> On Wed, Sep 29, 2010 at 1:47 PM, Stephen Roderick <kiwi [dot] net [..] ...> wrote:
>>> On Sep 28, 2010, at 15:14 , Peter Soetens wrote:
>>>
>>>> On Tuesday 28 September 2010 14:58:15 S Roderick wrote:
>>>>> We are re-architecting our socket-related components, and have hit an
>>>>> interesting situation. Within a method called during a state machine's run
>>>>> section, we have a unix select() call with a timeout. During this call
>>>>> (with nothing connecting, so that this component blocks), this component
>>>>> plus all _other_ components completely stall. All components are periodic.
>>>>>
>>>>> The reason this occurs is that the blocking component is ORO_SCHED_OTHER,
>>>>> and it blocks all other ORO_SCHED_OTHER components. This makes sense from
>>>>> the O/S scheduler's point of view, but it has some "interesting" side
>>>>> effects on systems with multiple non-realtime components (e.g. logging,
>>>>> other socket components).
>>>>
>>>> If you used the PeriodicActivity class, you get what it advertises. Use the
>>>> 1.10's Activity class to have independent periodic threads.
>>>
>>> What does Activity provide over PeriodicActivity here?
>>
>> I guess Peter meant that, if all of your components are
>> periodicActivity with the same period/priority, they are grouped in a
>> single thread, hence it's logic all of your components block. By
>> opting for the Activity instead, they will be in separate threads and
>> your problem might be solved.
>
> Confirm that using RTT::Activity instead of RTT::PeriodicActivity solves the problem. A blocking select() in a method called by a state machine now doesn't block other components with the same scheduling parameters.
>
> As Herman noted, the state machine implementation probably should be robust to this situation, though I don't know whether it actually is now ...?

It is. PeriodicActivity should only be used in exceptional (controlled
and understood) situations, since it implicitly serializes components
and starting/stopping a component may/will change the execution order.

That's why we default to 'Activity' in RTT 1.10 and 2.0 in each
component. It's a sane/save default.

Peter
--
Orocos-Dev mailing list
Orocos-Dev [..] ...
http://lists.mech.kuleuven.be/mailman/listinfo/orocos-dev

Delaying select() call within state machine

Submitted by bruyninc on Tue, 2010-09-28 17:28.

On Tue, 28 Sep 2010, S Roderick wrote:

> We are re-architecting our socket-related components, and have hit an
> interesting situation. Within a method called during a state machine's
> run section, we have a unix select() call with a timeout.

Ouch....

> During this
> call (with nothing connecting, so that this component blocks), this
> component plus all _other_ components completely stall. All components
> are periodic.

> The reason this occurs is that the blocking component is ORO_SCHED_OTHER,
> and it blocks all other ORO_SCHED_OTHER components. This makes sense from
> the O/S scheduler's point of view, but it has some "interesting" side
> effects on systems with multiple non-realtime components (e.g. logging,
> other socket components).
>
> As we haven't seen this before (we previously put the blocking call in
> updateHook() inside a non-periodic activity ), is this behavior related
> to the select() being within a state machine called method, or just
> having a long-duration blocking call in a periodic activity? Or have we
> simply not noticed this behavior previously?

Since our earliest endeavours with realtime Linux, the "select()" has been
a show stopper, since it does not come in a preemtible version...
So, you should go back to the proven solution of dedicating a non-periodic
activity to every indefinitely blocking call...

> Demonstrated with Orocos v1 on macosx and gnulinux.
>
> Somethng to keep in mind ...
> S

Herman

Delaying select() call within state machine

Submitted by snrkiwi on Tue, 2010-09-28 17:32.

On Sep 28, 2010, at 13:24 , Herman Bruyninckx wrote:

> On Tue, 28 Sep 2010, S Roderick wrote:
>
>> We are re-architecting our socket-related components, and have hit an
>> interesting situation. Within a method called during a state machine's
>> run section, we have a unix select() call with a timeout.
>
> Ouch....
>
>> During this
>> call (with nothing connecting, so that this component blocks), this
>> component plus all _other_ components completely stall. All components
>> are periodic.
>
>> The reason this occurs is that the blocking component is ORO_SCHED_OTHER,
>> and it blocks all other ORO_SCHED_OTHER components. This makes sense from
>> the O/S scheduler's point of view, but it has some "interesting" side
>> effects on systems with multiple non-realtime components (e.g. logging,
>> other socket components).
>>
>> As we haven't seen this before (we previously put the blocking call in
>> updateHook() inside a non-periodic activity ), is this behavior related
>> to the select() being within a state machine called method, or just
>> having a long-duration blocking call in a periodic activity? Or have we
>> simply not noticed this behavior previously?
>
> Since our earliest endeavours with realtime Linux, the "select()" has been
> a show stopper, since it does not come in a preemtible version...
> So, you should go back to the proven solution of dedicating a non-periodic
> activity to every indefinitely blocking call...

I've been trying to find ways around that particular approach. Especially as having a state machine attached to things like socket connections makes a great deal of sense. I imagine your suggestion is then to pair this non-periodic socket component with a periodic coordinating component ...?
S

Delaying select() call within state machine

Submitted by bruyninc on Tue, 2010-09-28 22:04.

On Tue, 28 Sep 2010, Stephen Roderick wrote:

> On Sep 28, 2010, at 13:24 , Herman Bruyninckx wrote:
>
>> On Tue, 28 Sep 2010, S Roderick wrote:
>>
>>> We are re-architecting our socket-related components, and have hit an
>>> interesting situation. Within a method called during a state machine's
>>> run section, we have a unix select() call with a timeout.
>>
>> Ouch....
>>
>>> During this
>>> call (with nothing connecting, so that this component blocks), this
>>> component plus all _other_ components completely stall. All components
>>> are periodic.
>>
>>> The reason this occurs is that the blocking component is ORO_SCHED_OTHER,
>>> and it blocks all other ORO_SCHED_OTHER components. This makes sense from
>>> the O/S scheduler's point of view, but it has some "interesting" side
>>> effects on systems with multiple non-realtime components (e.g. logging,
>>> other socket components).
>>>
>>> As we haven't seen this before (we previously put the blocking call in
>>> updateHook() inside a non-periodic activity ), is this behavior related
>>> to the select() being within a state machine called method, or just
>>> having a long-duration blocking call in a periodic activity? Or have we
>>> simply not noticed this behavior previously?
>>
>> Since our earliest endeavours with realtime Linux, the "select()" has been
>> a show stopper, since it does not come in a preemtible version...
>> So, you should go back to the proven solution of dedicating a non-periodic
>> activity to every indefinitely blocking call...
>
> I've been trying to find ways around that particular approach. Especially
> as having a state machine attached to things like socket connections
> makes a great deal of sense. I imagine your suggestion is then to pair
> this non-periodic socket component with a periodic coordinating component
> ...?

This connection to sockets makes _a lot_ of sense, since it is a very
recurring practical situation. The "real" solution is to have a state
machine that is _robust_ against missing synchronization withe socket
component. What that "robustness" really means is not so clear to me yet
(and part of Markus' PhD research), but it your use case _is_ a perfect
example of the failure of 'centralized', 'closed-world' Coordination.

Herman

Delaying select() call within state machine

Submitted by snrkiwi on Wed, 2010-09-29 11:48.

On Sep 28, 2010, at 18:01 , Herman Bruyninckx wrote:

> On Tue, 28 Sep 2010, Stephen Roderick wrote:
>
>> On Sep 28, 2010, at 13:24 , Herman Bruyninckx wrote:
>>
>>> On Tue, 28 Sep 2010, S Roderick wrote:
>>>
>>>> We are re-architecting our socket-related components, and have hit an
>>>> interesting situation. Within a method called during a state machine's
>>>> run section, we have a unix select() call with a timeout.
>>>
>>> Ouch....
>>>
>>>> During this
>>>> call (with nothing connecting, so that this component blocks), this
>>>> component plus all _other_ components completely stall. All components
>>>> are periodic.
>>>
>>>> The reason this occurs is that the blocking component is ORO_SCHED_OTHER,
>>>> and it blocks all other ORO_SCHED_OTHER components. This makes sense from
>>>> the O/S scheduler's point of view, but it has some "interesting" side
>>>> effects on systems with multiple non-realtime components (e.g. logging,
>>>> other socket components).
>>>>
>>>> As we haven't seen this before (we previously put the blocking call in
>>>> updateHook() inside a non-periodic activity ), is this behavior related
>>>> to the select() being within a state machine called method, or just
>>>> having a long-duration blocking call in a periodic activity? Or have we
>>>> simply not noticed this behavior previously?
>>>
>>> Since our earliest endeavours with realtime Linux, the "select()" has been
>>> a show stopper, since it does not come in a preemtible version...
>>> So, you should go back to the proven solution of dedicating a non-periodic
>>> activity to every indefinitely blocking call...
>>
>> I've been trying to find ways around that particular approach. Especially
>> as having a state machine attached to things like socket connections
>> makes a great deal of sense. I imagine your suggestion is then to pair
>> this non-periodic socket component with a periodic coordinating component
>> ...?
>
> This connection to sockets makes _a lot_ of sense, since it is a very
> recurring practical situation. The "real" solution is to have a state
> machine that is _robust_ against missing synchronization withe socket
> component. What that "robustness" really means is not so clear to me yet
> (and part of Markus' PhD research), but it your use case _is_ a perfect
> example of the failure of 'centralized', 'closed-world' Coordination.

Agreed re robustness. The surprising thing for me with the current implementation, is that the blocking select in the state machine caused all other SCHED_OTHER tasks to be blocked also. I can see how this might happen though.

Delaying select() call within state machine

Submitted by bruyninc on Wed, 2010-09-29 12:00.

On Wed, 29 Sep 2010, Stephen Roderick wrote:

> On Sep 28, 2010, at 18:01 , Herman Bruyninckx wrote:
>
>> On Tue, 28 Sep 2010, Stephen Roderick wrote:
>>
>>> On Sep 28, 2010, at 13:24 , Herman Bruyninckx wrote:
>>>
>>>> On Tue, 28 Sep 2010, S Roderick wrote:
>>>>
>>>>> We are re-architecting our socket-related components, and have hit an
>>>>> interesting situation. Within a method called during a state machine's
>>>>> run section, we have a unix select() call with a timeout.
>>>>
>>>> Ouch....
>>>>
>>>>> During this
>>>>> call (with nothing connecting, so that this component blocks), this
>>>>> component plus all _other_ components completely stall. All components
>>>>> are periodic.
>>>>
>>>>> The reason this occurs is that the blocking component is ORO_SCHED_OTHER,
>>>>> and it blocks all other ORO_SCHED_OTHER components. This makes sense from
>>>>> the O/S scheduler's point of view, but it has some "interesting" side
>>>>> effects on systems with multiple non-realtime components (e.g. logging,
>>>>> other socket components).
>>>>>
>>>>> As we haven't seen this before (we previously put the blocking call in
>>>>> updateHook() inside a non-periodic activity ), is this behavior related
>>>>> to the select() being within a state machine called method, or just
>>>>> having a long-duration blocking call in a periodic activity? Or have we
>>>>> simply not noticed this behavior previously?
>>>>
>>>> Since our earliest endeavours with realtime Linux, the "select()" has been
>>>> a show stopper, since it does not come in a preemtible version...
>>>> So, you should go back to the proven solution of dedicating a non-periodic
>>>> activity to every indefinitely blocking call...
>>>
>>> I've been trying to find ways around that particular approach. Especially
>>> as having a state machine attached to things like socket connections
>>> makes a great deal of sense. I imagine your suggestion is then to pair
>>> this non-periodic socket component with a periodic coordinating component
>>> ...?
>>
>> This connection to sockets makes _a lot_ of sense, since it is a very
>> recurring practical situation. The "real" solution is to have a state
>> machine that is _robust_ against missing synchronization withe socket
>> component. What that "robustness" really means is not so clear to me yet
>> (and part of Markus' PhD research), but your use case _is_ a perfect
>> example of the failure of 'centralized', 'closed-world' Coordination.
>
> Agreed re robustness. The surprising thing for me with the current implementation, is that the blocking select in the state machine caused all other SCHED_OTHER tasks to be blocked also. I can see how this might happen though.
>
I don't... Maybe because the select() has process-wide repercussions, and
not just on thread level...?

Herman