FBSched: segfault at deployer exit

Submitted by gah on Thu, 2012-04-19 08:33

Orocos-users

I hate to have to dump this on the list, but as I'm not familiar enough with the source to try and sort this out for myself:

After correctly (afaict) setting up an instance of FBSched and its slaves, I run into the following segfault trying to exit the deployer:

=====

Program received signal SIGSEGV, Segmentation fault.
0xb744e490 in main_arena () from /lib/tls/i686/cmov/libc.so.6
(gdb) bt
#0 0xb744e490 in main_arena () from /lib/tls/i686/cmov/libc.so.6
#1 0xb7f8c785 in RTT::extras::SlaveActivity::start (this=0x80af280)
at /home/gah/ros/orocos-ros-gnulinux/orocos_toolchain/rtt/rtt/extras/SlaveActivity.cpp:123
#2 0xb7f012d3 in RTT::ExecutionEngine::stopTask (this=0x80af480, task=0x80af3e0)
at /home/gah/ros/orocos-ros-gnulinux/orocos_toolchain/rtt/rtt/ExecutionEngine.cpp:401
#3 0xb7f456ce in RTT::base::TaskCore::stop (this=0x80af3e0) at /home/gah/ros/orocos-ros-gnulinux/orocos_toolchain/rtt/rtt/base/TaskCore.cpp:219
#4 0xb7f05e72 in RTT::TaskContext::stop (this=0x80af3e0) at /home/gah/ros/orocos-ros-gnulinux/orocos_toolchain/rtt/rtt/TaskContext.cpp:406

[...]

#13 0xb7cc2fd4 in OCL::DeploymentComponent::kickOutAll (this=0xbfffeb14)
at /home/gah/ros/orocos-ros-gnulinux/orocos_toolchain/ocl/deployment/DeploymentComponent.cpp:580
#14 0xb7cda200 in ~DeploymentComponent (this=0xbfffeb14, __in_chrg=<value optimised out>)
at /home/gah/ros/orocos-ros-gnulinux/orocos_toolchain/ocl/deployment/DeploymentComponent.cpp:266
#15 0x080536da in main (argc=<value optimised out>, argv=0xbfffed34) at /home/gah/ros/orocos-ros-gnulinux/orocos_toolchain/ocl/bin/deployer.cpp:168

=====

If I first call 'stop()' on the slaves, everything is fine (probably because SlaveActivity::stop() returns false in line 148).

The slave is basically 'oro-createpkg component' sans the std::cout statements.

Some info from frame #1:

=====

(gdb) frame 1
#1 0xb7f8c785 in RTT::extras::SlaveActivity::start (this=0x80af280)
at /home/gah/ros/orocos-ros-gnulinux/orocos_toolchain/rtt/rtt/extras/SlaveActivity.cpp:123
123 if (mmaster && !mmaster->isActive())

(gdb) print mmaster
$1 = (class RTT::base::ActivityInterface *) 0x80ad1e8

=====

Should FBSched call stop() on its slaves?

FBSched: segfault at deployer exit

Submitted by gah on Thu, 2012-04-26 09:08.

g ah wrote:
> I hate to have to dump this on the list, but as I'm not familiar enough with the source to try and sort this out for myself:
>
> After correctly (afaict) setting up an instance of FBSched and its slaves, I run into the following segfault trying to exit the deployer:
>
[..]
>
> Should FBSched call stop() on its slaves?

Can anyone shed any light on this? The deployer is still segfaulting at exit.

FBSched: segfault at deployer exit

Submitted by markus.klotzbuecher on Thu, 2012-04-26 10:12.

Hi G Ah,

On Thu, Apr 26, 2012 at 09:05:52AM +0000, g ah wrote:
>
> g ah wrote:
> > I hate to have to dump this on the list, but as I'm not familiar enough with the source to try and sort this out for myself:
> >
> > After correctly (afaict) setting up an instance of FBSched and its slaves, I run into the following segfault trying to exit the deployer:
> >
> [..]
> >
> > Should FBSched call stop() on its slaves?
>
> Can anyone shed any light on this? The deployer is still segfaulting at exit.

So let me see if I understand correctly: with fbsched running you exit
the deployer and you get a segfault? This would not surprise me,
because if any of your components gets destructed before fbsched, the
latter will try to invoke update() on a non-existing component. If
this speculation is correct, then stopping fbsched before shutting
down (=proper shutdown management) should solve your problem, right?

Markus

FBSched: segfault at deployer exit

Submitted by gah on Thu, 2012-04-26 17:08.

Markus Klotzbuecher wrote:
> Hi G Ah,
>
> On Thu, Apr 26, 2012 at 09:05:52AM +0000, g ah wrote:
>> g ah wrote:
>>> I hate to have to dump this on the list, but as I'm not familiar enough with the source to try and sort this out for myself:
>>>
>>> After correctly (afaict) setting up an instance of FBSched and its slaves, I run into the following segfault trying to exit the deployer:
>>>
>> [..]
>>>
>>> Should FBSched call stop() on its slaves?
>> Can anyone shed any light on this? The deployer is still segfaulting at exit.
>
> So let me see if I understand correctly: with fbsched running you exit
> the deployer and you get a segfault? This would not surprise me,
> because if any of your components gets destructed before fbsched, the
> latter will try to invoke update() on a non-existing component. If
> this speculation is correct, then stopping fbsched before shutting
> down (=proper shutdown management) should solve your problem, right?

It seems to be the other way around: as long as any of the slaves are in the running state, exiting the deployer results in a segfault. Stop all the slaves, and the state of fbsched does not matter anymore (ie: it can be [R] or [S], no segfaults either way).

Segfaults always happen in SlaveActivity::start() (line 123).

The scenario you describe was what I thought was happening as well, but my (admittedly primitive) testing led me to disregard that hypothesis.

Removing the slaves as peers of the deployer (as suggested by W. Lambert) makes no difference in my test: only after stopping all slaves can the deployer exit normally.

FBSched: segfault at deployer exit

Submitted by willy on Thu, 2012-04-26 17:12.

2012/4/26 g ah <gaohml [..] ...>:
>
> Markus Klotzbuecher wrote:
>> Hi G Ah,
>>
>> On Thu, Apr 26, 2012 at 09:05:52AM +0000, g ah wrote:
>>> g ah wrote:
>>>> I hate to have to dump this on the list, but as I'm not familiar enough with the source to try and sort this out for myself:
>>>>
>>>> After correctly (afaict) setting up an instance of FBSched and its slaves, I run into the following segfault trying to exit the deployer:
>>>>
>>> [..]
>>>>
>>>> Should FBSched call stop() on its slaves?
>>> Can anyone shed any light on this? The deployer is still segfaulting at exit.
>>
>> So let me see if I understand correctly: with fbsched running you exit
>> the deployer and you get a segfault? This would not surprise me,
>> because if any of your components gets destructed before fbsched, the
>> latter will try to invoke update() on a non-existing component. If
>> this speculation is correct, then stopping fbsched before shutting
>> down (=proper shutdown management) should solve your problem, right?
>
> It seems to be the other way around: as long as any of the slaves are in the running state, exiting the deployer results in a segfault. Stop all the slaves, and the state of fbsched does not matter anymore (ie: it can be [R] or [S], no segfaults either way).
>
> Segfaults always happen in SlaveActivity::start() (line 123).
>
> The scenario you describe was what I thought was happening as well, but my (admittedly primitive) testing led me to disregard that hypothesis.
>
> Removing the slaves as peers of the deployer (as suggested by W. Lambert) makes no difference in my test: only after stopping all slaves can the deployer exit normally.
>

Yes it only works if the problem comes from the fbsched state when
exiting or if your fbsched is managing slave stop/cleanup in his
stop/cleanup Hooks

>
> --
> Orocos-Users mailing list
> Orocos-Users [..] ...
> http://lists.mech.kuleuven.be/mailman/listinfo/orocos-users
>

FBSched: segfault at deployer exit

Submitted by willy on Thu, 2012-04-26 11:24.

2012/4/26 Markus Klotzbuecher <markus [dot] klotzbuecher [..] ...>:
> Hi G Ah,
>
> On Thu, Apr 26, 2012 at 09:05:52AM +0000, g ah wrote:
>>
>> g ah wrote:
>> > I hate to have to dump this on the list, but as I'm not familiar enough with the source to try and sort this out for myself:
>> >
>> > After correctly (afaict) setting up an instance of FBSched and its slaves, I run into the following segfault trying to exit the deployer:
>> >
>> [..]
>> >
>> > Should FBSched call stop() on its slaves?
>>
>> Can anyone shed any light on this? The deployer is still segfaulting at exit.
>
> So let me see if I understand correctly: with fbsched running you exit
> the deployer and you get a segfault? This would not surprise me,
> because if any of your components gets destructed before fbsched, the
> latter will try to invoke update() on a non-existing component. If
> this speculation is correct, then stopping fbsched before shutting
> down (=proper shutdown management) should solve your problem, right?
>
> Markus
> --
> Orocos-Users mailing list
> Orocos-Users [..] ...
> http://lists.mech.kuleuven.be/mailman/listinfo/orocos-users

The Deployer is doing some stop/cleanup at the end, but in your case
not in the right order. I think the underlying problem is that slave
components or fbsched are in a way a composition. Tthe deployer should
only see the fbsched as a peer to manage on shutdown. This also rise
the question "should the fbsched be responsible of managing slave task
state (such as configuring all it slaves tasks when configured,
starting all them in starting, ...). I personnaly add the slave peeers
to my fbsched equivalent, and use the removePeer from Deployer. So :
_ 1 : my peers are no more known by the Deployer which prevent such segfaults
_ 2 : in a large scaled system you only see groups of components in
your Deployer.

It produces some tricky work for deployment and introspection (I think
the deployer isn't coded for this, but
FBSched.MySlaveComponent1.doSomething() and so on are working), but
it's a way to compose the system view.

The Orocos Project

FBSched: segfault at deployer exit

FBSched: segfault at deployer exit

FBSched: segfault at deployer exit

FBSched: segfault at deployer exit

FBSched: segfault at deployer exit

FBSched: segfault at deployer exit