SIGSEGV when loading scripts from a script under Xenomai

We have a blocking problem while porting our application from gnulinux to xenomai. We have written a small program that shows the problem (please see the attached piece of code).

It results in a SIGSEGV, when execution of the first script program reaches the point where it loads a second one.

The weirdness of the problem is that the bug is not that easy to reproduce.

It is systematic on the following configuration (1):

linux-2.6.30.8-xenomai-2.4.10 + orocos-rtt-1.8.5

It is not seenable on:

linux-2.6.32.7-xenomai-2.6.32.7-xenomai-2.5.2 +orocos-rtt-1.10.3 (2),

unless you simply add some "cout << " logs in ProgramGraphParser, at least, but I have to admit that I cannot explicitely say how much log is mandatory to make the bug appear.

In all the cases, gdb seems to show a corrupted stack, making it very difficult to diagnose.

Please notice that the bug only occurs when the application is built for xenomai

Best regards

AttachmentSize
Orocos-bug.tar_.bz21.45 KB

SIGSEGV when loading scripts from a script under Xenomai

On Wednesday 05 May 2010 10:23:35 nicolas [dot] mabire [..] ... wrote:
> We have a blocking problem while porting our application from gnulinux to
> xenomai. We have written a small program that shows the problem (please
> see the attached piece of code).

I'll run it in my test setup to see if I can reproduce this. Thanks for the
test program.

>
> It results in a SIGSEGV, when execution of the first script program reaches
> the point where it loads a second one.

There was a huge bug in the parser code that did not allow thread-
safe/reentrant program script loading. This also caused a crash on the windows
platform (reported earlier this week on the list). You could grab that patch
and apply it to the RTT + small patch on OCL (not necessary for your test
program) and check for the results. It will be in RTT/OCL 1.12. I'll create
that branch this week.

>
> The weirdness of the problem is that the bug is not that easy to reproduce.
>
> It is systematic on the following configuration (1):
>
> linux-2.6.30.8-xenomai-2.4.10 + orocos-rtt-1.8.5
>
> It is not seenable on:
>
> linux-2.6.32.7-xenomai-2.6.32.7-xenomai-2.5.2 +orocos-rtt-1.10.3 (2),

I also wondered if it was a Xenomai or stack explosion bug. In case we use too
much stack, a Xenomai app will just segfault, without any good traces. Since
scripting uses a lot of stack, *and* you load a script from a script (x2
stack) this might stress Xenomai too much and trigger some bug, which kills
the app. We use mlock_all to fix all pages into memory (mandatory when using
Xenomai) so if you run out of stack you get a sigsegv (see 'man mlockall').
See also
http://www.xenomai.org/index.php/Porting_POSIX_applications_to_Xenomai#m....

You can test with 'ulimit -s 16000' before starting the program to see if the
crash goes away (see 'man bash' and search for ulimit).

>
> unless you simply add some "cout << " logs in ProgramGraphParser, at least,
> but I have to admit that I cannot explicitely say how much log is mandatory
> to make the bug appear.
>
> In all the cases, gdb seems to show a corrupted stack, making it very
> difficult to diagnose.
>
> Please notice that the bug only occurs when the application is built for
> xenomai

I had never tested loading scripts from scripts before, so it might be related
to that. I suggest that you first test the Xenomai memory settings using ulimit
and if that does not help, get the patch from svn/git (see:
http://github.com/psoetens/orocos-
rtt/commit/ecd144be27ec39461deff9be983fa6064abbb5e3 )
and try it out.

Let us know if you found something,

Peter

SIGSEGV when loading scripts from a script under Xenomai

The ulimit command does not change the behaviour. It seems to crash at the same moment.

The patch you have suggested does not compile for xenomai, so inline functions for xenomai should be static. I have changed that to make it compile.
After that, the OCL does not compile too, because of the changes of the PeerParser constructor, I have made some changes here too.
With that version, it still crashes at the same moment.

Adding OS::Thread::setStackSize(16000) did not help too...

Re: SIGSEGV when loading scripts from a script under Xenomai

The ulimit command does not change the behaviour. It seems to crash at the same moment.

The patch you have suggested does not compile for xenomai, so inline functions for xenomai should be static. I have changed that to make it compile. After that, the OCL does not compile too, because of the changes of the PeerParser constructor, I have made some changes here too. With that version, it still crashes at the same moment.

Adding OS::Thread::setStackSize(16000) did not help too...

SIGSEGV when loading scripts from a script under Xenomai

On Wednesday 12 May 2010 09:33:57 nicolas [dot] mabire [..] ... wrote:
> The ulimit command does not change the behaviour. It seems to crash at the
> same moment.
>
> The patch you have suggested does not compile for xenomai, so inline
> functions for xenomai should be static. I have changed that to make it
> compile. After that, the OCL does not compile too, because of the changes
> of the PeerParser constructor, I have made some changes here too. With
> that version, it still crashes at the same moment.
>
> Adding OS::Thread::setStackSize(16000) did not help too...
>
The stack size is in bytes. I checked your test program with

valgrind --tool=massif hello-gnulinux
ms_print massif.out.26171

and it reports to use 27000 bytes of stack (if I interpreted the logs
correctly).

Also, using OS::Thread::setStackSize(16000) will not work because the minimum
is PTHREAD_STACK_MIN (16384) bytes.

So I would recommend to check with at least OS::Thread::setStackSize(32000)
before your threads are created. Note that in 1.10.3, the stack size is
ignored on gnulinux due to a bug. Xenomai is using that number though.

I also added an mlockall to your program in gnulinux with the fixed rtt
version, set the stack size to miminum and ran it as root, but no crash was
observed, which contradicts what massif reports. But I still would go for the
32000...until other evidence prooves this wrong.

Peter

SIGSEGV when loading scripts from a script under Xenomai

On Wednesday 12 May 2010 09:33:57 nicolas [dot] mabire [..] ... wrote:
> The ulimit command does not change the behaviour. It seems to crash at the
> same moment.
>
> The patch you have suggested does not compile for xenomai, so inline
> functions for xenomai should be static. I have changed that to make it
> compile. After that, the OCL does not compile too, because of the changes
> of the PeerParser constructor, I have made some changes here too. With
> that version, it still crashes at the same moment.
>
> Adding OS::Thread::setStackSize(16000) did not help too...
>

I compiled & ran your example program in my VM with the 2.5.2 configuration,
and the code which is on RTT 'trunk', without any errors. I also modified the
script to load/unload the program each time and to start the script from
hello.cpp when it was done.

I'll retest with stock 1.10.2...

Peter

debian:/home/virtual/src/Orocos-bug/Orocos-bug# uname -a
Linux debian 2.6.30.8-xenomai-2.5.2 #1 SMP Fri Apr 16 14:18:56 CEST 2010 i686
GNU/Linux
# ORO_LOGLEVEL=7 ./hello-xenomai
0.000 [ Info ][Logger] Successfully extracted environment variable
ORO_LOGLEVEL
0.000 [ Info ][Logger] OROCOS version '1.10.99' compiled with GCC 4.3.2.
Running in Xenomai.
0.000 [ Info ][Logger] Orocos Logging Activated at level : [ Debug ] ( 6 )
0.000 [ Info ][Logger] Reference System Time is : 19639565543526 ticks (
6795.85 seconds ).
0.000 [ Info ][Logger] Logging is relative to this time.
0.000 [ Info ][Logger] Xenomai Periodic Timer runs in preemptive 'one-shot'
mode.
0.000 [ Info ][Logger] Installing SIGXCPU handler.
0.000 [ Debug ][Logger] Xenomai Timer and Main Task Created
0.000 [ Debug ][Logger] MainThread started.
0.000 [ Debug ][Logger] Starting StartStopManager.
0.000 [ Info ][Toolkit] Loading Tool RealTime.
0.001 [ Debug ][Toolkit] Registered Type 'int' to the Orocos Type System.
0.001 [ Debug ][Toolkit] Registered Type 'uint' to the Orocos Type System.
0.001 [ Debug ][Toolkit] Registered Type 'double' to the Orocos Type System.
0.001 [ Debug ][Toolkit] Registered Type 'bool' to the Orocos Type System.
0.001 [ Debug ][Toolkit] Registered Type 'void' to the Orocos Type System.
0.001 [ Debug ][Toolkit] Registered Type 'PropertyBag' to the Orocos Type
System.
0.001 [ Debug ][Toolkit] Registered Type 'float' to the Orocos Type System.
0.001 [ Debug ][Toolkit] Registered Type 'char' to the Orocos Type System.
0.001 [ Debug ][Toolkit] Registered Type 'array' to the Orocos Type System.
0.001 [ Debug ][Toolkit] Registered Type 'string' to the Orocos Type System.
Started....
0.001 [ Debug ][ExecutionEngine] Creating ExecutionEngine for hello
0.002 [ Info ][Thread] Creating Thread for scheduler: 1
0.002 [ Warning][Thread] Forcing priority (0) of thread to 1.
0.002 [ Info ][hello] Thread created with scheduler type '1', priority -1
and period 0.
configureHook()....
0.002 [ Info ][ProgramLoader::loadProgram] Parsing file foo.ops
0.003 [ Info ][ProgramLoader::loadProgram] Loading Program 'PrgFoo'
configureHook() done.
0.003 [ Info ][PeriodicThread] Creating PeriodicThread for scheduler: 1
0.004 [ Info ][TimerThreadInstance] PeriodicThread created with scheduler
type '1', priority 1 and period 5.
0.004 [ Debug ][PeriodicThread::start] Periodic Thread TimerThreadInstance
started.
startHook()....
Executing prgFoo...
startHook done.
Alive
Alive
Alive
Alive
Alive
5.004 [ Warning][hello] THIERRY in PrgFoo: Loading foo2.ops...
5.004 [ Info ][ProgramLoader::loadProgram] Parsing file foo2.ops
5.005 [ Info ][ProgramLoader::loadProgram] Loading Program 'PrgFoo2'
5.005 [ Warning][hello] THIERRY PrgFoo DONE.
5.005 [ Info ][ParserScriptingAccess::unloadProgram] Unloading Program
'PrgFoo2'
updateHook()....
updateHook done.
Alive
Alive
Alive
Alive
Alive
10.004 [ Warning][hello] THIERRY in PrgFoo: Loading foo2.ops...
10.004 [ Info ][ProgramLoader::loadProgram] Parsing file foo2.ops
10.005 [ Info ][ProgramLoader::loadProgram] Loading Program 'PrgFoo2'
10.005 [ Warning][hello] THIERRY PrgFoo DONE.
10.005 [ Info ][ParserScriptingAccess::unloadProgram] Unloading Program
'PrgFoo2'
updateHook()....
updateHook done.
Alive
Alive
Alive
Alive
Alive
15.005 [ Warning][hello] THIERRY in PrgFoo: Loading foo2.ops...
15.005 [ Info ][ProgramLoader::loadProgram] Parsing file foo2.ops
15.006 [ Info ][ProgramLoader::loadProgram] Loading Program 'PrgFoo2'
15.006 [ Warning][hello] THIERRY PrgFoo DONE.
15.006 [ Info ][ParserScriptingAccess::unloadProgram] Unloading Program
'PrgFoo2'
updateHook()....
updateHook done.
^C

SIGSEGV when loading scripts from a script under Xenomai

On Wednesday 12 May 2010 09:33:57 nicolas [dot] mabire [..] ... wrote:
> The ulimit command does not change the behaviour. It seems to crash at the
> same moment.
>
> The patch you have suggested does not compile for xenomai, so inline
> functions for xenomai should be static. I have changed that to make it
> compile. After that, the OCL does not compile too, because of the changes
> of the PeerParser constructor, I have made some changes here too. With
> that version, it still crashes at the same moment.

Yes, I should have mentioned that there is a patch on the OCL SVN trunk that
contains the fix you needed.

>
> Adding OS::Thread::setStackSize(16000) did not help too...
>

So that would start to exclude the stack story. I'll replicate it here in a VM
and see what else could help... I'll keep you posted (tomorrow and friday are
holidays overhere...)

Peter

SIGSEGV when loading scripts from a script under Xenomai

We have a blocking problem while porting our application from gnulinux to xenomai.
We have written a small program that shows the problem (please see the attached piece of code).

It results in a SIGSEGV, when execution of the first script program reaches
the point where it loads a second one.

The weirdness of the problem is that the bug is not that easy to reproduce.

It is systematic on the following configuration (1):

linux-2.6.30.8-xenomai-2.4.10 + orocos-rtt-1.8.5

It is not seenable on:

linux-2.6.32.7-xenomai-2.6.32.7-xenomai-2.5.2 +orocos-rtt-1.10.3 (2),

unless you simply add some "cout << " logs in ProgramGraphParser, at least,
but I have to admit that I cannot explicitely say how much log is mandatory
to make the bug appear.

In all the cases, gdb seems to show a corrupted stack, making it very difficult
to diagnose.

Please notice that the bug only occurs when the application is built for
xenomai

Best regards