[Bug 559] New: Seg fault on certain state machine transitions

Submitted by snrkiwi on Tue, 2008-05-27 16:49

RTT-dev

For more infomation about this bug, visit
Summary: Seg fault on certain state machine transitions
Product: RTT
Version: rtt-trunk
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: P2
Component: Real-Time Toolkit (RTT)
AssignedTo: orocos-dev [..] ...
ReportedBy: kiwi [dot] net [..] ...
CC: orocos-dev [..] ...
Estimated Hours: 0.0

Created an attachment (id=307)
--> (https://www.fmtc.be/bugzilla/orocos/attachment.cgi?id=307)
GD backtrace of seg-fault

We're getting inconsistent behaviour from a state machine, including a
seg-fault in a certain machine setup.

Seg-fault occurs with
{{{
StateMachine PA10nAxesController_SM
{
var string faultMessage
// we always shutdown, no matter what state we're in
transition requestShutdown() select STOP_ROBOT
// keep work out of initial state, to keep it 'atomic'
initial state START_ROBOT
{
entry
{
}
transitions { select PREPARE }
}

// first real state - prepare robot for use and transition to SAFE
state PREPARE
{
entry
{
do inState("PREPARE")
// do prepareForUse()
}
run
{
do prepareForUse()
}
/// ##### seg-fault occurs when changing this one #####
transition faultDetected(faultMessage) select SAFE
// transitions { select SAFE }
}

state SAFE
{
entry
{
// do stopAllAxes()
// do lockAllAxes()
do inState("SAFE")
}
exit
{
// do unlockAllAxes()
// do startAllAxes()
}
transition requestPositionHold() select POSITION_HOLD
transition requestPositionControl() select POSITION_CONTROL
}
...
}}}

If we change the fault detected line to
{{{
transition faultDetected(faultMessage) select STOP_ROBOT
}}}
all is fine (where STOP_ROBOT is the final state). But with a transition to
SAFE it seg-faults (gdb backtrace attached).

Also, if we remove the run() program from the PREPARE state and uncomment the
prepareForUse() in the entry program (which is the way we wanted it to work), a
faultDetected event emitted during the prepareForUse() does not cause the
transition to STOP_ROBOT (or SAFE if using that) to occur. The event just seems
to disappear. Is there something wrong with our state machine formulation?

prepareForUse() is a method (not a command) and always returns true. The parent
task context is in a periodic activity.

TIA
S

[Bug 559] Seg fault on certain state machine transitions

Submitted by klaas on Thu, 2008-06-05 09:20.

For more infomation about this bug, visit

--- Comment #16 from Klaas Gadeyne <klaas [dot] gadeyne [..] ...> 2008-06-05 11:08:02 ---
(In reply to comment #15)
> fixed on trunk/rtt
>
> $ svn ci src/StateMachine.cpp tests/state_test.cpp -m"Fixes bug #559: Seg
> fault on certain state machine transitions
> > * all events are processed asynchronously
> > * events are enabled before entry{}
> > * events from entry{} are processed the same step
> > * events from run {} are processed the next step
> > "
> Sending src/StateMachine.cpp
> Sending tests/state_test.cpp

Shouldn't this be mentioned in the component builders manual too?

[Bug 559] Seg fault on certain state machine transitions

Submitted by sspr on Thu, 2008-06-05 09:15.

For more infomation about this bug, visit

Peter Soetens
<peter [dot] soetens [..] ...> changed:

--- Comment #15 from Peter Soetens
<peter [dot] soetens [..] ...> 2008-06-05 11:03:55 ---
fixed on trunk/rtt

$ svn ci src/StateMachine.cpp tests/state_test.cpp -m"Fixes bug #559: Seg
fault on certain state machine transitions
> * all events are processed asynchronously
> * events are enabled before entry{}
> * events from entry{} are processed the same step
> * events from run {} are processed the next step
> "
Sending src/StateMachine.cpp
Sending tests/state_test.cpp
Transmitting file data ..
Committed revision 29362.

[Bug 559] Seg fault on certain state machine transitions

Submitted by sspr on Thu, 2008-06-05 09:10.

For more infomation about this bug, visit

Peter Soetens
<peter [dot] soetens [..] ...> changed:

--- Comment #14 from Peter Soetens
<peter [dot] soetens [..] ...> 2008-06-05 11:00:47 ---
Created an attachment (id=316)
--> (https://www.fmtc.be/bugzilla/orocos/attachment.cgi?id=316)
patch to be applied on trunk.

So the procedure until now is:

* all events are processed asynchronously
* events are enabled before entry{}
* events from entry{} are processed the same step
* events from run {} are processed the next step

The debugging output has been removed again. Maybe we should add a
sm.debug( true ) call such that useful state transitions etc are shown
automatically... (submit a new bug report if you find this useful)

Peter

[Bug 559] Seg fault on certain state machine transitions

Submitted by snrkiwi on Fri, 2008-05-30 14:20.

For more infomation about this bug, visit

--- Comment #13 from S Roderick <kiwi [dot] net [..] ...> 2008-05-30 16:08:02 ---
(In reply to comment #12)
> Created an attachment (id=313)
--> (https://www.fmtc.be/bugzilla/orocos/attachment.cgi?id=313) [details]
> Fixes crash and allows emitting events from entry (run still broken)

Confirm that events emitted during entry{} cause correct transitions.

Confirm that events during run{} still cause weird behaviour.

One step closer ...! :-))

[Bug 559] Seg fault on certain state machine transitions

Submitted by sspr on Fri, 2008-05-30 07:25.

For more infomation about this bug, visit

Peter Soetens
<peter [dot] soetens [..] ...> changed:

--- Comment #12 from Peter Soetens
<peter [dot] soetens [..] ...> 2008-05-30 09:16:08 ---
Created an attachment (id=313)
--> (https://www.fmtc.be/bugzilla/orocos/attachment.cgi?id=313)
Fixes crash and allows emitting events from entry (run still broken)

I have enabled events from the start of the entry program and behold, the
transition of the report is taken correctly when the event is emitted from
entry{}. In case the state reacts to an event in entry, run{} is not executed.
In case the event is emitted from run{}, the faulty behavior remains.

This patch also enables temporarily some useful log messages that warn when an
event is dropped (and for which reason). Useful for debugging this further.

[Bug 559] Seg fault on certain state machine transitions

Submitted by snrkiwi on Thu, 2008-05-29 22:45.

For more infomation about this bug, visit

--- Comment #11 from S Roderick <kiwi [dot] net [..] ...> 2008-05-30 00:33:22 ---
Sorry, didn't expect this to mushroom into so much work.

So when are events expected/allowed to arrive? Does this apply to events from
within this component, as well as from other components/state-machines? I guess
the real question is: when you wrote the semantics for a state machine, when
were you expecting events to arrive (and from where)?

If I can help please let me know. I'm actually stalled on this right now - I
can't find a workaround that doesn't use this feature ... :-(

Thanks
S

[Bug 559] Seg fault on certain state machine transitions

Submitted by sspr on Fri, 2008-05-30 20:15.

On Friday 30 May 2008 00:33:21 S Roderick wrote:
>
> So when are events expected/allowed to arrive? Does this apply to events
> from within this component, as well as from other
> components/state-machines? I guess the real question is: when you wrote the
> semantics for a state machine, when were you expecting events to arrive
> (and from where)?

The reason why we aren't seeing events from run{} is because events are queued
and the transition conditions are checked right after run. Because the event
has not been processed, it is not seen, a different transition is made, and
then the event is processed, to late in another state.

Today the SM works like this:

Period 1:
Entry
Period 2:
Run -> Transitions? -> ( Handle ) | (Exit -> Entry )
Period 3:
Run -> Transitions? -> ( Handle ) | (Exit -> Entry )
... etc.

Transitions checks for event and condition transitions. If you want to detect
an event emitted in Run, you need to add a 'do nothing' statement to delay
the transition evaluation, such that the event gets time to come in. This
explains why an event in entry is detected, because it is evaluated the next
period.

The idea behind the original scheme is that at the start of period 'n', you
execute the run program, check for a transition and make it if 'ok' OR
execute the handle program. This is actually a statemachine in a
statemachine...

You could write an alternative:
Period 1:
Entry -> Run
Period 2:
( Transtions ? (Exit -> Entry ) : (Handle) ) -> Run
Period 2:
( Transtions ? (Exit -> Entry ) : (Handle) ) -> Run
... etc

Which could be simplified by leaving out Handle:
Period 1:
Entry -> Run
Period 2:
Transtions ? (Exit -> Entry ->Run ) : Run
Period 2:
Transtions ? (Exit -> Entry ->Run) : Run
... etc

But this has clearly different semantics. The current implementation allows to
jump from one state to the other without executing Run. In the alternative
implementation, Run is always at least once executed, just after the state is
entered. I'm in favor of dropping handle, so I would dare to go for scheme nb
3...

What's your opinion on which is best ? Is anyone using handle ?

Peter

[Bug 559] Seg fault on certain state machine transitions

Submitted by sspr on Thu, 2008-05-29 22:00.

For more infomation about this bug, visit

--- Comment #10 from Peter Soetens
<peter [dot] soetens [..] ...> 2008-05-29 23:48:27 ---
(In reply to comment #9)
> (In reply to comment #8)
> > Created an attachment (id=310)
--> (https://www.fmtc.be/bugzilla/orocos/attachment.cgi?id=310) [details] [details]
> > Fixes both crash and weird run behaviour
>
> Seg fault disappears, but this still doesn't seem to be working right. In the
> following, prepareForUse() is a method that simply emits a noFault event. This
> event doesn't appear to get processed, instead the "impossible" transition is
> taken. Also, if we change prepareForUse() to emit a fault, we still end up
> taking the impossible transition.
>
> The latest patch for this bug has been applied, RTT was recompiled clean and
> installed, and the entire test suite recompiled.
>
> Any ideas?
>
> Also, did you come to any conclusion regarding events being discarded if fired
> in entry {} programs?

OK, emitting events from within a script was clearly not tested yet :-/. I
could reproduce the new bad behavior with noFault(). I'll try to enable events
from entry {} as well... but note also that events are discarded if the events
come in when a program (except run{}) is running. We could weaken that too, to
include entry{} as well, but this is such a complex mechanism that we'll need
additional unit tests to be sure that we don't break anything.

Peter

[Bug 559] Seg fault on certain state machine transitions

Submitted by snrkiwi on Thu, 2008-05-29 16:45.

For more infomation about this bug, visit

--- Comment #9 from S Roderick <kiwi [dot] net [..] ...> 2008-05-29 18:36:13 ---
(In reply to comment #8)
> Created an attachment (id=310)
--> (https://www.fmtc.be/bugzilla/orocos/attachment.cgi?id=310) [details]
> Fixes both crash and weird run behaviour

Seg fault disappears, but this still doesn't seem to be working right. In the
following, prepareForUse() is a method that simply emits a noFault event. This
event doesn't appear to get processed, instead the "impossible" transition is
taken. Also, if we change prepareForUse() to emit a fault, we still end up
taking the impossible transition.

The latest patch for this bug has been applied, RTT was recompiled clean and
installed, and the entire test suite recompiled.

Any ideas?

Also, did you come to any conclusion regarding events being discarded if fired
in entry {} programs?

Thanks
S

{{{

// keep work out of initial state, to keep it 'atomic'
initial state START_ROBOT
{
entry { } // empty entry() program
transitions { select PREPARE }
}

// first real state - prepare robot for use and transition to SAFE
state PREPARE
{
// entry
run
{
do prepareForUse()
}
// prepareForUse() completed succesfully
transition noFault() select SAFE
// unable to prepare so stop the whole system
transition faultDetected(faultMessage) select STOP_ROBOT
transitions { select POSITION_CONTROL } // should be impossible
}

}}}

The test code (using cxxtest)
{{{
PA10nAxesControllerSim s("Robot");
SlaveActivity as(s.engine());
// m is a mockup component attached to the robot's ports, and am is its
associated slave activity

TS_ASSERT(stateMachine->inInitialState());
TS_ASSERT_EQUALS(s.getState(), "UNKNOWN");
log(Info) << "pre state cycle" << endlog();
TS_ASSERT(as.execute());
TS_ASSERT(am.execute());
TS_ASSERT(stateMachine->inStrictState("PREPARE"));
log(Info) << "post state cycle" << endlog();
TS_ASSERT(as.execute());
TS_ASSERT(am.execute());
log(Info) << "post state cycle 2" << endlog();
// TS_ASSERT_EQUALS(s.getState(), "PREPARE"); // with run program
TS_ASSERT_EQUALS(s.getState(), "SAFE"); // with entry program
log(Info) << "state test" << endlog();
helpers::dumpStateMachineState(stateMachine);
}}}

And the associated log output
{{{
0.070 [ Info ][ParserScriptingAccess::loadStateMachine] Loading StateMachine
'PA10nAxesController'
0.070 [ Info ][Logger] pre state cycle
0.070 [ Info ][Logger] post state cycle
0.070 [ Debug ][Robot] Sim: prepareForUse()
0.070 [ Debug ][Robot] [Robot] No fault
0.070 [ Debug ][Logger] [Robot] No fault
0.070 [ Info ][Logger] post state cycle 2
0.070 [ Info ][Logger] state test
0.070 [ Debug ][Logger] State machine status: state=POSITION_CONTROL (ACTIVE
Automatic Line=69)
}}}

[Bug 559] Seg fault on certain state machine transitions

Submitted by sspr on Wed, 2008-05-28 20:30.

For more infomation about this bug, visit

Peter Soetens
<peter [dot] soetens [..] ...> changed:

--- Comment #8 from Peter Soetens
<peter [dot] soetens [..] ...> 2008-05-28 22:19:39 ---
Created an attachment (id=310)
--> (https://www.fmtc.be/bugzilla/orocos/attachment.cgi?id=310)
Fixes both crash and weird run behaviour

I finally found what caused the run-weirdness. A bug that I had found some
months ago appears not to have been fixed on trunk: All events must be
asynchronously processed in the StateMachine. Because this was not the case,
the event was processed immediately, executing the transition to SAFE and when
the stack unwound, the run program of the previous state just continued.

What happens after this patch is: the method in the run program fires the
event, the event is queued and the run program finishes. Then the next periodic
step, events are processed and the transition is made.

In case your run program could not finish before the event is processed
(because of a command), it will abort the remainder part.

[Bug 559] Seg fault on certain state machine transitions

Submitted by snrkiwi on Wed, 2008-05-28 19:45.

For more infomation about this bug, visit

--- Comment #7 from S Roderick <kiwi [dot] net [..] ...> 2008-05-28 21:35:17 ---
(In reply to comment #6)
> (In reply to comment #5)
> > More interesting behaviour ... :-(
> >
> > In the following, if a fault event is thrown in startPositionHold() then the
> > next state is SAFE. Correct. Good!
>
> [..]
>
> It seems the run program is executed anyway after the safe state has been
> entered... I get in the good version:
>
> In state: PREPARERUN
> In state: SAFE
>
> and in the bad version:
>
> In state: SAFE
> In state: PREPARERUN
>
>
> In Task CrashSM[R]. (Status of last Command : done )
> (type 'ls' for context info) :list sm
> 23 run
> 24 {
> 25 do prepareForUse()
> 26 do inState("PREPARERUN")
> 27 }
> 28 /// ##### seg-fault occurs when changing this one #####
> 29 transition faultDetected(faultMessage) select SAFE
> 30 // transitions { select SAFE }
> 31 }
> 32
> R> 33 state SAFE
> 34 {
> 35 entry
> 36 {
> 37 // do stopAllAxes()
> 38 // do lockAllAxes()
> 39 do inState("SAFE")
> 40 }
> 41 exit
> 42 {
>
> So the run program was not completely 'erased'.
>

Sorry, but I don't quite understand what you're getting at here. This is even
after your patch? Is this is a problem at my end, or with Orocos? Can I help in
any way?

[Bug 559] Seg fault on certain state machine transitions

Submitted by sspr on Wed, 2008-05-28 16:05.

For more infomation about this bug, visit

--- Comment #6 from Peter Soetens
<peter [dot] soetens [..] ...> 2008-05-28 17:53:56 ---
(In reply to comment #5)
> More interesting behaviour ... :-(
>
> In the following, if a fault event is thrown in startPositionHold() then the
> next state is SAFE. Correct. Good!

[..]

It seems the run program is executed anyway after the safe state has been
entered... I get in the good version:

In state: PREPARERUN
In state: SAFE

and in the bad version:

In state: SAFE
In state: PREPARERUN

In Task CrashSM[R]. (Status of last Command : done )
(type 'ls' for context info) :list sm
23 run
24 {
25 do prepareForUse()
26 do inState("PREPARERUN")
27 }
28 /// ##### seg-fault occurs when changing this one #####
29 transition faultDetected(faultMessage) select SAFE
30 // transitions { select SAFE }
31 }
32
R> 33 state SAFE
34 {
35 entry
36 {
37 // do stopAllAxes()
38 // do lockAllAxes()
39 do inState("SAFE")
40 }
41 exit
42 {

So the run program was not completely 'erased'.

[Bug 559] Seg fault on certain state machine transitions

Submitted by snrkiwi on Wed, 2008-05-28 12:50.

For more infomation about this bug, visit

--- Comment #5 from S Roderick <kiwi [dot] net [..] ...> 2008-05-28 14:39:18 ---
More interesting behaviour ... :-(

In the following, if a fault event is thrown in startPositionHold() then the
next state is SAFE. Correct. Good!

{{{
state POSITION_HOLD
{
// entry
run
{
do inState("POSITION_HOLD")
do startPositionHold()
}
exit
{
do stopPositionHold()
}
transition requestSafe() select SAFE
transition faultDetected(faultMessage) select SAFE
transition requestPositionControl() select POSITION_CONTROL
}
}}}

But if you swap the two method calls inside of the run program, then the fault
event has no affect and the state machine remains in POSITION_HOLD. Wrong. Bad!
:-(

{{{
state POSITION_HOLD
{
// entry
run
{
do startPositionHold()
do inState("POSITION_HOLD")
}
exit
{
do stopPositionHold()
}
transition requestSafe() select SAFE
transition faultDetected(faultMessage) select SAFE
transition requestPositionControl() select POSITION_CONTROL
}
}}}

[Bug 559] Seg fault on certain state machine transitions

Submitted by snrkiwi on Wed, 2008-05-28 12:45.

For more infomation about this bug, visit

--- Comment #4 from S Roderick <kiwi [dot] net [..] ...> 2008-05-28 14:34:44 ---
(In reply to comment #3)
> Created an attachment (id=309)
--> (https://www.fmtc.be/bugzilla/orocos/attachment.cgi?id=309) [details]
> Fixes the crash in this bug

Looking at this patch, could there also be a similar problem in the same fixed
function when currentProg==0, inside of the "if (stepping)" further up?

[Bug 559] Seg fault on certain state machine transitions

Submitted by sspr on Tue, 2008-05-27 20:15.

For more infomation about this bug, visit

Peter Soetens
<peter [dot] soetens [..] ...> changed:

--- Comment #3 from Peter Soetens
<peter [dot] soetens [..] ...> 2008-05-27 22:04:46 ---
Created an attachment (id=309)
--> (https://www.fmtc.be/bugzilla/orocos/attachment.cgi?id=309)
Fixes the crash in this bug

I could reproduce this bug. When an event is emitted during the run {} program
the run program is aborted and the event is processed. Meaning, every statement
after 'prepareForUse()' would not be executed. I'm (again) not sure if this is
exactly the desired behaviour...

Peter

Appears to fix the problem.

Submitted by snrkiwi on Wed, 2008-05-28 01:25.

Appears to fix the problem. Thanks!

Re events emitted during a run program. I guess I think of the state machine as having run-to-completion semantics, so the event is queued for processing after the run() or handle() program has completed. This is most definitely not the only way to code state machines, and from memory UML is a little vague on this (isn't it?).

I see in the docs now, where it says that events aren't enabled until run/handle. Hmmm ... so how would you recommend handling a failure of a method/command during an entry program? We played with using "try xxx catch select yyy" but had no joy.

[Bug 559] Seg fault on certain state machine transitions

Submitted by sspr on Tue, 2008-05-27 19:45.

For more infomation about this bug, visit

Peter Soetens
<peter [dot] soetens [..] ...> changed:

What |Removed |Added
--------------------------------------------------------------------------
CC| |peter [dot] soetens [..] ...

--- Comment #2 from Peter Soetens
<peter [dot] soetens [..] ...> 2008-05-27 21:36:01 ---
(In reply to comment #0)
>
> Also, if we remove the run() program from the PREPARE state and uncomment the
> prepareForUse() in the entry program (which is the way we wanted it to work), a
> faultDetected event emitted during the prepareForUse() does not cause the
> transition to STOP_ROBOT (or SAFE if using that) to occur. The event just seems
> to disappear. Is there something wrong with our state machine formulation?

The way statemachines work now is that events are only accepted after the entry
{} program has been executed and before the exit {} program is executed. An
event during entry is discarded. I'm not sure if this constraint is still
necessary, as the event transition program could be 'queued' for execution just
after entry. It looks possible at least.

I need more time/ a simple unit test to look into the crash...

Peter

[Bug 559] Seg fault on certain state machine transitions

Submitted by snrkiwi on Tue, 2008-05-27 17:00.

For more infomation about this bug, visit

--- Comment #1 from S Roderick <kiwi [dot] net [..] ...> 2008-05-27 18:49:29 ---
Created an attachment (id=308)
--> (https://www.fmtc.be/bugzilla/orocos/attachment.cgi?id=308)
Orocos log associated with backtrace