[Bug 653] New: 100% CPU usage when Timer class is used

https://www.fmtc.be/bugzilla/orocos/show_bug.cgi?id=653

Summary: 100% CPU usage when Timer class is used
Product: RTT
Version: 1.6.0
Platform: i386 Compatible
OS/Version: GNU/Linux
Status: NEW
Severity: normal
Priority: P3
Component: Real-Time Toolkit (RTT)
AssignedTo: orocos-dev [..] ...
ReportedBy: steven [dot] kauffmann [..] ...
CC: orocos-dev [..] ...
Estimated Hours: 0.0

Created an attachment (id=423)
--> (https://www.fmtc.be/bugzilla/orocos/attachment.cgi?id=423)
Simple test that reproduces the 100% cpu usage

Hi,

See attachment for an example that reproduces the problem. the problem only
occurs when running on gnulinux (on Xenomai it works ok, not tested on RTAI).

Steven

[Bug 653] GnuLinux port mixes incompatible clock_monotonic and c

https://www.fmtc.be/bugzilla/orocos/show_bug.cgi?id=653

Peter Soetens <peter [..] ...> changed:

What |Removed |Added
----------------------------------------------------------------------------
Resolution| |FIXED
Status|ASSIGNED |RESOLVED

--- Comment #9 from Peter Soetens <peter [..] ...> 2009-05-28 13:42:46 ---
Patch accepted, will backport to 1.6 and 1.8 branches.

PEter

[Bug 653] GnuLinux port mixes incompatible clock_monotonic and c

https://www.fmtc.be/bugzilla/orocos/show_bug.cgi?id=653

--- Comment #8 from Peter Soetens <peter [..] ...> 2009-05-21 14:55:09 ---
(In reply to comment #5)
> Created an attachment (id=425)
--> (https://www.fmtc.be/bugzilla/orocos/attachment.cgi?id=425) [details]
> Fixes the bug, but not a good solution
>
> By changing clock_monotonic back into clock_realtime, one can verify this bug
> is fixed. However, it ain't a clean solution (see previous comment about the
> patch that "caused" this bug)

I would still propose to accept your solution, because there is no easy way to
fix it for gnulinux. All semaphore implementations only accept times since
EPOCH, which means they are sensitive to system time changes (caused by ntpd or
the user)
Considering that it is gnulinux and that the typical scenario is only very
small changes, I'd prefer an on average good solution above a broken
implementation.

About the long-term solution:
Other systems based on file descriptors and select(), like timerfd, allow to
specify the clock type, but would need to be used in combination of eventfd in
order to simulate timed semaphores. Using file descriptors has tremendous
advantages, but I don't feel comfortable introducing this change into 1.x.

Any objections against applying Klaas' patch ?

[Bug 653] GnuLinux port mixes incompatible clock_monotonic and c

On May 21, 2009, at 08:55 , Peter Soetens wrote:

> https://www.fmtc.be/bugzilla/orocos/show_bug.cgi?id=653
>
> --- Comment #8 from Peter Soetens <peter [..] ...>
> 2009-05-21 14:55:09 ---
> (In reply to comment #5)
>> Created an attachment (id=425)
> --> (https://www.fmtc.be/bugzilla/orocos/attachment.cgi?id=425)
> [details]
>> Fixes the bug, but not a good solution
>>
>> By changing clock_monotonic back into clock_realtime, one can
>> verify this bug
>> is fixed. However, it ain't a clean solution (see previous comment
>> about the
>> patch that "caused" this bug)
>
> I would still propose to accept your solution, because there is no
> easy way to
> fix it for gnulinux. All semaphore implementations only accept times
> since
> EPOCH, which means they are sensitive to system time changes (caused
> by ntpd or
> the user)
> Considering that it is gnulinux and that the typical scenario is
> only very
> small changes, I'd prefer an on average good solution above a broken
> implementation.
>
> About the long-term solution:
> Other systems based on file descriptors and select(), like timerfd,
> allow to
> specify the clock type, but would need to be used in combination of
> eventfd in
> order to simulate timed semaphores. Using file descriptors has
> tremendous
> advantages, but I don't feel comfortable introducing this change
> into 1.x.
>
> Any objections against applying Klaas' patch ?

I have no problem with this.

The rtos_sem_wait_timed() is definitely a bad idea, but I agree with
an average good solution for this OS. Where will this be documented?
Do we have per-OS technical notes/issues?

How widely used is rtos_sem_wait_timed()?

S

[Bug 653] GnuLinux port mixes incompatible clock_monotonic and c

https://www.fmtc.be/bugzilla/orocos/show_bug.cgi?id=653

Peter Soetens <peter [..] ...> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |peter [..] ...

--- Comment #7 from Peter Soetens <peter [..] ...> 2009-05-20 14:34:33 ---
(In reply to comment #6)
> (In reply to comment #5)
> > Created an attachment (id=425)
--> (https://www.fmtc.be/bugzilla/orocos/attachment.cgi?id=425) [details] [details]
> > Fixes the bug, but not a good solution
> >
> > By changing clock_monotonic back into clock_realtime, one can verify this bug
> > is fixed. However, it ain't a clean solution (see previous comment about the
> > patch that "caused" this bug)
>
> While thinking of a better solution (using the relative delay sem_wait_timed
> version somehow, or converting the time from one clock into another), it struck
> me somehow that there was no locking involved in the implementation of
> rtos_sem_wait_timed
>
> static inline int rtos_sem_wait_timed(rt_sem_t* m, NANO_TIME delay )
> {
> TIME_SPEC timevl, delayvl;
> clock_gettime(CLOCK_REALTIME, &timevl);
> delayvl = ticks2timespec(delay);
>
> // add current time with delay, detect&correct overflows.
> timevl.tv_sec += delayvl.tv_sec;
> timevl.tv_nsec += delayvl.tv_nsec;
> if ( timevl.tv_nsec >= 1000000000) {
> ++timevl.tv_sec;
> timevl.tv_nsec -= 1000000000;
> }
>
> assert( timevl.tv_nsec < 1000000000 );
>
> return sem_timedwait( m, &timevl);
> }
>
> I would think that, if the thread that executes this code is preempted
> somewhere along these lines, we don't do what we pretend (although I'm not sure
> what we are pretending here (i.e. what is this function supposed to do,
> starting from which line of code should I interpret delay) and I'm not sure
> this is a catastrophe though).
> Thoughts, anyone? Is this another possible race condition or am I overlooking
> something?

Your analysis is correct. Using the 'delayed' function is bad practice, because
the wait time is virtually unbounded (whatever that may mean). Unless you're
the highest priority thread, you may 'sleep' much longer than anticipated. An
absolute time should be used, ie sem_wait_until(abstime). We'd better remove
the delayed function from the API in RTT 2.0...

But that doesn't solve the clock problems we have at the moment, which we need
to fix urgently.

Peter

[Bug 653] GnuLinux port mixes incompatible clock_monotonic and c

https://www.fmtc.be/bugzilla/orocos/show_bug.cgi?id=653

--- Comment #6 from Klaas Gadeyne <klaas [dot] gadeyne [..] ...> 2009-05-15 17:46:06 ---
(In reply to comment #5)
> Created an attachment (id=425)
--> (https://www.fmtc.be/bugzilla/orocos/attachment.cgi?id=425) [details]
> Fixes the bug, but not a good solution
>
> By changing clock_monotonic back into clock_realtime, one can verify this bug
> is fixed. However, it ain't a clean solution (see previous comment about the
> patch that "caused" this bug)

While thinking of a better solution (using the relative delay sem_wait_timed
version somehow, or converting the time from one clock into another), it struck
me somehow that there was no locking involved in the implementation of
rtos_sem_wait_timed

static inline int rtos_sem_wait_timed(rt_sem_t* m, NANO_TIME delay )
{
TIME_SPEC timevl, delayvl;
clock_gettime(CLOCK_REALTIME, &timevl);
delayvl = ticks2timespec(delay);

// add current time with delay, detect&correct overflows.
timevl.tv_sec += delayvl.tv_sec;
timevl.tv_nsec += delayvl.tv_nsec;
if ( timevl.tv_nsec >= 1000000000) {
++timevl.tv_sec;
timevl.tv_nsec -= 1000000000;
}

assert( timevl.tv_nsec < 1000000000 );

return sem_timedwait( m, &timevl);
}

I would think that, if the thread that executes this code is preempted
somewhere along these lines, we don't do what we pretend (although I'm not sure
what we are pretending here (i.e. what is this function supposed to do,
starting from which line of code should I interpret delay) and I'm not sure
this is a catastrophe though).
Thoughts, anyone? Is this another possible race condition or am I overlooking
something?

[Bug 653] GnuLinux port mixes incompatible clock_monotonic and c

https://www.fmtc.be/bugzilla/orocos/show_bug.cgi?id=653

--- Comment #5 from Klaas Gadeyne <klaas [dot] gadeyne [..] ...> 2009-05-15 17:26:42 ---
Created an attachment (id=425)
--> (https://www.fmtc.be/bugzilla/orocos/attachment.cgi?id=425)
Fixes the bug, but not a good solution

By changing clock_monotonic back into clock_realtime, one can verify this bug
is fixed. However, it ain't a clean solution (see previous comment about the
patch that "caused" this bug)

[Bug 653] GnuLinux port mixes incompatible clock_monotonic and c

https://www.fmtc.be/bugzilla/orocos/show_bug.cgi?id=653

--- Comment #4 from Klaas Gadeyne <klaas [dot] gadeyne [..] ...> 2009-05-15 11:27:43 ---
(In reply to comment #3)
> In your test program, the current time is calculated using clock_monotonic,
> whereas the sem_wait value is based on clock_realtime. This seems to be the
> cause of the bug.

Seems to be a regression caused by the patch for
<https://www.fmtc.be/bugzilla/orocos/show_bug.cgi?id=565>

[Bug 653] 100% CPU usage when Timer class is used

https://www.fmtc.be/bugzilla/orocos/show_bug.cgi?id=653

--- Comment #2 from Klaas Gadeyne <klaas [dot] gadeyne [..] ...> 2009-05-15 10:50:22 ---
Created an attachment (id=424)
--> (https://www.fmtc.be/bugzilla/orocos/attachment.cgi?id=424)
Demonstrates that the values reported by the 2 clocks are incompatible

[Bug 653] GnuLinux port mixes incompatible clock_monotonic and c

https://www.fmtc.be/bugzilla/orocos/show_bug.cgi?id=653

Klaas Gadeyne <klaas [dot] gadeyne [..] ...> changed:

What |Removed |Added
----------------------------------------------------------------------------
Summary|100% CPU usage when Timer |GnuLinux port mixes
|class is used |incompatible
| |clock_monotonic and
| |clock_realtime

--- Comment #3 from Klaas Gadeyne <klaas [dot] gadeyne [..] ...> 2009-05-15 10:50:47 ---
In your test program, the current time is calculated using clock_monotonic,
whereas the sem_wait value is based on clock_realtime. This seems to be the
cause of the bug.

[Bug 653] 100% CPU usage when Timer class is used

https://www.fmtc.be/bugzilla/orocos/show_bug.cgi?id=653

Klaas Gadeyne <klaas [dot] gadeyne [..] ...> changed:

What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|--- |1.6.3
Status|NEW |ASSIGNED

--- Comment #1 from Klaas Gadeyne <klaas [dot] gadeyne [..] ...> 2009-05-15 10:39:10 ---
After first investigation, this seems to be related to uncompatible time format
conventions. The absolute time which is passed to rtos_sem_wait_until is
nonsense and not related at all to the value of clock_gettime.

I'm on it.

ps. Probably other versions are affected too.