RTAI or Xenomai ?

Submitted by phamelin on Mon, 2009-11-02 10:28

Orocos-users

Hello,

I'm looking to add real-time support to my Linux system and I would like to
have your opinion about RTAI and Xenomai. What is the best? In the past, I
personaly had some troubles to make RTAI working. On the other hand, I've
never used Xenomai but it seems much more alive and supported.

Thank you for the advices!

Philippe Hamelin

RTAI or Xenomai ?

Submitted by peter on Mon, 2009-11-02 10:28.

On Thu, Oct 29, 2009 at 18:06, Philippe Hamelin
<philippe [dot] hamelin [..] ...> wrote:
> Hello,
>
> I'm looking to add real-time support to my Linux system and I would like to
> have your opinion about RTAI and Xenomai. What is the best? In the past, I
> personaly had some troubles to make RTAI working. On the other hand, I've
> never used Xenomai but it seems much more alive and supported.
>
> Thank you for the advices!

Both projects are running on tight resources. Xenomai has benefited
from the years-long experience in real-time Linux of Philippe Gerum,
Gilles Chanteperdix and Jan Kitzka (and many other contributers). RTAI
has a long history too and has an edge with regards to certain device
drivers (mainly comedi). Both projects have been bitten by extremely
annoying stability bugs, and both seemed to resolve them in the end.
With all respect for RTAI, I'd try first Xenomai, but if not
sufficient fall back on RTAI without feeling doomed :-)

For both cases it is advised to pick a version, test,test,test it (ie,
run many test cases in parallell, deliberately segfault rt-processes
in random places, run heavy X applications like Eclipse, compile a
kernel etc...) and observe the stability of the system. Then stick
with that version unless you *absolutely* need to upgrade. You'll have
to redo your testing if you change kernel/RTAI/Xenomai version.

Both have comparable performance if you would wonder. Xenomai does
more sanity checking of the user input (function arguments *and*
kernel configuration) than RTAI.

Peter

RTAI or Xenomai ?

Submitted by Klaas Gadeyne on Mon, 2009-11-02 10:28.

On Fri, Oct 30, 2009 at 2:55 PM, Peter Soetens <peter [..] ...> wrote:
> On Thu, Oct 29, 2009 at 18:06, Philippe Hamelin
> <philippe [dot] hamelin [..] ...> wrote:
>> Hello,
>>
>> I'm looking to add real-time support to my Linux system and I would like to
>> have your opinion about RTAI and Xenomai. What is the best? In the past, I
>> personaly had some troubles to make RTAI working. On the other hand, I've
>> never used Xenomai but it seems much more alive and supported.

What are your requirements (device drivers, documentation, stable
release support, ...)?

Klaas

RTAI or Xenomai ?

Submitted by phamelin on Mon, 2009-11-02 10:28.

You are suggesting a good precision Klass. First, let me describe my
problem. I currently have an Orocos application with about 20 components. I
have a master component with a periodic activity which calls the other 19
slave activities at 500Hz (gnulinux ORO_SHED_RT). Actually, this is taking
about 3% CPU so there is no major computational effort. When I push up the
frequency to 1 kHz this suddenly lead the application to 100% CPU
utilization. I guess the problem comes from the gnulinux scheduler (debian
lenny vanilla kernel) which is not able to deal with this setup. So this
problem has lead me to look forward for an other scheduler which may help me
to get down to a lower CPU utilization.

Currently, we use custom device drivers that directly write to I/O. We need
a solution which is easy to maintain by any common programmer which is not
specialized in real time linux. Also, the solution should be stable because
this is to be used in an industrial product.

Philippe

2009/10/30 Klaas Gadeyne <klaas [dot] gadeyne [..] ...>

> On Fri, Oct 30, 2009 at 2:55 PM, Peter Soetens <peter [..] ...>
> wrote:
> > On Thu, Oct 29, 2009 at 18:06, Philippe Hamelin
> > <philippe [dot] hamelin [..] ...> wrote:
> >> Hello,
> >>
> >> I'm looking to add real-time support to my Linux system and I would like
> to
> >> have your opinion about RTAI and Xenomai. What is the best? In the past,
> I
> >> personaly had some troubles to make RTAI working. On the other hand,
> I've
> >> never used Xenomai but it seems much more alive and supported.
>
> What are your requirements (device drivers, documentation, stable
> release support, ...)?
>
> Klaas
> --
> Orocos-Users mailing list
> Orocos-Users [..] ...
> http://lists.mech.kuleuven.be/mailman/listinfo/orocos-users
>

RTAI or Xenomai ?

Submitted by Klaas Gadeyne on Mon, 2009-11-02 10:28.

On Fri, Oct 30, 2009 at 4:19 PM, Philippe Hamelin
<philippe [dot] hamelin [..] ...> wrote:
> You are suggesting a good precision Klass. First, let me describe my
> problem. I currently have an Orocos application with about 20 components. I
> have a master component with a periodic activity which calls the other 19
> slave activities at 500Hz (gnulinux ORO_SHED_RT). Actually, this is taking
> about 3% CPU so there is no major computational effort. When I push up the
> frequency to 1 kHz this suddenly lead the application to 100% CPU
> utilization. I guess the problem comes from the gnulinux scheduler (debian
> lenny vanilla kernel) which is not able to deal with this setup. So this
> problem has lead me to look forward for an other scheduler which may help me
> to get down to a lower CPU utilization.

I don't know if your guess is correct, but a smaller (and faster to
check) step would maybe be to check with
<http://www.pengutronix.de/software/linux-rt/debian_en.html>

> Currently, we use custom device drivers that directly write to I/O. We need
> a solution which is easy to maintain by any common programmer which is not
> specialized in real time linux. Also, the solution should be stable because
> this is to be used in an industrial product.

_If_ your device drivers are using plain linux systems calls, neither
xenomai/rtai are going to help you unless you rewrite them (e.g. using
RTDM). And to me, that means you'll need a programmer which is aware
of (possible) real-time issues.

Klaas

RTAI or Xenomai ?

Submitted by phamelin on Mon, 2009-11-02 10:28.

Thank you Klass, I will give a try!

2009/10/30 Klaas Gadeyne <klaas [dot] gadeyne [..] ...>

> On Fri, Oct 30, 2009 at 4:19 PM, Philippe Hamelin
> <philippe [dot] hamelin [..] ...> wrote:
> > You are suggesting a good precision Klass. First, let me describe my
> > problem. I currently have an Orocos application with about 20 components.
> I
> > have a master component with a periodic activity which calls the other 19
> > slave activities at 500Hz (gnulinux ORO_SHED_RT). Actually, this is
> taking
> > about 3% CPU so there is no major computational effort. When I push up
> the
> > frequency to 1 kHz this suddenly lead the application to 100% CPU
> > utilization. I guess the problem comes from the gnulinux scheduler
> (debian
> > lenny vanilla kernel) which is not able to deal with this setup. So this
> > problem has lead me to look forward for an other scheduler which may help
> me
> > to get down to a lower CPU utilization.
>
> I don't know if your guess is correct, but a smaller (and faster to
> check) step would maybe be to check with
> <http://www.pengutronix.de/software/linux-rt/debian_en.html>
>
> > Currently, we use custom device drivers that directly write to I/O. We
> need
> > a solution which is easy to maintain by any common programmer which is
> not
> > specialized in real time linux. Also, the solution should be stable
> because
> > this is to be used in an industrial product.
>
> _If_ your device drivers are using plain linux systems calls, neither
> xenomai/rtai are going to help you unless you rewrite them (e.g. using
> RTDM). And to me, that means you'll need a programmer which is aware
> of (possible) real-time issues.
>
> Klaas
>

RTAI or Xenomai ?

Submitted by peter on Mon, 2009-11-02 10:28.

On Fri, Oct 30, 2009 at 16:19, Philippe Hamelin
<philippe [dot] hamelin [..] ...> wrote:
> You are suggesting a good precision Klass. First, let me describe my
> problem. I currently have an Orocos application with about 20 components. I
> have a master component with a periodic activity which calls the other 19
> slave activities at 500Hz (gnulinux ORO_SHED_RT). Actually, this is taking
> about 3% CPU so there is no major computational effort. When I push up the
> frequency to 1 kHz this suddenly lead the application to 100% CPU
> utilization. I guess the problem comes from the gnulinux scheduler (debian
> lenny vanilla kernel) which is not able to deal with this setup. So this
> problem has lead me to look forward for an other scheduler which may help me
> to get down to a lower CPU utilization.

top is notorious for reporting wrong CPU utilization numbers when
using SCHED_FIFO. The numbers are complete nonsense and are indeed
typically 100%.

If you care, you'll need a better way of measuring, for example, by
using the (dark and hidden) Orocos Thread Scope.

Peter

RTAI or Xenomai ?

Submitted by phamelin on Mon, 2009-11-02 10:28.

Hello,
after some investigations I found interesting things.

I measured the execution time of all slave activities :

void MasterComponent::updateHook()
{
    referenceTime = TimeService::Instance()->getTicks();
 
    getPeer("slaveA")->getActivity()->execute();
    getPeer("slaveB")->getActivity()->execute();
    getPeer("slaveC")->getActivity()->execute();
 
    processTime = TimeService::Instance()->secondsSince(referenceTime);
}

and I found that I'm using 0.9ms while my periodic activity has a 1ms
period. So, I'm effectively using most of the CPU.
Then, I wanted to know which component is taking so much CPU. I first tested
component "SlaveA" :

void MasterComponent::updateHook()
{
    referenceTime = TimeService::Instance()->getTicks();
 
    getPeer("slaveA")->getActivity()->execute();
 
    processTime = TimeService::Instance()->secondsSince(referenceTime);
 
    getPeer("slaveB")->getActivity()->execute();
    getPeer("slaveC")->getActivity()->execute();
 
}

I found that the process time of slaveA was 0.15ms. The comopnent "slaveA"
just have some code in his updateHook (no FSM, no programs...). I removed
all the code in the updateHook() of "slaveA" and then I measured again the
process time of the single line :

    getPeer("slaveA")->getActivity()->execute();

I found that just calling the (empty) slaveActivity takes about 0.1ms. My
system has in reality about 10 slaves activities which are called by the
masterComponent, so TotalTime = 10 * 0.1ms = 1ms. It means that it takes
100% CPU @ 1kHz just to call empty components.

Do you think that this overhead comes from Orocos or from the "context
switching" time of Linux? Does each slave activity has his own thread or the
slave activities is executed in the calling thread? If the slave activity is
called inside the master thread I don't see where this switching time could
come from?

Also, I tested the RT kernel from
http://www.pengutronix.de/software/linux-rt/debian_en.html and I had similar
results. I'm using a Celeron M 1GHz with 512Mb DDR.

Philippe

2009/10/30 Peter Soetens <peter [..] ...>

> On Fri, Oct 30, 2009 at 16:19, Philippe Hamelin
> <philippe [dot] hamelin [..] ...> wrote:
> > You are suggesting a good precision Klass. First, let me describe my
> > problem. I currently have an Orocos application with about 20 components.
> I
> > have a master component with a periodic activity which calls the other 19
> > slave activities at 500Hz (gnulinux ORO_SHED_RT). Actually, this is
> taking
> > about 3% CPU so there is no major computational effort. When I push up
> the
> > frequency to 1 kHz this suddenly lead the application to 100% CPU
> > utilization. I guess the problem comes from the gnulinux scheduler
> (debian
> > lenny vanilla kernel) which is not able to deal with this setup. So this
> > problem has lead me to look forward for an other scheduler which may help
> me
> > to get down to a lower CPU utilization.
>
> top is notorious for reporting wrong CPU utilization numbers when
> using SCHED_FIFO. The numbers are complete nonsense and are indeed
> typically 100%.
>
> If you care, you'll need a better way of measuring, for example, by
> using the (dark and hidden) Orocos Thread Scope.
>
> Peter
>

RTAI or Xenomai ?

Submitted by peter on Mon, 2009-11-02 10:28.

On Fri, Oct 30, 2009 at 20:36, Philippe Hamelin
<philippe [dot] hamelin [..] ...> wrote:
> Hello,
> after some investigations I found interesting things.
>
> I measured the execution time of all slave activities :
>
>

>
> void MasterComponent::updateHook()
> {
>     referenceTime = TimeService::Instance()->getTicks();
>
>     getPeer("slaveA")->getActivity()->execute();
>     getPeer("slaveB")->getActivity()->execute();
>     getPeer("slaveC")->getActivity()->execute();
>
>     processTime = TimeService::Instance()->secondsSince(referenceTime);
> }
>
>

>
> and I found that I'm using 0.9ms while my periodic activity has a 1ms
> period. So, I'm effectively using most of the CPU.
> Then, I wanted to know which component is taking so much CPU. I first tested
> component "SlaveA" :
>
>

>
> void MasterComponent::updateHook()
> {
>     referenceTime = TimeService::Instance()->getTicks();
>
>     getPeer("slaveA")->getActivity()->execute();
>
>     processTime = TimeService::Instance()->secondsSince(referenceTime);
>
>     getPeer("slaveB")->getActivity()->execute();
>     getPeer("slaveC")->getActivity()->execute();
>
> }
>
>

>
> I found that the process time of slaveA was 0.15ms. The comopnent "slaveA"
> just have some code in his updateHook (no FSM, no programs...). I removed
> all the code in the updateHook() of "slaveA" and then I measured again the
> process time of the single line :
>
>

>     getPeer("slaveA")->getActivity()->execute();
>

>
> I found that just calling the (empty) slaveActivity takes about 0.1ms. My
> system has in reality about 10 slaves activities which are called by the
> masterComponent, so TotalTime = 10 * 0.1ms = 1ms. It means that it takes
> 100% CPU @ 1kHz just to call empty components.
>
> Do you think that this overhead comes from Orocos or from the "context
> switching" time of Linux? Does each slave activity has his own thread or the
> slave activities is executed in the calling thread? If the slave activity is
> called inside the master thread I don't see where this switching time could
> come from?

That's the point, slaveActivity doesn't have any thread switching.
It's supposed to be the most efficient way, and 100us for executing
nothing is a 'little' bit too much. SlaveActivity is an empty box that
just calls step() on the ExecutionEngine, which steps 4 'processors'
(program, state machine, event, command) and the updateHook(). If all
these have nothing to do, they should need very little time too.

There are clearly only two ways out of this:
1. There is a bug in the EE or one of its processors which causes this slowness
2. The measurement is wrong, ie you measure something else.

During my thesis (2005) I did extensive measurements of the time spent
in the EE, data ports, events, commands etc. under different loads.
This was on a 700MHz PIII. , 128KB L2 cache and 128MB Ram. Time
measurement itself took 3.2 us. The measurements had threads running
at 2KHz, 1KHz and 500Hz and a not-realtime thread at 100Hz. System
load was very acceptable, even since I was logging all measurements.

Summarized, I measured the following (corrected for measurement times):
* Data flow read or write: lock-based: 10us, lock-free: 3us
* Sending a command: 3us, processing a command: 3us (both lock-free)
* Emiting event and processing event: same as command (also lock-free)

You can see the measurements yourself from:
http://www.mech.kuleuven.be/dept/resources/docs/soetens.pdf

This thesis is *not* recommended reading to learn Orocos, but the
measurements are in there (look for sections titled 'Validation' and
Appendix A). Not that much has changed since then, although I believe
I had to robustify the lock-free algorithms, at the expense of some
more computational overhead. I also see that I didn't measure the
scripting overhead, but checking for empty lists won't be much
different than checking an empty command queue.

I still don't understand how you get from 3% to 100% with just
doubling a frequency of a single thread. It should have been 6%.
Simple.

Peter

>
> Also, I tested the RT kernel from
> http://www.pengutronix.de/software/linux-rt/debian_en.html and I had similar
> results. I'm using a Celeron M 1GHz with 512Mb DDR.
>
> Philippe
>
> 2009/10/30 Peter Soetens <peter [..] ...>
>>
>> On Fri, Oct 30, 2009 at 16:19, Philippe Hamelin
>> <philippe [dot] hamelin [..] ...> wrote:
>> > You are suggesting a good precision Klass. First, let me describe my
>> > problem. I currently have an Orocos application with about 20
>> > components. I
>> > have a master component with a periodic activity which calls the other
>> > 19
>> > slave activities at 500Hz (gnulinux ORO_SHED_RT). Actually, this is
>> > taking
>> > about 3% CPU so there is no major computational effort. When I push up
>> > the
>> > frequency to 1 kHz this suddenly lead the application to 100% CPU
>> > utilization. I guess the problem comes from the gnulinux scheduler
>> > (debian
>> > lenny vanilla kernel) which is not able to deal with this setup. So this
>> > problem has lead me to look forward for an other scheduler which may
>> > help me
>> > to get down to a lower CPU utilization.
>>
>> top is notorious for reporting wrong CPU utilization numbers when
>> using SCHED_FIFO. The numbers are complete nonsense and are indeed
>> typically 100%.
>>
>> If you care, you'll need a better way of measuring, for example, by
>> using the (dark and hidden) Orocos Thread Scope.
>>
>> Peter
>
>

RTAI or Xenomai ?

Submitted by phamelin on Mon, 2009-11-02 15:16.

Hello,
of course the %CPU given by top is erroneous. The %CPU should be about 50% @
500 Hz and 100% @ 1000 Hz. I wrote a sample program to measure the time
required to execute the slave activity of an empty TaskContext. I used
rtt/TimeService to calculate an average time of 1000 calls. The time
obtained on a Celeron M 1GHz is 30 microseconds. In the case of my ~20
slaves application, this means that it should take about 0.6ms to call the
activities without doing any processing. According to [1], this CPU is much
more powerful than the Pentium-III 700MHz used to bench Orocos in Peter's
thesis. I tested the same program on my desktop Intel Core 2 Duo 6600 @
2.4GHz and I obtained a mean time of 1.3 microseconds, which seems to be
proportionally correct according to the performance of the CPU.

Does someone has a low-end CPU to execute this benchtest? I would very like
to know if there's a performance issue with this particular CPU (Celeron M 1
GHz). Theorically, this 1 GHz CPU has a possibility of 1000 clock cycles per
microsecond. This would mean that the code executed by a single slave
activity requires 30 000 clock cycles? This seems to be high for an empty
TaskContext.

[1] http://www.cpubenchmark.net/cpu_lookup.php?cpu=Intel+Celeron+M+1.00GHz

Philippe

2009/10/30 Peter Soetens <peter [..] ...>

> On Fri, Oct 30, 2009 at 20:36, Philippe Hamelin
> <philippe [dot] hamelin [..] ...> wrote:
> > Hello,
> > after some investigations I found interesting things.
> >
> > I measured the execution time of all slave activities :
> >
> >

> >
> > void MasterComponent::updateHook()
> > {
> >     referenceTime = TimeService::Instance()->getTicks();
> >
> >     getPeer("slaveA")->getActivity()->execute();
> >     getPeer("slaveB")->getActivity()->execute();
> >     getPeer("slaveC")->getActivity()->execute();
> >
> >     processTime = TimeService::Instance()->secondsSince(referenceTime);
> > }
> >
> >

> >
> > and I found that I'm using 0.9ms while my periodic activity has a 1ms
> > period. So, I'm effectively using most of the CPU.
> > Then, I wanted to know which component is taking so much CPU. I first
> tested
> > component "SlaveA" :
> >
> >

> >
> > void MasterComponent::updateHook()
> > {
> >     referenceTime = TimeService::Instance()->getTicks();
> >
> >     getPeer("slaveA")->getActivity()->execute();
> >
> >     processTime = TimeService::Instance()->secondsSince(referenceTime);
> >
> >     getPeer("slaveB")->getActivity()->execute();
> >     getPeer("slaveC")->getActivity()->execute();
> >
> > }
> >
> >

> >
> > I found that the process time of slaveA was 0.15ms. The comopnent
> "slaveA"
> > just have some code in his updateHook (no FSM, no programs...). I removed
> > all the code in the updateHook() of "slaveA" and then I measured again
> the
> > process time of the single line :
> >
> >

> >     getPeer("slaveA")->getActivity()->execute();
> >

> >
> > I found that just calling the (empty) slaveActivity takes about 0.1ms. My
> > system has in reality about 10 slaves activities which are called by the
> > masterComponent, so TotalTime = 10 * 0.1ms = 1ms. It means that it takes
> > 100% CPU @ 1kHz just to call empty components.
> >
> > Do you think that this overhead comes from Orocos or from the "context
> > switching" time of Linux? Does each slave activity has his own thread or
> the
> > slave activities is executed in the calling thread? If the slave activity
> is
> > called inside the master thread I don't see where this switching time
> could
> > come from?
>
> That's the point, slaveActivity doesn't have any thread switching.
> It's supposed to be the most efficient way, and 100us for executing
> nothing is a 'little' bit too much. SlaveActivity is an empty box that
> just calls step() on the ExecutionEngine, which steps 4 'processors'
> (program, state machine, event, command) and the updateHook(). If all
> these have nothing to do, they should need very little time too.
>
> There are clearly only two ways out of this:
> 1. There is a bug in the EE or one of its processors which causes this
> slowness
> 2. The measurement is wrong, ie you measure something else.
>
> During my thesis (2005) I did extensive measurements of the time spent
> in the EE, data ports, events, commands etc. under different loads.
> This was on a 700MHz PIII. , 128KB L2 cache and 128MB Ram. Time
> measurement itself took 3.2 us. The measurements had threads running
> at 2KHz, 1KHz and 500Hz and a not-realtime thread at 100Hz. System
> load was very acceptable, even since I was logging all measurements.
>
> Summarized, I measured the following (corrected for measurement times):
> * Data flow read or write: lock-based: 10us, lock-free: 3us
> * Sending a command: 3us, processing a command: 3us (both lock-free)
> * Emiting event and processing event: same as command (also lock-free)
>
> You can see the measurements yourself from:
> http://www.mech.kuleuven.be/dept/resources/docs/soetens.pdf
>
> This thesis is *not* recommended reading to learn Orocos, but the
> measurements are in there (look for sections titled 'Validation' and
> Appendix A). Not that much has changed since then, although I believe
> I had to robustify the lock-free algorithms, at the expense of some
> more computational overhead. I also see that I didn't measure the
> scripting overhead, but checking for empty lists won't be much
> different than checking an empty command queue.
>
> I still don't understand how you get from 3% to 100% with just
> doubling a frequency of a single thread. It should have been 6%.
> Simple.
>
> Peter
>
> >
> > Also, I tested the RT kernel from
> > http://www.pengutronix.de/software/linux-rt/debian_en.html and I had
> similar
> > results. I'm using a Celeron M 1GHz with 512Mb DDR.
> >
> > Philippe
> >
> > 2009/10/30 Peter Soetens <peter [..] ...>
> >>
> >> On Fri, Oct 30, 2009 at 16:19, Philippe Hamelin
> >> <philippe [dot] hamelin [..] ...> wrote:
> >> > You are suggesting a good precision Klass. First, let me describe my
> >> > problem. I currently have an Orocos application with about 20
> >> > components. I
> >> > have a master component with a periodic activity which calls the other
> >> > 19
> >> > slave activities at 500Hz (gnulinux ORO_SHED_RT). Actually, this is
> >> > taking
> >> > about 3% CPU so there is no major computational effort. When I push up
> >> > the
> >> > frequency to 1 kHz this suddenly lead the application to 100% CPU
> >> > utilization. I guess the problem comes from the gnulinux scheduler
> >> > (debian
> >> > lenny vanilla kernel) which is not able to deal with this setup. So
> this
> >> > problem has lead me to look forward for an other scheduler which may
> >> > help me
> >> > to get down to a lower CPU utilization.
> >>
> >> top is notorious for reporting wrong CPU utilization numbers when
> >> using SCHED_FIFO. The numbers are complete nonsense and are indeed
> >> typically 100%.
> >>
> >> If you care, you'll need a better way of measuring, for example, by
> >> using the (dark and hidden) Orocos Thread Scope.
> >>
> >> Peter
> >
> >
>

RTAI or Xenomai ?

Submitted by phamelin on Mon, 2009-11-02 18:28.

Hello,
I have interesting additionnal results concerning this problem. I found that
the version I was using on the low-end CPU (Celeron M 1GHz) was RTT-1.10.1
and on the high-end CPU (Core 2 duo 2.4GHz) was RTT-1.8.5. I installed
RTT-1.10.1 on the high-end CPU to compare results with the same version of
RTT and I found that it takes about the triple time to call an empty
SlaveActivity in this new version (3.6 microseconds in 1.10.1 compare to 1.3
microseconds in 1.8.5). In light of these results I think there is a code
change from 1.8 to 1.10 which increased considerably the execution time of
an activity.

I will try to downgrade to 1.8.5 on the low-end CPU to see if the process
time is divided by 3.

Philippe

2009/11/2 Philippe Hamelin <philippe [dot] hamelin [..] ...>

> Hello,
> of course the %CPU given by top is erroneous. The %CPU should be about 50%
> @ 500 Hz and 100% @ 1000 Hz. I wrote a sample program to measure the time
> required to execute the slave activity of an empty TaskContext. I used
> rtt/TimeService to calculate an average time of 1000 calls. The time
> obtained on a Celeron M 1GHz is 30 microseconds. In the case of my ~20
> slaves application, this means that it should take about 0.6ms to call the
> activities without doing any processing. According to [1], this CPU is much
> more powerful than the Pentium-III 700MHz used to bench Orocos in Peter's
> thesis. I tested the same program on my desktop Intel Core 2 Duo 6600 @
> 2.4GHz and I obtained a mean time of 1.3 microseconds, which seems to be
> proportionally correct according to the performance of the CPU.
>
> Does someone has a low-end CPU to execute this benchtest? I would very like
> to know if there's a performance issue with this particular CPU (Celeron M 1
> GHz). Theorically, this 1 GHz CPU has a possibility of 1000 clock cycles per
> microsecond. This would mean that the code executed by a single slave
> activity requires 30 000 clock cycles? This seems to be high for an empty
> TaskContext.
>
> [1] http://www.cpubenchmark.net/cpu_lookup.php?cpu=Intel+Celeron+M+1.00GHz
>
>
> Philippe
>
>
> 2009/10/30 Peter Soetens <peter [..] ...>
>
>> On Fri, Oct 30, 2009 at 20:36, Philippe Hamelin
>> <philippe [dot] hamelin [..] ...> wrote:
>> > Hello,
>> > after some investigations I found interesting things.
>> >
>> > I measured the execution time of all slave activities :
>> >
>> >

>> >
>> > void MasterComponent::updateHook()
>> > {
>> >     referenceTime = TimeService::Instance()->getTicks();
>> >
>> >     getPeer("slaveA")->getActivity()->execute();
>> >     getPeer("slaveB")->getActivity()->execute();
>> >     getPeer("slaveC")->getActivity()->execute();
>> >
>> >     processTime = TimeService::Instance()->secondsSince(referenceTime);
>> > }
>> >
>> >

>> >
>> > and I found that I'm using 0.9ms while my periodic activity has a 1ms
>> > period. So, I'm effectively using most of the CPU.
>> > Then, I wanted to know which component is taking so much CPU. I first
>> tested
>> > component "SlaveA" :
>> >
>> >

>> >
>> > void MasterComponent::updateHook()
>> > {
>> >     referenceTime = TimeService::Instance()->getTicks();
>> >
>> >     getPeer("slaveA")->getActivity()->execute();
>> >
>> >     processTime = TimeService::Instance()->secondsSince(referenceTime);
>> >
>> >     getPeer("slaveB")->getActivity()->execute();
>> >     getPeer("slaveC")->getActivity()->execute();
>> >
>> > }
>> >
>> >

>> >
>> > I found that the process time of slaveA was 0.15ms. The comopnent
>> "slaveA"
>> > just have some code in his updateHook (no FSM, no programs...). I
>> removed
>> > all the code in the updateHook() of "slaveA" and then I measured again
>> the
>> > process time of the single line :
>> >
>> >

>> >     getPeer("slaveA")->getActivity()->execute();
>> >

>> >
>> > I found that just calling the (empty) slaveActivity takes about 0.1ms.
>> My
>> > system has in reality about 10 slaves activities which are called by the
>> > masterComponent, so TotalTime = 10 * 0.1ms = 1ms. It means that it takes
>> > 100% CPU @ 1kHz just to call empty components.
>> >
>> > Do you think that this overhead comes from Orocos or from the "context
>> > switching" time of Linux? Does each slave activity has his own thread or
>> the
>> > slave activities is executed in the calling thread? If the slave
>> activity is
>> > called inside the master thread I don't see where this switching time
>> could
>> > come from?
>>
>> That's the point, slaveActivity doesn't have any thread switching.
>> It's supposed to be the most efficient way, and 100us for executing
>> nothing is a 'little' bit too much. SlaveActivity is an empty box that
>> just calls step() on the ExecutionEngine, which steps 4 'processors'
>> (program, state machine, event, command) and the updateHook(). If all
>> these have nothing to do, they should need very little time too.
>>
>> There are clearly only two ways out of this:
>> 1. There is a bug in the EE or one of its processors which causes this
>> slowness
>> 2. The measurement is wrong, ie you measure something else.
>>
>> During my thesis (2005) I did extensive measurements of the time spent
>> in the EE, data ports, events, commands etc. under different loads.
>> This was on a 700MHz PIII. , 128KB L2 cache and 128MB Ram. Time
>> measurement itself took 3.2 us. The measurements had threads running
>> at 2KHz, 1KHz and 500Hz and a not-realtime thread at 100Hz. System
>> load was very acceptable, even since I was logging all measurements.
>>
>> Summarized, I measured the following (corrected for measurement times):
>> * Data flow read or write: lock-based: 10us, lock-free: 3us
>> * Sending a command: 3us, processing a command: 3us (both lock-free)
>> * Emiting event and processing event: same as command (also lock-free)
>>
>> You can see the measurements yourself from:
>> http://www.mech.kuleuven.be/dept/resources/docs/soetens.pdf
>>
>> This thesis is *not* recommended reading to learn Orocos, but the
>> measurements are in there (look for sections titled 'Validation' and
>> Appendix A). Not that much has changed since then, although I believe
>> I had to robustify the lock-free algorithms, at the expense of some
>> more computational overhead. I also see that I didn't measure the
>> scripting overhead, but checking for empty lists won't be much
>> different than checking an empty command queue.
>>
>> I still don't understand how you get from 3% to 100% with just
>> doubling a frequency of a single thread. It should have been 6%.
>> Simple.
>>
>> Peter
>>
>> >
>> > Also, I tested the RT kernel from
>> > http://www.pengutronix.de/software/linux-rt/debian_en.html and I had
>> similar
>> > results. I'm using a Celeron M 1GHz with 512Mb DDR.
>> >
>> > Philippe
>> >
>> > 2009/10/30 Peter Soetens <peter [..] ...>
>> >>
>> >> On Fri, Oct 30, 2009 at 16:19, Philippe Hamelin
>> >> <philippe [dot] hamelin [..] ...> wrote:
>> >> > You are suggesting a good precision Klass. First, let me describe my
>> >> > problem. I currently have an Orocos application with about 20
>> >> > components. I
>> >> > have a master component with a periodic activity which calls the
>> other
>> >> > 19
>> >> > slave activities at 500Hz (gnulinux ORO_SHED_RT). Actually, this is
>> >> > taking
>> >> > about 3% CPU so there is no major computational effort. When I push
>> up
>> >> > the
>> >> > frequency to 1 kHz this suddenly lead the application to 100% CPU
>> >> > utilization. I guess the problem comes from the gnulinux scheduler
>> >> > (debian
>> >> > lenny vanilla kernel) which is not able to deal with this setup. So
>> this
>> >> > problem has lead me to look forward for an other scheduler which may
>> >> > help me
>> >> > to get down to a lower CPU utilization.
>> >>
>> >> top is notorious for reporting wrong CPU utilization numbers when
>> >> using SCHED_FIFO. The numbers are complete nonsense and are indeed
>> >> typically 100%.
>> >>
>> >> If you care, you'll need a better way of measuring, for example, by
>> >> using the (dark and hidden) Orocos Thread Scope.
>> >>
>> >> Peter
>> >
>> >
>>
>
>

RTAI or Xenomai ?

Submitted by peter on Mon, 2009-11-02 21:12.

On Mon, Nov 2, 2009 at 19:23, Philippe Hamelin
<philippe [dot] hamelin [..] ...> wrote:
> Hello,
> I have interesting additionnal results concerning this problem. I found that
> the version I was using on the low-end CPU (Celeron M 1GHz) was RTT-1.10.1
> and on the high-end CPU (Core 2 duo 2.4GHz) was RTT-1.8.5. I installed
> RTT-1.10.1 on the high-end CPU to compare results with the same version of
> RTT and I found that it takes about the triple time to call an empty
> SlaveActivity in this new version (3.6 microseconds in 1.10.1 compare to 1.3
> microseconds in 1.8.5). In light of these results I think there is a code
> change from 1.8 to 1.10 which increased considerably the execution time of
> an activity.
>
> I will try to downgrade to 1.8.5 on the low-end CPU to see if the process
> time is divided by 3.

I've just discovered, that RTT 1.10.x does not set the
CMAKE_BUILD_TYPE. This causes the RTT to be compiled with no
optimization flags.

Can you try to set CMAKE_BUILD_TYPE=Release ?

I hope that's it.

Peter

>
> Philippe
>
>
> 2009/11/2 Philippe Hamelin <philippe [dot] hamelin [..] ...>
>>
>> Hello,
>> of course the %CPU given by top is erroneous. The %CPU should be about 50%
>> @ 500 Hz and 100% @ 1000 Hz. I wrote a sample program to measure the time
>> required to execute the slave activity of an empty TaskContext. I used
>> rtt/TimeService to calculate an average time of 1000 calls. The time
>> obtained on a Celeron M 1GHz is 30 microseconds. In the case of my ~20
>> slaves application, this means that it should take about 0.6ms to call the
>> activities without doing any processing. According to [1], this CPU is much
>> more powerful than the Pentium-III 700MHz used to bench Orocos in Peter's
>> thesis. I tested the same program on my desktop Intel Core 2 Duo 6600 @
>> 2.4GHz and I obtained a mean time of 1.3 microseconds, which seems to be
>> proportionally correct according to the performance of the CPU.
>>
>> Does someone has a low-end CPU to execute this benchtest? I would very
>> like to know if there's a performance issue with this particular CPU
>> (Celeron M 1 GHz). Theorically, this 1 GHz CPU has a possibility of 1000
>> clock cycles per microsecond. This would mean that the code executed by a
>> single slave activity requires 30 000 clock cycles? This seems to be high
>> for an empty TaskContext.
>>
>> [1] http://www.cpubenchmark.net/cpu_lookup.php?cpu=Intel+Celeron+M+1.00GHz
>>
>> Philippe
>>
>>
>> 2009/10/30 Peter Soetens <peter [..] ...>
>>>
>>> On Fri, Oct 30, 2009 at 20:36, Philippe Hamelin
>>> <philippe [dot] hamelin [..] ...> wrote:
>>> > Hello,
>>> > after some investigations I found interesting things.
>>> >
>>> > I measured the execution time of all slave activities :
>>> >
>>> >

>>> >
>>> > void MasterComponent::updateHook()
>>> > {
>>> >     referenceTime = TimeService::Instance()->getTicks();
>>> >
>>> >     getPeer("slaveA")->getActivity()->execute();
>>> >     getPeer("slaveB")->getActivity()->execute();
>>> >     getPeer("slaveC")->getActivity()->execute();
>>> >
>>> >     processTime = TimeService::Instance()->secondsSince(referenceTime);
>>> > }
>>> >
>>> >

>>> >
>>> > and I found that I'm using 0.9ms while my periodic activity has a 1ms
>>> > period. So, I'm effectively using most of the CPU.
>>> > Then, I wanted to know which component is taking so much CPU. I first
>>> > tested
>>> > component "SlaveA" :
>>> >
>>> >

>>> >
>>> > void MasterComponent::updateHook()
>>> > {
>>> >     referenceTime = TimeService::Instance()->getTicks();
>>> >
>>> >     getPeer("slaveA")->getActivity()->execute();
>>> >
>>> >     processTime = TimeService::Instance()->secondsSince(referenceTime);
>>> >
>>> >     getPeer("slaveB")->getActivity()->execute();
>>> >     getPeer("slaveC")->getActivity()->execute();
>>> >
>>> > }
>>> >
>>> >

>>> >
>>> > I found that the process time of slaveA was 0.15ms. The comopnent
>>> > "slaveA"
>>> > just have some code in his updateHook (no FSM, no programs...). I
>>> > removed
>>> > all the code in the updateHook() of "slaveA" and then I measured again
>>> > the
>>> > process time of the single line :
>>> >
>>> >

>>> >     getPeer("slaveA")->getActivity()->execute();
>>> >

>>> >
>>> > I found that just calling the (empty) slaveActivity takes about 0.1ms.
>>> > My
>>> > system has in reality about 10 slaves activities which are called by
>>> > the
>>> > masterComponent, so TotalTime = 10 * 0.1ms = 1ms. It means that it
>>> > takes
>>> > 100% CPU @ 1kHz just to call empty components.
>>> >
>>> > Do you think that this overhead comes from Orocos or from the "context
>>> > switching" time of Linux? Does each slave activity has his own thread
>>> > or the
>>> > slave activities is executed in the calling thread? If the slave
>>> > activity is
>>> > called inside the master thread I don't see where this switching time
>>> > could
>>> > come from?
>>>
>>> That's the point, slaveActivity doesn't have any thread switching.
>>> It's supposed to be the most efficient way, and 100us for executing
>>> nothing is a 'little' bit too much. SlaveActivity is an empty box that
>>> just calls step() on the ExecutionEngine, which steps 4 'processors'
>>> (program, state machine, event, command) and the updateHook(). If all
>>> these have nothing to do, they should need very little time too.
>>>
>>> There are clearly only two ways out of this:
>>> 1. There is a bug in the EE or one of its processors which causes this
>>> slowness
>>> 2. The measurement is wrong, ie you measure something else.
>>>
>>> During my thesis (2005) I did extensive measurements of the time spent
>>> in the EE, data ports, events, commands etc. under different loads.
>>> This was on a 700MHz PIII. , 128KB L2 cache and 128MB Ram. Time
>>> measurement itself took 3.2 us. The measurements had threads running
>>> at 2KHz, 1KHz and 500Hz and a not-realtime thread at 100Hz. System
>>> load was very acceptable, even since I was logging all measurements.
>>>
>>> Summarized, I measured the following (corrected for measurement times):
>>> * Data flow read or write: lock-based: 10us, lock-free: 3us
>>> * Sending a command: 3us, processing a command: 3us (both lock-free)
>>> * Emiting event and processing event: same as command (also lock-free)
>>>
>>> You can see the measurements yourself from:
>>> http://www.mech.kuleuven.be/dept/resources/docs/soetens.pdf
>>>
>>> This thesis is *not* recommended reading to learn Orocos, but the
>>> measurements are in there (look for sections titled 'Validation' and
>>> Appendix A). Not that much has changed since then, although I believe
>>> I had to robustify the lock-free algorithms, at the expense of some
>>> more computational overhead. I also see that I didn't measure the
>>> scripting overhead, but checking for empty lists won't be much
>>> different than checking an empty command queue.
>>>
>>> I still don't understand how you get from 3% to 100% with just
>>> doubling a frequency of a single thread. It should have been 6%.
>>> Simple.
>>>
>>> Peter
>>>
>>> >
>>> > Also, I tested the RT kernel from
>>> > http://www.pengutronix.de/software/linux-rt/debian_en.html and I had
>>> > similar
>>> > results. I'm using a Celeron M 1GHz with 512Mb DDR.
>>> >
>>> > Philippe
>>> >
>>> > 2009/10/30 Peter Soetens <peter [..] ...>
>>> >>
>>> >> On Fri, Oct 30, 2009 at 16:19, Philippe Hamelin
>>> >> <philippe [dot] hamelin [..] ...> wrote:
>>> >> > You are suggesting a good precision Klass. First, let me describe my
>>> >> > problem. I currently have an Orocos application with about 20
>>> >> > components. I
>>> >> > have a master component with a periodic activity which calls the
>>> >> > other
>>> >> > 19
>>> >> > slave activities at 500Hz (gnulinux ORO_SHED_RT). Actually, this is
>>> >> > taking
>>> >> > about 3% CPU so there is no major computational effort. When I push
>>> >> > up
>>> >> > the
>>> >> > frequency to 1 kHz this suddenly lead the application to 100% CPU
>>> >> > utilization. I guess the problem comes from the gnulinux scheduler
>>> >> > (debian
>>> >> > lenny vanilla kernel) which is not able to deal with this setup. So
>>> >> > this
>>> >> > problem has lead me to look forward for an other scheduler which may
>>> >> > help me
>>> >> > to get down to a lower CPU utilization.
>>> >>
>>> >> top is notorious for reporting wrong CPU utilization numbers when
>>> >> using SCHED_FIFO. The numbers are complete nonsense and are indeed
>>> >> typically 100%.
>>> >>
>>> >> If you care, you'll need a better way of measuring, for example, by
>>> >> using the (dark and hidden) Orocos Thread Scope.
>>> >>
>>> >> Peter
>>> >
>>> >
>>
>
>

RTAI or Xenomai ?

Submitted by phamelin on Mon, 2009-11-02 22:44.

That did the trick for the high-end CPU. The execution time of a slave
activity is gone back again to 1.3 microseconds. Tomorrow I will test that
on the low-end CPU.

Philippe

2009/11/2 Peter Soetens <peter [..] ...>

> On Mon, Nov 2, 2009 at 19:23, Philippe Hamelin
> <philippe [dot] hamelin [..] ...> wrote:
> > Hello,
> > I have interesting additionnal results concerning this problem. I found
> that
> > the version I was using on the low-end CPU (Celeron M 1GHz) was
> RTT-1.10.1
> > and on the high-end CPU (Core 2 duo 2.4GHz) was RTT-1.8.5. I installed
> > RTT-1.10.1 on the high-end CPU to compare results with the same version
> of
> > RTT and I found that it takes about the triple time to call an empty
> > SlaveActivity in this new version (3.6 microseconds in 1.10.1 compare to
> 1.3
> > microseconds in 1.8.5). In light of these results I think there is a code
> > change from 1.8 to 1.10 which increased considerably the execution time
> of
> > an activity.
> >
> > I will try to downgrade to 1.8.5 on the low-end CPU to see if the process
> > time is divided by 3.
>
> I've just discovered, that RTT 1.10.x does not set the
> CMAKE_BUILD_TYPE. This causes the RTT to be compiled with no
> optimization flags.
>
> Can you try to set CMAKE_BUILD_TYPE=Release ?
>
> I hope that's it.
>
> Peter
>
> >
> > Philippe
> >
> >
> > 2009/11/2 Philippe Hamelin <philippe [dot] hamelin [..] ...>
> >>
> >> Hello,
> >> of course the %CPU given by top is erroneous. The %CPU should be about
> 50%
> >> @ 500 Hz and 100% @ 1000 Hz. I wrote a sample program to measure the
> time
> >> required to execute the slave activity of an empty TaskContext. I used
> >> rtt/TimeService to calculate an average time of 1000 calls. The time
> >> obtained on a Celeron M 1GHz is 30 microseconds. In the case of my ~20
> >> slaves application, this means that it should take about 0.6ms to call
> the
> >> activities without doing any processing. According to [1], this CPU is
> much
> >> more powerful than the Pentium-III 700MHz used to bench Orocos in
> Peter's
> >> thesis. I tested the same program on my desktop Intel Core 2 Duo 6600 @
> >> 2.4GHz and I obtained a mean time of 1.3 microseconds, which seems to be
> >> proportionally correct according to the performance of the CPU.
> >>
> >> Does someone has a low-end CPU to execute this benchtest? I would very
> >> like to know if there's a performance issue with this particular CPU
> >> (Celeron M 1 GHz). Theorically, this 1 GHz CPU has a possibility of 1000
> >> clock cycles per microsecond. This would mean that the code executed by
> a
> >> single slave activity requires 30 000 clock cycles? This seems to be
> high
> >> for an empty TaskContext.
> >>
> >> [1]
> http://www.cpubenchmark.net/cpu_lookup.php?cpu=Intel+Celeron+M+1.00GHz
> >>
> >> Philippe
> >>
> >>
> >> 2009/10/30 Peter Soetens <peter [..] ...>
> >>>
> >>> On Fri, Oct 30, 2009 at 20:36, Philippe Hamelin
> >>> <philippe [dot] hamelin [..] ...> wrote:
> >>> > Hello,
> >>> > after some investigations I found interesting things.
> >>> >
> >>> > I measured the execution time of all slave activities :
> >>> >
> >>> >

> >>> >
> >>> > void MasterComponent::updateHook()
> >>> > {
> >>> >     referenceTime = TimeService::Instance()->getTicks();
> >>> >
> >>> >     getPeer("slaveA")->getActivity()->execute();
> >>> >     getPeer("slaveB")->getActivity()->execute();
> >>> >     getPeer("slaveC")->getActivity()->execute();
> >>> >
> >>> >     processTime =
> TimeService::Instance()->secondsSince(referenceTime);
> >>> > }
> >>> >
> >>> >

> >>> >
> >>> > and I found that I'm using 0.9ms while my periodic activity has a 1ms
> >>> > period. So, I'm effectively using most of the CPU.
> >>> > Then, I wanted to know which component is taking so much CPU. I first
> >>> > tested
> >>> > component "SlaveA" :
> >>> >
> >>> >

> >>> >
> >>> > void MasterComponent::updateHook()
> >>> > {
> >>> >     referenceTime = TimeService::Instance()->getTicks();
> >>> >
> >>> >     getPeer("slaveA")->getActivity()->execute();
> >>> >
> >>> >     processTime =
> TimeService::Instance()->secondsSince(referenceTime);
> >>> >
> >>> >     getPeer("slaveB")->getActivity()->execute();
> >>> >     getPeer("slaveC")->getActivity()->execute();
> >>> >
> >>> > }
> >>> >
> >>> >

> >>> >
> >>> > I found that the process time of slaveA was 0.15ms. The comopnent
> >>> > "slaveA"
> >>> > just have some code in his updateHook (no FSM, no programs...). I
> >>> > removed
> >>> > all the code in the updateHook() of "slaveA" and then I measured
> again
> >>> > the
> >>> > process time of the single line :
> >>> >
> >>> >

> >>> >     getPeer("slaveA")->getActivity()->execute();
> >>> >

> >>> >
> >>> > I found that just calling the (empty) slaveActivity takes about
> 0.1ms.
> >>> > My
> >>> > system has in reality about 10 slaves activities which are called by
> >>> > the
> >>> > masterComponent, so TotalTime = 10 * 0.1ms = 1ms. It means that it
> >>> > takes
> >>> > 100% CPU @ 1kHz just to call empty components.
> >>> >
> >>> > Do you think that this overhead comes from Orocos or from the
> "context
> >>> > switching" time of Linux? Does each slave activity has his own thread
> >>> > or the
> >>> > slave activities is executed in the calling thread? If the slave
> >>> > activity is
> >>> > called inside the master thread I don't see where this switching time
> >>> > could
> >>> > come from?
> >>>
> >>> That's the point, slaveActivity doesn't have any thread switching.
> >>> It's supposed to be the most efficient way, and 100us for executing
> >>> nothing is a 'little' bit too much. SlaveActivity is an empty box that
> >>> just calls step() on the ExecutionEngine, which steps 4 'processors'
> >>> (program, state machine, event, command) and the updateHook(). If all
> >>> these have nothing to do, they should need very little time too.
> >>>
> >>> There are clearly only two ways out of this:
> >>> 1. There is a bug in the EE or one of its processors which causes this
> >>> slowness
> >>> 2. The measurement is wrong, ie you measure something else.
> >>>
> >>> During my thesis (2005) I did extensive measurements of the time spent
> >>> in the EE, data ports, events, commands etc. under different loads.
> >>> This was on a 700MHz PIII. , 128KB L2 cache and 128MB Ram. Time
> >>> measurement itself took 3.2 us. The measurements had threads running
> >>> at 2KHz, 1KHz and 500Hz and a not-realtime thread at 100Hz. System
> >>> load was very acceptable, even since I was logging all measurements.
> >>>
> >>> Summarized, I measured the following (corrected for measurement times):
> >>> * Data flow read or write: lock-based: 10us, lock-free: 3us
> >>> * Sending a command: 3us, processing a command: 3us (both lock-free)
> >>> * Emiting event and processing event: same as command (also lock-free)
> >>>
> >>> You can see the measurements yourself from:
> >>> http://www.mech.kuleuven.be/dept/resources/docs/soetens.pdf
> >>>
> >>> This thesis is *not* recommended reading to learn Orocos, but the
> >>> measurements are in there (look for sections titled 'Validation' and
> >>> Appendix A). Not that much has changed since then, although I believe
> >>> I had to robustify the lock-free algorithms, at the expense of some
> >>> more computational overhead. I also see that I didn't measure the
> >>> scripting overhead, but checking for empty lists won't be much
> >>> different than checking an empty command queue.
> >>>
> >>> I still don't understand how you get from 3% to 100% with just
> >>> doubling a frequency of a single thread. It should have been 6%.
> >>> Simple.
> >>>
> >>> Peter
> >>>
> >>> >
> >>> > Also, I tested the RT kernel from
> >>> > http://www.pengutronix.de/software/linux-rt/debian_en.html and I had
> >>> > similar
> >>> > results. I'm using a Celeron M 1GHz with 512Mb DDR.
> >>> >
> >>> > Philippe
> >>> >
> >>> > 2009/10/30 Peter Soetens <peter [..] ...>
> >>> >>
> >>> >> On Fri, Oct 30, 2009 at 16:19, Philippe Hamelin
> >>> >> <philippe [dot] hamelin [..] ...> wrote:
> >>> >> > You are suggesting a good precision Klass. First, let me describe
> my
> >>> >> > problem. I currently have an Orocos application with about 20
> >>> >> > components. I
> >>> >> > have a master component with a periodic activity which calls the
> >>> >> > other
> >>> >> > 19
> >>> >> > slave activities at 500Hz (gnulinux ORO_SHED_RT). Actually, this
> is
> >>> >> > taking
> >>> >> > about 3% CPU so there is no major computational effort. When I
> push
> >>> >> > up
> >>> >> > the
> >>> >> > frequency to 1 kHz this suddenly lead the application to 100% CPU
> >>> >> > utilization. I guess the problem comes from the gnulinux scheduler
> >>> >> > (debian
> >>> >> > lenny vanilla kernel) which is not able to deal with this setup.
> So
> >>> >> > this
> >>> >> > problem has lead me to look forward for an other scheduler which
> may
> >>> >> > help me
> >>> >> > to get down to a lower CPU utilization.
> >>> >>
> >>> >> top is notorious for reporting wrong CPU utilization numbers when
> >>> >> using SCHED_FIFO. The numbers are complete nonsense and are indeed
> >>> >> typically 100%.
> >>> >>
> >>> >> If you care, you'll need a better way of measuring, for example, by
> >>> >> using the (dark and hidden) Orocos Thread Scope.
> >>> >>
> >>> >> Peter
> >>> >
> >>> >
> >>
> >
> >
>

RTAI or Xenomai ?

Submitted by phamelin on Mon, 2009-11-02 22:56.

Ok tested it on low-end CPU and the results passed from 30us to 2.7us just
by setting CMAKE_BUILD_TYPE=Release in RTT. Thanks Peter you're the master
:-)

2009/11/2 Philippe Hamelin <philippe [dot] hamelin [..] ...>

> That did the trick for the high-end CPU. The execution time of a slave
> activity is gone back again to 1.3 microseconds. Tomorrow I will test that
> on the low-end CPU.
>
> Philippe
>
> 2009/11/2 Peter Soetens <peter [..] ...>
>
> On Mon, Nov 2, 2009 at 19:23, Philippe Hamelin
>> <philippe [dot] hamelin [..] ...> wrote:
>> > Hello,
>> > I have interesting additionnal results concerning this problem. I found
>> that
>> > the version I was using on the low-end CPU (Celeron M 1GHz) was
>> RTT-1.10.1
>> > and on the high-end CPU (Core 2 duo 2.4GHz) was RTT-1.8.5. I installed
>> > RTT-1.10.1 on the high-end CPU to compare results with the same version
>> of
>> > RTT and I found that it takes about the triple time to call an empty
>> > SlaveActivity in this new version (3.6 microseconds in 1.10.1 compare to
>> 1.3
>> > microseconds in 1.8.5). In light of these results I think there is a
>> code
>> > change from 1.8 to 1.10 which increased considerably the execution time
>> of
>> > an activity.
>> >
>> > I will try to downgrade to 1.8.5 on the low-end CPU to see if the
>> process
>> > time is divided by 3.
>>
>> I've just discovered, that RTT 1.10.x does not set the
>> CMAKE_BUILD_TYPE. This causes the RTT to be compiled with no
>> optimization flags.
>>
>> Can you try to set CMAKE_BUILD_TYPE=Release ?
>>
>> I hope that's it.
>>
>> Peter
>>
>> >
>> > Philippe
>> >
>> >
>> > 2009/11/2 Philippe Hamelin <philippe [dot] hamelin [..] ...>
>> >>
>> >> Hello,
>> >> of course the %CPU given by top is erroneous. The %CPU should be about
>> 50%
>> >> @ 500 Hz and 100% @ 1000 Hz. I wrote a sample program to measure the
>> time
>> >> required to execute the slave activity of an empty TaskContext. I used
>> >> rtt/TimeService to calculate an average time of 1000 calls. The time
>> >> obtained on a Celeron M 1GHz is 30 microseconds. In the case of my ~20
>> >> slaves application, this means that it should take about 0.6ms to call
>> the
>> >> activities without doing any processing. According to [1], this CPU is
>> much
>> >> more powerful than the Pentium-III 700MHz used to bench Orocos in
>> Peter's
>> >> thesis. I tested the same program on my desktop Intel Core 2 Duo 6600 @
>> >> 2.4GHz and I obtained a mean time of 1.3 microseconds, which seems to
>> be
>> >> proportionally correct according to the performance of the CPU.
>> >>
>> >> Does someone has a low-end CPU to execute this benchtest? I would very
>> >> like to know if there's a performance issue with this particular CPU
>> >> (Celeron M 1 GHz). Theorically, this 1 GHz CPU has a possibility of
>> 1000
>> >> clock cycles per microsecond. This would mean that the code executed by
>> a
>> >> single slave activity requires 30 000 clock cycles? This seems to be
>> high
>> >> for an empty TaskContext.
>> >>
>> >> [1]
>> http://www.cpubenchmark.net/cpu_lookup.php?cpu=Intel+Celeron+M+1.00GHz
>> >>
>> >> Philippe
>> >>
>> >>
>> >> 2009/10/30 Peter Soetens <peter [..] ...>
>> >>>
>> >>> On Fri, Oct 30, 2009 at 20:36, Philippe Hamelin
>> >>> <philippe [dot] hamelin [..] ...> wrote:
>> >>> > Hello,
>> >>> > after some investigations I found interesting things.
>> >>> >
>> >>> > I measured the execution time of all slave activities :
>> >>> >
>> >>> >

>> >>> >
>> >>> > void MasterComponent::updateHook()
>> >>> > {
>> >>> >     referenceTime = TimeService::Instance()->getTicks();
>> >>> >
>> >>> >     getPeer("slaveA")->getActivity()->execute();
>> >>> >     getPeer("slaveB")->getActivity()->execute();
>> >>> >     getPeer("slaveC")->getActivity()->execute();
>> >>> >
>> >>> >     processTime =
>> TimeService::Instance()->secondsSince(referenceTime);
>> >>> > }
>> >>> >
>> >>> >

>> >>> >
>> >>> > and I found that I'm using 0.9ms while my periodic activity has a
>> 1ms
>> >>> > period. So, I'm effectively using most of the CPU.
>> >>> > Then, I wanted to know which component is taking so much CPU. I
>> first
>> >>> > tested
>> >>> > component "SlaveA" :
>> >>> >
>> >>> >

>> >>> >
>> >>> > void MasterComponent::updateHook()
>> >>> > {
>> >>> >     referenceTime = TimeService::Instance()->getTicks();
>> >>> >
>> >>> >     getPeer("slaveA")->getActivity()->execute();
>> >>> >
>> >>> >     processTime =
>> TimeService::Instance()->secondsSince(referenceTime);
>> >>> >
>> >>> >     getPeer("slaveB")->getActivity()->execute();
>> >>> >     getPeer("slaveC")->getActivity()->execute();
>> >>> >
>> >>> > }
>> >>> >
>> >>> >

>> >>> >
>> >>> > I found that the process time of slaveA was 0.15ms. The comopnent
>> >>> > "slaveA"
>> >>> > just have some code in his updateHook (no FSM, no programs...). I
>> >>> > removed
>> >>> > all the code in the updateHook() of "slaveA" and then I measured
>> again
>> >>> > the
>> >>> > process time of the single line :
>> >>> >
>> >>> >

>> >>> >     getPeer("slaveA")->getActivity()->execute();
>> >>> >

>> >>> >
>> >>> > I found that just calling the (empty) slaveActivity takes about
>> 0.1ms.
>> >>> > My
>> >>> > system has in reality about 10 slaves activities which are called by
>> >>> > the
>> >>> > masterComponent, so TotalTime = 10 * 0.1ms = 1ms. It means that it
>> >>> > takes
>> >>> > 100% CPU @ 1kHz just to call empty components.
>> >>> >
>> >>> > Do you think that this overhead comes from Orocos or from the
>> "context
>> >>> > switching" time of Linux? Does each slave activity has his own
>> thread
>> >>> > or the
>> >>> > slave activities is executed in the calling thread? If the slave
>> >>> > activity is
>> >>> > called inside the master thread I don't see where this switching
>> time
>> >>> > could
>> >>> > come from?
>> >>>
>> >>> That's the point, slaveActivity doesn't have any thread switching.
>> >>> It's supposed to be the most efficient way, and 100us for executing
>> >>> nothing is a 'little' bit too much. SlaveActivity is an empty box that
>> >>> just calls step() on the ExecutionEngine, which steps 4 'processors'
>> >>> (program, state machine, event, command) and the updateHook(). If all
>> >>> these have nothing to do, they should need very little time too.
>> >>>
>> >>> There are clearly only two ways out of this:
>> >>> 1. There is a bug in the EE or one of its processors which causes this
>> >>> slowness
>> >>> 2. The measurement is wrong, ie you measure something else.
>> >>>
>> >>> During my thesis (2005) I did extensive measurements of the time spent
>> >>> in the EE, data ports, events, commands etc. under different loads.
>> >>> This was on a 700MHz PIII. , 128KB L2 cache and 128MB Ram. Time
>> >>> measurement itself took 3.2 us. The measurements had threads running
>> >>> at 2KHz, 1KHz and 500Hz and a not-realtime thread at 100Hz. System
>> >>> load was very acceptable, even since I was logging all measurements.
>> >>>
>> >>> Summarized, I measured the following (corrected for measurement
>> times):
>> >>> * Data flow read or write: lock-based: 10us, lock-free: 3us
>> >>> * Sending a command: 3us, processing a command: 3us (both lock-free)
>> >>> * Emiting event and processing event: same as command (also lock-free)
>> >>>
>> >>> You can see the measurements yourself from:
>> >>> http://www.mech.kuleuven.be/dept/resources/docs/soetens.pdf
>> >>>
>> >>> This thesis is *not* recommended reading to learn Orocos, but the
>> >>> measurements are in there (look for sections titled 'Validation' and
>> >>> Appendix A). Not that much has changed since then, although I believe
>> >>> I had to robustify the lock-free algorithms, at the expense of some
>> >>> more computational overhead. I also see that I didn't measure the
>> >>> scripting overhead, but checking for empty lists won't be much
>> >>> different than checking an empty command queue.
>> >>>
>> >>> I still don't understand how you get from 3% to 100% with just
>> >>> doubling a frequency of a single thread. It should have been 6%.
>> >>> Simple.
>> >>>
>> >>> Peter
>> >>>
>> >>> >
>> >>> > Also, I tested the RT kernel from
>> >>> > http://www.pengutronix.de/software/linux-rt/debian_en.html and I
>> had
>> >>> > similar
>> >>> > results. I'm using a Celeron M 1GHz with 512Mb DDR.
>> >>> >
>> >>> > Philippe
>> >>> >
>> >>> > 2009/10/30 Peter Soetens <peter [..] ...>
>> >>> >>
>> >>> >> On Fri, Oct 30, 2009 at 16:19, Philippe Hamelin
>> >>> >> <philippe [dot] hamelin [..] ...> wrote:
>> >>> >> > You are suggesting a good precision Klass. First, let me describe
>> my
>> >>> >> > problem. I currently have an Orocos application with about 20
>> >>> >> > components. I
>> >>> >> > have a master component with a periodic activity which calls the
>> >>> >> > other
>> >>> >> > 19
>> >>> >> > slave activities at 500Hz (gnulinux ORO_SHED_RT). Actually, this
>> is
>> >>> >> > taking
>> >>> >> > about 3% CPU so there is no major computational effort. When I
>> push
>> >>> >> > up
>> >>> >> > the
>> >>> >> > frequency to 1 kHz this suddenly lead the application to 100% CPU
>> >>> >> > utilization. I guess the problem comes from the gnulinux
>> scheduler
>> >>> >> > (debian
>> >>> >> > lenny vanilla kernel) which is not able to deal with this setup.
>> So
>> >>> >> > this
>> >>> >> > problem has lead me to look forward for an other scheduler which
>> may
>> >>> >> > help me
>> >>> >> > to get down to a lower CPU utilization.
>> >>> >>
>> >>> >> top is notorious for reporting wrong CPU utilization numbers when
>> >>> >> using SCHED_FIFO. The numbers are complete nonsense and are indeed
>> >>> >> typically 100%.
>> >>> >>
>> >>> >> If you care, you'll need a better way of measuring, for example, by
>> >>> >> using the (dark and hidden) Orocos Thread Scope.
>> >>> >>
>> >>> >> Peter
>> >>> >
>> >>> >
>> >>
>> >
>> >
>>
>
>

RTAI or Xenomai ?

Submitted by phamelin on Mon, 2009-11-02 10:28.

Your anwser makes sense to me too. I will try to build a simple use case
using the deployer and an OCL::HelloWorld component.

Philippe

2009/10/30 Peter Soetens <peter [..] ...>

> >
> > void MasterComponent::updateHook()
> > {
> >     referenceTime = TimeService::Instance()->getTicks();
> >
> >     getPeer("slaveA")->getActivity()->execute();
> >     getPeer("slaveB")->getActivity()->execute();
> >     getPeer("slaveC")->getActivity()->execute();
> >
> >     processTime = TimeService::Instance()->secondsSince(referenceTime);
> > }
> >
> >

> >
> > void MasterComponent::updateHook()
> > {
> >     referenceTime = TimeService::Instance()->getTicks();
> >
> >     getPeer("slaveA")->getActivity()->execute();
> >
> >     processTime = TimeService::Instance()->secondsSince(referenceTime);
> >
> >     getPeer("slaveB")->getActivity()->execute();
> >     getPeer("slaveC")->getActivity()->execute();
> >
> > }
> >
> >

> >     getPeer("slaveA")->getActivity()->execute();
> >

RTAI or Xenomai ?

Submitted by phamelin on Mon, 2009-11-02 10:28.

I know that the CPU utilization given by standard linux tools aren't
reliable. My application don't really take 100% CPU because there is only a
small amount of calculation. However, I know that the CPU is overloaded
because the system is anormaly slow for all other process during executing
of my orocos application. I'm using a Celeron M 1 GHz with 512Mb DDR so I
think it's faster enough for the small amount of calculation I do. I tested
the same application (and same OS) on a Core 2 Duo and it runs fine with a
CPU utilization of about 2-3% at 1 kHz. So I don't understand how my
application can go from 3% to 100% of CPU utilization only by doubling the
sampling rate. There seems to be a choke point somewhere in the scheduling
process. I already had that kind of problem in QNX and I resolved the
problem by changing the scheduler time period from 1ms to 100us.

2009/10/30 Peter Soetens <peter [..] ...>