orogen status ?

Submitted by peter on Mon, 2010-11-15 09:56

RTT-dev

Could someone (Steven,Sylvain) update me on the status of orogen ? I'd like to
release 2.1.1 and/or 2.2.0 asap in order to get some bugfixes out people are
hitting too frequently. The prefered answer would be 'merge this branch with
that branch and you'll be fine' :-)

If there's no consensus, I'll release it as-is, and we'll have to fix it in the
next release cycle.

Peter

orogen status ?

Submitted by Sylvain Joyeux on Tue, 2010-11-16 12:37.

On 11/15/2010 10:53 AM, Peter Soetens wrote:
> If there's no consensus, I'll release it as-is, and we'll have to fix it in the
> next release cycle.
That's not acceptable.

I am the one that develops oroGen. I appreciate that you do some
modifications to it, but let's be fair, 95% of the work in the near
future comes from "real" users of it, a.k.a. DFKI and me.

You chose to change some things dramatically for no good reasons.
pkg-config works fine if the RTT .pc file is generated properly. Your
patches broke things, and you pushed it to a stable branch (mmm).

So, here's my position. Yes, ROS is important for you. A functioning
oroGen is more important for me. So, either you don't call 'stable' a
completely broken oroGen because it works for the 5% of functionality
you use, or I am NOT caring about the orocos-toolchain at all anymore.

I don't want to have to deal with people that come on the mailing list
and ask why they can't even compile the simple examples -- or even worse
the published oroGen packages.

orogen status ?

Submitted by peter on Tue, 2010-11-16 13:08.

On Tuesday 16 November 2010 13:34:56 Sylvain Joyeux wrote:
> On 11/15/2010 10:53 AM, Peter Soetens wrote:
> > If there's no consensus, I'll release it as-is, and we'll have to fix it
> > in the next release cycle.
>
> That's not acceptable.

It is if 2.2.0 is released within a few days after 2.1.1.

>
> I am the one that develops oroGen. I appreciate that you do some
> modifications to it, but let's be fair, 95% of the work in the near
> future comes from "real" users of it, a.k.a. DFKI and me.

You're clearly underestimating the impact of typegen for the rest of us.

>
> You chose to change some things dramatically for no good reasons.
> pkg-config works fine if the RTT .pc file is generated properly. Your
> patches broke things, and you pushed it to a stable branch (mmm).

If anyone of us had known about running the unit tests, this wouldn't have
happened in the first place. We learned from it, move on.

>
> So, here's my position. Yes, ROS is important for you. A functioning
> oroGen is more important for me. So, either you don't call 'stable' a
> completely broken oroGen because it works for the 5% of functionality
> you use, or I am NOT caring about the orocos-toolchain at all anymore.
>
> I don't want to have to deal with people that come on the mailing list
> and ask why they can't even compile the simple examples -- or even worse
> the published oroGen packages.

It sounds like you have underestimated the 'cost' of being a maintainer.
Maintaining your own patches is easy. Merging and discussing contributions of
other people of the community is harder, and takes lots more time because they
see things differently and have different use cases than you. I invited you to
help us resolve the problems we had, and all we got was a 'no more than 5
mintues of my time'. That's asking for trouble, or at least, for some weeks
until the unexperienced people work it out. Remember that our patches were not
'just ros', but also fixed things we ran into, or tried to harmonize how
built/installed packages look like, such that the deployer and other tools can
work with them.

Peter

orogen status ?

Submitted by Sylvain Joyeux on Tue, 2010-11-16 13:20.

On 11/16/2010 02:03 PM, Peter Soetens wrote:
> On Tuesday 16 November 2010 13:34:56 Sylvain Joyeux wrote:
>> On 11/15/2010 10:53 AM, Peter Soetens wrote:
>>> If there's no consensus, I'll release it as-is, and we'll have to fix it
>>> in the next release cycle.
>>
>> That's not acceptable.
>
> It is if 2.2.0 is released within a few days after 2.1.1.
I sent my email
>
>>
>> I am the one that develops oroGen. I appreciate that you do some
>> modifications to it, but let's be fair, 95% of the work in the near
>> future comes from "real" users of it, a.k.a. DFKI and me.
>
> You're clearly underestimating the impact of typegen for the rest of us.
Maybe I am underestimating the future impact of typegen for you, but you
are clearly forgetting that we are using oroGen extensively *now*.

>> You chose to change some things dramatically for no good reasons.
>> pkg-config works fine if the RTT .pc file is generated properly. Your
>> patches broke things, and you pushed it to a stable branch (mmm).
>
> If anyone of us had known about running the unit tests, this wouldn't have
> happened in the first place. We learned from it, move on.
So what ? I don't remember having a mail asking how/if the patch broke
something.

If there are not test suites, you don't ask people around if your
patches break anything. As I mention, you don't realize that, when you
use typegen, you actually use a small part of oroGen's functionality.
So, if you only test by running typegen, it *is* going to break stuff.

Imagine if I was pushing all the typelib/oroGen stuff I have in store
without testing typegen first. I can tell you the result: massive
breakage. Before I'll do that I will ask *you guys* to check that
typegen is not broken. Because you are using it and I am not.

>>
>> So, here's my position. Yes, ROS is important for you. A functioning
>> oroGen is more important for me. So, either you don't call 'stable' a
>> completely broken oroGen because it works for the 5% of functionality
>> you use, or I am NOT caring about the orocos-toolchain at all anymore.
>>
>> I don't want to have to deal with people that come on the mailing list
>> and ask why they can't even compile the simple examples -- or even worse
>> the published oroGen packages.
>
> It sounds like you have underestimated the 'cost' of being a maintainer.
> Maintaining your own patches is easy. Merging and discussing contributions of
> other people of the community is harder, and takes lots more time because they
> see things differently and have different use cases than you. I invited you to
> help us resolve the problems we had, and all we got was a 'no more than 5
> mintues of my time'. That's asking for trouble, or at least, for some weeks
> until the unexperienced people work it out. Remember that our patches were not
> 'just ros', but also fixed things we ran into, or tried to harmonize how
> built/installed packages look like, such that the deployer and other tools can
> work with them.
Most of the really intrusive changes are ROS-related, or new
functionality that you pushed to the stable branch while breaking the
rest. There are very few real fixes, and they are actually harmless.

As for the "harmonization", again, you look at your use case and don't
care about mine. And get over my head because your use case is obviously
a lot more important.

I don't underestimate maintainership cost, thank you, but that's not the
point here. It was not about integrating new patches so that they work
fine. It is about having inexperienced people think that they can push
changes to a codebase they don't know so well, while getting over the
head of the maintainer. And then telling the maintainer that 'oh, now we
would need you to help fixing our mess, and if you don't have time it is
your fault it stays broken'. WTF ?

orogen status ?

Submitted by Sylvain Joyeux on Mon, 2010-11-15 10:12.

Morevoer, you did not comment/look at the merge request I sent on
gitorious. As grave bugfixing goes, this one is damn important.

orogen status ?

Submitted by peter on Mon, 2010-11-15 14:36.

On Monday 15 November 2010 11:10:15 Sylvain Joyeux wrote:
> Morevoer, you did not comment/look at the merge request I sent on
> gitorious. As grave bugfixing goes, this one is damn important.

Some minor questions/remarks:

- Did you consider using a plain boost::shared_ptr instead of intrusive, such
that the RTT::os::Mutex was not needed ? IIRC, boost::shared_ptr uses atomics
when available, in order to avoid to allocate a mutex for each shared object.

- No new unit tests were added, does this mean that no new 'public api'
functions were added ?

- What's the issue with 'SendStatus' + 1 ? Can't we let the CORBA enum count
from -1 ?

- How did you avoid a recursive self-lock on the mutex when a connection is
blown up (write of channel element returns false) ? Do we unit test for this ?

Peter

orogen status ?

Submitted by Sylvain Joyeux on Mon, 2010-11-15 15:00.

On 11/15/2010 03:35 PM, Peter Soetens wrote:
> On Monday 15 November 2010 11:10:15 Sylvain Joyeux wrote:
>> Morevoer, you did not comment/look at the merge request I sent on
>> gitorious. As grave bugfixing goes, this one is damn important.
>
> Some minor questions/remarks:
>
> - Did you consider using a plain boost::shared_ptr instead of intrusive, such
> that the RTT::os::Mutex was not needed ? IIRC, boost::shared_ptr uses atomics
> when available, in order to avoid to allocate a mutex for each shared object.
shared_ptr has the same issue than intrusive_ptr has: it is not thread
safe on copy vs. deletion, which is were the problem lies in this
particular case.

Since we don't create connections every millisecond, I don't think that
more optimization is needed. This patch fixes the many problems that
earlier optimizations added, so I would just go for the thing that works
for the time being.

> - No new unit tests were added, does this mean that no new 'public api'
> functions were added ?
That is correct. For the functionality itself, my main test is an
orocos.rb test in which I concurrently read/write and connect/disconnect
through CORBA between multiple ports of two components.

> - What's the issue with 'SendStatus' + 1 ? Can't we let the CORBA enum count
> from -1 ?
No way that I know of.

> - How did you avoid a recursive self-lock on the mutex when a connection is
> blown up (write of channel element returns false) ? Do we unit test for this ?
I think this is covered by the orocos.rb test suite, but I may be wrong.
I checked and what happens is that OutputPort::write does not lock, but
calls delete_if, which itself locks and then does all the manipulation
(i.e. no additional lock needed).

As you can see, the only synchronization in the channel elements is for
copying of the intrusive ptr, so there is not side effect that I see.

orogen status ?

Submitted by peter on Mon, 2010-11-15 15:32.

On Monday 15 November 2010 15:59:17 Sylvain Joyeux wrote:
> On 11/15/2010 03:35 PM, Peter Soetens wrote:
> > On Monday 15 November 2010 11:10:15 Sylvain Joyeux wrote:
> >> Morevoer, you did not comment/look at the merge request I sent on
> >> gitorious. As grave bugfixing goes, this one is damn important.
> >
> > Some minor questions/remarks:
> >
> > - Did you consider using a plain boost::shared_ptr instead of intrusive,
> > such that the RTT::os::Mutex was not needed ? IIRC, boost::shared_ptr
> > uses atomics when available, in order to avoid to allocate a mutex for
> > each shared object.
>
> shared_ptr has the same issue than intrusive_ptr has: it is not thread
> safe on copy vs. deletion, which is were the problem lies in this
> particular case.
>
> Since we don't create connections every millisecond, I don't think that
> more optimization is needed. This patch fixes the many problems that
> earlier optimizations added, so I would just go for the thing that works
> for the time being.

AFAIKT, plain shared_ptr does *not* contain the race that intrusive_ptr has,
because refcount is stored in another memory location than the object that is
refcounted. With intrusive, we might be reading the refcount, while the object
(hence also refcount) is being deleted. I'd be surprised if plain shared_ptr
is not copy-thread-safe during deletion.

>
> > - No new unit tests were added, does this mean that no new 'public api'
> > functions were added ?
>
> That is correct. For the functionality itself, my main test is an
> orocos.rb test in which I concurrently read/write and connect/disconnect
> through CORBA between multiple ports of two components.
>
> > - What's the issue with 'SendStatus' + 1 ? Can't we let the CORBA enum
> > count from -1 ?
>
> No way that I know of.
>
> > - How did you avoid a recursive self-lock on the mutex when a connection
> > is blown up (write of channel element returns false) ? Do we unit test
> > for this ?
>
> I think this is covered by the orocos.rb test suite, but I may be wrong.
> I checked and what happens is that OutputPort::write does not lock, but
> calls delete_if, which itself locks and then does all the manipulation
> (i.e. no additional lock needed).
>
> As you can see, the only synchronization in the channel elements is for
> copying of the intrusive ptr, so there is not side effect that I see.

I was referring to the mutex in ConnectionManager.

Peter

orogen status ?

Submitted by Sylvain Joyeux on Mon, 2010-11-15 16:16.

On 11/15/2010 04:29 PM, Peter Soetens wrote:
> AFAIKT, plain shared_ptr does *not* contain the race that intrusive_ptr has,
> because refcount is stored in another memory location than the object that is
> refcounted. With intrusive, we might be reading the refcount, while the object
> (hence also refcount) is being deleted. I'd be surprised if plain shared_ptr
> is not copy-thread-safe during deletion.
I don't see how having the refcount elsewhere than the object makes it
more thread safe. You still need to delete the refcount once the object
is deleted anyway. Moreover, there is no "increment and check if equal
to one" atomic operation (is there ?). So, even though you would solve
the must-delete-the-refcount-sometime problem, you can't check
atomically that you just have incremented a live pointer, and therefore
the shared pointer may be referencing a deleted pointer anyway.

FYI, I saw these explanations on the boost mailing list.

>>> - How did you avoid a recursive self-lock on the mutex when a connection
>>> is blown up (write of channel element returns false) ? Do we unit test
>>> for this ?
>>
>> I think this is covered by the orocos.rb test suite, but I may be wrong.
>> I checked and what happens is that OutputPort::write does not lock, but
>> calls delete_if, which itself locks and then does all the manipulation
>> (i.e. no additional lock needed).
>>
>> As you can see, the only synchronization in the channel elements is for
>> copying of the intrusive ptr, so there is not side effect that I see.
>
> I was referring to the mutex in ConnectionManager.
Yes, which is only taken by delete_if in the "failed write" case.

orogen status ?

Submitted by Sylvain Joyeux on Mon, 2010-11-15 16:20.

On 11/15/2010 05:13 PM, Sylvain Joyeux wrote:
> On 11/15/2010 04:29 PM, Peter Soetens wrote:
>> AFAIKT, plain shared_ptr does *not* contain the race that intrusive_ptr has,
>> because refcount is stored in another memory location than the object that is
>> refcounted. With intrusive, we might be reading the refcount, while the object
>> (hence also refcount) is being deleted. I'd be surprised if plain shared_ptr
>> is not copy-thread-safe during deletion.
> I don't see how having the refcount elsewhere than the object makes it
> more thread safe. You still need to delete the refcount once the object
> is deleted anyway. Moreover, there is no "increment and check if equal
> to one" atomic operation (is there ?). So, even though you would solve
> the must-delete-the-refcount-sometime problem, you can't check
> atomically that you just have incremented a live pointer, and therefore
> the shared pointer may be referencing a deleted pointer anyway.
>
> FYI, I saw these explanations on the boost mailing list.

Another thing: I spend already a lot of time debugging that particular
issue, and part of it was because of early optimizations in
ConnectionManager. Play around with shared_ptr all you like if you can't
sleep because of an unnecessary atomic, but *first* publish this working
implementation.

orogen status ?

Submitted by Sylvain Joyeux on Mon, 2010-11-15 16:24.

On 11/15/2010 05:19 PM, Sylvain Joyeux wrote:
> On 11/15/2010 05:13 PM, Sylvain Joyeux wrote:
>> On 11/15/2010 04:29 PM, Peter Soetens wrote:
>>> AFAIKT, plain shared_ptr does *not* contain the race that
>>> intrusive_ptr has,
>>> because refcount is stored in another memory location than the object
>>> that is
>>> refcounted. With intrusive, we might be reading the refcount, while
>>> the object
>>> (hence also refcount) is being deleted. I'd be surprised if plain
>>> shared_ptr
>>> is not copy-thread-safe during deletion.
>> I don't see how having the refcount elsewhere than the object makes it
>> more thread safe. You still need to delete the refcount once the object
>> is deleted anyway. Moreover, there is no "increment and check if equal
>> to one" atomic operation (is there ?). So, even though you would solve
>> the must-delete-the-refcount-sometime problem, you can't check
>> atomically that you just have incremented a live pointer, and therefore
>> the shared pointer may be referencing a deleted pointer anyway.
>>
>> FYI, I saw these explanations on the boost mailing list.
>
> Another thing: I spend already a lot of time debugging that particular
> issue, and part of it was because of early optimizations in
> ConnectionManager. Play around with shared_ptr all you like if you can't
> sleep because of an unnecessary atomic, but *first* publish this working
> implementation.
s/atomic/mutex/ in the last sentence.

orogen status ?

Submitted by peter on Mon, 2010-11-15 10:32.

On Monday 15 November 2010 11:10:15 Sylvain Joyeux wrote:
> Morevoer, you did not comment/look at the merge request I sent on
> gitorious. As grave bugfixing goes, this one is damn important.

I'm still not getting any emails for merge requests from gitorious. The
official/only way is to post it explicitly on orocos-dev. I've just seen your
mail of last friday, but I was out of office on thursday and friday.

I'll merge it today on master. This is indeed an urgently needed update !

Peter

orogen status ?

Submitted by Sylvain Joyeux on Mon, 2010-11-15 10:12.

On 11/15/2010 10:53 AM, Peter Soetens wrote:
> Could someone (Steven,Sylvain) update me on the status of orogen ? I'd like to
> release 2.1.1 and/or 2.2.0 asap in order to get some bugfixes out people are
> hitting too frequently. The prefered answer would be 'merge this branch with
> that branch and you'll be fine' :-)
>
> If there's no consensus, I'll release it as-is, and we'll have to fix it in the
> next release cycle.
For me, the acceptable way is to release the next toolchain with the
master branch of dfki-ric-orogen, and do the ROS-related bugfixing on a
topic branch. I thought we agreed on that.

Sylvain