A tale of two linkers

I've spent two hours finding out something two days ago, so I thought it
might interest people here.

I was debugging oroGen-generated typekits, and started adding some cout
lines here and there. I was using a Ruby shell on one side and an
orogen-generated RTT deployment on the other (through CORBA).

To my surprise, the lines where *not* appearing in the Ruby process, but
were appearing in the C++ process. Weird.

Then, since the problematic toolkit was for std::string, I supposed it
was a clash between the RTT default toolkit and the orogen-generated
one. So I also checked that the transport the Ruby plugin was using was
the one returned by the orogen-generated typekit plugin. And it was.

Then, I started our faithful gdb and stepped into the code.

Indeed, the plugin object was the one returned by the plugin's
registration function. It is using CorbaTemplateProtocol, with the
associated specialization just above it.

After stratching my head for a while, I found the solution: the ruby
process uses the RTT's plugin mechanism (based on dlopen()) and the
orogen-generated process uses normal linking !

In a nutshell, dlopen() happily replaced the symbols of the
specialization by the one already loaded from the RTT, even though the
CorbaTemplateProtocol that is using it is in the same compilation unit.
The normal linker does not (prefers local symbols).

And, funnily enough, somebody had a similar problem at DFKI today ;-)

A tale of two linkers

On Wed, Nov 10, 2010 at 7:01 PM, Sylvain Joyeux <sylvain [dot] joyeux [..] ...> wrote:
> I've spent two hours finding out something two days ago, so I thought it
> might interest people here.
>
> I was debugging oroGen-generated typekits, and started adding some cout
> lines here and there. I was using a Ruby shell on one side and an
> orogen-generated RTT deployment on the other (through CORBA).
>
> To my surprise, the lines where *not* appearing in the Ruby process, but
> were appearing in the C++ process. Weird.
>
> Then, since the problematic toolkit was for std::string, I supposed it
> was a clash between the RTT default toolkit and the orogen-generated
> one. So I also checked that the transport the Ruby plugin was using was
> the one returned by the orogen-generated typekit plugin. And it was.
>
> Then, I started our faithful gdb and stepped into the code.
>
> Indeed, the plugin object was the one returned by the plugin's
> registration function. It is using CorbaTemplateProtocol, with the
> associated specialization just above it.
>
> After stratching my head for a while, I found the solution: the ruby
> process uses the RTT's plugin mechanism (based on dlopen()) and the
> orogen-generated process uses normal linking !
>
> In a nutshell, dlopen() happily replaced the symbols of the
> specialization by the one already loaded from the RTT, even though the
> CorbaTemplateProtocol that is using it is in the same compilation unit.
> The normal linker does not (prefers local symbols).

If there's something wrong with the dlopen options, I'm open to a review.
You can influence this policy with RTLD_GLOBAL/RTLD_LOCAL. We override
the default (LOCAL) by setting it to GLOBAL. This allowed us to load libraries
that were not correctly linked but had the missing symbols already in
the process.

Peter

A tale of two linkers

On Thu, Nov 11, 2010 at 09:10:30PM +0100, Peter Soetens wrote:
> On Wed, Nov 10, 2010 at 7:01 PM, Sylvain Joyeux <sylvain [dot] joyeux [..] ...> wrote:
> > I've spent two hours finding out something two days ago, so I thought it
> > might interest people here.
> >
> > I was debugging oroGen-generated typekits, and started adding some cout
> > lines here and there. I was using a Ruby shell on one side and an
> > orogen-generated RTT deployment on the other (through CORBA).
> >
> > To my surprise, the lines where *not* appearing in the Ruby process, but
> > were appearing in the C++ process. Weird.
> >
> > Then, since the problematic toolkit was for std::string, I supposed it
> > was a clash between the RTT default toolkit and the orogen-generated
> > one. So I also checked that the transport the Ruby plugin was using was
> > the one returned by the orogen-generated typekit plugin. And it was.
> >
> > Then, I started our faithful gdb and stepped into the code.
> >
> > Indeed, the plugin object was the one returned by the plugin's
> > registration function. It is using CorbaTemplateProtocol, with the
> > associated specialization just above it.
> >
> > After stratching my head for a while, I found the solution: the ruby
> > process uses the RTT's plugin mechanism (based on dlopen()) and the
> > orogen-generated process uses normal linking !
> >
> > In a nutshell, dlopen() happily replaced the symbols of the
> > specialization by the one already loaded from the RTT, even though the
> > CorbaTemplateProtocol that is using it is in the same compilation unit.
> > The normal linker does not (prefers local symbols).
>
> If there's something wrong with the dlopen options, I'm open to a review.
> You can influence this policy with RTLD_GLOBAL/RTLD_LOCAL. We override
> the default (LOCAL) by setting it to GLOBAL. This allowed us to load libraries
> that were not correctly linked but had the missing symbols already in
> the process.

>From reading dlopen(3) it appears that adding the (nonstandard)
RTLD_DEEPBIND might solve Sylvains issue.

Markus

A tale of two linkers

On 11/12/2010 08:53 AM, Markus Klotzbuecher wrote:
> On Thu, Nov 11, 2010 at 09:10:30PM +0100, Peter Soetens wrote:
>> On Wed, Nov 10, 2010 at 7:01 PM, Sylvain Joyeux<sylvain [dot] joyeux [..] ...> wrote:
>>> I've spent two hours finding out something two days ago, so I thought it
>>> might interest people here.
>>>
>>> I was debugging oroGen-generated typekits, and started adding some cout
>>> lines here and there. I was using a Ruby shell on one side and an
>>> orogen-generated RTT deployment on the other (through CORBA).
>>>
>>> To my surprise, the lines where *not* appearing in the Ruby process, but
>>> were appearing in the C++ process. Weird.
>>>
>>> Then, since the problematic toolkit was for std::string, I supposed it
>>> was a clash between the RTT default toolkit and the orogen-generated
>>> one. So I also checked that the transport the Ruby plugin was using was
>>> the one returned by the orogen-generated typekit plugin. And it was.
>>>
>>> Then, I started our faithful gdb and stepped into the code.
>>>
>>> Indeed, the plugin object was the one returned by the plugin's
>>> registration function. It is using CorbaTemplateProtocol, with the
>>> associated specialization just above it.
>>>
>>> After stratching my head for a while, I found the solution: the ruby
>>> process uses the RTT's plugin mechanism (based on dlopen()) and the
>>> orogen-generated process uses normal linking !
>>>
>>> In a nutshell, dlopen() happily replaced the symbols of the
>>> specialization by the one already loaded from the RTT, even though the
>>> CorbaTemplateProtocol that is using it is in the same compilation unit.
>>> The normal linker does not (prefers local symbols).
>>
>> If there's something wrong with the dlopen options, I'm open to a review.
>> You can influence this policy with RTLD_GLOBAL/RTLD_LOCAL. We override
>> the default (LOCAL) by setting it to GLOBAL. This allowed us to load libraries
>> that were not correctly linked but had the missing symbols already in
>> the process.
>
> From reading dlopen(3) it appears that adding the (nonstandard)
> RTLD_DEEPBIND might solve Sylvains issue.

Ah ... forgot to mention that.

I thought RTLD_LOCAL would have solved the problem, but it does not. No
time to investigate further.

RTLD_DEEPBIND breaks dynamic_cast as the typeid() symbols don't get
merged. Again, no time to investigate, so I don't know if it is an
unexpected behaviour.

In general, I'm fine with the behaviour right now -- as long as I know
it exists. To avoid it in the future, I would say that the best option
would be to replace the explicit specialization with something else, or
have oroGen not use the specialization at all.

A tale of two linkers

On Fri, Nov 12, 2010 at 09:15:16AM +0100, Sylvain Joyeux wrote:
> On 11/12/2010 08:53 AM, Markus Klotzbuecher wrote:
> > On Thu, Nov 11, 2010 at 09:10:30PM +0100, Peter Soetens wrote:
> >> On Wed, Nov 10, 2010 at 7:01 PM, Sylvain Joyeux<sylvain [dot] joyeux [..] ...> wrote:
> >>> I've spent two hours finding out something two days ago, so I thought it
> >>> might interest people here.
> >>>
> >>> I was debugging oroGen-generated typekits, and started adding some cout
> >>> lines here and there. I was using a Ruby shell on one side and an
> >>> orogen-generated RTT deployment on the other (through CORBA).
> >>>
> >>> To my surprise, the lines where *not* appearing in the Ruby process, but
> >>> were appearing in the C++ process. Weird.
> >>>
> >>> Then, since the problematic toolkit was for std::string, I supposed it
> >>> was a clash between the RTT default toolkit and the orogen-generated
> >>> one. So I also checked that the transport the Ruby plugin was using was
> >>> the one returned by the orogen-generated typekit plugin. And it was.
> >>>
> >>> Then, I started our faithful gdb and stepped into the code.
> >>>
> >>> Indeed, the plugin object was the one returned by the plugin's
> >>> registration function. It is using CorbaTemplateProtocol, with the
> >>> associated specialization just above it.
> >>>
> >>> After stratching my head for a while, I found the solution: the ruby
> >>> process uses the RTT's plugin mechanism (based on dlopen()) and the
> >>> orogen-generated process uses normal linking !
> >>>
> >>> In a nutshell, dlopen() happily replaced the symbols of the
> >>> specialization by the one already loaded from the RTT, even though the
> >>> CorbaTemplateProtocol that is using it is in the same compilation unit.
> >>> The normal linker does not (prefers local symbols).
> >>
> >> If there's something wrong with the dlopen options, I'm open to a review.
> >> You can influence this policy with RTLD_GLOBAL/RTLD_LOCAL. We override
> >> the default (LOCAL) by setting it to GLOBAL. This allowed us to load libraries
> >> that were not correctly linked but had the missing symbols already in
> >> the process.
> >
> > From reading dlopen(3) it appears that adding the (nonstandard)
> > RTLD_DEEPBIND might solve Sylvains issue.
>
> Ah ... forgot to mention that.
>
> I thought RTLD_LOCAL would have solved the problem, but it does not. No
> time to investigate further.
>
> RTLD_DEEPBIND breaks dynamic_cast as the typeid() symbols don't get
> merged. Again, no time to investigate, so I don't know if it is an
> unexpected behaviour.

Ah yes, I had a similar problem when loading RTT as a Lua plugin and
not using RTLD_GLOBAL.

> In general, I'm fine with the behaviour right now -- as long as I know
> it exists. To avoid it in the future, I would say that the best option
> would be to replace the explicit specialization with something else, or
> have oroGen not use the specialization at all.

I agree.

Markus