DataFlow 2.0 status and push/pull policy

Submitted by sspr on Tue, 2009-09-22 13:10

RTT-dev

Lets start with a status update. As ye'all know, there is no
rt-middleware available for inter-process communication, except the
low-level messaging libraries (and even most don't target hard
real-time). So we decided to write another transport (next to CORBA)
using message queues for doing real-time data flow ipc. Conveniently,
mqueues are supported by plain gnulinux and Xenomai, using the same
API.

One of the tricky parts is that since there's no middleware (unlike
with CORBA), there's someone needed to do the serialisation to/from
the queue. I came across the boost::serialization library and it fits
the purpose very well. It defines serialization algorithms independent
of 'archiving' algorithms. The former describes how data is
composed/decomposed in primitive types, the latter takes that
information and writes it into some format (text, binary, xml). The
boost::serialization library allows easy extension, but because it
does a lot bookkeeping (read mallocs) because it also serializes
pointers, subclasses etc. this was not real-time. The 'hard way' of
extending this library is by implementing the Archive concept
independent of the available helper classes. For our simple purpose,
this was fairly doable, if it wasn't for the wrong documentation or
implementation of the serialization part. But that's another thread on
another list. The end result is that it is possible to use
boost::serialization for real-time marshalling/demarshalling of
primitive types and std::vector or complexer (variable size) types.

Once serialization worked, it was with little effort added as a new
transport template. One can setup data flow streams using CORBA
('out-of-band') or by using the createStream functions on input and
output port. Also the return values of read are now NoData, NewData
and OldData. The deployer has not yet been adapted.

One thing I was struggling with is how large the buffer size of the
message queue should be. The current implementation creates the
requested pull (output side) or push (input side) buffer/data object
(a connection policy) and *in addition* and by definition, the message
queue is a buffer too. Practically this means that MQ based dataflow
has always two buffers: the MQ itself and the policy buffer. At least,
that's what you would think. In real practice, there is always an
input side buffer, optionally an output side buffer and then the MQ
buffer. That's because from the moment a message arrives on the MQ, we
pull it (such that the MQ won't fill up), store it at the input side
and inform the input port by raising the new data event.

In case you lost me, this is how it works:
PUSH:
output -> (buffer + input)
PULL:
(output + buffer) -> input

When using corba, this translates to:
PUSH:
output -> CORBA -> (buffer + input) : output.write() goes over corba
to store data in buffer
PULL:
(output + buffer) -> CORBA -> input : input.read() goes over corba to
read data from buffer

When using MQ, this translates to:
PUSH:
output -> MQ -> (buffer + input) : all is real-time
PULL:
(output + buffer) -> MQ -> (buffer + input) : all is real-time, last
buffer added de-facto by implementation (see below)

I was wondering two things:
1. is it really necessary that the user can specify push/pull ? Won't
this derive itself from the application architecture ?
2. couldnt' the MQ be the buffer/data element (regardless of
push/pull) in the data flow channel ?

If 1 is true, then 2 is answered as well. To know whether the
application architecture itself is enough to derive where
buffering/data storage must take place, we can test all cases:

1. in-process:
There is no difference between push and pull. You can specify it, but
it will result in the same topology
Conclusion: one buffer in the middle (push nor pull)
2. through corba:
PUSH: the output is punished for a remote client. This is fairly
unacceptable, unless the remote client is a real-time process itself,
(and output is not). Also, every sample output produced is sent over
the wire (possible bottleneck). It is still ok if input would do more
reads than output writes.
PULL: input is punished for listening to remote data, so input can't
be a real-time process. It is advantageous if output does more writes
than input does reads.
Conclusion: in case both sender and receiver are real-time processes,
neither push nor pull can satisfy the necessary architecture. It would
be therefore better to install buffers at both ends at all times and
let the ORB threads do the delivery of the data. So a buffer on each
side (push and pull)
3. through MQ:
PUSH: the buffer on input side is there for 'corba' legacy issues.
CORBA had a buffer there, so MQ too. it could possibly be replaced by
the MQ itself
PULL: same comment as PUSH, but the added buffer on input side is for
the message dispatcher which listens to the message queues for new
data and then needs some place to store that data. That's why we
always need a buffer at input side.
Conclusion: the MQ could play as the buffer in the middle (push nor pull)

I'm mixing current implementation with a new design proposal here,
which might be confusing. The *real* point I needed to make is: should
the user specify push/pull or can the application always derive the
correct places to put buffers ? I would say yes, but I might be
overlooking why Sylvain installed this policy.

Peter