gen_udp:recv()

Discussion:

gen_udp:recv()

Avinash Dhumane

2021-04-05 13:42:20 UTC

zxq9

2021-04-06 13:12:15 UTC

Hi Avinash,

On 2021/04/05 22:42, Avinash Dhumane wrote:
> I am taking a short-cut to inquire this, by posting to you all (than
> going through relevant documentation).

Oh! The humanity!

> Is gen_udp:recv() to be invoked per datagram, or can it return a binary
> of multiple datagrams concatenated?

It returns a single datagram with some meta about the transmission
(port, address, etc.).

> I am asking this because there is Length to recv(), and I don’t
> understand why it is there. Should it specify the length of expected
> datagram? What if the datagrams of variable length? Should it, then,
> specify the maximum length?

This goes back to memory allocation issues when you are allocating
memory manually. How much is the max size you want to receive might be
based on the size of the buffer you're willing to set aside for the
incoming data -- not the kind of thing people have to contemplate as
carefully as they did in earlier eras of networking.

There are some other lower level UDP dials and knobs that allow you to
peek at the size and UDP headers and whatnot to figure out what a
reasonable balance between getting an entire datagram, being safe, and
not being wasteful might be -- and typically you know what protocol
you're working with.

Anyway, in Erlang this is pretty much irrelevant. The common way to
receive UDP traffic (and TCP traffic, for that matter) is to leave all
the low level stuff to the runtime and inside your code just receive it
as an Erlang message and not deal with udp:recv/2,3 at all.

-Craig

zxq9

2021-04-06 14:16:51 UTC

Permalink

On 2021/04/06 23:14, Łukasz Niemier wrote:
>> Anyway, in Erlang this is pretty much irrelevant. The common way to receive UDP traffic (and TCP traffic, for that matter) is to leave all the low level stuff to the runtime and inside your code just receive it as an Erlang message and not deal with udp:recv/2,3 at all.
>
> Fortunately new `socket` module have `socket:recv/1` function that fetches all data that there are in the socket right now. So it should be perfectly possible to add `gen_udp:recv/1` that would receive whole packet at once. Also calling `gen_udp:recv(Socket, 1500)` should be safe enough as single UDP packet cannot be larger than MTU.

Indeed. However, I can't think of any cases where I would want to mess
with that rather than receive packets as messages.

-Craig

Max Lapshin

2021-04-06 14:49:49 UTC

Permalink

Well, I like the idea to receive 20-200 UDP messages in a single
binary with separate structure holding mapping of this data.

For us it may help to remove special driver that does the same.

Max Lapshin

2021-04-06 16:49:53 UTC

Permalink

Exactly.

We receive MPEG-TS _stream_ in multicast messages and it is very
painful to receive so many messages (it can be around 90K per second
totally or 300-900 per one stream per second) one by one. Reducing
this about by 10 decreases CPU usage a lot.

On Tue, Apr 6, 2021 at 6:31 PM Łukasz Niemier <***@niemier.pl> wrote:
>
> > Well, I like the idea to receive 20-200 UDP messages in a single
> > binary with separate structure holding mapping of this data.
>
> Then it sounds like you want stream protocol instead of packet protocol.
> The whole idea of packet protocol is to receive messages in packets.
>
> --
>
> Łukasz Niemier
> ***@niemier.pl
>

Avinash Dhumane

2021-04-07 10:25:33 UTC

Permalink

Andreas Schultz

2021-04-07 12:24:51 UTC

Permalink

On Linux and BSD, the recvmmsg would can read multiple packets at once.
Adding it to gen_udp might be impractical, but the new socket module would
be a good place for a function that uses that system call.

Am Mi., 7. Apr. 2021 um 13:19 Uhr schrieb Avinash Dhumane <
***@gmail.com>:

> Our case is similar. The stock exchange disseminates the market movements
> (called, ticks) as stream of variable length messages (datagrams or
> packets), each less than 50 bytes, and each stamped with sequence number,
> over multicast (udp) streams, each stream replicated so that if a packet is
> missed (dropped) â due to the nature of underlying udp â it may be received
> on the replicated stream (before fallback on a tcp-based recovery server).
>
>
>
> The message rate (without replication) exceeds 100K per second. Issuing as
> many calls, at Erlang-application level, viz. gen_udp:recv, is impractical,
> though I suppose, at underlying OS system-call level, each datagram must be
> received using a separate call. Therefore, I believed, that Erlang might
> provide an abstraction, which packs a number of datagrams (packets /
> messages) arrived so far and deliver to the application, for further
> handling. One-to-one call from app to Erlang, and further from Erlang to
> OS, for each datagram, would seem too much taxing.
>
>
>
>
>
> *From: *Max Lapshin <***@gmail.com>
> *Sent: *06 April 2021 10:20 PM
> *To: *Åukasz Niemier <***@niemier.pl>
> *Cc: *Erlang Questions <erlang-***@erlang.org>
> *Subject: *Re: gen_udp:recv()
>
>
>
> Exactly.
>
>
>
> We receive MPEG-TS _stream_ in multicast messages and it is very
>
> painful to receive so many messages (it can be around 90K per second
>
> totally or 300-900 per one stream per second) one by one. Reducing
>
> this about by 10 decreases CPU usage a lot.
>
>
>
>
>
>
>
>
>
> On Tue, Apr 6, 2021 at 6:31 PM Åukasz Niemier <***@niemier.pl> wrote:
>
> >
>
> > > Well, I like the idea to receive 20-200 UDP messages in a single
>
> > > binary with separate structure holding mapping of this data.
>
> >
>
> > Then it sounds like you want stream protocol instead of packet protocol.
>
> > The whole idea of packet protocol is to receive messages in packets.
>
> >
>
> > --
>
> >
>
> > Åukasz Niemier
>
> > ***@niemier.pl
>
> >
>
>
>

--

Andreas Schultz

Max Lapshin

2021-04-07 13:02:45 UTC

Permalink

I suppose that you really need the same as me.

You need to receive _one_ big binary with an extra field containing
messages boundaries and other meta info.

On Wed, Apr 7, 2021 at 1:25 PM Avinash Dhumane <***@gmail.com> wrote:
>
> Our case is similar. The stock exchange disseminates the market movements (called, ticks) as stream of variable length messages (datagrams or packets), each less than 50 bytes, and each stamped with sequence number, over multicast (udp) streams, each stream replicated so that if a packet is missed (dropped) – due to the nature of underlying udp – it may be received on the replicated stream (before fallback on a tcp-based recovery server).
>
>
>
> The message rate (without replication) exceeds 100K per second. Issuing as many calls, at Erlang-application level, viz. gen_udp:recv, is impractical, though I suppose, at underlying OS system-call level, each datagram must be received using a separate call. Therefore, I believed, that Erlang might provide an abstraction, which packs a number of datagrams (packets / messages) arrived so far and deliver to the application, for further handling. One-to-one call from app to Erlang, and further from Erlang to OS, for each datagram, would seem too much taxing.
>
>
>
>
>
> From: Max Lapshin
> Sent: 06 April 2021 10:20 PM
> To: Łukasz Niemier
> Cc: Erlang Questions
> Subject: Re: gen_udp:recv()
>
>
>
> Exactly.
>
>
>
> We receive MPEG-TS _stream_ in multicast messages and it is very
>
> painful to receive so many messages (it can be around 90K per second
>
> totally or 300-900 per one stream per second) one by one. Reducing
>
> this about by 10 decreases CPU usage a lot.
>
>
>
>
>
>
>
>
>
> On Tue, Apr 6, 2021 at 6:31 PM Łukasz Niemier <***@niemier.pl> wrote:
>
> >
>
> > > Well, I like the idea to receive 20-200 UDP messages in a single
>
> > > binary with separate structure holding mapping of this data.
>
> >
>
> > Then it sounds like you want stream protocol instead of packet protocol.
>
> > The whole idea of packet protocol is to receive messages in packets.
>
> >
>
> > --
>
> >
>
> > Łukasz Niemier
>
> > ***@niemier.pl
>
> >
>
>

Avinash Dhumane

2021-04-07 13:43:44 UTC

Permalink

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=utf-8"><meta name=Generator content="Microsoft Word 15 (filtered medium)"><style></style></head><body lang=EN-IN link=blue vlink="#954F72" style='word-wrap:break-word'><div class=WordSection1>Indeed! <o:p></o:p><o:p> </o:p>As Jesper too pointed out, a sub-10-microsecond level turnaround time seems impossible to achieve in Erlang; the best I got is 60 microseconds, and average is in the order of hundreds of microseconds (almost close to a millisecond)<o:p></o:p><o:p> </o:p>I am taking the “tick-to-order” processing (in stock-market parlance) out of Erlang and into Rust to meet this ultra-low latency requirement.<o:p></o:p><o:p> </o:p>Thanks all for confirming the overall approach.<o:p></o:p><o:p> </o:p><o:p> </o:p><div style='mso-element:para-border-div;border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm'>From: <a href="mailto:***@gmail.com">Max Lapshin</a> Sent: 07 April 2021 06:32 PM To: <a href="mailto:***@gmail.com">Avinash Dhumane</a> Cc: <a href="mailto:***@niemier.pl">Łukasz Niemier</a>; <a href="mailto:erlang-***@erlang.org">Erlang Questions</a> Subject: Re: gen_udp:recv()</div><o:p> </o:p>I suppose that you really need the same as me.<o:p> </o:p>You need to receive _one_ big binary with an extra field containingmessages boundaries and other meta info.<o:p> </o:p>On Wed, Apr 7, 2021 at 1:25 PM Avinash Dhumane <***@gmail.com> wrote:><o:p> </o:p>> Our case is similar. The stock exchange disseminates the market movements (called, ticks) as stream of variable length messages (datagrams or packets), each less than 50 bytes, and each stamped with sequence number, over multicast (udp) streams, each stream replicated so that if a packet is missed (dropped) – due to the nature of underlying udp – it may be received on the replicated stream (before fallback on a tcp-based recovery server).><o:p> </o:p>><o:p> </o:p>><o:p> </o:p>> The message rate (without replication) exceeds 100K per second. Issuing as many calls, at Erlang-application level, viz. gen_udp:recv, is impractical, though I suppose, at underlying OS system-call level, each datagram must be received using a separate call. Therefore, I believed, that Erlang might provide an abstraction, which packs a number of datagrams (packets / messages) arrived so far and deliver to the application, for further handling. One-to-one call from app to Erlang, and further from Erlang to OS, for each datagram, would seem too much taxing.><o:p> </o:p>><o:p> </o:p>><o:p> </o:p>><o:p> </o:p>><o:p> </o:p>> From: Max Lapshin> Sent: 06 April 2021 10:20 PM> To: Łukasz Niemier> Cc: Erlang Questions> Subject: Re: gen_udp:recv()><o:p> </o:p>><o:p> </o:p>><o:p> </o:p>> Exactly.><o:p> </o:p>><o:p> </o:p>><o:p> </o:p>> We receive MPEG-TS _stream_ in multicast messages and it is very><o:p> </o:p>> painful to receive so many messages (it can be around 90K per second><o:p> </o:p>> totally or 300-900 per one stream per second) one by one. Reducing><o:p> </o:p>> this about by 10 decreases CPU usage a lot.><o:p> </o:p>><o:p> </o:p>><o:p> </o:p>><o:p> </o:p>><o:p> </o:p>><o:p> </o:p>><o:p> </o:p>><o:p> </o:p>><o:p> </o:p>> On Tue, Apr 6, 2021 at 6:31 PM Łukasz Niemier <***@niemier.pl> wrote:><o:p> </o:p>> >><o:p> </o:p>> > > Well, I like the idea to receive 20-200 UDP messages in a single><o:p> </o:p>> > > binary with separate structure holding mapping of this data.><o:p> </o:p>> >><o:p> </o:p>> > Then it sounds like you want stream protocol instead of packet protocol.><o:p> </o:p>> > The whole idea of packet protocol is to receive messages in packets.><o:p> </o:p>> >><o:p> </o:p>> > --><o:p> </o:p>> >><o:p> </o:p>> > Łukasz Niemier><o:p> </o:p>> > ***@niemier.pl><o:p> </o:p>> >><o:p> </o:p>><o:p> </o:p><o:p> </o:p></div></body></html>

Jesper Louis Andersen

2021-04-09 09:52:58 UTC

Permalink

On Wed, Apr 7, 2021 at 3:50 PM Avinash Dhumane <***@gmail.com> wrote:

>
>
> I am taking the âtick-to-orderâ processing (in stock-market parlance) out
> of Erlang and into Rust to meet this ultra-low latency requirement.
>
>
I know Janes St. has a separate OCaml ingester which handles the message
feed of this form and transforms it into a more digestible format for the
rest of their systems. The key point is to demux the stream quickly and
send relevant information into another CPU core for further processing. It
also cuts down traffic in the case you can throw away a lot of messages
from the feed you aren't interested in. At 10us per message, it isn't as if
your general processing budget is very high. However, if you have a larger
machine, the pure erlang solution is to use something like {active, N} or
{active, true} and then move messages into other processes as fast as
possible, as Valentin suggests. To me, 60-200us per udp message sounds way
too high and my hunch is that this is due to configuration.

Avinash Dhumane

2021-04-10 06:47:41 UTC

Permalink

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=utf-8"><meta name=Generator content="Microsoft Word 15 (filtered medium)"><style></style></head><body lang=EN-IN link=blue vlink="#954F72" style='word-wrap:break-word'><div class=WordSection1>The “tick-to-order” indeed needs to be vertically partitioned so that the ticks (market movements) do not sit in the queue waiting to be attended. More so because the ticks arrive in the bursts; the next microsecond might bring no tick, or just one tick, or ten, or hundred, or might even eight hundred. If we pick up, off the wire, packet-by-packet (I.e. tick-by-tick) and process sequentially, not only we will lose many of them but the last tick (of the burst) will lose its value by the time it gets attention.<o:p> </o:p>Each tick, to be attended to (and not to be skipped), undergoes intense computation, before its culmination into an order (buy and/or sell) that tries to materialise the opportunity envisioned by the trader. This is High-Frequency-Algorithm-Trading all about, where every microsecond gained matters. But, pure Erlang programs, especially when the processing is vertically partitioned (as oppose to horizontally, where we give entire application to a single “request” or single “client”, and spawn as many instances as the requests or clients), where (in vertical case) we hop from one process to next, the latency involved till the order is fired is totally unpredictable – that’s why the range is from 60 microsecond to several hundred microseconds.<o:p> </o:p>I had even tried collocating entire computation (from receiving tick to firing order) into a single Erlang-process just to see how fast can it go; but, the metal performance of “C” is impossible (to even consider) to match (least to beat). But, programming and debugging in C/C++ is naturally expensive; safe-Rust at least takes many worries off the head. Erlang programming is still my preferred choice “for understanding” the data structures and control structures required for the application.<o:p></o:p><o:p> </o:p><o:p> </o:p><div style='mso-element:para-border-div;border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm'>From: <a href="mailto:***@gmail.com">Jesper Louis Andersen</a> Sent: 09 April 2021 03:23 PM To: <a href="mailto:***@gmail.com">Avinash Dhumane</a> Cc: <a href="mailto:***@gmail.com">Max Lapshin</a>; <a href="mailto:***@niemier.pl">Łukasz Niemier</a>; <a href="mailto:erlang-***@erlang.org">Erlang Questions</a> Subject: Re: gen_udp:recv()</div><o:p> </o:p><div><div><div>On Wed, Apr 7, 2021 at 3:50 PM Avinash Dhumane <<a href="mailto:***@gmail.com">***@gmail.com</a>> wrote:<o:p></o:p></div></div><div><blockquote style='border:none;border-left:solid #CCCCCC 1.0pt;padding:0cm 0cm 0cm 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0cm;margin-bottom:5.0pt'><div><div> <o:p></o:p>I am taking the “tick-to-order” processing (in stock-market parlance) out of Erlang and into Rust to meet this ultra-low latency requirement.</div></div></blockquote><div><o:p> </o:p></div><div><div>I know Janes St. has a separate OCaml ingester which handles the message feed of this form and transforms it into a more digestible format for the rest of their systems. The key point is to demux the stream quickly and send relevant information into another CPU core for further processing. It also cuts down traffic in the case you can throw away a lot of messages from the feed you aren't interested in. At 10us per message, it isn't as if your general processing budget is very high. However, if you have a larger machine, the pure erlang solution is to use something like {active, N} or {active, true} and then move messages into other processes as fast as possible, as Valentin suggests. To me, 60-200us per udp message sounds way too high and my hunch is that this is due to configuration.<o:p></o:p></div><div><o:p> </o:p></div></div></div></div><o:p> </o:p><o:p> </o:p></div></body></html>

zxq9

2021-04-10 06:52:18 UTC

Permalink

On 2021/04/10 15:47, Avinash Dhumane wrote:
> The “tick-to-order” indeed needs to be /vertically/ partitioned so that
> the ticks (market movements) do not sit in the queue waiting to be
> attended. More so because the ticks arrive in the bursts; the next
> microsecond might bring no tick, or just one tick, or ten, or hundred,
> or might even eight hundred.

Why are the ticks not partitioned by type? Is it only possible to
receive ALL ticks available in the market at a single address:port? This
seems... very weird and not what I've seen in HFT, at least.

-Craig

Avinash Dhumane

2021-04-10 06:59:07 UTC

Permalink

zxq9

2021-04-07 14:07:32 UTC

Permalink

On 2021/04/07 19:25, Avinash Dhumane wrote:
> would seem too much taxing.

TRY IT or you'll never know.

-Craig

Jesper Louis Andersen

2021-04-07 12:11:59 UTC

Permalink

On Mon, Apr 5, 2021 at 4:21 PM Avinash Dhumane <***@gmail.com> wrote:

> My need for the socket is {active, false}. I had tried with {active, true}
> in the past, and got the datagrams as individual messages in the inbox of
> my process. So, I assume that there will be one call to recv() per
> datagram. I do not have the setup to test {active, false}; hence, I am not
> sure about the behaviour of (and, rationale for) Length argument to recv().
>
>
It's due to being there in low level socket handling (in C). You need to
specify the size of the buffer to write the result into. It matters
somewhat. While Ethernet normally specifies a limit of 1500 bytes (minus
header, minus tagging, ...), a local loopback interface can easily be 16k,
64k, or more. And if you are considering a high count of receivers, it
might be wise to limit the maximal size you are willing to receive.

As for the overhead: if you have many small messages and need to have a
high processing rate, I'm going to suggest you either try {active, N} or go
the route Max laid out by moving the processing loop into C to get it fast
enough. You can work backwards: 100k messages per second means you have to
process messages at a rate of one per 10 microseconds. At those rates, it
isn't uncommon you need some kind of microbatching to keep up.

--
J.

Jesper Louis Andersen

2021-04-09 09:41:44 UTC

Permalink

On Thu, Apr 8, 2021 at 9:51 PM ***@micic.co.za <***@pixie.co.za>
wrote:

>
> Presuming that by micro-batchingâ you mean more than one message being
> delivered at the same time, I donât think that would be the only way to
> accomplish rates of 100k messages per second.
> In fact, in one of our project we have accomplished rates of above 200k
> UDP messages per second per Erlang node (using Erlangâs gen_udp module).
>
>
It just means that you avoid processing messages one-by-one, but set up a
short interval (5ms say) and process in batches in that short interval. In
practice, this will "feel" instant in the system, but amortizes some
back-n-forth overhead over multiple messages, reducing it. In a certain
sense, the kernel buffer on an UDP connection is already doing micro
batching when it delivers messages up to the VM. The actual amount of
messages you can process is somewhat dependent on the hardware
configuration as well, and probably also kernel tuning at these rates.

> I am not sure what process_flag( message_queue_data, off_heap ) actually
> does. I mean, I do understand that a particular process queue is allocated
> from some "private space", but not sure how such memory is managed.
>
> It would be great if anyone can shed some light on this.
>
>
In normal operation, the message queue is part of the process heap[0]. So
when it's time to do a GC run, all of the messages on the heap are also
scanned. This takes time proportional to the size of the message queue,
especially if the message queue is large and the rest of the heap is small.
But note that a message arriving in the queue can't point to other data in
the heap. This leads to the idea of storing the messages off-heap, in their
own private space. This then improves GC times because we don't have to
traverse data in the queue anymore, and it's safe because the lifetime
analysis is that messages in the queue can't keep heap data alive.

On the flip side though, is that when you have messages off_heap, sending
messages to the process is slightly slower. This has to do with the
optimizations and locking pertaining to message passing. The normal way is
that you allocate temporary space for the messages from the sending process
and then hook these small allocations into the message queue of the
receiving process. They are "temporary" private off-heap spaces, which
doesn't require locking for a long time, and doesn't require a lock on the
memory area of the receiving process heap at all. Messages are then
"internalized" to the process and the next garbage collection will move the
messages into the main heap, so we can free up all the smaller memory
spaces. With off-heap, we need to manage the off-heap area, and this
requires some extra locking, which can potentially conflict more, slowing
down message delivery.

--
J.