Discussion:
DETS table auto_save behaviour
Nicolas Martyanoff
2021-05-25 06:43:05 UTC
Permalink
Hi,

I am unsure about the behaviour of DETS regarding saving.

The documentation indicates:

all operations performed by Dets are disk operations

Which seems to hint that every single insertion ends up on disk. Good.

But then:

{auto_save, auto_save()} - The autosave interval. If the interval is
an integer Time, the table is flushed to disk whenever it is not
accessed for Time milliseconds. A table that has been flushed
requires no reparation when reopened after an uncontrolled emulator
halt.

This is ambiguous: does it means that entries will be buffered in memory
and only written to disk during the auto save operation (therefore some
operations are not actually disk operations), or does it mean that DETS
always writes to disk without sync-ing (using fsync or equivalent), and
synchronization occurs during the auto save operation ?

In any case, am I correct in assuming that DETS does not offer any way
to guarantee that entries are actually written on disk, meaning that an
application crash would lead to a loss of every entry written since the
last auto_save operation ?

I was hoping to use DETS as a local persistent buffer in case data
cannot be written to a remote database, but it seems impossible to
guarantee that every entry is being sync-ed to disk.

Thank you in advance.

Regards,
--
Nicolas Martyanoff
http://snowsyn.net
***@gmail.com
Frank Muller
2021-05-25 07:01:11 UTC
Permalink
This question always puzzled me.
Does Mnesia rely on the same assumptions?

/Frank
Post by Nicolas Martyanoff
Hi,
I am unsure about the behaviour of DETS regarding saving.
all operations performed by Dets are disk operations
Which seems to hint that every single insertion ends up on disk. Good.
{auto_save, auto_save()} - The autosave interval. If the interval is
an integer Time, the table is flushed to disk whenever it is not
accessed for Time milliseconds. A table that has been flushed
requires no reparation when reopened after an uncontrolled emulator
halt.
This is ambiguous: does it means that entries will be buffered in memory
and only written to disk during the auto save operation (therefore some
operations are not actually disk operations), or does it mean that DETS
always writes to disk without sync-ing (using fsync or equivalent), and
synchronization occurs during the auto save operation ?
In any case, am I correct in assuming that DETS does not offer any way
to guarantee that entries are actually written on disk, meaning that an
application crash would lead to a loss of every entry written since the
last auto_save operation ?
I was hoping to use DETS as a local persistent buffer in case data
cannot be written to a remote database, but it seems impossible to
guarantee that every entry is being sync-ed to disk.
Thank you in advance.
Regards,
--
Nicolas Martyanoff
http://snowsyn.net
Mikael Pettersson
2021-05-26 21:10:27 UTC
Permalink
Post by Nicolas Martyanoff
I was hoping to use DETS as a local persistent buffer in case data
cannot be written to a remote database, but it seems impossible to
guarantee that every entry is being sync-ed to disk.
I'm not too familiar with the internals of DETS, but basically data
goes straight to/from disk while meta-data about allocated and free
areas of the file are cached in memory. I don't know if writes are
sync or not. In our experience, DETS files are somewhat fragile, plus
they have a hard 2GB size limitation which made them extremely awkward
for our use case (large mnesia tables). That's part of the reason we
migrated most of our mnesia tables to eleveldb.

If I had to have a standalone (not mnesia) local persistent store I'd
probably go with eleveldb (or one of its spinoffs) if I needed lookups
by key, or a disk_log if I just needed a FIFO buffer. disk_log allows
you to choose how sync or async your writes are. _I_ wouldn't use
DETS.
Nicolas Martyanoff
2021-05-27 05:39:25 UTC
Permalink
Post by Mikael Pettersson
I'm not too familiar with the internals of DETS, but basically data
goes straight to/from disk while meta-data about allocated and free
areas of the file are cached in memory. I don't know if writes are
sync or not. In our experience, DETS files are somewhat fragile, plus
they have a hard 2GB size limitation which made them extremely awkward
for our use case (large mnesia tables). That's part of the reason we
migrated most of our mnesia tables to eleveldb.
I was already wary of DETS due to the size limitation (the fact that the
limitation is still here in 2021 shows that nobody is interested in
maintaining the module), but you are confirming my first impression.
Post by Mikael Pettersson
If I had to have a standalone (not mnesia) local persistent store I'd
probably go with eleveldb (or one of its spinoffs) if I needed lookups
by key, or a disk_log if I just needed a FIFO buffer. disk_log allows
you to choose how sync or async your writes are. _I_ wouldn't use
DETS.
I also just realized that it does not support ordered_set. I'll probably
end up with sqlite3.

Thank you for the information!

Regards,
--
Nicolas Martyanoff
http://snowsyn.net
***@gmail.com
Ulf Wiger
2021-05-27 05:47:26 UTC
Permalink
It's always tricky with open files during some abrupt crashes. OS-level
file system caching means that not all written data may have been
physically written to disk.

To detect this, dets has a flag indicating whether the file was properly
closed. As I understand it, the 'auto-save' does the same thing as when the
file is closed, except the file stays open.

BR,
Ulf W
Post by Mikael Pettersson
Post by Nicolas Martyanoff
I was hoping to use DETS as a local persistent buffer in case data
cannot be written to a remote database, but it seems impossible to
guarantee that every entry is being sync-ed to disk.
I'm not too familiar with the internals of DETS, but basically data
goes straight to/from disk while meta-data about allocated and free
areas of the file are cached in memory. I don't know if writes are
sync or not. In our experience, DETS files are somewhat fragile, plus
they have a hard 2GB size limitation which made them extremely awkward
for our use case (large mnesia tables). That's part of the reason we
migrated most of our mnesia tables to eleveldb.
If I had to have a standalone (not mnesia) local persistent store I'd
probably go with eleveldb (or one of its spinoffs) if I needed lookups
by key, or a disk_log if I just needed a FIFO buffer. disk_log allows
you to choose how sync or async your writes are. _I_ wouldn't use
DETS.
Frank Muller
2021-05-27 06:51:53 UTC
Permalink
How about Mnesia and persistence to disk?
Post by Ulf Wiger
It's always tricky with open files during some abrupt crashes. OS-level
file system caching means that not all written data may have been
physically written to disk.
To detect this, dets has a flag indicating whether the file was properly
closed. As I understand it, the 'auto-save' does the same thing as when the
file is closed, except the file stays open.
BR,
Ulf W
Post by Mikael Pettersson
Post by Nicolas Martyanoff
I was hoping to use DETS as a local persistent buffer in case data
cannot be written to a remote database, but it seems impossible to
guarantee that every entry is being sync-ed to disk.
I'm not too familiar with the internals of DETS, but basically data
goes straight to/from disk while meta-data about allocated and free
areas of the file are cached in memory. I don't know if writes are
sync or not. In our experience, DETS files are somewhat fragile, plus
they have a hard 2GB size limitation which made them extremely awkward
for our use case (large mnesia tables). That's part of the reason we
migrated most of our mnesia tables to eleveldb.
If I had to have a standalone (not mnesia) local persistent store I'd
probably go with eleveldb (or one of its spinoffs) if I needed lookups
by key, or a disk_log if I just needed a FIFO buffer. disk_log allows
you to choose how sync or async your writes are. _I_ wouldn't use
DETS.
Ulf Wiger
2021-05-27 14:11:55 UTC
Permalink
Mnesia has a WAL (Write-Ahead Log), in which it writes data safely. It then
writes to dets (if that's the chosen table type).

At startup, dets files are repaired if they don't appear to have been
properly closed. Then the transaction log is applied, making sure that the
database is consistent.

Repairs of dets files have been known to take time in the past, but I think
OTP has optimized it, Klarna optimized the mnesia end of it, and both
computers and disks are insanely faster now.

I'd say that the most glaring issue with disc_only_copies in mnesia is not
even the 2 GB limit, but the fact that if you get there, dets will simply
discard the update, and mnesia won't even notice. That is, your application
must ensure that you never exceed the dets limit.

Most people use disc_copies for persistence, since they have better
performance and better reliability than disc_only_copies. The downside is
that the table will also fit in RAM. A different approach would be to use a
backend plugin. There are three alternatives to choose from, as far as I
know: leveldb, leveled, and rocksdb. There may be issues building leveldb
on newer OTP versions. Leveled is (almost) entirely erlang-based, so it
wins hands-down on build time. Rocksdb should be the fastest, although the
difference isn't dramatic.

BR,
Ulf W
Post by Frank Muller
How about Mnesia and persistence to disk?
Post by Ulf Wiger
It's always tricky with open files during some abrupt crashes. OS-level
file system caching means that not all written data may have been
physically written to disk.
To detect this, dets has a flag indicating whether the file was properly
closed. As I understand it, the 'auto-save' does the same thing as when the
file is closed, except the file stays open.
BR,
Ulf W
Post by Mikael Pettersson
Post by Nicolas Martyanoff
I was hoping to use DETS as a local persistent buffer in case data
cannot be written to a remote database, but it seems impossible to
guarantee that every entry is being sync-ed to disk.
I'm not too familiar with the internals of DETS, but basically data
goes straight to/from disk while meta-data about allocated and free
areas of the file are cached in memory. I don't know if writes are
sync or not. In our experience, DETS files are somewhat fragile, plus
they have a hard 2GB size limitation which made them extremely awkward
for our use case (large mnesia tables). That's part of the reason we
migrated most of our mnesia tables to eleveldb.
If I had to have a standalone (not mnesia) local persistent store I'd
probably go with eleveldb (or one of its spinoffs) if I needed lookups
by key, or a disk_log if I just needed a FIFO buffer. disk_log allows
you to choose how sync or async your writes are. _I_ wouldn't use
DETS.
Frank Muller
2021-05-27 14:31:03 UTC
Permalink
Thanks for the info Ulf.

Could you please point me to the WAL source code?
Curious to know how it’s implemented.
Post by Ulf Wiger
Mnesia has a WAL (Write-Ahead Log), in which it writes data safely. It
then writes to dets (if that's the chosen table type).
At startup, dets files are repaired if they don't appear to have been
properly closed. Then the transaction log is applied, making sure that the
database is consistent.
Repairs of dets files have been known to take time in the past, but I
think OTP has optimized it, Klarna optimized the mnesia end of it, and both
computers and disks are insanely faster now.
I'd say that the most glaring issue with disc_only_copies in mnesia is not
even the 2 GB limit, but the fact that if you get there, dets will simply
discard the update, and mnesia won't even notice. That is, your application
must ensure that you never exceed the dets limit.
Most people use disc_copies for persistence, since they have better
performance and better reliability than disc_only_copies. The downside is
that the table will also fit in RAM. A different approach would be to use a
backend plugin. There are three alternatives to choose from, as far as I
know: leveldb, leveled, and rocksdb. There may be issues building leveldb
on newer OTP versions. Leveled is (almost) entirely erlang-based, so it
wins hands-down on build time. Rocksdb should be the fastest, although the
difference isn't dramatic.
BR,
Ulf W
Post by Frank Muller
How about Mnesia and persistence to disk?
Post by Ulf Wiger
It's always tricky with open files during some abrupt crashes. OS-level
file system caching means that not all written data may have been
physically written to disk.
To detect this, dets has a flag indicating whether the file was properly
closed. As I understand it, the 'auto-save' does the same thing as when the
file is closed, except the file stays open.
BR,
Ulf W
Post by Mikael Pettersson
Post by Nicolas Martyanoff
I was hoping to use DETS as a local persistent buffer in case data
cannot be written to a remote database, but it seems impossible to
guarantee that every entry is being sync-ed to disk.
I'm not too familiar with the internals of DETS, but basically data
goes straight to/from disk while meta-data about allocated and free
areas of the file are cached in memory. I don't know if writes are
sync or not. In our experience, DETS files are somewhat fragile, plus
they have a hard 2GB size limitation which made them extremely awkward
for our use case (large mnesia tables). That's part of the reason we
migrated most of our mnesia tables to eleveldb.
If I had to have a standalone (not mnesia) local persistent store I'd
probably go with eleveldb (or one of its spinoffs) if I needed lookups
by key, or a disk_log if I just needed a FIFO buffer. disk_log allows
you to choose how sync or async your writes are. _I_ wouldn't use
DETS.
Ulf Wiger
2021-05-27 16:07:17 UTC
Permalink
The logic is spread out, but a starting point is where the actual commit is
logged.

https://github.com/erlang/otp/blob/master/lib/mnesia/src/mnesia_tm.erl#L284-L291

But there are several different places where stuff happens. Check also the
mnesia_tm:do_commit()
function:
https://github.com/erlang/otp/blob/master/lib/mnesia/src/mnesia_tm.erl#L1781-L1797

and the mnesia_dumper.erl module (which reads the commit log and disperses
the data into the
different tables, both at startup, and periodically, to avoid having the
commit log grow too large.)

BR,
Ulf
Post by Frank Muller
Thanks for the info Ulf.
Could you please point me to the WAL source code?
Curious to know how it’s implemented.
Post by Ulf Wiger
Mnesia has a WAL (Write-Ahead Log), in which it writes data safely. It
then writes to dets (if that's the chosen table type).
At startup, dets files are repaired if they don't appear to have been
properly closed. Then the transaction log is applied, making sure that the
database is consistent.
Repairs of dets files have been known to take time in the past, but I
think OTP has optimized it, Klarna optimized the mnesia end of it, and both
computers and disks are insanely faster now.
I'd say that the most glaring issue with disc_only_copies in mnesia is
not even the 2 GB limit, but the fact that if you get there, dets will
simply discard the update, and mnesia won't even notice. That is, your
application must ensure that you never exceed the dets limit.
Most people use disc_copies for persistence, since they have better
performance and better reliability than disc_only_copies. The downside is
that the table will also fit in RAM. A different approach would be to use a
backend plugin. There are three alternatives to choose from, as far as I
know: leveldb, leveled, and rocksdb. There may be issues building leveldb
on newer OTP versions. Leveled is (almost) entirely erlang-based, so it
wins hands-down on build time. Rocksdb should be the fastest, although the
difference isn't dramatic.
BR,
Ulf W
Post by Frank Muller
How about Mnesia and persistence to disk?
Post by Ulf Wiger
It's always tricky with open files during some abrupt crashes. OS-level
file system caching means that not all written data may have been
physically written to disk.
To detect this, dets has a flag indicating whether the file was
properly closed. As I understand it, the 'auto-save' does the same thing as
when the file is closed, except the file stays open.
BR,
Ulf W
Post by Mikael Pettersson
Post by Nicolas Martyanoff
I was hoping to use DETS as a local persistent buffer in case data
cannot be written to a remote database, but it seems impossible to
guarantee that every entry is being sync-ed to disk.
I'm not too familiar with the internals of DETS, but basically data
goes straight to/from disk while meta-data about allocated and free
areas of the file are cached in memory. I don't know if writes are
sync or not. In our experience, DETS files are somewhat fragile, plus
they have a hard 2GB size limitation which made them extremely awkward
for our use case (large mnesia tables). That's part of the reason we
migrated most of our mnesia tables to eleveldb.
If I had to have a standalone (not mnesia) local persistent store I'd
probably go with eleveldb (or one of its spinoffs) if I needed lookups
by key, or a disk_log if I just needed a FIFO buffer. disk_log allows
you to choose how sync or async your writes are. _I_ wouldn't use
DETS.
Frank Muller
2021-05-27 16:39:42 UTC
Permalink
Awesome, thanks!
I thought the WAL was implemented in C.
Post by Ulf Wiger
The logic is spread out, but a starting point is where the actual commit
is logged.
https://github.com/erlang/otp/blob/master/lib/mnesia/src/mnesia_tm.erl#L284-L291
But there are several different places where stuff happens. Check also the
mnesia_tm:do_commit()
https://github.com/erlang/otp/blob/master/lib/mnesia/src/mnesia_tm.erl#L1781-L1797
and the mnesia_dumper.erl module (which reads the commit log and disperses
the data into the
different tables, both at startup, and periodically, to avoid having the
commit log grow too large.)
BR,
Ulf
Post by Frank Muller
Thanks for the info Ulf.
Could you please point me to the WAL source code?
Curious to know how it’s implemented.
Post by Ulf Wiger
Mnesia has a WAL (Write-Ahead Log), in which it writes data safely. It
then writes to dets (if that's the chosen table type).
At startup, dets files are repaired if they don't appear to have been
properly closed. Then the transaction log is applied, making sure that the
database is consistent.
Repairs of dets files have been known to take time in the past, but I
think OTP has optimized it, Klarna optimized the mnesia end of it, and both
computers and disks are insanely faster now.
I'd say that the most glaring issue with disc_only_copies in mnesia is
not even the 2 GB limit, but the fact that if you get there, dets will
simply discard the update, and mnesia won't even notice. That is, your
application must ensure that you never exceed the dets limit.
Most people use disc_copies for persistence, since they have better
performance and better reliability than disc_only_copies. The downside is
that the table will also fit in RAM. A different approach would be to use a
backend plugin. There are three alternatives to choose from, as far as I
know: leveldb, leveled, and rocksdb. There may be issues building leveldb
on newer OTP versions. Leveled is (almost) entirely erlang-based, so it
wins hands-down on build time. Rocksdb should be the fastest, although the
difference isn't dramatic.
BR,
Ulf W
Post by Frank Muller
How about Mnesia and persistence to disk?
Post by Ulf Wiger
It's always tricky with open files during some abrupt crashes.
OS-level file system caching means that not all written data may have been
physically written to disk.
To detect this, dets has a flag indicating whether the file was
properly closed. As I understand it, the 'auto-save' does the same thing as
when the file is closed, except the file stays open.
BR,
Ulf W
Post by Mikael Pettersson
Post by Nicolas Martyanoff
I was hoping to use DETS as a local persistent buffer in case data
cannot be written to a remote database, but it seems impossible to
guarantee that every entry is being sync-ed to disk.
I'm not too familiar with the internals of DETS, but basically data
goes straight to/from disk while meta-data about allocated and free
areas of the file are cached in memory. I don't know if writes are
sync or not. In our experience, DETS files are somewhat fragile, plus
they have a hard 2GB size limitation which made them extremely awkward
for our use case (large mnesia tables). That's part of the reason we
migrated most of our mnesia tables to eleveldb.
If I had to have a standalone (not mnesia) local persistent store I'd
probably go with eleveldb (or one of its spinoffs) if I needed lookups
by key, or a disk_log if I just needed a FIFO buffer. disk_log allows
you to choose how sync or async your writes are. _I_ wouldn't use
DETS.
Loading...