2019-08-28 08:35:46 +00:00
2019-08-29 10:54:53 +00:00
\begin_layout Standard
title{MARS for Kernel Developers}
Basic Working Principle
Control Model
Although MARS tries to
\emph on
\emph default
\emph on
\emph default
the synchronous control behaviour of DRBD at the interface level (
\family typewriter
\family default
) in many situations as best as it can, the
\emph on
\emph default
control model is necessarily asynchronous.
As an experiencend sysadmin, you will be curious how it works in principle.
When you know something about it, you will no longer be surprised when
some (detail) behaviour is different from DRBD.
\begin_layout Standard
\begin_inset Graphics
filename images/handshake.fig
width 80col%
\begin_layout Standard
We have a binary todo switch, which can be either in state
\begin_inset Quotes eld
\begin_inset Quotes erd
\begin_inset Quotes eld
\begin_inset Quotes erd
In addition, we have an actual response indicator, which is similar to
an LED indicating the actual status.
In our example, we imagine that both are used for controlling a big ventilator,
having a huge inert mass.
Imagine a big machine from a power plant, which is as tall as a human.
\begin_layout Standard
We start in a situation where the binary switch is off, and the ventilator
is stopped.
At point 1, we turn on the switch.
At that moment, a big contactor will sound like
\begin_inset Quotes eld
\begin_inset Quotes erd
, and a big motor will start to hum.
At first you won't hear anything else.
It will take a while, say 1 minute, until the big wheel will have reached
its final operating RPM, due to the huge inert mass.
During that spin-up, the lights in your room will become slightly darker.
When having reached the full RPM at point 2, your workplace will then be
noisier, but in exchange your room lights will be back at ordinary strength,
and the actual response LED will start to lit in order to indicate that
the big fan is now operational.
\begin_layout Standard
Assume we want to turn the system off.
When turning the todo switch to
\begin_inset Quotes eld
\begin_inset Quotes erd
at point 3, first nothing will seem to happen at all.
The big wheel will keep spinning due to its heavy inert mass, and the RPM
as well as the sound will go down only slowly.
During spin-down, the actual response LED will stay illuminated, in order
to warn you that you should not touch the wheel, otherwise you may get
Notice that it is only safe to access the wheel when
\emph on
\emph default
the switch and the LED are off.
Conversely, if at least one of them is on, something is going on inside
the machine.
Transferred to MARS: always look at
\emph on
\emph default
the todo switch and the correponding actual indicator in order to not miss
The LED will only go off after, say, 2 minutes, when the wheel has actually
stopped at point 4.
After that, the cycle may potentially start over again.
\begin_layout Standard
As you can see, all four possible cartesian product combinations between
two boolean values are occurring in the diagram.
\begin_layout Standard
The same handshake protocol is used in MARS for communication between userspace
and kernelspace, as well as for communication in the widely distributed
The Lamport Clock
MARS is always
\emph on
\emph default
communicating in the distributed system on
\emph on
\emph default
topics, even strategic decisions.
If there were a
\emph on
\emph default
global consistency model, which would be roughly equivalent to a standalone
model, we would need
\emph on
\emph default
in order to serialize conflicting requests.
It is known for many decades that
\emph on
distributed locks
\emph default
do not only suffer from performance problems, but they are also cumbersome
to get them working reliably in scenarios where nodes or network links
may fail at any time.
Therefore, MARS uses a very different consistency model:
\series bold
Eventually Consistent
\series default
filename images/lightbulb_brightlit_benj_.png
2020-02-15 15:45:31 +00:00
lyxscale 9
scale 5
2019-08-28 08:35:46 +00:00
Notice that the network bottleneck problems described in section
\begin_inset CommandInset ref
LatexCommand ref
reference "sec:Network-Bottlenecks"
\emph on
\emph default
\begin_inset Quotes eld
eventually consistent
\begin_inset Quotes erd
You have
\series bold
no chance
\series default
against natural laws, like Einstein's laws.
In order to cope with the problem area, you have to
\emph on
invest some additional effort
\emph default
Unfortunately, asynchronous communication models are more tricky to program
and to debug than simple strictly consistent models.
In particular, you
\emph on
have to cope with
\emph default
\series bold
race conditions
\series default
\emph on
\emph default
\emph on
\emph default
\begin_inset Quotes eld
eventually consistent
\begin_inset Quotes erd
In the face of the laws of the universe, motivate yourself by looking at
the graphics at the cover page: the planets are a
\emph on
\emph default
for what you have to do!
\begin_layout Standard
\begin_inset Graphics
filename images/MatieresCorrosives.png
lyxscale 50
scale 17
Example: the asynchronous communication protocol of MARS leads to a different
behaviour from DRBD in case of
\series bold
network partitions
\series default
(temporary interruption of communication between some cluster nodes), because
\emph on
\emph default
the old state of remote nodes over long periods of time, while DRBD knows
absolutely nothing about its peers in disconnected state.
Sysadmins familiar with DRBD might find the following behaviour unusual:
\begin_inset Separator latexpar
\begin_inset Text
\begin_layout Plain Layout
\size tiny
<cell alignment="left" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
DRBD Behaviour
<cell alignment="left" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
MARS Behaviour
<row endhead="true">
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
the network partitions
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
automatic disconnect
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
nothing happens, but replication lags behind
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
on A:
\family typewriter
umount $device
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
on A:
\family typewriter
{drbd,mars}adm secondary
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
on B:
\family typewriter
{drbd,mars}adm primary
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
works, split brain happens
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\series bold
\size tiny
\series default
because B believes that A is primary
<cell alignment="left" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
the network resumes
<cell alignment="left" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
automatic connect attempt fails
<cell alignment="left" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
communication automatically resumes
\begin_layout Standard
If you intentionally want to switch over (and to produce a split brain as
a side effect), the following variant must be used with MARS:
\begin_inset Text
\begin_layout Plain Layout
\size tiny
<cell alignment="left" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
DRBD Behaviour
<cell alignment="left" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
MARS Behaviour
<row endhead="true">
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
the network partitions
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
automatic disconnect
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
nothing happens, but replication lags behind
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
on A:
\family typewriter
umount $device
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
on A:
\family typewriter
{drbd,mars}adm secondary
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
works (but
\emph on
not remmonended!
\emph default
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
on B:
\family typewriter
{drbd,mars}adm primary
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
split brain, but nobody knows
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\series bold
\size tiny
\series default
because B believes that A is primary
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
on B:
\family typewriter
marsadm disconnect
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
works, nothing happens
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
on B:
\family typewriter
marsadm primary --force
<cell alignment="left" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
<cell alignment="left" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
works, split brain happens on B, but A doesn't know
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
on B:
\family typewriter
marsadm connect
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
works, nothing happens
<cell alignment="left" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
the network resumes
<cell alignment="left" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
automatic connect attempt fails
<cell alignment="left" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text
\begin_layout Plain Layout
\size tiny
communication resumes, A now detects the split brain
\begin_layout Standard
In order to implement the consistency model
\begin_inset Quotes eld
eventually consistent
\begin_inset Quotes erd
, MARS uses a so-called Lamport
Published in the late 1970s by Leslie Lamport, also known as inventor of
MARS uses a special variant called
\begin_inset Quotes eld
physical Lamport clock
\begin_inset Quotes erd
The physical Lamport clock is another almost-realtime clock which
\emph on
\emph default
run independently from the Linux kernel system clock.
However, the Lamport clock tries to remain as near as possible to the system
\begin_layout Standard
\family typewriter
cat /proc/sys/mars/lamport_clock
\family default
The result will show both clocks in parallel, in units of seconds since
the Unix epoch, with nanosecond resolution.
When there are no network messages at all, both the system clock and the
Lamport clock will show almost the same time (except some minor differences
of a few nanoseconds resulting from the finite processor clock speed).
\begin_layout Standard
The physical Lamport clock works rather simple:
\emph on
\emph default
message on the network is augmented with a Lamport time stamp telling when
the message was
\emph on
\emph default
according to the local Lamport clock of the sender.
Whenever that message is received by some receiver, it checks whether the
time ordering relation would be violated: whenever the Lamport timestamp
in the message would claim that the sender had sent it
\emph on
\emph default
it arrived at the receiver (according to drifts in their respective local
clocks), something must be wrong.
In this case, the local Lamport clock of the
\emph on
\emph default
is advanced shortly after the sender Lamport timestamp, such that the time
ordering relation is no longer violated.
As a consequence, any local Lamport clock may precede the corresponding
local system clock.
In order to avoid accumulation of deltas between the Lamport and the system
clock, the Lamport clock will run slower after that, possibly until it
reaches the system clock again (if no other message arrives which sets
it forward again).
After having reached the system clock, the Lamport clock will continue
\begin_inset Quotes eld
\begin_inset Quotes erd
\begin_layout Standard
MARS uses the local Lamport clock for anything where other systems would
use the local system clock: for example, timestamp generation in the
\family typewriter
\family default
Even symlinks created there are timestamped according to the Lamport clock.
Both the kernel module and the userspace tool
\family typewriter
\family default
are always operating in the timescale of the Lamport clock.
Most importantly, all timestamp comparisons are always carried out with
respect to Lamport time.
\begin_inset Graphics
filename images/MatieresCorrosives.png
lyxscale 50
scale 17
Bigger differences between the Lamport and the system clock can be annoying
from a human point of view: when typing
\family typewriter
ls -l /mars/resource-mydata/
\family default
many timestamps may appear as if they were created in the
\begin_inset Quotes eld
\begin_inset Quotes erd
, because the
\family typewriter
\family default
command compares the output formatting against the system clock (it does
not even know of the existence of the MARS Lamport clock).
\begin_inset Graphics
filename images/MatieresToxiques.png
lyxscale 50
scale 17
Always use
\family typewriter
\family default
(or another clock synchronization service) in order to pre-synchronize
your system clocks as close as possible.
Bigger differences are not only annoying, but may lead some people to wrong
conclusions and therefore even lead to bad human decisions!
In a professional datacenter, you should use
\family typewriter
\family default
anyway, and you should monitor its effectiveness anyway.
\begin_inset Graphics
filename images/lightbulb_brightlit_benj_.png
2020-02-15 15:45:31 +00:00
lyxscale 9
scale 5
2019-08-28 08:35:46 +00:00
Hint: many internal logfiles produced by the MARS kernel module contain
Lamport timestamps written as numerical values.
In order to convert them into human-readable form, use the command
\family typewriter
marsadm cat /mars/
\family default
or similar.
\begin_layout Section
The Symlink Tree
\begin_inset CommandInset label
LatexCommand label
name "sec:The-Symlink-Tree"
\begin_inset Graphics
filename images/MatieresCorrosives.png
lyxscale 50
scale 17
The symlink tree as described here will be replaced by another representation
in future versions of MARS.
Therefore, don't do any scripting by directly accessing symlinks! Use the
primitive macros described in section
\begin_inset CommandInset ref
LatexCommand ref
reference "subsec:Predefined-Trivial-Macros"
The current
\family typewriter
\family default
filesystem container format contains not only transaction logfiles, but
also acts as a generic storage for (persistent) state information.
Both configuration information and runtime state information are currently
stored in symlinks.
Symlinks are
\begin_inset Quotes eld
This means, the symlink targets need not be other files or directories,
but just any values like integers or strings.
\begin_inset Quotes erd
in order to represent some
\family typewriter
key -> value
\family default
\begin_layout Standard
\begin_inset Graphics
filename images/lightbulb_brightlit_benj_.png
2020-02-15 15:45:31 +00:00
lyxscale 9
scale 5
2019-08-28 08:35:46 +00:00
It is not yet clear / decided, but there is a
\emph on
\emph default
that the
\emph on
\emph default
\family typewriter
key -> value
\family default
pairs will be retained in future versions of MARS.
Instead of being represented by symlinks, another representation will be
used, such that hopefully the
\family typewriter
\family default
part will remain in the form of a pathname, even if there were no longer
a physical representation in an actual filesystem.
\begin_inset Graphics
filename images/lightbulb_brightlit_benj_.png
2020-02-15 15:45:31 +00:00
lyxscale 9
scale 5
2019-08-28 08:35:46 +00:00
A fundamentally different behaviour than DRBD: when your DRBD primary crashed
some time ago, and now comes up again, you have to setup DRBD again by
a sequence of commands like
\family typewriter
modprobe drbd; drbdadm up all; drbdadm primary all
\family default
or similar.
In contrast, MARS needs only
\family typewriter
modprobe mars
\family default
\family typewriter
\family default
has been mounted by
\family typewriter
\family default
\emph on
\emph default
of the symlinks residing in
\family typewriter
\family default
will automatically remember your previous state, even if some your resources
were primary while others were secondary (mixed operations).
You don't need to do any actions in order to
\begin_inset Quotes eld
\begin_inset Quotes erd
a previous state, no matter how
\begin_inset Quotes eld
\begin_inset Quotes erd
it was.
(Almost) all symlinks appearing in the
\family typewriter
\family default
directory tree are automatically replicated thoughout the whole cluster,
provided that the cluster
\family typewriter
\family default
s are equal
This is protection against accidental
\begin_inset Quotes eld
\begin_inset Quotes erd
of two unrelated clusters which had been created at different times with
\family typewriter
\family default
at all sites.
Thus the
\family typewriter
\family default
directory forms some kind of
\emph on
global namespace
\emph default
In order to avoid name clashes, each pathname created at node A follows
a convention: the node name A should be a suffix of the pathname.
Typically, internal MARS names follow the scheme
\family typewriter
\emph on
\emph default
\family default
When using the expert command
\family typewriter
marsadm {get,set}-link
\family default
(which will likely be replaced by something else in future MARS releases),
you should follow the best practice of systematically using pathnames like
\family typewriter
\family default
or similar.
As a result, each node will automatically get informed about the state
at any other node, like B when the corresponding information is recorded
on node B under the name
\family typewriter
\family default
(context-dependent names).
\begin_inset Graphics
filename images/lightbulb_brightlit_benj_.png
2020-02-15 15:45:31 +00:00
lyxscale 9
scale 5
2019-08-28 08:35:46 +00:00
Experts only: the symlink replication works generically.
You might use the
\family typewriter
\family default
directory in order to place your own symlink there (for whatever purpose,
which need not have to do with MARS).
However, the symlinks are likely to disappear.
\family typewriter
marsadm {get,set}-link
\family default
There is a chance that these abstract commands (or variants thereof) will
be retained, by acting on the new data representation in future, even if
the old symlink format will vanish some day.
\begin_inset Graphics
filename images/lightbulb_brightlit_benj_.png
2020-02-15 15:45:31 +00:00
lyxscale 9
scale 5
2019-08-28 08:35:46 +00:00
Important: the convention of placing the
\series bold
creator host name
\series default
inside your pathnames should be used wherever possible.
The name part is a kind of
\begin_inset Quotes eld
ownership indicator
\begin_inset Quotes erd
It is crucial that no other host writes any symlink not
\begin_inset Quotes eld
\begin_inset Quotes erd
to him.
Other hosts may read foreign information as often as they want, but never
modify them.
This way, your cluster nodes are able to
\emph on
\emph default
with each other via symlink / information updates.
Although experts might create (and change) the current symlinks with userspace
tools like
\family typewriter
ln -s
\family default
\family typewriter
marsadm set-link myvalue /mars/userspace/mykey-A
\family typewriter
marsadm delete-file /mars/userspace/mykey-A
There are many reasons for this: first, the
\family typewriter
marsadm set-link
\family default
command will automatically use the Lamport clock for symlink creation,
and therefore will avoid any errors resulting from a
\begin_inset Quotes eld
\begin_inset Quotes erd
system clock (as in
\family typewriter
ln -s
\family default
Second, the
\family typewriter
marsadm delete-file
\family default
(which also deletes symlinks) works on the
\emph on
whole cluster
\emph default
And finally, there is a chance that this will work in future versions of
MARS even after the symlinks have vanished.
What's the difference? If you would try to remove your symlink locally by
hand via
\family typewriter
rm -f
\family default
, you will be surprised: since the symlink has been replicated to the other
cluster nodes, it will be re-transferred from there and will be resurrected
locally after some short time.
This way, you cannot delete any object reliably, because your whole cluster
(which may consist of many nodes) remembers all your state information
and will
\begin_inset Quotes eld
\begin_inset Quotes erd
it whenever
\begin_inset Quotes eld
\begin_inset Quotes erd
In order to solve the deletion problem, MARS uses some internal deletion
protocol using auxiliary symlinks residing in
\family typewriter
\family default
The deletion protocol ensures that all replicas get deleted in the whole
cluster, and only thereafter the auxiliary symlinks in
\family typewriter
\family default
are also deleted eventually.
You may update your already existing symlink via
\family typewriter
marsadm set-link some-other-value /mars/userspace/mykey-A
\family default
The new value will be propagated throughout the cluster according to a
\series bold
timestamp comparison protocol
\series default
: whenever node B notices that A has a
\emph on
\emph default
version of some symlink (according to the Lamport timestamp), it will replace
its elder version by the newer one.
The opposite does
\emph on
\emph default
work: if B notices that A has an elder version, just nothing happens.
This way, the timestamps of symlinks can only progress in forward direction,
but never backwards in time.
As a consequence, symlink updates made
\begin_inset Quotes eld
by hand
\begin_inset Quotes erd
\family typewriter
ln -sf
\family default
may get lost when the local system clock is much more earlier than the
Lamport clock.
When your cluster is fully connected by the network, the last timestamp
will finally win everywhere.
Only in case of network outages leading to
\emph on
network partitions
\emph default
, some information may be
\emph on
temporarily inconsistent
\emph default
, but only for the duration of the network outage.
The timestamp comparison protocol in combination with the Lamport clock
and with the persistence of the
\family typewriter
\family default
filesystem will automatically heal any temporary inconsistencies as soon
as possible, even in case of temporary node shutdown.
The meaning of some internal MARS symlinks residing in
\family typewriter
\family default
MARS for Developers
This chapter is organized strictly top-down.
\begin_layout Standard
If you are a sysadmin and want to inform yourself about internals (useful
for debugging), the relevant information is at the beginning, and you don't
need to dive into all technical details at the end.
If you are a kernel developer and want to contribute code to the emerging
MARS community, please read it (almost) all.
Due to the top-down organization, sometimes you will need to follow some
forward references in order to understand details.
Therefore I recommend reading this chapter twice in two different reading
modes: in the first reading pass, you just get a raw network of principles
and structures in your brain (you don't want to grasp details, therefore
don't strive for a full understanding).
In the second pass, you will exploit your knowlegde from the first pass
for a deeper understanding of the details.
Alternatively, you may first read the sections about general architecture,
and then start a bottom-up scan by first reading the last section about
generic objects and aspects, and working in reverse
\emph on
\emph default
order (but read
\emph on
\emph default
sections in-order) until you finally reach the kernel interfaces / symlink
\begin_layout Section
Motivation / Politics
MARS is not yet upstream in the Linux kernel.
This section tries to clear up some potential doubts.
Some people have asked why MARS uses its own internal framework instead
\emph on
\emph default
Notice that
\emph on
\emph default
use of pre-existing Linux infrastructure is not only possible, but actually
implemented, by usinig it
\emph on
\emph default
in brick
\emph on
\emph default
(black-box principle).
However, such bricks are not portable to other environments like userspace.
being based on some already existing Linux kernel infrastructures like
the device mapper.
Here is a list of technical reasons:
The existing device mapper infrastructure is based on
\family typewriter
struct bio
\family default
In contrast, the new XIO personality of the generic brick infrastructure
is based on the concept of AIO (Asynchronous IO), which is a
\series bold
true superset
\series default
of block IO.
In particular,
\family typewriter
struct bio
\family default
is firmly referencing to
\family typewriter
struct page
\family default
(via intermediate
\family typewriter
struct bio_vec
\family default
), using types like
\family typewriter
\family default
in the field
\family typewriter
\family default
Basic transfer units are blocks, or sectors, or pages, or the like.
In contrast,
\family typewriter
struct aio_object
\family default
used by the XIO personality can address
\series bold
arbitrary granularity
\series default
memory with byte resolution even at odd
Some brick
\emph on
\emph default
(as opposed to the capabilities of the
\emph on
\emph default
) may be (and, in fact,
\emph on
\emph default
) restricted to
\family typewriter
\family default
operations or the like.
This is no general problem, because IOP can automatically insert some translato
r bricks extending the capabilities to universal granularity (of course
at some performance costs).
positions in (virtual) files / devices, similar to classical Unix file
IO, but
\emph on
\emph default
Practical experience shows that even non-functional properties like performance
of many datacenter workloads are profiting from that
The current transaction logger uses variable-sized headers at
\begin_inset Quotes eld
\begin_inset Quotes erd
Although this increases
\family typewriter
\family default
load due to
\begin_inset Quotes eld
\begin_inset Quotes erd
, the
\emph on
overall performance
\emph default
was provably better than in variants where sector / page alignment was
strictly obeyed, but space was wasted for alignments.
Such functionality is only possible if the XIO infrastructure
\emph on
\emph default
\emph on
\emph default
(but doesn't force)
\begin_inset Quotes eld
\begin_inset Quotes erd
IO operations.
In future, many different transaction logfile formats showing different
runtime behaviour (e.g.
optimized for high-throughput SSD loads) may co-exist in parallel.
Note that properly aligned XIO operations bear no noticeable overhead compared
to classical block IO, at least in typical datacenter RAID scenarios.
The AIO/XIO abstraction contains no fixed link to kernel abstractions and
should be
\series bold
easily portable
\series default
to other environments.
In summary, the new personality provides a uniform abstraction which abstracts
away from multiple different kernel interfaces; it is designed to be useful
even in userspace.
\begin_layout Enumerate
Kernel infrastructures for the concept of
\emph on
direct IO
\emph default
are different from those for
\emph on
buffered IO
\emph default
The XIO personality used by MARS subsumes both concepts as use case
\emph on
\emph default
\series bold
\series default
is an optional internal property of XIO bricks (almost non-functional property
with support for consistency guarantees).
\begin_layout Enumerate
The AIO/XIO personality is generically designed for remote operations over
networks, at arbitrary places in the IO stack, with (almost
\begin_inset Foot
status open
\begin_layout Plain Layout
By default, automatic network connection re-establishment and infinite network
retries are already implemented in the
\family typewriter
\family default
\family typewriter
\family default
bricks to provide fully transparent semantics.
However, this may be undesirable in case of fatal crashes.
Therefore, abort operations are also configurable, as well as network timeouts
which are then mapped to classical IO errors.
) no semantic differences to local operations (built-in
\series bold
network transparency
\series default
There are universal provisions for mixed operation of different versions
\series bold
rolling software updates
\series default
in clusters / grids).
\begin_layout Enumerate
The generic brick infrastructure (as well as its personalities like XIO
or any other future personality) supports
\series bold
dynamic re-wiring / re-configuration
\series default
\emph on
\emph default
operation (even while parallel IO requests are flying, some of them taking
different paths in the IO stack in parallel).
This is absolutely needed for MARS logfile rotation.
In the long term, this would be useful for many advanced new features and
products, not limited to multipathing.
\begin_layout Enumerate
The generic brick infrastructure (and in turn all personalities) provide
\series bold
additional comfort
\series default
to the programmer while enabling
\series bold
increased functionality
\series default
: by use of a generalization of
\series bold
aspect orientation
\series default
Similar to AOP, insertion of IOP bricks for checking / debugging etc is
one of the key advantages of the generic brick infrastructure.
In contrast to AOP where debugging is usually {en,dis}abled statically
at compile time, IOP allows for
\emph on
\emph default
(re-)configuration of debugging bricks, automatic repair, and many more
features promoted by
\emph on
organic computing
\emph default
, the programmer need no longer worry about dynamic memory allocations for
\emph on
local state
\emph default
in a brick instance.
\series bold
automating local state
\series default
even when dynamically instantiating new bricks (possibly having the same
brick type) at runtime.
Specifially, XIO is automating
\series bold
request stacking
\series default
at the completion path this way, even while dynamically reconfiguring the
IO stack
The generic aspect orientation approach leads to better
\series bold
separation of concerns
\series default
: local state needed by brick implementations is not visible from outside
by default.
In other words, local state is also
\series bold
private state
\series default
Accidental hampering of internal operations is impeded.
\begin_layout Plain Layout
Example from the kernel: in
\family typewriter
\family default
the definition of
\family typewriter
struct request
\family default
contains the following comment:
/* the following two fields are internal, NEVER access directly */
\family default
It appears that
\family typewriter
struct request
\family default
contains not only fields relevant for the caller, but also
\series bold
internal fields
\series default
needed only in
\emph on
\emph default
\emph on
\emph default
For example,
\family typewriter
\family default
is documented to be used only in IO schedulers.
XIO goes one step further: there need not exist exactly one IO scheduler
instance in the IO stack for a single device.
\family typewriter
\family default
brick types could be each instantiated many times, and in arbitrary places,
even for the same (logical) device.
The equivalent of
\family typewriter
\family default
would then be automatically instantiated multiple times for the same IO
request, by automatically instantiating the right local aspect instances.
A similar automation
DM can achieve stacking and dynamic routing by a workaround called
\emph on
request cloning
\emph default
, potentially leading to mass creation of temporary / intermediate object
does not exist in the rest of the Linux kernel.
\begin_layout Enumerate
The generic brick infrastructure, together with personalities like XIO,
\series bold
new long-term functional and non-functional opportunities
\series default
by use of concepts from instance-oriented programming (IOP
The application area is
\series bold
not limited to device drivers
\series default
For example, a new personality for
\emph on
stackable filesystems
\emph default
could be developed in future.
In summary, anyone who would insist that MARS should be
\emph on
Notice that kernel-specific structures like
\family typewriter
struct bio
\family default
are of course used by MARS, but only
\emph on
\emph default
the blackbox implementation of bricks like
\family typewriter
\family default
\family typewriter
\family default
which act as
\series bold
\series default
to/from that structure.
It is possible to write further adaptors, e.g.
for direct interfacing to the device mapper infrastructure.
\emph default
based on pre-existing kernel structures / frameworks instead of contributing
a new framework would cause a
\emph on
massive regression of functionality
\emph default
On one hand, all code contributed by the MARS project is
\series bold
\series default
into the rest of the Linux kernel.
From the viewpoint of other parts of the kernel, the whole addition
\emph on
\emph default
\emph on
\emph default
a driver (although its infrastructure is much more than a driver).
\begin_layout Itemize
On the other hand, if people are interested, the contributed infrastructure
\emph on
\emph default
be used to
\emph on
\emph default
to the power of the Linux kernel.
It is designed to be
\series bold
open for contributions
\series default
\emph on
\emph default
(but not the only possible) way to do this is giving the generic brick
framework / the XIO personality as well as future personalities / the MARS
application the status of a
\emph on
\emph default
inside the kernel (in the long term), similar to the SCSI subsystem or
the network subsystem.
Noone is forced to use it, but anybody may use it if he/she likes.
Politically, the author is a FOSS advocate willing to collaborate and to
support anyone interested in contributions.
The author's personal interest is long-term and is open for both in-tree
and out-of-tree extensions of both the framework and MARS by any other
party obeying the GPL and not hazarding FOSS by patents (instead supporting
organizations like the Open Invention Network).
The author is open to closer relationships with the Linux Foundation and
other parts of the Linux ecosystem.
Architecture Overview
filename images/MARS_Framework_Architecture.pdf
width 100col%
Some Architectural Details
The following pictures show some
\begin_inset Quotes eld
zones of responsibility
\begin_inset Quotes erd
, not necessarily a strict hierarchy (although Dijkstra's famous layering
rules from THE are tried to be respected as much as possible).
\series bold
Instance Oriented Programming
\series default
(IOP) described in
Please note that MARS is only instance-
\emph on
\emph default
Similar to OOP, where
\begin_inset Quotes eld
\begin_inset Quotes erd
means a weaker form of
\begin_inset Quotes eld
\begin_inset Quotes erd
, the term
\begin_inset Quotes eld
\begin_inset Quotes erd
means that the
\emph on
\emph default
brick layer need not be fully modularized according to the IOP principles,
but the
\emph on
\emph default
brick layer already is.
, while MARS Full is planned to be fully instance-
\emph on
\emph default
MARS Architecture
\begin_inset Graphics
filename images/mars-light-architecture.fig
width 40col%
MARS Full Architecture (planned)
\begin_inset Graphics
filename images/mars-full-architecture.fig
width 80col%
Documentation of the Symlink Trees
\family typewriter
\family default
symlink tree is serving the following purposes, all at the same time:
\series bold
\series default
between cluster nodes, see sections
\begin_inset CommandInset ref
LatexCommand ref
reference "sec:The-Lamport-Clock"
\begin_inset CommandInset ref
LatexCommand ref
reference "sec:The-Symlink-Tree"
This communication is even the
\emph on
\emph default
communication between cluster nodes (apart from the
\emph on
\emph default
of transaction logfiles and sync data).
\series bold
\emph on
\emph default
\series default
between the kernel module and the userspace tool
\family typewriter
\family default
\begin_layout Enumerate
\emph on
\emph default
persistent repository
\series default
which keeps state information between reboots (also in case of node crashes).
It is even the
\emph on
\emph default
place where state information is kept.
There is no other place like
\family typewriter
\family default
\begin_inset Graphics
filename images/MatieresCorrosives.png
lyxscale 50
scale 17
Because of its internal character, its representation and semantics may
change at any time without notice (e.g.
via an
\emph on
\emph default
upgrade procedure between major releases).
It is
\emph on
\emph default
an external interface to the outer world.
Don't build anything on it.
\begin_layout Standard
However, knowledge of the symlink tree is useful for advanced sysadmins,
\series bold
human inspection
\series default
and for
\series bold
\series default
\begin_layout Standard
As an
\begin_inset Quotes eld
\begin_inset Quotes erd
interface from outside, only the
\family typewriter
\family default
Documentation of the MARS Symlink Tree
XIO Worker Bricks
The XIO Brick Personality
\begin_layout Section
