mirror of https://github.com/schoebel/mars
arch-guide: new section about DSM and workingsets
This commit is contained in:
parent
97eea2ea1f
commit
d2356cb7bf
|
@ -3469,7 +3469,7 @@ yes
|
|||
|
||||
\begin_layout Standard
|
||||
\noindent
|
||||
As indicated in section
|
||||
As indicated in sections
|
||||
\begin_inset CommandInset ref
|
||||
LatexCommand nameref
|
||||
reference "sec:Reliability-Arguments-from"
|
||||
|
@ -3477,6 +3477,16 @@ plural "false"
|
|||
caps "false"
|
||||
noprefix "false"
|
||||
|
||||
\end_inset
|
||||
|
||||
and
|
||||
\begin_inset CommandInset ref
|
||||
LatexCommand nameref
|
||||
reference "subsec:Explanations-from-DSM"
|
||||
plural "false"
|
||||
caps "false"
|
||||
noprefix "false"
|
||||
|
||||
\end_inset
|
||||
|
||||
, there are problems with object storage's
|
||||
|
@ -12008,6 +12018,433 @@ reference "subsec:Optimum-Reliability-from"
|
|||
.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Section
|
||||
Explanations from DSM and WorkingSet Theory
|
||||
\begin_inset CommandInset label
|
||||
LatexCommand label
|
||||
name "subsec:Explanations-from-DSM"
|
||||
|
||||
\end_inset
|
||||
|
||||
|
||||
\end_layout
|
||||
|
||||
\begin_layout Standard
|
||||
This section tries to explain the BigCluster incidents observed at some
|
||||
1&1 Ionos doughter from a different perspective.
|
||||
In the OS literature and community, DSM = Distributed Shared Memory and
|
||||
Denning's workingset theory from the 1960s are typically attributed to
|
||||
a different research area.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Standard
|
||||
However, personal discussions with some prominent promoters of Ceph found
|
||||
some informal agreements about some use cases where BigCluster appears
|
||||
to be well suited:
|
||||
\end_layout
|
||||
|
||||
\begin_layout Itemize
|
||||
Large collections of audio / video files.
|
||||
These are never modified in place, but written once, and then
|
||||
\series bold
|
||||
\emph on
|
||||
streamed
|
||||
\series default
|
||||
\emph default
|
||||
.
|
||||
Thus it is possible to use relatively large object sizes, or even 1 video
|
||||
file = 1 object.
|
||||
Then streaming involves only a low number of objects at the same time,
|
||||
down to a per-application parallelism degree of typically only 1.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Itemize
|
||||
Measurement data like in CERN physics experiments, where often some
|
||||
\emph on
|
||||
streaming model
|
||||
\emph default
|
||||
is predominant.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Itemize
|
||||
Backups and long-term archives, when also accomplished via
|
||||
\emph on
|
||||
streaming
|
||||
\emph default
|
||||
.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Standard
|
||||
In contrast to this, here are some other use cases where BigCluster did
|
||||
not meet expectations of some people at 1&1 Ionos:
|
||||
\end_layout
|
||||
|
||||
\begin_layout Itemize
|
||||
Virtual block devices involving
|
||||
\series bold
|
||||
strict consistency
|
||||
\series default
|
||||
on top of a very high number of small
|
||||
\begin_inset Quotes eld
|
||||
\end_inset
|
||||
|
||||
unreliable
|
||||
\begin_inset Quotes erd
|
||||
\end_inset
|
||||
|
||||
/ eventually consistent objects.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Itemize
|
||||
CephFS with
|
||||
\series bold
|
||||
highly parallel random updates
|
||||
\series default
|
||||
to a huge number of files / inodes, also involving strict consistency in
|
||||
some places (e.g.
|
||||
concurrent metadata updates belonging to the same directory).
|
||||
\end_layout
|
||||
|
||||
\begin_layout Standard
|
||||
Here is a
|
||||
\emph on
|
||||
first attempt
|
||||
\emph default
|
||||
to explain these behavioural observations from a more generalized viewpoint.
|
||||
The author is open for discussion, and will modify this part upon better
|
||||
understanding.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Standard
|
||||
Ceph & co are apparently shining at use cases where the
|
||||
\emph on
|
||||
object paradigm
|
||||
\emph default
|
||||
is naturally well-suited for the
|
||||
\emph on
|
||||
application behaviour
|
||||
\emph default
|
||||
.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Standard
|
||||
Application behaviour has been studied in the 1970s.
|
||||
Theorists know that in general it is
|
||||
\emph on
|
||||
unpredictable
|
||||
\emph default
|
||||
due to Turing Completeness, but practical obervations are revealing some
|
||||
frequent
|
||||
\emph on
|
||||
behavioural pattern
|
||||
\emph default
|
||||
s.
|
||||
Otherwise, caching would not be beneficial.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Standard
|
||||
While Denning had studied and modelled application behaviour for typical
|
||||
drum storage devices of his era, later DSM people stumbled over similar
|
||||
problems: the
|
||||
\emph on
|
||||
frequency of access to needed data
|
||||
\emph default
|
||||
can grow much higher than the channel / transport capacities can
|
||||
\begin_inset Foot
|
||||
status open
|
||||
|
||||
\begin_layout Plain Layout
|
||||
In general, this is unavoidable.
|
||||
In a storage pyramid, the CPU is always able to access RAM pages with a
|
||||
much higher frequency than any (R)DMA transport can supply.
|
||||
\end_layout
|
||||
|
||||
\end_inset
|
||||
|
||||
provide.
|
||||
Denning and Saltzer coined a term for this:
|
||||
\series bold
|
||||
thrashing
|
||||
\series default
|
||||
.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Standard
|
||||
Thrashing means that more time is spent by
|
||||
\emph on
|
||||
fetching
|
||||
\emph default
|
||||
data than by
|
||||
\emph on
|
||||
working
|
||||
\emph default
|
||||
with it, because the transports are
|
||||
\emph on
|
||||
overloaded
|
||||
\emph default
|
||||
.
|
||||
As Denning observed, thrashing essentially means that the system becomes
|
||||
|
||||
\emph on
|
||||
unusable by customers
|
||||
\emph default
|
||||
.
|
||||
Thrashing is a highly non-linear
|
||||
\series bold
|
||||
self-amplifying effect
|
||||
\series default
|
||||
, similar to traffic jams at highways: one it has started, it will worsen
|
||||
itself.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Standard
|
||||
Saltzer found a workaround for his contemporary batch operating systems:
|
||||
limit the parallelism degree of concurrently running batch jobs.
|
||||
In his Multics project, this was also transferred to interactive systems,
|
||||
by limiting the swap-in parallelism degree of his contemporary swapping
|
||||
methods.
|
||||
Although this may sound counter-intuitive for modern readers: by introduction
|
||||
of a certain type of
|
||||
\series bold
|
||||
artificial limitation
|
||||
\series default
|
||||
at or around the non-linear regression point, the
|
||||
\series bold
|
||||
user experience was
|
||||
\emph on
|
||||
improved
|
||||
\series default
|
||||
\emph default
|
||||
.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Standard
|
||||
Now comes a conclusion: when thrashing occurs in a modern BigCluster model
|
||||
for whatever reason, the self-amplification will be likely worse than in
|
||||
a LocalSharding model, due to the following reasons:
|
||||
\end_layout
|
||||
|
||||
\begin_layout Itemize
|
||||
|
||||
\series bold
|
||||
Overload propagation
|
||||
\series default
|
||||
: when some parts of the
|
||||
\begin_inset Formula $O(n^{2})$
|
||||
\end_inset
|
||||
|
||||
storage network are overloaded, other parts may also become affected in
|
||||
turn, due to sharing of network resources.
|
||||
Once queueing has started somewhere, it is likely to worsen, and likely
|
||||
to induce further queueing at other parts of the shared network.
|
||||
The more other parts are affected transitively, the more parts will get
|
||||
overloaded.
|
||||
So the overload, once it has started somewhere, has a higher probabilty
|
||||
for
|
||||
\emph on
|
||||
spreading out
|
||||
\emph default
|
||||
even to parts which were not overloaded before (self-amplification at BigCluste
|
||||
r level).
|
||||
\end_layout
|
||||
|
||||
\begin_layout Itemize
|
||||
Random replication of objects adds
|
||||
\emph on
|
||||
artificial randomness
|
||||
\emph default
|
||||
to the
|
||||
\series bold
|
||||
\emph on
|
||||
locality of reference
|
||||
\series default
|
||||
\emph default
|
||||
, as described by Denning.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Itemize
|
||||
Original DSM was trying to provide a strict or near-strict consistency model
|
||||
for application programmers.
|
||||
Later research then tried some weaker consistency models, without getting
|
||||
a final breakthrough for general use cases.
|
||||
BigCluster is similarly organized to DSM, but on slow
|
||||
\emph on
|
||||
remote storage
|
||||
\emph default
|
||||
instead of logically shared remote RAM over fast RDMA.
|
||||
Thus we can expect similar problems as observed by the DSM community, like
|
||||
|
||||
\series bold
|
||||
single points of contention
|
||||
\series default
|
||||
, etc.
|
||||
These might become even worse once they have appeared.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Standard
|
||||
In a nutshell:
|
||||
\series bold
|
||||
system stability
|
||||
\series default
|
||||
under overload conditions, once they have started somewhere, is highly
|
||||
non-linear, and tends to spread
|
||||
\begin_inset Foot
|
||||
status open
|
||||
|
||||
\begin_layout Plain Layout
|
||||
In the past, advocates of BigCluster have placed the argument that BigCluster
|
||||
can
|
||||
\emph on
|
||||
equallay distribute
|
||||
\emph default
|
||||
the total application load onto
|
||||
\begin_inset Formula $O(n)$
|
||||
\end_inset
|
||||
|
||||
storage servers, so a single overloaded client will get better performance
|
||||
than in a sharding model.
|
||||
This argument contains the
|
||||
\emph on
|
||||
implicit assumption
|
||||
\emph default
|
||||
that load distribution is behaving
|
||||
\series bold
|
||||
linearly
|
||||
\series default
|
||||
, or close to that.
|
||||
However, Denning and Saltzer found that system reaction due to overload
|
||||
by workingset behaviour is
|
||||
\emph on
|
||||
extremely
|
||||
\emph default
|
||||
non-linear, and may
|
||||
\emph on
|
||||
completely
|
||||
\emph default
|
||||
tear down systems even when only
|
||||
\emph on
|
||||
slightly
|
||||
\emph default
|
||||
overloaded.
|
||||
Although there may exist some areas where the assumption of linearity is
|
||||
correct and may lead to improvements by better load distribution, unpredictable
|
||||
behaviour due to self-amplification of overload at BigCluster level may
|
||||
result in the
|
||||
\series bold
|
||||
opposite
|
||||
\series default
|
||||
.
|
||||
Denning has provided a mathematical model for this, which could probably
|
||||
be transferred to modern application behaviour.
|
||||
\end_layout
|
||||
|
||||
\end_inset
|
||||
|
||||
, and to self-amplify.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Standard
|
||||
In contrast, sharding models are not spreading any overload to other shards
|
||||
by definition.
|
||||
So the total availability from the viewpoint of the
|
||||
\emph on
|
||||
total
|
||||
\emph default
|
||||
set of customers is less vulnerable to impacts.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Standard
|
||||
\noindent
|
||||
\begin_inset Graphics
|
||||
filename images/lightbulb_brightlit_benj_.png
|
||||
lyxscale 12
|
||||
scale 7
|
||||
|
||||
\end_inset
|
||||
|
||||
In the above use cases where BigCluster is shining, overload is unlikely,
|
||||
since the
|
||||
\emph on
|
||||
parallelism of object access
|
||||
\emph default
|
||||
is limited.
|
||||
This is somewhat similar to Saltzer's historic workaround for trashing.
|
||||
|
||||
\emph on
|
||||
Streaming
|
||||
\emph default
|
||||
at application behaviour level will translate into streaming at the network
|
||||
layer.
|
||||
Classical TCP networks dealing with a relatively low number of high-throuhput
|
||||
streaming connections are just
|
||||
\emph on
|
||||
constructed
|
||||
\emph default
|
||||
for dealing with packet loss, such as caused by overload, e.g.
|
||||
by their
|
||||
\series bold
|
||||
congestion control
|
||||
\series default
|
||||
|
||||
\begin_inset Foot
|
||||
status open
|
||||
|
||||
\begin_layout Plain Layout
|
||||
Recommended reading: the papers from Sally Floyd.
|
||||
\end_layout
|
||||
|
||||
\end_inset
|
||||
|
||||
algorithms.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Standard
|
||||
\noindent
|
||||
\begin_inset Graphics
|
||||
filename images/MatieresCorrosives.png
|
||||
lyxscale 50
|
||||
scale 17
|
||||
|
||||
\end_inset
|
||||
|
||||
In contrast, an extremely high number of parallel short connections would
|
||||
be similar to a
|
||||
\begin_inset Quotes eld
|
||||
\end_inset
|
||||
|
||||
SYN flood attack
|
||||
\begin_inset Quotes erd
|
||||
\end_inset
|
||||
|
||||
, or similar to a classical UDP packet storm.
|
||||
It would allow for a much higher parallelism degree, but will be more vulnerabl
|
||||
e to packet loss / packet storm effects / etc, and more vulnerable to self-ampli
|
||||
fication.
|
||||
These application behaviour types are avoided in the above use case examples
|
||||
for BigCluster.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Standard
|
||||
\noindent
|
||||
\begin_inset Graphics
|
||||
filename images/lightbulb_brightlit_benj_.png
|
||||
lyxscale 12
|
||||
scale 7
|
||||
|
||||
\end_inset
|
||||
|
||||
In addition, storing video files as immutable BLOBs will limit the
|
||||
\series bold
|
||||
randomness
|
||||
\series default
|
||||
of
|
||||
\emph on
|
||||
locality of references
|
||||
\emph default
|
||||
, while splitting into millions of very small objects may easily lead to
|
||||
an explosion of randomness by some orders of magnitude.
|
||||
\end_layout
|
||||
|
||||
\begin_layout Section
|
||||
Performance Arguments from Architecture
|
||||
\begin_inset CommandInset label
|
||||
|
|
Loading…
Reference in New Issue