arch-guide: updates on performance

This commit is contained in:
Thomas Schoebel-Theuer 2019-10-04 11:54:16 +02:00 committed by Thomas Schoebel-Theuer
parent 9d53952ecf
commit 4a6a430d27
1 changed files with 422 additions and 20 deletions

View File

@ -16050,6 +16050,17 @@ name "sec:Performance-Arguments-from"
\end_inset
\end_layout
\begin_layout Subsection
Performance Penalties by Choice of Replication Layer
\begin_inset CommandInset label
LatexCommand label
name "subsec:Performance-Penalties-Layer"
\end_inset
\end_layout
\begin_layout Standard
@ -16065,15 +16076,34 @@ Trying to replicate several petabytes of data, or some billions of inodes,
\end_layout
\begin_layout Standard
Choosing the wrong layer for
Choosing the wrong
\series bold
layer
\series default
(see section
\begin_inset CommandInset ref
LatexCommand nameref
reference "subsec:Layering-Rules"
plural "false"
caps "false"
noprefix "false"
\end_inset
) for
\series bold
mass data replication
\series default
may get you into trouble.
Layer selection is much more important than any load distribution argument
as frequently heard from certain advocates.
Here is an architectural-level (cf section
\begin_inset CommandInset ref
LatexCommand ref
LatexCommand nameref
reference "sec:What-is-Architecture"
plural "false"
caps "false"
noprefix "false"
\end_inset
@ -16083,6 +16113,7 @@ reference "sec:What-is-Architecture"
\begin_layout Standard
\noindent
\align center
\begin_inset Graphics
filename images/Layers.pdf
width 100col%
@ -16095,7 +16126,21 @@ reference "sec:What-is-Architecture"
\begin_layout Standard
\noindent
The picture shows the main components of a standalone Unix / Linux system.
In the late 1970s / early 1980s, a so-called
It conforms to Dijkstra's layering rules explained in section
\begin_inset CommandInset ref
LatexCommand nameref
reference "subsec:Layering-Rules"
plural "false"
caps "false"
noprefix "false"
\end_inset
.
\end_layout
\begin_layout Standard
In the late 1970s / early 1980s, a so-called
\emph on
Buffer Cache
\emph default
@ -16109,7 +16154,7 @@ Page Cache
\series bold
Dentry Cache
\series default
(for metadata).
(for metadata lookup).
\end_layout
\begin_layout Standard
@ -16142,7 +16187,7 @@ status open
In theory, there is another cut point D by implementing a generically distribute
d cache.
There exists some academic research on this, but practically usable enterprise-
grade systems are rare and not wide-spread.
grade implementations are rare and not wide-spread.
\end_layout
\end_inset
@ -16162,9 +16207,9 @@ Cut points B and C are
\emph on
generic
\emph default
, supporting a wide variety of applicactions, without altering them.
Cutting at B means replication at filesystem level.
C means replication at block level.
, supporting a wide variety of applications, without altering them.
Cutting at B means replication at filesystem layer.
C means replication at block layer.
\end_layout
\begin_layout Standard
@ -16184,15 +16229,54 @@ maintain cache coherence
.
\end_layout
\begin_layout Standard
\noindent
\begin_inset Graphics
filename images/lightbulb_brightlit_benj_.png
lyxscale 12
scale 7
\end_inset
Caching can yield several
\emph on
orders of magnitude
\emph default
of performance.
In contrast, frequently heard load distribution arguments can only re-distribut
e the already existing performance of your spindles, but cannot magically
\begin_inset Quotes eld
\end_inset
create
\begin_inset Quotes erd
\end_inset
new sources of performance out of thin air.
In contrary, load distribution over a storage network is
\emph on
costing
\emph default
some performance, by introduction of additional latencies and potential
bottlenecks.
\end_layout
\begin_layout Standard
When replicating at C, the Linux caches are
\emph on
above
\emph default
your cut point.
Thus you will receive much less traffic, typically already reduced by a
factor of 100, or even more.
Thus you will receive much less traffic at C, typically already reduced
by a factor of 100, or even more.
This is much more easy to cope with.
\emph on
Local
\emph default
caches and their SMP scaling properties can be implemented much more efficientl
y than distributed ones.
You will also profit from
\series bold
journalling filesystems
@ -16265,11 +16349,11 @@ This limitation isn't necessarily caused by the choice of layer.
laws of physics
\series default
: communication is always limited by the speed of light.
A distributed filesystem is nothing else but a logically
A distributed filesystem is essentially nothing else but a persistent
\series bold
distributed shared memory
DSM = Distributed Shared Memory
\series default
(DSM).
.
\end_layout
\begin_layout Standard
@ -16284,8 +16368,11 @@ inferior
\end_layout
\begin_layout Standard
Therefore: you simply shouldn't try to solve long-distance communication
needs via communication over filesystems.
Therefore: you simply shouldn't try to solve
\series bold
long-distance communication needs
\series default
via communication over shared filesystems.
Even simple producer-consumer scenarios (one-way communication) are less
performant (e.g.
when compared to plain TCP/IP) when it comes to distributed POSIX semantics.
@ -16297,9 +16384,24 @@ synchronisation overhead at metadata level
\end_layout
\begin_layout Standard
If you have a need for mixed operations at different locations in parallel:
just split your data set into disjoint filesystem instances (or database
/ VM instances, etc).
If you want mixed operations at different locations in parallel: split your
data set into disjoint filesystem instances (or database / VM instances,
etc).
Then you should achieve the
\series bold
ability for butterfly
\series default
, see section
\begin_inset CommandInset ref
LatexCommand nameref
reference "subsec:Flexibility-of-Failover"
plural "false"
caps "false"
noprefix "false"
\end_inset
.
All you need is careful thought about the
\emph on
appropriate
@ -16314,20 +16416,320 @@ sets
\emph default
of user homedirectory subtrees, or database sets logically belonging together,
etc).
An example hierarchy of granularities is described in section
\begin_inset CommandInset ref
LatexCommand nameref
reference "par:Positive-Example:-ShaHoLin"
plural "false"
caps "false"
noprefix "false"
\end_inset
.
Further hints can be found in sections
\begin_inset CommandInset ref
LatexCommand nameref
reference "sec:Granularity-at-Architecture"
plural "false"
caps "false"
noprefix "false"
\end_inset
and
\begin_inset CommandInset ref
LatexCommand nameref
reference "subsec:Variants-of-Sharding"
plural "false"
caps "false"
noprefix "false"
\end_inset
.
\end_layout
\begin_layout Standard
Replication at filesystem level is often at single-file granularity.
\noindent
\begin_inset Graphics
filename images/lightbulb_brightlit_benj_.png
lyxscale 12
scale 7
\end_inset
Sharding (see section
\begin_inset CommandInset ref
LatexCommand nameref
reference "par:Definition-of-Sharding"
plural "false"
caps "false"
noprefix "false"
\end_inset
) implementations like ShaHoLin (see section
\begin_inset CommandInset ref
LatexCommand nameref
reference "par:Positive-Example:-ShaHoLin"
plural "false"
caps "false"
noprefix "false"
\end_inset
) are essentially exploiting the scalability of SMP = Symmetric MultiProcessing,
nowadays typically going into saturation around
\begin_inset Formula $\approx100$
\end_inset
hardware CPU threads for typical workloads, which is executed by
\emph on
hardware
\emph default
inside of your server enclosure.
In contrast, DSM-like solutions are trying to distribute your application
workload over longer distances, involving relatively slow system software
instead of
\series bold
hardware acceleration
\series default
.
Therefore, SMP is preferable over DSM wherever possible.
\end_layout
\begin_layout Standard
Replication at filesystem level is often by single-file granularity.
If you have several millions or even billions of inodes, you may easily
find yourself in a snakepit.
See also
\begin_inset CommandInset ref
LatexCommand nameref
reference "subsec:Example-Failures-of"
plural "false"
caps "false"
noprefix "false"
\end_inset
.
\end_layout
\begin_layout Standard
Conclusion: active-passive operation over long distances (such as between
continents) is even an advantage.
continents) is even an
\emph on
advantage
\emph default
.
It keeps you from trying bad / almost impossible things.
\end_layout
\begin_layout Subsection
Performance Tradeoffs from Load Distribution
\begin_inset CommandInset label
LatexCommand label
name "subsec:Performance-Tradeoffs-from-Load-Distribution"
\end_inset
\end_layout
\begin_layout Standard
A frequent argument from BigCluster advocates is that random repliction
would provide better performance.
This argument isn't wrong, but it does not hit the point.
\end_layout
\begin_layout Standard
As analysed in section
\begin_inset CommandInset ref
LatexCommand nameref
reference "subsec:Similarities-and-differences"
plural "false"
caps "false"
noprefix "false"
\end_inset
, load distribution isn't a unique concept bound to BigCluster / random
replication.
Load distribution has been used since decades at a variety of
\series bold
RAID striping
\series default
methods.
\end_layout
\begin_layout Standard
RAID striping levels like RAID-0 or RAID-10 or RAID-60 are known since decades,
forming a mature technology.
Also known since the 1980s is that the size of a single striped RAID set
must not grow too big, otherwise reliability will suffer too much.
Larger RAID systems are therefore
\series bold
split
\series default
into multiple
\series bold
RAID sets
\series default
.
\end_layout
\begin_layout Standard
This has some intresting parallels to the BigCluster reliability problems
analyzed in section
\begin_inset CommandInset ref
LatexCommand nameref
reference "sub:Detailed-explanation"
plural "false"
caps "false"
noprefix "false"
\end_inset
, and some workarounds, e.g.
as discussed in section
\begin_inset CommandInset ref
LatexCommand nameref
reference "subsec:Similarities-and-differences"
plural "false"
caps "false"
noprefix "false"
\end_inset
.
\end_layout
\begin_layout Standard
Summary: both RAID striping and random replication methods are
\series bold
limited
\series default
by the fundamental law of storage systems, see section
\begin_inset CommandInset ref
LatexCommand nameref
reference "subsec:Optimum-Reliability-from"
plural "false"
caps "false"
noprefix "false"
\end_inset
, in a similar way.
\end_layout
\begin_layout Standard
A detailed performane comparison at architcture level between random replication
of variable-sized objects and striping of block-level sectors is beyond
the scope of this architecture guide.
However, the following should be be intuitively clear from section
\begin_inset CommandInset ref
LatexCommand nameref
reference "subsec:Layering-Rules"
plural "false"
caps "false"
noprefix "false"
\end_inset
and from Einstein's laws of the speed of light:
\end_layout
\begin_layout Quote
Fine-grained load distribution over
\series bold
short distances
\series default
and/or at
\series bold
lower layers
\series default
has a
\series bold
bigger performance potential
\series default
than over longer distances and/or at higher layers.
\end_layout
\begin_layout Standard
In other words: local SAS busses are capable of realtime IO transfers over
very short distances (enclosure-to-enclosure), while an expensive IP storage
network isn't realtime (due to packet loss).
SAS busses are
\emph on
constructed
\emph default
for dealing with requirements arising from RAID, and have been optimized
for years / decades.
\end_layout
\begin_layout Standard
\noindent
\begin_inset Graphics
filename images/lightbulb_brightlit_benj_.png
lyxscale 12
scale 7
\end_inset
Management summary: just use some appropriate RAID striping at your (Local)Shar
ding storage boxes for performance-critical workloads.
It is not only cheaper
\begin_inset Foot
status open
\begin_layout Plain Layout
Several OSDs are also using SAS or similar local IO busses, in order to
drive a high number of spindles.
Essentially, random replication is involving
\emph on
two
\emph default
different types of networks at the same time.
This also explains why such a combination must necessarily induce some
performance loss.
\end_layout
\end_inset
, but typically also more performant (on top of comparable technology and
comparable dimensioning).
\end_layout
\begin_layout Standard
\noindent
\begin_inset Graphics
filename images/lightbulb_brightlit_benj_.png
lyxscale 12
scale 7
\end_inset
RAID-6 is much cheaper than RAID-10, and can also provide some striping
with respect to (random) reads.
However, random writes are much slower.
For read-intensive workloads, the striping behaviour of RAID-6 is often
sufficient.
A tool for comparsion of different RAID setup alternatives can be found
at
\begin_inset Flex URL
status open
\begin_layout Plain Layout
http://www.blkreplay.org
\end_layout
\end_inset
.
\end_layout
\begin_layout Section
Scalability Arguments from Architecture
\begin_inset CommandInset label