diff --git a/docu/mars-architecture-guide.lyx b/docu/mars-architecture-guide.lyx index 2349c861..84a62972 100644 --- a/docu/mars-architecture-guide.lyx +++ b/docu/mars-architecture-guide.lyx @@ -9076,6 +9076,803 @@ noprefix "false" Nevertheless, principal behaviour of implementations are also discussed. \end_layout +\begin_layout Section +Performance Arguments from Architecture +\begin_inset CommandInset label +LatexCommand label +name "sec:Performance-Arguments-from" + +\end_inset + + +\end_layout + +\begin_layout Subsection +Performance Penalties by Choice of Replication Layer +\begin_inset CommandInset label +LatexCommand label +name "subsec:Performance-Penalties-Layer" + +\end_inset + + +\end_layout + +\begin_layout Standard +Some people think that replication is easily done at filesystem layer. + There exist lots of cluster filesystems and other filesystem-layer solutions + which claim to be able to replicate your data, sometimes even over long + distances. +\end_layout + +\begin_layout Standard +Trying to replicate several petabytes of data, or some billions of inodes, + is however a much bigger challenge than many people can imagine. +\end_layout + +\begin_layout Standard +Choosing the wrong +\series bold +layer +\series default + (see section +\begin_inset CommandInset ref +LatexCommand nameref +reference "subsec:Layering-Rules" +plural "false" +caps "false" +noprefix "false" + +\end_inset + +) for +\series bold +mass data replication +\series default + may get you into trouble. + Layer selection is much more important than any load distribution argument + as frequently heard from certain advocates. + Here is an architectural-level (cf section +\begin_inset CommandInset ref +LatexCommand nameref +reference "sec:What-is-Architecture" +plural "false" +caps "false" +noprefix "false" + +\end_inset + +) explanation why replication at the block layer is more easy and less error + prone: +\end_layout + +\begin_layout Standard +\noindent +\align center +\begin_inset Graphics + filename images/Layers.pdf + width 100col% + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +The picture shows the main components of a standalone Unix / Linux system. + It conforms to Dijkstra's layering rules explained in section +\begin_inset CommandInset ref +LatexCommand nameref +reference "subsec:Layering-Rules" +plural "false" +caps "false" +noprefix "false" + +\end_inset + +. +\end_layout + +\begin_layout Standard +In the late 1970s / early 1980s, a so-called +\emph on +Buffer Cache +\emph default + had been introduced into the architecture of Unix. + Today's Linux has refined the concept to various internal caches such as + the +\series bold +Page Cache +\series default + (for data) and the +\series bold +Dentry Cache +\series default + (for metadata lookup). +\end_layout + +\begin_layout Standard +All these caches serve one main purpose +\begin_inset Foot +status open + +\begin_layout Plain Layout +Another important purpose is +\series bold +providing shared memory +\series default +. +\end_layout + +\end_inset + +: they are reducing the load onto the storage by exploitation of fast RAM. + A well-tuned cache can yield high cache hit ratios, typically 99%. + In some cases (as observed in practice) even more than 99.9%. +\end_layout + +\begin_layout Standard +Now start distributing the system over long distances. + There are potential cut points A and B and C +\begin_inset Foot +status open + +\begin_layout Plain Layout +In theory, there is another cut point D by implementing a generically distribute +d cache. + There exists some academic research on this, but practically usable enterprise- +grade implementations are rare and not wide-spread. +\end_layout + +\end_inset + +. +\end_layout + +\begin_layout Standard +Cut point A is application specific, and can have advantages because it + has knowledge of the application. + For example, replication of mail queues can be controlled much more fine-graine +d than at filesystem or block layer. +\end_layout + +\begin_layout Standard +Cut points B and C are +\emph on +generic +\emph default +, supporting a wide variety of applications, without altering them. + Cutting at B means replication at filesystem layer. + C means replication at block layer. +\end_layout + +\begin_layout Standard +When replicating at B, you will notice that the caches are +\emph on +below +\emph default + your cut point. + Thus you will have to re-implement +\series bold +distributed caches +\series default +, and you will have to +\series bold +maintain cache coherence +\series default +. +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Flex Custom Color Box 3 +status open + +\begin_layout Plain Layout +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + + Caching can yield several +\emph on +orders of magnitude +\emph default + of performance. +\end_layout + +\begin_layout Plain Layout +\noindent +\begin_inset Graphics + filename images/MatieresCorrosives.png + lyxscale 50 + scale 17 + +\end_inset + + In contrast, frequently heard load distribution arguments can only re-distribut +e the already existing performance of your spindles, but cannot magically + +\begin_inset Quotes eld +\end_inset + +create +\begin_inset Quotes erd +\end_inset + + new sources of performance out of thin air. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +In contrary, load distribution over a storage network is +\emph on +costing +\emph default + some performance, by introduction of additional latencies and potential + bottlenecks. +\end_layout + +\begin_layout Standard +When replicating at C, the Linux caches are +\emph on +above +\emph default + your cut point. + Thus you will receive much less traffic at C, typically already reduced + by a factor of 100, or even more. + This is much more easy to cope with. + +\emph on +Local +\emph default + caches and their SMP scaling properties can be implemented much more efficientl +y than distributed ones. + You will also profit from +\series bold +journalling filesystems +\series default + like +\family typewriter +ext4 +\family default + or +\family typewriter +xfs +\family default +. + In contrast, +\emph on +truly distributed +\begin_inset Foot +status open + +\begin_layout Plain Layout +In this context, +\begin_inset Quotes eld +\end_inset + +truly +\begin_inset Quotes erd +\end_inset + + means that the POSIX semantics would be always guaranteed cluster-wide, + and even in case of partial failures. + In practice, some distributed filesystems like NFS don't even obey the + POSIX standard +\emph on +locally +\emph default + on 1 standalone client. + We know of projects which have +\emph on +failed +\emph default + right because of this. +\end_layout + +\end_inset + + +\emph default + journalling is typically not available with distributed cluster filesystems. +\end_layout + +\begin_layout Standard +A +\emph on +potential +\emph default + drawback of block layer replication is that you are typically limited to + active-passive replication. + An active-active operation is not impossible at block layer (see combinations + of DRBD with +\family typewriter +ocfs2 +\family default +), but less common, and less safe to operate. +\end_layout + +\begin_layout Standard +This limitation isn't necessarily caused by the choice of layer. + It is simply caused by the +\series bold +laws of physics +\series default +: communication is always limited by the speed of light. + A distributed filesystem is essentially nothing else but a persistent +\series bold +DSM = Distributed Shared Memory +\series default +. +\end_layout + +\begin_layout Standard +Some decades of research on DSM have shown that there exist applications + / workloads where the DSM model is +\emph on +inferior +\emph default + to the direct communication paradigm. + Even in short-distance / cluster scenarios. + Long-distance DSM is extremely cumbersome. +\end_layout + +\begin_layout Standard +Therefore: you simply shouldn't try to solve +\series bold +long-distance communication needs +\series default + via communication over shared filesystems. + Even simple producer-consumer scenarios (one-way communication) are less + performant (e.g. + when compared to plain TCP/IP) when it comes to distributed POSIX semantics. + There is simply too much +\series bold +synchronisation overhead at metadata level +\series default +. +\end_layout + +\begin_layout Standard +\begin_inset Flex Custom Color Box 3 +status open + +\begin_layout Plain Layout +If you want mixed operations at different locations in parallel: split your + data set into disjoint filesystem instances (or database / VM instances, + etc). + Then you should achieve the +\series bold +ability for butterfly +\series default +, see section +\begin_inset CommandInset ref +LatexCommand nameref +reference "subsec:Flexibility-of-Failover" +plural "false" +caps "false" +noprefix "false" + +\end_inset + +. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +All you need is careful thought about the +\emph on +appropriate +\emph default + +\emph on +granularity +\emph default + of your data sets (such as well-chosen +\emph on +sets +\emph default + of user homedirectory subtrees, or database sets logically belonging together, + etc). + An example hierarchy of granularities is described in section +\begin_inset CommandInset ref +LatexCommand nameref +reference "par:Positive-Example:-ShaHoLin" +plural "false" +caps "false" +noprefix "false" + +\end_inset + +. + Further hints can be found in sections +\begin_inset CommandInset ref +LatexCommand nameref +reference "sec:Granularity-at-Architecture" +plural "false" +caps "false" +noprefix "false" + +\end_inset + + and +\begin_inset CommandInset ref +LatexCommand nameref +reference "subsec:Variants-of-Sharding" +plural "false" +caps "false" +noprefix "false" + +\end_inset + +. +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + + Sharding (see section +\begin_inset CommandInset ref +LatexCommand nameref +reference "par:Definition-of-Sharding" +plural "false" +caps "false" +noprefix "false" + +\end_inset + +) implementations like ShaHoLin (see section +\begin_inset CommandInset ref +LatexCommand nameref +reference "par:Positive-Example:-ShaHoLin" +plural "false" +caps "false" +noprefix "false" + +\end_inset + +) are essentially exploiting the scalability of SMP = Symmetric MultiProcessing, + nowadays typically going into saturation around +\begin_inset Formula $\approx100$ +\end_inset + + hardware CPU threads for typical workloads, which is executed by +\emph on +hardware +\emph default + inside of your server enclosure. + In contrast, DSM-like solutions are trying to distribute your application + workload over longer distances, involving relatively slow system software + instead of +\series bold +hardware acceleration +\series default +. + Therefore, SMP is preferable over DSM wherever possible. +\end_layout + +\begin_layout Standard +Replication at filesystem level is often by single-file granularity. + If you have several millions or even billions of inodes, you may easily + find yourself in a snakepit. + See also +\begin_inset CommandInset ref +LatexCommand nameref +reference "subsec:Example-Failures-of" +plural "false" +caps "false" +noprefix "false" + +\end_inset + +. +\end_layout + +\begin_layout Standard +\begin_inset Flex Custom Color Box 3 +status open + +\begin_layout Plain Layout +\begin_inset Argument 1 +status open + +\begin_layout Plain Layout + +\series bold +Conclusion +\end_layout + +\end_inset + + +\series bold +Active-passive operation +\series default + over long distances (such as between continents) at +\series bold +block layer +\series default + is an +\series bold +\emph on +advantage +\series default +\emph default +. + It keeps your staff from trying bad / almost impossible things, like DSM + = Distributed Shared Memory over long distances. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Subsection +Performance Tradeoffs from Load Distribution +\begin_inset CommandInset label +LatexCommand label +name "subsec:Performance-Tradeoffs-from-Load-Distribution" + +\end_inset + + +\end_layout + +\begin_layout Standard +A frequent argument from BigCluster advocates is that random repliction + would provide better performance. + This argument isn't wrong, but it does not hit the point. +\end_layout + +\begin_layout Standard +As analysed in section +\begin_inset CommandInset ref +LatexCommand nameref +reference "subsec:Similarities-and-differences" +plural "false" +caps "false" +noprefix "false" + +\end_inset + +, load distribution isn't a unique concept bound to BigCluster / random + replication. + Load distribution has been used since decades at a variety of +\series bold +RAID striping +\series default + methods. +\end_layout + +\begin_layout Standard +RAID striping levels like RAID-0 or RAID-10 or RAID-60 are known since decades, + forming a mature technology. + Also known since the 1980s is that the size of a single striped RAID set + must not grow too big, otherwise reliability will suffer too much. + Larger RAID systems are therefore +\series bold +split +\series default + into multiple +\series bold +RAID sets +\series default +. +\end_layout + +\begin_layout Standard +This has some intresting parallels to the BigCluster reliability problems + analyzed in section +\begin_inset CommandInset ref +LatexCommand nameref +reference "sub:Detailed-explanation" +plural "false" +caps "false" +noprefix "false" + +\end_inset + +, and some workarounds, e.g. + as discussed in section +\begin_inset CommandInset ref +LatexCommand nameref +reference "subsec:Similarities-and-differences" +plural "false" +caps "false" +noprefix "false" + +\end_inset + +. +\end_layout + +\begin_layout Standard +Summary: both RAID striping and random replication methods are +\series bold +limited +\series default + by the fundamental law of storage systems, see section +\begin_inset CommandInset ref +LatexCommand nameref +reference "subsec:Optimum-Reliability-from" +plural "false" +caps "false" +noprefix "false" + +\end_inset + +, in a similar way. +\end_layout + +\begin_layout Standard +A detailed performane comparison at architcture level between random replication + of variable-sized objects and striping of block-level sectors is beyond + the scope of this architecture guide. + However, the following should be be intuitively clear from section +\begin_inset CommandInset ref +LatexCommand nameref +reference "subsec:Layering-Rules" +plural "false" +caps "false" +noprefix "false" + +\end_inset + + and from Einstein's laws of the speed of light: +\end_layout + +\begin_layout Quote +Fine-grained load distribution over +\series bold +short distances +\series default + and/or at +\series bold +lower layers +\series default + has a +\series bold +bigger performance potential +\series default + than over longer distances and/or at higher layers. +\end_layout + +\begin_layout Standard +In other words: local SAS busses are capable of realtime IO transfers over + very short distances (enclosure-to-enclosure), while an expensive IP storage + network isn't realtime (due to packet loss). + SAS busses are +\emph on +constructed +\emph default + for dealing with requirements arising from RAID, and have been optimized + for years / decades. +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Flex Custom Color Box 3 +status open + +\begin_layout Plain Layout +\noindent +\begin_inset Argument 1 +status open + +\begin_layout Plain Layout + +\series bold +Advice for performance-critical workloads +\end_layout + +\end_inset + + +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + + Besides +\emph on +local +\emph default + SSDs, also consider some appropriate RAID striping at your (Local)Sharding + storage boxes for performance-critical workloads. + It is not only cheaper than BigCluster load distribution methods, but typically + also more performant (on top of comparable technology and comparable dimensioni +ng). + Tradeoffs of various parameters and measurement methods for system architects + are described at +\begin_inset Flex URL +status open + +\begin_layout Plain Layout + +http://blkreplay.org +\end_layout + +\end_inset + +. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + + RAID-6 is much cheaper +\begin_inset Foot +status open + +\begin_layout Plain Layout +Several OSDs are also using SAS or similar local IO busses, in order to + drive a high number of spindles. + Essentially, random replication is involving +\emph on +two +\emph default + different types of networks at the same time. + This also explains why such a combination must necessarily induce some + performance loss. +\end_layout + +\end_inset + + than RAID-10, and can also provide some striping with respect to (random) + reads. + However, random writes are much slower. + For read-intensive workloads, the striping behaviour of RAID-6 is often + sufficient. + A tool for comparsion of different RAID setup alternatives can be found + at +\begin_inset Flex URL +status open + +\begin_layout Plain Layout + +http://www.blkreplay.org +\end_layout + +\end_inset + +. +\end_layout + \begin_layout Section Local vs Centralized Storage \begin_inset CommandInset label @@ -18332,803 +19129,6 @@ locality of references an explosion of randomness by some orders of magnitude. \end_layout -\begin_layout Section -Performance Arguments from Architecture -\begin_inset CommandInset label -LatexCommand label -name "sec:Performance-Arguments-from" - -\end_inset - - -\end_layout - -\begin_layout Subsection -Performance Penalties by Choice of Replication Layer -\begin_inset CommandInset label -LatexCommand label -name "subsec:Performance-Penalties-Layer" - -\end_inset - - -\end_layout - -\begin_layout Standard -Some people think that replication is easily done at filesystem layer. - There exist lots of cluster filesystems and other filesystem-layer solutions - which claim to be able to replicate your data, sometimes even over long - distances. -\end_layout - -\begin_layout Standard -Trying to replicate several petabytes of data, or some billions of inodes, - is however a much bigger challenge than many people can imagine. -\end_layout - -\begin_layout Standard -Choosing the wrong -\series bold -layer -\series default - (see section -\begin_inset CommandInset ref -LatexCommand nameref -reference "subsec:Layering-Rules" -plural "false" -caps "false" -noprefix "false" - -\end_inset - -) for -\series bold -mass data replication -\series default - may get you into trouble. - Layer selection is much more important than any load distribution argument - as frequently heard from certain advocates. - Here is an architectural-level (cf section -\begin_inset CommandInset ref -LatexCommand nameref -reference "sec:What-is-Architecture" -plural "false" -caps "false" -noprefix "false" - -\end_inset - -) explanation why replication at the block layer is more easy and less error - prone: -\end_layout - -\begin_layout Standard -\noindent -\align center -\begin_inset Graphics - filename images/Layers.pdf - width 100col% - -\end_inset - - -\end_layout - -\begin_layout Standard -\noindent -The picture shows the main components of a standalone Unix / Linux system. - It conforms to Dijkstra's layering rules explained in section -\begin_inset CommandInset ref -LatexCommand nameref -reference "subsec:Layering-Rules" -plural "false" -caps "false" -noprefix "false" - -\end_inset - -. -\end_layout - -\begin_layout Standard -In the late 1970s / early 1980s, a so-called -\emph on -Buffer Cache -\emph default - had been introduced into the architecture of Unix. - Today's Linux has refined the concept to various internal caches such as - the -\series bold -Page Cache -\series default - (for data) and the -\series bold -Dentry Cache -\series default - (for metadata lookup). -\end_layout - -\begin_layout Standard -All these caches serve one main purpose -\begin_inset Foot -status open - -\begin_layout Plain Layout -Another important purpose is -\series bold -providing shared memory -\series default -. -\end_layout - -\end_inset - -: they are reducing the load onto the storage by exploitation of fast RAM. - A well-tuned cache can yield high cache hit ratios, typically 99%. - In some cases (as observed in practice) even more than 99.9%. -\end_layout - -\begin_layout Standard -Now start distributing the system over long distances. - There are potential cut points A and B and C -\begin_inset Foot -status open - -\begin_layout Plain Layout -In theory, there is another cut point D by implementing a generically distribute -d cache. - There exists some academic research on this, but practically usable enterprise- -grade implementations are rare and not wide-spread. -\end_layout - -\end_inset - -. -\end_layout - -\begin_layout Standard -Cut point A is application specific, and can have advantages because it - has knowledge of the application. - For example, replication of mail queues can be controlled much more fine-graine -d than at filesystem or block layer. -\end_layout - -\begin_layout Standard -Cut points B and C are -\emph on -generic -\emph default -, supporting a wide variety of applications, without altering them. - Cutting at B means replication at filesystem layer. - C means replication at block layer. -\end_layout - -\begin_layout Standard -When replicating at B, you will notice that the caches are -\emph on -below -\emph default - your cut point. - Thus you will have to re-implement -\series bold -distributed caches -\series default -, and you will have to -\series bold -maintain cache coherence -\series default -. -\end_layout - -\begin_layout Standard -\noindent -\begin_inset Flex Custom Color Box 3 -status open - -\begin_layout Plain Layout -\noindent -\begin_inset Graphics - filename images/lightbulb_brightlit_benj_.png - lyxscale 12 - scale 7 - -\end_inset - - Caching can yield several -\emph on -orders of magnitude -\emph default - of performance. -\end_layout - -\begin_layout Plain Layout -\noindent -\begin_inset Graphics - filename images/MatieresCorrosives.png - lyxscale 50 - scale 17 - -\end_inset - - In contrast, frequently heard load distribution arguments can only re-distribut -e the already existing performance of your spindles, but cannot magically - -\begin_inset Quotes eld -\end_inset - -create -\begin_inset Quotes erd -\end_inset - - new sources of performance out of thin air. -\end_layout - -\end_inset - - -\end_layout - -\begin_layout Standard -\noindent -In contrary, load distribution over a storage network is -\emph on -costing -\emph default - some performance, by introduction of additional latencies and potential - bottlenecks. -\end_layout - -\begin_layout Standard -When replicating at C, the Linux caches are -\emph on -above -\emph default - your cut point. - Thus you will receive much less traffic at C, typically already reduced - by a factor of 100, or even more. - This is much more easy to cope with. - -\emph on -Local -\emph default - caches and their SMP scaling properties can be implemented much more efficientl -y than distributed ones. - You will also profit from -\series bold -journalling filesystems -\series default - like -\family typewriter -ext4 -\family default - or -\family typewriter -xfs -\family default -. - In contrast, -\emph on -truly distributed -\begin_inset Foot -status open - -\begin_layout Plain Layout -In this context, -\begin_inset Quotes eld -\end_inset - -truly -\begin_inset Quotes erd -\end_inset - - means that the POSIX semantics would be always guaranteed cluster-wide, - and even in case of partial failures. - In practice, some distributed filesystems like NFS don't even obey the - POSIX standard -\emph on -locally -\emph default - on 1 standalone client. - We know of projects which have -\emph on -failed -\emph default - right because of this. -\end_layout - -\end_inset - - -\emph default - journalling is typically not available with distributed cluster filesystems. -\end_layout - -\begin_layout Standard -A -\emph on -potential -\emph default - drawback of block layer replication is that you are typically limited to - active-passive replication. - An active-active operation is not impossible at block layer (see combinations - of DRBD with -\family typewriter -ocfs2 -\family default -), but less common, and less safe to operate. -\end_layout - -\begin_layout Standard -This limitation isn't necessarily caused by the choice of layer. - It is simply caused by the -\series bold -laws of physics -\series default -: communication is always limited by the speed of light. - A distributed filesystem is essentially nothing else but a persistent -\series bold -DSM = Distributed Shared Memory -\series default -. -\end_layout - -\begin_layout Standard -Some decades of research on DSM have shown that there exist applications - / workloads where the DSM model is -\emph on -inferior -\emph default - to the direct communication paradigm. - Even in short-distance / cluster scenarios. - Long-distance DSM is extremely cumbersome. -\end_layout - -\begin_layout Standard -Therefore: you simply shouldn't try to solve -\series bold -long-distance communication needs -\series default - via communication over shared filesystems. - Even simple producer-consumer scenarios (one-way communication) are less - performant (e.g. - when compared to plain TCP/IP) when it comes to distributed POSIX semantics. - There is simply too much -\series bold -synchronisation overhead at metadata level -\series default -. -\end_layout - -\begin_layout Standard -\begin_inset Flex Custom Color Box 3 -status open - -\begin_layout Plain Layout -If you want mixed operations at different locations in parallel: split your - data set into disjoint filesystem instances (or database / VM instances, - etc). - Then you should achieve the -\series bold -ability for butterfly -\series default -, see section -\begin_inset CommandInset ref -LatexCommand nameref -reference "subsec:Flexibility-of-Failover" -plural "false" -caps "false" -noprefix "false" - -\end_inset - -. -\end_layout - -\end_inset - - -\end_layout - -\begin_layout Standard -\noindent -All you need is careful thought about the -\emph on -appropriate -\emph default - -\emph on -granularity -\emph default - of your data sets (such as well-chosen -\emph on -sets -\emph default - of user homedirectory subtrees, or database sets logically belonging together, - etc). - An example hierarchy of granularities is described in section -\begin_inset CommandInset ref -LatexCommand nameref -reference "par:Positive-Example:-ShaHoLin" -plural "false" -caps "false" -noprefix "false" - -\end_inset - -. - Further hints can be found in sections -\begin_inset CommandInset ref -LatexCommand nameref -reference "sec:Granularity-at-Architecture" -plural "false" -caps "false" -noprefix "false" - -\end_inset - - and -\begin_inset CommandInset ref -LatexCommand nameref -reference "subsec:Variants-of-Sharding" -plural "false" -caps "false" -noprefix "false" - -\end_inset - -. -\end_layout - -\begin_layout Standard -\noindent -\begin_inset Graphics - filename images/lightbulb_brightlit_benj_.png - lyxscale 12 - scale 7 - -\end_inset - - Sharding (see section -\begin_inset CommandInset ref -LatexCommand nameref -reference "par:Definition-of-Sharding" -plural "false" -caps "false" -noprefix "false" - -\end_inset - -) implementations like ShaHoLin (see section -\begin_inset CommandInset ref -LatexCommand nameref -reference "par:Positive-Example:-ShaHoLin" -plural "false" -caps "false" -noprefix "false" - -\end_inset - -) are essentially exploiting the scalability of SMP = Symmetric MultiProcessing, - nowadays typically going into saturation around -\begin_inset Formula $\approx100$ -\end_inset - - hardware CPU threads for typical workloads, which is executed by -\emph on -hardware -\emph default - inside of your server enclosure. - In contrast, DSM-like solutions are trying to distribute your application - workload over longer distances, involving relatively slow system software - instead of -\series bold -hardware acceleration -\series default -. - Therefore, SMP is preferable over DSM wherever possible. -\end_layout - -\begin_layout Standard -Replication at filesystem level is often by single-file granularity. - If you have several millions or even billions of inodes, you may easily - find yourself in a snakepit. - See also -\begin_inset CommandInset ref -LatexCommand nameref -reference "subsec:Example-Failures-of" -plural "false" -caps "false" -noprefix "false" - -\end_inset - -. -\end_layout - -\begin_layout Standard -\begin_inset Flex Custom Color Box 3 -status open - -\begin_layout Plain Layout -\begin_inset Argument 1 -status open - -\begin_layout Plain Layout - -\series bold -Conclusion -\end_layout - -\end_inset - - -\series bold -Active-passive operation -\series default - over long distances (such as between continents) at -\series bold -block layer -\series default - is an -\series bold -\emph on -advantage -\series default -\emph default -. - It keeps your staff from trying bad / almost impossible things, like DSM - = Distributed Shared Memory over long distances. -\end_layout - -\end_inset - - -\end_layout - -\begin_layout Subsection -Performance Tradeoffs from Load Distribution -\begin_inset CommandInset label -LatexCommand label -name "subsec:Performance-Tradeoffs-from-Load-Distribution" - -\end_inset - - -\end_layout - -\begin_layout Standard -A frequent argument from BigCluster advocates is that random repliction - would provide better performance. - This argument isn't wrong, but it does not hit the point. -\end_layout - -\begin_layout Standard -As analysed in section -\begin_inset CommandInset ref -LatexCommand nameref -reference "subsec:Similarities-and-differences" -plural "false" -caps "false" -noprefix "false" - -\end_inset - -, load distribution isn't a unique concept bound to BigCluster / random - replication. - Load distribution has been used since decades at a variety of -\series bold -RAID striping -\series default - methods. -\end_layout - -\begin_layout Standard -RAID striping levels like RAID-0 or RAID-10 or RAID-60 are known since decades, - forming a mature technology. - Also known since the 1980s is that the size of a single striped RAID set - must not grow too big, otherwise reliability will suffer too much. - Larger RAID systems are therefore -\series bold -split -\series default - into multiple -\series bold -RAID sets -\series default -. -\end_layout - -\begin_layout Standard -This has some intresting parallels to the BigCluster reliability problems - analyzed in section -\begin_inset CommandInset ref -LatexCommand nameref -reference "sub:Detailed-explanation" -plural "false" -caps "false" -noprefix "false" - -\end_inset - -, and some workarounds, e.g. - as discussed in section -\begin_inset CommandInset ref -LatexCommand nameref -reference "subsec:Similarities-and-differences" -plural "false" -caps "false" -noprefix "false" - -\end_inset - -. -\end_layout - -\begin_layout Standard -Summary: both RAID striping and random replication methods are -\series bold -limited -\series default - by the fundamental law of storage systems, see section -\begin_inset CommandInset ref -LatexCommand nameref -reference "subsec:Optimum-Reliability-from" -plural "false" -caps "false" -noprefix "false" - -\end_inset - -, in a similar way. -\end_layout - -\begin_layout Standard -A detailed performane comparison at architcture level between random replication - of variable-sized objects and striping of block-level sectors is beyond - the scope of this architecture guide. - However, the following should be be intuitively clear from section -\begin_inset CommandInset ref -LatexCommand nameref -reference "subsec:Layering-Rules" -plural "false" -caps "false" -noprefix "false" - -\end_inset - - and from Einstein's laws of the speed of light: -\end_layout - -\begin_layout Quote -Fine-grained load distribution over -\series bold -short distances -\series default - and/or at -\series bold -lower layers -\series default - has a -\series bold -bigger performance potential -\series default - than over longer distances and/or at higher layers. -\end_layout - -\begin_layout Standard -In other words: local SAS busses are capable of realtime IO transfers over - very short distances (enclosure-to-enclosure), while an expensive IP storage - network isn't realtime (due to packet loss). - SAS busses are -\emph on -constructed -\emph default - for dealing with requirements arising from RAID, and have been optimized - for years / decades. -\end_layout - -\begin_layout Standard -\noindent -\begin_inset Flex Custom Color Box 3 -status open - -\begin_layout Plain Layout -\noindent -\begin_inset Argument 1 -status open - -\begin_layout Plain Layout - -\series bold -Advice for performance-critical workloads -\end_layout - -\end_inset - - -\begin_inset Graphics - filename images/lightbulb_brightlit_benj_.png - lyxscale 12 - scale 7 - -\end_inset - - Besides -\emph on -local -\emph default - SSDs, also consider some appropriate RAID striping at your (Local)Sharding - storage boxes for performance-critical workloads. - It is not only cheaper than BigCluster load distribution methods, but typically - also more performant (on top of comparable technology and comparable dimensioni -ng). - Tradeoffs of various parameters and measurement methods for system architects - are described at -\begin_inset Flex URL -status open - -\begin_layout Plain Layout - -http://blkreplay.org -\end_layout - -\end_inset - -. -\end_layout - -\end_inset - - -\end_layout - -\begin_layout Standard -\noindent -\begin_inset Graphics - filename images/lightbulb_brightlit_benj_.png - lyxscale 12 - scale 7 - -\end_inset - - RAID-6 is much cheaper -\begin_inset Foot -status open - -\begin_layout Plain Layout -Several OSDs are also using SAS or similar local IO busses, in order to - drive a high number of spindles. - Essentially, random replication is involving -\emph on -two -\emph default - different types of networks at the same time. - This also explains why such a combination must necessarily induce some - performance loss. -\end_layout - -\end_inset - - than RAID-10, and can also provide some striping with respect to (random) - reads. - However, random writes are much slower. - For read-intensive workloads, the striping behaviour of RAID-6 is often - sufficient. - A tool for comparsion of different RAID setup alternatives can be found - at -\begin_inset Flex URL -status open - -\begin_layout Plain Layout - -http://www.blkreplay.org -\end_layout - -\end_inset - -. -\end_layout - \begin_layout Section Scalability Arguments from Architecture \begin_inset CommandInset label