diff --git a/docu/mars-architecture-guide.lyx b/docu/mars-architecture-guide.lyx index ab6b160c..427fddc8 100644 --- a/docu/mars-architecture-guide.lyx +++ b/docu/mars-architecture-guide.lyx @@ -6968,6 +6968,626 @@ intermediate granularity \end_layout +\begin_layout Subsection +Negative Example: Inappropriate Replication Layering +\begin_inset CommandInset label +LatexCommand label +name "subsec:Inappropriate-Replication-Layering" + +\end_inset + + +\end_layout + +\begin_layout Standard +For unknown reasons, +\emph on +several +\emph default + people have tried +\emph on +independently from each other +\emph default + to use MARS inside of VMs. + Some of these people were outside of 1&1 Ionos. + Others were trying this even against explicit recommendations from the + author of MARS. + Suchalike cannot work. +\end_layout + +\begin_layout Standard +Instead, creation of a +\series bold +separate replication layer at bare metal +\series default + is the correct solution, e.g. + using dedicated storage boxes, or directly replicating at hypervisor hardware + when using local storage (as is the case at ShaHoLin). + Not only for performance reasons and for resource allocation +\begin_inset Foot +status open + +\begin_layout Plain Layout +Another argument: resource sharing in +\family typewriter +/mars +\family default +. + Each VM would require its own instance of +\family typewriter +/mars +\family default +, while a per-storage or per-hypervisor MARS instance can +\emph on +share +\emph default + its disk space. + MARS has been explicitly constructed with resource sharing in mind. +\end_layout + +\end_inset + + reasons, MARS is explicitly constructed for running on +\series bold +bare metal +\series default + +\emph on +solely +\emph default + +\begin_inset Foot +status open + +\begin_layout Plain Layout +A minor exception is +\emph on +functional component testing +\emph default + (as opposed to end-to-end system testing, aka integration testing, and + as opposed to non-functional testing). + This can be done under KVM, provided that +\family typewriter +/dev/mars/mydata +\family default + is never used for further sub-virtualization, and only for non-critical + test loads. +\end_layout + +\end_inset + +. + See also description of hardware requirements in +\family typewriter +mars-user-manual.pdf +\family default +. +\end_layout + +\begin_layout Standard +Dijkstra's layering rules are +\emph on +implying +\emph default + that an actively running VM can never replicate +\emph on +itself +\emph default + into +\emph on +another +\emph default + VM, at least not its entire +\begin_inset Foot +status open + +\begin_layout Plain Layout +Being unable to replicate the +\emph on +entire +\emph default + VM state is also a violation of the blackbox principle. +\end_layout + +\end_inset + + internal state. + Trying to do so would lead to an +\series bold +endless nesting recursion +\begin_inset Foot +status open + +\begin_layout Plain Layout +A replicator replicating itself would change the state of the VM by its + replication activity, triggering another replication, which in turn would + trigger another replication, and so on. +\end_layout + +\end_inset + + +\series default + of runtime state. + Dijkstra's rules are clearly forbidding cyclic layering. + Therefore, replication must always be considered as a +\emph on +separate +\emph default + layer, and not intermixed with other layers. +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + + This isn't specific for MARS and its heavy statekeeping in +\family typewriter +/mars +\family default +. + Dijkstra's rules also apply to +\emph on +any other +\emph default + replication system. +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/MatieresCorrosives.png + lyxscale 50 + scale 17 + +\end_inset + + In addition to formal layering rules, resource management can easily become + a hell when based on virtual resources instead of on physical ones. +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Flex Custom Color Box 1 +status open + +\begin_layout Plain Layout +\noindent +I never heard of anyone who tried to use DRBD +\emph on +productively +\emph default + inside of VMs. + Apparently, sysadmins understand that this would be a bad idea, +\series bold +worsening performance +\series default + over-proportionally and +\series bold +\emph on +unpredictably +\series default +\emph default + +\begin_inset Foot +status open + +\begin_layout Plain Layout +Theoretical foundation: queueing theory. + VMs are introducing +\emph on +several +\emph default + queues into workloads, which did not exist without them. + In addition, it becomes impossible to guarantee a maximum service time. +\end_layout + +\end_inset + +, since the passive side would have to react in +\emph on +realtime +\emph default +, and even for each single IO request. + People seem to understand that +\series bold +realtime behaviour +\series default + cannot be expected from ordinary VMs. + Often they already had a bad experience, such as huge performance differences + between para-virtualized device drivers and physical hardware drivers, + both running on so-called +\begin_inset Quotes eld +\end_inset + +virtual hardware +\begin_inset Foot +status open + +\begin_layout Plain Layout +The term +\begin_inset Quotes eld +\end_inset + +virtual hardware +\begin_inset Quotes erd +\end_inset + + is a contradiction in itself. + It simply isn't hardware at all. + Hardware is something which creates an +\begin_inset Quotes eld +\end_inset + +Outch +\begin_inset Quotes erd +\end_inset + + when falling down onto your feet. +\end_layout + +\end_inset + + +\begin_inset Quotes erd +\end_inset + +. + Sometimes, the latter cannot run +\emph on +reliably +\emph default + +\begin_inset Foot +status open + +\begin_layout Plain Layout +Standard problem: missed interrupts, or interrupts not delivered in-time. +\end_layout + +\end_inset + + under KVM/qemu, other than for non-critical or minor workstation loads. + Even then, they often work as a CPU burner. +\end_layout + +\begin_layout Plain Layout +\noindent +\begin_inset Graphics + filename images/MatieresCorrosives.png + lyxscale 50 + scale 17 + +\end_inset + + For some unknown reason, a few people seem to expect that MARS would be + able to work miracles there. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Flex Custom Color Box 2 +status open + +\begin_layout Plain Layout +\noindent +\begin_inset Argument 1 +status open + +\begin_layout Plain Layout + +\series bold +End users messing around with IPs +\end_layout + +\end_inset + + +\begin_inset Graphics + filename images/MatieresCorrosives.png + lyxscale 50 + scale 17 + +\end_inset + + I don't know of any virtualization platform where ordinary VM users can + easily configure and use BGP themselves. + Therefore, geo-redundant replication setups under VMs would +\series bold +lack location transparency +\series default +, and provide a +\series bold +crippled user experience +\series default +. +\end_layout + +\begin_layout Plain Layout +\noindent +\begin_inset Graphics + filename images/MatieresCorrosives.png + lyxscale 50 + scale 17 + +\end_inset + + Leaving geo-replication and BGP handover to be managed by end users would + be a bad idea. + Apart from skills and from a management hell to be mastered by end users, + it would be a +\series bold +waste of IP addresses +\series default +. + When +\emph on +external +\emph default + VM customers would need to control BGP themselves, at least 3 public IP + addresses would be needed: each of both non-location-transparent VMs running + in parallel would require at least 1 public IP for external +\family typewriter +ssh +\family default + access etc, which is 2 in total, and a third public IP for BGP handover, + carrying the workload traffic. + Notice that public IPv4 addresses are a scarce resource. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + + A good virtualization platform must provide +\series bold +full location transparency +\series default + of the VMs, without user intervention. + Only a single public IP per VM is then required, which automatically follows + the current geo-location of +\emph on +the +\emph default + single per-user +\begin_inset Foot +status open + +\begin_layout Plain Layout +At the passive / secondary side, only the LV replica is updated. + No VM is started there. + Thus no additional VM is requiring CPU and RAM resources. + In contrast, 2 non-location-transparent VMs responsible for replication + would essentially +\series bold +double the necessary compute resources +\series default +. + In addition, total disk space allocation for multiple +\family typewriter +/mars +\family default + instances instead of a shared one would be much higher. + All of these would result in a +\series bold +massive cost increase +\series default +. +\end_layout + +\end_inset + + VM instance running at the same time. + This is already standard for local VM handover in the same datacenter. + No serious VM user would accept manual IP renumbering work, or responsibility + for routing changes, when his VM is suddenly running on a different hypervisor, + just because another customer used some more RAM, or because some hardware + went defective. + For unknown reasons, a few people are however expecting a similar effort + and similar skills from their (internal or external) VM customers as soon + as geo-redundancy comes into play. +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + + BGP or a sister protocol is a +\emph on +must +\emph default + +\begin_inset Foot +status open + +\begin_layout Plain Layout +The 1&1 Ionos ShaHoLin setup (see section +\begin_inset CommandInset ref +LatexCommand nameref +reference "par:Positive-Example:-ShaHoLin" +plural "false" +caps "false" +noprefix "false" + +\end_inset + +) is a striking example that BGP and its control by hypervisors is possible + in large scale. +\end_layout + +\end_inset + + for geo-redundant VMs. + It should be automatically controlled by the storage or by the hypervisor + layer, instead of by end users. + When storage and hypervisors are anyway managed by sysadmins, users should + not notice where their VM is currently running (see +\begin_inset CommandInset ref +LatexCommand nameref +reference "sec:Location-transparency" +plural "false" +caps "false" +noprefix "false" + +\end_inset + +). + In addition, managed geo-control may become a sold feature. + Customers can then +\emph on +trigger +\emph default + automatic handover of the geo-location with a single click (provided that + both locations are healthy). +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Flex Custom Color Box 2 +status open + +\begin_layout Plain Layout +\noindent + +\series bold +\begin_inset Argument 1 +status open + +\begin_layout Plain Layout + +\series bold +OPEX Cost Savings by Managed Geo-Location Transparency +\end_layout + +\end_inset + + +\series default +When using a geo-redundant +\family typewriter +RemoteSharding +\family default + or +\family typewriter +FlexibleSharding +\family default + model, passive-side hypervisors do not carry any workload. + Thus they may be powered off, until they are needed again. + Only the corresponding passive storage boxes need to remain powered all + the time. +\end_layout + +\begin_layout Plain Layout +However, this can only work when +\emph on +managed +\emph default + geo-location transparency is implemented. + Otherwise, end users would get a +\emph on +pair of +\emph default + VMs instead of a single VM, running all the time, in order to be able to + manage geo-redundancy themselves. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Flex Custom Color Box 3 +status open + +\begin_layout Plain Layout +\begin_inset Argument 1 +status open + +\begin_layout Plain Layout + +\series bold +Manager Briefing +\end_layout + +\end_inset + +Never accept a proposal to use MARS or any other replication system inside + of VMs. +\end_layout + +\begin_layout Plain Layout + +\series bold +Insist on fully managed geo-location transparency +\series default + from the viewpoint of VM users. + It is even +\series bold +considerably cheaper +\series default + at OPEX, since unnecessary doubling of the number of concurrently running + VM instances is avoided. +\end_layout + +\begin_layout Plain Layout +Do not call any VM system +\begin_inset Quotes eld +\end_inset + +geo-redundant +\begin_inset Quotes erd +\end_inset + + if it misses this simple standard requirement. + It should not require any political discussions at all (since local location + transparency is standard at local VM farms for decades). +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + + Managed BGP makes you independent from the OS running inside of VMs. + For example, Windows guests will become geo-redundant without modification. +\end_layout + \begin_layout Section Granularity at Architecture \begin_inset CommandInset label diff --git a/docu/mars-user-manual.lyx b/docu/mars-user-manual.lyx index 7dba771d..6bed1f18 100644 --- a/docu/mars-user-manual.lyx +++ b/docu/mars-user-manual.lyx @@ -419,8 +419,8 @@ name "sec:Typical-MARS-replication" \end_layout \begin_layout Standard -Typical recommended usage is replication of multiple Logical Volumes (LVs), - similar to DRBD: +Typical recommended usage is replication of multiple Logical Volumes (LVs) + directly at bare metal (never inside of VMs), similar to DRBD: \end_layout \begin_layout Standard @@ -1113,8 +1113,8 @@ noprefix "false" \end_layout \begin_layout Standard -Typically, you will install MARS at many servers for replication of many - LVs +Typically, you will install MARS at many bare metal servers for replication + of many LVs \emph on between \begin_inset Foot @@ -1146,12 +1146,26 @@ mars-architecture-guide.pdf \emph default multiple datacenters. + Do +\emph on +not +\emph default + use MARS inside of VMs (see explanation of Dijkstra's layering rules in + +\family typewriter +mars-architecture-guide.pdf +\family default +). \end_layout \begin_layout Standard You can use MARS both at dedicated storage servers (e.g. for serving Windows clients over iSCSI), or at standalone Linux servers - where CPU and storage are not separated. + where CPU and storage are +\emph on +not +\emph default + separated. \end_layout \begin_layout Standard @@ -1623,6 +1637,54 @@ better disks. \end_layout +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/MatieresCorrosives.png + lyxscale 50 + scale 17 + +\end_inset + + Do not import the block device for +\family typewriter +/mars/ +\family default + over iSCSI. + This would sacrifice both reliability and performance. + MARS is constructed for exploiting a hardware BBU cache with a typical + IO parallelism degree of 1000 parallel IO requests, over fast local DMA. + See also section +\begin_inset CommandInset ref +LatexCommand nameref +reference "sec:IO-Performance-Tuning" +plural "false" +caps "false" +noprefix "false" + +\end_inset + +. +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/MatieresCorrosives.png + lyxscale 50 + scale 17 + +\end_inset + + Consequence: never run MARS inside of a VM (other than for functional component + testing). + See also Dijkstra's layering rules in +\family typewriter +mars-architecture-guide.pdf +\family default +. +\end_layout + \begin_layout Standard \noindent \begin_inset Graphics @@ -1663,7 +1725,7 @@ blackbox \family typewriter marsadm \family default - interface is supposed to remain stable. + interface and its primitive macros are supposed to remain stable. \end_layout \begin_layout Standard @@ -1734,8 +1796,8 @@ status open \begin_layout Plain Layout There is fundamental argument: network traffic between datacenters belongs to a higher level than a single component like MARS. - Thus its security requirements must be solved at that level, but not at - the level of MARS. + Thus its security requirements must be solved at that higher level, but + not at the lower level of MARS. \end_layout \end_inset @@ -2233,7 +2295,7 @@ INSTALL \family typewriter mars.ko \family default - kernel module to all of your cluster nodes, but also the + kernel module to all of your bare metal cluster nodes, but also the \family typewriter marsadm \family default @@ -2481,8 +2543,8 @@ name "sec:Setup-Primary-and" \end_layout \begin_layout Standard -If you already have some production data on your servers via LVM, you may - skip some of the following subsections. +If you already have some production data on your bare metal servers via + LVM, you may skip some of the following subsections. \end_layout \begin_layout Standard @@ -2543,8 +2605,21 @@ name "subsec:Setup-Hardware" \end_layout \begin_layout Standard -When using hardware RAID controllers, you will need to build your RAID sets - with the corresponding tools. +\noindent +\begin_inset Graphics + filename images/MatieresToxiques.png + lyxscale 50 + scale 17 + +\end_inset + + Do not use MARS inside of VMs. + Only use at bare metal! +\end_layout + +\begin_layout Standard +When using hardware RAID controllers with hardware BBU (as is highly recommended +), you will need to build your RAID sets with the corresponding tools. \end_layout \begin_layout Standard @@ -2613,8 +2688,8 @@ name "subsec:Setup-LVM" \end_inset - Execute the following instructions only once after hardware deployment, - or if you want to re-install your server. + Execute the following instructions only once after bare metal hardware + deployment, or if you want to re-install your server. Otherwise, you may delete existing data. \end_layout @@ -2706,7 +2781,7 @@ name "subsec:Setup-your-Cluster" \end_layout \begin_layout Standard -For your cluster, you need at least two nodes. +For your cluster, you need at least two bare metal nodes. In the following, they will be called hostA and hostB. In the beginning, hostA will have the \family typewriter @@ -3276,7 +3351,7 @@ mydata \end_layout \begin_layout Standard -You may have some alreadypre-existing +You may have some already pre-existing \family typewriter /dev/lv/mydata \family default @@ -3590,7 +3665,7 @@ starting \end_inset -By default, MARS uses the so-called + By default, MARS uses the so-called \begin_inset Quotes eld \end_inset @@ -33309,6 +33384,19 @@ name "chap:Technical-Data-MARS" \end_layout +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/MatieresToxiques.png + lyxscale 50 + scale 17 + +\end_inset + + Do not use MARS inside of VMs. + Only use at bare metal! +\end_layout + \begin_layout Standard MARS has some built-in limitations which should be overcome \begin_inset Foot @@ -33328,11 +33416,11 @@ Some internal algorithms are quadratic. \end_layout \begin_layout Itemize -maximum 10 nodes per cluster +maximum 4 nodes per cluster \end_layout \begin_layout Itemize -maximum 10 resources per cluster +maximum 20 resources per cluster \end_layout \begin_layout Itemize