From 70c1e8eb0fdc5a0826d841922208b323faf0e29e Mon Sep 17 00:00:00 2001 From: Thomas Schoebel-Theuer Date: Thu, 20 Nov 2014 18:38:37 +0100 Subject: [PATCH] doc: big update --- docu/mars-manual.lyx | 2751 ++++++++++++++++++++++++++++++++++++------ 1 file changed, 2360 insertions(+), 391 deletions(-) diff --git a/docu/mars-manual.lyx b/docu/mars-manual.lyx index 0cb512af..e7c7b573 100644 --- a/docu/mars-manual.lyx +++ b/docu/mars-manual.lyx @@ -237,7 +237,11 @@ Anytime Consistency \end_layout \begin_layout Abstract -The current version of MARS Light works +The current version of MARS Light supports +\begin_inset Formula $k>2$ +\end_inset + + replicas and works \series bold asynchronously \series default @@ -343,9 +347,12 @@ constructed \begin_layout Standard On the other hand, there exist other cases where DRBD did not work as expected, leading to incidents and other operational problems. - We analyzed them for those use cases, and found that they could only be + We analyzed them for those use cases. + The later author of MARS came to the conclusion that they could only be resolved by fundamental changes in the overall architecture of DRBD. - Therefore, we started the development of MARS. + The development of MARS started at the personal initiative of the author, + first in form of a personal project during holidays, but later picked up + by 1&1 as an official project. \end_layout \begin_layout Standard @@ -1195,7 +1202,8 @@ This time, the network throughput limit (solid red line) is assumed to be However, the application workload (dotted green line) shows some heavy peaks. We know from our 1&1 datacenters that such an application behaviour is - very common. + very common (e.g. + in case of certain kinds of DDOS attacks etc). \end_layout \begin_layout Standard @@ -1226,7 +1234,7 @@ In general and in some theories, latencies are conceptually independent \end_layout \begin_layout Enumerate -There exist lines with high latencies but also high throughput. +There exist communication lines with high latencies but also high throughput. Examples are raw fibre cables at the ground of the Atlantic. \end_layout @@ -1246,7 +1254,7 @@ ssh \begin_layout Enumerate Low latencies need not be incompatible with high throughput. See Myrinet, InfiniBand or high-speed point-to-point interconnects, such - as modern memory busses. + as modern RAM busses. \end_layout \begin_layout Enumerate @@ -1892,7 +1900,7 @@ EXPORT_SYMBOL() statements. The pre-patch must be applied to the kernel source tree before building your (custom) kernel. - Hopefully, the patch will be integrated upstream some day. + Future versions of MARS are planned to require no pre-patch anymore. \end_layout \begin_layout Standard @@ -1931,7 +1939,8 @@ modinfo \family typewriter Makefile.dist \family default - (tested with Debian; may need some extra work with other distros). + (tested with some older versions of Debian; may need some extra work with + other distros). \end_layout \begin_layout Standard @@ -2246,6 +2255,11 @@ Wait a few seconds until the directory /mars/resource-mydata/ \family default and its symlink contents also appears on cluster node B. + The command +\family typewriter +marsadm wait-cluster +\family default + may be helpful. \end_layout \begin_layout Enumerate @@ -2303,7 +2317,7 @@ contents \emph default of a generic block device; it does not interpret it in any way. Therefore, you may use MARS (as well as DRBD) for mirroring Windows filesystems -, or raw devices from databases, or whatever. +, or raw devices from databases, or virtual machines, or whatever. \begin_inset Newline newline \end_inset @@ -2345,7 +2359,11 @@ echo 0 > /proc/sys/mars/do_fast_fullsync \end_layout \begin_layout Enumerate -Optionally: if you create a +Optionally, only for experienced sysadmins who +\emph on +really +\emph default + know what they are doing: if you will create a \emph on new \emph default @@ -2357,13 +2375,78 @@ new \emph on after(!) \emph default - having created the MARS resource, you may skip the fast fullsync phase - at all, because the old content of + having created the MARS resource as well as +\emph on +after +\emph default + having already joined it on every replica, you may abandon the fast fullsync + phase +\emph on +before +\emph default + creating the fresh filesystem, because the old content of \family typewriter /dev/mars/mydata \family default - is just garbage not used by the freshly created filesystem. - Just say + will then be just garbage not used by the freshly created filesystem +\begin_inset Foot +status open + +\begin_layout Plain Layout +It is +\emph on +vital +\emph default + that the transaction logfile contents created by +\family typewriter +mkfs +\family default + is +\emph on +fully +\emph default + propagated to the secondaries and then replayed there. +\end_layout + +\begin_layout Plain Layout +Analogously, another exception is also possible, but at your own risk (be + careful, really!): when migrating your data from DRBD to MARS, and you + have ensured that (1) at the end of using DRBD both your replicas were + really equal (you should have checked that), and (2) before and after setting + up any side of MARS ( +\family typewriter +create-resource +\family default + as well as +\family typewriter +join-resource +\family default +) nothing has been written at all to it (i.e. + no usage, neither of +\family typewriter +/dev/lv/mydata +\family default + nor of +\family typewriter +/dev/mars/mydata +\family default + has occurred in any way), the first transaction logfile +\family typewriter +/mars/resource-mydata/log-000000001-$primary +\family default + created by MARS will be empty. + Check whether this is really true! Then, and only then, you may also issue + a +\family typewriter +fake-sync +\family default +. +\end_layout + +\end_inset + +. + Then, and only then, you may say \family typewriter marsadm fake-sync mydata \family default @@ -2387,7 +2470,7 @@ fake-sync \series bold absolutely sure \series default - that you really don't need the data! Otherwise, you are almost + that you really don't need to sync the data! Otherwise, you are \emph on guaranteed \emph default @@ -2396,8 +2479,8 @@ guaranteed \family typewriter fake-sync \family default -, you may startover the full sync at your secondary side at any time by - saying +, you may startover the fast full sync at your secondary side at any time + by saying \family typewriter marsadm invalidate mydata \family default @@ -2655,6 +2738,250 @@ name "sub:Intended-Switching" \end_layout +\begin_layout Standard +Before starting a planned handover from your old primary +\family typewriter +A +\family default + to a new primary +\family typewriter +B +\family default +, you should check the replication of the resource. + As a human, use +\family typewriter +marsadm view mydata +\family default +. + For scripting, use the macros from section +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Predefined-Trivial-Macros" + +\end_inset + +. + The network should be OK, and the amount of replication delay should be + as low as possible. + Otherwise, handover may take a very long time, or it may produce a split + brain, or it may even fail. +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + +Best practice is to +\series bold +prepare a planned handover +\series default + by the following steps: +\end_layout + +\begin_layout Enumerate +Check the network and the replication lag. + It should be low (a few hundred megabytes, or a low number of gigabytes + - see also the rough time forecast shown by +\family typewriter +marsadm view mydata +\family default + when there is a larger replication delay, or directly access the forecast + by +\family typewriter +marsadm view-replinfo +\family default +). +\end_layout + +\begin_layout Enumerate +Stop your application, then umount +\family typewriter +/dev/mars/mydata +\family default + on host +\family typewriter +A +\family default +. +\end_layout + +\begin_layout Enumerate +When scripting, or when typing extremely fast, or for better safety, say + +\family typewriter +marsadm wait-umount mydata +\family default + host +\family typewriter +B +\family default +. + When your network is OK, the propagation of the device usage state +\begin_inset Foot +status open + +\begin_layout Plain Layout +Notice that the usage check for +\family typewriter +/dev/mars/mydata +\family default + on host +\family typewriter +B +\family default + is based on the +\emph on +open count +\emph default + transferred from +\emph on +another +\emph default + node +\family typewriter +A +\family default +. + Since MARS is operating asynchronously (in contrast to DRBD), it may take + some time until our node +\family typewriter +B +\family default + knows that the device is no longer used at +\family typewriter +A +\family default +. + This can lead to a race condition if you automate an intended takeover + with a script like +\family typewriter +ssh root@A +\begin_inset Quotes eld +\end_inset + +umount /dev/mars/mydata +\begin_inset Quotes erd +\end_inset + +; ssh root@B +\begin_inset Quotes eld +\end_inset + +marsadm primary mydata +\begin_inset Quotes erd +\end_inset + + +\family default + because your second ssh command may be faster than the internal MARS symlink + tree propagation (cf section +\begin_inset CommandInset ref +LatexCommand ref +reference "sec:The-Symlink-Tree" + +\end_inset + +). + In order to prevent such races, you are strongly advised to use the command +\end_layout + +\begin_layout Itemize + +\family typewriter +marsadm wait-umount mydata +\end_layout + +\begin_layout Plain Layout +on node +\family typewriter +B +\family default + before trying to become primary. +\end_layout + +\end_inset + + should take only a few seconds. + Otherwise, check for any network problems or any other problems. +\end_layout + +\begin_layout Enumerate +On host +\family typewriter +B +\family default +, wait until +\family typewriter +marsadm view mydata +\family default + (or +\family typewriter +view-diskstate +\family default +) shows +\family typewriter +UpToDate +\family default +. + It is possible to omit this step, but then you have no control on the duration + of the handover, and in case of any transfer problems, disk space problems, + etc you are potentially risking to produce a split brain (although +\family typewriter +marsadm +\family default + will do its best to avoid it). + Doing the wait by yourself, +\emph on +before +\emph default + starting +\family typewriter +marsadm primary +\family default +, has a big advantage: you can abort the handover cycle at any time, just + by re-mounting the device +\family typewriter +/dev/mars/mydata +\family default + at the old primary +\family typewriter +A +\family default + again, and by re-starting your application. + Once you have started +\family typewriter +marsadm primary +\family default + on host +\family typewriter +B +\family default +, you might have to switch back, or possibly even via +\family typewriter +primary --force +\family default + (see sections +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Forced-Switching" + +\end_inset + + and +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Split-Brain-Resolution" + +\end_inset + +). +\end_layout + \begin_layout Standard Switching the roles is very similar to DRBD: just issue the command \end_layout @@ -2666,18 +2993,180 @@ marsadm primary mydata \end_layout \begin_layout Standard -on your formerly secondary node. - Precondition is that you are in connected state, and that the old primary - does not use its +on your formerly secondary node \family typewriter -/dev/mars/mydata +B \family default - device any longer. - If on of the preconditions is violated, +. +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + +The most important difference to DRBD: don't use an intermediate +\family typewriter +marsadm secondary mydata +\family default + anywhere. + Although it would be possible, it has +\emph on +disadvantages +\emph default +, such as increased risk of producing a split brain. + Always switch +\emph on +directly +\emph default +! +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + +In contrast to DRBD, MARS remembers the designated primary, even when your + system crashes and reboots. + While in case of a crash you have to re-setup DRBD with commands like +\family typewriter +drbdadm up +\begin_inset Formula $\ldots$ +\end_inset + +; drbdadm primary +\begin_inset Formula $\ldots$ +\end_inset + + +\family default +, MARS will automatically resume its former roles just by saying +\family typewriter +modprobe mars +\family default +. +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + +Another fundamental difference to DRBD: when the network is healthy, there + can only exist +\emph on +one +\emph default + designated primary at a time (modulo some communication delays caused by + the +\begin_inset Quotes eld +\end_inset + +eventually consistent +\begin_inset Quotes erd +\end_inset + + communication model, see section +\begin_inset CommandInset ref +LatexCommand ref +reference "sec:The-Lamport-Clock" + +\end_inset + +). + By saying +\family typewriter +marsadm primary mydata +\family default + on host +\family typewriter +B +\family default +, +\series bold +all other +\series default + hosts (including +\family typewriter +A +\family default +) will +\series bold +automatically go into secondary role +\series default + after a while! +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + +You simply +\emph on +don't need +\emph default + an intermediate +\family typewriter +marsadm secondary mydata +\family default + for planned handover! +\end_layout + +\begin_layout Standard +Precondition for \family typewriter marsadm primary \family default - refuses to start. + is that you are up, that means in attached and connected state (cf. + +\family typewriter +marsadm up +\family default +), and that any old primary (in this case +\family typewriter +A +\family default +) does not use its +\family typewriter +/dev/mars/mydata +\family default + device any longer, and that the network is healthy. + If some (parts of) logfiles are not yet (fully) transferred to the new + primary, you will need enough space on +\family typewriter +/mars/ +\family default + at the target side. + If one of the preconditions described in section +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Operation-of-the" + +\end_inset + + is violated, +\family typewriter +marsadm primary +\family default + may refuse to start. \end_layout \begin_layout Standard @@ -2720,103 +3209,96 @@ rely marsadm \family default does its best, but at least in case of (unnoticed) network outages / partitions - (or even + (or \emph on -very +extremely, really extremely \emph default - slow / overloaded networks), an attempt to become up-to-date is likely - to fail. + slow / overloaded networks), an attempt to become +\family typewriter +UpToDate +\family default + may fail. If you want to \emph on ensure \emph default that no split brain can result from intended primary switching, please - give the + obey the the best practices from above, and please give the \family typewriter primary \family default command only after your secondary is \emph on known +\begin_inset Foot +status open + +\begin_layout Plain Layout +As noted in many places in this manual, checking this cannot be done by + looking at the local state of a single cluster node. + You have to check several nodes. + +\family typewriter +marsadm +\family default + can only check the +\emph on +local \emph default - to be up-to-date. + node reliably! \end_layout -\begin_layout Standard -Notice that the usage check for -\family typewriter -/dev/mars/mydata -\family default - is based on the -\emph on -open count +\end_inset + + \emph default - transferred from another cluster node. - Since MARS is operating asynchronously (in contrast to DRBD), it may take - some time until our node knows that the device is no longer used at another - node. - This can lead to a race condition if you automate an intended takeover - with a script like + to be really \family typewriter -ssh root@A -\begin_inset Quotes eld -\end_inset - -umount /dev/mars/mydata -\begin_inset Quotes erd -\end_inset - -; ssh root@B -\begin_inset Quotes eld -\end_inset - -marsadm primary mydata -\begin_inset Quotes erd -\end_inset - - +UpToDate \family default - because your second ssh command may be faster than the internal MARS symlink - tree propagation (cf section + (see +\family typewriter +marsadm view +\family default + and other macros described in section \begin_inset CommandInset ref LatexCommand ref -reference "sec:The-Symlink-Tree" +reference "sec:Inspecting-the-State" \end_inset ). - In order to prevent such races, you should use the command -\end_layout - -\begin_layout Itemize - -\family typewriter -marsadm wait-umount mydata \end_layout \begin_layout Standard -on node B before trying to become primary. - The script should look like +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + + A +\emph on +very rough +\emph default + estimation of the time to become \family typewriter -ssh root@A -\begin_inset Quotes eld -\end_inset - -umount /dev/mars/mydata -\begin_inset Quotes erd -\end_inset - -; ssh root@B -\begin_inset Quotes eld -\end_inset - -marsadm wait-umount mydata && marsadm primary mydata -\begin_inset Quotes erd -\end_inset - - +UpToDate \family default -. + is displayed by +\family typewriter +marsadm view mydata +\family default + or other macros (e.g. + +\family typewriter +view-replinfo +\family default +). + However, on very flaky networks, the estimation may not only flicker much, + but also be inaccurate. \end_layout \begin_layout Subsubsection @@ -2848,21 +3330,133 @@ last known \begin_layout Itemize \family typewriter -marsadm disconnect mydata +marsadm pause-fetch mydata \end_layout +\begin_deeper +\begin_layout Itemize +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + + notice that this is similar to +\family typewriter +drbdadm disconnect mydata +\family default + as you are probably used from DRBD. + For better compatibility with DRBD, you may use the alternate syntax +\family typewriter +marsadm disconnect mydata +\family default + instead. + However, there is a subtle difference to DRBD: DRBD will drop +\emph on +both +\emph default + sides of its single bi-directional connection and no longer try to re-connect + from any of both sides, while +\family typewriter +pause-fetch +\family default + is equivalent to +\family typewriter +pause-fetch-local +\family default +, which instructs only the +\emph on +local +\emph default + host to stop fetching logfiles. + Other members of the cluster, including the former primary, are +\emph on +not +\emph default + instructed to do so. + They may continue fetching logfiles over their own private TCP connections, + potentially using many connections in parallel, and potentially even from + any +\emph on +other +\emph default + member of the resource, if they think they can get the data from there. + In order to instruct +\begin_inset Foot +status open + +\begin_layout Plain Layout +Notice that not all such instructions may arrive at all sites when the network + is interrupted (or extremely slow). +\end_layout + +\end_inset + + +\emph on +all +\emph default + members of the resource to stop fetching logfiles, you may use +\family typewriter +marsadm pause-fetch-global mydata +\family default + instead (cf section +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Operation-of-the" + +\end_inset + +). +\end_layout + +\end_deeper \begin_layout Itemize \family typewriter marsadm primary mydata --force \end_layout +\begin_deeper +\begin_layout Itemize +\begin_inset Graphics + filename images/MatieresCorrosives.png + lyxscale 50 + scale 17 + +\end_inset + + this is the forceful switchover. + Use +\family typewriter +--force +\family default + only if you know what you are doing! +\end_layout + +\end_deeper \begin_layout Itemize \family typewriter -marsadm connect mydata +marsadm resume-fetch mydata \end_layout +\begin_deeper +\begin_layout Itemize +As such, the new primary does not really need this, because primaries are + producing their own logfiles without need for fetching. + This is only to undo the previous +\family typewriter +pause-fetch +\family default +, in order to avoid future surprises when the new primary will somewhen + change to secondary mode again (in the far-distant future), and you have + forgotten to remember the fact that fetching had been switched off. + +\end_layout + +\end_deeper \begin_layout Standard When using \family typewriter @@ -2917,7 +3511,7 @@ reference "sub:Split-Brain-Resolution" \end_inset -), otherwise you cannot operate your resource any longer. +), otherwise you cannot operate your resource in the long term. \end_layout \begin_layout Standard @@ -2931,10 +3525,10 @@ In order to impede you from giving an accidental \family default works only in \emph on -disconnected +locally disconnected \emph default state. - This is analogously to DRBD. + This is similar to DRBD. \end_layout \begin_layout Standard @@ -2994,7 +3588,7 @@ Most reasons will be displayed by \family typewriter marsadm \family default - when it is rejecting to execute the switchover. + when it is rejecting the switchover. \end_layout \end_inset @@ -3043,8 +3637,8 @@ reference "sub:Final-Destroy-of" connection loss \emph default (e.g. - networking problems / network partitions) you might not be able to reliably - detect whether a split brain will actually result, or not. + networking problems / network partitions), you may not be able to reliably + detect whether a split brain actually resulted, or not. \end_layout \begin_layout Paragraph @@ -3082,14 +3676,19 @@ might log-rotate \family default independently from each other. - However, the replication will certainly get stuck, and your + However, this is really no good idea. + The replication to third nodes will likely get stuck, and your \family typewriter /mars/ \family default - filesystem will eventually run out of space. - Any other secondary node will certainly get into serious problems: it simply - does not not know which split-brain version it should follow. - Therefore, you will certainly loose your redundancy. + filesystem(s) will eventually run out of space. + Any further secondary node (when having +\begin_inset Formula $k>2$ +\end_inset + + replicas) will certainly get into serious problems: it simply does not + know which split-brain version it should follow. + Therefore, you will certainly loose the actuality of your redundancy. \end_layout \begin_layout Standard @@ -3103,21 +3702,111 @@ log-rotate When one of your multiple split brain nodes has left its actual primary role, e.g. - via -\family typewriter -marsadm secondary -\family default - and umounting its + after umounting its local \family typewriter /dev/mars/mydata \family default - device while the network is up (again), we cannot guarantee that it is - always possible to re-enter primary mode again, even when + device, and when the network is up (again), we cannot guarantee +\begin_inset Foot +status open + +\begin_layout Plain Layout +In a few cases which are covered by the test suite, it is likely to work. + Future versions of MARS Light might improve on this. + It is generally no good idea to try to (forcefully) become primary in a + split-brain situation starting from being secondary, because the result + is likely to be +\series bold +undefined at concept level +\series default +. +\end_layout + +\end_inset + + that it is always possible to re-enter primary mode again, even when \family typewriter primary --force \family default is given. - First cleanup the split brain via + Therefore, use +\family typewriter +marsadm secondary +\family default + is +\emph on +strongly discouraged +\emph default +. + It tells the whole cluster that +\emph on +nobody +\emph default + is designated as primary any more. + +\emph on +All +\emph default + nodes should go into secondary mode, globally. + However, when the device +\family typewriter +/dev/mars/mydata +\family default + is in use somewhere, it will remain in +\emph on +actual +\emph default + primary mode during that time, even if another host is now the designated + primary, or if +\family typewriter +(none) +\family default + is designated as primary as will result from a +\family typewriter +secondary +\family default + command. + As soon as a local +\family typewriter +/dev/mars/mydata +\family default + is released, the node will +\emph on +actually +\emph default + go into secondary mode if it is no longer designated as primary. + Thus, +\family typewriter +marsadm secondary +\family default + can lead to a situation where noone is actually in primary role, and noone + is able to re-enter it due to split brain. + Such a situation can be avoided by +\emph on +directly +\emph default + switching over from one primary to another one, without intermediate +\family typewriter +secondary +\family default + command. + This behaviour is different from DRBD. +\end_layout + +\begin_layout Standard +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + + In case you have accidentally entered such a situation where all nodes + are refusing to become primary due to split brain, you +\emph on +have to +\emph default + cleanup the split brain via \family typewriter leave-resource \family default @@ -3138,7 +3827,8 @@ erroneous state. Therefore it is \series bold -generally no good idea to (re-)enter it deliberately! +generally no good idea to (re-)enter it deliberately, or to stay in it any + longer! \end_layout \begin_layout Standard @@ -3148,9 +3838,58 @@ passively \emph default by secondaries. Whenever a secondary detects that somewhere a split brain has happend, - it just refuses to fetch and to replay any logfiles behind the split point. - This means that its local disk state will remain consistent, but outdated - which respect to any of the split brain versions. + it refuses to replay any logfiles behind the split point (and also to fetch + them when possible), or anywhere where something appears suspect or ambiguous. + This tries to keep its local disk state always being consistent, but outdated + with respect to any of the split brain versions. + As a consequence, becoming primary may be impossible, because it cannot + always know which logfiles are the correct ones to replay before +\family typewriter +/dev/mars/mydata +\family default + can appear. + The ambiguity must be resolved first. +\end_layout + +\begin_layout Standard +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + + If you +\emph on +really +\emph default + need the local device +\family typewriter +/dev/mars/mydata +\family default + to disappear +\emph on +everywhere +\emph default + in a split brain situation, you don't need a +\emph on +strongly discouraged +\emph default + +\family typewriter +marsadm secondary +\family default + command for this. + +\family typewriter +marsadm detach +\family default + or +\family typewriter +marsadm down +\family default + can do it also, without destroying information about the former designated + primary. \end_layout \begin_layout Subsection @@ -3293,15 +4032,15 @@ reference "sec:The-Symlink-Tree" \end_inset be able to communicate with each other. - If that is not possible, or if it takes too long, use the method described - in section + If that is not possible, or if it takes too long, you may fall back to + the method described in section \begin_inset CommandInset ref LatexCommand ref reference "sub:Final-Destroy-of" \end_inset -. +, but do this only as far as necessary. \end_layout \begin_layout Enumerate @@ -3336,7 +4075,11 @@ right not \emph default the version which is currently designated as primary for the whole cluster. - Only in such a case, switch the primary role as described in sections + +\series bold +Only +\series default + in such a case, switch the primary role as described in sections \begin_inset CommandInset ref LatexCommand ref reference "sub:Intended-Switching" @@ -3376,7 +4119,7 @@ for no good reason \begin_layout Enumerate \family typewriter -marsadm disconnect mydata +marsadm pause-fetch mydata \family default . \end_layout @@ -3392,7 +4135,7 @@ marsadm primary mydata --force \begin_layout Enumerate \family typewriter -marsadm connect mydata +marsadm resume-fetch mydata \family default . \end_layout @@ -3403,13 +4146,279 @@ marsadm connect mydata The next steps are different for different use cases: \end_layout +\begin_layout Paragraph +Destroying a Wrong Split Brain Version +\end_layout + +\begin_layout Standard +Continue with the following steps, each on those cluster node(s) where you + cannot retain its split-brain version, but start with the old +\begin_inset Quotes eld +\end_inset + +wrong +\begin_inset Quotes erd +\end_inset + + primaries first (see advice at the end of this section): +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Plain Layout + + +\backslash +begin{enumerate} +\backslash +setcounter{enumi}{6} +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Plain Layout + + +\backslash +item +\end_layout + +\end_inset + + +\family typewriter +marsadm leave-resource mydata +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Plain Layout + + +\backslash +item +\end_layout + +\end_inset + +After having done this on one cluster node, check whether the split brain + is already gone (e.g. + by saying +\family typewriter +marsadm view mydata +\family default +). + There are chances that you don't need this on all of your nodes. + Only in very rare +\begin_inset Foot +status open + +\begin_layout Plain Layout +When your network had partitioned in a very awkward way for a long time, + and when your partitioned primaries did several +\family typewriter +log-rotate +\family default + operations indendently from each other, there is a small chance that +\family typewriter +leave-resource +\family default + does not clean up +\emph on +all +\emph default + remains of such an awkward situation. + Only in such a case, try +\family typewriter +log-purge-all +\family default +. +\end_layout + +\end_inset + + cases, it might happen that the preceding l +\family typewriter +eave-resource +\family default + operations were not able to clean up all logfiles produced in parallel + by the split brain situation. + Only in such rare cases, read the documentation about +\family typewriter +log-purge-all +\family default + (see page +\begin_inset CommandInset ref +LatexCommand pageref +reference "log-purge-all$res" + +\end_inset + +) and try it. +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Plain Layout + + +\backslash +end{enumerate} +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard +If you want to restore redundancy, you can follow-up a +\family typewriter +join-resource +\family default + phase to the old resource name (using the correct device name, double-check + it!) This should restore your redundancy by overwriting your bad split + brain version with the correct one. +\end_layout + +\begin_layout Standard +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + +It is important to resolve the split brain +\emph on +before +\emph default + you can start the +\family typewriter +join-resource +\family default + reconstruction phase! In order to keep as many +\begin_inset Quotes eld +\end_inset + +good +\begin_inset Quotes erd +\end_inset + + versions as possible (e.g. + for emergency cases), don't re-join them all in parallel, but rather start + with the oldest / most outdated / worst / inconsistent version first. + It is recommended to start the next one only when the previous one has + sucessfully finished. +\end_layout + +\begin_layout Standard +Alternatively, but only if you have only +\begin_inset Formula $k=2$ +\end_inset + + replicas in total, you may use the following short procedure instead, which + works in almost all +\begin_inset Formula $k=2$ +\end_inset + + cases, but cannot resolve all (desperate, very scarce) split-brain situations + (see documentation of +\family typewriter +log-purge-all +\family default + on page +\begin_inset CommandInset ref +LatexCommand pageref +reference "log-purge-all$res" + +\end_inset + +): +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Plain Layout + + +\backslash +begin{enumerate} +\backslash +setcounter{enumi}{6} +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Plain Layout + + +\backslash +item +\end_layout + +\end_inset + +On the single (new) secondary with a non- +\begin_inset Quotes erd +\end_inset + +right +\begin_inset Quotes erd +\end_inset + + version, and only if the split brain has not yet been resolved, say +\family typewriter +marsadm invalidate mydata +\family default +. +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Plain Layout + + +\backslash +end{enumerate} +\end_layout + +\end_inset + + +\end_layout + \begin_layout Paragraph Keeping a Split Brain Version \end_layout \begin_layout Standard -Continue with the following steps, each on those cluster node(s) you don't - want to retain: +This case starts indentical as before, but continues differently. + On each of those cluster node(s) you don't want to retain: \end_layout \begin_layout Standard @@ -3465,47 +4474,13 @@ After having done this on \emph on all \emph default - non-right cluster nodes, check that the split brain is gone (e.g. + those cluster nodes, check that the split brain is gone (e.g. by saying \family typewriter -marsadm status +marsadm view mydata \family default -). - In very rare -\begin_inset Foot -status open - -\begin_layout Plain Layout -When your network had partitioned in a very awkward way for a long time, - and when your partitioned primaries did several -\family typewriter -log-rotate -\family default - operations indendently from each other, there is a small chance that -\family typewriter -leave-resource -\family default - does not clean up -\emph on -all -\emph default - remains of such an awkward situation. - Only in such a case, try -\family typewriter -log-purge-all -\family default -. -\end_layout - -\end_inset - - cases, it might happen that the preceding l -\family typewriter -eave-resource -\family default - operations were not able to clean up all logfiles produced in parallel - by the split brain situation. - Only in such rare cases, read the documentation about +), as documented above. + In very rare cases, you might also need a \family typewriter log-purge-all \family default @@ -3516,7 +4491,7 @@ reference "log-purge-all$res" \end_inset -) and try it. +). \end_layout \begin_layout Standard @@ -3591,108 +4566,6 @@ end{enumerate} \end_inset -\end_layout - -\begin_layout Paragraph -Destroying a Wrong Split Brain Version -\end_layout - -\begin_layout Standard -As before, do the -\family typewriter -leave-resource -\family default - step on each wrong split-brain node and check that the split brain has - gone, but omit the re-creation. -\end_layout - -\begin_layout Standard -If you want to restore redundancy, you can follow-up a -\family typewriter -join-resource -\family default - to the old resource name. - This should restore your redundancy by overwriting your bad split brain - version with the correct one. -\end_layout - -\begin_layout Standard -Alternatively, you may try the following short procedure instead, which - is however not guaranteed to resolve all (desperate) split-brain situations - (see documentation of -\family typewriter -log-purge-all -\family default - on page -\begin_inset CommandInset ref -LatexCommand pageref -reference "log-purge-all$res" - -\end_inset - -): -\end_layout - -\begin_layout Standard -\begin_inset ERT -status open - -\begin_layout Plain Layout - - -\backslash -begin{enumerate} -\backslash -setcounter{enumi}{6} -\end_layout - -\end_inset - - -\end_layout - -\begin_layout Standard -\begin_inset ERT -status open - -\begin_layout Plain Layout - - -\backslash -item -\end_layout - -\end_inset - -On each node with a non- -\begin_inset Quotes erd -\end_inset - -right -\begin_inset Quotes erd -\end_inset - - version, say -\family typewriter -marsadm invalidate mydata -\family default -. -\end_layout - -\begin_layout Standard -\begin_inset ERT -status open - -\begin_layout Plain Layout - - -\backslash -end{enumerate} -\end_layout - -\end_inset - - \end_layout \begin_layout Paragraph @@ -3702,7 +4575,7 @@ Keeping a Good Version \begin_layout Standard When you had a secondary which did not participate in the split brain, but just got confused and therefore stopped replaying logfiles immediately - after the split-brain point, it may very well happen + before the split-brain point, it may very well happen \begin_inset Foot status open @@ -3751,16 +4624,12 @@ future. \end_inset that you don't need to do any action for it. - When all wrong versions have disappeared from the cluster (either by -\family typewriter -invalidate -\family default - or by + When all wrong versions have disappeared from the cluster (by \family typewriter leave-resource \family default -), the confusion should be over, and the secondary should automatically - resume tracking of the new unique version. + as described before), the confusion should be over, and the secondary should + automatically resume tracking of the new unique version. \end_layout \begin_layout Standard @@ -3776,6 +4645,77 @@ stuck nodes. \end_layout +\begin_layout Standard +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + + Hint / advice: it is a good idea to start split brain resolution +\emph on +first +\emph default + with those (few) nodes which had been (accidentally) primary before, but + are not the new designated primary. + Usually, you had 2 primaries during split brain, so this will apply only + to +\emph on +one +\emph default + of them. + Leave the other one intact, by not leaving its primary state at all (if + it is possible -- notice that if you have enough space on +\family typewriter +/mars/ +\family default + it may be even possible to not only continue your application during the + split brain without interruption, just by not umounting +\family typewriter +/dev/mars/mydata +\family default + at all, but in addition to avoid invalidations caused by emergency mode, + see section +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Emergency-Mode" + +\end_inset + +). + +\emph on +First +\emph default + resolve the problem of the +\begin_inset Quotes eld +\end_inset + +wrong +\begin_inset Quotes erd +\end_inset + + primary(s) via +\family typewriter +leave-resource +\family default +. + Wait for a short while. + Then check the rest of your secondaries (if you have +\begin_inset Formula $k>2$ +\end_inset + + replicas in total), whether they now are already following the new (unique) + primary, and finally check whether the split brain warning reported by + +\family typewriter +marsadm view all +\family default + is already gone. + This way, you can often omit unnecessary invalidations of replicas. +\end_layout + \begin_layout Subsection Final Destruction of a Damaged Node \begin_inset CommandInset label @@ -3803,7 +4743,8 @@ Physically \family typewriter /mars/ \family default - filesystem, or whatever. + filesystem, a half-defective kernel, RAM / kernel memory corruption, disk + corruption, or whatever. Don't risk any such unpredictable behaviour! \end_layout @@ -3842,14 +4783,18 @@ reference "sub:Forced-Switching" \end_layout \begin_layout Enumerate -On the surviving new designated primary, give the following commands: +On a surviving node, but preferably +\emph on +not +\emph default + the new designated primary, give the following commands: \end_layout \begin_deeper \begin_layout Enumerate \family typewriter -marsadm --host=your-damaged-host disconnect mydata +marsadm --host=your-damaged-host down mydata \end_layout \begin_layout Enumerate @@ -3858,6 +4803,20 @@ marsadm --host=your-damaged-host disconnect mydata marsadm --host=your-damaged-host leave-resource mydata \end_layout +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/MatieresToxiques.png + lyxscale 50 + scale 17 + +\end_inset + + Check for misspellings, in particular the hostname of the dead node, and + check the command syntax before typing return! Otherwise, you may forcefully + destroy the wrong node! +\end_layout + \end_deeper \begin_layout Enumerate In case any of the previous commands should fail (which is rather likely), @@ -3918,8 +4877,8 @@ your-damaged-host \end_inset - Even if your node comes to life again in some way: always ensure that the - mars kernel modules does not run any more. + Even if your dead node comes to life again in some way: always ensure that + the mars kernel module cannot run any more. \emph on Never @@ -3936,7 +4895,7 @@ In case \family typewriter leave-resource --host= \family default - does not work, you can try the following alternative: + does not work, you can start over with the following fallback: \end_layout \begin_layout Standard @@ -3987,7 +4946,7 @@ marsadm down mydata \begin_layout Enumerate Check by hand whether your local disk is consistent, e.g. - by test-mounting is, + by test-mounting it readonly, \family typewriter fsck \family default @@ -4001,8 +4960,35 @@ marsadm delete-resource mydata \end_layout \begin_layout Enumerate -Check whether the other cluster nodes are dead. - If not, STONITH them by hand. +Check whether the other vital cluster nodes don't report the dead resource + any more, e.g. + +\family typewriter +marsadm view all +\family default + at +\emph on +each +\emph default + of them. + In case the resource has not disappeared anywhere (which may happen during + network problems), do the +\family typewriter +down ; delete-resource +\family default + steps also there (optionally again with +\family typewriter +--force +\family default +). +\end_layout + +\begin_layout Enumerate +Be sure that the resource has disappeared +\emph on +everywhere +\emph default +. \end_layout \begin_layout Enumerate @@ -4011,7 +4997,20 @@ Check whether the other cluster nodes are dead. marsadm create-resource newmydata ... \family default - and further steps to setup your resource from scratch. + at the +\emph on +correct +\emph default + node using the +\emph on +correct +\emph default + disk device containing the +\emph on +correct +\emph default + version, and further steps to setup your resource from scratch, preferably + under a different name to minimize any risk. \end_layout \begin_layout Standard @@ -4115,7 +5114,7 @@ sequence path \family typewriter git \family default --like incremental checksum over the whole path (so-called +-like incremental checksum over the whole path history (so-called \begin_inset Quotes eld \end_inset @@ -4166,8 +5165,8 @@ repair \begin_layout Standard In case of desperate situations where none of the previous instructions - have succeeded, your last chance is rebuilding your resource from an intact - disk as follows: + have succeeded, your last chance is rebuilding all your resources from + intact disks as follows: \end_layout \begin_layout Enumerate @@ -4260,7 +5259,15 @@ rmmod \family typewriter mkfs.ext4 /mars/ \family default - or similar), and continue with step 7. + or similar) +\emph on +everywhere +\emph default + on +\emph on +all +\emph default + cluster nodes, and continue with step 7. \end_layout \begin_layout Enumerate @@ -4277,6 +5284,7 @@ reference "sub:Forced-Switching" \end_inset ). + Wait until any old logfile data has been replayed. \end_layout \begin_layout Enumerate @@ -4290,39 +5298,106 @@ marsadm delete-resource mydata --force \end_layout \begin_layout Enumerate -Locally build up the new resource as usual, out of the underlying disk. +Locally build up the new resource(s>) as usual, out of the underlying disk<8s<9. \end_layout \begin_layout Enumerate -Check whether the new resource works in standalone mode. +Check whether the new resource(s) work in standalone mode. \end_layout \begin_layout Enumerate When necessary, repeat these steps with other resources. \end_layout -\begin_layout Enumerate -Finally, do all the -\family typewriter -join-resource -\family default -s on the respective cluster nodes, according to your new redundancy scenario - after the failures (e.g. - after activating spare nodes, etc). -\end_layout - \begin_layout Standard Now you can choose how the rebuild your cluster. If you rebuilt \family typewriter /mars/ \family default - anywhere, you should do the same on all other (surviving) cluster nodes - and start over with a fresh + anywhere, you +\emph on +must +\emph default + rebuild it on +\emph on +all +\emph default + new cluster nodes and start over with a fresh \family typewriter join-cluster \family default - on them. + on each of them, from scratch. + It is not possible to mix the old cluster with the new one. +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Plain Layout + + +\backslash +begin{enumerate} +\backslash +setcounter{enumi}{10} +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Plain Layout + + +\backslash +item +\end_layout + +\end_inset + + Finally, do all the necessary +\family typewriter +join-resource +\family default +s on the respective cluster nodes, according to your new redundancy scenario + after the failures (e.g. + after activating spare nodes, etc). + If you have +\begin_inset Formula $k>2$ +\end_inset + + replicas, start +\family typewriter +join-resource +\family default + on the worst / most damaged version first, and start the next preferably + only after the previous sync has successfully completed. + This way, you will be retaining some very old and outdated, but hopefully + potentially usable old replicas while a sync is running. + Don't start too many syncs in parallel. +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Plain Layout + + +\backslash +end{enumerate} +\end_layout + +\end_inset + + \end_layout \begin_layout Standard @@ -4338,8 +5413,8 @@ join-cluster \family typewriter delete-resource \family default - twice on the same resource name, at least after you have already a working - standalone primary + twice on the same resource name, after you have already a working standalone + primary \begin_inset Foot status open @@ -4359,7 +5434,22 @@ delete-resource \end_inset . - You might accidentally destroy your again-working copy! + You might accidentally destroy your again-working copy! You +\emph on +can +\emph default + issue +\family typewriter +delete-resource +\family default + multiple times on different nodes, e.g. + when the network has problems, but doing so +\emph on +after +\emph default + re-establishment of the initial primary bears some risk. + Therefore, the safest way is first deleting the resources everywhere, and + then starting over afresh. \end_layout \begin_layout Standard @@ -4450,8 +5540,8 @@ cannot be reached over the network. For example, you may place REJECT rules into all of your local iptables firewalls at the backup datacenter. - Alternatively, you may block the routes at the appropriate central router(s) - in your network. + Alternatively / additionally, you may block the routes at the appropriate + central router(s) in your network. \end_layout \begin_layout Enumerate @@ -4463,7 +5553,7 @@ marsadm disconnect all; marsadm primary --force all \end_layout \begin_layout Enumerate -Restart your services in the back datacenter (as far as necessary). +Restart your services in the backup datacenter (as far as necessary). Depending on your network setup, further steps like switching BGP routes etc may be necessary. \end_layout @@ -4518,13 +5608,46 @@ rmmod \end_layout \begin_layout Enumerate -Do an +At the failed side, do \family typewriter rm -rf /mars/resource-$mydata/ \family default - for all resources which had been primary before the blackout. + for all those resources which had been primary before the blackout. + Do this +\emph on +only +\emph default + for those cases, otherwise you will need unnecessary +\family typewriter +leave-resource +\family default +s or +\family typewriter +invalidate +\family default +s later (e.g. + when half of your nodes were already running at the surving side). In order to avoid unnecessary traffic, please do this only as far as really necessary. + Don't remove any other directories. + In particular, +\family typewriter +/mars/ips/ +\family default + +\emph on +must +\emph default + remain intact. + In case you accidentally deleted them, or you had to re-create +\family typewriter +/mars/ +\family default + from scratch, try +\family typewriter +rsync +\family default + with the correct options. \begin_inset Newline newline \end_inset @@ -4568,7 +5691,9 @@ Run \family typewriter marsadm join-resource $res \family default - everywhere. +, but only at those places where the directory was removed previously, while + using the same disk devices as before. + This will minimize actual traffic thanks to the fast full sync algorithm. \end_layout \begin_layout Standard @@ -4596,9 +5721,12 @@ before \family typewriter primary --force \family default -! This way, no true split brain will occur at the backup datacenter side, - because there is simply no chance for transferring different versions over - the network. +! This way, no split brain will be +\emph on +visible +\emph default + at the backup datacenter side, because there is simply no chance for transferri +ng different versions over the network. It is also crucial to remove any (potentially diverging) resource directories \emph on @@ -4613,6 +5741,47 @@ modprobe have to restore bulks of nodes in a short time. \end_layout +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + +In case the repair of a full datacenter should take so extremely long that + some +\family typewriter +/mars/ +\family default + partitions are about to run out of space at the surviving side, you may + use the +\family typewriter +leave-resource --host=failed-node +\family default + trick described earlier, followed by +\family typewriter +log-delete-all +\family default +. + Best if you have prepared a fully automatic script long before the incident, + which executes suchalike only as far as necessary in each individual case. +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + +Even better: train such scenarios in advance, and prepare scripts for mass + automation. +\end_layout + \begin_layout Section The State of MARS \begin_inset CommandInset label @@ -4897,7 +6066,8 @@ where macroname \family default \emph default - is one of the following macros described in the following sections. + is one of the following macros described in the following sections, or + a macro which has been written by yourself. As always, you may replace the resource name \family typewriter mydata @@ -4906,13 +6076,24 @@ mydata \family typewriter all \family default - in order to get the state of all locally joined resources. + in order to get the state of all locally joined resources, as well as a + list of all those resources. \end_layout \begin_layout Subsection Predefined Macros \end_layout +\begin_layout Standard +The macro processor is a very flexible and versatile tool for +\series bold +customizing +\series default +. + You can create your own macros, but probably the rich set of predefined + macros is already sufficient for your needs. +\end_layout + \begin_layout Subsubsection Predefined Complex and High-Level Macros \begin_inset CommandInset label @@ -4928,15 +6109,14 @@ name "sub:Predefined-Complex-and" The following predefined complex macros try to address the information needs of humans. Nevertheless, they can also be used in scripts, but beware that sometimes - the output may change its format depending on certain if-conditions. + the output format may change. \end_layout \begin_layout Standard Notice: the definitions of predefined complex macros may be updated in the course of the MARS project. However, the primitive macros recursively called by the complex ones will - be hopefully rather stable in future (with the exception of bugfixes for - major bugs). + be hopefully rather stable in future (with the exception of bugfixes). If you want to retain an old / outdated version of a complex macro, just check it out from git, follow the instructions in section \begin_inset CommandInset ref @@ -4983,9 +6163,8 @@ marsadm view mydata \family default \emph default suffix. - It shows a one-line status summary for each resource, optionally augmented - with a progress bar whenever a sync or a fetch of logfiles is currently - running. + It shows a one-line status summary for each resource, optionally followed + by progress bars whenever a sync or a fetch of logfiles is currently running. The status line has the following fields: \end_layout @@ -5070,12 +6249,101 @@ primarynode \family typewriter 1and1 +\family default + +\begin_inset space ~ +\end_inset + +or +\begin_inset space ~ +\end_inset + + +\family typewriter +default-1and1 \family default A variant of \family typewriter default \family default for internal use by 1&1 Internet AG. + You may call this complex macro by saying +\family typewriter +marsadm view-1and1 all +\family default +. +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + +Note: the +\family typewriter +marsadm view-1and1 +\family default + command has been intensely tested in Spring 2014 to produce exactly the + same output than the 1&1 internal +\begin_inset Foot +status open + +\begin_layout Plain Layout +In addition to allow for customization, the macro processor is also meant + as an exit strategy for removing dependencies from non-free software. + +\series bold +Please put your future macros also under GPL! +\end_layout + +\end_inset + + tool +\family typewriter +marsview +\family default + +\begin_inset Foot +status open + +\begin_layout Plain Layout +There are some subtle differences: numbers are displayed in a different + precision, some bug fixes in the macro version (which might have occurred + +\emph on +in the meantime +\emph default + ) may lead to different output as a side effect from bug fixes in +\emph on +predefined +\emph default + macros, because the original +\family typewriter +marsview +\family default + command is currently not actively maintained. + Documentation of +\family typewriter +marsview +\family default + can be found in the corresponding manpage, see +\family typewriter +man marsview +\family default +. + By construction, this is also the (unmaintained) documentation of +\family typewriter +marsadm view-1and1 +\family default + and other +\family typewriter +-1and1 +\family default + macros. Notice that all \family typewriter *-1and1 @@ -5083,7 +6351,23 @@ default macros are not officially supported by the developer of MARS, and they may disappear in a future major release. However, they could be useful for your own customization macros. - Customization via your own macros (see section +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + +Customization via your own macros (see section \begin_inset CommandInset ref LatexCommand ref reference "sub:Creating-your-own" @@ -5091,8 +6375,37 @@ reference "sub:Creating-your-own" \end_inset ) is explicitly encouraged by the developer. - It would be even nice if a vibrant user community would emerge, helping - each other by exchange of macros. + It would be nice if a vibrant user community would emerge, helping each + other by exchange of macros. + +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + +Hint: in order to produce your own customized inspection / monitoring tools, + you may ask the author for an official reservation of a macro sub-namespace + such as +\family typewriter +*- +\emph on +yourcompanyname +\family default +\emph default +. + You will be fully responsible for your own reserved namespace and can do + with it whatever you want. + The official MARS release will guarantee that +\emph on +no name clashes +\emph default + with your reserved sub-namespace will occur in future. \end_layout \begin_layout Labeling @@ -5111,7 +6424,11 @@ Detached \family default , \family typewriter -Inconsistent +InConsistent +\family default +, +\family typewriter +NeedsReplay \family default , \family typewriter @@ -5181,7 +6498,8 @@ replstate-1and1 \family typewriter flags \family default - For each of disk, attach, sync, fetch, and replay, show exactly one character. + For each of disk, consistency, attach, sync, fetch, and replay, show exactly + one character. Each character is either a capital one, or the corresponding lowercase one, or a dash. The meaning is as follows: @@ -5210,7 +6528,7 @@ d \family typewriter - \family default - = none present. + = none present / configured. \end_layout \begin_layout Labeling @@ -5290,7 +6608,7 @@ fetch: \family typewriter F \family default - = fetched logfiles are (almost) up-to-date, + = according to knowlege, fetched logfiles are up-to-date, \family typewriter f \family default @@ -5307,7 +6625,7 @@ replay: \family typewriter R \family default - = all logfiles are replayed, + = all fetched logfiles are replayed, \family typewriter r \family default @@ -5411,7 +6729,129 @@ should \family typewriter /dev/mars/mydata \family default - device is currently in use. + device is currently in use . +\begin_inset Newline newline +\end_inset + + +\begin_inset Tabular + + + + + + + +\begin_inset Text + +\begin_layout Plain Layout + +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout + +\family typewriter +%todo-primary{} == 0 +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout + +\family typewriter +%todo-primary{} == 1 +\end_layout + +\end_inset + + + + +\begin_inset Text + +\begin_layout Plain Layout + +\family typewriter +%is-primary{} == 0 +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout + +\family typewriter +None +\family default + / +\family typewriter +Secondary +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout + +\family typewriter +NotYetPrimary +\end_layout + +\end_inset + + + + +\begin_inset Text + +\begin_layout Plain Layout + +\family typewriter +%is-primary{} == 1 +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout + +\family typewriter +RemainsPrimary +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout + +\family typewriter +Primary +\end_layout + +\end_inset + + + + +\end_inset + + \end_layout \begin_layout Labeling @@ -5452,6 +6892,22 @@ primarynode-1and1 syncinfo \family default Shows an informational progress bar when sync is running. + Intended for humans. + Scripts should not rely on any details from this. + Script may use this only as an +\emph on +approximate +\emph default + means for detecting progress (when comparing the +\emph on +full +\emph default + output text to a prior version and finding +\emph on +any +\emph default + difference, they may conclude that some progress has happened, how small + whatsoever). \end_layout \begin_layout Labeling @@ -5470,6 +6926,11 @@ syncinfo-1and1 replinfo \family default Shows an informational progress bar when fetch is running. + Use cases are analogously to +\family typewriter +syncinfo +\family default +. \end_layout \begin_layout Labeling @@ -5655,7 +7116,7 @@ get-resource-{fat,err,wrn}-count \labelwidthstring 00.00.0000 \family typewriter -is-{attach,sync,fetch,replay,primary} +is-{attach,sync,fetch,replay,primary,module-loaded} \family default Shows a boolean value (0 or 1) indicating the \emph on @@ -8349,7 +9810,7 @@ odd-else-part \begin_layout Itemize \family typewriter -%elsuntil +%elsunless \family default \begin_inset Formula $\ldots$ @@ -8612,6 +10073,24 @@ Mathematicians knowing Banach's fixedpoint theorem will know what this is \begin_layout Itemize +\family typewriter +%tmp{ +\emph on +body +\emph default +} +\family default + Evaluates the +\family typewriter +\emph on +body +\family default +\emph default + once in a temporary scope which is thrown away afterwards. +\end_layout + +\begin_layout Itemize + \family typewriter %call{ \emph on @@ -9626,8 +11105,8 @@ If there were a \emph on strict \emph default - global consistency model, which is roughly equivalent to a standalone model, - we would need + global consistency model, which would be roughly equivalent to a standalone + model, we would need \emph on locking \emph default @@ -9650,6 +11129,81 @@ Eventually Consistent \end_layout \begin_layout Standard +\noindent +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + +Notice that the network bottleneck problems described in section +\begin_inset CommandInset ref +LatexCommand ref +reference "sec:Network-Bottlenecks" + +\end_inset + + are +\emph on +demanding +\emph default + an +\begin_inset Quotes eld +\end_inset + +eventually consistent +\begin_inset Quotes erd +\end_inset + + model. + You have +\series bold +no chance +\series default + against natural laws, like Einstein's laws. + In order to cope with the problem area, you have to +\emph on +invest some additional effort +\emph default +. + Unfortunately, asynchronous communication models are more tricky to program + and to debug than simple strictly consistent models. + In particular, you +\emph on +have to cope with +\emph default + additional +\series bold +race conditions +\series default + +\emph on +inherent +\emph default + +\emph on +to +\emph default + the +\begin_inset Quotes eld +\end_inset + +eventually consistent +\begin_inset Quotes erd +\end_inset + + model. + In the face of the laws of the universe, motivate yourself by looking at + the graphics at the cover page: the planets are a +\emph on +symbol +\emph default + for what you have to do! +\end_layout + +\begin_layout Standard +\noindent \begin_inset Graphics filename images/MatieresCorrosives.png lyxscale 50 @@ -9657,8 +11211,8 @@ Eventually Consistent \end_inset - The asynchronous communication protocol of MARS leads to a different behaviour - from DRBD in case of + Example: the asynchronous communication protocol of MARS leads to a different + behaviour from DRBD in case of \series bold network partitions \series default @@ -10072,7 +11626,11 @@ works \begin_layout Plain Layout \size tiny -works +works (but +\emph on +not remmonended! +\emph default +) \end_layout \end_inset @@ -11419,7 +12977,22 @@ locally \family typewriter marsadm invalidate \family default - later. + later (if there is no split brain; otherwise you might need the +\family typewriter +leave-resource +\family default + ; +\family typewriter +join-resource +\family default + method from section +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Split-Brain-Resolution" + +\end_inset + +). This is an even more desperate action of the kernel module. You don't want to get there (except for testing). \end_layout @@ -11947,11 +13520,19 @@ ally. \end_layout \begin_layout Enumerate -On the secondaries, use +On the secondaries, and when there is no split brain, use \family typewriter marsadm invalidate $res \family default in order to get your outdated mirrors uptodate. + In case of split brain, follow the instructions from section +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Split-Brain-Resolution" + +\end_inset + +. This will lead to temporarily inconsistent mirrors, so don't do this on all secondaries in parallel, but sequentially step by step. This way, if you have more than 1 mirror, you will always retain at least @@ -11972,13 +13553,21 @@ If you had only 1 mirror per resource before the overflow happened, you \family typewriter marsadm join-resource $res \family default - on a third node (provided that your storage space permits that after the + on a third node (provided that your storage space permits it after the cleanup). After the initial full sync has finished there, do an \family typewriter marsadm invalidate $res \family default - on the outdated mirror. + on the outdated mirror (if you had no split brain; otherwise follow the + instructions in section +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Split-Brain-Resolution" + +\end_inset + +). This way, you will always retain at least one consistent mirror somewhere. After all is up-to-date, you can delete the superfluous mirror by \family typewriter @@ -14280,14 +15869,15 @@ $res \family default must denote an already existing resource in the cluster (i.e. its symlink tree information must have been received). - The resource must have a designated primary. + The resource must have a designated primary, and there must not exist a + split brain. The local node must not be already member of that resource. The argument \family typewriter $disk_dev \family default - must denote an absolute path to a usable local block device, its size must - be greater or equal to the logical size of the resource. + must denote an absolute path to a usable (but currently unused) local block + device, its size must be greater or equal to the logical size of the resource. When the optional \family typewriter $mars_name @@ -14751,7 +16341,12 @@ dead This command implies a forceful detach, possibly destroying consistency. \size scriptsize -In particular, when a cluster node was operating in primary mode ( +It is similar in spirit to a +\series bold +STONITH +\series default +. + In particular, when a cluster node was operating in primary mode ( \family typewriter /dev/mars/mydata \family default @@ -15239,12 +16834,13 @@ However, be careful! If you \emph on accidentally \emph default - forget to give the right readonly-mount flags, use + forget to give the right readonly-mount flags, if you use \family typewriter fsck \family default - inbetween, or alter the disk content in any other way (beware of LVM snapshots - / restores etc), you will almost certainly produce an + in repair mode inbetween, or alter the disk content in any other way (beware + of LVM snapshots / restores etc), you will almost certainly produce an + \series bold unnoticed inconsistency \series default @@ -15262,7 +16858,7 @@ no chance \begin_layout Plain Layout \size scriptsize -Postcondition: MARS uses the local disk and is able work with it (e.g. +Postcondition: MARS uses the local disk and is able to work with it (e.g. replay logfiles on it). \end_layout @@ -15406,7 +17002,15 @@ Postcondition: the local disk belonging to $res is no longer in use. \size scriptsize In contrast to DRBD, you need not explicitly pause syncing, fetching, or - replaying. + replaying +\emph on +to +\emph default + (as apposed to +\emph on +from +\emph default +) the local disk. These processes are automatically paused. As another contrast to DRBD, the respective processes will usually \emph on @@ -15429,6 +17033,45 @@ reference "sec:The-State-of" ). \end_layout +\begin_layout Plain Layout +\begin_inset Graphics + filename images/MatieresCorrosives.png + lyxscale 50 + scale 17 + +\end_inset + + +\size scriptsize +Notice: only +\emph on +local +\emph default + transfer operations +\emph on +to +\emph default + the local disk are paused by a detach. + When another node is remotely running a sync +\emph on +from +\emph default + your local disk, it will likely remain in use for remote reading. + The reason is that the server part of MARS is operating purely passively, + in order serve all remote requests as best as possible (similar to the + original Unix philosophy). + In order to really stop all accesses, do a +\family typewriter +pause-sync +\family default + on all other resource member where a sync is currently running. + You may also try +\family typewriter +pause-sync-global +\family default +. +\end_layout + \begin_layout Plain Layout \begin_inset Graphics filename images/MatieresToxiques.png @@ -15439,12 +17082,13 @@ reference "sec:The-State-of" \size scriptsize -WARNING! After this, you might use the underlying disk for other purposes, - such as test-mounting it in +WARNING! After this, and ather having paused any remote data access, you + might use the underlying disk for your own purposes, such as test-mounting + it in \emph on readonly \emph default - mode.. + mode. \series bold Don't modifiy @@ -15453,13 +17097,36 @@ Don't modifiy \family typewriter fsck \family default + +\begin_inset Foot +status open + +\begin_layout Plain Layout + +\size scriptsize +Some (but not all) +\family typewriter +fsck +\family default + tools for some filesystems have options to start only a test repair / verify + mode / dry run, without doing actual modifications to the data. + Of course, these modes +\emph on +can +\emph default + be used. + But be really sure! Double-check for the right options! +\end_layout + +\end_inset + ! Otherwise, you will have inconsistencies \emph on guaranteed \emph default . - MARS has no way for knowing of any modifications to your disk when not - written via + MARS has no way for knowing of any modifications to your disk when bypassing + \family typewriter /dev/mars/* \family default @@ -18709,7 +20376,7 @@ status open \begin_layout Plain Layout \size scriptsize -Precondition: sync must have been finished. +Precondition: sync must have finished at any resource member. All relevant transaction logfiles must be either already locally present, or be fetchable (see \family typewriter @@ -18792,8 +20459,8 @@ not \family typewriter --force \family default -: when another host is currently primary, it is first asked to become secondary, - and it is waited until it actually has become secondary. +: when another host is currently primary, it is first asked to leave its + primary role, and it is waited until it actually has become secondary. After that, the local host is asked to become primary. Before actually becoming primary, all relevant logfiles are transferred over the network and replayed, in order to avoid accidental creation of @@ -18836,6 +20503,67 @@ In case a split brain is already detected at the initial situation, the . \end_layout +\begin_layout Plain Layout +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + + +\size scriptsize + In case of +\begin_inset Formula $k>2$ +\end_inset + + replicas: if you want to handover between host +\family typewriter +A +\family default + and +\family typewriter +B +\family default + while a sync is currently running at host +\family typewriter +C +\family default +, you have the following options: +\end_layout + +\begin_layout Enumerate + +\size scriptsize +wait until the sync has finished (see macro +\family typewriter +sync-rest +\family default +, or +\family typewriter +marsadm view +\family default + in general). +\end_layout + +\begin_layout Enumerate + +\size scriptsize +do a +\family typewriter +leave-resouce +\family default + on host +\family typewriter +C +\family default +, and later +\family typewriter +join-resource +\family default + after the handover completed successfully. +\end_layout + \begin_layout Plain Layout \size scriptsize @@ -18869,7 +20597,7 @@ pause-replay primary --force \family default is a potentially harmful variant, because it will provoke a split brain - in most cases, and therefore in turn will usually lead to + in many cases, and therefore in turn will lead to \series bold data loss \series default @@ -18922,7 +20650,25 @@ primary \family typewriter --force \family default -, but rather umount the device at the other side! +, but rather umount +\begin_inset Foot +status open + +\begin_layout Plain Layout + +\size scriptsize +A common misconception is when people think that they can keep their filesystem + mounted without provoking a split brain, because they have their application + stopped and thus don't write any data into the filesystem. + This is a wrong idea, because filesystems may write some metadata, like + booking information, even after hours or days of inactivity. + Therefore MARS insists that the device is no longer in use before any handover + can take place. +\end_layout + +\end_inset + + the device at the other side! \end_layout \begin_layout Plain Layout @@ -18935,7 +20681,7 @@ primary \size scriptsize -Only use + Only use \family typewriter primary --force \family default @@ -18952,6 +20698,48 @@ primary --force ! \end_layout +\begin_layout Plain Layout +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + + +\size scriptsize + If you umount +\family typewriter +/dev/mars/mydata +\family default + on the old primary +\family typewriter +A +\family default +, and then wait until +\family typewriter +marsadm view +\family default + (or another suitable macro) on the target host +\family typewriter +B +\family default + shows that everything is +\family typewriter +UpToDate +\family default +, you can prevent a split brain by yourself even when giving +\family typewriter +primary --force +\family default + afterwards. + However, checking / assuring this is +\emph on +your +\emph default + responsibility! +\end_layout + \begin_layout Plain Layout \begin_inset Graphics filename images/MatieresCorrosives.png @@ -18963,13 +20751,13 @@ primary --force \family typewriter \size scriptsize -primary --force + primary --force \family default switches only the \emph on designated \emph default - primary, but actually becoming the/an actual primary may be impossible + primary, but actually becoming the / an actual primary may be impossible in case you are \emph on already @@ -19000,17 +20788,17 @@ reference "sub:Split-Brain-Resolution" \size scriptsize -Hint for the case of + Hint in case of \begin_inset Formula $k>2$ \end_inset - replicas: although + replicas: \family typewriter marsadm invalidate \family default - could be used to resolve a split brain at other secondaries (which are - neither the old nor the new designated primary), it is often better to - use the + cannot resolve a split brain at other secondaries (which are neither the + old nor the new designated primary). + Therefore, use the \family typewriter leave-resource \family default @@ -19033,13 +20821,14 @@ unrelated \begin_inset Quotes erd \end_inset - secondaries, until the split brain is gone. + secondaries step by step, until the split brain is gone. Don't \family typewriter join-resource \family default again before the split brain is gone! This way, all these replicas will - remain consistent for now, but of course outdated (or potentially a + remain consistent for now, but of course outdated (or potentially even + a \begin_inset Quotes eld \end_inset @@ -19049,7 +20838,7 @@ wrong split-brain version, but \emph on -usable +potentially usable \emph default in case you get under pressure in some way). In the hopefully unlikely case that you should later discover that you @@ -19091,7 +20880,7 @@ correct \size scriptsize -Generally: in case of + Generally: in case of \family typewriter primary --force \family default @@ -19221,16 +21010,25 @@ Precondition: the local \begin_layout Plain Layout \size scriptsize -Postcondition: +Postcondition: There exists no designated primary any more. + During split brain and when the network is OK (again), all actual primaries + (including the local host) will leave primary ASAP (i.e. + when their +\family typewriter +/dev/mars/mydata +\family default + is no longer in use). + Any secondary will start following (old) logfiles (even from backlogs) + by replaying transaction logs if it is +\emph on +uniquely +\emph default + possible (which is often violated during split brain). + On any secondary, \family typewriter /dev/mars/$dev_name \family default - has disappeared; at least the current host is in secondary role. - In split brain situations (when the network is OK), -\emph on -all -\emph default - hosts will go into secondary role after a while. + will have disappeared. \end_layout \begin_layout Plain Layout @@ -19242,6 +21040,47 @@ all \end_inset +\size scriptsize + Notice: in difference to DRBD, you +\series bold +don't need +\series default + this command during normal operation, including handover. + Any resource member which is +\emph on +not +\emph default + designated as primary will +\emph on +automatically +\emph default + go into secondary role. + For example, if you have +\begin_inset Formula $k=4$ +\end_inset + + replicas, only +\emph on +one of them +\emph default + can be designated as a primary. + When the network is OK, all other 3 nodes will know this fact, and they + will +\emph on +automatically +\emph default + go into secondary mode, following the transaction logs from the (new) primary. +\end_layout + +\begin_layout Plain Layout +\begin_inset Graphics + filename images/MatieresCorrosives.png + lyxscale 50 + scale 17 + +\end_inset + + \size scriptsize Hint: avoid this command. It turns off @@ -19252,12 +21091,114 @@ any \series bold globally \series default + +\begin_inset Foot +status open + +\begin_layout Plain Layout + +\size scriptsize +A serious +\series bold +misconception +\series default + among some people is when they believe that they can switch +\begin_inset Quotes eld +\end_inset + +a certain node to secondary +\begin_inset Quotes erd +\end_inset + . + It is not possible to switch individual nodes to secondary, without affecting + other nodes! The concept of +\begin_inset Quotes eld +\end_inset + +designated primary +\begin_inset Quotes erd +\end_inset + + is +\series bold +global +\series default + throughout a resource! +\end_layout + +\end_inset + +. + You cannot start a sync after that (e.g. + +\family typewriter +invalidate +\family default + or +\family typewriter +join-resource +\family default + or +\family typewriter +resume-sync +\family default +), because it is +\emph on +not unique +\emph default + wherefrom the data shall be fetched. + In split brain situations (when the network is OK again), this may have + further drawbacks. It is much better / easier to +\series bold \emph on directly \emph default - switch the primary from one node to another. + switch the designated primary +\series default + from one node to another via the +\family typewriter +primary +\family default + command. + See also section +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Forced-Switching" + +\end_inset + +. +\end_layout + +\begin_layout Plain Layout +\begin_inset Graphics + filename images/lightbulb_brightlit_benj_.png + lyxscale 12 + scale 7 + +\end_inset + + +\size scriptsize + There is only one valid use case where you +\emph on +really +\emph default + need this command: before finally destroying a resouce via the +\emph on +last +\emph default + +\family typewriter +leave-resource +\family default + (or the dangerous +\family typewriter +delete-resource +\family default +), you will need this before you can do that. \end_layout \end_inset @@ -20305,6 +22246,34 @@ $res designated \emph default primary must exist. + When having +\begin_inset Formula $k>2$ +\end_inset + + replicas, no split brain must exist (otherwise, or when +\family typewriter +invalidate +\family default + does not work in case of +\begin_inset Formula $k=2$ +\end_inset + +, use the +\family typewriter +leave-resource +\family default + ; +\family typewriter +join-resource +\family default + method described in section +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Split-Brain-Resolution" + +\end_inset + +). \end_layout \begin_layout Plain Layout @@ -20461,9 +22430,9 @@ reference "sec:Creating-and-Maintaining" Use this only \emph on -after +before \emph default - having created a fresh filesystem inside + creating a fresh filesystem inside \family typewriter /dev/mars/$res \family default @@ -24503,17 +26472,17 @@ status open \size scriptsize This would cause unintended side effects due to races between logfile transfer / application and block-wise comparison of the underlying disks. - However, MARS + However, \family typewriter -invalide +marsadm join-resource +\family default + or +\family typewriter +invalidate \family default will do the same as DRBD verify followed by DRBD resync, i.e. - -\family typewriter -marsadm invalidate -\family default - will automatically correct any found errors; note that the fast-fullsync - algorithm of MARS will minimize network traffic. + this will automatically correct any found errors;. + Note that the fast-fullsync algorithm of MARS will minimize network traffic. \end_layout \end_inset