diff --git a/docu/images/Incident_Probabilities.pdf b/docu/images/Incident_Probabilities.pdf new file mode 100644 index 00000000..42627891 Binary files /dev/null and b/docu/images/Incident_Probabilities.pdf differ diff --git a/docu/images/MOUNTPOINTS_Comparison_of_Reversible_StorageNode_Failures.pdf b/docu/images/MOUNTPOINTS_Comparison_of_Reversible_StorageNode_Failures.pdf new file mode 100644 index 00000000..026ab125 Binary files /dev/null and b/docu/images/MOUNTPOINTS_Comparison_of_Reversible_StorageNode_Failures.pdf differ diff --git a/docu/images/SERVICE_Comparison_of_Reversible_StorageNode_Failures.pdf b/docu/images/SERVICE_Comparison_of_Reversible_StorageNode_Failures.pdf new file mode 100644 index 00000000..b4b2cf5c Binary files /dev/null and b/docu/images/SERVICE_Comparison_of_Reversible_StorageNode_Failures.pdf differ diff --git a/docu/mars-manual.lyx b/docu/mars-manual.lyx index 323c6e89..c51d76d7 100644 --- a/docu/mars-manual.lyx +++ b/docu/mars-manual.lyx @@ -2014,6 +2014,1193 @@ In any case, a MARS-based geo-redundant sharding pool is cheaper than using commercial storage appliances which are much more expensive by their nature. \end_layout +\begin_layout Section +Reliability Arguments from Architecture +\begin_inset CommandInset label +LatexCommand label +name "sec:Reliability-Arguments-from" + +\end_inset + + +\end_layout + +\begin_layout Standard +A contemporary common belief is that big clusters would provide better reliabili +ty than anything else. + There are some practical observations at 1&1 and its daughter companies + which cannot confirm this. +\end_layout + +\begin_layout Standard +Stimulated by such practical experience, theoretical explanations were sought. + Surprisingly, they show that LocalSharding is superior to true big clusters + under practically important preconditions. + Here is an intutitive explanation. + A detailed mathematical description of the model can be found in appendix + +\begin_inset CommandInset ref +LatexCommand vref +reference "chap:Mathematical-Model-of" + +\end_inset + +. +\end_layout + +\begin_layout Subsection +Storage Server Node Failures +\end_layout + +\begin_layout Subsubsection +Simple intuitive explanation +\end_layout + +\begin_layout Standard +Block-level replication systems like DRBD are constructed for failover in + local redundancy scenarios. + Or, when using MARS, even for geo-redundant failover scenarios. + They are traditionally dealing with +\series bold +pairs +\series default + of servers, or with triples, etc. + In order to get a storage incident with them, +\emph on +both +\emph default + sides of a DRBD or MARS small-cluster (also called +\series bold +shard +\series default +) must have an incident at the same time. +\end_layout + +\begin_layout Standard +In contrast, big clusters are spreading their objects over a huge number + of nodes +\begin_inset Formula $O(n)$ +\end_inset + +, with some redundancy degree +\begin_inset Formula $k$ +\end_inset + + denoting the number of replicas. + As a consequence, +\emph on +any +\emph default + +\begin_inset Formula $k$ +\end_inset + + node failures out of +\begin_inset Formula $O(n)$ +\end_inset + + will produce an incident. + For example, when +\begin_inset Formula $k=2$ +\end_inset + + and +\begin_inset Formula $n$ +\end_inset + + is equal for both models, then +\emph on +any +\emph default + combination to two node failures occurring at the same time will lead to + an incident: +\end_layout + +\begin_layout Standard +\noindent +\align center +\begin_inset Graphics + filename images/Incident_Probabilities.pdf + width 100col% + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +Intuitively, it is easy to see that hitting both members of the same pair + at the same time is less likely than hitting +\emph on +any +\emph default + two nodes of a big cluster. +\end_layout + +\begin_layout Standard +If you are curious about some concrete numbers, read on. +\end_layout + +\begin_layout Subsubsection +Detailed explanation +\begin_inset CommandInset label +LatexCommand label +name "sub:Detailed-explanation" + +\end_inset + + +\end_layout + +\begin_layout Standard +For the sake of simplicity, the following more detailed explanation is based + on the following assumptions: +\end_layout + +\begin_layout Itemize +We are looking at +\series bold +storage node +\series default + failures only. +\end_layout + +\begin_layout Itemize +Disk failures are regarded as already solved (e.g. + by local RAID-6 or by the well-known compensation mechanisms of big clusters). + Only in case they don't work, they are mapped to node failures, and are + already included in the probability of storage node failures. +\end_layout + +\begin_layout Itemize +We restrict ourselves to temporary / +\series bold +transient +\series default + failures, without regarding permanent data loss. + Otherwise, the differences between local-storage sharding architectures + and big clusters would become even worse. + When loosing some physical storage nodes forever in a big cluster, it is + typically all else but easy to determine which data of which application + instances / customers have been affected, and which will need a restore + from backup. +\end_layout + +\begin_layout Itemize +Storage network failures (as a whole) are ignored. + Otherwise a fair comparison between the architectures would become difficult. + If they were taken into account, the advantages of LocalSharding would + become even bigger. +\end_layout + +\begin_layout Itemize +We assume that the storage network (when present) forms no bottleneck. + Network implementations like TCP/IP versus Infiniband or similar are thus + ignored. +\end_layout + +\begin_layout Itemize +Software failures / bugs are also ignored. + We only compare +\emph on +architectures +\emph default + here, not their various implementations. +\end_layout + +\begin_layout Itemize +The x axis shows the number of basic storage units +\begin_inset Formula $n$ +\end_inset + +, where one basic storage unit equals to the total disk space provided by + one storage node. +\end_layout + +\begin_layout Itemize +We assume that the number of application instances is linearly scaling with + +\begin_inset Formula $n$ +\end_inset + +. + For simplicity, we assume that the number of applications running on the + whole pool is exactly +\begin_inset Formula $n$ +\end_inset + +. +\end_layout + +\begin_layout Itemize +For the BigCluster architecture, we assume that all objects are always distribut +ed to +\begin_inset Formula $O(n)$ +\end_inset + + nodes. + For simiplicy of the model, we assume a distribution via a +\emph on +uniform +\emph default + hash function. + When other hash functions were used (e.g. + distributing only to a constant number of nodes), it would no longer be + a big cluster. +\begin_inset Newline newline +\end_inset + +In the following example, we assume a uniform object distribution to exactly + +\begin_inset Formula $n$ +\end_inset + + nodes. + Notice that any other +\begin_inset Formula $n'=O(n)$ +\end_inset + + with +\begin_inset Formula $n' + + + + + + +\begin_inset Text + +\begin_layout Plain Layout +LocalSharding +\size tiny +(DRBDorMARS) +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout +A up +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout +A down +\end_layout + +\end_inset + + + + +\begin_inset Text + +\begin_layout Plain Layout +B up +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout +0 +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout +1 +\end_layout + +\end_inset + + + + +\begin_inset Text + +\begin_layout Plain Layout +B down +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout +1 +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout +2 +\end_layout + +\end_inset + + + + +\end_inset + + +\begin_inset ERT +status open + +\begin_layout Plain Layout + + +\backslash +hfill +\end_layout + +\end_inset + + +\begin_inset Tabular + + + + + + + +\begin_inset Text + +\begin_layout Plain Layout +BigCluster +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout +A up +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout +A down +\end_layout + +\end_inset + + + + +\begin_inset Text + +\begin_layout Plain Layout +B up +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout +0 +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout +2 +\end_layout + +\end_inset + + + + +\begin_inset Text + +\begin_layout Plain Layout +B down +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout +2 +\end_layout + +\end_inset + + +\begin_inset Text + +\begin_layout Plain Layout +2 +\end_layout + +\end_inset + + + + +\end_inset + + +\begin_inset ERT +status open + +\begin_layout Plain Layout + + +\backslash +hfill +\end_layout + +\end_inset + + +\begin_inset space ~ +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +What is the heart of the difference? While a node failure at LocalSharding + (DRBDorMARS) will tear down only the local application, the teardown produced + by BigCluster will spread to +\emph on +all +\emph default + of the +\begin_inset Formula $n=2$ +\end_inset + + application units, because of the uniform hashing and because we have only + +\begin_inset Formula $k=1$ +\end_inset + + replica. +\end_layout + +\begin_layout Standard +Would it help to increase both +\begin_inset Formula $n$ +\end_inset + + and +\begin_inset Formula $k$ +\end_inset + + to larger values? +\end_layout + +\begin_layout Standard +In the following graphics, the thick red line shows the behaviour for +\begin_inset Formula $k=1$ +\end_inset + + PlainServers (which is the same as +\begin_inset Formula $k=1$ +\end_inset + + DRBDorMARS) with increasing number of storage units +\begin_inset Formula $n,$ +\end_inset + + ranging from 1 to 10,000 storage units = number of servers for +\begin_inset Formula $k=1$ +\end_inset + +. + Higher values of +\begin_inset Formula $k\in[1,4]$ +\end_inset + + are also displayed. + All lines corresponding to the same +\begin_inset Formula $k$ +\end_inset + + are drawn in the same color. + Notice that both the x and y axis are logscale: +\end_layout + +\begin_layout Standard +\noindent +\align center +\begin_inset Graphics + filename images/SERVICE_Comparison_of_Reversible_StorageNode_Failures.pdf + lyxscale 200 + width 100col% + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +When you look at the thin solid BigCluster lines for +\begin_inset Formula $k=2,\ldots$ +\end_inset + + drawn in different colors, you may wonder why they are alltogether converging + to the thin red BigCluster line, which corresponds to +\begin_inset Formula $k=1$ +\end_inset + + BigCluster. + And they also converge against the grey dotted topmost line indicating + the total possible uptime of all applications (depending on x). + It can be explained as follows: +\end_layout + +\begin_layout Standard +The x axis shows the number of basic storage units. + When you have to create 10,000 storage units with a replication degree + of +\begin_inset Formula $k=2$ +\end_inset + + replicas, then you will have to deploy +\begin_inset Formula $k*10,000=20,000$ +\end_inset + + servers in total. + When operating a pool of 20,000 servers, in statistical average 2 servers + of them will be down at any given point in time. + However, 2 is the same number as the replication degree +\begin_inset Formula $k.$ +\end_inset + + Because our BigCluster model as defined above will distribute +\emph on +all +\emph default + objects to +\emph on +all +\emph default + servers uniformly, there will almost always +\emph on +exist +\emph default + some objects for which no replica is available at any given point in time. + This means, you will almost always have a +\series bold +permanent incident +\series default + involving the same number of nodes as your replication degree +\begin_inset Formula $k$ +\end_inset + +, and in turn +\emph on +some +\emph default + of your objects will not be accessible at all. + This means, at +\begin_inset Formula $x=10,000$ +\end_inset + + storage units you will loose almost any advantage from increasing the number + of replicas. + Adding more replicas will no longer help at +\begin_inset Formula $x\geq10,000$ +\end_inset + + storage units. +\end_layout + +\begin_layout Standard +Notice that the +\emph on +solid +\emph default + lines are showing the probability of +\emph on +some +\emph default + incident, disregarding the +\series bold +size of the incident +\series default +. +\end_layout + +\begin_layout Standard +What's about the +\emph on +dashed +\emph default + lines showing much better behaviour for BigCluster? +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/MatieresCorrosives.png + lyxscale 50 + scale 17 + +\end_inset + + Under some further preconditions, it would be possible to argue with the + +\emph on +size +\emph default + of incidents. + However, now a big fat warning. + When you are +\series bold +responsible +\series default + for operations of thousands of servers, you should be very conscious about + these preconditions. + Otherwise you could risk your career. + In short: +\end_layout + +\begin_layout Itemize +When your application, e.g. + a smartphone app, consists of accessing only 1 object at all during a reasonabl +y long timeframe, you can safely +\series bold +assume that there is no interdependency +\series default + between all of your objects. + In addition, you have to assume (and you should check) that your cluster + operating software as a whole does not introduce any further +\series bold +hidden / internal interdependencies +\series default +. + Only in this case, and only then, you can take the dashed lines arguing + with the number of inaccessible objects instead of with the number of basic + storage units. +\end_layout + +\begin_layout Itemize +Whenever your application uses +\series bold +bigger structured objects +\series default +, such as filesystems or block devices or whole VMs / containers, then you + likely will get +\series bold +interdependent objects +\series default + at your big cluster storage layer. +\begin_inset Newline newline +\end_inset + +Example: experienced sysadmins will confirm that even a data loss rate of + only 1/1,000,000 of blocks in a classical Linux filesystem like +\family typewriter +xfs +\family default + or +\family typewriter +ext4 +\family default + will likely imply the need of an offline filesystem check ( +\family typewriter +fsck +\family default +), which is a major incident for the affected filesystem instances. +\begin_inset Newline newline +\end_inset + +Theoretical explanation: servers are running for a very long time, and filesyste +ms are typically also mounted for a long time. + Notice that the probability of hitting any vital filesystem data equals + the probability of hitting any other data. + Sooner or later, any defective sector in the metadata structures or in + freespace management etc will stop your whole filesystem, and in turn will + stop your application instance(s) running on top of it. +\begin_inset Newline newline +\end_inset + +Similar arguments hold for transient failures: most filesystems are not + constructed for compensation of hanging IO, typically leading to +\series bold +system hangs +\series default +. +\end_layout + +\begin_layout Standard +\noindent +\begin_inset Graphics + filename images/MatieresCorrosives.png + lyxscale 50 + scale 17 + +\end_inset + + Blindly taking the dashed lines will expose you to a high risk of error. + Practical experience shows that there are often +\series bold +hidden dependencies +\series default + in many applications, often also at application level. + You cannot necessarily see them when inspecting their data structures! + You will only notice some of them by analyzing their +\series bold +runtime behaviour +\series default +, e.g. + with tools like +\family typewriter +strace +\family default +. + Notice that in general the runtime behaviour of an arbitrary program is + +\series bold +undecidable +\series default +. + Be cautious when drawing assumptions out of thin air! +\end_layout + +\begin_layout Subsection +Optimum Reliability from Architecture +\begin_inset CommandInset label +LatexCommand label +name "subsec:Optimum-Reliability-from" + +\end_inset + + +\end_layout + +\begin_layout Standard +Another argument could be: don't distribute the BigCluster objects to exactly + +\begin_inset Formula $n$ +\end_inset + + nodes, but to less nodes. + Would the result be better than DRBDorMARS LocalSharding? +\end_layout + +\begin_layout Standard +When distributing to +\begin_inset Formula $O(k')$ +\end_inset + + nodes with some constant +\begin_inset Formula $k'$ +\end_inset + +, we have no longer a BigCluster architecture, but a mixed BigClusterSharding + form. +\end_layout + +\begin_layout Standard +As can be generalized from the above tables, the reliability of +\series bold +any +\series default + BigCluster on +\begin_inset Formula $k'>k$ +\end_inset + + nodes is +\series bold +always +\series default + worse than of LocalSharding on exactly +\begin_inset Formula $k$ +\end_inset + + nodes, where +\begin_inset Formula $k$ +\end_inset + + is also the redundancy degree. +\end_layout + +\begin_layout Standard +In general: +\end_layout + +\begin_layout Verse + +\series bold +\size large +The LocalSharding model is the optimum model for reliability of operation, + compared to any other model truly distributing its data and operations + over truly more nodes, like RemoteSharding or BigClusterSharding or BigCluster + does. +\end_layout + +\begin_layout Standard +There exists no better model because shards consisting of exactly +\begin_inset Formula $k$ +\end_inset + + nodes where +\begin_inset Formula $k$ +\end_inset + + is the redundancy degree are already the smallest possible shards under + the assumptions of section +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Detailed-explanation" + +\end_inset + +, and any other model truly involving +\begin_inset Formula $k'>k$ +\end_inset + + nodes for distribution of objects at any shard is +\series bold +always +\series default + worse in the dimension of reliability. + Thus the above sentence follows by induction. +\end_layout + +\begin_layout Standard +The above sentence is formulating a +\series bold +fundamental law of storage systems +\series default +. +\end_layout + +\begin_layout Subsection +Error Propagation to Client Mountpoints +\end_layout + +\begin_layout Standard +The following is only applicable when filesystems (or their objectstore + counterparts) are exported over a storage network, in order to be mounted + in parallel at +\begin_inset Formula $O(n)$ +\end_inset + + mountpoints each. +\end_layout + +\begin_layout Standard +In such a scenario, any problem / incident inside of your storage pool for + the filesystem instances will be spread to +\begin_inset Formula $O(n)$ +\end_inset + + clients, leading to an increase of the incident size by a factor of +\begin_inset Formula $O(n)$ +\end_inset + + when measured in number of affected mountpoints: +\end_layout + +\begin_layout Standard +\noindent +\align center +\begin_inset Graphics + filename images/MOUNTPOINTS_Comparison_of_Reversible_StorageNode_Failures.pdf + lyxscale 200 + width 100col% + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +As a results, we now have a total of +\begin_inset Formula $O(n^{2})$ +\end_inset + + mountpoints = our new basic application units. + Such +\begin_inset Formula $O(n^{2})$ +\end_inset + + architectures are quickly becoming even worse than before. + Thus a clear warning: don't try to build systems in such a way. +\end_layout + +\begin_layout Standard +Notice: DRBD or MARS are traditionally used for running the application + on the same box as the storage. + Thus they are not vulnerable to these kinds of failure propagation over + network. + Even with traditional iSCSI exports over DRBD or MARS, you won't have suchalike + problems. + Your only chance to increase the error propagation are +\begin_inset Formula $O(n)$ +\end_inset + + NFS or +\family typewriter +glusterfs +\family default + exports to +\begin_inset Formula $O(n)$ +\end_inset + + clients leading to a total number of +\begin_inset Formula $O(n^{2})$ +\end_inset + + mountpoints, or similar setups. +\end_layout + +\begin_layout Standard +Clear advice: don't do that. + It's a bad idea. +\end_layout + \begin_layout Section Performance Arguments from Architecture \end_layout @@ -38422,6 +39609,464 @@ reference "sec:Scripting-HOWTO" . \end_layout +\begin_layout Chapter +Mathematical Model of Architectural Reliability +\begin_inset CommandInset label +LatexCommand label +name "chap:Mathematical-Model-of" + +\end_inset + + +\end_layout + +\begin_layout Standard +The assumptions used in the model are explained in detail in section +\begin_inset CommandInset ref +LatexCommand vref +reference "sub:Detailed-explanation" + +\end_inset + +. + Here is a quick recap of the main parameters: +\end_layout + +\begin_layout Itemize +\begin_inset Formula $n$ +\end_inset + + is the number of basic storage units. + It is also used for the number of application units, assumed to be the + same. +\end_layout + +\begin_layout Itemize +\begin_inset Formula $k$ +\end_inset + + is the replication degree, or number of replicas. + In general, you will have to deploy +\begin_inset Formula $N=k*n$ +\end_inset + + storage servers for getting +\begin_inset Formula $n$ +\end_inset + + basic storage units. + This applies to any of the competing architectures. + +\end_layout + +\begin_layout Itemize +\begin_inset Formula $s$ +\end_inset + + is the architecture-dependent spread exponent: it tells whether a storage + incident will spread to the application units. + Examples: +\begin_inset Formula $s=0$ +\end_inset + + means that there is no spread between storage unit failures and application + unit failures, other than a local 1:1 one. + +\begin_inset Formula $s=1$ +\end_inset + + means that an uncompensated storage node incident will cause +\begin_inset Formula $n$ +\end_inset + + application incidents. +\end_layout + +\begin_layout Itemize +\begin_inset Formula $p$ +\end_inset + + is the probability of a storage server incident. + In the examples at section +\begin_inset CommandInset ref +LatexCommand vref +reference "sec:Reliability-Arguments-from" + +\end_inset + +, a fixed +\begin_inset Formula $p=0.0001$ +\end_inset + + was used for easy understanding, but the following formulae should also + hold for any other +\begin_inset Formula $p\in(0,1)$ +\end_inset + +. +\end_layout + +\begin_layout Itemize +\begin_inset Formula $T$ +\end_inset + + is the observational period, introduced for convenience of understanding. + The following can also be computed independently from any +\begin_inset Formula $T$ +\end_inset + +, as long as the probability +\begin_inset Formula $p$ +\end_inset + + does not change over time, which is assumed. + Because +\begin_inset Formula $T$ +\end_inset + + is only here for convenience of understanding, we set it to +\begin_inset Formula $T=1/p$ +\end_inset + +. + In the examples from section +\begin_inset CommandInset ref +LatexCommand vref +reference "sub:Detailed-explanation" + +\end_inset + +, a fixed +\begin_inset Formula $T=10,000$ +\end_inset + + hours was used. +\end_layout + +\begin_layout Section +Formula for DRBD / MARS +\end_layout + +\begin_layout Standard +We need not discrimiate between a storage failure probability S and an applicati +on failure probability A because applications are run locally at the storage + servers 1:1. + The probability for failure of a single shard consisting of +\begin_inset Formula $k$ +\end_inset + + nodes is +\end_layout + +\begin_layout Standard +\begin_inset Formula +\[ +A_{p}(k)=p^{k} +\] + +\end_inset + +because all +\begin_inset Formula $k$ +\end_inset + + shard members have to be down all at the same time. + In section +\begin_inset CommandInset ref +LatexCommand vref +reference "sub:Detailed-explanation" + +\end_inset + + we assumed that there is no cross-communication between shards. + Therefore they are completely independent from each other, and the total + downtime of +\begin_inset Formula $n$ +\end_inset + + shards during the observational period +\begin_inset Formula $T$ +\end_inset + + is +\end_layout + +\begin_layout Standard +\begin_inset Formula +\[ +A_{p,T}(k,n)=T*n*p^{k} +\] + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +When introducing the spread exponent +\begin_inset Formula $s$ +\end_inset + +, the formula turns into +\end_layout + +\begin_layout Standard +\begin_inset Formula +\[ +A_{s,p,T}(k,n)=T*n^{s+1}*p^{k} +\] + +\end_inset + + +\end_layout + +\begin_layout Section +Formula for Unweighted BigCluster +\end_layout + +\begin_layout Standard +This is based on the Bernoulli formula. + The probability that exactly +\begin_inset Formula $\bar{k}$ +\end_inset + + storage nodes out of +\begin_inset Formula $N=k*n$ +\end_inset + + total storage nodes are down is +\end_layout + +\begin_layout Standard +\begin_inset Formula +\[ +\bar{S}_{p}(\bar{k},N)=\binom{N}{\bar{k}}*p^{\bar{k}}*(1-p)^{N-\bar{k}} +\] + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +Similarly, the probability for getting +\begin_inset Formula $k$ +\end_inset + + or more storage node failures (up to +\begin_inset Formula $N$ +\end_inset + +) at the same time is +\end_layout + +\begin_layout Standard +\begin_inset Formula +\[ +S_{p}(k,N)=\sum_{\bar{k}=k}^{N}\bar{S}_{p}(\bar{k},N)=\sum_{\bar{k}=k}^{N}\binom{N}{\bar{k}}*p^{\bar{k}}*(1-p)^{N-\bar{k}} +\] + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +By replacing +\begin_inset Formula $N$ +\end_inset + + with +\begin_inset Formula $k*n$ +\end_inset + + (for conversion of the x axis into basic storage units) and by introducing + +\begin_inset Formula $T$ +\end_inset + + we get +\end_layout + +\begin_layout Standard +\begin_inset Formula +\[ +S_{p,T}(k,n)=T*\sum_{\bar{k}=k}^{k*n}\binom{k*n}{\bar{k}}*p^{\bar{k}}*(1-p)^{k*n-\bar{k}} +\] + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +For comparability with DRBDorMARS, we have to compute the application downtime + A instead of the storage downtime S, which depends on the spread exponent + +\begin_inset Formula $s$ +\end_inset + + as follows: +\end_layout + +\begin_layout Standard +\begin_inset Formula +\[ +A_{s,p,T}(k,n)=n^{s+1}*S_{p,T}(k,n)=n^{s+1}*T*\sum_{\bar{k}=k}^{k*n}\binom{k*n}{\bar{k}}*p^{\bar{k}}*(1-p)^{k*n-\bar{k}} +\] + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +Notice that at +\begin_inset Formula $s=0$ +\end_inset + + we have introduced a factor of +\begin_inset Formula $n$ +\end_inset + +, which corresponds to the hashing effect (teardown of +\begin_inset Formula $n$ +\end_inset + + application instances by a single uncompensated storage incident) as described + in section +\begin_inset CommandInset ref +LatexCommand vref +reference "sub:Detailed-explanation" + +\end_inset + +. +\end_layout + +\begin_layout Section +Formula for SizeWeighted BigCluster +\end_layout + +\begin_layout Standard +In difference to above, we need to introduce a correction factor by the + fraction of affected objects, relative to basic storage units. + Otherwise the y axis would not stay comparable due to different units. +\end_layout + +\begin_layout Standard +For the special case of +\begin_inset Formula $k=1$ +\end_inset + +, there is no difference to above. +\end_layout + +\begin_layout Standard +For the special case of +\begin_inset Formula $k=2$ +\end_inset + + replica, the correction factor is +\begin_inset Formula $1/(N-1)$ +\end_inset + +, because we assume that all the replica of the affected first node are + uniformly spread to all other nodes, which is +\begin_inset Formula $N-1$ +\end_inset + +. + The probability for hitting the intersection of the first node with the + second node is thus +\begin_inset Formula $1/(N-1)$ +\end_inset + +. +\end_layout + +\begin_layout Standard +For higher values of +\begin_inset Formula $k$ +\end_inset + +, and with a similar argument (never put another replica of the same object + onto the same storage node) we get the correction factor as +\end_layout + +\begin_layout Standard +\begin_inset Formula +\[ +C(k,N)=\prod_{l=1}^{k-1}\frac{1}{N-l} +\] + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +Hint: there are maximum +\begin_inset Formula $k$ +\end_inset + + physical replicas on the disks. + For higher values of +\begin_inset Formula $\bar{k}\geq k$ +\end_inset + +, there are +\begin_inset Formula $\binom{\bar{k}}{k}$ +\end_inset + + combinations of object intersections (when assuming that the number of + objects on a node is very large such and no further object repetition can + occur execpt for the +\begin_inset Formula $k$ +\end_inset + +-fold replica placement). + Thus the generalization to +\begin_inset Formula $\bar{k}\geq k$ +\end_inset + + is +\end_layout + +\begin_layout Standard +\begin_inset Formula +\[ +C(k,\bar{k},N)=\binom{\bar{k}}{k}\prod_{l=1}^{k-1}\frac{1}{N-l} +\] + +\end_inset + + +\end_layout + +\begin_layout Standard +\noindent +By inserting this into the above fomula, we get +\end_layout + +\begin_layout Standard +\begin_inset Formula +\[ +A_{s,p,T}(k,n)=n^{s+1}*T*\sum_{\bar{k}=k}^{k*n}C(k,\bar{k},k*n)*\binom{k*n}{\bar{k}}*p^{\bar{k}}*(1-p)^{k*n-\bar{k}} +\] + +\end_inset + + +\end_layout + \begin_layout Chapter GNU Free Documentation License \begin_inset CommandInset label