diff --git a/docu/football-verbose.help b/docu/football-verbose.help new file mode 100644 index 00000000..260df6ad --- /dev/null +++ b/docu/football-verbose.help @@ -0,0 +1,574 @@ +\begin{verbatim} +verbose=1 +Usage: + ./football.sh --help [--verbose] + Show help + ./football.sh --variable= + Override any shell variable + +Actions for resource migration: + + ./football.sh migrate [] + Run the sequence + migrate_prepare ; migrate_wait ; migrate_finish; migrate_cleanup. + + ./football.sh migrate_prepare [] + Allocate LVM space at the targets and start MARS replication. + + ./football.sh migrate_wait [] + Wait until MARS replication reports UpToDate. + + ./football.sh migrate_finish [] + Call hooks for handover to the targets. + + ./football.sh migrate_cleanup + Remove old / currently unused LV replicas from MARS and deallocate + from LVM. + +Actions for (manual) repair in emergency situations: + + ./football.sh manual_migrate_config [] + Transfer only the cluster config, without changing the MARS replicas. + This does no resource stopping / restarting. + Useful for reverting a failed migration. + + ./football.sh manual_config_update + Only update the cluster config, without changing anything else. + Useful for manual repair of failed migration. + + ./football.sh manual_merge_cluster + Run "marsadm merge-cluster" for the given hosts. + Hostnames must be from different (former) clusters. + + ./football.sh manual_split_cluster + Run "marsadm split-cluster" at the given hosts. + Useful for fixing failed / asymmetric splits. + Hint: provide _all_ hostnames which have formerly participated + in the cluster. + + ./football.sh repair_vm + Try to restart the VM on one of the given machines. + Useful during unexpected customer downtime. + + ./football.sh repair_mars + Before restarting the VM like in repair_vm, try to find a local + LV where a stand-alone MARS resource can be found and built up. + Use this only when the MARS resources are gone, and when you are + desperate. Problem: this will likely create a MARS setup which is + not usable for production, and therefore must be corrected later + by hand. Use this only during an emergency situation in order to + get the customers online again, while buying the downsides of this + command. + +Actions for inplace FS shrinking: + + ./football.sh shrink + Run the sequence shrink_prepare ; shrink_finish ; shrink_cleanup. + + ./football.sh shrink_prepare [] + Allocate temporary LVM space (when possible) and create initial + raw FS copy. + Default percent value(when left out) is 85. + + ./football.sh shrink_finish + Incrementally update the FS copy, swap old <=> new copy with + small downtime. + + ./football.sh shrink_cleanup + Remove old FS copy from LVM. + +Actions for inplace FS extension: + + ./football.sh extend + +Combined actions: + + ./football.sh migrate+shrink [] [] + Similar to migrate ; shrink but produces less network traffic. + Default percent value (when left out) is 85. + + ./football.sh migrate+shrink+back [] + Migrate temporarily to , then shrink there, + finally migrate back to old primary and secondaries. + Default percent value (when left out) is 85. + +Global maintenance: + + ./football.sh lv_cleanup + +General features: + + - Instead of , an absolute amount of storage with suffix + 'k' or 'm' or 'g' can be given. + + - When is currently stopped, login to the container is + not possible, and in turn the hypervisor node and primary storage node + cannot be automatically determined. In such a case, the missing + nodes can be specified via the syntax + :: + + - The following LV suffixes are used (naming convention): + -tmp = currently emerging version for shrinking + -preshrink = old version before shrinking took place + + - By adding the option --screener, you can handover football execution + to ./screener.sh . + When some --enable_*_waiting is also added, then the critical + sections involving customer downtime are temporarily halted until + some sysadmins says "screener.sh continue $resource" or + attaches to the sessions and presses the RETURN key. + + ## football_includes + # List of directories where football-*.conf files can be found. + football_includes="${football_includes:-/usr/lib/mars/plugins /etc/mars/plugins $script_dir/plugins $HOME/.mars/plugins ./plugins}" + + ## dry_run + # When set, actions are only simulated. + dry_run=${dry_run:-0} + + ## verbose + # increase speakiness. + verbose=${verbose:-0} + + ## confirm + # Only for debugging: manually started operations can be + # manually checked and confirmed before actually starting opersions. + confirm=${confirm:-1} + + ## force + # Normally, shrinking and extending will only be started if there + # is something to do. + # Enable this for debugging and testing: the check is then skipped. + force=${force:-0} + + ## debug_injection_point + # RTFS don't set this unless you are a developer knowing what you are doing. + debug_injection_point="${debug_injection_point:-0}" + + ## football_logdir + # Where the logfiles should be created. + # HINT: after playing Football in masses for a whiile, your $logdir will + # be easily populated with hundreds or thousands of logfiles. + # Set this to your convenience. + football_logdir="${football_logdir:-${logdir:-$HOME/football-logs}}" + + ## screener + # When enabled, handover execution to the screener. + # Very useful for running Football in masses. + screener="${screener:-0}" + + ## min_space + # When testing / debugging with extremely small LVs, it may happen + # that mkfs refuses to create extemely small filesystems. + # Use this to ensure a minimum size. + min_space="${min_space:-20000000}" + + ## cache_repeat_lapse + # When using the waiting capabilities of screener, and when waits + # are lasting very long, your dentry cache may become cold. + # Use this for repeated refreshes of the dentry cache after some time. + cache_repeat_lapse="${cache_repeat_lapse:-120}" # Minutes + + ## ssh_opt + # Useful for customization to your ssh environment. + ssh_opt="${ssh_opt:--4 -A -o StrictHostKeyChecking=no -o ForwardX11=no -o KbdInteractiveAuthentication=no -o VerifyHostKeyDNS=no}" + + ## rsync_opt + # The rsync options in general. + # IMPORTANT: some intermediate progress report is absolutely needed, + # because otherwise a false-positive TIMEOUT may be assumed when + # no output is generated for several hours. + rsync_opt="${rsync_opt:- -aSH --info=progress2,STATS}" + + ## rsync_opt_prepare + # Additional rsync options for preparation and updating + # of the temporary shrink mirror filesystem. + rsync_opt_prepare="${rsync_opt_prepare:---exclude='.filemon2' --delete}" + + ## rsync_nice + # Typically, the preparation steps are run with background priority. + rsync_nice="${rsync_nice:-nice -19}" + + ## rsync_repeat_prepare and rsync_repeat_hot + # Tuning: increases the reliability of rsync and ensures that the dentry cache + # remains hot. + rsync_repeat_prepare="${rsync_repeat_prepare:-5}" + rsync_repeat_hot="${rsync_repeat_hot:-3}" + + ## wait_timeout + # Avoid infinite loops upon waiting. + wait_timeout="${wait_timeout:-$(( 24 * 60 ))}" # Minutes + + ## lvremove_opt + # Some LVM versions are requiring this for unattended batch operations. + lvremove_opt="${lvremove_opt:--f}" + + ## critical_status + # This is the "magic" exit code indicating _criticality_ + # of a failed command. + critical_status="${critical_status:-199}" + + ## serious_status + # This is the "magic" exit code indicating _seriosity_ + # of a failed command. + serious_status="${serious_status:-198}" + + ## pre_hand or --pre-hand= + # Set this to do an ordinary to a new start position before doing + # anything else. This may be used for handover to a different datacenter + # and running Football there. + pre_hand="${pre_hand:-}" + + ## startup_when_locked + # When == 0: + # Don't abort and don't wait when a lock is detected at startup. + # When == 1 and when enable_startup_waiting=1: + # Wait until the lock is gone. + # When == 2: + # Abort start of script execution when a lock is detected. + # Later, when a locks are set _during_ execution, they will + # be obeyed when enable_*_waiting is set (instead), and will + # lead to waits instead of aborts. + startup_when_locked="${startup_when_locked:-1}" + + ## user_name + # Normally automatically derived from ssh agent or from $LOGNAME. + # Please override this only when really necessary. + export user_name="${user_name:-$(get_real_ssh_user)}" + export user_name="${user_name:-$LOGNAME}" + + +PLUGIN football-cm3 + + 1&1 specfic plugin for dealing with the cm3 cluster manager + and its concrete operating enviroment (singleton instance). + + Current maximum cluster size limit: + + Maximum #syncs running before migration can start: + + Following marsadm --version must be installed: + + Following mars kernel modules must be loaded: + + ## enable_cm3 + # ShaHoLin-specifc plugin for working with the infong platform + # (istore, icpu, infong) via 1&1-specific clustermanager cm3 + # and related toolsets. Much of it is bound to a singleton database + # instance (clustermw & siblings). + enable_cm3="${enable_cm3:-$(if [[ "$0" =~ tetris ]]; then echo 1; else echo 0; fi)}" + + ## skip_resource_ping + # Enable this only for testing. Normally, a resource name denotes a + # container name == machine name which must be runnuing as a precondition, + # und thus must be pingable over network. + skip_resource_ping="${skip_resource_ping:-0}" + + ## date_lock + # Don't enter critical sections at certain days of the week, + # and/or during certain hours. + # This is a regex matching against "date +%u_%H" + date_lock="${date_lock:-}" + + ## workaround_firewall + # Documentation of technical debt for later generations: + # This is needed since July 2017. In the many years before, no firewalling + # was effective at the replication network, because it is a physically + # separate network from the rest of the networking infrastructure. + # An attacker would first need to gain root access to the _hypervisor_ + # (not only to the LXC container and/or to KVM) before gaining access to + # those physical replication network interfaces. + # Since about that time, which is about the same time when the requirements + # for Container Football had been communicated, somebody introduced some + # unnecessary firewall rules, based on "security arguments". + # These arguments were however explicitly _not_ required by the _real_ + # security responsible person, and explicitly _not_ recommended by him. + # Now the problem is that it is almost politically impossible to get + # rid of suchalike "security feature". + # Until the problem is resolved, Container Football requires + # the _entire_ local firewall to be _temporarily_ shut down in order to + # allow marsadm commands over ssh to work. + # Notice: this is _not_ increasing the general security in any way. + # LONGTERM solution / TODO: future versions of mars should no longer + # depend on ssh. + # Then this "feature" can be turned off. + workaround_firewall="${workaround_firewall:-1}" + + ## ip_magic + # Similarly to workaround_firewall, this is needed since somebody + # introduced additional firewall rules also disabling sysadmin ssh + # connections at the _ordinary_ sysadmin network. + ip_magic="${ip_magic:-1}" + + ## do_split_cluster + # The current MARS branch 0.1a.y is not yet constructed for forming + # a BigCluster constisting of several thousands of machines. + # When a future version of mars0.1b.y (or 0.2.y) will allow this, + # this can be disabled. + do_split_cluster="${do_split_cluster:-1}" + + ## clustertool_host + # URL prefix of the internal configuation database REST interface. + clustertool_host="${clustertool_host:-http://clustermw:3042}" + + ## clustertool_user + # Username for clustertool access. + # By default, scans for a *.password file (see next option). + clustertool_user="${clustertool_user:-$(shopt -u nullglob; ls *.password | head -1 | cut -d. -f1)}" || echo "cannot find a password file *.password for clustermw: you MUST supply the credentials via default curl config files (see man page)" + + ## clustertool_passwd + # Here you can supply the encrpted password. + # By default, a file $clustertool_user.password is used + # containing the encrypted password. + clustertool_passwd="${clustertool_passwd:-$([[ -r $clustertool_user.password ]] && cat $clustertool_user.password)}" + + ## do_migrate + # Keep this enabled. Only disable for testing. + do_migrate="${do_migrate:-1}" # must be enabled; disable for dry-run testing + + ## always_migrate + # Only use for testing, or for special situation. + # This skip the test whether the resource has already migration. + always_migrate="${always_migrate:-0}" # only enable for testing + + ## check_segments + # 0 = disabled + # 1 = only display the segment names + # 2 = check for equality + # WORKAROUND, potentially harmful when used inadequately. + # The historical physical segment borders need to be removed for + # Container Football. + # Unfortunately, the subproject aiming to accomplish this did not + # proceed for one year now. In the meantime, Container Football can + # be only played within the ancient segment borders. + # After this big impediment is eventually resolved, this option + # should be switched off. + check_segments="${check_segments:-1}" + + ## backup_dir + # Directory for keeping JSON backups of clustermw. + backup_dir="${backup_dir:-.}" + + ## enable_mod_deflate + # Internal, for support. + enable_mod_deflate="${enable_mod_deflate:-1}" + + ## enable_segment_move + # Seems to be needed by some other tooling. + enable_segment_move="${enable_segment_move:-1}" + + ## override_hwclass_id + # When necessary, override this from $include_dir/plugins/*.conf + override_hwclass_id="${override_hwclass_id:-25007}" + + ## override_hvt_id + # When necessary, override this from $include_dir/plugins/*.conf + override_hvt_id="${override_hvt_id:-8059}" + + ## iqn_base and iet_type and iscsi_eth and iscsi_tid + # Workaround: this is needed for _dynamic_ generation of iSCSI sessions + # bypassing the ordinary ones as automatically generated by the + # cm3 cluster manager (only at the old istore architecture). + # Notice: not needed for regular operations, only for testing. + # Normally, you dont want to shrink over a _shared_ 1MBit iSCSI line. + iqn_base="${iqn_base:-iqn.2000-01.info.test:test}" + iet_type="${iet_type:-blockio}" + iscsi_eth="${iscsi_eth:-eth1}" + iscsi_tid="${iscsi_tid:-4711}" + + ## monitis_downtime_script + # ShaHoLin-internal + monitis_downtime_script="${monitis_downtime_script:-}" + + ## monitis_downtime_duration + # ShaHoLin-internal + monitis_downtime_duration="${monitis_downtime_duration:-20}" # Minutes + + ## shaholin_finished_log + # ShaHoLin-specific logfile, reporting _only_ successful completion + # of an action. + shaholin_finished_log="${shaholin_finished_log:-$football_logdir/shaholin-finished.log}" + + ## ticket + # OPTIONAL: the meaning is ShaHoLin specific. + # This can be used for updating JIRA tickets. + # Can be set on the command line like "./tetris.sh $args --ticket=TECCM-4711 + ticket="${ticket:-}" + + ## ticket_get_cmd + # Optional: when set, this script can be used for retrieving ticket IDs + # in place of commandline option --ticket= + ticket_get_cmd="${ticket_get_cmd:-}" + + ## ticket_update_cmd + # This can be used for calling an external command which updates + # the ticket(s) given by the $ticket parameter. + ticket_update_cmd="${ticket_update_cmd:-}" + + ## shaholin_action + # OPTIONAL: specific action script with parameters. + shaholin_action="${shaholin_action:-}" + + +PLUGIN football-basic + + Generic driver for systemd-controlled MARS pools. + The current version supports only a flat model: + (1) There is a single "big cluster" at metadata level. + All cluster members are joined via merge-cluster. + All occurring names need to be globally unique. + (2) The network uses BGP or other means, thus any hypervisor + can (potentially) start any VM at any time. + (3) iSCSI or remote devices are not supported for now + (LocalSharding model). This may be extended in a future + release. + This plugin is exclusive-or with cm3. + +Plugin specific actions: + + ./football.sh basic_add_host + Manually add another host to the hostname cache. + + ## pool_cache_dir + # Directory for caching the pool status. + pool_cache_dir="${pool_cache_dir:-$script_dir/pool-cache}" + + ## initial_hostname_file + # This file must contain a list of storage and/or hypervisor hostnames + # where a /mars directory must exist. + # These hosts are then scanned for further cluster members, + # and the transitive closure of all host names is computed. + initial_hostname_file="${initial_hostname_file:-./hostnames.input}" + + ## hostname_cache + # This file contains the transitive closure of all host names. + hostname_cache="${hostname_cache:-$pool_cache_dir/hostnames.cache}" + + ## resources_cache + # This file contains the transitive closure of all resource names. + resources_cache="${resources_cache:-$pool_cache_dir/resources.cache}" + + ## res2hyper_cache + # This file contains the association between resources and hypervisors. + res2hyper_cache="${res2hyper_cache:-$pool_cache_dir/res2hyper.assoc}" + + ## enable_basic + # This plugin is exclusive-or with cm3. + enable_basic="${enable_basic:-$(if [[ "$0" =~ football ]]; then echo 1; else echo 0; fi)}" + + ## ssh_port + # Set this for separating sysadmin access from customer access + ssh_port="${ssh_port:-}" + + ## basic_mnt_dir + # Names the mountpoint directory at hypervisors. + # This must co-incide with the systemd mountpoints. + basic_mnt_dir="${basic_mnt_dir:-/mnt}" + + +PLUGIN football-motd + + Generic plugin for motd. Communicate that Football is running + at login via motd. + + ## enable_motd + # whether to use the motd plugin. + enable_motd="${enable_motd:-0}" + + ## update_motd_cmd + # Distro-specific command for generating motd from several sources. + # Only tested for Debian Jessie at the moment. + update_motd_cmd="${update_motd_cmd:-update-motd}" + + ## download_motd_script and motd_script_dir + # When no script has been installed into /etc/update-motd.d/ + # you can do it dynamically here, bypassing any "official" deployment + # methods. Use this only for testing! + # An example script (which should be deployed via your ordinary methods) + # can be found under $script_dir/update-motd.d/67-football-running + download_motd_script="${download_motd_script:-}" + motd_script_dir="${motd_script_dir:-/etc/update-motd.d}" + + ## motd_file + # This will contain the reported motd message. + # It is created by this plugin. + motd_file="${motd_file:-/var/motd/football.txt}" + + ## motd_color_on and motd_color_off + # ANSI escape sequences for coloring the generated motd message. + motd_color_on="${motd_color_on:-\\033[31m}" + motd_color_off="${motd_color_off:-\\033[0m}" + + +PLUGIN football-report + + Generic plugin for communication of reports. + + ## report_cmd_{start,warning,failed,finished} + # External command which is called at start / failure / finish + # of Football. + # The following variables can be used (e.g. as parameters) when + # escaped with a backslash: + # $res = name of the resource (LV, container, etc) + # $primary = the current (old) + # $secondary_list = list of current (old) secondaries + # $target_primary = the target primary name + # $target_secondary = list of target secondaries + # $operation = the operation name + # $target_percent = the value used for shrinking + # $txt = some informative text from Football + # Further variables are possible by looking at the sourcecode, or by + # defining your own variables or functions externally or via plugins. + # Empty = don't do anything + report_cmd_start="${report_cmd_start:-}" + report_cmd_warning="${report_cmd_warning:-$script_dir/screener.sh notify "$res" warning "$txt"}" + report_cmd_failed="${report_cmd_failed:-}" + report_cmd_finished="${report_cmd_finished:-}" + + +PLUGIN football-waiting + + Generic plugig, interfacing with screener: when this is used + by your script and enabled, then you will be able to wait for + "screener.sh continue" operations at certain points in your + script. + + ## enable_*_waiting + # + # When this is enabled, and when Football had been started by screener, + # then football will delay the start of several operations until a sysadmin + # does one of the following manually: + # + # a) ./screener.sh continue $session + # b) ./screener.sh resume $session + # c) ./screener.sh attach $session and press the RETURN key + # d) doing nothing, and $wait_timeout has exceeded + # + # CONVENTION: football resource names are used as screener session ids. + # This ensures that only 1 operation can be started for the same resource, + # and it simplifies the handling for junior sysadmins. + # + enable_startup_waiting="${enable_startup_waiting:-0}" + enable_handover_waiting="${enable_handover_waiting:-0}" + enable_migrate_waiting="${enable_migrate_waiting:-0}" + enable_shrink_waiting="${enable_shrink_waiting:-0}" + + ## enable_cleanup_delayed and wait_before_cleanup + # By setting this, you can delay the cleanup operations for some time. + # This way, you are keeping the old LV contents as a kind of "backup" + # for some limited time. + # HINT: dont set to wait_before_cleanuplarge values, because it can + # seriously slow down Football. + enable_cleanup_delayed="${enable_cleanup_delayed:-0}" + wait_before_cleanup="${wait_before_cleanup:-180}" # Minutes + + ## reduce_wait_msg + # Instead of reporting the waiting status once per minute, + # decrease the frequency of resporting. + # Warning: dont increase this too much. Do not exceed + # session_timeout/2 from screener. Because of the Nyquist criterion, + # stay on the safe side by setting session_timeout at least to _twice_ + # the time than here. + reduce_wait_msg="${reduce_wait_msg:-60}" # Minutes + +\end{verbatim} diff --git a/docu/football.help b/docu/football.help new file mode 100644 index 00000000..6c507107 --- /dev/null +++ b/docu/football.help @@ -0,0 +1,173 @@ +\begin{verbatim} +Usage: + ./football.sh --help [--verbose] + Show help + ./football.sh --variable= + Override any shell variable + +Actions for resource migration: + + ./football.sh migrate [] + Run the sequence + migrate_prepare ; migrate_wait ; migrate_finish; migrate_cleanup. + + ./football.sh migrate_prepare [] + Allocate LVM space at the targets and start MARS replication. + + ./football.sh migrate_wait [] + Wait until MARS replication reports UpToDate. + + ./football.sh migrate_finish [] + Call hooks for handover to the targets. + + ./football.sh migrate_cleanup + Remove old / currently unused LV replicas from MARS and deallocate + from LVM. + +Actions for (manual) repair in emergency situations: + + ./football.sh manual_migrate_config [] + Transfer only the cluster config, without changing the MARS replicas. + This does no resource stopping / restarting. + Useful for reverting a failed migration. + + ./football.sh manual_config_update + Only update the cluster config, without changing anything else. + Useful for manual repair of failed migration. + + ./football.sh manual_merge_cluster + Run "marsadm merge-cluster" for the given hosts. + Hostnames must be from different (former) clusters. + + ./football.sh manual_split_cluster + Run "marsadm split-cluster" at the given hosts. + Useful for fixing failed / asymmetric splits. + Hint: provide _all_ hostnames which have formerly participated + in the cluster. + + ./football.sh repair_vm + Try to restart the VM on one of the given machines. + Useful during unexpected customer downtime. + + ./football.sh repair_mars + Before restarting the VM like in repair_vm, try to find a local + LV where a stand-alone MARS resource can be found and built up. + Use this only when the MARS resources are gone, and when you are + desperate. Problem: this will likely create a MARS setup which is + not usable for production, and therefore must be corrected later + by hand. Use this only during an emergency situation in order to + get the customers online again, while buying the downsides of this + command. + +Actions for inplace FS shrinking: + + ./football.sh shrink + Run the sequence shrink_prepare ; shrink_finish ; shrink_cleanup. + + ./football.sh shrink_prepare [] + Allocate temporary LVM space (when possible) and create initial + raw FS copy. + Default percent value(when left out) is 85. + + ./football.sh shrink_finish + Incrementally update the FS copy, swap old <=> new copy with + small downtime. + + ./football.sh shrink_cleanup + Remove old FS copy from LVM. + +Actions for inplace FS extension: + + ./football.sh extend + +Combined actions: + + ./football.sh migrate+shrink [] [] + Similar to migrate ; shrink but produces less network traffic. + Default percent value (when left out) is 85. + + ./football.sh migrate+shrink+back [] + Migrate temporarily to , then shrink there, + finally migrate back to old primary and secondaries. + Default percent value (when left out) is 85. + +Global maintenance: + + ./football.sh lv_cleanup + +General features: + + - Instead of , an absolute amount of storage with suffix + 'k' or 'm' or 'g' can be given. + + - When is currently stopped, login to the container is + not possible, and in turn the hypervisor node and primary storage node + cannot be automatically determined. In such a case, the missing + nodes can be specified via the syntax + :: + + - The following LV suffixes are used (naming convention): + -tmp = currently emerging version for shrinking + -preshrink = old version before shrinking took place + + - By adding the option --screener, you can handover football execution + to ./screener.sh . + When some --enable_*_waiting is also added, then the critical + sections involving customer downtime are temporarily halted until + some sysadmins says "screener.sh continue $resource" or + attaches to the sessions and presses the RETURN key. + + +PLUGIN football-cm3 + + 1&1 specfic plugin for dealing with the cm3 cluster manager + and its concrete operating enviroment (singleton instance). + + Current maximum cluster size limit: + + Maximum #syncs running before migration can start: + + Following marsadm --version must be installed: + + Following mars kernel modules must be loaded: + + +PLUGIN football-basic + + Generic driver for systemd-controlled MARS pools. + The current version supports only a flat model: + (1) There is a single "big cluster" at metadata level. + All cluster members are joined via merge-cluster. + All occurring names need to be globally unique. + (2) The network uses BGP or other means, thus any hypervisor + can (potentially) start any VM at any time. + (3) iSCSI or remote devices are not supported for now + (LocalSharding model). This may be extended in a future + release. + This plugin is exclusive-or with cm3. + +Plugin specific actions: + + ./football.sh basic_add_host + Manually add another host to the hostname cache. + + +PLUGIN football-motd + + Generic plugin for motd. Communicate that Football is running + at login via motd. + + +PLUGIN football-report + + Generic plugin for communication of reports. + + +PLUGIN football-waiting + + Generic plugig, interfacing with screener: when this is used + by your script and enabled, then you will be able to wait for + "screener.sh continue" operations at certain points in your + script. + +\end{verbatim} diff --git a/docu/make-help.sh b/docu/make-help.sh new file mode 100755 index 00000000..cdb869aa --- /dev/null +++ b/docu/make-help.sh @@ -0,0 +1,18 @@ +#!/bin/bash + +football_dir="${football_dir:-../football}" + +function make_latex_include +{ + local cmd="$1" + + echo '\begin{verbatim}' + eval "$cmd" | sed 's/\\/\\\\/g' + echo '\end{verbatim}' +} + +make_latex_include "../userspace/marsadm --help" > marsadm.help +make_latex_include "(cd $football_dir/ && ./football.sh --help)" > football.help +make_latex_include "(cd $football_dir/ && ./football.sh --help --verbose)" > football-verbose.help +make_latex_include "(cd $football_dir/ && ./screener.sh --help)" > screener.help +make_latex_include "(cd $football_dir/ && ./screener.sh --help --verbose)" > screener-verbose.help diff --git a/docu/mars-manual.lyx b/docu/mars-manual.lyx index 2daf872b..da6c3f97 100644 --- a/docu/mars-manual.lyx +++ b/docu/mars-manual.lyx @@ -39413,6 +39413,13 @@ maximum 100 logfiles per resource \begin_layout Chapter Handout for Midnight Problem Solving +\begin_inset CommandInset label +LatexCommand label +name "chap:Handout-for-Midnight" + +\end_inset + + \end_layout \begin_layout Standard @@ -42012,6 +42019,162 @@ A_{s,p,T}(k,n)=n^{s+1}*T*\sum_{\bar{k}=k}^{k*n}C(k,\bar{k},k*n)*\binom{k*n}{\bar \end_inset +\end_layout + +\begin_layout Chapter +Command Documentation for Userspace Tools +\begin_inset CommandInset label +LatexCommand label +name "chap:Command-Documentation-for" + +\end_inset + + +\end_layout + +\begin_layout Section + +\family typewriter +marsadm --help +\begin_inset CommandInset label +LatexCommand label +name "sec:marsadm-–help" + +\end_inset + + +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Plain Layout + + +\backslash +input{marsadm.help} +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Section + +\family typewriter +football.sh --help +\begin_inset CommandInset label +LatexCommand label +name "sec:football-–help" + +\end_inset + + +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Plain Layout + + +\backslash +input{football.help} +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Section + +\family typewriter +football.sh --help --verbose +\begin_inset CommandInset label +LatexCommand label +name "sec:football-help-verbose" + +\end_inset + + +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Plain Layout + + +\backslash +input{football-verbose.help} +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Section + +\family typewriter +screener.sh --help +\begin_inset CommandInset label +LatexCommand label +name "sec:screener–help" + +\end_inset + + +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Plain Layout + + +\backslash +input{screener.help} +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Section + +\family typewriter +screener.sh --help --verbose +\begin_inset CommandInset label +LatexCommand label +name "sec:screener-help-verbose" + +\end_inset + + +\end_layout + +\begin_layout Standard +\begin_inset ERT +status open + +\begin_layout Plain Layout + + +\backslash +input{screener-verbose.help} +\end_layout + +\end_inset + + \end_layout \begin_layout Chapter diff --git a/docu/marsadm.help b/docu/marsadm.help new file mode 100644 index 00000000..7cbacce9 --- /dev/null +++ b/docu/marsadm.help @@ -0,0 +1,570 @@ +\begin{verbatim} + +Thorough documentation is in mars-manual.pdf. Please use the PDF manual +as authoritative reference! Here is only a short summary of the most +important sub-commands / options: + +marsadm [] [ | all | ] +marsadm [] view[-] [ | all ] + + = + --force + Skip safety checks. + Use this only when you really know what you are doing! + Warning! This is dangerous! First try --dry-run. + Not combinable with 'all'. + --dry-run + Don't modify the symlink tree, but tell what would be done. + Use this before starting potentially harmful actions such as + 'delete-resource'. + --verbose + Increase speakyness of some commands. + --logger=/path/to/usr/bin/logger + Use an alternative syslog messenger. + When empty, disable syslogging. + --max-deletions= + When your network or your firewall rules are defective over a + longer time, too many deletion links may accumulate at + /mars/todo-global/delete-* and sibling locations. + This limit is preventing overflow of the filesystem as well + as overloading the worker threads. + --thresh-logfiles= + --thresh-logsize= + Prevention of too many small logfiles when secondaries are not + catching up. When more than thresh-logfiles are already present, + the next one is only created when the last one has at least + size thresh-logsize (in units of GB). + --timeout= + Abort safety checks after timeout with an error. + When giving 'all' as resource agument, this works for each + resource independently. + --window= + Treat other cluster nodes as healthy when some communcation has + occured during the given time window. + --threshold= + Some macros like 'fetch-threshold-reached' use this for determining + their sloppyness. + --host= + Act as if the command was running on cluster node . + Warning! This is dangerous! First try --dry-run + --backup-dir= + Only for experts. + Used by several special commands like merge-cluster, split-cluster + etc for creating backups of important data. + --ip= + Override the IP address stored in the symlink tree, as well as + the default IP determined from the list of network interfaces. + Usually you will need this only at 'create-cluster' or + 'join-cluster' for resolving ambiguities. + --ssh-port= + Override the default ssh port (22) for ssh and rsync. + Useful for running {join,merge}-cluster on non-standard ssh ports. + --ssh-opts="" + Override the default ssh commandline options. Also used for rsync. + --macro= + Handy for testing short macro evaluations at the command line. + + = + attach + usage: attach + Attaches the local disk (backing block device) to the resource. + The disk must have been previously configured at + {create,join}-resource. + When designated as a primary, /dev/mars/$res will also appear. + This does not change the state of {fetch,replay}. + For a complete local startup of the resource, use 'marsadm up'. + + cat + usage: cat + Print internal debug output in human readable form. + Numerical timestamps and numerical error codes are replaced + by more readable means. + Example: marsadm cat /mars/5.total.status + + connect + usage: connect + See resume-fetch-local. + + connect-global + usage: connect-global + Like resume-fetch-local, but affects all resource members + in the cluster (remotely). + + connect-local + usage: connect-local + See resume-fetch-local. + + create-cluster + usage: create-cluster (no parameters) + This must be called exactly once when creating a new cluster. + Don't call this again! Use join-cluster on the secondary nodes. + Please read the PDF manual for details. + + create-resource + usage: create-resource + (further syntax variants are described in the PDF manual). + Create a new resource out of a pre-existing disk (backing + block device) /dev/lv/mydata (or similar). + The current node will start in primary role, thus + /dev/mars/ will appear after a short time, initially + showing the same contents as the underlying disk /dev/lv/mydata. + It is good practice to name the resource and the + disk name identical. + + cron + usage: cron (no parameters) + Do all necessary regular housekeeping tasks. + This is equivalent to log-rotate all; sleep 5; log-delete-all all. + + delete-resource + usage: delete-resource + CAUTION! This is dangerous when the network is somehow + interrupted, or when damaged nodes are later re-surrected + in any way. + + Precondition: the resource must no longer have any members + (see leave-resource). + This is only needed when you _insist_ on re-using a damaged + resource for re-creating a new one with exactly the same + old . + HINT: best practice is to not use this, but just create a _new_ + resource with a new out of your local disks. + Please read the PDF manual on potential consequences. + + detach + usage: detach + Detaches the local disk (backing block device) from the + MARS resource. + Caution! you may read data from the local disk afterwards, + but ensure that no data is written to it! + Otherwise, you are likely to produce harmful inconsistencies. + When running in primary role, /dev/mars/$res will also disappear. + This does not change the state of {fetch,replay}. + For a complete local shutdown of the resource, use 'marsadm down'. + + disconnect + usage: disconnect + See pause-fetch-local. + + disconnect-global + usage: disconnect-global + Like pause-fetch-local, but affects all resource members + in the cluster (remotely). + + disconnect-local + usage: disconnect-local + See pause-fetch-local. + + down + usage: down + Shortcut for detach + pause-sync + pause-fetch + pause-replay. + + get-emergency-limit + usage: get-emergency-limit + Counterpart of set-emergency-limit (per-resource emergency limit) + + get-sync-limit-value + usage: get-sync-limit-value (no parameters) + For retrieval of the value set by set-sync-limit-value. + + get-systemd-unit + usage: get-systemd-unit + Show the system units (for start and stop), or empty when unset. + + invalidate + usage: invalidate + Only useful on a secondary node. + Forces MARS to consider the local replica disk as being + inconsistent, and therefore starting a fast full-sync from + the currently designated primary node (which must exist; + therefore avoid the 'secondary' command). + This is usually needed for resolving emergency mode. + When having k=2 replicas, this can be also used for + quick-and-simple split-brain resolution. + In other cases, or when the split-brain is not resolved by + this command, please use the 'leave-resource' / 'join-resource' + method as described in the PDF manual (in the right order as + described there). + + join-cluster + usage: join-cluster + Establishes a new cluster membership. + This must be called once on any new cluster member. + This is a prerequisite for join-resource. + + join-resource + usage: join-resource + (further syntax variants are described in the PDF manual). + The resource must have been already created on + another cluster node, and the network must be healthy. + The contents of the local replica disk /dev/lv/mydata will be + overwritten by the initial fast full sync from the currently + designated primary node. + After the initial full sync has finished, the current host will + act in secondary role. + For details on size constraints etc, refer to the PDF manual. + + leave-cluster + usage: leave-cluster (no parameters) + This can be used for final deconstruction of a cluster member. + Prior to this, all resources must have been left + via leave-resource. + Notice: this will never destroy the cluster UID on the /mars/ + filesystem. + Please read the PDF manual for details. + + leave-resource + usage: leave-resource + Precondition: the local host must be in secondary role. + Stop being a member of the resource, and thus stop all + replication activities. The status of the underlying disk + will remain in its current state (whatever it is). + + log-delete + usage: log-delete + When possible, globally delete all old transaction logfiles which + are known to be superflous, i.e. all secondaries no longer need + to replay them. + This must be regularly called by a cron job or similar, in order + to prevent overflow of the /mars/ directory. + For regular maintainance cron jobs, please prefer 'marsadm cron'. + For details and best practices, please refer to the PDF manual. + + log-delete-all + usage: log-delete-all + Alias for log-delete + + log-delete-one + usage: log-delete-one + When possible, globally delete at most one old transaction logfile + which is known to be superfluous, i.e. all secondaries no longer + need to replay it. + Hint: use this only for testing and manual inspection. + For regular maintainance cron jobs, please prefer cron + or log-delete-all. + + log-purge-all + usage: log-purge-all + This is potentially dangerous. + Use this only if you are really desperate in trying to resolve a + split brain. Use this only after reading the PDF manual! + + log-rotate + usage: log-rotate + Only useful at the primary side. + Start writing transaction logs into a new transaction logfile. + This should be regularly called by a cron job or similar. + For regular maintainance cron jobs, please prefer 'marsadm cron'. + For details and best practices, please refer to the PDF manual. + + lowlevel-delete-host + usage: lowlevel-delete-host + Delete cluster member. + + lowlevel-ls-host-ips + usage: lowlevel-ls-host-ips + List cluster member names and IP addresses. + + lowlevel-set-host-ip + usage: lowlevel-set-host-ip + Set IP for host. + + merge-cluster + usage: merge-cluster + Precondition: the resource names of both clusters must be disjoint. + Create the union of two clusters, consisting of the + union of all machines, and the union of all resources. + The members of each resource are _not_ changed by this. + This is useful for creating a big "virtual LVM cluster" where + resources can be almost arbitrarily migrated between machines via + later join-resource / leave-resource operations. + + merge-cluster-check + usage: merge-cluster-check + Check whether the resources of both clusters are disjoint. + Useful for checking in advance whether merge-cluster would be + possible. + + merge-cluster-list + usage: merge-cluster-list + Determine the local list of resources. + Useful for checking or analysis of merge-cluster disjointness by hand. + + pause-fetch + usage: pause-fetch + See pause-fetch-local. + + pause-fetch-global + usage: pause-fetch-global + Like pause-fetch-local, but affects all resource members + in the cluster (remotely). + + pause-fetch-local + usage: pause-fetch-local + Stop fetching transaction logfiles from the current + designated primary. + This is independent from any {pause,resume}-replay operations. + Only useful on a secondary node. + + pause-replay + usage: pause-replay + See pause-replay-local. + + pause-replay-global + usage: pause-replay-global + Like pause-replay-local, but affects all resource members + in the cluster (remotely). + + pause-replay-local + usage: pause-replay-local + Stop replaying transaction logfiles for now. + This is independent from any {pause,resume}-fetch operations. + This may be used for freezing the state of your replica for some + time, if you have enough space on /mars/. + Only useful on a secondary node. + + pause-sync + usage: pause-sync + See pause-sync-local. + + pause-sync-global + usage: pause-sync-global + Like pause-sync-local, but affects all resource members + in the cluster (remotely). + + pause-sync-local + usage: pause-sync-local + Pause the initial data sync at current stage. + This has only an effect if a sync is actually running (i.e. + there is something to be actually synced). + Don't pause too long, because the local replica will remain + inconsistent during the pause. + Use this only for limited reduction of system load. + Only useful on a secondary node. + + primary + usage: primary + Promote the resource into primary role. + This is necessary for /dev/mars/$res to appear on the local host. + Notice: by concept there can be only _one_ designated primary + in a cluster at the same time. + The role change is automatically distributed to the other nodes + in the cluster, provided that the network is healthy. + The old primary node will _automatically_ go + into secondary role first. This is different from DRBD! + With MARS, you don't need an intermediate 'secondary' command + for switching roles. + It is usually better to _directly_ switch the primary roles + between both hosts. + When --force is not given, a planned handover is started: + the local host will only become actually primary _after_ the + old primary is gone, and all old transaction logs have been + fetched and replayed at the new designated priamry. + When --force is given, no handover is attempted. A a consequence, + a split brain situation is likely to emerge. + Thus, use --force only after an ordinary handover attempt has + failed, and when you don't care about the split brain. + For more details, please refer to the PDF manual. + + resize + usage: resize + Prerequisite: all underlying disks (usually /dev/vg/$res) must + have been already increased, e.g. at the LVM layer (cf. lvresize). + Causes MARS to re-examine all sizing constraints on all members of + the resource, and increase the global logical size of the resource + accordingly. + Shrinking is currently not yet implemented. + When successful, /dev/mars/$res at the primary will be increased + in size. In addition, all secondaries will start an incremental + fast full-sync to get the enlarged parts from the primary. + + resume-fetch + usage: resume-fetch + See resume-fetch-local. + + resume-fetch-global + usage: resume-fetch-global + Like resume-fetch-local, but affects all resource members + in the cluster (remotely). + + resume-fetch-local + usage: resume-fetch-local + Start fetching transaction logfiles from the current + designated primary node, if there is one. + This is independent from any {pause,resume}-replay operations. + Only useful on a secondary node. + + resume-replay + usage: resume-replay + See resume-replay-local. + + resume-replay-global + usage: resume-replay-global + Like resume-replay-local, but affects all resource members + in the cluster (remotely). + + resume-replay-local + usage: resume-replay-local + Restart replaying transaction logfiles, when there is some + data left. + This is independent from any {pause,resume}-fetch operations. + This should be used for unfreezing the state of your local replica. + Only useful on a secondary node. + + resume-sync + usage: resume-sync + See resume-sync-local. + + resume-sync-global + usage: resume-sync-global + Like resume-sync-local, but affects all resource members + in the cluster (remotely). + + resume-sync-local + usage: resume-sync-local + Resume any initial / incremental data sync at the stage where it + had been interrupted by pause-sync. + Only useful on a secondary node. + + secondary + usage: secondary + Promote all cluster members into secondary role, globally. + In contrast to DRBD, this is not needed as an intermediate step + for planned handover between an old and a new primary node. + The only reasonable usage is before the last leave-resource of the + last cluster member, immediately before leave-cluster is executed + for final deconstruction of the cluster. + In all other cases, please prefer 'primary' for direct handover + between cluster nodes. + Notice: 'secondary' sets the global designated primary node + to '(none)' which in turn prevents the execution of 'invalidate' + or 'join-resource' or 'resize' anywhere in the cluster. + Therefore, don't unnecessarily give 'secondary'! + + set-emergency-limit + usage: set-emergency-limit + Set a per-resource emergency limit for disk space in /mars. + See PDF manual for details. + + set-sync-limit-value + usage: set-sync-limit-value + Set the maximum number of resources which should by syncing + concurrently. + + set-systemd-unit + usage: set-systemd-unit [] + This activates the systemd template engine of marsadm. + Please read mars-manual.pdf on this. + When is omitted, it will be treated equal to + . + + split-cluster + usage: split-cluster (no parameters) + NOT OFFICIALLY SUPPORTED - ONLY FOR EXPERTS. + RTFS = Read The Fucking Sourcecode. + Use this only if you know what you are doing. + + up + usage: up + Shortcut for attach + resume-sync + resume-fetch + resume-replay. + + wait-cluster + usage: wait-resource [] + Waits until a ping-pong communication has succeeded in the + whole cluster (or only the members of ). + NOTICE: this is extremely useful for avoiding races when scripting + in a cluster. + + wait-connect + usage: wait-connect [] + See wait-cluster. + + wait-resource + usage: wait-resource + [[attach|fetch|replay|sync][-on|-off]] + Wait until the given condition is met on the resource, locally. + + wait-umount + usage: wait-umount + Wait until /dev/mars/ has disappeared in the + cluster (even remotely). + Useful on both primary and secondary nodes. + + = name of resource or "all" for all resources + + + = | + + = + 1and1 + comminfo + commstate + cstate + default + default-global + diskstate + diskstate-1and1 + dstate + fetch-line + fetch-line-1and1 + flags + flags-1and1 + outdated-flags + outdated-flags-1and1 + primarynode + primarynode-1and1 + replay-line + replay-line-1and1 + replinfo + replinfo-1and1 + replstate + replstate-1and1 + resource-errors + resource-errors-1and1 + role + role-1and1 + state + status + sync-line + sync-line-1and1 + syncinfo + syncinfo-1and1 + todo-role + + + = + deletable-size + device-opened + errno-text + Convert errno numbers (positive or negative) into human readable text. + get-log-status + get-resource-{fat,err,wrn}{,-count} + get-{disk,device} + is-{alive} + is-{split-brain,consistent,emergency} + occupied-size + present-{disk,device} + (deprecated, use *-present instead) + replay-basenr + replay-code + When negative, this indidates that a replay/recovery error has occurred. + rest-space + summary-vector + systemd-unit + tree + uuid + wait-{is,todo}-{attach,sync,fetch,replay,primary}-{on,off} + {alive,fetch,replay,work}-{timestamp,age,lag} + {all,the}-{pretty-,}{global-,}{{err,wrn,inf}-,}msg + {cluster,resource}-members + {disk,device}-present + {disk,resource,device}-size + {fetch,replay,work}-{lognr,logcount} + {get,actual}-primary + {is,todo}-{attach,sync,fetch,replay,primary} + {my,all}-resources + {sync,fetch,replay,work,syncpos}-{size,pos} + {sync,fetch,replay,work}-{rest,{almost-,threshold-,}reached,percent,permille,vector} + {sync,fetch,replay}-{rate,remain} + {time,real-time} +\end{verbatim} diff --git a/docu/screener-verbose.help b/docu/screener-verbose.help new file mode 100644 index 00000000..64781be6 --- /dev/null +++ b/docu/screener-verbose.help @@ -0,0 +1,365 @@ +\begin{verbatim} +OVERRIDE verbose=1 +./screener.sh: Run _unattended_ processes in screen sessions. + Useful for MASS automation, running hundreds of unattended + commands in parallel. + HINT: for running more than ~500 sessions in parallel, you might need + some system tuning (e.g. rlimits, kernel patches etc) for creating + a huge number of file descritor / sockets / etc. + ADVANTAGE: You may attach to individual screens, kill them, or continue + some waiting commands. + +Synopsis: + ./screener.sh --help [--verbose] + ./screener.sh list-running + ./screener.sh list-waiting + ./screener.sh list-failed + ./screener.sh list-critical + ./screener.sh list-serious + ./screener.sh list-done + ./screener.sh list + ./screener.sh list-screens + ./screener.sh run [] + ./screener.sh start + ./screener.sh [] + +Inquiry operations: + + ./screener.sh list-screens + Equivalent to screen -ls + + ./screener.sh list- + Show a list of currently running, waiting (for continuation), failed, + and done/completed screen sessions. + + ./screener.sh list + First show a list of currently running screens, then + for each a list of (old) failed / completed / sessions + (and so on). + + ./screener.sh status + Like list-*, but filter and dont report timestamps. + + ./screener.sh show + Show the last logfile of at standard output. + + ./screener.sh less + Show the last logfile of using "less -r". + +MASS starting of screen sessions: + + ./screener.sh run + Commands are launched in screen sessions via "./screener.sh start" commands, + unless the same is already running, + or is in some error state, or is already done (see below). + The commands are given by a column with CSV header name + containing "command", or by the first column. + The needs to be given by a column with CSV header + name matching "screen_id|resource". + The number and type of commands to launch can be reduced via + any combination of the following filter conditions: + + --max= + Limit the number of _new_ sessions additionally started this time. + + --== + Only select lines where an arbitrary CSV column (given by its + CSV header name in C identifier syntax) has the given value. + + --!= + Only select lines where the colum has _not_ the given value. + + --=~ + Only select lines where the bash regular expression matches + at the given column. + + --max-per= + Limit the number per _distinct_ value of the column denoted by + the _next_ filter condition. + Example: ./screener.sh run test.csv --dry-run --max-per=2 --dst_network=~. + would launch only 2 Football processes per destination network. + + Hint: filter conditions can be easily checked by giving --dry-run. + +Start / restart / kill / continue screen sessions: + + ./screener.sh start + Start a new screen session, running arbitrary and + inside. + + ./screener.sh restart + Works only when the last command for failed. + This will restart the old and its as before. + Use only when you want to repeat the same command once again. + + ./screener.sh kill + Terminate the running screen session forcibly. + + ./screener.sh continue + ./screener.sh continue [] + ./screener.sh continue + Useful for MASS automation of processes involving critical sections + such as customer downtime. + When giving a numerical argument, up to that number + of sessions are resumed (ordered by age). + When no further arugment is given, _all_ currently waiting sessions + are continued. + When --auto-attach is given, it will sequentially resume the + sessions to be continued. By default, unless --force_attach is set, + it uses "screen -r" skipping those sessions which are already + attached to somebody else. + This feature works only with prepared scripts which are creating + an empty flagfile + /home/schoebel/mars/mars-migration.git/screener-logdir-testing/running/$screen_id.waiting + whenever they want to wait for manual intervention (for whatever reason). + Afterwards, the script must be polling this flagfile for removal. + This screener operation simply removes the flagfile, such that + the script will then continue afterwards. + Example: look into ./football.sh + and search for occurrences of substring "call_hook start_wait". + + ./screener.sh wakeup + ./screener.sh wakeup [] + ./screener.sh wakeup + Similar to continue, but refers to delayed commands waiting for + a timeout. This can be used to individually shorten the timeout + period. + Example: Football cleanup operations may be artificially delayed + before doing "lvremove", to keep some sort of 'backup' for a + limited time. When your project is under time pressure, these + delays may be hindering. + Use this for premature ending of such artificial delays. + + ./screener.sh up <...> + Do both continue and wakeup. + + ./screener.sh auto <...> + Equivalent to ./screener.sh --auto-attach up <...> + Remember that only session without current attachment will be + attached to. + +Attach to a running session: + + ./screener.sh attach + This is equivalent to screen -x $screen_id + + ./screener.sh resume + This is equivalent to screen -r $screen_id + +Communication: + + ./screener.sh notify + May be called from external scripts to send emails etc. + +Locking (only when supported by ): + + ./screener.sh lock + ./screener.sh unlock + ./screener.sh lock + ./screener.sh unlock + +Cleanup / bookkeeping: + + ./screener.sh clear-critical + ./screener.sh clear-serious + ./screener.sh clear-failed + Mark the status as "done" and move the logfile away. + + ./screener.sh purge [] + This will remove all old logfiles which are older than + . By default, the variable $screener_log_purge_period + will be used, which is currently set to '30'. + + ./screener.sh cron + You should call this regulary from a user cron job, in order + to purge old logfiles, or to detect hanging sessions, or to + automatically send pending emails, etc. + +Options: + + --variable + --variable=$value + These must come first, in order to prevent mixup with + options of . + Allows overriding of any internal shell variable. + --help --verbose + Show all overridable shell variables, also for plugins. + + ## football_includes + # List of directories where screener-*.conf files can be found. + football_includes="${football_includes:-/usr/lib/mars/plugins /etc/mars/plugins $script_dir/plugins $HOME/.mars/plugins ./plugins}" + + ## title + # Used as a title for startup of screen sessions, and later for + # display at list-* + title="${title:-}" + + ## auto_attach + # Upon start or upon continue/wakuep/up, attach to the + # (newly created or existing) session. + auto_attach="${auto_attach:-0}" + + ## auto_attach_grace + # Before attaching, wait this time in seconds. + # The user may abort within this sleep time by + # pressing Ctrl-C. + auto_attach_grace="${auto_attach_grace:-10}" + + ## force_attach + # Use "screen -x" instead of "screen -r" allowing + # shared sessions between different users / end terminals. + force_attach="${force_attach:-0}" + + ## drop_shell + # When a fails, the screen session will not terminated immediately. + # Instead, an interactive bash is started, so can later attach and + # rectify any probllems. + # WARNING! only activate this if you regulary check for failed sessions + # and then manually attach to them. Don't use this when running hundreds + # or thousand in parallel. + drop_shell="${drop_shell:-0}" + + ## session_timeout + # Detect hanging sessions when they don't produce any output anymore + # for a longer time. Hanging sessions are then marked as failed or critical. + session_timeout="${session_timeout:-$(( 3600 * 3 ))}" # seconds + + ## screener_logdir or logdir + # Where the logfiles and all status information is kept. + export screener_logdir="${screener_logdir:-${logdir:-$HOME/screener-logs}}" + + ## screener_command_log + # This logfile will accumulate all relevant $0 command invocations, + # including timestamps and ssh agent identities. + # To switch off, use /dev/null here. + screener_command_log="${screener_command_log:-$screener_logdir/commands.log}" + + ## screener_cron_log + # Since "$0 cron" works silently, you won't notice any errors. + # This logfiles gives you a chance for checking any problems. + screener_cron_log="${screener_cron_log:-$screener_logdir/cron.log}" + + ## screener_log_purge_period + # $0 cron or $0 purge will automatically remove all old logfiles + # from $screener_logdir/*/ when this period is exceeded. + screener_log_purge_period="${screener_log_purge_period:-30}" # Days + + ## dry_run + # Dont actually start screen sessions when set. + dry_run="${dry_run:-0}" + + ## verbose + # increase speakiness. + verbose=${verbose:-0} + + ## debug + # Some additional debug messages. + debug="${debug:-0}" + + ## sleep + # Workaround races by keeping sessions open for a few seconds. + # This is useful for debugging of immediate script failures. + # You have some short time window for attaching. + # HINT: instead, just inspect the logfiles in $screener_logdir/*/*.log + sleep="${sleep:-3}" + + ## screen_cmd + # Customize the screen command (e.g. add some further options, etc). + screen_cmd="${screen_cmd:-screen}" + + ## use_screenlog + # Add the -L option. Not really useful when running thousands of + # parallel screen sessions, because the automatically generated filenames + # are crap, and cannot be set in advance. + # Useful for basic debugging of setup problems etc. + use_screenlog="${use_screenlog:-0}" + + ## waiting_txt and delay_txt + # RTFS Don't use this, unless you know what you are doing. + waiting_txt="${waiting_txt:-SCREENER_waiting_WAIT}" + delayed_txt="${delayed_txt:-SCREENER_delayed_WAIT}" + + ## critical_status + # This is the "magic" exit code indicating _criticality_ + # of a failed command. + critical_status="${critical_status:-199}" + + ## serious_status + # This is the "magic" exit code indicating _seriosity_ + # of a failed command. + serious_status="${serious_status:-198}" + + ## less_cmd + # Used at $0 less $id + less_cmd="${less_cmd:-less -r}" + + ## date_format + # Here you can customize the appearance of list-* commands + date_format="${date_format:-%Y-%m-%d %H:%M}" + + ## csv_delimit + # The delimiter used for CSV file parsing + csv_delim="${csv_delim:-;}" + + ## csv_cmd_fields + # Regex telling the field name for 'cmd' + csv_cmd_fields="${csv_cmd_fields:-command}" + + ## csv_id_fields + # Regex telling the field name for 'screen_id' + csv_id_fields="${csv_id_fields:-screen_id|resource}" + + ## csv_remove + # Regex for global removal of command options + csv_remove="${csv_remove:---screener}" + + ## user_name + # Normally automatically derived from ssh agent or from $LOGNAME. + # Please override this only when really necessary. + export user_name="${user_name:-$(ssh-add -l | grep -o '[^ ]+@[^ ]+' | sort -u | tail -1)}" + export user_name="${user_name:-$LOGNAME}" + + ## tmp_dir and tmp_stub + # Where temporary files are residing + tmp_dir="${tmp_dir:-/tmp}" + tmp_stub="${tmp_stub:-$tmp_dir/screener.$$}" + +Running hook: email_describe_plugin + +PLUGIN screener-email + + Generic plugin for sending emails (or SMS via gateways) + upon status changes, such as script failures. + + ## email_* + # List of email addresses. + # Empty = don't send emails. + email_critical="${email_critical:-}" + email_serious="${email_serious:-}" + email_failed="${email_failed:-}" + email_warning="${email_warning:-}" + email_waiting="${email_waiting:-}" + email_done="${email_done:-}" + + ## sms_* + # List of email addresses of SMS gateways. + # These may be distinct from email_*. + # Empty = don't send sms. + sms_critical="${sms_critical:-}" + sms_serious="${sms_serious:-}" + sms_failed="${sms_failed:-}" + sms_warning="${sms_warning:-}" + sms_waiting="${sms_waiting:-}" + sms_done="${sms_done:-}" + + ## email_cmd + # Command for email sending. + # Please include your gateways etc here. + email_cmd="${email_cmd:-mailx -S smtp=mx.nowhere.org:587 -S smpt-auth-user=test}" + + ## email_logfiles + # Whether to include logfiles in the body. + # Not used for sms_*. + email_logfiles="${email_logfiles:-1}" + +\end{verbatim} diff --git a/docu/screener.help b/docu/screener.help new file mode 100644 index 00000000..544aa264 --- /dev/null +++ b/docu/screener.help @@ -0,0 +1,193 @@ +\begin{verbatim} +./screener.sh: Run _unattended_ processes in screen sessions. + Useful for MASS automation, running hundreds of unattended + commands in parallel. + HINT: for running more than ~500 sessions in parallel, you might need + some system tuning (e.g. rlimits, kernel patches etc) for creating + a huge number of file descritor / sockets / etc. + ADVANTAGE: You may attach to individual screens, kill them, or continue + some waiting commands. + +Synopsis: + ./screener.sh --help [--verbose] + ./screener.sh list-running + ./screener.sh list-waiting + ./screener.sh list-failed + ./screener.sh list-critical + ./screener.sh list-serious + ./screener.sh list-done + ./screener.sh list + ./screener.sh list-screens + ./screener.sh run [] + ./screener.sh start + ./screener.sh [] + +Inquiry operations: + + ./screener.sh list-screens + Equivalent to screen -ls + + ./screener.sh list- + Show a list of currently running, waiting (for continuation), failed, + and done/completed screen sessions. + + ./screener.sh list + First show a list of currently running screens, then + for each a list of (old) failed / completed / sessions + (and so on). + + ./screener.sh status + Like list-*, but filter and dont report timestamps. + + ./screener.sh show + Show the last logfile of at standard output. + + ./screener.sh less + Show the last logfile of using "less -r". + +MASS starting of screen sessions: + + ./screener.sh run + Commands are launched in screen sessions via "./screener.sh start" commands, + unless the same is already running, + or is in some error state, or is already done (see below). + The commands are given by a column with CSV header name + containing "command", or by the first column. + The needs to be given by a column with CSV header + name matching "screen_id|resource". + The number and type of commands to launch can be reduced via + any combination of the following filter conditions: + + --max= + Limit the number of _new_ sessions additionally started this time. + + --== + Only select lines where an arbitrary CSV column (given by its + CSV header name in C identifier syntax) has the given value. + + --!= + Only select lines where the colum has _not_ the given value. + + --=~ + Only select lines where the bash regular expression matches + at the given column. + + --max-per= + Limit the number per _distinct_ value of the column denoted by + the _next_ filter condition. + Example: ./screener.sh run test.csv --dry-run --max-per=2 --dst_network=~. + would launch only 2 Football processes per destination network. + + Hint: filter conditions can be easily checked by giving --dry-run. + +Start / restart / kill / continue screen sessions: + + ./screener.sh start + Start a new screen session, running arbitrary and + inside. + + ./screener.sh restart + Works only when the last command for failed. + This will restart the old and its as before. + Use only when you want to repeat the same command once again. + + ./screener.sh kill + Terminate the running screen session forcibly. + + ./screener.sh continue + ./screener.sh continue [] + ./screener.sh continue + Useful for MASS automation of processes involving critical sections + such as customer downtime. + When giving a numerical argument, up to that number + of sessions are resumed (ordered by age). + When no further arugment is given, _all_ currently waiting sessions + are continued. + When --auto-attach is given, it will sequentially resume the + sessions to be continued. By default, unless --force_attach is set, + it uses "screen -r" skipping those sessions which are already + attached to somebody else. + This feature works only with prepared scripts which are creating + an empty flagfile + /home/schoebel/mars/mars-migration.git/screener-logdir-testing/running/$screen_id.waiting + whenever they want to wait for manual intervention (for whatever reason). + Afterwards, the script must be polling this flagfile for removal. + This screener operation simply removes the flagfile, such that + the script will then continue afterwards. + Example: look into ./football.sh + and search for occurrences of substring "call_hook start_wait". + + ./screener.sh wakeup + ./screener.sh wakeup [] + ./screener.sh wakeup + Similar to continue, but refers to delayed commands waiting for + a timeout. This can be used to individually shorten the timeout + period. + Example: Football cleanup operations may be artificially delayed + before doing "lvremove", to keep some sort of 'backup' for a + limited time. When your project is under time pressure, these + delays may be hindering. + Use this for premature ending of such artificial delays. + + ./screener.sh up <...> + Do both continue and wakeup. + + ./screener.sh auto <...> + Equivalent to ./screener.sh --auto-attach up <...> + Remember that only session without current attachment will be + attached to. + +Attach to a running session: + + ./screener.sh attach + This is equivalent to screen -x $screen_id + + ./screener.sh resume + This is equivalent to screen -r $screen_id + +Communication: + + ./screener.sh notify + May be called from external scripts to send emails etc. + +Locking (only when supported by ): + + ./screener.sh lock + ./screener.sh unlock + ./screener.sh lock + ./screener.sh unlock + +Cleanup / bookkeeping: + + ./screener.sh clear-critical + ./screener.sh clear-serious + ./screener.sh clear-failed + Mark the status as "done" and move the logfile away. + + ./screener.sh purge [] + This will remove all old logfiles which are older than + . By default, the variable $screener_log_purge_period + will be used, which is currently set to '30'. + + ./screener.sh cron + You should call this regulary from a user cron job, in order + to purge old logfiles, or to detect hanging sessions, or to + automatically send pending emails, etc. + +Options: + + --variable + --variable=$value + These must come first, in order to prevent mixup with + options of . + Allows overriding of any internal shell variable. + --help --verbose + Show all overridable shell variables, also for plugins. + + +PLUGIN screener-email + + Generic plugin for sending emails (or SMS via gateways) + upon status changes, such as script failures. + +\end{verbatim}