doc: appendix --help commands

This commit is contained in:
Thomas Schoebel-Theuer 2018-04-23 15:25:32 +02:00
parent 63be004b9e
commit c059daa99d
7 changed files with 2056 additions and 0 deletions

574
docu/football-verbose.help Normal file
View File

@ -0,0 +1,574 @@
\begin{verbatim}
verbose=1
Usage:
./football.sh --help [--verbose]
Show help
./football.sh --variable=<value>
Override any shell variable
Actions for resource migration:
./football.sh migrate <resource> <target_primary> [<target_secondary>]
Run the sequence
migrate_prepare ; migrate_wait ; migrate_finish; migrate_cleanup.
./football.sh migrate_prepare <resource> <target_primary> [<target_secondary>]
Allocate LVM space at the targets and start MARS replication.
./football.sh migrate_wait <resource> <target_primary> [<target_secondary>]
Wait until MARS replication reports UpToDate.
./football.sh migrate_finish <resource> <target_primary> [<target_secondary>]
Call hooks for handover to the targets.
./football.sh migrate_cleanup <resource>
Remove old / currently unused LV replicas from MARS and deallocate
from LVM.
Actions for (manual) repair in emergency situations:
./football.sh manual_migrate_config <resource> <target_primary> [<target_secondary>]
Transfer only the cluster config, without changing the MARS replicas.
This does no resource stopping / restarting.
Useful for reverting a failed migration.
./football.sh manual_config_update <hostname>
Only update the cluster config, without changing anything else.
Useful for manual repair of failed migration.
./football.sh manual_merge_cluster <hostname1> <hostname2>
Run "marsadm merge-cluster" for the given hosts.
Hostnames must be from different (former) clusters.
./football.sh manual_split_cluster <hostname_list>
Run "marsadm split-cluster" at the given hosts.
Useful for fixing failed / asymmetric splits.
Hint: provide _all_ hostnames which have formerly participated
in the cluster.
./football.sh repair_vm <resource> <primary_candidate_list>
Try to restart the VM <resource> on one of the given machines.
Useful during unexpected customer downtime.
./football.sh repair_mars <resource> <primary_candidate_list>
Before restarting the VM like in repair_vm, try to find a local
LV where a stand-alone MARS resource can be found and built up.
Use this only when the MARS resources are gone, and when you are
desperate. Problem: this will likely create a MARS setup which is
not usable for production, and therefore must be corrected later
by hand. Use this only during an emergency situation in order to
get the customers online again, while buying the downsides of this
command.
Actions for inplace FS shrinking:
./football.sh shrink <resource> <percent>
Run the sequence shrink_prepare ; shrink_finish ; shrink_cleanup.
./football.sh shrink_prepare <resource> [<percent>]
Allocate temporary LVM space (when possible) and create initial
raw FS copy.
Default percent value(when left out) is 85.
./football.sh shrink_finish <resource>
Incrementally update the FS copy, swap old <=> new copy with
small downtime.
./football.sh shrink_cleanup <resource>
Remove old FS copy from LVM.
Actions for inplace FS extension:
./football.sh extend <resource> <percent>
Combined actions:
./football.sh migrate+shrink <resource> <target_primary> [<target_secondary>] [<percent>]
Similar to migrate ; shrink but produces less network traffic.
Default percent value (when left out) is 85.
./football.sh migrate+shrink+back <resource> <tmp_primary> [<percent>]
Migrate temporarily to <tmp_primary>, then shrink there,
finally migrate back to old primary and secondaries.
Default percent value (when left out) is 85.
Global maintenance:
./football.sh lv_cleanup <resource>
General features:
- Instead of <percent>, an absolute amount of storage with suffix
'k' or 'm' or 'g' can be given.
- When <resource> is currently stopped, login to the container is
not possible, and in turn the hypervisor node and primary storage node
cannot be automatically determined. In such a case, the missing
nodes can be specified via the syntax
<resource>:<hypervisor>:<primary_storage>
- The following LV suffixes are used (naming convention):
-tmp = currently emerging version for shrinking
-preshrink = old version before shrinking took place
- By adding the option --screener, you can handover football execution
to ./screener.sh .
When some --enable_*_waiting is also added, then the critical
sections involving customer downtime are temporarily halted until
some sysadmins says "screener.sh continue $resource" or
attaches to the sessions and presses the RETURN key.
## football_includes
# List of directories where football-*.conf files can be found.
football_includes="${football_includes:-/usr/lib/mars/plugins /etc/mars/plugins $script_dir/plugins $HOME/.mars/plugins ./plugins}"
## dry_run
# When set, actions are only simulated.
dry_run=${dry_run:-0}
## verbose
# increase speakiness.
verbose=${verbose:-0}
## confirm
# Only for debugging: manually started operations can be
# manually checked and confirmed before actually starting opersions.
confirm=${confirm:-1}
## force
# Normally, shrinking and extending will only be started if there
# is something to do.
# Enable this for debugging and testing: the check is then skipped.
force=${force:-0}
## debug_injection_point
# RTFS don't set this unless you are a developer knowing what you are doing.
debug_injection_point="${debug_injection_point:-0}"
## football_logdir
# Where the logfiles should be created.
# HINT: after playing Football in masses for a whiile, your $logdir will
# be easily populated with hundreds or thousands of logfiles.
# Set this to your convenience.
football_logdir="${football_logdir:-${logdir:-$HOME/football-logs}}"
## screener
# When enabled, handover execution to the screener.
# Very useful for running Football in masses.
screener="${screener:-0}"
## min_space
# When testing / debugging with extremely small LVs, it may happen
# that mkfs refuses to create extemely small filesystems.
# Use this to ensure a minimum size.
min_space="${min_space:-20000000}"
## cache_repeat_lapse
# When using the waiting capabilities of screener, and when waits
# are lasting very long, your dentry cache may become cold.
# Use this for repeated refreshes of the dentry cache after some time.
cache_repeat_lapse="${cache_repeat_lapse:-120}" # Minutes
## ssh_opt
# Useful for customization to your ssh environment.
ssh_opt="${ssh_opt:--4 -A -o StrictHostKeyChecking=no -o ForwardX11=no -o KbdInteractiveAuthentication=no -o VerifyHostKeyDNS=no}"
## rsync_opt
# The rsync options in general.
# IMPORTANT: some intermediate progress report is absolutely needed,
# because otherwise a false-positive TIMEOUT may be assumed when
# no output is generated for several hours.
rsync_opt="${rsync_opt:- -aSH --info=progress2,STATS}"
## rsync_opt_prepare
# Additional rsync options for preparation and updating
# of the temporary shrink mirror filesystem.
rsync_opt_prepare="${rsync_opt_prepare:---exclude='.filemon2' --delete}"
## rsync_nice
# Typically, the preparation steps are run with background priority.
rsync_nice="${rsync_nice:-nice -19}"
## rsync_repeat_prepare and rsync_repeat_hot
# Tuning: increases the reliability of rsync and ensures that the dentry cache
# remains hot.
rsync_repeat_prepare="${rsync_repeat_prepare:-5}"
rsync_repeat_hot="${rsync_repeat_hot:-3}"
## wait_timeout
# Avoid infinite loops upon waiting.
wait_timeout="${wait_timeout:-$(( 24 * 60 ))}" # Minutes
## lvremove_opt
# Some LVM versions are requiring this for unattended batch operations.
lvremove_opt="${lvremove_opt:--f}"
## critical_status
# This is the "magic" exit code indicating _criticality_
# of a failed command.
critical_status="${critical_status:-199}"
## serious_status
# This is the "magic" exit code indicating _seriosity_
# of a failed command.
serious_status="${serious_status:-198}"
## pre_hand or --pre-hand=
# Set this to do an ordinary to a new start position before doing
# anything else. This may be used for handover to a different datacenter
# and running Football there.
pre_hand="${pre_hand:-}"
## startup_when_locked
# When == 0:
# Don't abort and don't wait when a lock is detected at startup.
# When == 1 and when enable_startup_waiting=1:
# Wait until the lock is gone.
# When == 2:
# Abort start of script execution when a lock is detected.
# Later, when a locks are set _during_ execution, they will
# be obeyed when enable_*_waiting is set (instead), and will
# lead to waits instead of aborts.
startup_when_locked="${startup_when_locked:-1}"
## user_name
# Normally automatically derived from ssh agent or from $LOGNAME.
# Please override this only when really necessary.
export user_name="${user_name:-$(get_real_ssh_user)}"
export user_name="${user_name:-$LOGNAME}"
PLUGIN football-cm3
1&1 specfic plugin for dealing with the cm3 cluster manager
and its concrete operating enviroment (singleton instance).
Current maximum cluster size limit:
Maximum #syncs running before migration can start:
Following marsadm --version must be installed:
Following mars kernel modules must be loaded:
## enable_cm3
# ShaHoLin-specifc plugin for working with the infong platform
# (istore, icpu, infong) via 1&1-specific clustermanager cm3
# and related toolsets. Much of it is bound to a singleton database
# instance (clustermw & siblings).
enable_cm3="${enable_cm3:-$(if [[ "$0" =~ tetris ]]; then echo 1; else echo 0; fi)}"
## skip_resource_ping
# Enable this only for testing. Normally, a resource name denotes a
# container name == machine name which must be runnuing as a precondition,
# und thus must be pingable over network.
skip_resource_ping="${skip_resource_ping:-0}"
## date_lock
# Don't enter critical sections at certain days of the week,
# and/or during certain hours.
# This is a regex matching against "date +%u_%H"
date_lock="${date_lock:-}"
## workaround_firewall
# Documentation of technical debt for later generations:
# This is needed since July 2017. In the many years before, no firewalling
# was effective at the replication network, because it is a physically
# separate network from the rest of the networking infrastructure.
# An attacker would first need to gain root access to the _hypervisor_
# (not only to the LXC container and/or to KVM) before gaining access to
# those physical replication network interfaces.
# Since about that time, which is about the same time when the requirements
# for Container Football had been communicated, somebody introduced some
# unnecessary firewall rules, based on "security arguments".
# These arguments were however explicitly _not_ required by the _real_
# security responsible person, and explicitly _not_ recommended by him.
# Now the problem is that it is almost politically impossible to get
# rid of suchalike "security feature".
# Until the problem is resolved, Container Football requires
# the _entire_ local firewall to be _temporarily_ shut down in order to
# allow marsadm commands over ssh to work.
# Notice: this is _not_ increasing the general security in any way.
# LONGTERM solution / TODO: future versions of mars should no longer
# depend on ssh.
# Then this "feature" can be turned off.
workaround_firewall="${workaround_firewall:-1}"
## ip_magic
# Similarly to workaround_firewall, this is needed since somebody
# introduced additional firewall rules also disabling sysadmin ssh
# connections at the _ordinary_ sysadmin network.
ip_magic="${ip_magic:-1}"
## do_split_cluster
# The current MARS branch 0.1a.y is not yet constructed for forming
# a BigCluster constisting of several thousands of machines.
# When a future version of mars0.1b.y (or 0.2.y) will allow this,
# this can be disabled.
do_split_cluster="${do_split_cluster:-1}"
## clustertool_host
# URL prefix of the internal configuation database REST interface.
clustertool_host="${clustertool_host:-http://clustermw:3042}"
## clustertool_user
# Username for clustertool access.
# By default, scans for a *.password file (see next option).
clustertool_user="${clustertool_user:-$(shopt -u nullglob; ls *.password | head -1 | cut -d. -f1)}" || echo "cannot find a password file *.password for clustermw: you MUST supply the credentials via default curl config files (see man page)"
## clustertool_passwd
# Here you can supply the encrpted password.
# By default, a file $clustertool_user.password is used
# containing the encrypted password.
clustertool_passwd="${clustertool_passwd:-$([[ -r $clustertool_user.password ]] && cat $clustertool_user.password)}"
## do_migrate
# Keep this enabled. Only disable for testing.
do_migrate="${do_migrate:-1}" # must be enabled; disable for dry-run testing
## always_migrate
# Only use for testing, or for special situation.
# This skip the test whether the resource has already migration.
always_migrate="${always_migrate:-0}" # only enable for testing
## check_segments
# 0 = disabled
# 1 = only display the segment names
# 2 = check for equality
# WORKAROUND, potentially harmful when used inadequately.
# The historical physical segment borders need to be removed for
# Container Football.
# Unfortunately, the subproject aiming to accomplish this did not
# proceed for one year now. In the meantime, Container Football can
# be only played within the ancient segment borders.
# After this big impediment is eventually resolved, this option
# should be switched off.
check_segments="${check_segments:-1}"
## backup_dir
# Directory for keeping JSON backups of clustermw.
backup_dir="${backup_dir:-.}"
## enable_mod_deflate
# Internal, for support.
enable_mod_deflate="${enable_mod_deflate:-1}"
## enable_segment_move
# Seems to be needed by some other tooling.
enable_segment_move="${enable_segment_move:-1}"
## override_hwclass_id
# When necessary, override this from $include_dir/plugins/*.conf
override_hwclass_id="${override_hwclass_id:-25007}"
## override_hvt_id
# When necessary, override this from $include_dir/plugins/*.conf
override_hvt_id="${override_hvt_id:-8059}"
## iqn_base and iet_type and iscsi_eth and iscsi_tid
# Workaround: this is needed for _dynamic_ generation of iSCSI sessions
# bypassing the ordinary ones as automatically generated by the
# cm3 cluster manager (only at the old istore architecture).
# Notice: not needed for regular operations, only for testing.
# Normally, you dont want to shrink over a _shared_ 1MBit iSCSI line.
iqn_base="${iqn_base:-iqn.2000-01.info.test:test}"
iet_type="${iet_type:-blockio}"
iscsi_eth="${iscsi_eth:-eth1}"
iscsi_tid="${iscsi_tid:-4711}"
## monitis_downtime_script
# ShaHoLin-internal
monitis_downtime_script="${monitis_downtime_script:-}"
## monitis_downtime_duration
# ShaHoLin-internal
monitis_downtime_duration="${monitis_downtime_duration:-20}" # Minutes
## shaholin_finished_log
# ShaHoLin-specific logfile, reporting _only_ successful completion
# of an action.
shaholin_finished_log="${shaholin_finished_log:-$football_logdir/shaholin-finished.log}"
## ticket
# OPTIONAL: the meaning is ShaHoLin specific.
# This can be used for updating JIRA tickets.
# Can be set on the command line like "./tetris.sh $args --ticket=TECCM-4711
ticket="${ticket:-}"
## ticket_get_cmd
# Optional: when set, this script can be used for retrieving ticket IDs
# in place of commandline option --ticket=
ticket_get_cmd="${ticket_get_cmd:-}"
## ticket_update_cmd
# This can be used for calling an external command which updates
# the ticket(s) given by the $ticket parameter.
ticket_update_cmd="${ticket_update_cmd:-}"
## shaholin_action
# OPTIONAL: specific action script with parameters.
shaholin_action="${shaholin_action:-}"
PLUGIN football-basic
Generic driver for systemd-controlled MARS pools.
The current version supports only a flat model:
(1) There is a single "big cluster" at metadata level.
All cluster members are joined via merge-cluster.
All occurring names need to be globally unique.
(2) The network uses BGP or other means, thus any hypervisor
can (potentially) start any VM at any time.
(3) iSCSI or remote devices are not supported for now
(LocalSharding model). This may be extended in a future
release.
This plugin is exclusive-or with cm3.
Plugin specific actions:
./football.sh basic_add_host <hostname>
Manually add another host to the hostname cache.
## pool_cache_dir
# Directory for caching the pool status.
pool_cache_dir="${pool_cache_dir:-$script_dir/pool-cache}"
## initial_hostname_file
# This file must contain a list of storage and/or hypervisor hostnames
# where a /mars directory must exist.
# These hosts are then scanned for further cluster members,
# and the transitive closure of all host names is computed.
initial_hostname_file="${initial_hostname_file:-./hostnames.input}"
## hostname_cache
# This file contains the transitive closure of all host names.
hostname_cache="${hostname_cache:-$pool_cache_dir/hostnames.cache}"
## resources_cache
# This file contains the transitive closure of all resource names.
resources_cache="${resources_cache:-$pool_cache_dir/resources.cache}"
## res2hyper_cache
# This file contains the association between resources and hypervisors.
res2hyper_cache="${res2hyper_cache:-$pool_cache_dir/res2hyper.assoc}"
## enable_basic
# This plugin is exclusive-or with cm3.
enable_basic="${enable_basic:-$(if [[ "$0" =~ football ]]; then echo 1; else echo 0; fi)}"
## ssh_port
# Set this for separating sysadmin access from customer access
ssh_port="${ssh_port:-}"
## basic_mnt_dir
# Names the mountpoint directory at hypervisors.
# This must co-incide with the systemd mountpoints.
basic_mnt_dir="${basic_mnt_dir:-/mnt}"
PLUGIN football-motd
Generic plugin for motd. Communicate that Football is running
at login via motd.
## enable_motd
# whether to use the motd plugin.
enable_motd="${enable_motd:-0}"
## update_motd_cmd
# Distro-specific command for generating motd from several sources.
# Only tested for Debian Jessie at the moment.
update_motd_cmd="${update_motd_cmd:-update-motd}"
## download_motd_script and motd_script_dir
# When no script has been installed into /etc/update-motd.d/
# you can do it dynamically here, bypassing any "official" deployment
# methods. Use this only for testing!
# An example script (which should be deployed via your ordinary methods)
# can be found under $script_dir/update-motd.d/67-football-running
download_motd_script="${download_motd_script:-}"
motd_script_dir="${motd_script_dir:-/etc/update-motd.d}"
## motd_file
# This will contain the reported motd message.
# It is created by this plugin.
motd_file="${motd_file:-/var/motd/football.txt}"
## motd_color_on and motd_color_off
# ANSI escape sequences for coloring the generated motd message.
motd_color_on="${motd_color_on:-\\033[31m}"
motd_color_off="${motd_color_off:-\\033[0m}"
PLUGIN football-report
Generic plugin for communication of reports.
## report_cmd_{start,warning,failed,finished}
# External command which is called at start / failure / finish
# of Football.
# The following variables can be used (e.g. as parameters) when
# escaped with a backslash:
# $res = name of the resource (LV, container, etc)
# $primary = the current (old)
# $secondary_list = list of current (old) secondaries
# $target_primary = the target primary name
# $target_secondary = list of target secondaries
# $operation = the operation name
# $target_percent = the value used for shrinking
# $txt = some informative text from Football
# Further variables are possible by looking at the sourcecode, or by
# defining your own variables or functions externally or via plugins.
# Empty = don't do anything
report_cmd_start="${report_cmd_start:-}"
report_cmd_warning="${report_cmd_warning:-$script_dir/screener.sh notify "$res" warning "$txt"}"
report_cmd_failed="${report_cmd_failed:-}"
report_cmd_finished="${report_cmd_finished:-}"
PLUGIN football-waiting
Generic plugig, interfacing with screener: when this is used
by your script and enabled, then you will be able to wait for
"screener.sh continue" operations at certain points in your
script.
## enable_*_waiting
#
# When this is enabled, and when Football had been started by screener,
# then football will delay the start of several operations until a sysadmin
# does one of the following manually:
#
# a) ./screener.sh continue $session
# b) ./screener.sh resume $session
# c) ./screener.sh attach $session and press the RETURN key
# d) doing nothing, and $wait_timeout has exceeded
#
# CONVENTION: football resource names are used as screener session ids.
# This ensures that only 1 operation can be started for the same resource,
# and it simplifies the handling for junior sysadmins.
#
enable_startup_waiting="${enable_startup_waiting:-0}"
enable_handover_waiting="${enable_handover_waiting:-0}"
enable_migrate_waiting="${enable_migrate_waiting:-0}"
enable_shrink_waiting="${enable_shrink_waiting:-0}"
## enable_cleanup_delayed and wait_before_cleanup
# By setting this, you can delay the cleanup operations for some time.
# This way, you are keeping the old LV contents as a kind of "backup"
# for some limited time.
# HINT: dont set to wait_before_cleanuplarge values, because it can
# seriously slow down Football.
enable_cleanup_delayed="${enable_cleanup_delayed:-0}"
wait_before_cleanup="${wait_before_cleanup:-180}" # Minutes
## reduce_wait_msg
# Instead of reporting the waiting status once per minute,
# decrease the frequency of resporting.
# Warning: dont increase this too much. Do not exceed
# session_timeout/2 from screener. Because of the Nyquist criterion,
# stay on the safe side by setting session_timeout at least to _twice_
# the time than here.
reduce_wait_msg="${reduce_wait_msg:-60}" # Minutes
\end{verbatim}

173
docu/football.help Normal file
View File

@ -0,0 +1,173 @@
\begin{verbatim}
Usage:
./football.sh --help [--verbose]
Show help
./football.sh --variable=<value>
Override any shell variable
Actions for resource migration:
./football.sh migrate <resource> <target_primary> [<target_secondary>]
Run the sequence
migrate_prepare ; migrate_wait ; migrate_finish; migrate_cleanup.
./football.sh migrate_prepare <resource> <target_primary> [<target_secondary>]
Allocate LVM space at the targets and start MARS replication.
./football.sh migrate_wait <resource> <target_primary> [<target_secondary>]
Wait until MARS replication reports UpToDate.
./football.sh migrate_finish <resource> <target_primary> [<target_secondary>]
Call hooks for handover to the targets.
./football.sh migrate_cleanup <resource>
Remove old / currently unused LV replicas from MARS and deallocate
from LVM.
Actions for (manual) repair in emergency situations:
./football.sh manual_migrate_config <resource> <target_primary> [<target_secondary>]
Transfer only the cluster config, without changing the MARS replicas.
This does no resource stopping / restarting.
Useful for reverting a failed migration.
./football.sh manual_config_update <hostname>
Only update the cluster config, without changing anything else.
Useful for manual repair of failed migration.
./football.sh manual_merge_cluster <hostname1> <hostname2>
Run "marsadm merge-cluster" for the given hosts.
Hostnames must be from different (former) clusters.
./football.sh manual_split_cluster <hostname_list>
Run "marsadm split-cluster" at the given hosts.
Useful for fixing failed / asymmetric splits.
Hint: provide _all_ hostnames which have formerly participated
in the cluster.
./football.sh repair_vm <resource> <primary_candidate_list>
Try to restart the VM <resource> on one of the given machines.
Useful during unexpected customer downtime.
./football.sh repair_mars <resource> <primary_candidate_list>
Before restarting the VM like in repair_vm, try to find a local
LV where a stand-alone MARS resource can be found and built up.
Use this only when the MARS resources are gone, and when you are
desperate. Problem: this will likely create a MARS setup which is
not usable for production, and therefore must be corrected later
by hand. Use this only during an emergency situation in order to
get the customers online again, while buying the downsides of this
command.
Actions for inplace FS shrinking:
./football.sh shrink <resource> <percent>
Run the sequence shrink_prepare ; shrink_finish ; shrink_cleanup.
./football.sh shrink_prepare <resource> [<percent>]
Allocate temporary LVM space (when possible) and create initial
raw FS copy.
Default percent value(when left out) is 85.
./football.sh shrink_finish <resource>
Incrementally update the FS copy, swap old <=> new copy with
small downtime.
./football.sh shrink_cleanup <resource>
Remove old FS copy from LVM.
Actions for inplace FS extension:
./football.sh extend <resource> <percent>
Combined actions:
./football.sh migrate+shrink <resource> <target_primary> [<target_secondary>] [<percent>]
Similar to migrate ; shrink but produces less network traffic.
Default percent value (when left out) is 85.
./football.sh migrate+shrink+back <resource> <tmp_primary> [<percent>]
Migrate temporarily to <tmp_primary>, then shrink there,
finally migrate back to old primary and secondaries.
Default percent value (when left out) is 85.
Global maintenance:
./football.sh lv_cleanup <resource>
General features:
- Instead of <percent>, an absolute amount of storage with suffix
'k' or 'm' or 'g' can be given.
- When <resource> is currently stopped, login to the container is
not possible, and in turn the hypervisor node and primary storage node
cannot be automatically determined. In such a case, the missing
nodes can be specified via the syntax
<resource>:<hypervisor>:<primary_storage>
- The following LV suffixes are used (naming convention):
-tmp = currently emerging version for shrinking
-preshrink = old version before shrinking took place
- By adding the option --screener, you can handover football execution
to ./screener.sh .
When some --enable_*_waiting is also added, then the critical
sections involving customer downtime are temporarily halted until
some sysadmins says "screener.sh continue $resource" or
attaches to the sessions and presses the RETURN key.
PLUGIN football-cm3
1&1 specfic plugin for dealing with the cm3 cluster manager
and its concrete operating enviroment (singleton instance).
Current maximum cluster size limit:
Maximum #syncs running before migration can start:
Following marsadm --version must be installed:
Following mars kernel modules must be loaded:
PLUGIN football-basic
Generic driver for systemd-controlled MARS pools.
The current version supports only a flat model:
(1) There is a single "big cluster" at metadata level.
All cluster members are joined via merge-cluster.
All occurring names need to be globally unique.
(2) The network uses BGP or other means, thus any hypervisor
can (potentially) start any VM at any time.
(3) iSCSI or remote devices are not supported for now
(LocalSharding model). This may be extended in a future
release.
This plugin is exclusive-or with cm3.
Plugin specific actions:
./football.sh basic_add_host <hostname>
Manually add another host to the hostname cache.
PLUGIN football-motd
Generic plugin for motd. Communicate that Football is running
at login via motd.
PLUGIN football-report
Generic plugin for communication of reports.
PLUGIN football-waiting
Generic plugig, interfacing with screener: when this is used
by your script and enabled, then you will be able to wait for
"screener.sh continue" operations at certain points in your
script.
\end{verbatim}

18
docu/make-help.sh Executable file
View File

@ -0,0 +1,18 @@
#!/bin/bash
football_dir="${football_dir:-../football}"
function make_latex_include
{
local cmd="$1"
echo '\begin{verbatim}'
eval "$cmd" | sed 's/\\/\\\\/g'
echo '\end{verbatim}'
}
make_latex_include "../userspace/marsadm --help" > marsadm.help
make_latex_include "(cd $football_dir/ && ./football.sh --help)" > football.help
make_latex_include "(cd $football_dir/ && ./football.sh --help --verbose)" > football-verbose.help
make_latex_include "(cd $football_dir/ && ./screener.sh --help)" > screener.help
make_latex_include "(cd $football_dir/ && ./screener.sh --help --verbose)" > screener-verbose.help

View File

@ -39413,6 +39413,13 @@ maximum 100 logfiles per resource
\begin_layout Chapter
Handout for Midnight Problem Solving
\begin_inset CommandInset label
LatexCommand label
name "chap:Handout-for-Midnight"
\end_inset
\end_layout
\begin_layout Standard
@ -42012,6 +42019,162 @@ A_{s,p,T}(k,n)=n^{s+1}*T*\sum_{\bar{k}=k}^{k*n}C(k,\bar{k},k*n)*\binom{k*n}{\bar
\end_inset
\end_layout
\begin_layout Chapter
Command Documentation for Userspace Tools
\begin_inset CommandInset label
LatexCommand label
name "chap:Command-Documentation-for"
\end_inset
\end_layout
\begin_layout Section
\family typewriter
marsadm --help
\begin_inset CommandInset label
LatexCommand label
name "sec:marsadm-help"
\end_inset
\end_layout
\begin_layout Standard
\begin_inset ERT
status open
\begin_layout Plain Layout
\backslash
input{marsadm.help}
\end_layout
\end_inset
\end_layout
\begin_layout Section
\family typewriter
football.sh --help
\begin_inset CommandInset label
LatexCommand label
name "sec:football-help"
\end_inset
\end_layout
\begin_layout Standard
\begin_inset ERT
status open
\begin_layout Plain Layout
\backslash
input{football.help}
\end_layout
\end_inset
\end_layout
\begin_layout Section
\family typewriter
football.sh --help --verbose
\begin_inset CommandInset label
LatexCommand label
name "sec:football-help-verbose"
\end_inset
\end_layout
\begin_layout Standard
\begin_inset ERT
status open
\begin_layout Plain Layout
\backslash
input{football-verbose.help}
\end_layout
\end_inset
\end_layout
\begin_layout Section
\family typewriter
screener.sh --help
\begin_inset CommandInset label
LatexCommand label
name "sec:screenerhelp"
\end_inset
\end_layout
\begin_layout Standard
\begin_inset ERT
status open
\begin_layout Plain Layout
\backslash
input{screener.help}
\end_layout
\end_inset
\end_layout
\begin_layout Section
\family typewriter
screener.sh --help --verbose
\begin_inset CommandInset label
LatexCommand label
name "sec:screener-help-verbose"
\end_inset
\end_layout
\begin_layout Standard
\begin_inset ERT
status open
\begin_layout Plain Layout
\backslash
input{screener-verbose.help}
\end_layout
\end_inset
\end_layout
\begin_layout Chapter

570
docu/marsadm.help Normal file
View File

@ -0,0 +1,570 @@
\begin{verbatim}
Thorough documentation is in mars-manual.pdf. Please use the PDF manual
as authoritative reference! Here is only a short summary of the most
important sub-commands / options:
marsadm [<global_options>] <command> [<resource_name> | all | <args> ]
marsadm [<global_options>] view[-<macroname>] [<resource_name> | all ]
<global_option> =
--force
Skip safety checks.
Use this only when you really know what you are doing!
Warning! This is dangerous! First try --dry-run.
Not combinable with 'all'.
--dry-run
Don't modify the symlink tree, but tell what would be done.
Use this before starting potentially harmful actions such as
'delete-resource'.
--verbose
Increase speakyness of some commands.
--logger=/path/to/usr/bin/logger
Use an alternative syslog messenger.
When empty, disable syslogging.
--max-deletions=<number>
When your network or your firewall rules are defective over a
longer time, too many deletion links may accumulate at
/mars/todo-global/delete-* and sibling locations.
This limit is preventing overflow of the filesystem as well
as overloading the worker threads.
--thresh-logfiles=<number>
--thresh-logsize=<number>
Prevention of too many small logfiles when secondaries are not
catching up. When more than thresh-logfiles are already present,
the next one is only created when the last one has at least
size thresh-logsize (in units of GB).
--timeout=<seconds>
Abort safety checks after timeout with an error.
When giving 'all' as resource agument, this works for each
resource independently.
--window=<seconds>
Treat other cluster nodes as healthy when some communcation has
occured during the given time window.
--threshold=<bytes>
Some macros like 'fetch-threshold-reached' use this for determining
their sloppyness.
--host=<hostname>
Act as if the command was running on cluster node <hostname>.
Warning! This is dangerous! First try --dry-run
--backup-dir=</absolute_path>
Only for experts.
Used by several special commands like merge-cluster, split-cluster
etc for creating backups of important data.
--ip=<ip>
Override the IP address stored in the symlink tree, as well as
the default IP determined from the list of network interfaces.
Usually you will need this only at 'create-cluster' or
'join-cluster' for resolving ambiguities.
--ssh-port=<port_nr>
Override the default ssh port (22) for ssh and rsync.
Useful for running {join,merge}-cluster on non-standard ssh ports.
--ssh-opts="<ssh_commandline_options>"
Override the default ssh commandline options. Also used for rsync.
--macro=<text>
Handy for testing short macro evaluations at the command line.
<command> =
attach
usage: attach <resource_name>
Attaches the local disk (backing block device) to the resource.
The disk must have been previously configured at
{create,join}-resource.
When designated as a primary, /dev/mars/$res will also appear.
This does not change the state of {fetch,replay}.
For a complete local startup of the resource, use 'marsadm up'.
cat
usage: cat <path>
Print internal debug output in human readable form.
Numerical timestamps and numerical error codes are replaced
by more readable means.
Example: marsadm cat /mars/5.total.status
connect
usage: connect <resource_name>
See resume-fetch-local.
connect-global
usage: connect-global <resource_name>
Like resume-fetch-local, but affects all resource members
in the cluster (remotely).
connect-local
usage: connect-local <resource_name>
See resume-fetch-local.
create-cluster
usage: create-cluster (no parameters)
This must be called exactly once when creating a new cluster.
Don't call this again! Use join-cluster on the secondary nodes.
Please read the PDF manual for details.
create-resource
usage: create-resource <resource_name> </dev/lv/mydata>
(further syntax variants are described in the PDF manual).
Create a new resource out of a pre-existing disk (backing
block device) /dev/lv/mydata (or similar).
The current node will start in primary role, thus
/dev/mars/<resource_name> will appear after a short time, initially
showing the same contents as the underlying disk /dev/lv/mydata.
It is good practice to name the resource <resource_name> and the
disk name identical.
cron
usage: cron (no parameters)
Do all necessary regular housekeeping tasks.
This is equivalent to log-rotate all; sleep 5; log-delete-all all.
delete-resource
usage: delete-resource <resource_name>
CAUTION! This is dangerous when the network is somehow
interrupted, or when damaged nodes are later re-surrected
in any way.
Precondition: the resource must no longer have any members
(see leave-resource).
This is only needed when you _insist_ on re-using a damaged
resource for re-creating a new one with exactly the same
old <resource_name>.
HINT: best practice is to not use this, but just create a _new_
resource with a new <resource_name> out of your local disks.
Please read the PDF manual on potential consequences.
detach
usage: detach <resource_name>
Detaches the local disk (backing block device) from the
MARS resource.
Caution! you may read data from the local disk afterwards,
but ensure that no data is written to it!
Otherwise, you are likely to produce harmful inconsistencies.
When running in primary role, /dev/mars/$res will also disappear.
This does not change the state of {fetch,replay}.
For a complete local shutdown of the resource, use 'marsadm down'.
disconnect
usage: disconnect <resource_name>
See pause-fetch-local.
disconnect-global
usage: disconnect-global <resource_name>
Like pause-fetch-local, but affects all resource members
in the cluster (remotely).
disconnect-local
usage: disconnect-local <resource_name>
See pause-fetch-local.
down
usage: down <resource_name>
Shortcut for detach + pause-sync + pause-fetch + pause-replay.
get-emergency-limit
usage: get-emergency-limit <resource_name>
Counterpart of set-emergency-limit (per-resource emergency limit)
get-sync-limit-value
usage: get-sync-limit-value (no parameters)
For retrieval of the value set by set-sync-limit-value.
get-systemd-unit
usage: get-systemd-unit <resource_name>
Show the system units (for start and stop), or empty when unset.
invalidate
usage: invalidate <resource_name>
Only useful on a secondary node.
Forces MARS to consider the local replica disk as being
inconsistent, and therefore starting a fast full-sync from
the currently designated primary node (which must exist;
therefore avoid the 'secondary' command).
This is usually needed for resolving emergency mode.
When having k=2 replicas, this can be also used for
quick-and-simple split-brain resolution.
In other cases, or when the split-brain is not resolved by
this command, please use the 'leave-resource' / 'join-resource'
method as described in the PDF manual (in the right order as
described there).
join-cluster
usage: join-cluster <hostname_of_primary>
Establishes a new cluster membership.
This must be called once on any new cluster member.
This is a prerequisite for join-resource.
join-resource
usage: join-resource <resource_name> </dev/lv/mydata>
(further syntax variants are described in the PDF manual).
The resource <resource_name> must have been already created on
another cluster node, and the network must be healthy.
The contents of the local replica disk /dev/lv/mydata will be
overwritten by the initial fast full sync from the currently
designated primary node.
After the initial full sync has finished, the current host will
act in secondary role.
For details on size constraints etc, refer to the PDF manual.
leave-cluster
usage: leave-cluster (no parameters)
This can be used for final deconstruction of a cluster member.
Prior to this, all resources must have been left
via leave-resource.
Notice: this will never destroy the cluster UID on the /mars/
filesystem.
Please read the PDF manual for details.
leave-resource
usage: leave-resource <resource_name>
Precondition: the local host must be in secondary role.
Stop being a member of the resource, and thus stop all
replication activities. The status of the underlying disk
will remain in its current state (whatever it is).
log-delete
usage: log-delete <resource_name>
When possible, globally delete all old transaction logfiles which
are known to be superflous, i.e. all secondaries no longer need
to replay them.
This must be regularly called by a cron job or similar, in order
to prevent overflow of the /mars/ directory.
For regular maintainance cron jobs, please prefer 'marsadm cron'.
For details and best practices, please refer to the PDF manual.
log-delete-all
usage: log-delete-all <resource_name>
Alias for log-delete
log-delete-one
usage: log-delete-one <resource_name>
When possible, globally delete at most one old transaction logfile
which is known to be superfluous, i.e. all secondaries no longer
need to replay it.
Hint: use this only for testing and manual inspection.
For regular maintainance cron jobs, please prefer cron
or log-delete-all.
log-purge-all
usage: log-purge-all <resource_name>
This is potentially dangerous.
Use this only if you are really desperate in trying to resolve a
split brain. Use this only after reading the PDF manual!
log-rotate
usage: log-rotate <resource_name>
Only useful at the primary side.
Start writing transaction logs into a new transaction logfile.
This should be regularly called by a cron job or similar.
For regular maintainance cron jobs, please prefer 'marsadm cron'.
For details and best practices, please refer to the PDF manual.
lowlevel-delete-host
usage: lowlevel-delete-host <resource_name>
Delete cluster member.
lowlevel-ls-host-ips
usage: lowlevel-ls-host-ips <resource_name>
List cluster member names and IP addresses.
lowlevel-set-host-ip
usage: lowlevel-set-host-ip <resource_name>
Set IP for host.
merge-cluster
usage: merge-cluster <hostname_of_other_cluster>
Precondition: the resource names of both clusters must be disjoint.
Create the union of two clusters, consisting of the
union of all machines, and the union of all resources.
The members of each resource are _not_ changed by this.
This is useful for creating a big "virtual LVM cluster" where
resources can be almost arbitrarily migrated between machines via
later join-resource / leave-resource operations.
merge-cluster-check
usage: merge-cluster-check <hostname_of_other_cluster>
Check whether the resources of both clusters are disjoint.
Useful for checking in advance whether merge-cluster would be
possible.
merge-cluster-list
usage: merge-cluster-list
Determine the local list of resources.
Useful for checking or analysis of merge-cluster disjointness by hand.
pause-fetch
usage: pause-fetch <resource_name>
See pause-fetch-local.
pause-fetch-global
usage: pause-fetch-global <resource_name>
Like pause-fetch-local, but affects all resource members
in the cluster (remotely).
pause-fetch-local
usage: pause-fetch-local <resource_name>
Stop fetching transaction logfiles from the current
designated primary.
This is independent from any {pause,resume}-replay operations.
Only useful on a secondary node.
pause-replay
usage: pause-replay <resource_name>
See pause-replay-local.
pause-replay-global
usage: pause-replay-global <resource_name>
Like pause-replay-local, but affects all resource members
in the cluster (remotely).
pause-replay-local
usage: pause-replay-local <resource_name>
Stop replaying transaction logfiles for now.
This is independent from any {pause,resume}-fetch operations.
This may be used for freezing the state of your replica for some
time, if you have enough space on /mars/.
Only useful on a secondary node.
pause-sync
usage: pause-sync <resource_name>
See pause-sync-local.
pause-sync-global
usage: pause-sync-global <resource_name>
Like pause-sync-local, but affects all resource members
in the cluster (remotely).
pause-sync-local
usage: pause-sync-local <resource_name>
Pause the initial data sync at current stage.
This has only an effect if a sync is actually running (i.e.
there is something to be actually synced).
Don't pause too long, because the local replica will remain
inconsistent during the pause.
Use this only for limited reduction of system load.
Only useful on a secondary node.
primary
usage: primary <resource_name>
Promote the resource into primary role.
This is necessary for /dev/mars/$res to appear on the local host.
Notice: by concept there can be only _one_ designated primary
in a cluster at the same time.
The role change is automatically distributed to the other nodes
in the cluster, provided that the network is healthy.
The old primary node will _automatically_ go
into secondary role first. This is different from DRBD!
With MARS, you don't need an intermediate 'secondary' command
for switching roles.
It is usually better to _directly_ switch the primary roles
between both hosts.
When --force is not given, a planned handover is started:
the local host will only become actually primary _after_ the
old primary is gone, and all old transaction logs have been
fetched and replayed at the new designated priamry.
When --force is given, no handover is attempted. A a consequence,
a split brain situation is likely to emerge.
Thus, use --force only after an ordinary handover attempt has
failed, and when you don't care about the split brain.
For more details, please refer to the PDF manual.
resize
usage: resize <resource_name>
Prerequisite: all underlying disks (usually /dev/vg/$res) must
have been already increased, e.g. at the LVM layer (cf. lvresize).
Causes MARS to re-examine all sizing constraints on all members of
the resource, and increase the global logical size of the resource
accordingly.
Shrinking is currently not yet implemented.
When successful, /dev/mars/$res at the primary will be increased
in size. In addition, all secondaries will start an incremental
fast full-sync to get the enlarged parts from the primary.
resume-fetch
usage: resume-fetch <resource_name>
See resume-fetch-local.
resume-fetch-global
usage: resume-fetch-global <resource_name>
Like resume-fetch-local, but affects all resource members
in the cluster (remotely).
resume-fetch-local
usage: resume-fetch-local <resource_name>
Start fetching transaction logfiles from the current
designated primary node, if there is one.
This is independent from any {pause,resume}-replay operations.
Only useful on a secondary node.
resume-replay
usage: resume-replay <resource_name>
See resume-replay-local.
resume-replay-global
usage: resume-replay-global <resource_name>
Like resume-replay-local, but affects all resource members
in the cluster (remotely).
resume-replay-local
usage: resume-replay-local <resource_name>
Restart replaying transaction logfiles, when there is some
data left.
This is independent from any {pause,resume}-fetch operations.
This should be used for unfreezing the state of your local replica.
Only useful on a secondary node.
resume-sync
usage: resume-sync <resource_name>
See resume-sync-local.
resume-sync-global
usage: resume-sync-global <resource_name>
Like resume-sync-local, but affects all resource members
in the cluster (remotely).
resume-sync-local
usage: resume-sync-local <resource_name>
Resume any initial / incremental data sync at the stage where it
had been interrupted by pause-sync.
Only useful on a secondary node.
secondary
usage: secondary <resource_name>
Promote all cluster members into secondary role, globally.
In contrast to DRBD, this is not needed as an intermediate step
for planned handover between an old and a new primary node.
The only reasonable usage is before the last leave-resource of the
last cluster member, immediately before leave-cluster is executed
for final deconstruction of the cluster.
In all other cases, please prefer 'primary' for direct handover
between cluster nodes.
Notice: 'secondary' sets the global designated primary node
to '(none)' which in turn prevents the execution of 'invalidate'
or 'join-resource' or 'resize' anywhere in the cluster.
Therefore, don't unnecessarily give 'secondary'!
set-emergency-limit
usage: set-emergency-limit <resource_name> <value>
Set a per-resource emergency limit for disk space in /mars.
See PDF manual for details.
set-sync-limit-value
usage: set-sync-limit-value <new_value>
Set the maximum number of resources which should by syncing
concurrently.
set-systemd-unit
usage: set-systemd-unit <resource_name> <start_unit_name> [<stop_unit_name>]
This activates the systemd template engine of marsadm.
Please read mars-manual.pdf on this.
When <stop_unit_name> is omitted, it will be treated equal to
<start_unit_name>.
split-cluster
usage: split-cluster (no parameters)
NOT OFFICIALLY SUPPORTED - ONLY FOR EXPERTS.
RTFS = Read The Fucking Sourcecode.
Use this only if you know what you are doing.
up
usage: up <resource_name>
Shortcut for attach + resume-sync + resume-fetch + resume-replay.
wait-cluster
usage: wait-resource [<resource_name>]
Waits until a ping-pong communication has succeeded in the
whole cluster (or only the members of <resource_name>).
NOTICE: this is extremely useful for avoiding races when scripting
in a cluster.
wait-connect
usage: wait-connect [<resource_name>]
See wait-cluster.
wait-resource
usage: wait-resource <resource_name>
[[attach|fetch|replay|sync][-on|-off]]
Wait until the given condition is met on the resource, locally.
wait-umount
usage: wait-umount <resource_name>
Wait until /dev/mars/<resource_name> has disappeared in the
cluster (even remotely).
Useful on both primary and secondary nodes.
<resource_name> = name of resource or "all" for all resources
<macroname> = <complex_macroname> | <primitive_macroname>
<complex_macroname> =
1and1
comminfo
commstate
cstate
default
default-global
diskstate
diskstate-1and1
dstate
fetch-line
fetch-line-1and1
flags
flags-1and1
outdated-flags
outdated-flags-1and1
primarynode
primarynode-1and1
replay-line
replay-line-1and1
replinfo
replinfo-1and1
replstate
replstate-1and1
resource-errors
resource-errors-1and1
role
role-1and1
state
status
sync-line
sync-line-1and1
syncinfo
syncinfo-1and1
todo-role
<primitive_macroname> =
deletable-size
device-opened
errno-text
Convert errno numbers (positive or negative) into human readable text.
get-log-status
get-resource-{fat,err,wrn}{,-count}
get-{disk,device}
is-{alive}
is-{split-brain,consistent,emergency}
occupied-size
present-{disk,device}
(deprecated, use *-present instead)
replay-basenr
replay-code
When negative, this indidates that a replay/recovery error has occurred.
rest-space
summary-vector
systemd-unit
tree
uuid
wait-{is,todo}-{attach,sync,fetch,replay,primary}-{on,off}
{alive,fetch,replay,work}-{timestamp,age,lag}
{all,the}-{pretty-,}{global-,}{{err,wrn,inf}-,}msg
{cluster,resource}-members
{disk,device}-present
{disk,resource,device}-size
{fetch,replay,work}-{lognr,logcount}
{get,actual}-primary
{is,todo}-{attach,sync,fetch,replay,primary}
{my,all}-resources
{sync,fetch,replay,work,syncpos}-{size,pos}
{sync,fetch,replay,work}-{rest,{almost-,threshold-,}reached,percent,permille,vector}
{sync,fetch,replay}-{rate,remain}
{time,real-time}
\end{verbatim}

365
docu/screener-verbose.help Normal file
View File

@ -0,0 +1,365 @@
\begin{verbatim}
OVERRIDE verbose=1
./screener.sh: Run _unattended_ processes in screen sessions.
Useful for MASS automation, running hundreds of unattended
commands in parallel.
HINT: for running more than ~500 sessions in parallel, you might need
some system tuning (e.g. rlimits, kernel patches etc) for creating
a huge number of file descritor / sockets / etc.
ADVANTAGE: You may attach to individual screens, kill them, or continue
some waiting commands.
Synopsis:
./screener.sh --help [--verbose]
./screener.sh list-running
./screener.sh list-waiting
./screener.sh list-failed
./screener.sh list-critical
./screener.sh list-serious
./screener.sh list-done
./screener.sh list
./screener.sh list-screens
./screener.sh run <file.csv> [<condition_list>]
./screener.sh start <screen_id> <cmd> <args...>
./screener.sh [<options>] <operation> <screen_id>
Inquiry operations:
./screener.sh list-screens
Equivalent to screen -ls
./screener.sh list-<type>
Show a list of currently running, waiting (for continuation), failed,
and done/completed screen sessions.
./screener.sh list
First show a list of currently running screens, then
for each <type> a list of (old) failed / completed / sessions
(and so on).
./screener.sh status <screen_id>
Like list-*, but filter <sceen_id> and dont report timestamps.
./screener.sh show <screen_id>
Show the last logfile of <screen_id> at standard output.
./screener.sh less <screen_id>
Show the last logfile of <screen_id> using "less -r".
MASS starting of screen sessions:
./screener.sh run <file.csv> <condition_list>
Commands are launched in screen sessions via "./screener.sh start" commands,
unless the same <screen_id> is already running,
or is in some error state, or is already done (see below).
The commands are given by a column with CSV header name
containing "command", or by the first column.
The <screen_id> needs to be given by a column with CSV header
name matching "screen_id|resource".
The number and type of commands to launch can be reduced via
any combination of the following filter conditions:
--max=<number>
Limit the number of _new_ sessions additionally started this time.
--<column_name>==<value>
Only select lines where an arbitrary CSV column (given by its
CSV header name in C identifier syntax) has the given value.
--<column_name>!=<value>
Only select lines where the colum has _not_ the given value.
--<column_name>=~<bash_regex>
Only select lines where the bash regular expression matches
at the given column.
--max-per=<number>
Limit the number per _distinct_ value of the column denoted by
the _next_ filter condition.
Example: ./screener.sh run test.csv --dry-run --max-per=2 --dst_network=~.
would launch only 2 Football processes per destination network.
Hint: filter conditions can be easily checked by giving --dry-run.
Start / restart / kill / continue screen sessions:
./screener.sh start <screen_id> <cmd> <args...>
Start a new screen session, running arbitrary <cmd> and <args...>
inside.
./screener.sh restart <screen_id>
Works only when the last command for <screen_id> failed.
This will restart the old <cmd> and its <args...> as before.
Use only when you want to repeat the same command once again.
./screener.sh kill <screen_id>
Terminate the running screen session forcibly.
./screener.sh continue
./screener.sh continue <screen_id> [<screen_id_list>]
./screener.sh continue <number>
Useful for MASS automation of processes involving critical sections
such as customer downtime.
When giving a numerical <number> argument, up to that number
of sessions are resumed (ordered by age).
When no further arugment is given, _all_ currently waiting sessions
are continued.
When --auto-attach is given, it will sequentially resume the
sessions to be continued. By default, unless --force_attach is set,
it uses "screen -r" skipping those sessions which are already
attached to somebody else.
This feature works only with prepared scripts which are creating
an empty flagfile
/home/schoebel/mars/mars-migration.git/screener-logdir-testing/running/$screen_id.waiting
whenever they want to wait for manual intervention (for whatever reason).
Afterwards, the script must be polling this flagfile for removal.
This screener operation simply removes the flagfile, such that
the script will then continue afterwards.
Example: look into ./football.sh
and search for occurrences of substring "call_hook start_wait".
./screener.sh wakeup
./screener.sh wakeup <screen_id> [<screen_id_list>]
./screener.sh wakeup <number>
Similar to continue, but refers to delayed commands waiting for
a timeout. This can be used to individually shorten the timeout
period.
Example: Football cleanup operations may be artificially delayed
before doing "lvremove", to keep some sort of 'backup' for a
limited time. When your project is under time pressure, these
delays may be hindering.
Use this for premature ending of such artificial delays.
./screener.sh up <...>
Do both continue and wakeup.
./screener.sh auto <...>
Equivalent to ./screener.sh --auto-attach up <...>
Remember that only session without current attachment will be
attached to.
Attach to a running session:
./screener.sh attach <screen_id>
This is equivalent to screen -x $screen_id
./screener.sh resume <screen_id>
This is equivalent to screen -r $screen_id
Communication:
./screener.sh notify <screen_id> <txt>
May be called from external scripts to send emails etc.
Locking (only when supported by <cmd>):
./screener.sh lock
./screener.sh unlock
./screener.sh lock <screen_id>
./screener.sh unlock <screen_id>
Cleanup / bookkeeping:
./screener.sh clear-critical <screen_id>
./screener.sh clear-serious <screen_id>
./screener.sh clear-failed <screen_id>
Mark the status as "done" and move the logfile away.
./screener.sh purge [<days>]
This will remove all old logfiles which are older than
<days>. By default, the variable $screener_log_purge_period
will be used, which is currently set to '30'.
./screener.sh cron
You should call this regulary from a user cron job, in order
to purge old logfiles, or to detect hanging sessions, or to
automatically send pending emails, etc.
Options:
--variable
--variable=$value
These must come first, in order to prevent mixup with
options of <cmd> <args...>.
Allows overriding of any internal shell variable.
--help --verbose
Show all overridable shell variables, also for plugins.
## football_includes
# List of directories where screener-*.conf files can be found.
football_includes="${football_includes:-/usr/lib/mars/plugins /etc/mars/plugins $script_dir/plugins $HOME/.mars/plugins ./plugins}"
## title
# Used as a title for startup of screen sessions, and later for
# display at list-*
title="${title:-}"
## auto_attach
# Upon start or upon continue/wakuep/up, attach to the
# (newly created or existing) session.
auto_attach="${auto_attach:-0}"
## auto_attach_grace
# Before attaching, wait this time in seconds.
# The user may abort within this sleep time by
# pressing Ctrl-C.
auto_attach_grace="${auto_attach_grace:-10}"
## force_attach
# Use "screen -x" instead of "screen -r" allowing
# shared sessions between different users / end terminals.
force_attach="${force_attach:-0}"
## drop_shell
# When a <cmd> fails, the screen session will not terminated immediately.
# Instead, an interactive bash is started, so can later attach and
# rectify any probllems.
# WARNING! only activate this if you regulary check for failed sessions
# and then manually attach to them. Don't use this when running hundreds
# or thousand in parallel.
drop_shell="${drop_shell:-0}"
## session_timeout
# Detect hanging sessions when they don't produce any output anymore
# for a longer time. Hanging sessions are then marked as failed or critical.
session_timeout="${session_timeout:-$(( 3600 * 3 ))}" # seconds
## screener_logdir or logdir
# Where the logfiles and all status information is kept.
export screener_logdir="${screener_logdir:-${logdir:-$HOME/screener-logs}}"
## screener_command_log
# This logfile will accumulate all relevant $0 command invocations,
# including timestamps and ssh agent identities.
# To switch off, use /dev/null here.
screener_command_log="${screener_command_log:-$screener_logdir/commands.log}"
## screener_cron_log
# Since "$0 cron" works silently, you won't notice any errors.
# This logfiles gives you a chance for checking any problems.
screener_cron_log="${screener_cron_log:-$screener_logdir/cron.log}"
## screener_log_purge_period
# $0 cron or $0 purge will automatically remove all old logfiles
# from $screener_logdir/*/ when this period is exceeded.
screener_log_purge_period="${screener_log_purge_period:-30}" # Days
## dry_run
# Dont actually start screen sessions when set.
dry_run="${dry_run:-0}"
## verbose
# increase speakiness.
verbose=${verbose:-0}
## debug
# Some additional debug messages.
debug="${debug:-0}"
## sleep
# Workaround races by keeping sessions open for a few seconds.
# This is useful for debugging of immediate script failures.
# You have some short time window for attaching.
# HINT: instead, just inspect the logfiles in $screener_logdir/*/*.log
sleep="${sleep:-3}"
## screen_cmd
# Customize the screen command (e.g. add some further options, etc).
screen_cmd="${screen_cmd:-screen}"
## use_screenlog
# Add the -L option. Not really useful when running thousands of
# parallel screen sessions, because the automatically generated filenames
# are crap, and cannot be set in advance.
# Useful for basic debugging of setup problems etc.
use_screenlog="${use_screenlog:-0}"
## waiting_txt and delay_txt
# RTFS Don't use this, unless you know what you are doing.
waiting_txt="${waiting_txt:-SCREENER_waiting_WAIT}"
delayed_txt="${delayed_txt:-SCREENER_delayed_WAIT}"
## critical_status
# This is the "magic" exit code indicating _criticality_
# of a failed command.
critical_status="${critical_status:-199}"
## serious_status
# This is the "magic" exit code indicating _seriosity_
# of a failed command.
serious_status="${serious_status:-198}"
## less_cmd
# Used at $0 less $id
less_cmd="${less_cmd:-less -r}"
## date_format
# Here you can customize the appearance of list-* commands
date_format="${date_format:-%Y-%m-%d %H:%M}"
## csv_delimit
# The delimiter used for CSV file parsing
csv_delim="${csv_delim:-;}"
## csv_cmd_fields
# Regex telling the field name for 'cmd'
csv_cmd_fields="${csv_cmd_fields:-command}"
## csv_id_fields
# Regex telling the field name for 'screen_id'
csv_id_fields="${csv_id_fields:-screen_id|resource}"
## csv_remove
# Regex for global removal of command options
csv_remove="${csv_remove:---screener}"
## user_name
# Normally automatically derived from ssh agent or from $LOGNAME.
# Please override this only when really necessary.
export user_name="${user_name:-$(ssh-add -l | grep -o '[^ ]+@[^ ]+' | sort -u | tail -1)}"
export user_name="${user_name:-$LOGNAME}"
## tmp_dir and tmp_stub
# Where temporary files are residing
tmp_dir="${tmp_dir:-/tmp}"
tmp_stub="${tmp_stub:-$tmp_dir/screener.$$}"
Running hook: email_describe_plugin
PLUGIN screener-email
Generic plugin for sending emails (or SMS via gateways)
upon status changes, such as script failures.
## email_*
# List of email addresses.
# Empty = don't send emails.
email_critical="${email_critical:-}"
email_serious="${email_serious:-}"
email_failed="${email_failed:-}"
email_warning="${email_warning:-}"
email_waiting="${email_waiting:-}"
email_done="${email_done:-}"
## sms_*
# List of email addresses of SMS gateways.
# These may be distinct from email_*.
# Empty = don't send sms.
sms_critical="${sms_critical:-}"
sms_serious="${sms_serious:-}"
sms_failed="${sms_failed:-}"
sms_warning="${sms_warning:-}"
sms_waiting="${sms_waiting:-}"
sms_done="${sms_done:-}"
## email_cmd
# Command for email sending.
# Please include your gateways etc here.
email_cmd="${email_cmd:-mailx -S smtp=mx.nowhere.org:587 -S smpt-auth-user=test}"
## email_logfiles
# Whether to include logfiles in the body.
# Not used for sms_*.
email_logfiles="${email_logfiles:-1}"
\end{verbatim}

193
docu/screener.help Normal file
View File

@ -0,0 +1,193 @@
\begin{verbatim}
./screener.sh: Run _unattended_ processes in screen sessions.
Useful for MASS automation, running hundreds of unattended
commands in parallel.
HINT: for running more than ~500 sessions in parallel, you might need
some system tuning (e.g. rlimits, kernel patches etc) for creating
a huge number of file descritor / sockets / etc.
ADVANTAGE: You may attach to individual screens, kill them, or continue
some waiting commands.
Synopsis:
./screener.sh --help [--verbose]
./screener.sh list-running
./screener.sh list-waiting
./screener.sh list-failed
./screener.sh list-critical
./screener.sh list-serious
./screener.sh list-done
./screener.sh list
./screener.sh list-screens
./screener.sh run <file.csv> [<condition_list>]
./screener.sh start <screen_id> <cmd> <args...>
./screener.sh [<options>] <operation> <screen_id>
Inquiry operations:
./screener.sh list-screens
Equivalent to screen -ls
./screener.sh list-<type>
Show a list of currently running, waiting (for continuation), failed,
and done/completed screen sessions.
./screener.sh list
First show a list of currently running screens, then
for each <type> a list of (old) failed / completed / sessions
(and so on).
./screener.sh status <screen_id>
Like list-*, but filter <sceen_id> and dont report timestamps.
./screener.sh show <screen_id>
Show the last logfile of <screen_id> at standard output.
./screener.sh less <screen_id>
Show the last logfile of <screen_id> using "less -r".
MASS starting of screen sessions:
./screener.sh run <file.csv> <condition_list>
Commands are launched in screen sessions via "./screener.sh start" commands,
unless the same <screen_id> is already running,
or is in some error state, or is already done (see below).
The commands are given by a column with CSV header name
containing "command", or by the first column.
The <screen_id> needs to be given by a column with CSV header
name matching "screen_id|resource".
The number and type of commands to launch can be reduced via
any combination of the following filter conditions:
--max=<number>
Limit the number of _new_ sessions additionally started this time.
--<column_name>==<value>
Only select lines where an arbitrary CSV column (given by its
CSV header name in C identifier syntax) has the given value.
--<column_name>!=<value>
Only select lines where the colum has _not_ the given value.
--<column_name>=~<bash_regex>
Only select lines where the bash regular expression matches
at the given column.
--max-per=<number>
Limit the number per _distinct_ value of the column denoted by
the _next_ filter condition.
Example: ./screener.sh run test.csv --dry-run --max-per=2 --dst_network=~.
would launch only 2 Football processes per destination network.
Hint: filter conditions can be easily checked by giving --dry-run.
Start / restart / kill / continue screen sessions:
./screener.sh start <screen_id> <cmd> <args...>
Start a new screen session, running arbitrary <cmd> and <args...>
inside.
./screener.sh restart <screen_id>
Works only when the last command for <screen_id> failed.
This will restart the old <cmd> and its <args...> as before.
Use only when you want to repeat the same command once again.
./screener.sh kill <screen_id>
Terminate the running screen session forcibly.
./screener.sh continue
./screener.sh continue <screen_id> [<screen_id_list>]
./screener.sh continue <number>
Useful for MASS automation of processes involving critical sections
such as customer downtime.
When giving a numerical <number> argument, up to that number
of sessions are resumed (ordered by age).
When no further arugment is given, _all_ currently waiting sessions
are continued.
When --auto-attach is given, it will sequentially resume the
sessions to be continued. By default, unless --force_attach is set,
it uses "screen -r" skipping those sessions which are already
attached to somebody else.
This feature works only with prepared scripts which are creating
an empty flagfile
/home/schoebel/mars/mars-migration.git/screener-logdir-testing/running/$screen_id.waiting
whenever they want to wait for manual intervention (for whatever reason).
Afterwards, the script must be polling this flagfile for removal.
This screener operation simply removes the flagfile, such that
the script will then continue afterwards.
Example: look into ./football.sh
and search for occurrences of substring "call_hook start_wait".
./screener.sh wakeup
./screener.sh wakeup <screen_id> [<screen_id_list>]
./screener.sh wakeup <number>
Similar to continue, but refers to delayed commands waiting for
a timeout. This can be used to individually shorten the timeout
period.
Example: Football cleanup operations may be artificially delayed
before doing "lvremove", to keep some sort of 'backup' for a
limited time. When your project is under time pressure, these
delays may be hindering.
Use this for premature ending of such artificial delays.
./screener.sh up <...>
Do both continue and wakeup.
./screener.sh auto <...>
Equivalent to ./screener.sh --auto-attach up <...>
Remember that only session without current attachment will be
attached to.
Attach to a running session:
./screener.sh attach <screen_id>
This is equivalent to screen -x $screen_id
./screener.sh resume <screen_id>
This is equivalent to screen -r $screen_id
Communication:
./screener.sh notify <screen_id> <txt>
May be called from external scripts to send emails etc.
Locking (only when supported by <cmd>):
./screener.sh lock
./screener.sh unlock
./screener.sh lock <screen_id>
./screener.sh unlock <screen_id>
Cleanup / bookkeeping:
./screener.sh clear-critical <screen_id>
./screener.sh clear-serious <screen_id>
./screener.sh clear-failed <screen_id>
Mark the status as "done" and move the logfile away.
./screener.sh purge [<days>]
This will remove all old logfiles which are older than
<days>. By default, the variable $screener_log_purge_period
will be used, which is currently set to '30'.
./screener.sh cron
You should call this regulary from a user cron job, in order
to purge old logfiles, or to detect hanging sessions, or to
automatically send pending emails, etc.
Options:
--variable
--variable=$value
These must come first, in order to prevent mixup with
options of <cmd> <args...>.
Allows overriding of any internal shell variable.
--help --verbose
Show all overridable shell variables, also for plugins.
PLUGIN screener-email
Generic plugin for sending emails (or SMS via gateways)
upon status changes, such as script failures.
\end{verbatim}