when config journal_zero_on_create true, osd mkfs will fail when zeroing journal.
journal open with O_DIRECT, buf should align with blocksize.
Backport: giant, firefly, dumpling
Signed-off-by: Xie Rui <875016668@qq.com>
Reviewed-by: Sage Weil <sage@redhat.com>
latest_monmap that we stash is only used locally--the encoded bl is never shared. Which means we should just use CEPH_FEATURES_ALL all of the time.
Fixes: #5203
Backport: giant, firefly
Signed-off-by: Xie Rui <875016668@qq.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Joao Eduardo Luis <joao@redhat.com>
If client reconnect a already mark_down endpoint, server-side will detect
remote reset happen, so it will reset existing connection. Meanwhile,
retry tag is received by client-side connection and it will try to
reconnect. Again, client-side connection will send connect_msg with
connect_seq(1). But it will met server-side connection's connect_seq(0),
it will make server-side reply with reset tag. So this connection will
loop in reset and retry tag.
One solution is that we close server-side connection if connect_seq ==0 and
no message in queue. But it will trigger another problem:
1. client try to connect a already mark_down endpoint
2. client->send_message
3. server-side accept new socket, replace old one and reply retry tag
4. client plus one to connect_seq but socket failure happen
5. server-side connection detected and close because of connect_seq==0 and no
message
6. client reconnect, server-side has no existing connection and met
"connect.connect_seq > 0". So server-side will reply to RESET tag
7. client discard all messages in queue. So we lose a message never delivered
This solution add a new "once_session_reset" flag to indicate whether
"existing" reset. Because server-side's connect_seq is 0 only when it never
successfully or ever session reset. We only need to reply RESET tag if ever
session reset.
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
If connection sent many messages without acked, then it was marked down.
Next we get a new connection, it will issue a connect_msg with connect_seq=0,
server side need to detect "connect_seq==0 && existing->connect_seq >0",
so it will reset out_q and detect remote reset. But if client side failed
before sending connect_msg, now it will issue a connect_msg with non-zero
connect_seq which will cause server-side can't detect exist remote reset.
Server-side will reply a non-zero in_seq and cause client crash.
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Because AsyncConnection won't enter "open" tag from "replace" tag,
the codes which set reply_tag won't be used when enter "open" tag.
It will cause server side discard out_q and lose state.
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Make handle_connect_msg follow lock rule: unlock any lock before acquire
messenger's lock. Otherwise, deadlock will happen.
Enhance lock condition check because connection's state maybe change while
unlock itself and lock again.
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Now when calling mark_down/mark_down_all, it will dispatch a reset event.
If we call Messenger::shutdown/wait, and it will let reset event called after
Messenger dealloc.
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
In order to avoid deadlock like:
1. mark_down_all with holding lock
2. ms_dispatch_reset
3. get_connection want to get lock
4. deadlock
We signal a workerpool barrier to wait for all in-queue events done.
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Previously, if caller want to mark_down one connection and caller is event
thread callback, it will block for the wakeup. Meanwhile, the expected event
thread which will signal the blocked thread may also want to mark_down
connection which is own by already blocked thread. So deadlock is happen.
As tradeoff, introduce lock to file_events which can avoid create/delete
file_event callback. So we don't need wait for callback again.
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Learn from commit(2d4dca757e) for
SimpleMessenger:
If binding on a IP-Address fails, delay and retry again.
This happens mainly on IPv6 deployments. Due to DAD (Duplicate Address Detection)
or SLAAC it can be that IPv6 is not yet available when the daemons start.
Monitor daemons try to bind on a static IPv6 address and that might not be available
yet and that causes the monitor not to start.
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Totally avoid extra thread in AsyncMessenger now. The bind socket will be
regarded as a normal socket and will dispatch a random Worker thread to
handle accept event.
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Now, 2-4 async op thread can fully meet a OSD's network demand with SSD
backend. So we can bind limited thread to special cores, it can improve
async event loop performance because most of structure and method will
processed within thread.
For example,
ms_async_op_threads = 2
ms_async_affinity_cores = 0,3
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
undersized not valid: undersized not in inactive|unclean|stale
undersized not valid: undersized doesn't represent an int
Invalid command: unused arguments: ['undersized']
pg dump_stuck {inactive|unclean|stale [inactive|unclean|stale...]} {<int>} : show information about stuck pgs
Signed-off-by: xinxin shu <xinxin.shu@intel.com>
We no longer convert stores on upgrade. Users coming from bobtail or
before sould go through an interim version such as cuttlefish, dumpling,
firefly or giant.
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
People upgrading from bobtail or previous clusters should first go
through an interim version (quite a few to pick from: cuttlefish,
dumpling, firefly, giant).
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
3600 will mean every hour, on the hour; 60 will mean every minute, on
the minute. This will allow the monitors to spit out the info at
regular intervals, regardless the time at which they formed quorum or
which monitor is now the leader.
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
By caching the summary string we can avoid writing dups on clog.
We will still write dups every 'mon_health_to_clog_interval', to make
sure that we still output health status every now and then, but we
increased the interval from 120 seconds to 3600 seconds -- once every
hour unless the health status changes.
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
Instead of writing the health status only when a user action calls
get_health(), have the monitor writing it every X seconds.
Adds a new config option 'mon_health_to_clog_tick_interval' (default:
60 [seconds]), and changes the default value of
'mon_health_to_clog_interval' from 60 (seconds) to 120 (seconds).
If 'mon_health_to_clog' is 'true' and 'mon_health_to_clog_tick_interval'
is greater than 0.0, the monitor will now start a tick event when it
wins an election (meaning, only the leader will write this info to
clog).
This tick will, by default, run every 60 seconds. It will call
Monitor::get_health() to obtain current health summary and overall
status. If overall status is the same as the cached status, then it
will attempt to ignore it. The status will not be ignored if the last
write to clog happened more than 'mon_health_to_clog_interval' seconds
ago (default: 120).
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
Output health summary to clog on Monitor::get_health() (called during,
e.g., 'ceph -s', 'ceph health' and alikes) if 'mon_health_to_clog' is
true (default: false) and if last update is at least
'mon_health_to_clog_interval' old (default: 60.0 (seconds)).
This patch is far from optimal for several reasons though:
1. health summary is still generated on-the-fly by the monitor each time
Monitor::get_health() is called.
2. health summary will only be outputted to clog IF and WHEN
Monitor::get_health() is called.
3. patch does not account for duplicate summaries. We may have the same
string outputted every time Monitor::get_health() is called (as long as
enough time passed since we last wrote to clog)
4. each monitor will output to clog independently from the other
monitors. This means that running a 'ceph -s' 3 times in a row, on a
cluster with at least 3 monitors, may result in writing the same string
3 times.
5. We reduce the amount of writes to clog by caching the last overall
health status. We only write to clog if the overall status is different
from the cached value OR enough time has passed since we last wrote to
clog. This may result in ignoring new contributing factors to overall
cluster health that by themselves do not change the overall status; and
even though we will pick on them once enough time has passed, we may end
up losing intermediate states (which may be good if they're transient,
but not as awesome if they reflect some kind of instability).
Fixes: #9440 (even if in a poor manner)
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
Was returning ENOENT, should succeed for 'fail' on
a non-existent name, as the fail operation makes
it cease to exist.
Signed-off-by: John Spray <john.spray@redhat.com>
The json-pretty format was modified for readability and now includes
additional newlines / spaces. Either switch to json to avoid dealing
with space changes or modify the expected output to include them.
http://tracker.ceph.com/issues/10547Fixes: #10547
Signed-off-by: Loic Dachary <ldachary@redhat.com>
When Formatter::create replaced new_formatter, the handling of an
invalid format was also incorrectly changed. When an invalid format (for
instance "plain") was specified, new_formatter returned a NULL pointer
which was sometime handled by creating a json-pretty formatter and
sometimes differently.
A new Formatter::create prototype with a fallback argument is added and
is used if it is not the empty string and that the format is not
known. This prototype is used where new_formatter returning NULL was
replaced by a json-pretty formatter.
http://tracker.ceph.com/issues/10547Fixes: #10547
Signed-off-by: Loic Dachary <ldachary@redhat.com>