A ceph.conf line with "key" and no "= value" currently shows
"unexpected character while parsing putative key value,
at char N line M". There's no reason it can't be clearer.
Fixes: #4229
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Always close the image we opened in check_clone(), and check the
return code of the rbd_close() called before cloning.
Refs: #3958
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
A watch is a mutation, while a notify is a read. The mutations need to
pass in a proper snap context to be fully correct.
Also, make the WRITE flag implicit so the caller doesn't need to pass it
in.
Signed-off-by: Sage Weil <sage@inktank.com>
Otherwise, search_for_missing may neglect to check the missing
set for some objects assuming that if the need version is
prior to last_complete, the replica must have it.
Fixes: #4994
Signed-off-by: Samuel Just <sam.just@inktank.com>
We should let users remove xattrs as well as set them. ;) And
the check in handle_client_setlayout was totally useless -- perhaps
intended for setdirlayout?
This is a follow-on to 9f82ae60fa and
should be taken wherever it goes.
Signed-off-by: Greg Farnum <greg@inktank.com>
This was previously disallowed because Once Upon a Time, the root
inode wasn't persisted to disk and was an entirely in-memory construct. But
it's safe now, and has been for a while.
Signed-off-by: Greg Farnum <greg@inktank.com>
This cherry-pick is going in the reverse direction of normal. That's
because this direction makes for the minimal change -- this patchset
is required to fix the loss of directory layouts we were previously
seeing, but fixing it requires changing the encoding versions. So we
wrote it on top of Bobtail and let it update the struct_v's as they existed
then. Note that we here change a few encoding versions in ways which are
NOT COMPATIBLE with previous development code (but not any releases). In
particular, development code introduced and this removes the
file_layout_policy_t, and some of the CInode and EMetaBlob encoding
struct_v values were used in development code to mean one thing, but
mean something different due to the Bobtail patch.
Remove the default_file_layout struct, which was just a ceph_file_layout,
and store it in the inode_t. Rip out all the annoying code that put this
on the heap.
To aid in this usage, add a clear_layout() function to inode_t.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 36ed407e0f)
Conflicts:
src/mds/CInode.cc
src/mds/CInode.h
src/mds/MDCache.cc
src/mds/Server.cc
src/mds/events/EMetaBlob.h
Cherry-pick-
Reviewed-by: Sage Weil <sage@inktank.com>
Use qi to parse a strictly formatted set of key/value pairs. Be picky
about whitespace. Any subset of recognized keys is allowed. Parse the
same set of keys as the ceph.*.layout.* vxattrs.
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5551aa5b3b)
This was causing librados to unblock after the ACK on unwatch, which meant
that librbd users raced and tried to delete the image before the unwatch
change was committed..and got EBUSY. See #3958.
The watch operation has a similar problem.
Signed-off-by: Sage Weil <sage@inktank.com>
The omap portion of the clone happened above in DBObjectMap::clone.
Only the fs stored attrs need to be explicitely copied.
Signed-off-by: Samuel Just <sam.just@inktank.com>
When a range request is made for more than rgw_get_obj_max_req_size
bytes the first returned chunk sets 'ret' to STATUS_PARTIAL_CONTENT and
all remaining chunks behave as if there is an error state and only
return a minimal header.
Fix this by passing STATUS_PARTIAL_CONTENT to set_req_state_err, but
leave the 'ret' member variable untouched.
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit c83a01d4e8)
nginx seems to be providing a CONTENT_LENGTH environment variable with no data
when the request body is empty.
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
When a range request is made for more than rgw_get_obj_max_req_size
bytes the first returned chunk sets 'ret' to STATUS_PARTIAL_CONTENT and
all remaining chunks behave as if there is an error state and only
return a minimal header.
Fix this by passing STATUS_PARTIAL_CONTENT to set_req_state_err, but
leave the 'ret' member variable untouched.
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
The only way for a parent to disappear is a racing flatten completing,
or possibly in the future the image being forcibly removed. In either
case, continuing to flatten makes no sense, so stop early.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Image metadata like snapshots, size, and parent is frequently read,
but rarely updated. During flatten, we were depending on the parent
lock to prevent the parent ImageCtx from disappearing out from under
us while we read from it. The copy-up path also needed the parent lock
to be able to read from the parent image, which lead to a deadlock.
Convert parent_lock, snap_lock, and md_lock to RWLocks, and change
their use to read instead of exclusive locks where appropriate. The
main place exclusive locks are needed is in ictx_refresh, so this is
pretty simple. This fixes the deadlock, since parent_lock is only
needed for read access in both flatten and the copy-up operation.
cache_lock and refresh_lock are only really used for exclusive access,
so leave them as regular mutexes.
One downside to this is that there's no way to assert is_locked()
for RWLocks, so we'll have to be very careful about changing code
in the future.
Fixes: #3665
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
This ensures we release our in-progress recovery counters, which prevents
recovery from getting blocked indefinitely when a pool removal races with
recovery ops.
Fixes: #4217
Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
With the single-paxos patches we shifted from an approach with multiple
paxos instances (one for each paxos service) keeping their own versions
to a single paxos instance for all the paxos services, thus ending up
with a single global version for paxos.
With the release of v0.52, the monitor started tracking these global
versions, keeping them for the single purpose of making it possible to
convert the store to a single-paxos format.
This patch now introduces a mechanism to convert a GV-enabled store to
the single-paxos format store when the monitor is upgraded.
As we require the global versions to be present, we first check if the
store has the GV feature set: if not we will not proceed, but we will
start the conversion otherwise.
In the end of the conversion, the monitor data directory will have a
brand new 'store.db' directory, where the key/value store lies,
alongside with the old store. This makes it possible to revert to a
previous monitor version if things go sideways, without jeopardizing the
data in the store.
The conversion is done as during a rolling upgrade, without any
intervention by the user. Fire up the new monitor version on an old
store, and the monitor itself will convert the store, trim any lingering
versions that might not be required, and proceed to start as expected.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
This tool will convert an old monitor store format (bobtail) to the new
key/value store-backed, single-paxos format.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
The init() function always implicitly created a new store if it was
missing.
This patches makes init() a private function accepting a bool that used
to specify whether or not we want to create the store if it does not
exists, and creates two functions: open() and create_and_open().
open() will fail if the store we are trying to open does not exist;
create_and_open() maintains the same behavior as the previous behavior of
init() and will create the store if it does not exist before opening it.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Synchronize two monitor stores when one of the monitors has diverged
significantly from the remaining monitor cluster.
This process roughly consists of the following steps:
0. mon.X tries to join the cluster;
1. mon.X verifies that it has diverged from the remaining cluster;
2. mon.X asks the leader to sync;
3. the leader allows mon.X to sync, pointing out a mon.Y from
which mon.X should sync;
4. mon.X asks mon.Y to sync;
5. mon.Y sends its own store in one or more chunks;
6. mon.X acks each received chunk; go to 5;
7. mon.X receives the last chunk from mon.Y;
8. mon.X informs the leader that it has finished synchronizing;
9. the leader acks mon.X's finished sync;
10. mon.X bootstraps and retries joining the cluster (goto 0.)
This is the most simple and straightforward process that can be hoped
for. However, things may go sideways at any time (monitors failing, for
instance), which could potentially lead to a corrupted monitor store.
There are however mechanisms at work to avoid such scenario at any step
of the process.
Some of these mechanisms include:
- aborting the sync if the leader fails or leadership changes;
- state barriers on synchronization functions to avoid stray/outdated
messages from interfering on the normal monitor behavior or on-going
synchronization;
- store clean-up before any synchronization process starts;
- store clean-up if a sync process fails;
- resuming sync from a different monitor mon.Z if mon.Y fails mid-sync;
- several timeouts to guarantee that all the involved parties are still
alive and participating in the sync effort.
- request forwarding when mon.X contacts a monitor outside the quorum
that might know who the leader is (or might know someone who does)
[4].
Changes:
- Adapt the MMonProbe message for the single-paxos approach, dropping
the version map and using a lower and upper bound version instead.
- Remove old slurp code.
- Add 'sync force' command; 'sync_force' through the admin socket.
Notes:
[1] It's important to keep track of the paxos version at the time at
which a store sync starts. Given that after the sync we end up with
the same state as the monitor we are synchronizing from, there is a
chance that we might end up with an uncommitted paxos version if we
are synchronizing with the leader (there's some paxos stashing done
prior to commit on the leader). By keeping track at which version
the sync started, we can then let the requester to which version he
should cap its paxos store.
[2] Furthermore, the enforced paxos cap, described on [1], is even more
important if we consider the need to reapply the paxos versions that
were received during the sync, to make sure the paxos store is
consistent. If we happened to have some yet-uncommitted version in
the store, we could end up applying it.
[3] What is described in [1] and [2]:
Fixes: #4026Fixes: #4037Fixes: #4040
[4] Whenever a given monitor mon.X is on the probing phase and notices
that there is a mon.Y with a paxos version considerably higher than
the one mon.X has, then mon.X will attempt to synchronize from
mon.Y. This is the basis for the store sync. However this might
hold true, the fact is that there might be a chance that, by the
time mon.Y handles the sync request from mon.X, mon.Y might already
be attempting a sync himself with some other mon.Z. In this case,
the appropriate thing for mon.Y to do is to forward mon.X's request
to mon.Z, as mon.Z should be part of the quorum, know who the leader
is or be the leader himself -- if not, at least it is guaranteed
that mon.Z has a higher version than both mon.X and mon.Y, so it
should be okay to sync from him.
Fixes: #4162
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
The monitor's synchronization process requires a specific message type
to carry the required informations. Since this process significantly
differs from slurping, reusing the MMonProbe message is not an option as
it would require major changes and, for all intetions and purposes, it
would be far outside the scope of the MMonProbe message.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
We created an interface specific to the MonitorDBStore, which can be used
to create iterators to obtain chunks for sync.
Two different iterators were defined: one that will iterate over the whole
store, focusing on the specified set of prefixes; another that will
iterate over only one specific prefix.
These two different iterators allow us build the sync process in two
distinct phases: 1) obtain all key/value pairs for paxos and all paxos
services, bundle them in chunks and send them over the wire; and 2) obtain
all the paxos versions, bundle them in chunks and send them over the wire.
Also, we are currently considering a chunk to be (at most) 1 MB worth of
data, although it can be tuned using 'mon_sync_max_payload_size' option.
mon: MonitorDBStore: add crc support when --mon-sync-debug is set
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>