This script uses the python bindings to libcephfs and rados
to create files and check the correctness of the backtrace
written to the 'parent' xattr on the first object (if its
a file) or inode (if its a dir). The script includes test cases
that kill the mds at specific kill points and restart it through
teuthology using the teuthology restart task.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
To test the mds journal and replay behavior, and the
functionality for storing backtraces on inodes, we
add kill points to the MDS in the openc, journal replay,
and journal expire paths.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Initial Support for python bindings to libcephfs for testing
MDS backtraces with a the python script test-backtraces.py.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The second conditional for adding a new segment is always
true when the first conditional is true. Clean this up
to simply create a new segment when we've reached the end of
the current segment.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Design info:
http://www.spinics.net/lists/ceph-devel/msg11872.html
Adds a backtrace to the data pool for supporting lookup-by-ino,
storing the backtrace on the first object in the data pool
or the metadata pool for a directory, as the 'parent' xattr
on the object (named by inode) in that pool. For create, rename,
mkdir, and setlayout operations, the backtrace is
queued (with the current log segment) after the journal is
committed and the safe reply is returned to the client, but the
the backtrace operation itself isn't started until the log segment is
expired.
For journal replay, we queue the backtrace so that it gets
written out on journal expire. Inodes get added to the EMetaBlob
in the fullbits list, so we queue backtraces while iterating through
the fullbits during replay.
Using setlayout or setxattr('ceph.file.layout.pool'),
the data pool for a file can be changed after it is created
but before anything is written to the file. A forwarding backtrace
is written to the old pool on a setlayout, to ensure we can always find
the latest backtrace. We store a list of old pools with the backtrace
for cleaning up all forwarding pointers of an inode.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Add unified backtrace handling for storing a backtrace on file objects
(the first data object) and dirs. The backtrace store operation is
queued on the LogSegment (for performing the store on log segment
expire). We encode the backtrace on queue to avoid keeping a reference
around to the CInode, which may get dropped from the cache by the time
the log segment is expired (and the backtrace is written out).
Fetching the backtrace is implemented on the CInode.
Also allow incrementing/decrementing the DIRTYPARENT pin ref as needed,
instead of using a state semaphore to keep track of whether itsset or
not. This allows us to remove the STATE_DIRTYPARENT field on CInode.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Flip the conditional so that snap realms are
decoded, otherwise this results in an assertion
failure of the mds when a client attempts to
reconnect.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
elist<T>::clear() is calling remove(), which isn't a
method defined on elist<T> (it was never defined according
to git). Because elist is templated and no references
to clear() are ever made, the compiler matches remove(T) to the
remove(const char *) system call defined in stdio.h.
Once clear is invoked on an instance of elist<T>, we get the
compile error shown below.
The fix here is to use pop_front() instead of remove().
Compile error is:
In file included from ../../src/mds/CInode.h:22:0,
from ../../src/mds/CInode.cc:19:
../../src/include/elist.h: In instantiation of ‘void elist<T>::clear() [with T = cinode_backtrace_info_t*]’:
../../src/mds/CInode.cc:1129:20: required from here
../../src/include/elist.h:101:7: error: no matching function for call to ‘remove(cinode_backtrace_info_t*)’
../../src/include/elist.h:101:7: note: candidates are:
In file included from ../../src/mds/CInode.cc:17:0:
/usr/include/stdio.h:179:12: note: int remove(const char*)
/usr/include/stdio.h:179:12: note: no known conversion for argument 1 from ‘cinode_backtrace_info_t*’ to ‘const char*’
In file included from /usr/include/c++/4.7/algorithm:63:0,
from /usr/include/c++/4.7/backward/hashtable.h:65,
from /usr/include/c++/4.7/ext/hash_map:65,
from ../../src/include/encoding.h:292,
from ../../src/common/entity_name.h:22,
from ../../src/common/config.h:26,
from ../../src/mds/CInode.h:20,
from ../../src/mds/CInode.cc:19:
/usr/include/c++/4.7/bits/stl_algo.h:1117:5: note: template<class _FIter, class _Tp> _FIter std::remove(_FIter, _FIter, const _Tp&)
/usr/include/c++/4.7/bits/stl_algo.h:1117:5: note: template argument deduction/substitution failed:
In file included from ../../src/mds/CInode.h:22:0,
from ../../src/mds/CInode.cc:19:
../../src/include/elist.h:101:7: note: candidate expects 3 arguments, 1 provided
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
To test the backtrace attributes on objects, we need
to be able to decode the backtrace using ceph-dencoder.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Implements pin refs on the inode as a map instead of
a multiset, allowing individual ref counts to act as
real references with values that can be >1.
The pin refs are only used for debugging, but allowing
them to be >1 avoids the need for a separate state field
for things like DIRTYPARENT.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
The MetaRequest holds onto inodes and dentries
for retrying unsafe requests, but those objects
might be removed from the cache (unlink for example)
causing the inode/dentry to be freed. Ensure that
the inode/dentry is never freed while the MetaRequest
holds onto it by putting/getting the refs using
set/get interfaces.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Otherwise, when we eventually remove the temp collection, there might be
objects in the temp collection which were independently pulled into the child
pg collection. Thus, removing the old stale parent link from its temp
collection also blasts the omap entries and snap mappings for the real child
object.
Backport: bobtail
Fixes: #4452
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When removing the last instance of ceph, also remove the files
created by ceph during operation. These consist of the files
under /var/lib/ceph, /etc/ceph, and /var/log/ceph. Bug #4415.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
When the checksum or footer are invalid, we will now try to
look at the next entry. If we find a valid entry, it is likely
that the journal is corrupt.
Signed-off-by: Samuel Just <sam.just@inktank.com>
header_t::committed_up_to provides a lower bound for safetly committed
journal entries. If read_entry fails prior to committed_up_to, we
know we have a corrupt jorunal entry. Furthermore, if
journal_write_header_frequency is not 0, we will write out the
journal header once every journal_write_header_frequency
journal writes.
Signed-off-by: Samuel Just <sam.just@inktank.com>
If queue_pos == header.max_size when we create the entry
header magic, the entry will be rejected at get_top() on
replay.
Fixes: #4436
Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Otherwise:
1) expand_pg_num removes a splitting pg entry
2) peering thread grabs pg lock and starts split
3) OSD::consume_map grabs pg lock and starts removal
At step 2), we run afoul of the assert(is_splitting)
check in split_pgs. This way, the would be splitting
pg is marked as removed prior to the splitting state
being updated.
Backport: bobtail
Fixes: #4449
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
1) Replica sends notify
2) Prior to processing notify, primary queues query to replica
3) Primary processes notify and activates sending MOSDPGLog
to replica.
4) Primary does do_notifies at end of process_peering_events
and sends to Query.
5) Replica sees MOSDPGLog and activates
6) Replica sees Query and asserts.
In the above case, the Replica should simply ignore the old
Query.
Fixes: #4050
Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
I broke this in 4637752db6 when I
restructured this function. Only try to increase the max if we are
the leader.
Signed-off-by: Sage Weil <sage@inktank.com>
Determine what cluster the disk belongs to by checking the fsid defined
in /etc/ceph/*.conf. Previously we hard-coded 'ceph'.
Note that this has the nice side-effect that if we have a disk with a
bad/different fsid, we now fail to activate it. Previously, we would
mount and start ceph-osd, but the daemon would fail to authenticate
because it was part of the wrong cluster.
Fixes: #3253
Signed-off-by: Sage Weil <sage@inktank.com>
The ceph-mds.conf file moced from the ceph package to the
ceph-mds package. Add replaces/breaks statements to the
control file to handle this on upgrade.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
If the target position is already a mount point, fail to move our mount
over to it. This usually indicates that a different osd.N from a
different cluster instances is in that position.
Signed-off-by: Sage Weil <sage@inktank.com>
This ensures that when we then start individual mds instances, we can
stop ceph-mds-all and they will get stopped. We do the same already for
ceph-all.
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 41897fcba1)
This ensures that when we then start individual mds instances, we can
stop ceph-mds-all and they will get stopped. We do the same already for
ceph-all.
Signed-off-by: Sage Weil <sage@inktank.com>
This reverts commit 813e9fe2b4.
We run --mkfs with the osd disk mounted in a temporary location, so it is
necessary to explicitly pass in these paths.
If we want to support journals in a different location, we need to make
ceph-disk-prepare update the journal symlink accordingly.. not control it via
the config option.
Signed-off-by: Sage Weil <sage@inktank.com>