mirror of
https://github.com/ceph/ceph
synced 2024-12-24 12:24:19 +00:00
07ac5d3e74
git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1068 29311d96-e01e-0410-9327-a35deaab8ce9
201 lines
6.6 KiB
Plaintext
201 lines
6.6 KiB
Plaintext
|
|
|
|
AUTHORITY
|
|
|
|
The authority maintains a list of what nodes cache each inode.
|
|
Additionally, each replica is assigned a serial (normally 0) to
|
|
disambiguate multiple replicas of the same item (see below).
|
|
|
|
set<int> cached_by;
|
|
map<int, int> cached_by_serial;
|
|
|
|
The cached_by set _always_ includes all nodes that cache the
|
|
partcuarly inode, but may additionally include nodes that used to
|
|
cache it but no longer do. In those cases, an expire message should
|
|
be in transit.
|
|
|
|
|
|
REPLICA
|
|
|
|
The replica maintains a notion of who it believes is the authority for
|
|
each replicated inode. There are two possibilities:
|
|
|
|
- Ordinarily, this notion is correct.
|
|
- If the part of the file system in question was recently exported to
|
|
a new MDS, the inodes old authority is acting as a CACHEPROXY,
|
|
and will forward relevant messages on to the authority.
|
|
|
|
When a repica is expired from cache, and expire is sent to the
|
|
authority. The expire includes the serial number issued when the
|
|
replica was originally created to disambiguate potentially concurrent
|
|
replication activity.
|
|
|
|
|
|
EXPORTS
|
|
|
|
- The old authority suddenly becomes a replica. It's serial is well
|
|
defined. It also becomes a CACHEPROXY, which means its cached_by
|
|
remains defined (with an alternate meaning!). While a proxy, the
|
|
node will forward relevant messages from the replica to the
|
|
authority (but not the other way around--the authority knows all
|
|
replicas).
|
|
|
|
- Once the export is acked, the old authority sends a
|
|
message to the replica notifying it of the new authority. As soon
|
|
as all replicas acknowedge receipt of this notice, the old authority
|
|
can cease CACHEPROXY responsibilities and become a regular replica.
|
|
At this point it's cached_by is no longer defined.
|
|
|
|
- Replicas always know who the authority for the inode is, OR they
|
|
know prior owner acting as a CACHEPROXY. (They don't know which it
|
|
is.)
|
|
|
|
|
|
CACHED_BY
|
|
|
|
The authority always has an inclusive list of nodes who cache an item.
|
|
As such it can confidently send updates to replicas for locking,
|
|
invalidating, etc. When a replica is expired from cache, an expire is
|
|
sent to the authority. If the serial matches, the node is removed
|
|
from the cached_by list.
|
|
|
|
|
|
|
|
|
|
|
|
SUBTREE AUTHORITY DELEGATION: imports versus hashing
|
|
|
|
Authority is generally defined recursively: an inode's authority
|
|
matches the containing directory, and a directory's authority matches
|
|
the directory inode's. Thus the authority delegation chain can be
|
|
broken/redefined in two ways:
|
|
|
|
- Imports and exports redefine the directory inode -> directory
|
|
linkage, such that the directory authority is explicitly specified
|
|
via dir.dir_auth:
|
|
|
|
dir.dir_auth == -1 -> directory matches its inode
|
|
dir.dir_auth >= 0 -> directory authority is dir.dir_auth
|
|
|
|
- Hashed directories redefine the directory -> inode linkage. In
|
|
non-hashed directories, inodes match their containing directory.
|
|
In hashed directories, each dentry's authority is defined by a hash
|
|
function.
|
|
|
|
inode.hash_seed == 0 -> inode matches containing directory
|
|
inode.hash_seed > 0 -> defined by hash(hash_seed, dentry)
|
|
|
|
A directory's "containing_import" (bad name, FIXME) is either the
|
|
import or hashed directory that is responsible for delegating a
|
|
subtree. Note that the containing_import of a directory may be itself
|
|
because it is an import, but it cannot be itself because it is hashed.
|
|
|
|
Thus:
|
|
|
|
- Import and export operations' manipulation of dir_auth is
|
|
completely orthogonal to hashing operations. Hashing methods can
|
|
ignore dir_auth, except when they create imports/exports (and break
|
|
the inode<->dir auth linkage).
|
|
|
|
- Hashdirs act sort of like imports in that they bound an
|
|
authoritative region. That is, either hashdirs or imports can be
|
|
the key for nested_exports. In some cases, a dir may be both an
|
|
import and a hash.
|
|
|
|
- Export_dir won't export a hashdir. This is because it's tricky
|
|
(tho not necessarily impossible) due to the way nested_exports is
|
|
used with imports versus hashdirs.
|
|
|
|
|
|
|
|
|
|
FREEZING
|
|
|
|
There are two types of freezing:
|
|
|
|
- TREE: recursively freezes everything nested beneath a directory,
|
|
until an export of edge of cache is reached.
|
|
- DIR: freezes the contents of a single directory.
|
|
|
|
Some notes:
|
|
|
|
- Occurs on the authoritative node only.
|
|
|
|
- Used for suspending critical operations while migrating authority
|
|
between nodes or hashing/unhashing directories.
|
|
|
|
- Freezes the contents of the cache such that items may not be added,
|
|
items cannot be auth pinned, and/or subsequently reexported. The
|
|
namespace of the affected portions of the hierarchy may not change.
|
|
The content of inodes and other orthogonal operations
|
|
(e.g. replication, inode locking and modification) are unaffected.
|
|
|
|
Two states are defined: freezing and frozen. The freezing state is
|
|
used while waiting for auth_pins to be removed. Once all auth_pins
|
|
are gone, the state is changed to frozen. New auth_pins cannot be
|
|
added while freezing or frozen.
|
|
|
|
|
|
AUTH PINS
|
|
|
|
An auth pin keeps a given item on the authoritative node until it is
|
|
removed. The pins are tracked recursively, so that a subtree cannot
|
|
be frozen if it contains any auth pins.
|
|
|
|
If a pin is placed on a non-authoritative item, the item is allowed to
|
|
become authoritative; the specific restriction is it cannot be frozen,
|
|
which only happens during export-type operations.
|
|
|
|
|
|
TYPES OF EXPORTS
|
|
|
|
- Actual export of a subtree from one node to another
|
|
- A rename between directories on different nodes exports the renamed
|
|
_inode_. (If it is a directory, it becomes an export such that the
|
|
directory itself does not move.)
|
|
- A hash or unhash operation will migrate inodes within the directory
|
|
either to or from the directory's main authority.
|
|
|
|
EXPORT PROCESS
|
|
|
|
|
|
|
|
|
|
HASHING
|
|
|
|
- All nodes discover and open directory
|
|
|
|
- Prep message distributes subdir inode replicas for exports so that
|
|
peers can open those dirs. This is necessary because subdirs are
|
|
converted into exports or imports as needed to avoid migrating
|
|
anything except the hashed dir itself. The prep is needed for the
|
|
same reasons its important with exports: the inode authority must
|
|
always have the exported dir open so that it gets accurate dir
|
|
authority updates, and can keep the inode->dir_auth up to date.
|
|
|
|
- MHashDir messsage distributes the directory contents.
|
|
|
|
- While auth is frozen_dir, we can't get_or_open_dir. Otherwise the
|
|
Prep messages won't be inclusive of all dirs, and the
|
|
imports/exports won't get set up properly.
|
|
|
|
TODO
|
|
readdir
|
|
|
|
|
|
- subtrees stop at hashed dir. hashed dir's dir_auth follows parent
|
|
subtree, unless the dir is also an explicit import. thus a hashed
|
|
dir can also be an import dir.
|
|
|
|
|
|
bananas
|
|
apples
|
|
blueberries
|
|
green pepper
|
|
carrots
|
|
celery
|
|
|
|
|
|
|
|
|