ceph/branches/riccardo/monitor2/doc/caching.txt
riccardo80 07ac5d3e74 creating branch for distributed monitor
git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1068 29311d96-e01e-0410-9327-a35deaab8ce9
2007-02-01 05:43:23 +00:00

201 lines
6.6 KiB
Plaintext

AUTHORITY
The authority maintains a list of what nodes cache each inode.
Additionally, each replica is assigned a serial (normally 0) to
disambiguate multiple replicas of the same item (see below).
set<int> cached_by;
map<int, int> cached_by_serial;
The cached_by set _always_ includes all nodes that cache the
partcuarly inode, but may additionally include nodes that used to
cache it but no longer do. In those cases, an expire message should
be in transit.
REPLICA
The replica maintains a notion of who it believes is the authority for
each replicated inode. There are two possibilities:
- Ordinarily, this notion is correct.
- If the part of the file system in question was recently exported to
a new MDS, the inodes old authority is acting as a CACHEPROXY,
and will forward relevant messages on to the authority.
When a repica is expired from cache, and expire is sent to the
authority. The expire includes the serial number issued when the
replica was originally created to disambiguate potentially concurrent
replication activity.
EXPORTS
- The old authority suddenly becomes a replica. It's serial is well
defined. It also becomes a CACHEPROXY, which means its cached_by
remains defined (with an alternate meaning!). While a proxy, the
node will forward relevant messages from the replica to the
authority (but not the other way around--the authority knows all
replicas).
- Once the export is acked, the old authority sends a
message to the replica notifying it of the new authority. As soon
as all replicas acknowedge receipt of this notice, the old authority
can cease CACHEPROXY responsibilities and become a regular replica.
At this point it's cached_by is no longer defined.
- Replicas always know who the authority for the inode is, OR they
know prior owner acting as a CACHEPROXY. (They don't know which it
is.)
CACHED_BY
The authority always has an inclusive list of nodes who cache an item.
As such it can confidently send updates to replicas for locking,
invalidating, etc. When a replica is expired from cache, an expire is
sent to the authority. If the serial matches, the node is removed
from the cached_by list.
SUBTREE AUTHORITY DELEGATION: imports versus hashing
Authority is generally defined recursively: an inode's authority
matches the containing directory, and a directory's authority matches
the directory inode's. Thus the authority delegation chain can be
broken/redefined in two ways:
- Imports and exports redefine the directory inode -> directory
linkage, such that the directory authority is explicitly specified
via dir.dir_auth:
dir.dir_auth == -1 -> directory matches its inode
dir.dir_auth >= 0 -> directory authority is dir.dir_auth
- Hashed directories redefine the directory -> inode linkage. In
non-hashed directories, inodes match their containing directory.
In hashed directories, each dentry's authority is defined by a hash
function.
inode.hash_seed == 0 -> inode matches containing directory
inode.hash_seed > 0 -> defined by hash(hash_seed, dentry)
A directory's "containing_import" (bad name, FIXME) is either the
import or hashed directory that is responsible for delegating a
subtree. Note that the containing_import of a directory may be itself
because it is an import, but it cannot be itself because it is hashed.
Thus:
- Import and export operations' manipulation of dir_auth is
completely orthogonal to hashing operations. Hashing methods can
ignore dir_auth, except when they create imports/exports (and break
the inode<->dir auth linkage).
- Hashdirs act sort of like imports in that they bound an
authoritative region. That is, either hashdirs or imports can be
the key for nested_exports. In some cases, a dir may be both an
import and a hash.
- Export_dir won't export a hashdir. This is because it's tricky
(tho not necessarily impossible) due to the way nested_exports is
used with imports versus hashdirs.
FREEZING
There are two types of freezing:
- TREE: recursively freezes everything nested beneath a directory,
until an export of edge of cache is reached.
- DIR: freezes the contents of a single directory.
Some notes:
- Occurs on the authoritative node only.
- Used for suspending critical operations while migrating authority
between nodes or hashing/unhashing directories.
- Freezes the contents of the cache such that items may not be added,
items cannot be auth pinned, and/or subsequently reexported. The
namespace of the affected portions of the hierarchy may not change.
The content of inodes and other orthogonal operations
(e.g. replication, inode locking and modification) are unaffected.
Two states are defined: freezing and frozen. The freezing state is
used while waiting for auth_pins to be removed. Once all auth_pins
are gone, the state is changed to frozen. New auth_pins cannot be
added while freezing or frozen.
AUTH PINS
An auth pin keeps a given item on the authoritative node until it is
removed. The pins are tracked recursively, so that a subtree cannot
be frozen if it contains any auth pins.
If a pin is placed on a non-authoritative item, the item is allowed to
become authoritative; the specific restriction is it cannot be frozen,
which only happens during export-type operations.
TYPES OF EXPORTS
- Actual export of a subtree from one node to another
- A rename between directories on different nodes exports the renamed
_inode_. (If it is a directory, it becomes an export such that the
directory itself does not move.)
- A hash or unhash operation will migrate inodes within the directory
either to or from the directory's main authority.
EXPORT PROCESS
HASHING
- All nodes discover and open directory
- Prep message distributes subdir inode replicas for exports so that
peers can open those dirs. This is necessary because subdirs are
converted into exports or imports as needed to avoid migrating
anything except the hashed dir itself. The prep is needed for the
same reasons its important with exports: the inode authority must
always have the exported dir open so that it gets accurate dir
authority updates, and can keep the inode->dir_auth up to date.
- MHashDir messsage distributes the directory contents.
- While auth is frozen_dir, we can't get_or_open_dir. Otherwise the
Prep messages won't be inclusive of all dirs, and the
imports/exports won't get set up properly.
TODO
readdir
- subtrees stop at hashed dir. hashed dir's dir_auth follows parent
subtree, unless the dir is also an explicit import. thus a hashed
dir can also be an import dir.
bananas
apples
blueberries
green pepper
carrots
celery