Commit Graph

796 Commits

Author SHA1 Message Date
Sage Weil
2ed9f5a96f osd: include osdmap epoch in osd_op message operator<<
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
2013-02-28 16:22:46 -08:00
Sage Weil
0cbe406f93 osd: show retry attempt in MOSDOp operator<<
Signed-off-by: Sage Weil <sage@inktank.com>
2013-02-28 13:33:40 -08:00
Sage Weil
0be28af0bb Merge remote-tracking branch 'gh/next' 2013-02-24 13:22:47 -08:00
Sage Weil
0cd215ee5b mds: reencode MDSMap in MMDSMap if MDSENC feature is not present
In some cases the MMDSMap message from mon -> client passes from leader ->
peon -> client, and the leader doesn't encode with the correct feature
bits.  As with MMOSDMap, we reencode the nested MDSMap based on the
features if relevant bits are not present.

We forgot to include this with the mds encoding changes.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-02-23 16:36:52 -08:00
Joao Eduardo Luis
beafca57fb Merge branch 'wsp.bobtail.2merge' into wsp.bobtail.master
Conflicts:
	src/.gitignore
	src/Makefile.am
	src/include/ceph_features.h
	src/mon/MDSMonitor.cc
	src/mon/PGMonitor.cc
2013-02-21 18:04:22 +00:00
Joao Eduardo Luis
cab3411b4a mon: Monitor: Add monitor store synchronization support
Synchronize two monitor stores when one of the monitors has diverged
significantly from the remaining monitor cluster.

This process roughly consists of the following steps:

  0. mon.X tries to join the cluster;
  1. mon.X verifies that it has diverged from the remaining cluster;
  2. mon.X asks the leader to sync;
  3. the leader allows mon.X to sync, pointing out a mon.Y from
     which mon.X should sync;
  4. mon.X asks mon.Y to sync;
  5. mon.Y sends its own store in one or more chunks;
  6. mon.X acks each received chunk; go to 5;
  7. mon.X receives the last chunk from mon.Y;
  8. mon.X informs the leader that it has finished synchronizing;
  9. the leader acks mon.X's finished sync;
 10. mon.X bootstraps and retries joining the cluster (goto 0.)

This is the most simple and straightforward process that can be hoped
for. However, things may go sideways at any time (monitors failing, for
instance), which could potentially lead to a corrupted monitor store.
There are however mechanisms at work to avoid such scenario at any step
of the process.

Some of these mechanisms include:

 - aborting the sync if the leader fails or leadership changes;
 - state barriers on synchronization functions to avoid stray/outdated
   messages from interfering on the normal monitor behavior or on-going
   synchronization;
 - store clean-up before any synchronization process starts;
 - store clean-up if a sync process fails;
 - resuming sync from a different monitor mon.Z if mon.Y fails mid-sync;
 - several timeouts to guarantee that all the involved parties are still
   alive and participating in the sync effort.
 - request forwarding when mon.X contacts a monitor outside the quorum
   that might know who the leader is (or might know someone who does)
   [4].

Changes:
  - Adapt the MMonProbe message for the single-paxos approach, dropping
    the version map and using a lower and upper bound version instead.
  - Remove old slurp code.
  - Add 'sync force' command; 'sync_force' through the admin socket.

Notes:

[1] It's important to keep track of the paxos version at the time at
    which a store sync starts.  Given that after the sync we end up with
    the same state as the monitor we are synchronizing from, there is a
    chance that we might end up with an uncommitted paxos version if we
    are synchronizing with the leader (there's some paxos stashing done
    prior to commit on the leader).  By keeping track at which version
    the sync started, we can then let the requester to which version he
    should cap its paxos store.

[2] Furthermore, the enforced paxos cap, described on [1], is even more
    important if we consider the need to reapply the paxos versions that
    were received during the sync, to make sure the paxos store is
    consistent.  If we happened to have some yet-uncommitted version in
    the store, we could end up applying it.

[3] What is described in [1] and [2]:

Fixes: #4026
Fixes: #4037
Fixes: #4040

[4] Whenever a given monitor mon.X is on the probing phase and notices
    that there is a mon.Y with a paxos version considerably higher than
    the one mon.X has, then mon.X will attempt to synchronize from
    mon.Y.  This is the basis for the store sync.  However this might
    hold true, the fact is that there might be a chance that, by the
    time mon.Y handles the sync request from mon.X, mon.Y might already
    be attempting a sync himself with some other mon.Z.  In this case,
    the appropriate thing for mon.Y to do is to forward mon.X's request
    to mon.Z, as mon.Z should be part of the quorum, know who the leader
    is or be the leader himself -- if not, at least it is guaranteed
    that mon.Z has a higher version than both mon.X and mon.Y, so it
    should be okay to sync from him.

Fixes: #4162

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-02-21 18:02:22 +00:00
Joao Eduardo Luis
6db25a3885 message: MMonSync: Monitor Synchronization message
The monitor's synchronization process requires a specific message type
to carry the required informations. Since this process significantly
differs from slurping, reusing the MMonProbe message is not an option as
it would require major changes and, for all intetions and purposes, it
would be far outside the scope of the MMonProbe message.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-02-21 18:02:22 +00:00
Joao Eduardo Luis
a5e2dcb33d mon: Single-paxos and key/value store support
We are converting the monitor subsystem to a Single-Paxos architecture,
backed by a key/value store. The previous architecture used a Paxos
instance for each Paxos Service, backed by a nasty Monitor Store that
provided few to no consistency guarantees whatsoever, which led to a fair
amount of workarounds.

Changes:

* Paxos:
  - Add k/v store support
  - Add documentation describing the new Paxos storage layout and behavior
  - Get rid of the stashing code, which was used as a consistency point
    mechanism (we no longer need it, because of our k/v store)
  - Debug level of 30 will output json-formatted transaction dumps
  - Allows for proposal queueing, to be proposed in the same order as
    they were queued.
  - No more 'is_leader()' function, using instead the Monitor's for
    enhanced simplicity.
  - Add 'is_lease_valid()' function.
  - Disregard 'stashed versions'
  - Make the paxos 'state' variable a bit-map, so we lock the proposal
    mechanism while maintaining the state [5].
  - Related notes: [3]

* PaxosService:
  - Add k/v store support, creating wrappers to be used by the services
  - Add documentation
  - Support single-paxos behavior, creating wrappers to be used by the
    services and service-specific version
  - Rearrange variables so they are neatly organized in the beginning of
    the class
  - Add a trim_to() function to be used by the services, instead of letting
    them rely on Paxos::trim_to(), which is no longer adequate to the job
    at hand
  - Debug level of 30 will output json-formatted transaction dumps
  - Support proposal queueing, taking it into consideration when
    assessing the current state of the service (active, writeable,
    readable, ...)
  - Redefine the conditions for 'is_{active,readable,writeable}()' given
    the new single-paxos approach, with proposal queueing [1].
  - Use our own waiting_for_* callback lists, which now must be
    dissociated from their Paxos counterparts [2].
  - Related notes: [3], [4]

* Monitor:
  - Add k/v store support
  - Use only one Paxos instance and pass it down to each service instance
  - Crank up CEPH_MON_PROTOCOL to 10

* {Auth,Log,MDS,Monmap,OSD,PG}Monitor:
  - Add k/v store support
  - Add single-paxos support

* AuthMonitor:
  - Don't always propose full versions: if the KeyServer doesn't have
    keys, we cannot propose a full version. This should only happen when
    we start with a brand new store and we are creating the first
    pending proposal, and if we were to commit a full version filled
    with nothing but a big void of nothingness, we could eventually end
    up with a corrupted version.

* Elector:
  - Add k/v store support
  - Add single-paxos support

* ceph-mon:
  - Use the monitor's k/v store instead of MonitorStore

* MMonPaxos:
  - remove the machine_id field: This field was used to identify from/to
    which paxos service a given message belonged. We no longer have a Paxos
    for each service, so this field became obsolete.

Notes:

[1] Redefine the conditions for 'is_{active,readable,writeable}()' on
    the PaxosService class, to be used with single-paxos and proposal
    queueing:

  We should not rely on the Paxos::is_*() functions, since they do not apply
  directly to the PaxosService.

  All the PaxosService classes share the same Paxos class, but they do not
  rely on its values. Each service only relies, uses and updates its own
  values on the k/v store. Thus, we may have a given service (e.g., the
  OSDMonitor) proposing a new value, hence updating or waiting to update its
  store, and we may still consider the LogMonitor as being able to read and
  write its own values on the k/v store. In a nutshell, different services
  do not overlap on their access to their own store when it comes to reading,
  and since the Paxos will queue their updates and deal with them in a FIFO
  order, their updates won't overlap either.

  Therefore, the conditions for the PaxosService::is_{active,readable,
  writeable} differ from those on the Paxos::is_{active,readable,writeable}.

  * PaxosService::is_active() - the PaxosService will be considered as
  active iff it is not proposing and the Paxos is not recovering. This
  means that a given PaxosService (e.g., the OSDMonitor) may be considered
  as being active even though some other service (e.g., the LogMonitor) is
  proposing a new value and the Paxos is on the UPDATING state. This means
  that the OSDMonitor will be able to read its own versions and queue any
  changes on to the Paxos. However, if the Paxos is on state RECOVERING,
  we cannot be considered as active.

  * PaxosService::is_writeable() - We will be able to propose new values
  iff we are the Leader, we have a valid lease, and we are not already
  proposing. If we are proposing, we must wait for our proposal to finish
  in order to proceed with writing to our k/v store; otherwise we could
  incur in assuming that our last committed version was, say, 10; then
  assign map epochs/versions taking that into consideration, make changes
  to the store based on those values, just to come to smash previously
  proposed values on the store. We really don't want that. To be fair,
  there was a chance we could assume we were always writable, but there
  may be unforeseen consequences to this; so we take the conservative
  approach here for now, and we will relax it in the future if we believe
  it to be fruitful.

  * PaxosService::is_readable() - We will be readable iff we are not
  proposing and the Paxos is not recovering; if our last committed version
  exists; and if we are either a cluster of one or we have a valid lease.

[2] Use own waiting_for_* callback lists on PaxosService, which now must
    be dissociated from their Paxos counterparts:

  We were relying on Paxos to wait for state changes, but since our state
  became somewhat independent from the Paxos state, we have to deal with
  callbacks waiting for 'readable', 'writable' or 'active' on different
  terms than those that Paxos provide.

  So, basically, we will take one of two approaches when it comes to waiting:

  * If we are proposing, queue ourselves on our own list, waiting for the
  proposal to finish;
  * Otherwise, the cause for the need to wait comes from Paxos, so queue
  the callback directly on Paxos.

  This approach means that we must make sure to check our desired state
  whenever the callback is fired up, and re-queue ourselves if the state
  didn't quite change (or if it changed but our waiting condition result
  didn't). For instance, if we were waiting for a proposal to finish due to
  a failed 'is_active()', we will need to recheck if we are active before
  continuing once the callback is fired. This is mainly because we may have
  finished our proposal, but a new Election may have been called and the
  Paxos may not be active.

[3] Propose everything in the queue before bootstrapping, but don't
    allow new proposals:

  The MonmapMonitor may issue bootstraps once it is updated. We must ensure
  that we propose every single pending proposal before we actually do it.

  However, ee don't want to propose if we are going to bootstrap; otherwise,
  we may end up losing proposals.

[4] Handle the case when first_committed_version equals 0 on a
    PaxosService

  In a nutshell, the services do not set the first committed version, as
  they consider it as a SEP (Somebody Else's Problem). They do rely on it
  though, and we, the PaxosService, must ensure that it contains a valid
  value (that is, higher than zero) at all times.

  Since we will only have a first_committed version equal to zero once,
  and that is before the service's first proposal, we are safe to simply
  read the variable from the store and assign the first_committed the same
  value as the last_committed iff the first_committed version is zero.

  This also affects trimming, since trimming relies on the first_committed
  version as the lower bound for version trimming. Even though the k/v store
  will gracefully ignore any problem from trying to remove non-existent
  versions, the main issue would still stand: we'd be removing a non-existent
  version and that just doesn't make any sense.

[5] 'lock' paxos when we are running some internal proposals

  Force the paxos services to wait for us to complete whatever we are
  doing before they can proceed.  This is required because on certain
  occasions we might need to run internal proposals, not affected to any of
  the paxos services (for instance, when learning an old value), and we need
  them to stay put, or they might incur in erroneous state and crash the
  monitor.

  This could have been done with an extra bool, but there was no point
  in creating a new variable when we can just as easily reuse the
  'state' variable for our twisted interests.

Fixes: #4175

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-02-21 18:02:14 +00:00
Joao Eduardo Luis
5659a4eb1d mon: Remove global version code introduced around bobtail's release
This patch reverts most of the global version (gv) related patches that
were introduced around bobtail's release as a prelude to the single-paxos
patches.

The gv infrastructure allowed us to gather version information on the
monitors, essential to the move to a single-paxos implementation on
existing clusters -- this means that for an existing cluster to upgrade
to the a single-paxos monitor, it will first have to be upgraded to a
version prior to this patch.  This patch strips the monitor subsystem of
all the gv-related code that is of no use for upcoming versions.

Furthermore, from this patch onwards until all single-paxos patches
are merged, ceph-mon won't work as expected, and may not compile at some
point in the git history.

These patches are not retro-compatible, and the monitors are not expected
to work with earlier versions.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-02-12 13:22:33 +00:00
Greg Farnum
8864c73027 Merge branch 'wip-mds-encode-rebased'
Reviewed-by: Sage Weil <sage@inktank.com>
2013-02-11 22:02:40 -08:00
Greg Farnum
e7bc4b8d59 mds: cap_reconnect_t uses modern encoding
Signed-off-by: Greg Farnum <greg@inktank.com>
2013-02-08 13:58:41 -08:00
Danny Al-Gaaf
b1fc10ef93 messages/MOSDRepScrub.h: initialize member variable in constructor
Initialize chunky and deep bool member variables in the constructor
with false.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2013-02-06 08:42:04 -08:00
Greg Farnum
771204b22d mds: move conditional MDSMap encoding into single encode method
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
2013-02-05 13:29:06 -08:00
Yan, Zheng
f4abf00af5 mds: rejoin remote wrlocks and frozen auth pin
Includes remote wrlocks and frozen authpin in cache rejoin strong message

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-29 10:17:36 +08:00
Sage Weil
a1bf8220e5 osd: set PULL subop cost to size of requested data
Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-22 14:47:40 -08:00
Sage Weil
5a384f48bf Merge branch 'wip-mds'
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-17 21:05:05 -08:00
Joao Eduardo Luis
aa40de9088 messages: add MTimeCheck
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-11 00:44:21 +00:00
Yan, Zheng
b3796f46a4 mds: indroduce DROPLOCKS slave request
In some rare case, Locker::acquire_locks() drops all acquired locks
in order to auth pin new objects. But Locker::drop_locks only drops
explicitly acquired remote locks, does not drop objects' version
locks that were implicitly acquired on remote MDS. These leftover
locks break locking order when re-acquiring _locks and may cause
dead lock.

The fix is indroduce DROPLOCKS slave request which drops all acquired
lock on remote MDS.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Sage Weil
64b845f6ba features is uint64_t
This won't bite us for a while yet (we're on bit 26), but it will soon!

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-12-27 17:24:54 -08:00
Sage Weil
2fbe3e17d6 Merge remote-tracking branch 'gh/next' 2012-12-27 17:15:29 -08:00
Sage Weil
af37cc3a87 Merge remote-tracking branch 'gh/wip-mds' 2012-12-27 13:40:01 -08:00
Sage Weil
f1dfd64f72 messages/MOSDOpReply: remove misleading may_read/may_write
These are OpRequest properties, calculated/enforced at the OSD.  They don't
belong in the MOSDOp or MOSDOpReply messages.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-27 12:12:40 -08:00
Sage Weil
03f6dfa46e osd: move rmw_flags to OpRequest, out of MOSDOp
It was very sloppy to put a server-side processing state inside the
messsage.  Move it to the OpRequestRef instead.

Note that the client was filling in bogus data that was then lost during
encoding/decoding; clean that up.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-27 12:12:40 -08:00
Yan, Zheng
0002546205 mds: fix race between send_dentry_link() and cache expire
MDentryLink message can race with cache expire, When it arrives at
the target MDS, it's possible there is no corresponding dentry in
the cache. If this race happens, we should expire the replica inode
encoded in the MDentryLink message. But to expire an inode, the MDS
need to know which subtree does the inode belong to, so modify the
MDentryLink message to include this information.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2012-12-23 20:01:12 -08:00
Sage Weil
61d43af747 osd: make MOSDFailure output more sensible
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-23 15:21:18 -08:00
Sage Weil
25ea06969f osd: make pool_stat_t encoding backward compatible with v0.41 and older
In particular, this is the encoding that is used in precise.

Fixes: #3212
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-08 09:24:38 -08:00
Yan, Zheng
0fa487585e mds: fix freeze inode deadlock
CInode::freeze_inode() is used in the case of cross authority rename.
Server::handle_slave_rename_prep() calls it to wait for all other
operations on source inode to complete. This happens after all locks
for the rename operation are acquired. But to acquire locks, we need
auth pin locks' parent objects first. So there is an ABBA deadlock
if someone auth pins the source inode after locks for rename are
acquired and before Server::handle_slave_rename_prep() is called.
The fix is freeze and auth pin the source inode at the same time.

This patch introduces CInode::freeze_auth_pin(), it waits for all
other MDRequests to release auth pins, then change the inode to
FROZENAUTHPIN state, this state prevents other MDRequests from
getting new auth pins.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2012-12-01 12:52:23 -08:00
Danny Al-Gaaf
ec2f261762 messages/MClientRequest.h: remove twice included sys/types.h
Fix includes: remove twice included sys/types.h

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2012-11-28 08:25:43 -08:00
Joao Eduardo Luis
f5029074da messages: MLog: make ctor's uuid argument a const
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-11-27 20:00:44 +00:00
Mike Ryan
d2c6d44b27 message: add MRecoveryReserve
This message will be used to reserve and release recovery slots on
replica PGs.

Signed-off-by: Mike Ryan <mike.ryan@inktank.com>
2012-11-01 01:05:01 -07:00
Mike Ryan
23dbe3ecae message: add missing print statement for REJECT message
Signed-off-by: Mike Ryan <mike.ryan@inktank.com>
2012-11-01 01:05:01 -07:00
Sage Weil
ad6840ce5c Merge branch 'wip-3301'
Reviewed-by: Sage Weil <sage@inktank.com>
2012-10-16 21:06:00 -07:00
Gary Lowell
d78ba6af94 Merge branch 'next' 2012-10-16 23:27:21 +00:00
Sage Weil
b290dc3a30 MClientRequest: fix mode formatting
Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-16 11:11:16 -07:00
Josh Durgin
c8721b956c Merge branch 'wip-osd-caps'
Conflicts:
	src/osd/OSDCap.cc
	src/test/osd/osdcap.cc

Reviewed-by: Sage Weil <sage.weil@inktank.com>
2012-10-05 16:21:12 -07:00
Josh Durgin
20496b8d2b OSD: separate class caps from normal read/write
This properly accounts for multi-op requests. Use MOSDOp->rmw_flags for
internal caps requirements, leaving MOSDOp->flags for client specified
options. Use accessors so the flags don't need to be known by the callers.

Also separate capability checks (need_*_cap) from the nature of the MOSDOp
(may_{read,write}). This preserves the semantics of may_{read,write},
which are used in several places outside of capability checks.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-04 18:36:07 -07:00
Sage Weil
aed3612f87 MOSDBoot: fix compatibility with ~argonaut
I revved this message and forgot to set the compat version correctly,
preventing post-change (e.g., bobtail) OSDs from talking to pre-change
(e.g., argonaut) monitors.  This was in b64641c.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-04 17:01:33 -07:00
Josh Durgin
2e366ea8aa OSD: deprecate CLS_METHOD_PUBLIC flag
Remove all existing usage, but leave the definition so third-party
class plugins don't break.

The public flag let *any* user execute a class method, as long
as they had read and/or write access as the method required. This is
better managed by the new osd caps infrastructure, and it was
entirely undocumented and unused, so it should be safe to remove.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-02 15:43:37 -07:00
Sage Weil
6f7067f489 mon: avoid large pass by value in MForward
CID 717035: Big parameter passed by value (PASS_BY_VALUE)
At (1): Passing parameter caps of type MonCaps (size 144 bytes) by value.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-09-29 01:35:36 -07:00
Sage Weil
d2cbe1fb6d MOSDFailure: avoid big pass by value
CID 727975: Big parameter passed by value (PASS_BY_VALUE)
At (1): Passing parameter f of type entity_inst_t (size 152 bytes) by value.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-09-28 13:18:06 -07:00
Sage Weil
cb0d9690a7 MMonJoin: avoid large pass by value
CID 717036: Big parameter passed by value (PASS_BY_VALUE)
At (1): Passing parameter a of type entity_addr_t (size 136 bytes) by value.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-09-28 13:18:06 -07:00
Sage Weil
e92b92b2b6 MRoute: avoid pass by value
CID 717038: Big parameter passed by value (PASS_BY_VALUE)
At (1): Passing parameter i of type entity_inst_t (size 152 bytes) by value.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-09-28 13:18:05 -07:00
Sage Weil
02e48394cc messages: uninit values
CID 717259: Uninitialized scalar field (UNINIT_CTOR)
At (2): Non-static class member "global_id" is not initialized in this constructor nor in any functions that it calls.

CID 728086: Uninitialized scalar field (UNINIT_CTOR)
At (4): Non-static class member "type" is not initialized in this constructor nor in any functions that it calls.

CID 717260: Uninitialized scalar field (UNINIT_CTOR)
At (2): Non-static class member "from" is not initialized in this constructor nor in any functions that it calls.

CID 717261: Uninitialized scalar field (UNINIT_CTOR)
At (51): Non-static class member field "head.time_warp_seq" is not initialized in this constructor nor in any functions that it calls.

+ more

Signed-off-by: Sage Weil <sage@inktank.com>
2012-09-28 13:18:05 -07:00
Sage Weil
a351f7a1f4 Merge remote-tracking branch 'gh/wip_backfill_full2'
Conflicts:
	src/include/ceph_features.h
2012-09-27 13:21:23 -07:00
Mike Ryan
c689556896 PG, OSD: reject backfills when an OSD is nearly full
Reject backfills when an OSD reaches a configurable full ratio. Retry
backfilling periodically in the hopes that the OSD has become less full.

This changeset introduces two configuration options for dealing with
this: osd_refuse_backfill_full_ratio and osd_backfill_retry_interval.

We also introduce two new state transitions in the PG's Active state.

Signed-off-by: Mike Ryan <mike.ryan@inktank.com>
2012-09-26 11:57:31 -07:00
Sage Weil
577184dd5b Merge remote-tracking branch 'gh/wip-mon-gv'
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

Conflicts:
	src/include/ceph_features.h
2012-09-26 09:47:18 -07:00
Samuel Just
b54a0a252c osd/: add backfill reservations
Previously, a new osd would be bombarded by backfills from many osds
simultaneously, resulting in excessively high load.  Instead, we
want to limit the number of backfills coming into and going out
from a single osd.

To that end, each OSDService now has two AsyncReserver instances: one
for backfills going from the osd (local_reserver) and one for backfills
going to the osd (remote_reserver).  For a primary to initiate a
backfill, it must first obtain a reservation from its own
local_reserver.  Then, it must obtain a reservation from the backfill
target's remote_reserver via a MBackfillReserve message. This process is
managed by substates of Active and ReplicaActive (see the changes in
PG.h).  The reservations are dropped either on the Backfilled event,
which is sent on the primary before calling recovery_complete and on the
replica on receipt of the BackfillComplete progress message), or upon
leaving Active or ReplicaActive.

It's important that we always grab the local reservation before the
remote reservation in order to prevent a circular dependency.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-09-25 11:43:47 -07:00
Sage Weil
db04ce46f5 mon: make MRoute encoding backwards-compatible
If the target as the NULLROUTE feature, use a new encoding that explicitly
indicates whether a message follows.  If the feature is absent, use the
old encoding.  The mon is responsible for not trying to send a null reply
if the target does not have the feature.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-09-19 11:49:34 -07:00
Sage Weil
d328a28cc6 mon: send 'null' reply to requests we won't reply to
This is a no-op if the client was talking to us, but in the forwarded
request case will clean up the request state (and request message) on the
forwarding monitor.  Otherwise, MOSDFailure messages (and probably others)
can accumulate on the non-leader mon indefinitely.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-09-18 14:39:00 -07:00
Sage Weil
b64641c3dd osd: include boot_epoch in MOSDBoot
This will let the monitor infer whether we were wrongly marked down or
the daemon restarted.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-09-18 14:38:59 -07:00