When the shutdown/stop flag is set, continue to work through the queue.
Process events, but discard messages. This avoids the loss of reset events
on shutdown that are necessary to clean up ref cycles.
Signed-off-by: Sage Weil <sage@inktank.com>
Use the atomic pipe link removal as a signal that we are the one failing
the con and use that to queue the reset event.
This fixes the case where we have an open, the session gets set up via the
handle_accept callback, and then race with another connection and go into
wait + close, or just close. In that case, fault() needs to queue a reset
event to match the accept.
Signed-off-by: Sage Weil <sage@inktank.com>
This gives the ms_handle_reset call a chance to clean up (for example, by
breaking a con->priv <-> session reference cycle).
Signed-off-by: Sage Weil <sage@inktank.com>
Make RefCountedObject a private parent of Connection so that users are
forced to use ConnectionRef whenever references are taken.
Many methods can still take a raw Connection* when they are using the
caller's reference but not taking their own; this is cheaper than
twiddling the reference count, and the lifetime is still well defined.
Local variables generally use ConnectionRef, though.
Signed-off-by: Sage Weil <sage@inktank.com>
This patch adds "open-by-ino" helper. It utilizes backtrace to find
inode's path and open the inode. The algorithm looks like:
1. Check MDS peers. If any MDS has the inode in its cache, goto step 6.
2. Fetch backtrace. If backtrace was previously fetched and get the
same backtrace again, return -EIO.
3. Traverse the path in backtrace. If the inode is found, goto step 6;
if non-auth dirfrag is encountered, goto next step. If fail to find
the inode in its parent dir, goto step 1.
4. Request MDS peers to traverse the path in backtrace. If the inode
is found, goto step 6. If MDS peer encounters non-auth dirfrag, it
stops traversing. If any MDS peer fails to find the inode in its
parent dir, goto step 1.
5. Use the same algorithm to open the inode's parent. Goto step 3 if
succeeds; goto step 1 if fails.
6. return the inode's auth MDS ID.
The algorithm has two main assumptions:
1. If an inode is in its auth MDS's cache, its on-disk backtrace
can be out of date.
2. If an inode is not in any MDS's cache, its on-disk backtrace
must be up to date.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Send ping requests to both the front and back hb addrs for peer osds. If
the front hb addr is not present, do not send it and interpret a reply
as coming from both. This handles the transition from old to new OSDs
seamlessly.
Note both the front and back rx times. Both need to be up to date in order
for the peer to be healthy.
Signed-off-by: Sage Weil <sage@inktank.com>
We already have a throttler that lets of limit the amount of memory
consumed by messages from a given source. Currently this is based only
on the size of the message payload. Add a second throttler that limits
the number of messages so that we can effectively throttle small requests
as well.
Signed-off-by: Sage Weil <sage@inktank.com>
We go to the trouble to exchange our seq numbers during the handshake, but
the bit that then avoids resending old messages was broken because we
already requeue_sent() before we get to this point. Fix it by discarding
queued items (in the high prio slot) that we don't need to resend, and
adjust out_seq as needed.
Drop the optional arg to requeue_sent() now that it is unused.
Signed-off-by: Sage Weil <sage@inktank.com>
The HealthMonitor builds upon the QuorumService interface, and should be
used to keep track of all and any relevant information about the monitor
cluster (maybe even about all the cluster if need be).
This patch also introduces the HealthService interface, used to define
a HealthMonitor service, responsible for dispatching 'MMonHealth' messages
(the QuorumService interface dispatches generic 'Message').
Based on the HealthService interface, we introduce the DataHealthService
class, a service that will track disk space consumption by the monitors,
warn when a given threshold is crossed, and gracefully shutdown the monitor
if disk space usage hits critical levels that might affect the correct
monitor behavior.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Back in commit 6339c5d439, we tried to make
this deal with a race between a faulting pipe and new messages being
queued. The sequence is
- fault starts on pipe
- fault drops pipe_lock to unregister the pipe
- user (objecter) queues new message on the con
- submit_message reopens a Pipe (due to this bug)
- the message managed to make it out over the wire
- fault finishes faulting, calls ms_reset
- user (objecter) closes the con
- user (objecter) resends everything
It appears as though the previous patch *meant* to drop *m on the floor in
this case, which is what this patch does. And that fixes the crash I am
hitting; see #4271.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Performance tests on high-end machines have indicated the Linux autotuning
of the receive buffer sizes can cause throughput collapse. See bug
#2100, and this email discussion:
http://marc.info/?l=ceph-devel&m=133009796706284&w=2
Initially default to 0, which leaves us with the default. We may adjust
the default in the future.
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
The monitor's synchronization process requires a specific message type
to carry the required informations. Since this process significantly
differs from slurping, reusing the MMonProbe message is not an option as
it would require major changes and, for all intetions and purposes, it
would be far outside the scope of the MMonProbe message.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Two problems.
First, we need to cap the tokens per bucket. Otherwise, a stream of
items at one priority over time will indefinitely inflate the tokens
available at another priority. The cap should represent how "bursty"
we allow a given bucket to be. Start with 4MB for now.
Second, set a floor on the item cost. Otherwise, we can have an
infinite queue of 0 cost items that start over queues. More
realistically, we need to balance the overhead of processing small items
with the cost of large items. I.e., a 4 KB item is not 1/1000th as
expensive as a 4MB item.
Signed-off-by: Sage Weil <sage@inktank.com>
If we
negotiate cephx AND
are a server AND
cephx require signatures = true
then require the MSG_AUTH feature bit. Put this in the Policy struct for
this connection so that the existing feature bit checks and error reporting
are used, and the peer knows what feature it is missing.
Signed-off-by: Sage Weil <sage@inktank.com>
We cannot trust the Message bufferlists or other structures to be
stable without pipe_lock, as another Pipe may claim and modify the sent
list items while we are writing to the socket.
Related to #3678.
Signed-off-by: Sage Weil <sage@inktank.com>
Fill out the Message header, footer, and calculate CRCs during
encoding, not write_message(). This removes most modifications from
Pipe::write_message().
Signed-off-by: Sage Weil <sage@inktank.com>
This modifies bufferlists in the Message struct, and it is possible
for multiple instances of the Pipe to get references on the Message;
make sure they don't modify those bufferlists concurrently.
Signed-off-by: Sage Weil <sage@inktank.com>
Associate a sending message with the connection inside the pipe_lock.
This way if a racing thread tries to steal these messages it will
be sure to reset the con point *after* we do such that it the con
pointer is valid in encode_payload() (and later).
This may be part of #3678.
Signed-off-by: Sage Weil <sage@inktank.com>