ceph/doc/radosgw/troubleshooting.rst

=================
 Troubleshooting
=================


The Gateway Won't Start
=======================

If you cannot start the gateway (i.e., there is no existing ``pid``), 
check to see if there is an existing ``.asok`` file from another 
user. If an ``.asok`` file from another user exists and there is no
running ``pid``, remove the ``.asok`` file and try to start the
process again. This may occur when you start the process as a ``root`` user and 
the startup script is trying to start the process as a 
``www-data`` or ``apache`` user and an existing ``.asok`` is 
preventing the script from starting the daemon.

The radosgw init script (/etc/init.d/radosgw) also has a verbose argument that
can provide some insight as to what could be the issue::

  /etc/init.d/radosgw start -v

or ::

  /etc/init.d radosgw start --verbose

HTTP Request Errors
===================

Examining the access and error logs for the web server itself is
probably the first step in identifying what is going on.  If there is
a 500 error, that usually indicates a problem communicating with the
``radosgw`` daemon.  Ensure the daemon is running, its socket path is
configured, and that the web server is looking for it in the proper
location.


Crashed ``radosgw`` process
===========================

If the ``radosgw`` process dies, you will normally see a 500 error
from the web server (apache, nginx, etc.).  In that situation, simply
restarting radosgw will restore service.

To diagnose the cause of the crash, check the log in ``/var/log/ceph``
and/or the core file (if one was generated).


Blocked ``radosgw`` Requests
============================

If some (or all) radosgw requests appear to be blocked, you can get
some insight into the internal state of the ``radosgw`` daemon via
its admin socket.  By default, there will be a socket configured to
reside in ``/var/run/ceph``, and the daemon can be queried with::

 ceph daemon /var/run/ceph/client.rgw help
 
 help                list available commands
 objecter_requests   show in-progress osd requests
 perfcounters_dump   dump perfcounters value
 perfcounters_schema dump perfcounters schema
 version             get protocol version

Of particular interest::

 ceph daemon /var/run/ceph/client.rgw objecter_requests
 ...

will dump information about current in-progress requests with the
RADOS cluster.  This allows one to identify if any requests are blocked
by a non-responsive OSD.  For example, one might see::

  { "ops": [
        { "tid": 1858,
          "pg": "2.d2041a48",
          "osd": 1,
          "last_sent": "2012-03-08 14:56:37.949872",
          "attempts": 1,
          "object_id": "fatty_25647_object1857",
          "object_locator": "@2",
          "snapid": "head",
          "snap_context": "0=[]",
          "mtime": "2012-03-08 14:56:37.949813",
          "osd_ops": [
                "write 0~4096"]},
        { "tid": 1873,
          "pg": "2.695e9f8e",
          "osd": 1,
          "last_sent": "2012-03-08 14:56:37.970615",
          "attempts": 1,
          "object_id": "fatty_25647_object1872",
          "object_locator": "@2",
          "snapid": "head",
          "snap_context": "0=[]",
          "mtime": "2012-03-08 14:56:37.970555",
          "osd_ops": [
                "write 0~4096"]}],
  "linger_ops": [],
  "pool_ops": [],
  "pool_stat_ops": [],
  "statfs_ops": []}

In this dump, two requests are in progress.  The ``last_sent`` field is
the time the RADOS request was sent.  If this is a while ago, it suggests
that the OSD is not responding.  For example, for request 1858, you could
check the OSD status with::

 ceph pg map 2.d2041a48
 
 osdmap e9 pg 2.d2041a48 (2.0) -> up [1,0] acting [1,0]

This tells us to look at ``osd.1``, the primary copy for this PG::

 ceph daemon osd.1 ops
 { "num_ops": 651,
  "ops": [
        { "description": "osd_op(client.4124.0:1858 fatty_25647_object1857 [write 0~4096] 2.d2041a48)",
          "received_at": "1331247573.344650",
          "age": "25.606449",
          "flag_point": "waiting for sub ops",
          "client_info": { "client": "client.4124",
              "tid": 1858}},
 ...

The ``flag_point`` field indicates that the OSD is currently waiting
for replicas to respond, in this case ``osd.0``.


Java S3 API Troubleshooting
===========================


Peer Not Authenticated
----------------------

You may receive an error that looks like this:: 

     [java] INFO: Unable to execute HTTP request: peer not authenticated

The Java SDK for S3 requires a valid certificate from a recognized certificate
authority, because it uses HTTPS by default. If you are just testing the Ceph
Object Storage services, you can resolve this problem in a few ways:  

#. Prepend the IP address or hostname with ``http://``. For example, change this::

	conn.setEndpoint("myserver");

   To:: 

	conn.setEndpoint("http://myserver")

#. After setting your credentials, add a client configuration and set the 
   protocol to ``Protocol.HTTP``. :: 

			AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
			
			ClientConfiguration clientConfig = new ClientConfiguration();
			clientConfig.setProtocol(Protocol.HTTP);
			
			AmazonS3 conn = new AmazonS3Client(credentials, clientConfig);


405 MethodNotAllowed
--------------------

If you receive an 405 error, check to see if you have the S3 subdomain set up correctly. 
You will need to have a wild card setting in your DNS record for subdomain functionality
to work properly.

Also, check to ensure that the default site is disabled. ::

     [java] Exception in thread "main" Status Code: 405, AWS Service: Amazon S3, AWS Request ID: null, AWS Error Code: MethodNotAllowed, AWS Error Message: null, S3 Extended Request ID: null
  
  
Numerous objects in default.rgw.meta pool
=========================================

Clusters created prior to *jewel* have a metadata archival feature enabled by default, using the ``default.rgw.meta`` pool.
This archive keeps all old versions of user and bucket metadata, resulting in large numbers of objects in the ``default.rgw.meta`` pool.

Disabling the Metadata Heap
---------------------------

Users who want to disable this feature going forward should set the ``metadata_heap`` field to an empty string ``""``::

  $ radosgw-admin zone get --rgw-zone=default > zone.json
  [edit zone.json, setting "metadata_heap": ""]
  $ radosgw-admin zone set --rgw-zone=default --infile=zone.json
  $ radosgw-admin period update --commit

This will stop new metadata from being written to the ``default.rgw.meta`` pool, but does not remove any existing objects or pool.

Cleaning the Metadata Heap Pool
-------------------------------

Clusters created prior to *jewel* normally use ``default.rgw.meta`` only for the metadata archival feature.

However, from *luminous* onwards, radosgw uses :ref:`Pool Namespaces <radosgw-pool-namespaces>` within ``default.rgw.meta`` for an entirely different purpose, that is, to store ``user_keys`` and other critical metadata.

Users should check zone configuration before proceeding any cleanup procedures::

  $ radosgw-admin zone get --rgw-zone=default | grep default.rgw.meta
  [should not match any strings]

Having confirmed that the pool is not used for any purpose, users may safely delete all objects in the ``default.rgw.meta`` pool, or optionally, delete the entire pool itself.
doc: Copied contents of rgw troubleshooting over to the new ops section. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2012-09-19 23:25:11 +00:00			`=================`
			`Troubleshooting`
			`=================`


doc: Added trouble shooting entry. Fixed formatting. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2013-08-05 20:49:58 +00:00			`The Gateway Won't Start`
			`=======================`

			If you cannot start the gateway (i.e., there is no existing ``pid``),
			check to see if there is an existing ``.asok`` file from another
			user. If an ``.asok`` file from another user exists and there is no
			running ``pid``, remove the ``.asok`` file and try to start the
doc: Fixed the paragraph and boxes. Signed-off-by: Scoots Hamilton <scoots@redhat.com> 2018-11-14 15:27:48 +00:00			process again. This may occur when you start the process as a ``root`` user and
doc: Added trouble shooting entry. Fixed formatting. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2013-08-05 20:49:58 +00:00			`the startup script is trying to start the process as a`
			``www-data`` or ``apache`` user and an existing ``.asok`` is
			`preventing the script from starting the daemon.`

Add a verbose argument and some verbosity This allows a user to use "-v\|--verbose" to get some insight as to what could be preventing radosgw from starting properly. Signed-off-by: David Moreau Simard <dmsimard@iweb.com> 2013-11-05 17:02:47 +00:00			`The radosgw init script (/etc/init.d/radosgw) also has a verbose argument that`
doc: Fixed the paragraph and boxes. Signed-off-by: Scoots Hamilton <scoots@redhat.com> 2018-11-14 15:27:48 +00:00			`can provide some insight as to what could be the issue::`
Add a verbose argument and some verbosity This allows a user to use "-v\|--verbose" to get some insight as to what could be preventing radosgw from starting properly. Signed-off-by: David Moreau Simard <dmsimard@iweb.com> 2013-11-05 17:02:47 +00:00
			`/etc/init.d/radosgw start -v`

doc: Fixed the paragraph and boxes. Signed-off-by: Scoots Hamilton <scoots@redhat.com> 2018-11-14 15:27:48 +00:00			`or ::`
Add a verbose argument and some verbosity This allows a user to use "-v\|--verbose" to get some insight as to what could be preventing radosgw from starting properly. Signed-off-by: David Moreau Simard <dmsimard@iweb.com> 2013-11-05 17:02:47 +00:00
			`/etc/init.d radosgw start --verbose`
doc: Added trouble shooting entry. Fixed formatting. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2013-08-05 20:49:58 +00:00
doc: Copied contents of rgw troubleshooting over to the new ops section. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2012-09-19 23:25:11 +00:00			`HTTP Request Errors`
			`===================`

			`Examining the access and error logs for the web server itself is`
			`probably the first step in identifying what is going on. If there is`
			`a 500 error, that usually indicates a problem communicating with the`
			``radosgw`` daemon. Ensure the daemon is running, its socket path is
			`configured, and that the web server is looking for it in the proper`
			`location.`


			Crashed ``radosgw`` process
			`===========================`

			If the ``radosgw`` process dies, you will normally see a 500 error
			`from the web server (apache, nginx, etc.). In that situation, simply`
			`restarting radosgw will restore service.`

			To diagnose the cause of the crash, check the log in ``/var/log/ceph``
			`and/or the core file (if one was generated).`


			Blocked ``radosgw`` Requests
			`============================`

			`If some (or all) radosgw requests appear to be blocked, you can get`
			some insight into the internal state of the ``radosgw`` daemon via
			`its admin socket. By default, there will be a socket configured to`
			reside in ``/var/run/ceph``, and the daemon can be queried with::

doc: 'ceph --admin-daemon ...' -> 'ceph daemon ...' Signed-off-by: Sage Weil <sage@redhat.com> 2015-09-04 19:59:34 +00:00			`ceph daemon /var/run/ceph/client.rgw help`
doc: Copied contents of rgw troubleshooting over to the new ops section. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2012-09-19 23:25:11 +00:00
			`help list available commands`
			`objecter_requests show in-progress osd requests`
			`perfcounters_dump dump perfcounters value`
			`perfcounters_schema dump perfcounters schema`
			`version get protocol version`

			`Of particular interest::`

doc: 'ceph --admin-daemon ...' -> 'ceph daemon ...' Signed-off-by: Sage Weil <sage@redhat.com> 2015-09-04 19:59:34 +00:00			`ceph daemon /var/run/ceph/client.rgw objecter_requests`
doc: Copied contents of rgw troubleshooting over to the new ops section. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2012-09-19 23:25:11 +00:00			`...`

			`will dump information about current in-progress requests with the`
			`RADOS cluster. This allows one to identify if any requests are blocked`
radosgw/troubleshooting.rst: s/ceph-osd/OSD/ Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de> 2014-03-08 23:58:57 +00:00			`by a non-responsive OSD. For example, one might see::`
doc: Copied contents of rgw troubleshooting over to the new ops section. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2012-09-19 23:25:11 +00:00
			`{ "ops": [`
			`{ "tid": 1858,`
			`"pg": "2.d2041a48",`
			`"osd": 1,`
			`"last_sent": "2012-03-08 14:56:37.949872",`
			`"attempts": 1,`
			`"object_id": "fatty_25647_object1857",`
			`"object_locator": "@2",`
			`"snapid": "head",`
			`"snap_context": "0=[]",`
			`"mtime": "2012-03-08 14:56:37.949813",`
			`"osd_ops": [`
			`"write 0~4096"]},`
			`{ "tid": 1873,`
			`"pg": "2.695e9f8e",`
			`"osd": 1,`
			`"last_sent": "2012-03-08 14:56:37.970615",`
			`"attempts": 1,`
			`"object_id": "fatty_25647_object1872",`
			`"object_locator": "@2",`
			`"snapid": "head",`
			`"snap_context": "0=[]",`
			`"mtime": "2012-03-08 14:56:37.970555",`
			`"osd_ops": [`
			`"write 0~4096"]}],`
			`"linger_ops": [],`
			`"pool_ops": [],`
			`"pool_stat_ops": [],`
			`"statfs_ops": []}`

			In this dump, two requests are in progress. The ``last_sent`` field is
			`the time the RADOS request was sent. If this is a while ago, it suggests`
			`that the OSD is not responding. For example, for request 1858, you could`
			`check the OSD status with::`

			`ceph pg map 2.d2041a48`

			`osdmap e9 pg 2.d2041a48 (2.0) -> up [1,0] acting [1,0]`

			This tells us to look at ``osd.1``, the primary copy for this PG::

doc: 'ceph --admin-daemon ...' -> 'ceph daemon ...' Signed-off-by: Sage Weil <sage@redhat.com> 2015-09-04 19:59:34 +00:00			`ceph daemon osd.1 ops`
doc: Copied contents of rgw troubleshooting over to the new ops section. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2012-09-19 23:25:11 +00:00			`{ "num_ops": 651,`
			`"ops": [`
			`{ "description": "osd_op(client.4124.0:1858 fatty_25647_object1857 [write 0~4096] 2.d2041a48)",`
			`"received_at": "1331247573.344650",`
			`"age": "25.606449",`
			`"flag_point": "waiting for sub ops",`
			`"client_info": { "client": "client.4124",`
			`"tid": 1858}},`
			`...`

			The ``flag_point`` field indicates that the OSD is currently waiting
doc: Added some Java S3 API troubleshooting entries. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2013-06-11 19:12:46 +00:00			for replicas to respond, in this case ``osd.0``.


			`Java S3 API Troubleshooting`
			`===========================`


			`Peer Not Authenticated`
			`----------------------`

			`You may receive an error that looks like this::`

			`[java] INFO: Unable to execute HTTP request: peer not authenticated`

			`The Java SDK for S3 requires a valid certificate from a recognized certificate`
			`authority, because it uses HTTPS by default. If you are just testing the Ceph`
			`Object Storage services, you can resolve this problem in a few ways:`

			#. Prepend the IP address or hostname with ``http://``. For example, change this::

			`conn.setEndpoint("myserver");`

			`To::`

			`conn.setEndpoint("http://myserver")`

			`#. After setting your credentials, add a client configuration and set the`
			protocol to ``Protocol.HTTP``. ::

			`AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);`

			`ClientConfiguration clientConfig = new ClientConfiguration();`
			`clientConfig.setProtocol(Protocol.HTTP);`

			`AmazonS3 conn = new AmazonS3Client(credentials, clientConfig);`



			`405 MethodNotAllowed`
			`--------------------`

			`If you receive an 405 error, check to see if you have the S3 subdomain set up correctly.`
			`You will need to have a wild card setting in your DNS record for subdomain functionality`
			`to work properly.`

doc: Added trouble shooting entry. Fixed formatting. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2013-08-05 20:49:58 +00:00			`Also, check to ensure that the default site is disabled. ::`
doc: Added some Java S3 API troubleshooting entries. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2013-06-11 19:12:46 +00:00
			`[java] Exception in thread "main" Status Code: 405, AWS Service: Amazon S3, AWS Request ID: null, AWS Error Code: MethodNotAllowed, AWS Error Message: null, S3 Extended Request ID: null`



doc: describe metadata_heap cleanup Fixes: http://tracker.ceph.com/issues/18174 Signed-off-by: Dan van der Ster <daniel.vanderster@cern.ch> 2019-03-12 15:42:25 +00:00			`Numerous objects in default.rgw.meta pool`
			`=========================================`

			Clusters created prior to jewel have a metadata archival feature enabled by default, using the ``default.rgw.meta`` pool.
			This archive keeps all old versions of user and bucket metadata, resulting in large numbers of objects in the ``default.rgw.meta`` pool.

			`Disabling the Metadata Heap`
			`---------------------------`

			Users who want to disable this feature going forward should set the ``metadata_heap`` field to an empty string ``""``::

			`$ radosgw-admin zone get --rgw-zone=default > zone.json`
			`[edit zone.json, setting "metadata_heap": ""]`
			`$ radosgw-admin zone set --rgw-zone=default --infile=zone.json`
			`$ radosgw-admin period update --commit`

			This will stop new metadata from being written to the ``default.rgw.meta`` pool, but does not remove any existing objects or pool.

			`Cleaning the Metadata Heap Pool`
			`-------------------------------`

			Clusters created prior to jewel normally use ``default.rgw.meta`` only for the metadata archival feature.

			However, from luminous onwards, radosgw uses :ref:`Pool Namespaces <radosgw-pool-namespaces>` within ``default.rgw.meta`` for an entirely different purpose, that is, to store ``user_keys`` and other critical metadata.

			`Users should check zone configuration before proceeding any cleanup procedures::`

			`$ radosgw-admin zone get --rgw-zone=default \| grep default.rgw.meta`
			`[should not match any strings]`

			Having confirmed that the pool is not used for any purpose, users may safely delete all objects in the ``default.rgw.meta`` pool, or optionally, delete the entire pool itself.