Welcome to the Free Software contributions diary of Loïc Dachary. Although the posts look like blog entries, they really are technical reports about the work done during the day. They are meant to be used as a reference by co-developers and managers. Erasure Code Patents StreamScale.

Improving PGs distribution with CRUSH weight sets

In a Ceph cluster with a single pool of 1024 Placement Groups (PGs), the PG distribution among devices will not be as expected. (see Predicting Ceph PG placement for details about this uneven distribution).

In the following, the difference between the expected number of PGs on a given host or device and the actual number of PGs can be as high as 25%. For instance, device4 has a weight of 1 and is expected to get 128 PGs but it only gets 96. And host2 (containing device4) has 83 less PGs than expected.

         ~expected~  ~actual~  ~delta~   ~delta%~  ~weight~
dc1            1024      1024        0   0.000000       1.0
 host0          256       294       38  14.843750       2.0
  device0       128       153       25  19.531250       1.0
  device1       128       141       13  10.156250       1.0
 host1          256       301       45  17.578125       2.0
  device2       128       157       29  22.656250       1.0
  device3       128       144       16  12.500000       1.0
 host2          512       429      -83 -16.210938       4.0
  device4       128        96      -32 -25.000000       1.0
  device5       128       117      -11  -8.593750       1.0
  device6       256       216      -40 -15.625000       2.0

Using minimization methods it is possible to find alternate weights for hosts and devices that significantly reduce the uneven distribution, as shown below.

         ~expected~  ~actual~  ~delta~  ~delta%~  ~weight~
dc1            1024      1024        0  0.000000  1.000000
 host0          256       259        3  1.171875  0.680399
  device0       128       129        1  0.781250  0.950001
  device1       128       130        2  1.562500  1.050000
 host1          256       258        2  0.781250  0.646225
  device2       128       129        1  0.781250  0.867188
  device3       128       129        1  0.781250  1.134375
 host2          512       507       -5 -0.976562  7.879046
  device4       128       126       -2 -1.562500  1.111111
  device5       128       127       -1 -0.781250  0.933333
  device6       256       254       -2 -0.781250  1.955555

For each host or device, we now have two weights. First, there is the target weight which is used to calculate the expected number of PGs. For instance, host0 should get 1024 * (target weight host0 = 2 / (target weight host0 = 2 + target weight host1 = 2 + target weight host3 = 4)) PGs, that is 1024 * ( 2 / ( 2 + 2 + 4 ) ) = 256. The second weight is weight set, which is used to choose where a PGs is mapped.

The problem is that the crushmap format only allows one weight per host or device. Although it would be possible to store the target weight outside of the crushmap, it would be inconvenient and difficult to understand. Instead, the crushmap syntax was extended to allow additional weights (via weight set) to be associated with each item (i.e. disk or host in the example below).

                                                    target
         ~expected~  ~actual~  ~delta~   ~delta%~  ~weight~ ~weight set~
dc1            1024      1024        0   0.000000       1.0     1.000000
 host0          256       294       38  14.843750       2.0     0.680399
  device0       128       153       25  19.531250       1.0     0.950001
  device1       128       141       13  10.156250       1.0     1.050000
 host1          256       301       45  17.578125       2.0     0.646225
  device2       128       157       29  22.656250       1.0     0.867188
  device3       128       144       16  12.500000       1.0     1.134375
 host2          512       429      -83 -16.210938       4.0     7.879046
  device4       128        96      -32 -25.000000       1.0     1.111111
  device5       128       117      -11  -8.593750       1.0     0.933333
  device6       256       216      -40 -15.625000       2.0     1.955555

Continue reading

Posted in ceph, crush, libcrush | Leave a comment

Faster Ceph CRUSH computation with smaller buckets

The CRUSH function maps Ceph placement groups (PGs) and objects to OSDs. It is used extensively in Ceph clients and daemons as well as in the Linux kernel modules and its CPU cost should be reduced to the minimum.

It is common to define the Ceph CRUSH map so that PGs use OSDs on different hosts. The hierarchy can be as simple as 100 hosts, each containing 7 OSDs. When determining which OSDs belong to a given PG, in this scenario the CRUSH function will hash each of the 100 hosts to the PG, compare the results, and select the OSDs that score the highest. That is 100 + 7 calls to the hash function.

If the hierarchy is modified to have 20 racks, each containing 5 hosts with 7 OSDs per host, we only have 20 + 5 + 7 calls to the hash function. The first 20 calls are to select the rack, the next 5 to select the host in the rack and the last 7 to select the OSD within the host.

In other words, this hierarchical subdivision of the CRUSH map reduces the number of calls to the hash function from 107 to 32. It does not have any influence on how the PGs are mapped to OSDs but it saves CPU.

Posted in ceph, libcrush | 1 Comment

Predicting Ceph PG placement

When creating a new Ceph pool, deciding for the number of PG requires some thinking to ensure there are a few hundred PGs per OSD. The distribution can be verified with crush analyze as follows:

$ crush analyze --rule data --type device \
                --replication-count 2 \
                --crushmap crushmap.txt \
                --pool 0 --pg-num 512 --pgp-num 512
         ~id~  ~weight~  ~over/under used %~
~name~
device0     0       1.0                 9.86
device5     5       2.0                 8.54
device2     2       1.0                 1.07
device3     3       2.0                -1.12
device1     1       2.0                -5.52
device4     4       1.0               -14.75

The argument of the –pool option is unknown because the pool was not created yet, but pool numbers are easy to predict. If the highest pool number is 5, the next pool number will be 6. The output shows the PGs will not be evenly distributed because there are not enough of them. If there was a thousand times more PGs, they would be evenly distributed:

$ crush analyze --rule data --type device \
                --replication-count 2 \
                --crushmap crushmap \
                --pool 0 --pg-num 512000 --pgp-num 512000
         ~id~  ~weight~  ~over/under used %~
~name~
device4     4       1.0                 0.30
device3     3       2.0                 0.18
device2     2       1.0                -0.03
device5     5       2.0                -0.04
device1     1       2.0                -0.13
device0     0       1.0                -0.30

Increasing the number of PGs is not a practical solution because having more than a few hundred PGs per OSD requires too much CPU and RAM. Knowing that device0 will be the first OSD to fill up, reweight-by-utilization should be used when it is too full.
Continue reading

Posted in ceph | Leave a comment

How many objects will move when changing a crushmap ?

After a crushmap is changed (e.g. addition/removal of devices, modification of weights or tunables), objects may move from one device to another. The crush compare command can be used to show what would happen for a given rule and replication count. In the following example, two new OSDs are added to the crushmap, causing 22% of the objects to move from the existing OSDs to the new ones.

$ crush compare --rule firstn \
                --replication-count 1 \
                --origin before.json --destination after.json
There are 1000 objects.

Replacing the crushmap specified with --origin with the crushmap
specified with --destination will move 229 objects (22.9% of the total)
from one item to another.

The rows below show the number of objects moved from the given
item to each item named in the columns. The objects% at the
end of the rows shows the percentage of the total number
of objects that is moved away from this particular item. The
last row shows the percentage of the total number of objects
that is moved to the item named in the column.

         osd.8    osd.9    objects%
osd.0        3        4       0.70%
osd.1        1        3       0.40%
osd.2       16       16       3.20%
osd.3       19       21       4.00%
osd.4       17       18       3.50%
osd.5       18       23       4.10%
osd.6       14       23       3.70%
osd.7       14       19       3.30%
objects%   10.20%   12.70%   22.90%

The crush compare command can also show the impact of a change in one or more “tunables”, such as setting chooseleaf_stable to 1.

$ diff -u original.json destination.json
--- original.json	2017-03-14 23:41:47.334740845 +0100
+++ destination.json	2017-03-04 18:36:00.817610217 +0100
@@ -608,7 +608,7 @@
         "choose_local_tries": 0,
         "choose_total_tries": 50,
         "chooseleaf_descend_once": 1,
-        "chooseleaf_stable": 0,
+        "chooseleaf_stable": 1,
         "chooseleaf_vary_r": 1,
         "straw_calc_version": 1
     }

In the following example some columns were removed for brevity and replaced with dots. It shows that 33% of the objects will move after chooseleaf_stable is changed from 0 to 1. Each device will receive and send more than 1% and less than 3% of these objects.

$ crush compare --origin original.json --destination destination.json \
                --rule replicated_ruleset --replication-count 3
There are 300000 objects.

Replacing the crushmap specified with --origin with the crushmap
specified with --destination will move 99882 objects (33.294% of the total)
from one item to another.

The rows below show the number of objects moved from the given
item to each item named in the columns. The objects% at the
end of the rows shows the percentage of the total number
of objects that is moved away from this particular item. The
last row shows the percentage of the total number of objects
that is moved to the item named in the column.

          osd.0  osd.1 osd.11 osd.13 osd.20 ... osd.8  osd.9 objects%
osd.0         0    116    180      0   3972 ...   138    211    1.89%
osd.1       121      0    129     64    116 ...   112    137    1.29%
osd.11      194    126      0     12      0 ...   168    222    1.94%
osd.13        0     75     19      0    211 ...     0   4552    2.06%
osd.20     4026    120      0    197      0 ...    90      0    1.92%
osd.21      120   2181     65    130    116 ...    85     75    1.29%
osd.24      176    150    265     63      0 ...   160    258    2.29%
osd.25      123     99    190    198     99 ...    92    182    2.19%
osd.26       54     83     62    258    254 ...    51     69    2.27%
osd.27      124    109      0     90     73 ...  1840      0    1.55%
osd.29       43     54      0     98    123 ...  1857      0    1.60%
osd.3        74     82   2112    137    153 ...    61     44    1.62%
osd.37       65    108      0      0    166 ...    67      0    1.66%
osd.38      163    119      0      0     73 ...    58      0    1.68%
osd.44       56     73   2250    148    173 ...    77     43    1.68%
osd.46       60     71    132     67      0 ...    39    125    1.31%
osd.47        0     51     70    126     70 ...     0     73    1.35%
osd.8       151    112    163      0     76 ...     0    175    1.67%
osd.9       197    130    202   4493      0 ...   188      0    2.03%
objects%  1.92%  1.29%  1.95%  2.03%  1.89% ... 1.69%  2.06%   33.29%

Continue reading

Posted in ceph | Leave a comment

Predicting which Ceph OSD will fill up first

When a device is added to Ceph, it is assigned a weight that reflects its capacity. For instance if osd.1 is a 1TB disk, its weight will be 1.0 and if osd.2 is a 4TB disk, its weight will be 4.0. It is expected that osd.1 will receive exactly four times more objects than osd.2. So that when osd.1 is 80% full, osd.2 is also 80% full.

But running a simulation on a crushmap with four 4TB disks and one 1TB disk, shows something different:

         WEIGHT     %USED
osd.4       1.0       86%
osd.3       4.0       81%
osd.2       4.0       79%
osd.1       4.0       79%
osd.0       4.0       78%

It happens when these devices are used in a two replica pool because the distribution of the second replica depends on the distribution of the first replica. If the pool only has one copy of each object, the distribution is as expected (there is a variation but it is around 0.2% in this case):

         WEIGHT     %USED
osd.4       1.0       80%
osd.3       4.0       80%
osd.2       4.0       80%
osd.1       4.0       80%
osd.0       4.0       80%

This variation is not new but there was no way to conveniently show it from the crushmap. It can now be displayed with crush analyze command. For instance:

    $ ceph osd crush dump > crushmap-ceph.json
    $ crush ceph --convert crushmap-ceph.json > crushmap.json
    $ crush analyze --rule replicated --crushmap crushmap.json
            ~id~  ~weight~  ~over/under used~
    ~name~
    g9       -22  2.299988     10.400604
    g3        -4  1.500000     10.126750
    g12      -28  4.000000      4.573330
    g10      -24  4.980988      1.955702
    g2        -3  5.199982      1.903230
    n7        -9  5.484985      1.259041
    g1        -2  5.880997      0.502741
    g11      -25  6.225967     -0.957755
    g8       -20  6.679993     -1.730727
    g5       -15  8.799988     -7.884220

shows that g9 will be ~90% full when g1 is ~80% full (i.e. 10.40 – 0.50 ~= 10% difference) and g5 is ~74% full.

By monitoring disk usage on g9 and adding more disk space to the cluster when the disks on g9 reach a reasonable threshold (like 85% or 90%), one can ensure that the cluster will never fill up, since it is known that g9 will always be the first node to become overfull. Another possibility is to run the ceph osd reweight-by-utilization command from time to time and try to even the distribution.
Continue reading

Posted in ceph | Leave a comment

logging udev events at boot time

Adapted from Peter Rajnoha post:

  • create a special systemd unit to monitor udev during boot:
    cat > /etc/systemd/system/systemd-udev-monitor.service <<EOF
    [Unit]
    Description=udev Monitoring
    DefaultDependencies=no
    Wants=systemd-udevd.service
    After=systemd-udevd-control.socket systemd-udevd-kernel.socket
    Before=sysinit.target systemd-udev-trigger.service
    
    [Service]
    Type=simple
    ExecStart=/usr/bin/sh -c "/usr/sbin/udevadm monitor --udev --env > /udev_monitor.log"
    
    [Install]
    WantedBy=sysinit.target
    EOF
    
  • run systemctl daemon-reload
  • run systemctl enable systemd-udev-monitor.service
  • reboot
  • append “systemd.log_level=debug systemd.log_target=kmsg udev.log-priority=debug log_buf_len=8M” to kernel command line
  • collect the logs in /udev_monitor.log

Continue reading

Posted in Uncategorized | Leave a comment

Testing Ceph with ARMv8 OpenStack instances

The Ceph integration tests can be run on ARMv8 (aka arm64 or aarch64) OpenStack instances on CloudLab or Runabove.

When logged in CloudLab an OpenStack cluster suitable for teuthology must be created. To start an experiment

click Change Profile

to select the OpenStackTeuthology profile

the description of the profile contains an example credential file (i.e. openrc.sh) that can be copy/pasted on the local machine

the m400 default machine type will select ARMv8 hardware

in the last step, choose a name for the experiment. The openrc.sh file must be modified to reflect the chosen name because it shows in the URL of the authentication service. If a new experiment by the same name is run a month later, the same openrc.sh file can be used.

the page is then updated to show the progress of the provisionning. Note that it takes about 15 minutes for it to complete: even when the page says the experiment is up, the OpenStack setup is still going on and need a few more minutes.

Finally click on Profile instructions to display the link to the horizon dashboard and the password of the admin user (i.e. configuring OpenStack inside your experiment, you’ll be able to visit the OpenStack Dashboard WWW interface (approx. 5-15 minutes). Your OpenStack admin and instance VM password is randomly-generated by Cloudlab, and it is: 0905d783e7e7 .).

When the cluster is created, running the smoke integration tests for rados on jewel can be done with ceph-workbench

ceph-workbench --verbose ceph-qa-suite --ceph jewel --suite smoke --filter rados

assuming the openrc.sh file has been set in ~/.ceph-workbench/openrc.sh. When the command returns, it displays the URL of the web interface

...
2016-04-07 11:25:57,625.625 DEBUG:paramiko.transport:EOF in transport thread
2016-04-07 11:25:57,628.628 INFO:teuthology.openstack:
pulpito web interface: http://128.110.155.162:8081/
ssh access           : ssh ubuntu@128.110.155.162 # logs in /usr/share/nginx/html

And when the test completes successfully it will show in green. Otherwise the logs of the failed tests can be downloaded and analyzed.

Continue reading

Posted in ceph, openstack | Leave a comment

Semi-reliable GitHub scripting

The githubpy python library provides a thin layer on top of the GitHub V3 API, which is convenient because the official GitHub documentation can be used. The undocumented behavior of GitHub is outside of the scope of this library and needs to be addressed by the caller.

For instance creating a repository is asynchronous and checking for its existence may fail. Something similar to the following function should be used to wait until it exists:

    def project_exists(self, name):
        retry = 10
        while retry > 0:
            try:
                for repo in self.github.g.user('repos').get():
                    if repo['name'] == name:
                        return True
                return False
            except github.ApiError:
                time.sleep(5)
            retry -= 1
        raise Exception('error getting the list of repos')

    def add_project(self):
        r = self.github.g.user('repos').post(
            name=GITHUB['repo'],
            auto_init=True)
        assert r['full_name'] == GITHUB['username'] + '/' + GITHUB['repo']
        while not self.project_exists(GITHUB['repo']):
            pass

Another example is merging a pull request. It sometimes fails (503, cannot be merged error) although it succeeds in the background. To cope with that, the state of the pull request should be checked immediately after the merge failed. It can either be merged or closed (although the GitHub web interface shows it as merged). The following function can be used to cope with that behavior:

    def merge(self, pr, message):
        retry = 10
        while retry > 0:
            try:
                current = self.github.repos().pulls(pr).get()
                if current['state'] in ('merged', 'closed'):
                    return
                logging.info('state = ' + current['state'])
                self.github.repos().pulls(pr).merge().put(
                    commit_message=message)
            except github.ApiError as e:
                logging.error(str(e.response))
                logging.exception('merging ' + str(pr) + ' ' + message)
                time.sleep(5)
            retry -= 1
        assert retry > 0

These two examples have been implemented as part of the ceph-workbench integration tests. The behavior described above can be reproduced by running the test in a loop during a few hours.

Posted in Uncategorized | Leave a comment

teuthology forensics with git, shell and paddles

When a teuthology integration test for Ceph fails, the results are analyzed to find the source of the problem. For instance the upgrade suite: pool_create failed with error -4 EINTR issue was reported early October 2015, with multiple integration job failures.
The first step is to look into the teuthology log which revealed that pools could not be created.

failed: error rados_pool_create(test-rados-api-vpm049-15238-1) \
  failed with error -4"

The 4 stands for EINTR. The paddles database is used by teuthology to store test results and can be queried via HTTP. For instance:

curl --silent http://paddles.front.sepia.ceph.com/runs/ |
  jq '.[] | \
      select(.name | contains("upgrade:firefly-hammer-x")) | \
      select(.branch == "infernalis") | \
      select(.status | contains("finished")) \
      | .name' | \
  while read run ; do eval run=$run ; \
    curl --silent http://paddles.front.sepia.ceph.com/runs/$run/jobs/ | \
      jq '.[] | "http://paddles.front.sepia.ceph.com/runs/\(.name)/jobs/\(.job_id)/"' ; \
  done | \
  while read url ; do eval url=$url ; \
    curl --silent $url | \
      jq 'if((.description != null) and \
             (.description | contains("parallel")) and \
             (.success == true)) then "'$url'" else null end' ; \
  done | grep -v null

shows which successful jobs the upgrade:firefly-hammer-x suites run against the infernalis branch (the first jq expression) were involved in a parallel test (that is the name of a subdirectory of the suite). This was not sufficient to figure out the root cause of the problem because:

  • it only provides access to the last 100 runs
  • it does allow to grep the teuthology log file for a string

With the teuthology logs in the /a directory (it’s actually a 100TB CephFS mount half full), the following shell snippet can be used to find the upgrade tests that failed with the error -4 message in the logs.

for run in *2015-{07,08,09,10}*upgrade* ; do for job in $run/* ; do \
  test -d $job || continue ; \
  config=$job/config.yaml ;   test -f $config || continue ; \
  summary=$job/summary.yaml ; test -f $summary || continue ; \
  if shyaml get-value branch < $config | grep -q hammer && \
     shyaml get-value success < $summary | grep -qi false && \
     grep -q 'error -4' $job/teuthology.log  ; then
       echo $job ;
   fi ; \
done ; done

It looks for all upgrade runs, back to July 2015. shyaml is used to query the branch from the job configuration and only keep those targeting hammer. If the job failed (according to the success value found in the summary file), the error is looked up in the teuthology.log file. The first failed job is found early september:

teuthology-2015-09-11_17:18:07-upgrade:firefly-x-hammer-distro-basic-vps/1051109

It happened on a regular basis after that date but was only reported early October. The commits merged in the hammer branch around that time are displayed with:

git log --merges --since 2015-09-01 --until 2015-09-11 --format='%H' ceph/hammer | \
while read sha1 ; do \
  echo ; git log --format='** %aD "%s":https://github.com/ceph/ceph/commit/%H' ${sha1}^1..${sha1} ; \
done | perl -p -e 'print "* \"PR $1\":https://github.com/ceph/ceph/pull/$1\n" if(/Merge pull request #(\d+)/)'

It can be copy pasted in redmine issue. It turns out that a pull request merged September 6th was responsible for the failure.

Posted in ceph | Leave a comment

On demand Ceph packages for teuthology

When a teuthology jobs install Ceph, it uses packages created by gitbuilder. These packages are built every time a branch is pushed to the official repository.

Contributors who do not have write access to the official repository, can either ask a developer with access to push a branch for them or setup a gitbuilder repository, using autobuild-ceph. Asking a developer is inconvenient because it takes time and also because it creates packages for every supported operating system, even when only one of them would be enough. In addition there often is a long wait queue because the gitbuilder of the sepia lab is very busy. Setting up a gitbuilder repository reduces wait time but it has proven to be too time and resources consuming for most contributors.

The buildpackages task can be used to resolve that problem and create the packages required for a particular job on demand. When added to a job that has an install task, it will:

  • always run before the install task regardless of its position in the list of tasks (see the buildpackages_prep function in the teuthology internal tasks for more information).
  • create an http server, unless it already exists
  • set gitbuilder_host in ~/.teuthology.yaml to the http server
  • find the SHA1 of the commit that the install task needs
  • checkout the ceph repository at SHA1 and build the package, in a dedicated server
  • upload the packages to the http server, using directory names that mimic the gitbuilder conventions used in the lab gitbuilder and destroy the server used to build them

When the install task looks for packages, it uses the http server populated by the buildpackages task. The teuthology cluster keeps track of which packages were built for which architecture (via makefile timestamp files). When another job needs the same packages, the buildpackages task will notice they already have been built and uploaded to the http server and do nothing.

A test suite verifies the buildpackages task works as expected and can be run with:

teuthology-openstack --verbose \
   --key-name myself --key-filename ~/Downloads/myself \
   --ceph-git-url http://workbench.dachary.org/ceph/ceph.git \
   --ceph hammer --suite teuthology/buildpackages

The –ceph-git-url is the repository from which the branch specified with –ceph is cloned. It defaults to http://github.com/ceph/ceph which requires write access to the official Ceph repository.

Posted in ceph | Leave a comment