Welcome to the Free Software contributions diary of Loïc Dachary. Although the posts look like blog entries, they really are technical reports about the work done during the day. They are meant to be used as a reference by co-developers and managers. Erasure Code Patents StreamScale.

HOWTO test a Ceph crush rule

The crushtool utility can be used to test Ceph crush rules before applying them to a cluster.

$ crushtool --outfn crushmap --build --num_osds 10 \
   host straw 2 rack straw 2 default straw 0
# id	weight	type name	reweight
-9	10	default default
-6	4		rack rack0
-1	2			host host0
0	1				osd.0	1
1	1				osd.1	1
-2	2			host host1
2	1				osd.2	1
3	1				osd.3	1
-7	4		rack rack1
-3	2			host host2
4	1				osd.4	1
5	1				osd.5	1
-4	2			host host3
6	1				osd.6	1
7	1				osd.7	1
-8	2		rack rack2
-5	2			host host4
8	1				osd.8	1
9	1				osd.9	1	

Creates a crushmap from scratch (–build). It assumes there is a total of 10 OSDs available ( –num_osds 10 ). It then places two OSDs in each host ( host straw 2 ). The resulting hosts (five of them) are then placed in racks, at most two per racks ( rack straw 2 ). All racks are placed in the default root (that’s what the zero stands for : all of them) ( default straw 0 ). The last rack only has one host because there is an odd number of hosts available.
The crush rule to be tested can be injected in the crushmap with

crushtool --outfn crushmap --build --num_osds 10 host straw 2 rack straw 2 default straw 0
crushtool -d crushmap -o crushmap.txt
cat >> crushmap.txt <<EOF
rule myrule {
	ruleset 1
	type replicated
	min_size 1
	max_size 10
	step take default
	step choose firstn 2 type rack
	step chooseleaf firstn 2 type host
	step emit
crushtool -c crushmap.txt -o crushmap

This crushmap should be able to provide two OSDs ( for placement groups for instance ) and it can be verified with the –test option.

$ crushtool -i crushmap --test --show-statistics --rule 1 --min-x 1 --max-x 2 --num-rep 2
rule 1 (myrule), x = 1..2, numrep = 2..2
CRUSH rule 1 x 1 [0,2]
CRUSH rule 1 x 2 [7,4]
rule 1 (myrule) num_rep 2 result size == 2:	2/2

The –rule 1 designates the rule that was injected. The –rule 0 is the default rule that is created by default. The x can be thought of as the unique name of the placement group for which OSDs are reclaimed. The –min-x 1 –max-x 2 varies the value of x from 1 to 2 therefore trying the rule only twice. –min-x 1 –max-x 2048 would create 2048 lines. Each line shows the value of x after the rule number. In rule 1 x 2 the 1 is the rule number and the 2 is the value of x. The last line shows that for all values of x (2/2 i.e. 2 values of x out of 2), when asked to provide 2 OSDs (num_rep 2) the crush rule was able to provide 2 (result size == 2).

If asked for 4 OSDs, the same crush rule may fail because it has barely enough resources to satisfy the requirements.

$ crushtool -i crushmap --test --show-statistics --rule 1 --min-x 1 --max-x 2 --num-rep 4
rule 1 (myrule), x = 1..2, numrep = 4..4
CRUSH rule 1 x 1 [0,2,9]
CRUSH rule 1 x 2 [7,4,1,3]
rule 1 (myrule) num_rep 4 result size == 3:	1/2
rule 1 (myrule) num_rep 4 result size == 4:	1/2

The statistics at the end shows that one of the two mappings failed: the result size == 3 is lower than the required number num_rep 4. If asked for more OSDs than the rule can provide, the rule will always fail.

crushtool -i crushmap --test --show-statistics --rule 1 --min-x 1 --max-x 2 --num-rep 5
rule 1 (myrule), x = 1..2, numrep = 5..5
CRUSH rule 1 x 1 [0,2,9]
CRUSH rule 1 x 2 [7,4,1,3]
rule 1 (myrule) num_rep 5 result size == 3:	1/2
rule 1 (myrule) num_rep 5 result size == 4:	1/2

More examples of crushtool usage can be found in the crushtool directory of the Ceph sources.

Posted in ceph | Leave a comment

HOWTO test teuthology tasks

The Ceph integration tests run by teuthology are described with YAML files in the ceph-qa-suite repository. The actual work is carried out on machines provisioned by teuthology via tasks. For instance, the workunit task runs a script found in the qa/workunits directory of the Ceph repository.
The workunit.py script, although small, is complex enough to deserve testing. Creating unit tests would require a lot of mocking and it would not catch a typo in a shell command to be run on an actual machine. Another approach is to create light weight integration tests within the ceph-qa-suite repository itself. For instance tests/workunit is designed to maximize coverage of the workunit.py script and run as quickly as possible.
Continue reading

Posted in ceph | Leave a comment

What cinder volume is missing an RBD object ?

Although it is extremely unlikely to loose an object stored in Ceph, it is not impossible. When it happens to a Cinder volume based on RBD, knowing which has an object missing will help with disaster recovery.
Continue reading

Posted in Havana, ceph, openstack | Leave a comment

Tell teuthology to use a local ceph-qa-suite directory

By default teuthology will clone the ceph-qa-suite repository and use the tasks it contains. If tasks have been modified localy, teuthology can be instructed to use a local directory by inserting something like:

suite_path: /home/loic/software/ceph/ceph-qa-suite

in the teuthology job yaml file. The directory must then be added to the PYTHONPATH

PYTHONPATH=/home/loic/software/ceph/ceph-qa-suite \
   ./virtualenv/bin/teuthology  --owner loic@dachary.org \
   /tmp/work.yaml targets.yaml
Posted in ceph | Leave a comment

Temporarily disable Ceph scrubbing to resolve high IO load

In a Ceph cluster with low bandwidth, the root disk of an OpenStack instance became extremely slow during days.

When an OSD is scrubbing a placement group, it has a significant impact on performances and this is expected, for a short while. In this case, however it slowed down to the point where the OSD was marked down because it did not reply in time:

2014-07-30 06:43:27.331776 7fcd69ccc700  1
   mon.bm0015@0(leader).osd e287968
   we have enough reports/reporters to mark osd.12 down

To get out of this situation, both scrub and deep scrub were deactivated with:

root@bm0015:~# ceph osd set noscrub
set noscrub
root@bm0015:~# ceph osd set nodeep-scrub
set nodeep-scrub

After a day, as the IO load remained stable confirming that no other factor was causing it, scrubbing was re-activated. The context causing the excessive IO load was changed and it did not repeat itself after another 24 hours, although scrubbing was confirmed to resume when examining the logs on the same machine:

2014-07-31 15:29:54.783491 7ffa77d68700  0 log [INF] : 7.19 deep-scrub ok
2014-07-31 15:29:57.935632 7ffa77d68700  0 log [INF] : 3.5f deep-scrub ok
2014-07-31 15:37:23.553460 7ffa77d68700  0 log [INF] : 7.1c deep-scrub ok
2014-07-31 15:37:39.344618 7ffa77d68700  0 log [INF] : 3.22 deep-scrub ok
2014-08-01 03:25:05.247201 7ffa77d68700  0 log [INF] : 3.46 deep-scrub ok
Posted in ceph | Leave a comment

Global shortcuts for emacs org-mode on Ubuntu

Let say F7 is bound, in emacs, to the org-clock-out function of Org Mode as a shortcut to quickly stop the current clock accumulating the time spent on a given task.

(global-set-key (kbd "<f7>") 'org-clock-out)

F7 can be sent to the emacs window via the command line with

xdotool search --name 'emacs@fold' key F7

If emacs needs to be displayed to the user (in case it was iconified or on another desktop), the windowactivate command can be added:

xdotool search --name 'emacs@fold' windowactivate key F7

On Ubuntu 14.04 this command can be bound to the F7 regardless of which window has focus, via the shortcuts tab of the keyboard section of System Settings as shown below:

Posted in Ubuntu, emacs | Leave a comment

Ceph disaster recovery scenario

A datacenter containing three hosts of a non profit Ceph and OpenStack cluster suddenly lost connectivity and it could not be restored within 24h. The corresponding OSDs were marked out manually. The Ceph pool dedicated to this datacenter became unavailable as expected. However, a pool that was supposed to have at most one copy per datacenter turned out to have a faulty crush ruleset. As a result some placement groups in this pool were stuck.

$ ceph -s
health HEALTH_WARN 1 pgs degraded; 7 pgs down;
   7 pgs peering; 7 pgs recovering;
   7 pgs stuck inactive; 15 pgs stuck unclean;
   recovery 184/1141208 degraded (0.016%)

Continue reading

Posted in ceph | 2 Comments

puppet-ceph update

End of last year, a new puppet-ceph module was bootstrapped with the ambitious goal to re-unite the dozens of individual efforts. I’m very happy with what we’ve accomplished. We are making progress although our community is mixed, but more importantly, we do things differently.
Continue reading

Posted in ceph, puppet | Leave a comment

Ceph erasure code jerasure plugin benchmarks (Highbank ARMv7)

The benchmark described for Intel Xeon is run with a Highbank ARMv7 Processor rev 0 (v7l) processor (the maker of the processor was Calxeda ), using the same codebase:

The encoding speed is ~450MB/s for K=2,M=1 (i.e. a RAID5 equivalent) and ~25MB/s for K=10,M=4.

It is also run with Highbank ARMv7 Processor rev 2 (v7l) (note the 2):

The encoding speed is ~650MB/s for K=2,M=1 (i.e. a RAID5 equivalent) and ~75MB/s for K=10,M=4.

Note: The code of the erasure code plugin does not contain any NEON optimizations.
Continue reading

Posted in ceph | 3 Comments

workaround DNSError when running teuthology-suite

Note: this is only useful for people with access to the Ceph lab.

When running a Ceph integration tests using teuthology, it may fail because of a DNS resolution problem with:

$ ./virtualenv/bin/teuthology-suite --base ~/software/ceph/ceph-qa-suite \
   --suite upgrade/firefly-x \
   --ceph wip-8475 --machine-type plana \
   --email loic@dachary.org --dry-run
2014-06-27 INFO:urllib3.connectionpool:Starting new HTTP connection (1):
  HTTPConnectionPool(host='gitbuilder.ceph.com', port=80):
  Max retries exceeded with
  url: /kernel-rpm-centos6-x86_64-basic/ref/testing/sha1
  (Caused by : [Errno 3] name does not exist)

It may be caused by DNS propagation problems and pointing to the ceph.com may work better. If running bind, adding the following in /etc/bind/named.conf.local will forward all ceph.com related DNS queries to the primary server (NS1.DREAMHOST.COM i.e., assuming /etc/resolv.conf is set to use the local DNS server first:

zone "ceph.com."{
   type forward ;
   forward only ;
  forwarders {; } ;
zone "ipmi.sepia.ceph.com" {
   type forward;
   forward only;
   forwarders {;;
zone "front.sepia.ceph.com" {
   type forward;
   forward only;
   forwarders {;;

The front.sepia.ceph.com zone will resolve machine names allocated by teuthology-lock and used as targets such as:

  ubuntu@saya001.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABA ... 8r6pYSxH5b
Posted in ceph | Leave a comment