Welcome to the Free Software contributions diary of Loïc Dachary. Although the posts look like blog entries, they really are technical reports about the work done during the day. They are meant to be used as a reference by co-developers and managers.
The gf-complete and jerasure libraries implement the erasure code functions used in Ceph. They were copied in Ceph in 2013 because there were no reference repositories at the time. The copy was removed from the Ceph repository and replaced by git submodules to decouple the release cycles.
The AMT of an ASRock Q87M motherboard is configured to enable remote power control (power cycle) and display of the BIOS and the console. It is a cheap alternative to iLO or IPMI that can be used with Free Software. AMT is a feature of vPro that was available in 2011 with some Sandy Bridge chipsets. It is included in many of the more recent Haswell chipsets.
The following is a screenshot of vinagre connected to the AMT VNC server displaying the BIOS of the ASRock Q87M motherboard.
L’erasure code, c’est aussi le RAID5, qui permet de perdre un disque dur sans perdre ses données. Du point de vue de l’utilisateur, le concept est simple et utile, mais pour la personne qui est chargée de concevoir le logiciel qui fait le travail, c’est un casse-tête. On trouve des boîtiers RAID5 à trois disques dans n’importe quelle boutique : quand l’un d’eux cesse de fonctionner, on le remplace et les fichiers sont toujours là. On pourrait imaginer ça avec six disques dont deux cessent de fonctionner simultanément. Mais non : au lieu d’avoir recours à une opération XOR, assimilable en cinq minutes, il faut des corps de Galois, un bon bagage mathématique et beaucoup de calculs. Pour corser la difficulté, dans un système de stockage distribué tel que Ceph, les disques sont souvent déconnectés temporairement pour cause d’indisponibilité réseau.
The Ceph erasure code plugin benchmark for jerasure version 1 are compared after an upgrade to jerasure version 2, using the same command, on the same hardware.
- Encoding: 5.2GB/s which is ~20% better than 4.2GB/s
- Decoding: no processing necessary (because the code is systematic)
- Recovering the loss of one OSD: 11.3GB/s which is ~13% better than 10GB/s
- Recovering the loss of two OSD: 4.42GB/s which is ~35% better than 3.2GB/s
The relevant lines from the full output of the benchmark are:
seconds KB plugin k m work. iter. size eras.
0.088136 1048576 jerasure 6 2 decode 1024 1048576 1
0.226118 1048576 jerasure 6 2 decode 1024 1048576 2
0.191825 1048576 jerasure 6 2 encode 1024 1048576 0
The improvements are likely to be greater for larger K+M values.
The OpenStack Foundation is delivering a training program to accelerate the speed at which new OpenStack developers are successful at integrating their own roadmap into that of the OpenStack project. If you’re a new OpenStack contributor or plan on becoming one soon, you should sign up for the next OpenStack Upstream Training in Atlanta, May 10-11. Participation is strongly advised also for first time participants to OpenStack Design Summit.
The addition of erasure code in Ceph started in april 2013 and was discussed during the first Ceph Developer Summit. The implementation reached an important milestone a few days ago and it is now ready for alpha testing.
For the record, here is the simplest way to store and retrieve an object in an erasure coded pool as of today:
./ceph osd crush rule create-erasure ecruleset \
./ceph osd pool create ecpool 12 12 erasure \
./rados --pool ecpool put SOMETHING /etc/group
./rados --pool ecpool get SOMETHING /tmp/group
$ tail -3 /tmp/group
The chunks are stored in three objects and it can be reconstructed if any of them are lost.
find dev | grep SOMETHING
When an OSD handles an operation it is queued to a PG, it is added to the op_wq work queue ( or to the waiting_for_map list if the queue_op method of PG finds that it must wait for an OSDMap ) and will be dequeued asynchronously. The dequeued operation is processed by the ReplicatedPG::do_request method which calls the the do_op method because it is a CEPH_MSG_OSD_OP. An OpContext is allocated and is executed.
2014-02-24 09:28:34.571489 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26'1 (0'0,26'1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] execute_ctx 0x7fc16c08a3b0
A transaction (which is either a RPGTransaction for a replicated backend or an ECTransaction for an erasure coded backend) is obtained from the PGBackend. The transaction is attached to a OpContext (which was allocated by do_op). Note that in the following log line although do_op shows, it comes from the execute_ctx method.
2014-02-24 09:28:34.571563 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26'1 (0'0,26'1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] do_op 847441d7/SOMETHING/head//3 [read 0~4194304] ov 26’1
The execute_ctx method calls prepare_transaction which calls do_osd_ops which prepares the CEPH_OSD_OP_READ.
2014-02-24 09:28:34.571663 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26'1 (0'0,26'1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] async_read noted for 847441d7/SOMETHING/head//3
The execute_ctx method continues when prepare_transaction returns and creates the MOSDOpReply object. Then it calls start_async_reads which calls objects_read_async on the backend (which is either ReplicatedBackend::objects_read_async or ECBackend::objects_read_async). When the read completes (this code path is not explored here), it calls the OnReadComplete::finish method (because the OnReadComplete object was given as an argument to objects_read_async) which calls ReplicatedPG::OpContext::finish_read each time a ready completes (i.e. if reading from an erasure coded pool, on each chunk) which calls ReplicatedPG::complete_read_ctx (if there are no pending reads) which sends the reply to the client .
When compiling Ceph, ccache may appear to miss more than expected, as shown by the cache miss line of ccache -s
cache directory /home/loic/.ccache
cache hit (direct) 1
cache hit (preprocessed) 0
cache miss 1
files in cache 3
cache size 392 Kbytes
max cache size 10.0 Gbytes
Compiling Ceph from clones in two different directories does not explain the miss, unless CCACHE_HASHDIR is set. It should be unset with:
When a command is sent to the Ceph monitor, such as ceph osd pool create, it will add a pool to the pending changes of the maps. The modification is stashed for paxos propose interval seconds before it is used to build new maps and becomes effective. This guarantees that the mons are not updated more than once a second ( the default value of paxos propose interval ).
When running make check changing the paxos propose interval value to 0.01 seconds for the cephtool tests roughly saves half the time (going from ~2.5mn to ~1.25mn real time).