Welcome to the Free Software contributions diary of Loïc Dachary. Although the posts look like blog entries, they really are technical reports about the work done during the day. They are meant to be used as a reference by co-developers.

Removing potential backdoors from Tails 3.0

The default Tails 3.0 bootable ISO includes proprietary binary blobs running on network hardware. They may contain backdoors and are silently loaded when Tails boots. There is no known exploit at this date but it may take years before they are discovered. To remove this security and privacy risk, a new ISO can be built using a pristine Debian GNU/Linux 9 / stretch installation.

$ sudo apt-get update
$ sudo apt-get install -y git
$ git clone -b stable https://git-tails.immerda.ch/tails
$ cd tails

Edit config/chroot_apt/preferences and remove the following block:

Explanation: src:firmware-nonfree
Package: firmware-linux firmware-linux-nonfree firmware-amd-graphics ...
Pin: release o=Debian,n=sid
Pin-Priority: 999

Build the bootable ISO

$ cat | sudo tee /etc/apt/preferences.d/00-builder-jessie-pinning <<EOF
Package: *
Pin: release o=Debian,a=stable
Pin-Priority: 700

Package: *
Pin: origin deb.tails.boum.org
Pin-Priority: 800
$ sudo apt-get install -y software-properties-common dirmngr
$ sudo add-apt-repository 'deb http://deb.tails.boum.org/ builder-jessie main'
$ sudo apt-key adv --receive-keys C7988EA7A358D82E
$ sudo apt-get update
$ sudo apt-get install -y \
  dpkg-dev \
  gettext \
  intltool \
  libfile-slurp-perl \
  liblist-moreutils-perl \
  libyaml-libyaml-perl \
  libyaml-perl \
  libyaml-syck-perl \
  perlmagick \
  po4a \
  syslinux-utils \
  time \
# because lb build sets /etc/resolv.conf to in chroot
$ sudo apt-get install -y bind9
$ sudo systemctl start bind9
$ sudo apt-get install ikiwiki
Get:6 http://.../main amd64 libmarkdown2 amd64 2.2.1-1~bpo8+1~0.tails1 [35.0 kB]
Get:7 http://.../main amd64 ikiwiki all 3.20160905.0tails1 [1,413 kB]
# because --no-merge-usr is not in builder-jessie debootstrap
$ sudo apt-get install debootstrap=1.0.89
$ sudo apt-get install live-build
$ sudo lb clean --all
$ sudo lb config
$ sudo lb build

The *.iso file can then be installed.

Posted in tails | Leave a comment

Run SecureDrop tests without Vagrant

Assuming a virgin installation of Ubuntu 14.04, the SecureDrop repository and its dependencies can be installed with the following:

sudo apt-get update
sudo apt-get install -y python-virtualenv git
sudo apt-get install -y build-essential libssl-dev libffi-dev python-dev
virtualenv /tmp/v
source /tmp/v/bin/activate
pip install --upgrade pip # so it is able to get binary wheels
pip install ansible # so we have version 2+

git clone http://github.com/freedomofpress/securedrop
cd securedrop

cat > /tmp/inventory <<EOF

ansible-playbook -vvvv \
       -e securedrop_repo=$(pwd) \
       -e non_default_securedrop_user=ubuntu \
       -e non_default_securedrop_code=$(pwd)/securedrop \
       -i /tmp/inventory -c local \

And the tests can then be run with

$ cd securedrop
$ DISPLAY=:1 pytest -v tests
Posted in SecureDrop | Leave a comment

Shrink an OpenStack image

After a while openstack image create creates increasingly large files because the blocks used and freed are not trimmed and it is not uncommon for hypervisors to not support fstrim. The image can be shrinked and the virtual machine recreated from it to reclaim the unused space.

$ openstack image save --file 2017-06-16-gitlab.qcow2 2017-06-16-gitlab
$ qemu-img convert 2017-06-16-gitlab.qcow2 -O raw work.img
$ sudo kpartx -av work.img
add map loop0p1 (252:0): 0 104855519 linear 7:0 2048
$ sudo e2fsck -f /dev/mapper/loop0p1
cloudimg-rootfs: 525796/6400000 files (0.1% non-contiguous), 2967491/13106939 blocks
$ sudo resize2fs -p -M /dev/mapper/loop0p1
The filesystem on /dev/mapper/loop0p1 is now 3190624 (4k) blocks long.
$ sudo kpartx -d work.img
loop deleted : /dev/loop0

Create a smaller image that is big enough for the resized file system.

$ sudo virt-df -h work.img
Filesystem                                Size       Used  Available  Use%
work.img:/dev/sda1                         12G       9.7G       2.0G   83%
$ qemu-img create -f raw smaller.img 13G
Formatting 'smaller.img', fmt=raw size=13958643712

Resize the large image into the smaller one:

$ sudo virt-resize --shrink /dev/sda1 work.img smaller.img
Resize operation completed with no errors.  Before deleting the old disk,
carefully check that the resized disk boots and works correctly.
$ ls -lh work.img smaller.img
-rw-r--r-- 1 ubuntu ubuntu 13G Jun 16 08:38 smaller.img
-rw-r--r-- 1 ubuntu ubuntu 50G Jun 16 08:31 work.img
$ qemu-img convert smaller.img -O qcow2 smaller.qcow2

Upload the smaller image

time openstack image create --file smaller.qcow2 \
     --disk-format=qcow2 --container-format=bare 2017-06-16-gitlab-smaller
Posted in openstack | Leave a comment

installing tails with kvm

For test purposes it is useful to bootstrap tails using virtual machines and files. Here is how it can be done with KVM.

$ wget 'https://tails-dl.urown.net/tails/stable/tails-amd64-3.0/tails-amd64-3.0.iso'
$ qemu-img create -f raw tails-installed.img 4G
$ kvm -m 4096 -cdrom tails-amd64-3.0.iso -device piix3-usb-uhci \
               -boot d \
               -drive id=my_usb_disk,file=tails-installed.img,if=none,format=raw \
               -device usb-storage,drive=my_usb_disk,removable=on \
               -net nic -net user -display sdl

From the virtual machine console, install tails to the USB device (i.e. tails-installed.img). When the installation is complete Control-C KVM. Copy tails-installed.img to tails-backup.img in case you want to start over. Run tails with:

kvm -m 4096 -device piix3-usb-uhci \
               -drive id=my_usb_disk,file=tails-installed.img,if=none,format=raw \
               -device usb-storage,drive=my_usb_disk,removable=on \
               -net nic -net user -display sdl

Continue reading

Posted in tails | Leave a comment

Installing python-crush on CentOS 7 without network

To install python-crush on a CentOS 7 that does not have access to internet, the necessary files must be downloaded via an USB drive. The python34-pip package must be installed from the EPEL repository the machine uses for maintenance purposes.

On the machine with access to internet:

$ sudo pip3 install wheel
$ sudo pip3 wheel --wheel-dir /usbdrive crush wheel

On the machine with no access to internet:

$ sudo pip3 install /usbdrive/wheel-*
$ sudo pip3 install --no-index --use-wheel --find-links=/usbdrive crush

The crush command can be verified with:

$ crush --help
Posted in crush | Leave a comment

A tool to rebalance uneven Ceph pools

The algorithm to fix uneven CRUSH distributions in Ceph was implemented as the crush optimize subcommand. Given the output of ceph report, crush analyze can show buckets that are over/under filled:

$ ceph report > ceph_report.json
$ crush analyze --crushmap ceph_report.json --pool 3
             ~id~  ~weight~  ~PGs~  ~over/under filled %~
cloud3-1363    -6    419424   1084                   7.90
cloud3-1364    -7    427290   1103                   7.77
cloud3-1361    -4    424668   1061                   4.31
cloud3-1362    -5    419424   1042                   3.72
cloud3-1359    -2    419424   1031                   2.62
cloud3-1360    -3    419424    993                  -1.16
cloud3-1396    -8    644866   1520                  -1.59
cloud3-1456   -11    665842   1532                  -3.94
cloud3-1397    -9    644866   1469                  -4.90
cloud3-1398   -10    644866   1453                  -5.93

Worst case scenario if a host fails:

        ~over filled %~
device            30.15
host              10.53
root               0.00

The crush optimize command will create a crushmap rebalancing the PGs:

$ crush optimize --crushmap ceph_report.json \
                 --out-path optimized.crush --pool 3
2017-05-27 20:22:17,638 argv = optimize --crushmap ceph_report.json \
  --out-path optimized.crush --pool 3 --replication-count=3 \
  --pg-num=4096 --pgp-num=4096 --rule=data --out-version=j \
  --no-positions --choose-args=3
2017-05-27 20:22:17,670 default optimizing
2017-05-27 20:22:24,165 default wants to swap 447 PGs
2017-05-27 20:22:24,172 cloud3-1360 optimizing
2017-05-27 20:22:24,173 cloud3-1359 optimizing
2017-05-27 20:22:24,174 cloud3-1361 optimizing
2017-05-27 20:22:24,175 cloud3-1362 optimizing
2017-05-27 20:22:24,177 cloud3-1364 optimizing
2017-05-27 20:22:24,177 cloud3-1363 optimizing
2017-05-27 20:22:24,179 cloud3-1396 optimizing
2017-05-27 20:22:24,188 cloud3-1397 optimizing
2017-05-27 20:22:27,726 cloud3-1360 wants to swap 21 PGs
2017-05-27 20:22:27,734 cloud3-1398 optimizing
2017-05-27 20:22:29,151 cloud3-1364 wants to swap 48 PGs
2017-05-27 20:22:29,176 cloud3-1456 optimizing
2017-05-27 20:22:29,182 cloud3-1362 wants to swap 32 PGs
2017-05-27 20:22:29,603 cloud3-1361 wants to swap 47 PGs
2017-05-27 20:22:31,406 cloud3-1396 wants to swap 77 PGs
2017-05-27 20:22:33,045 cloud3-1397 wants to swap 61 PGs
2017-05-27 20:22:33,160 cloud3-1456 wants to swap 58 PGs
2017-05-27 20:22:33,622 cloud3-1398 wants to swap 47 PGs
2017-05-27 20:23:51,645 cloud3-1359 wants to swap 26 PGs
2017-05-27 20:23:52,090 cloud3-1363 wants to swap 43 PGs

Before uploading the crushmap (with ceph osd setcrushmap -i optimized.crush), crush analyze can be used again to verify it improved as expected:

$ crush analyze --crushmap optimized.crush --pool 3 --replication-count=3 \
                --pg-num=4096 --pgp-num=4096 --rule=data --choose-args=0
             ~id~  ~weight~  ~PGs~  ~over/under filled %~
cloud3-1359    -2    419424   1007                   0.24
cloud3-1363    -6    419424   1006                   0.14
cloud3-1360    -3    419424   1005                   0.04
cloud3-1361    -4    424668   1017                  -0.02
cloud3-1396    -8    644866   1544                  -0.04
cloud3-1397    -9    644866   1544                  -0.04
cloud3-1398   -10    644866   1544                  -0.04
cloud3-1364    -7    427290   1023                  -0.05
cloud3-1456   -11    665842   1594                  -0.05
cloud3-1362    -5    419424   1004                  -0.06

Worst case scenario if a host fails:

        ~over filled %~
device            11.39
host               3.02
root               0.00

Continue reading

Posted in ceph, crush | Leave a comment

An algorithm to fix uneven CRUSH distributions in Ceph

The current CRUSH implementation in Ceph does not always provide an even distribution.

The most common cause of unevenness is when only a few thousands PGs, or less, are mapped. This is not enough samples and the variations can be as high as 25%. For instance, when there are two OSDs with the same weight, distributing randomly four PGs among them may lead to one OSD having three PGs and the other only one. This problem would be resolved by having at least thousands of PGs per OSD, but that is not recommended because it would require too many resources.

The other cause of uneven distribution is conditional probability. For a two-replica pool, PGs are mapped to OSDs that must be different: the second OSD is chosen at random, on the condition that it is not the same as the first OSD. When all OSDs have the same probability, this bias is not significant. But when OSDs have different weights it makes a difference. For instance, given nine OSDs with weight 1 and one OSD with weight 5, the smaller OSDs will be overfilled (from 7% to 10%) and the bigger OSD will be ~15% underfilled.

The proposed algorithm fixes both cases by producing new weights that can be used as a weight set in Luminous clusters.

For a given pool the input parameters are:

  • pool size
  • numeric id
  • number of PGs
  • root of the CRUSH rule (take step)
  • the CRUSH rule itself
   for size in [1,pool size]
     copy all weights to the size - 1 weight set
     recursively walk the root
     for each bucket at a given level in the hierarchy
       repeat until the difference with the expected distribution is small
         map all PGs in the pool, with size instead of pool size
         in the size - 1 weight set
           lower the weight of the most overfilled child
           increase  the weight of the most underfilled child

It is common to change the size of a pool in a Ceph cluster. When increasing the size from 2 to 3, the user expects the existing objects to stay where they are, with new objects being created to provide an additional replica. To preserve this property while optimizing the weights, there needs to be a different weight set for each possible size. This is what the outer loop (for size in [1,pool size]) does.

If the distribution is not as expected at the highest level of the hierarchy, there is no way to fix that at the lowest levels. For instance if a host receives 100 more PGs that it should, the OSDs it contains will inevitably be overfilled. This is why the optimization proceeds from the top of the hierarchy.

When a bucket is given precisely the expected number of PGs and fails to distribute them evenly among its children, the children’s weights can be modified to get closer to the ideal distribution. Increasing the weight of the most underfilled item will capture PGs from the other buckets. And decreasing the weight of the most overfilled will push PGs out of it. A simulation is run to determine precisely which PGs will be distributed to which item because there is no known mathematical formula to calculate that. This is why all PGs are mapped to determine which items are over- or underfilled.

Continue reading

Posted in ceph, crush, libcrush | Leave a comment

Ceph space lost due to overweight CRUSH items

When a CRUSH bucket contains five Ceph OSDs with the following weights:

osd.0   5
osd.1   1
osd.2   1
osd.3   1
osd.4   1

20% of the space in osd.0 will never be used by a pool with two replicas.

The osd.0 gets 55% of the values for the first replica (i.e 5 / 9), as expected. But osd.0 can only get 45% for the second replica, because that is all there is left.

The upper bound for the weight of an item within a bucket that contains either devices or items designated to be the failure domain can be calculated as follows:

  • N is the number of replicas
  • O is the number of overweight items, i.e. items that have a weight greater than (sum of the weights)/N
  • the effective weight of all overweight items is equal to (sum of the weights of non-overweight items) / (N – O)

In the example above, the effective weight of osd.0 is therefore ( 1 + 1 + 1 + 1) / ( 2 – 1 ) = 4.

The crush analyze command detects weights that are above the maximum and uses their effective weight to get meaningful results. For instance:

$ crush analyze ...
        ~id~  ~weight~  ~objects~  ~over/under filled %~
osd.3      5         1        646                26.17
osd.4      6         1        610                19.14
osd.2      4         1        575                12.30
osd.1      3         1        571                11.52
osd.0      2         5       1694               -37.29

Worst case scenario if a osd fails:

        ~overfilled %~
osd             21.14
root             0.00

The following are overweight and should be cropped:

        ~id~  ~weight~  ~cropped weight~  ~cropped %~
osd.0      2         5               4.0         20.0

The osd.0 is reported to be 37.29% underfilled but 20% of that amount comes from the fact that the item is overweight. The remaining 17.29% come from the conditional probability bias and random noise due to a low number of objects.

Posted in ceph, crush | Leave a comment

Comprendre la démocratie liquide

J’ai beaucoup de mal à expliquer l’idée de démocratie liquide et ce n’est pas faute d’avoir essayé. Peut-être que le coté récursif de la délégation de vote n’est pas naturel pour les non-informaticiens. A l’occasion de l’entre deux tour des présidentielles, un projet s’est lancé pour expliquer de quoi il s’agit: mieux.vote. Je me suis inscrit et je vais tenter de donner un coup de main. A suivre !

Posted in Liquid Democracy | Leave a comment

Ceph full ratio and uneven CRUSH distributions

A common CRUSH rule in Ceph is

    step chooseleaf firstn 0 type host

meaning Placement Groups (PGs) will place replicas on different hosts so the cluster can sustain the failure of any host without losing data. The missing replicas are then restored from the surviving replicas (via a process called “backfilling”) and placed on the remaining hosts.

To make sure there is enough space in the cluster to cope with the failure of a host, a certain percentage of free space in the cluster is reserved (from the beginning) and never used. This percentage needs to be adjusted to take into account the most overfull OSD in case the PG distribution is not even.

For instance, in a cluster with five hosts containing two identical disks each, reserving 20% plus 6.45% to account for the most overfull OSD displayed below should be enough:

crush analyze --type device --rule data \
              --replication-count 2 \
              --crushmap mymap.txt \
              --pool 0 --pg-num 1024 --pgp-num 1024
         ~id~  ~weight~  ~objects~  ~over/under used %~
device9     9       1.0        218                 6.45
device5     5       1.0        214                 4.49
device7     7       1.0        214                 4.49
device0     0       1.0        212                 3.52
device8     8       1.0        212                 3.52
device6     6       1.0        208                 1.56
device2     2       1.0        201                -1.86
device3     3       1.0        201                -1.86
device4     4       1.0        192                -6.25
device1     1       1.0        176               -14.06

However this uneven distribution will be different when a host is removed, because that causes a change in the most overfull OSD – and the new most overfull OSD may even be worse (more overfull) than the previous one. In our example cluster, device9 was the most (6.45%) overfull. If the host containing device8 and device9 is removed, device5 becomes the most overfull OSD, and it is worse (8.98%):

         ~id~  ~weight~  ~objects~  ~over/under used %~
device5     5       1.0        279                 8.98
device7     7       1.0        270                 5.47
device2     2       1.0        268                 4.69
device0     0       1.0        267                 4.30
device3     3       1.0        249                -2.73
device6     6       1.0        246                -3.91
device1     1       1.0        241                -5.86
device4     4       1.0        228               -10.94

It would therefore be better to reserve 28.98% instead of 26.45% to make sure the cluster does not become too full after a host failure. To help with that, the crush analyze command was modified to display the worst case scenario for each bucket type in the crushmap:

crush analyze --type device --rule data \
              --replication-count 2 \
              --crushmap mymap.txt \
              --pool 0 --pg-num 1024 --pgp-num 1024
         ~id~  ~weight~  ~objects~  ~over/under used %~
device9     9       1.0        218                 6.45
device5     5       1.0        214                 4.49
device7     7       1.0        214                 4.49
device0     0       1.0        212                 3.52
device8     8       1.0        212                 3.52
device6     6       1.0        208                 1.56
device2     2       1.0        201                -1.86
device3     3       1.0        201                -1.86
device4     4       1.0        192                -6.25
device1     1       1.0        176               -14.06

Worst case scenario if a host fails:

        ~over used %~
device           8.98
host             4.49
root             0.00
Posted in ceph, crush, libcrush | Leave a comment