Netbooting multiple "workers" RPi from a "master" RPi

Question

My question is, I believe, a little bit complex, and I think it would be better to fully describe what I have and what I want to achieve, with what I am able to do.

What I have

I have a group of RPi 3B+, with one RPi serving as a Master of the cluster and the others serving as Workers. The goal is to create a cluster of RPi manageable from the master RPi to do distributed computations. They are all connected via Ethernet through a switch. I have setup a network with a DHCP server on the Master RPi (with dnsmasq) and I am able to SSH to any Worker RPi without any problem.

I have also set up a TFTP server on the Master RPi, for reasons I will explain in the next section.

What I want to achieve

I want to be able to flash the SD cards of the Workers remotely and without having to do any physical interactions with the Workers RPi, all from the master. I can manually flash a special system to the SD cards of the Workers once, but I'd like to then be able to update the OS of the Workers without having to move anything, as stated above.

The "plan" that I have now is the following:

Flash the Workers SD cards with a special system composed of 2 systems:
- A piCore (TinyCore) system that, on boot, would launch a script that checks if a new image is available on the Master RPi, and if yes, downloads it with TFTP, flashes it and reboot.
- A second partition that would contain a "normal" system, set up by the piCore system. Ideally, piCore could set up any kind of OS (Raspbian, an other piCore system or whatever).

What I am able to do

I am able to flash an SD card with piCore, use fdisk to create a new partition intended for the "normal" system. I can set up a script in piCore that will check if a new image is present on the Master RPi, and download it.

What I don't know how to do

How can I, once I have the image on the piCore partition, flash it in the second partition ? I only know the unix command dd that allows me to flash an entire SD card from an .img file, and I obviously don't want to flash the entire SD card of the Worker but only the special partition dedicated to the "normal" system.

How can I configure the bootloader of the SD card to make sure it always boot the piCore partition, and how can I, from the special script in the piCore partition, make sure that when I reboot, the newly flashed "normal" system is booted up and not again the piCore system.

I think I have a correct plan, at least "conceptually", to solve the problem, but I lack the technical knowledge on which tool I should use and how to configure them. I don't have a lot of experiences on this kind of low-level things, unfortunately.

I also heard of the process of "netbooting" the Worker to a filesystem located on the Master, but I'm not sure it would fit the requirements of my problem. Mainly, I don't know if it possible for multiple to netboot to the same (and more specifically, use the same remote filesystem at the same time) and if it possible to set up the Worker such that it reboots on the local filesystem on the SD card after it has checked if a new image is available on the Master RPi.

Thanks in advance!

A very good explained question (^.^)d But using a piCore and fiddling with partitions to copy a new root file system to it seems to me a bit complicated and error prone. You have already researched for netbooting. I would prefer that because it is a well known technology with using `nfs` (main unix used network file system) to local mount the root file system from a remote server. The only drawback is that you need a stable network because if it's down the Workers will stuck. I haven't done it yet but I'm interested. If it is an alternative I will start with an answer we can improve. — Ingo, Jul 22 '18 at 11:37
Your terminology is confusing. You can **NOT** "flash" a SD Card in the Pi. You **CAN** copy partitions and/or install or copy OS images but **NOT** modify a working OS. See [PINN](https://github.com/procount/pinn/blob/master/README_PINN.md) which is used to do something similar, although controlled by the keyboard. — Milliways, Jul 22 '18 at 12:46
@Milliways Sorry for the possibly wrong terminlogy. By "flashing", I mean putting a new, clean OS on the SD card. Indeed, what I mean is to copy the content of the `.img` file containing a Raspbian system and put it in the SD card of the Worker RPi. — Longwelwind, Jul 22 '18 at 14:20
@Ingo The more I researched on the subject, the more it seems like netbooting can help me solve my issue. I have multiple questions though: - Once the Worker RPi is netbooted on a filesystem located on the Master RPi (using `nfs`), will I be able to interact with the SD card inserted in the Worker RPi (Will there be a `/dev/mmblk0` ?) - How will I be able to tell the Worker RPi, once the flashing process is done, to reboot on the system on the SD card, and not try to netboot again ? The network should normally be "stable". — Longwelwind, Jul 22 '18 at 14:25
@Longwelwind I will give you an answer but I have to do a little bit of testing. Just a moment please. In general: your netbooted system behaves like it is was booted from a SD Card. There is no need to flash anything. If you have inserted a CD Card without a flashed image you should be able to mount it like any other usb stick. With a flashed image it will boot from that because it precedes netbooting. — Ingo, Jul 23 '18 at 08:39

Ingo · Accepted Answer · 2019-03-06T13:13:04.627

Here is a solution with netbooting using sytemd-networkd.

Network booting works only for the wired adapter. Booting over wireless LAN is not supported ¹.

It is also important that there is already a working DHCP server on the local network.

We use RPi 3B+. It comes with "Improved PXE network and USB mass-storage booting" ². So PXE booting will work out of the box. Please forget all the quirks, hints and workarounds to netboot with older models you may find on the web. There is no need to prepare the worker for netbooting. It will simply try it if there is no SD Card inserted.

So lets look what I've tested. I followed mostly the official tutorial ³ for older models but adapted it to the needs of this question and for the RPi 3B+.

For reference I flashed Raspbian Stretch Lite 2018-06-27, enabled ssh and made a full-upgrade. This setup can be done headless. After first boot ssh into the RPi and update Raspbian:

raspberrypi ~$ sudo -Es
raspberrypi ~# apt update
raspberrypi ~# apt full-upgrade

Setup systemd-networkd

For detailed information look at ⁴. Here only in short. Execute these commands:

# Install helpers
raspberrypi ~# apt --yes install rng-tools systemd-container

raspberrypi ~# systemctl mask networking.service
raspberrypi ~# systemctl mask dhcpcd.service
raspberrypi ~# mv /etc/network/interfaces /etc/network/interfaces~
raspberrypi ~# sed -i '1i resolvconf=NO' /etc/resolvconf.conf

raspberrypi ~# systemctl enable systemd-networkd.service
raspberrypi ~# systemctl enable systemd-resolved.service
raspberrypi ~# ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf

We will give our master a static ip address because it works as a server. For example my master is on subnet 192.168.10.0/24 static ip address 192.168.10.60 broadcast address 192.168.10.255 gateway/router 192.168.10.1 dns server 192.168.10.10Of course you have to use the ip addresses from your network. Look what are yours. You may find your dns server with cat /etc/resolv.conf. If in doubt you may use googles dns server 8.8.8.8. To set the static ip address write this file:

raspberrypi ~# cat > /etc/systemd/network/04-eth.network <<EOF
[Match]
Name=e*
[Network]
Address=192.168.10.60/24
Gateway=192.168.10.1
DNS=192.168.10.10
EOF

Rename hostname from raspberrypi to master:

raspberrypi ~# sed -i 's/raspberrypi/master/' /etc/hostname
raspberrypi ~# sed -i 's/raspberrypi/master/g' /etc/hosts

Reboot.

Master configuration

ssh into your master. Remember that is has now a new static ip address.

This setup will also be used for the worker, so we copy it to a directory we will later mount as root partition for the worker.

master ~$ sudo -Es
master ~# mkdir -p /nfs/worker1
master ~# rsync -xa --exclude /nfs / /nfs/worker1

Don't worry now. Depending on your SD Card copying of 1.1 GByte will take about 15 minutes or longer. Look at the green led on your RasPi.

When finished prepare the network and the name of the worker:

master ~# rm /nfs/worker1/etc/systemd/network/04-eth.network
master ~# sed -i 's/master/worker1/' /nfs/worker1/etc/hostname
master ~# sed -i 's/master/worker1/g' /nfs/worker1/etc/hosts

Now we start the worker in a container. This is similar to chroot but more powerful. We regenerate SSH host keys so ssh will not complain about spoofing ("it has already seen the same host with other ip address"):

master ~# systemd-nspawn -D /nfs/worker1 /sbin/init

Login and execute following commands. This will create new SSH2 server keys and it tries to start the ssh.service but that will fail because the ethernet interface is already used by the master. Starting the ssh.service (here with error) is essentional because we are headless on the worker. If the worker is running on its own hardware this should go without error.

worker1 ~$ sudo rm /etc/ssh/ssh_host_*
worker1 ~$ sudo dpkg-reconfigure openssh-server
worker1 ~$ logout

Exit from container with CTRL+(short three times)].

Setup tftp server

Now we will install a tftp server that is needed to send boot files to the worker. The program dnsmasq will provide this. Also we install the network sniffer tcpdump to look if the worker requests its boot files the right way:

master ~# apt --yes install dnsmasq tcpdump
master ~# # Stop dnsmasq breaking DNS resolving:
master ~# rm /etc/resolvconf/update.d/dnsmasq

Now start tcpdump so you can search for DHCP packets from the worker:

master ~# tcpdump -i eth0 port bootpc

Now power on the worker RPi without SD Card. Then you should get packets from it "DHCP/BOOTP, Request from ..."

IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b8:27:eb:d3:85:78

Here we have to notice the mac address b8:27:eb:d3:85:78 from the worker RPi. You should also see that it gets a reply to an ip address from the DHCP server on your local network, here 192.168.10.1:

IP 192.168.10.1.bootps > 192.168.10.101.bootpc: BOOTP/DHCP, Reply, length 300

Exit with CTRL+C. Then we have to configure dnsmasq to serve boot files via tftp. Write this file:

master ~# cat > /etc/dnsmasq.conf <<EOF
port=0
dhcp-range=192.168.10.255,proxy
log-dhcp
enable-tftp
tftp-root=/tftpboot
tftp-unique-root=mac
pxe-service=0,"Raspberry Pi Boot"
EOF

The first address of the dhcp-range is the broadcast address of your network. Now create a /tftpboot directory. The subdirectory for the specific worker (its mac address we have noticed with tcpdump) must have only lower case characters and dashes:

master ~# mkdir -p /tftpboot/b8-27-eb-d3-85-78
master ~# chmod -R 777 /tftpboot
master ~# systemctl enable dnsmasq.service
master ~# systemctl restart dnsmasq.service

Monitor dnsmasq:

master ~# journalctl --unit dnsmasq.service --follow

Now power cycle the worker RPi. You should see something like this:

master dnsmasq-tftp[756]: file /tftpboot/b8-27-eb-d3-85-78/bootcode.bin not found

Next, you will need to copy bootcode.bin and start.elf into the /tftpboot/b8-27-eb-d3-85-78 directory. You should be able to do this by copying the files from /boot, since these are the right ones. We need a kernel, so we might as well copy the entire boot directory. First, use Ctrl+C to exit the monitoring state. Then type the following:

master ~# cp -r /boot/* /tftpboot/b8-27-eb-d3-85-78

Restart dnsmasq for good measure:

master ~# systemctl restart dnsmasq

Edit /tftpboot/b8-27-eb-d3-85-78/cmdline.txt and from root= onwards, replace it with:

root=/dev/nfs nfsroot=192.168.10.60:/nfs/worker1,vers=3 rw ip=dhcp rootwait elevator=deadline

You should substitute the IP address here with the static ip address of your master.

Set up NFS root

This should now allow your Raspberry Pi to boot through until it tries to load a root filesystem that is normally located at the second partition of the SD Card (which it doesn't have). All we have to do to get this working is to export the /nfs/worker1 filesystem we created earlier.

master ~# apt install nfs-kernel-server
master ~# echo "/nfs *(rw,sync,no_subtree_check,no_root_squash)" | tee -a /etc/exports
master ~# systemctl enable rpcbind
master ~# systemctl restart rpcbind
master ~# systemctl enable nfs-kernel-server
master ~# systemctl restart nfs-kernel-server

Finally, edit /nfs/worker1/etc/fstab and remove or comment the PARTUUID=efe16111-01 and PARTUUID=efe16111-02 lines (only proc should be left).

Now power cycle the worker RPi and it should boot. You can monitor again. You will also see what ip address your worker has:

master ~# exit
master ~$ journalctl --unit dnsmasq.service --follow

Now you should be able to ssh into the worker e.g. with:

master ~$ ssh pi@192.168.10.101

What to do next?

You have now a working base for one worker. It should be no problem to add the next worker2 with e.g. mac address b8:27:eb:0e:3c:6f. Create directories mkdir /tftpboot/b8-27-eb-0e-3c-6f and mkdir /nfs/worker2, copy boot and root data to it and modify /tftpboot/b8-27-eb-0e-3c-6f/cmdline.txt and /nfs/worker2/etc/fstab. Then worker2 should boot.

You can manage your workers from the master by running them in a container as shown above with sudo systemd-nspawn -D /nfs/worker1 /sbin/init, e.g. for maintenance. But this can only be done if the worker is shut down.

Yes, there is much to optimize. But this is out of scope here and can be asked as separate questions.

You need a bit of storage and you can attach an external USB storage (stick or disk) to the master. Most files are identical. It may be possible to work with hard links. There are backup strategies using this. I don't know if it is workable for this purpose.

You can strip down the operating system of the workers to just what they need. First step could be to clean up from old networking (ifupdown), dhcpcd and openresolv ⁴.

If the worker does not need to be persistent after reboot, means forget all changes from runtime, then you can use a read only root directory. This has the big advantage that you only need one boot and root directory for all workers. Problem is that the workers need different names on the network (worker1, worker2, ...) but this can be solved with DHCP. To achive this you can pay attention to special transient directories ⁵ or with overlay file systems ⁶.

references:
[1] Network booting
[2] Raspberry Pi 3 Model B+
[3] Network Boot Your Raspberry Pi
[4] Howto migrate from networking to systemd-networkd with dynamic failover
[5] Can a Raspberry Pi be used to create a backup of itself?
[6] How do I make the OS reset itself every time it boots up?

Thanks for the extremely detailed answer! I managed to netboot a RPi (which was easier than I thought it would be), I will check later on how I will handle the SD cards (or if I will use an external storage). — Longwelwind, Jul 25 '18 at 18:23
@Longwelwind I've done it the first time and it is also a documentation for myself shared with the community. If you use it I would appreciate if you could change the headline, something like: *"Netbooting multiple workers from a master"* or so? That let others better find this issue. Give me a comment here if you run in problems. And yes, if it runs for you, you could accept the answer ;-) — Ingo, Jul 25 '18 at 18:42