Skip to content

Disk storage and backup #112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stevenveenma opened this issue Aug 24, 2020 · 10 comments · Fixed by #576
Closed

Disk storage and backup #112

stevenveenma opened this issue Aug 24, 2020 · 10 comments · Fixed by #576

Comments

@stevenveenma
Copy link

First of all great that you forked this initiative so that this is still alive!
I've been using the IOTstack since the fall of 2019 and build various MQTT readings, Python scripts that write to Influxdb and Grafana visualizations. Recently I got some issues.

  • The IOTstack is installed on a 16GB SD card that seemed to be sufficient. But the system is writing much data to the disk which is now completely occupied. At first I suspected my Influxdb files but this is of limited size. I checked the log files but this is limited too. When I drilled down using sudo du -hsx * | sort -rh | head -10 I discovered /var/lib/docker is spending 11G. But I can't get access to the content of this folder. I am aware docker is using a network but how do I examine which files are responsible for this? Is there any shortcut to remove unnecessary space?
  • I use my NAS to backup the contents of the volumes folder. I did this by mounting it and use a copy script in crontab, which I discussed with Graham in Improvements to Dropbox and Google Drive backups gcgarner/IOTstack#78 This worked quite well, but the content of the volume folder (especially Influxdb) seems to be dynamic. As the script is copying every file change. The contents of the backup on the NAS are exploding (32 GB yet). Moreover this backup consists of old and new files so I think when I need it it won't work. What should work better is to copy the entire volume folder to the NAS and then remove the old version. Can you give me advise on this?
@petinox
Copy link

petinox commented Aug 24, 2020

Perhaps it's a bit early for me to jump in on this before the people more involved in the project, but have you had a look at the backup script that takes a db snapshot (~/IOTstack/scripts/backup_influxdb.sh) mentioned here?

@Paraphraser
Copy link

@petinox - you are spot on.

@stevenveenma - this is a response to your second dot point. I'm still thinking about your first dot point.

I apologise in advance if I wind up telling you a whole bunch of things you already know.

It is rarely safe to backup databases at the file system level while the DB engine is active. That's because a single transaction will often involve multiple writes to many files (journals, actual inserts/deletes/modifies, index updates, triggered actions). It's really only the DB engine that fully understands the "state" at any given moment.

I used to think it would always be safe if I stopped the DB engine before copying the file system but a long-time DBA once told me that you could be still caught out at restore time if the DB engine was taking shortcuts with iNodes. Whether that's actually true is an open question.

The only exception to this rule I'm aware of is SQLite and, even then, only where the "conceptual" database doesn't span more than one physical file. Alarm bells should ring whenever you see an "attach" of a second SQLite database.

All the database packages I've ever used offer their own internal support for backup and restore.

For SQLite, that's the ".dump" command, the output of which is all the SQL statements needed to recreate the schema and re-insert all the data. Postgres has "pg_dumpall" which does the same kind of thing.

In the case of InfluxDB and in the particular context of IOTstack, the journey begins at the "influxdb" section of "docker-compose.yml", where you will find this "volumes" definition:

      - ./backups/influxdb/db:/var/lib/influxdb/backup

In words: the absolute path /var/lib/influxdb/backup inside the container is mapped to the relative path ./backups/influxdb/db outside the container, where the leading "." implies "the directory containing docker-compose.yml" and, accordingly, means the (almost) absolute external path:

~/IOTstack/backups/influxdb/db

That directory is where the output from InfluxDB backups turns up, and where you need to place any backups you want to restore.

To take a manual backup of your InfluxDB databases, do this:

$ sudo rm ~/IOTstack/backups/influxdb/db/*
$ docker exec influxdb influxd backup -portable /var/lib/influxdb/backup

The reason for the (somewhat dangerous) "sudo rm" is because the backup process does not manage the backup directory for you. Every backup produces a complete set of files with a unique timestamp prefix. You get a mess if the backup directory isn't empty before you start.

The documentation says that there is a way of taking incremental backups, in which case you wouldn't erase the backup directory between runs, but I have not explored that.

Once the backup directory has been populated, the simplest way to proceed is to "tar" its contents. Something like:

$ cd ~/IOTstack/backups/influxdb/db/
$ sudo tar -cf ~/IOTstack/backups/influx_backup.tar .

The supplied "backup_influxdb.sh" does not do it like that. It leaves the problem to "docker_backup.sh" which produces a single .tar.gz, including everything in ~/IOTstack/backups/influxdb/db. Personally, I don't find that approach useful because almost all of the InfluxDB backup files are already gzipped. I see no sense in double-compression.

Restoring your InfluxDB databases involves:

  • Erasing the contents of ~/IOTstack/backups/influxdb/db

  • Unpacking the tar into ~/IOTstack/backups/influxdb/db

  • Taking the InfluxDB container down

  • Erasing the contents of ~/IOTstack/volumes/influxdb

  • Bringing the InfluxDB container up (which will re-initialise ~/IOTstack/volumes/influxdb with empty databases).

  • Reloading the databases from the backup via:

     $ docker exec influxdb influxd restore -portable /var/lib/influxdb/backup
    

I posted my own backup and restore scripts in response to a Discord question. They are at Paraphraser/IOTstackBackup. They are specific to my needs and are not intended as a general purpose solution, just a source of ideas. I run my backup once a day as a cron job. I routinely restore "the most-recent backup" several times a week to a "test" RPi so I know it works.

In thinking about your NAS arrangement, I might try something like:

  • Explicitly omit ~/IOTstack/volumes/influxdb from the scope of the current cron job
  • Investigate the how-to of incremental backups
  • Add another "volumes" definition to docker-compose.yml so that I had two separate backup directories, one for my "full" backups (which needs to be erased between runs), the other for my "incremental" backups (which obviously can't be erased between runs)
  • Set up another cron job to tell influx to take periodic incremental backups
  • Add the incremental backup folder to the scope of the NAS sync.

A few things would worry me:

  1. How long are incremental backups feasible?
    • There must come a point where the advantage you gain (in terms of minimising data loss) from more-frequent incremental backups, will be cancelled out by the time it will take to reload an arbitrarily-large number of increments before your database(s) are fully operational again.
    • There must also come a point where the space devoted to an arbitrarily-large number of increments impacts your SD storage. Then it becomes less of a "mirroring" problem and more of a "copying" problem, which may have restore-time consequences too.
  2. How to test that an arbitrarily-large number of increments actually works? It's easy to test a full backup and restore. You've got one tar containing a single coherent set of backup files. It either works or it doesn't (see this note about the messages that are emitted during a reload). I don't see that it's quite so easy to test whether there's a "gotcha" point in an arbitrarily-large series of increments.

Anyway, I hope this helps you get a bit further down the track.

@Paraphraser
Copy link

Here are some musings on your first dot point.

I'm running from a 450GB SSD so I don't have quite the same space constraints but, even so, I keep a careful watch on where space is disappearing.

As you've narrowed it down to Docker, this is what I would try next.

Start with:

$ docker images

You'll get something like this:

REPOSITORY            TAG                 IMAGE ID            CREATED             SIZE
grafana/grafana       latest              364e270a7f54        4 days ago          149MB
eclipse-mosquitto     latest              4af162db6b4c        5 days ago          8.65MB
iotstack_nodered      latest              cf074a8ad70c        11 days ago         539MB
eclipse-mosquitto     <none>              7f8207069305        11 days ago         8.65MB
pihole/pihole         latest              4d43d29c9890        2 weeks ago         301MB
grafana/grafana       <none>              9f31b3fc1ea3        2 weeks ago         149MB
nodered/node-red      latest              fa3bc6f20464        2 weeks ago         376MB
influxdb              latest              1ca48fe485f8        2 weeks ago         261MB
kunde21/gitea-arm     latest              b1320f20c065        3 weeks ago         104MB
portainer/portainer   latest              dbf28ba50432        4 weeks ago         62.5MB

Whenever you do a:

$ cd ~/IOTstack
$ docker-compose pull
$ docker-compose up -d

or, if you're updating Node-Red:

$ docker rmi nodered/node-red
$ cd ~/IOTstack
$ docker-compose pull
$ docker-compose up --build -d

and a new image comes down (or a new iotstack_nodered is built) you're going to see double-entries such as in the above for Mosquitto and Grafana. You will want to delete the obsolete versions (marked "<none>") by using the IMAGE ID, as in:

$ docker rmi 7f8207069305

Sometimes, when you do that, you'll get an error like this:

Error response from daemon: conflict: unable to delete XXXXXXXXXXXX - image is being used by YYYYYYYYYYYY

and you solve that by:

$ docker rm YYYYYYYYYYYY

then retry the rmi. On some occasions, you need to iterate a few times to rm a succession of conflicts until the rmi succeeds.

Any time you have ever needed to use rm (either explicitly or implicitly) then that can leave dangling volume layers. You can see if that is the case with:

$ docker volume ls

The desired response from that command is an empty list, as in:

DRIVER              VOLUME NAME

If you get anything in the list, first try:

$ docker volume rm $(docker volume ls -q)

which will remove whatever it can remove. Retry docker volume ls and if there's anything left in the list, pick the first one and try removing it with its identifier. Something like:

$ docker volume rm 694ac4af40d6cab85db8b55fd4671791c4cce81499bd5a9003ef983982930645

That may chuck up a dependency, which you deal with by trying docker volume rm on the dependency identifier (and possibly the dependency's dependency identifier, and so on).

Each time you are successful in removing something, retry the:

$ docker volume rm $(docker volume ls -q)

Eventually, you'll achieve an empty list.

If what you're experiencing with space going walkabout is due to any/all of the above then you'll probably find, like I did, that you suddenly have a whole lot of extra space.

In theory there are "prune" commands that automate all of this but my experience with those has been, shall we say, less than stellar, so I'm sticking with the primitive commands as per the above.

@stevenveenma
Copy link
Author

stevenveenma commented Aug 25, 2020

Thank you very much for your very elaborate answers. It will take me some time to interpret all the information and test things. I will first focus on the disk storage problem so that the operation of the RPI is guaranteed.

I did the docker images command and found 7 images each containing 715MB with the none tag that I deleted. I rebooted but the filesystem still was completely occupied Then I did docker volume ls. Two volumes show up. But unfortunately I couldn't remove them using your instructions as the volume is in use. I searched for instructions and found docker container stop $(docker container ls -aq). Then I did the prune command on volumes and the ls was empty. After rebooting df showed that the volume has 50% of its space restored! But..... now docker doesn't start any more. I try to connect to portainer but nothing. I found out that an existing docker.pid file might obstruct the restart of docker daemon. So I renamed it and rebooted. Now Grafana showed up but with many errors. Portainer and the other applications don't show up. I messed things up. Enough for today.

@Paraphraser
Copy link

Paraphraser commented Aug 26, 2020

I mentioned I had tried the docker image prune -a (see ~/IOTstack/scripts/prune-images.sh) with less-than-satisfactory outcomes but I've just stumbled across:

$ docker system prune

When you run it, it responds:

WARNING! This will remove:
  - all stopped containers
  - all networks not used by at least one container
  - all dangling images
  - all dangling build cache

Are you sure you want to continue? [y/N]

I've run it on all of my Pis. Two (RPi4s) were all in a "clean" state as per my earlier reply so it found nothing to do but on an RPi3 (running from SD) that I knew had some dangling images, it promptly reclaimed 150MB. No muss, no fuss.

It remains to be seen whether it really goes all-the-way and cleans up dangling volumes - are those implied in "build cache"? If it does, it will be a very useful command.

Also take a look at:

$ docker system df

@stevenveenma
Copy link
Author

I was disappointed that your instructions on this didn't lead to the right result (released space). So when I found prune instructions on https://linuxize.com/post/how-to-remove-docker-images-containers-volumes-and-networks/ I tried this. A rough road for a task that seems so easy.

Ok, my docker seems to be broken now. Any chance to repair this or should I just reinstall the whole thing? An image and docker should be done easy but restoring all scripts and settings need some attention. I will try to make a backup in advance and then see if this can be restored without issues.

@Paraphraser
Copy link

I'm not sure I understand.

just worked through every "prune" in that linuxize link you provided (thanks) but my three RPis each responded with "Total reclaimed space: 0B".

I had taken docker system prune to be the "super killer" that would supplant the need for any other "prune".

Are you saying that docker system prune didn't work for you? Or are you saying that it worked to some extent but you're still chasing lost space somewhere?


I think the answer to your second question depends on how you've been taking backups. Graham Garner's backup script(s) produce a single tar.gz which contain the current docker-compose.yml, everything in services, everything in volumes with the exception of influxdb and nextcloud, plus the result of telling influxdb to dump its databases.

If you have been running those as well then you should be able to extract the contents and move those into place after you've done a clean install. You'll probably want to take a look at my restore script (link earlier in this issue) to get some ideas of how to proceed (mainly the approach to preserving permissions, and the how-to of reloading influxdb.

I've been tinkering with those scripts and have realised something else. When docker-compose does its thing, it gets upset if anything referred to in the "services" area isn't present. Conversely, it auto-creates anything referred to in the "volumes" area and, in the case of influx, that includes the path ~/IOTstack/backups/influxdb/db. The problem is that it gets the permissions wrong (backups needs pi:pi, while influxdb & db need root:root). I think that's something to be aware of if you're trying a bare-metal restore. If you do nothing else, you'll probably want to:

$ mkdir -p ~/IOTstack/backups/influxdb/db
$ sudo chown -R root:root ~/IOTstack/backups/influxdb

I'm not sure whether I said this before but a backup omits volumes/influxdb so there's nothing there when volumes is moved into place on a restore. Bringing the influxdb container up creates volumes/influxdb/data and then the daemon running inside the container initialises some empty structures. Then it's ready to be told to restore from backups/influxdb/db.

Once or twice when I've been testing things I've done this:

$ cd ~/IOTstack
$ docker-compose down
$ cd
$ sudo mv IOTstack IOTstack.off
$ git clone https://github.com/SensorsIot/IOTstack.git IOTstack
$ cd IOTstack
$ ./menu.sh

and I have never had any trouble with either that instantiation, or when I blow it away and put the IOTstack.off back into place.

I think it's pretty robust, all things considered.

If you don't have what I will call a "classic" backup and everything is in your NAS then, if it were me, I'd probably do a clean checkout, configure things how I wanted, then start pulling stuff back from the NAS, resolving ownership and permissions mismatches in favour of what I saw in the clean install.

You could try recovering InfluxDB from NAS (in the sense of copying stuff into volumes/influxdb/data) but I'll be extremely surprised if it actually works. Pleased for you. But still surprised.

@stevenveenma
Copy link
Author

Sorry for my late reply, it have been some rough weeks. I just installed docker again with your instructions and at least I have pi-hole working now. If I have some more time I will dive into influxdb and mqtt again. Thanks so far.

@chapelhillmccl
Copy link

I have had similar problems running out of disk space from original IOTstack. One thing I think to note is that if you run out of disk space then some docker instances will not be running and could then be removed using the Prune command. I was able to use the the scripts to refresh the docker containers and with backup to get running again.

@Paraphraser
Copy link

Stale issue. Can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants