Overview
This is a quick write up of what I had spent a few weeks trying to work out.
The adventure happened at the beginning of October, so don’t blindly copy paste
queries without making absolutely sure you’re deleting the right stuff. Use
select
generously.
When connected to the DB - run \timing
. It prints the time taken to execute
every query - a really nice thing to get a grasp when things take longer.
I’ve had duplicates in instance
, person
, site
, community
, post
and
received_activity
.
The quick gist of this is the following:
- Clean up
- Reindex
- Full vacuum
I am now certain vacuuming is not, strictly speaking, necessary, but it makes me feel better to have all the steps I had taken written down.
\d
- list tables (look at it as describe database
);
\d tablename
- describe table.
\o filename\ - save all output to a file on a filesystem.
/tmp/query.sql` was my choice.
instance
You need to turn indexscan
and bitmapscan
off to actually get the duplicates
SET enable_indexscan = off;
SET enable_bitmapscan = off;
The following selects the dupes
SELECT
id,
domain,
published,
updated
FROM instance
WHERE
domain IN (
SELECT
domain
FROM
instance
GROUP BY domain
HAVING COUNT(*) > 1
)
ORDER BY domain;
Deleting without using the index is incredibly slow - turn it back on:
SET enable_indexscan = on;
SET enable_bitmapscan = on;
DELETE FROM instance WHERE id = ;
Yes, you can build a fancier query to delete all the older/newer IDs at once. No, I do not recommend it. Delete one, confirm, repeat.
At first I was deleting the newer IDs; then, after noticing the same instances
were still getting new IDs I swapped to targetting the old ones. After noticing
the same god damn instances still getting new duplicate IDs, I had to dig
deeper and, by some sheer luck discovered that I need to reindex
the database
to bring it back to sanity.
Reindexing
the database takes a very long time - don’t do that. Instead
target the table - that should not take more than a few minutes. This, of
course, all depends on the size of the table, but instance
is naturally going
to be small.
REINDEX TABLE instance;
If reindexing
succeeds - you have cleaned up the table. If not - it will yell
at you with the first name that it fails on. Rinse and repeat until it’s happy.
Side note - it is probably enough to only reindex
the index that’s failing,
but at this point I wanted to ensure at least the whole table is in a good
state.
Looking back - if I could redo it - I would delete the new IDs only, keeping the old ones. I have no evidence, but I think getting rid of the old IDs introduced more duplicates in other related tables down the line. At the time, of course, it was hard to tell WTF was going on and making a wrong decision was better than making no decision.
person
The idea is the same for all the tables with duplicates; however, I had to modify the queries a bit due to small differences.
What I did at first, and you shouldn’t do:
SET enable_indexscan = off;
SET enable_bitmapscan = off;
DELETE FROM person
WHERE
id IN (
SELECT id
FROM (
SELECT id, ROW_NUMBER() OVER (PARTITION BY actor_id ORDER BY id)
AS row_num
FROM person) t
WHERE t.row_num > 1 limit 1);
The issue with the above is that it, again, runs a delete
without using the
index. It is horrible, it is sad, it takes forever. Don’t do this. Instead,
split it into a select
without the index and a delete
with the index:
SET enable_indexscan = off;
SET enable_bitmapscan = off;
SELECT
id, actor_id, name
FROM person a
USING person b
WHERE
a.id > b.id
AND
a.actor_id = b.actor_id;
SET enable_indexscan = on;
SET enable_bitmapscan = on;
DELETE FROM person WHERE id = ;
person
had dupes into the thousands - I just didn’t have enough time at that
moment and started deleting them in batches:
DELETE FROM person WHERE id IN (1, 2, 3, ... 99);
Again - yes, it can probably all be done in one go. I hadn’t, and so I’m not
writing it down that way. This is where I used \o
to then manipulate the output to be in batches using coreutils. You can do that, you can make the database do it for you. I’m a better shell user than an SQL user.
Reindex
the table and we’re good to go!
REINDEX table person;
site
, community
and post
Rinse and repeat, really. \d tablename
, figure out which column is the one to
use when looking for duplicates and delete-reindex-move on
.
received_activity
This one deserves a special mention, as it had 64 million rows in the database
when I was looking at it. Scanning such a table takes forever and, upon closer
inspection, I realised there’s nothing useful in it. It is, essentially, a log
file. I don’t like useless shit in my database, so instead of trying to find the
duplicates, I decided to simply wipe most of it in hopes the dupes would go with
it. I did it in 1 million increments, which took ~30 seconds each run on the
single threaded 2GB RAM VM the database is running on. The reason for this was
to keep the site running as lemmy
backend starts timing out otherwise and
that’s not great.
Before deleting anything, though, have a look at how much storage your tables are taking up:
SELECT
nspname AS "schema",
pg_class.relname AS "table",
pg_size_pretty(pg_total_relation_size(pg_class.oid)) AS "total_size",
pg_size_pretty(pg_relation_size(pg_class.oid)) AS "data_size",
pg_size_pretty(pg_indexes_size(pg_class.oid)) AS "index_size",
pg_stat_user_tables.n_live_tup AS "rows",
pg_size_pretty(
pg_total_relation_size(pg_class.oid) /
(pg_stat_user_tables.n_live_tup + 1)
) AS "total_row_size",
pg_size_pretty(
pg_relation_size(pg_class.oid) /
(pg_stat_user_tables.n_live_tup + 1)
) AS "row_size"
FROM
pg_stat_user_tables
JOIN
pg_class
ON
pg_stat_user_tables.relid = pg_class.oid
JOIN
pg_catalog.pg_namespace AS ns
ON
pg_class.relnamespace = ns.oid
ORDER BY
pg_total_relation_size(pg_class.oid) DESC;
Get the number of rows:
SELECT COUNT(*) FORM received_activity;
Delete the rows at your own pace. You can start with a small number to get the
idea of how long it takes (remember \timing
? ;) ).
DELETE FROM received_activity where id < 1000000;
Attention! Do let the autovacuum
finish after every delete query.
I ended up leaving ~3 million rows, which at the time represented ~ 3 days of federation. I chose 3 days as that is the timeout before an instance is marked as dead if no activity comes from it.
Now it’s time to reindex
the table:
REINDEX TABLE received_activity;
Remember the reported size of the table? If you check your system, nothing will have changed - that is because postgres does not release freed up storage to the kernel. It makes sense under normal circumstances, but this situation is anything but.
Clean all the things!
VACUUM FULL received_activity;
Now you have reclaimed all that wasted storage to be put to better use.
In my case, the database (not the table) shrunk by ~52%!
I am now running a cronjob that deletes rows from received_activity
that are
older than 3 days:
DELETE FROM
received_activity
WHERE
published < NOW() - INTERVAL '3 days';
In case you’re wondering if it’s safe deleting such logs from the database - Lemmy developers seem to agree here and here.