Discussion:
Soliciting ideas for v2.0
(too old to reply)
Christopher Browne
2007-06-28 16:31:46 UTC
Permalink
The v2.0 branch has already taken on the following fairly significant
changes (listed below). Jan and I had a chat this afternoon about
some items still to do:

I'm working on:

- getting the scripted tests to generate some SQL INSERT statements
summarizing the way the tests went; that would very much ease
collecting data about conformance

- more doc improvements on the removal of TABLE ADD KEY

Jan is looking at:

- Make some of the queries that ultimately determine which sl_log_[12]
data needs to be replicated work better under the condition where a
subscription runs for a very long time

Other items coming along are:

- Improving LISTEN/UNLISTEN logic to be much quieter
- better listen path generation

There's a further "wish list" we're looking at, for which it seems
timely to solicit additions:

- CANCEL SUBSCRIPTION - e.g. - cancel a failing subscription.

- Some mechanism to permit renaming tables and modifying data on the
way in to a subscriber.

- Clone Node - to allow using pg_dump/PITR to populate a new
subscriber node.

Are there more items we should try to add?

$Id: RELEASE-2.0,v 1.3 2007-06-27 15:54:37 cbbrowne Exp $

Differences from 1.2 stream

- Removal of TABLE ADD KEY

- It drops all support for databases prior to Postgres version 8.3.

This is required because we now make use of new functionality in
Postgres, namely the trigger and rule support for session replication
role. As of now, every node (origin/subscriber/mixed) can be dumped with
pg_dump and result in a consistent snapshot of the database.

- Still need alterTableRestore() for the upgrade from 1.2.x to 2.0.
upgradeSchema() will restore the system catalog to a consistent
state and define+configure the new versions of the log and deny_access
triggers.

- Fix EXECUTE SCRIPT so that it records the ev_seqno for WAIT FOR EVENT
and make sure all DDL is executed in session_replication_role "local"
on the origin as well as all subscribers. This will cause the slony
triggers to ignore all DML statements while user triggers follow the
regular configuration options for ENABLE [REPLICA/ALWAYS] or DISABLE.

- Let the logshipping files also switch to session_replication_role =
"replica" or "local" (for DDL).

- Sequence tracking becomes enormously less expensive; rather than
polling *ALL* sequence values for each and every SYNC, the slon
stores the last value, and only records entries in sl_seqlog when
the value changes from that last value. If most sequences are
relatively inactive, they won't require entries in sl_seqlog very
often.

- Change to tools/slony1_dump.sh (used to generate log shipping dump);
change quoting of "\\\backslashes\\\" to get rid of warning

- Cleanup thread revised to push most of the logic to evaluate which
tables are to be vacuumed into a pair of stored functions.

This fairly massively simplifies the C code.

- Revised logging levels so that most of the interesting messages are
spit out at SLON_CONFIG and SLON_INFO levels. This can allow users
to drop out the higher DEBUG levels and still have useful logs.
--
output = reverse("ofni.sailifa.ac" "@" "enworbbc")
<http://dba2.int.libertyrms.com/>
Christopher Browne
(416) 673-4124 (land)
Raymond O'Donnell
2007-06-29 03:16:31 UTC
Permalink
Post by Christopher Browne
There's a further "wish list" we're looking at, for which it seems
Maybe I'm missing something blindingly obvious (I'm pretty inexperienced
with Slony), but would it be possible to produce a tool that allows a
dump of a replicated node *without* all the extra stuff Slony adds - the
extra schema, triggers, etc. - in other words, output similar to that
produced by pg_dump on an unreplicated database? - preferably a binary
application rather than a shell script, so it can be used on Windows
machines also.

As I say, maybe I'm missing something.....please point me in the right
direction if I am.

Thanks,

Ray.

---------------------------------------------------------------
Raymond O'Donnell, Director of Music, Galway Cathedral, Ireland
***@iol.ie
---------------------------------------------------------------
Filip Rembiałkowski
2007-06-29 03:58:46 UTC
Permalink
Post by Raymond O'Donnell
Post by Christopher Browne
There's a further "wish list" we're looking at, for which it seems
Maybe I'm missing something blindingly obvious (I'm pretty inexperienced
with Slony), but would it be possible to produce a tool that allows a
dump of a replicated node *without* all the extra stuff Slony adds - the
extra schema, triggers, etc. - in other words, output similar to that
produced by pg_dump on an unreplicated database?
Good point IMHO; similar need here. I guess many people need this.

To produce a "clean" dump ( that is, without Slony-I schema and
triggers ) we have to install temporary backup node, wait for it to
catch up, DROP NODE it , and dump it. Whole process takes almost
twice longer as plain pg_dump (and is more failure-prone)
--
Filip Rembia?kowski
Stéphane Schildknecht
2007-06-29 04:02:49 UTC
Permalink
Post by Filip Rembiałkowski
Post by Raymond O'Donnell
Post by Christopher Browne
There's a further "wish list" we're looking at, for which it seems
Maybe I'm missing something blindingly obvious (I'm pretty inexperienced
with Slony), but would it be possible to produce a tool that allows a
dump of a replicated node *without* all the extra stuff Slony adds - the
extra schema, triggers, etc. - in other words, output similar to that
produced by pg_dump on an unreplicated database?
Good point IMHO; similar need here. I guess many people need this.
To produce a "clean" dump ( that is, without Slony-I schema and
triggers ) we have to install temporary backup node, wait for it to
catch up, DROP NODE it , and dump it. Whole process takes almost
twice longer as plain pg_dump (and is more failure-prone)
In order to get the schema, I personnaly use slony1_extract_schema.sh
and slony1_dump.sh to get the data.

Isn't it what you're looking for ?

Regards,
--
St?phane SCHILDKNECHT
Pr?sident de PostgreSQLFr
06 13 60 37 44 - 09 53 69 97 12
http://www.PostgreSQLFr.org
Raymond O'Donnell
2007-06-29 04:14:57 UTC
Permalink
Post by Stéphane Schildknecht
In order to get the schema, I personnaly use slony1_extract_schema.sh
and slony1_dump.sh to get the data.
Isn't it what you're looking for ?
I do manual backups from time to time from my Windows laptop, so
Unix/Linux shell scripts aren't any use to me.


Ray.


---------------------------------------------------------------
Raymond O'Donnell, Director of Music, Galway Cathedral, Ireland
***@iol.ie
---------------------------------------------------------------
Andrew Sullivan
2007-06-29 07:19:28 UTC
Permalink
Post by Raymond O'Donnell
I do manual backups from time to time from my Windows laptop, so
Unix/Linux shell scripts aren't any use to me.
You can put a POSIX shell on your laptop ;-) But I agree that this
is an area where the Slony Windows support lags.

A
--
Andrew Sullivan | ***@crankycanuck.ca
The fact that technology doesn't work is no bar to success in the marketplace.
--Philip Greenspun
Mikko Partio
2007-06-29 04:27:02 UTC
Permalink
Post by Filip Rembiałkowski
Post by Raymond O'Donnell
Post by Christopher Browne
There's a further "wish list" we're looking at, for which it seems
Maybe I'm missing something blindingly obvious (I'm pretty inexperienced
with Slony), but would it be possible to produce a tool that allows a
dump of a replicated node *without* all the extra stuff Slony adds - the
extra schema, triggers, etc. - in other words, output similar to that
produced by pg_dump on an unreplicated database?
Good point IMHO; similar need here. I guess many people need this.
To produce a "clean" dump ( that is, without Slony-I schema and
triggers ) we have to install temporary backup node, wait for it to
catch up, DROP NODE it , and dump it. Whole process takes almost
twice longer as plain pg_dump (and is more failure-prone)
I think the developers have it on the "TODO" list:

----------------

Differences from 1.2 stream

- Removal of TABLE ADD KEY

- It drops all support for databases prior to Postgres version 8.3.

This is required because we now make use of new functionality in
Postgres, namely the trigger and rule support for session replication
role. As of now, every node (origin/subscriber/mixed) can be dumped with
pg_dump and result in a consistent snapshot of the database.

----------------

Regards

MP
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20070629/=
9c691aa2/attachment.htm
Vivek Khera
2007-07-06 11:19:05 UTC
Permalink
Post by Filip Rembiałkowski
To produce a "clean" dump ( that is, without Slony-I schema and
triggers ) we have to install temporary backup node, wait for it to
catch up, DROP NODE it , and dump it. Whole process takes almost
twice longer as plain pg_dump (and is more failure-prone)
I do a "plain" pg_dump on the origin node.

To restore, I run pg_restore -l to generate a list of objects,
comment out anything inside the _REPLICATION schema and any non-data
resources, then restore into a clean schema generated via the script
Jan created to make the clean schema :-)
Christopher Browne
2007-06-29 04:57:38 UTC
Permalink
Post by Raymond O'Donnell
Post by Christopher Browne
There's a further "wish list" we're looking at, for which it seems
Maybe I'm missing something blindingly obvious (I'm pretty
inexperienced with Slony), but would it be possible to produce a tool
that allows a dump of a replicated node *without* all the extra stuff
Slony adds - the extra schema, triggers, etc. - in other words, output
similar to that produced by pg_dump on an unreplicated database? -
preferably a binary application rather than a shell script, so it can
be used on Windows machines also.
As I say, maybe I'm missing something.....please point me in the right
direction if I am.
I think that's going to be an easily supported feature in 2.0 already.

- The dumps will be entirely consistent (e.g. - devoid of breakage due
to pg_catalog hacking) because the new 8.3 trigger handling makes it
unnecessary to "hack up" the system catalogue.

- You can use the pg_dump option "--exclude-schema" option to leave
out the extra schema, which will make triggers, functions, and such
fall away.

It won't quite be "automatic," but it will be plenty easier than it is
now...
--
let name="cbbrowne" and tld="ca.afilias.info" in String.concat "@" [name;tld];;
<http://dba2.int.libertyrms.com/>
Christopher Browne
(416) 673-4124 (land)
Andrew Sullivan
2007-06-29 07:18:26 UTC
Permalink
Post by Raymond O'Donnell
Maybe I'm missing something blindingly obvious (I'm pretty inexperienced
with Slony), but would it be possible to produce a tool that allows a
dump of a replicated node *without* all the extra stuff Slony adds
Is there something that tools/slony1_dump.sh doesn't do that you
want?

A
--
Andrew Sullivan | ***@crankycanuck.ca
In the future this spectacle of the middle classes shocking the avant-
garde will probably become the textbook definition of Postmodernism.
--Brad Holland
Raymond O'Donnell
2007-06-29 07:36:37 UTC
Permalink
Post by Andrew Sullivan
Is there something that tools/slony1_dump.sh doesn't do that you
want?
Well, I can't run it on my WinXP laptop! :-)

Ray.

---------------------------------------------------------------
Raymond O'Donnell, Director of Music, Galway Cathedral, Ireland
***@iol.ie
---------------------------------------------------------------
Hiroshi Saito
2007-06-29 10:26:03 UTC
Permalink
Post by Raymond O'Donnell
Post by Andrew Sullivan
Is there something that tools/slony1_dump.sh doesn't do that you
want?
Well, I can't run it on my WinXP laptop! :-)
Uga.., Please store as TODO for the time being. Although I want to realize it,
I'm not spare time now.. Of course, the fact that other someone make will be
welcomed.:-)

Regards,
Hiroshi Saito
Christopher Browne
2007-06-29 11:05:01 UTC
Permalink
Post by Hiroshi Saito
Post by Raymond O'Donnell
Post by Andrew Sullivan
Is there something that tools/slony1_dump.sh doesn't do that you
want?
Well, I can't run it on my WinXP laptop! :-)
Uga.., Please store as TODO for the time being. Although I want to
realize it, I'm not spare time now.. Of course, the fact that other
someone make will be welcomed.:-)
I may poke at many parts of Slony-I for version 2.0; I can pretty much
guarantee that I *will not* be working on a Windows version of
slony1_dump.sh...
--
let name="cbbrowne" and tld="cbbrowne.com" in name ^ "@" ^ tld;;
http://linuxfinances.info/info/oses.html
Rules of the Evil Overlord #61. "If my advisors ask "Why are you
risking everything on such a mad scheme?", I will not proceed until I
have a response that satisfies them." <http://www.eviloverlord.com/>
Jeff Davis
2007-06-29 14:05:09 UTC
Permalink
Post by Christopher Browne
Are there more items we should try to add?
Is there a good way to make it more efficient to do large
deletes/updates?

For instance, if the number of tuples that need to be deleted for a
transaction exceeds a certain amount, could we use a different process
for the delete on the subscriber so that it doesn't do millions of
single-tuple deletes?

I don't know exactly how that would work, perhaps by using a temporary
table on the subscriber and doing a single "DELETE FROM foo WHERE id IN
(SELECT id FROM foo_delete_tmp)" or something?

Regards,
Jeff Davis
Christopher Browne
2007-06-29 15:07:11 UTC
Permalink
Post by Jeff Davis
Post by Christopher Browne
Are there more items we should try to add?
Is there a good way to make it more efficient to do large
deletes/updates?
For instance, if the number of tuples that need to be deleted for a
transaction exceeds a certain amount, could we use a different process
for the delete on the subscriber so that it doesn't do millions of
single-tuple deletes?
I don't know exactly how that would work, perhaps by using a temporary
table on the subscriber and doing a single "DELETE FROM foo WHERE id IN
(SELECT id FROM foo_delete_tmp)" or something?
Ah, yes.

It would be nontrivial to improve this in the case of a multi-column
key, but if the candidate primary key consists of a single column, one
might be able to detect this.

Here's a sample, of sorts...

slonyregress1@[local]:5834=# select * from _slony_regress1.sl_log_1 where log_cmdtype = 'D' order by log_actionseq limit 10;
log_origin | log_xid | log_tableid | log_actionseq | log_cmdtype | log_cmddata
------------+---------+-------------+---------------+-------------+-------------
1 | 3939096 | 2 | 9277 | D | id='1'
1 | 3939096 | 1 | 9278 | D | id='1'
1 | 3939096 | 2 | 9279 | D | id='2'
1 | 3939096 | 1 | 9280 | D | id='2'
1 | 3939096 | 2 | 9281 | D | id='3'
1 | 3939096 | 2 | 9282 | D | id='4'
1 | 3939096 | 2 | 9283 | D | id='6'
1 | 3939096 | 2 | 9284 | D | id='9'
1 | 3939096 | 2 | 9285 | D | id='13'
1 | 3939096 | 2 | 9286 | D | id='18'
(10 rows)

This would normally get turned into:

delete from table2 where id = '1';
delete from table1 where id = '1';
delete from table2 where id = '2';
delete from table1 where id = '2';
... and so forth ...

It should, in principle, be possible to group together requests for one table, so that we could have the following:

delete from table2 where
id = '3'
or
id = '4'
or
id = '6'
or
id = '9'
or
id = '13'
or
id = '18';

where this groups together the consecutive entries for table #2.

This actually still works for multiple-column PKs as long as we put
parentheses around each "log_cmddata" entry. The win isn't in turning
this into a parser-challengingly-huge single query; it is in doing
several entries as one query, so I'd be inclined to let it start a new
DELETE request any time the size of the query reaches 100 tuples.

Looking at the update case...

slonyregress1@[local]:5834=# select * from _slony_regress1.sl_log_1 where log_cmdtype = 'U' order by log_actionseq limit 10;
log_origin | log_xid | log_tableid | log_actionseq | log_cmdtype | log_cmddata
------------+---------+-------------+---------------+-------------+----------------------------
1 | 3943148 | 2 | 18633 | U | data='foo' where id='8340'
1 | 3943148 | 2 | 18634 | U | data='foo' where id='8341'
1 | 3943148 | 2 | 18635 | U | data='foo' where id='8342'
1 | 3943148 | 2 | 18636 | U | data='foo' where id='8343'
1 | 3943148 | 2 | 18637 | U | data='foo' where id='8344'
1 | 3943148 | 2 | 18638 | U | data='foo' where id='8345'
1 | 3943148 | 2 | 18639 | U | data='foo' where id='8346'
1 | 3943148 | 2 | 18640 | U | data='foo' where id='8347'
1 | 3943148 | 2 | 18641 | U | data='foo' where id='8348'
1 | 3943148 | 2 | 18642 | U | data='foo' where id='8349'
(10 rows)

It would take some parsing of the log_cmddata to do this, nonetheless,
I think it ought to be possible to compress this into some smaller
number of queries. Again, if we limited each query to process 100
tuples, at most, that would still seem like enough to call it a "win."

I'll take a look...
--
output = reverse("ofni.secnanifxunil" "@" "enworbbc")
http://www3.sympatico.ca/cbbrowne/rdbms.html
"When we write programs that "learn", it turns out that we do and they
don't." -- Alan J. Perlis
Andrew Hammond
2007-06-29 23:58:04 UTC
Permalink
Post by Christopher Browne
Post by Jeff Davis
Post by Christopher Browne
Are there more items we should try to add?
Is there a good way to make it more efficient to do large
deletes/updates?
For instance, if the number of tuples that need to be deleted for a
transaction exceeds a certain amount, could we use a different process
for the delete on the subscriber so that it doesn't do millions of
single-tuple deletes?
I don't know exactly how that would work, perhaps by using a temporary
table on the subscriber and doing a single "DELETE FROM foo WHERE id IN
(SELECT id FROM foo_delete_tmp)" or something?
Ah, yes.
It would be nontrivial to improve this in the case of a multi-column
key, but if the candidate primary key consists of a single column, one
might be able to detect this.
Here's a sample, of sorts...
log_origin | log_xid | log_tableid | log_actionseq | log_cmdtype | log_cmddata
------------+---------+-------------+---------------+-------------+-------------
1 | 3939096 | 2 | 9277 | D | id='1'
1 | 3939096 | 1 | 9278 | D | id='1'
1 | 3939096 | 2 | 9279 | D | id='2'
1 | 3939096 | 1 | 9280 | D | id='2'
1 | 3939096 | 2 | 9281 | D | id='3'
1 | 3939096 | 2 | 9282 | D | id='4'
1 | 3939096 | 2 | 9283 | D | id='6'
1 | 3939096 | 2 | 9284 | D | id='9'
1 | 3939096 | 2 | 9285 | D | id='13'
1 | 3939096 | 2 | 9286 | D | id='18'
(10 rows)
delete from table2 where id = '1';
delete from table1 where id = '1';
delete from table2 where id = '2';
delete from table1 where id = '2';
... and so forth ...
delete from table2 where
id = '3'
or
id = '4'
or
id = '6'
or
id = '9'
or
id = '13'
or
id = '18';
where this groups together the consecutive entries for table #2.
This actually still works for multiple-column PKs as long as we put
parentheses around each "log_cmddata" entry. The win isn't in turning
this into a parser-challengingly-huge single query; it is in doing
several entries as one query, so I'd be inclined to let it start a new
DELETE request any time the size of the query reaches 100 tuples.
A really interesting win would be in detecting cases where you can go from

WHERE id IN ( a list )

to

WHERE a < id AND id < b

However I think this is only possible at the time the transaction
happens (how else will you know if your sequence is contigious. And
that suggests to me that it's not reasonable to do at this time.

Also, ISTM that the big reason we don't like statement based
replication is that SQL has many non-deterministic aspects. However,
there is probably a pretty darn big subset of SQL which is provably
non-deterministic. And for that subset, would it be any less rigorous
to transmit those statements than to transmit the per-row change
statments like we currently do?
Post by Christopher Browne
It would take some parsing of the log_cmddata to do this, nonetheless,
I think it ought to be possible to compress this into some smaller
number of queries. Again, if we limited each query to process 100
tuples, at most, that would still seem like enough to call it a "win."
I can see two places to find these wins. When the statement is parsed
(probably very affordable) and, as you mentioned above, by inspecting
the log tables. I think that we'd have to be pretty clever with the
log tables to avoid having it get too expensive. I wonder if full text
indexing with an "sql stemmer" might be clever way to index that data
usefully.

Two downsides of the parser approach that I can see are
1) the postgresql parser / planner is already plenty complex
2) it doesn't group stuff across multiple statements

Just some thoughts.

Andrew
Jeff Davis
2007-07-01 16:06:15 UTC
Permalink
Post by Andrew Hammond
A really interesting win would be in detecting cases where you can go from
WHERE id IN ( a list )
to
WHERE a < id AND id < b
However I think this is only possible at the time the transaction
happens (how else will you know if your sequence is contigious. And
that suggests to me that it's not reasonable to do at this time.
If we move the data from the provider to a temp table on the receiver,
we could also use an IN query rather than a range. I don't know when
this would be a win, but it seems like it would be useful in some cases.

A range is much nicer, but like you say, it's harder to detect in a
deterministic way.
Post by Andrew Hammond
Also, ISTM that the big reason we don't like statement based
replication is that SQL has many non-deterministic aspects. However,
there is probably a pretty darn big subset of SQL which is provably
non-deterministic. And for that subset, would it be any less rigorous
to transmit those statements than to transmit the per-row change
statments like we currently do?
The pgpool guys have done a lot of research on statement replication
already. I think it's a very interesting line of research that is good
in a lot of cases. It's worth thinking about parts of pgpool that would
be useful in slony.

Regards,
Jeff Davis
Christopher Browne
2007-07-03 09:33:26 UTC
Permalink
Post by Andrew Hammond
A really interesting win would be in detecting cases where you can go from
WHERE id IN ( a list )
to
WHERE a < id AND id < b
However I think this is only possible at the time the transaction
happens (how else will you know if your sequence is contigious. And
that suggests to me that it's not reasonable to do at this time.
That also seems near-nondeterministic in that we're capturing data
based on the state of things when the transactions (multiple!) happen,
on the data source, when the effects will be based on the state of
things on destination nodes, at a different point in time.

I'll see about doing an experiment on this to see if, for the DELETE
case, it seems to actually help. It may be that the performance
effects are small to none, so that the added code complication isn't
worthwhile.
Post by Andrew Hammond
Also, ISTM that the big reason we don't like statement based
replication is that SQL has many non-deterministic aspects. However,
there is probably a pretty darn big subset of SQL which is provably
non-deterministic. And for that subset, would it be any less
rigorous to transmit those statements than to transmit the per-row
change statments like we currently do?
Well, by capturing the values, we have captured a deterministic form
of the update.

Jan and I had a chat last week on ideas of how to do "wilder
transformations" (e.g. - like adding/dropping columns, or of
replicating "WHERE FOO IN ('BAR')"); what we arrived at was that, in
such cases, what we'd need to do is to have custom 'logtrigger'
functions that would have full access to OLD.* and NEW.* (e.g. - the
two sets of columns, old and new), which would then use them, perhaps
with arbitrary complexity, construct sl_log_n entries.

The "fully general" logtrigger function would be *way* less efficient
than the present ones; you don't get complex transformations for free.
Post by Andrew Hammond
Post by Christopher Browne
It would take some parsing of the log_cmddata to do this, nonetheless,
I think it ought to be possible to compress this into some smaller
number of queries. Again, if we limited each query to process 100
tuples, at most, that would still seem like enough to call it a "win."
I can see two places to find these wins. When the statement is parsed
(probably very affordable) and, as you mentioned above, by inspecting
the log tables. I think that we'd have to be pretty clever with the
log tables to avoid having it get too expensive. I wonder if full text
indexing with an "sql stemmer" might be clever way to index that data
usefully.
I have a *small* regret in this; it would be very nice if data in
sl_log_[n].log_cmddata were split into two portions:

1. For an INSERT, split between the column name list and the VALUES
portion;

You could, in principle, join together a set of VALUES entries for
the same table as long as the list of column names match.

2. For an UPDATE, split between the SET portion and the WHERE
portion;

You could, in principle, join together a set of entries which
have identical SET portions by folding together the WHERE
clauses.

3. For DELETE, there's nothing to be split :-).

It's trivial to fold DELETE requests together as I previously
showed.
Post by Andrew Hammond
Two downsides of the parser approach that I can see are
1) the postgresql parser / planner is already plenty complex
2) it doesn't group stuff across multiple statements
I don't see any possibility of using a parser-based approach; that
jumps us back into statement-based replication, which is susceptible
to nondeterminism problems.

Remember, the thought we started with was:
"What if we could do something that would make mass operations less
expensive?"

I don't want to introduce anything that can materially increase
processing costs.

The more intelligent we try to get, the more expensive the
logtrigger() function gets, and if the price is high enough, then we
gain nothing.

The only "win" I see is if we can opportunistically join some
statements together. If we have to make the log trigger function
universally *WAY* more expensive, well, that's a performance loss :-(.
--
let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];;
http://cbbrowne.com/info/unix.html
Rules of the Evil Overlord #207. "Employees will have conjugal visit
trailers which they may use provided they call in a replacement and
sign out on the timesheet. Given this, anyone caught making out in a
closet while leaving their station unmonitored will be shot."
<http://www.eviloverlord.com/>
Jeff Davis
2007-07-03 11:16:32 UTC
Permalink
Post by Christopher Browne
I'll see about doing an experiment on this to see if, for the DELETE
case, it seems to actually help. It may be that the performance
effects are small to none, so that the added code complication isn't
worthwhile.
In a simple test I ran, DELETE of the entire 5M record table using
sequential scan was MUCH faster (9.41s) than 5M individual DELETE
statements in a single transaction (552.49s).

5M records is small enough to fit into memory. I expect the difference
would be even greater when the index and table can't both fit into
memory and the deletes are distributed randomly over the table.

I think it is worth exploring ways of solving this problem. Right now
slony is great for small inserts, updates, and deletes. But any large
update/delete on the origin can cause the subscribers to fall way
behind.
Post by Christopher Browne
"What if we could do something that would make mass operations less
expensive?"
I don't want to introduce anything that can materially increase
processing costs.
The more intelligent we try to get, the more expensive the
logtrigger() function gets, and if the price is high enough, then we
gain nothing.
The only "win" I see is if we can opportunistically join some
statements together. If we have to make the log trigger function
universally *WAY* more expensive, well, that's a performance loss :-(.
Is there any way that we could detect (even without 100% confidence)
that a transaction is "big" and we should spend more effort trying to
optimize it?

I know that's a big project (in the general case), but there might be
some simple things that would work.

Regards,
Jeff Davis
Christopher Browne
2007-07-03 14:52:56 UTC
Permalink
Post by Jeff Davis
Post by Christopher Browne
I'll see about doing an experiment on this to see if, for the DELETE
case, it seems to actually help. It may be that the performance
effects are small to none, so that the added code complication isn't
worthwhile.
In a simple test I ran, DELETE of the entire 5M record table using
sequential scan was MUCH faster (9.41s) than 5M individual DELETE
statements in a single transaction (552.49s).
5M records is small enough to fit into memory. I expect the difference
would be even greater when the index and table can't both fit into
memory and the deletes are distributed randomly over the table.
I think it is worth exploring ways of solving this problem. Right now
slony is great for small inserts, updates, and deletes. But any large
update/delete on the origin can cause the subscribers to fall way
behind.
The "handy" alternative test (which would be a good "smoke test" for
whether it's worth bothering to put *any* effort into this) would be to
try to do some partial groupings to see if they'd help.

Thus, if the whole delete goes across the range id = 0 thru id =
5000000, then things to try would be (each case involving 1 transaction):

1. Delete with the 5M individual DELETE statements. (Which you found
took 552.49s)
2. Delete with 50K DELETE statements, each having a WHERE clause with
100 items in it.
3. Delete with 5K DELETE statements, each having a WHERE clause with 1K
items in it.

If 2. or 3. come *way* closer to 9.41s, then it may be worth exploring
the complexity of folding together adjacent deletes on the same table.
There could also be a case made for trying sequential versus random
orderings (e.g. - in the former case, each DELETE statement takes on a
specific range of items whereas in the latter, each selects items more
or less at random).

I'll see about constructing a series of tests like this; won't be
running before I send this :-). If you have time to generate 2. and/or
3., on your system and get timings there, I'd be much obliged.

;;;; Here's some relevant code :-)
(format t "begin;")
(loop for i from 0 to 49999
do (format t "delete from foo where ")
(loop for j from 0 to 99
do (format t "id=~D or " (+ j (* i 100))))
do (format t "id=~D;~%" (* i 100)))
(format t "commit;")

I don't think we can save the full 543 seconds, but if we could save a
good portion of it, it's worth trying to pursue...
Post by Jeff Davis
Post by Christopher Browne
"What if we could do something that would make mass operations less
expensive?"
I don't want to introduce anything that can materially increase
processing costs.
The more intelligent we try to get, the more expensive the
logtrigger() function gets, and if the price is high enough, then we
gain nothing.
The only "win" I see is if we can opportunistically join some
statements together. If we have to make the log trigger function
universally *WAY* more expensive, well, that's a performance loss :-(.
Is there any way that we could detect (even without 100% confidence)
that a transaction is "big" and we should spend more effort trying to
optimize it?
Regrettably, no. For us to switch over to a sort of log trigger that
supports "doing something smarter" requires that we add in logic that
will have some (definitely non-zero) cost any time it *isn't*
worthwhile. And "usual sorts of OLTP activity" will fall into the
category where performance would be injured.
Post by Jeff Davis
I know that's a big project (in the general case), but there might be
some simple things that would work.
Well, the cases I suggested (2. and 3.) would fall into "simple cases".
Jeff Davis
2007-07-06 15:51:34 UTC
Permalink
Post by Christopher Browne
1. Delete with the 5M individual DELETE statements. (Which you found
took 552.49s)
2. Delete with 50K DELETE statements, each having a WHERE clause with
100 items in it.
3. Delete with 5K DELETE statements, each having a WHERE clause with 1K
items in it.
If 2. or 3. come *way* closer to 9.41s, then it may be worth exploring
the complexity of folding together adjacent deletes on the same table.
There could also be a case made for trying sequential versus random
orderings (e.g. - in the former case, each DELETE statement takes on a
specific range of items whereas in the latter, each selects items more
or less at random).
I'll see about constructing a series of tests like this; won't be
running before I send this :-). If you have time to generate 2. and/or
3., on your system and get timings there, I'd be much obliged.
I apologize for the slow response. Here are my results:

DELETE 5000000 at a time: 19.571929
DELETE 1000 at a time: 15.999146
DELETE 100 at a time: 18.946113
DELETE 1 at a time: 770.507714
UPDATE 5000000 at a time: 94.686762
UPDATE 1000 at a time: 86.327752
UPDATE 100 at a time: 103.473719
UPDATE 1 at a time: 963.358307

My test scripts are attached. I am seeing the same kind of slowness when
it's deleting all versus in chunks of 1000, which is confusing me also.

Notice in my scripts that the way the table is created is perfectly
clustered.

Regards,
Jeff Davis
-------------- next part --------------
require 'postgres'

db = PGconn.connect(nil,nil,nil,nil,'db02','jdavis',nil)

db.exec %q{
drop table if exists delete_test;
create table delete_test( a1 int8, a2 int8, a3 int8, a4 int8 );
insert into delete_test select
generate_series,
generate_series+1,
generate_series+2,
generate_series+3
from generate_series(1,5000000);
create unique index delete_test_idx on delete_test(a1);
analyze delete_test;
}

sleep 100

t1 = Time.now
db.exec "BEGIN"
i = 0
x = ARGV[0].to_i
while i < 5000000 do
db.exec "DELETE FROM delete_test WHERE a1 > #{i} and a1 <= #{i+x}"
i += x
end
db.exec "COMMIT"
t2 = Time.now

puts t2-t1

-------------- next part --------------
require 'postgres'

db = PGconn.connect(nil,nil,nil,nil,'db02','jdavis',nil)

db.exec %q{
drop table if exists update_test;
create table update_test( a1 int8, a2 int8, a3 int8, a4 int8 );
insert into update_test select
generate_series,
generate_series+1,
generate_series+2,
generate_series+3
from generate_series(1,5000000);
create unique index update_test_idx on update_test(a1);
analyze update_test;
}

sleep 100

t1 = Time.now
db.exec "BEGIN"
i = 0
x = ARGV[0].to_i
while i < 5000000 do
db.exec "UPDATE update_test SET a2=-a2 WHERE a1 > #{i} and a1 <= #{i+x}"
i += x
end
db.exec "COMMIT"
t2 = Time.now

puts t2-t1
Christopher Browne
2007-07-03 15:07:39 UTC
Permalink
Post by Jeff Davis
Post by Christopher Browne
I'll see about doing an experiment on this to see if, for the DELETE
case, it seems to actually help. It may be that the performance
effects are small to none, so that the added code complication isn't
worthwhile.
In a simple test I ran, DELETE of the entire 5M record table using
sequential scan was MUCH faster (9.41s) than 5M individual DELETE
statements in a single transaction (552.49s).
I have a test running here...

***@dba2:~/records/2007/2007-07-03> for i in *log; do
for> echo $i
for> echo "-------------------------------------------------------"
for> cat $i
for> echo "======================================================="
for> done
hundreds.log
-------------------------------------------------------
Started
Tue Jul 3 22:03:16 UTC 2007
=======================================================
onedelete.log
-------------------------------------------------------
Started
Tue Jul 3 21:57:04 UTC 2007
Started purging by onedelete
Tue Jul 3 21:58:41 UTC 2007
Completed deletions
Tue Jul 3 21:59:24 UTC 2007
=======================================================
thousands.log
-------------------------------------------------------
Started
Tue Jul 3 21:47:07 UTC 2007
Started purging by thousands
Tue Jul 3 21:47:22 UTC 2007
Completed deletions
Tue Jul 3 21:53:12 UTC 2007
=======================================================


My PC is evidently slower than yours; it took ~43s for the "one big delete"

Doing it in groups of 1K took 4:50 (e.g. - 4 minutes 50 seconds)

I'll be running against groups of 100 and against groups of 1 overnight;
presumably both will be worse than my other numbers. It'll be
interesting to see how much worse than 4:50 it gets...
Christopher Browne
2007-07-03 15:53:46 UTC
Permalink
Post by Christopher Browne
I'll be running against groups of 100 and against groups of 1
overnight; presumably both will be worse than my other numbers. It'll
be interesting to see how much worse than 4:50 it gets...
hundreds.log
-------------------------------------------------------
Started
Tue Jul 3 22:03:16 UTC 2007
Started purging by hundreds
Tue Jul 3 22:04:59 UTC 2007
Completed deletions
Tue Jul 3 22:08:14 UTC 2007
=======================================================
oneatatime.log
-------------------------------------------------------
Started
Tue Jul 3 22:08:15 UTC 2007
Started purging by oneatatime
Tue Jul 3 22:09:50 UTC 2007
Completed deletions
Tue Jul 3 22:25:01 UTC 2007
=======================================================
onedelete.log
-------------------------------------------------------
Started
Tue Jul 3 21:57:04 UTC 2007
Started purging by onedelete
Tue Jul 3 21:58:41 UTC 2007
Completed deletions
Tue Jul 3 21:59:24 UTC 2007
=======================================================
thousands.log
-------------------------------------------------------
Started
Tue Jul 3 21:47:07 UTC 2007
Started purging by thousands
Tue Jul 3 21:47:22 UTC 2007
Completed deletions
Tue Jul 3 21:53:12 UTC 2007
=======================================================


One at a time took about 15 minutes
100 at a time took 3:15
1000 at a time took longer than 100 at a time (curious, that!)
all in one shot took 43 seconds.

Something's fishy about the 1K at a time case; I'm rerunning that.
Retrying, it took 4:22, again, longer than at 100/DELETE.

I expect that we run into some parsing overhead where doing 1000 at a
shot imposes some burden (memory, query plan, whatever) that is greater
than the savings gotten from cutting down on the number of queries.

There is definitely a material savings to be had; deleting a bunch of
tuples in one query *does* provide a considerable savings over doing one
at a time. Whether it's worth implementing may be another question :-).
Csaba Nagy
2007-07-04 01:21:13 UTC
Permalink
Post by Christopher Browne
One at a time took about 15 minutes
100 at a time took 3:15
1000 at a time took longer than 100 at a time (curious, that!)
all in one shot took 43 seconds.
Check the plans for the 100 vs. 1000 cases: I'm pretty sure 100 goes for
bitmap index scan and 1000 goes for sequential scan... and 10 * 100
bitmap index scans are probably somewhat faster than 1 sequential scan
on your table/box. I guess 1000 is close to the limit between the
performance turnover between the index scan and sequential scan on your
table/box/setup, but the sequential scan is slightly underestimated by
the planner.

BTW, the bitmap index scan case should theoretically be the fastest, so
aiming for the highest chunk size where the planner still chooses bitmap
index scan (or downright forcing it to do so if possible) would give the
best performance.

Cheers,
Csaba.
Christopher Browne
2007-07-04 08:14:03 UTC
Permalink
Post by Csaba Nagy
Post by Christopher Browne
One at a time took about 15 minutes
100 at a time took 3:15
1000 at a time took longer than 100 at a time (curious, that!)
all in one shot took 43 seconds.
Check the plans for the 100 vs. 1000 cases: I'm pretty sure 100 goes for
bitmap index scan and 1000 goes for sequential scan... and 10 * 100
bitmap index scans are probably somewhat faster than 1 sequential scan
on your table/box. I guess 1000 is close to the limit between the
performance turnover between the index scan and sequential scan on your
table/box/setup, but the sequential scan is slightly underestimated by
the planner.
BTW, the bitmap index scan case should theoretically be the fastest, so
aiming for the highest chunk size where the planner still chooses bitmap
index scan (or downright forcing it to do so if possible) would give the
best performance.
Interestingly, the query plans for 100 and 1000 are much the same:

- Bitmap Heap Scan on foo atop
100 (or 1000) ORed Bitmap Index Scans.

I was fully expecting to discover some differing query plan; it was
quite a surprise to discover both:

a) That the plans were essentially identical, and
b) That it's more expensive to delete 1K tuples at a shot than 100.
Csaba Nagy
2007-07-04 08:29:22 UTC
Permalink
Post by Christopher Browne
- Bitmap Heap Scan on foo atop
100 (or 1000) ORed Bitmap Index Scans.
I was fully expecting to discover some differing query plan; it was
a) That the plans were essentially identical, and
b) That it's more expensive to delete 1K tuples at a shot than 100.
OK, that surprises me too... and it would be interesting to know why is
that so ? It is completely counterintuitive...

Cheers,
Csaba.
Jeff Davis
2007-07-06 15:54:45 UTC
Permalink
Post by Csaba Nagy
Post by Christopher Browne
One at a time took about 15 minutes
100 at a time took 3:15
1000 at a time took longer than 100 at a time (curious, that!)
all in one shot took 43 seconds.
Check the plans for the 100 vs. 1000 cases: I'm pretty sure 100 goes for
bitmap index scan and 1000 goes for sequential scan... and 10 * 100
bitmap index scans are probably somewhat faster than 1 sequential scan
on your table/box. I guess 1000 is close to the limit between the
performance turnover between the index scan and sequential scan on your
table/box/setup, but the sequential scan is slightly underestimated by
the planner.
BTW, the bitmap index scan case should theoretically be the fastest, so
aiming for the highest chunk size where the planner still chooses bitmap
index scan (or downright forcing it to do so if possible) would give the
best performance.
Why would a bitmap index scan be faster than a sequential scan when
deleting the entire table?

Regards,
Jeff Davis
Christopher Browne
2007-07-07 11:00:59 UTC
Permalink
Post by Jeff Davis
Post by Csaba Nagy
Post by Christopher Browne
One at a time took about 15 minutes
100 at a time took 3:15
1000 at a time took longer than 100 at a time (curious, that!)
all in one shot took 43 seconds.
Check the plans for the 100 vs. 1000 cases: I'm pretty sure 100 goes for
bitmap index scan and 1000 goes for sequential scan... and 10 * 100
bitmap index scans are probably somewhat faster than 1 sequential scan
on your table/box. I guess 1000 is close to the limit between the
performance turnover between the index scan and sequential scan on your
table/box/setup, but the sequential scan is slightly underestimated by
the planner.
BTW, the bitmap index scan case should theoretically be the fastest, so
aiming for the highest chunk size where the planner still chooses bitmap
index scan (or downright forcing it to do so if possible) would give the
best performance.
Why would a bitmap index scan be faster than a sequential scan when
deleting the entire table?
It wouldn't be...

The thing is, on the subscriber side, the deletes take place one tuple
at a time. In THAT case, putting several into one DELETE request does
validly favour a bitmap index scan.

This may remain controversial for a while; it seems evident that there
is a gain to be had by doing several DELETEs together. Unfortunately,
it is also evident that this will complicate the code.
Post by Jeff Davis
From your numbers, it looks like "maxxing out" at 100 in a statement
is likely optimal, *way* faster than doing a delete at a time, and not
materially worse than the "where-less" DELETE FROM FOO;

This is not the next thing to change, but I think it should be changed
someday...
--
output = reverse("ofni.sailifa.ac" "@" "enworbbc")
<http://dba2.int.libertyrms.com/>
Christopher Browne
(416) 673-4124 (land)
Jan Wieck
2007-07-04 07:10:29 UTC
Permalink
Post by Christopher Browne
Post by Andrew Hammond
Also, ISTM that the big reason we don't like statement based
replication is that SQL has many non-deterministic aspects. However,
there is probably a pretty darn big subset of SQL which is provably
non-deterministic. And for that subset, would it be any less
rigorous to transmit those statements than to transmit the per-row
change statments like we currently do?
Well, by capturing the values, we have captured a deterministic form
of the update.
How to figure out what is deterministic and what isn't? A simple

insert into summary select id, sum(value) from detail group by id;

seems pretty deterministic, doesn't it? But the result of it depends on
the exact commit order and the transaction isolation level. We don't
capture the commit order of single transactions, nor do we care for it
anywhere in the Slony-I logic.


Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== ***@Yahoo.com #
Andrew Sullivan
2007-07-04 07:45:20 UTC
Permalink
Post by Jan Wieck
seems pretty deterministic, doesn't it? But the result of it depends on
the exact commit order and the transaction isolation level. We don't
capture the commit order of single transactions, nor do we care for it
anywhere in the Slony-I logic.
I think this is key. The current arrangement solves the problem
where the visibility rules as they were in force on the origin are
followed while applying on the replica. You're going to need to do
quite a bit of theoretical work here to show that the agreeable order
rules are followed in any grouping approach you take. Please see the
original concept paper on this exact point. MVCC is hard.

A
--
Andrew Sullivan | ***@crankycanuck.ca
However important originality may be in some fields, restraint and
adherence to procedure emerge as the more significant virtues in a
great many others. --Alain de Botton
Christopher Browne
2007-07-04 07:57:05 UTC
Permalink
Post by Andrew Sullivan
Post by Jan Wieck
seems pretty deterministic, doesn't it? But the result of it depends on
the exact commit order and the transaction isolation level. We don't
capture the commit order of single transactions, nor do we care for it
anywhere in the Slony-I logic.
I think this is key. The current arrangement solves the problem
where the visibility rules as they were in force on the origin are
followed while applying on the replica. You're going to need to do
quite a bit of theoretical work here to show that the agreeable order
rules are followed in any grouping approach you take. Please see the
original concept paper on this exact point. MVCC is hard.
The only change I'd propose in handling grouping is to
opportunistically see if there are consecutive operations that may be
trivially joined together.

In effect, if the *old* logic generated the sequence of queries:

delete from my_table where id = 25;
delete from my_table where id = 82;
delete from another_table where id = 19;
delete from my_table where id = 45;

then there is only one "joining" possible, which is to combine the
first two delete queries into one, so that the overall sequence of
queries becomes thus:

delete from my_table where id = 25 or id = 82;
delete from another_table where id = 19;
delete from my_table where id = 45;

That doesn't change anything about "agreeable ordering" as far as I
can see.

Of course, that example isn't much of a "win." What would be way more
interesting (from a performance perspective) is the case where there
are 25 deletes in a row from my_table that could be folded together.
--
(reverse (concatenate 'string "ofni.sesabatadxunil" "@" "enworbbc"))
http://linuxdatabases.info/info/sgml.html
Health is merely the slowest possible rate at which one can die.
Andrew Sullivan
2007-07-04 09:30:38 UTC
Permalink
Post by Christopher Browne
That doesn't change anything about "agreeable ordering" as far as I
can see.
Of course, that example isn't much of a "win." What would be way more
interesting (from a performance perspective) is the case where there
are 25 deletes in a row from my_table that could be folded together.
Right. But the actual key here is to make sure that (1) no agreeable
ordering actually changes _and_ (2) that doing this work for every
sequence is in fact going to be a winner for most cases. Jan and I
talked, IIRC, about this sort of optimisation in the early days of
1.0 design work, and neither of us were able to come up with a set of
tests that could both demonstrate this was a win _and_ that it
wouldn't kill ordinary-case performance (or blow out memory). I'm
not saying it's impossible, but I'm trying to suggest that some
general algorithm work in this area is going to be needed, and it's
not going to be ameable merely to empircal tests. You also have to
have at least _prima facie_ evidence that, for ordinary cases, this
won't suck. I'd like to see that argument.

A
--
Andrew Sullivan | ***@crankycanuck.ca
When my information changes, I alter my conclusions. What do you do sir?
--attr. John Maynard Keynes
Jan Wieck
2007-07-04 11:32:20 UTC
Permalink
Post by Christopher Browne
Post by Andrew Sullivan
Post by Jan Wieck
seems pretty deterministic, doesn't it? But the result of it depends on
the exact commit order and the transaction isolation level. We don't
capture the commit order of single transactions, nor do we care for it
anywhere in the Slony-I logic.
I think this is key. The current arrangement solves the problem
where the visibility rules as they were in force on the origin are
followed while applying on the replica. You're going to need to do
quite a bit of theoretical work here to show that the agreeable order
rules are followed in any grouping approach you take. Please see the
original concept paper on this exact point. MVCC is hard.
The only change I'd propose in handling grouping is to
opportunistically see if there are consecutive operations that may be
trivially joined together.
delete from my_table where id = 25;
delete from my_table where id = 82;
delete from another_table where id = 19;
delete from my_table where id = 45;
then there is only one "joining" possible, which is to combine the
first two delete queries into one, so that the overall sequence of
delete from my_table where id = 25 or id = 82;
delete from another_table where id = 19;
delete from my_table where id = 45;
Actually, only an insert or update to the same table would interrupt the
possible grouping. But that is academic.

What I see here is that we are trying to come up with a special case
optimization for mass-deletes. No mass insert or update operations will
benefit from any of this. Do people do mass deletes that much that we
really have to worry about them?


Jan
Post by Christopher Browne
That doesn't change anything about "agreeable ordering" as far as I
can see.
Of course, that example isn't much of a "win." What would be way more
interesting (from a performance perspective) is the case where there
are 25 deletes in a row from my_table that could be folded together.
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== ***@Yahoo.com #
Csaba Nagy
2007-07-05 01:13:52 UTC
Permalink
Post by Jan Wieck
What I see here is that we are trying to come up with a special case
optimization for mass-deletes. No mass insert or update operations will
benefit from any of this. Do people do mass deletes that much that we
really have to worry about them?
I occasionally did such mass delete operations, which did cause slony to
lag behind but not so much that I would cry for such a feature...

In fact our postgres DB scaling is done by splitting the DB in 2 when
the customers on it grow too much. We do this by replicating a copy via
slony, and then cut the replication and configure half of the customers
to point to the first one, the rest point to the other one. Then delete
the unneeded data on both DBs. This procedure is transparent to the
customers, the required downtime is a few minutes while the application
is restarted with the new configuration, which can be done in a low
traffic period.

So back to the point, deleting the unneeded half of the data is such a
bulk delete operation, and sometimes I had to do it while slony was
active on the DB.

Cheers,
Csaba.
Csaba Nagy
2007-07-05 06:33:57 UTC
Permalink
[snip] Then delete
the unneeded data on both DBs. [snip]
OK, I should have said that the deletion is done in reasonably sized
chunks so that each deletion transaction doesn't take ages. Maybe this
is why slony was up to the task, even if the delete statement ratio from
master to slave would be between 1...10K.

So then no, I don't have a scenario where I would delete millions of
rows in one statement and slony would have to replicate that to millions
of single statements... that would kill the master too...

Cheers,
Csaba.
Jan Wieck
2007-07-05 06:51:00 UTC
Permalink
Post by Csaba Nagy
[snip] Then delete
the unneeded data on both DBs. [snip]
OK, I should have said that the deletion is done in reasonably sized
chunks so that each deletion transaction doesn't take ages. Maybe this
is why slony was up to the task, even if the delete statement ratio from
master to slave would be between 1...10K.
So then no, I don't have a scenario where I would delete millions of
rows in one statement and slony would have to replicate that to millions
of single statements... that would kill the master too...
You could do THOSE mass deletes via EXECUTE SCRIPT ;-)


Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== ***@Yahoo.com #
Brad Nicholson
2007-07-05 06:25:35 UTC
Permalink
Post by Jan Wieck
What I see here is that we are trying to come up with a special case
optimization for mass-deletes. No mass insert or update operations will
benefit from any of this. Do people do mass deletes that much that we
really have to worry about them?
Yes, there are a few places that I can think of where we would directly
benefit from this. Trimming data from very active log tables is the
main case. We also recently had a case where we needed to delete a fair
amount of data from a table quickly to prevent degraded performance in a
front line system. We ended up dropping the table, deleting and
re-subbing. We were not is a situation where we could have done smaller
batches that slony would have liked.
--
Brad Nicholson 416-673-4106
Database Administrator, Afilias Canada Corp.
Andrew Hammond
2007-07-05 12:41:00 UTC
Permalink
Post by Brad Nicholson
Post by Jan Wieck
What I see here is that we are trying to come up with a special case
optimization for mass-deletes. No mass insert or update operations will
benefit from any of this. Do people do mass deletes that much that we
really have to worry about them?
Yes, there are a few places that I can think of where we would directly
benefit from this. Trimming data from very active log tables is the
main case. We also recently had a case where we needed to delete a fair
amount of data from a table quickly to prevent degraded performance in a
front line system. We ended up dropping the table, deleting and
re-subbing. We were not is a situation where we could have done smaller
batches that slony would have liked.
May I rephrase Jan's question? Are there any cases where people do
mass deletion that couldn't be solved using existing table
partitioning approaches? In the case of the log files Brad mentions
above, a solution is to use tables partitioned by temporal range, say
on a monthly basis. Once all the data in a given partition has reached
it's retention schedule, simply drop the partition.

This costs some additional EXECUTE DDL scripting, but I think would
solve the problem quite effectively. It also requires no additions or
changes to slony. Finally, it requires the user to be running a more
modern version of PostgreSQL, however if you care about performance,
that seems a reasonable assumption.

Andrew
Vivek Khera
2007-07-06 11:25:34 UTC
Permalink
Post by Jan Wieck
What I see here is that we are trying to come up with a special
case optimization for mass-deletes. No mass insert or update
operations will benefit from any of this. Do people do mass deletes
that much that we really have to worry about them?
I do mass deletes twice monthly to prune out old data. Usually
several tens of millions of rows. On the origin it looks like
'delete from XXX where object_id=N' which cascade deletes via FK
relationships to gazillions of rows.
Christopher Browne
2007-07-06 11:42:51 UTC
Permalink
Post by Vivek Khera
Post by Jan Wieck
What I see here is that we are trying to come up with a special
case optimization for mass-deletes. No mass insert or update
operations will benefit from any of this. Do people do mass deletes
that much that we really have to worry about them?
I do mass deletes twice monthly to prune out old data. Usually
several tens of millions of rows. On the origin it looks like
'delete from XXX where object_id=N' which cascade deletes via FK
relationships to gazillions of rows.
That would be a case where there could be some "win" from joining
deletes together at least somewhat... So I think this feature is one
that can't be so quickly ruled out...
--
output = reverse("ofni.sailifa.ac" "@" "enworbbc")
<http://dba2.int.libertyrms.com/>
Christopher Browne
(416) 673-4124 (land)
Andrew Sullivan
2007-07-06 12:37:21 UTC
Permalink
Post by Christopher Browne
That would be a case where there could be some "win" from joining
deletes together at least somewhat... So I think this feature is one
that can't be so quickly ruled out...
I also wonder, though, whether it mightn't be exposed by EXECUTE or
something instead, presumably with a lowish lock level. I'm just
worried about the potential for this to turn into a big expensive
operation for ordinary cases because we've optimised the admittedly
awful mass-delete behaviour (made even worse because PostgreSQL
itself is sort of brutal with mass deletes, due to all the dead
rows). Something that was special that way would also reduce the
exposure from hairy, complicated bits in the main line code, which as
others have already noted is somewhat mysterious to many developers.

A
--
Andrew Sullivan | ***@crankycanuck.ca
The fact that technology doesn't work is no bar to success in the marketplace.
--Philip Greenspun
Jeff Davis
2007-07-06 16:05:35 UTC
Permalink
Post by Jan Wieck
What I see here is that we are trying to come up with a special case
optimization for mass-deletes. No mass insert or update operations will
benefit from any of this. Do people do mass deletes that much that we
really have to worry about them?
Why does this not apply to UPDATEs as well?

Regards,
Jeff Davis
Jan Wieck
2007-07-09 05:46:38 UTC
Permalink
Post by Jeff Davis
Post by Jan Wieck
What I see here is that we are trying to come up with a special case
optimization for mass-deletes. No mass insert or update operations will
benefit from any of this. Do people do mass deletes that much that we
really have to worry about them?
Why does this not apply to UPDATEs as well?
I can see how multiple log rows for DELETE might be combined into one
subscriber side DELETE statement with a WHERE clause using IN (BETWEEN
would be difficult in the case of multi-column primary keys).

How you intend to do the same for INSERT or UPDATE is unclear to me. As
said before, there is no SQL statement logging available and even if it
where, due to MVCC it won't do us any good because Slony does not
replicate single transactions in their exact commit order. This
visibility problem is one of the tricky details that killed pgcluster 1.
I won't let it sneak into Slony.


Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== ***@Yahoo.com #
Jeff Davis
2007-07-09 12:18:29 UTC
Permalink
Post by Jan Wieck
Post by Jeff Davis
Why does this not apply to UPDATEs as well?
I can see how multiple log rows for DELETE might be combined into one
subscriber side DELETE statement with a WHERE clause using IN (BETWEEN
would be difficult in the case of multi-column primary keys).
How you intend to do the same for INSERT or UPDATE is unclear to me. As
said before, there is no SQL statement logging available and even if it
where, due to MVCC it won't do us any good because Slony does not
replicate single transactions in their exact commit order. This
visibility problem is one of the tricky details that killed pgcluster 1.
I won't let it sneak into Slony.
I am _not_ suggesting statement replication in Slony, and that was not
the direction I intended to take this discussion (statement replication
is really more the domain of pgpool, which is interesting but a
completely different design).

As I understand it, the subscriber reads the log from the provider and
turns the log into individual statements on the subscriber.

Instead, the subscriber could read the log on the provider into a temp
table on the subscriber, then the subscriber could do a "DELETE FROM foo
WHERE ... IN (select ... from temp_table)" or an "UPDATE ... SET ...
FROM temp_table WHERE ...". This should not be any different than
running individual statements within a single transaction, but is more
efficient in some cases (bulk deletes/updates).

In the quick test I just ran (attached ruby script), it took 160s to
COPY the 5M updates into a temp table, and then run one UPDATE that
would update every record in the table by joining with the temp table
(the COPY was a few seconds, so most of the 160s was the single UPDATE
statement). That's way faster than the 15 minutes it took to update one
record at a time.

Regards,
Jeff Davis
Jan Wieck
2007-07-09 22:12:51 UTC
Permalink
Post by Jeff Davis
Post by Jan Wieck
Post by Jeff Davis
Why does this not apply to UPDATEs as well?
I can see how multiple log rows for DELETE might be combined into one
subscriber side DELETE statement with a WHERE clause using IN (BETWEEN
would be difficult in the case of multi-column primary keys).
How you intend to do the same for INSERT or UPDATE is unclear to me. As
said before, there is no SQL statement logging available and even if it
where, due to MVCC it won't do us any good because Slony does not
replicate single transactions in their exact commit order. This
visibility problem is one of the tricky details that killed pgcluster 1.
I won't let it sneak into Slony.
I am _not_ suggesting statement replication in Slony, and that was not
the direction I intended to take this discussion (statement replication
is really more the domain of pgpool, which is interesting but a
completely different design).
As I understand it, the subscriber reads the log from the provider and
turns the log into individual statements on the subscriber.
Instead, the subscriber could read the log on the provider into a temp
table on the subscriber, then the subscriber could do a "DELETE FROM foo
WHERE ... IN (select ... from temp_table)" or an "UPDATE ... SET ...
FROM temp_table WHERE ...". This should not be any different than
running individual statements within a single transaction, but is more
efficient in some cases (bulk deletes/updates).
In the quick test I just ran (attached ruby script), it took 160s to
COPY the 5M updates into a temp table, and then run one UPDATE that
would update every record in the table by joining with the temp table
(the COPY was a few seconds, so most of the 160s was the single UPDATE
statement). That's way faster than the 15 minutes it took to update one
record at a time.
Although the ruby script is missing, I think I get the picture. This
would work if the log trigger would record every column of every updated
row. But it doesn't do that. I only logs the columns that have a new
value that is distinct from the old one.

The speed gain in your test has a couple of sources. The parsing
overhead of 5 million update statements vs one copy is one of them, the
network latency for 5 million round trips another and the execution plan
(which only applies if all or at least a substantial part of the entire
table is updated) of course. Sure is one such mega update of the entire
table better done in a merge over the two sorted tuple sets. But the
planner might chose another strategy if you updated only 500,000 out of
10 million rows.

It would be interesting to see how 5 million updates done via prepared
statement using parameters fare in the comparison. Those mass updates
would at least tend to use a fairly small number of different column
combinations (this might even be generally true for an entire
application). Using prepared statements to apply the changes will deal
with some part of the parsing and planning overhead. It will still force
your case to do 5 million index scans instead of two sorts and a merge.
But that comparison really only applies to cases where you update a very
substantial part of the whole table.

There might be a completely different possibility of dealing with the
problem. Unfortunately I will not have the time to implement it. But the
idea goes as follows.

The log is selected by actually doing a COPY over a SELECT (the usual
log select). That COPY result is fed into the current log table on the
subscriber. A trigger on that log table will suppress the actual insert
operation if the subscriber is not in forwarding mode and the operation
is for a subscribed table (the node might be origin to something else).
What it also does is doing the actual leg work of applying the changes
via prepared SPI statements or maybe even direct heap and index updates.
This method cuts down on the parsing and planning, as well as on the
network round trips.


Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== ***@Yahoo.com #
Jeff Davis
2007-07-12 11:02:57 UTC
Permalink
Post by Jan Wieck
The speed gain in your test has a couple of sources. The parsing
overhead of 5 million update statements vs one copy is one of them, the
network latency for 5 million round trips another and the execution plan
(which only applies if all or at least a substantial part of the entire
table is updated) of course. Sure is one such mega update of the entire
table better done in a merge over the two sorted tuple sets. But the
planner might chose another strategy if you updated only 500,000 out of
10 million rows.
The time it takes to do 5M "SELECT #" calls, where # is a constant.
These are run over a local socket, not a real network, so the latency is
less: 305s

The time it takes to do 5M "UPDATE" queries on a 5M record table, and
none of the WHERE clauses match any tuples (so only an index lookup is
done), in one transaction: 623s

The time it takes to do 5M of the same update queries, but prepared
first: 398s

I know tests involving this overhead have been done before, but this is
on my machine, and it could be a reference point for the other numbers I
stated earlier.
Post by Jan Wieck
application). Using prepared statements to apply the changes will deal
with some part of the parsing and planning overhead. It will still force
your case to do 5 million index scans instead of two sorts and a merge.
But that comparison really only applies to cases where you update a very
substantial part of the whole table.
Yes, my test was designed to show that there are areas of potential
improvement without moving to statement-based replication.
Post by Jan Wieck
The log is selected by actually doing a COPY over a SELECT (the usual
log select). That COPY result is fed into the current log table on the
subscriber. A trigger on that log table will suppress the actual insert
operation if the subscriber is not in forwarding mode and the operation
is for a subscribed table (the node might be origin to something else).
What it also does is doing the actual leg work of applying the changes
via prepared SPI statements or maybe even direct heap and index updates.
This method cuts down on the parsing and planning, as well as on the
network round trips.
Interesting idea. I don't think I understand the implications of doing
direct heap/index updates. Would you have to log the visibility
information as well?

Regards,
Jeff Davis

Christopher Browne
2007-07-04 07:53:14 UTC
Permalink
Post by Jan Wieck
Post by Christopher Browne
Post by Andrew Hammond
Also, ISTM that the big reason we don't like statement based
replication is that SQL has many non-deterministic aspects. However,
there is probably a pretty darn big subset of SQL which is provably
non-deterministic. And for that subset, would it be any less
rigorous to transmit those statements than to transmit the per-row
change statments like we currently do?
Well, by capturing the values, we have captured a deterministic form
of the update.
How to figure out what is deterministic and what isn't? A simple
insert into summary select id, sum(value) from detail group by id;
seems pretty deterministic, doesn't it? But the result of it depends
on the exact commit order and the transaction isolation level. We
don't capture the commit order of single transactions, nor do we care
for it anywhere in the Slony-I logic.
But at the time that we apply these changes in log_actionseq order, we
have imposed a deterministic order. (Which happens to be repeatable,
on each node.)
--
"cbbrowne","@","linuxfinances.info"
http://linuxdatabases.info/info/lisp.html
Do not worry about the bullet that has got your name on it. It will
hit you and it will kill you, no questions asked. The rounds to worry
about are the ones marked: TO WHOM IT MAY CONCERN.
Jan Wieck
2007-07-04 11:52:43 UTC
Permalink
Post by Christopher Browne
Post by Jan Wieck
Post by Christopher Browne
Post by Andrew Hammond
Also, ISTM that the big reason we don't like statement based
replication is that SQL has many non-deterministic aspects. However,
there is probably a pretty darn big subset of SQL which is provably
non-deterministic. And for that subset, would it be any less
rigorous to transmit those statements than to transmit the per-row
change statments like we currently do?
Well, by capturing the values, we have captured a deterministic form
of the update.
How to figure out what is deterministic and what isn't? A simple
insert into summary select id, sum(value) from detail group by id;
seems pretty deterministic, doesn't it? But the result of it depends
on the exact commit order and the transaction isolation level. We
don't capture the commit order of single transactions, nor do we care
for it anywhere in the Slony-I logic.
But at the time that we apply these changes in log_actionseq order, we
have imposed a deterministic order. (Which happens to be repeatable,
on each node.)
The question was, how do we figure out which SQL statement would be
deterministic and thus a candidate for SQL query string propagation -
aside from the fact that there is no standard way in Postgres to capture
query strings or parsetrees anyway. So far we have only established that
the logging of the changes has to be done the way we are doing it now,
on a row base where the actionseq determines the repeatable order. I
don't see how going through a lot of effort to group those individual
log rows together again will gain us a lot.

Unless the effort also attempts to group together consecutive insert and
update statements affecting the same columns and using prepared
statements, and unless we have some evidence that doing so will gain
more than the effort of all that grouping costs, I don't think it is a
good idea to make that part of remote_worker.c any more complicated than
it is today. How many developers do we have who actually understand how
that part of slon really works and who could go in and fix some bug in
there?


Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== ***@Yahoo.com #
Darcy Buskermolen
2007-07-04 12:37:08 UTC
Permalink
Post by Jan Wieck
Post by Christopher Browne
Post by Jan Wieck
Post by Christopher Browne
Post by Andrew Hammond
Also, ISTM that the big reason we don't like statement based
replication is that SQL has many non-deterministic aspects. However,
there is probably a pretty darn big subset of SQL which is provably
non-deterministic. And for that subset, would it be any less
rigorous to transmit those statements than to transmit the per-row
change statments like we currently do?
Well, by capturing the values, we have captured a deterministic form
of the update.
How to figure out what is deterministic and what isn't? A simple
insert into summary select id, sum(value) from detail group by id;
seems pretty deterministic, doesn't it? But the result of it depends
on the exact commit order and the transaction isolation level. We
don't capture the commit order of single transactions, nor do we care
for it anywhere in the Slony-I logic.
But at the time that we apply these changes in log_actionseq order, we
have imposed a deterministic order. (Which happens to be repeatable,
on each node.)
The question was, how do we figure out which SQL statement would be
deterministic and thus a candidate for SQL query string propagation -
aside from the fact that there is no standard way in Postgres to capture
query strings or parsetrees anyway. So far we have only established that
the logging of the changes has to be done the way we are doing it now,
on a row base where the actionseq determines the repeatable order. I
don't see how going through a lot of effort to group those individual
log rows together again will gain us a lot.
Unless the effort also attempts to group together consecutive insert and
update statements affecting the same columns and using prepared
statements, and unless we have some evidence that doing so will gain
more than the effort of all that grouping costs, I don't think it is a
good idea to make that part of remote_worker.c any more complicated than
it is today. How many developers do we have who actually understand how
that part of slon really works and who could go in and fix some bug in
there?
I have a better understanding of how this works today than i did a month ago,
but it still feels a lot like black magic in there so I'm with Jan on this
one, unless we can show that there is a significant advantage to doing so
it's not worth the complication.

Effort spent keeping slony from suffering from ill effects of Long Running
Transactions feels like a much better basket to place development eggs into
to me.
Post by Jan Wieck
Jan
--
Darcy Buskermolen
Command Prompt, Inc.
+1.503.667.4564 X 102
http://www.commandprompt.com/
PostgreSQL solutions since 1997
David Fetter
2007-07-02 08:38:16 UTC
Permalink
Post by Christopher Browne
The v2.0 branch has already taken on the following fairly significant
changes (listed below). Jan and I had a chat this afternoon about
[snip]
Post by Christopher Browne
- Clone Node - to allow using pg_dump/PITR to populate a new
subscriber node.
Are there more items we should try to add?
It would be really great to be able to promote a PITR node to a
subscriber, or better still, to be able to use pg_dump + some kind of
ancillary information to do same. This would help a lot with the
initial sync problem that makes replicating large databases so
painful.

Cheers,
D
--
David Fetter <***@fetter.org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter

Remember to vote!
Consider donating to PostgreSQL: http://www.postgresql.org/about/donate
Jan Wieck
2007-07-02 10:33:00 UTC
Permalink
Post by David Fetter
Post by Christopher Browne
The v2.0 branch has already taken on the following fairly significant
changes (listed below). Jan and I had a chat this afternoon about
[snip]
Post by Christopher Browne
- Clone Node - to allow using pg_dump/PITR to populate a new
subscriber node.
Are there more items we should try to add?
It would be really great to be able to promote a PITR node to a
subscriber, or better still, to be able to use pg_dump + some kind of
ancillary information to do same. This would help a lot with the
initial sync problem that makes replicating large databases so
painful.
You are aware that only PITR will avoid a long running transaction?

The stuff I am currently (very slowly) working on is that very problem.
Any long running transaction causes that the minxid in the SYNC's is
stuck at that very xid during the entire runtime of the LRT. The problem
with that is that the log selection in the slon worker uses an index
scan who's only index scankey candidates are the minxid of one and the
maxxid of another snapshot. That is the range of rows returned by the
scan itself. Since the minxid is stuck, it will select larger and larger
groups of log tuples only to filter out most of them on a higher level
in the query via xxid_le_snapshot().

While the LRT is in progress, we don't have ANY chance of doing
something at all because nobody knows if that transaction does any
changes to the database that the next SYNC after it committed has to
pick up. But in case it aborts, we know it did not change anything. So
my idea right now is to trim that snapshot information by removing
aborted transactions from it. That way slon will only suffer from an
increasing index scankey range while the LRT is in progress, but not any
more after it aborted. This will at least take care of pg_dump and slony
copy_set(), because both never commit. LRT's that really do updates will
continue to cause that problem.

pg_dump can now be done against a subscriber, and all long running
reporting and stuff people tend to move off to slaves anyway. So all
that is really destined to create trouble are long running housekeeping
transactions.


Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== ***@Yahoo.com #
Marko Kreen
2007-07-02 10:45:58 UTC
Permalink
Post by Jan Wieck
The stuff I am currently (very slowly) working on is that very problem.
Any long running transaction causes that the minxid in the SYNC's is
stuck at that very xid during the entire runtime of the LRT. The problem
with that is that the log selection in the slon worker uses an index
scan who's only index scankey candidates are the minxid of one and the
maxxid of another snapshot. That is the range of rows returned by the
scan itself. Since the minxid is stuck, it will select larger and larger
groups of log tuples only to filter out most of them on a higher level
in the query via xxid_le_snapshot().
How the LRT problem is avoided in PGQ:

http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/skytools/skytools/sql/pgq/functions/pgq.batch_event_sql.sql?rev=1.2&content-type=text/x-cvsweb-markup

Basic idea is that there are only few LRT's, so its reasonable
to pick up bottom half of range by event txid, one-by-one.
--
marko
Jan Wieck
2007-07-02 11:03:30 UTC
Permalink
Post by Marko Kreen
Post by Jan Wieck
The stuff I am currently (very slowly) working on is that very problem.
Any long running transaction causes that the minxid in the SYNC's is
stuck at that very xid during the entire runtime of the LRT. The problem
with that is that the log selection in the slon worker uses an index
scan who's only index scankey candidates are the minxid of one and the
maxxid of another snapshot. That is the range of rows returned by the
scan itself. Since the minxid is stuck, it will select larger and larger
groups of log tuples only to filter out most of them on a higher level
in the query via xxid_le_snapshot().
http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/skytools/skytools/sql/pgq/functions/pgq.batch_event_sql.sql?rev=1.2&content-type=text/x-cvsweb-markup
Basic idea is that there are only few LRT's, so its reasonable
to pick up bottom half of range by event txid, one-by-one.
Hmmm, that is an interesting idea. And it is (in contrast to what I've
been playing with) node insensitive, since it doesn't need info only
available on the event origin, like CLOG. Thanks.


Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== ***@Yahoo.com #
Jan Wieck
2007-07-02 12:49:29 UTC
Permalink
Post by Jan Wieck
Post by Marko Kreen
Post by Jan Wieck
The stuff I am currently (very slowly) working on is that very problem.
Any long running transaction causes that the minxid in the SYNC's is
stuck at that very xid during the entire runtime of the LRT. The problem
with that is that the log selection in the slon worker uses an index
scan who's only index scankey candidates are the minxid of one and the
maxxid of another snapshot. That is the range of rows returned by the
scan itself. Since the minxid is stuck, it will select larger and larger
groups of log tuples only to filter out most of them on a higher level
in the query via xxid_le_snapshot().
http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/skytools/skytools/sql/pgq/functions/pgq.batch_event_sql.sql?rev=1.2&content-type=text/x-cvsweb-markup
Basic idea is that there are only few LRT's, so its reasonable
to pick up bottom half of range by event txid, one-by-one.
Hmmm, that is an interesting idea. And it is (in contrast to what I've
been playing with) node insensitive, since it doesn't need info only
available on the event origin, like CLOG. Thanks.
Not only is it interesting, but it is astonishing simple to adopt into
our code. I want to do some more testing before I commit this change,
but the really interesting thing here is that it is only a 3 line change
in the remote_worker.c file, which could easily be backported into 1.2.

I had created a really pathetic test case here by SIGSTOP'ing the slon
while doing the copy_set() for a day, so it had some 90000 events
backlog. About a third into that backlog, it was down to 60+ seconds
delay for first log row and due to the dynamic in the group size, doing
that on a single event base. That same database is now moving through
the backlog in batches of 5-8 minutes each, has a <1 second delay for
first log row and does those groups in 50-70 seconds.

This looks very promising.


Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== ***@Yahoo.com #
Christopher Browne
2007-07-02 12:53:34 UTC
Permalink
Post by Jan Wieck
Post by Jan Wieck
Post by Marko Kreen
Post by Jan Wieck
The stuff I am currently (very slowly) working on is that very problem.
Any long running transaction causes that the minxid in the SYNC's is
stuck at that very xid during the entire runtime of the LRT. The problem
with that is that the log selection in the slon worker uses an index
scan who's only index scankey candidates are the minxid of one and the
maxxid of another snapshot. That is the range of rows returned by the
scan itself. Since the minxid is stuck, it will select larger and larger
groups of log tuples only to filter out most of them on a higher level
in the query via xxid_le_snapshot().
http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/skytools/skytools/sql/pgq/functions/pgq.batch_event_sql.sql?rev=1.2&content-type=text/x-cvsweb-markup
Basic idea is that there are only few LRT's, so its reasonable
to pick up bottom half of range by event txid, one-by-one.
Hmmm, that is an interesting idea. And it is (in contrast to what
I've been playing with) node insensitive, since it doesn't need info
only available on the event origin, like CLOG. Thanks.
Not only is it interesting, but it is astonishing simple to adopt into
our code. I want to do some more testing before I commit this change,
but the really interesting thing here is that it is only a 3 line
change in the remote_worker.c file, which could easily be backported
into 1.2.
I had created a really pathetic test case here by SIGSTOP'ing the slon
while doing the copy_set() for a day, so it had some 90000 events
backlog. About a third into that backlog, it was down to 60+ seconds
delay for first log row and due to the dynamic in the group size,
doing that on a single event base. That same database is now moving
through the backlog in batches of 5-8 minutes each, has a <1 second
delay for first log row and does those groups in 50-70 seconds.
This looks very promising.
Drew Hammond's keen on having some BSD-oriented scripts put into the
1.2 branch that I had only put into HEAD; this might be an excuse for
a 1.2.11.
--
let name="cbbrowne" and tld="linuxdatabases.info" in name ^ "@" ^ tld;;
http://linuxdatabases.info/info/linux.html
There are two kinds of people in the world: People who think there are
two kinds of people and people who don't.
Andrew Hammond
2007-07-02 16:51:11 UTC
Permalink
Post by Christopher Browne
Drew Hammond's keen on having some BSD-oriented scripts put into the
1.2 branch that I had only put into HEAD; this might be an excuse for
a 1.2.11.
s/BSD/djb daemontools/

Andrew
Jan Wieck
2007-07-02 20:16:22 UTC
Permalink
Post by Christopher Browne
Post by Jan Wieck
Post by Jan Wieck
Post by Marko Kreen
Post by Jan Wieck
The stuff I am currently (very slowly) working on is that very problem.
Any long running transaction causes that the minxid in the SYNC's is
stuck at that very xid during the entire runtime of the LRT. The problem
with that is that the log selection in the slon worker uses an index
scan who's only index scankey candidates are the minxid of one and the
maxxid of another snapshot. That is the range of rows returned by the
scan itself. Since the minxid is stuck, it will select larger and larger
groups of log tuples only to filter out most of them on a higher level
in the query via xxid_le_snapshot().
http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/skytools/skytools/sql/pgq/functions/pgq.batch_event_sql.sql?rev=1.2&content-type=text/x-cvsweb-markup
Basic idea is that there are only few LRT's, so its reasonable
to pick up bottom half of range by event txid, one-by-one.
Hmmm, that is an interesting idea. And it is (in contrast to what
I've been playing with) node insensitive, since it doesn't need info
only available on the event origin, like CLOG. Thanks.
Not only is it interesting, but it is astonishing simple to adopt into
our code. I want to do some more testing before I commit this change,
but the really interesting thing here is that it is only a 3 line
change in the remote_worker.c file, which could easily be backported
into 1.2.
I had created a really pathetic test case here by SIGSTOP'ing the slon
while doing the copy_set() for a day, so it had some 90000 events
backlog. About a third into that backlog, it was down to 60+ seconds
delay for first log row and due to the dynamic in the group size,
doing that on a single event base. That same database is now moving
through the backlog in batches of 5-8 minutes each, has a <1 second
delay for first log row and does those groups in 50-70 seconds.
This looks very promising.
Drew Hammond's keen on having some BSD-oriented scripts put into the
1.2 branch that I had only put into HEAD; this might be an excuse for
a 1.2.11.
IF ... if ... (really if) ... this all checks out to do what it is
supposed to do. We have to test if this sort of change does have any
adverse side effects on slony installations running with hundreds of
concurrent DB connections, for example. So far it only looks pretty
against a simple N1->N2 setup bombarded with a -c5 pgbench. That isn't
quite the testing you want to have done before committing such a
substantial change in the inner core log selection logic of STABLE code,
is it?


Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== ***@Yahoo.com #
Andrew Sullivan
2007-07-04 09:37:40 UTC
Permalink
Post by Jan Wieck
against a simple N1->N2 setup bombarded with a -c5 pgbench. That isn't
quite the testing you want to have done before committing such a
substantial change in the inner core log selection logic of STABLE code,
is it?
What, we're not gonna pants-seat fly? Sigh. No guts, no glory ;-)

Seriously, I agree with Jan here: let's be _really_ conservative with
this one. Indeed, given that it's a small patch, I'd be inclined to
issue a .11 with a contrib/pgq-apprach.patch file and suggest people
try it before back patching for real. The HEAD is a good place for
architectural changes, but the supposedly STABLE releases aren't.
I'm not a fan of the Linux-style, "rewrite the PCI subsystem in
x.x.8" STABLE-style releases. And I think this project has been
often enough bitten by such exuberance that we should be cautious.

A
--
Andrew Sullivan | ***@crankycanuck.ca
A certain description of men are for getting out of debt, yet are
against all taxes for raising money to pay it off.
--Alexander Hamilton
Marko Kreen
2007-07-05 02:28:54 UTC
Permalink
Post by Andrew Sullivan
Post by Jan Wieck
against a simple N1->N2 setup bombarded with a -c5 pgbench. That isn't
quite the testing you want to have done before committing such a
substantial change in the inner core log selection logic of STABLE code,
is it?
What, we're not gonna pants-seat fly? Sigh. No guts, no glory ;-)
Seriously, I agree with Jan here: let's be _really_ conservative with
this one. Indeed, given that it's a small patch, I'd be inclined to
issue a .11 with a contrib/pgq-apprach.patch file and suggest people
try it before back patching for real. The HEAD is a good place for
architectural changes, but the supposedly STABLE releases aren't.
I'm not a fan of the Linux-style, "rewrite the PCI subsystem in
x.x.8" STABLE-style releases. And I think this project has been
often enough bitten by such exuberance that we should be cautious.
I think the patch is fine correctness-wise. Main problem
you can have with new approach is that Postgres gets confused
and turns the whole query into seqscan.

It should not happen in 8.3 but could be a problem with
7.4 or 8.0.

OTOH, my experience was with int8 txid, maybe they are
more intelligent when handling int4.
--
marko
Steve Singer
2007-07-03 16:44:07 UTC
Permalink
Is there anything we can do for 2.0 to improve DDL use cases?

I had sent out a patch a while back that lets EXECUTE SCRIPT take a list of
tables to lock (thus not locking everything). I never did get any feedback
on the patch. If there is interest I can try to bring it up to the current
2.0 head and resend it.

Do the new was of disabling triggers have any effect on the dangers of doing
alter tables outside of an execute script? Is there anything we can do to
improve that (at least for DDL where the order it is applied on doesn't need
to match on the slaves)


Steve
Jan Wieck
2007-07-05 08:10:11 UTC
Permalink
Post by Steve Singer
Is there anything we can do for 2.0 to improve DDL use cases?
I had sent out a patch a while back that lets EXECUTE SCRIPT take a list of
tables to lock (thus not locking everything). I never did get any feedback
on the patch. If there is interest I can try to bring it up to the current
2.0 head and resend it.
Do the new was of disabling triggers have any effect on the dangers of doing
alter tables outside of an execute script? Is there anything we can do to
improve that (at least for DDL where the order it is applied on doesn't need
to match on the slaves)
They do indeed have an effect on precisely that, because now the system
catalog is clean and you actually can do arbitrary DDL without going
through EXECUTE SCRIPT at all. Whether doing so makes sense or not and
eventually screws up your data because the commands affect different
sets of rows is another question.

Since this also means that EXECUTE SCRIPT does not lock any tables by
itself, one can issue the required LOCK TABLE statements at the
beginning of the script, so I don't think such a patch is required any
more. Unless I am missing something, that is.


Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== ***@Yahoo.com #
Vivek Khera
2007-07-06 11:16:57 UTC
Permalink
Post by Christopher Browne
- getting the scripted tests to generate some SQL INSERT statements
summarizing the way the tests went; that would very much ease
collecting data about conformance
I hope you're not gonna accept these raw from random sources on the
net posting test results... You'd be better off posting a tabular
format from which you can parse the data into SQL.
Christopher Browne
2007-07-06 11:40:42 UTC
Permalink
Post by Vivek Khera
Post by Christopher Browne
- getting the scripted tests to generate some SQL INSERT statements
summarizing the way the tests went; that would very much ease
collecting data about conformance
I hope you're not gonna accept these raw from random sources on the
net posting test results... You'd be better off posting a tabular
format from which you can parse the data into SQL.
Good point. Some might consider it an "SQL injection" attack ;-).

I think I'll still work provisionally with SQL for the time being;
that lets us make stronger inferences about data types and such. And
before letting random sources to post test results, this will need to
get turned into something rather less "security challenging."

I have noted this in the TODO.
--
"Note that if I can get you to `su and say' something just by asking,
you have a very serious security problem on your system and you should
look into it." -- Paul Vixie, vixie-cron 3.0.1 installation notes
Adam Cassar
2007-07-08 05:04:53 UTC
Permalink
What about an easy way to resync the slaves from the master?
Andrew Sullivan
2007-07-08 09:42:50 UTC
Permalink
Post by Adam Cassar
What about an easy way to resync the slaves from the master?
Please expand on what you mean by that.

A
--
Andrew Sullivan | ***@crankycanuck.ca
When my information changes, I alter my conclusions. What do you do sir?
--attr. John Maynard Keynes
Loading...