Remove operations pull completed tokens from txn-queue #301

jameinel · 2018-11-19T13:48:38Z

We ran into an odd edge case in our system, where we had a document that was being created and then removed repeatedly. (People were testing if the system worked by creating an object and then immediately removing it.)

The code for Insert and Remove just copies whatever the txn-queue was from the collection to the stash and back. This changes it so that both clean up the txn-queue while it is creating the document in the stash. This should be functionally equivalent to the $pullAll that we do during Update. It should be safe to do this because we are expressly creating the document out of 'whole cloth' with an Insert, so it isn't possible for us to be racing with another thread that is trying to add another txn to the queue. (If it was, this doesn't change the race that we would have already had).

There is an extra pass over the freshly read txn-queue, but in well behaved systems, the queue should always be kept small. This is mostly a case where we found we weren't behaving well.

We are directly inserting the Queue, but instead of just blindly copying what was in the old document, apply the $pullAll that we would have applied if this were an Update. Because we are creating the stash document from scratch, we don't need to ask Mongo to do the work.

Essentially, we don't have to do both Insert and Remove, because one of them must have been done in order to do the other. This way, a series of only Insert and Remove operations won't cause the txn-queue to get out of hand.

jameinel · 2018-11-19T13:53:33Z

I should also note that I added a -fast flag to the txn test suite, as the StashStressTest takes about 2minutes for me, while the rest of the test suite takes something like 15s total time.

jameinel · 2018-11-19T13:54:27Z

hm... I just had a failure on:

FAIL: sim_test.go:110: S.TestSimReinsertCopy4Workers

sim_test.go:308:
c.Check(err, IsNil)
... value *errors.errorString = &errors.errorString{s:"cannot find transaction 5bf2c04f93538f4ab865a02b_3acb2035 in queue for document {accounts 9}"} ("cannot find transaction 5bf2c04f93538f4ab865a02b_3acb2035 in queue for document {accounts 9}")

I'll investigate as that is the sort of race this change could have introduced.

It turns out that I missed we had just reread the document as part of the 'add txn-insert' or 'add txn-remove' to the document. That gives us an updated txn-queue that needs to be handled. Since we are now cleaning directly on the txn-queue instead of the cached queue docs, go ahead and do it on both Insert and Remove.

jameinel · 2018-11-19T14:12:20Z

To follow up, I had missed that the insert that flags the item as marked for removal/marked for insert rereads the txn-queue, and that is the queue that we were using for the stash value. Not the queue that we started the function with. But that should be handled, and I ran
go test -fast -count=20
without any failures in the simulation tests, so I'm reasonably sure this is correct.

manadart · 2018-11-19T14:26:18Z

txn/flusher.go

@@ -998,6 +1004,23 @@ func tokensToPull(dqueue []tokenAndId, pull map[bson.ObjectId]*transaction, dont
 	return result
 }

+// cleanQueue takes an existing queue and removes all the items in pull from it, generating a new txn-queue.
+func cleanQueue(queue []token, pull map[bson.ObjectId]*transaction, dontPull token) ([]token) {
+	cleaned := make([]token, 0, len(queue))


When filtering like this, a nice trick I like is:
cleaned := queue[:0]
Because the slice is created with the same underlying array, there's no re-allocation up to the original length.

I almost went this way, but ended up avoiding it because it modifies the original object. We do use it for debug logging later on, so I didn't want to confuse the situation, especially since it doesn't rewrite info.Queue's length, it means an array that could end up with the tail doubled.

manadart · 2018-11-19T14:31:36Z

txn/flusher.go

@@ -998,6 +1004,23 @@ func tokensToPull(dqueue []tokenAndId, pull map[bson.ObjectId]*transaction, dont
 	return result
 }

+// cleanQueue takes an existing queue and removes all the items in pull from it, generating a new txn-queue.
+func cleanQueue(queue []token, pull map[bson.ObjectId]*transaction, dontPull token) ([]token) {


Is filterQueue a better name for this?

works for me. (done)

howbazaar · 2018-11-19T22:18:27Z

txn/txn_test.go

+	var qdoc txnQueue
+	err = s.accounts.FindId(0).One(&qdoc)
+	c.Assert(err, IsNil)
+	err = s.accounts.FindId(0).One(&qdoc)


Is there are reason to find the doc twice? Or is this an error?

just a copy paste error in the test, fixed.

This is the patch from juju/mgo#4 globalsign/mgo#301 It filters the txn-queue when copying a document from or to the stash, just like we would do during a normal Update operation. This prevents a sequence of Insert + Remove pairs from filling up the txn-queue.

#9486 ## Description of change This is the patch from juju/mgo#4 globalsign/mgo#301 It filters the txn-queue when copying a document from or to the stash, just like we would do during a normal Update operation. This prevents a sequence of Insert + Remove pairs from filling up the txn-queue. ## QA steps The associated test shows the basic change. To trigger this manually you can: ``` $ juju bootstrap lxd $ while true; do juju deploy cs:~jameinel/ubuntu-lite; sleep 90; juju remove-application ubuntu-lite; sleep 30 done ``` While doing that, you should be able to go into the juju controller and either ``` > db.constraints.find() > db.txns.stash.find({"_id.c": "constraints"}) ``` And see that there is a document whose txn-queue is growing quite large. With this patch, the txn-queue should stay around 1 or 2 entries. ## Documentation changes None. ## Bug reference https://bugs.launchpad.net/juju/+bug/1804197

jameinel added 3 commits November 19, 2018 17:25

Skip the very slow test if -fast is supplied

81681e4

Comment on why we aren't Pulling during Insert.

8141feb

Essentially, we don't have to do both Insert and Remove, because one of them must have been done in order to do the other. This way, a series of only Insert and Remove operations won't cause the txn-queue to get out of hand.

jameinel changed the title ~~Txn globalsign remove prunes queue~~ Remove operations Pull completed tokens from txn-queue Nov 19, 2018

jameinel changed the title ~~Remove operations Pull completed tokens from txn-queue~~ Remove operations pull completed tokens from txn-queue Nov 19, 2018

manadart previously approved these changes Nov 19, 2018

View reviewed changes

Rename from cleanQueue to filterQueue.

98832d7

jameinel dismissed manadart’s stale review via 98832d7 November 19, 2018 15:24

howbazaar reviewed Nov 19, 2018

View reviewed changes

No need to read the document 2x in the test.

0ca7a4d

jameinel mentioned this pull request Nov 20, 2018

Remove and Insert filter the txn-queue juju/mgo#4

Merged

jameinel mentioned this pull request Nov 20, 2018

Include an mgo patch to filter txn-queue during Insert and Remove juju/juju#9486

Merged

howbazaar approved these changes Nov 20, 2018

View reviewed changes

eminano added 3 commits November 27, 2018 10:34

Merge branch 'development' into txn-globalsign-remove-prunes-queue

2a46c73

Merge branch 'development' into txn-globalsign-remove-prunes-queue

226ef5b

Merge branch 'development' into txn-globalsign-remove-prunes-queue

6559e40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove operations pull completed tokens from txn-queue #301

Remove operations pull completed tokens from txn-queue #301

jameinel commented Nov 19, 2018 •

edited

Loading

jameinel commented Nov 19, 2018

jameinel commented Nov 19, 2018

jameinel commented Nov 19, 2018

manadart Nov 19, 2018

jameinel Nov 19, 2018

manadart Nov 19, 2018

jameinel Nov 19, 2018

howbazaar Nov 19, 2018

jameinel Nov 20, 2018

Remove operations pull completed tokens from txn-queue #301

Are you sure you want to change the base?

Remove operations pull completed tokens from txn-queue #301

Conversation

jameinel commented Nov 19, 2018 • edited Loading

jameinel commented Nov 19, 2018

jameinel commented Nov 19, 2018

hm... I just had a failure on:

jameinel commented Nov 19, 2018

manadart Nov 19, 2018

Choose a reason for hiding this comment

jameinel Nov 19, 2018

Choose a reason for hiding this comment

manadart Nov 19, 2018

Choose a reason for hiding this comment

jameinel Nov 19, 2018

Choose a reason for hiding this comment

howbazaar Nov 19, 2018

Choose a reason for hiding this comment

jameinel Nov 20, 2018

Choose a reason for hiding this comment

jameinel commented Nov 19, 2018 •

edited

Loading