#241031 POSTPONED #2 p9 fonts-ttf-liberation.git=2.00.4-alt2 $ ssh girar task show 241031 id=241031 locked=no shared=no fail_early=yes test_only=no repo=p9 owner=imz state=POSTPONED try=2 iter=1 depends=241026 build_time=2019-Nov-19/18:53:52 100:dir=/people/imz/packages/fonts-ttf-liberation.git 100:tag_name=2.00.4-alt2 100:tag_id=29d66c17a6d4f4058223097fd63fd18aac93b529 100:tag_author=Ivan Zakharyaschev <imz@altlinux.org> 100:fetched=2019-11-19T18:44:35 100:userid=imz 100:pkgname=fonts-ttf-liberation The dependency was added after the first try. Then, a command to run the task a second time was given. Now, the second try is postponed. The other task (the dependency, 241026) was completed successfully at the same moment: [imz@team ~]$ stat /tasks/archive/done/_$(( 241026 / 1024 ))/241026/logs/commit.log File: /tasks/archive/done/_235/241026/logs/commit.log Size: 1474 Blocks: 8 IO Block: 512 regular file Device: 9ch/156d Inode: 170657490 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 502/ ldv_b1) Gid: ( 502/ ldv_b1) Access: 2019-11-19 21:54:05.259894962 +0300 Modify: 2019-11-19 21:52:56.118469653 +0300 Change: 2019-11-19 21:54:05.259894962 +0300 Birth: - [imz@team ~]$ stat /tasks/archive/done/_$(( 241026 / 1024 ))/241026/logs/events.2.3.log File: /tasks/archive/done/_235/241026/logs/events.2.3.log Size: 3209 Blocks: 8 IO Block: 512 regular file Device: 9ch/156d Inode: 170657499 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 502/ ldv_b1) Gid: ( 502/ ldv_b1) Access: 2019-11-19 21:54:05.259894962 +0300 Modify: 2019-11-19 21:53:29.572741179 +0300 Change: 2019-11-19 21:54:05.259894962 +0300 Birth: - [imz@team ~]$ tail /tasks/archive/done/_$(( 241026 / 1024 ))/241026/logs/events.2.3.log 2019-Nov-19 18:52:56 :: packages update OK 2019-Nov-19 18:53:00 :: [noarch] update OK 2019-Nov-19 18:53:00 :: repo update OK 2019-Nov-19 18:53:13 :: repo save OK 2019-Nov-19 18:53:13 :: src index update OK 2019-Nov-19 18:53:14 :: updated /gears/f/fonts-ttf-liberation.git branch `sisyphus' 2019-Nov-19 18:53:16 :: saved /srpms/f/fonts-ttf-liberation.git branch `sisyphus' as `old/sisyphus-task241026' 2019-Nov-19 18:53:16 :: removed /srpms/f/fonts-ttf-liberation.git branch `sisyphus' 2019-Nov-19 18:53:29 :: gears update OK 2019-Nov-19 18:53:29 :: task #241026 for sisyphus DONE [imz@team ~]$ 18:53:29 (DONE) is earlier than build_time=2019-Nov-19/18:53:52 of the postponed task. [imz@team ~]$ stat /tasks/241031/logs/events.2.1.log File: /tasks/241031/logs/events.2.1.log Size: 0 Blocks: 0 IO Block: 512 regular empty file Device: 1ah/26d Inode: 47842378 Links: 1 Access: (0664/-rw-rw-r--) Uid: ( 812/ UNKNOWN) Gid: ( 59/ UNKNOWN) Access: 2019-11-19 21:53:52.610928162 +0300 Modify: 2019-11-19 21:53:52.610928162 +0300 Change: 2019-11-19 21:53:52.610928162 +0300 Birth: - [imz@team ~]$ tail /tasks/241031/logs/events.2.1.log [imz@team ~]$ It looks like when the pender was finishing the other task, it had yet no postponed task to notify. The check that the dependecy exists seems to have been done some time before, and then after the other task was done and the notifications were attempted, the dependent task was run (and made postponed according to the result of the early check). Such situation could be avoided by introducing a parallel checking process (whether a dependecy has been completed) or adding such a check to awaiter, or the race could be avoided by using a read-lock in the "run" command (when it checks the status of the dependencies and changes the state to either AWAITING or POSTPONED) and a write-lock in the committer process ("pender") during, hmmm, some appropriate time.
I suggest another solution, almost without additional locking (for a more deterministic behavior, we should additionaly change lockf -n to lockf -x in gb-y-awake-tasks). As for testing/debugging the new code in the "run" command in such a situation, I see a way to do this: copy girar-task-run on the server (so that it doesn't change for normal users), add two "breakpoints" to it in the copy: before the first girar-task-change-state "$id" "$next_state and at the end of the script, before the lock on the task is released. A breakpoint might be the shell command "read". To test the 2 kinds of races, we create a new task with a dependency on an existing task. Then run it with the modified "run" command (with breakpoints). Then, before we continue at the breakpoint, complete the other task (the dependency), and then continue. Look at the result. The 2 breakpoints are for the 2 different kind of races. From 4386f86b6762c8b8ff9f1b43c7dd1ab57453ab6d Mon Sep 17 00:00:00 2001 From: Ivan Zakharyaschev <imz@altlinux.org> Date: Wed, 4 Dec 2019 05:11:12 +0300 Subject: [PATCH] girar-task-run: avoid one of the races leaving the task POSTPONED forever The race that is treated by this change: 1. check_depends 2. Another task is done (one of the dependencies of this task), but has nothing to awake yet, because this task is not yet in POSTPONED state. 3. This task's state is changed to POSTPONED ...and it is never awaken. Another race that is not treated: 0. This script locks this task. 1. This task's state is changed to POSTPONED. 2. Another task is done (one of the dependencies of this task), but cannot awake this task, because this task is still locked by this script. 3. This script finishes and unlocks this task. The second race could be treated by using lockf -x in gb-y-awake-tasks rather than lockf -n. --- bin/girar-task-run | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/bin/girar-task-run b/bin/girar-task-run index 036486a..cac2c66 100755 --- a/bin/girar-task-run +++ b/bin/girar-task-run @@ -282,6 +282,30 @@ trap '' HUP INT QUIT PIPE TERM echo $try > task/try echo 1 > task/iter girar-task-change-state "$id" "$next_state" +if [ "$next_state" = POSTPONED ]; then + # The status of the dependencies might have changed since the last check. + # And if so, this task will remain in POSTPONED state forever, + # because it couldn't have been awaken since that last check, + # because it hasn't been in POSTPONED state yet. + # So, we must awake it ourselves. + if check_depends; then + [ -s task/depends ] || { + next_state=AWAITING + girar-task-change-state "$id" "$next_state" + } + else + next_state=FAILED + girar-task-change-state "$id" "$next_state" + girar-webapi-task update "$id" + false + fi + + # At this moment, it's possible that this task is in POSTPONED state + # and is still locked. And if at this moment gb-y-awake-tasks is triggered + # by some of its dependencies, that will have no effect (it uses lockf -n). + # Hence, this task will remain in POSTPONED state forever. This is bad. + # Consider using lockf -x in gb-y-awake-tasks to avoid such situation. +fi girar-webapi-task update "$id" echo >&2 "task #$id: try #$try is $next_state, result will be emailed to $owner@$EMAIL_DOMAIN" -- 2.21.0
I assume this is fixed by commit b27183856ce6b7c32d463867c7ca82a7524809d0.