Summary: | a task is postponed forever although the other task (its dependency) was completed at the same moment (a race) | ||
---|---|---|---|
Product: | Infrastructure | Reporter: | Ivan Zakharyaschev <imz> |
Component: | girar | Assignee: | Dmitry V. Levin <ldv> |
Status: | CLOSED FIXED | QA Contact: | Andrey Cherepanov <cas> |
Severity: | minor | ||
Priority: | P3 | CC: | glebfm, ldv |
Version: | unspecified | ||
Hardware: | all | ||
OS: | Linux |
Description
Ivan Zakharyaschev
2019-11-30 06:42:13 MSK
I suggest another solution, almost without additional locking (for a more deterministic behavior, we should additionaly change lockf -n to lockf -x in gb-y-awake-tasks). As for testing/debugging the new code in the "run" command in such a situation, I see a way to do this: copy girar-task-run on the server (so that it doesn't change for normal users), add two "breakpoints" to it in the copy: before the first girar-task-change-state "$id" "$next_state and at the end of the script, before the lock on the task is released. A breakpoint might be the shell command "read". To test the 2 kinds of races, we create a new task with a dependency on an existing task. Then run it with the modified "run" command (with breakpoints). Then, before we continue at the breakpoint, complete the other task (the dependency), and then continue. Look at the result. The 2 breakpoints are for the 2 different kind of races. From 4386f86b6762c8b8ff9f1b43c7dd1ab57453ab6d Mon Sep 17 00:00:00 2001 From: Ivan Zakharyaschev <imz@altlinux.org> Date: Wed, 4 Dec 2019 05:11:12 +0300 Subject: [PATCH] girar-task-run: avoid one of the races leaving the task POSTPONED forever The race that is treated by this change: 1. check_depends 2. Another task is done (one of the dependencies of this task), but has nothing to awake yet, because this task is not yet in POSTPONED state. 3. This task's state is changed to POSTPONED ...and it is never awaken. Another race that is not treated: 0. This script locks this task. 1. This task's state is changed to POSTPONED. 2. Another task is done (one of the dependencies of this task), but cannot awake this task, because this task is still locked by this script. 3. This script finishes and unlocks this task. The second race could be treated by using lockf -x in gb-y-awake-tasks rather than lockf -n. --- bin/girar-task-run | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/bin/girar-task-run b/bin/girar-task-run index 036486a..cac2c66 100755 --- a/bin/girar-task-run +++ b/bin/girar-task-run @@ -282,6 +282,30 @@ trap '' HUP INT QUIT PIPE TERM echo $try > task/try echo 1 > task/iter girar-task-change-state "$id" "$next_state" +if [ "$next_state" = POSTPONED ]; then + # The status of the dependencies might have changed since the last check. + # And if so, this task will remain in POSTPONED state forever, + # because it couldn't have been awaken since that last check, + # because it hasn't been in POSTPONED state yet. + # So, we must awake it ourselves. + if check_depends; then + [ -s task/depends ] || { + next_state=AWAITING + girar-task-change-state "$id" "$next_state" + } + else + next_state=FAILED + girar-task-change-state "$id" "$next_state" + girar-webapi-task update "$id" + false + fi + + # At this moment, it's possible that this task is in POSTPONED state + # and is still locked. And if at this moment gb-y-awake-tasks is triggered + # by some of its dependencies, that will have no effect (it uses lockf -n). + # Hence, this task will remain in POSTPONED state forever. This is bad. + # Consider using lockf -x in gb-y-awake-tasks to avoid such situation. +fi girar-webapi-task update "$id" echo >&2 "task #$id: try #$try is $next_state, result will be emailed to $owner@$EMAIL_DOMAIN" -- 2.21.0 I assume this is fixed by commit b27183856ce6b7c32d463867c7ca82a7524809d0. |