Bug 37561

Summary: a task is postponed forever although the other task (its dependency) was completed at the same moment (a race)
Product: Infrastructure Reporter: Ivan Zakharyaschev <imz>
Component: girarAssignee: Dmitry V. Levin <ldv>
Status: CLOSED FIXED QA Contact: Andrey Cherepanov <cas>
Severity: minor    
Priority: P3 CC: glebfm, ldv
Version: unspecified   
Hardware: all   
OS: Linux   

Description Ivan Zakharyaschev 2019-11-30 06:42:13 MSK
#241031 POSTPONED #2 p9 fonts-ttf-liberation.git=2.00.4-alt2

$ ssh girar task show 241031
id=241031 locked=no shared=no fail_early=yes test_only=no repo=p9 owner=imz state=POSTPONED try=2 iter=1 depends=241026 build_time=2019-Nov-19/18:53:52
 100:dir=/people/imz/packages/fonts-ttf-liberation.git
 100:tag_name=2.00.4-alt2
 100:tag_id=29d66c17a6d4f4058223097fd63fd18aac93b529
 100:tag_author=Ivan Zakharyaschev <imz@altlinux.org>
 100:fetched=2019-11-19T18:44:35
 100:userid=imz
 100:pkgname=fonts-ttf-liberation

The dependency was added after the first try. Then, a command to run the task a second time was given. Now, the second try is postponed.

The other task (the dependency, 241026) was completed successfully at the same moment:

[imz@team ~]$ stat /tasks/archive/done/_$(( 241026 / 1024 ))/241026/logs/commit.log
  File: /tasks/archive/done/_235/241026/logs/commit.log
  Size: 1474      	Blocks: 8          IO Block: 512    regular file
Device: 9ch/156d	Inode: 170657490   Links: 1
Access: (0644/-rw-r--r--)  Uid: (  502/  ldv_b1)   Gid: (  502/  ldv_b1)
Access: 2019-11-19 21:54:05.259894962 +0300
Modify: 2019-11-19 21:52:56.118469653 +0300
Change: 2019-11-19 21:54:05.259894962 +0300
 Birth: -
[imz@team ~]$ stat /tasks/archive/done/_$(( 241026 / 1024 ))/241026/logs/events.2.3.log
  File: /tasks/archive/done/_235/241026/logs/events.2.3.log
  Size: 3209      	Blocks: 8          IO Block: 512    regular file
Device: 9ch/156d	Inode: 170657499   Links: 1
Access: (0644/-rw-r--r--)  Uid: (  502/  ldv_b1)   Gid: (  502/  ldv_b1)
Access: 2019-11-19 21:54:05.259894962 +0300
Modify: 2019-11-19 21:53:29.572741179 +0300
Change: 2019-11-19 21:54:05.259894962 +0300
 Birth: -
[imz@team ~]$ tail /tasks/archive/done/_$(( 241026 / 1024 ))/241026/logs/events.2.3.log
2019-Nov-19 18:52:56 :: packages update OK
2019-Nov-19 18:53:00 :: [noarch] update OK
2019-Nov-19 18:53:00 :: repo update OK
2019-Nov-19 18:53:13 :: repo save OK
2019-Nov-19 18:53:13 :: src index update OK
2019-Nov-19 18:53:14 :: updated /gears/f/fonts-ttf-liberation.git branch `sisyphus'
2019-Nov-19 18:53:16 :: saved /srpms/f/fonts-ttf-liberation.git branch `sisyphus' as `old/sisyphus-task241026'
2019-Nov-19 18:53:16 :: removed /srpms/f/fonts-ttf-liberation.git branch `sisyphus'
2019-Nov-19 18:53:29 :: gears update OK
2019-Nov-19 18:53:29 :: task #241026 for sisyphus DONE
[imz@team ~]$ 

18:53:29 (DONE) is earlier than build_time=2019-Nov-19/18:53:52 of the postponed task.

[imz@team ~]$ stat /tasks/241031/logs/events.2.1.log 
  File: /tasks/241031/logs/events.2.1.log
  Size: 0         	Blocks: 0          IO Block: 512    regular empty file
Device: 1ah/26d	Inode: 47842378    Links: 1
Access: (0664/-rw-rw-r--)  Uid: (  812/ UNKNOWN)   Gid: (   59/ UNKNOWN)
Access: 2019-11-19 21:53:52.610928162 +0300
Modify: 2019-11-19 21:53:52.610928162 +0300
Change: 2019-11-19 21:53:52.610928162 +0300
 Birth: -
[imz@team ~]$ tail /tasks/241031/logs/events.2.1.log 
[imz@team ~]$ 

It looks like when the pender was finishing the other task, it had yet no postponed task to notify. The check that the dependecy exists seems to have been done some time before, and then after the other task was done and the notifications were attempted, the dependent task was run (and made postponed according to the result of the early check).

Such situation could be avoided by introducing a parallel checking process (whether a dependecy has been completed) or adding such a check to awaiter, or the race could be avoided by using a read-lock in the "run" command (when it checks the status of the dependencies and changes the state to either AWAITING or POSTPONED) and a write-lock in the committer process ("pender") during, hmmm, some appropriate time.
Comment 1 Ivan Zakharyaschev 2019-12-04 16:32:44 MSK
I suggest another solution, almost without additional locking (for a more deterministic behavior, we should additionaly change lockf -n to lockf -x in gb-y-awake-tasks).

As for testing/debugging the new code in the "run" command in such a situation, I see a way to do this: copy girar-task-run on the server (so that it doesn't change for normal users), add two "breakpoints" to it in the copy: before the first

girar-task-change-state "$id" "$next_state

and at the end of the script, before the lock on the task is released. A breakpoint might be the shell command "read".

To test the 2 kinds of races, we create a new task with a dependency on an existing task. Then run it with the modified "run" command (with breakpoints). Then, before we continue at the breakpoint, complete the other task (the dependency), and then continue. Look at the result. The 2 breakpoints are for the 2 different kind of races.

From 4386f86b6762c8b8ff9f1b43c7dd1ab57453ab6d Mon Sep 17 00:00:00 2001
From: Ivan Zakharyaschev <imz@altlinux.org>
Date: Wed, 4 Dec 2019 05:11:12 +0300
Subject: [PATCH] girar-task-run: avoid one of the races leaving the task
 POSTPONED forever

The race that is treated by this change:

1. check_depends
2. Another task is done (one of the dependencies of this task), but
has nothing to awake yet, because this task is not yet in POSTPONED
state.
3. This task's state is changed to POSTPONED

...and it is never awaken.

Another race that is not treated:

0. This script locks this task.
1. This task's state is changed to POSTPONED.
2. Another task is done (one of the dependencies of this task), but
cannot awake this task, because this task is still locked by this script.
3. This script finishes and unlocks this task.

The second race could be treated by using lockf -x in gb-y-awake-tasks
rather than lockf -n.
---
 bin/girar-task-run | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/bin/girar-task-run b/bin/girar-task-run
index 036486a..cac2c66 100755
--- a/bin/girar-task-run
+++ b/bin/girar-task-run
@@ -282,6 +282,30 @@ trap '' HUP INT QUIT PIPE TERM
 echo $try > task/try
 echo 1 > task/iter
 girar-task-change-state "$id" "$next_state"
+if [ "$next_state" = POSTPONED ]; then
+	# The status of the dependencies might have changed since the last check.
+	# And if so, this task will remain in POSTPONED state forever,
+	# because it couldn't have been awaken since that last check,
+	# because it hasn't been in POSTPONED state yet.
+	# So, we must awake it ourselves.
+	if check_depends; then
+		[ -s task/depends ] || {
+			next_state=AWAITING
+			girar-task-change-state "$id" "$next_state"
+		}
+	else
+		next_state=FAILED
+		girar-task-change-state "$id" "$next_state"
+		girar-webapi-task update "$id"
+		false
+	fi
+
+	# At this moment, it's possible that this task is in POSTPONED state
+	# and is still locked. And if at this moment gb-y-awake-tasks is triggered
+	# by some of its dependencies, that will have no effect (it uses lockf -n).
+	# Hence, this task will remain in POSTPONED state forever. This is bad.
+	# Consider using lockf -x in gb-y-awake-tasks to avoid such situation.
+fi
 girar-webapi-task update "$id"
 
 echo >&2 "task #$id: try #$try is $next_state, result will be emailed to $owner@$EMAIL_DOMAIN"
-- 
2.21.0
Comment 2 Dmitry V. Levin 2020-10-30 19:06:14 MSK
I assume this is fixed by commit b27183856ce6b7c32d463867c7ca82a7524809d0.