This is because of the following race condition case:
A B
|
lockfd = open(lockfile, ...) |
| unlink(lockfile)
lockf(lockfd, F_LOCK, 0) |
According to [1], to recover from an ESTALE error, an application must
close the file or directory where the error occurred, and reopen it so
the NFS client can resolve the pathname again and retrieve the new file
handle.
[1] https://nfs.sourceforge.net/#faq_a10
Simplify locking by using lockf(3). It is POSIX compatible and should
work over NFS.
Fix download race condition when:
1) host A creates lockfile and aquire lock to fetch from distfiles
mirror
2) host B opens the lockfile and waits for lock
3) host A gets 404 from distfiles, releases lock and deletes the
lockfile, which host A has an open file handle for
4) host B gets lock of the deleted file and downloads file
5) host A retries download and creates a new lockfile, but is not
blocked by host B, even if it should
Solve this by releaseing the lock, give the other processes a chance
to aquire it (using sleep(0)), and then only delete the lockfile if:
a) download was successful (no 404) or b) no-one else has a lock.
This reverts commit 281720ec39 (abuild-fetch: aquire a second lock
using flock(2))
fixes#10026
Abuild-fetch uses curl (fallback to wget) to download files. They are
saved with a ".part" extension first, so they can be resumed if
necessary. When the download is through, the ".part" extension gets
removed. However, when the server does not support resume of downloads
(e.g. GitHub's on the fly generated tarballs), then the ".part"
extension got removed anyway. Abuild aborts in that case. But when
running a third time, the distfile exists and it is assumed that this
is the full download.
Changes:
* abuild-fetch:
* Only remove the ".part" extension, when curl/wget exit with 0
* Pass the exit code from curl/wget as exit code of abuild-fetch
* Wherever abuild-fetch would return an exit code on its own, the
codes have been changed to be > 200 (so they don't collide with
curl's as of now 92 exit codes)
* Remove undocumented feature of downloading multiple source URLs at
a time. This doesn't match with the usage description, was not used
in abuild at all and it would have made it impossible to pass the
exit code.
* abuild:
* After downloading, when curl is installed and abuild-fetch has
33 as exit code (curl's HTTP range error), then delete the partfile
and try the download again.
flock(2) on an NFS mount will on the server side convert the lock to a
POSIX lock (fcntl(F_SETLK)). This means that abuild running on NFS
server and client will create different locks and they will both try
download same file at same time.
We fix this by creating a small abuild-fetch application that will
create a POSIX lock which works with NFS.