Issue 555360043: Add a cooperative FS lock to lilypond-book.

hahnjo

The current change leaves a few questions unanswered: What should lilypond-book do if there happens ...

5 years, 2 months ago (2020-02-23 15:18:27 UTC) #1

hanwenn

I think this is worth it because it simplifies the build system, and puts the ...

5 years, 2 months ago (2020-02-23 15:54:54 UTC) #2

hahnjo

On 2020/02/23 15:54:54, hanwenn wrote: > I think this is worth it because it simplifies ...

5 years, 2 months ago (2020-02-23 15:59:14 UTC) #3

dak

On 2020/02/23 15:54:54, hanwenn wrote: > I think this is worth it because it simplifies ...

5 years, 2 months ago (2020-02-23 16:05:08 UTC) #5

hanwenn

On 2020/02/23 16:05:08, dak wrote: > On 2020/02/23 15:54:54, hanwenn wrote: > > I think ...

5 years, 2 months ago (2020-02-23 16:23:34 UTC) #7

dak

On 2020/02/23 16:23:34, hanwenn wrote: > On 2020/02/23 16:05:08, dak wrote: > > On 2020/02/23 ...

5 years, 2 months ago (2020-02-23 16:29:20 UTC) #8

hanwenn

On 2020/02/23 16:29:20, dak wrote: > On 2020/02/23 16:23:34, hanwenn wrote: > > On 2020/02/23 ...

5 years, 2 months ago (2020-02-23 16:36:44 UTC) #10

dak

On 2020/02/23 15:59:14, hahnjo wrote: > On 2020/02/23 15:54:54, hanwenn wrote: > > I think ...

5 years, 2 months ago (2020-02-23 16:45:11 UTC) #12

hahnjo

On 2020/02/25 08:09:21, hanwenn wrote: > Jonas, did you want to have another look? Yes, ...

5 years, 2 months ago (2020-02-25 09:06:40 UTC) #15

hahnjo

So I can see a consistent improvement by ~40s for 'make -j4 CPU_COUNT=4 test', going ...

5 years, 2 months ago (2020-02-25 22:09:06 UTC) #16

hanwenn

On Tue, Feb 25, 2020 at 11:09 PM <jonas.hahnfeld@gmail.com> wrote: > Another solution might be ...

5 years, 2 months ago (2020-02-26 07:59:36 UTC) #17

hahnjo

On 2020/02/26 07:59:36, hanwenn wrote: > On Tue, Feb 25, 2020 at 11:09 PM <mailto:jonas.hahnfeld@gmail.com> ...

5 years, 2 months ago (2020-02-26 08:19:39 UTC) #18

hahnjo

On 2020/02/26 08:19:39, hahnjo wrote: > On 2020/02/26 07:59:36, hanwenn wrote: > > On Tue, ...

5 years, 2 months ago (2020-02-26 08:28:33 UTC) #19

On 2020/02/26 08:19:39, hahnjo wrote:
> On 2020/02/26 07:59:36, hanwenn wrote:
> > On Tue, Feb 25, 2020 at 11:09 PM <mailto:jonas.hahnfeld@gmail.com> wrote:
> > > Another solution might be serialize only lilypond-book and let tex et
> > > al. run concurrently. That should also be harmless, right?
> > 
> > But this is exactly what this patch does.
> 
> I meant "serialize only lilypond-book in the Makefile [...]", sorry for not
> being specific. I agree that this patch attempts to go this way in
> lilypond-book, and that's what I object to, see below.
> 
> > I don't understand your objection. Serializing mechanism in the
> > makefile are obscure and hard to understand, because build systems
> > want to do as many things in parallel as possible.
> 
> ... so it's the build system's responsibility to get things right. In our case
> this means: Do *not* call lilypond-book in parallel.
> 
> > A lock (a file lock, in this case) is the standard solution for
> > serializing concurrent access to a shared resource (a standard
> > problem). What is your objection against using a standard solution?
> 
> Yes, locks are a standard solution, but file locks are brittle. I've seen them
> fail far too often (ever had your apt-get / yum / pacman error out because
there
> was a lock-file?) so I object to adding this complexity if it only helps for a
> single case in our build (ie input/regression/lilypond-book/).
> 
> > On a philosophical level, it is a lilypond-book implementation detail
> > that it can't deal with concurrent invocation, so the remediation for
> > this problem should be in lilypond-book too.
> 
> Let me disagree: It's an implementation detail of make that it runs things in
> parallel. IMHO a build system should ensure that the result of running with
> multiple jobs is the same as a sequential run.

That said: I'm also fine if some other developer accepts this patch. See my
timing data above to get to your own conclusion. After all, my opinion is just
one of a larger range.

Sign in to reply to this message.

hanwenn

On Wed, Feb 26, 2020 at 9:19 AM <jonas.hahnfeld@gmail.com> wrote: > > A lock (a ...

5 years, 2 months ago (2020-02-26 08:59:58 UTC) #20

hanwenn

On Wed, Feb 26, 2020 at 9:59 AM Han-Wen Nienhuys <hanwenn@gmail.com> wrote: > In this ...

5 years, 2 months ago (2020-02-26 09:52:35 UTC) #22

dak

On 2020/02/26 08:28:33, hahnjo wrote: > On 2020/02/26 08:19:39, hahnjo wrote: > > > On ...

5 years, 2 months ago (2020-02-26 11:59:14 UTC) #23

hanwenn

On 2020/02/26 11:59:14, dak wrote: > On 2020/02/26 08:28:33, hahnjo wrote: > > On 2020/02/26 ...

5 years, 2 months ago (2020-02-28 17:57:06 UTC) #24

On 2020/02/26 11:59:14, dak wrote:
> On 2020/02/26 08:28:33, hahnjo wrote:
> > On 2020/02/26 08:19:39, hahnjo wrote:
> 
> > > > On a philosophical level, it is a lilypond-book implementation detail
> > > > that it can't deal with concurrent invocation, so the remediation for
> > > > this problem should be in lilypond-book too.
> > > 
> > > Let me disagree: It's an implementation detail of make that it runs things
> in
> > > parallel. IMHO a build system should ensure that the result of running
with
> > > multiple jobs is the same as a sequential run.
> > 
> > That said: I'm also fine if some other developer accepts this patch. See my
> > timing data above to get to your own conclusion. After all, my opinion is
just
> > one of a larger range.
> 
> My take on this is that this "implementation detail" of parallel invocation
> resulting in awkward breakage is something that warrants fixing irrespective
of
> our build system.  All that the UG states here is
> 
> ‘--lily-output-dir=DIR’
>      Write lily-XXX files to directory DIR, link into ‘--output’
>      directory.  Use this option to save building time for documents in
>      different directories which share a lot of identical snippets.
> 
> It doesn't state at all what happens in cases of contentions.  Fixing
> contentions with a lock is a brute-force solution just not allowing for
> parallelism, but it is a solution to the contention problem.
> 
> It is not a solution to lilypond-book starting more jobs than Make knows
about. 
> Or to all but one lilypond-book invocation not doing any progress and blocking
> Make which could instead start other actual single-process tasks.  So I see
this
> patch and its approach as an improvement to lilypond-book.  I don't see that
it
> solves the parallel build carnage: it just scales down the impact from having
to
> choose between complete serialization and database failure.

David, I think you are saying this patch is LGTM - could you be explicit, so
james understands what is going on?

Sign in to reply to this message.

dak

On 2020/02/28 17:57:06, hanwenn wrote: > On 2020/02/26 11:59:14, dak wrote: > > It doesn't ...

5 years, 2 months ago (2020-02-28 18:14:14 UTC) #25

dak

On 2020/02/28 18:14:14, dak wrote: > On 2020/02/28 17:57:06, hanwenn wrote: > > On 2020/02/26 ...

5 years, 2 months ago (2020-03-06 22:18:17 UTC) #27

On 2020/02/28 18:14:14, dak wrote:
> On 2020/02/28 17:57:06, hanwenn wrote:
> > On 2020/02/26 11:59:14, dak wrote:
> 
> > > It doesn't state at all what happens in cases of contentions.  Fixing
> > > contentions with a lock is a brute-force solution just not allowing for
> > > parallelism, but it is a solution to the contention problem.
> > > 
> > > It is not a solution to lilypond-book starting more jobs than Make knows
> about. 
> > > Or to all but one lilypond-book invocation not doing any progress and
> blocking
> > > Make which could instead start other actual single-process tasks.  So I
see
> this
> > > patch and its approach as an improvement to lilypond-book.  I don't see
that
> it
> > > solves the parallel build carnage: it just scales down the impact from
> having to
> > > choose between complete serialization and database failure.
> > 
> > David, I think you are saying this patch is LGTM - could you be explicit, so
> > james understands what is going on?
> 
> I think this patch is an improvement over the status quo.  It's sort of a
crutch
> that works only on some systems and not on NFS as far as I understand.  And it
> doesn't actually work well as a job control measure in connection with
parallel
> Make.  But it does improve lilypond-book behavior on some systems.  I think
that
> a restricted form of locking is better than nothing.  I am incidentally not
sure
> just what kind of file systems minimal VMs without a file system of their own
> work with: if they get an NFS view, this would not even work with Lilydev
which
> would be bad.  But I don't know how VMs do file systems without a partition of
> their own.

Sigh.  I just noticed that opposed to the patch title, this does not just
introduce a file lock for lilypond-book but _also_ changes the build system such
that now almost double the number of allocated jobs get used.  It would be good
if different topics weren't conflated into single issues so that it's easier to
discuss what one is actually dealing with and make decisions based on the
respective merits of the individual parts.

"It doesn't actually work well as a job control measure in connection with
parallel Make" should likely have been an indicator of what I thought I was
talking about.

Sign in to reply to this message.

hanwenn

On Fri, Mar 6, 2020 at 11:18 PM <dak@gnu.org> wrote: > > Sigh. I just ...

5 years, 2 months ago (2020-03-07 10:56:18 UTC) #28

dak

Han-Wen Nienhuys <hanwenn@gmail.com> writes: > On Fri, Mar 6, 2020 at 11:18 PM <dak@gnu.org> wrote: ...

5 years, 2 months ago (2020-03-07 12:39:31 UTC) #29

Han-Wen Nienhuys <hanwenn@gmail.com> writes:

> On Fri, Mar 6, 2020 at 11:18 PM <dak@gnu.org> wrote:
>>
>> Sigh.  I just noticed that opposed to the patch title, this does not
>> just introduce a file lock for lilypond-book but _also_ changes the
>> build system such that now almost double the number of allocated jobs
>> get used.  It would be good if different topics weren't conflated into
>> single issues so that it's easier to discuss what one is actually
>> dealing with and make decisions based on the respective merits of the
>> individual parts.
>>
>> "It doesn't actually work well as a job control measure in connection
>> with parallel Make" should likely have been an indicator of what I
>> thought I was talking about.
>
> Can you tell me what problem you are currently experiencing?

Harm has a system with memory pressure.  That means that he so far has
only been able to work with

CPU_COUNT=2 make -j2 doc

Since now lilypond-doc is no longer serialised, he'd need to reduce to

CPU_COUNT=1 make -j2 doc

or

CPU_COUNT=2 make -j1 doc

to get similar memory utilisation, for a considerable loss in
performance.  I've taken a look at Make's jobserver implementation and
it is pretty straightforward.  The real solution would, of course, be to
make lilypond-book, with its directory-based database, not lock other
instances of lilypond-book but take over their job load.  However, the
current interaction of lilypond-book is giving the whole work to
lilypond which splits into n copies with a fixed work load.

To make that work, one would rather have one "job server" of LilyPond
itself which does all the initialisation work and then waits for job
requests.  Upon receiving them, it forks off copies working on them.

Working with freshly forked copies would have the advantage of having
reproducible stats not depending on the exact work distribution, and the
disadvantage of things like typical font loading and symbol memoization
in frequent code paths happening in each copy.  On the other hand, the
question of "gc between files?" would not be an issue since one would
just throw the current state of memory away.

One would probably want fresh forks for regtests because of the stats
and reproducibility, and would accept continuous forks for documentation
building (I assume that continuous forks, by which I mean one instance
of LilyPond processing several files in sequence like we do now, would
be faster in the long run but probably not all that much).

I previously thought of trying to pin down the job distribution of
regtests upon make test-baseline so that only new regtests (rather than
the preexisting ones) would get distributed arbitrarily on make check,
but starting with fresh forks seems like a much better deal for
reproducibility.

Of course, that's all for the long haul.

To get back to your question: the consequences are worst when the job
count is constrained due to memory pressure.  My laptop has uncommonly
large memory for its overall age and power, so I am not hit worst.  The
rough doubling of jobs does not cause me to run into swap space.

-- 
David Kastrup

Sign in to reply to this message.

thomasmorley651

On 2020/03/07 12:39:31, dak wrote: > Han-Wen Nienhuys <mailto:hanwenn@gmail.com> writes: > > > On Fri, ...

5 years, 2 months ago (2020-03-07 15:08:24 UTC) #30

dak

thomasmorley65@gmail.com writes: > On 2020/03/07 12:39:31, dak wrote: >> >> Harm has a system with ...

5 years, 2 months ago (2020-03-07 15:30:33 UTC) #31

hanwenn

On Sat, Mar 7, 2020 at 4:30 PM David Kastrup <dak@gnu.org> wrote: > that starts ...

5 years, 2 months ago (2020-03-07 15:59:12 UTC) #32

hanwenn

On Sat, Mar 7, 2020 at 1:39 PM David Kastrup <dak@gnu.org> wrote: > >> "It ...

5 years, 1 month ago (2020-03-08 09:35:23 UTC) #33

On Sat, Mar 7, 2020 at 1:39 PM David Kastrup <dak@gnu.org> wrote:
> >> "It doesn't actually work well as a job control measure in connection
> >> with parallel Make" should likely have been an indicator of what I
> >> thought I was talking about.
> >
> > Can you tell me what problem you are currently experiencing?
>
> Harm has a system with memory pressure.  That means that he so far has
> only been able to work with
>
> CPU_COUNT=2 make -j2 doc
>
> Since now lilypond-doc is no longer serialised, he'd need to reduce to

> to get similar memory utilisation, for a considerable loss in
> performance.  I've taken a look at Make's jobserver implementation and
> it is pretty straightforward.  The real solution would, of course, be to
> make lilypond-book, with its directory-based database, not lock other
> instances of lilypond-book but take over their job load.  However, the
> current interaction of lilypond-book is giving the whole work to
> lilypond which splits into n copies with a fixed work load.

That's considerable extra complexity, and it wouldn't work for folks
that are using lilypond-book for actual work, ie. without a make
jobserver.

Harm, what kind of machine is this? I should note that 1) lilypond
takes up to 600M of memory during the regtest, and I am pretty sure
the rest of the jobs (tex, ghostscript) are peanuts compared to that
(because jobs like TeX and GS process things page-by-page). This means
that 1G was too little before, and 2G should be ample, so I am
somewhat skeptical of your diagnosis.

A 1G so-dimm (used) costs 3 EUR these days. I don't think it makes
economical sense to spend time to optimize for this case.

> To get back to your question: the consequences are worst when the job
> count is constrained due to memory pressure.  My laptop has uncommonly
> large memory for its overall age and power, so I am not hit worst.  The
> rough doubling of jobs does not cause me to run into swap space.

I think something is off with the heap use (on GUILE 1.8 at least). We
can do the Carver score (which is 100 pages) in 900M heap easily. The
600M number sounds too high, especially given the fact that the
snippets are generally tiny fragments of music.

-- 
Han-Wen Nienhuys - hanwenn@gmail.com - http://www.xs4all.nl/~hanwen

Sign in to reply to this message.

hanwenn

On Sun, Mar 8, 2020 at 10:35 AM Han-Wen Nienhuys <hanwenn@gmail.com> wrote: > > To ...

5 years, 1 month ago (2020-03-08 10:11:27 UTC) #34

thomasmorley651

On 2020/03/07 15:30:33, dak wrote: > mailto:thomasmorley65@gmail.com writes: > > > On 2020/03/07 12:39:31, dak ...

5 years, 1 month ago (2020-03-08 14:59:30 UTC) #35

On 2020/03/07 15:30:33, dak wrote:
> mailto:thomasmorley65@gmail.com writes:
> 
> > On 2020/03/07 12:39:31, dak wrote:
> >> 
> >> Harm has a system with memory pressure.  That means that he so far has
> >> only been able to work with
> >> 
> >> CPU_COUNT=2 make -j2 doc
> >
> > Well, 
> >   CPU_COUNT=3 make -j3 doc
> > is mostly no problem
> 
> Ok.
> 
> >> Since now lilypond-doc is no longer serialised, he'd need to reduce to
> >> 
> >> CPU_COUNT=1 make -j2 doc
> >> 
> >> or
> >> 
> >> CPU_COUNT=2 make -j1 doc
> >
> > Let me check, putting this patch on top of current master, right?
> > With guile-1 or guile-2?
> 
> I'd use Guile-1, for the reason that it runs faster, eats less memory,
> and is more repeatable by virtue of not crashing.
> 
> > I've little time atm, thus I'm not sure when I'm able to start
> > testings...
> 
> The way this works is that running of lilypond-book in one directory
> blocks running lilypond-book in other directories but nothing else.  So
> you can up, using CPU_COUNT=3 make -j3 with one job of lilypond-book
> that starts up 3 copies of LilyPond for large workloads, as well as with
> 3 jobs in other directories.  Once those jobs actually run into
> lilypond-book, they are stalled within lilypond-book without starting
> LilyPond processes until the first lilypond-book has finished.
> 
> So the worst memory use is for when one copy of lilypond-book has
> finished with its LilyPond part and starts the EPS and PDF processing,
> another copy of lilypond-book takes over and starts its LilyPond
> processes, and something else happens in another directory.
> 
> -- 
> David Kastrup

I did some testings, both using guile-1.
(1)
Checking out febe487bb45c97f97377536a5d15da80cce80297 "stepmake: use patsubst
for finding build-dir". I.e. current patch is in.
(2)
Same checkout, with 7ab9c8fa4faff7a513d0ecfbc7eecf7efd2b8ea8 "Add a FS lock to
lilypond-book" reverted.

In both cases I did:
time CPU_COUNT=5 make -j5
and
time CPU_COUNT=5 make -j5 doc

While running 'make' and 'make doc', I did some other work in firefox and jEdit.
Usually I get problems (meaning a heavy slow down on those other tools), if all
cores are working (no surprise) and as soon as SWAP exceeds 600 MB.

Though, with both tests I don't experience a big difference.
SWAP goes up to 1.1 GB (partly up to 1.2 GB)
Timing values are comparable.
I've got the impression (1) performs slightly better. Judging from the usability
of other tools (firefox, jEdit, etc)

So from my part, no objection against this patch.

Sign in to reply to this message.

hanwenn

5 years, 1 month ago (2020-03-21 12:31:06 UTC) #36

commit 7ab9c8fa4faff7a513d0ecfbc7eecf7efd2b8ea8
Author: Han-Wen Nienhuys <hanwen@lilypond.org>
Date:   Sun Mar 1 17:47:53 2020 +0100

    Add a FS lock to lilypond-book

Sign in to reply to this message.

Issue 555360043: Add a cooperative FS lock to lilypond-book. (Closed)

Description

Patch Set 1 #

Patch Set 2 : fcntl #

Patch Set 3 : timing test #

Patch Set 4 : spaces #

Patch Set 5 : eager checksums #

Patch Set 6 : harden #

Patch Set 7 : lockfilename #

Patch Set 8 : rebase #

Messages