Re: [devuan-dev] Luminously Unparalleled Repository Coalescer design doc

Author: onefang
Date:
To: devuan developers internal list
Subject: Re: [devuan-dev] Luminously Unparalleled Repository Coalescer design doc

Some of what I have done in apt-panopticon might be relevant, as it
consumes the files you are generating, if I understand what lurc is
doing. Amprolla replacement? It helps in this sort of docs to state up
front what it actually does, too much software leaves out that
fundamental first step in documentation. Your "Overview:" doesn't
actually say what it accomplishes.

Apt-panopticon monitors various details of Devuans package mirror system,
digging deep to see if it can detect any problems, and keeping historical
data to help diagnose problems. So basically it consumes what amprolla
spits out, and what gets rsynced to the mirrors, to see if it can find
trouble. Big brother onefang is watching you!

On 2020-11-19 11:16:11, Ivan J. wrote:
> Hi!
>
> On Wed, Nov 18, 2020 at 12:45:42PM -0500, Mason Loring Bliss wrote:
> > lurc - the Luminously Unparalleled Repository Coalescer
> > Initial design document
> > ---------------------------------------------------------------------------
> > Overview:
> >
> > Single tool with a collection of single-shot functions, each invoked
> > separately and as separate processes, but with batched versions that invoke
> > copies of the tool with the correct arguments, in series. Goal: Be able to
> > (re-)run each individual piece independently.
> >
> > Individual operations will assert kernel advisory locks (via fcntl) to
> > guarantee coherency during processing. (Id est, no new pulls during a
> > merge, no new merges during a pull, with a configurable timeout.)
>
> In the current amprolla implementations, this locking is done wrong.
> Also "with a configurable timeout" sounds wrong. Instead, I would
> implement proper error handling and cleanup upon error.
>
> > Will have a --force flag or similar to force re-merging of data sets in the
> > absense of new data, which will be flagged and recorded during download
> > attempts. Will also use --force to insist on redownloading evidently-
> > unchanged datasets. (Freshness trusts HTTP headers.)
>
> I recommend parsing Release files rather than trusting HTTP headers,
> because they will tell you their update time correctly, rather than
> httpd because servers _may_ be misconfigured in some cornercases. The
> Release file contains a "Date" which you may use instead.

I agree with that, it's what apt-panopticon does to check if a mirror is
up to date.

> > Configuration syntax will be simple flat text. Blacklisted packages will be
> > per-dist.
> >
> > TBD: Syntax/semantics for specifying precedence amongst repositories.
> > (Example, department > organization > devuan > debian-security > debian,
> > with each step potentially asserting blacklists.) Current favourite: linked
> > list specified in config? Guard against loops or any dist being masked more
> > than once, or directly masking more than one subordinate dist.
> > ---------------------------------------------------------------------------
> > Procedure:
> >
> > 1. Pull down repo data from all specified repositories. (Invocation of tool
> > can specify a single dist to pull or a batch mode that calls the tool to
> > collect all repositories. For single-repository mode only, a --force option
> > will allow re-pull even for files that look unchanged.)
> >
> > a. config will specify dist locations

> >
> > b. config will have a suite mapping

> >
> > c. snag each relevant Packages, Release, Contents file

> >
> > TBD: What's the minimal set of files I need to regenerate, beyond > > the Packages files?

>
> The files that need to be generated are "Packages" and "Release". In the
> 21st century, you'll also want "InRelease" to sign these repositories,
> and definitely compress the "Packages" files with gzip or xz.

InRelease is a PGP signed version of Release, there is alse Release.gpg
which is the binary PGP signature of Release. Then there is the
collection of Contents-* files and Packages.

Might help to browse pkgmasters package mirror to see what is there, or
try to read existing amprolla code, to see what is the minimal set of
files. I suspect all the meta data files are needed by apt. There is
all sorts of oddness that might surprise you -

dists/ascii/contrib/binary-all/by-hash/SHA256/
dists/ascii/contrib/binary-all/Release

As examples.

> > 2. Write out merged data where
> >
> > a. higher-precedence packages mask lower-precedence packages and > > blacklists. (Examples, local apt built without libsystemd0, local > > Plymouth built without systemd deps, local dist blacklists libsystemd0 > > and pulseaudio.)

> >
> > b. Packages are blacklisted per-dist, with each level offering a > > blacklist of packages from that level or in subordinate dists.

>
> Keep in mind not only Source/Package names should be blacklisted, but
> also other packages in which their names appear in the dependencies.
>
> > c. Per-dist blacklisting is applied with each successive application of > > a dist, from lowest-precedence to highest. As such, if a higher-rank > > repository supplied a package blacklisted below, it will appear in the > > final results, unless a still-higher-priority dist again blacklists it.

> >
> > 3. Sign.
> >
> > 4. Publish data. We want to be *really* atomic, and not have network
> > latency impact this, so:
> >
> > a. rsync the produced merge to holding directory on destination > > (pkgmaster)

> >
> > b. Once on pkgmaster, rsync into place - more likely to be atomic

> >
> > TBD: but consider better guarantees? Either way, this is outside > > of the scope of lurc and merely a suggestion.

>
> Unfortunately transferring files like this will never truly be atomic. I
> gave some thought of actually archiving the whole generated repository(s)
> with cpio and doing a copy-pass to another server. This might actually
> be the most efficient method, but I never found the time to implement
> this in the current amprolla codebase.

One of the new mirror admins recently asked me how do we manage to keep
things from getting confused while the mirrors sync. The idea I came up
with was to rsync the metadata files last, which would be easy enough to
do. Since the actual package files are all versioned in their names, coz
multiple versions are stored on the mirrors. In combination with your
cpio idea, might work. Mirror rsyncs packages first, then rsyncs the
cpio of the metadata, and does their atomic best to unpack it. Something
more efficient than cpio might be better. Unpacking speed should be
optimized I think off the top of my head, to minimize the atomic window
on each mirror server.

> > ---------------------------------------------------------------------------
> > Open questions:
> >
> > 1. What logging detail do we want? Question listed in weekly meeting doc.
> > ---------------------------------------------------------------------------
> > Config data:
> >
> > set of repositories with dist keys (fields: repo <key> <url>)
> > map of overlays (fields: map <dist> <subordinate>)
> >
> > Blacklist data per-dist in /etc/lurc/blacklist.d.
>
> Please use the amprolla configuration as a reference.
>
> > ---------------------------------------------------------------------------
> > Method to merge data:
> >
> > 1. In-memory map of most-subordinate remaining set.
> >
> > 2. Apply blacklist.
> >
> > 3. Overlay next-most-subordinate set atop initial data, apply blacklist.
> > Loop.
> >
> > 4. Write out remaining dataset to file. Preserve deb822(5). (Consider
> > formal use of deb822 for configs?)
> > ---------------------------------------------------------------------------
> > Details/notes:
> >
> > Packaged dependencies so far: libhttp-tinyish-perl
> >
> > Modification status: Return code is 200 or 2xx if new, 304 if unmodified
>
> Could you explain this?
>
> > Config in /etc/lurc
> > Work in /var/spool/lurc
> > role user: turkey (why? because sudo turkey lurc)
>
> I don't understand this. Why not reflect the user/group names to the
> actual program?

"sudo turkey lurc" might be some sort of joke that Aussies like me that
only have bush turkeys, not real turkeys, might not understand.

>
> > todo: provide bash-completion
> >
> > todo: perldoc as base for docco
> > ---------------------------------------------------------------------------
>
> Best regards,
> Ivan
> _______________________________________________
> devuan-dev internal mailing list
> devuan-dev@???
> https://mailinglists.dyne.org/cgi-bin/mailman/listinfo/devuan-dev

--
A big old stinking pile of genius that no one wants
coz there are too many silver coated monkeys in the world.

This message is part of the following thread:
	the complete thread tree sorted by date
	Ivan J. at
	Mark Hindley at

Donate to Dyne.org