[TUHS] another conversion of the CSRG BSD SCCS archives to Git

Steffen Nurpmeso steffen at sdaoden.eu
Tue Dec 3 04:36:36 AEST 2019


Please excuse the late reply.

Greg A. Woods wrote in <m1ibDzG-0036tPC at more.local>:
 |At Fri, 29 Nov 2019 22:52:58 +0100, Steffen Nurpmeso <steffen at sdaoden.eu> \
 |Subject: Re: [TUHS] another conversion of the CSRG BSD SCCS archives to Git
 |> Greg A. Woods wrote in <m1iVoBV-0036tPC at more.local>:
 |>|I've been fixing and enhancing James Youngman's git-sccsimport to use
 |>|with some of my SCCS archives, and I thought it might be the ultimate
 |>|stress test of it to convert the CSRG BSD SCCS archives.
 |>|The conversion takes about an hour to run on my old-ish Dell server.
 |>|This conversion is unlike others -- there is some mechanical compression
 |>|of related deltas into a single Git commit.
 |> Thanks for taking the time to produce a CSRG repo that seems to
 |> mimic changesets as they really happened.  As i never made it
 |> there on my own, i have switched to yours some weeks ago.  (Mind
 |> you, after doing "gc --aggressive --prune=all" the repository size
 |Ah!  I did indeed forget the "git gc" step that many conversion guides
 |recommend.  I might change the import script to do that automatically,
 |particularly if it has also initialised the repository in the same run.
 |Apparently github themselves run it regularly:
 | https://stackoverflow.com/a/56020315/816536
 |Probably they do this by configuring "gc.auto" in each repository,
 |though I've not found any reference to what they might configure it to.

I do not know either, but i have the impression they work with
individual repositories, possibly doing deduplication on the
filesystem level, if at all.  (Some repositories shrink notably,
while others do not.  And i say that because i think bitbucket,
once they added git support, seemed to have used common storage
for the individual git objects, at least i remember a post
pointing to some git object <-> python <-> mercurial library

 |However it seems that without the "--aggressive" option, nothing will be
 |done in this repository.  With it though I go from 316M down to just 71M.

It throws away intermediate full data and keeps only the deltas.
It can also throw away reflog info (which i never have used).
I always use it.  Now with my new machine i can even use it for
the BSD repositories etc., whereas before each update added its
own pack, and the normal gc only combined the packs, but did not
resolve the intermediate deltas.  (Note however i have learned git
almost a decade ago, and have not reread the documentation or
technical papers ever since, let alone in full.)

 |I don't see any way to force/tell/ask github to run "git gc --aggressive".

Very computing intensive task.  Back when i was subscribed to the
git ML around 2011 i was witness of Hamano asking and Jeff King
(from Github by then) responding something like "[it is ok but] gc
aggressive is a pain".  They must have changed the algorithm until
then, now going much more over main memory and requiring much more
of it, too, not truly honouring the provided pack.windowMemory /
pack.threads options (once i tried last).  It has no recovery path
too, for example my old machine could not garbage collect the
FreeBSD repository, i even let it work almost over night (5+
hours), and it did not made it, whereas my new one can do it in
a few minutes, despite the CPUs not being that much faster, it is
only about the memory (8GB instead of 2GB).

I sometimes think about the fact that a lot of software seems to
loose its capability to run in restricted environments.  Providing
alternative runtime solutions is coding etc. intensive, of course,
and in the way of a rapid development process, too.

 |Perhaps I can just delete it from github and immediately re-create it
 |with the re-packed repository, and in theory all the hashes should stay
 |the same and any existing clones should be unaffected.  What do you think?

From the technical point i think this should simply work.  But No
need to delete the repository, simply deleting the branch should
be enough.  (Or fooling around with the update-ref that i often
use, as in "update-ref newmaster master" "checkout newmaster",
"branch -D master" (or "update-ref -d master"), then pushing, then
re-renaming newmaster to master, and pushing again, etc.)

Would be interesting to know how github does deduplication.  The
real great ones of Bell Labs/Plan9 developed this venti / fossil
storage with the blockhash-based permanent storage, and despite
all the multimedia the curve of new allocation flattened after
some time.  I would assume github would benefit dramatically from

 |Note I have some thoughts of re-doing the whole conversion anyway, with
 |with more ideas on to dealing with "removed" files (SCCS files renamed
 |to the likes of "S.foo") and also including the many files that were
 |never checked into SCCS, perhaps even on a per-release basis, thus being
 |able to create release tags that can be checked out to match the actual
 |releases on the CDs.  But this will not happen quite so soon.

That would be nice; having the real changesets is a real
improvement already and however!  And even Spinellis Unix history
repository seems not to be perfect even after years, i heard on
some FreeBSD ML list.

 |     Greg A. Woods <gwoods at acm.org>
 |Kelowna, BC     +1 250 762-7675           RoboHack <woods at robohack.ca>
 |Planix, Inc. <woods at planix.com>     Avoncote Farms <woods at avoncote.ca>
 --End of <m1ibDzG-0036tPC at more.local>

|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

More information about the TUHS mailing list