One part of the so-called “Factory 2.0”
The goal of this document is to describe a high level plan for what kinds of systems will need to be modified and introduced to Fedora Infrastructure to support building, maintaining, and shipping modular things.
There are open questions still which can impact this plan, discussed near the end.
We know from the originally nebulous discussions that led to the formation of a modularity initiative in Fedora that we were going to create a new framework for building and composing the distribution, one which could have negative side-effects on our existing efforts if we weren’t careful. “Modularity will be very powerful. It will give us enough power to shoot ourselves in the foot.” We’re talking about allowing the creation of combinations of components with independent lifecycles. There’s the possibility of a combinatorial explosion in there that we’ll need to contain. We’ll do that in part by vigorously limiting the number of supported modules with policy, to be taken up in another document, and by providing infrastructure automation to reduce the amount of manual work required.
Towards accomplishing the second goal, we set out to study existing workflows and hypothesize what non-automated modularity workflows would entail.
Life of a package update
For reference, here’s our depiction of the life of a package update as it travels through Fedora Infrastructure, today:
Life of a module update
Here’s our depiction of some scenarios for module maintainers, without any automation in place. Take a look at the four starting points, organized into columns, at the top:
Submitting a new module: This is straightforward. There’s no need to invent a new submission framework here. We’ll reuse the existing bugzilla->pkgdb->dist-git workflow we have for packages. It will provide well-known gate point where we can deny new modules that are unfit for Fedora (criteria to be determined) would create unsustainable burden (criteria to be determined) or are problematic in some other way.
Updating an existing module: The flow here is also similar to existing packaging workflows. A maintain can update their module’s metadata definition, commit and push it to dist-git, and then kick off a “module build” in koji (the details for a koji “module build” here are non-trivial, and are being lead by Petr Sabata and Lubos Kocman).
Updating a component: This is what packagers do today when they update their package to a new release in rawhide. We may carry modules that include that package, but they pin their dependency to a specific version of that component. Since this is a new version of that component, we do not have to automatically worry about updating any modules that depend on this component.
Backporting a security fix: This gets tricky. There are currently (at least) two different competing approaches for how a module should pin its dependencies:
- Modules should specify their dependency versions by the git hash of a commit in the dist-git repo of that component.
- Modules should specify their dependency versions by a git branch or tag on the dist-git repo of that component.
Note that the author of this document favors approach #2 and that the workflow diagram above depicts the steps required of module maintainers for approach #1.
There are problems with each of these and full discussion of those is outside the scope of this document. Stephen Tweedie will be taking up the question of versioning and releases in a different document.
In either case, the backporting of a security patch to older versions is a situation that requires lots of manual work. Patches need to be applied to the component’s old (either) commits or branches. Dependant modules either need to have their pinned git hashes updated in their own dist-git repos (or have their pinned git branch names left unmodified). Finally, rebuilds need to be scheduled for the entire tree of dependant modules.
Whichever approach we decide on for pinning component versions, the exercise above in figuring out “what a module maintainer needs to do” has been instructive.
Infrastructure Services Proposal
Here is an overview diagram of the set of systems and services that would be involved in maintaining, building, and shipping a module.
Let’s go through each piece…
Dist-git: We’ll be keeping the definitions of modules in dist-git, using the namespace approach that we already have implemented in pkgdb and dist-git in Fedora Infrastructure.
Branch history: In the event that we decide on version-pinning approach #2, we’re going to need a way to remember what branch refs pointed to which git hashes and what point in time. This is so that, when a patch is applied to a supported branch for a component, we can determine what modules have been built against the old hash and then rebuild them. As noted in the diagram, it may be that we can extract this information from the git history itself - nonetheless, it will be useful to have a queryable network service that exposes that information to other services.
pdc-updater: In Fedora Infrastructure, we currently have a service running that listens to our message bus. When it notices that pungi has completed a new compose, it uploads the metadata about that compose to PDC. In our diagram here we have two depictions of pdc-updater where it needs to be modified to handle two new types of data to be included in PDC On the left hand side, we want to listen for dist-git changes for modules which have added or removed a dependency. We want to push that dependency to the release-component-relationships endpoint in PDC, which we’ll then query later to know out “what depends on what”. There’s a taiga card for this requiring changes to pdc-updater. On the right hand side of the diagram, we want to continue to listen for new pungi composes as we do now, but we want to additionally import information about built modules included in that compose. These are modules that we are preparing to ship.
Orchestrator (ad hoc pungi): We’re reluctant to build a workflow engine directly into koji itself. It makes more sense to relegate koji to building and storing metadata about artifacts and to instead devise a separate service dedicated to scheduling and orchestrating those builds. It will make heavy use of PDC (and perhaps the branch-history service) to know what needs to be rebuilt. When a component changes, the orchestrator will be responsible for asking what depends on that component, and then scheduling rebuilds of those modules directly. Once those module rebuilds have completed and have been validated by CI, the orchestrator will be triggered again to schedule rebuilds of a subsequent tier of dependencies. This cycle will repeat until the tree of dependencies is fully rebuilt. In the event that a rebuild fails, or if CI validation fails, maintainers will be notified in the usual ways (the Fedora notification service). A module maintainer could then respond by manually fixing their module and scheduling another module build, at which point the trio of systems would pick up where they left off and would complete the rebuild of subsequent tiers (stacks).
Taskotron (CI): Taskotron itself will likely need only minor patching, to be aware of modules as an entity that can be tested. There will be much more involved work required of the Modularity Working Group to propose and implement some default tests for all modules, as well as some guidelines for writing tests specific to individual modules.
Koji: As mentioned earlier, Petr Sabata and Lubos Kocman are working on the details here, but here are some highlights: A module defines its own buildroot, which doesn’t inherit from other buildroots. Accordingly, rebuilding a module will entail building its components from source (or from srpm). We’ll be looking for optimizations, so that we can avoid rebuilding binary rpms if the buildroot of a pre-built rpm matches bit-for-bit.
Pungi: Pungi currently works by (at the request of cron or a releng admin) scheduling a number of tasks in koji. Deps are resolved, repos are created, and images (live, install, vagrant, etc..) are created out of those repos. This takes a number of hours to complete and when done, the resulting artifacts are assembled in a directory called The Compose. That compose is then noted in PDC. Some CI and manual QA work is done to validate the compose for final releases, and it is rsynced to the mirrors for distribution. With the introduction of modules, we’ll have an explosion in the amount of time taken to build all of the repos for all of those combinations which is why we’re going to break out a good deal of that work into the orchestrator, which would would like to pre-build the parts the constitute a compose, before we ask for them. Pungi’s job then, primarily, is reduced to harvesting those pre-built artifacts. In the event that those artifacts are not available in koji, pungi will of course have to schedule new builds for them before proceeding. We have a (good) requirement to allow developers to run pungi in a local environment, disconnected from our infrastructure. This will be hard, but worth it. The gist will be to have pungi contain libs that know how to build the artifacts. In production, pungi will schedule a koji task, which in turn makes a builder call that koji lib to do the work. In an offline development environment, we’ll configure pungi to just call that lib itself, directly.
Comps-As-A-Service: More information for you on how pungi currently builds a compose: pungi takes as its input a pungi config file which, while it defines many aspects of the compose, it primarily defines the set of outputs: the artifacts. It furthermore takes in a variants.xml file which defines the variants to be produced, in terms of comps groups. Those comps groups are then defined in another comps.xml file. They are just groups of packages. Our variants are defined in terms of groups of packages from the flat Fedora package namespace. At minimum, we’ll need to modify pungi to accept a definition of the variants in terms of modules, but additionally, we have problems with trying to maintain and copy the comps.xml file all around our infrastructure to build things. We’d like to replace that with CaaS: Comps-as-a-service, so we can query for this stuff over the network and manage it (hopefully) more sanely. The big work item here is defining the variants in terms of modules. We’ll still need to produce a comps.xml file to mash into the repo metadata, but we will generate that file on the fly from CaaS data.
Metadata Service: This is an optional client-facing service which can provide cached pre-computed resolutions of dependencies. We don’t have anything like this currently for RPMs. It could be nice to have a generic system which can serve fast dep resolutions for all kinds of artifacts. It is optional, because we expect that we can build the client tools to work just fine with the metadata lists distributed over the mirrors (or CDN). If we find we have UX issues with long waits for dep resolution, we could invest work in a system like this to supplement.
Build Pipeline Overview: Another optional client-facing service. It could be nice to be able to query and ask “I have module X installed. Do you have a fresh build of X underway? Is it complete, but not yet available on the mirrors?” This is moreso targeted for developers - it would be nice to be able to query and find the status of any kind of component, module, or image in the pipeline through a homogenous interface.
As mentioned before, we have an open question about how to uniquely identify components and modules. That’s outside the scope of this document, but a decision there will have impact on what we need to build.
Not mentioned so far - we need a way to store, query, and view SLA and EOL information for modules. Bear in mind that the “EOL” information for packages currently is tied to the distro-release for that package. So, whenever F17 goes EOL, that’s when the branch for that package goes EOL. One of the principal reasons for getting into this whole modularity mess was to have independent lifecycles between modules; tying it only to a distro release just won’t do.
One idea here is to keep the EOL information in the module’s yaml metadata definition. If we go that route, we’re going to need to also store a cache of that authoritative data somewhere so that it can be queried by our other web services. We’ll need to build an interface to the data for the compose side of the pipeline, so that people putting together a release of the distro can be sure that none of the modules in that release are going to go EOL before the distro release is supposed to go EOL. We’ll need to build a separate interface to the data from the packager side of the pipeline, so that people modifying modules can see what the EOLs of things they depend on are, and what things what what EOLs depend on their module.
Lastly, we need to appreciate what kind of load this could put on the mirror network. We’ll be creating many more repos. What kind of overhead does that bring? Will we need to restructure the way mirrors selectively pull content? Nothing fundamentally changes here. It is a matter of quantity.
It is useful to step back for a moment and think about some of the really cool new systems we have, like Koschei and OSBS.
Koschei is a kind of continuous integration service. It monitors new builds in koji and in response, tries to also rebuild packages that depend on that package (as scratch buidls). In doing so, it attempts to find situations where packages inadvertently fail to rebuild from source, which is really useful. It has to suss out and maintain a dependency graph to accomplish this. For rpms only, it looks somewhat like the orchestrator tool in our diagram above (except less committed. It does scratch builds, not real builds.)
OSBS is a build system that we use to build docker containers. We run it as a kind of child of koji. It performs builds, and submits them back for koji to store via koji’s Content Generator API. It is particularly cool in that it automatically rebuilds containers that depend on one another. For docker containers only, it looks somewhat like the orchestrator tool in our diagram above (except less integrated with our environment. It is buried behind koji.)
You can see that we’re all heading towards some of the same patterns in our approach, but we lack a unified approach to modifying the pipeline as a whole. That’s understandable.. It’s big and hard to change! Furthermore, there’s likely a social explanation at root, with respect to Conway’s Law. We have an opportunity here to change that.
If we can solve the dep-graph modelling, chain-rebuilding, and CI problems generally in the pipeline, then we’ll be all the more situated to easily adapt to the next wave of technological change.