Proposal: Pause the release train after F30 to solve several big challenges
Fedora’s singular lifecycle has been in place for almost a decade and a half. During that time, technology users have changed how they consume platform and applications. Fedora needs to be more toward the forefront of these changes. But more importantly, Fedora needs to be more hospitable to community management of lifecycle.
Currently Fedora can’t scale for more community ownership of the things we release: (1) Only a few people can build and push out releases; and (2) we manage releases largely based on that staffing. The Fedora community should be able to run releases of content themselves, using tools that work well, with only minimal oversight, and determine their own schedule for doing so.
This implies a great deal of both redesign and reworking of tools and processes. To unblock the community, several things need to happen. We need a faster, more scalable compose to enable CI/CD operations; we need a way to automate more testing and quality measures; and we need to update our delivery tools and processes. We also need a way to track and coordinate this work across teams, since it involves collaboration among Fedora infrastructure, QA, applications, release engineering, CentOS CI, maintainers, and more.
We should skip the F31 release cycle and leave F30 in place longer in order to focus on improving the tooling and testing changes. These tooling changes will improve the overall reliability of Fedora, and will decrease the manual effort and complexities involved in producing the distribution artifacts. Although we’ve done this before to make “editions” happen, the intent would be to track this multi-team effort so we can (1) use the time as well as possible, and (2) give the work maximum transparency.
Challenge #0: Coordination
Fedora’s development has grown organically over the years, which has led to the complex and often disorganized state we have now. Development teams often use different tools, spread across a variety of platforms. This leads to a siloing effect where teams often operate in isolation, which leads to duplicate work, complaints about interoperability, and complex designs.
This problem is numbered 0 because it’s fundamental to how we can solve other challenges in this proposal. It doesn’t technically block them, although failure to coordinate in general surely would.
By standardizing and unifying the tools used to plan for Fedora at a cross-team level, we can reduce the siloing effect currently seen. This proposal would adopt the common kanban approach to development within Fedora. A number of teams in Fedora are already using the Taiga software, either at tree.taiga.io, or taiga.fedorainfracloud.org. By migrating these disparate instances to a single hosted solution, Fedora teams can better coordinate their work, and share tickets between groups.
This proposal also allows Fedora to help fund the taiga open source project which benefits the wider open source ecosystem. We encourage anyone working on project planning and execution, or who finds other methods like issue queues or the wiki lacking, to use this shared instance.
- The Fedora Infra group will provide and coordinate a common instance of Taiga for Fedora teams, and help in the migration of existing Taiga data to the new platform.
- The Fedora project should begin using Taiga for higher level planning purposes.
Challenge #1: Faster, more scalable composes
We need faster composes, to enable more continuous testing/integration/delivery. In a continuous process, changes are checked not just for package changes, but in an integrated way, and constantly. If failure is detected, the change is rejected and the maintainer notified so they can fix the issue before trying again.
The current compose process is monolithic and relatively slow. Right now a compose process runs between 8-12 hours. Given time zones, human factors, and other issues, this means delays of at least a day between composes. That’s too slow to allow for a continuous process. Maintainers should be able to get feedback on their change within a few hours. That means our compose needs to be shorter.
In addition, our compose operations are highly bound to on-premise hardware. This prevents scaling our processes to take advantage of modern, hybrid cloud footprints. In some cases we are necessarily premise-bound — for instance, Power and s390x. However, there’s a broad range of continuous testing we could use to validate general changes before engaging that hardware.
Improve transactions and compose speed in general, with a target of one hour. This means we can pursue desirable goals like gating package changes on integrated testing, not just package validity. It also means we can spend less time waiting for each compose, and more time on working against the compose on a regular basis.
- Improve underlying transaction speeds that affect composes
- Update the compose process such that we can scale to use hybrid cloud footprints
- Target the use case of compose on demand for CI
Challenge #2: Testing
The current development process for Fedora does not require package testing for Fedora Rawhide, which frequently leads to nightly compose failures, broken installs, or incompatibilities between applications. In addition, Rawhide is used by a relatively small number of contributors and users, resulting in minimal outside feedback. The lack of basic validation in Rawhide means that users have to repetitively take action to fix locally broken systems from time to time. Unnecessary brokenness contributes to the cycle of low usage.
In the stabilization/stable branches, many packages get “thrown over the wall” at the last minute before a freeze in order to try to get features in. Often this results in frantic QE testing, stress, and unplanned work by a variety of teams.
Require integration and functional testing for packages and updates per package build request. This will require changes to the current testing and tooling environment, which will have an impact for pagure, koji, openqa, greenwave, waiverdb, bodhi, and the CICO pipeline test. Ideally all packages should gain tests, but at minimum critical path packages need to pass a basic defined set of tests for installation, upgrade, and function. Rawhide gating is already a requested feature. Skipping the F31 development cycle allows the testing and tooling developers the time needed to properly build, test and deploy the changes needed.
- Make Rawhide gating a reality.
- Better test reporting through bodhi and pagure.
- All tests through CICO and OpenQA
- Consistent test result format used in resultsdb.
- Define and automate more of our quality measures -- have them owned by community, and tested by machines wherever possible
- Review workflow to see if we need tooling at each step or if some can be removed.
Challenge #3: Delivery tools and processes
As mentioned before, our processes are highly monolithic. They bottleneck a large set of deliverables regularly on a small set of people and time. Inevitably this leads to misses on numerous processes (although not for lack of effort). An example is releasing content that hasn’t been curated properly in the community. Furthermore, to avoid strain on the bottleneck, we enforce a lengthy freeze process across the infrastructure around every delivery cycle. This is precisely the opposite of an agile operation, in that we focus on constraining change. In a typical release cycle (approximately 26 weeks), this has been 6-8 weeks of freeze, or 23-30% of the time.
Focus limited resources funded by Red Hat on a set of critical content the community can build on (Platform). Use automation to a maximum extent, and minimize manual effort, so that both Red Hat funded staff and the community can easily guide release deliverables. Update release processes to be more scalable for community ownership. This will ensure the Fedora Project promotes work our contributors care about passionately. Focus on keeping a short time to recover from problems, and minimize freeze periods.
- Define a small, draft set of content for the Platform which can iterate over time
- Update tools to support automated creation, validation, and release of deliverables and minimize manual effort
- Establish access controls for tools and release validation that allow flexible ownership beyond the current single releng group
- Reduce freeze times for Platform release to < 5% of a cycle, and consider eliminating altogether when possible
Challenge #4: Feedback and Metrics
The current Fedora metrics are largely pieced together spreadsheets, log files, and a loose collection of scripts. In order to better evaluate the current state of package usage, community engagement, we need to rewrite our tooling around feedback and metrics gathering.
Matthew and others have already submitted several RFE’s for dnf that will help us get better information from the server logs. Additionally we’ll need to take some time to actually write a proper toolkit for the data processing side of things. Ultimately we want an automated dashboard of some sort so that real-ish time trend data is available to the community as a whole, with more granular information available to Fedora leadership.
- File and track RFEs for dnf
- Choose toolkit(s) to process data and dashboard results (not new in-house development)
- Establish ACLs that allow statistics users to customize their own displays