From Fedora Project Wiki

Hardlink identical files in packages by default

Summary

A post-build step is added to the package build macros to automatically hardlink all identical files under /usr. Previously, this was done in some packages and now it's done everywhere by default.

Owner


Current status

Detailed Description

Files can be hardlinked at the end of the %install step in package builds. rpm supports this and will preserve those links in the binary rpm and during installation. This makes the installation a bit more efficient. Hardlinking of read-only files is generally transparent to the user, but has some small benefits: the files are not duplicated in the file system; backup, copy, and search programs will usually make use of the link information and not process the same inode twice. Thus, it's good to hardlink as many packaged files as possible.

Previously, hardlinking was done automatically for a subset of files in Python packages (via the %__os_install_post_python macro), and explicitly in some packages with lots of similar files (usually via the hardlink program).

The %__os_install_post is extended to automatically hardlink all identical files under %{buildroot}%{_prefix}, i.e. the /usr directory in packages. This calls a new helper binary (part of the add-determinism package) that does the linking.

Hard links may be confusing if the file is modified. In particular, all links to the same inode share the same ownership and permissions, and obviously the same contents. Thus, we want to apply hardlinking only to files under /usr, which are generally read-only in packages.

When files are hardlinked, mtime (the modification timestamp) is taken into account. Only files with identical mtime, owner, group, and mode are subject to linking. The new program written to do the linking takes $SOURCE_DATE_EPOCH into account, and will clamp mtimes to it before comparing.

Note: rpm correctly handles the case where a hardlink is between files in two different subpackages. Thus, we can hardlink everything under %{buildroot}, and rpm will store the files as hardlinked if they are in the same output package, adjusting the hardlink counts as appropriate.

Feedback

Benefit to Fedora

As mentioned in the Summary, hardlinking deduplicates the data in rpms and in installations. Backup, copy, and search programs will usually make use of the link information and not process the same inode twice. Thus, by hardlinking files in the packages we make things a bit more efficient. (The impact is small, because rpms generally don't have large duplicated files.)

Hardlinking of files was previously done in some packages explicitly, but it required adding a BuildRequires line and invoking a script, so it wasn't done very often. By handling this automatically, we'll be able to simplify those packages.

Another caveat that needs to be taken into account when doing hardlinking as part of the package build is that newer hardlink versions use reflinks instead of hardlinks by default. (With a hardlink, one inode is connected to the file system tree in two or more places. With a reflink, some blocks of an inode are shared with another inode, inside of the file system, and the two inodes retain their separate identities.) rpm has no knowledge of reflinks, so those reflinks created during package build have no effect on the binary package and the payload is duplicated. Invocations of hardlink would have to be annotated with --reflink=never to retain the intended effect. By removing that step from packages we avoid this issue.

The Reproducible Builds effort reported that some packages that use hardlinking are not reproducible, see irreproducibility#22. When files are created in the package build, depending on how fast the build machine is, some files might or might not have identical timestamps. The tools that were used to compare files for hardlinking were general tools that did not "know" that we'd clamp the mtimes to $SOURCE_DATE_EPOCH in a subsequent step, so the results of the mtime comparisons were unstable. The tool that is added as part of this Change does the mtime clamping internally for reproducible results. Fixing this issue was the initial motivation for this change.

Scope

  • Proposal owners:
    • extend the add-determinism package with a little helper that does file comparisons and hardlinks identical files. The helper takes $SOURCE_DATE_EPOCH into account.
    • open pull request for redhat-rpm-config to insert a call to the helper in %__os_install_post.
    • open pull request for python-srpm-macros to drop their hardlinking step.
  • Other developers:
    • merge pull request
    • report issues if the hardlinking has unforeseen consequences or does not work correctly.
    • drop explicit calls to hardlink in their packages.
  • Release engineering:
  • Policies and guidelines: not needed, AFAICT.
  • Trademark approval: N/A (not needed for this Change)
  • Alignment with the Fedora Strategy:

Upgrade/compatibility impact

No impact.

Early Testing (Optional)

Build package with an invocation of the new helper.

How To Test

Install packages rebuilt with the helper.

User Experience

Not visible to users.

Dependencies

Contingency Plan

  • Contingency mechanism:
    • if hardlinking causes a problem in some specific packages, they can be trivially modified to skip the hardlinking step by setting a macro.
    • if there is a general problem, we can easily drop the macro in redhat-rpm-config.
  • Contingency deadline: any time, even after release. Any affected packages would have to be rebuilt.
  • Blocks release? No.

Documentation

The invocation of the helper will be documented inline in the macros files. Other documentation is not needed.

Release Notes

Package builds automatically hardlink identical files. This reduces the installation footprint a bit and also makes packages builds more reproducible.