User:Jskladan/Sandbox:Depcheck

= Depcheck process =

The Goal
The goal is not to catch _all_ problems, but to make sure, that _average user_ does not suffer from dependency problems.

This is quite important statement to consider. The thing is, that we are unable to check absolutely everything - i.e. whether update XYZ does not potentialy break some installation.
 * 1) That would require us to test against every single possible combination of packages (currently around 12k).
 * 2) There always will be some conflicts - try "yum install '*'" - it's not even possible to install all the packages at once (conflicts, obsoletes, ...).

What we do want to check is These two are quite interesting points to discuss.
 * 1) Whether there is _some_ way to install the package
 * 2) Whether it can be installed on _average system_

The first one describes the fact, that if a package should be pushed from 'pending' to 'updates', than all the dependencies must be present. We don't care too much, whether a half of already installed packages needs to be removed in order to install the package, because we can't evaluate the idea behind the update (take initv x systemd for example, if we would check for conflicts/obsoletes with the 'core' stuff, depcheck would deny systemd, because it removes initv). But we need to make sure, that there is a way to install the package. Whether the system is bootable, or if the 'removals' caused by the update/installation are reasonabele is not a subject of depchecking.

The second point means, that we do not want to check all possible combinations, and look for the one particular situation which does not allow the package to install. As in the previous point, we need to understand, that the very nature of packaging/repositories is, that not all packages can be installed at once (because if this would not apply, then no depchecking would be required). From the first point, or we can call it 'first stage', we know that the package can be installed. Now we want to check, whether two packages 'commonly installed next to each other' do not suddenly collide. To achieve this, we'll have a 'standard installation', and we'll take it as a base for the upcoming 'installations'.

This is very different from the way Depcheck functions right now. At the moment, Depcheck assumes that 'all packages are installed' (correct me if i'm wrong), and then installs the whole 'pending' at once. Not caring about 'conflicts', 'obsoletes', etc. The original idea was: "It does not matter, that we install all at once, yum will all sort it out for us". We found out, that it's not as simple :)

Of course, we might want to do some specific tests. Like when we detect 'obsoletes' in the dependencies, we can add the obsoleted package to our 'average installation', and check that the 'upgrade' is seamless.

There is also a problem with 'Foo requires Bar == 1.3' in stable, and having a update for Bar in pending. We could list all packages which have Bar as a requirement, install them, and try to update Bar from pending. But some of the packages requiring Bar can conflict each other, and then we'd need to do more 'runs' so we can check all the packages currently in stable... And this goes back to the initial "we can't check it all". For this particular problem, we could do some 'best try' and check the requirements manualy - i.e. make a list of packages which can function with the new Bar, and those which can not by traversing their dependency information and comparing it to the new Bar version. But we still won't be able to see the 'hidden dependencies' and other stuff. But _this is OK_.

There are plenty of other potential problems, but we need to step aside from those (at least for the beginning), and set a reasonable goal for depcheck. Then if we have quite a lot of false positives tied to some particular situation (like the exact version requirement), we can make filters to catch the potentialy problematic packages, and do some additional testing on them. But most of it always boils down to the 'what is the beginning state of our system'.

I (jskladan) belive that it would be for the best to develop the 'Simple Depcheck' (as kparal and I call the alternative implementation) incrementaly. Make a simple base, which just checks whether a package can be 'somehow' installed, not caring about how reasonable the removals/requirements are. Do not check intensions, check functionality.

Also, we'd like to cooperate with yum developers, and convince them into creating a stable, documented API suitable for dependecy checking. I.e. something depcheck uses the undocumented yum magic for. We'd like to have a method to set up virtual package sack, give it a list of packages to 'install', make 'snapshots' of the packagesacks to be able to unroll after 'installing' broken package, and of course a structured output which could be used to detect the root of problem with broken package.

What do we want from depcheck?

 * 1) Subset of packages, which can be installed.
 * 2) Set of 'uninstallable' packages, ideally with logs ;-)
 * 3) We want to be able to do all the checks 'virtually' just using pkgdb & yum api (i.e. without performing an actual package installation)

Simple process
Data
 * All_packages = dictionary of all packages in pending. Contains information about 'installability' (i.e. whether the package is at least in one installable subset)
 * Runs = List of all packages 'installable' in that run

Data structures than can be imagined like this:
 * All_packages = {package_name: {OK: None, was_first = False}, ...}
 * Where OK is boolean indicating, whether the package
 * None - package was not yet a part of any run
 * True - was installed in some installable subset
 * False - package had problems with installation
 * Also for the OK, the update-path is None -> False -> True (i.e. None can change to both True or False, False can change to True, True can not be changed)
 * was_first is a helper, to track whether the package was ever "at the start" of the installable subset creation. This is important for deciding when to end the depcheck process.
 * Runs = [[package_name, ...], ...]
 * Stores the installable subset of the respective run

Algorithm

 * 1) Filter packages from all_packages (let's call the output from this filter undecided)
 * 2) * Take packages which have OK set to None or False
 * 3) * Remove packages which have the was_first mutex set to True
 * 4) If undecided is empty:
 * 5) * End Depcheck
 * 6) Pick random package from undecided
 * 7) Set the was_first mutex to True
 * 8) For each package in undecided (starting with the randomly selected package):
 * 9) * Try to install the package
 * 10) * If installation passed:
 * update the all_packages[package]['OK'] to True
 * 1) * If installation failed:
 * update the all_packages[package]['OK'] to False (if previous state was True, than it is not updated)

Example
Let's have 'pending' packages A..F with these dependencies:
 * A requires B
 * A requires C
 * A conflicts with D
 * D requires C
 * E has no dependencies at all
 * F requires XYZ which is not present in any repo

= Brainstorming, Phase 2 - Three staged depcheck =

Kamil had an excellent idea for depcheck - kind of a 'reversed look on the problem'. Instead of having all the packages from stable/updates 'installed', and then trying to 'install' the whole pending at once, we can achieve the same level of testing by installing one package from pending (having repositories enabled) and then installing each package from the stable/updates/pending set (ony by one, uninstalling the previously installed) to check whether there is a problem or not. This effectively solves the issue with inter-conflicting packages in stable/updates, and still catches all possible dependency issues.

The process
The process has three stages, the first one simply checks, whether package can be installed.

Next step test interactions with the rest of the packages. The tested package is installed, and then the rest of stable/updates is installed ('install test').

The last stage tests 'regression' - e.g. if package failed to install, then we'd like to test whether the 'already stable' version would fail in the same way. If the 'old' version did not fail, and the updated does, than we can report it differently from the situation when both 'stable' and 'pending' packages fail to install with one particular other package.

''The huge advantage of "staged" approach is simpler determination of 'failure points'. We'll know which stages passed and which failed, and based on this, we can examine the results/package with better precision instead of scanning the whole report.''

Stage 1 - Vacuum
First of all, we need to test, whether the package is 'stand-alone installable'. By this, I mean a state, when no packages at all are installed, all the respective repositories are enabled, and  ran.

This will test, whether the package itself can be installed - i.e. has no broken dependencies. Of course, this won't detect any collisions with the rest of the packages (like using conflicts instead of obsolete, etc.), but is a corner-stone for the next stages of depchecking.

If the package fails to install in 'vacuum', then we stop the test and report broken package & failed depcheck.

Algorithm:
 * create vacuum (yum instance with enabled repositories, but no packages installed)
 * install tested_package
 * if installation passed
 * proceed to stage 2
 * else
 * report broken package

Stage 2 - Install
This step will simulate the state, when a package is installed on the system, and then some 'already stable/updates' packages are installed on top of that. While saying 'all packages from stable/updates/pendig' it's important to realize, that when you say "install Foo" yum installs the newest version in all the repositories (can be overriden by specific version, of course, given that version is in some repo).

Fail in this state should catch events, where (for example) the update changes "Provides", so applications requiring it will fail to install.

Algorithm:
 * create vacuum (yum instance with enabled repositories, but no packages installed)
 * install tested_package
 * Create Snapshot of the yum object state
 * For each package in stable/updates/pending (meaning these repos are enabled, and we're installing 'names' not 'name + version'):
 * install package
 * if installation failed:
 * store the name of failed package & log from depsolving process
 * revert to Snapshot (effectively 'uninstalling' the package)

If tested_package fails in this test, we store its name & log, and will proceed to Stage 3

Stage 3 - Regression check
Stage 3 installing the 'old' version of tested_package, which is already in stable/updates and inspecting whether the 'old' version also failed to install with this particular package. This should point out 'new' errors like the 'pidgin requires gtk & libpurple, libpurple conflicts gtk' scenario for instance (not sure if/when kparal will submit the use-cases), or not, and detect change of state.

Algorithm:
 * create vacuum (yum instance with enabled repositories, but no packages installed)
 * disable the 'pending' repo
 * Create Snapshot (empty)'
 * For each package which failed Stage 2 (meaning we're installing 'names' not 'name + version'):
 * install package (effectively installing it's 'stable' version
 * Create Snapshot (one_package) of the yum object state
 * for each package it failed to install with in Stage 2:
 * install the "failed-with" package
 * if installation failed:
 * no change. Report like "info"
 * if installation passed:
 * update changes state -> new error -> fail depcheck & report failed update
 * revert to Snapshot (one_package) (effectively 'uninstalling' the package)
 * revert to Snapshot (empty) (reseting the state of yum to 'vacuum')