ccf794d doc: first draft of architecture document — radicle-native-ci

radicle-native-ci

rad:z3qg5TKmN83afz2fj9z3fQjU8vaYE

Radicle CI adapter for native CI

doc: first draft of architecture document

Lars Wirzenius committed 2 years ago

commit ccf794d2a00a837f078d65d0c982a814ad4acbd9
parent e3007d4

5 files changed +253 -0

added doc/.gitignore

@@ -0,0 +1,2 @@

+	`*.html`
+	`*.svg`

added doc/Makefile

@@ -0,0 +1,15 @@

 .SUFFIXES: .uml .svg .pik .md .html
 .md.html:
 	pandoc --toc --standalone --self-contained $< -o $@
 .uml.svg:
 	plantuml -tsvg --output=. $<
 .pik.svg:
 	pikchr-cli $< > $@.tmp
 	mv $@.tmp $@
 all: architecture.html
 architecture.html: architecture.svg test.svg

added doc/architecture.md

@@ -0,0 +1,200 @@

 ---
 title: Radicle native CI
 subtitle: Requirements and architecture
 ...
 # Introduction
 This document explains the purpose of the Radicle native CI component,
 the requirements put on it, and its software architecture.
 # Overview
 CI support in Radicle consists of several components. For native CI
 they are:
 * the Radicle node
 * the CI broker
 * the native CI executable
 These all have to run on the same host: the node and broker
 communicate via a Unix domain socket, and the broker spawns the native
 CI executable.
 See the CI broker architecture documentation for a more in-depth
 description of CI in Radicle.
 The child process is called "the CI adapter" in this document.
 Native CI works like this:
 * reads a request message from its standard input
 * clones the git repository in the request
 * switches to the commit in the request
 * reads the `.radicle/native.yaml` file in the repository
 * writes a response message saying it starts a run, to its standard
   output
 * executes the shell snippet in the `.native.yaml` file
 * writes a response message with the result of the run
 * writes a log file based on what it did
 * updates the `index.html` page that lists all CI runs and their
   results
 ## Native CI
 ![Sequence diagram for native CI](architecture.svg)
 The diagram above shows the happy path. Various things can go wrong,
 after the native CI executable has started. (In this document we don't
 need to consider other possible failures.) The test suite for native
 CI verifies that they're all handled correctly, either by explicitly
 testing each case, or relying on analysis that generic error handling
 copes with the case. See the `test-suite` program in the source tree.
 * the environment variable specifying the configuration file is not
   set
 * can't read or parse the configuration file
 * the configuration file does not specify all mandatory fields
 * the configuration file specifies values that are wrong in some way
 * stdin is empty
 * stdin does not contain a newline
 * the first line of stdin can't be parsed as a message serialized as
   JSON
 * the message is not a trigger message
 * the repository triggered does not exist
 * the repository can't be cloned
 * the repository does not have the requested commit
 * the repository does not contain `.radicle/native.yaml`
 * `native.yaml` can't be read or parsed as YAML
 * `native.yaml` does not contain a text field `shell`
 * writing first response to stdout fails
 * there is any problem executing the contents of the `shell` field
   using `bash`
 * executing the shell snippet takes too long
 * generating or writing a "run metadata" file fails
 * writing second response to stdout fails
 * finding or parsing all run metadata files fails
 * generating or writing the static web pages listing all runs fails
 # Requirements
 Overall, the native CI engine, or adapter, is very simple. However, it
 must be robust, which makes things more difficult. Here, robust means
 that whatever happens, the node owner finds out what it was. If a run
 fails for whatever reason, the node owner can figure out why. Ideally,
 this applies to anyone watching CI on the node can see it as well.
 In the descriptions of the requirements we use the following roles:
 * "developer" makes changes to the repository on which CI is run
 * "node owner" runs the node itself
 The native CI engine has several ways to report what it does:
 * its standard error output
   - in a systemd setup this is captured to the system log or journal
 * a per-node log file for native CI
   - this is for this that interest only the node owner, not the
     developer
   - e.g., finding configuration errors that the developer can't fix,
     such a missing configuration file
 * a per-run log file
   - this of interest to both the node owner and the developer
   - this is the primary tool for the developer to figure out what went
     wrong in their CI run, so that they can change their repository to
     fix it
 ## Developer can see what status of each CI run on a node
 _Requirement:_ The developer can see what CI runs a node has
 triggered, and what the current status of each is.
 _Justification:_ This lets them be reassured that CI is working.
 _Implementation:_ Native CI maintains one or more web pages that list
 every run. For each run, the following is recorded:
 ## Developer gets a useful run log
 _Requirement:_ The developer can fetch a useful log of a run that
 helps them find out problems in their code.
 _Justification:_ This is crucial for the developer to have any hope of
 fixing a problem found in CI.
 _Implementation:_ The run log is a static file that can be fetched via
 HTTP from the node, or viewed in a web browser. The run log contains
 at least the following information:
 * the repository ID
 * the repository alias, if one is known to the local node
 * the commit id that triggered the run
 * the commit diff (`git show`)
 * when the run was triggered
 * when the run finished
 * the environment variables of the native CI process
 * every command or other action that was taken during the run
 * the standard output and standard error output, and the exit code, of
   every command
 * whether the run was considered successful or not
 ## Node owner is informed via system log if CI fails early
 _Requirement:_ If a native CI run fails early, it writes a message to
 its standard error output.
 _Justification:_ The standard error is captured by systemd, and
 written to the system log or journal, from where the node owner can be
 expected to find it. This gives them a chance to find out what's wrong
 and hopefully fix it.
 "Early" here means any time before the broker has been given a
 "result" response message, and a per-run log file has been created,
 and the web page of all CI runs has been updated.
 _Implementation:_ Use a suitable Rust logging library, with the
 default log level allowing only error messages, and only logging an
 error if something goes wrong early.
 ## Only early failures are logged to the system log
 _Requirement:_ The native CI engine only writes to its standard error
 output when it fails early. Otherwise it only updates its per-run and
 per-node log files.
 _Justification:_ It's easy to spam the system log with many useless
 messages, which make it harder to find important information in the
 log.
 ## The per-node log is updated when an early error occurs
 _Requirement:_ If native CI writes an error message to the standard
 error output, it is also written to the per-node log, with more
 detail.
 _Justification:_ The system log is a bad place to report detailed
 information, as it's quite constrained. A per-node log provides more
 flexibility.
 _Implementation:_ Append to a per-node log, and if that fails, report
 that, too, to the standard error output.
 # Test architecture
 In order to test the native CI engine, we invoke it in various ways,
 and examine its outputs.
 ![Test setup](test.svg)
 In order for the native CI engine to work, it needs to clone from a
 node. This is awkward for testing. Using a real node is possible, but
 introduces more moving parts that can fail during tests. Using a test
 double, or mock, as the node would be possible, but more work, and
 it'd be somewhat tricky logic, which is likely to introduce bugs.
 We implement the test suite to use a specially set up local node, with
 a repository with contents for tests. We will create the node as part
 of the test suite so that it has exactly the content we need for the
 tests.

added doc/architecture.uml

@@ -0,0 +1,20 @@

 @startuml
 participant "CI broker" as broker
 participant "Native CI" as ci
 participant "Radicle \n node" as node
 participant "Repository \n (local clone)" as repo
 participant "/bin/bash" as shell
 participant "index.html" as index
 broker -> ci : request message
 ci -> node   : git clone
 node -> repo
 ci -> repo   : read native.yaml
 repo -> ci
 ci -> broker : response: triggered
 ci -> shell  : execute desired commands
 shell -> ci  : stdout, stderr, exit
 ci -> broker : response: result
 ci -> index  : generate run index web page
 @enduml

added doc/test.uml

@@ -0,0 +1,16 @@

 @startuml
 participant "Test harness" as harness
 participant "Native CI" as ci
 participant "Local node" as node
 participant "/bin/bash" as shell
 harness -> ci : invoke
 harness -> ci : request via stdin
 ci -> node    : git clone
 node <- ci
 ci -> shell   : run build
 ci <- shell   : stdout, stderr, exit
 harness <- ci : response via stdout
 @enduml