radicle-metrics — heartwood

Radicle Heartwood Protocol & Stack

radicle-metrics

lorenz opened 11 months ago

Motivation

Generally, one might wnat to track the usage of Radicle and the activity on the Radicle network over time.

One may distinguish metrics at various levels:

By repository (or even more fine grained), such as: How active is a particular repository?
By node: How busy is a particular node in the network? How much bandwidth does it use?
By network: How active is a set of nodes? How much data are they exchanging, how long does it take a change to propagate to all nodes? How many nodes are participating in the network?

Generally, it is possible that these types of metrics build on top of each other, e.g. "by node" metrics might to some degree aggregate "by repository" metrics, and "by network" metrics might to some degree be derived from "by node" metrics.

Still, it might be beneficial also be able to collect "by repository" metrics without a running node (only requiring access to the repository in Radicle storage), just as one would not expect access to the whole network to collect "by node" metrics.

We propose to add the radicle-metrics crate to heartwood, which for now is focused solely on "by repository" metrics, and possibly aggregation of these to "by store" for convenience.

Then, by collecting these metrics from a well-connected node, such as iris.radicle.xyz, we still can get a pretty good picture of activity.

The crate should avoid I/O as much as possible, to enable access via HTTP but also just as a raw file.

I want to track the usage of Radicle and the activity on the network over time.

Scope

The scope of the crate is per-repository. It depends on radicle, to make access to a Git repository by RID easy.

It does not depend on radicle-cli.

It decidedly does not depend on a Radicle node running or present, i.e. it does not require a node.db file or similar.

Usage

Library

A function that takes a Radicle storage and an RID and produces (iterates?) structs.

Binary

A rad-stats command that takes an RID and produces output in the Prometheus Exposition format.

Specification

Data to collect

Natural number of COBs of type xyz.radicle.issue
Natural number of COBs of type xyz.radicle.issue with status "closed"
Natural number of COBs of type xyz.radicle.patch
Natural number of COBs of type xyz.radicle.patch with status "merged"
Natural number of comments of COBs types xyz.radicle.patch
Natural number of comments of COBs types xyz.radicle.issue

The intention is for downstream tooling to combine these metrics freely to serve as a proxy of "collaboration activity".

All of these metrics are time series. That is, for every metric we store data points that have two dimensions:

Time
Value of metric

Example: Alice and Bob collaborate on a repository. First, Alice creates an issue. The next day, Bob comments on the issue (to inform Alice that he will start work on it). It takes him two days to do the coding. Then he creates a patch with the intent to resolve the issue. Alice comments on the patch to thank Bob and merges it. Finally, Alice closes the issue.

As output format we use the Prometheus Text Exposition Format.

The name of the metrics is to be decided, considering best practices according to the Prometheus project.

lorenz added crate=radicle,radicle-metrics 11 months ago

lorenz added crate=radicle-metrics 11 months ago

z6MkireR...3voM commented 11 months ago

Interesting idea, but my first reaction is to think that metrics is usually end-application focused. The second thought is that this almost like the COB cache, but with a time-series.

So I'm wondering if introducing COB events into the node events stream might solve part of this, then the metrics side is just a time-series DB that aggregates over that stream. Caveat being that you said this isn't going to require the node, but maybe that is too much of a restriction? I guess I'd like to know how you would foresee designing this with just storage in mind?

z6MkrnXJ...SFS3 commented 11 months ago

All of these metrics are time series. That is, for every metric we store data points that have two dimensions: Time Value of metric

I would like to propose we don’t store anything. Metrics only need to be exposed. Storing can happen in other purpose-built tools, after metrics have been scraped. Is “store” perhaps a typo here?

z6MkrnXJ...SFS3 commented 11 months ago

Example: Alice and Bob collaborate on a repository. First, Alice creates an issue. The next day, Bob comments on the issue (to inform Alice that he will start work on it). It takes him two days to do the coding. Then he creates a patch with the intent to resolve the issue. Alice comments on the patch to thank Bob and merges it. Finally, Alice closes the issue.

I didn’t understand how this example relates to the rest of the description… 🤔

lorenz commented 11 months ago

Interesting idea, but my first reaction is to think that metrics is usually end-application focused.

Not sure what you are implying here. Please elaborate in which way this would affect the proposal.

The second thought is that this almost like the COB cache, but with a time-series.

Yup.

So I'm wondering if introducing COB events into the node events stream might solve part of this, then the metrics side is just a time-series DB that aggregates over that stream.

Firstly, this would only apply to new data. For an event that happened two weeks ago, you would not get such event when running the node today, right? Or are you thinking about a way to replay events? I think that node events could be a way to update an already existing time series, and this could complement the approach suggested above. But fundamentally I disagree that just computing things on data that is already accessible via Git and storage should depend on radicle-node in some way. What I suggest is to implement a tool that transforms a repository into multiple time series. It can run without the node, and it can run offline. And then we may think about making updates to these time series smarter, hopefully by consuming events.

Caveat being that you said this isn't going to require the node, but maybe that is too much of a restriction?

It is too much of a restriction, for metrics that actually talk about properties of the node. I think is a prefectly fine restriction for metrics that only depend on storage.

I guess I'd like to know how you would foresee designing this with just storage in mind?

I don't understand the question, sorry. It is perfectly possible to walk all repositories, and all COBs in all repositories, and compute a time series for predefined metrics while doing that.

lorenz commented 11 months ago

I would like to propose we don’t store anything. Metrics only need to be exposed.

How can you "expose" data without storing it anywhere? For the integration with radicle-httpd we're talking about storing the time series data in memory, and in a first iteration caching it for a relatively long period. Since the proposal does not specify any incremental computation of metrics, but only one that requires walking commits on potentially many Git repositories, and having metrics that are up to date to the last second is not so important, I would suggest that we cache these metrics in a way that they are recomputed at most once very 30M or 1H.

And in the future, we might get better at incremental updates of the metrics, so that we can allow shorter caching times, or even just computing the latest increment on the fly for new requests.

Storing can happen in other purpose-built tools, after metrics have been scraped.

Sure, a time series DB could scrape the endpoint every few minutes.

Is “store” perhaps a typo here?

No.

lorenz commented 11 months ago

I didn’t understand how this example relates to the rest of the description… 🤔

I am supposed to edit in the time series data that results from this interaction. So we get a sense of how interaction on Radicle maps to a particular time series shaped data.

I just had to get a off a train at that point while writing the issue.

z6MkrnXJ...SFS3 commented 10 months ago

How can you "expose" data without storing it anywhere?

Well, one way is by simply - and only - printing out a computed value. Whoever scrapes this output is then responsible for parsing and storing this value in the time series db.

For example, if there was a rad metrics command:

that could just print to stdout,
then, this output could be redirected to a file,
then, something like node-exporter’s textfile collector [1] can scrape those values and import them into prometheus.

With a setup like this, there is no requirement for any type of time series data being stored on the metric-generator side.

I am proposing removing both the incremental computation and all the caches from the scope of this first iteration.

In terms of non-functional requirements here, I don’t think there is any particular performance requirement. Running the metrics generation script is probably something that I’d run once a day, during "low traffic hours” and I’d be very happy to have something that computes these values in “minutes”. I don’t need this to finish in “seconds”. Once it starts taking “Hours” we can reconsider.

[1] - https://github.com/prometheus/node_exporter?tab=readme-ov-file#textfile-collector

rudolfs commented 10 months ago

We already have exactly that: rad stats.

z6MkrnXJ...SFS3 commented 10 months ago

I am aware (it is also mentioned in the issue description ;) ). I used a different subcommand on purpose to avoid implying that the content returned should be the same.

z6MkgFq6...nBGz commented 9 months ago

I would like to collect metrics that might enable us to draw conclusions about performance such as:

number of inbound / outbound connections
limits and relay config
usage of environment variables e.g. RUST_BACKTRACE

Would these be covered on the "by node" level?