---
title: Data Collection Workflow
---

# Specification: Data Collection Workflow

## Goal

Aggregates information from GitLab (groups, projects, README files, issues) into a unified, enriched dataset that downstream renderers consume. The collector orchestrates calls to the [GitLab API Client](./spec_api_client.md) and produces structured *overview rows* (see [Model Mapping](./spec_model_mapping.md)).

---

## 1. Happy-Path Flow

1. Retrieve the list of groups visible to the configured API token(s).
2. For every group, retrieve projects.
3. For each project:
   * Determine the branch used for README lookup: prefer `default_branch` from project metadata; fall back to `main`.
   * Request `README.md` raw content; on 404 treat README as *missing*.
   * Request issues list. Some projects may not have issue tracking, yielding `issues=None`.
4. Transform raw JSON/text responses into domain objects via model mapping (see related spec) and compose an *overview row* containing:
   * Group object.
   * Project object.
   * Optional README object.
   * Optional List of Issue objects (may be empty list).
   * Extra metadata extracted from README front-matter (author, priority, etc.).
5. Return the list of rows to the caller, preserving the original discovery order.

---

## 2. Failure Handling

| Failure point | Result |
|---------------|--------|
| Group listing request returns error | Raise *Collector Error* and abort collection. |
| Project listing for a single group fails | Propagate error → abort collection (no partial results). |
| README fetch returns 404 | Record `readme = None`; continue processing other artefacts. |
| README fetch returns non-404 error | Propagate as *Collector Error*. |
| Issues request fails | Record `issues = None`; continue processing other artefacts. |

Errors are *not* silently ignored (except the explicitly graceful README-missing/no Issue-tracker case).

---

## 3. Concurrency & Rate Limits

* Implementations may perform project-level fetches in parallel, but **must honour** the rate-limit handling strategy defined in the API client.
* Parallelism must not reorder final output; order is defined by input discovery sequence (§1-5).

---

## 4. Output Contract

* Returns an **ordered, in-memory collection** of overview rows.
* `None` is always interpreted as "does not has this feature", "not found", etc. Empty values (such as `""`, `[]`, ..) indicates the presence in the API, but empty.
* No persistence or caching is performed at this layer.
* The consumer applies sorting/grouping according to their own needs (see [Table Sorting](./spec_table_sorting.md)).

---

## 5. Non-Goals

* Command-line parsing, configuration merging, or environment handling (see [Settings](./spec_settings.md)).
* Rendering concerns of any kind – these are addressed by higher-level specs.
