SolarHub — Nightly Pipeline
Support Aurora: GitHub Sponsors · Patreon · Buy Me a Coffee
Schedule
The nightly pipeline runs every day at 00:30 UTC via a GitHub Actions scheduled trigger.
Pipeline Lifecycle
The entire pipeline is orchestrated in a single workflow file, .github/workflows/pipeline.yml, divided into four primary stages to ensure data integrity and synchronization.
graph TD
Stage1[Stage 1: Lock Frontend] --> NodeA[Node A: Pull New URLs]
Stage1 --> NodeB[Node B: Push Annotations to HF]
NodeA --> Stage4[Stage 4: Sync & Unlock]
NodeB --> Stage4
Stage4 --> Error[Stage 99: Unlock on Failure]
Stage Details
Stage 1: Lock Frontend
- Action: Renames the
data/directory todata_processing/. - Purpose: Prevents the frontend from reading inconsistent or partially updated task files while the crawler is active.
- Outcome: A "locked" repository state where public task data is temporarily unavailable.
Node A: Pull New URLs
- Script:
scripts/pull_new_urls.py - Action: Crawls the JSOC/HMI imagery database for the previous day's solar captures.
- Outcome: New JSON task files are generated in
data_processing/, each containing unique IDs and image URLs.
Node B: Push Annotations to HF
- Script:
scripts/merge_annotations_to_hf.py - Action: Synchronizes pending user annotations from the
annotations/directory with theSpaceGen/solarhub-*datasets on HuggingFace. - Outcome: Permanent record of community-labeled data for ML training.
Stage 4: Sync & Unlock
- Action: Merges newly pulled URLs into the
annotations/folder as blank templates and renamesdata_processing/back todata/. - Purpose: Finalizes the daily update and makes new tasks available for labeling.
Stage 99: Unlock on Failure
- Trigger: Runs only if any of the preceding stages fail.
- Action: Attempts to recover the repository to an "unlocked" state (renaming
data_processing/todata/) to ensure the system doesn't remain stuck in maintenance mode.
Parallel Execution
To maximize efficiency, Node A (Data Crawling) and Node B (HuggingFace Synchronization) run in parallel once the frontend is locked. The final sync only occurs once both nodes complete successfully.