Scripts Reference
Support Aurora: GitHub Sponsors · Patreon · Buy Me a Coffee
The scripts/ directory contains modular Python automation that powers the SolarHub data lifecycle.
1. pull_new_urls.py
Purpose: Fetches the previous day's solar imagery URLs from the JSOC/HMI API.
- Operation: Crawls http://jsoc.stanford.edu/ and filters for
_Ic_1k.jpg(sunspots) and_M_1k.jpg(magnetograms). - Output: Generates JSON task files in
data_processing/. - Key Features:
- Automatically handles serial number incrementing for unique task IDs.
- Prevents duplicates by checking existing URLs in the HuggingFace dataset (requires
HF_TOKEN).
2. parse_issue_annotation.py
Purpose: Extracts user-submitted annotations from GitHub Issue bodies.
- Trigger: Invoked by the
Parse Annotation IssueGitHub Actions workflow. - Workflow:
- Parses Markdown headers (e.g.,
### Task Type) into structured data. - Validates task IDs and labels against a whitelist.
- Merges coordinates into the corresponding task file in
annotations/.
- Parses Markdown headers (e.g.,
- Environment Variables: Requires
ISSUE_NUMBER,ISSUE_BODY, andISSUE_AUTHOR.
3. merge_annotations_to_hf.py
Purpose: Synchronizes local annotations/ data with HuggingFace datasets.
- Mechanism: Non-destructive schema reconciliation.
- Algorithm:
- Reconciliation: If the local schema differs from the HuggingFace schema, it performs a "union merge," filling missing fields with
null. - De-duplication: Merges records by ID/URL, ensuring that the latest data from GitHub wins on collision.
- Reconciliation: If the local schema differs from the HuggingFace schema, it performs a "union merge," filling missing fields with
- Library: Built on top of the
datasetsandhuggingface_hublibraries.
4. compute_points.py
Purpose: (Work in Progress) Evaluates machine learning model accuracy.
- Responsibility: Compares the latest label from
annotations(ground truth) againstml_label(model output) for all labeled records. - Metric: Computes overall accuracy and per-task-type performance.
- Output: Writes a summary report used for model auditing and progress tracking.
5. setup_platforms.py
Purpose: Initializes external platform integrations.
- Usage: Typically used during the initial repository setup.
- Functions: Sets up HuggingFace repositories and Kaggle environment variables.
Development Guidelines
- Environment: All scripts are designed for Python 3.11+.
- Dependencies: Key libraries include
requests,huggingface_hub, anddatasets. - Logging: All scripts use standard Python logging with a consistent format for easy debugging via GitHub Actions logs.