HuggingFace Integration
Support Aurora: GitHub Sponsors · Patreon · Buy Me a Coffee
HuggingFace is the backbone of SolarHub's data storage. We use it both for archiving community-labeled datasets and for versioning our machine-learning models.
Dataset Structure
Each task type in SolarHub corresponds to a separate HuggingFace dataset repository.
- Base URL:
https://huggingface.co/datasets/SpaceGen/ - Naming Convention:
solarhub-{task-type} - Examples:
solarhub-sunspotsolarhub-magnetogramsolarhub-solar-flare
Splits
train: The primary split containing all community-submitted annotations and historical data.tasks: (Internal) Used by the pipeline to track URLs that haven't been labeled yet.
Synchronization Workflow
Local changes from the annotations/ directory are synchronized to HuggingFace during the Nightly Pipeline.
Important: Only records that contain actual user-submitted annotations are pushed to HuggingFace. Raw image URLs without labels remain exclusively in the GitHub repository buffer for 24 hours.
- Detection:
merge_annotations_to_hf.pyidentifies updated.jsonlfiles inannotations/. - Filtering: The script filters for records where the
annotationslist is not empty. - Reconciliation: If the local schema has new fields, the script automatically updates the HuggingFace dataset schema.
- Merging: New records are appended to the master dataset. If a record with the same ID/URL already exists on HuggingFace, its
annotationslist is merged and deduplicated, ensuring a complete historical record.
Model Hub
Trained model weights are stored in the HuggingFace Model Hub.
- Naming Convention:
solarhub-model-{task-type} - Usage: Kaggle inference kernels pull the latest model version from these repositories to generate predictions for new solar imagery.
Security & Access
Interaction with HuggingFace is authenticated using the HF_TOKEN GitHub Actions secret.
- Requirements: The token must have Write permissions to the
SpaceGenorganization repositories. - Library: We use the
huggingface_hubanddatasetsPython libraries for all API interactions.
Data Schema Evolution
SolarHub is designed for long-term scientific research. If the data schema needs to change (e.g., adding a "confidence" score to user labels), the merge_annotations_to_hf.py script ensures that old data remains compatible by filling new fields with null values for historical records.