Skip to main content

HuggingFace Dataset Schema

Support Aurora: GitHub Sponsors · Patreon · Buy Me a Coffee

This page defines the permanent storage schema for SolarHub datasets hosted on HuggingFace.

Overview

The HuggingFace datasets (SpaceGen/solarhub-*) store the long-term history of all community-labeled solar data. Unlike the GitHub repository, which only maintains a 24-hour window, HuggingFace is an append-only historical archive.

Column Definitions

ColumnTypeDescription
idstringUnique task ID (e.g., sp-1234).
serial_numberint64Incremental serial number.
urlstringPermanent URL to the solar observation image.
task_typestringThe scientific category (sunspot, magnetogram, etc.).
created_atstringISO-8601 timestamp of record creation.
annotationsstring (JSON)A JSON-serialized array of all user contributions.
metadatastring (JSON)A JSON-serialized object containing capture and source metadata.

The annotations Column

The annotations column stores a full history of human labels. Because HuggingFace datasets prefer flat or simple nested structures, we store the full annotation list as a compressed JSON string.

Decoded Example:

[
{
"user": "github_user_1",
"issue_number": 101,
"timestamp": "2026-03-18T12:00:00Z",
"locations": [
{ "label": "class_f", "region": "450000 15 1009 15" }
]
}
]

locations[].region is persisted exactly as submitted by contributors (no RLE conversion or other normalization in the issue parser).

Schema Reconciliation

The synchronization script (merge_annotations_to_hf.py) performs automated schema evolution.

  1. Union Merge: If a new version of the Aurora backend adds a field (e.g., confidence_score), the script identifies the missing column on HuggingFace and updates the dataset features automatically.
  2. Backward Compatibility: All historical records are preserved. Missing fields in older records are filled with null.
  3. Deduplication: Records are merged by id. If a record is updated with a new annotation, the new entry is appended to the existing annotations array for that record ID.

Parser Strictness Rule

Aurora enforces a strict write rule before data reaches HuggingFace:

  • For a given record id, the same GitHub user can only submit one annotation entry.
  • A second submission from the same user for the same record is rejected during issue parsing.