Skip to main content

Data Schema

Support Aurora: GitHub Sponsors · Patreon · Buy Me a Coffee

SolarHub uses a standardized JSON Lines (JSONL) format for all solar task and annotation data.

Format: Compressed JSONL

Every data file in the repository (under data/ and annotations/) is a .jsonl file. Each line is a single, independent, and minified JSON object representing one solar observation record.


Task Record Schema

Each task is represented as a single minified line in a .jsonl file, with fields ordered as follows: id, url, task_type, created_at, metadata, annotations.

{
"id": "sp-1234",
"url": "http://jsoc.stanford.edu/data/hmi/images/2026/03/16/000000_Ic_1k.jpg",
"task_type": "sunspot",
"created_at": "2026-03-17T00:30:00Z",
"metadata": {
"source": "JSOC_HMI_JPG",
"captured_at": "2026-03-16"
},
"annotations": [
{
"user": "github_username",
"confidence_score": 95.0,
"locations": [
{ "label": "class_f", "region": "450000 15 1009 15" }
],
"issue_number": 42,
"timestamp": "2026-03-17T14:30:00Z"
}
]
}

Field Definitions

FieldTypeDescription
idstringPrimary Key: Unique global identifier (e.g., sp-94, mg-102). Persists across years of data.
urlstringDirect link to the solar observation image.
task_typestringScientific category (sunspot, magnetogram, etc.).
created_atstringRecord creation timestamp (ISO 8601).
metadataobjectSystem Only: Reserved for backend metadata (source, capture date).
annotationslistA list of user annotation entries. Each entry contains user, confidence_score, locations, issue_number, and timestamp.

Annotation Entry Structure

Each entry in the annotations list represents a contribution from a single user.

FieldTypeDescription
userstringGitHub username.
confidence_scorefloatContributor's self-reported confidence (0-100).
locationslistArray of point/region objects.
issue_numberintegerSubmission source issue.
timestampstringContribution timestamp.

Location Object

{ "label": "class_f", "region": "450000 15 1009 15" }

Labels are applied to specific regions via the region field. The parser stores each region payload exactly as submitted by the contributor (no transformation). This means the value may contain RLE, polygon coordinates, circles, or any other accepted task-specific encoding. There is no image-wide label field.

Contribution Constraint

For each record id, a given GitHub user can appear at most once in annotations. If the same user tries to submit another annotation for the same record, the parser rejects it and returns an error.