Skip to content

Data Process

This guide is for developers to help you quickly integrate with the EO Platform's data processing module. From frontend interaction and backend scheduling to result retrieval, this document provides a complete technical workflow, code examples, and debugging advice.

1. Prerequisites

Before starting the integration, please ensure the following environment is ready:

  • Account Permissions: You need access and write permissions for the "Data Processing" module.
  • Input Data: Data must be uploaded in the "Data Storage" module, preferably in GeoTIFF format (single-band or multi-band).
  • Running Services:
    • data_management_service (Java service) is deployed and connected to PostgreSQL.
    • data_process_service (Python service) is running correctly, connected to PostgreSQL, and can access the shared file system.
    • gis_service: Running correctly.
  • Shared Storage: The frontend, backend, and Python services share object storage, ensuring read and write permissions.

2. Feature Overview

The data processing module provides several remote sensing image processing capabilities that can lay the foundation for subsequent analysis:

Processing TypeFunction DescriptionCommon Scenarios
Band MergingCombines multiple single-band images into one multi-band imageCreating RGB, false-color images, enhancing feature recognition
Imagery MosaickingSeamlessly stitches multiple imagesCreating maps covering large areas in batches
Imagery FusionFuses panchromatic and multispectral imagery to improve resolutionHigh-resolution remote sensing image generation
Cloud RemovalAutomatically identifies and removes cloudsImproving image quality for subsequent classification and monitoring
NDVINormalized Difference Vegetation Index, measures vegetation cover and healthAgricultural monitoring, ecological assessment
EVIEnhanced Vegetation Index, improves on shadow and soil background effectsMonitoring dense vegetation areas
NDWINormalized Difference Water Index, enhances the contrast between water and landWater body extraction, flood monitoring
NDBINormalized Difference Built-up Index, highlights urban built-up areasUrban expansion, land use classification
BAIBurned Area Index, identifies post-fire areasFire monitoring, disaster assessment

⚠️ If you need to extend to a new processing type, please register the task type in data_process_service and synchronize it in the task type enum of data_management_service.

2.1 Task Type Enum (TaskType)

Enum NameValueDescriptionTypical InputTypical Output
FUSION1Imagery fusion (Pan-sharpen, etc.)Panchromatic + MultispectralHigh-resolution multi-band image
MOSAIC2Imagery mosaickingMultiple images with the same CRSLarge-scene stitched image
Cloud_Remove4Cloud removalCloudy image, cloud mask, cloud-free imageCloud-free image
BAND_MERGED6Merge bandsMultiple single bandsMulti-band image
NDVI7Index calculation NDVINIR + RedSingle-band index image
EVI8Index calculation EVINIR + Red + BlueSingle-band index image
NDWI9Index calculation NDWIGreen + NIRSingle-band index image
NDBI10Index calculation NDBISWIR + NIR/RedSingle-band index image
BAI11Index calculation BAIRed + NIRSingle-band index image

The frontend Processing Type field must be consistent with the enum above, otherwise the Worker will refuse to execute.

3. System Architecture and Flow

Data Process Sequence Diagram Placeholder

The data processing module uses an asynchronous queue architecture, implementing a "frontend submits task → backend polls for processing → frontend queries" model:

  1. Frontend: Users create tasks and view progress in the UI.
  2. data_management_service (Java): Handles API requests, writing/reading tasks to/from the database.
  3. PostgreSQL: Stores task records, acting as a message queue.
  4. data_process_service (Python): Polls the database and executes the actual data processing logic.
  5. GIS Service: Tiles and publishes the processed imagery.

4. Core Development Workflow

4.1 Create Processing Task (Triggered by Frontend)

  1. The user clicks Add Task to open the task creation dialog.
  2. Select the processing type, input data, output path, and filename.
  3. After submission, the frontend calls the Create Processing Task API to get a task ID.

4.2 Task Scheduling (data_management_service (Java)/data_process_service(Python))

  1. data_management_service writes the task to PostgreSQL with the status set to NOT_STARTED.
  2. data_process_service polls every 60 seconds:
    • Uses SELECT ... FOR UPDATE to lock the earliest pending task.
    • Updates status to DOWNLOADING and downloads the imagery.
    • Sets status to DOWNLOADED after download is complete.
    • Updates status to PROCESSING and executes the specific algorithm.
    • Sets status to PROCESSING_COMPLETED after the algorithm finishes.
    • If a publishing step is included, it enters PUBLISHING, and is set to PUBLISH_COMPLETED upon completion.
    • In case of exceptions: DOWNLOAD_FAILED for download failure; PROCESSING_FAILED for processing failure; PUBLISH_FAILED for publishing failure, all with an errorMessage recorded.

4.3 Progress Update

The status is refreshed every time the frontend reloads the current list page by calling the Query Task List API.

4.4 View Results (Triggered by Frontend)

  • When status = PUBLISH_COMPLETED, the API will return information like result.outputFileId, result.previewUrl, etc.
  • The task details page displays:
    • Basic task information (name, type, input imagery, etc.).
    • Input data list and band mapping.
    • Result preview image.

4.5 Previewing Image Processing Results

5. API Quick Index

CapabilityAPIDescription
Query File Band InfoPOST /processtask/query/file/bandInfoQuery Band Info
Create TaskPOST /processtask/addTaskCreate an asynchronous processing task
Query Task DetailGET /processtask/query/task/detailReturns status, error message, and results
Raster Publish DetailGET /metadata/query/raster/publishUrlGet details of a published image
Query Task ListPOST /processtask/query/pageSupports pagination and filtering
Delete TaskDELETE /processtask/deleteDelete a record by task ID

API parameters, field descriptions, and error codes are provided in the corresponding links and are not repeated here.

6. Task Status (TaskStatus)

The platform uses the following status enums:

  • NOT_STARTED = 0 Not Started (created successfully, not yet queued)
  • DOWNLOADING = 1 Downloading (Python service is downloading imagery to local/cache)
  • DOWNLOADED = 2 Downloaded (input data is ready, awaiting processing)
  • PROCESSING_COMPLETED = 3 Processing Completed (processing stage finished successfully, ready to publish)
  • PROCESSING_FAILED = 4 Processing Failed (algorithm stage failed, includes error message)
  • PROCESSING = 5 Processing (algorithm is executing)
  • PUBLISHING = 6 Publishing (writing/registering results to storage/catalog service)
  • PUBLISH_COMPLETED = 7 Publish Completed (results can be queried/downloaded/previewed)
  • PUBLISH_FAILED = 8 Publish Failed (ingestion or registration failed)
  • DOWNLOAD_FAILED = 9 File Download Failed (input retrieval failed)

6.1 Complete Status Flow

The complete task lifecycle should follow this sequence to unify frontend/backend logic and alerting:

  1. NOT_STARTED (0) → Task created successfully, waiting to be queued
  2. DOWNLOADING (1) → Worker pulls input files to local or cache
  3. DOWNLOADED (2) → Input data is ready, enters processing queue
  4. PROCESSING (5) → Execute algorithm (crop/fuse/mosaic/index, etc.)
  5. PROCESSING_COMPLETED (3) → Processing stage ends successfully, ready to publish or write back
  6. PUBLISHING (6) → Register results to storage/catalog/preview service
  7. PUBLISH_COMPLETED (7) → Publish complete, results can be queried/downloaded/previewed

Exception branches:

  • DOWNLOAD_FAILED (9): Input file download failed → Supports retrying the download or terminating the task
  • PROCESSING_FAILED (4): Algorithm execution failed → Display error message, supports "Resubmit"
  • PUBLISH_FAILED (8): Result registration failed → Supports "Retry Publish" or rollback

6.2 Status Sequence Diagram

Data Process State Machine Diagram Placeholder

Placeholder Note: Please name your generated state machine image guide/data-process-state.svg (or adjust the reference path above) and replace the placeholder image to display it in the document.

7. Best Practices

7.1 Band Merging

  • Input Requirements: Input single-band or multi-band GeoTIFF images.
  • Performance Suggestion: Crop to the ROI (Region of Interest) before merging to reduce processing load.

7.2 Imagery Mosaicking

  • Data Preparation: Images need to have some overlap and a consistent coordinate system.
  • Edge Blending: The default GDAL algorithm is used to eliminate seams.
  • Output Size: The mosaicked image can be very large; please estimate storage consumption.

7.3 Cloud Removal

  • Prioritize using input images with cloud probability or cloud masks.
  • For Sentinel-2 and Landsat data, the quality mask (QA Band) can be used to assist processing.
  • It is recommended to perform a visual check to confirm the quality after processing.

7.4 Imagery Fusion

  • Data Preparation: Requires a high-resolution panchromatic image and a corresponding multispectral image of the same scene or region.
  • Resolution and Registration: It is recommended that both have good geometric registration and the same Coordinate Reference System (CRS). Resampling and registration should be performed beforehand if necessary.
  • Typical Use: To improve spatial resolution while preserving spectral characteristics as much as possible.

7.5 Index Calculation

  • Supports common indices like NDVI, NDWI, etc. The formula must be specified in the task parameters.
  • Ensure that the input bands match the index requirements, e.g., NDVI requires NIR and Red bands.

8. Debugging and Troubleshooting

SymptomPossible CauseRecommended Action
Task stuck at NOT_STARTED for a long timePython service is not running or database connection is abnormalCheck Python data_process_service logs and health status
Task fails and errorMessage contains FileNotFoundInvalid input file ID or file has been deletedConfirm the file still exists in data storage
Task fails with insufficient permissionsNo write permission for the output path, or Worker's object storage credentials are insufficientVerify storage mount path permissions
API returns 401/403Token expired or role is missingRe-apply for a Token, confirm user permissions
Result preview is missingProcessing succeeded but no preview was generatedCheck if the preview generation logic in Python data_process_service was executed

9. Performance and Extension Suggestions

  • Task Concurrency: It is recommended that each data_process_service handles 1 task at a time to avoid I/O contention; throughput can be increased by horizontally scaling Worker instances.
  • Task Queue Governance: Regularly clean up expired or failed tasks to prevent queue buildup from affecting scheduling.
  • Monitoring Metrics:
    • Task processing duration.
    • Failure rate and distribution of error types.
    • data_process_service CPU/GPU, memory, and disk I/O.

10. Frequently Asked Questions (FAQ)

Q1: How to support a new processing type?
Add a new processing type enum in the Java Service, register the corresponding Task class in the Python Worker, implement the execute_processing logic, and synchronize the frontend enum.

Q2: Is task cancellation supported?
Currently, only deleting pending tasks (pending) is supported; canceling a running task is not.

Q3: Can the results be used as input again?
Yes. The processing results are written to the data storage module and can be referenced again from the "Select Data" selector.

Q4: How to troubleshoot a data_process_service process crash?
Check the status of manage.py runserver and start_celery_scheduler through the process manager, locate the specific exception using the logs, and then fix it based on the error type.