About the project
CBDA NorKyst-800 is a "big data engine" pilot project applied to data from the numerical ocean modelling system NorKyst-800. NorKyst-800 implements the numerical ocean model ROMS (Regional Ocean Modeling System) with a high spatial resolution and is used for simulations of physical variables as sea level, temperature, salinity and currents for all coastal areas in Norway and adjacent sea.
The core part of CBDA's work was ingesting the data in the local Hadoop ecosystem, processing and outputting it in a standardized and optimized format, which can then be easily accessed and handled for further analysis.
Specifically, the public data from the NorKyst-800 database was managed as follows:
- ingestion in the local Hadoop infrastructure and choice of an optimal representation of the data for further processing: since the existing interfaces provided by MapReduce and Spark frameworks cannot efficiently handle array-based data formats such as NetCDF, new interfaces have been created and a data model designed specifically for NetCDF. The developed NetCDF-based interfaces allow both MapReduce and Spark to efficiently extract, transform, store in HDSF as ORC tables and process the datasets.
- output through an accessible interface to make the data easily available to relevant research tasks.
The code base is open source and is made available by the author P. Thongtra's through her Git account:: https://github.com/pthongtra/netcdf-load-utils