MetOcean Solutions Observation API connector

This notebook describes how to get observation data from one of MetOcean Solutions' observation API. The data that can be accessed though the API come from different sources such as MetService's automatic weather station network, a number of wave buoys and tidal gages owned by various organisations. Different instances of the API are used for different sources. This connector should work for all of them. The API provides the data as sets of records that belong to a station. Each station is identified by a unique id string. Each station has its own metadata describing location, instrument, ownership, status, license and level of permission. If the station is not open then anyone interested to use the data needs to get in contact with MetOcean to negotiate access to the data. Any access to data non data will be done using the same connector but authentication will need to be set when initialising the connector. For now the connector allows to retrieve data in the shape of xarray datasets. A complete documentation of the API can be found here:

Sebastien Delaux
To Be Determined

Installation of required libraries

Install and import xarray and matplotlib in the current Jupyter kernel. Xarray is a great library to work with data stored using NetCDF format. MetOcean's Observation API can provide data under the form of netCDF-like JSON and hence xarray is a great option to work with that data. Matplotlib will used to plot the data.

In [1]:

import sys !{sys.executable} -m pip install xarray matplotlib import xarray
Collecting xarray Using cached Collecting matplotlib Using cached Collecting numpy>=1.15 (from xarray) Using cached Collecting setuptools>=41.2 (from xarray) Using cached Collecting pandas>=0.25 (from xarray) Using cached Collecting python-dateutil>=2.1 (from matplotlib) Using cached Collecting kiwisolver>=1.0.1 (from matplotlib) Using cached Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib) Using cached Collecting cycler>=0.10 (from matplotlib) Using cached Collecting pytz>=2017.2 (from pandas>=0.25->xarray) Using cached Collecting six>=1.5 (from python-dateutil>=2.1->matplotlib) Using cached Installing collected packages: numpy, setuptools, pytz, six, python-dateutil, pandas, xarray, kiwisolver, pyparsing, cycler, matplotlib Successfully installed cycler-0.10.0 kiwisolver-1.1.0 matplotlib-3.2.1 numpy-1.18.2 pandas-1.0.3 pyparsing-2.4.6 python-dateutil-2.8.1 pytz-2019.3 setuptools-46.1.3 six-1.14.0 xarray-0.15.1

Install and import cf-json extension to xarray which will be used to turn netCDF-like json as returned by the API into an xarray Dataset

In [2]:

!{sys.executable} -m pip install git+ssh:// from cfjson.xrdataset import CFJSONinterface
Collecting git+ssh:// Cloning ssh:// (to v0.3.3) to /tmp/pip-ne5mykz6-build Installing collected packages: cfjson Running install for cfjson ... done Successfully installed cfjson-0.3.3

Import request library to query the API and path module of os package to help building url

In [3]:

import requests from os import path
Definition of connector

Declare a new exception class that can be used to identify any exception raised by the connector

In [4]:

class MetOceanObsAPIException(Exception): def __init__(self, message, obj=None, obj_name=None, status=400): self.message = message self.obj = obj self.obj_name = obj_name self.status = status super(ThreddsServerConnector, self).__init__(message)

Definition of connector

In [5]:

class MetOceanObsAPIConnector(object): """ A class used to connect one of MetOcean Solution's instances of their observation API. Attributes ---------- root_url : str The root url of the service Methods ------- get_station_metadata(station_id) return the metadata associated with a given observation station get_data() return observation data retrieved from the API in the form of a xarray dataset object """ def __init__(self, root_url="", username=None, password=None, api_key=None): """ Initialise the connector Parameters ---------- root_url : str The url of the observation API to connect to. (default is username : str The username to use to connect the API password : str The password to use to connect the API api_key : str The api_key to use to connect the API Raises ------ NotImplementedError If any of the authentication parameters is set. """ self.root_url = path.join(root_url, 'api/stations') if username or password or api_key: raise NotImplementedError("Secure authentication is yet to be implemented for the connector") def get_station_metadata(self, station_id): """ Get the metadata associated with a given observation station Parameters ---------- station_id : str The id of the station to return the metadata for. Raises ------ MetOceanObsAPIException If query did not return with 200 status code Returns ------- dict A dictionary containing the metadata associated with the station """ ## Build query url url = path.join(self.root_url, station_id) ## Do request res = requests.get(url) ## Check request successed. If not raise exception if res.status_code != 200: raise MetOceanObsAPIException("getstation_metadata query failed with status code {1} and reasons: %s"\ .format(res.status_code,\ res.text)) ## Decode JSON response and return return res.json() def get_data(self, station_id, time=None, variables=None, max_nrecords=100): """ Get the metadata associated with a given observation station Parameters ---------- station_id : str The id of the station to return the metadata for. time : str ISO8601 time interval with double dash option and empty start/end allowed used to filter the data. Valid exemples are: time=2007-03-01T13:00:00Z--2008-05-11T15:30:00Z time=2007-03-01T13:00:00Z-- time=--2008-05-11T15:30:00Z variables : list List of strings containing the name of the variables to retrieve. Return all by default max_nrecords : int The maximum number of records to retrieve (default is 100) Raises ------ MetOceanObsAPIException If query did not return with 200 status code Returns ------- xarray dataset A xarray dataset containing the data for the station """ ## Build full url to query records url = path.join(self.root_url, station_id, 'records') ## Build dictionary with query parameters params = {} params['time'] = time params['vars'] = variables params['limit'] = max_nrecords ## Excecute request res = requests.get(url, headers={'accept': 'application/'}, params=params) ## Check for success if res.status_code != 200: raise MetOceanObsAPIException("Get data query failed with status code {1} and reasons: %s"\ .format(res.status_code,\ res.text)) ## Turn data into xarray dataset dset = xarray.Dataset() dset.cfjson.from_json(res.json()) ## Return xarray dataset return dset

Instantiate new connector and retrieve metadata for the station of id 'ciwrb'

In [6]:

obs_api_connector = MetOceanObsAPIConnector() metadata = obs_api_connector.get_station_metadata(station_id='ciwrb') print(metadata)
{'id': 'ciwrb', 'name': 'ciwrb', 'tags': ['wave'], 'description': '', 'provider': 'metocean', 'level': 0.0, 'dataset': 'socean', 'coordinates': 'POINT (169.04916 -52.76000000000001)', 'latitude': -52.76000000000001, 'longitude': 169.04915999999997, 'public': True, 'is_active': False, 'last_record_at': '2019-06-19T18:00:00Z'}

For now the metadata available is limited and some metadata like the variables provided for a given station are not available. That information can be retrieved by first querying for a sample of the data.

In [7]:

dset = obs_api_connector.get_data(station_id='ciwrb', max_nrecords=3)

From the sample result, the names of the variables can be listed:

In [8]:

for (var_name, var) in dset.data_vars.items(): print(var.attrs['standard_name'],'-->', var_name)
latitude --> latitude longitude --> longitude altitude --> level sea_surface_wave_period_at_variance_spectral_density_maximum --> tp sea_surface_wave_significant_height --> hs sea_surface_wave_maximum_height --> hmax sea_surface_wave_from_direction_at_variance_spectral_density_maximum --> dpm

Now we can query for specific variables only and a set time range

In [9]:

dset = obs_api_connector.get_data(station_id='ciwrb', time='2019-06-18T15:00:00Z--2019-06-19T21:00:00Z', variables=['hs', 'tp'], max_nrecords=300) print("Retrieved {0} records for station ciwrb".format(len(dset.time)))
Retrieved 9 records for station ciwrb

Now we can do the same for a weather station. We retrieve temperature (tmp) since 2020-01-21 19:00:00Z and we get at most 200 records

In [10]:

dset = obs_api_connector.get_data(station_id='tu0fknyk', time=['2020-01-21 19:00:00Z--'], max_nrecords=200) print("Retrieved {0} records for station tu0fknyk".format(len(dset.time)))
Retrieved 200 records for station tu0fknyk

Now lets plot the data using matplotlib

In [11]:

%matplotlib inline import matplotlib.pyplot as plt from matplotlib.dates import date2num plt.xlabel('time') plt.ylabel('Temperature (degree Celcius)') plt.plot_date(date2num(dset.time),


[<matplotlib.lines.Line2D at 0x14c3179fb2e8>]
Simple numerical data correction example

Now we are going to put together a toy example aiming at correcting numerical data using observation data. That kind of approach could be used in the context of statistical downscaling.

Note that this is a toy example aimed at demonstrating a workflow and one added constraint for the demonstration was that it had to use a neural network model.

The amount of data used here is tiny and clearly not enough to build something scientifically robust. Also the model is only suitable in that it is aimed at working on sequences of data, we did not investigate scientific suitability and it is likely there are simpler and more suitable options.

The data used here is the 2 metres temperature from the nowcast for the last 9 days from the 0.25 degree GFS model at the same location as that of the observation station tu0fknyk.

In [12]:

from datetime import datetime from dateutil.relativedelta import relativedelta import numpy as np

Here is a useful function that converts the datetime format used by xarray to the "standard" python datetimes.

In [13]:

def datetime64_to_datetime(dd): return datetime.utcfromtimestamp(round( * 1e-9))

Get the longitude/latitude coordinates of the observation point

In [14]:

lon0 =[0] lat0 =[0]

Now load one of the GFS dataset to identify the nearest grid point to the observation station

In [76]:

now = datetime.utcnow() gfs = xarray.open_dataset(''%(now.year, now.month,

In [77]:

ilon = np.argmin(np.abs( ilat = np.argmin(np.abs( print("Approximating longitude",lon0,"by",[ilon]) print("Approximating latitude",lat0,"by",[ilat]) print(ilon, ilat)
Approximating longitude 174.971 by 175.0 Approximating latitude -37.03 by -37.0 700 212

For the last 9 days, download all forecast guidances (cycles 0,6,12 and 18) available and keep the 2 metre temperature variable at the point of interest for the first 6 hours (nowcast).

In [18]:

now = datetime.utcnow() t = datetime(now.year, now.month, - relativedelta(days=9) data = [] files = [] while t < now: for cycle in [0,6,12,18]: try: gfs = xarray.open_dataset(''%(t.year,t.month,,cycle)) print(gfs.time[0].data) files.append(gfs) data.append(gfs.tmp2m[:6,ilat,ilon]) except: pass t = t + relativedelta(days=1)
2020-03-21T00:00:00.000000000 2020-03-21T06:00:00.000000000 2020-03-21T12:00:00.000000000 2020-03-21T18:00:00.000000000 2020-03-22T00:00:00.000000000 2020-03-22T06:00:00.000000000 2020-03-22T12:00:00.000000000 2020-03-22T18:00:00.000000000 2020-03-23T00:00:00.000000000 2020-03-23T06:00:00.000000000 2020-03-23T12:00:00.000000000 2020-03-23T18:00:00.000000000 2020-03-24T00:00:00.000000000 2020-03-24T06:00:00.000000000 2020-03-24T12:00:00.000000000 2020-03-24T18:00:00.000000000 2020-03-25T00:00:00.000000000 2020-03-25T06:00:00.000000000 2020-03-25T12:00:00.000000000 2020-03-25T18:00:00.000000000 2020-03-26T00:00:00.000000000 2020-03-26T06:00:00.000000000 2020-03-26T12:00:00.000000000 2020-03-26T18:00:00.000000000 2020-03-27T00:00:00.000000000 2020-03-27T06:00:00.000000000 2020-03-27T12:00:00.000000000 2020-03-27T18:00:00.000000000 2020-03-28T00:00:00.000000000 2020-03-28T06:00:00.000000000 2020-03-28T12:00:00.000000000 2020-03-28T18:00:00.000000000 2020-03-29T00:00:00.000000000 2020-03-29T06:00:00.000000000 2020-03-29T12:00:00.000000000 2020-03-29T18:00:00.000000000 2020-03-30T00:00:00.000000000 2020-03-30T06:00:00.000000000 2020-03-30T12:00:00.000000000

Concatenate all the 6 hours segments into a single one

In [19]:

nowcast = xarray.concat(data,dim='time')

Now we get all the observation available since the start of the nowcast

In [20]:

t0 = datetime64_to_datetime(nowcast.time[0]) obs = obs_api_connector.get_data(station_id='tu0fknyk', time=['%04d-%02d-%02d %02d:00:00Z--'%(t0.year, t0.month,, t0.hour)], max_nrecords=1000)

As the observation API returns the most recent recoords first, we sort them by ascending time.

In [21]:

obs = obs.sortby('time', ascending=True)

Now we match the data. For each observation that has a corresponding time-stamp in the nowcast put data keep the data.

In [22]:

obs_data = [] nowcast_data = [] time = [] ## Parse datetime for nowcast nowcast_time = np.array([datetime64_to_datetime(t) for t in nowcast.time]) for it, t in enumerate(obs.time[:]): ## Check if match match = np.where(nowcast_time == datetime64_to_datetime(t)) if len(match[0]) == 1: obs_data.append([it]) nowcast_data.append(nowcast[match[0]].data[0]) time.append(datetime64_to_datetime(t)) ## Turn list into numpy array (nicer to work with) obs_data = np.array(obs_data) ## Do also conversion from Kelvin to Celcius nowcast_data = np.array(nowcast_data)-273.18 time = np.array(time)

Now let's plot the retrieved data. There does not seem to be much if any bias but it looks like the amplitude of the daily variations in the observations is slightly larger than that of the nowcast data so maybe we can correct for that.

In [23]:

%matplotlib inline plt.plot_date(date2num(time), obs_data, label='Observations') plt.plot_date(date2num(time), nowcast_data, label='Nowcast') plt.legend()


<matplotlib.legend.Legend at 0x14c3112d52b0>

We will have a go at using a simple LSTM model built using Keras with a tensorflow backend. We assume that the image/environment we use here already has CUDA/CUDNN/Tensorflow installed. We just add Keras.

In [28]:

!{sys.executable} -m pip install keras
Collecting keras Using cached Collecting h5py (from keras) Using cached Collecting keras-preprocessing>=1.0.5 (from keras) Using cached Collecting six>=1.9.0 (from keras) Using cached Collecting pyyaml (from keras) Collecting scipy>=0.14 (from keras) Using cached Collecting numpy>=1.9.1 (from keras) Using cached Collecting keras-applications>=1.0.6 (from keras) Using cached Installing collected packages: six, numpy, h5py, keras-preprocessing, pyyaml, scipy, keras-applications, keras Successfully installed h5py-2.10.0 keras-2.3.1 keras-applications-1.0.8 keras-preprocessing-1.1.0 numpy-1.18.2 pyyaml-5.3.1 scipy-1.4.1 six-1.14.0

First we prepare the data to model. We will use sequences of the 15 consecutive nowcast points and attempt at predicting the difference between the observations and the nowcast.

In [24]:

x = np.array([nowcast_data[i:i+15] for i in range(len(nowcast_data)-15)]) y = np.array([obs_data[i:i+15][-1]-nowcast_data[i:i+15][-1] for i in range(len(obs_data)-15)]) ## Needs to be of shape (nsamples, sequence_length, nfeatures) x = x.reshape(x.shape[0], x.shape[1], 1)

We normalise the input data to the neural network

In [27]:

xmin = x.min() xmax = x.max() x = (x-xmin)/(xmax-xmin) print(xmin, xmax)
8.8012085 24.061737

Now we load leras and build the model

In [67]:

import keras from keras.models import Sequential, Model from keras.layers import Dense, LSTM, Input neurons = 10 batch_size = 32 nb_epoch = 1000

In [68]:

ip = Input((x.shape[1], x.shape[2])) xx = LSTM(neurons, stateful=False)(ip) out = Dense(1)(xx) model = Model(ip, out) model.compile(loss='mean_squared_error', optimizer='adam')

We train the model on all data except for the last 50 samples

In [69]:[:-50], y[:-50], epochs=nb_epoch, batch_size=batch_size, verbose=1, shuffle=False)
Epoch 1/1000 168/168 [==============================] - 0s 2ms/step - loss: 2.5169 Epoch 2/1000 168/168 [==============================] - 0s 114us/step - loss: 2.4623 Epoch 3/1000 168/168 [==============================] - 0s 115us/step - loss: 2.4153 Epoch 4/1000 168/168 [==============================] - 0s 125us/step - loss: 2.3740 Epoch 5/1000 168/168 [==============================] - 0s 101us/step - loss: 2.3374 ..... Epoch 996/1000 168/168 [==============================] - 0s 113us/step - loss: 0.6693 Epoch 997/1000 168/168 [==============================] - 0s 105us/step - loss: 0.6434 Epoch 998/1000 168/168 [==============================] - 0s 128us/step - loss: 0.6387 Epoch 999/1000 168/168 [==============================] - 0s 118us/step - loss: 0.6343 Epoch 1000/1000 168/168 [==============================] - 0s 141us/step - loss: 0.6357


<keras.callbacks.callbacks.History at 0x14c2c58c0b70>

We evaluate the model on all data

In [70]:

res = model.predict(x).squeeze()

Plotting the data, the model fits the data OK but performs a lot less well on the last 50 time steps which were not in the training set

In [71]:

plt.plot(time[15:], obs_data[15:] - nowcast_data[15:], label='Error') plt.plot(time[15:], res, label='Correction') plt.legend()


<matplotlib.legend.Legend at 0x14c2c4ff0a58>

Just for illustration here are the data and the corrected data. The results are not great which is not a surprise but the workflow is there.

In [73]:

plt.plot(time, nowcast_data, label='Nowcast') plt.plot(time, obs_data, label='Observations') plt.plot(time[15:], res+nowcast_data[15:], label='Corrected nowcast') plt.legend()


<matplotlib.legend.Legend at 0x14c2c4f79668>

In [ ]: