Venice is drowning - final report
Abstract
The main objective of the project is to analyze the data of the tide detections regarding the area of the Venice lagoon, producing predictive models whose performances are evaluated on a time horizon ranging from one hour up to a week of forecast.
For this purpose, three models, both linear and machine-learning based, are tested:
- ARIMA (AutoRegressive Integrated Moving Average);
- UCM (Unobserved Component Models);
- LSTM (Long Short-Term Memory).
Datasets
Two datasets are the basis for the project pipeline:
- the “main” dataset contains the tides level measurements (in cm) in the Venice lagoon from a certain reference level, obtained through the use of a sensor, between 1983 and 2018;
- a second dataset holds the information regarding meteorological variables such as rainfall in mm, wind direction in degree at 10 meters and finally wind speed at 10 meters in meters per second in the periods between 2000 and 2019.
The tides level dataset is composed using the single historical datasets made public by the city of Venice, in particular from Centro Previsioni e Segnalazioni Maree. The data regarding the meteorological variables, instead have been provided, on request, by ARPA Veneto.
Considering that one of the main cause of the tidal phenomena is known to be the lunar gravitational potential acting over the seas on earth, an attempt was made in order to explicitly introduce in the model the physics of the system through the creation of an ad-hoc time series containing the distance between Venice and the Moon (i.e. the radius of the gravitational interaction). Therefore, we firstly reproduced the approximate solution of the (differential) motion equation governing the lunar motion around the earth (as a Sun-Earth-Moon three body problem) presented in the great book “Newtonian Dynamics” by Richard Fitzpatrick, obtaining a quite precise analytical description of the aforementioned interaction radius, calculated with respect to the center of the Earth, as a function of time. Then, in order to access more precise solutions, and for the sake of simplicity and re-using fashion’ we exploited the API PyEphem, an astronomy library that provides basic astronomical computations for the Python programming language: the objective distance is obtained from the API which internally perform a 4th order RK integration and triangulate the distance with respect to the observer position (Venice in this case). In summary, the “lunar motion” time series, consisting of the distance between Venice and the Moon as a function of time, is obtained for further testing’s sake.
All the preprocessing operations regarding parsing, inspection and the final union of the cited datasets are available in the following scripts:
- parsing_tides_data allows to perform the construction of the tidal dataset, importing and unifying each single annual dataset;
- inspection contains a series of preliminar inspection of the aformentioned data:
- preprocess_weather_data_2000_2019 contains the preprocessing operations of the weather-related dataset;
- parsing_tides_weather reports a summary of the procedure implemented in order to deal with missing data in the weather dataset, and contains the merging operation producing the final dataset.
As a precise choice, due to time-related and computational reasons, only the data ranging from 2010 and 2018 are kept after the preprocessing.
Data inspection
During the preprocessing phase, some descriptive visualizations regarding the main time series are produced in order to inspect its characteristics.
Fig.1 reports the entire time series, represented together with the autocorrelation and partial autocorrealtion plots. Observing Fig.2, it is possible to notice how the tidal phenomenon seems to be characterized by a normal distribution. In such a case, it is worth noticing that from the analytic perspective the concepts of strict and weak stationarity are equivalent.
During the preliminary inspection of the historycal data, one of the investigations deals with the mean and variance stationarity of the considered time series. Regarding the former, Fig.1 suggests the series to be characterized by a stationary mean. This hypothesis is further proven by the output of the Augmented Dickey-Fuller test, confirming the in-mean stationarity of the tidal phenomenon.
On the other hand, the latter characteristics (stationary variance) can be verified observing Fig.4, where the daily average of the tidal level values are plotted against the daily aggregation of the standard deviation. In particular, no clear trends emerges in such a plot, enforcing the conclusion that the plotted quantities are uncorrelated.
Models
The produced models will focus on two areas, a purely statistical one with linear models such as ARIMA and UCM and a machine learning approach, through the investigation of an LSTM model. The preparation and implementation of the models will be presented below and a section of results, where it will be possible to make a rapid comparison between the performance of the models on a test set defined a priori, will be eventually proposed. For the modelling approach, it is worth highlighting the different subsets of the whole dataset exploited for each one of the described areas:
- for the linear models the training set is composed by the last six months of 2018, from July to December;
- for the machine learning approach, considering the capability of handling more data within a non-explosive computational time, the training set covers the period between January 2010 and December 2018.
The test set, previously extracted, is common among the different approaches and consists of the last two weeks of December 2018, i.e. from 17/12/2018 23:00:00 to 31/12/2019 23:00:00.
Concerning the linear models, two strategies are considered: the former consists of integrating the meteorological variables and the lunar motion -visualized in Fig.7- in the main analysis, while the latter is based on performing a sort of harmonic analysis by re-using some of the most common principal periodic components previously extracted from other studies in the field (i.e. see Consoli et all, 2014 and Abubakar et all, 2019). The adaptation of such components to our series is done exploiting oce, an R package that helps Oceanographers with their work by providing utils to investigate and elaborate Oceanographic data files.