A NEW APPROACH FOR LOW-LATITUDE IONOSPHERIC SCINTILLATION PREDICTION

. The prediction of ionospheric scintillation is a current research topic. Data-oriented models have been proposed since there is not a mathematical model able to simulate the complex ionospheric mechanisms for the prediction of scintillation. Data-oriented models employ machine learning algorithms that are trained with known, post-mortem data, and then perform predictions from new data. An ensem-ble method based on sampling boosting applied for a set of decision trees is employed for the prediction of scintillation in a single low-latitude location in Brazil. The prediction was performed considering two classes, occurrence or absence of scintillation. The method uses as input temporal series of the scintilla-tion index S 4 , the total electron content (TEC), and some geomagnetic and solar indexes for that location. Data encompasses the summer months of the previous solar cycle years (2010-2018), and it was split into mutually-exclusive training and validation/test sets. Prediction performance was promising, showing a potential to be developed for operational use. Data limitations related to time series extension, or also to a balanced distribution of samples/instances covering both classes, since scintillation occurrences are usually scarce, can be further developed as new data are available.


INTRODUCTION
Many socioeconomic activities rely on GNSS (Global Navigation Satellite System) satellite constellations, including the navigation of aircraft, land vehicles, and fluvial/maritime vessels.The Earth's ionosphere presents density irregularities, which cause fluctuation in phase/amplitude of the radio-frequency signals that traverse it.This effect is called ionospheric scintillation and may affect GNSS signals and/or telecommunication signals.Scintillation is associated with complex electromagnetic phenomena in the ionosphere, which are even described by some mathematical models but still require more resolution.
Scintillation is a current research topic due to its relevance.Since low-latitude scintillation is frequent in Brazil, researchers at INPE, the National Institute for Space Research, have been studying ionospheric scintillation and plasma bubbles for more than 50 years (DE PAULA et al., 2021) and participate in related international cooperation programs and projects with other research institutes and universities worldwide.In addition, INPE hosts the EMBRACE program (Brazilian Study and Monitoring of Space Weather)1 .
The absence of accurate mathematical models to simulate the ionosphere, and thus predict the occurrence of scintillation, gave rise to data-oriented models, which are based on machine learning.One such model "learns", using known post-mortem data that may include time series of scintillation, total electron content (TEC), and other related geomagnetic and solar variables and indexes.This work employed scintillation and TEC data from 2010 to 2018, covering almost the entire solar cycle 24.
Scintillation data was acquired from the GNSS networks LISN (Low-latitude Ionospheric Sensor Network, Peru), CIGALA-CALIBRA (Concept for Ionospheric Scintillation Mitigation for Professional GNSS in Latin America -Countering GNSS High Accuracy Applications Limitations due to Ionospheric Disturbances in Brazil), and ICEA (Institute for Airspace Control, Brazil).The S 4 data provided by these networks were then employed to generate S 4 maps by means of inverse distance weighted interpolation (VANI, 2018), which interpolates for each grid point of the map the S 4 values of all available GNSS stations, but weighting each value by the inverse of the distance station-grid point.
TEC data was generated using the RINEX files provided by the RMBC/IBGE (Rede Brasileira de Monitoramento Contínuo dos Sistemas GNSS, Brazilian Institute for Geography and Statistics) using the method adopted by EMBRACE/INPE (OTSUKA et al., 2002; SOUSA DO CARMO, 2018), which generates a TEC map combining the TEC values of the available stations for the considered time interval.These maps have some data limitations related to the quality, periodicity, and eventual out-of-service status of particular GNSS stations.Another limitation is that, for the early years of the considered solar cycle, the number of GNSS stations of these networks was lower than now, affecting the monitoring of scintillation.The number of stations has increased over the years.This work denotes as raw data the S 4 and TEC values extracted from the corresponding maps since these values undergo a specific pre-processing in order to have a more suitable representation for the

Ionospheric Scintillation
Nowadays, Global Navigation Satellite System (GNSS) are commonly employed in many activities, even in critical ones, such as aerial navigation.Besides GPS, other GNSS's were proposed and implemented (or are being implemented) around the world, such as the Russian GLONASS, the Chinese COMPASS, and the European GALILEO.However, GNSS radio-frequency signals may be disturbed by the ionosphere dynamic, mainly by undergoing phase and amplitude fluctuations along the line-of-sight between GNSS satellites and receivers on ground stations or on board vehicles, aircraft, or vessels.This phenomenon is called ionospheric scintillation, being measured in amplitude by the S 4 index, the normalized standard deviation of the signal amplitude I acquired at a rate of 50 Hz: The Earth's ionosphere presents many phenomena directly or indirectly related to the Sun dynamics.
The Sun emits radiation and particles that affect the Space Weather between Sun and Earth, including its electric and magnetic fields.In their turn, Earth's electric and magnetic fields are also affected, as well as the ionosphere density of ions and electrons, generating non-homogeneous and non-isotropic structures of different scales of size, ranging from meters to hundreds of kilometers.Such structures represent ionospheric irregularities, being the most important the ionospheric bubbles, regions with depletion of ions and electrons.The boundaries of these irregularities may present high-density gradients that may cause scintillation.
In addition, there is a very known plasma transport process from the magnetic Equator to low latitudes, in both Earth hemispheres, called the Equatorial Ionization Anomaly (EIA), which causes the rise of equatorial plasma, and eventually, the rise of bubbles that migrate towards the magnetic South and East, and may suffer expansion or contraction in size.Bubbles are formed after sunset and generally end up at the beginning of dawn.The occurrences of ionosphere bubbles start to intensify from October or November of each year, presenting peaks along the Summer, and start to decrease from March on.
The general morphology of bubble occurrence and associated mechanisms is well understood, but bubble occurrences present high day-to-day variability, making their prediction difficult (ABDU, 2019).
This poses a challenge for its prediction since this work proposes the use of time series that may embed such variability hampering the learning phase of the proposed algorithm.

Prediction of Ionospheric Scintillation
Currently, notwithstanding the existence of mathematical models able to simulate the forming and evolution of ionospheric bubbles, they would require higher temporal and spatial resolutions in order to predict scintillation (YOKOYAMA, 2017).There are some empirical models that employ or not historical data, but they are inaccurate for the prediction of scintillation (BÉNIGUEL et al., 2011; RETTERER, 2010; WERNIK et al., 2007).Scintillation may impact different socioeconomic services, and although being the occurrence of scintillation restricted to some regions and periods of time, it can critically affect aerial navigation and landing/take-off procedures, which increasingly rely on GNSS services.The prediction of the occurrence of scintillation is actually the prediction of the scintillation index S 4 in a point of the globe, making the abstraction of the existence of a two-dimensional S 4 field.Such index is actually a measure of the amplitude uniformity of the radio-frequency signal emitted by a GNSS satellite and reaching a ground receiver (there is also another measure considering the phase of the signal).For instance, worldwide institutions acquire raw data from networks of GNSS stations, to derive S 4 values, and the resulting set of these values at discrete points of the globe is then interpolated, providing regional or global S 4 maps that can be thought of as a two-dimensional field.
In the absence of a suitable mathematical model, a common approach is to employ a data-oriented model that is based on a machine learning algorithm.Similar approaches have been employed in different areas such as weather forecast or hydrology prediction, and rely on large amounts of data and on the use of supercomputers (CAMPOREALE, 2019).A machine learning algorithm has training, validation, and test phases using mutually exclusive sets of historical data.In the training phase, the algorithm "learns" using known data, adjusting learning parameters (for instance, weights and biases, in the case of a neural network).In the validation phase, other parameters inherent in the algorithm configuration, called hyper-parameters are optimized, and in the test phase, the resulting data-oriented model is evaluated.
A particular work (McGRANAGHAN et al., 2018) was the first to propose the prediction of scintillation in high magnetic latitudes but relies on the persistence paradigm, i.e. assuming that scintillation may last for some hours, suggesting that the ionosphere has a kind of non-linear memory.However many works, before and after that one, only proposed data-oriented models, as some works involving the authors (LIMA et al., 2014, 2015; REZENDE et al., 2010).In particular, a fresh search on the Google Scholar website revealed only a single work in recent years related to scintillation prediction in low magnetic latitudes, as proposed here.
The referred work proposes scintillation prediction using a data-oriented model based on gradient boosting (ZHAO et al., 2021).It employs S 4 data of the 2012-2020 period from Brazilian GNSS stations, and also the virtual height of the ionospheric F-layer base provided by specific radars called ionosondes.
The proposed prediction is made for the considered GNSS station considering two classes, occurrence or not of scintillation, respectively being below or above the threshold of S 4 = 0.5, in the period 2012-2020.

LOW-LATITUDE SCINTILLATION PREDICTION
The prediction is daily (actually for each night, since scintillation occurs mostly after sunset), yielding a single maximum S 4 value for each night.Training data uses S 4 values averaged every 5 minutes, but only the maximum of these values is considered for each night.
The prediction results of that work are promising, but there is a single S 4 value predicted for the entire night, not taking into account its temporal variability.In addition, the machine learning algorithm was trained using only a single maximum value for each night, not considering that, at the same time, the GNSS station could lock different satellites in different azimuthal angles that may yield S 4 values in a range of values from absent/weak, moderate or strong scintillation.Therefore, besides the limitation of a single-value prediction for each night, there is a strong bias to make the prediction of only high S 4 values that are more likely associated with the occurrence of strong scintillation in any azimuthal direction around the GNSS station.The approach is interesting, but the resulting prediction performance given by the F 1 score, a standard metric explained in the next section, was below 85 % in all cases, showing room for improvement.

DATA AND METHODOLOGY
Machine learning algorithms that perform prediction or classification are the same, only focusing on different problems.The prediction/classification performance relies on a suitable amount of good-quality data, which must undergo selection and pre-processing in order to be suitable for the considered algorithm.For instance, raw data may be discretized or normalized.A predictor/classifier algorithm is then applied, "learning" from known data in the training phase, thus generating a specific data-oriented model.Known data refers to input data composed of predictive features plus the expected numerical or categorical output data.This work employs two target classes, occurrence, and absence of scintillation.
The next phase is the validation, the tuning of the hyper-parameters of the algorithm, in order to improve the model performance.Finally, the test phase uses another "unknown" set of data (i.e.not including the target class).An analysis of the prediction performance is then made by some selected metrics.These steps, although applied to prediction, are part of a more general Data Science process called Knowledge Discovery in Databases (KDD), which is performed iteratively, in order to optimize each step.

Prediction metrics
The standard way to evaluate the prediction/classification performance is by a confusion matrix that summarizes correct and incorrect predictions/classifications for a number of instances to be predicted/classified.
In this case, there are only two classes (occurrence/absence of scintillation), being the matrix is 2 × 2.
The diagonal elements contain the count of correct predictions for the occurrence class (true positives TP and true negatives TN), while the remaining elements are the counts of mispredictions for the same class (false positives FP and false negatives FN).Some prediction/classification metrics can be derived from the confusion matrix, all having values in the range of 0.0 to 1.0, being the unity their optimal value.D r a f t

LOW-LATITUDE SCINTILLATION PREDICTION
These metrics are defined below considering instances of the scintillation occurrence class: Usually, it is intended to predict the majority of the scintillation occurrence instances (TP), while minimizing the misprediction occurrences (FN).As a consequence, the number of mispredicted nonoccurrences (FP) is increased, and some trade-off must the chosen in such adjustment in order to minimize the number of FNs without increasing too much the number of FPs.The F 1 score is a metric that better optimizes these numbers, being adopted in this work.

Data pre-processing
The data employed in this work covers the 2010-2018 period, but only the summer months, which usually present more scintillation events.Data includes time series of the S 4 and TEC indexes, both over SJC.
This work proposes a local S 4 prediction for SJC using local data, and thus no information about the location is employed.S 4 and TEC data of nearby GNSS stations were implicitly considered in the generation of the corresponding maps.
Besides the time series of the S 4 and TEC indexes, other time series were employed to account for the geomagnetic activity and the level of solar activity.The time series of geomagnetic indexes include the AE (Auroral Electrojet), Dst, Ap, and Sym-H/Sym-D, while the time series of solar indexes include the F10.7 (Solar flux), SN (daily sunspot number), IMF (Interplanetary Magnetic Field), and its components B z and IMF B y , solar wind velocity, and pressure, respectively, V sw and P sw .In addition, the time series of the virtual height of the F-layer base (h ′ F ) was also employed.
The sampling rates of S 4 , TEC, and other variables and indexes are higher, but the corresponding time series are downsampled to 30-minute by their maximum, in order to reduce the complexity of the problem and processing needs.The remaining data has a poorer temporal resolution.Predictions refer to finding out if there will be absence or occurrence of scintillation at a given time over SJC using the S 4 threshold value of 0.2 to distinguish between these classes.Many pre-processing steps were required, as described below.

LOW-LATITUDE SCINTILLATION PREDICTION
The resulting data from these pre-processing steps is a feature vector for each instant of time, according to the 30-minute resolution.This vector is composed of all the considered predictive features and may be considered as an instance of the input data.Every time a prediction is devised, the feature vector for that instant of time applies.or Time Series Cross Validation Gap Walk Forward (GapWalkForward).When using GapKFold, both internal and external schemes are GapKFold, the same is valid for the GapWalkForward.

Data partitioning and validation schemes
• Time Series Cross Validation with Gaps (GapKFold): the set of predictive features is divided into k mutually exclusive subsets, each one sorted in time and separated by 30-day gaps, being (k-1) subsets used for training and the remaining one for validation; this scheme is iteratively repeated by replacing the validation set by one of the training sets, rendering k different models (named as k-Cross Validation); the resulting prediction is given by a polling scheme using these k models (here, k was adopted as 5); • Time Series Cross Validation Gap Walk Forward (GapWalkForward): is similar to GapKFold, but generates datasets ordered in time, such that all data in the training dataset precedes the data in the validation dataset.K-different models can be generated: in the first interaction an ordered subset from all samples sets starts in the first sample, this is the training set, for each interaction the sample number in the training set will increase from the same amount while keeping the validation set with a fixed size, in the last interaction all the available samples will be covered when considered the training and validation subsets together (here, k was also adopted as 5).and validation of the same data resulting in two different models.Test results are given by a polling scheme using these two models.The algorithms are the XGBoost (CHEN et al., 2016) and the CatBoost (DOROGUSH et al., 2018; PROKHORENKOVA et al., 2018).However, a specific stopping criterion (early stopping) is applied for each model along the training, in order to avoid over-fitting.At every iteration of the training, the prediction performance based on the validation subset is evaluated.Since such performance degrades for too many iterations, an optimal number of training iterations can be defined for each model.

LOW-LATITUDE SCINTILLATION PREDICTION
It is worth noting that predictions are made for 6 prediction times, from 30 to 180 minutes of antecedence (spaced by 30 minutes), and therefore, there are actually 6 models, each one resulting from the combination of the models derived from XGBoost and CatBoost for a given prediction time using the GapKFold partitioning scheme, and another 6 models using GapWalkForward scheme.These 12 models are evaluated separately resulting in 12 prediction results at each instant of time.Each one implies in training, validating, and testing the related subsets, which comprise the number of instances with 30-minute resolution.
The (Tables 1 and 2) show the number of instances for the 30-minute prediction using respectively the GapKFold and the GapWalkForward schemes, both after performing the balancing of instances according to the classes (occurrence/absence of scintillation).In these tables, there are 5 subsets for each validation level, but in the table related to the GapKFold scheme, only the two first subsets are shown for the outer level.In the table related to the GapWalkForward scheme, only the first (smaller) and the last (larger) subsets of the outer level are shown.In both, for each outer level, all corresponding 5 subsets of the inner level are shown.

RESULTS
This section summarizes the prediction results for the different models, prediction times (30-180 minutes), and data partitioning schemes, using the prediction metrics described before in Section 2. ements for the test subset of each case.As expected, longer prediction times degrade the prediction performance, but for 30 and 60 minute antecedence, results show the F 1 score above 90 % for the Gap-KFold scheme, except for the last subset (V) of the GapKFold.The same tendency was observed for the V subset of the GapWalkForward scheme, but the results were poorer.
The same behavior is observed for all prediction times and both partitioning schemes and can be explained as follows.Regardless of the constant-size subsets of the GapKFold scheme, and the "cumulative" size of the GapWalkForward scheme, both V subsets refer to more recent instances, in this case, from the period 2017-2018 (solar minima).The corresponding training and validation subsets precede the V subset, being of the period 2011-2017 (solar maxima).Therefore, these poor results for the V subset are due to the use of data from different phases of the solar cycle.
In general, considering subsets I to IV, the performance of the GapWalkForward scheme was lower than the GapKFold one.The reason is that the GapWalkForward scheme preserves the temporal ordering of the training, validation, and testing sets, and thus the training employs only "past data" in relation to validation and testing sets.On the other hand, the GapKFold training may include samples of "future data" in relation to the validation and test sets.This is a kind of leakage from one set to another and generally improves the prediction performance since in the training phase the model may learn information about the "future".
A second problem may arise using the GapWalkForward scheme, which generates 5 models using increasing sizes of the training and validations sets, since, in the generation of the earlier models, the size of the training and validation sets may be too small precluding the learning process.
However, despite its lower prediction performance, the GapWalkForward scheme would be better for operational purposes, since it does not rely on "future data".
Tables 3 and 5 present results for 30 and 60 minute predictions, respectively for the GapKFold and the GapWalkForward schemes, showing accuracy, precision, and F 1 score for each case.Tables 4 and   6 present results for 30-60-90-120- machine learning algorithm.In the case of this work, the time series of these maps allowed the time series generation of the S 4 and TEC values for a single location with low magnetic latitude, the city of São José dos Campos (SJC), Brazil.The proposed data-oriented model aims to perform short-term predictions (30 minutes ahead) of the S 4 index during the night over SJC.The prediction performance of the model was evaluated by some specific metrics, showing a potential for future operational use at EMBRACE/INPE.

•
Generation of time series of S 4 maps from GNSS raw data similarly as performed in the GNSS network CIGALA-CALIBRA/INCT(VANI, 2018), but reducing the S 4 values corresponding to the set of neighboring IPPs (Ionosphere Piercing Point) at each grid point using the maximum value, instead of the average, and using a 10-minute integration interval; • Generation of time series of TEC using the method adopted by EMBRACE/INPE(OTSUKA et al.,   2002; SOUSA DO CARMO, 2018), also with 10-minute resolution; • Removal of eventual negative TEC values from the TEC maps, since they have no physical mean-ing; these negative values may be generated as a flaw of the method that estimates the absolute TEC value from the relative one; each negative value is replaced by the average of the lower non-negative 5% quantile; • Extraction of the TEC and S 4 values using the corresponding values of the grid point that is nearest to SJC, obtaining time series of these variables; • Re-sampling of the S 4 and TEC and the other time series for 30-minute resolution; • Filling in of the missing values of the S 4 and TEC time series using the last valid value that precedes each missing value; • Normalization of the S 4 and TEC time series values to the 0.0 to 1.0 interval.• Use of the libraries Tsfresh (CHRIST et al., 2016, 2018) and Tsfel (BARANDAS et al., 2020) for the extraction of predictive features from fixed-length windows of the considered time series, such as maximum value, average, median, Fourier transform coefficients, etc.; • Calculation of the time derivatives of each time series and estimation of nonlinear correlations between different time series or their derivatives; • Calculation of the average and the climatology deviation is given, for each value of the series, by the difference between the value and the average of precedent values for a fixed-length window; • Filtering of the predictive features composed of the original features plus the extra features obtained in the former steps; features with variance lower than 0.1 are filtered out since they are possibly meaningless in the learning process; • Standardization of all predictive features in order to yield zero mean and unitary variance for each one; • Use the SMOTE function of the Imblearn library (LEMAÎTRE et al., 2017) to generate synthetic instances from the set of original ones, in order to obtain a balanced distribution of instances (occurrence/absence of scintillation) in the training set.
scheme generates training, validation, and test sets.Both the internal and external partitioning levels used here employ one of the following schemes: Time Series Cross Validation with Gaps (GapKFold)

Table 1 :
Number of instances for the training, validation, and testing subsets using the GapKFold partitioning for the 30-minute prediction, showing subsets (I) and (II) of the outer level cross-validation.Two machine learning algorithms of this work are state-of-the-art ensemble methods, i.e. based on a finite set of classifiers/predictors.Two Gradient Boosting Ensemble algorithms are used in the training

Table 2 :
Number of instances for the training, validation, and testing subsets using the GapWalkForward partitioning for the 30-minute prediction, showing the smaller (I) and the larger (V) subsets of the outer level cross-validation.
1.It is worth noting that such metrics presented were obtained from the counts of TP, TN, FP, and FN el-

Table 3 :
Prediction results for 30 min and 60 min antecedence using the GapKFold scheme.

Table 4 :
Prediction results for 30 min to 180 min antecedence using the GapKFold scheme.

Table 5 :
Prediction results for 30 min and 60 min antecedence using the GapWalkForward scheme.

Table 6 :
Prediction results for 30 min to 180 min antecedence using the GapWalkForward scheme.
CONCLUSIONSThe monitoring and prediction of ionospheric scintillation occurrence are current research topics and have importance for GNSS warning systems, mainly those concerned with aerial navigation, and aircraft Braz.J. Geophys., 40, 4, 2022