What information does Jaime need to evaluate the accuracy of his measurement?

Loading metrics

Open up Admission

Peer-reviewed

Research Commodity

A new accurateness measure based on bounded relative fault for fourth dimension series forecasting

Chao Chen,
Jamie Twycross,
Jonathan M. Garibaldi

ten

Published: March 24, 2017
https://doi.org/ten.1371/journal.pone.0174202

Figures

Abstruse

Many accuracy measures take been proposed in the by for time series forecasting comparisons. All the same, many of these measures suffer from one or more bug such as poor resistance to outliers and calibration dependence. In this newspaper, while summarising usually used accuracy measures, a special review is made on the symmetric mean absolute percent error. Moreover, a new accuracy mensurate called the Unscaled Hateful Bounded Relative Absolute Error (UMBRAE), which combines the best features of various culling measures, is proposed to address the common problems of existing measures. A comparative evaluation on the proposed and related measures has been made with both synthetic and real-world data. The results indicate that the proposed measure, with user selectable benchmark, performs too every bit or ameliorate than other measures on selected criteria. Though information technology has been unremarkably accepted that there is no unmarried best accuracy measure out, nosotros suggest that UMBRAE could be a practiced selection to evaluate forecasting methods, especially for cases where measures based on geometric mean of relative errors, such as the geometric mean relative absolute error, are preferred.

Citation: Chen C, Twycross J, Garibaldi JM (2017) A new accuracy measure based on divisional relative error for time series forecasting. PLoS ONE 12(3): e0174202. https://doi.org/ten.1371/journal.pone.0174202

Editor: Zhong-Ke Gao, Tianjin University, Cathay

Received: August 21, 2016; Accepted: March half dozen, 2017; Published: March 24, 2017

Copyright: © 2017 Chen et al. This is an open access article distributed under the terms of the Creative Eatables Attribution License, which permits unrestricted apply, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The M3-Competition forecasting data are bachelor with R package 'Mcomp' (https://CRAN.R-project.org/package=Mcomp).

Funding: Chao Chen was part funded by the School of Calculator Scientific discipline, University of Nottingham.

Competing interests: The authors have alleged that no competing interests exist.

Introduction

Forecasting has ever been an attractive research expanse since it plays an important function in daily life. As one of the most popular research domains, time series forecasting has received detail concern from researchers [1–5]. Many comparative studies have been conducted with the aim of identifying the almost authentic methods for time series forecasting [6]. However, research findings indicate that the performance of forecasting methods varies co-ordinate to the accuracy measure existence used [7]. Various accurateness measures have been proposed every bit the best to use in the past decades. Still, many of these measures are not mostly applicative due to issues such as existence space or undefined under certain circumstances, which may produce misleading results. The criteria required for accuracy measures have been explicitly addressed by Armstrong and Collopy [half-dozen] and further discussed by Fildes [eight] and Clements and Hendry [9]. As discussed, a good accuracy measure should provide an informative and clear summary of the mistake distribution. The criteria should also include reliability, construct validity, computational complication, outlier protection, calibration-independency, sensitivity to changes and interpretability. Information technology has been suggested by many researchers that no unmarried mensurate can exist superior to all others in these criteria [6, ten, 11].

The evolution of accuracy measures can be seen through the measures used in the major comparative studies of forecasting methods. Root Hateful Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) tin can be considered as the very early and about popular accuracy measures. They were the primary measures used in the original M-Contest [12]. Despite well-known issues such as their high sensitivity to outliers, they are still being widely used [13–15]. When using these accuracy measures, errors which are small and appear to be good, such as 0.one by RMSE and ane% by MAPE, can frequently be obtained. Wei et al. [16] employed RMSE as the performance indicator in their research on stock price forecasting. The boilerplate error obtained was 84 and it was claimed to be superior to some other previous models. Still, without comparison, the fault 84 as a number is not easy to interpret. In fact, the boilerplate fluctuation of stock indices used was 83 which is smaller than the mistake of their proposed model. A similar example can be found regarding MAPE. Esfahanipour and Aghamiri [17] proposed a model with an error of 1.3%, which appears to be good. Yet, this mistake was larger than the average daily fluctuation of the stock cost, which was approximately 1.two%. The poor interpretation here is mainly due to the lack of comparable criterion used past the accuracy measure.

Armstrong and Collopy [6] recommended the utilize of relative absolute errors equally a potential solution to the above issue. Accuracy measures based on relative errors, such as Hateful Relative Absolute Fault (MRAE), can provide a improve estimation of how good the evaluated forecasting method perform compared to the benchmark method. However, when the benchmark error is small or equal to naught, the relative error could become extremely big or infinite. This may lead to an undefined mean or at least a baloney of the result. Thus, Armstrong and Collopy suggested a method named 'winsorizing' to overcome this problem past trimming farthermost values. However, this process will also add together some complexity to the calculation and an appropriate trimming level has to be specified [eighteen].

Similarly, MAPE too has the issue of being infinite or undefined due to zeros in the denominator [19]. The symmetric mean absolute percentage fault (sMAPE) was first proposed by Armstrong [20] equally a modified MAPE which could be a simple way to ready the issue. It was then used in the M3-Competition equally an alternative primary measure to MAPE [7]. Nonetheless, Goodwin and Lawton [21] pointed out that sMAPE is not as symmetric every bit its proper noun suggested. In fact, it gave more penalties to nether-estimates more than to over-estimates. Thus, the use of sMAPE in the M3-Contest was widely criticized by researchers later [22]. In an unpublished working newspaper, Chen and Yang [23] defined a modified sMAPE, called msMAPE, past adding an additional component to the denominator of sMAPE. The added component tin efficiently avert the inflation of sMAPE acquired past zero-valued observations. Even so, this does not address the issue of asymmetry for sMAPE.

Hyndman and Koehler [18] proposed Hateful Absolute Scaled Error (MASE) every bit a generally applicable measurement of forecasting accuracy without the problems seen in the other accuracy measures. Nonetheless, this measure can notwithstanding exist dominated past a unmarried large fault, though space and undefined values have been well avoided for most cases [24]. Davydenko and Fildes [24] proposed an contradistinct version of MASE, the average relative MAE (AvgRelMAE), which uses the geometric mean to average the relative efficiencies of adjustments beyond time serial. Although the geometric mean is appropriate for averaging benchmark ratios [25], the ceremoniousness of AvgRelMAE still depends on its component measure out RelMAE for each time series.

In this paper, a new accuracy measure is proposed to address the bug mentioned above. Specifically, by introducing a newly defined bounded relative absolute error, the new mensurate can address the disproportionate issue of sMAPE while maintaining its other backdrop, such as scale-independence and outlier resistance. Further, nosotros believe that the new measure improves the interpretability based on relative errors with a selectable criterion than sMAPE which uses the percentage errors based on the observation values. Given that [6] claimed that measures based on relative errors are the most reliable, we believe our measure is reliable in this sense.

Review of accuracy measures

Many accuracy measures take been proposed to evaluate the performance of forecasting methods during the past couple of decades. A table of well-nigh normally used measures were listed in the review of 25 years of time serial forecasting [1]. There was also a thorough review on accuracy measures by Hyndman and Koehler [18]. In this section, we mainly focus on new insights or new measures that have been introduced since 2006.

For a time series with n observations, let Y _t announce the ascertainment at time t and F _t denote the forecasts of Y _t. Then the forecasting error e _t can be defined every bit (Y_t–F_t ). Let denote the forecasting error at time t obtained by some benchmark method. That ways , where is the forecast at time t by the criterion method.

Scale-dependent measures

The measures based on accented or squared errors are also known as scale-dependent measures since their scale depends on the calibration of the data. They are useful in comparing forecasting methods on the aforementioned set of data. Even so, they should not be used beyond data sets that are on different scales. The most commonly used scale-dependent measures are Mean Absolute Error (MAE), Mean Squared Fault (MSE) and RMSE: (1) (2) (3)

MAE had been cited in the very early forecasting literature as a master measure of functioning for forecasting models [26]. As shown in Eq 1, MAE directly calculates the arithmetic mean of absolute errors. Hence, it is very easy to compute and to empathize. However, it may produce biased results when extremely large outliers be in information sets. Specifically, even a single large error can sometimes dominate the result of MAE.

MSE, which calculates the arithmetic hateful of squared errors, was used in the beginning One thousand-Contest [12]. However, its use was widely criticized afterwards as inappropriate [6, 27]. MSE is more vulnerable to outliers since information technology gives extra weight to large errors. Also, the squared errors are on different scale from the original data. Thus, RMSE, which is the squre root of MSE, is oftentimes preferred to MSE as information technology is on the aforementioned scale as the data. However, RMSE is also sensitive to forecasting outliers [28].

Per centum-based measures

To be scale-independent, a common approach is to use percentage errors based on observation values. Two instance measures based on percentage errors are MAPE and sMAPE divers as: (4) (5)

Information technology should exist noted that absolute values are used in the denominator of sMAPE defined in this paper. This definition is different but equivalent to the definition in Makridakis [10] and Makridakis and Hibon [7] when forecasts and bodily values are all non-negative. The absolute values in the denominator tin avoid negative sMAPE as pointed out by Hyndman and Koehler [eighteen].

MAPE was used every bit one of the major accurateness measures in the original G-Competition [12]. All the same, the percentage errors could be excessively big or undefined when the target time series has values close to or equal to cipher [19]. Moreover, Armstrong [twenty] pointed out that MAPE has a bias favouring estimates that are beneath the actual values. This was illustrated by extremes: "a forecast of 0 tin can never be off by more than 100%, but there is no limit to errors on the high side". Makridakis [ten] discussed the asymmetric issue of MAPE with another example which involves two forecasts on unlike actual values. However, nosotros believe that the example by Makridakis is beyond the idea of Armstrong in 1985. To our understanding, we believe that the assumption concerning the disproportionate issue of MAPE described by Armstrong [twenty] is: i), the estimates are non-negative while the actual value is positive; ii) the forecasting range is disproportionate that 0 is the lower bound for lower estimates while there is no upper bound for upper estimates; iii), errors for lower estimates and upper estimates should be symmetric (an farthermost instance: 0 as the worst lower judge should accept the same absolute fault every bit the worst upper estimate which is space).

sMAPE tin can produce symmetric errors in the asymmetric forecasting range every bit stated in the higher up supposition. However, it is more natural to consider the symmetric property in a symmetric forecasting range for lower and upper estimates. Thus, sMAPE was widely criticized as an asymmetric measure [21, 22]. Regardless of the asymmetric issue, an advantage of sMAPE is that information technology does not have the issue of MAPE from being excessively large or infinite. Too, due to the mistake bounds defined, sMAPE is more resistant to outliers since it gives less significance to outliers compared to other measures which do not accept bounds for errors.

Relative-based measures

Another arroyo for accuracy measures to exist scale-independent is to use relative errors based on the errors produced by a criterion method (east.g. the naïve method). The most normally used such measures are MRAE and the geometric mean relative absolute error (GMRAE): (6) (7)

MRAE can provide a clearer intuition of the operation improvement compared to the benchmark method. Still, MRAE has a similar limitation as MAPE, in that it tin also be excessively large or undefined, when is shut to or equal to nada.

GMRAE is favoured since it is more often than not acknowledged that the geometric mean is more appropriate for averaging relative quantities than the arithmetics mean [6, 8]. According to an culling representation of GMRAE shown higher up in Eq seven, a fundamental stride for computing GMRAE is to brand an arithmetic mean of log-scaled error ratios. This makes GMRAE more than resistant to outliers compared to MRAE which uses the arithmetic hateful of original error ratios. Notwithstanding, GMRAE is even so sensitive to outliers. More specifically, GMRAE can be dominated by not simply a single big outlier, just also an extremely small fault close to zero. This is because in that location is neither upper leap nor lower bound for the log-scaled fault ratios used past GMRAE. Also, it should likewise exist noticed that zippo errors, both in e _t and , take to be excluded from the assay. Thus, GMRAE may not be sufficiently informative.

Rather than utilise the boilerplate of relative errors, i can also utilize the relative of average errors obtained by a base of operations measure. For example, when the base measure is RMSE, then relative RMSE (RelRMSE) is defined as: (8)

RelRMSE is a commonly used measure proposed by Armstrong and Collopy [six] where RMSE ^* denotes the RMSE produced by a benchmark method. Similar measures, such every bit RelMAE and RelMAPE, can be easily divers. They are besides called relative measures. An advantage of relative measures is their interpretability [18]. Withal, the functioning of relative measures is restricted by the component measure. For example, RelMAPE is also undefined when MAPE is undefined. Further, RelMAPE can also be easily dominated past farthermost large outliers since MAPE is not resistant to outliers. Thus, it makes no sense to compute RelMAPE if MAPE, equally the component, is skewed.

Another disadvantage of relative measures is that they are only available when there are several forecasts on the same serial [18]. As a related thought of relative measures, MASE does not take the above issue. It is defined equally: (9)

In MASE, the absolute error |eastward _t| for each observation is scaled by the average in-sample error MAE ^** produced a benchmark method (e.chiliad. one-step naïve method, or seasonal naïve method for seasonal information). Thus, MASE will non produce infinite or undefined values except in the irrelevant case where all historical data are equal. However, MASE is still vulnerable to outliers [24]. Moreover, it has to be assumed that the catamenia-to-menses difference of the time serial is stationary, and then that the scaling factor is a consequent calculator of the scale of the series.

For comparisons of forecasting methods on multiple time series, MASE is equivalent to the weighted arithmetic hateful of relative MAEs [24]: (10) where m denotes the number of time serial, n _i denotes the number of observations for the i ^th time series and . As pointed out past Davydenko and Fildes [24], using the arithmetic hateful of MAE ratios introduces a bias towards overrating the accuracy of a benchmark method. They proposed the measure AvgRelMAE as an alternative to MASE, based on the geometric hateful to average the scaled quantities.

(eleven)

It should be noticed that AvgRelMAE uses out-of-sample as the scaling factor while MASE uses in-sample . Though AvgRelMAE was shown to have many advantages such as interpretability and robustness [24], it even so has the same effect with MASE since they are based on RelMAE. As mentioned in a higher place, the accuracy of RelMAE is constrained by the accuracy of MAE. Since MAE can be dominated by extreme outliers, the MAE ratio r _i does not necessarily represent an advisable comparison of forecasting methods based on the errors of the majority of forecasts for the i ^th time series.

A new accuracy mensurate

The criteria for a useful accuracy measure take been explicitly addressed in the literature [6, 8, 9, xi]. As reviewed in the previous Section, many measures accept been proposed with diverse advantages and disadvantages. Yet, nearly of these measures suffer from one or more problems. In this section, nosotros propose a new accuracy measure which adopts the advantages of other measures such as sMAPE and MRAE without having their common bug. Specifically, the proposed measure is expected to have the following backdrop: (i) Informative: it tin can provide an informative result without the need to trim errors; (2) Resistant to outliers: it can hardly be dominated past a single forecasting outlier; (iii) Symmetric: over estimates and under estimates are treated fairly; (iv) Scale-independent: it can be applied to information sets on different scales; (v) Interpretability: it is easy to understand and can provide intuitive results.

It has been mentioned to a higher place in the review that sMAPE is resistant to outliers due to bounded error defined. Nosotros would like to suggest a new mensurate in a similar fashion to sMAPE without its problems. Since relative errors are more general than percentage errors in providing intuitive results, we apply the Relative Absolute Error (RAE) as the base of operations to derive our new measure.

(12)

Since RAE has no upper bound, it can be excessively large or undefined when is pocket-sized or equal to zero. This issue can be hands addressed by adding a |eastward _t| to the denominator of RAE, which introduces a bounded RAE (BRAE): (thirteen)

In BRAE, the added |e _t| tin ensure that the denominator will be no less than the numerator. Information technology means BRAE volition have a maximum error of 1 while the minimum error is 0 when |eastward _t| is equal to nothing. Due to the upper leap of BRAE, an accurateness mensurate based on BRAE will be more resistant to forecasting outliers. Information technology can be noticed that the asymmetric issue of sMAPE has also been addressed in BRAE by adding a |e _t| rather than a |F _t| to the denominator. Besides, a measure based on BRAE is more appropriate than sMAPE for intermittent demand data which have many aught-valued observations. To avert the issue of existence undefined, BRAE is defined to be 0.v for the special case when |e _t| and are both equal to goose egg.

In practise, the one-step naïve method is a usually used benchmark where . However, it should exist noticed that the naïve method is not necessarily an effective benchmark. For instance, when most forecasting methods can generally produce much smaller errors than the naïve method, BRAE volition have the same event equally per centum error based measure stated above. Thus, it is preferable to use a properly competitive method equally a benchmark, such that a value of around 0.5 is obtained by BRAE.

Based on BRAE, a measure called Mean Divisional Relative Absolute Error (MBRAE) can be defined as: (14)

Though MBRAE is adequate to compare forecasting methods, information technology is a scaled mistake that cannot exist directly interpreted every bit a normal fault ratio reflecting the error size. In fact, the process of calculating GMRAE also contains a mean of log-scaled error ratio which is non easily interpretable. Simply this issue is addressed by converting the log-scaled error to a normal ratio with the exponential function. Similarly, a transformation can exist made to MBRAE to obtain a more than interpretable measure which is termed the unscaled MBRAE (UMBRAE): (15)

With UMBRAE, the performance of a proposed forecasting method tin can be easily interpreted, in terms of the average relative accented mistake based on BRAE, equally follows: when UMBRAE is equal to ane, the proposed method performs roughly the same as the criterion method; when UMBRAE < 1, the proposed method performs roughly (1−UMBRAE)*100% better than the criterion method; when UMBRAE > 1, the proposed method is roughly (UMBRAE−ane)*100% worse than the criterion method.

In general, UMBRAE is informative without the demand to trim extreme errors. At the same time, based on the bounded errors, UMBRAE is resistant to outliers. Information technology is also symmetric and plainly scale-contained. The benchmark used by UMBRAE is selectable where the naïve method can exist easily applied. A competitive benchmark is preferable to obtain more intuitive results. To the all-time of our knowledge, UMBRAE has not been proposed before. We suggest it as a more often than not applicative accuracy measure for time series forecasting. UMBRAE would exist especially useful for the cases where the performance of forecasting methods are not expected to exist dominated by forecasting outliers.

Evaluation and results

In this section, the performance of UMBRAE is evaluated. The naïve method is used as the benchmark for UMBRAE. Properties such as reliability and sensitivity have been well investigated in the study by Armstrong and Collopy [6]. In their written report, MAPE and MRAE accept been assessed to exist adequate in terms of reliability and good in terms of sensitivity. In fact, these properties, specially reliability, cannot be hands examined. For case, in the reliability tests, if forecasting methods are expected to accept the same rankings when they are evaluated by a reliable accurateness measure, these forecasting methods themselves take to perform stably on different fourth dimension series. It is difficult to observe such forecasting methods in the real world. Thus, these properties are not examined in our study. Instead, information technology is assumed that UMBRAE, based on relative errors, will also exist reliable and sensitive to error changes. Consequently, our evaluation will be mainly focused on the expected properties mentioned in the previous Section. To make comparisons, other common measures mentioned in the review Section are also examined in our evaluation. Comparisons are firstly made with constructed time serial to specifically examine the required properties. Then the M3-Competition information with 3003 time series [seven] are used to demonstrate how these measures perform with real-globe data.

Evaluation with synthetic data

Three groups of constructed time series data are used in the comparative study. These synthetic data are not designed to be representative of real-world data. Rather, they are selected to clearly testify the drawbacks of accuracy measures in terms of the required properties. In the synthetic evaluations, the boilerplate one-pace naïve mistake is used to scale errors for MASE.

One of the most desired properties of an accuracy measure is the ability to resist outliers. Thus, the first group of synthetic data is made to examine whether the accurateness measure out is resistant to a unmarried forecasting outlier. Equally shown in Fig ane, Y _t is the objective time serial with 10 observations, which are randomly generated under the normal distribution (mean = 300, sd = 100). is the forecasting series of Y _t. Specifically, does not have obvious forecasting outlier and its forecasting errors measured by MAPE are approximately 10%. The other three forecasts are the same equally except that they all accept a forecasting outlier for the eighth ascertainment. Though occasionally occurring big errors should likewise be considered in evaluating the performance of a forecasting method, it is assumed that a single large outlier should not affect the whole operation significantly. Yet, the results in Fig 1 shows that the errors reported by some accuracy measures take been significantly dominated past the single forecasting outlier. The worst is RMSE where its error for has go approximately 36 times larger than its error for . Though MASE has been scaled from MAE, it in fact performs the same as MAE in dealing with the forecasting outlier. The errors given by MAE and MASE for have both been distorted to be most 15 times larger than for . In dissimilarity, sMAPE, GMRAE and UMBRAE are less sensitive to this single forecasting outlier. UMBRAE reports the smallest differences for the four fourth dimension series.

Fig 1. Evaluation on the resistance of accuracy measures to a single forecasting outlier.

A: Constructed time series information where Y _t is the target series and are forecasts. The only deviation between is their forecasts on the ascertainment Y ₈. B: Results of single forecasting outlier evaluation, which shows UMBRAE is less sensitive than other measures to a single forecasting outlier.

https://doi.org/ten.1371/periodical.pone.0174202.g001

The second group of time serial data is created to evaluate whether over-estimates and nether-estimates are treated 'fairly' past the accurateness measures. Equally presented in Fig 2, Y _t is the same time series as which was used in the single forecasting outlier resistance evaluation. In this scenario, makes a 10% over-estimate error to all observations in Y _t while makes a x% under-guess. The results in Fig ii evidence that all the accuracy measures except sMAPE have given the same fault for and . sMAPE produces a larger error for which indicates it puts a heavier penalty on under-estimates than on over-estimates.

Fig 2. Evaluation on the symmetry of accurateness measures to over-estimates and under-estimates.

A: Synthetic fourth dimension serial data where Y _t is the target serial and are forecasts. makes a 10% over-estimate to all observations of Y _t, while makes a 10% under-estimate. B: Results of symmetric evaluation, which shows UMBRAE and all other accuracy measures except sMAPE are symmetric.

https://doi.org/x.1371/periodical.pone.0174202.g002

Davydenko and Fildes [24] suggested another scenario to examine the holding of symmetry for measures. In this scenario, the reward given for improving the benchmark is expected to balance the penalty given for reducing the criterion by the same quantity. We also use this to examine our measure UMBRAE. Suppose that a time series has only ii observations (y) and there is i forecasting method to be compared with another benchmark method. For the benchmark method, it makes the forecasts f with errors (y−f) of i and ii respectively. In contrast, the forecasting method produces errors of ii and 1 respectively. As an expected event, the forecasting method has an error of 1 measured past UMBRAE based on the criterion method. Thus, UMBRAE is besides symmetric for this instance.

Unremarkably, the scale-dependent result of accurateness measures is related to their capability of evaluating forecasting functioning beyond information series on different scales. Accuracy measures based on percentages or relative ratios are conspicuously suited to perform such evaluations and no constructed data are made for this. However, the scale-dependent issue also exists within a information serial. Thus, the third group of synthetic information shown in Fig three is fabricated to evaluate the property of accurateness measures dealing with data on different scales within a unmarried time series. In this data set, Y _t is a time series generated by the Fibonacci sequence from ii to 144. Equally the forecasts to Y _t, all forecasting values of are set to have a twenty% over-estimate error of the relevant observation of Y _t. In contrast, has the same mean absolute error as but its errors are on different percentage scales from 1440% to 0.2%. Specifically, has the same accented error as . For instance, has the same absolute error as which is 28.8. As presented in Fig 3, MAE, RMSE, MASE and fifty-fifty GMRAE exercise not bear witness any deviation betwixt the two forecasts. MRAE and MAPE, even so, have produced essentially different results for the two cases. The errors measured by them for are approximately ten times larger than for . In contrast, UMBRAE and sMAPE requite a moderate departure for the two forecasts.

Fig 3. Evaluation on the scale dependency of accurateness measures.

A: Synthetic time serial information where Y _t is the target series and are forecasts. and have the same mean absolute error, simply errors are on different percentage scales to the corresponding values of Y _t. B: Results of calibration dependency evaluation, where MAE, RMSE, MASE and even GMRAE show no difference between and . MRAE and MAPE produce substantially different errors for the two cases. sMAPE and UMBRAE can reasonably distinguish the 2 forecasts.

https://doi.org/10.1371/journal.pone.0174202.g003

Evaluation with the M3-Contest data

The M-Competitions are well-known empirical studies which employ various existent-world fourth dimension series data in comparison the functioning of forecasting methods. In this written report, we use the M3-Competition [7] Data which contains 3003 time serial to evaluate our proposed measure. The forecasting information are available with R package 'Mcomp' maintained by Hyndman. The 'Mcomp' bundle for R is available from Hyndman's website: http://robjhyndman.com/software/mcomp/. Amid the 24 forecasting methods in the M3-Competition, 22 are used in our evaluation since their forecasts are available for all the 3003 time serial. Since the one-step naïve method is used past many accuracy measures as the benchmark, it is also listed in the results every bit a forecasting method. Every bit an alternative version of MASE, AvgRelMAE which use geometric mean to boilerplate errors across time serial, is likewise included in this evaluation. To simplify the results, errors are only measured at the first 6 forecasting horizons beyond the 3003 time series, which are available from all of the 22 forecasting methods.

The results are listed in Tabular array i. It can be noticed that errors by MAE and RMSE are relatively large numbers which is meaningless without comparisons. UMBRAE is able to give interpretable results where a forecasting method with an error < 1 can be considered to be better than the benchmark method in terms of the average relative absolute fault based on BRAE. As shown in the results, the naïve method, which is the benchmark used by UMBRAE, has an error of i. Errors of other forecasting methods measured by UMBRAE are all less than 1. This indicates that these forecasting methods are amend than the naïve method. However, MRAE gives the reverse result in which the naïve method is ranked as the all-time. Information technology has to be noticed that all the errors excluding that for the naïve method measured by AvgRelMAE are smaller than 1, whereas all the errors measured by MASE are much larger than 1. The rank correlation coefficient of unlike measures is shown in Tabular array 2. The correlation between RMSE, or MRAE, and other measures is extremely depression. In contrast, UMBRAE shows essentially high agreement with nearly of other measures, where the boilerplate Spearman rank correlation is 0.516. Especially, UMBRAE has remarkably high correlations with GMRAE and AvgRelMAE which are 0.995 and 0.990 respectively.

To eliminate the influence of outliers and extreme errors, we also use trimmed means to evaluate the accuracy measures. A 3% trimming level is used in our study. Equally shown in Tabular array 3, nigh errors measured past MAE, RMSE, MASE, MRAE and MAPE accept significant differences compared to that without trimming shown in Table one. The rankings of forecasting methods made by these measures too have significant changes. In contrast, errors and rankings measured by other measures have less changes. Particularly, the value of UMBRAE is quite invariant to trimming, where differences appear only after the 3rd decimal signal for most of the forecasting methods. It can also be noticed that the rankings made by UMBRAE in Tabular array 3 keep the same as that in Table 1. In full general, all the measures except MRAE take like rankings. Every bit shown in Table 3, the rank correlations betwixt UMBRAE and other measures are much college on average every bit shown in Table 4.

To testify the error distributions in a similar manner to that in [24], nosotros use the errors produced past the forecasting method ForecastPro every bit an case. Figs 4 to xi testify the distributions of the eight underlying error measurements used in the ix accuracy measures mentioned in this paper. In each Fig, the top plot shows the kernel density judge of the errors illustrating its distribution, while the bottom shows a box-and-whisker plot which more clearly highlights the outliers. From these Figs, it can be seen that the distribution of error measurements used in UMBRAE is more evenly distributed, with fewer outliers than in the other measures.

Discussion

Fig 1 shows that MRAE and MAPE can exist easily dominated by a single forecasting outlier. This is because they are based on the arithmetic hateful and there are no upper bound defined for the single fault. In practise, the poor resistance to forecasting outliers may produce misleading results. This can be illustrated by our evaluation on the M3-Competition data. As shown in Table 1, MRAE gives significantly different rankings from other measures. It suggests the naïve method performs the best while about all the other accurateness measures indicate that the naïve method is the worst. By examining the forecasting information, nosotros can find that the results measured past MRAE are seriously distorted by the extreme large relative absolute errors where the naïve errors are pocket-size. With the geometric mean, GMRAE has shown remarkable resistance to the forecasting outliers. However, ane disadvantage of measures based on the geometric mean is that zero-error forecasts have to exist excluded. Thus, these measures may not be sufficiently informative. In contrast, due to the bounded errors defined, we have shown that UMBRAE tin can perform as well as GMRAE in resisting forecasting outliers. In fact, the errors and rankings given by UMBRAE are remarkably correlated to which measured by GMRAE, specially in Tables 3 and four where extreme errors are trimmed. Thus, for the cases where measures such as GMRAE are preferred, UMBRAE could be an alternative mensurate since it is much easier to utilize without the need to trim errors.

It can too be noticed in Figs 4 to xi that all the accuracy measures except AvgRelMAE (run across Fig seven), GMRAE (see Fig 8) and UMBRAE (see Fig 11) have highly skewed distributions with long tails including extremely large forecasting outliers. Although undefined and cypher errors (0.5%) have been trimmed, GMRAE notwithstanding contains about 10.two% forecasting outliers including some large log-transformed errors such as -x.76 and 8.08. Although the divisional errors used past sMAPE (see Fig 10) and UMBRAE likewise contain some outliers, at that place are no extremely large errors. Specifically, UMBRAE follows a symmetric distribution and it only produces about 3% outliers which volition not affect the result significantly.

Information technology has to exist noted that UMBRAE does not necessarily e'er provide the aforementioned information as GMRAE. For example, given a time series with a one thousand thousand observations, if the forecasting method and the criterion method produces errors (y−f) which are eastward and e* following the standard normal distribution, UMBRAE and GMRAE will both be approximately 1. Nevertheless, if the forecasting method produces errors of twoe, the value of GMRAE will be approximately 2 as ane may expected. Only, UMBRAE volition give an fault of approximately ane.67 which is less than 2. This is because the bounded error used past UMBRAE will not exist increased too much when error e is doubled for the cases where |due east| is much larger than |e*|. In other words, a twice worse forecast volition non be given an mistake of twice in significance by UMBRAE when the forecast is much worse than most of other forecasts. In fact, this is the key strategy of UMBRAE for resisting outliers. Also, the above expectation of fault two is based on the estimation by 'relative boilerplate fault'. However, it is arguable the 'average relative fault' is not necessarily the same as the 'relative average error'. This can be more or less reflected past the synthetic test shown in Fig 3. More discussions virtually this will be given after in this department in terms of the scale-independency. We believe that the above issue does not invalidate the use of UMBRAE in do.

I of the common concerns about an accurateness measure is whether it is symmetric. Ii different cases were used to evaluate the belongings of symmetry for accuracy measures. In our point of view, the first case is about the symmetry in the absolute quantity which concerns whether the same over-estimates and under-estimates tin can be treated fairly by a measure. As shown in Fig ii, just sMAPE is not symmetric in the accented quantity (due to the asymmetric bounded errors used). This issue has been addressed by UMBRAE with symmetric bounded errors defined. The 2nd case is in fact about the symmetry in the relative quantity where measures are expected to give a result of 1 for averaging ii relative errors N and . Usually, a measure which uses the arithmetic hateful should not be symmetric in such relative quantity. Withal, UMBRAE, which uses the arithmetic hateful for part of its calculations, has shown a symmetric result. This is because UMBRAE does not work direct on the original error ratios. The original relative errors accept been converted to bounded relative errors for UMBRAE before calculating the arithmetic mean. In fact, this is quite similar to the process of calculating GMRAE which is based on the geometric hateful. As a result, it is not an issue for UMBRAE to employ the arithmetic mean. Figs 8 and 11 show that both errors used by GMRAE and UMBRAE follow a symmetric distribution.

Information technology is necessary (or, at to the lowest degree, highly desirable) for an accuracy measure to be scale-contained when assessing forecasting methods across data on different scales. Normally, measures based on percentages or ratios in the same range are considered to be scale-contained. However, nosotros debate that it is not enough for these percentages or ratios to exist in the aforementioned range. To exist truly calibration-contained, these mistake percentages or ratios should besides be closely related to the scale of data for specific observations. Otherwise, they may lead to misleading results. For example, in Table 1, the error of MASE for the naïve method is two.134. This is a somewhat confusing effect which may be intuitively interpreted every bit indicating that the naïve method performs worse than the naïve method itself! In fact, it means the naïve method gives smaller errors on average for the forecasting data than its errors for the in-sample data. In contrast, AvgRelMAE does not have this issue since information technology uses the average error on out-of-sample as the scaling factor. Fig three shows that MASE fails to distinguish the divergence between the two forecasts which are clearly different considering the mistake percentages at different observations. This is because every single error used by MASE at different observations is scaled past the aforementioned scaling gene. GMRAE as well fails in this evaluation. We notice that this is because GMRAE, in fact, has the same issue as MASE. Every single error of GMRAE can also exist considered to be a scaled error based on a consistent scaling cistron GMAE*, which is the geometric mean of the benchmark errors e*. According to the higher up, we conclude that MASE, AvgRelMAE and GMRAE are relatively scale-independent because they assume that the scaling factor is a consequent estimator. In dissimilarity, UMBRAE is calibration-independent and it is closely related to the error ratios at observations. Thus, information technology tin reasonably evidence the difference between the two forecasts with respect to error percentages.

Another important holding of an accuracy measure is its interpretability. As Tabular array ane shows, the numerical errors measured past MAE and RMSE have piddling intuitive meaning without comparisons, and take therefore been scored as 'fair'. Comparatively, measures which produce errors in percentages or ratios based on a benchmark are more interpretable. The benchmark used by an accurateness mensurate is also important for its interpretability. In Table i, errors measured by MAPE are all small-scale errors effectually x%. However, these small errors are less meaningful without comparisons. This is considering these minor percentages are based on the original values of observations. Thus, they do not necessarily indicate a good performance. In contrast, errors measured by UMBRAE are more interpretable. An fault of 0.77 indicates that the forecasting method performs approximately 23% better than the criterion method.

Equally shown in Tabular array five, the accuracy measures are rated past the key criteria concerned in this paper. Measures are considered to be less informative if undefined or zero errors have to be excluded. The holding of symmetry is rated in both absolute quantity and relative quantity as discussed above. Measures are rated as relatively scale-independent because they presume that the scaling cistron is a consistent estimator. Relative-based accuracy measures are considered to be more interpretable than other measures since they tin provide more than intuitive results in terms of operation without actress comparisons. sMAPE is rated equally poor in interpretability since its mistake, which has a range of (0,200), is not as easy as MAPE to empathize.

In summary, we show that UMBRAE (i) is informative and uses all available errors; (ii) can perform equally well as GMRAE in resisting forecasting outliers without the need to trim zero-error forecasts; (3) is symmetric in both absolute quantity and relative quantity; (4) is scale-independent; (v) is interpretable and can provide intuitive result. As such, UMBRAE combines the all-time features of various alternative measures into a single new mensurate. Thus, we believe UMBRAE is an interesting new measure because information technology constitutes a simple, flexible, easy to utilise and empathise measure that is resistant to outliers. Likewise, the forecasting benchmark for calculating UMBRAE is selectable, and the ideal choice should be a forecasting method to exist outperformed. Every bit a well-known benchmark, the naïve method can exist easily applied as a default to show whether a forecasting method is generally good or not.

Determination

Nosotros have proposed a new accurateness measure UMBRAE based on divisional relative errors. Equally discussed in the review of sMAPE, i advantage of the divisional fault is that information technology gives less significance to outliers since it does not have the issue of beingness excessively large or infinite. Evaluation on the proposed measure forth with related measures has been made on both synthetic and existent-earth data. We have shown that UMBRAE combines the best features of various alternative measures without having their common drawbacks. UMBRAE, with selectable benchmark, can provide an informative and interpretable result based on bounded relative fault. It is less sensitive to forecasting outliers than other measures. It is also symmetric and scale-contained. Though it has been commonly accustomed that there cannot be any single best accuracy measure, we suggest that UMBRAE is a good choice for general use when evaluating the operation of forecasting methods. Since UMBRAE, in our study, performs like to GMRAE without the demand to trim nothing-error forecasts, we particularly recommend UMBRAE as an culling mensurate for the cases where GMRAE is preferred.

Although nosotros have shown that UMBRAE has many advantages equally described above, its statistical backdrop have not been well studied. For example, the way how UMBRAE reflects the properties of the distributions of errors is unclear. Moreover, one possible underlying drawback for UMBRAE is that the bounded error used by UMBRAE will reach the maximum value 1.0 when the benchmark fault ( ) is equal to cipher fifty-fifty if the forecast is practiced. This may produce a biased estimate especially when the benchmark method produces a large number of zero errors. Although this drawback may non be relevant for the bulk of existent-world data, in the future, we would like to address this issue.

Writer Contributions

Conceptualization: CC.
Data curation: CC.
Formal analysis: CC JT JMG.
Funding acquisition: JT JMG.
Investigation: CC JT JMG.
Methodology: CC JT JMG.
Project administration: JT JMG.
Software: CC.
Supervision: JT JMG.
Validation: CC JT JMG.
Visualization: CC JT JMG.
Writing – original draft: CC JT JMG.
Writing – review & editing: CC JT JMG.

References

ane. De Gooijer JG, Hyndman RJ. 25 Years of Fourth dimension Series Forecasting. International Journal of Forecasting. 2006;22(3):443–473.
- View Commodity
- Google Scholar
2. Gao ZK, Jin ND. A directed weighted complex network for characterizing chaotic dynamics from time series. Nonlinear Analysis: Existent Globe Applications. 2012;thirteen(2):947–952.
- View Article
- Google Scholar
3. Gao ZK, Yang YX, Fang PC, Zou Y, Xia CY, Du M. Multiscale complex network for analyzing experimental multivariate fourth dimension series. Europhysics Letters. 2015;109(3):30005.
- View Article
- Google Scholar
4. Gao ZK, Modest M, Kurths J. Circuitous network assay of fourth dimension series. Europhysics Letters. 2016;116(v):50001.
- View Article
- Google Scholar
5. Gao ZK, Cai Q, Yang YX, Dang WD, Zhang SS. Multiscale limited penetrable horizontal visibility graph for analyzing nonlinear time series. Scientific Reports. 2016;6:35622. pmid:27759088
- View Article
- PubMed/NCBI
- Google Scholar
six. Armstrong JS, Collopy F. Error measures for generalizing virtually forecasting methods: Empirical comparisons. International Journal of Forecasting. 1992;8(1):69–80.
- View Article
- Google Scholar
7. Makridakis S, Hibon Chiliad. The M3-Contest: results, conclusions and implications. International Journal of Forecasting. 2000;16(four):451–476.
- View Commodity
- Google Scholar
8. Fildes R. The evaluation of extrapolative forecasting methods. International Journal of Forecasting. 1992;viii(i):81–98.
- View Commodity
- Google Scholar
ix. Clements MP, Hendry DF. On the limitations of comparing mean square forecast errors. Journal of Forecasting. 1993;12(8):617–637.
- View Article
- Google Scholar
x. Makridakis Due south. Accurateness measures: theoretical and practical concerns. International Journal of Forecasting. 1993;nine(4):527–529.
- View Article
- Google Scholar
eleven. Armstrong JS, Fildes R. Correspondence on the pick of error measures for comparisons among forecasting methods. Periodical of Forecasting. 1995;fourteen(i):67–71.
- View Article
- Google Scholar
12. Makridakis South, Andersen A, Carbone R, Fildes R, Hibon M, Lewandowski R, et al. The accuracy of extrapolation (time serial) methods: Results of a forecasting competition. Journal of Forecasting. 1982;i(2):111–153.
- View Article
- Google Scholar
13. Olaofe ZO. A 5-day wind speed & power forecasts using a layer recurrent neural network (LRNN). Sustainable Free energy Technologies and Assessments. 2014;vi:ane–24.
- View Article
- Google Scholar
xiv. Svalina I, Galzina V, Lujić R, Šimunović Thousand. An adaptive network-based fuzzy inference system (ANFIS) for the forecasting: The case of close price indices. Expert Systems with Applications. 2013;xl(fifteen):6055–6063.
- View Commodity
- Google Scholar
15. Boyacioglu MA, Avci D. An Adaptive Network-Based Fuzzy Inference System (ANFIS) for the prediction of stock market render: The case of the Istanbul Stock Exchange. Skilful Systems with Applications. 2010;37(12):7908–7912.
- View Article
- Google Scholar
xvi. Wei LY, Chen TL, Ho Th. A hybrid model based on adaptive-network-based fuzzy inference arrangement to forecast Taiwan stock market. Expert Systems with Applications. 2011;38(eleven):13625–13631.
- View Article
- Google Scholar
17. Esfahanipour A, Aghamiri W. Adapted neuro-fuzzy inference system on indirect approach TSK fuzzy rule base for stock market place analysis. Expert Systems with Applications. 2010;37(7):4742–4748.
- View Article
- Google Scholar
18. Hyndman RJ, Koehler AB. Some other look at measures of forecast accurateness. International Periodical of Forecasting. 2006;22(iv):679–688.
- View Article
- Google Scholar
19. Makridakis SG, Wheelwright SC, Hyndman RJ. Forecasting: Methods and Applications. Wiley series in direction. Wiley; 1998.
20. Armstrong JS. Measures of Accurateness. In: Long-Range Forecasting: From Crystal Ball to Reckoner. A Wiley-Interscience Publication. Wiley; 1985. p. 346–354.
21. Goodwin P, Lawton R. On the asymmetry of the symmetric MAPE. International Journal of Forecasting. 1999;15(4):405–408.
- View Article
- Google Scholar
22. Ord Chiliad. Commentaries on the M3-Competition An introduction, some comments and a scorecard. International Periodical of Forecasting. 2001;17(four):537–541.
- View Commodity
- Google Scholar
23. Chen Z, Yang Y. Assessing forecast accurateness measures; 2004. Available from: https://www.researchgate.cyberspace/publication/228774888_Assessing_forecast_accuracy_measures.
24. Davydenko A, Fildes R. Measuring forecasting accuracy: The instance of judgmental adjustments to SKU-level need forecasts. International Journal of Forecasting. 2013;29(3):510–522.
- View Article
- Google Scholar
25. Fleming PJ, Wallace JJ. How not to prevarication with statistics: the correct fashion to summarize criterion results. Communications of the ACM. 1986;29(iii):218–221.
- View Article
- Google Scholar
26. Wright DJ, Capon G, Pagé R, Quiroga J, Taseen AA, Tomasini F. Evaluation of forecasting methods for determination support. International Periodical of Forecasting. 1986;2(ii):139–152.
- View Article
- Google Scholar
27. Chatfield C. Apples, oranges and mean square error. International Journal of Forecasting. 1988;4(4):515–518.
- View Article
- Google Scholar
28. Armstrong JS. Evaluating forecasting methods. In: Principles of Forecasting: A Handbook for Researchers and Practitioners. vol. 30. Springer US; 2001. p. 443–472.

mckinneythavess.blogspot.com

Source: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0174202