International Journal of Intelligence Science, 2012, 2, 176180 http://dx.doi.org/10.4236/ijis.2012.224023 Published Online October 2012 (http://www.SciRP.org/journal/ijis) Using Data Mining with Time Series Data in ShortTerm Stocks Prediction: A Literature Review José Manuel Azevedo1, Rui Almeida2, Pedro Almeida3 1Department of Mathematics, Instituto Politécnico do Porto, Porto, Portugal 2Department of Mathematics, Faculdade de Ciências, Universidade da Beira Interior, Covilhã, Portugal 3Department of Informatics, Faculdade de Engenharia, Universidade da Beira Interior, Covilhã, Portugal Email: jazevedo@iscap.ipp.pt, ralmeida@ubi.pt, palmeida@ubi.pt Received July 1, 2012; revised August 20, 2012; accepted September 1, 2012 ABSTRACT Data Mining (DM) methods are being increasingly used in prediction with time series data, in addition to traditional statistical approaches. This pape r presents a literature review of the use of DM with time series data, focusing on short time stocks predictio n. This is an area that has been attracting a great deal of attention from researchers in the field. The main contribution of this paper is to provide an outlin e of the use of DM with time series data, using mainly examples related with shortterm stocks prediction. This is important to a better understanding of the field. Some of the main trends and open issues will also be introduced. Keywords: Data Mining; Time Series; Fundamental Data; Data Frequency; Application Domain; ShortTerm Stocks Prediction 1. Introduction Data Mining (DM) is a challenging field for research and has some practical successful application in several dif ferent areas. DM methods are being increasingly used in prediction with time series data, in addition to traditional statistical approaches [13]. DM can be presented as one of the phases of the Knowledge Discov ery in Databases (KDD ) process [46], and is identified as “the means by which the patterns are extracted from data” [7]. Nowadays, it can be said that the two terms, DM and KDD, are indistinctly used. The OECD Glossary of statistical terms [8] presents the following definition: “A ti me series is a set of reg ular timeordered observations of a quantitative characteristic of an individual or collective phenomenon taken at suc cessive, in most cases equidistant, periods/points of time”. There are several application domains of DM with time series data, being that one important application domain is shortterm stocks prediction. This will be the focus of this paper. Shortterm stocks prediction is a difficult is sue and can be considered as an open research issue [9,10]. Intelligent forecasting models have achieved bet ter results than traditional methods, particularly in short term forecasts [11]. Although intelligent forecasting me thods are better, we can still improve the results in terms of accuracy in addition to other factors. The main contribution of this paper is to provide an outline of the use of DM with time series data, using mainly examples related with shortterm stocks or market indexes predictions. This is important to a better under standing of the field. Some of the main trends and open issues will also be introduced. The paper is organized as follows: DM with time se ries data is presented in Section 2, the integration of fun damental data is explored in Section 3, data frequency issues are introduced in Section 4. The paper closes in Section 5, with conc lusion an d f uture research di re ctions. 2. Data Mining with Time Series Data Since the seminal paper of Fayyad in 1996 [4], the Data Mining (DM) area has attracted a great deal of interest and can nowadays be considered as an established field. DM applications can be found in a diversified range of application domains. One important application domain is that of time series data. “A timeseries data set consists of sequences of numeric values obtained over repeated measurements of time. The values are typically measured at equal time intervals (e.g ., every minute, hour, or day)”. [5]. The referred measures can be taken over one variable or several variables—univariate or multivariate tim e series. C opyright © 2012 SciRes. IJIS
J. M. AZEVEDO ET AL. 177 2.1. Data Mining with Time Series Data Applications DM with time series data is popular and many applica tions can be found in the literature, for instance, for earthquake forecasting [12], characterization of ozone behavior [13], or flood prediction [14]. Other application example is that of financial decision making. A decision support tool for financial forecasting, named as EDDIE, is presented in [15]. In [16], a new architecture that im plements a binary neural network, AURA, to produce discrete probability distribution as forecasts, using high frequency data sets, is presented. The use of support vector machines and back propagation neural networks to predict credit ratings is presented in [17]. One important application concerns shortterm stocks prediction, which is the main focus of this pap er. In [18], an approach to the paradox of obtaining better results with longhorizon forecasts than with shorthorizon fore casts is presented, and it is claimed that the paradox is solved, since the proposed model obtains promising re sults. Nevertheless, there is a great deal of interest from investors in shorthorizon forecasts, thus the authors con sider that research focusing on this issue is important, namely in using data mining with time series for short term stocks prediction. 2.2. Data Mining Techniques Used with Time Series Data for ShortTerm Stocks Prediction Several DM techniques are used with time series data in order to obtain shortterm stocks prediction. An interest ing approach to portfolio management, using the Gaus sian temporal factor analysis technique, is introduced in [19]. Neural networks are one of the most popular tech niques for stocks prediction. [2025] are some examples. In [22] rough sets and classification trees are used, as well. Rough sets are also used in [26]. Support Vector Machines are used in [27]. There were not yet been given strong evidences of some technique being better than other, but nonlinear models are more popular. 2.3. Specific Challenges Using DM with time series data presents several specific challenges. In [28,29] the authors focus on the issue of representing time series data in order to effectively and efficiently apply DM. In [28], three types of algorithms are presented and compared, namely, the sliding window algorithm, the topdown algorithm, and the bottomup algorithm, and a new approach, that is claimed to over come the inconveniences of these three algorithms, is introduced. In [29], a new concept, named as median strings, is presented as a simple and, at the same time, powerful representation for time series data. Another interesting issue is to find out if different time series, or parts of a time series, have similar behavior. This issue can be approached through the use of simila rity measures and indexing techniques. Interesting re views can be found in [30,31]. Over fitting is a common problem across DM applica tions and DM with time series data is not an exception. In [32], an approach that intends to overcome this pro blem is presented. Other important issue concerns the way to implement each one of the phases of the KDD process, taking into account the specificities of time series data. An applica tion of DM with time series data for shortterm stock prediction is presented in [1], analyzing all the phases of the KDD process. Promising results were achieved, but it is referred that the inclusion of fundamental data could help improving the obtained results. Table 1 presents a resume of the main techniques and challenges. 3. Including Fundamental Data Concerning shortterm stocks prediction, a possible ap proach is to collect the historical financial data, such as open price, higher price, lower price, close price, and volume. These can be used in a daily basis frequency, or other frequencies considered as appropriate. Several in dicators can be derived and used for more adequate analysis. This approach is named as technical analysis. Another possible approach is to use statistical data, such as, macroeconomics indexes, and basic financial indica tors of the company. This approach is named as funda mental analysis. Table 2 resumes some of the technical and fundamental features found in the literature. Other researches, for instance [3739], present similar indica tors. From the literature review it is clear that one of the main issues in obtaining good pr edictio n s is related to the first phase of the KDD process, that is to say, the selec Table 1. Data mining with time series data: Main techniques and challenges. Neural networks [2025] Vector machines support [27] Rough sets [22,26] Classification trees [22] Techniques Gaussian temporal factor analysis [19] Data representation [28,29] Similar behavior [30,31] Over fitting [32] Challenges Implementing all KDD phases [1] Copyright © 2012 SciRes. IJIS
J. M. AZEVEDO ET AL. Copyright © 2012 SciRes. IJIS 178 Table 2. Features for techni cal and fundamental analysis. Type Features References Fundamental ROA(A); EBI Gross margin; g ross margin growth o perating income; operation income growth; net income; net income growth; continued net income; cash flow ratio; sales growth ratio; current ratio; ordinary income gro wth; continued income growth; total asset growth; return on total asset; quick ratio; liabilities ratio; total asset turnover; account receivable turnover; inventory turnover; fixed asset turnover; days payables outstanding; And several of others: gross national product; real GDP; unemployment rate; real economic growth; monetary supply and amou nt ; g ross margin growth ; CCI; personal income; industrial production; Taiwan export/import volume; operation income growth liabilities; total asset growth fixed asset turnover; monitoring indicator Export for eign exchange volume; WPI; merchandise trade volume export/import; Tsai and Hsiao (2010) [33] Fundamental Demand index; moving average divergence convergence; relative strength index; positive directional movement index; negative directional mov ement index; moving average; rsquared; linear regression slope; average true range Technical Price channel (top); price channel (bottom); price per earning per share; volume; open price; range; changes; close price Zarandi, Rezaee, Turksen and Neshat (2009) [34] Technical Average position change; bollinger band %; cutler’s relative Strength index; exponential moving average; stochastic oscillator; typical price; volume accumulator; volume weighted RSIMFI; volume weighted RSI, williams %R; advance decline line; average true range; average position change; chaikin A/D oscillator; on balance volume; stoch. osc.; typical price Ince and Trafalis (2007) [35] Fundamental Money supply (M1B); governme nt c on sumption level, gross national products, gross domestic products; consumer price index; wholesale products index; rate of exchange Technical Moving average convergence/divergence; price rate of change; stochastic %K; stochastic %D; relative strength index; stochastic oscillator and directional indicator Cheng, Chen and Lin (2010) [22] Technical On balance volume; moving average; average stock yield Shen, Guo, Wu and Wu (2011) [36] Daily tion of the adequate feature combination, since the same methods can yield different results if different features are selected as inputs. Weekly Monthly Another aspect that arises from the literature review is that most researchers use only one of the two types of analysis, technical or fundamental. Thus analyzing com binations of both types of i ndicators is yet underexplored. Quarterly Figure 1. Time series with different frequencies. In addition, most studies use macroeconomics vari ables, forgetting the important financial indicators of the companies. Considering the domain application, it is clear that the evolution of stock prices is influenced by both types of variables, so considering it could conduct to good resu lts. low frequency time series obtained from the collection of fundamental data. Forecasts should be done in a daily basis, thus there are some important issues for research. Some research can be found in the literature approa ching the issue of integrating time series features with different frequencies. Traditional approaches use regres sion algorithms such as MIDAS [37,38]. Nevertheless, this approach does not use DM. One of the main issues related to the combination of both types of features is that time series data have dif ferent frequencies (Figure 1). Usually technical features have daily frequencies and fundamental features have monthly, quarterly, and lower frequencies, presenting some integration issues. These in tegration issues are very important and have several implications. In the literature review, only a few works, use DM with time series data with different frequencies. [22,34] are two examples. These studies present promising re sults, but the use of neural networks is somehow a limi tation. Neural networks, despite usually yielding good results, functions as a “black box” . This way it is difficult to understand the mechanism and the generated model. 4. Integrating Features with Different Frequencies From the literature review it can be concluded that these issues needs further research, and it can be useful to test other methods, and to explore the selection of some different features. As stated above, interesting results could be obtained through the integration of time series data with different frequencies. With shortterm stocks predictions, there is the need to use mainly time series with data collected daily, yielding high frequency time series, opposed to The application domain is an important issue to con
J. M. AZEVEDO ET AL. 179 sider when applying DM, thus it should also be consider ed in this case. Taking into account the application do main will surely bring good in sights and will surely yield good results. 5. Conclusions and Future Research Directions This paper presents a literature review of the use of data mining with time series data. This literature review is very useful, since it brings a better understanding of the field of study, and this is an important contribution of this paper. From the literature review it can be con cluded that this subject attracts a great deal of interest by researchers. Nevertheless, several research issues remain unexplored. One of the ones that were identified during this research is related with the combined use of fundamental and technical indicators. The combined use of both types of indicators reveals also the issue of integrating time series with different fr equencie s. Feature selection, corresponding to the first phase of the KDD process, is also an issue that requires more re search to be done. Future research directions include the study of ways to select the best features for DM with time series data. The existence of features with different frequencies is a concern, and methods that will help how to envisage this problem will be planned and implemented. 6. Acknowledgements This work was partially supported by the research pro jects: PEstOE/MAT/UI0212/2011, financed by FEDER through COMPETE—Programa Operacional Factores de Competitividade and by FCT—Fundação para a Ciência e a Tecnologia. REFERENCES [1] P. Almeida, “Previsão do Comportamento de Séries Temporais Financeiras com Apoio de Conhecimento Sobre o Domínio,” Ph.D. Thesis, Universidade da Beira Interior, Covilhã, 2003. [2] L. Breiman, “Statistical Modeling: The Two Cultures,” Statistical Science, Vol. 18, No. 3, 2001, pp. 199231. doi:10.1214/ss/1009213726 [3] M. A. Ruggiero, “Ciberne tic Tradin g Strategies—Developing a Profitable Trading System StateoftheArt Techno logies,” John Wiley & Sons, New York, 1977. [4] U. M. Fayyad, G. PiatetskiShapiro and P. Smyth, “From Data Mining to Knowledge Discovery: An Overview,” In: U. M. Fayyad, G. PiatetskyShapiro, P. Smyth and R. Uthurusamy, Eds., Advances in Knowledge Discovery and Data Mining, The MIT Press, Cambridge, 1996, pp. 134. [5] J. Han, M. Kamber and J. Pei, “Data Mining: Concepts and Techniques,” Morgan Kaufman Publishers, Califor nia, 2011. [6] D. Hand, H. Mannila and P. Smyth, “Principles of Data Mining,” The MIT Press, Cambridge, 2011. [7] A. Azevedo and M. F. Santos, “KDD, SEMMA, and CRISPDM: A Parallel Overview,” Proceedings of the IADIS European Conference on Data Mining, Amster dam, 2426 July 2008, pp. 182185. [8] OECD, “Time Series,” 2006. http://stats.oecd.org/glossary/detail.asp?ID=2708 [9] M. A. Ferreira and P. SantaClara, “Forecasting Stock Market Returns: The Sum of the Parts Is More than the Whole,” Journal of Financial Economics, Vol. 100, No. 3, 2011, pp. 514537. doi:10.1016/j.jfineco.2011.02.003 [10] T. Fu, “A Review on Time Series Data Mining,” En gineering Applications of Artificial Intelligence, Vol. 21, No. 1, 2011, pp. 164181. doi:10.1016/j.engappai.2010.09.007 [11] T. O. Hill, M. Connor and W. Remus, “Neural Network Models for Time Series Forecasts,” Management Science, Vol. 42, No. 7, 1996, pp. 10821092. doi:10.1287/mnsc.42.7.1082 [12] S. Fong and Z. Nannan, “Towards an Adaptive Fore casting of Earthquake Time Series from Decomposable and Salient Characteritics,” Proceendings of the 3rd In ternational Conference on Pervasive Patterns and Appli cations, Rome, 25 September 2011, pp. 5360. [13] K. J. Walsh, M. Milligan, M. Woodman and J. Sherwell, “Data Mining to Characterize Ozone Behavior in Bal timore and Washington DC,” Journal of Atmospheric En vironment, Vol. 42, No. 18, 2008, pp. 42804292. doi:10.1016/j.atmosenv.2008.01.012 [14] C. Damle and A. Yalcin, “Flood Prediction Using Time Series Data Mining,” Journal of Hidrology, Vol. 333, No. 24, 2007, pp. 305316. doi:10.1016/j.jhydrol.2006.09.001 [15] E. Tsang, P. Yung and J. Li, “EDDIEAutomation, a De cision Support Tool for Financial Forecasting,” Decision Support Systems, Vol. 37, No. 4, 2004, pp. 559565. doi:10.1016/S01679236(03)000873 [16] A. Pasley and J. Austin, “Distribution Forecasting of High Frequency Time Series,” Decision Support Systems, Vol. 37, No. 4, 2004, pp. 501513. doi:10.1016/S01679236(03)000836 [17] Z. Huang, H. Chen, C. J. Hsu, W. H. Chen and S. Wu, “Credit Ratings Analysis with Support Vector Machines and Neural Networks. A Market Comparative Study,” Decision Support Systems, Vol. 37, No. 4, 2004, pp. 542558. doi:10.1016/S01679236(03)000861 [18] H. M. Krolzig and J. Toro, “Multiperiod Forecasting in Stock Market: A Paradox Solved,” Decision Support Sys tems, Vol. 37, No. 4, 2004, pp. 531542. doi:10.1016/S01679236(03)00085X [19] K. C. Chiu and L. Xu, “Arbitrage Pricing TheoryBased Gaussian Temporal Factor Analysis for Adaptive Port folio Management,” Decision Support Systems, Vol. 37, No. 4, 2004, pp. 485500. doi:10.1016/S01679236(03)000824 Copyright © 2012 SciRes. IJIS
J. M. AZEVEDO ET AL. Copyright © 2012 SciRes. IJIS 180 [20] O. Coupelon, “Nneural Network Modeling for Stock Movement Prediction: A State of the Art,” 2007. http://olivier.coupelon.free.fr/Neural_network_modeling_ for_stock_movemen_prediction.pdf [21] M. Kordos and A. Cwiok, “A New Approach to Neural Network Based Stock Trading Strategy,” Proceedings of the 12th International Conference on Intelligent Data Engineering and Automated Learning, Norwich, 79 Sep tember 2011, pp. 429436. [22] J. H. Cheng, H. P. Chen and Y. M. Lin, “A Hybrid Fore cast Marketing Timing Model Based on Probabilistic Neural Network, Rough Set and C 4.5,” Expert Systems with Applications, Vol. 37, No. 4, 2010, pp. 18141820. doi:10.1016/j.eswa.2009.07.019 [23] Z. Yudong and W. Lenan, “Stock Market Prediction of S & P 500 via Combination of Improved BCO Approach and BP Neural Network,” Expert Systems with Appli cations, Vol. 36, No. 5, 2009, pp. 88498854. doi:10.1016/j.eswa.2008.11.028 [24] X. Lin, Z. Yang and Y. Song, “ShortTerm Stock Price Based on Echo State Networks,” Expert Systems with Ap plications, Vol. 36, No. 3, 2009, pp. 73137317. doi:10.1016/j.eswa.2008.09.049 [25] T. Chang, “A Comparative Study of Artificial Neural Networks, and Decision Trees for Digital Game Content Stocks Price Prediction,” Expert Systems with Appli cations, Vol. 38, No. 12, 2011, pp. 1484614851. doi:10.1016/j.eswa.2011.05.063 [26] L. Shen and H. T. Loh, “Applying Rough Set to Market Timing Decisions,” Decision Support System, Vol. 37, No. 4, 2004, pp. 583597. doi:10.1016/S01679236(03)000897 [27] Q. Wen, Z. Yang, Y. Song and P. Jia, “Automatic Stock Decision Support System Based on Box Theory and SVM Algorithm,” Expert Systems with Applications, Vol. 37, No. 2, 2010, pp. 10151022. doi:10.1016/j.eswa.2009.05.093 [28] E. Keogh, S. Chu, D. Hart and M. Pazzani, “Segmenting Time Series: A Survey and No vel Approach,” In: M. Last, A. Kandel and H. Bunke, Eds., Data Mining in Time Se ries Databases—Series in Machine Perception Artificial Intelligence, World Scientific, Singapore, 2004, pp. 121. [29] X. Jiang, H. Bunke and J. Csirik, “Median Strings: A Review,” In: M. Last, A. Kandel and H. Bunke, Eds., Data Mining in Time Series Databases—Series in Ma chine Perception Artificial Intelligence, World Scientific, Singapore, 2004, pp. 173192. [30] G. Das and D. Gunopulos, “Time Series Similarity and Indexing,” In: N. Ye, Ed., The Handbook of Data Mining, Lawrence Erlbaum Associates, London, 2003, pp. 279 304. [31] M. L. Hetland, “A Survey of Recent Methods for Effi cient Retrieval of Similar Time Sequences,” In: M. Last, A. Kandel and H. Bunke, Eds., Data Mining in Time Se ries Databases—Series in Machine Perception Artificial Intelligence, World Scientific, Singapore, 2004, pp. 23 42. [32] K. Mehta and S. Bhattacharya, “Adequacy of Training Data for Evolutionary Mining of Trading Rules,” Deci sion Support Systems, Vol. 37, No. 4, 2004, pp. 461474. doi:10.1016/S01679236(03)000915 [33] C. F. Tsai and Y. C. Hsiao, “Combining Multiple Feature Selection Methods for Stock Prediction: Union, Inter section, and MultiIntersection Approaches,” Decision Support Systems, Vol. 50, No. 1, 2010, pp. 258269. doi:10.1016/j.dss.2010.08.028 [34] M. H. F. Zarandi, B. Rezaee, I. B. Turksen and E. Neshat, “A Type2 Fuzzy RuleBased Expert System Model for Stock Price Analysis,” Decision Support Systems, Vol. 36, No. 1, 2009, pp. 139154. [35] H. Ince and T. Trafalis, “Kernel Principal Component Analysis and Support Vector Machines for Stock Price Prediction,” IIE Transactions, Vol. 39, No. 6, 2007, pp. 629637. doi:10.1080/07408170600897486 [36] W. Shen, X. Guo, C. Wu and D. Wu, “Forecasting Stock Indices Using Radial Basis Function Neural Networks Optimized by Artificial Fish Swarm Algorithm,” Knowl edgeBased Systems, Vol. 24, No. 3, 2011, pp. 378385. doi:10.1016/j.knosys.2010.11.001 [37] B. C. O. Tas, “Private Information of the Fed and Pre dictability of Stock Returns,” Applied Economics, Vol. 43, No. 19, 2011, pp. 23812398. doi:10.1080/00036840903194220 [38] P. M. Dechow, A. P. Hutton, L. Meulbroek and R. G. Sloan, “ShortSellers, Fundamental Analysis, and Stock Returns,” Journal of Financial Economics, Vol. 61, No. 1, 2001, pp. 77106. doi:10.1016/S0304405X(01)000563 [39] M. Lam, “Neural Network Techniques for Financial Per formance Prediction: Integrating Fundamental and Tech nical Analysis,” Decision Support Systems, Vol. 37, No. 4, 2004, pp. 567581. doi:10.1016/S01679236(03)000885 [40] K. Wohlrabe, “Forecasting with MixedFrequency Time Series Models,” Ph.D. Thesis, Ludwig Maximilians Uni versitat, Munchen, 2008. [41] E. Andreou, E. Ghysels and A. Kourtellos, “Forecasting with MixedFrequency Data,” Journal of Econometrics, Vol. 158, No. 2, 2010, pp. 246261. doi:10.1016/j.jeconom.2010.01.004
