Abstract
Suppose one wants to detect bad or suspicious subsequences in event sequences. Whether an observed pattern of activity (in the form of a particular subsequence) is significant and should be a cause for alarm depends on how likely it is to occur fortuitously. A long-enough sequence of observed events will almost certainly contain any subsequence, and setting thresholds for alarm is an important issue in a monitoring system that seeks to avoid false alarms. Suppose a long sequence, T, of observed events contains a suspicious subsequence pattern, S, within it, where the suspicious subsequence S consists of m events and spans a window of size w within T. We address the fundamental problem: Is a certain number of occurrences of a particular subsequence unlikely to be generated by randomness itself (i.e. indicative of suspicious activity)? If the probability of an occurrence generated by randomness is high and an automated monitoring system flags it as suspicious anyway, then such a system will suffer from generating too many false alarms. This paper quantifies the probability of such an S occurring in T within a window of size w, the number of distinct windows containing S as a subsequence, the expected number of such occurrences, its variance, and establishes its limiting distribution that allows setting up an alarm threshold so that the probability of false alarms is very small. We report on experiments confirming the theory and showing that we can detect bad subsequences with low false alarm rate.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Aho A, Corasick M (1975) Efficient string matching: An aid to biblographic search. Programming techniques
Apostolico A, Atallah M (2002) Compact recognizers of episode sequences. Inform Comput 174:180–192
Billingsley P (1986) Probability and measure. Wiley, New York
Boasson L, Cegielski P, Guessarian I, Matiyasevich Y (1999) Window-accumulated subsequence matching problem is linear. Proc PODS pp 327–336
Crochemore M, Rytter W (1994) Text algorithms. Oxford University Press, New York
Das G, Fleischer R, Gasieniec L, Gunopulos D, Kärkkäinen J (1997) Episode matching. In: Combinatorial pattern matching, 8th annual symposium. Lecture Notes in Computer Science 1264, pp 12–27
Flajolet P, Guivarc’h Y, Szpankowski W, Vallée B (2001) Hidden pattern statistics. ICALP 2001, Crete, Greece, LNCS 2076, pp 152–165
Kucherov G, Rusinowitch M (1997) Matching a set of strings with variable length don’t cares. Theor Comput Sci 178:129–154
Kumar S, Spafford EH (1994) A pattern-matching model for intrusion detection. Proceedings of the National Computer Security Conference, pp 11–21
Mannila H, Toivonen H, Verkamo A (1997) Discovery of frequent episodes in event sequences. Data Min Knowl Discov 1:241–258
Nicodème P, Salvy B, Flajolet P (1999) Motif statistics. European symposium on algorithms. Lecture Notes in Computer Science 1643, pp 194–211
Pevzner P (2000) Computational molecular biology: an algorithmic approach. MIT Press
Régnier M, Szpankowski W (1998) On pattern frequency occurrences in a Markovian sequence. Algorithmica 22:631–649
Rigoutsos I, Floratos A, Parida L, Gao Y, Platt D (2000) The emergence of pattern discovery techniques in computational biology. Metabol Eng 2:159–177
Sedgewick R, Flajolet P (1995) An introduction to the analysis of algorithms. Addison-Wesley, Reading, MA
Szpankowski W (2001) Average case analysis of algorithms on sequence. Wiley, New York
Waterman M (1995) Introduction to computational biology. Chapman and Hall, London
Wespi A, Debar H, Dacier M, Nassehi M (2000) Fixed vs variable-length patterns for detecting suspicious process behavior. J Comput Secur 8:159–181
Wu S, Manber U (1995) Fast text searching allowing errors. Comm ACM 35:83–91
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gwadera, R., Atallah, M. & Szpankowski, W. Reliable detection of episodes in event sequences. Knowl Inf Syst 7, 415–437 (2005). https://doi.org/10.1007/s10115-004-0174-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-004-0174-5