Keywords

1 Introduction

As a main file format for data exchange, the eXtensible Markup Language (XML) has been widely used on the web [1]. The presence of a schema provides a lot of conveniences and advantages for various applications such as data processing, automatic data integration, static analysis of transformations and so on [3, 12, 2023, 27, 31]. DTD (Document Type Definitions) and XSD (XML Schema Definitions) are two popular schema languages recommended by W3C (World Wide Web Consortium) [30]. Most content models used in DTD and XSD essentially consist of restricted subclasses of regular expressions. Therefore for practical purpose many researches focus on the study of subclasses practically used.

In 2004, Bex et al. proposed a subclass called the simple regular expressions after analyzing 109 DTDs and 93 XSDs from Cover Pages [6], which became the basis of later work. Martens et al. discussed the complexity of decision problems for eCHARE, an extension of simple regular expression, and still call it simple regular expression [24]. Later eCHARE was called CHARE in [25]. In 2005, Bex et al. discussed the expressiveness of XSDs based on 819 XSDs harvested from the web in [5]. In this corpus only 225 XSDs remained no errors and \(85\,\%\) were in fact structurally equivalent to a DTD. In 2006, after an analysis of 819 DTDs and XSDs gathered from Cover Pages as well as from the web, Bex et al. proposed two new subclasses: single occurrence regular expression (SORE) and chain regular expression (CHARE) [7]. In [7], CHARE is defined as a subclass of SORE and it has a weaker expressiveness than the CHARE defined in [25]. Using 966 DTDs and XSDs, Feng et al. extended CHARE [7] to Echare to cover more content models [15]. Echare allows two kinds of base symbols a and \(a^+\) where \(a\in \varSigma \).

The above discussion reveals some shortcomings of existing work. First, it is clear that the names of different subclasses of regular expressions above are quite confusing in the literature. Therefor we rename these subclasses and use new names in this paper. The relations between the new and the old names are shown in Table 1. Second, the scale of data sets was far from enough for analysis. One distinguishing feature of this paper is that the data set is sufficiently large compared with previous relevant work. So it will be helpful to get a more accurate result. Using techniques such as proxies, disguised as a browser, multi-threading, we gather a large sample of 2427 DTD and 4859 XSD files from Google after removing duplicate schemas by MD5. The data set covers different fields such as education, agriculture, science, economics, engineering, sports and so on. The whole list for our data set and tools used in this paper can be found in http://lcs.ios.ac.cn/~zhangxl/project.html. Besides, in existing work the above subclasses were separately discussed in different papers using relatively small data sets, while in this paper we simultaneously analyze these subclasses using a new larger data set.

Table 1. Relations between new and old restricted subclass names

In addition, these subclasses are all defined on standard regular expressions. However, counting and interleaving have already been used in XSDs. Björklund et al. made an incremental evaluation of regular expressions with counters based on a data set from three libraries: RegExLib, Snort and XML Schema on the web in [10]. The results show that almost half of the regular expressions use non-trivial counters which exclude the forms of \(a^{[0,0]}\), \(a^{[0,1]}\) and \(a^{[1,1]}\). They also found that the vast majority is the simple form of CHAINs: \(e_1\cdot e_2\cdot \cdot \cdot e_m\) where \(e_i=(a_1+a_2+\cdot \cdot \cdot +a_n)^{[k,l]}\). In [17], Ghelli et al. proposed a restricted subclass defined by \( T {:}{:}= \varepsilon | a^{[m,n]} | T + T | T \cdot T | T \& T\) where \(m\in N\backslash \{0\}\) and \(n\in N\backslash \{0\}\cup \{^*\}\). In L(T), each alphabet symbol can appear at most once. Counters can only occur as a constraint for terminal symbols. For example, \((a?\cdot (b+c+d)^{[1,100]})\) is not allowed. In 2008 they introduced a linear-time membership algorithm [18] for this subclass. Furthermore, determinism of regular expressions with counting and interleaving has been discussed by many researchers [13, 16, 19, 28]. In this paper, we extend CHARE with counting and interleaving operators. We conduct a series of experiments with result of a more popular use of this new subclass (94 %) in real world. We also analyze the determinism of it together with other five commonly used subclasses based on our data set. The main contributions of this paper are listed as follows.

- Considering numerical occurrence constraints and interleaving, we extend CHARE to a new restricted class called extended CHARE with counting and interleaving (eCICHARE). To the best of our knowledge, we are the first to analyze the usage of regular expressions with counting and interleaving together through real world data.

- Based on the large data set, we inspect the properties of different subclasses used in practice including eCICHARE by different measures. Particularly, the proportions of all subclasses used in practice are analyzed. In addition, we are the first to analyze the usage of eCHARE using real world data.

The rest of paper is organized as follows. Section 2 gives the definitions used in this paper. Then we introduce the data set and experiments in Sect. 3. The related work is discussed in Sect. 4 and Sect. 5 gives the conclusion.

2 Definitions

2.1 Regular Expression with Counting and Interleaving

Let \(\varSigma \) be a finite alphabet of terminal symbols. Each string consists of a finite sequence of symbols in \(\varSigma \). \(\varSigma ^*\) means the set of all strings over \(\varSigma \). A regular expression with counting and interleaving over \(\varSigma \) is \(\emptyset , \varepsilon \), or \(a \in \varSigma \), or is the union \(r_1+r_2\), the concatenation \(r_1 \cdot r_2\), the interleaving \( r_1 \& r_2\), the Kleene-star \(r_1^*\), the choice \(r_1?\), the counting \(r_1^{[m,n]}\) with \(m\le n\) and \(n>0\), or the plus \(r_1^+\) where \(r_1\) and \(r_2\) are both regular expressions. \(r_1\cdot r_2\) is also written as \(r_1r_2\). Let s be a string in \(\varSigma ^*\) and |s| denotes its size. We use \( s_1 \& s_2\) to denote the set of strings obtained by \(s_1\) and \(s_2\) in every possible way. For \(s\in \varSigma ^*\), \( s \& \varepsilon =\varepsilon \& s=s\) and \( a\cdot s_1 \& b\cdot s_2=({a}\cdot (s_1 \& b\cdot s_2))\cup ({b}\cdot (a\cdot s_1 \& s_2))\). For example, strings accepted by \( a \& b \& c\) are \(\{abc,acb,bac,bca,cab,cba\}\). Counting is the numerical occurrence constraint which defines the minimal and maximal number of times for a certain symbol in a regular expression. For \(E=(a^{[1,3]}+b)^{[2,3]}b\), it means that two-to-three repeats of a choice between a sequence of one-to-three a elements or a single b, followed by a single b. By \( RE( \& , \#)\) we represent regular expressions extended with interleaving and counting. We use \( RE( \& )\) and \(RE(\#)\) to denote the regular expressions with interleaving and counting respectively. \(\#\) and & operators help us to express the content models more succinctly and concisely.

2.2 eCICHARE

We introduce a new subclass which extends CHARE with interleaving and counting.

Definition 1

(extended CHARE with counting and interleaving (eCICHARE)). Base symbols are a, a?, \(a^*\), or \(a^+\) where \(a \in \varSigma \). A factor is of the form e, e?, \(e^*\), \(e^{[m,n]}\), or \(e^+\) where e is a disjunction of base symbols of the same kind. An eCICHARE is \(\emptyset \), \(\varepsilon \), a concatenation of factors, or an unordered sequence of factors with interleaving among them.

In the definition, each eCICHARE cannot contain both concatenation and interleaving operators at the same time. Numerical occurrence can only be the constraint to factors. This restriction is severe, but the experiment result shows that it is actually met by most of regular expressions in real world data. In addition, two forms of eCICHARE: \(a^{[0,1]}\) and \(a^{[1,1]}\) are always substituted by a? and a in practice.

For instance, \(E_1=(a^*+b^*)?(a+b)^*(c^++d^+)e?\), \(E_2=a^{[1,3]}b?\) and \( E_3=(a^*+b^*)? \& (a+b)^* \& (c^++d^+) \& e?\) are both eCICHAREs while \( E_4=(a^*+b^*)?(a+b^{[0,3]})^*(c^++d^+) \& e?\) is not.

2.3 Determinism

Determinism is required by W3C specification for content models of DTDs and XSDs. It is also called the unique particle attribution (UPA) property by W3C Recommendation. It has the same meaning with one-ambiguity [11], and weak determinism [16, 29]. Suppose that we match a string s against a given regular expression E from left to right. If we always know definitely the next symbol we will match without looking ahead in the string, E is deterministic. The formal definition of determinism is as follows.

Definition 2

(Determinism [4]). An expression E is deterministic if and only if for all words \(uxv, uyv \in L(E')\) where \(|x|=|y|=1\), if \(x\ne y\) then \(x'\ne y'\).

In this definition, \(E'\), \(x'\) and \(y'\) are the marked forms of E, x and y. For example. \(E=(a+b)^*a?b^*\), then \(E'=(a_1+b_2)^*a_3?b_4^*\) and \((E')'=E\). \(E=aa^*\) is a deterministic regular expression while \(E=a^*a\) is not.

2.4 Definitions for Other Five Subclasses

Definition 3

(CHARE [6]). Base symbols are a, a?, \(a^*\), \(a^+\) where \(a \in \varSigma \). A factor is of the form e, e?, \(e^*\), \(e^+\) where e is a disjunction of base symbols of the same kind. A simple regular expression is \(\emptyset \), \(\varepsilon \), or a concatenation of factors.

For instance, \((a^*+b^*)(a+b)?b^*(a+b)^*\) is a CHARE while \((abc+c?)(a^*+b^*)d?\) is not.

Definition 4

(eCHARE [25]). Base symbols are s, s?, \(s^*\), \(s^+\) where s is a non-empty string. A factor is of the form e, e?, \(e^*\), \(e^+\) where e is a disjunction of base symbols of the same kind. That is of the form \((s_1+\cdot \cdot \cdot +s_n)\), \((s_1^*+\cdot \cdot \cdot +s_n^*)\), \((s_1?+\cdot \cdot \cdot +s_n?)\), \((s_1^++\cdot \cdot \cdot +s_n^+)\), where \(n\ge 1\) and \(s_i\) is non-empty string. An eCHARE is \(\emptyset \), \(\varepsilon \) or a concatenation of factors.

For example, \(((abc)^*+b)(a+b)?(ab)^+(ac+b)^*\) is an eCHARE.

Definition 5

(SORE [7]). Let \(\varSigma \) be a finite alphabet. A single-occurrence regular expression (SORE) is a regular expression over \(\varSigma \) in which every terminal symbol occurs at most once.

For instance, \((((a+b),c)?d^+)^*e\) is a SORE while \((a^*+b^*)a\) is not.

Definition 6

(Simplified CHARE [7]). A Simplified CHARE is a SORE over \(\varSigma \) of the form \(f_1 \cdot \cdot \cdot f_n\) where \(n\ge 1\). Every factor \(f_i\) is an expression of the form \((a_1+\cdot \cdot \cdot +a_n)\), \((a_1+\cdot \cdot \cdot +a_n)?\), \((a_1+\cdot \cdot \cdot +a_n)^*\), \((a_1+\cdot \cdot \cdot +a_n)^+\) where \(n\ge 1\) and every \(a_i\) is a terminal symbol.

For example, \((a+b)?c^*\) is a Simplified CHARE while \((a+b?)a\) is not.

Definition 7

(eSimplified CHARE [15]). An eSimplified CHARE is a SORE over \(\varSigma \) of the form \(f_1 \cdot \cdot \cdot f_n\) where \(n\ge 1\). Every factor \(f_i\) is an expression of the form \((a_1+\cdot \cdot \cdot +a_n)\), \((a_1+\cdot \cdot \cdot +a_n)?\), \((a_1+\cdot \cdot \cdot +a_n)^*\), \((a_1+\cdot \cdot \cdot +a_n)^+\) where \(n\ge 1\) and every \(a_i\) is a terminal symbol or the form of \(a_i^+\).

For example, \((a+b^+)c?\) is an eSimplified CHARE.

CHARE and eCHARE require the base symbols in each factor must be the same kind while eSimplified CHARE can be different.

3 Experiments

3.1 Data Set

Data Preprocess. There are two steps in our data preprocess. First, get the DTD and XSD files. Repositories of data set such as GSML, DMTF, DSML, DWML, FACETMAP, GITHUB, GRAPHML, HAPMAP, IOP, KAIST, NCBI, BIOXSD, CORBA, CSML and so on are well-formed. We can harvest DTDs and XSDs from their official websites directly. But others need to be crawled through Google by queries: filetype:dtd or filetype:xsd. Using these two queries, we get many URLs for DTDs (or XSDs). However, not all these URLs point to a DTD (or XSD) directly but to some HTML files. Such HTML file may have a link or path to a DTD (or XSD), or just some information on the website. For links, we make a recursive resolving to download these related DTDs and XSDs. For information on the websites, we use a specific script tool: \(html\_parser\) which is written in Python to analyze the information and obtain the schema. Second, do data cleaning. We remove duplicate files with same URLs. More precisely, we use the technique MD5 to analyze whether two files with different URLs have the same content. If so, redundant files are also removed. At last, we obtain 2427 DTD files and 4859 XSD files and extract 64249 and 67255 regular expressions from DTDs and XSDs files respectively.

For the convenience of discussion, we introduce a uniform syntax to denote subclasses of restricted regular expression by specifying the allowed factors used in [25]. The base symbols are denoted by a, a?, \(a^*\), \(a^+\). s means a string in \(\varSigma ^*\). The disjuncts are denoted by \((a_1+a_2+\cdot \cdot \cdot +a_n)\), \((a_1?+a_2?+\cdot \cdot \cdot +a_n?)\), \((a_1^*+a_2^*+\cdot \cdot \cdot +a_n^*)\), \((a_1^++a_2^++\cdot \cdot \cdot +a_n^+)\) which can be also extended with choice, Kleene-star, and plus respectively. These factors can be abbreviated by the form of \((+\cdot \cdot \cdot )\). We use \(RE((+a?), a^*)\) to illustrate the subclass whose factors can be in the form of \((a_1?+a_2?+\cdot \cdot \cdot +a_n?)\) where \(a_i \in \varSigma \) and \(n\ge 1\) or the form of \(a^*\) for some \(a \in \varSigma \). We list some possible factors in Table 2.

Table 2. Possible factors in subclasses of regular expressions and their abbreviations [25]

Based on 64249 and 67255 regular expressions from DTDs and XSDs, we analyze the occurrence types of regular expressions in practice. We treat the form of \(a^+\) operator as \(aa^*\). The result is shown in Table 3. From Table 3, we can find that the form of \(RE(a,(+a)^*)\) accounts for the most proportion (\(34.6\,\%\)) for DTDs while in XSDs, forms of RE(aa?) and \(RE(a,a^*)\) are more popular. In addition, the vast majority of regular expressions belongs to the subclass of eCICHARE with proportion of \(90.3\,\%\) for DTDs and \(94.1\,\%\) for XSDs respectively.

Table 3. Occurrence of types

3.2 Definitions for Measures

In this section, we first introduce some definitions of measures used in the experiment such as star height, nesting depth, density and so on. Then, using these measures, we analyze the properties and complexity of regular expressions from different aspects.

Definition 8

(Star Height [2]). The star height of a regular expression E over the alphabet \(\varSigma \), denoted by h(E), is a nonnegative integer defined recursively as follows:

  1. 1.

    \(h(E)=0\), if \(E=\emptyset \) or a for \(a \in \varSigma \),

  2. 2.

    \(h(E)=max\{h(E_1), h(E_2)\}\), if \(E=(E_1+E_2)\) or \(E=(E_1\cdot E_2)\), where \(E_1\) and \(E_2\) are regular expressions over \(\varSigma \),

  3. 3.

    \(h(E)=h(E_1)+1\) if \(E=(E_1)^*\) and \(E_1\) is a regular expression over \(\varSigma \).

The star height [2] reflects the maximum nesting depth of Kleene-star occurring in a regular expression. It is an illustration for the complexity of DTDs and XSDs. We give the star height of DTDs and XSDs respectively. We use some substitutions in our experiment of computing the star height. For example, \(E_1=a^+\) and \(E_2=a^{[m,\infty ]}\) can be rewritten as the form of \(E_1'=aa^*\) and \(E_2'=a^*\). We treat interleaving similarly as the operator of concatenation. In fact, other forms of counting and interleaving do not influence the results of star height. From Table 4, we observe that the result of distributions for DTDs and XSDs have no significant differences. Content models with star height larger than 2 are very rare. For XSDs, the proportion with the star height equal to 1 is higher in our result than that in [6] which is \(38\,\%\) and \(17\,\%\) respectively.

Table 4. Star height observed in DTDs and XSDs
Table 5. Nesting depth observed in DTDs and XSDs

Besides, we also consider the counter height for regular expressions of XSDs. Counter height is defined just like the star height. The result indicates that the counter height with 1 of regular expressions with counting accounts for almost \(89.3\,\%\). Star height and counter height are both illustration of iteration depth of regular expressions. In this paper, we introduce Nesting depth to measure the complexity of a regular expression. For example, the star height for regular expressions \(E_1=(a+b?)?c^{[2,4]}d\) and \(E_2=a\) are both zero. But the complexity for \(E_1\) and \(E_2\) should be different. Nesting depth considers all operators possible in regular expressions including counting and interleaving. For example, let \(E=1^++(2?\cdot 3^+)^*+4^{[1,3]}\). Its corresponding nesting depth is 2. We show proportions of DTDs and XSDs by nesting depth in Table 5. From Table 5, we can find that both DTDs and XSDs with nesting depth lower than 2 are more than \(95\,\%\). That means schemas with complex structures are very rare.

Definition 9

(Nesting Depth). The nesting depth of a regular expression E over \(\varSigma \), denoted by ND(E), is a nonnegative integer defined recursively as follows:

  1. 1.

    \(ND(E)=0\), if \(E=\emptyset \), \(\varepsilon \) or a for \(a\in \varSigma \),

  2. 2.

    \(ND(E)=max\{ND(E_1),ND(E_2)\}\), if \(E=(E_1+E_2)\), \( E=(E_1 \& E_2)\) or \(E=(E_1\cdot E_2)\), where \(E_1\) and \(E_2\) are regular expressions over \(\varSigma \),

  3. 3.

    \(ND(E)=ND(E_1)+1\), if \(E=(E_1)^*\), \(E=(E_1)^+\), \(E=(E_1)?\) or \(E=(E_1)^{[m,n]}\) for \(E_1\) is a regular expression over \(\varSigma \).

Then we consider the density distribution of DTDs and XSDs respectively based on the real world data. The density [6] can be another measure to illustrate the complexity degree of rules in content models.

Definition 10

(Density [6]). The density of a schema is defined as the number of elements occurring in the right hand side of its rules divided by the number of elements. The formula is \(d = \frac{1}{N} \sum _{i = 1}^N |A_i|\) where N is the total number of element definitions occurring in this schema, \(A_i\) is the string in the right hand of a rule, \(|A_i|\) denotes the size of \(A_i\).

The density [6] can be a measure to illustrate the complexity degree of regular expressions. From Definition 10, we can easily find that the larger the value is, the more sophisticated rules a schema will have. In our experiment, we treat counting as unary operator and the interleaving operator as concatenation. Theses processings do not influence the results of density. Based on the large data set, we give the distributions of density in Fig. 1 for DTDs and XSDs respectively. From Fig. 1, it is easy to find that the density less than 10 for DTDs and XSDs is 95.8 % and 94.4 %.

Fig. 1.
figure 1

ID number of DTDs (left) and XSDs (right) versus their density

3.3 XML Schema Features Used in Practical

Although DTD is simple, it develops with some shortcomings such as no modularity, limited expressiveness for new domains, limited basic types and so on. The content model of an element in DTD depends only on the element name. In contrast, XML Schema is based on type definitions and allows the content model to depend on the context in which the element is used. With stronger expressiveness, the usage of XML Schema grows gradually though it is complicated to some extent. Table 6 shows the features of XSDs used in practice.

Table 6. XML schema features used in the corpus

3.4 Determinism

We consider the determinism of the regular expressions on our large data set. We use our own tools to check the determinism of regular expressions for DTD and XSD respectively. The result is shown in Table 7. From Table 7, we observe that these subclasses almost all satisfy the deterministic requirement. Take the first number 58536 / 64249 for example. 64249 means the total number of regular expressions for DTDs in the whole data. 58536 means the number of deterministic regular expressions in the whole data set.

Table 7. Determinism of regular expressions
Table 8. Proportions of subclasses of regular expressions

3.5 Usage of Subclasses of Regular Expressions in Practice

In this section, we mainly investigate the usage and proportions of six subclasses in practice based on our large data set. In particular, k-ORE [4] is discussed with different values of k at the same time. k-ORE means each alphabet symbol in regular expression occurs at most k times. The reasons and proportions dissatisfying the corresponding definitions are also discussed. In our experiment, we call regular expressions with counting and interleaving as non-standard regular expressions. They are analyzed separately because counting and interleaving are specific features for XSDs. This is a little different from that in [6, 7]. In [6, 7], they rewritten regular expressions with counting as a new form using choice operator ?. For example, \(E=a^{[1,3]}\) is transformed to \(E'=aa?a?\). This transformation is not reasonable enough which influences the results of CHARE and eCHARE. The result in Table 8 indicates that the existing five subclasses still have been used in a large scale (more that \(80\,\%\)). However, the usage of regular expressions with counting and interleaving increases gradually. They are called non-standard expressions in our experiment and accounts for \(6.67\,\%\). After we extend CHARE to eCICHARE, the proportion rises to \(94.00\,\%\) from \(87.88\,\%\).

Bex et al. concluded that \(92\,\%\) of DTDs and \(97\,\%\) of XSDs were CHARE based on 109 DTDs and 93 XSDs [6]. In [7], Bex et al. found that \(99\,\%\) regular expressions were Simplified CHARE, which were also SORE based on 819 DTDs and XSDs. While in our experiment, the proportions for CHARE, SORE, Simplified CHARE, eSimplified CHARE are lower. The main reason is the difference of data sets. Our data set is larger and more comprehensive, which leads to more accurate results and is closer to the real situation. Another reason is the different process of counting and interleaving operators. The transformation of counting in [6, 7] such as the substitution aa?a? of \(a^{[1,3]}\), increases the proportions of CHARE and eCHARE. Besides, we are the first to analyze the usage of eCHARE through real world data. The proportions of eCHARE are \(90.37\,\%\) for DTDs and \(88.13\,\%\) for XSDs. Reasons dissatisfying the corresponding definitions are shown in the table clearly. For example, non-terminal symbols mean using expressions such as the forms of ab and \(a?b^*\) where \(a, b \in \varSigma \) as base symbols in a disjunction. They are not allowed in four subclasses: CHARE, Simplified CHARE, eSimplified CHARE and eCICHARE. For CHARE, they account for \(8.84\,\%\) and \(3.52\,\%\) for DTDs and XSDs respectively. The proportions for other three subclasses can be found in Table 8. However, strings like the form of ab as the base symbols is valid for eCHARE. The form of \((a?b^*+c)\) is not allowed in eCHARE. We call it as extended strings. The proportion of extended strings is not small and they will be considered in our future work. The proportion of SORE (\(95.01\,\%\) for DTDs and \(92.45\,\%\) for XSDs) means that symbols in the vast majority of regular expressions only occur once most of the time. Simplified CHARE and eSimplified CHARE are both subclasses of SOREs. In [15], eSimplified CHARE accounts for \(84.8\,\%\) based on 2009 regular expressions while in our experiment the proportion is \(89.12\,\%\) for DTDs (64249 regular expressions) and \(86.73\,\%\) for XSDs (67255 regular expressions). The unary operators \(*\) and ? cannot be constraints for base symbols in eSimplified CHARE, which account for \(0.89\,\%\) and \(2.9\,\%\) for DTDs and XSDs respectively. For all six subclasses, proportions of other reasons are not large. For example, different unary operators depict the use of the form \((a?+b^*)c\) as factors which is not valid in the six subclasses. The unary operator * or ? is the form of \((a^*+b^*)\) which is not allowed in eSimplified CHARE. In the definition of eCICHARE, numerical occurrence constraints cannot be nested. The reason of regular expressions with counting is the form of \((a^{[1,3]}+b)^{[1,2]}\) which is not valid for eCICHARE.

4 Related Work

Early in 2002, Choi [14] made an experiment about DTDs. 60 DTDs were extracted from the XML.org DTD repository [32]. He analyzed the features of DTDs and proposed measures to make a deep study of their structural properties such as local properties including syntactic complexity, determinism, ambiguity and global properties including reachability, recursion, simple path and simple cycle, chain of stars, hubs. He found that the majority of DTDs used in real world is the form of chain. This result has provided important suggestion for later study in this field.

Based on 109 DTDs and 93 XSDs crawled from Cover Pages, Bex et al. [6] proposed simple regular expression which was named CHARE in this paper. They found that 92 % and 97 % of all element definitions in DTDs and XSDs are CHAREs. Martens et al. extended CHARE to eCHARE in [25]. Using an improved data set crawled from Cover Pages and the web, about 819 DTDs and XSDs, Bex et al. introduced two new subclasses of regular expressions: SORE and Simplified CHARE in [7]. The results revealed that more than 99 % of the regular expressions occurring in practical schemas are Simplified CHARE (therefore also SORE). In 2006, Bex et al. in [8] proposed the concept of k-occurrence. A regular expression is k-occurrence if every alphabet symbol occurs at most k times in it [8]. According to the same data set in [7], they concluded that regular expressions occurring in practical schemas are such that every alphabet symbol occurs at most k times and actually, in 98 % of the cases \(k=1\). Martens et al. in [26] concluded that in more that 98 % of the XSDs occurring in practice the content model of an element depends only on the label of the element itself, the label of its parent, and (sometimes) the label of its grand-parent after an examination of 225 XSDs gathered from the Cover Pages. Later in 2007, Bex et al. [9] introduced the concept of k-local. An XSD is k-local if its content models depend only on labels up to the k-th ancestor [9]. For most cases in real world, the value of k is 1. This result is conformed with that in [8]. In 2014, inspired by Simplified CHARE, Feng et al. proposed eSimplified CHARE whose base symbol allows the forms of a and \(a^+\) where \(a\in \varSigma \). The data set used in [15] consists of 966 valid DTDs and XSDs which were rewritten as 2009 regular expressions. Based on this data set, the cover ratio of eSimplified CHARE reached 84.8 % from 79.5 % for Simplified CHARE. In addition, two inference algorithms for eSimplified CHARE were given.

In addition, regular expressions with counting and interleaving have been studied by many researchers. Björklund et al. were the first to make an incremental evaluation of regular expressions with counters. They gathered about 3024, 458 and 8830 regular expressions respectively from three libraries: RegExLib, Snort, and XML Schema on the web in [10]. The results show that there are 1705 out of 3024 (about \(56.3\,\%\)), 270 out of 458 (about \(58.9\,\%\)) regular expressions use non-trivial counters in RegExLib and Snort. Regular expressions with non-trivial counters are the forms excluding three specific ones: \(a^{[0,0]}\), \(a^{[0,1]}\) and \(a^{[1,1]}\). In addition, the proportions of CHAINs are \(73.3\,\%\), \(85.1\,\%\) and \(86\,\%\) for the data of three libraries above. In [17], Ghelli et al. proposed a restricted subclass defined by \( T {:}{:}= \varepsilon | a^{[m,n]} | T + T | T \cdot T | T \& T\) where \(m\in N\backslash \{0\}\) and \(n\in N\backslash \{0\}\cup \{^*\}\). In L(T), each alphabet symbol can appear at most once. Counter can only occur as a constraint for terminal symbols. For example, \((a?\cdot (b+c+d+e+f)^{[1,100]})\) is not allowed. Based on this subclass, they first proposed a linear-time translation algorithm of the translation of each regular expression into a set of constraints in [18]. Then they introduced a linear-time membership algorithm to check whether a word satisfies the resulting constraints.

5 Conclusion and Future Work

In this paper, we introduce a new restricted subclass of regular expression with counting and interleaving. The experiment results show that this subclass can cover more content models in real world. Then different features of five subclasses of content models, i.e., CHARE, eCHARE, SORE, Simplified CHARE, eSimplified CHARE together with eCICHARE are also analyzed using different measures. We inspect their usages and give the corresponding proportions based on the real world data. Our data set is much larger than previous work which leads to more accurate results. We believe that our work will be helpful to the applications and further study of DTDs and XSDs. One future work is the study of inference algorithms and complexity problems related to eCICHARE. The strong and weak determinism of eCICHARE will also be considered. Besides, based on our experiment results, it is possible to propose other useful subclasses of content models.