1 Introduction

DNA sequencing has an essential role in molecular biology [1,2,3,4,5,6]; therefore, it is important to find the exact place of bases in the genome [7,8,9,10,11]. In the 70 s, the Sanger sequencing technology was introduced to sequencing the first genome [12, 13]. Then, in 2005 appearance of new sequencing technology is known as next-generation sequencing (NGS) [14].

The NGS technology can sequence hundreds of thousand fragments of DNA at the same time in fewer hours and lower cost [15,16,17,18,19]. But this technology cannot sequence the whole genome, so we have to divide the genome into short fragments to be sequenced [20, 21]. NGS derived a large number of reads from the same genome, so each base of the genome was sequenced many times [22,23,24,25,26]. The average number of times that a base sequenced is called coverage [21, 27, 28]. It produced additional information that it causes the increased complexity of the processing machine [29].

De novo sequencing method was used to assembly genome by sequence reads [30]. De novo sequencing in bioinformatics means merging and aligning DNA fragments to gain the original genome [31]. The importance of de novo sequencing is in the ability of technology to new assembly genomes. The complexity of de novo sequencing increases by genome length as an example, and max length of the genome is a repeat region in the genome which can be a simple repeat region within the genes or the genes themselves or part of them that cause more complexity [32].

In general, the complexity of assembly is affected by the number of reads, length of reads, length of repeat regions, and genome length [33]. Increasing the number of long reads helps to reduce assembly complexity, but the running time algorithm increases exponentially by the number and length of reads [34]. On the other hand, short reads assemble very easily, but repeat regions increase the complexity of assembly. For assembly, first, we produce samples from the genome and called them reads and then send them to sequencer machine and finally the process in the processing machine to assembly genome [35].

The method used for human genome assembly is a generalized reference [30] that eliminates the shortcomings of previous methods and improves the speed, accuracy, and sequencing time that is the novelty of the paper. This algorithm is comparable to the OLC algorithm, which uses graph theory for genome assembly. It works better for duplicate areas and identifies the exact location of these areas and prevents the creation of redundant paths in this graph.

We use the de novo sequencing algorithm in three categories: First, we use this method for reads that extract from i.i.d genome with equal probability, and then we use that for i.i.d genome with repeat region genome and finally we use it for a real genome that extracted from the human chromosome 19 [36]. At first for showing the accuracy of the algorithm, we extract the different number of reads lengths for showing the effect of reads on sequenced bases, running time algorithm, and max contig length and then we show the effect of different fragment lengths on the probability of assembly accuracy.

2 Material and methods

2.1 Lander–Waterman’s coverage bound and feedback sequencing system

The number of reads produced by Lander–Waterman’s bound is based on the length of the genome. Lander–Waterman bound is based on the coupon-collecting problem. That shows for the genome, the total number of bases for resequencing should be \(O(GlnG)\) [37]. This is a result of the resequencing machine in which two parts of sequencing and processing are separate. If we take feedback from the processing machine, two parts of the machine work cooperatively. The number of bases will be reduced. In paper [37], the number of reads produced by Lander–Waterman’s coverage bound for genome with length of \(G\), so at least we need \(N=\frac{G}{L}\mathrm{ln}(\frac{G}{L})\) reads and reduced number of sequenced bases to as low as \(O(G)\). In this paper, we focus on the de novo sequencing problem. We have two cases for assembling reads:

  1. 1.

    Two segments of reads with length \((\mathcal{l})\) have common base, such as \({j}^{th}\) read and \({(j+1)}^{\mathrm{th}}\) read in Fig.  1.

  2. 2.

    Two segments of reads with length \((\mathcal{l})\) do not have a common base, such as \({i}^{th}\) read and \({(i+1)}^{\mathrm{th}}\) read in Fig.  2.

Fig. 1
figure 1

a Common assembly method, b New assembly method in ref [37]

Fig. 2
figure 2

Two cases for subsequent reads

In the first case, we have reads which extension does not increase the coverage. But in the second case, we have read which extension on sequencing increases the coverage. While starting time of reads has an independent exponential distribution. We assume that the starting points of the readings are Poisson distribution with rate \(\lambda =\frac{N}{G}\); starting time of reads has an exponential distribution.

We assume that \({B}_{i}\) is random variable that state number of additional bases in \(i\) th reads and total number of additional bases is equal to: \(B=\sum_{i=1}^{N}{B}_{i}\);\(E\left\{B\right\}=NE\left\{{B}_{j}\right\}\).

For calculating \(E\left\{{B}_{j}\right\}\) that \(j+1\) th reads are \(x\) bases after starting \(j\) th reads. If \(x\le \mathcal{l}\), we have \({B}_{j}=\mathcal{l}-x\). Average bases for all reads are equal to [37]:

$$ {\text{PE}}\left\{ {\text{B}} \right\} = kG\left( {1 - \frac{{e^{k} }}{{e^{{\frac{{{\text{NL}}}}{G}}} }}} \right){ } $$
(1)

Max value of the equation is one. So with k = cte, the number of additional read bases is from rank \(\mathcal{O}(G)\). For example, by choice \(N=G/\mathcal{l}\) Lander–Waterman’s coverage is bound broken.

For calculating the number of additional bases in noisy mode, we use similar reasoning as free noise mode. So for calculating \(E\left\{{B}_{j}\right\}\) for \(j\) th reads, we assume that \({C}_{\in }\) reads after these reads started from \(x\) th base that have Erlang distribution. An average number of bases that have overlap for every reads are equal to [37]:

$$ E\left\{ B \right\} = \left\{ {\frac{1}{{\left( {C_{\varepsilon } - 1} \right)!}}\left( {\left( {k_{n} - C_{\varepsilon } } \right)\gamma \left( {C_{\varepsilon } ,k_{n} } \right) + k_{n}^{{c_{\varepsilon } }} e^{{ - k_{n} }} } \right)} \right\} \times G $$
(2)

where is \(k=\frac{N\mathcal{l}}{G}\). So for every fixed, \({k}_{n}\) \(E\left\{B\right\}\) is equal to \(\mathcal{O}(G)\). Coverage bound of genome bases covered \({C}_{\epsilon }\) time is equal to [37]:

$$ P \le G\mathop \sum \limits_{j = 0}^{{C_{\varepsilon } - 1}} e^{ - \lambda l} \frac{{\left( {\lambda l} \right)^{j} }}{j!} $$
(3)

2.2 Suggested method

In the sequencing problem, if we did not have a reference genome, we had to use de novo sequencing. The sequencing machine produced reads with length \(L\). Base on Lander–Waterman’s coverage bound, \(NL\) total number of sequenced bases are equal to \(O(G\mathrm{ln}G)\). In this case, the cover bound will be controlled by the processing machine. So processing machine can end sequencing in each base. In another word, sequencing machines start reading bases one by one from the DNA segment and its processing stops only by the stop commend of the processing machine.

2.3 Designing a de novo sequencing algorithm for i.i.d genome

In this method, the key strategy we use to end the sequencing is: At the first level, we found \(L\) by integer \(K\) in \(\left\{1,\dots ,L\right\}\) divided by \(\mathcal{l}=\frac{L}{K}\) and we assume that \(\mathcal{l}\) is integer. In the first level (\(k=1\)), sequencing machine separates first \(\mathcal{l}\) bases of all reads that called fragment in collection \(C=\left\{{\mathcal{R}}_{1}\left(\mathcal{l}\right), {\mathcal{R}}_{2}\left(\mathcal{l}\right),\dots , {\mathcal{R}}_{N}\left(\mathcal{l}\right)\right\}\). \({\mathcal{R}}_{i}\left(\mathcal{l}\right)\) means first \(\mathcal{l}\) bases of \(i\) th reads. In other words, \({\mathcal{R}}_{i}\left(\mathcal{l}\right)\) is.

\(i\) th fragment of \(C\) collection. In the \(k\) th level (\(k>1\)), we have two loops on \(i, j\) that compare part of \(i\) th and \(j\) th fragmets. If \(\mathcal{l}-1-mer\) (substrings of length l-1 in reads) last base of \(i\) th fragment with \(\mathcal{l}-1-mer\) first bases of jth fragment is equal, these two fragment were connected in the i.i.d genome [37]. So the machine connects these two fragments and replaces them with the \(i\) th fragment and deletes the jth fragment of \(C\) collection. So the sequencing of \(i\) th fragment ended, and the sequencing continues from the jth fragment and jth fragment replaced with \(i\) th fragment. In \((k+1)\) th level, if we had \({L}_{MAX}-k+1\ge k+1\), we called \((\mathcal{l}+k-1)\) th base of \(i\) th reads and connected to the \(i\) th fragment and repeat all level to achive contig. This algorithm is an exhibit in Algorithm de novo Sequencing.

For noisy reads, we had error rate parameter \(\upepsilon \) and parameter \({\mathrm{C}}_{\upepsilon }\) for coverage deep and correct probability of resequencing is \({\text{P}}_{\varepsilon }\). For extracting \(C_{\varepsilon }\), the bases are goal that have max number of bases in that region. The probability of failure for noisy manner is equal to:

$$ \begin{gathered} P\left( {n_{0} \le \max \left( {n_{1} ,n_{2} ,n_{3} } \right)} \right) = \sum\limits_{j = 0}^{{c_{i} }} {P\left( {j \le \max \left( {n_{1} ,n_{2} ,n_{3} } \right)\left| {n_{0} = j} \right.} \right)} \times P\left( {n_{0} = j} \right) \hfill \\ \sum\limits_{j = 0}^{{c_{i} }} {\sum\limits_{{n_{1} = m_{1} ,n_{2} = m_{2} ,n_{3} = c_{i} - m_{1} - m_{2} - j}} {\left[ {P\left( {n_{0} = j,n_{1} = m_{1} ,n_{2} = m_{2} ,n_{3} = C_{i} - m_{1} - m_{2} - j} \right) \times P\left( {n_{1} = m_{1} ,n_{2} = m_{2} ,n_{3} = C_{i} - m_{1} - m_{2} - j} \right) \times P\left( {n_{0} = j} \right)} \right]} } \hfill \\ \end{gathered} $$
(4)

As we have:

$$ P\left( {n_{1} = m_{1} ,n_{2} = m_{2} ,n_{3} = C_{i} - m_{1} - m_{2} - j} \right) = \frac{{\left( {\begin{array}{*{20}c} {c_{i - j} } \\ {n_{1} = m_{1} ,n_{2} = m_{2} ,n_{3} = C_{i} - m_{1} - m_{2} - j} \\ \end{array} } \right)}}{{3^{{C_{i - j} }} }} $$
(5)
$$ P\left( {n_{0} = j} \right) = \left( {\begin{array}{*{20}c} {C_{i} } \\ j \\ \end{array} } \right)\left( {1 - } \right)^{{j}{\left( {C_{i} - j} \right)}} { } $$
(6)
$$ \begin{gathered} P\left( {n_{0} \le \max \left( {n_{1} ,n_{2} ,n_{3} } \right)} \right) \hfill \\ = \sum\limits_{j = 0}^{{c_{i} }} {\sum\limits_{{n_{1} = m_{1} ,n_{2} = m_{2} ,n_{3} = c_{i} - m_{1} - m_{2} - j}} {\left[ {P\left( {j \le \max \left( {n_{1} ,n_{2} ,n_{3} } \right)\left| {n_{0} = j,n_{1} = m_{1} ,n_{2} = m_{2} ,n_{3} = C_{i} - m_{1} - m_{2} - j} \right.} \right)} \right]} } \hfill \\ \times \left( {\begin{array}{*{20}c} {C_{i} } \\ j \\ \end{array} } \right)\left( {1 - \in } \right)^{j} \in^{{\left( {C_{i} - j} \right)}} \times \frac{{\left( {\begin{array}{*{20}c} {c_{i - j} } \\ {n_{1} = m_{1} ,n_{2} = m_{2} ,n_{3} = C_{i} - m_{1} - m_{2} - j} \\ \end{array} } \right)}}{{3^{{C_{i - j} }} }} \hfill \\ \end{gathered} $$
(7)

For example, if \(P_{\varepsilon } = 10^{ - 4}\),\(\varepsilon = 0.07\), \(C_{\varepsilon }\) is equal to: 8

figure a

To investigate the probability of producing contig with length less than 90 percent of the genome, we need the number of reads (N), length of reads (L), and length of the genome (G). So for genome length (G) and reads length (L) base on Lander–Waterman’s bound we can have the number of reads (N). We assume that the starting points of the readings are Poisson distribution with rate \(=\frac{N}{G}\), starting time of reads compared to the previous read is exponential distribution and the time interval between the start of two consecutive reads is independent of each other.

In de novo sequencing algorithm, we need \(\mathcal{l}\) common base between two successive reads. So probability of start reads in two periods \((L-\mathcal{l},L]\) and \([1, L-\mathcal{l}]\) is probability of fail and probability of success, respectively.

Probability of starting reads in \((L-\mathcal{l},L]\) is equal to:

$$ P\left( {x > L - \ell } \right) = e^{{\frac{N}{G}\left( {L - \ell } \right)}} $$
(8)

Probability of starting reads in \(\left[ {1, L - \ell } \right]\) is equal to:

$$ P\left( {x \le L - \ell } \right) = 1 - e^{{\frac{N}{G}\left( {L - \ell } \right)}} $$
(9)

So probability of success for all reads is equal to:

$$ P_{{{\text{success}}}} = \left( {1 - e^{{\frac{N}{G}\left( {L - \ell } \right)}} } \right)^{N} { } $$
(10)

And probability of fail is equal to:

$$ P_{{{\text{fail}}}} = 1 - P_{{{\text{success}}}} = 1 - \left( {1 - e^{{\frac{N}{G}\left( {L - \ell } \right)}} } \right)^{N} { } $$
(11)

We cannot get the number of reads for getting failure probability under one base on \(L\), \(\mathcal{l}\), and \(G\) from equal (11). Therefore, for extracting the number of reads needed to achieve the probability of failure, we show equal (11) base on number of reads [1, 5000] for \(G=20000\), \(\mathcal{l}=14\), and reads length \(L=[\mathrm{100,300},\dots ,1500]\) as shown in Fig. 3. We extract value of \(N\) from the curve for \({P}_{fail}=0.1\) and \({P}_{fail}=0.01\) and compare with Lander–Waterman’s bound (Fig. 4).

Fig. 3
figure 3

Failure probability of the algorithm according to the number reads

Fig. 4
figure 4

Max number of reads base on length of reads

In Fig.  3, we indicate the failure probability of the algorithm base on the number of reads for length read \(L=[100, 300,\dots , 1500]\). As we see in this figure for \(L=100\), we have much error to make contig, and therefore, we only have short contig.

2.4 i.i.d genome with repeat region

In this section, we try to study the i.i.d genome with repeat region. Our goal is to assembly reads of the i.i.d genome with repeat region. The challenge of this assembly is the repeat region that can cause mistakes in assembly. The way we offer to solve this problem is using a long read and using fragment length longer than the repeat region (Fig. 5). In this way by use of bases around the repeat region, we can exactly locate reads in de novo sequencing.

Fig. 5
figure 5

Bridging over repetitive areas

2.5 Real genome

Real genomes have different repeat lengths with different lengths. In this section, we try to generalize the way that is shown in Sect. 2.2. We use this way only with a long read for real genomes. Lander–Waterman's state coverage bound for i.i.d genome. Based on coverage bound, minimum number of reads for coverage genome is equal to \(N\approx \frac{G\mathrm{ln}G}{L}\). In other words, we need \(\mathrm{NL}=G lnG\) bases for coverage genome. By generalization way [37] for de novo sequencing, first, we sequenced bases of all reads and then the processor connects all the fragments that are uniquely connected to make contig. Unlock read that does not overlap with right-hand read is sent to the processor for extending one base in reads. This loop repeats until getting max contig length, so the number of bases in sequencing level decreases and we have max contig length equal to the real genome.

3 Result and discussion

Check the algorithm on the i.i.d genome with length \(G\)، \(N\) reads with length \(l\) base on the Lander–Waterman bound. We have an example of producing reads on a fragment with length 20,000 in Fig.  6 that \(L=500\) and the number of reads is \(N=400\) from \(N\approx \frac{G}{L}\mathrm{Ln}G\approx 400\).

Fig. 6
figure 6

The location of reading extracted from the i.i.d genome

Fig. 7
figure 7

Effect of number of reads on number of bases

Fig. 8
figure 8

Effect of the number of reads on execution time

Fig. 9
figure 9

Effect of number of reads on max contig length

3.1 Effect of number of reads

At first, we show the algorithm on i.i.d genome with length 20,000 for N=\([\mathrm{200,250},\dots 600]\) for specific read length \(L=1700\), fragment length \(\mathcal{l}=\) log G \(\approx 14\) on windows 8 with MATLAB 2017a and CPU core i5-2410 MHZ—RAM 4 Gbyte. All simulations are the result of averaging on the 20 times running the algorithm. It should be noted that many codings were done in the supercomputer center of the University of Isfahan. The systems of this center are core i24 with 64 GB of RAM.

In Fig. 7, we show the number of sequencing bases based on the number of reads. As we see in Fig. 7, by increasing the number of reads the number of sequencing bases increased. But the speed of that become decrease for \(N \ge 400\).

In Fig. 8, we show algorithm execution time base on the number of reads. As we see in Fig. 8, by increasing the number of reads algorithm execution time linearly increased.

In Fig. 9, we show max contig length base on number of reads. As we see in Fig. 9, max contig length is in \(N=350\),\(N=400\), and \(N=450\) that is almost equal to i.i.d genome length (20,000). As summarizing, we can say by running the algorithm on reads with length \(L=[300, 700,\dots ,1700]\), and we have max contig length equal to genome length.

3.2 Effect of reads Length (L)

To examine the effect of length reads, we should run the algorithm on i.i.d genome with length 20,000 for different lengths such as \(L=[300, 500,\dots ,1700]\) and fragment length obtained from \(\mathcal{l}=\mathrm{log}G\) [37]:

In Fig. 10, we show the number of reads base on reads length. In Fig. 10, our goal is to max length contig achieve 90 percent of genome length. It shows that the algorithm cannot make this contig for \(L\le 200\), so we have to show the number of reads base on reads length.

Fig. 10
figure 10

Number of reads base on reads length

3.3 Effect of fragment length (\(\mathcal{l}\))

We run algorithm on i.i.d genome with length 20,000 for reads length \(L=500\), number of reads \(N=\frac{G}{L}\mathrm{ln}G\), and fragment length \(\mathcal{l}=\{5, 6, 7, 8,\dots ,16\}\). We should say that all simulations are the result of averaging on 20 times running the algorithm.

In Fig. 11, we show the number of sequenced bases based on fragment length. As we see in Fig. 11 by increasing the number of sequenced bases fragment length increased. But the increased speed for \(l \ge 400\) decreased.

Fig. 11
figure 11

Effect of number of sequenced bases on fragment length

Fig. 12
figure 12

Effect of fragment length on execution time

Fig. 13
figure 13

Effect of fragment length on max contig length

Fig. 14
figure 14

Effect of fragment length on the probability of de novo sequencing

In Fig. 12, we show execution time base on fragment length. As we see in the figure by increasing execution time, fragment length increased.

In Fig. 13, we show max contig length base on fragment length. As we see in figure, max contig length is equal to genome length for \(\mathcal{l}\ge 9\).

In Fig. 14, we have the probability of de novo sequencing base on fragment length. As we see in Fig. 14, the probability of de novo sequencing for \(\mathcal{l}\ge 8\) is equal to one.

3.4 Result of simulation on i.i.d genome with repeat region

We run the algorithm on the i.i.d genome with repeat region (\(G=20000)\), the connection of fragments on the reads is shown with the dashed line. In Fig. 15, we show the result of running the algorithm on a fragment with a length of 20,000 that has a repeat region with length \({L}_{R}=100\). This genome has 2 × 50 = 100 repeat region with a length of 50 base. The number of repeat genomes is \({N}_{r}=50\) that each repeat region twice in the genome. In this genome, we should use long reads for correct connecting reads, and also fragment length in this genome should be longer than repeat region length \(\mathcal{l}\) = log \(G+{L}_{r}\approx 64\), so the repeat region does not have an effect on assembly. All simulations are the result of averaging on the 20 times running the algorithm.

Fig. 15
figure 15

The location of reading extracted from the i.i.d genome with repeat region

Fig. 16
figure 16

Effect of number of reads on number of bases

Fig. 17
figure 17

Effect of number of reads on execution time

3.5 Effect of number of reads

At first, we show algorithm on i.i.d genome with repeat region with length 20,000 for N=\([200, 250,\dots , 600]\) for specific read length \(L=1700\) and fragment length \(\mathcal{l}=\) log \(+{L}_{r}\approx 64\). All simulations are the result of averaging on the 20 times running the algorithm.

In Fig. 16, we show the number of sequencing bases based on the number of reads. As we see in the figure, by increasing the number of reads the number of sequencing bases linearly increased.

In Fig. 17, we show algorithm execution time base on the number of reads. As we see in the figure, by increasing the number of reads algorithm execution time linearly increased.

In Fig. 18, we show max contig length base on the number of reads. As we see in the figure, max contig length is in \(N=400\) and \(N=450\) that is almost equal to i.i.d genome with a repeat region (G = 20,000). As summarizing, we can say by running the algorithm on reads with length \(L=[300, 700,\dots ,1700]\), and we have max contig length equal to genome length.

Fig. 18
figure 18

Effect of number of reads on max contig length

3.6 Effect of fragment length (\(\mathcal{l}\))

We run an algorithm on i.i.d genome with repeat region (G = 20,000) for reads length \(L=500\), number of reads \(N=\frac{G}{L}\mathrm{ln}G\), and fragment length \(\mathcal{l}=\{6, 7,\dots , 65\}\). We should say that all simulations are the result of averaging on 20 times running the algorithm.

In Fig. 19, we show the number of sequenced bases based on fragment length. As we see in Fig. 19, increasing the number of sequenced base fragment lengths increased.

Fig. 19
figure 19

Effect of fragment length on number of sequenced bases

In Fig. 20, we show execution time base on fragment length. As we see in Fig. 20, by increasing execution time fragment length increased.

Fig. 20
figure 20

Effect of fragment length on execution time

In Fig. 21, we show max contig length base on fragment length. As we see in Fig. 21, max contig length is equal to genome length for \(\mathcal{l}\ge 50\).

Fig. 21
figure 21

Effect of fragment length on max contig length

In Fig. 22, we have the probability of de novo sequencing base on fragment length. As we see in figure, probability of de novo sequencing for \(\mathcal{l}\ge 50\) is equal to one.

Fig. 22
figure 22

Effect of fragment length on probability of de novo sequencing

3.7 Result of simulation on human genome

We run the algorithm in parallel on the human genome with a length of 50,000. We extract this part of the genome from human chromosome 19. The server has full information about the genome, length of repeat genome, and place of them on the genome. All the information is taken from the reference [38]. The web site provides complete information about the genome, the length of duplicate regions, their location, and type.

3.8 Effect of read length

At first, we show algorithm in parallel on human genome with length 50,000 for different reads length for specific read length \(N=\mathrm{round}(\mathrm{G }*\mathrm{ log}(\mathrm{G}) /\mathrm{ L}\_\mathrm{max }\)fragment length \(\mathcal{l}=\) log \(G+{L}_{r}\approx 64\). All simulations are result of averaging on the 20 times running the algorithm.

In Fig. 23, we show number of sequencing bases based on read length. As we see in Fig. 23, by increasing the read length the number of sequencing bases linearly decreased.

Fig. 23
figure 23

Effect of read length on number of bases

In Fig. 24, we show algorithm execution time base on read length. As we see in the figure, by increasing the read length algorithm execution time linearly decreased.

Fig. 24
figure 24

Effect of read length on execution time

In Fig. 25, we show max contig length base on read length. As we see in figure, max contig length is in \(L=500\) that is almost equal to human genome (G = 50,000). As summarizing, we can say by running algorithm in parallel on reads with length \(L=[500, 1000,\dots ,5000]\), and we have max contig length equal to genome length on shorter read, execution time, and sequenced bases by increasing the read length decreased.

Fig. 25
figure 25

Effect of read length on max contig length

3.9 Effect of fragment length (\(\mathcal{l}\))

We run algorithm in parallel on the human genome with length 50,000 for reads length \(L=1000\), number of reads‌ \(N=\frac{G}{L}\mathrm{ln}G\), and fragment length \(\mathcal{l}=\{15, \dots , 90\}\). We should say that all simulations are result of averaging on 20 times running the algorithm.

In Fig. 26, we show the number of sequenced bases based on fragment length. As we see in Fig. 26, by increasing the number of sequenced bases fragment length increased.

Fig. 26
figure 26

Effect of fragment length on number of sequenced bases

In Fig. 27, we show execution time base on fragment length. As we see in the figure, by increasing execution time fragment length did not increase.

Fig. 27
figure 27

Effect of fragment length on execution time

In Fig. 28, we show max contig length base on fragment length. As we see in Fig. 28, max contig length is equal to genome length for \(\mathcal{l}>14\).

Fig. 28
figure 28

Effect of fragment length on max contig length

In Fig. 29, we have the probability of de novo sequencing base on fragment length. As we see in Fig. 29, probability of de novo sequencing for \(\mathcal{l}\ge 10\) is equal to one.

Fig. 29
figure 29

Effect of fragment length on the probability of de novo sequencing

4 Conclusion

The amount of information generated by the sequencer machine at a great rate is increasing. We need a strong processing machine for extracting information from data. Sequencing machine has lots of additional data that need lots of money and time to extract useful information and delete additional information at the processing level.

Based on Lander–Waterman’s coverage bound, sequencing machine for genome with length G and reads with length L, at least needs N read that called lower Lander–Waterman’s coverage bound. In other words, we need \(N\approx \frac{G}{L}\) ln \(G\) reads to coverage genome with probability 1-\(\varepsilon \). The total number of sequenced bases by machine is \(NL\) that is equal to \(G\) ln \(G\). These results show that sequenced machine is independent from processing machine. On the other hand, sequencing machine sequenced total number of required bases for coverage genome and then reads send to the processing machine.

In this paper, we try to generalize presented method for de novo sequencing. At first, we use this method for i.i.d genome with length \(\mathrm{G}\) and N reads with length L from that. The result of running this algorithm on this genome prohibits a decrease in running algorithm time and number of sequencing bases with the increase in reads length.

Max contig length for reads with \(L\ge 500\) usually has equal length with genome length. By examining the probability of sequencing error of algorithm with Lander–Waterman’s bound in p = 0.01, we understand that Lander–Waterman’s bound is not so accurate for \(L<500\), but other length has the equal result.

Then, we run the algorithm for i.i.d genome with repeat region. This genome has \(100=50\times 2\) repeat region with length of 50 bases. For genome assembly, we assume fragment length \(\mathcal{l}={L}_{R}+\) log \(G\) to reduce the ambiguity of repeat region. The result of running this algorithm on i.i.d genome with repeat length prohibits decrease in the running algorithm time and number of sequencing bases with the increase in reads length.

Finally, we run a de novo sequencing algorithm in parallel on a real human genome with a length of 50,000. With increase in length of reads, running algorithm time and sequenced bases reduced. An increase in fragment length causes increase in accuracy in genome assembly.