Nowadays, Solid State Drives consume an enormous amount of NAND Flash memories [1] causing a restless pressure on increasing the number of stored bits per mm2. Planar memory cells (Fig. 1) have been scaled for decades by improving process technology, circuit design, programming algorithms [2], and lithography.

Fig. 1
figure 1

NAND string

Unfortunately, when approaching a minimum feature size of 1x-nm, more challenges pop up: doping concentration in the channel region becomes difficult to control [3], RTN [4] and electron injection statistics [5] widen threshold distributions, thus causing a significant hit to both endurance and retention. Furthermore, by reducing the distance between memory cells, the intra-wordline electric field becomes higher, pushing the bit error rate to an even higher level.

3D arrays can definitely be considered as a breakthrough for fueling a further increase of the bit density. Identifying the right way for going 3D was not so easy though.

Historically, Flash memory manufacturers have leveraged lithography to shrink the 2-dimensional (2D) memory cell [6].

However, with 3D architectures, the “simple” reduction of the minimum feature size is running out of steam [7]: a higher number of stacked cells is the only hope for dramatically reducing the real estate of a stored bit.

3D arrays can leverage either Floating Gate (FG) or Charge Trapping (CT) technologies [8]. As a matter of fact, the vast majority of 3D architectures published to date are built with CT cells, mainly because of the simpler fabrication process. Nevertheless, Floating Gate is still around and there are commercial products who managed to integrate FG into a 3D array.

1 3D Charge Trap NAND Flash Memories

3D arrays can be efficiently built by vertically rotating the planar NAND Flash string as displayed in Fig. 2. The solution of choice is a conduction channel completely surrounded by the gate [9]: indeed, the curvature effect helps increasing the electric field Et across the tunnel oxide, and reduces the electric field Eb across the blocking oxide [10, 11], and this has a positive impact on oxide reliability and overall power consumption.

Fig. 2
figure 2

The NAND flash string goes vertical

Vertical channel arrays have been historically driven by architectures known as BiCS, which stands for Bit Cost Scalable [12, 13] and P-BiCS, acronym for Pipe-Shaped BiCS [1416], which are both leveraging CT cells [17]. Let’s get started with BiCS, which is sketched in Figs. 3 and 4 [13]. There is a stack of Control Gates (CGs), the lowest being the one of the Source Line Selector (SLS). The whole vertical stack is punched through and the resulting holes are filled with poly-silicon; each filled hole (a.k.a. pillar) forms a series of memory cells vertically connected in a NAND fashion. Bit Line Selectors (BLS’s) and Bitlines (BLs) are formed at the top of the structure [18].

Fig. 3
figure 3

Adapted with permission from [19]. ©2017 IEEE

BiCS architecture.

Fig. 4
figure 4

Equivalent circuit of a BiCS array

The poly-silicon body of memory cells is not doped or lightly doped [10, 11]; indeed, considering the bad aspect ratio of the vertical polysilicon plug, p-n junctions cannot be easily realized by either diffusion or implantation in a trench structure. As usual, a select transistor (BLS) is used to connect each NAND string to a bitline; there is also another select transistor (SLS), which connects the other side of the string to the common source diffusion.

It is important to highlight that the number of critical and expensive lithography steps does not depend on the number of control gate plates because the whole 3D stack is drilled at one [20, 21].

As sketched in Fig. 5, vertical transistor have polysilicon body and this fact turned out to be one of the critical cornerstone of the 3D foundation. From a manufacturing perspective, the density of the traps at the grain boundary is very difficult to control, with such a vertical shape: the bad thing is that this poor control induces significant fluctuations of the characteristics of vertical transistors.

Fig. 5
figure 5

BiCS memory cells

The recipe for fixing the trap density fluctuation problem is to manufacture a polysilicon body much thinner than the depletion width. In other words, by shrinking the polysilicon volume, the total number of traps goes down (Fig. 6). This particular structure is usually referred to as Macaroni Body [13]. A filler layer (i.e. a dielectric film) is used in the central part of the macaroni structure, essentially because it makes the manufacturing process easier.

Fig. 6
figure 6

A vertical transistor (right) modified with Macaroni body (left)

The fabrication sequence of the BiCS array [22] starts from building the layers for control gates and selectors. Then, BLS stripes are defined. After forming pillars, bitlines are laid out by using a metal layer.

Control gate edges are extended to form a ladder to connect to the fan-out region, as sketched in Fig. 7 [12, 13, 22, 23]. Actually, there are 2 ladders: one of the 2 can’t be used because it is masked by the metals biasing the bitline selectors.

Fig. 7
figure 7

Adapted with permission from [19]. ©2017 IEEE

Fan-out of the BiCS array.

Over time BiCS became P-BiCS, mainly to improve the Source Line resistance [14, 15]. In a nutshell, two vertical NAND strings are shorted together at the bottom of the 3D structure: in this way, they form a single NAND string and the 2 edges are connected to the bitline and to the Source Line, respectively (Fig. 8). Thanks to its U-shape, P-BiCS has few advantages over BiCS:

Fig. 8
figure 8

P-BICS NAND strings

  • retention is better because manufacturing creates less damages in the tunnel oxide;

  • being at the top, the Source Line can be connected to a metal mesh, thus lowering its parasitic resistance;

  • Source Line and bitline selectors are at the same height of the stack and, therefore, they can be equally optimized and controlled, thus obtaining a better string functionality.

One of the biggest drawbacks of P-BiCS is the fact that at the same height of the stack there are two different control gates which, of course, can’t be biased together; therefore, the two layers can’t be simply shorted together. As a result, compared to BiCS, a totally different and more complex fan-out is required [16], as displayed in Fig. 9: basically, a fork-shaped gate is adopted, such that each branch acts on two NAND pages.

Fig. 9
figure 9

Adapted with permission from [19]. ©2017 IEEE

Fork-shaped fan-out.

A major advantage is the easier connection of the source line [14] through the “Top Level Source Line” of Fig. 10. This additional metal mesh guarantees a much better noise immunity for circuits.

Fig. 10
figure 10

Adapted with permission from [19]. ©2017 IEEE

P-BiCS: Source line metal mesh.

Besides BiCS and P-BiCS, many other approaches were tried, including VRAT (Vertical Recess Array Transistor) [24], Z-VRAT (Zigzag VRAT) [24], and VSAT (Vertical Stacked Array Transistor) [25], and 3D-VG (Vertical Gate) NAND [26] which is a unique architecture where the channel runs along the horizontal direction.

TCAT (Terabit Cell Array Transistor) was disclosed in 2009 [27] and it was the foundation for V-NAND (Fig. 11), which is the first 3D memory device who reached the market. Except for SL + regions which are n + diffusions, the equivalent circuit of TCAT is the same of BiCS (Fig. 4). All SL + lines are connected together to form the common Source Line. There are 2 metal layers for decoding wordlines and NAND strings, respectively.

Fig. 11
figure 11

TCAT NAND flash array

TCAT is based on gate-replacement [27], whereas BiCS is gate-first. Gate-replacement begins with the deposition of multiple oxide/nitride layers. After the stack formation, nitride is removed through an etching process. Afterwards, tungsten metal gates are deposited and, finally, gates are separated by using another etching step. Metal gates translate into a lower wordline parasitic resistance, resulting in faster programming and reading operations.

The bulk erase operation is another significant difference compared to BiCS. Because NAND strings are close to n + areas, during erasing, holes can come straight from the substrate, thus avoiding the GIDL (Gate Induced Drain leakage) on the source side, which is a well-known problem for BiCS.

BiCS and TCAT are compared in Fig. 12 [28]. Being TCAT based on a gate-last process, the charge trap layer is biconcave, and thanks to this particular shape it is much harder for charges to spread out. On the contrary, BiCS is characterized by a charge trapping layer going through all gate plates, thus acting as a charge spreading path: of course, the main consequence of this layout is a degradation of data retention.

Fig. 12
figure 12

BiCS versus TCAT

TCAT evolved into another architecture called V-NAND [29]. The first generation had 24 wordline layers, plus additional dummy wordline layers (dummy CG) [3032].

Why dummy layers? Mainly because of the floating body of the memory cells with vertical channel. In fact, during the programming operations, hot carriers are generated by the high lateral electric field located at the edge of the NAND string. Therefore, these hot carriers keep the voltage on the channel low during the programming operation of the first wordline (i.e. Program Disturb) [33, 34]. Dummy wordlines before the first WL are an effective and simple solution to this problem [35, 36].

A 128 Gb TLC (3 bit/cell) device manufactured by using V-NAND Gen2 was published in 2015 [37, 38]. Gen2 had 32 memory layers instead of the previous 24 and introduced the concept of Single-Sequence Programming. Conventional (mainly 2D) TLC programming techniques go through the programming sequence multiple times. To be more specific, each wordline is programmed 3 times, such that VTH distributions can be progressively tightened. Because of the smaller cell-to-cell interference (compared to FG), CT cells exhibit an intrinsic narrower native VTH distribution. As a result, V-NAND Gen2 could write 3 pages of logic data in a single programming sequence. There are 2 benefits to this approach: reduced power consumption and faster programming.

V-NAND Gen3 appeared in 2016 [39], in the form of a 48 layer TLC device. With such a high number of gate layers, the very high aspect ratio of the pillar becomes a serious challenge for the etching technology. To mitigate this problem, the easiest solution is to shrink the thickness of gate layers. The downside of this approach is that the parasitic RC of the wordline gets higher, thus slowing access operations to the memory array. Moreover, channel’s size fluctuations become critical. Indeed, pillars are holes drilled in the gate layer and they represent a barrier for charges flowing along the wordline: in essence, a distribution of the holes diameters generates a distribution of the parasitic resistances of gate layers. In addition, pillars, once manufactured, have the conic shape sketched in Fig. 13. The overall result is that the same voltage applied to different gate layers translates into a waveform per layer. An adaptive program pulse scheme can fix the problem. In a nutshell, the program pulse duration has to be tailored to the characteristics of the wordline layer. As the number of layers increases, the pillar becomes longer with a negative impact on the aspect ratio of the pillar. To compensate for that, V-NAND Gen4 [40], which is built on a stack of 64 layers, had to shrink both the layer thickness and the intra-layer distance (spacing). The downside is an increased wordline parasitic capacitance which adversely affects cell’s reliability and timings. Improved circuits and programming algorithms can be used to tackle this problem [40].

Fig. 13
figure 13

Ideal versus actual shape of pillars

As discussed, both BiCS [41] and V-NAND use CT cells, but Floating Gate still exists, as explained in the next Section.

2 3D Floating Gate NAND Flash Memories

2D NAND Flash memories use FG cells which have been improved and optimized for decades. Of course, there have been many attempts to reuse this know-how in 3D.

The first 3D attempt is known as 3D Conventional FG (C-FG) or S-SGT (Stacked-Surrounding Gate Transistor) [4244], and it is sketched in Fig. 14.

Fig. 14
figure 14

Adapted with permission from [19]. ©2017 IEEE

3D C-FG cell.

A C-FG NAND string is shown in Fig. 15, including select transistors. Please note that both string selectors are manufactured as standard transistors, i.e. they haven’t any floating gate. Figure 16 shows a C-FG array, including the fan-out region. While all wordlines at the same height of the stack are connected, BLS lines can’t, because they need to be page selective per each CG layer. On the contrary, SLS transistors can be shorted together, thus saving both power and silicon area.

Fig. 15
figure 15

Adapted with permission from [19]. ©2017 IEEE

C-FG NAND flash string.

Because we are talking about FG cells, FG coupling between neighboring cells is the main hurdle for vertical scaling. With enhancement-mode operations, the high resistance of source/drain (S/D) regions should also be carefully considered. In fact, these regions need high-doping and this is not very easy to accomplish when the conduction channel is made of polysilicon. The solution to this problem is to electrically invert the S/D layer by using higher voltages during read. This simple solution is hardly manageable by C-FG cells because of the thin FG.

The Extended Sidewall Control Gate (ESCG) structure, Fig. 17 [45], is another FG option and it was developed to contain the interference effect. Moreover, by applying a positive voltage to the ESCG structure, density of electrons on the surface of the pillar can be much higher than C-FG (even one order of magnitude): a highly inverted electrical source/drain can significantly lower the S/D resistance.

Fig. 16
figure 16

C-FG NAND flash array with fan-out

Fig. 17
figure 17

ESCG NAND flash cell

In addition, the ESCG shielding structure reduces the FG–FG coupling capacitance: the ESCG region is biased as CG, and the CG coupling capacitance (CCG) is significantly increased because of the increased overlap area between CG and FG. A higher CG coupling ratio is one of the key ingredients for achieving effective NAND Flash operations [46].

Another FG cell is DC-SF (Dual Control-Gate with Surrounding Floating Gate, Fig. 18) [47]. This time FG is controlled by two CGs. The impact on the FG/CG coupling ratio is remarkable, thanks to the enlargement of the FG/CG overlap area. Another positive aspect is the reduction of the voltages required for programming and erasing. DC-SF eliminates the FG-FG interference because the CG between two adjacent FGs plays the role of an electrostatic shield [48].

Fig. 18
figure 18

Adapted with permission from [19]. ©2017 IEEE

DC-SF NAND flash cell.

FG is fully isolated by IPD (Inter Poly Dielectric) and capacitive coupled to upper and lower control gates, CGU and CGL, respectively. The tunnel oxide is located between the channel CH and FG, while IPD is on the sidewall of the CG. In this way, free charges cannot tunnel to the control gates.

BiCS and DC-SF NAND strings are sketched in Fig. 19. In BiCS the nitride layer, going across all gates, makes the cell prone to data retention issues [49]. On the contrary, the surrounding FG is totally isolated: it is much easier for DC-SF to retain electrons [50, 51]. Of course, the downside of DC-SF is the fact there are two gate layers instead of one, coupled with much more complex biasing schemes [52, 53].

Fig. 19
figure 19

BiCS versus DC-SF

The Separated Sidewall Control Gate (S-SCG) Flash cell [54] displayed in Fig. 20 is another 3D FG option developed around the sidewall concept.

Fig. 20
figure 20

Adapted with permission from [19]. ©2017 IEEE

S-SCG NAND flash cell.

One of major drawbacks of this cell is the “direct” disturb to the neighboring passing cells, caused by the high SCG/FG coupling capacitance. We define it as “direct” because the sidewall CG is shared between adjacent cells: as a matter of fact, biasing SCG means biasing both FGs.

To minimize the decoding complexity, all SCGs belonging to one block adopt a common SCG scheme; besides their electrostatic shield functionality, sidewall gates can help all memory operations [55]. For instance, the common SCG is biased at 1 V during read operations, thus electrically inverting the channel (same as ESCG). Compared to ESCG, the electrical inversion happens simultaneously on source and drain, exactly because of the sidewall gates. Same thing happens during programming: the common SCG is biased at a medium voltage to improve the channel boosting efficiency.

Besides the direct disturb, another problem of Sidewall Gates is the limitation of vertical scaling to around 30 nm; indeed, the thicknesses of SCG and IPD can’t be scaled too much, otherwise they would breakdown when voltages are applied.

Let’s now take a look at examples of 3D FG NAND memory arrays of hundreds of Gb. The first 3D FG device was published in 2015 [56], in the form of a 384 Gb TLC NAND based on C-FG. This memory device was built with a stack of 32 (+ dummy) memory layers.

A 768 Gb 3D FG NAND became public in the following year [57]. What is unique in this case is the fact that the area underneath the array was used for circuitry. More details about this approach are provided in Sect. 3.

3 Key Challenges for 3D Flash Development

In this Section we cover some of the key challenges that technologists and designers are facing to push 3D memories even further.

3.1 Number of Layers

To reduce the bit size, the number of stacked cells needs to go up, but this causes a bunch of problems hard to solve [6].

Pillar’s Aspect Ratio (AR) is definitely the first challenge to overcome; in a stack of 32 cells AR can already be as high as 30. In this context, hole etching and gate patterning are extremely difficult, but of paramount importance.

A possible solution to this problem is to divide the stacking process in more steps to reduce the corresponding AR. For example, a NAND string made of 128 cells can be divided in 2 groups of 64 cells each, as shown in Fig. 21. The downside of this solution is the cost of the stacking process (in this example, 4 times higher than the cost of the plain solution).

Fig. 21
figure 21

Multi-stacked or multi-deck process [6]

Second problem is the small cell current [58]. With 2D sensing schemes, a 200 nA/cell saturation current is considered the right value because it gives a reasonable sensing margin. Unfortunately, already with a stack of 24 layers, the cell current is just ~20% of FG cell. And it becomes lower and lower as the number of cells in the vertical stack increases. There are a couple of possible paths to solve this problem: sensing schemes with higher sensitivity, and the introduction of new materials enabling a higher cell mobility in the poly-Si channel (i.e. a higher current) [5962].

All the above-mentioned problems can be fixed if entire NAND strings could be stacked one on top of each other. In this case, either bitlines or source lines are fabricated between NAND strings. This special architecture can simultaneously reduce the aspect ratio and increase the sensing current at same time.

3.2 Peripheral Circuits Under Memory Arrays

In the first 3D generations [63, 64], peripheral circuits (charge pumps, logic, etc.) and core circuits (like Page Buffers and Row decoders) are located outside the memory matrix, like in a conventional 2D chip floorplan, as sketched in Fig. 22. However, 3D memory cells are vertically stacked: in other words, memory transistors are not formed on the Si substrate; on the contrary, they are built around a deposited poly-Si (vertical pillar). Therefore, 3D architectures allow placing some circuits directly on the Si substrate under the memory array. Of course, this solution offers a significant reduction of the chip size.

Fig. 22
figure 22

Conventional 3D NAND flash memory layout

Figure 23 shows a layout of a Flash memory with Circuits Under the Array (CuA) [65, 66].

Fig. 23
figure 23

3D NAND flash memory layout with Circuits Under Array

This big area saving doesn’t come for free. The most important challenge is manufacturing low resistance metal layers under the array: this is absolutely critical for a reliable circuit functionality. Usually, metal layers used in 2D NAND flash memories are made of Cu. However, when circuits are under the array, the high temperature processes (i.e. >800 °C) that 3D requires can seriously degrade the resistance of metal layers. Therefore, circuits under the array require 3D “low” temperature fabrication processes.

4 Future Trend for 3D NAND Flash

Figure 24 shows cell’s size scaling trend, based on published die photographs. 2D became flat below 20 nm, while 3D cell showed a significant reduction going from 24 to 64 layers. This 3D scaling speed will continue by increasing the height of the memory stack, and exploiting technological innovations like Multi-stacked and Stacked NAND string [67].

Fig. 24
figure 24

Reproduced with permission from [19]. ©2017 IEEE

Effective cell size trend.

3D NAND arrays based on CT vertical channel were selected for volume production because the fabrication process is simpler than other 3D architectures. Volume production of 3D NAND Flash started in late 2013 with a 24 layer MLC (2 bit/cell) V-NAND [63, 68]. Year after year, the number of stacked cells grew up, as shown in [7, 64, 69], thus reducing the cost per bit and fueling an even more pronounced diffusion of Solid State Drives.

In this chapter we have presented many architectural options for building a 3D NAND array, including some of the latest and greatest layout options, but the 3D evolution is just at the beginning. In fact, two fundamentally different technologies, Floating and Charge Trap, are fighting each other, trying to prove that they can win in the long run, i.e. when scaling will be pushed to the limit. Flash manufactures are already shooting for 200 vertical layers with multi-level capabilities, including 4 bit/cell and 5 bit/cell. No doubt that we’ll see a lot of innovations in the near future: engineers and scientists are called to give their best effort to make this vertical evolution happen.