Keywords

4.1 Introduction

The term “information” is very diverse and is used in almost all areas of human scientific, technical, economic, and societal activity. In particular, information covers communication theory, information theory , information technology , informatics, documentation science, and information systems that are the result of the interactions between technology and human processes.

Very broadly, the development of human communication and information dissemination has spanned three main periods:

  • Speech and language development era.

  • Development of writing era.

  • Information revolution era.

Speech and language enabled humans to exchange knowledge with others who were physically at the same place. Writing has enabled communication between people at different places and times and initiated the development of civilization. The information revolution is still on-going and expanding with advancements in computer science and engineering, telecommunication systems, knowledge science, and computer networks , the biggest and most powerful of which is the Internet .

The above shows that “information” is indeed one of the “basic pillars” of human life and society, as will be addressed in this book. In the present chapter, we will guide the reader to the concept of information, including a listing of the key historical landmarks of its particular manifestations and branches. The core of the chapter is the tour to communication and information transmission theory, namely, communication models, modulation/demodulation , Shannon’s information entropy, source coding/channel coding , and theorems of information theory . The material in the chapter is not intended to constitute a complete treatment of the subject, but it is sufficient for the purposes of this book which principally aims at highlighting the impact of the “information pillar” on human life and society (Chap. 11). The fields of information documentation science, information science, information technology , and information systems will be discussed in the next chapter.

4.2 What Is Information?

According to the Oxford English Dictionary (1989), the two main contexts in which the term information is used are as follows:

  • The act of molding the mind.

  • The act of communicating knowledge.

The two processes, “molding” and “communicating”, are intrinsically related and, in most cases, occur inseparably (although not in a clearly known way).

Other frequently used interpretations of the term information include the following (Dictionary.com):

  • Knowledge communicated or received concerning a particular fact or circumstance.

  • Knowledge gained through study, communication, research, instruction, experience, etc.

  • The act or fact of informing.

  • An indication of the number of possible choices or messages.

  • Important or useful facts obtained as output from a computer via the processing input data with a program.

  • Data at any stage of processing (input, storage, output, transmission, etc.).

  • A numerical measure of the uncertainty of an experimental outcome.

  • The result of applying data processing to data, giving it context and meaning. Information can then be further processed to yield knowledge.

  • Directory assistance.

The concept of information as “knowledge communicated” has a dominant position in modern human life and society and has been fully developed after World War II with the advancement and use of information science , informatics, computer science , computer networks , information theory , information technology , and information systems . From an epistemological viewpoint, the information concept has also found extensive and important use in biology, physics, psychology, etc.

The term information has a Latin origin (informatio, informo) [1, 2] and is composed by the prefix “in” and the word “formatio” or “formo” which has the meaning of giving a form to something. The prefix “in” here is used to make stronger the act of this form giving. There are two primary contexts in which the term information or informo has been used, the first is a tangible (corporaliter) context and the second an intangible (incorporaliter) concept. The tangible context refers to low-level processes, and the intangible context to high-level processes (moral, pedagogical, spiritual, mental). Capurro [1] has studied the Greek origins of the term informatio and its subsequent evolution. The Greek origin is evident in the works of Cicero (106–43 B.C.) and Augustine (354–430 A.D.) [1, 2].

The Greek term for information is pliroforia (πληροφορία) composed by the two words “pliro” (πλήρης = complete) and “foria” (φέρω = carry/bring). The word pliroforia indicates that the common meaning (sense) of a simple or complex symbol consisting of two or more subjects is completely carried. Conceptually, the term pliroforia (information) signals the content that is complete and clear (in whatever form). In computer science , the information is reflected in the qualitative value of the “bit = binary digit” 0 or 1 or the quantum bit (qubit) in quantum computers. The computer processes the data (sequences of 0s, 1s) and provides processed data. The human gives meaning to these processed data and converts them into “information”.

It must be remarked that today the word information is used in almost all scientific fields with various different meanings depending on the subject of each field and the processes studied therein (e.g., decrease in entropy, physical organization, a communication pattern, a form of feedback control, the meaning of a linguistic term, the probability of a signal transmitted over a communication channel, etc.). According to Bogdan [3]: “There seems to be no unique idea of information upon which these various concepts converge and hence no proprietary theory of information.”

The issue of whether the concept of “information” should include necessarily a human knower or an interpretative component or exclude mental processes and user-oriented intentions and address only an objective magnitude or property of human beings is of continuous concern by scientists and philosophers [4]. Several approaches that belong between these two extremes, including the need for a unified theory of information, have been proposed [5].

In [6], Machlup and Mansfield present their view that: “Information is addressed to human minds and is received by human minds.” All other senses are metaphoric and anthropomorphic. That is, the basic senses in information deal with “telling something or with the something that is being told.” Therefore, according to [6], the term information is not appropriate for use in the context of signal transmission. In overall, information is a human phenomenon that involves individuals transmitting and receiving messages in the framework of their possible decisions and actions.

In [7], Capurro defines information as an anthropological class referring to the phenomenon of human messages with vertical and horizontal structures related to the Greek concept of message (αγγελία: aggelia or μήνυμα: menima) and the philosophic discourse (λόγος: logos). The debate about the unification and naturalization of the term information goes back to Boltzmann, Neumann, Nyquist, Wiener, and Hartley.

In [8], R.V.L. Hartley states: “it is desirable to eliminate the psychological factors involved [in electrical transmission systems] and to establish a measure of information in terms of purely physical quantities” (This is because electrical transmission systems have to do not with humans but with machines).

According to C.S. Pierce (1839–1914) and C.W. Morris (1901–1979), the information transmitted through a communication channel between an emitter and a receiver involves three levels:

  • Syntactic level.

  • Semantic level.

  • Pragmatic level.

At the syntactic level, the information deals with the formal bonds that exist between the various elements that make up the information, the rules of the communication code, the capacity of the channels, and the system design and coding methods for the transmission, processing, and storage of the information.

The semantic level is concerned with the ways of expressing the meaning of the information (e.g., written or nonwritten, cultural, societal, or moral rules or traditions valid in the human group or society concerned).

At the pragmatic level, the information is related to its utility. This level is determined to a large extent by the background of the person(s) who receive the information (political, social, economic, psychological, and moral factors).

The above three levels have a hierarchical structure in which the information can be managed (transmitted, processed, stored) at the syntactic level without the need to necessarily know its conceptual content at the semantic level or its practical utility at the pragmatic level.

Indeed, the word information in Shannon’s “Mathematical Theory of Communication” is used in a sense that must not be confused with meaning [9]. Shannon and Weaver state: “the semantic aspects of communication are irrelevant to the engineering aspects” [10]. Of course, as Capuro and Hjorland note [2]: “this does not mean that the engineering aspects are necessarily irrelevant to the semantic aspects.

The philosophical discussion about the concept of information was originated by Norbert Wiener who stated that “Information is information, not matter or energy. No materialism which does not admit this can survive at the present day” [2]. In the twentieth century, the notions of information and communication were applied at higher levels of abstraction and not to the communication of human knowledge as expressed in the above Wiener quotation. According to Titze [11], information is not a metaphysical principle but refers to a natural tendency for order and evolution. According to Stonier [12], structural and kinetic information is an intrinsic element of the universe , which is independent of the existence or not of an intelligent agent that can perceive it or not. Information may be manifested as “infons” which are comparable to “photons” [13]. Stonier distinguishes and separates the syntactic from the semantic features of information, and adopts the emergence of a global brain called “noosphere” by Teillard de Chardin (from the Greek “νους” = noos = mind, and σφαίρα = sphere = domain/globe) [14].

4.3 Historical Landmarks

The concept of information is generic and today spans all branches of science, technology, and human society that deal with the generation, acquisition, organization, storage, retrieval, interpretation, transformation, processing transmission, and utilization of information. Therefore, the history of information includes the evolution of computers, (theoretical) computer science , computation, information theory , information technology information science , information systems , multimedia, and informatics. Clearly, a detailed exposition of the combined history of all these parallel and overlapping branches, which in one or the other way involve the concept of information, needs too much space, and so here only the main historical landmarks will be given.

On the basis of the principal technology employed for the input, processing, output, and communication processes, the history of information technology and systems can be divided into the following four periods [15,16,17]:

  • Pre-mechanical period (3000 B.C.–1450 A.D.).

  • Mechanical period (1450–1840 A.D.).

  • Electromechanical period (1840–1940).

  • Electronic period (1940–Present).

4.3.1 Pre-mechanical Period

Humans have been attempting to facilitate calculation using mechanical devices and to find ways to communicate for thousands of years. The communication of people using pictures and symbols (alphabet) was started by the Phoenicians (around 3500 B.C.). Around 2900 B.C., the Egyptians develop hieroglyphic writing. Around 2000 B.C., the Greeks enhance the Phoenician alphabet through the addition of vowels. The Egyptians were writing on papyrus and around 2600 B.C. Chinese make paper from rags on which the modern paper making is based. Around 600 B.C., religious and other books were used, to permanently store information, made by folding papyrus vertically into leaves. Egyptians developed the first number system 1, 2,…, 9 as vertical lines (with the number 10 as a U circle, etc.), a number system similar to the present was developed in India (around 10–200 A.D.), and the zero number was added in around 870 A.D. The ancient Greeks constructed some sophisticated analog computers such as the Antikythera mechanism (involving rusted metal gears and pointers) which has discovered in 1901 on the island of Antikythera [18]. Around 200 B.C., human messengers on foot or horseback started to be used in Egypt and China with messenger relay stations available. Fire signals were used on many occasions instead of human messengers.

4.3.2 Mechanical Period

Principal dates in this period are as follows:

  • 1020 A.D.: Romans establish postal services and use mirrors for sending messages (heliographs).

  • 100200 A.D.: First wooden printing presses used in China and the first bound books.

  • 1450: Movable metal-type printing process (Johannes Gutenberg, Germany).

  • 1600: Slide rule; an early form of analog computer (William Oughtred, England).

  • 1641: Blaise Pascal’s adding machine

  • Late 1600s: Gottfried Wilhelm Leibniz’s adding machine.

  • 1822: Charles Babbage engines (difference engine, analytical engine).

  • 18301840: Parts and processes similar to modern computers (storage, punch card, binary logic, fixed program, real-time concept). It appears that the first programmer is Ada Augusta Byron (Countess of Lovelace), a friend of Babbage, who wrote a report on Babbage’s machine. The name of the Ada programming language was chosen to honor her.

4.3.3 Electromechanical Period

This period was characterized by the conversion of information and knowledge into electric pulses and the rise of mathematics.

  1. 1.

    The start-up of telecommunications:

    • Late eighteenth century: Voltaic battery.

    • Early 1800s: Telegraph.

    • 1835: Samuel Morse develops the telegraphic Morse code.

    • 1876: Telephone (Alexander Graham Bell).

    • 1894: Invention and development of radio (Guglielmo Marconi).

  2. 2.

    Electromechanical computing

    • 18801940: Census tabulation machine (Herman Hollerith), early punch cards, punch card workers, IBM Mark 1.

  3. 3.

    Some landmarks of mathematics

    • 1928: The German mathematician David Hilbert poses the following questions: (a) Is mathematics complete (i.e., can every mathematical statement be either proved or disproved?), (b) Is mathematics consistent (i.e., is it actually true that statements like 0 = 1 cannot be proven by a valid method?), and (c) Is mathematics decidable (i.e., does there exist a mechanical way that can confirm the validity or not of a mathematical statement?).

    • 1931: The answers to two of Hilbert’s questions were given by Kurt Gödel who proved that every sufficiently powerful formal system is either inconsistent or incomplete, and also that the consistence of a consistent axiom system cannot be proven within this system. The third question remains unanswered.

    • 1936: Alan Turing (1912–1954) gave an answer to the third question of Hilbert, via his formal model of a computer, known as the Turing machine. He showed that his machine would not be able to solve any problem (e.g., the question: given a Pascal program, does it halt on all inputs?—the so-called halting problem).

4.3.4 Electronic Period

The general-purpose electronic digital computer was developed during World War II. One of the major needs in this period was the automatic calculation of ballistic equations.

  • 1940: At Iowa State University, an electronic computer was built for solving systems of linear equations (John Vincent Atanasoff and Clifford Berry).

  • 1945: Development of EDVAC (Electronic Discrete Variable Computer).

  • 1946: Based on the ideas of Atanasoff and Berry, the ENIAC (Electronic Numerical Integrator and Computer) system was built, originally intended for artillery calculations. This was the first fully working high-speed general-purpose computer using vacuum tubes.

  • 1948: Construction of Manchester Mark I (the first stored-program computer).

  • 1949: Construction of EDSAC (Electronic Delay Storage Automatic Calculator).

  • Late 1940: UNIVAC (Universal Automatic Computer). This is the first general-purpose computer for commercial use.

We now give a brief description of the development of the four generations of digital computing and information processing.

19501958: First generation. Logic circuits which use vacuum tubes, punch cards for storage of external data, internal storage of data and programs using magnetic drums, machine language, assembly language, and the need for a compiler.

19591963: Second generation. Logic circuits using transistors designed on semiconductors, magnetic disks and tapes, magnetic cores, and high-level languages such as FORTRAN and COBOL.

19641979: Third generation. Integrated circuits (ICs) instead of individual transistors, magnetic tapes and disks, metal–oxide–semiconductor (MOS) memories, operating systems , and advanced languages, e.g., BASIC.

1980Present: Fourth generation. Large-scale integrated (LSI) and very large-scale integrated (VLSI) circuits, central processing units (CPUs) on a single chip leading to the development of personal computers (PCs), e.g., Apple II (1977), Apple Mac (1984), IBM PC (1981), Microsoft Disk Operating System (MS-DOS), graphical user interfaces (GUIs) for PCs (1980), and MS Windows (version 3, 1990)).

4.3.5 Information Theory Landmarks

The field of information theory as we know it today was initiated by Shannon’s celebrated paper: “A Mathematical Theory of Communication” [9]. Shannon realized and adopted the need to have a communication theory, in which the communication signals must be utilized separately from the meaning of the messages they transmit. He also realized that a signal can be transmitted arbitrarily close to the theoretical channel capacity, even if the signal is contaminated by noise. These initial ideas have inspired and guided information, communication and computer engineers ever since. Information theory overlaps considerably with communication theory, but it is principally concerned with the basic and theoretical constraints on the processing and communication of information and not with the design and operation of individual components and communication devices. The principal historical landmarks of information theory are the following [9, 19,20,21,22,23]:

  • 1929: The publication of Leó Szilard on the decrease in entropy caused by the intervention of intelligent agents [21].

  • 1948: The introduction by Shannon of the entropy concept in information theory as an expected value that expresses the “information content” in a message (in units, such as bits) [9].

  • 1949: Publication of the book of Robert Fano concerning the transmission of information [23].

  • 1956: The publication of the book of Leon Brillouin about information theory and his ambitious attempt to incorporate all of scientific endeavor within the framework of Shannon’s information theory [22].

  • 1957: The paper of Edwin Jaynes on maximum entropy principle (MEP) [24].

  • 1961: Publication of the book of Myron Tribus in which he tried to formulate the laws of thermodynamics via information theory [25].

  • 1988: Publication of the work of George Saridis on the application of MEP methods [24, 26] to control [27, 28].

  • 2006: Publication of the book of George Klir [29] in which he develops a “generalized information theory” (GIT) by viewing uncertainty as a manifestation of some information deficiency, and information as the capacity to reduce uncertainty.

  • 2007: Publication of the book of Ben-Naim [30] on the benefits of the use of the concept (term) information or uncertainty, instead of the term entropy, in statistical mechanics and thermodynamics.

    More detailed historical landmarks of information theory, especially regarding the development of error-correcting codes and lossless data compression, can be found in Wikipedia [31].

4.3.6 Computer Networks , Multimedia, and Telematics Landmarks

The development of computer networks and multimedia has, over the years, led to advanced and very accurate uses of transmitted information elements of any type for the benefit of present human society [32]:

Computer networks

Before the widespread adoption of internetworking, which led to the Internet , the majority of communication networks were allowing communications only between the stations of the network. One of the dominating methods of computer networking was based on the central mainframe concept, in which its terminals were enabled to be connected via long-leased lines. This method was in use, for example, during the 1950s for research communication purposes between Pittsburgh (Pennsylvania) and Santa Monica (California). During the 1960s, a period in which the telephone network was the primary communication network in the world, many groups were working toward enhancing and implementing “packet switching”, especially in the defense field. The origin of the Internet was the development of the Advanced Research Project Agency Network (ARPANET), the first ARPANET link being realized between the University of California and Stanford (21 November 1969). ARPANET was enhanced with ideas drawn from the ALOHA net and grew rapidly. The ARPANET development was based on the Request for Comments (RFC) process, which continues to be used until this day.

Some of the primary landmarks in the history of computer networking are the following [33]:

  • 1971: The ALOHA net, a type of TDMA transmission system and protocol for terrestrial and satellite random access communication (developed at the University of Hawaii) became operational in June 1971. ALOHA is a Hawaiian word (symbol) meaning hello, or goodbye or love and coming from “Alo” = presence/front/face and “ha” = breath.

  • 1974: The X.25 network developed by the International Telecommunication Union (ITU) was used on the British SERCnet network of academic and research sites (later it became JANET). The first ITU standard on X.25 was approved in March 1976.

  • 1982: The TCP/IP (Transmission Control Protocol and Internet Protocol) is established as the standard for ARANET.

  • 1986: TCP/IP is offered on workstations and PCs.

  • 1989: The number of hosts exceeds 100,000.

  • 1990: Several search tools (ARCHIE, Gopher, WAIS, etc.) are starting to enter the market after the official shut down of ARPANET.

  • 1991: Development of the World Wide Web (WWW) by Jim Berners-Lee at CERN (European Center for Nuclear Research).

  • 1992Present: The WWW and Internet explode into the world [34].

Multimedia is the branch of computer science and information technology which deals with the computer-controlled integration of text, graphics, drawing, static and moving images (video), audio animation, and any other media suitable for representing, storing, transmitting, and digitally processing every kind of information. A multimedia application is any application that employs a collection of multiple media sources (e.g., texts, graphics, images, audio, animation, and/or video). Hypermedia is one of the multimedia applications [35].

The multimedia term is used in contrast to the term media, which indicates that only conventional types of printed or hand-produced material are used. The term multimedia was introduced by Bob Goldstein at Southampton, Long Island, in July 1966. Some landmark events in the area of multimedia are [36]:

  • 1945: Vannevar Bush publishes the first landmark paper that describes what amounts to a hypermedia system called “Memex”.

  • 1960: Ted Nelson coined the concept of “hypertext”.

  • 1969: Development of the hypertext editor at Brown (Nelson and Van Dam).

  • 1985: Negroponte and Wiesner formed the MIT Media Laboratory.

  • 1989: Tim Berners-Lee proposal for the World Wide Web to CERN.

  • 1990: Kristine Hooper Woolsey headed the Apple Multimedia Laboratory (with more than 100 scientists).

  • 1992: JPEG was adopted as the international standard for digital image compression. Also, the first M-Bone audio multicast on the Net was made.

  • 1993: Production of the first full-fledged browser, Mosaic of NCSA (The Illinois National Center for Supercomputing Applications).

  • 1994: Development of Netscape by Jim Clark and Marc Andreessen.

  • 1995: Introduction of JAVA for platform-independent application development (The first applet created is called Duke).

  • 1996: Introduction of Microsoft’s Internet Explorer.

Further historical information pertaining to multimedia can be found in [37].

Telematics

Telematics is the synergy of telecommunications and informatics (telematics = tele communications + inform matics). Therefore, as an overall term, “telematics” refers to the long-distance transmission and computer processing of information. The term “telematics” was coined in French by Simon Nora and Alain Minc as “telematique” (tele communication et informatique) in their 1978 report entitled “L’ Informatisation de la Société” [38]. This report was requested by the president of France Valery Giscard d’ Estaing in 1976 to explore how the computer and informatics applications could be extended to the organizational, economic, and developmental issues of modern society. In their report, Nora and Minc made the following remark: “Today, any consumer of electricity can instantly obtain the electric power needs without worrying about where it comes from or how much it costs. There is every reason to believe that the same will be true in the future of telematics.” Today, computers become smaller and smaller with ever lower energy requirements, and computing devices gradually become mobile and can accompany us wherever we go. These developments have led to the so-called mobile computing, ubiquitous computing, or pervasive computing. The historic landmarks of telematics involve the parallel and combined landmarks of computers and telecommunications . On the computer side, the two major landmarks are as follows [39, 40]:

  • 1971: Development of the first microprocessor, the Intel 4004.

  • 1981: Release of Osborne 1, the first portable computer.

On the telecommunications side, we mention the following:

  • 1860: Invention of the telephone by Antonio Meucci.

  • 1895: Invention of the wireless telegraph by Guglielmo Marconi.

  • 1946: AT&T introduces the first mobile telephone.

  • 1958: Launching of the first communication satellite, SCORE.

  • 1969: The ARPANET goes online and links for the first time two computers (University of California and Stanford).

  • 1992: The WWW is released by CERN worldwide.

  • 2000: The first commercial UMTS network is introduced in Japan.

We now present a summarized description of the basic concepts and methods of various manifestations of information that have been discussed in this section from a historical point of view, namely,

  • Communication systems .

  • Information theory .

  • Computers and computer science.

  • Informatics.

  • Telematics.

  • Multimedia.

  • Information systems .

  • Information science .

4.4 Communication Systems

4.4.1 General Issues

In communication systems theory , signals represent information. Information is used neatly packaged in analog or digital form. Communication systems are used for both manipulating and transmitting information. The following cases are the main types:

  • Point-to-point communication (from a single place to another place).

  • Broadcast communication (from one place to many other places).

  • Multipoint to multipoint communication (teleconferencing, chat rooms).

Communication systems can be analog (e.g., via radio waves) or digital (e.g., computer networks). The question arising here is: Which is better, analog or digital communication? This question has been answered by Shannon, in his work on information theory [9], who suggests the use of digital representation of signals and the digital communication strategy. This means that all information-bearing signals are converted into digital ones (discrete-time, amplitude-quantized signals) applying the sampling theorem which shows the condition that must be met for this conversion to be an accurate one. As mentioned in Sect. 4.3.5. Shannon has shown that in digital form, a properly designed system can communicate digital information without error despite the presence of communication noise in all transmission channels. This result has a fundamental importance and value not only in communications, but also in all areas where digital information is handled, e.g., compact discs (CDs) and digital video disks (DVDs).

4.4.2 Shannon–Weaver Communication Model

One of the fundamental concepts in communication and information theory is the “communication model” of Shannon and Weaver [10], which is illustrated in Fig. 4.1b. Figure 4.1a shows the graphical representation (called block diagram) of a system operating on a receiving signal x(t) and producing an overall output signal y(t). In Fig. 4.1b, we have the following components (blocks):

Fig. 4.1
figure 1

a Block diagram of system receiving an input signal x(t) and producing an output signal y(t), b Shannon–Weaver communication model

  • The information source that selects a desired signal (message) s(t), out of a set of possible messages, which is to be sent to the destination (sink). The message can have several forms, e.g., speech, music, strings of letters as in telegraph or teletype, or characters typed on the keyboard of a PC, etc.

  • The transmitter that receives the message s(t) and produces a signal x(t) suitable for transmission over the channel. The signal x(t) is either modulated or encoded, depending on the message’s physical nature. In telephony, the transmitter’s operation consists of changing sound pressure into electrical current. In telegraphy, an encoding operation takes place that generates a sequence of dots, dashes, and spaces corresponding to the message. In a multiplexed, pulse-coded, modulation system, the various speech functions are sampled, compressed, quantized, and encoded, and finally properly interleaved to construct the signal x(t).

  • The channel is the medium over which the signal is transmitted to the receiver. In the channel, the transmitted signal is typically corrupted by noise or distorted or attenuated by various phenomena [giving the corrupted message r(t)].

  • The receiver is a sort of inverse transmitter, changing the transmitted signal back into a received message \(\hat{s}(t)\) that must resemble s(t) as much as possible.

  • The destination (or information sink) that is the person (or thing) for whom the message is intended (and who uses it for the desired purpose).

The “noise” contaminating the transmitted message x(t) may be internal (i.e., coming from the receiver’s attitudes, beliefs, or knowledge) or external (i.e., caused by other sources). This internal or external noise may either strengthen the intended outcome of the messages (if the information confirms the message) or weaken the intended outcome (the information in the noise disconfirms the original message).

The implementation of the ShannonWeaver communication model needs the following:

  • To understand signals and signal generation.

  • To understand the nature of the information these signals represent.

  • To know how information is transformed between analog and digital forms.

  • To know how information can be processed by systems working on information-bearing signals.

These requirements are met through electrical engineering (which studies the ways signals are represented and manipulated electrically/electronically) and signal engineering (which studies the structure of signals, their information content, and the capabilities that this structure imposes upon communication systems), independently of what the signal sources are.

4.4.3 Other Communication Models

The Shannon–Weaver linear communication model is applicable to pure machine communication, i.e., it is not intended to match human communication. This model is focused on the technical problems associated with the selection and arrangement of discrete units of information and not with the semantic “content” of the messages. Weaver himself pointed out that: “It is surprising but true that, from the present view point, two messages, one heavily loaded with meaning and the other pure nonsense, can be equivalent as regards information.

Two other communication models devised for human communication are the Berlo model of communication [43] and the Schramm model of communication [44], shown in Fig. 4.2a, b [45]:

Fig. 4.2
figure 2

a Berlo’s SMCR model: a source encodes a message for a channel to a receiver who decodes the message, b Schramm’s communication model

Berlo’s model is an enhanced adaptation of the Shannon–Weaver model so that, to cover the human communication, it provides a mechanism in the source for the following:

  • Communication skills.

  • Attitudes.

  • Knowledge.

  • Social and cultural issues.

In other words, the source is sufficiently flexible to include oral, written, electronic, or any other form of “symbolic” generator of messages. The “message” is considered to be the core element of the model, stressing the transmission of ideas, specifically content, elements, treatment, structure, and code. The receiver is recognized to be very important to communication, since it is actually the target. Like the source, the receiver takes into account and interprets communication skills, attitudes, knowledge, and social and cultural issues. The channel, i.e., the communication medium, involves all the human senses (hearing, seeing, touching, smelling, and tasting). The concepts of “encoding” and “decoding” emphasize the psychophysical problem every person has in translating his/her own thoughts in worlds or symbols and providing them in an understandable way to other people. Here, it is tacitly assumed that human communication is similar to machine communication (e.g., telephone signal, television signal, and computer information transmission, etc.). Clearly, the accuracy of human communication using this model depends on choosing the “proper” symbols, preventing interference, and sending efficient messages. However, even if the proper symbols are used, people may misunderstand each other. But all these issues are human-centered depending on agreements, beliefs, shared values, and attitudes.

The Schramm’s model provides the additional features of “ feedback field expertise” and “psychological reference frame” for the interacting humans. It is noted that Schramm’s model is less linear than the Shannon–Weaver or the Berlo’s model. But again, it is suitable only for bilateral communication between two partners (i.e., it does not cover the case of multilevel communication). Three nonlinear models of communication are the following:

  • Dance’s helical spiral (1967): This spiral depicts communication as a dynamic process, where the helix represents how communication evolves in a human from his/her birth up to the present [41].

  • Westley and MacLean’s conceptual model (1957): This model is based on the fact that the communication of a person begins when he/she starts to respond selectively to the immediate surroundings and not when he/she starts to talk [41].

  • Becker’s mosaic model (1968): Here, it is assumed that most communicative acts link message elements from several social situations (not just from one situation). These complex communicative events are linked to the activity of a receiver who moves through a permanently varying cube or mosaic of information [41]. This model adds a third dimension, but human communication involves many more dimensions. Therefore, many researchers attempted to develop multidimensional communication models. Two such models are the following [41]:

  • Ruesch and Bateson functional model (1951): This involves four levels of analysis, i.e., level 1 (intrapersonal process), level 2 (interpersonal), level 3 (group interaction), and level 4 (cultural level).

  • Barnlund’s transactional model (1970): A very systematic functional model where the key assumptions on which it is based are shown explicitly. Its most important property is that it includes no simple or linear directionality in the interplay between itself and the surrounding world.

4.4.4 Transmitter–Receiver Operations

We now briefly discuss the operations of the transmitter and receiver in the Shannon–Weaver communication model (Fig. 4.1b).

Modulation

The transmitter performs the modulation operation, namely, the superposition of the information (message) onto an electronic or optical carrier signal. Modulation can be applied to direct current (mainly by turning it on or off), alternating current, and optical signals. The Morse code, used in telegraphy and presently in amateur radio, uses a binary (i.e., two state) digital code similar to the code used in modern computers. In today’s typical radio and telecommunication systems, the carrier is an alternating current (AC) sent over a given frequency band. Standard modulation methods are as follows:

  • AM (Amplitude modulation), where the voltage superimposed on the carrier modulation signal is time varying.

  • FM ( Frequency modulation), where the frequency of the carrier waveform is varied (modulated) in suitably small amounts.

  • PM ( Phase modulation), where the natural flow of the alternating signal is delayed temporarily.

The above are called continuous (analog) signal modulation methods to discriminate them from PCM (Pulse-Code Modulation) which is employed to encode analog and digital signals in a binary (digital) way. In general, the modulation techniques can be classified as follows:

  • Analog versus digital modulation.

  • Baseband (low pass) versus bandpass (passband) modulation.

  • Binary versus M-ary modulation.

  • Linear versus nonlinear modulation .

  • Memoryless modulation versus modulation with memory.

  • Power-efficient versus bandwidth-efficient modulation.

  • Constant envelope versus nonconstant envelope modulation.

  • Equivalent digital modulation methods (amplitude-shift keying: ASK, frequency-shift keying: FSK, phase-shift keying: PSK). Two-way radios employ FM although some use single sideband (SSB). A combination of ASK and PSK is the quadrature-amplitude modulation (QAM).

Demodulation

This is the inverse operation of modulation, which extracts the original information-bearing signal from the modulated carrier signal. An electronic device used to perform the recovery of the original message from the modulated carrier signal is called a demodulator. To each modulation technique, there corresponds an appropriate demodulation technique and an associated demodulator. For example, in an AM signal (which encodes the message into the carrier wave by varying its amplitude in proportion to the analog message sent), we have two kinds of demodulators. These are the envelope detector (which uses a rectifier allowing the current to flow only in one direction) and the product detector. The latter multiplies the incoming AM -modulated signal by the signal of a local oscillator with the same frequency and phase as the carrier used. After filtering, the original message is obtained. For FM, many typical forms of demodulators exist. For example, a quadrature detector phase shifts the signal 90° and multiplies it with the unshifted version. Among the terms produced by this operation is the original message which is selected and amplified. Another FM demodulator employs two AM demodulators, one tuned to the high end of the frequency band and the other to the lower end. The two outputs are then passed to a difference amplifier.

A pair of transmitter (coder, modulator) and receiver (decoder, demodulator) is called a transceiver. The general structure of modern communication systems involving a CODEC (Coder/Decoder) and MODEM (Modulator/Demodulator) has the form shown in Fig. 4.3.

Fig. 4.3
figure 3

General structure of a digital communication system

Multiplexing

To convey more information in a given amount of time, the bandwidth of a signal carrier is divided such that more than one modulated messages can be sent simultaneously on the same carrier. This is called multiplexing, where the carrier is sometimes called the channel and its separate message signal carried is referred to as subchannel. The multiplexer is the device that puts the individual signals onto the carrier and takes off received transmissions—separately. Typical forms of multiplexing are as follows:

  • FDM: Frequency-division multiplexing.

  • TDM: Time-division multiplexing.

FDM which divides the principal frequency of the carrier into separate subchannels is commonly employed in analog communication systems , whereas TDM, which divides the principal signal into time slots, each one carrying a separate signal, is used in digital communication systems. A digital cellular phone technology based on TDMA (time-division multiple access), which was developed in the 1980s and is predominant in Europe, is the Global System for Mobile (GSM) communications. GSM is also used worldwide. GSM operates in the 900 MHz and 1.8 GHz bands in Europe, and the 1.9 GHz PCS band in the United States. It is based on a circuit-switched system, which divides each 200 kHz channel into eight 25 kHz time slots.

4.4.5 Analysis of Analog Modulation–Demodulation

Here, we outline the mathematical analysis of analog modulation. The carrier wave is a sinusoid of the following form:

$$c(t) = A\;{ \cos }\left( {\omega_{c} + \phi } \right)$$

where the amplitude A, the carrier cyclic frequency \(\omega_{c} = 2\pi f_{c}\) (\(f_{c}\) is the frequency), and the phase \(\phi\) are the parameters that can be varied according to the message-bearing signal:

$$m(t) = M\;{ \cos }\left( {\omega_{m} t + \theta_{0} } \right)$$

where \(\theta_{0}\) is a constant and, without loss of generality, it can be assumed \(\theta_{0} = 0\).

4.4.5.1 Amplitude Modulation

In amplitude modulation (AM) , the amplitude A of the carrier signal is modulated in proportion to the message-bearing (low frequency) signal m(t), to give

$$\begin{aligned} x(t) & = \left( {A + M\,{ \cos }\left( {\omega_{m} t} \right)} \right)\,{ \cos }\left( {\omega_{c} + \phi } \right) \\ & = A\left( {1 + \mu \,{ \cos }\,\omega_{m} t} \right)\,{ \cos }\left( {\omega_{c} t + \phi } \right) \\ \end{aligned}$$

where

$$\mu = M/A$$

is called the “modulation index” of AM, and for demodulation purposes (i.e., for the recovery of m(t) from the above modulated carrier signal) is typically selected less than one (i.e., M < A).

Using the well-known trigonometric identity:

$${ \cos }\,(\theta )\,{ \cos }\,(\psi ) = (1/2)\left[ {{ \cos }\,(\theta + \psi ) + { \cos }\,(\theta - \psi )} \right],$$

the amplitude modulated signal x(t) can be written as

$$\begin{aligned} x(t) & = A\,{ \cos }\left( {\omega_{c} t + \phi } \right) + \frac{A\mu }{2}\left[ {{ \cos }\left( {\omega_{c} + \omega_{m} } \right)t + \phi } \right] \\ & \quad + \frac{A\mu }{2}{ \cos }\left[ {\left( {\omega_{c} - \omega_{m} } \right)t + \phi } \right] \\ \end{aligned}$$

This shows that x(t) has three additive components with frequencies \(\omega_{c} ,\;\omega_{c} + \omega_{m}\) and \(\omega_{c} - \omega_{m}\). The frequency \(\omega_{c} + \omega_{m}\) is called the upper side frequency, and \(\omega_{c} - \omega_{m}\) is called the lower side frequency. A typical AM signal with \(\mu_{0} = 0.8\) is shown in Fig. 4.4.

Fig. 4.4
figure 4

Typical amplitude signal with \(\mu_{0} = 0.8\)

In the type of AM described above, both side frequencies \(\omega_{c} - \omega_{m}\) and \(\omega_{c} + \omega_{m}\) are transmitted. It is therefore called Double Sideband AM (DSB-AM). But for more efficiency, only one of the sidebands must be transmitted, in which case we have the so-called Single Sideband AM (SSB-AM). In particular, SSB-AM is called Lower Sideband AM (LSB-AM) or Upper Sideband AM (USB-AM), if only the lower frequency or the upper side frequency is transmitted, respectively. The case where the modulation index is μ > 1 is called overmodulation. Now, let us calculate the power of a modulated wave. The power is proportional to the square of the voltage (amplitude). Therefore, the power of the carrier wave is

$${\text{Carrier}}\;\text{power} = KA^{2}$$

where the factor ½, required because A is the amplitude (peak value) of c(t), is included in the proportionality constant K.

The total sideband power is equal to

$$\begin{aligned} {\text{Total}}\;{\text{side - band}}\;{\text{power}} & = KA^{2} \left( {\frac{\mu }{2}} \right)^{2} + KA^{2} \left( {\frac{\mu }{2}} \right)^{2} \\ & = \frac{{\mu^{2} }}{2} \times {\text{Carrier}}\;{\text{power}} \\ \end{aligned}$$

Therefore, the total power of the AM signal is equal to

$$P = KA^{2} \left( {1 + \mu^{2} /2} \right)$$

i.e., equal to the sum of the constant carrier power \(KA^{2}\) and the power of the two sidebands, which depends on the value of \(\mu\). For example, if \(\mu = 1,\,P = KA^{2} \times \left( {3/2} \right) = KA^{2} \times 150\%\).

4.4.5.2 Amplitude Demodulation

As mentioned in Sect. 4.4.4, for AM we have two kinds of demodulators (detectors): the envelope demodulator and the product demodulator. The signal

$$v\left( t \right) = A\left( {1 + m\left( t \right)} \right)$$

is called the envelope of the AM signal. Clearly, if the envelope v(t) is extracted, the transmitted message m(t) can be recovered. A simple way to obtain the envelope is to use the “envelope demodulator circuit” shown in Fig. 4.5.

Fig. 4.5
figure 5

a Envelope AM demodulator. b The AM signal passing through the demodulator provides the envelope signal. This is done by the capacitor which removes the carrier frequency and leaves only the envelope

The envelope detector is of the so-called noncoherent (asynchronous) type. Better performance can be obtained by coherent (synchronous) detector in which both the frequency and the phase of the carrier are known at the demodulator. The amplitude of the carrier affects only the level of the demodulated signal, which can be changed by a simple amplifier. Therefore, the amplitude of the carrier is not important in the demodulation . The carrier signal is restored at the receiver by a circuit called the carrier-recovery circuit. In the product detector, the amplitude modulated signal is multiplied by the signal provided by a local oscillator. The simplest case is obtained if the local oscillator has the same frequency as the carrier, i.e., \(\omega_{c}\). Then, the result of the multiplication is the sum of the original message and another AM signal at frequency \(2\omega_{c}\). A low-pass filter is used to suppress this second component. The block diagram of the product detector is shown in Fig. 4.6.

Fig. 4.6
figure 6

Block diagram of a product detector

In the above block diagram, we have

$$\begin{aligned} x(t)\,c(t) & = A\left( {1 + m(t)} \right)\,{ \cos }\left( {\omega_{c} t} \right)\,{ \cos }\left( {\omega_{c} t} \right) \\ & = \frac{1}{2}A\left( {1 + m(t)} \right)\left( {1 + { \cos }\left( {2\omega_{c} t} \right)} \right) \\ \end{aligned}$$

where, without loss of generality, the phase \(\phi\) of the carrier was omitted and use was made of the same trigonometric identity as in the AM modulation . After low-pass filtering, the original message is recovered.

The carrier signal can be available to the product detector in two ways:

  • Through transmission of the carrier.

  • Through recovering of the carrier.

The carrier recovery can be achieved by using a transmitted pilot signal outside the passband of the modulated signal as shown in Fig. 4.7.

Fig. 4.7
figure 7

Pilot signal outside the AM signal

To recover the carrier signal, two techniques can be applied: (i) recovery by a bandpass filter (Fig. 4.8a), and (ii) recovery by a phase-locked loop—PLL (Fig. 4.8b).

Fig. 4.8
figure 8

a Recovery of carrier by a bandpass filter. b Recovery of carrier by a phase-locked loop (PLL). Here, we use the symbols \(\hat{c}(t)\) and \(\hat{m}(t)\) to indicate approximate copies of \(c(t)\) and \(m(t)\)

It is noted that, with the product demodulator, we can also decode overmodulated AM , SSB, and AM with suppressed carrier, in addition to the standard DSB-AM.

4.4.5.3 Frequency Modulation

The principal advantages of frequency modulation (FM) over AM are as follows: (i) better signal-to-noise ratio, (ii) less interference effects between neighboring stations, and (iii) less radiated power. The drawbacks are as follows: (i) It requires much more bandwidth than AM (up to 20-times more), and (ii) The transmitter and receiver devices are more complex. In FM, the frequency of the carrier wave is changed in proportion to the message signal m(t), i.e., we have

$$\omega (t) = \omega_{c} \left( {1 + \mu \,{ \cos }\left( {\omega_{m} t} \right)} \right)$$

where \(f_{c} = \omega_{c} \,/\,2\pi\) is called the center frequency, and \(\mu\) is the degree of modulation. A frequency-modulated signal has the form shown in Fig. 4.9b.

Fig. 4.9
figure 9

Frequency modulation . a The modulating signal. b The modulated signal

Frequency modulation is popular in a variety of radio transmission applications from broadcasting to general point-to-point transmissions. It is also widely employed for broadcasting via very high frequency (VHF) because it provides a very good medium for high-quality transmissions. FM is also largely used in several mobile applications. Here, because \(\omega\) is time varying, the FM signal is given by

$$\begin{aligned} A\,{ \cos }\left[ {\int\limits_{0}^{t} {\omega (t)\,{\text{d}}t} } \right] & = A\,{ \cos }\left[ {\int {\omega_{c} \left( {1 + \mu \,{ \cos }\left( {\omega_{m} t} \right)} \right){\text{d}}t} } \right] \\ & = A\,{ \cos }\left[ {\frac{{\omega_{c} \mu }}{{\omega_{m} }}{ \sin }\left( {\omega_{m} t} \right) + \omega_{c} t + \theta_{0} } \right] \\ \end{aligned}$$

where the constant of integration \(\theta_{0}\) can be neglected as a constant angle. The quantity \(\Delta \omega = \omega_{c} \mu\) is called the cyclic frequency deviation.

The parameter

$$\frac{{\omega_{c} \mu }}{{\omega_{m} }} = \frac{{\Delta \omega }}{{\omega_{m} }} = \mu_{f}$$

is called the FM modulation index (or deviation ratio) and has a value in radians that is different for each value of \(\omega_{m}\). On the basis of this, the expression for an FM signal becomes

$$x(t) = A\,{ \cos }\left( {\omega_{c} t + \mu_{f} \,{ \sin }\left( {\omega_{m} t} \right)} \right)$$

Expanding this signal using and the identity for \({ \cos }\,(\theta + \psi )\), we obtain

$$\begin{aligned} x(t) = & \,A\,{ \cos }\left( {\omega_{c} t + \mu_{f} \,{ \sin }\left( {\omega_{m} t} \right)} \right) \\ = & \,{ \cos }\left( {\omega_{c} t} \right)\,{ \cos }\left( {\mu_{f} \,{ \sin }\left( {\omega_{m} t} \right)} \right) - { \sin }\left( {\omega_{c} t} \right)\,{ \sin }\left( {\mu_{f} \,{ \sin }\left( {\omega_{m} t} \right)} \right) \\ \end{aligned}$$

The two complex functions \({ \cos }\left( {\mu_{f} \,{ \sin }\left( {\omega_{m} t} \right)} \right)\) and \({ \sin }\left( {\mu_{f} \,{ \sin }\left( {\omega_{m} t} \right)} \right)\) can be expressed as an infinite series of Bessel functions, namely [46, 47],

$$\begin{aligned} { \cos }\left( {\mu_{f} \,{ \sin }\left( {\omega_{m} t} \right)} \right) & = J_{o} \left( {\mu_{f} } \right) + 2J_{2} \left( {\mu_{f} } \right){ \cos }\left( {2\omega_{m} t} \right) \\ & \quad + 2J_{4} \left( {\mu_{f} } \right){ \cos }\left( {4\omega_{m} t} \right) + 2J_{6} \left( {\mu_{f} } \right){ \cos }\left( {6\omega_{m} t} \right) + \cdots \\ { \sin }\left( {\mu_{f} \,{ \sin }\left( {\omega_{m} t} \right)} \right) & = 2J_{1} \left( {\mu_{f} } \right){ \sin }\left( {\omega_{m} t} \right) + 2J_{3} \left( {\mu_{f} } \right){ \sin }\left( {3\omega_{m} t} \right) \\ & \quad + 2J_{5} \left( {\mu_{f} } \right){ \sin }\left( {5\omega_{m} t} \right) + 2J_{7} \left( {\mu_{f} } \right){ \sin }\left( {7\omega_{m} t} \right) + \cdots \\ \end{aligned}$$

where the Bessel functions \(J_{k} \;\left( {i = 1,2, \ldots } \right)\) of the first kind are defined as

$$J_{k} \left( {\mu_{f} } \right) = \sum\limits_{q = 0}^{\infty } {\frac{{( - 1)^{q} }}{q!(q + k)!}\left( {\frac{{\mu_{f} }}{2}} \right)^{k + 2q} }$$

and are provided in corresponding mathematical tables and computer-mathematics libraries. Using the trigonometric identities for \({ \cos }\,(\theta )\,{ \cos }\,(\psi )\) and \({ \sin }\,(\theta )\,{ \sin }\,(\psi )\), we get

$$\begin{aligned} x(t) & = A\left\{ {J_{0} \left( {\mu_{f} } \right){ \cos }\left( {\omega_{c} t} \right)} \right. - J_{1} \left( {\mu_{f} } \right)\left[ {{ \cos }\left( {\omega_{c} - w_{m} t} \right) - { \cos }\left( {\omega_{c} + \omega_{m} } \right)t} \right] \\ & \quad + J_{2} \left( {\mu_{f} } \right)\left[ {{ \cos }\left( {\left( {\omega_{c} - 2\omega_{m} } \right)t} \right) + { \cos }\left( {\left( {\omega_{c} + 2\omega_{m} } \right)t} \right)} \right] \\ & \quad + J_{3} \left( {\mu_{f} } \right)\left[ {{ \cos }\left( {\left( {\omega_{c} - 3\omega_{m} } \right)t} \right) - { \cos }\left( {\left( {\omega_{c} + 3\omega_{m} } \right)t} \right)} \right] \\ & \quad + J_{4} \left( {\mu_{f} } \right)\left[ {{ \cos }\left( {\left( {\omega_{c} - 4\omega_{m} } \right)t} \right) + { \cos }\left( {\left( {\omega_{c} + 4\omega_{m} } \right)t} \right)} \right] - \cdots \\ \end{aligned}$$

We see that the FM signal involves the center frequency \(\omega_{c}\) and an infinite number of side frequency pairs, each pair spaced by an amount equal to the modulating frequency \(\omega_{m}\).

The amplitude of the successive side frequency pairs is determined by the Bessel function coefficients. Although theoretically the FM signal covers the entire frequency spectrum with sidebands, the J coefficients decrease relatively quickly and the series converges rapidly. So, the bandwidth actually required is finite. Because \(\omega_{m}\) and \(\mu_{f}\) are inversely proportional \(\left( {\mu_{f} =\Delta \omega \,/\,\omega_{m} } \right)\), a small \(\omega_{m}\) will result in more side frequencies of significant value, than those obtained in the case of a high \(\omega_{m}\) as shown in Fig. 4.10.

Fig. 4.10
figure 10

Frequency spectra of two FM signals with \(\Delta f = 75{\text{K}}_{\text{c}} /{\text{s}}\)

It is noted that \(J_{0} \left( {\mu_{f} } \right) = 0\) for \(\mu_{f} = 2.405,\;5.520,\;8.654,\;11.79,\;14.93\), etc. Therefore, for these values of \(\mu_{f}\), the center frequency component is zero (i.e., it does not exist).

A practical rule of thumb is Carson’s rule which says that almost all (~98%) of the power of an FM signal is contained in a bandwidth \(B_{c}\) equal to

$$B_{c} = 2\left( {f_{m} +\Delta f} \right)$$

where \(\Delta f = \mu f_{c}\) is the frequency deviation \(\Delta \omega \,/\,2\pi\) from the central frequency.

4.4.5.4 Frequency Demodulation

In FM, it happens that signals with a large frequency deviation are supporting higher quality transmissions at the expense of occupying a wider bandwidth. Thus, in practice, several levels of frequency deviations (or modulation index values) are employed, according to the application that is used. Those with low deviation are known as narrowband frequency modulation (NBFM), with typical values ±3 kHz. NBFM is generally used for point-to-point communications. Higher frequency deviations are required for broadcasting and are called wideband frequency modulation (WBFM). Typical values of WBFM frequency deviation are ±75 kHz.

To receive FM, a receiver must be sensitive to the frequency variations of the incoming signal, which may be either NBFM or WBFM, and insensitive to amplitude variations. To assure that any amplitude variations are removed, the signals are amplified to such a degree that the amplifier goes into limiting. The demodulator must be frequency-dependent in order to be able to linearly convert frequency variations into voltage variations. This suggests that the FM demodulator must have an S-type voltage–frequency characteristic as shown in Fig. 4.11.

Fig. 4.11
figure 11

S-curve characteristic of an FM demodulator

The principal types of FM demodulator circuits are as follows [48]:

  • Slope detector.

  • Ratio detector.

  • Foster–Seeley detector.

  • Phase-locked loop (PLL) detector .

  • Quadrature detector .

Slope detector

This is the simplest form of FM detector. It is essentially a tank circuit that is tuned to a frequency lightly higher or lower than the carrier frequency.

Ratio detector

This detector is very popular because it provides a higher level of amplitude variation rejection. This has the result of a greater level of noise immunity (since most noise is amplitude noise), and a satisfactory operation with lower levels of limiting circuitry is required. The ratio detector consists of a frequency-sensitive, phase-shift circuit with a transformer and two diodes in series. The transformer performs the detection of the frequency variations of the incoming signal.

Foster–Seeley detector

This detector, also called a phase-shift detector, was invented in 1936 by Dudley Foster and Stuart Seeley [49]. It employs a double-tuned, radio frequency transformer to convert frequency variations of its input signal to amplitude variations. These amplitude variations are then rectified and filtered to give a dc output voltage that varies in both amplitude and polarity as the frequency of the input signal varies. The output voltage is zero when the input frequency is equal to the carrier frequency \(f_{c}\). A typical response curve of the Foster–Seeley detector has the S-form shown in Fig. 4.11, where the frequency band of operation is \(\left[ {f_{1} ,\;f_{2} } \right]\).

Phase-locked-loop detector (PLL)

This detector produces a fixed (locked) relation to the phase of a “reference” input signal and responds to both the frequency and the phase of the input signals automatically, so that it matches the reference in both frequency and phase. Actually, a PLL detector is an example of a control system working under negative feedback . PLL FM detectors are very popular and used in several types of radio equipment ranging from broadcasting receivers to high-performance communications equipment. Their wide use started when integrated circuit technology had developed to allow the manufacture of radio frequency analog circuits. The basic structure of a PLL detector is shown in Fig. 4.12 [42].

Fig. 4.12
figure 12

Basic structure of a PLL detector

The phase comparator compares the phase of the output signal with that of the input signal and sends the phase-error signal to the loop filter, where it is filtered and becomes the control signal u(t) that drives the voltage-controlled oscillator (VCO). The key issue in designing a PLL detector is the loop filter, which must have sufficiently wide bandwidth to allow it to follow the anticipated variations of the frequency-modulated signal.

Quadrature detector

The quadrature detector shifts the incoming signal, 90° and multiplies it with the unshifted signal. It does not need a center-tapped transformer and so it can easily be integrated into a single LSI chip, unlike the other detectors. The 90°-phase shift is achieved using a high-reactance capacitor and passes the phase-shifted signal to an LC circuit tuned at the carrier frequency. Then, the frequency changes produce an additional leading or lagging phase shift within the multiplier. The quadrature FM detector circuit is shown in Fig. 4.13. The operation of the quadrature FM detector is based on the property that multiplying two periodic signals with the same frequency generates a DC voltage that is directly proportional to the signal-phase difference.

Fig. 4.13
figure 13

Quadrature detector circuit (The capacitor \(C_{2}\) shifts the phase of the input signal 90°)

At the resonance frequency, the phase of the LC circuit is zero, below resonance the phase is positive and above resonance the phase is negative, i.e., the phase changes with the variations of the frequency, and the same is also true for the output voltage.

4.4.5.5 Phase Modulation

For a carrier \(c(t) = A\,{ \cos }\left( {\omega_{c} t + \phi_{c} } \right)\) and a modulating signal \(m(t) = \mu \,{ \cos }\left( {\omega_{m} t} \right)\), the phase modulated (PM) signal \(x(t)\) has the following form:

$$x(t) = A\,{ \cos }\left( {\omega_{c} t + \psi (t)} \right),\;\psi (t) = m(t) + \phi_{c}$$

Here, we directly see that, as \(m(t)\) increases or decreases over time, so does the phase shift of the modulated signal. The phase shift can also be viewed as a change of the carrier frequency. Therefore, phase modulation is equivalent to frequency modulation , and so it can be analyzed by the methods presented in Sect. 4.4.5.3. We note again that, for small amplitude-modulating signals, we have the undesired result of sidebands and poor efficiency. For very large, single, sinusoid modulating signals, the bandwidth of PM is approximately given by Carson’s rule as in FM, i.e.,

$$B_{C} = 2\left( {1 + \mu_{PM} } \right)f_{m}$$

where \(\mu_{PM}\) is the PM modulation index defined as \(\mu_{PM} =\Delta \phi\) with \(\Delta \phi\) being the maximum phase shift. An example of PM signal is shown in Fig. 4.14.

Fig. 4.14
figure 14

Typical PM signal

4.4.5.6 Phase Demodulation

Because of the equivalence of FM and PM (the carrier frequency is modulated by the time derivative of the phase shift), phase demodulation can be performed by any FM demodulator. Here, we will outline a phase demodulator of the PLL type, called a “sinusoidal phase detector”, which is shown in Fig. 4.15 [42].

Fig. 4.15
figure 15

Sinusoidal phase detector

In the phase-locked loop detector of Fig. 4.13, the VCO tends to phase lock to the input in “quadrature”, i.e., with 90°-phase difference \(\left( {\phi (t) \to \psi (t) + \pi /2} \right)\). This means that we can define \(\phi (t)\) as \(\phi (t) = \theta (t) + \pi /2\) with \(\theta (t) \to \psi (t)\) as \(t \to \infty\).

In Fig. 4.18, the output \(e(t)\) of the multiplier is equal to

$$\begin{aligned} e(t) & = x(t)y(t) = \left( {A_{x} A_{y} /2} \right)\left[ {{ \cos }\left( {\psi (t) - \phi (t)} \right)} \right. \\ & \quad \left. { + \,{ \cos }\left( {2\omega_{c} + \psi (t) + \phi (t)} \right)} \right] \\ & = \left( {A_{x} A_{y} \,/\,2} \right)\left[ {{ \cos }\left( {\psi (t) - \theta (t) - \pi /2} \right)} \right. \\ & \quad \left. { + \,{ \cos }\left( {2\omega_{c} + \psi (t) + \theta (t) + \pi /2} \right)} \right] \\ & = \left( {A_{x} A_{y} \,/\,2} \right)\left[ {{ \sin }\left( {\psi (t) - \theta (t)} \right) - { \sin }\left( {2\omega_{c} t + \psi (t) + \theta (t)} \right)} \right] \\ \end{aligned}$$

Now, assuming that the low-pass filter removes the second (high frequency) term, the output \(u(t)\) of the filter is found to be

$$u(t) = \left( {A_{x} A_{y} /2} \right){ \sin }\left( {\psi (t) - \theta (t)} \right)$$

where \(\theta (t)\) is considered to be the output phase. Here, instead of a sawtooth function \(F\left( \cdot \right)\), we have the sinusoid function:

$$F\left( p \right) = \left( {A_{x} A_{y} /2} \right){ \sin }\left( {p(t)} \right),\;p(t) = \psi (t) - \theta (t)$$

which has the plot shown in Fig. 4.16.

Fig. 4.16
figure 16

The sinusoidal gain function F(·). For comparison, the sawtooth function is also shown

Here, \(\theta (t) \to \psi (t)\), and so the phase \(\phi (t)\) of the detector tends to lock in quadrature with the input. Figure 4.17 shows in the same plot the AM , FM, and PM signals obtained using the same modulating signal and the same carrier in all cases. This plot can be generated by using the proper signal generation (SigGen) modules available in [50].

Fig. 4.17
figure 17

AM , FM, and PM signals compared. The carrier and modulating signals are shown superimposed on the top

4.4.6 Pulse Modulation and Demodulation

General Issues Analog modulation (AM, FM, PM ) is used for transferring an analog low-pass (baseband) signal, such as an audio signal or TV signal, over an analog passband channel, e.g., a restricted radio frequency band or a cable TV network channel. Digital modulation is used for transferring a digital bit stream over an analog passband channel, such as a limited radio frequency band or a public switched telephone network (where a bandpass filter restricts the frequency band to between 300 and 3400 Hz).

Pulse modulation is used to transfer a narrowband analog signal, such as a phone call, over a wideband baseband channel, or in some cases, as a bit stream over another digital transmission system. It is also used in neuron modeling and circuit design. Analog modulation/demodulation techniques were discussed in previous sections. Here, we will briefly discuss the basic pulse modulation schemes. Digital modulation schemes will be presented in the next section. Pulse modulation schemes use as a carrier signal a pulse train. The form of the pulses can be selected from among several types that differ in terms of energy and spectral content consumption. Examples of pulse types are square pulses and raised-cosine pulses. The five basic pulse modulation methods are the following, according to the pulse train parameter, that is, modulated (amplitude, width/duration, frequency, and position of leading edge):

  • Pulse-amplitude modulation (PAM) .

  • Pulse-width modulation (PWM) .

  • Pulse-frequency modulation (PFM) .

  • Pulse-position modulation (PPM).

  • Pulse-code modulation (PCM).

Pulse-Amplitude Modulation and Demodulation

In pulse-amplitude modulation (PAM), the amplitude (height) of individual pulses in the pulse train is varied from its normal value in accordance with the instantaneous amplitude of the modulating signal at sampling intervals. The width, frequency, and position of the pulses are kept constant. In other words, the information carried by the modulating signal is carried on a train of pulses being encoded in the amplitude of the pulses.

Figure 4.18 shows a typical PAM signal.

Fig. 4.18
figure 18

Example of PAM in which the amplitude of a square pulse carrier train is modulated by a sinusoid signal

The PAM transmitter design is simple since it is a standard sampler with constant sampling period. Similarly, the receiver (demodulator) is simple.

Pulse-Width Modulation and Demodulation

In pulse-width modulation (PWM) or pulse-duration modulation (PDM) or pulse-length modulation (PLM), the modulating signal changes the width of individual pulses from their normal values , in proportion to the change at each sampling interval. The amplitude, frequency, and position of the pulses are kept constant. PWM is extensively used in power electronics and power control applications, because it is a very good method of providing intermediate quantities of electric power between fully on and fully off. It is noted that a standard power switch provides full power when switched on, and zero power when switched off. In communication applications, the width of the pulses corresponds to specific data values encoded at one end and decoded at the other end. Pulses of various widths (lengths) that represent the transmitted information are sent at regular intervals determined by the carrier frequency. There is no need to use a clock signal, because the leading edge of the pulse train plays the role of a clock. But, to avoid a data value with a zero-width pulse, the addition of a small leading edge offset is required.

Pulse-Frequency Modulation and Demodulation

In pulse-frequency modulation (PFM), the instantaneous amplitude of the modulating signal changes the frequency of the carrier-pulse train, leaving unaltered the amplitude and width of the pulses. PFM is analogous to PWM since the magnitude of the information-bearing (modulating) signal is encoded in the duty cycle of the square pulse train. Compared to PAM, PFM has the advantage of better immunity to noise interference. The drawback is that the design of transmitter and receiver is more complex. For this reason, in practice, PAM or PWM is more commonly used. An example of PFM is shown in Fig. 4.19.

Fig. 4.19
figure 19

Examples of PFM: a modulating signal, b carrier periodic pulse train, and c PFM pulse train

Pulse-Position Modulation

In pulse-position modulation (PPM), the variation of the instantaneously sampled values of the modulating signal changes the position of each pulse with respect to the position of a periodic reference pulse. The amplitude and width of the pulses are kept constant, and so the required transmitter power is constant. PPM has the drawback that it depends on the synchronization of the transmitter–receiver pair. PPM is widely used in optical communications. PPM and FSK (frequency-shift keying) modulation are two canonical forms of orthogonal modulation. PPM and FSK are actually dual modulation techniques in the sense that in FSK each signal gets a different slice of the frequency band, whereas in PPM each signal gets a different slice of the signaling interval. A PPM signal can be produced by feeding a PWM signal into a differentiating circuit, which provides positive-and-negative-polarity pulses at the rising and falling edges of the PWM pulses. Passing these alternating polarity pulses to a rectification circuit, which cuts the positive fixed (non-modulated) pulses, we obtain the negative pulse train which is the desired PPM pulse signal.

Pulse-Code Modulation and Demodulation

In pulse-code modulation (PCM), the amplitude of the analog modulating signal is sampled with a fixed sampling rate, and then it is quantized to a set of symbols, typically a binary code. The PCM principle is illustrated in Fig. 4.20.

Fig. 4.20
figure 20

The principle of PCM. Sampling and binary coding using 16 quantized levels (i.e., four-bit quantization)

The four-bit coded values of the sinusoid modulating signals at the sampling instants one up to 12 are as shown in Table 4.1.

Table 4.1 PCM of a sinusoid using the four-bit binary code 1-2-4-8

The general structure of a PCM modulator has the form shown in Fig. 4.21.

Fig. 4.21
figure 21

The pulse-code modulator performs the operations: sampling–holding, quantization, and binary (digital) coding

This structure is implemented in several ways, using suitable, single-integrated circuits (known as analog-to-digital (A/D) converters).

Pulse-coded demodulation reproduces the analog input (message) from the digital output using a reverse sequence of operations. A PCM demodulator circuit is known as digital-to-analog (D/A) converter (see Fig. 4.22).

Fig. 4.22
figure 22

The PCM demodulator (or D/A converter) is performed by a binary decoder and an analog holder

Pulse-code modulation has two important features:

  • Noise contamination is almost completely eliminated when the pulse signals exceed the noise levels by at least 20 dB.

  • The signal can be received and retransmitted as many times as desired with no distortion of the signal.

Other Pulse Modulation Techniques

Besides the five types of pulse modulation just discussed, over the years, several other techniques with associated benefits and drawbacks have been developed. These are as follows [51]:

  • ASK: Amplitude-Shift Keying, where a finite number of amplitudes are used.

  • FSK: Frequency-Shift Keying, where a finite number of frequencies are used.

  • PSK: Phase-Shift Keying, where a finite number of phases are used.

  • BPSK: Binary-Phase Shift Keying.

  • QPSK: Quadrature-Phase Shift Keying.

  • QAM: Quadrature-Amplitude Modulation.

  • ADPCM: Adaptive-Differential PCM.

  • PDM: Pulse-Density Modulation.

Some more specific schemes of quantized modulation (QM) are the following:

  • QAM: Quantized-Amplitude. Modulation

  • QFM: Quantized-Frequency Modulation.

  • QPAM: Quantized-Pulse-Amplitude Modulation.

  • QPM: Quantized-Phase Modulation.

  • QPPM: Quantized-Pulse-Position Modulation.

For descriptions and details about them, the reader is referred to modern textbooks of communication systems [42, 52].

4.5 Information Theory

4.5.1 General Issues

As we saw in Sect. 4.3.5, information theory was formally initiated by Claude Shannon (1948) who coined the concept of “entropy” as the average “information content” in a message, measured in bits (binary digits) [9]. The term “information” was originally coined by Ralph Hartley in his 1928 paper “Transmission of Information”, where he treated “information” in a technical sense as a measurable quantity, reflecting the receiver’s ability to recognize the sequence of symbols sent by the sender (without any concern about the meaning or semantic issues of these symbols). Hartley’s formula for information is [8]

$$I = { \log }\,S^{n} = n\,{ \log }\,S$$

where S is the number of possible symbols and n is the number of symbols in a transmission. The information I is measured in decimal digits, also called Hartley units.

Information theory is a mathematical theory principally concerned with coding–decoding. Its primary mathematical tools are probability theory and statistics. Shannon’s formula for entropy is

$$H\left( X \right) = - \sum\limits_{k = 1}^{n} {p_{k} \,{ \log }_{2} \,p_{k} }$$

where \(p_{k} = p\left( {x_{k} } \right)\) are discrete probabilities of the random process (message) X, with possible values \(x_{1} ,x_{2} , \ldots ,x_{n}\), which express the probabilities that a particular message is transmitted. \(H(X)\) is a measure of how much information is contained in the transmitted message.

Shannon gave the name “entropy” to the quantity \(H(X)\) upon the suggestion of John von Neumann. Originally, he thought to call it “information” but since this word was overly used, he decided to call it “uncertainty”. But, in a private conversation with von Neumann, he was advised to call it “entropy”. “You should call it entropy, John von Neumann suggested, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.”

In the following, we will present two derivations of the concept of entropy, \(H(X)\), and the four fundamental theorems of information theory , namely,

  • Nyquist–Shannon sampling theorem .

  • Shannon’s source coding theorem.

  • Shannon’s channel capacity theorem.

  • Landau–Pollack bandwidth signal dimensionality theorem .

4.5.2 Information Theory’s Entropy

The following two ways of introducing information theory’s entropy will be discussed:

  • Entropy as the information content of a measurement .

  • Direct definition of entropy .

4.5.2.1 Entropy Derivation via the Information Content of a Measurement

The present derivation is as follows [53]. A real number N is typically expressed by the infinite series:

$$N = d_{ - n} 10^{n} + \cdots + d_{ - 2} 10^{2} + d_{ - 1} 10 + d_{0} + d_{1} 10^{ - 1} + \cdots + d_{n} 10^{ - n} + \cdots$$

where the \(d_{i}\)’s (digits) are integers between 0 and 9, and \(d_{ - n} \ne 0\) for \(n > 0\). In shorthand, N is written in the well-known decimal form:

$$N = d_{ - n} d_{ - n + 1} \cdots d_{ - 1} d_{0 \bullet } d_{1} d_{2} \cdots d_{n}$$

For each number, there exists a unique decimal expansion of this form, provided that we agree to exclude expansions like 2.36999… where nines repeat indefinitely, and express numbers like 2.5000… as 2.5 (omitting the infinitely repeating zeros).

Rational numbers are defined to be ratios of integers, i.e.,

$$a/b = d_{n} \cdots d_{1} d_{0 \bullet } d_{1} d_{2} \cdots d_{n} \quad (b > 0)$$

A decimal expression represents a fraction if and only if some sequence of digits in the expression repeats indefinitely. For example,

$$\begin{aligned} & 1/4 = 0.2500, \ldots ,\;1/6 = 0.166 \ldots 6 \ldots \\ & 13/70 = 0.1857142857142857142857 \ldots \\ \end{aligned}$$

With other numbers, the situation is more complex. For example, the decimal expressions for the numbers \(\sqrt 2\) and \(\pi\) (the ratio of the circumference of a circle to its diameter) are as follows:

$$\sqrt 2 = 1.41421356 \ldots ,\;\pi = 3.14159265 \ldots$$

where all digits must be calculated one by one. These classes of numbers differ in the amount of information they convey and in the effort required to obtain that information.

Although in some cases (like the ones mentioned above) we know explicitly all the digits of the expansion of a number, this is not so when we observe or measure a certain physical quantity. What one can do here is to obtain successive refined intervals that bound the actual value m of the quantity M at hand. Suppose that we get an estimate of the true value m in the interval \(\hat{m}_{1} \in \left( {a_{1} ,b_{1} } \right)\). If this is not satisfying, a second more precise measurement is performed which gives

$$\hat{m}_{2} \in \left( {a_{2} ,b_{2} } \right)\;{\text{with}}\;a_{1} < a_{2} < b_{2} < b_{1}$$

Clearly, the new bounding interval \(\left( {a_{2} ,b_{2} } \right)\) provides some additional number of digits in the decimal expansion of m. The number of decimal digits in the original decimal expansion is approximately given by \({ \log }_{10} \left( {b_{1} - a_{1} } \right)\), and the number of digits of the more accurate expansion is \({ \log }_{10} \left( {b_{2} - a_{2} } \right)\).

The gain \(I_{10}\) in information when we go from the first interval \(\left( {a_{1} ,b_{1} } \right)\) to the second interval \(\left( {a_{2} ,b_{2} } \right)\) is equal to

$$\begin{aligned} I_{10} = & { \log }_{10} \left( {b_{1} - a_{1} } \right) - { \log }_{10} \left( {b_{2} - a_{2} } \right) \\ = & { \log }_{10} \left[ {\left( {b_{1} - a_{1} } \right)/\left( {b_{2} - a_{2} } \right)} \right]\quad {\text{decimal}}\;{\text{digits}} \\ \end{aligned}$$

If the number N is represented in binary form: \(N = c_{ - n} \cdots c_{ - 1} c_{0} \cdot c_{1} c_{2} \cdots c_{n}\) where \(c_{i} = 0\) or 1 (for all i), the information gain is equal to

$$I_{2} = { \log }_{2} \left[ {\left( {b_{1} - a_{1} } \right)/\left( {b_{2} - a_{2} } \right)} \right]\;{\text{bits}}$$

Since \(b_{1} - a_{1}\) is the length of the interval with limits \(a_{1}\) and \(b_{1}\), and \(b_{2} - a_{2}\) is the corresponding length of the second interval with limits \(a_{2}\) and \(b_{2}\), the ratio \(\left( {b_{2} - a_{2} } \right)/\left( {b_{1} - a_{1} } \right)\) represents the probability p that a random point in the interval \(\left( {a_{1} ,b_{1} } \right)\) lies in the interval \(\left( {a_{2} ,b_{2} } \right)\), i.e.,

$$p = \left( {b_{2} - a_{2} } \right)/\left( {b_{1} - a_{1} } \right)$$

Thus, the gain in information can be expressed as

$$I_{2} = { \log }_{2} \left[ {\left( {b_{1} - a_{1} } \right)/\left( {b_{2} - a_{2} } \right)} \right] = { \log }_{2} \left( {\frac{1}{p}} \right),\quad 0 \le p \le 1$$

Now, consider a system that can be in one of the finite sets of states \(x_{1} ,x_{2} , \ldots ,x_{n}\) (arbitrarily ordered), and let \(p_{k}\) be the probability of the system to be in state \(x_{k}\). Clearly, the probabilities \(p_{k}\) must satisfy the total probability condition:

$$\sum\limits_{k = 1}^{n} {p_{k} = 1}$$

Suppose that an observation of the system reveals that the system is in state \(x_{k}\). To calculate how much information was gained after this observation, we assume that our observation system associates with each state a probability interval. For example, state \(x_{1}\) with probability \(p_{1}\) corresponds to the interval \(0 < x < p_{1}\), state \(x_{2}\) with the interval \(p_{1} < x < p_{1} + p_{2}\), and so on. In general, the kth state \(x_{k}\) is associated with the interval:

$$p_{1} + p_{2} + \cdots + p_{k - 1} < x < p_{1} + p_{2} + \cdots + p_{k}$$

with limiting condition for the nth state of

$$p_{1} + p_{2} + \cdots + p_{n - 1} < x < p_{1} + p_{2} + \cdots + p_{n} = 1$$

After observing that the system is in state \(x_{k}\), the length of the measurement interval becomes \(p_{k}\), whereas, at the beginning (without any measurement), we have \(0 < x < 1\), i.e., a length interval equal to one. Therefore, the gain in information compared with the case before the observation is

$${ \log }_{2} \left( {1/p_{k} } \right)$$

The average information \(\overline{I}\) gained when the system changes from one state to another, up to the state \(x_{n}\), is equal to

$$\begin{aligned} \overline{I} = & \,p_{1} \,{ \log }_{2} \left( {1/p_{1} } \right) + p_{2} \,{ \log }_{2} \left( {1/p_{2} } \right) + \cdots + p_{n} \,{ \log }_{2} \left( {1/p_{n} } \right) \\ = & \,\sum\limits_{k = 1}^{n} {p_{k} \,{ \log }_{2} \left( {1/p_{k} } \right)} \\ = & \, - \sum\limits_{k = 1}^{n} {p_{k} \,{ \log }_{2} \,p_{k} = H} \\ \end{aligned}$$

where H is exactly the entropy introduced by Shannon in his 1948 paper [9]. Here, if \(p_{k} = 0\) for some k, the value of the term \(0\,{ \log }_{2} 0\) is taken to be 0, consistent with the limit \(\mathop { \lim }\limits_{{p \to 0^{ + } }} p\,{ \log }_{2} \,p = 0\).

4.5.2.2 Direct Definition of Entropy

This direct way of defining entropy is due to Shannon and can be found in most textbooks on information theory or communication systems . Consider a random variable X that carries an infinite amount of information if it is continuous in amplitude range. Each realization (presentation) of X can be considered to represent a message. Actually, in a physical or biological system, it is not realistic to assume that the amplitude measurements can have infinite accuracy. Therefore, it seems to be realistic to assume that the value of X can be uniformly quantized to a finite set of values, in which case X is discrete random variable:

$$X = \left\{ {x_{k} ;k = 0,\, \pm 1,\, \pm 2, \ldots , \pm n} \right\}$$

where 2n + 1 is the total number of discrete levels. Now, suppose that the event \(X = x_{k}\) occurs with probability:

$$p_{k} = P\left( {x_{k} } \right) = P\left( {X = x_{k} } \right)$$

under the conditions:

$$0 \le p_{k} \le 1\quad {\text{and}}\quad \sum\limits_{k = 1}^{n} {p_{k} = 1}$$

If for some k, the event \(X = x_{k}\) occurs with probability \(p_{k} = 1\), in which case all other \(p_{i}\)’s for \(i \ne k\) are zero, the occurrence of the event \(X = x_{k}\) does not add any “ information ” and does not incur any “surprise”, since we know surely what the message is. However, if the various discrete levels occur with different probabilities, the occurrence of some event \(X = x_{k}\) conveys some information (or surprise), which is higher when \(p_{k} = p\left( {x_{k} } \right)\) is lower (i.e., the uncertainty about the occurrence of \(X = x_{k}\) is higher). Thus, the terms “uncertainty”, “surprise”, and “information” can be used interchangeably in the information theory framework. In particular, the amount of information (or surprise) is related to the inverse of the probability of occurrence.

On the basis of this discussion, the amount of information obtained after the observation of the event \(X = x_{k}\) with probability \(p_{k}\) is defined as

$$I_{B} \left( {x_{k} } \right) = { \log }_{B} \left( {1/p_{k} } \right) = - { \log }_{B} \,p_{k}$$

where the base B of the logarithm is arbitrary and when the base \(B = 2,\;I\left( {x_{k} } \right)\) is measured in bits, when \(B = 10\;I\left( {x_{k} } \right)\) is measured in decimal digits (Hartley’s), and when \(B = e\) (natural logarithm) \(I\left( {x_{k} } \right)\) is measured in natural units (nats).

The quantity \(I\left( {x_{k} } \right)\) defined here has the following properties:

  • \(I_{B} \left( {x_{k} } \right) = 0\) for \(p_{k} = 1\) (i.e., no information is gained if the occurrence of \(X = x_{k}\) is absolutely certain).

  • \(I_{B} \left( {x_{k} } \right) \ge 0\) for \(0 \le p_{k} \le 1\) (i.e., a loss of information never occurs).

  • \(I_{B} \left( {x_{k} } \right) > I\left( {x_{i} } \right)\) for \(p_{k} < p_{i} \;\,\left( {k \ne i} \right)\) (i.e., the less probable an event is, the more information is gained after its occurrence).

Clearly, the quantity \(I_{B} \left( {x_{k} } \right)\) is a random variable with probability \(p_{k}\). Therefore, the mean value \(\overline{I}_{B}\) of \(I_{B} \left( {x_{k} } \right)\) over the entire range of \(2n + 1\) discrete values is equal to

$$\begin{aligned} \overline{I}_{B} (X) = & \,E\left[ {I\left( {x_{k} } \right)} \right] = \sum\limits_{k = - n}^{n} {p_{k} I\left( {x_{k} } \right)} \\ = & \, - \sum\limits_{k = - n}^{n} {p_{k} \,{ \log }\,p_{k} = H(X)} \\ \end{aligned}$$

and is called the entropy of the discrete random variable X, which takes a finite set of \((2n + 1)\) values. The entropy \(H(X)\) provides a measure of the average amount of information provided per message. The basic properties of the entropy \(H(X)\) are as follows:

  • \(H(X)\) is continuous (a small change of the probabilities yield a small change in \(H(X)\));

  • \(0 \le H(X) \le { \log }_{B} (2n + 1)\);

  • \(H(X) = 0\) if and only if \(p_{k} = 1\) for some k (i.e., if there is no uncertainty);

  • \(H(X) = { \log }_{B} (2n + 1)\) if and only if all discrete levels are equally probable (\(p_{k} = 1\,/\,(2n + 1)\) for all k);

  • \(H(X)\) is additive (i.e., the entropy is independent to how the process is divided into parts). This allows us to calculate the entropy of a system via the entropies of the subsystems, if we know the interactions between the subsystems. That is if the system is divided into M blocks (subsystems) with \(b_{1} ,b_{2} , \ldots ,b_{M}\) elements each, we have

    $$\begin{aligned} H\left( {\frac{1}{2n + 1}, \ldots ,\frac{1}{2n + 1}} \right) & = H_{M} \left( {\frac{{b_{1} }}{2n + 1}, \ldots ,\frac{{b_{M} }}{2n + 1}} \right) \\ & \quad + \sum\limits_{k = - n}^{n} {\frac{{b_{k} }}{2n + 1}} H_{{b_{k} }} \left( {\frac{1}{{b_{k} }}, \ldots ,\frac{1}{{b_{k} }}} \right); \\ \end{aligned}$$
  • \(H(X)\) remains unchanged if the events \(X = x_{k}\) are re-ordered, i.e.,

    $$H\left( {x_{n} , \ldots ,x_{0} ,x_{1} , \ldots ,x_{n} } \right) = H\left( {x_{ - n + 1} ,x_{n} , \ldots ,x_{0} ,x_{1} , \ldots ,x_{n} } \right);$$
  • For equal probability events, \(H(X)\) increases with the number of observations, i.e.,

    $$H\left( {\frac{1}{2n + 1}, \ldots ,\frac{1}{2n + 1}} \right) < H\left( {\frac{1}{2n + 2}, \ldots ,\frac{1}{2n + 2}} \right);$$
  • The addition or removal of an event with probability zero does not change the value of the entropy:

    $$H\left( {\frac{1}{2n + 1}, \ldots ,\frac{1}{2n + 1},0} \right) = H\left( {\frac{1}{2n + 1}, \ldots ,\frac{1}{2n + 1}} \right).$$

4.5.2.3 Differential Entropy

The concept of entropy thus far discussed refers to discrete random variables. Here, we will introduce the corresponding concept for the case of a continuous random variable X that has a probability density function \(f(x)\). This concept is the following:

$$\begin{aligned} h(X) = & \, - \int\limits_{ - \infty }^{\infty } {f(x)\,{ \log }_{B} \,f(x)} \,{\text{d}}x \\ = & \, - E\left[ {{ \log }_{B} \,f(x)} \right] \\ \end{aligned}$$

and is called the differential entropy of X (in contrast to the absolute entropy \(H(X)\)). Of course, it is remarked right from now that \(h(X)\) does not provide in any way a measure of the randomness of X. This expression for the differential entropy is justified by using the mean value theorem:

$$f\left( {x_{k} } \right)\delta x = \int\limits_{k\delta x}^{(k + 1)\delta x} {f(x)\,dx}$$

which leads to the Riemannian approximation:

$$\int\limits_{ - \infty }^{\infty } {f(x)} \,dx = \mathop { \lim }\limits_{\delta x \to 0} \sum\limits_{k = - \infty }^{\infty } {f\left( {x_{k} } \right)\delta x}$$

Define

$$\begin{aligned} H_{\delta } = & \, - \sum\limits_{k = - \infty }^{\infty } {f\left( {x_{k} } \right)\delta x\,{ \log }_{B} } \left[ {f\left( {x_{k} } \right)\delta x} \right] \\ = & \, - \sum\limits_{k = - \infty }^{\infty } {f\left( {x_{k} } \right)\delta x\,{ \log }_{B} \,f\left( {x_{k} } \right)} - \sum\limits_{k = - \infty }^{\infty } {f\left( {x_{k} } \right)\delta x\,{ \log }_{B} \,\delta x} \\ \end{aligned}$$

Then, as \(\delta x \to 0\):

$$\mathop { \lim }\limits_{\delta x \to 0} H_{\delta } = - \int\limits_{ - \infty }^{\infty } {f(x)\,{ \log }_{B} \,f(x)} \,{\text{d}}x - \mathop { \lim }\limits_{\delta x \to 0} \left( {{ \log }_{B} \,\delta x} \right)\int\limits_{ - \infty }^{\infty } {f(x)} \,{\text{d}}x$$

Or

$$h(X) = - \int\limits_{ - \infty }^{\infty } {f(x)\,{ \log }_{B} \,f(x)\,} dx = \mathop { \lim }\limits_{\delta x \to 0} \left( {H_{\delta } + { \log }_{B} \,\delta x} \right)$$

where the relation \(\int_{ - \infty }^{\infty }\, {f(x)\,dx = 1}\) was used.

Now, \(\mathop { \lim }\limits_{\delta x \to 0} { \log }_{B} \,\delta x = - \infty\), which implies that the entropy of a continuous variable is infinitely large. This is indeed justified by the fact that a continuous random variable may take any value in the interval \(\left( { - \infty ,\infty } \right)\), and so the uncertainty associated with the variable is infinite. Regarding the infinite offset, \({ \log }_{B} 0 = - \infty\), as a reference, the quantity \(h(X)\) is actually a differential entropy . Clearly, the differential entropy is not a limit of the Shannon entropy for \(n \to \infty\), but differs from this limit by the infinite offset \({ \log }\,\delta x \to - \infty\) as \(\delta x \to 0\).

4.5.2.4 Joint Entropy, Conditional Entropy, and Mutual Information

Joint entropy

Consider two discrete random variables X and Y. The entropy of their pairing (X, Y) is called joint entropy \(H(X,Y)\), i.e.,

$$\begin{aligned} H(X,Y) & = E_{(X,Y)} \left[ { - { \log }_{B} \,p(x,y)} \right] \\ & = - \sum\limits_{x,y} {p(x,y)\,{ \log }_{B} (x,y)} \\ \end{aligned}$$

Conditional entropy

The conditional entropy of the random variable X given the random variable Y is defined as

$$\begin{aligned} H\left( {X\left| Y \right.} \right) & = E_{Y} \left[ {H\left( {X\left| Y \right.} \right)} \right] \\ & = - \sum\limits_{y \in Y} {p(y)} \sum\limits_{x \in X} {p\left( {x\left| y \right.} \right)} \,{ \log }_{B} \,p\left( {x\left| y \right.} \right) \\ & = - \sum\limits_{x,y} {p(x,y)\,{ \log }_{B} \left[ {p(x,y)\,/\,p(y)} \right]} \\ \end{aligned}$$

i.e., \(H\left( {X\left| Y \right.} \right)\) is the average conditional entropy \(E_{Y} \left[ {H\left( {X\left| Y \right.} \right)} \right]\) over Y. It follows that

$$H\left( {X\left| Y \right.} \right) = H(X,Y) - H(Y)$$

Mutual information

The mutual information (or transinformation) is defined to be the amount of information that can be obtained about one random variable X via the observation of another variable Y, i.e.,

$$I(X;Y) = E_{X,Y} \left[ {I_{s} (x,y)} \right] = \sum\limits_{x,y} {p(x,y)\,{ \log }_{B} \frac{p(x,y)}{p(x)\,p(y)}}$$

where \(I_{s} (x,y)\) is the point-wise (specific) mutual information. It follows that \(I(X;Y)\) has the following property:

$$I(X;Y) = H(X) - H\left( {X\left| Y \right.} \right)$$

This means that, if we know Y, we can have a saving of \(I(X;Y)\) bits in encoding X, compared to the case in which we do not know Y. The mutual information has the following property:

$$\begin{aligned} I(X;Y) = & \,H(X) - H\left( {X\left| Y \right.} \right) \\ = & \,H(X) - \left[ {H(X,Y) - H(Y)} \right] \\ = & \,H(X) + H(Y) - H(X,Y) \\ = & \,H(Y) - \left[ {H(X,Y) - H(X)} \right] \\ = & \,H(Y) - H\left( {Y\left| X \right.} \right) = I\left( {Y;X} \right) \\ \end{aligned}$$

i.e., the mutual information is symmetric.

4.5.3 Coding Theory

4.5.3.1 General Issues

Coding theory is the branch of information theory that deals with the transmission of information across communication channels (noiseless, noisy) and the recovery of the messages. Coding theory aims to make messages easy to read, and should not be confused with cryptography, which is concerned with making messages hard to read. One of the principal goals of coding theory is to remove information redundancy (i.e., compress data) and correct the errors that may occur. Specifically, coding theory studies the properties of codes and their suitability for particular applications. Here, a brief outline of coding theory will be given. Full presentations can be found in the respective literature (e.g., [54,55,56,57,58,59,60,61,62,63,64,65,66]).

Coding is categorized into three categories:

  • Source coding .

  • Channel coding.

  • Combined source and channel coding .

Source encoding (or data compression) performs data compression on the information sent by a source, so that the transmission is more efficient. Channel encoding adds extra bits to assure a more robust transmission regarding disturbances and noise present on the transmission channel.

4.5.3.2 Source Coding

Source coding ( compression ) is the process of encoding information using fewer information-bearing units (e.g., bits) compared to a non-coded representation, through the employment of particular encoding schemes. Compression is very useful since it contributes to the reduction of the consumption of expensive resources. The development of source coding techniques and the design of equipment must be based on compromises among several factors, such as the degree of compression, the level of noise and distortion involved, and the computational resources needed for the coding and recovery of the data.

The two ways of data compression are as follows:

  • Lossless data compression, where the data must be recovered precisely.

  • Lossy data compression, where the bits required for recovering the data with the desired reliability, as measured by a distortion index, are added.

4.5.3.3 Channel Coding

Channel coding is concerned with maximizing the information rate that the channel can convey reliably, i.e., with acceptable error probability. To this end, codes must be designed that allow fast transmission of data, involve many valid codewords, and can detect and correct various errors. This again requires a number of trade-offs between several conflicting issues. Although source coding tries to remove as much redundancy as possible, channel coding designs and investigates several error-correcting codes, which add sufficient redundancy (i.e., error correction) that assures efficient and faithful transmission of information across a noisy channel.

4.5.3.4 Error Detecting and Correcting Codes

Error detection

It deals with the detection of errors introduced by noise and other disturbances across the transmission channel from the transmitter to the receiver. The general way is to add some extra data bits (i.e., redundancy) to a message, which makes possible the detection of any errors in the conveyed message. The additional bits are called check bits and are determined by a suitable algorithm applied to the message bits. To detect any existing errors, the receiver applies the same algorithm to the received data bits and compares the output to the received check bits. If no matching of the values occurs, an error has taken place at some point of the transmission.

Error correction

It deals with both the detection of errors and the recovery of the initial, error-free data. The three primary ways for design the channel code for error correction are as follows:

  • ARQ (Automatic Repeat reQuest): The receiver requests the retransmission of the data that are assumed to contain errors.

  • FEC (Forward Error Correction): The transmitter sends the encoded message with an error-correcting code. The receiver decodes the received message into the most likely data, without sending any request for retransmission.

  • Hybrid ARQ and FEC: Here minor errors are restored without retransmission, and major errors are corrected by sending a request for retransmission.

Three ARQ schemes that are suitable for varying or unknown capacity communication channels (such as the Internet ) are as follows:

  • Stop-and-Wait ARQ.

  • Go-back-N ARQ.

  • Selective Repeat ARQ.

FEC is typically employed in lower layer communication and in storage devices (CDs, DVDs, Dynamic RAM) and are classified into two classes:

  • Block codes which are processed in blocks. In this class, we have, among others, repetition codes, Hamming codes, and multidimensional parity-check codes.

  • Convolutional codes which are processed on a bit-by-bit basis. Here, the so-called Viterby decoder performs optimal decoding.

4.5.3.5 Block Codes

An \((n,k)\) block code , where k is the number of input bits and n is the number of output bits, is characterized by the following parameter:

$$\begin{aligned} & Code\,rate :\;r = k/n \\ & Channel\,data\,rate :\;R_{0} = rR_{n} , \\ \end{aligned}$$

where \(R_{n}\) denotes the bit rate of the information source.

A binary block code C of block length n is a subset of the set of all binary n-tuples

\({\bar{\mathbf{x}}} = \left[ {x_{0} ,x_{1} , \ldots ,x_{n - 1} } \right]\), where \(x_{i} = 0\) or 1 for \(i = 0,1, \ldots ,n - 1\). Code vector or code word is called an n-tuple belonging to the code.

  • Hamming weight \(w\left( {\left[ {x_{o} ,x_{1} , \ldots ,x_{n - 1} } \right]} \right)\) is the number of nonzero components of the n-type.

  • Hamming distance \(d\left( {{\mathbf{x}},{\mathbf{x}}^{{\prime }} } \right)\) between two n-tuples x and \({\mathbf{x}}^{{\prime }}\) is defined as the number of positions in which their components differ. From this definition, it follows that

    $$d\left( {{\mathbf{x}},{\mathbf{x}}^{{\prime }} } \right) = w\left( {{\mathbf{x}} - {\mathbf{x}}^{{\prime }} } \right),$$

    where \({\mathbf{x}} - {\mathbf{x}}^{{\prime }} = \left[ {x_{0} ,x_{1} , \ldots ,x_{n - 1} } \right] - \left[ {x_{0}^{{\prime }} ,x_{1}^{{\prime }} , \ldots ,x_{n - 1}^{{\prime }} } \right] = \left[ {x_{0} - x_{0}^{{\prime }} ,x_{1} - x_{1}^{{\prime }} , \ldots ,x_{n - 1} - x_{n - 1}^{{\prime }} } \right]\)

Minimum Hamming distance, \(d_{ \hbox{min} }\), of the block code C is the smallest Hamming distance between pairs of distinct code words.

Some theorems on the detection and correction ability of block codes are the following [61].

Theorem 1

A code C can detect all patterns of s or fewer errors if and only if \(d_{ \hbox{min} } > s\).

Theorem 2

A code C can correct all patterns of s or fewer errors if and only if \(d_{ \hbox{min} } > 2s\).

This theorem suggests the following implicit decoding rule: “Decode c (the corrupted received codeword of an actual codeword a) to the nearest codeword in terms of the Hamming distance, i.e., \(d_{ \hbox{min} } = { \hbox{min} }\;w(e) = { \hbox{min} }\;w({\mathbf{c}} - {\mathbf{a}}) = { \hbox{min} }\;d({\mathbf{c}},{\mathbf{a}})\)

Definition

A binary code C is linear if, for a and \({\mathbf{a}}^{{\prime }}\) in C, the sum \({\mathbf{a}} + {\mathbf{a}}^{{\prime }}\) is also in C. For example, the code \(C = \left\{ {[0,0,0] + [1,1,0] + [1,0,1] + [0,1,1]} \right\}\) is a linear code because \([1,1,0] + [101] = [011]\), and so on.

  • The minimum Hamming distance of a linear block code is equal to the smallest weight of the nonzero code words in the code.

4.5.3.6 Convolutional Codes

A binary convolutional code is characterized by a 3-tuple (n, k, m), where k is the number of input (message) bits, n is the number of generated (output) bits, and m is the number of memory registers (memory order). As in block codes, the parameter \(r = k\,/\,n\) is called the code rate. Typically, k and n vary from 1 to 8, m from 2 to 10, and the code rate r from 1/8 to 7/8. In deep space applications, code rates r as low as 1/100 or smaller have been used. In many commercial cases, the convolutional codes are described as (r, L), where r is the code rate and L is the constraint length \(L = k(m - 1)\), which represents the number of bits in the encoder memory that affect the generation of the n output bits.

A binary convolutional encoder is represented pictorially by a set of shift registers and modulo-2 adders, where the output bits are modulo-2 sums of selective shift register contents and present input bits. The diagram of a (2, 1, 2) convolutional code r is shown in Fig. 4.23.

Fig. 4.23
figure 23

Binary (2, 1, 2) convolutional encoder \(({\text{SR}}_{i} = i{\text{th}}\;{\text{shift}}\;{\text{register}})\)

This is a rate \(r = 1/2\) code. Each input bit is coded into two output bits. The constraint length of the code is \(L = k\left( {m - 1} \right) = 1 \times \left( {2 - 1} \right) = 1\).

The choice of the bits to be added for generating the output bit is determined by the so-called generator polynomial (g) for that output. In Fig. 4.23, the first output bit has a generator polynomial of 111. The second output has a generator polynomial of 101. The output bits of \({\mathbf{y}}_{1} \;and\;{\mathbf{y}}_{2}\) are given by

$$\begin{aligned} y_{1,i} & = {\text{mod}}\;2\left[ u_{1} + u_{0} + u_{ - 1} \right]_{i} = u_{1,i} \, {\oplus}\, u_{0,i} \, {\oplus}\, u_{ - 1,i} \\ y_{2,i} & = {\text{mod}}\;2\left[ u_{1} + u_{ - 1} \right]_{i} = u_{1,i} \, {\oplus}\, u_{ - 1,i} \\ \end{aligned}$$

The generator polynomial gives a unique error protection quality to the code. For example, one code (3, 2, 2) may have entirely different properties from another, depending on the selected polynomial. However, it is noted that not all polynomials lead to good error correction performance. A list of good polynomials for rate 1/2 codes is given in Table 4.2 [54].

Table 4.2 Efficient generator polynomials for 1/2 codes

The encoders of convolutional codes are represented by multivariable (MV) linear time-invariant (LTI) systems as shown in Fig. 4.24.

Fig. 4.24
figure 24

LTI representation of a MV convolutional encoder

The \({\mathbf{y}}_{j}\) output in Fig. 4.21 is given by

$${\mathbf{y}}_{j} = \sum\limits_{i = 1}^{k} {{\mathbf{x}}_{i} * {\mathbf{g}}_{j}^{(i)} }$$

where “*” denotes the convolution operator, and \({\mathbf{g}}_{j}^{(i)}\) is the impulse response of the ith input sequence with respect to the jth output. The impulse responses are the generator polynomials (sequences) of the encoder, previously discussed.

The impulse response for the binary (2, 1, 2) convolutional code of Fig. 4.20 is

$$\begin{aligned} {\mathbf{g}}_{1} = & \,\left[ {1,1,1,0, \ldots } \right] = \left[ {1,1,1} \right] \\ {\mathbf{g}}_{2} = & \,\left[ {1,0,1,0, \ldots } \right] = \left[ {1,0,1} \right] \\ \end{aligned}$$

The outputs \({\mathbf{y}}_{1}\) and \({\mathbf{y}}_{2}\) corresponding to the input vector \({\mathbf{x}} = \left[ {1\;1\;1\;0\;1} \right]\) are equal to

$$\begin{aligned} {\mathbf{y}}_{1} & = \left[ {1\;1\;1\;0\;1} \right] * \left[ {1\;1\;1} \right] = \left[ {1\;0\;1\;0\;0\;1\;1} \right] \\ {\mathbf{y}}_{2} & = \left[ {1\;1\;1\;0\;1} \right] * \left[ {1\;0\;1} \right] = \left[ {1\;1\;0\;1\;0\;0\;1} \right] \\ \end{aligned}$$

As in block codes , the convolutional codes can be generated by a generator matrix multiplied by the input (information, message) vectors .

Let \(\left\{ {{\mathbf{x}}_{1} ,{\mathbf{x}}_{2} , \ldots ,{\mathbf{x}}_{k} } \right\}\) and \(\left\{ {{\mathbf{y}}_{1} ,{\mathbf{y}}_{2} , \ldots ,{\mathbf{y}}_{n} } \right\}\) be the input and output sequences. Arranging the input sequences as

$$\begin{aligned} {\mathbf{x}} & = \left[ {x_{1,0} ,x_{2,0} , \ldots ,x_{k0} ; \ldots ;x_{1,k} ,x_{2,k} , \ldots ,x_{k,k} } \right] \\ & = \left[ {{\mathbf{u}}_{0} ,{\mathbf{u}}_{1} , \ldots ,{\mathbf{u}}_{p} , \ldots } \right] \\ \end{aligned}$$

and the output sequences as

$$\begin{aligned} {\mathbf{y}} & = \left[ {y_{1,0} ,y_{2,0} , \ldots ,y_{n,0} ; \ldots ;y_{n,0} ,y_{n,1} , \ldots } \right] \\ & = \left[ {{\mathbf{z}}_{0} ,{\mathbf{z}}_{1} , \ldots ,{\mathbf{z}}_{p} , \ldots } \right] \\ \end{aligned}$$

the convolutional encoder is described by the matrix-vector equation:

$${\mathbf{y}} = {\mathbf{xA}}$$

where A is the generator matrix of the code:

$${\mathbf{A}} = \left[ {\begin{array}{*{20}c} {{\mathbf{A}}_{0} } & {{\mathbf{A}}_{1} } & {{\mathbf{A}}_{2} } & \cdots & {{\mathbf{A}}_{m} } & {} & {} \\ {} & {{\mathbf{A}}_{0} } & {{\mathbf{A}}_{1} } & \cdots & {{\mathbf{A}}_{m - 1} } & {{\mathbf{A}}_{m} } & {} \\ {} & {} & {{\mathbf{A}}_{0} } & \cdots & {{\mathbf{A}}_{m - 2} } & {{\mathbf{A}}_{m - 1} } & {{\mathbf{A}}_{m} } \\ {} & {} & \ddots & {} & {} & {} & {} \\ \end{array} } \right]$$

where the \(k \times n\) blocks \({\mathbf{A}}_{p}\) are given by

$${\mathbf{A}}_{p} = \left[ {\begin{array}{*{20}c} {{\mathbf{g}}_{1,p}^{(1)} } & \cdots & {{\mathbf{g}}_{n,p}^{(1)} } \\ {{\mathbf{g}}_{1,p}^{(2)} } & \cdots & {{\mathbf{g}}_{n,p}^{(2)} } \\ \cdots & \cdots & \cdots \\ {{\mathbf{g}}_{1,p}^{(k)} } & \cdots & {{\mathbf{g}}_{n,p}^{(k)} } \\ \end{array} } \right]$$

and \(g_{jp}^{(i)}\) the impulse response of the ith input with respect to the jth output:

$${\mathbf{g}}_{j}^{(i)} = \left[ {g_{j,0}^{(i)} ,g_{j,1}^{(i)} \ldots g_{j,p}^{(i)} \ldots g_{j,m}^{(i)} } \right]$$

The output vector y of the binary encoder (2, 1, 2) of Fig. 4.30 is given by

$${\mathbf{y}} = {\mathbf{xA}}$$

where \({\mathbf{x}} = \left[ {1\;1\;1\;0\;1} \right]\), and

$${\mathbf{A}} = \left[ {\begin{array}{*{20}c} {11} & {10} & {11} & {} & {} & {} & {} \\ {} & {11} & {10} & {11} & {} & {} & {} \\ {} & {} & {11} & {10} & {11} & {} & {} \\ {} & {} & {} & {11} & {10} & {11} & {} \\ {} & {} & {} & {} & {11} & {10} & {11} \\ \end{array} } \right]$$

i.e.,

$${\mathbf{y}} = \left[ {11,\;01,\;10,\;01,\;00,\;10,\;11} \right]$$

as shown in Fig. 4.30.

A better understanding of the operation of an encoder can be obtained using one or more of the following graphical representations [59,60,61,62,63,64]:

  • State diagram.

  • Tree diagram.

  • Trellis diagram.

The definition and the way these encoder diagrams can be drawn and used are given in the literature [54,55,56,57].

For the decoding of convolutional codes , there are available several different approaches that can be grouped into two basic classes [59,60,61,62,63,64]:

  • Sequential decoding (Fano algorithm).

  • Maximum-likelihood decoding (Viterbi algorithm).

Sequential decoding was one of the earliest techniques developed for decoding convolutionally coded bit streams. It was first coined by Wosencraft and refined by Fano. Maximum-likelihood decoding is a good alternative methodology that is best implemented by the Viterbi algorithm. For full presentations of the above decoding techniques, the reader is referred to the bibliography [54,55,56,57,58,59,60,61,62,63,64].

4.5.4 Fundamental Theorems of Information Theory

Here, four fundamental theorems of information theory will be reviewed and their roles and usefulness in practice will be explained. These theorems are the following:

  • Nyquist–Shannon sampling theorem .

  • Shannon’s source coding theorem .

  • Shannon’s noisy channel coding and capacity theorem.

  • Landau–Pollack bandwidth signal dimensionality theorem .

4.5.4.1 Nyquist–Shannon Sampling Theorem

This theorem, which was first formulated by Harry Nyquist in 1928 [67] and formally proved by Shannon in 1949 [68], states: “To be able to reconstruct perfectly a sampled analog signal \(x(t)\) from its sampled version \(x(kT),\;k = 0,1,2, \ldots\) the sampling frequency \(1/T\;Hz\) (where T is the sampling period) must be greater than twice the highest frequency W of \(x(t)\) .”

If the sampling frequency is less than this limit, then frequencies in the original analog signal that are greater than half the sampling frequency will be “aliased” and will appear in the resulting signal as lower frequencies. If the sampling frequency is exactly twice the highest frequency of the analog input, then “phase mismatching”, between the sampler and the signal, will distort the signal. Therefore, in practice, an analog low-pass filter must be used before the sampler to guarantee that no components with frequencies above the sampling frequency remain. This is called an “anti-aliasing filter” and must be very carefully designed, because a poor filter causes phase distortion and other effects. The minimum sampling frequency 2 W that permits exact reconstruction of the original signal is known as the Nyquist frequency (or Nyquist rate), and the time spacing between samples is known as “Nyquist time interval”. To express the above concepts mathematically, let a signal \(x(t)\) have a Fourier transform:

$$F\left[ {x(t)} \right] = X(f) = 0\quad for\quad \left| f \right| > W.$$

Then, \(x(t)\) is completely determined by giving the value of the signal at a sequence of points, spaced \(T = 1/2W\) apart. The values \(x_{k} = x(k\,/\,2W) = x(kT)\), T = sampling period, are called the samples of \(x(t)\).

The sampled signal \(x^{*} (t)\) is expressed as the amplitude modulation via \(x(t)\) of a \(\delta -\) pulse (Dirac) train \(\sum\nolimits_{k = - \infty }^{\infty } {\delta (t - kT)}\), i.e.,

$$x^{*} (t) = x(t)\sum\limits_{k = - \infty }^{\infty } {\delta (t - kT)}$$

Since multiplication in the time domain is expressed by convolution “*” in the frequency domain , we have

$$\begin{aligned} X^{ * } (f) = & \,X(f) * \left[ {\frac{1}{T}\sum\limits_{j = - \infty }^{\infty } {\delta \left( {f - j/T} \right)} } \right. \\ = & \,\frac{1}{T}\int\limits_{ - \infty }^{\infty } {X(s)} \sum\limits_{j = - \infty }^{\infty } {\delta \left( {f - s - j/T} \right)\,} {\text{d}}s \\ = & \,\frac{1}{T}\sum\limits_{j = - \infty }^{\infty } {X\left( {f - j/T} \right) = X\left( {{\text{e}}^{j2\pi /T} } \right)} \\ \end{aligned}$$

Figure 4.25 shows pictorially the Fourier transform \(X^{*} \left( f \right) = X\left( {{\text{e}}^{j2\pi /T} } \right)\) of the sampled version \(x^{*} (t)\) of a signal \(x(t)\) that has the Fourier transform \(X(f)\) [42]. In this case, \(X(f) \ne 0\), outside the frequency region determined by the Nyquist frequency \(W = 1/2T\). Therefore, “aliasing distortion” appears, which is due to the overlap of the various periodically repeated sections of \(X^{*} (f)\) in the frequency domain .

Fig. 4.25
figure 25

a Fourier transform of a continuous-time signal, b the corresponding Fourier transform of the sampled version with sampling frequency 1/T. The overlaps indicate that in this case we have “aliasing effects”

4.5.4.2 Shannon’s Source Coding Theorem

Consider a source that assigns to each outcome (sample) x a binary word, which as we have seen is called the code. Here, the vector x is an outcome of a discrete random vector X. The asymptotic equipartition theorem (which is the basis for the entropy’s interpretation) states that for \({\mathbf{x}} \in {\mathbf{X}} = \left\{ {X_{1} ,X_{2} , \ldots ,X_{n} } \right\}\), where \(X_{i}\) are independent trials of X, as \(n \to \infty\) (i.e., asymptotically), there is a set of “typical” outcomes S for which

$$P_{X} ({\mathbf{x}}) = 2^{{ - nH({\mathbf{X}})}} ,\;{\mathbf{x}} \in S$$

and the total probability that the outcome is in S is almost one. Since the “typical” outcomes are all equiprobable, this means that there must be approximately \(2^{{nH({\mathbf{X}})}}\) outcomes in S. As n becomes larger, this approximation becomes more accurate [42].

In the information source, when n is large, we can assign only the “typical” outcomes and omit the “non-typical” ones. By using \(nH({\mathbf{X}})\)-bit code words, we are able to encode each of the \(2^{{nH({\mathbf{X}})}}\) typical outcomes with a unique binary word, for an average of \(H({\mathbf{X}})\) bits per component of the vector x. Since \(H({\mathbf{X}})\) represents the average information obtained from the observation, each outcome of X needs an average of \(H({\mathbf{X}})\) bits. This is true only for an average of n components, not for an individual component.

The source coding theorem states that [42]:

If a source can be modeled as repeated independent trials of a random variable X at r trials per second, then the source can be encoded by a source coder into a bit stream with bit rate less than \(R + \varepsilon\), for any \(\varepsilon > 0\), where \(R = rH({\mathbf{X}})\) is the so-called “rate of the source”.

This source coding theorem establishes the limits of data compression. Stated in another way, Shannon’s source coding theorem says that [69]:

The minimum average number of bits, C, needed to encode n symbols (which are treated as n independent samples of a discrete random vector X with probability \(p_{{\mathbf{X}}} ({\mathbf{x}})\) and entropy \(H({\mathbf{X}})\)) satisfies the relation:

$$H\left( {\mathbf{X}} \right) \le C < H\left( {\mathbf{X}} \right) + 1/n.$$

In practice, \(p_{{\mathbf{X}}} ({\mathbf{x}})\) is not exactly available, but we only have an estimate \(q_{{\mathbf{X}}} ({\mathbf{x}})\) for use in the source coding process. In this case, the corresponding minimum \(C_{q}\) satisfies

$$H({\mathbf{X}}) + KL\left( {p\left\| q \right.} \right) \le C_{q} < H({\mathbf{X}}) + KL\left( {p\left\| q \right.} \right) + 1/n$$

where \(KL\left( {p\left\| q \right.} \right)\) is the relative entropy (or the KullbackLeibler: KL) divergence, defined as

$$KL\left( {p\left\| q \right.} \right) = \sum\limits_{{\mathbf{x}}} {p_{{\mathbf{X}}} ({\mathbf{x}})\,{ \log }\left[ {p_{{\mathbf{X}}} ({\mathbf{x}})/q_{{\mathbf{X}}} ({\mathbf{x}})} \right] \ge 0}$$

We note that \(KL\left( {p\left\| q \right.} \right)\) is the difference between the two probability distributions, \(p_{{\mathbf{X}}} ({\mathbf{x}})\) and \(q_{{\mathbf{X}}} ({\mathbf{x}})\). Clearly, \(KL\left( {p\left\| q \right.} \right) = 0\) if and only if \(p_{{\mathbf{X}}} ({\mathbf{x}}) = q_{{\mathbf{X}}} ({\mathbf{x}})\).

Constructing practical codes that are close to R is difficult. However, constructing good suboptimal codes is usually easy (see Sect. 4.5.3). In words, the source coding theorem (as expressed in the above two alternative ways) says that no source coding scheme can be better than the “entropy of the source”, or bit stream rate less than the “source rate”.

Example

Let a binary random variable X with set of values (alphabet) \(A = \left\{ {0,1} \right\}\). Its entropy is equal to

$$H({\mathbf{X}}) = - a\,{ \log }_{2} a - (1 - a)\,{ \log }_{2} (1 - a)$$

where a denotes the probability of taking the value 1, \(a = p_{{\mathbf{X}}} (1)\). The entropy \(H({\mathbf{X}})\) is a function of a, i.e., \(H = H(a)\) which has the graphical representation shown in Fig. 4.26 [42].

Fig. 4.26
figure 26

Graphical representation of the entropy \(H({\mathbf{X}})\) of a binary function in terms of the probability \(a = p_{{\mathbf{X}}} (1)\)

We observe that when \(a = 1 - a = 1/2\) (equiprobable values 0 and 1), \(H({\mathbf{X}})\) takes a maximum, in agreement with the theory. The maximum of \(H({\mathbf{X}})\) is one bit. Also, \(H({\mathbf{X}})\) is zero when either \(a = 0\;(1 - a = 1)\) or \(a = 1\), again in agreement with the theory. The fact that, for \(a = 1 - a = 1/2\), the value of the entropy is \(H({\mathbf{X}}) = 1\) means that, to encode repeated samples of X, we need on average one bit per sample (outcome). Here, this is also sufficient for each sample, not just on average, since the source is binary. A source coder that can achieve rate R transmits samples of X unaltered (exactly). Now, suppose that \(a = 0.1\), in which case

$${\text{H}}({\text{X}}) = - 0.1\,{ \log }_{2} (0.1) - 0.9\,{ \log }_{2} (0.9) = 0.47$$

This means that, to encode repeated samples of X, we need 0.47 bits per outcome. Clearly, there are coding methods with average number of bits greater than 0.47 but less than unity.

In information theory , when dealing with a language, we speak about the entropy or rate of the language. For example, in the case of the English language, the alphabet consists of 26 letters (symbols) plus some additional symbols such as space, comma, period, etc. These symbols can be treated as n independent samples of a random variable X transmitted via the communication channel, each having a probability \(p_{{\mathbf{X}}} ({\mathbf{x}})\), and entropy \(H({\mathbf{X}}) = - \sum\nolimits_{{{\bar{\mathbf{x}}} \in {\mathbf{X}}}} {p_{{\mathbf{X}}} ({\mathbf{x}})\,{ \log }\,p_{{\mathbf{X}}} ({\mathbf{x}})}\). For example, p X (x = “s”) is much higher than p X (x = “z”). To minimize the code length for the language, we assign shorter (more compressed) codes for symbols of higher probabilities (here, a shorter code for the letter “s” than that of the letter “z”). As length code for a symbol x with \(p_{{\mathbf{X}}} ({\mathbf{x}})\), we can use its “surprise\(- { \log }\,p_{{\mathbf{X}}} ({\mathbf{x}})\).

4.5.4.3 Shannon’s Noisy Channel Coding and Capacity Theorem

Shannon’s noisy channel coding and capacity theorem deal with the maximum possible efficiency of error-correcting codes (see Sect. 4.5.3) versus levels of data corruption and noise interference [10, 68]. This theorem establishes that a randomly designed error-correcting code is basically as good as the best possible code, and states [70]:

“The capacity of a discrete-memoryless channel is given by

$$C = { \hbox{max} }_{{p_{{\mathbf{X}}} ({\mathbf{x}})}} \left\{ {I(X;Y)\left| {p_{{\mathbf{X}}} ({\mathbf{x}})} \right.} \right\}\quad ({\text{bits}}/{\text{symbol}})$$

where \(I(X;Y)\) is the mutual information between the channel input X and the output Y:

$$I(X;Y) = H(X) - H\left( {X\left| Y \right.} \right)$$

If the transmission rate R is less than C, then, for any \(\varepsilon > 0\), there exists a code with block length n large enough whose error probability is less than \(\varepsilon\). When \(R > C\), the error probability of any code with any block length is bounded away from zero. If the channel is used m times per second, then the channel capacity in bits per second is \(C^{{\prime }} = mC\).”

Example

Let a binary symmetric channel with cross-over probability 0.1. Then, \(C \approx 0.5\) bits per transmission. Thus, we can send reliably through the channel 0.4 bits/per channel. This can be achieved by taking (for example) 400 input information bits and map them into a code of length 1000 bits. The whole code is then transmitted through the channel, in which case 400 information bits can be decoded correctly, but 100 bits may be detected incorrectly. Now, let us consider a continuous-time additive white-Gaussian channel with fully uncorrelated signal and noise. In this case, the channel capacity is found to be [70]:

$$C = W\,{ \log }_{2} \left( {1 + \frac{S}{{N_{0} W}}} \right)\frac{\text{bits}}{\text{s}}$$

where S is the upper bound of the power of the input signal x (measured in Watt), \(N_{0}/2\) is the Gaussian noise power spectral density, and W is the channel bandwidth in Hz. Clearly, \(N_{0} W = N\) where N is the total noise or interference power over the bandwidth W?

Example

Using the above formula for the capacity of a Gaussian channel, we can compute the capacity of the voice band of a telephone channel. Typical values in this case are as follows:

$$W = 3000Xz,\;S/N = 1000\;{\text{or}}\;10\,{ \log }_{10} (1000) = 30\,{\text{db}}.$$

Thus, \(C = 3000\,{ \log }_{2} (1 + 1000) \approx 30\;{\text{kbits}}/{\text{s}}\)

This means that using this model one cannot design modems faster than 30 kbits/s. Because the signal-to-noise ratio is large, one can expect to be able to transmit \(C/W = 10\,{\text{bits}}/{\text{s}}/{\text{Hz}}\) across the telephone channel.

Remarks

  1. (a)

    If the noise and signal are not fully uncorrelated (as in the case of non-white additive noise), the signal-to-noise ratio S/N is not constant with frequency over the bandwidth. In this case, we can assume that the channel is modeled as a multitude of narrow independent Gaussian channels in parallel, and so C is given by

    $$C = \int\limits_{0}^{W} {{ \log }_{2} } \left( {1 + \frac{S(f)}{N(f)}} \right)df$$

    where \(S(f)\) and \(W(f)\) are the signal and noise power spectrums, respectively, which are functions of the frequency f.

  2. (b)

    For large S/N ratio (i.e., \(S\,/\,N \gg 1\)), we get \(C \approx 0.332\,W\;(S/N,\;{\text{in}}\;{\text{db}})\), where \(S/N\;{\text{in}}\;{\text{db}} = 10\,{ \log }_{10} (S/N)\).

  3. (c)

    For very small S/N (i.e., \(S\,/\,N \ll 1\)), we obtain \(C \approx 1.44W(S\,/\,N) = 1.44W\left( {S\,/\,N_{0} W} \right) = 1.44\left( {S\,/\,N_{0} } \right)\), i.e., the channel capacity in this case is (approximately) independent of the noise bandwidth.

4.5.4.4 Landau–Pollack Bandwidth Signal Dimensionality Theorem

This theorem is a consequence of the Nyquist–Shannon sampling theorem and states [42]: “A signal cannot be both band-limited and time-limited.

Actually, a band-limited signal is not time-limited, because its energy cannot be entirely restricted to any finite interval of time. Similarly, a time-limited function is not band-limited because its energy cannot be totally confined to a finite band of frequencies. However, one can assume that a band-limited signal is approximately time-limited, and a time-limited signal is approximately band-limited.

Quantitatively, the Landau–Pollack theorem says: “The signal space of all finite-energy signals is infinite dimensional, but the subset of such signals that are band limited to W Hz and approximately time limited to \(\left[ {0,t_{0} } \right],\;t_{0}\) sufficiently large, is approximately finite dimensional with dimension \(2Wt_{0} + 1\).”

This means that there exists a set of \(2Wt_{0} + 1\) orthonormal basis functions \(\Psi _{i} (t)\), such that for any finite-energy signal \(x(t)\) with energy \(E_{x}\) that is band limited to \(\left| f \right| < W\), for any constant \(\varepsilon\) with \(0 < \varepsilon < 1\), and for any \(t_{0}\) sufficiently large, the following relations hold:

$$\begin{aligned} & \int\limits_{0}^{{t_{0} }} {\left| {x(t)} \right|^{2} } dt > E(1 - \varepsilon ) \\ & \int\limits_{ - \infty }^{\infty } {\left\{ {x(t) - \sum\limits_{i = 0}^{{2Wt_{0} }} {x_{i}\Psi _{i} (t)} } \right\}^{2} dt < 12\varepsilon E_{x} } \\ \end{aligned}$$

where \(x_{i} ,\;\,i = 0,1, \ldots ,2Wt_{0}\) are the \(2Wt_{0} + 1\) expansion coefficients. In other words, if outside the interval \(\left[ {0,t_{0} } \right]\), the maximum fraction of the band-limited-signal’s energy is \(\varepsilon\), then this signal can be approximately expressed by a linear combination of a set of \(2Wt_{0} + 1\) orthonormal basis functions, with an energy error less than a fraction \(12\varepsilon\) of the signal’s energy . As \(t_{0}\) gets larger, the fraction of energy outside the dimension \(2Wt_{0} + 1\) of this signal subspace becomes smaller.

To verify that the Landau–Pollack theorem is a consequence of the Nyquist–Shannon sampling theorem , we reason as follows [71]. Suppose that a signal is both band-limited and time-limited exists and that this signal is sampled at a frequency greater than the Nyquist frequency . These finitely many time-domain coefficients should represent the entire signal. In the same way, the entire spectrum of the band-limited signal should be represented through the finitely many time-domain coefficients resulting from the signal’s sampling. Mathematically, this would require that a (trigonometric) polynomial can have infinitely many zeros, since the band-limited signal must be zero on an interval beyond a critical frequency that has infinitely many points. But we know from the fundamental theorem of algebra that a polynomial cannot have more zeros than its order. This contradiction is due to our incorrect assumption that a signal that is both band-limited and time-limited exists.

4.5.5 Jayne’s Maximum Entropy Principle

The maximum entropy principle (MEP) was first formulated by Jaynes [26] and is actually a generic optimization problem, namely,

“Find a probability distribution \(p_{X} (x),\;x \in X\), that maximizes the Shannon’s entropy \(H(X)\) subject to a set of given constraints \(c_{1} ,c_{2} , \ldots ,c_{n}\) which express partial information about the probability distribution \(p_{X} (x)\) sought, and also the typical axioms (constraints) of probability theory.”

The most common constraints in practice are expected (mean) values of one or more random variables and/or random functions, or several marginal probability distributions of an unknown joint distribution. More specifically, this optimization problem is formulated as follows:

Suppose we are given a random variable X taking the values \(\left\{ {x_{1} ,x_{2} , \ldots ,x_{n} } \right\}\) where n can be finite or infinite, and the mean (average) values of various functions \(f_{1} (X),f_{2} (X), \ldots ,f_{m} (X)\) where \(m < n\). The problem is to determine the probability assignment \(p_{i} = p\left( {x_{i} } \right)\) which satisfies the given data (constraints):

$$\begin{aligned} & \sum\limits_{i = 1}^{n} {p_{i} = 1,\;p_{i} \ge 0} \\ & \sum\limits_{i = 1}^{n} {p_{i} \,f_{k} \left( {x_{i} } \right)} = E\left[ {f_{k} (X)} \right] =\Xi _{K} ,\quad k = 1,2, \ldots ,m \\ \end{aligned}$$

and maximizes Shannon’s entropy:

$$H(X) = - \sum\limits_{i = 1}^{n} {p_{i} \,{ \log }\,p_{i} }$$

The solution to this mathematical problem can be determined using the well-known method of Lagrange multipliers, which however has the drawback that it does not make clear whether a true (global) maximum of \(H(X)\) has been obtained. Here, without loss of generality, we will develop the solution for the case \(f_{k} (X) = x_{k}\), in which the constraints about \(f_{k} (X)\) are reduced to

$$\sum\limits_{k = 1}^{n} {p_{k} x_{k} = E(X) = \xi }$$

Also, for simplicity, the logarithm log in \(H(X)\) is assumed to be the natural logarithm \(\left( {{ \log }_{\text{e}} = { \ln }} \right)\) [29]. We start the solution, by converting the above-constrained optimization problem into the equivalent unconstrained optimization problem, with respect to \(p_{k} \left( {k = 1,2, \ldots ,n} \right)\) and the Lagrange multipliers \(\lambda\) and \(\mu\), of the Lagrange function:

$$L = - \sum\limits_{k = 1}^{n} {p_{k} \,{ \ln }\,p_{k} } - \lambda \left( {\sum\limits_{k = 1}^{n} {p_{k} - 1} } \right) - \mu \left( {\sum\limits_{k = 1}^{n} {p_{k} x_{k} - \xi } } \right)$$

Then, we write down the associated canonical (partial derivative) equations:

$$\begin{aligned} \partial L/\partial p_{k} & = - { \ln }\,p_{k} - 1 - \lambda - \mu x_{k} = 0,\;k = 1,2, \ldots ,n \\ \partial L/\partial \lambda & = 1 - \sum\limits_{k = 1}^{n} {p_{k} = 0} \\ \partial L/\partial \mu & = \xi - \sum\limits_{k = 1}^{n} {p_{k} x_{k} = 0} \\ \end{aligned}$$

The first n equations can be rewritten as

$$\begin{array}{*{20}c} {p_{1} = {\text{e}}^{{ - 1 - \lambda - \mu x_{1} }} = {\text{e}}^{ - (1 + \lambda )} {\text{e}}^{{ - \mu x_{1} }} } \\ {p_{2} = {\text{e}}^{{ - 1 - \lambda - \mu x_{2} }} = {\text{e}}^{ - (1 + \lambda )} {\text{e}}^{{ - \mu x_{2} }} } \\ { - - - - - - - - - - } \\ { - - - - - - - - - - } \\ {p_{n} = {\text{e}}^{{ - 1 - \lambda - \mu x_{n} }} = {\text{e}}^{ - (1 + \lambda )} {\text{e}}^{{ - \mu x_{n} }} } \\ \end{array}$$

which, if divided by the sum \(p_{1} + p_{2} + \cdots + p_{n} = 1\), gives

$$p_{k} = {\text{e}}^{{ - \mu x_{k} }} /\sum\limits_{i = 1}^{n} {{\text{e}}^{{ - \mu x_{k} }} } ,\;k = 1,2, \ldots ,n$$

Now, multiplying both sides of the \(p_{k}\) equation by \(x_{k}\) and adding, we get

$$\xi = \sum\limits_{k = 1}^{n} {x_{k} \,{\text{e}}^{{ - \mu x_{k} }} } /\sum\limits_{k = 1}^{n} {{\text{e}}^{{ - \mu x_{k} }} }$$

from which we obtain

$$\sum\limits_{k = 1}^{n} {x_{k} \,{\text{e}}^{{ - \mu x_{k} }} } - \xi \sum\limits_{k = 1}^{n} {{\text{e}}^{{ - \mu x_{k} }} }$$

Finally, multiplying this equation by \({\text{e}}^{\mu \xi }\) gives

$$\sum\limits_{k = 1}^{n} {\left( {x_{k} - \xi } \right)\,{\text{e}}^{{ - \mu \left( {x_{k} - \xi } \right)}} = 0}$$

This is a nonlinear equation (with respect to \(\mu\)) and can be solved numerically for \(\mu\). Introducing this value of \(\mu\) into the \(p_{k}\) equations \((k = 1,2, \ldots ,n)\), we find the desired probabilities \(p_{k} ,\;k = 1,2, \ldots ,n\).

As a simple illustration of the maximum entropy principle, let us consider an unbiased (“honest”) die. Here, \(x_{k} = k\;(k = 1,2,3, \ldots ,6)\), and so \(\xi = (1 + 2 + 3 + 4 + 5 + 6)\,/\,6 = 21\,/\,6 = 3.5\). The last equation here becomes

$$\begin{aligned} & - 2.5{\text{e}}^{2.5\mu } - 1.5{\text{e}}^{1.5\mu } - 0.5{\text{e}}^{0.5\mu } \\ & \quad + 0.5{\text{e}}^{ - 0.5\mu } + 1.5{\text{e}}^{ - 1.5\mu } + 2.5{\text{e}}^{ - 2.5\mu } = 0 \\ \end{aligned}$$

The solution of this equation is \(\mu = 0\), and so we find

$$p_{k} = 1/6\;{\text{for}}\;k = 1,2, \ldots ,n$$

Remark

In this example, the mean value \(\xi = E(X)\) was known and used as a constraint on the corresponding probabilities. If \(E(X)\) were not known, then no constraint would be on the probabilities. In this case, the maximum entropy principle would give equal probabilities (the uniform or equiprobable distribution \(p_{x} (x)\)), which are the only probabilities for which the entropy takes its absolute maximum value (see the properties of the entropy in Sect. 4.5.2.2). If the probabilities are subject to constraints, the MEP principle gives a maximum entropy of limited value, which is usually smaller than the entropy of the uniform distribution.

4.6 Concluding Remarks

This chapter is the first of three chapters that deal with the “information pillar” of human life and society. The topics that have been covered include the general definition of the information concept, the historical landmarks of its manifestations, namely, communication (speech, writing, printing, telegraph, telephone, computing, and computers), information theory , computer networks , multimedia, telematics, and informatics, and a technical review of communication and information theory.

Information storage, flow, and processing are inherent processes in nature and living organisms. Organisms do not live separately from other organisms but interact with each other within the ecosystem in which they exist. Many of the existing technological communication and modulation/demodulation models were inspired by life on Earth, or they can be used to model and better understand biological communication models. The exploration of the physical and biological world provides unexpected surprises and discoveries that increase steadily our information entropy.

In a similar way, modern information transmission and communication techniques are affecting and will continue to increasingly affect the social and economic/business activity of people over coming decades. An example of this is electronic commerce (e-commerce), which reduces substantially the sales and operational costs in comparison with the traditional stores. The Internet expands and moves e-commerce into an open and global market with big changes in the market structure, size, and players.

More generally, inter-networked information technology enables people to act in their own self-interests and affect the experiences of other people. Overall, it is anticipated that the prospect of return-on-investment (ROI) research in communications, information theory , and information technology is very promising, with directly measurable results in economic strength and social benefits.