1 Introduction

While there are embedded metadata formats available for images, audio files, and videos, they are often limited to technical characteristics and many of them are not structured. In contrast to MP3 files, which often provide information about the album, singer or band, release year, genre, and might even include the lyrics, video files typically do not have any information embedded to them about the depicted concepts, actors, or the plot. Online video information retrieval often relies on the text surrounding the media files embedded to web pages, mainly due to the huge “semantic gap” between what computers and humans understand (automatically extractable low-level features and sophisticated high-level content descriptors) [36].

While some semantic image annotation tools (e.g., K-Space Annotation Tool, PhotoStuff, AktiveMedia, M-OntoMat-Annotizer, SWAD, Annotorius, Pundit, ImageSnippets) could be used for annotating video frames (as still images), semantic video annotation requires specialized tools for representing temporal and other information unique to videos.Footnote 1 For this reason, manual, semi-automatic, and automatic annotation tools have been introduced over the years for the semantic enrichment of audiovisual contents.

Video management systems date back to the early 1990s with video databases and conceptual models annotating audiovisual contents with unstructured data (e.g., OVID [33], Vane [10]), all of which were different in terms of spatial and temporal data representation, semantic expressiveness, and flexibility.

Less than a decade after the introduction of video annotation software tools, they began to support structured data. While video annotation software generating semi-structured output, such as MuViNo, Footnote 2 EXMARaLDA, Footnote 3 the VideoAnnEx Annotation Tool Footnote 4, ELAN Footnote 5, the Video Image Annotation Tool (VIA), Footnote 6 the Semantic Video Annotation Suite (SVAS), Footnote 7 VAnalyzer, Footnote 8 the Semantic Video Content Annotation Tool (SVCAT), Footnote 9 Anvil, Footnote 10 and the video annotation tool of Aydınlılar and Yazıcı [1], have been developed in parallel with tools powered by the Resource Description Framework (RDF), this paper focuses only on those video annotation tools that produce output in structured data formats including RDF or exclusively in RDF.

2 Semantic video annotation

State-of-the-art structured video annotation incorporates multimedia signal processing and formally grounded knowledge representation including, but not limited to, video feature extraction, machine learning, ontology engineering, and multimedia reasoning.

2.1 Feature extraction for concept mapping

A wide range of well-established algorithms exists for automatically extracting low-level video features, as for example, fast color quantization to extract the dominant colors [44] or Gabor filter banks to extract homogeneous texture descriptors [43]. There are also advanced algorithms for video content analysis, such as the Viola-Jones and Lienhart-Maydt object detection algorithms [26, 41], and the SIFT, SURF, and ORB keypoint detection algorithms [23, 28, 35]. The corresponding descriptors can be used as positive and negative examples in machine learning, such as support vector machines (SVM) and Bayesian networks, for keyframe analysis, face recognition, and video scene understanding.

While useful, many automatically extracted low-level video features are inadequate for representing video semantics. For example, annotating the dominant color or color distribution of a frame does not provide the meaning of the visual content. In contrast, high-level descriptors are suitable for video concept mapping, but they often rely on human knowledge, experience, and judgment. However, manual video concept tagging is very time-consuming, might be biased, too generic, or inappropriate, which has led to the introduction of collaborative semantic video annotation, where multiple users annotate the same resources and improve each other’s annotations [12]. User-supplied annotations can be curated using natural language processing to eliminate duplicates and typos, and filter out incorrectly mapped concepts. The integrity of manual annotations captured as structured data can be confirmed automatically using LOD definitions. Research results for high-level concept mapping in constrained videos, such as medical videos [16] or sport videos [3], are already promising, however, concept mapping in unconstrained videos is still a challenge [22].

The next section details multimedia ontology engineering best practices to create machine-interpretable high-level descriptors and reuse de facto standard definitions to formally represent human knowledge suitable for the automated interpretation of video contents.

2.2 Knowledge representation of video scenes

Logical formalization of video contents can be used for video indexing, scene interpretation, and video understanding [37]. Structured knowledge representations are usually expressed in, or based on, RDF, which can describe machine-readable statements in the form of subject-predicate-object triples, e.g., scene-depicts-person. The corresponding concepts are defined in 1) controlled vocabularies, consisting of three countably finite sets of symbols: a set N C of concept names, a set N R of role names, and a set N I of individual names, or 2) ontologies, i.e., quadruples expressed as O = (C, Σ, R, A), where C is a set of concept expressions, R is a set of binary relationships between concepts from C, 〈C, Σ〉 is the taxonomic structure of concepts from C, and A is a set of axioms. Vocabularies are defined in RDF Schema (RDFS), an extension of RDF to create vocabularies and taxonomies, and complex ontologies are defined in the fully-featured ontology language OWL (Web Ontology Language). Related terms and factual data might also be derived from other structured data sources, such as knowledge bases and LOD datasets. For example, to declare a video clip depicting a person in a machine-readable format, a vocabulary or ontology which provides the formal definition of video clips and their features is needed, such as the Clip vocabulary from Schema.org,Footnote 11 because it is suitable for declaring the director, file format, language, encoding, etc. of video clips (schema:Clip). The “depicts” relationship is defined by the Friend of a Friend (FOAF) vocabularyFootnote 12 (foaf:depicts). The definition of “Person” can be used from schema:Person, which defines typical properties of a person, including, but not limited to, name, gender, birthdate, and nationality.Footnote 13

Some well-established commonsense knowledge bases and their corresponding general-purpose upper ontologies that can be used for describing the concepts depicted in videos are WordnetFootnote 14 and OpenCyc.Footnote 15 There are also more specific ontologies for this purpose, such as the Large-Scale Concept Ontology for Multimedia (LSCOM) [31].

The spatiotemporal annotation of video events requires even more specialized ontologies, such as the SWRL Temporal OntologyFootnote 16 and VidOnt,Footnote 17 along with Media Fragment URI 1.0 identifiers.Footnote 18

If the concepts to describe belong to a knowledge domain not covered by existing ontologies, one can create a new ontology by formally defining the classes, properties, and their relationships, preferably in OWL, with a logical underpinning in description logics (DL). DL-based ontologies do not specify a particular interpretation based on a default assumption; instead, they consider all possible cases in which the axioms are satisfied.

In the case of the most expressive OWL 2 ontologies, the set of role expressions R over a signature is defined as R ::= U | N R | N R , where U represents the universal role, N R is a set of roles, and N R is a set of negated role assertions. The concept expressions of an OWL 2 ontology are defined as the set C ::= N C | (CC) | (CC) | ¬C | ⊤ | ⊥ | ∃R.C | ∀R.C | ⩾n R.C | ⩽n R.C | ∃R.Self | {N I }, where n is a non-negative integer, C, D represent concepts, and R represents roles. Based on these sets, SROIQ(D) axioms can be defined as general concept inclusions (GCIs) of the form CD and CD for concepts C and D (terminological knowledge, TBox), individual assertions of the form C(N I ), R(N I , N I ), N I ≈ N I , or N I ≉ N I (assertional knowledge, ABox), and role assertions of the form RS, RS, R 1 ° … ° R nS, Asy(R), Ref(R), Irr(R), Dis(R, S) for roles R, R i , and S (role box, RBox) [24], as summarized in Table 1.

Table 1 Syntax and semantics of SROIQ constructors

Interpretation I consists of a set ΔI (the domain of I) and an interpretation function ·I, which maps each atomic concept A to a set A I ⊆ ΔI, each atomic role R to a binary relation R I ⊆ ΔI × ΔI, and each individual name a to an element a I ∈ ΔI. Similar to the constructors, the formal meaning of the axioms is defined by their model-theoretic semantics, as shown in Table 2.

Table 2 Syntax and semantics of SROIQ axioms

As an example, assume a file of a video scene, namely the climax of the movie “The Good, the Bad, and the Ugly” with the trio, Tuco, Blondie, and Angel Eyes, portrayed by Eli Wallach, Clint Eastwood, and Lee Van Cleef, respectively (United Artists, 1966). The aim is to describe the video scene with spatiotemporal data, maintain provenance data, and annotate the movie characters depicted in the various regions of the scene, along with the actors who played in the corresponding roles. Using description logic, the knowledge representation of this video scene can be formalized as follows:Footnote 19

Scene(TRIO) Movie(THEGOODTHEBADANDTHEUGLY) sceneFrom(TRIO, THEGOODTHEBADANDTHEUGLY) hasStartTime(TRIO, 02:40:28) duration(TRIO, 00:04:40) hasFinishTime(TRIO, 02:45:08) depicts(TRIO, MexicanStandoff) MovieCharacter(TUCO) portrayedBy(TUCO, ELIWALLACH) MovieCharacter(BLONDIE) portrayedBy(BLONDIE, CLINTEASTWOOD) BLONDIEMANWITHNONAME MovieCharacter(ANGELEYES) portrayedBy(ANGELEYES, LEEVANCLEEF) depicts(TRIOROI1, TUCO) depicts(TRIOROI2, BLONDIE) depicts(TRIOROI3, ANGELEYES)

The concepts, roles, and individuals of this example are defined by multiple ontologies, which have to be declared in order to obtain the full identifiers according to the corresponding namespaces. Due to the relationship between DLs and OWL, the above example can be translated to any RDF serialization. In Turtle, for example, a shot of the Mexican standoff scene of “The Good, the Bad, and the Ugly” can be described as follows:

@prefix dbpedia: < http://dbpedia.org/resource/ > . @prefix foaf: < http://xmlns.org/foaf/0.1/ > . @prefix temporal: < http://swrl.stanford.edu/ontologies/built-ins/3.3/temporal.owl# > . @prefix schema: < http://schema.org/ > . @prefix vidont: < http://vidont.org/ > . @prefix xsd: < http://www.w3.org/2001/XMLSchema # > . vidont:TheGoodTheBadAndTheUgly a schema:Movie . < http://example.com/trio.mp4 > a vidont:Scene ;vidont:sceneFrom vidont:TheGoodTheBadAndTheUgly ; temporal:hasStartTime "02:40:28"^^xsd:dateTime ; temporal:duration "P4M40S"^^xsd:duration ; temporal:hasFinishTime "02:45:08"^^xsd:dateTime ; foaf:depicts dbpedia:Mexican_standoff . vidont:Tuco a vidont:MovieCharacter ; vidont:portrayedBy vidont:EliWallach . vidont:Blondie a vidont:MovieCharacter ;vidont:portrayedBy vidont:ClintEastwood ; owl:sameIndividualAs dbpedia:Man_with_No_Name . vidont:AngelEyes a vidont:MovieCharacter ; vidont:portrayedBy vidont:LeeVanCleef . < http://example.com/trio.mp4#t=45,46&xywh=538,258,105,511 >foaf:depicts vidont:Tuco . < http://example.com/trio.mp4#t=45,46&xywh=1161,286,47,157 > foaf:depicts vidont:Blondie . < http://example.com/trio.mp4#t=45,46&xywh=1306,206,166,530 > foaf:depicts vidont:AngelEyes .

This example incorporates concept and role definitions and individuals from DBpedia,Footnote 20 FOAF, the SWRL Temporal Ontology, Schema.org, and VidOnt, as well as XML Schema datatypes. The namespaces are declared using @prefix. Note that a is a shorthand notation for the rdf:type predicate. Also note that a series of RDF triples sharing the same subject are abbreviated by stating the subject once, and then separating each predicate-object pair using a semicolon.

Spatial information is declared using Media Fragment URI 1.0 identifiers. In this example, the position of the selected shot is specified as Normal Play Time according to RFC 2326, which is the default time scheme for media fragment URIs. The movie characters are represented by the top left corner coordinates and the dimensions of the imaginary surrounding rectangles, as shown in Fig. 1.

Fig. 1
figure 1

The top left corner coordinates and dimensions of RoIs can be used for spatial annotation of movie characters. Movie scene by United Artists

In contrast to the tree structure of XML documents, RDF-based knowledge representations can be visualized as graphs. RDF graphs are directed, labeled graphs in which the nodes are the resources and values, and the arrows assign the predicates (see Fig. 2).

Fig. 2
figure 2

A graph visualizing the RDF triples

Because the RDF graphs that share the same resource identifiers naturally merge together, interlinking LOD concepts and individuals (e.g., dbpedia:Mexican_standoff, dbpedia:Man_with_No_Name) makes the above graph part of the LOD Cloud.Footnote 21

2.3 Ontology-based video indexing and retrieval

Concept relationships are proven to be valuable knowledge resources that can enhance the effectiveness of video retrieval even for ambiguous queries [46]. RDF-based data is inherently machine-interpretable and unambiguous, which can be exploited in video indexing and retrieval. Video annotation tools often apply concept detection scores for a region, keyframe, shot, video clip, or entire video, which tend to perform better than feature extraction based on local descriptors (e.g., SIFT, HoG, HoF) [29, 30]. Each score value between 0 and 1 indicates the presence or absence of a concept. The higher the score, the higher the likelihood of the depiction of the concept. To improve concept detection accuracy, the relation between depicted concepts can be analyzed by computing co-occurrence, visual descriptors, and hybrid semantic similarity, which leverages contextual information for video classification [5]. Description logic-based semantic video annotations can also be complemented by rule-based representations to improve the integrity and correctness of the interpretation [45]. For example, Mexican standoffs can be described with SWRL rules as follows:

foaf:depicts(?s, ?p1)foaf:depicts(?s, ?p2)foaf:depicts(?s, ?p3)vidont:isHolding(?p1, dbpedia:pistol) ∧ vidont:isHolding(?p2, dbpedia:pistol)vidont:isHolding(?p3, dbpedia:pistol)vidont:isLookingAt(?p1,?p2)vidont:isLookingAt(?p1,?p3)vidont:isLookingAt(?p2,?p1)vidont:isLookingAt(?p2,?p3)vidont:isLookingAt(?p3,?p1)vidont:isLookingAt(?p3,?p2)temporal:hasStartTime(?e1, ?Ste1)temporal:hasFinishTime(?e1, ?ste1)temporal:hasStartTime(?e2, ?Ste2)temporal:hasFinishTime(?e2, ?ste2)temporal:before(e1, e2) → foaf:depicts(?c, dbpedia:Mexican_standoff)

The semantically enriched representation can be used by automated mechanisms to recognize the same type of video scenes in different video resources. Moreover, reasoners can use such machine-interpretable descriptions to automatically infer new statements to achieve knowledge discovery.

Once correctly identified, concepts can be interlinked with related data across LOD datasets. In contrast to website contents retrieved through keyphrase-based web search, RDF-based knowledge representations can be queried and manipulated manually or programmatically through the very powerful SPARQL query language [25]. SPARQL queries might include multiple questions in a single query to answer complex questions that cannot be formulated as keywords used in traditional keyphrase-based web search. Furthermore, they can be executed not only on a single dataset, but also across multiple datasets using federated queries.

2.4 Primary application areas

Multimedia ontologies can be used for high-level scene interpretation, such as event detection [2], moving object detection and tracking [14], and even human intention detection [27]. High-level scene interpretation is suitable for, among others, classification, video surveillance [34], intelligent video analytics, and real-time activity monitoring [9]. Most of these tasks are performed by reasoning over the video contents to recognize situations and temporal events based on human knowledge formally described as ontology concepts, roles, individuals, and rules. By representing fuzzy relationships between the context and depicted concepts of video contents, both deductive and abductive reasoning can be performed [13].

3 A retrospective survey of semantic video annotation tools

Veggie, one of the first video annotation tools to generate RDF output, was introduced by Hunter and Newmarch in 1999 [20]. The Java application produced Dublin Core-based metadata descriptions and video summaries for MPEG-1 videos.

In 2002, Heggland developed OntoLog, an application for searching and browsing temporal media metadata by leveraging metadata exchange using RDF, and SMIL for interoperability between different media players [19]. The software supported high-level descriptors not only for entire videos, but also for video shots and frames. Ontolog incorporated RDFS for representing depicted concepts and the relationships between them.

Vannotea, also released in 2002, was a prototype system for real-time collaborative, synchronous indexing, browsing, annotation, and commentary of MPEG-2 videos (see Fig. 3).

Fig. 3
figure 3

Vannotea, an early implementation of structured video annotation tools [21]

Vannotea was based on W3C’s Annotea, Footnote 22 and used RDF for knowledge representation and XPointer to link the annotations to the video resources.

Advene Footnote 23 (Annotate Digital Video, Exchange on the Net), also released in 2002, was developed over a decade, and is still available to download today. In Advene, users can annotate video fragments at arbitrary positions, save the semantically enriched videos, and play the videos with the associated semantics (see Fig. 4).

Fig. 4
figure 4

Temporal segment annotation with Advene

Being a proprietary binary format, the native file format of Advene is not ideal. Nevertheless, the software is Linked Data-ready, because every annotation, relation, and view is identified by a URI and RDF/XML output is supported. The software does not incorporate multimedia ontologies beyond FOAF and Dublin Core though.

OntoMedia Footnote 24 was developed in 2006 for large multimedia collection management using Semantic Web technologies. The graphical user interface of this standalone Java application offered easy metadata indexing and video retrieval. OntoMedia accepted any input media supported by QuickTime or the Java Media Framework, and could generate RDF and relational database output.

Also in 2006, Bertini and his colleagues developed the Multimedia Ontology Manager (MOM) to combine multimedia ontology engineering with automatic annotation, and generate textual and auditory commentary for video sequences [7]. The automatic video annotation was performed for entire video clips by using similarity checking between visual ontology concepts and extracted clips, and for video sequences by using composite concept patterns. Video clip sequences were annotated with predefined articulated sentences curated by the RACER reasoner.

Annomation, Footnote 25 published in 2008 as a collaborative Linked Data-driven narrative hypervideo application, allowed users to semantically annotate video resources using controlled vocabularies defined in the LOD Cloud. It was restricted to predefined videos hosted by the service, and the semantic annotations were saved in a local repository, making them inaccessible to external semantic agents.

In 2009, the LEMO Annotation Framework was released, providing a uniform, multimedia-enabled annotation model. LEMO addressed video fragments using the MPEG-21 Part 17 (Fragment Identification of MPEG Resources) standard, and exposed data as Linked Data [17].

IMAS, also published in 2009, was a web-based annotation tool for media resources that generated annotations using a set of proprietary ontologies [42]. IMAS imported images and videos from a media repository, but did not support media fragments. The output of IMAS was suitable for producers only, rather than general-purpose online publishing.

SemWebVid Footnote 26 was an Ajax web application released in 2010, which automatically generated YouTubeFootnote 27 video descriptions in RDF, taking manually added tags and closed captions into account. SemWebVid implemented natural language processing APIs to analyze the descriptors, and mapped the results to LOD concepts, using the DBpedia, Uberblic, Any23, and rdf:about APIs, and the now-discontinued Sindice API. Provenance data was color-coded, which was an original idea, however, the resulting text was not always easy to read (see Fig. 5).

Fig. 5
figure 5

Comprehensive concept mapping to LOD in SemWebVid [40]

The application implemented YouTube Data API v2, which has been replaced by the backward incompatible YouTube Data API v3 in April 2015. Consequently, SemWebVid is not working anymore.

Also in 2010, Choudhury and Breslin introduced a framework to annotate and retrieve online videos with light semantics, and integrate structured video annotations into the Linked Open Data Cloud by reusing important terms from Dublin Core, FOAF, and SKOS [11]. In the same year, the EuropeanaConnect Media Annotation Suite (ECMAS) was released, which used both plain text and semantic tags for the knowledge representation of depicted concepts [18].

Pan, an ontology-based online video annotation tool to import and edit OWL ontologies with MPEG-7 alignment, was also developed in 2010 [6]. Pan can browse videos, provide a mechanism for the user to select concepts from an ontology structure, add and edit annotations, and load previously saved annotations. The annotations are managed by another tool, Orione, an ontology-based search engine. Pan is not future-ready, because it was written in Adobe Flex and ActionScript 3, i.e., it requires the Flash plugin, which is now deprecated in favor of HTML5 and JavaScript.

YUMA Footnote 28 was an annotation software released in 2011, which supported image, audio, and video files [39]. YUMA suggested DBpedia and GeoNamesFootnote 29 terms, and exported the results to RDF using a proprietary vocabulary, along with LEMO and Open Annotation.Footnote 30

The ConnectME toolset was released in 2012, comprising of an HTML5-based semantically enriched video player and an online video annotation tool. The ConnectME framework identified, annotated, and deployed video concepts as Linked Data [32]. The user interface displayed timestamps next to the video player, along with the corresponding labels and LOD URIs (see Fig. 6), although using prefixes would have made the URIs more compact, easier to read, and easier to fit in the program window (more space would have been reserved for the video player, the search box, and the explorer).

Fig. 6
figure 6

The ConnectME hypervideo annotation suite incorporated temporal information with labels and LOD concepts

SemTube, a YouTube video annotation prototype, was also released in 2012. It expressed the context of YouTube videos in RDF/OWL and OAC [15]. SemTube used RDF, Linked Data, SPARQL, and RESTful APIs for data import and export. The data retrieval from SemTube annotations supported keyword-based and Linked Data-powered faceted search, and SPARQL queries. One of the preferred LOD datasets for concept mapping in both SemTube and SemWebVid was Freebase, which has been discontinued in 2015, with some of its articles transferred to Wikidata.Footnote 31

Many of the annotation tools discussed above were built with proprietary APIs, which have been changed over the years, breaking the functionality of the original program code. The original version of those tools that have not been updated to reflect these API changes stopped working partially or completely. Also, support is limited for most software prototypes, which often had a domain name registered at the time of their release, but have later been discontinued. Veggie, OntoLog, OntoMedia, MOM, the LEMO Annotation Framework, IMAS, ECMAS, ConnectME, SemTube, YUMA, and Vannotea are not available online anymore, while SemWebVid and Annomation were available at the time of writing, but were not working.

4 State-of-the-art structured video annotation tools

The TV Metadata Generator Footnote 32 was released by Eurecom as part of the LinkedTV project in 2011. Based on the local or online input video file, TV-Anytime or EXMARaLDA metadata files, or SRT subtitle files, the software automatically converts television content metadata into RDF. However, the software cannot generate RDF based on the video content alone, and is basically limited to the serialization of existing textual data as structured data. The LinkedTV Editor Footnote 33 provides a user interface for broadcasting services, which uses the automatically generated annotations of LinkedTV for the rapid generation of contextual information queues.

Open Video Annotation Footnote 34 is based on open source JavaScript libraries, such as Video.js,Footnote 35 Annotator,Footnote 36 and RangeSlider.Footnote 37 The developers claim that the software is compliant with W3C’s Open Annotation data formats. Open Video Annotation was designed to provide an intuitive interface for semantic tagging and the playback of semantically enriched videos (see Fig. 7).

Fig. 7
figure 7

In Open Video Annotation, users can take notes on the timeline, view existing annotations, and play annotated video fragments individually

At the time of writing, the Open Video Annotation was still under development, with many functionalities of the demo not yet working.

MyStoryPlayer is a video player capable of the semantic enrichment of multi-angle videos, and was specifically designed for educational videos. It provides an interface for interactive user annotations to be used in action, gesture, and posture analysis, with a focus on the formal representation of relationships between depicted elements in RDF [4]. MyStoryPlayer powers the website of the European eLibrary for Performing Arts (ECLAP),Footnote 38 and provides not only general and technical metadata, such as title and duration, but also timestamp-based data, which can be used to annotate presentations, human dialogues, and arbitrary video events (see Fig. 8).

Fig. 8
figure 8

In MyStoryPlayer, metadata and classification are coupled with timestamp-based snapshot comments

SemVidLOD Footnote 39 is a software prototype for the semantic enrichment of online video resources, video files, and streaming media with high-level descriptors using terms from the LOD Cloud. SemVidLOD implements VidOnt, the most expressive decidable multimedia ontology to date [38], to express administrative, technical, and licensing metadata, as well as sophisticated high-level content descriptions in RDF.

5 Comparison of structured video annotation tools

Based on the review of the state of the art, semantic video annotation tools differ in terms of characteristics and functionality due to the following technical features:

  • Expressivity. The semantic richness of annotations is determined by the expressivity of the controlled vocabularies and ontologies used for the knowledge representation of the depicted concepts. Some tools are restricted to proprietary controlled vocabulary terms, while others do not provide suggestions but accept arbitrary data.

  • Annotation level. Annotation software usually specialize in particular types of metadata (technical, administrative, licensing), content descriptors (high-level descriptors), multimedia descriptors (low-level descriptors), structural descriptors (spatial, temporal, and spatiotemporal descriptors), or a combination of these.

    • Low-level descriptor support. Capability to annotate automatically extractable low-level features of videos, such as motion trajectory.

    • High-level descriptor support. Capability to precisely annotate depicted concepts and individuals, such as a person, a car, or a building.

    • Spatial fragment support. Enables working with a portion of the media (Region of Interest, RoI) to represent information about the depicted space, for example to annotate a tumor in a medical video or an actor in a movie.

    • Temporal fragment support. Enables frame sequence segmentation within videos to represent time and events, such as video scenes or a goal in a soccer match video.

  • Standards alignment. Standards make it possible for various platforms and computer systems to communicate with each other and exchange data efficiently, regardless of their structural and functional differences. Standards alignment determines whether standards and de facto standards are implemented (e.g., MPEG-7, Dublin Core, Open Annotation). Video annotation software prototypes may use proprietary formats and mechanisms, which are difficult to implement in large-scale, heterogeneous multimedia systems. Poor standard support, including proprietary vocabulary use, negatively affects interoperability. Open standards are likely to be implemented globally, so they should be preferred.

  • Supported input and output data formats. Some annotation tools are designed for a particular video compression or codec only (MPEG-1, MPEG-2, MPEG-4/AVC H.264, etc.), or accept nothing else but YouTube videos by URL. Ideally, the set of supported formats would include at least the current industry standard video file formats. Some video annotation software can handle any kind of video file format, as long as the related codecs are installed on the system.

  • Signal processing integration. By integrating signal processing algorithms to video annotation tools, the annotation of low-level features becomes seamless, although the majority of automatically extracted low-level descriptors cannot be used for high-level scene interpretation, as mentioned earlier.

  • Linked Data support. Supporting best practices for publishing structured data, called Linked Data [8], is crucial for semantic multimedia applications. Linked Data provides unique URIs for each video object, media fragment, keyframe, and RoI, along with a mechanism to interlink depicted concepts with arbitrary definitions from the LOD Cloud, and differentiates media files from web resources that convey information about them. Linked Data support is crucial for future multimedia applications.

  • Automation. While manual annotations can be the most sophisticated and accurate annotations, they depend on the experience and background of the user, can be misspelt and ambiguous, do not always incorporate the most relevant keywords, and might be biased by personal preferences. Semi-automatic (supervised) and automated (unsupervised) annotation would be desired to address the above issues of manual annotations and to efficiently generate annotations to the rapidly growing number of online videos.

  • Provenance data support. Storing data source information (preferably by using the PROV Ontology)Footnote 40 is beneficial for video annotations derived from diverse data sources. Provenance data makes data quality assessment easier, can be used to find similar or related resources, and makes LOD concept interlinking more efficient.

  • RDF output. All structured video annotation software must support RDF output in a standard serialization, such as RDF/XML or Turtle. HTML5 Microdata, RDFa, and JSON-LD are also desirable, which can be directly embedded to the website markup.

  • Architecture. Web-based semantic video annotation tools are preferred to their desktop counterparts due to benefits such as platform-independence, interoperability, and global availability.

  • Built-in Video Player. Ideally, video annotation tools are embedded to a video player for seamless annotation and hypermedia playback.

Less objective features include user-friendliness, documentation quality and coverage, user support (examples, tutorial videos, contact), long-term availability, licensing, and whether the software is open source.

The following sections compare structured video annotation tools from the four main perspectives: standards support, input and output data formats, concept mapping sources, and spatiotemporal fragmentation.

5.1 Standards alignment

Several multimedia and web standards are required, and often implemented, in structured video annotation tools to provide backward- and forward-compatibility and interoperability. Standards are vital to gain widespread use, obtain optimality in terms of file structure and code length, and consider global needs. The most common international standards in semantic video annotation are DVD-Video (media and format are defined by multiple standards, e.g., ISO/IEC 16448:2002Footnote 41 and ECMA-267,Footnote 42 ISO/IEC 25434:2008),Footnote 43 MPEG-7 (ISO/IEC 15938),Footnote 44 MPEG-21 Part 17 (ISO/IEC 14496–17:2006),Footnote 45 Uniform Resource Locators (IETF RFC 1738),Footnote 46 and Dublin Core (IETF RFC 5013,Footnote 47 ISO 15836:2009,Footnote 48 ANSI/NISO Z39.85).Footnote 49 The technical specifications used by semantic video annotation tools that have not been standardized officially by a standardization body yet are used globally are known as de facto standards; these include W3C recommendations, such as RDFFootnote 50 and SKOS,Footnote 51 Open Annotation, and the Media Fragment URI (see Table 3).

Table 3 Standards supported by structured video annotation tools

Open standards are preferred to proprietary implementations, such as the temporal annotation of Annomation, the spatial and temporal fragmentation of Advene and SemTube, and proprietary ontologies, e.g., the SALERO ontologies used by IMAS or the LinkedTV ontology implemented by the LinkedTV Editor.

Standards alignment is a necessary but not always sufficient requirement for the long-term viability of software tools. For example, the development of Vannotea has been discontinued regardless of its MPEG-7 compliance, but the implementation of de facto standards, such as that of the Media Fragment URI or Open Annotation, can explain the continuing success and ongoing development of LinkedTV Editor, SemVidLOD, and TV Metadata Generator.

5.2 Supported data formats

The supported input and output file formats can be crucial for the usability of a software tool, especially with the large variety of video container formats, file formats, and codecs. Open formats are preferred to proprietary formats for two reasons. Firstly, open formats make software development easier and interoperability wider. Secondly, the popularity of tools that support only proprietary file formats tends to decrease faster than the ones that implement standardized and open formats. This might be the reason behind the discontinuation of LEMO, which was designed for the Flash Video format now replaced by HTML5.

The support for multiple input data formats is a user expectation, which is why many video annotation tools can open a variety of video files and handle video streams (see Table 4).

Table 4 Supported data formats of structured video annotation tools

While Linked Data output is expected from semantic video annotation tools, dependence on a particular LOD dataset can be a major design issue. A good example is the now-discontinued SemTube, which implemented Freebase as the primary LOD dataset for interlinking, which became obsolete and succeeded by Wikidata. However, the still popular DBpedia and GeoNames were the primary LOD datasets of Annomation, ConnectME, and YUMA, all of which have also been discontinued. This suggests that the long-term viability of LOD dataset URLs generated by semantic video annotation tools does not guarantee the success of these tools.

The tools for annotating YouTube videos (SemTube, SemWebVid) rely on the proprietary YouTube API. Consequently, such tools cannot be used for annotating videos stored on other video sharing portals, such as VimeoFootnote 52 and LiveLeak,Footnote 53 and since the corresponding API might change over time, future updates are required for the upcoming versions of the API or else the tools will stop working.

Those tools that accept video input via URL can use the corresponding URLs directly to add context to RDF triples and provide a graph identifier for quads (subject-predicate-object-graph name) so that they become globally interpretable. The software tools that open local files only do not have this kind of unique web identifier for the media resources by default.

5.3 Ontology use

The primary concept mapping sources vary greatly among structured video annotation tools, and include ontologies such as Dublin Core,Footnote 54 the Ontology for Media Resources,Footnote 55 FOAF,Footnote 56 Open Annotation,Footnote 57 and Representing Content in RDF.Footnote 58 Some tools also allow arbitrary ontologies so that they are not limited to the concepts of the primary concept mapping sources (see Table 5).

Table 5 Ontology use of semantic video annotation tools

As shown above, not all annotation tools allow arbitrary ontologies, which is a huge limitation even if standardized ontologies are used as the primary concept mapping sources. However, arbitrary ontology support is not necessarily sufficient to gain global adoption, as was the case of Advene, ConnectME, IMAS, and SemTube.

5.4 Spatiotemporal annotation support

Structured video annotation tools support either spatial or temporal annotations, both spatial and temporal annotations, or neither (see Table 6).

Table 6 Spatiotemporal annotation support of semantic video annotation tools

The most common spatiotemporal annotation format in semantic video annotation is W3C’s Media Fragment URI. LEMO used the MPEG-21 Part 17 standard for the same purpose. Some tools (Advene, Annomation, Vannotea) implemented proprietary mechanisms that cannot be processed by any other software tool but the ones that introduced them.

6 Conclusions

In contrast to review papers of multimedia annotation tools that mix the annotation of still images and videos, or do not differentiate between semi-structured and structured output, this comprehensive review explicitly enumerates the milestones of structured video annotation tools, highlights their limitations, and suggests required features for upcoming software tools.

Semantic video annotation tools face many challenges including, but not limited to, the wide variety of video codecs, the lack of standardized video ontologies, the vast number of video resources, not to mention the inherent ambiguity of audiovisual contents. Unstructured comments, labels, and tags of traditional video annotation systems come with a degree of formalism inadequate for efficient automated processing. To address this limitation, OWL ontologies and Linked Data can be used for structured video annotation, which can be generated semi-automatically or automatically with semantic video annotation tools. Multimedia ontology engineering has been demonstrated through structured video annotations that leverage standardized definitions as well as concepts from a state-of-the-art ontology, VidOnt, to combine the representation of video fragments, regions of interest, depicted concepts, and spatiotemporal information. The strengths and weaknesses of ontology-based video scene representation have also been discussed, and the limitations of structured video annotation tools have been highlighted. Despite the potential of these software tools, the development, maintenance, and support of most semantic multimedia annotation software prototypes mentioned in the literature have been discontinued. There are very few structured video annotation tools that are being actively developed. The state-of-the-art tools differ significantly in terms of supported input data formats, ontology use, standards alignment, Media Fragment URI implementation, and Linked Data support. Some software tools rely heavily on proprietary APIs and software libraries that might change over time. Fortunately, the implementation of Open Annotation and other de facto standard ontologies is more and more common. Based on this review it can be concluded that the global adoption of semantic video annotation tools depends on a number of characteristics, including the implemented technologies and standards, the supported input and output file formats, the primary concept mapping sources, Linked Data integration, and the option to use arbitrary ontologies and spatiotemporal fragmentation.

To meet the challenges of future web applications and improve the efficiency of concept mapping, information fusion would be desired, so that manually added tags, closed captions, and audio analysis could support the selection of the most relevant concepts. To provide Linked Data-powered structured annotations for video resources, online semantic multimedia annotation tools are preferred to desktop tools, using technologies such as HTML5, JavaScript, and Ajax in combination with Semantic Web standards. This can be achieved by a paradigm shift in the software design of semantic multimedia annotation tools, namely by adding the capability to open videos by URL (as opposed to opening video files from local repositories), supporting Linked Data and spatiotemporal fragmentation, and using modern multimedia ontologies for high-level concept descriptors. The interoperable video annotation output leverages Semantic Web standards for easy data distribution, sharing, reuse, and personalization, setting a new direction for online video sharing and next-generation video retrieval.