1] Welcome to the second lesson in the lecture: Introduction to Digital Image and Artefact Science. The framework topic of the following lessons is Digitisation and Data Management, and today we are talking about Experiencing - Expressing - Understanding: How do you actually digitise culture?
If, for example, you scan the works of the physicist Georg Christoph Lichtenberg or his statue in front of the Pauline Church, nothing is said about the content of the works. Lichtenberg was particularly known for the ambiguity of his aphorisms. Can culture in all its complexity be digitised at all? This is a difficult question that will lead us to the fundamental differences between the humanities and the natural sciences. 

2] For the question of what culture is, one lecture hour is not enough. For what counts as cultural heritage is subject to selection by the respective society. Today, only this much: The Hague Convention for the Protection of Cultural Property in the Event of Armed Conflict was concluded in 1954 as a treaty under international law. It aims to protect cultural property during war or armed conflict from destruction or damage as well as theft, looting and other forms of unlawful seizure. The preamble states that "any damage to cultural property, regardless of the people to whom it belongs, is damage to the cultural heritage of all humanity, because each people makes its contribution to the culture of the world".
Cultural property is defined here as "movable or immovable property of great importance to the cultural heritage of peoples". This includes, for example, paintings, sculptures, archaeological finds, books, manuscripts and archives as movable cultural property. Immovable cultural property includes monuments (such as the Seven Wonders of the World) and other sites such as museums, libraries, archives and excavation sites that serve the exhibition, use, safekeeping and protection of movable cultural property. However, cultural heritage also includes intangible assets such as endangered languages or dialects and also folk festivals and customs.

[3] Everyone's idea of what should be understood as culture will be defined differently. A fundamental social polarity was pointed out by Charles Percy Snow in his 1959 London speech "The two cultures":
"I believe," he said, "the intellectual life of the whole of Western society is increasingly divided into two polar groups. [...] Literary intellectuals at one pole - scientists at the other, and most representative, physicists. Between the two there is a gulf of mutual incomprehension - sometimes (especially among young people) hostility and dislike, but above all incomprehension. They have a strangely distorted image of each other. Their attitudes are so different that even at the level of emotions they cannot find much in common."

[4] Wilhelm Dilthey is considered the founder of the humanities. He says that "humanity, conceived in terms of perception and cognition, would be a physical fact for us and as such would only be accessible to scientific cognition. As an object of the humanities, however, it arises only insofar as human circumstances are experienced, insofar as they are expressed in expressions of life, and insofar as these expressions are understood." The most important difference, according to Dilthey, is that the natural sciences explain processes in nature, while the humanities attempt to understand historical-cultural events.
This understanding is examined by the humanities with the historical-hermeneutic method. Hermeneutics proceeds from the basic assumption that all human circumstances and actions arise from the human mind, which is fundamentally free and autonomous. They are therefore incalculable expressions of human freedom and individuality.
Every human being, every one of his or her actions is something unique! No natural law or quantitative-empirical procedure can grasp a person in his or her individuality: But we can try to understand people and their actions with empathy.

[5] For the natural sciences, on the other hand, the nomothetic approaches to science are relevant. The German philosopher Wilhelm Windelband put it this way in 1894: in the knowledge of reality, the empirical sciences seek either the general in the form of natural law or the individual in the historically determined form; they look in one part at the always unchanging form, in the other part at the unique, intrinsically determined content of real events. The one are sciences of law, the other sciences of events; the former teach what always is, the latter what once was. Scientific thought is - if one may form new terms of art - nomothetic in the one case, idiographic in the other."
Nomothetic, i.e. establishing laws, is a branch of research whose scientific work aims at establishing universally valid laws. Its methods are experimental and the data collected are quantitative. Nomothetic theories abstract from phenomena, a way of thinking that is typical of the natural sciences.
According to Windelband, the humanities, on the other hand, are idiographic, i.e. they describe peculiarities. Their goal is the comprehensive analysis of concrete, i.e. temporally and spatially unique objects of investigation.

[6] Generality is also the goal of computer science, whose central method is considered to be algorithmics. "Let us imagine, says Alan Turing, that the operations performed by the computer are divided into 'simple operations' which are so elementary that it is not easy to imagine that they can be further subdivided." This requirement of the founder of modern computer science, was already fulfilled by Euclid when he described, for example, the arithmetic steps one must perform to find the greatest common divisor. You could call this an early form of algorithm, because an algorithm is a formalised, unambiguous set of actions for solving a problem or a class of problems. Algorithms consist of finitely many well-defined individual steps.

7] In all sciences, the basic procedures for gaining knowledge are deduction and induction. Deduction (or derivation) proceeds from the general to the particular. From a premise and a fact, a new fact is derived as a logically compelling result. For example, the premise: "All human beings are mortal" applies. The fact "Socrates is a human being" leads to the consequence that Socrates must also be mortal.
In induction (or leading), one goes from the individual to the general. A general principle is derived from individual facts and the consequences presumably connected with them. Plato, Aristotle and Epicurus were human beings, for example, and the fact that Plato, Aristotle and Epicurus died would be considered a consequence in order to derive the general statement that all human beings are mortal.

[8] Inductively, one cannot establish generally valid rules with certainty. No matter how many white swans are observed, it cannot be ruled out that there are also black ones. But one can conclude that swans are mostly white. Thus, the historical-hermeneutical or empirical sciences primarily use induction (as proof of probability) to gain knowledge. For them, the inductively established rules retain their validity even if an exception becomes known. Thus, while in computer science the focus is on an unambiguous rule of action that leads to an unambiguous result, the humanities work with limited knowledge and try to arrive at probable statements or practicable solutions through a variety of analogies and other conjectural conclusions. The exactness of the computer is thus contrasted with the proof of probability in the humanities.

[9] The historical-hermeneutic method can also be described as an instruction for action. It begins with heuristics, i.e. the systematic collection and classification of relevant sources. This also includes disclosing the research question and the interest of the investigation.
This is followed by criticism, i.e. the sifting and processing of information using the methods of source criticism: Which ́sources ́ come into question? Under what conditions were the sources created and what influence do author, time and framework conditions have on my question? How do I have to process the sources in order to obtain the desired information?
The third step is interpretation, which should be a clear analysis of the object of investigation that can be substantiated in ever greater depth with observations and ́sources ́. The information must therefore be linked together in a meaningful way, including one's own values, on the basis of which what is being looked at is interpreted; ultimately, the explanation of historical changes is often in the foreground.
The result is presented in the form of a diagram: It should convey a meaningful idea of the historical changes, reflecting argumentatively by thinking along with other, complementary, but also competing positions, so that there is a broadening of perspective that is open to progress in knowledge.
Scientific thinking thus also becomes a matter of rational argumentation across different viewpoints, oriented towards intersubjective verifiability, deepening of knowledge and consensus building. 
Data science as a new information-processing discipline investigates and develops scientifically sound methods, processes, algorithms and systems to extract insights, patterns and conclusions from data - whether structured or unstructured. In some ways, it takes a very similar approach: 
Heuristics here correspond to data collection, criticism to data processing, interpretation to data analysis, and representation here is simply replaced by result.

[10] Hermeneutics as a method is thus a systematised, practical procedure for understanding and interpreting texts, pictorial works or other human expressions in a reflected way. Comparison and analogy are the most important procedures. Accordingly, many ways (and results) of gaining knowledge are possible.
The humanities are more concerned with the interpretation and evaluation of phenomena than with exact, quantitatively measurable statements. From the perspective of information processing, the methods of the humanities are therefore less standardised, their statements more qualitative than statistically evaluable and generally less formalised. 
If statistical methods and procedures are used, their results are less the result of the research itself; rather, they must be seen, interpreted and evaluated in the overall context. 
Because it is information about facts that cannot be collected empirically, the structuring of the data already represents an important hermeneutic act. Accordingly, the processing of data with the methods and tools of computer science does not lead to unambiguous and neutral results, but on the contrary is always a scientific construction of the facts.

11] The different perspectives of the natural sciences and the humanities form the foil on which we now want to ask specifically how culture is actually digitised. In doing so, we will stick to Dilthey's expressions, which will be used to divide our lecture lesson into three parts. „Experiencing", i.e. phenomenology, corresponds here to the acquisition of the reserach objects, specifically scanning, optical character recognition and layouting. „Expressing" corresponds to the formalisation of the research objects. Here we are concerned with the logical structure of texts and their representation in HTML and XML, the Text Encoding Initiative, metadata and authority records. And thirdly, we are interested in "understanding", the semantics of the research objects. Here we'll talk about ontologies, RDF and Linked Open Data.

[12] So let's first come to the digital acquisition of the research objects. 

[13] How do you actually get from a printed text or a handwritten source to digital data and a digital edition of these texts? Let's take Beethoven's Piano Sonata opus 109 as an example, which we have as a handwritten source. We can put this sheet of paper on a scanner and then save it as a digital image. With the help of OCR text recognition, we can recreate the musical notation on a trial basis. But we can also enter it manually and output it in metadata to another document attached to our image. This data can be stored in a database in various formats: as a midi file for the sound, as a generally portable pdf, as XSTL and HTML for web pages and as a jpeg for the image.

[14] Digital scanning creates a digital image with a scanner or digital camera. This camera creates a digital image of the text (e.g. in pdf or jpg) by placing a virtual grid over the text and recording the colour or grey value of each cell (i.e. each pixel). Scans, i.e. the digital images, can be viewed with appropriate viewing software (such as a PDF viewer), but they cannot be read by a computer. For the computer, it is an image and not text.

[15] The challenges of scanning are to optimise speed and cost; how much can be automated anyway? How many sheets can you do per minute? The fragility of the documents is a crucial point. How much handling is acceptable from a conservation point of view? How far can old books be opened without damaging the binding? How much exposure to light is possible at all? 
The quality and reliability of the digital image, the lighting, camera and scanner quality, resolution, colour depth, use of colour wedges and measuring tapes for calibration, etc. all these determine to a large extent the costs and the final result. And which digital format should one actually choose? TIFF, a lossless compressed format, or JPEG, which makes the file much smaller but is lossy. I recommend that you read the DFG's practical rules on digitisation.

16] However, books should not be scanned with simple flat-board scanners, especially if they are old and valuable, as pressing the book onto the scanner damages the binding and the spine. Special book scanners have been developed for this purpose, where the book is placed in a book holder and only needs to be opened by 45°. A flat support is achieved by a glass plate. To avoid reflections, the camera is now directed at a certain angle to an opposite mirror. In this way, depending on the optics, very high-resolution photos are possible. This book scanner still has to be operated manually, but there are already scan robots that can scan an entire book without human supervision.

[17] The result of scanning or photographing is an image. You can read the image as text, but to the computer it is just a collection of pixels. In addition, you overlook impurities, as can be seen here in the letter n, for example. In the image, however, these are equally weighty pixels as in letters, which can cause problems in text recognition. And if you zoom in closer, you see coloured shadows around each letter.

[18] So what are the possibilities to convert these pixel images into machine-readable text? Well, the simplest is manual transcription, i.e. entering the text letter by letter on the keyboard, preferably by people who do not speak the language in question, because this only results in typing errors, but not in any changes to the content, and by two independent transcribers using the so-called double key method. There are fixed guidelines for this and also transcription tools, even online. The comparison of the versions and the correction of the errors must then take place, and in this way it is also possible to decide which is the best transcription for these cases. This reduces errors and a measure of transcription accuracy can even be calculated by dividing the percentage of words by the letters that match in both transcriptions, thus achieving almost 100% accuracy in the best case. Of course, there are also automatic methods, optical character recognition for example, which we will talk about in a moment, and which has the greatest difficulty with texts that are not scanned cleanly, as shown on the right here.

19] The whole thing is easy when it comes to printed books in the normal layout, but there are very different text structures, especially with old manuscripts with pretty pictures, big initial letters, the initials, with text that runs around the actual text as a commentary, and also, as here with the Hebrew example, with scripts that are not read from left to right and from right to left.

[20] A particularly tricky case is a manuscript by Isaac Newton, which makes manual transcription absolutely necessary. 
Old manuscripts are often not written in a completely linear way. There are texts in margins, there are crossed-out sections and so on. And what to do with non-ASCII characters, non-local characters and symbols that cannot be Unicode mapped?  How can one avoid intentional or unintentional corrections, i.e. how does one ensure faithful transcriptions, especially of historical material? Manual transcription is extremely time-consuming and therefore very expensive. 

21] However, automatic text recognition, Optical Character Recognition OCR, is now used much more frequently. This is software for analysing digital image data of texts and recognising the characters and letters in the text. OCR is often included in the scope of delivery of scanners, and there are now also inexpensive programmes that make the scans of texts easily readable. This uses an algorithm that typically combines two types of information:
One is the knowledge of the visual characteristics of different characters, i.e. the shape of the letters. This already works very well with Latin letterforms of the nineteenth and eighteenth centuries, but also with non-European characters. Based on language models, one knows that certain character strings usually occur in this or that language. Knowledge of typical character strings in a particular language can greatly facilitate automatic text recognition. 
 
[22] Optical Layout Recognition, aims to detect the layout of a page, e.g. by recognising tables, figures and photos in the text,
recognising captions, and recognising the number of columns of text.
 
23] OCR would have no problem at all if the respective letters always looked the same. However, this is not at all the case; the development of printing processes after the invention of movable type led to different types and typefaces, and many printed products place particular value on an individual typeface. In addition, there are the ligatures, i.e. the mergers of frequently connected letters such as ffl or ffl, which were still developed from handwriting, to form a letter character.

[24] But these are not the only problems. The OCR-read version of this edition of the "Union in Deutschland" not only differentiates letter sizes so that headings and continuous text as well as indentations are not taken into account, the hyphenation also understandably causes problems. Fortunately, the curved typeface on the left margin causes few difficulties. Only once has a g been read as 9.

25] There is an international standard, Unicode, for encoding as many sense-bearing characters as possible, but with different assignment procedures: the Unicode Transformation Format, abbreviated UTF, is a method of mapping Unicode characters to sequences of bytes. There are different transformation formats for representing Unicode characters for the purpose of electronic data processing, e.g. UTF 8 as the standard or UTF 16, which allows more characters. Another table is the ASCII code with 128 characters (it does not contain umlauts), or ISO 8859-1 with 256 characters and also umlauts. Here, however, Greek or Cyrillic is not possible or only with a change of code. In the meantime, the most widespread is Unicode itself with theoretically more than one million characters: Greek, Cyrillic, Han, Thai and so on can be expressed with it and are also an integrated part of all newer operating systems. At present, about 90,000 characters are defined in Unicode.

26] Such OCR errors are more frequent than one might think. They have a negative effect on the capture and hinder further processing. To give a few examples: OCR errors often affect word segmentation, i.e. tokenisation. OCR errors omit dots and thus hinder sentence segmentation. OCR errors more often deviate from correctly spelled words than from typos. Letter-to-letter mappings are less likely to be one-to-one, so m is not consistently read as three i's, so that could be corrected at once with a substitution command. OCR errors are more likely to affect several places in a word than to be found systematically wrong once. While many can be consistently systematically wrong, unfortunately the methods for correcting spelling errors often do not work well.

[27] Even when text recognition is correct, post-editing may be required to remove artefacts of the printed text that can affect machine readability. Word separations at the end of lines, for example, become two words for the machine and this is not always trivial e.g. in the German spelling of Schiff-fahrt an f must be removed to make it one word Schifffahrt, at least in the old spelling. Page numbers, which can occur in the middle of sentences or even words and make further processing difficult, are also common. For example, an OCR-read word may be called "recogni- Page 7 tion". This pagination information must be removed and stored separately from the plain text.

[28] As you can see, it is a long way from scanned to machine-readable text, which can be shortened considerably by improving text recognition. I see three ways to do this. First, improving the input quality (i.e. improving the quality of the digital images through better scanning techniques, higher resolution or a different exposure). Secondly, an improvement of the OCR algorithm with regard to rarer characters, special characters or also for non-Latin scripts where the algorithms are not yet so far developed. Thirdly, an improvement of the output by post-processing Optical Character Recognition, i.e. pre-processing that removes typical errors before further Natural Language Processing tools can be applied.

[29] The second part of this lecture is about formalisation, i.e. the way things are expressed and how one manages to make these forms machine-readable.

[30] The successful application of computer-based procedures and methods requires an adequate transformation of the object of investigation into a machine-readable form. Formalisation means this standardised modelling, digital coding and operationalisation of information in order to be able to evaluate it with the methods and tools of computer science. As early as 1851, the English mathematician Augustus De Morgan developed the idea that the problem of the authorship of the Pauline epistles could be solved with the help of word length analyses and assumed that the statistical mean of the word lengths (measured in syllables) could be informative for this purpose. Archaeology, too, was engaged early on in the standardised description of forms and also attempted to record them geometrically.

[31] You probably know one way of formalising the appearance of a text. I am referring to mark-up languages such as HTML (Hypertext Markup Language), which serve to structure a text semantically, but not to format it, as you know it from word processing programs, for example. The visual presentation is not part of the HTML specifications, but is determined by the web browser and design templates such as CSS. By means of a digital markup, for example, p in angle brackets indicates the beginning of a line and b indicates bold type. This designation remains valid until it is removed again with "slash b" in angle brackets. These markings are also called tags and tagging.

32] And headings, for example, are marked with head for header in angle brackets. The logical structure of a text is also not easy for the computer to recognise and must therefore also be marked up. The beginning of Bram Stoker's Dracula, for example, consists of the title, which is valid for the entire work, as well as the addition "chapter 1", followed by "Jonathan Harker's journal" and "(kept in shorthand)" in the next two lines. The following text starts with the date and location "3 May Bistritz", which indicates that this is a text in journal form. All this is nested in divisions to which the respective text types are assigned.

33] The visual design of the text can also be distinguished. While the class, in this case "chapter", is noted in the text, the font information such as font type, font size and font colour is written in the stylesheet.

34] The most widespread mark-up language, which we have dealt with briefly for the time being, is the Hypertext Markup Language HTML. Its advantages lie in its procedural form and the machine-readable structuring of the entire document. HTML, however, being the property of a company, is not platform-independent. 

[35] Even somewhat simpler and very flexible is XML (eXtensible Markup Language), which is derived from the Standard Generalised Markup Language. The extensible markup language defines a set of rules for encoding documents in a format that is readable by both humans and machines. The design of XML is particularly focused on simplicity, generality and ease of use. Originally developed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere. XML is a markup language meta-language for defining data, formalising the overall structure of textual data according to document type definitions. XML follows the W3C standard and, like HTML, relies on the Generalised Standard Markup Language SGML.

[36] Perhaps I will explain this with a simple example. Tags are also used to mark up elements. And as with HTML, a start tag stands for the beginning of an element: here <heading> in angle brackets and an end tag for the end of an element: slash heading. Elements form the central instance of XML documents and can contain text or other elements. Even blank tags for elements without content, such as "newline" here, are possible by appending a slash to the blank element name.

37] Starting with exclamation marks and in double minus signs, one can also put comments in the text like "is the markup correct here". 
Generally speaking, elements form the basis of an XML document. They are to be understood like nodes of a structure tree. For example, the element person contains the elements first name, last name and profession. 

38] The rules according to which XML documents are structured have been defined by the Text Encoding Initiative (TEI). It is a non-profit initiative whose members have developed a widely used XML standard for representing texts in digital form. The consortium has issued a set of guidelines specifying encoding methods for machine-readable texts. 

[39] Currently, about 450 elements or tags are defined and categorised into 3 classes: The Core Tag Set is the same for every TEI document. Depending on the text genre, there are also a number of basic tag sets to choose from, which can be extended with any number of Additional Tag Sets.
Good tutorials with training material on tagging with TEI can be found on the pages of dariah.de.

40] To give a few examples from the Core Tag Set:  In the sentence "John eats a croissant every morning" croissant is a foreign word. This can be marked with the foreign language tag = fr for French. Mentioned is used to mark words that are mentioned here as an example but are not a meaningful part of the sentence. 
"Croissant is difficult to pronounce with your mouth full" would be such an example, and there you can also add language = french. Terms are used to indicate special concepts, e.g. when they are defined in the text. "A croissant is a crescent-shaped piece of light, buttery, pastry that is usually eaten for breakfast, especially in France."

[41] Base tag sets include items specifically composed for particular text genres such as prose, poetry, drama or spoken language. There are also sets for dictionaries and terms, and with mixed or general there is also a higher-level base tag set.

42] Unlike, for example, the novel, there is a rapid change of speakers in plays, which form a central element in the base tag set for drama. Stage directions such as "he leaps forth" or "at her feet" are also important here.

43] In poetry, on the other hand, the verse metre "stanza" and the division into lines is important, with the respective numbering of the lines with n.

[44] And the source situation must also be annotated. So there are separate elements for deletions, insertions, substitutions, missing words and so on.

[45] In this way, the state of the manuscript can be annotated with source-critical precision and made digitally readable. The element deletion, for example, is used like this:
Beginning of paragraph / "He could not" / type of deletion "immediate correction" / in the form: Strikethrough / responsible: Fotis Jannidis / by hand: Johan Wolfgang Goethe's / "fas" / end of deletion / "believe!" / end of paragraph.

46] Accordingly, additions can also be annotated with add for addition: Line / Genius / addition type "supralinear" / by hand of JWG, responsible for the reading Fotis Jannidis / regst / end of addition / du dich nicht? / end of the line.

47] And for corrections in the manuscript there is the element Correction.

48] A TEI document consists of a prologue, header and the actual text. At the beginning, the document type is determined according to the standardised Document Type Definition, the tag sets are specified and how the characters are encoded. In the header, the author, the content, the editor, etc. are named.

49] Here is an example of what such a prologue might look like. Try reading the code! That should no longer be difficult for you.

50] Now, we have taken a good step away from the actual digital copy. These annotations do not represent the actual text to be digitised. Rather, they are data about the text, so-called metadata. They describe objects in a structured and uniform or standardised form. The basis of the metadata description was defined by the Dublin Core Metadata Initiative. The DCMI is an open organisation concerned with the development of interoperable metadata standards to support a wide range of purposes and business models.

51] These metadata standards apply generally not only to text, but to any kind of mark-up, such as the photos from your digital camera. 
A basic distinction is made between class (e.g. photo properties), property (e.g. exposure time at the bottom) with the associated value (e.g. 1/17 or 1/25) and encoding. For example, the metadata element "Title" of the DCMI Metadata Terms in xml format is encoded according to the UTF8 character set by default. Metadata can be associated (embedded) with objects, as here, or provided separately (in their own records).

[52] The advantages of using metadata are obvious: They describe objects in a structured and uniform or standardised form. They serve to select (search/find/select) and identify resources or describe their appropriate use. They can describe anything: Literature, paintings, films, people, fossils, clothing, places, and so on, and they can be associated with objects (i.e. embedded in them) or provided separately (in their own records) as we have seen.

[53] A few types of metadata can be distinguished: Descriptive metadata is information needed to search, find and identify relevant objects, such as title, author, date of publication, etc. Administrative metadata provides information on origin, archiving, access rights and other things that serve to manage the objects, such as licence, producer, rights holder and so on. Technical metadata is information necessary for the appropriate use of the resource, such as file format, image resolution, etc. Structural metadata is information about the composition of a resource (e.g. whether a digitised book consists of many individual files) and the relationship of the parts to each other. Linkage/Relationship Metadata is information about the relationships that exist between objects (higher-level entireties, other versions and so on. Content Ratings Metadata provides information about possible users of objects, i.e. the target audience, and Meta-Metadata indicates which models, syntax and formats underlie the metadata, who created it and when, and so on. 

[54] These different types of metadata are particularly usable when they are standardised. For this purpose, the Dublin Core Metadata Element Set was defined in 1995 in Dublin (Ohio), which has since been maintained and further developed by the Dublin Core Metadata Initiative. The DCMI's main goals are the simplicity of semantics and application and the provision of a basis for semantic interoperability. The basic set of elements consists of contributor, coverage, scope, then: contributor, coverage, creator, date, description, format, identifier, language, publisher, relationship, rights, source, subject, title and type.

55] If, for example, one calls up the metadata on the Mona Lisa, i.e. not on the photos of the Mona Lisa but on the painting itself, one finds on the page of the Louvre very similar information as we have just seen with the photos, i.e. title, maker, the subject, the exact details of the time of production, the place, the century, the measurements and so on. 

56] If we want to comprehensively digitally record a document, let's say a magazine, we speak of a digital edition. This includes, on the one hand, the layout, i.e. how the characters are arranged on the paper, then of course the encoding, i.e. the sequence of alphanumeric characters, the structure of the text: what is a title, what is a heading, what is the nature of this heading, is it a chapter, a paragraph, a verse line and an act, colour, shape and size of the text also play a role and possibly hyperlinks. Then the metadata that cannot be taken directly from the text, such as the author or artist, the date of creation, access rights, last change, place and time of publication must be stored with the document, and of course the content: which genre does the text belong to, which concepts, which fictional worlds are associated with it, and lastly an active reference within the text or image to other texts and other media. Here we speak of hyperlinks.

57] Again a good example of good practice is the Göttingen Academy project Blumenbach online, which comprehensively indexes the works and objects of study of the natural scientist Johann Friedrich Blumenbach. Here, the metadata are only part of the technical infrastructure. This is arranged on several layers. The client layer is the front end of the application that can be accessed on the web. Behind this is a layer for the web application, an API layer that can be accessed from outside, and a service layer that handles the processing of the data. On the repository layer there is a TEI and a MySQL database that stores, manages and outputs the metadata. Which and how much metadata is collected in the project, you can perhaps guess on the right.

58] A pioneering project is Darwin online. Since 2006, the complete works of Charles Darwin have been available online. The University of Cambridge digitised 50,000 pages of text as well as 40,000 drawings and pictures, so that the entire work is available both visually and in an exemplary digital edition, in which all documents can also be searched for terms using a search function. 

59] Or to mention another, meanwhile already classic project: The Arnold Schönberg Center presents its entire collection online. Its catalogue of works and sources is based on the structure of the Schönberg Complete Edition and, due to its technical realisation as a content management system, can be used by the user chronologically as well as thematically and systematically. Digitisation is progressing. In 2020, all the diaries of probably the most famous Austrian composer of the modern era were added.

60] At vangoghletters.org you will not only find a digital edition of van Gogh's letters. Rather, starting with the letters, all relevant data are recorded and linked. This includes an index of original works by title, by type of work (such as paintings, x-rays and photographs of paintings, works on paper and sketches arranged). There is an index of works by other artists, the sketch index for Gaugin. there is an index of photographs, documents, periodicals and literary works, an index of biblical quotations (divided into relevant books and sections), an index of persons in general and by correspondent. (You may know that the majority of letters are addressed to van Gogh's brother Theo) and a gazetteer.

[61] Similarly, the complete works of the universal genius Leonardo da Vinci are available on the internet as a digital edition. Partly using various scientific methods. 
For example, Leonardo's Madonna with the Spindle in Edinburgh can be closely inspected using techniques of X-ray analysis, computer tomography, infrared and ultraviolet irradiation, and also a profilometric 3D examination of the painting's surface.

62] In the case of Blumenbach, Leonardo, Schönberg or van Gogh, it is quite clear to everyone which person is meant because they are generally known. But there are certainly several bearers of the name, sometimes the same one in different spellings.

63] This is where authority control can help. Let us take an even more complicated example: Basil, late antique bishop of Caesarea, is called Basil of Caesarea, Basil Magnus or Basil the Great in the sources. 
In the respective languages, however, there are also special names such as Vasile cel Mare. In the card indexes of the libraries, this problem has been solved with placeholders.

64] Authority control is always used when entities are to be assigned unambiguously, regardless of their spelling. Different projects can thus reference a specific entity with the help of the authority record and enrich it. The totality of references to the same target form an equivalence class, which can be summarised in a data set called authority record. 

[65] Similar to authority control, controlled vocabularies also provide uniqueness for the reference name. They concretise the value range (Range) of a metadata element (Properties) by listing the permitted values. This avoids synonymous names for identical values and homonymous names for different values. 

66] Each entity should be described unambiguously by an authority record in such a way that confusion is excluded. This requires a key property or individualisation feature. Such individualisation features can be, for example, life and existence data, geographical coordinates, superordinate terms or entities, professions, titles of nobility, places of activity or, in principle, relations to other entities.

67] Accordingly, there are many different types of authority records. In the library sector, for example, authority records of persons, corporate bodies, keywords, classification(s), works or unit titles, printers/publishers, generic terms, geographies, provenance characteristics are in use. The use of authority records is conceivable wherever one has to deal with different designations for the same entity.

68] This standardised description flows into the common authority file of the German-speaking world, which is used by libraries, cultural heritage institutions, various DH projects and others and comprises over 10 million data records. Since April 2012, it has brought together the Personennamendatei (PND), the gemeinsame Körperschaftsdatei (GKD), the Schlagwortnormdatei (SWD) and several local authority records. It is coordinated by the German National Library (DNB) and edited by the larger academic libraries in Germany and Austria.

69] Internationally, there is the International Standard Name Identifier (ISNI), a 16-digit number that has been assigned since 2012. The identifier is supported by a consortium consisting of the Conference of European National Librarians, the Online Computer Library Center in Dublin, ProQuest, the Confédération Internationale des Sociétés d'Auteurs et Compositeurs, the International Federation of Reproduction Rights Organisations and the International Performers Database Association. There are currently eight million assigned identifiers and the Open Researcher and Contributor ID is a subset of these.

[70] Thesaurus refers to a controlled vocabulary with various relations. Equivalence relations describe synonyms, like "used for such-and-such", multilingual variants, descriptors and non-descriptors. Hierarchical relations divide into superordinate and subordinate or leading terms, or superordinate and subordinate categories, and associative relations indicate related terms where one would say "see also" in an index.

71] Let me take an example from Eurovoc, a thesaurus developed, published and used by the European Union for indexing documents of the European institutions. Rail transport is listed here as a synonym, i.e. Used for (UF): Rail connection, rail traffic, railway, transport by railway. Rail transport belongs to group 48 Transport with the generic term Land Transport and the sub-terms CIV Convention, Rail network, Rolling Stock, and vehicle on rails. Related terms are transport staff, railway tariff, European Railway Agency and so on, and of course there are many linguistic equivalents to "rail transport" which are also standardised.

72] Metadata, authority records and controlled vocabularies show very nicely how far the process of formalisation has already progressed, especially in the case of texts. There is hardly any data left that cannot be clearly read and processed by the computer. But what about understanding? How can the computer be taught the semantics of the objects of study? This is what the last section of our lecture lesson will be about.

73] Understanding arises primarily from knowledge of the larger context of meaning and by linking information. In order to not only unambiguously designate concepts, but also to provide them with meanings and relationships, ontologies are used. Ontologies are controlled vocabularies with their own domain model. The categories, properties and relationships between the respective concepts are linguistically defined and formally mapped in an ontology. Ontologies thus form a network of information that is logically linked to each other. They are therefore a good way to limit the complexity of knowledge content and to represent properties of a subject area and their relationship to each other. This is done by defining a set of concepts and categories that make up the topic.
An ontology describes classes of concepts, relationships of these classes to each other and instances of these classes, thus allowing inferences about the instances (called inference).
As a description language, the World Wide Web Consortium (W3C) has defined the RDF Schema (RDFS) for simpler domain ontologies and the Web Ontology Language (OWL) for complex knowledge representation.

74] RDF (Resource Description Framework) is a syntax for representing data and resources in ontologies. RDF breaks down each piece of information into triples consisting of subject, predicate and object. The subject represents a source that can be identified with a URI. Predicate indicates a URI-identified reused feature of a relationship, and object indicates a resource or symbol to which the subject is related. Subject and predicate are thus always a resource (either a document, an entity or a property) identified by a URI. Object can be a resource or a string called a literal. If the object is a literal, statements can be made about the data type, e.g. date, numerical value, text or so on, and about the language.

75] The RDF schema describes classes in four elements: Class, Property, Literal and Resource. And properties are defined under the following elements: range, domain, type, subClass of, subProperty of, label and comment.

[76] Only the use of ontologies made a meaningful use of data on the Internet possible. This extension of the World Wide Web to include information that unambiguously describes data that is in itself only unstructured is called the Semantic Web. The goal of the semantic web is to make internet data not only machine-readable but also machine-comprehensible. This semantic labelling also makes it possible to work with heterogeneous data sources. The advantages over the traditional web are obvious and are summarised here in tabular form.

77] The Semantic Web as a large network of meaning uses linked data and, above all, jointly usable linked open data: This is the linking of different data sets using common URIs for the respective entities. Licensing the data under one licence enables uncomplicated subsequent use of the data by others. RDF and standards based on it are used to encode and link the data so that they can be correctly interpreted by machines in terms of their meaning. The ability to combine the individual statements of the different data sources is crucial, i.e. the information does not have to be modelled anew in each data source (subsequent use is easier because triples are easier to handle than data sets).

[78] In 2006, Tim Berners-Lee issued four concise recommendations for Linked open data design. 
1. use unique URIs as names for things and not words
2. use HTTP URIs so that users can look up those names.
3. when someone looks up a URI, provide useful information using the standards (like RDF, and the query language SPARQL).
And Add 4. links to other URIs so people can discover even more things.
Too bad far too few still adhere to this!

[79] An advantage of Linked open data is also its searchability. After all, the data is available as an RDF triple, which consists of a subject-predicate-object entity, such as "Tunis is the capital of Tunisia" or "Tunisia is in Africa". The data is stored in a triplestore or RDF store, which is a database specifically designed to store and retrieve triples through semantic queries. 
SPARQL, a graph-based query language for RDF, is used for the query. In the queries, variables are prefixed with "?". The subject of our query is referenced with the variable names "? x" and "?y". The object of each triple is searched for cityname in the variable isCapitalof and then for countryname in the variable isInContinent. 
As a result of the query in our example, all variable assignments for "?capital" and "?country" are returned that fulfil the four defined RDF triples. 
Because writing out the URIs reduces the readability of a query, prefixes can be used. Here an "abc:" stands for "http://example.com/exampleOntology#". 
This query can be distributed to multiple SPARQL endpoints (services that accept SPARQL queries and return results), computed, and results collected, a procedure called compound querying.

[80] In the Semantic Web, authority records can be understood as information nodes. For example, our Basil has the name Basilius Caesarensis. He has the place of action Nicsar and is an archbishop by profession. He is also a monk by profession. He is the author of De spiritu sancto. He is the friend of Gregory of Nazianzus. He lived from 330 to 370, was a fellow student of Julian Apostata and a brother of Gregory of Nyssa. So his person is linked to places, to other persons, to writings and also to other qualities. These in turn are linked to each other with other things, so that a large network is created, and this network is machine-readable through the norm data as a whole.

[81] Through such references rendered in RDF triples, the semantic contexts can be sufficiently formalised. In the field of cultural heritage and museum documentation, the CIDOC Conceptual Reference Model (CRM) has established itself as an extensible ontology for concepts and information. As an international standard for the controlled exchange of cultural heritage information, it is used by libraries, archives, museums and other cultural institutions to improve access to museum-related information and knowledge.

[82] A good example is the Deutsche Digitale Bibliothek. Unlike in traditional databases, cultural objects are described here in triplicates, which are defined according to event types, among other things. Such event types according to CIDOC CRM are e.g. performance, execution, excavation, exhibition, processing, extension, find, use (primary / secondary function), intellectual creation and so on.
Such triples could be, for example: The Sorrows of Young Werther is a reworking of the 1st edition, The Sorrows of Young Werther is the intellectual creation of Goethe, The Sorrows of Young Werther is published in Leipzig, and so on. Thus, this kind of event-based description already comes relatively close to an object biography.

83] Our initial question, how does one digitise culture, can perhaps be answered briefly using the example of digital editions and corpora: one could simply capture the texts with common text programs. Then they are at least machine-readable. A conversion into XML, however, is what makes the text, which has now been annotated in a mark-up language, evaluable and comparable with other texts and parts of texts. If this formalisation is carried out along enriched TEI guidelines, the text is interoperable and can also be reused by others. If the enriched text is combined with other data available in databases, it becomes part of a semantic network that can also be evaluated in terms of content. It can then be accessed and used by all through output in XHTML and an image viewer.
As you can see, digitisation does not mean scanning, but rather preparing the material in a machine-readable and machine-understandable way. In order to make the cultural asset findable, accessible, linkable and reusable, many now standardised acquisition steps are necessary, but they are time-consuming and expensive.

[84] We have now spent a long time on standards. One gets the impression that the digitisation of cultural property offers no challenge at all. But this is not the case. At the level of experience, for example, we do not yet have suitable methods to adequately represent the materiality and sensory quality of the objects of study. At the level of expression, we have to consider the historical conditionality of the data models and also the perspectivity, i.e. the question under which the objects were actually digitised. And on the level of understanding, we must always be aware that our thesis formation is based on incomplete, fuzzy and heterogeneous data. So not all problems have been solved by a long shot. The Digital Humanities are actually just getting started!

[85] Of course, this also includes specific knowledge that you should acquire in the course of the next few months. For one thing, you should know the properties of the presentation of texts, i.e. everything that has to do with desktop publishing such as font type & size, typesetting, layout, integration of media etc.. You should be aware of the separation of outline and structure, of content and information, and of presentation and formatting at all times. You should also be aware of relevant procedures and good practices for documenting digital research data. The relevant metadata standards I could also not comprehensively demonstrate, but some syntax examples have shown you how wide the field is that you can still develop in terms of knowledge. The relevance of controlled vocabularies and authority records is also an important area to be oriented towards.

[86] The skills you should acquire also include selecting appropriate text formats and mark-up languages. You should have a good command of the use of basic metadata to exchange and describe text and image sources. You are also expected to assess the consequences of a chosen documentation method on the re-usability of the data.

[87] Finally, I would like to give you a number of possible exam questions so that you can get an idea of what might be examined in the exam. For example, I could ask you what an algorithm is, what is meant by „hermeneutic method“ and what are the fundamental differences between the two concepts. Or I could ask you what an ontology is and ask you to give an internationally used example and its application. Which peculiarities of the humanities and thus problems of computer science are particularly relevant for the Digital Humanities would be a question you could answer in a larger context. On the other hand, which metadata standards do you know for marking up texts and images would be a question that directly tests your knowledge. How do sources become data and what is the process of creating digital editions is such a question, which brings together several slides in one, and what are the problems of automated text recognition is a question where you can even possibly bring in your own experience. The disadvantage of introductions is that you have to fit a lot of material into a small space. I hope you will not be put off by this. Today, the images and collection objects came up a little short. But that doesn't matter, because the digital image will be the topic of the 3rd lesson. With that, I say goodbye to you, thank you for listening and wish you all the best and lots of fun in the future with the digitisation of cultural assets.