Martin Langner, Introduction to Digital Image and Artefact Science (Summer Semester 2021)
II. Digitisation and Data Management:
Lesson 5. Databases (https://youtu.be/t5zWaRZ9I7M)
[1] Introduction
[2] Databases as knowledge repositories
[8] Content of this lecture lesson
[9] 1. concepts and terminology
[10] a) Data model, data structure and data type
[14] b) Database system
[21] c) Database models
[33] d) Media theory considerations
[44] 2. recommendations for the creation of a database
[45] a) Design and implementation
[52] b) Data entry and data quality
[60] c) Collaboration, data backup and export
[64] 3. image databases of the future
[65] a) User Centricity
[72] b) Image Collection Exploration
[79] c) Visualisation Layouts
[86] Conclusion
[86] Current research questions
[87] What you know and what you should be able to do
[90] Literature
[1] For the fifth lesson of our lecture "Introduction to Digital Image and Artefact Science" I would like to welcome you very warmly. Today we are still in the second section "Digitisation and Data Management" and would like to look at databases.
[2] The possibility of being able to physically compile, view and sort images on a bench, as imagined in the film Monuments Men for Hermann Göring, is hardly available to anyone. Today, however, these functions can fortunately be achieved with the help of an image database. As with Hitler's Linz Fuehrer Museum, such compilations aim to provide an almost complete overview of a specific area, in this case European and especially German artistic development.
3] The cabinets of curiosities of the Baroque period already pursued a similar approach to opening up the world. No distinction was made between natural objects, works of art and handicrafts. As the museum's guiding form of universal knowledge, the collections stored objects that stood for the diverse fascinations of the world, even if curiosities and technical marvels in particular dominated. The world was explained in its abundance and diversity through the examination of things.
In the meantime, the art and curiosity cabinets of the Baroque period have become popular again. For example, the exhibition "Art of the Curious" was shown in London in winter 2013, and in some museums such as the Kestner Museum in Hanover or the Me Collectors Room Berlin / Stiftung Olbricht, the old cabinets of curiosities are reconstructed and form central components of the permanent exhibition. On the one hand, this may have to do with the fact that we are used to associative compilations of images in our everyday lives, and are even willing to reassemble our lives from snapshots and quotes on Instagram or Facebook, thus using things of personal significance to construct our own self-image. On the other hand, visual sensory overload and selective perception do not pose a problem for us and we scroll through the merchandise offers of internet mail order companies and the result lists of search engines as a matter of course, which open up the world to us in its present form.
In this respect, one can agree with Hubert Burda's assertion that "Today's cabinets of curiosities are no longer those of Dresden, but those of Google and Facebook".
[4] However, Google and Facebook are actually just large databases. The principle of "database" as a system that enables the efficient storage and retrieval of information is very old and also applies to analogue processes such as the book, the archive or the card index. This text-based way of creating directories, categorising and indexing information, was already common in the late Middle Ages and ultimately goes back to Aristotle, if one does not even want to understand every list as a kind of database. Whenever the amount of information overwhelmed human memory, be it lists of victors, knowledge of herbs and medicinal plants or the number of writings in libraries, contemporary technologies were invented to search for and retrieve the information.
[5] Today, computerisation has given rise to large databases that make objects and the knowledge about them available thematically and in excerpts, thus providing an unmanageable amount of supposedly useful information. The scientific databases in archaeology also gather an unmanageable number of monuments. In May 2020, for example, the ARACHNE database of the German Archaeological Institute alone contained 4.4 million records, and the Beazley Archive database contained 118,202 vases. However, this means that the digital box of notes has grown to such an extent that actually only what is explicitly searched for can be found. Large numbers of results, on the other hand, are rather off-putting for users. This has led to increasingly high barriers for the user to search these databases, so that they are generally only used as reference works for specific finds specifically called up by the researcher, for example to find out more about further literature or factual data such as storage location and inventory number.
6] Collecting, arranging and preserving are forms of systematic development of reality. They describe the basic principles of archives, museums and libraries. As official classification systems, they define and institutionalise history and memory. For on the basis of fixed selection criteria, often only what is considered worth remembering finds its way into the archives. The question that now arises is whether these principles should also apply to the databases of the future. What is the advantage of collecting and publishing all data? And what is the importance of data curation, i.e. semantic ordering and sustainable storage? The short-lived nature of technology, as illustrated here in a witty way by viewers, messagers and cameras of the 1980s, already shows that databases can only be sustainably secured with great effort; a problem that has not yet been solved and therefore should not concern us further in this lesson.
7] As we have seen, the core of creating a database is to structure the world in order to better understand its parts. To find such superordinate categories requires the ability to perceive and name patterns and structures. It is important to group entities, to recognise relationships between them and to make appropriate links. In this way, the database fulfils a basic need of all scientific research, including research in the humanities.
In the following, we will discuss the special features of the database as a storage medium and what is necessary to be able to use databases as knowledge repositories.
8] For this purpose, I have again divided the lesson into three parts. First, we will deal with concepts and technical terms. Here we will not only get to know the technical terms, but also talk about different database models and also make theoretical observations about the database as an information medium. This will be followed by recommendations on how to create a database. And the third part will be about the image databases of the future, which has already begun, so you can get a pretty good idea of where the journey will lead.
[9] We first need to clarify some concepts and technical terms and start right away with the term "data", which is generally understood as measured values or empirically acquired observations. In computer science, character strings that have a syntax and are used for processing and result representation are called data. They can be stored as character strings (text data) or in binary form (i.e. ones and zeros for code or image data). Only when they also have a semantic level can one speak of 'information'.
[10] The first task of database modelling is to create a conceptual data model, e.g. in the form of a diagram, that describes the relevant real-world objects to be represented in the database and the relationships between these objects. A common approach to this is to develop an entity-relationship model, often using drawing tools.
[11] A successful data model will accurately reflect the possible state of the external world being modelled: For example, if coins have more than one image field and may also carry inscriptions, the data model will allow for the acquisition of this information by providing separate entries for obverse and reverse. Designing a good conceptual data model requires a good understanding of the application domain; this usually involves asking fundamental questions about the items of interest to the data structure,
such as "Can a vase painter also be the potter?", or "If a mixing pot is on a painted stand, is that two vases or just one?", or "If a site goes by more than one name, is that one site or are there two (and how do you determine that)?". The answers to these questions establish definitions of the terminology used for entities (such as producer, image carrier, site, findspot) and their relationships and attributes.
12] A data model is realised in a data structure, e.g. a list or table. A data structure is thus used to store and organise data, arranged and linked in a certain way, to enable them to be accessed and managed efficiently.
13] The basic requirement for all data structures is that they can map the accessing, finding, inserting, sorting and removing of data. The data structure takes into account the data type (e.g. integers, floating point numbers or strings) and the data format (such as number, formula, date, text or image).
14] A database is a structured collection of data. It stores information in the form of records containing the actual factual data, either entered manually by users or generated automatically. If we want to think of a database as a piece of furniture, it would be a drawer or filing cabinet, with records as the file covers that hold the data.
The database content can be in different formats and can enter the database as text, numbers, links or media (photos, drawings, films, etc.).
[15] Collecting data on a particular set of questions or materials and making status requests to that collection are the two main tasks of a database. For example, if we wanted to collect all the coins in a coin cabinet, a database could tell us which coins from the time of Emperor Titus are in it, or which coinages were not issued in Rome. The simplest form of such a database would be a list in tabular form, such as one can easily create with any spreadsheet program.
[16] In contrast, a database system consists of one or more databases and a management software called a database management system (DBMS).
The database management system structures and stores the information in the database, while the records consist of a sum of self-defined fields. The user therefore does not access the database directly, but the database management system as a control programme.
[17] The database management system is thus the actual management software that internally organises the structured storage of data and executes all read and write accesses to the database. Its tasks include, for example, the entry, modification and deletion of data, the creation of databases including the implementation of the data model, searching in the database contents by means of queries as well as the general administration of users, accesses and access rights.
18] A DB system has three levels of abstraction: The physical level describes the form in which the data is stored on the secondary storage. The logical/conceptual level records which data is stored by means of a database schema. The views or sections visualise subsets of the data. They are tailored to the respective needs of the user.
19] What are databases actually used for? Well, the main task is to record the data in a structured way.
This includes editing and changing the data, searching for it, filtering it and sorting it in order to evaluate it. This evaluation can be produced in the form of reports and exported to other programmes for further processing.
It is therefore primarily a matter of storing large amounts of data efficiently, unambiguously and permanently and of providing the required subsets in various forms of presentation that meet the needs.
20] There are a number of advantages that database systems have over structured individual files, such as tables:
Firstly, redundancies can be avoided, which not only makes it easier to store the data, but also to change it, because the respective entries only have to be changed once in one place. Secondly, inconsistencies and contradictions can be reconciled, which easily arise, for example, when several users work on the same table.
Thirdly, dependency relationships between data can be mapped directly in the database. The conditions defined in this way can also assume a control function during data entry and thus help to avoid errors, for example if only certain values are allowed in a data field.
In addition, databases with a defined user management help to avoid data protection problems by protecting the data from unauthorised access through differentiated access regulations.
Depending on the user access, corresponding display variants can be provided for each purpose, which show the relevant information in different sections, the so-called views, from the total amount of all stored information.
21] A data model is used at the conceptual level to formally describe all the data contained in the database, how it is stored and the relationships between them. It determines how the data to be stored is structured and which operations are possible on this data (search, delete, ...).
22] The hierarchical data model represents the data in a hierarchy so that the data structure is structured in tree form. The data are subject to sequential data organisation, so that each content of one attribute, such as the example of the Rome mint, is followed by any number of entries of the Nominal attribute, from which further entries for dating are then derived again. This hierarchical form of the relationship is mathematically also called a 1:n relationship. Coins from another mint would in turn form their own tree.
[23] Network databases represent a further development of the hierarchical data model by structuring the data in logical graphs. Since the individual nodes in the graph can be linked to each other arbitrarily via relationships, the restrictions associated with 1:n or mother-child relationships no longer apply. The network database model is thus more complex than the hierarchical database model . It is also possible to make n:m assignments, so that complex data structures can be mapped through networking. One disadvantage, however, is that the database thus has a rigid, complex and confusing overall structure.
The two database models considered thus suffer from inefficiencies that often result from redundancies in the underlying data representation. This leads to slow access to the data as well as cumbersome storage mechanisms.
24] In order to remedy this, the relational database model was developed in the 1970s, which describes the data and the relationships between them in the form of tables (and relations). The data is thus stored in uniformly structured data records that are organised in tables. The individual entries in different tables can be related to each other by means of references.
25] We could therefore say that a relational database is a collection of tables (or relations) in which each record is stored in a row (tuple) of a table. Each row consists of a number of attributes (or properties) that determine the contents of the fields and correspond to the columns of the table.
For example, a coin could be described using the following attributes: ID (integer), Nominal (string), Dating (string), Material (string), Diameter (integer), Weight (float), Date of discovery (date). The attribute types that determine what kind of data can be stored in the attribute are named in brackets. "Integer" means integer values, "String" means character sets, "Float" means floating point values and "Date" means dates in the form day/month/year.
26] Only similar, real objects (entities) are ever acquired per table, e.g. only coins, only photos, only buildings, etc. The data is then stored in the database. The unique referencing of a record is done with the help of one or more key attributes, the so-called primary key, here called ID. This key is unique and must never change, as it is used to reference the row in the table.
27] Individual tables can be related to each other using keys to explicitly express relationships between individual records from the different tables. The primary key of a relation, here "ID", may only occur once, i.e. it must be unique in order to guarantee referencing.
A foreign key, here "ID_N" is used to reference a primary key of another table.
[28] The relational database model comes closest to the way humanities scholars make references. Its searchable ontologies are at the same time a conceptualisation of the world in terms of relations between persons, things, places, events and actions. Nevertheless, the relational database model has some drawbacks that have also drawn criticism.
On the one hand, the artificial key attributes, which are necessary for internal management information, increase the amount of data. On the other hand, the lack of homogeneity with programming languages that use a different syntax and other data types often makes external programming interfaces necessary, which in turn can have some limitations.
[29] The relational database model has proven to be extremely successful. Earlier models - hierarchical databases and network databases - are rarely used today, and probably the relational model will remain the dominant model for many years to come. However, there are several other models that have been the subject of active research over the last three decades. I would like to introduce two of them.
First of all, there is the object-oriented database model, which shows its strengths especially in the management of complex data structures (such as 3D modelling, GIS or multimedia applications). The impetus for its development came from object-oriented programming, which is widely used among software engineers. While the relational scheme is based on an older conception of programming in which the data and the processes for handling that data (such as input and query) are strictly separated, the object-oriented model proposes to transform both the data and the processes into discrete objects. This is roughly equivalent to real-world objects that have certain properties and behave in a certain way. In object orientation, they are modelled by classes with certain attribute and behaviour definitions. So the data for the coin table and the elements of the operations that can be performed on them belong to the same basic structure here.
The advantage is that this facilitates maintenance by having fewer dependencies between data elements and by allowing reusable data modules that can easily be moved from one database context to another.
Another advantage is the creation of inheritance hierarchies. That is, one table "inherits" properties from another without duplicating them, creating a certain semantic richness that is not achievable with more conventional methods.
30] Finally, there are the graph databases, which, in succession to the network databases, represent the data as nodes and the relationships between them as edges, where both the nodes and the edges can have properties. Such graph structures with nodes, edges and properties are particularly suitable for the representation and storage of highly interconnected data.
[31] The website DB-Engines.com publishes a monthly ranking of the most popular database management systems. It is made up of the number of hits in search engines, the frequency of searches in Google Trends, the frequency of technical discussions in the usual forums, the number of job offers, the number of profiles in LinkedIn and Upwork and the number of Twitter tweets in which the system is mentioned. Here Oracle, which is mainly used in industry, and MySQL are in the top two places. However, the dominance of relational databases is particularly striking.
32] The long-term trend also shows little change here. Please note the logarithmic scale! The gap to the top 3 is much larger. Because of their user-friendliness, I have added MS Access and FileMaker, which continue to be successful, especially in the humanities.
33] Creating a database requires the formalisation and categorisation of information. This can be objectively given, as in the case of books, with author, title, publisher and year of publication. Personal data or stocks of goods can also be recorded unambiguously per se. This is often not so easy with data in the humanities, because the complexity and heterogeneity of artefacts and cultural circumstances cannot be subsumed under a few generic terms in the same way. And even more difficult than the assignment to categories is the relation of the entities to each other, which are usually not objectively given, but must first be determined by way of interpretation. Where, for example, relations can be expressed in terms such as "is resident in" or "is married to" in the registration office, in the humanities the relations are named with suggestive uncertainties such as "is at the same time as", "resembles", "belongs to" or "is derived from".
[34] Therefore, the creation of a database is as much a scientific achievement as the result of the evaluation, which is very easily already predetermined by the parameters of the conception.
Behind almost all visualisations, text and image archives or multimedia applications there is a database; usually to efficiently and dynamically load a hit list, generate a website or fill a map with labels. But even in these relatively simple applications, it becomes clear that the underlying ontology has considerable intellectual value. For each database contains not only the individual static values, such as "Michelangelo", "David", "1501-1504", and "marble statue", but also the associated ontological relationships between these data, such as "Michelangelo created a marble statue named David in 1501-1504."
[35] However, research does not only arise in the basic sorting of information, but also in the exploratory work with the data in the database. The more flexible the database model, the easier it is for new and unexpected references to emerge. In this respect, databases are perhaps less precedent-setting constellations than formations of knowledge accompanying interpretation.
36] This specific knowledge formation results to a not inconsiderable extent from the technical and medial peculiarities of databases. For databases have three mutually different accesses to the objects they store and manage. The subsurface, to take up Frieder Nake's expression from the third lesson again, i.e. the procedural level of the computer where the data is stored and processed, follows an internal storage logic. Visible to the user on the surface, however, is a different arrangement of data that follows the external logic of use, and often has little to do with the actual management of data that makes up the conceptual description of information that the database model specifies. As with the digital image, information in databases that is arranged identically on the subsurface can be visualised differently on the surface, depending on the query and layout, and can still be conceptually linked or related to each other in a different way.
37] This situation has very far-reaching effects in that the user often does not even notice what kind of data collection is involved and how it is structured, i.e. according to which criteria the information is filtered and displayed. I am thinking primarily of search engines like google or sales platforms like Ebay or Amazon, which produce customer-specific selections. The same applies no less to scientific databases where, depending on the database model, very different hits can be achieved.
[38] This has to do with the fact that databases can be used in a wide variety of ways and therefore appear very different to us. The media scientist Marcus Burkhardt distinguishes between three manifestations of digital databases. Firstly, the database as a latent infrastructure (as in content management systems), where the database is only used as a container for the dynamic display of entered data and not for its evaluation. Secondly, the database as an information collection and research tool that makes it possible to find the one in the many, and thirdly, the database as a Big Data application that makes it possible to evaluate the many and, above all, to visualise it.
[39] The suitability of any database depends very much on the selection and preparation of the data entered. But not only. Its basic media structure also shows it to be a two-faced source of information. By this I mean the computer's peculiarity, already discussed in the case of the digital image, of representing the data twice, namely as the surface of use and the subsurface of signal processing. Or to put it in the words of the well-known sociologist Niklas Luhmann: "Above all, however, the computer changes the relationship between (accessible) surface and depth, compared to what was traditionally defined by religion and art. [...] The surface is now the screen with extremely limited use of human senses, the depth, on the other hand, is the invisible machine that is now able to reconstruct itself from moment to moment, for example in response to use. The connection between surface and depth can be established via commands that instruct the machine to make something visible on the screen or by printing it out. It itself remains invisible."
[40] So on the one hand there is the interface that visualises the data and where the user enters his commands, and on the other hand there are the signals that affect the data. The relationship between the upper and lower surfaces is characterised by the intertwining of command and data structures, i.e. in the case of databases as views for mediating downwards and database commands for configuring the possibilities for upwards.
[41] But the decoupling of the two sides (i.e. the surface and the subsurface) remains in principle indissoluble. Databases are therefore always only "information potentials" that update themselves differently in each case. The unambiguous instance "query" is confronted with a multitude of independent, manipulable entities as input. Ultimately, it is impossible for the user to know whether there would not have been 'better' information if the input had been formulated differently.
But the user is often willing to find out and thus rearranges the information through new search terms and links. "Searching becomes a creative act in which unknown connections can be explored and investigated." (Burkhardt 257)
[42] On the other hand, databases, in their need to formalise and abstract, actively co-produce information by visualising and thus consolidating assignments and links. This visualisation, like the level of abstraction, has an ordering character with a certain facticity. As a pre-logical structuring of reality, the databases construct an image of reality, just as the cabinets of curiosities once did: information about reality thus becomes information as reality.
[43] The database as a symbolic form of digital media culture, as Lev Manuvitch has formulated it, thus has powerful potential to filter and structure information again and again. Epistemologically, this phenomenon is of enormous importance and demands an extremely critical approach from the scientific user. This is why it is so important to have a thorough knowledge of the data basis, the database models and the forms of selection and annotation of the data.
[44] Let us now move on to the second part of our lesson, the recommendations for creating a database. Please do not expect a database cookbook here where every step is explained. Nevertheless, I would like to advise you to follow the recommendations like a recipe. You will be able to create a database without them, but for a consistently good result, some ingredients and steps are essential.
45] If you want to create a database, you do not need to start from scratch in the conceptual design. There are a number of standards whose knowledge will help you to be up to date and compatible.
[46] Programming a database is relatively easy to learn. There are a number of YouTube tutorials, and we also offer regular exercises on this. Therefore, the following is more about basic things that you should take to heart when creating a database.
[47] As we have already seen, before designing a database, some preliminary questions regarding content and organisation need to be asked in order to determine the requirements for the database: What is the nature of the data? What exactly is to be achieved with the database? Who is to use the system and in what way?
Please then record this requirements analysis in writing.
To get further ideas, it may be useful to look for other databases with a similar content or technical orientation. Check whether you can adopt these systems or parts of them.
48] Once the requirements are clear, they need to be formalised and put into a database concept. Define the entity types and their relations, preferably using the Entity Relationship Model, which depicts the contents of the tables in boxes, the columns in ovals and the relations in diamonds.
You need to specify exactly which fields are to be assigned to which entity types. In principle, it is better to distribute the contents over several fields with smaller information than to provide a few large free-text fields. For example, it makes more sense not to write the storage location of an object in one field, but to record it separately by location, museum, department and inventory number. On the one hand, numbers have a different data format than text, and on the other hand, this information can be searched for and sorted in a more targeted manner. Merging data from several fields is not a big effort, but distributing the information of one field properly over several is.
It is important that you document the database design completely so that you can still see through it later and make specific changes. Of course, this is especially true for databases that are operated by several users.
And already in this phase you should think about data backup! Develop a workflow for short-term data backup and long-term, sustainable archiving of data.
[49] For collection objects, the Documentation Working Group of the International Council of Museums ICOM has developed the LIDO scheme: LIDO, the abbreviation for Lightweight Information Describing Objects, is an XML schema for exchanging and providing metadata on museum and collection objects. Only three mandatory sections are defined there: the "object or work" for classifying the object, the "title or name" for identifying the work and the directory for the administrative metadata.
In addition, the LIDO record identifier and the language information for the metadata are mandatory.
[50] The real strength of the LIDO scheme, however, lies in its event orientation. Here, the information on the collection objects is assigned to an event, i.e. one (or more) events in the life of the object.
A Greek amphora, for example, goes through many stages in its long life, at which data accrue. The first event is the production; data on the material, the production, the form or design and the decoration are assigned to it. The place of this event is the potter's workshop. It leaves this with the second event, trade or sale, which may have taken place at the market, in a shop or at the harbour. Trade tokens or special devices for stacking and packing are data that could be associated with the second event. The longest phase in the life of an object usually takes place during use and this is where all traces of use should be recorded. A distinction is made between the primary and secondary function. An amphora is primarily used for storing wine, but may also have been used secondarily as a container for cremated remains. If residue analyses or other chemical examinations were carried out, these data would be assigned to the event "use" just like scrapings and other signs of use.
Use ends with the deposition of the amphora. In many cases this is a waste context, but secondary deposition in a sanctuary or tomb is also common. Information on the (usually reconstructed) situation of this function belongs to the Event Removal.
The amphora came to us through an excavation. Data on the find site, the exact location of the find, the excavator and the date of the excavation, the find association and much more can be assigned to this event.
This has given our Greek amphora a new function as a collection object and all collection events and museum data are now assigned to the sixth event. In addition to the inventory and the current identification data, this also includes information on earlier collections in which the vase was located before it came to its current place of storage.
Data on restoration as well as on the state of preservation are to be separated from this, as is the information on various exhibitions and the associated loan traffic.
Fortunately, our vase has already been published, so that bibliographical information and references to reproductions can be included as the ninth event in our database.
And as the tenth event, our attention is directed to data entry and all meta- and paradata concerning the actual inclusion in our database system.
In this exemplary run-through, it became clear how much data can be collected for a collection object. The event-based approach represents a semantic grouping of the data, from which it is possible to draw conclusions about the object biography as well as the network-like linking of the data with those of other collection objects, persons, places and events.
51] The next step would be to assign attributes to these events. We had subdivided the production event into four sub-areas: material, production, design, decoration. In addition to naming the material, we could also acquire material properties such as clay composition / grain, clay colour, strength, colour and density of the coating, each with the corresponding number in Munsell's colour chart. These details can then be combined again in a total field for better searching or for printing.
Please remember to always keep a separate data field (i.e. attribute) for remarks and for literature for each section, so that no information is lost during entry.
These suggestions are useful for a general acquisition of museum objects. In detail, however, the design of the database depends on your research question. You would want all fields in a database to be filled in, but for a variety of reasons this is never actually the case. Therefore, also consider conceptualisation as a phase in which you take another hard look at what you actually want to do with the data.
52] It can be sensible to start by recording the data in a spreadsheet programme, because often the exact structure of a database only emerges during data recording. Claire Lemercier and Claire Zalc have defined ten commandments of data entry in their book "Quantitative Methods in the Humanities ".
1. reserve the first line only for the names of the variables.
The rest of the worksheet should contain data about the entities. This is important to allow sorting. You should keep notes, calculations and other data in additional files or sheets.
2. use the first column to assign an ID to each entity.
If you find (or reasonably decide) that two entities are actually the same, change the ID accordingly. It is not a problem if some numbers are not used as identifiers (they do not have to start with 1 either).
3. keep the wording of the source as much as possible.
If simplifications, modernised spellings or other changes are needed, make them later in a separate column or another file. This will save you the trouble of having to consult the source again if you change your interpretation or if you want to quote from the source. 4.
4. always note the source reference (insert as many "comment" columns as necessary for this purpose).
Depending on whether all the data comes from one source, or only the data for one person or variable or single item, the solution will vary, but there should always be a reference to the source. For example, there may be one source per file, one source per sheet of a spreadsheet file, one source per row or column (indicated at the head of that column), one column for 'source of information in previous column', etc. The point is to maintain detailed information about the source in at least one version of the database, even if the data is later reorganised or simplified. Columns that mix information from different sources without specifying what comes from where (e.g. "date of birth" if you are looking for biographical information in many sources) cause many problems at the stages of data processing and publication. More generally, include as many "comment" columns as necessary, or put brackets in regular columns to mention a particular source of information, a choice about transcription or categorisation, a doubt about a symbol, and so on. Avoid the kind of "comment" function that creates bubbles. These are difficult to read if you have more than a few, and they cannot be searched or sorted. 5.
5. enter missing data with "missing" or "not applicable".
There should be no empty cells in your file; more precisely, all empty cells should mean "I have not entered this yet". This is not just a practical point (software programmes often react badly to empty cells). The lack of information, when you think about it, is often information in itself. A standardised category such as "no information about occupation" or "no information about marital status" could yield substantial results when correlated with other variables.
[53] 6. Divide the information as much as possible into different columns.
For example, "Mr./Mrs./Missing", "Surname", "Name 1", "Name 2", "Name 3", "Noble title", "Maiden name", "Pseudonym" and possibly other variants instead of a single "Name" column. The same applies to addresses: Not that the number of a street address is interesting in itself, but putting it in a separate column allows you to sort by street name.
7. Avoid using the data format "date". Instead, spread the day, month and year (in numerical format) over three columns.
8. Note the exact wording of the date. Split intervals into start and end dates.
9. familiarise yourself with the context menu of your software, i.e. everything that can be done with a right-click on a PC (or Ctrl+click on a Macintosh).
Consult your spreadsheet's help file or online about: freezing areas of the sheet, showing/hiding columns, automatic column width adjustment, multiple selection, automatic copy and increment, insert special, replace, "rand" (random numbers for samples), concatenate, sort, filter and "pivot tables" (contingency tables).
10. save your data as often as possible (create new files regularly, don't just replace the previous version with the new one).
54] In order to save a lot of time later, these determinations should always be made before you start typing. This also means that you define and document binding terms and terminology. After the first experiences with data entry, however, you should regularly check the thesauri and ontologies and readjust them if necessary.
You should also clarify how to deal with the zero value problem, i.e. which attributes must always contain a value and where fields may remain empty. In addition to the primary key, it is advisable to always require a value for the central instances, such as the object category, the inventory number or the author's name.
And use controlled vocabulary in the form of value lists, norm data or thesauri whenever possible! Some institutions provide such norm data free of charge.
[55] Often database projects are started with great zeal and then come to a standstill because not enough time, money and personnel have been planned for the ongoing operation. This is because, in addition to the actual data entry, checking and standardising the entry, updating the documentation, troubleshooting and much more are very time-consuming, which is usually underestimated in the beginning.
56] Therefore, develop concepts for data entry and ensuring consistent data quality right from the start. Enable exports of data to other environments via appropriate interfaces. Plan for the very regular storage and backup of data and also consider to what extent multilingual data entry can be made possible right from the start. In the meantime, there are some useful systems for automatic translation, but it is even better to choose the norm data in such a way that a large part of the terms in ontologies is also available in multiple languages.
57] During the entire project, it must be clear to all editors where which data is to be entered. Therefore, support the input with external documentation and descriptive input forms where the definition of the attributes can be accessed at any time via roll-over explanations or help functions. Well thought-out labels such as "height in cm" instead of simply "height" considerably standardise the input.
Where possible, favour predefined terms such as value/code lists, thesauri or vocabularies. This way you achieve a high degree of uniformity in the data and avoid typing errors. Sometimes it can also be useful to use technical functions such as indices or auto-completion.
Such specifications help not only with multi-user systems, but also you alone. This is because the semantics of attributes can shift imperceptibly in the editing process if you have not made clear distinctions at the beginning.
[58] Data entry pursues the goal of formalising the data in such a way that it becomes mesh-readable and machine-interpretable as structured data. Longer free-text fields, on the other hand, are unstructured. Therefore, it is advisable to additionally provide a content-equivalent attribute with controlled vocabulary for properties of entities that are modelled as free-text attributes (such as detailed descriptions). This can be done, for example, by keywords that summarise the free description.
Longer free text contents can also be divided into several sections and thus assigned to different attributes.
Provide short descriptions for each record in the manner of catalogue headings. Even if it looks like a duplication at first, sometimes a short title can make the object more quickly recognisable, like a nickname, than in the dissected division of information in database attributes.
[59] Often it is not clear why certain fields have not been filled in. This uncertainty can also be modelled on the input side, for example by reserving certain characters for the reason of the null value.
For example, one could enter "NULL" in the field for information that is currently unknown or irrelevant. Information that is to be added later can be marked with a hashtag (#), and statements that are currently not clearly possible or cannot be answered could be marked with a question mark. The prerequisite, however, is that your database management system allows a search for these special characters.
[60] Interfaces and export options are especially important to consider if you want to process the data in another programme. For word processing, spreadsheets or quantifying evaluations, simple text-based file formats such as CSV or TSV are particularly well suited. More complex structured file formats such as XML and JSON are suitable for semantic queries to the dataset,
and fully compatible with other database applications are the universal interfaces ODBC (Open Database Connectivity) and JDBC (Java Database Connectivity), which allow access to and data exchange with a database.
[61] Regular saving and backing up of data should be a matter of course. To avoid major problems, you should work with different access rights and link the deletion of data records to security queries.
It is also highly recommended to create backup copies or data exports, especially before making major changes to the data stock, e.g. if you work with replacement commands or merge several versions.
In addition, it should be possible to fall back on older versions with the help of versioning, which saves all states of the database contents.
62] If you want to enable multilingual input, you must be aware that this has a strong impact on the conceptual and logical design of the database schema.
A simple way would be to label the user interfaces and forms in different languages (as well as the headings, labels next to the input fields and help texts). However, the field contents remain monolingual. Attributes with controlled vocabulary can help here, which can be automatically transferred into another language, provided that concordance lists (preferably as international reference systems) are available.
If necessary, one must define several language-specific attributes for a property of an object (for example, Darstellung_de, Darstellung_en), which are annotated subsequently (e.g. The back shows Victoria holding a shield facing to the left. The back shows Victory holding a shield facing to the left.).
63] For larger projects with distributed staff, it is advisable to enquire about the possibilities of server-client operation in good time (i.e. before the actual deployment) and to clarify potential requirements for hardware, software and infrastructure (such as network access).
This also includes determining which user group is allowed to see and change which fields and in what way. Consider how the layout masks should be designed and define the rights accordingly. It is essential to ensure uniformity and clarity!
And agree among yourselves on terms and understandings of field contents, so that everyone means the same thing when entering and querying data and enters the same data in the same fields.
[64] Thirdly, I would like to talk about the image databases of the future - actually of the present, but which has not yet fully arrived in the humanities - and at the same time spin a bit further what might be the rule in the future.
[65] "After the novel and then the cinema privileged narrative as the key form of cultural expression in modernity, the computer age introduces its correlate - the database," wrote Lev Manovich in 2001. "Many new media objects did not tell stories. They had no beginning or end; in fact, there was no thematic, formal or other development that would summarise their elements into a sentence. Instead, they are collections of individual objects, each object having the same meaning as any other." The database is therefore incapable of providing a grand world narrative.
However, websites such as "Dickens Dark London" and multimedia applications such as "Ancient Rome 3.0 use databases to let the user create the narrative himself by determining how he navigates through the database; usually, however, without noticing that he is in a database application. Nevertheless, a kind of montage emerges as a hyper-narrative to describe objects and ideas. For Lev Manovich, the database is therefore the symbolic form of digital media culture, but it is not yet fully understood.
[66] The computer game can perhaps be considered the mirror image and narrative form of digital media culture. They are perceived by the user as an interactive narrative: There is a clear beginning and a goal, which is why the presentation of objects seems to follow a certain logic, unlike in databases. Often the narrative of a game, such as "Glory of Rome", consists of a simple algorithm: found new cities, kill as many barbarians as possible and thus reach the next level. Unlike a database, here the player expectation is generated by an algorithm, which in turn requires the player to execute the algorithm in order to win. Or as Will Wright, the inventor of SimCity put it: "Playing the game is a continous loop between the user (viewing the outcomes and inputting decisions) and the computer (calculating outcomes and displaying them back to the user). The user is trying to build a mental model of the computer model." The further the player progresses and the more levels he reaches, the better he understands how the algorithm works - in other words: understands its hidden logic. Every process and every requirement can be reduced to an algorithm that the computer must execute, just as every object is modelled as a data structure that can be retrieved in an organised way in a database.
If it were possible to give our databases a narrative, two great advantages would arise: firstly, the structure of the database (i.e. its structure, its way of describing and managing the entries) would be intuitively understood by the user.
On the other hand, the attention of the users and thus their interest in the contents would increase, which could be used to correct and expand the data, so that those interested in cultural-historical data would ultimately also participate in the development of the database.
67] Market research has long since developed a differentiated scanning of user behaviour and offers us a comparable article for purchase with every query. The number of users of archaeological databases is admittedly too small to be able to display sentences such as "Whoever searched for Augustus was also interested in Livia". But the method of offering related records as an option makes sense. One could think of simple links with which one could display further records as offers, as has long since become common in libraries through reading lists compiled by lecturers or enthusiasts.
[68] Ontologies and semantic annotations also make related hits possible, just as in this example the search for Virtual Museum also shows "3D Art Gallery" as a result. Through related keywords and user monitoring, such offers can be well adapted to the respective target audience.
[69] Also on the cutting edge are sales sites like Houzz.com, which display furniture and home appliances in a changing hierarchy, depending on browsing history and search query. Every click is used to offer the prospective buyer a user-supported set of results.
In a scientific application, however, one would like to know according to which criteria the search results are displayed. For the search for images, for example, search terms, colour distribution, semantic links, images shown together in virtual exhibitions and pinboards, or works of art also discussed in texts would come into consideration.
70] These links do not necessarily have to be created manually, but can also be automated using metadata: the Open Library of the Internet Archive Initiative is a good example of serious user guidance through faceted browsing. Here, the search results can be modified by switching various filters on and off (drill down) and thus added or hidden by user actions.
In this way, the search is increasingly refined, so that one zooms into the database, so to speak.
[71] You may also refer to the pages of the Athenian Agoragrabung by the American School at Athens. Here, the data record for an excavated find is linked to the corresponding entries in the categories "Publications", "Reports", "Plans and Drawings", "Images", "Monuments", "Coins", "Deposits", "Catalog Entries", "Catalog Cards" and "Coin Envelopes", so that one can not only call up all the relevant information on the object one is looking for, but can also easily access the other finds in the same context or, for example, find the corresponding comparative pieces via the linked publications. Unlike in the Open Library, however, the links are not offered here in a faceted manner as a drill-down. The result list therefore remains static and the display of further search options is not dependent on the initial query. It would certainly be helpful for the user if, for example, the often five-digit number of linked objects were again subdivided according to genres or epochs, which would probably be technically easy to implement, since this information is uniformly and completely available in almost all data sets.
[72] The enormous amount of digital images uploaded to the internet every day poses challenges for storing, indexing and accessing these volumes of data. As we have already seen in lesson three, companies like google use image pattern recognition to support keyword-based search.
[73] The process is called image collection exploration and is a method of searching large image databases and repositories to find, display, summarise and browse image data quickly, effectively and intuitively. It attempts to find an answer to the semantic gap problem that arises in content-based image retrieval due to heterogeneous or multimodal data.
74] The first step of image collection exploration consists of clustering by prototype, i.e. a larger image collection is decomposed into representative image sets so that the search is no longer performed in the huge total set, but only in a subset. Summarisation as an informatic problem thus deals with the selection of a representative set of images to a search query or with a summary selection as an overview of an image collection.
[75] This is followed by a visualisation process of the image sets via a visualisation metaphor by displaying relationships between images in a special layout via a similarity function. This visualisation, as an interactive element of the search algorithm, improves the system, which is able to learn from the users' reactions and feedback.
[76] Based on the learned image relationships, suggestions for semantically similar images can be made. Here you can see a visualisation of three clusters that Pixolution has created based on object recognition. Already at first glance you can see that here the colour of the background plays a role, while probably tomato was identified via a key term that occurs in a core group of these sets.
77] Image Collection Exploration is also able to sort a large number of images, in this case over a thousand sunflower images, according to their visual characteristics. The quality of the result depends on the learning methods and the training sets.
[78] The first platform that consistently relied on image collection exploration in 2014, the FROMPO social discovery tool, advertised that it could be used to share content, images and videos. In the meantime, however, this is mainly sexually explicit material, and the platform therefore also requires proof of age.
79] Many image databases now consistently rely on a visual, object-oriented appearance by using an image browser for the database entries. You know this from Instagram, but also monument authorities like the one in Amsterdam presented the excavation material on their website Below the surface as a chronological arrangement of photos, which is the entry point to the excavation database.
[80] Similarly, as with faceted browsing, visual facets could be offered, perhaps generated as image stacks, or simply as image overviews generated faceted according to user input. In the database on Attic pictorial vases of the 4th century BC, for example, the vase images can be grouped in a similar way according to motifs, sites and product groups.
The possibilities of grouping archaeological objects in terms of their biography are manifold: with regard to their design, i.e. to depict the processes of creation and shaping, an arrangement according to genre, format, pictorial motif, workshop or production, for example, is a good idea. Spatio-temporal perspectives become clear with a socialisation according to sites, regions (or provinces or domains), contexts of use, trade routes, events up to museumisation, while the respective actors can be derived from the two groupings mentioned above. I am thinking of links such as "Works of Praxiteles", "Philip's Tomb", "Building Programme of Pericles", "Rome under Trajan", "Collection of the Habsburgs", "Strangers in Greek Sanctuaries", which could be derived from inscriptions, dedications and imported objects, or "Visitors to the Casa dei Vetti", insofar as graffiti, traces of use or clues to the reception of the wall painting make this possible.
81] Automated search results could also be displayed as a timeline as in ARTES or on a map as in Historypin, where photos and films of monuments stand for events that took place there and can be accessed in this way. The advantage here is that not only the location, but even more the thumbnails as eye-catchers attract attention and arouse curiosity.
82] In a similar way, the Tate Modern in London staged a "Gallery of Lost Art" as a virtual exhibition in 2013. The user looked from a kind of bird's eye view onto tables on which photos and documents on lost or destroyed works of art were laid out.
Clicking on them opened up background information so that the user could browse interactively through the history of the works of art. Database contents, which per se do not have a narrative, could be arranged narratively in their connection with space and time, because the time axis provides linearity to the documentation, even if it could probably be presented again and again and depending on the interest of the user.
[83] In order to visually link data collected empirically through interviews with the spaces where the interview took place, the Datarama was developed here in Göttingen at the Max Planck Society. In a circular room six metres in diameter, beamers project 360° panoramas onto the wall, which can be controlled interactively with a touch pad. In this way, the storylines and influences that may have been relevant to the respondents' answers can be reconstructed in retrospective visualisation.
[84] This spatiotemporal reference of the data was implemented very consistently in the Venice Time Machine project from 2012-2019. More than 190,000 documents were digitised and indexed in one of the largest databases concerning cultural heritage. The aim was to provide a collaborative multidimensional model of Venice by creating an open digital archive of the city that spans more than a thousand years of development. In this 3D city model, for example, all inhabitants of the respective time, as far as they appear in the documents, will be located at their place of residence. The project, which uses almost all significant DH methods, aims to show the distribution of information, of money and trade goods as well as artistic and architectural patterns in a networked way, to produce Big Data of the past, so to speak.
[85] In a similar way, but on a much smaller scale, in the EsteVirtuell project we are publishing an antiquities collection from the late 18th century as a scientific image database, with 3D scans of the nearly one thousand marble sculptures, by reconstructing the installation and making a virtual museum the interface of the database. My proposal for a scientific image database of the future would therefore be to create a narrative for the databases by reconstructing a biography of the objects in time, space and materiality and to achieve a stronger bond between the user and the database through forms of creative browsing in the manner of a museum visit. This is also connected with the hope of opening up new ideas and research perspectives through an unconventional linking of monuments.
[86] This actually brings us to the challenges for scientific image archives and repositories.
Databases are primarily text-based. This is not unproblematic for images and objects. A future task is therefore to analyse the data structures with regard to image and object evidence. What images are created in the mind of the user of a database? How can the data be prepared in such a way that the peculiar effect of images and objects can be mapped better than before? And how can image-immanent evidence and content interpretation be stored adequately and efficiently?
Furthermore, a critical, reflective approach to image pattern recognition and search results is necessary. We are only at the beginning here because we hardly understand how neural networks work. But we also have to be careful about the search results in the databases and always ask how the hit list came about.
As has become clear, I also see great potential in a stronger user-centricity in the visualisation of search results. Creativity is needed here as to what the relationship between the backend and the frontend of the databases could look like in the future.
87] § Basics of structuring and visualising information in databases
§ Differences in database models (relational, object-oriented, hierarchical, etc.)
§ Recommendations of IANUS / German Museums Association / CIDOC Working Groups for the creation of databases
§ Databases of your subject (e.g. Classical Archaeology), their material basis, history and conditions § Relevance of controlled vocabularies / norm data, thesauri etc. for use in databases
88] § Selection of suitable database (systems) taking into account their significant properties for different usage scenarios
§ Development of a MySQL image database (e.g. in LibreOffice BASE) for scientific questions
§ Practical experience in the use of image databases (searching, sorting and replacing, importing and exporting data, creating views, relations and evaluations)
89] Which image databases do you know? How are they structured?
What purpose do scientific image databases currently serve? Where do you think the future of scientific image archives lies? Explain the difference between data model, data structure and data type!
What is a database system?
What possibilities do you know for image-based data search?
Briefly characterise three database models in terms of structure and usefulness
What would the perfect image database have to look like in your opinion?
90] With a look at the literature, especially textbooks on database systems and also application-oriented manuals (and also two titles on media theory), I bid you farewell and wish you a good week.