Martin Langner, Introduction to Digital Image and Object Science (Summer Semester 2021)
III Analysis:
Lesson 10. Data Visualisation and Exploration (https://youtu.be/b0R8vpxMShU)

1]	 Introduction
2]	   History of Information Graphics
10]	   Representation of knowledge in diagrams and graphics
16]	   Content of this lesson
17]	 1. Multivariate Methods
18]	   Exploratory Statistics
20]	   Principal Component Analysis
30]	   Correspondence Analysis and Seriation
40]	 2. network analysis
41]	   Basics
45]	   Complex systems
48]	   Social network analysis
50]	   Historical network analysis
54]	   Tools
56]	 3. Data Visualisation
57]	   Data Visualisation in Comparison
62]	   Chart Types
76]	   Basic Principles of Visualisation I (Data)
82]	   Basic Principles of Visualisation II (Design)
86]	   Tools
98]	   Graphic Design as a Model  
100]	 Conclusion
100]	   Current research questions
101]	   What you know and what you should be able to do
104]	   Literature


  [1] Welcome to the tenth lesson of our introduction to Digital Image and Artefact Science. Today we are talking about data visualisation and data exploration. Visualisation means the making visible of something hidden. In this sense, today's topic will be the exploration of large data sets with visual methods, but also what is usually associated with information graphics, the clarification of statistical evaluations in diagrams and charts.

  2] Using points, lines and areas to visualise mathematical relationships is already very old and a basic component of geometry. However, in the analogue procedure with pencil, compass and paper, people used to think even more carefully about the individual visualisation steps than today, when diagrams are created at the touch of a button. It is easy to overlook the fact that both the acquisition and visualisation of data are independent interpretations. Data are not given, they are made! And visualisations don't just show results, they create them!

  [3] So a visualisation is not just an illustrative representation of the facts, but serves to explore, analyse or communicate. An early example of all three functions is provided by the Ebstorf map of the world, produced in northern Germany around 1300. Used as an altarpiece, this display panel, 3.57 m in diameter, aimed to communicate the medieval conception of the world to the congregation. Geographically, the Mediterranean Sea is in the lower right quarter, dividing the known world into Asia (top), Europe (lower left) and Africa (lower right). In the middle is the walled Jerusalem, but the centre is Christ, whose head you see at the top and feet at the bottom.

  4] To give an overview of facts that are not visible to the eye is also the task of mathematical-technical and medical diagrams, which illustrate the corresponding facts and thus record them for posterity. I show as examples Roger Bacon's famous study on the nature of light from 1280 and a schematic reproduction of the blood circulation on the so-called anatomy of Mansur ibn Ilyas, which was created in Iran in the late 14th century.

  [5] With the further development of natural and economic sciences, the communication of counts, measurements and ratios became increasingly important. The simplest graphic arrangement is the table. However, alternatives were sought early on. For example, the Scottish engineer and economist William Playfair developed the first bar chart in 1782, with which he visualised Scotland's exports and imports in 1781, and his famous line chart of 1786, which connected the ratios with lines and coloured the resulting areas, made England's export surplus vis-à-vis Denmark and Norway, which had been increasing since 1753, clear at a glance.

  [6] Keeping track of statistical data on area, state income, military expenditure and manpower was of central importance for the kingdoms of the early 19th century. For this reason, August Friedrich Wilhelm Crome drew up an area chart for the Prussian king in 1820, which listed all these data for the German states and put them in relation to each other. For this purpose, he further developed his size map of Europe, which had already been published in 1785.

  7] Out of the desire not only to reproduce ratios but also to relate several factors to each other, thematic cartography also experienced a great upswing in the 19th century and, enriched by quantitative attributes, acquired a new quality. As an early example, I show a depiction by the French civil engineer Charles Joseph Minard, who not only coloured the individual areas of the map to depict the cattle sent to Paris from all over France for consumption, but also provided them with circular diagrams. (1858)

  [8] As a basis for discussion on the routing of the railway line between Dijon and Mulhouse, he had mapped the traffic volume on the already existing roads in the area in 1845 by drawing the roads in different thicknesses and also varying the brightness. This hybrid form of map and flow chart, the flow map, is also suitable for depicting changes over time.
Famous is Minard's 1869 depiction of Hannibal's losing march across the Alps in 218 BC (above) and Napoleon's catastrophic losses during the Russian campaign of 1812 compared below. The area of a thick line illustrates in
each case the size of the armies at specific geographical points during their advance (and black also of their retreat). It shows six types of data in two dimensions: the strength of the troops, the distance covered, the temperature, the latitude and longitude, the direction of movement and the location in relation to specific dates.

  [9] If we pointed out in the last lesson that sampling always means reducing data, this is even more true for data visualisation. A general character of the map, namely its dimensional accuracy, no longer plays a major role in thematic maps. More important is the visualisation of certain topological relations. Taken to the extreme, this can be found, for example, in the traffic plans of today's large cities, which go back to the diagrammatic plan of the London Underground lines by Harry Beck in 1933, where, for the first time, all lines run straight or bent at a 45° angle and the distance between the stations is the same. A comparison with the older plan from 1926 makes it evident how much the reduction to basic geometric shapes has increased the readability of the plan.

  [10] And the need is constantly growing: Statistics have a fixed place in daily newspapers (not only since the pandemic), interfaces have to be designed more and more user-friendly, and even in presentations like this one, the demand for visual reinforcement of the message is becoming more and more pronounced. This does not simply result from the data, it has to be developed. The flood of information must be countered by an easily comprehensible visualisation of the facts. But this can only be done by concentrating on the main statements, i.e. by abstraction and information reduction.

  11] I have already spoken several times about information visualisation. The terms information visualisation and data visualisation are generally used synonymously. But there is a fundamental difference between charts and diagrams. In the one case, data is quantified and presented in an abstract way. Such non-natural diagrams are usually based on numerical values. In the other case, information is presented qualitatively, e.g. by explaining details in schematic drawings. The most common schematic drawing is the map, but explanatory graphics, especially on technical aspects or building instructions, are also popular.

  12] Typically for French structuralism, the cartographer Jacques Bertin dissected the diagrams in his still readable standard work "Sémiologie graphique. Les diagrammes, les réseaux, les cartes" (Paris 1967) divided graphics into "graphic variables". Point, line and areas are the basic vocabulary, which can be varied by size, brightness, texture, colour, direction and shape and linked with text labels. The design process aims at information reduction through abstraction.
Jacques Bertin puts it this way (and I'll translate it in a moment): "This point is fundamental. It is the inner mobility of the image that characterises modern graphics. We no longer 'draw' a graph once and for all. It is "constructed" and reconstructed (or manipulated) until all the relations it contains have been perceived." Jacques Bertin, Graphical Representation and the Graphical Processing of Information (Berlin: de Gruyter, 1982) p. 16.

  [13] However, the use of visual variables is not arbitrary. Rather, since the first appearance of diagrams, a consistent usage has emerged. Let us look at a few examples: Already the first line graph in 1724, the first bar chart in 1786 or the first pie chart in 1801 transferred the measured values to the area and represented the ratios as geometric relationships.
This is still true today: The first dimension is usually visualised in spatial extension, i.e. as a quantity or direction.
A second dimension is then added in another, non-spatial variable: as colour, brightness, texture or shape. This is especially the case when the data is to be compared with each other divided into groups.

  14] This is probably because spatial phenomena can be better acquired by us than coloured ones. If you use more than eight or twelve colours in a diagram, the individual colours are already not so easy to separate from each other. The same applies to black and white textures, which were common in times when colour printing was considerably more expensive.

  [15] Of course, this rule only applies to abstract forms of visualisation. If the data is mapped onto something that exists in the real world, such as maps, body parts or buildings, and must therefore already be represented spatially, one has no choice but to implement the projection of the data through labelling or even through colour or texture. Basically, however, these conventions are important because through them we have learned to read and quickly acquire these information graphics. Making complex numerical relationships visible must increase the readability of the data (e.g. compared to a table or matrix), otherwise the graphic is just an unnecessary accessory.

  16] As you have probably already expected, this lesson is again divided into three parts. First, it is about multivariate methods such as principal component analysis and correspondence analysis. I have reserved the second part for network analysis. And the third part is about the important question of correct visualisation. Here I give you a number of useful tips for creating scientific diagrams.

  [17] Let's start with multivariate methods. By this is meant that one examines several statistical variables at the same time.

  18] In addition to descriptive and inductive statistics, we could also mention exploratory or hypothesis-generating statistics, which are becoming increasingly important. In the DH field, this is also referred to as data mining. It systematically (and independently of an initial hypothesis) searches for significant patterns in Big Data, i.e. for possible correlations and differences in the data. In other words, one asks what is characteristic or unusual about the distribution of a feature. These hypothetical statements, usually based on sample data, must then be made probable with the help of inductive statistics.

  [19] The goal of visual analytics is to gain insights from extremely large and complex data sets by visually mapping numerical relationships and relationships within them. Visual analytics attempts to combine techniques of information visualisation with techniques of computer-assisted transformation and analysis of data to enable discourse between humans and information. Or, in other words, the computer's advantage of being able to automatically analyse data is combined with the human's ability to quickly visually acquire patterns or trends. With the help of interactive visualisation tools, one can influence the analysis process accordingly. In contrast to the information visualisation discussed so far, not only results are presented. Rather, the user has the possibility to control the analysis himself. Characteristic is the constant change between visual and automatic processes: The data are processed or filtered, then corresponding models are generated with the help of data mining, which can be visualised and thus checked as a third step. Analytical thinking is of central importance here in order to be able to draw conclusions from the combination of facts and hypotheses.

  [20] Comparing large sets of images with each other is extraordinarily difficult. This is because, as we learned in the lesson on image analysis, there are a variety of formal criteria by which images can be categorised. On the one hand, there are colour values and colour spectra, brightness, saturation or contrasts as well as lines and brush strokes; on the other hand, there are elements of pictorial composition, format, proportions, iconographic patterns and stylistic peculiarities. All of these together make up the special character of a work of art and its effect. It is possible to acquire all these features, but to relate each one to the others at the same time is beyond the capabilities of human consciousness. Our brain does, however, combine, structure and group the visual impressions and picks out a few, supposedly relevant aspects with which we then consciously argue.

  [21] The statistical method of principal component analysis proceeds in a similar way, using a mathematical approximation to combine a large number of statistical variables into a smaller number of linear combinations that are as meaningful as possible. You have to imagine this like projections of three-dimensional vectors into the plane, only in multidimensional space.
A group of components thus becomes a principal component. In this way, extensive data sets are simplified and visualised in a coordinate system.

  [22] Let us take an example: We want to compare the early paintings of the Dutch painter Piet Mondrian, who developed from a realistic to an abstract painter within a few years, and group these paintings according to similarities.
To this end, Lev Manovich and his team measured 60 features in each of 128 paintings, resulting in 7680 measures. Of course, the selection of the characteristics determines the result. If, for example, half of the characteristics concern colour properties, we should not be surprised if the paintings are grouped primarily according to their colour. Abstraction would thus be understood as a reduction of colour variety.
This data set could also be described as a set of 128 points in 60-dimensional space. The goal of principal component analysis is now to project these data points into a 2-dimensional subspace in such a way that as little information as possible is lost. Since there are always correlations between the various properties that we have collected here as measures, this redundancy can be summarised in a vector and visualised spatially as a data point.

  [23] We have imagined this data as a point cloud in a multidimensional space. We now determine a straight line that best approximates this point cloud. The distance of the individual points to the straight line is described as variance. This straight line forms the first principal component. Then we look for a second straight line that is perpendicular to the first and minimises the variance as much as possible. This then forms the second principal component.
Both straight lines thus form a coordinate system onto which we can project the points of the point cloud in the 2nd dimension. Each painting forms a point, with points that are particularly close to each other representing particularly similar paintings.

  [24] In a statistics programme such as SSPS, the table for the first 20 characteristics or variables, which are output here as 20 components, looks like this. Let's say we only wanted to consider eigenvalues that are greater than one; a general rule also known as the Kaiser-Guttman criterion, which has proven its worth. Then, in our case, only four variables fulfil this criterion. Or to put it another way: with the help of the Kaiser-Guttman criterion, we obtain four components for our analysis.
Let us now look at the variance: The first component comprises the largest variance and can thus already acquire more than 42% of the characteristics in our data set. Each additional component can explain less and less additional variance. Generally we only use components whose variance is above 10%. This means that we are satisfied with capturing 90% of the phenomena in our data set. Thus we are left with two principal components, since already the third component with only 5.780% variance can be disregarded.

  25] Since every structuring and summary always means a weighting in a certain direction, individual characteristics can be weighted more heavily or greater variance can be allowed. Therefore, principal component analysis is a good method to search exploratively for common features in large sets of images or complex groups of objects by modelling the degree of variance differently accordingly. After defining the 60 features, we made a second decision to reduce the data in order to steer our result in a certain direction. This is important in order to be able to clearly identify certain features in the data, but we must also be aware that we are thus clearly limiting the diversity in the data.

  26] But how do you read such a coordinate system that describes the two principal components? There are two simple basic rules for this: The closer two points are to each other, the more similar they are. And: the two axes divide the points into four quadrants, which roughly divide the data into four groups. The further away two points are from the centre of the coordinate system, the less similar they are to each other.
In our example, let us first take a look at the outliers: At each of the edges are quite mono- or bichromatic paintings, with bright reds and blues at the top, very bright colours on the leL, shades of green-brown on the right and contrasts of light and dark in the lower half. Basically, the degree of abstraction seems to decrease from leL to right.
This overall impression is also confirmed by paintings that are close to the zero point.

  27] If one evaluates the visualisation of the two main components chronologically, it is noticeable that Mondrian's early, realistic works are on the right-hand side and the abstract works from 1914 and 1917 form two groups on the leL. The works of 1909, when Mondrian made the change from representational to abstract within a few months, indicated here as green dots, are consequently scattered along the X-axis.

  [28] Let us summarise once again: Principal component analysis uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables, called principal components.

  This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability of the data as possible) and each subsequent component in turn has the largest possible variance under the constraint that it is orthogonal to the previous components.

  30] A special form is correspondence analysis, where, in contrast to principal component analysis, only categorical data are allowed. This involves variables that can only take one of a limited number of possible values, whereby each individual is assigned to a certain group or category on the basis of its qualitative characteristics. In other words, one does not evaluate measurement results here, but only the affiliation to predefined categories.

  [31] The basis of the correspondence analysis is formed by cross-tables, also called contingency tables, which show the (multivariate) frequency distribution of the variables. This provides a basic picture of the correlation between two variables. The columns and rows of a crosstab are in turn visualised as points in a space whose coordinate axes are formed by the respective characteristics. The procedure of correspondence analysis is not only used in market research or empirical social research, but also, for example, in ethnology and archaeology.

  32] We already saw this, for example, in the work of the Egyptologist Flinders Petrie, who in the 19th century was sorting the inventories of Egyptian tombs according to characteristics . Finding direct dependencies between two factors is relatively easy. In the case of grave analysis, however, he was faced with the problem of finding a (chronological) dependency structure for the variables in the high-dimensional contingency tables. To do this, he used the method of seriation.

  33] For the dating of the individual tombs in the necropolis of Taranto, for example, Daniel Graepler assumed that tombs with similar grave goods were closer in time than those with completely different grave goods. The similarity of the individual graves was determined by Graepler by evaluating the morphology of the pottery present in the necropolis and subdivided into phases A-G. It was important that the morphology of the graves was similar, and that the morphological separation of types was done according to chronological, culturally determined developments and not according to functional differences. Then all the grave goods and types were acquired and evaluated in tables.

  [34] With the help of a computer program that took over the seriation, the graves could be brought into a chronological sequence, this time arranged in the rows; again by sorting the rows and columns until an approximately diagonal sequence of entries emerged. In conjunction with the other grave goods, a chronological framework for Taranto's graves could thus be created for the first time.
Ideally, the normally distributed finds would result in a bell curve.
However, since finds are not normally distributed, the graph described by the points ordered by seriation looks somewhat different.

  [35] This classification was confirmed by Hempel's correspondence analysis, since the individual graves that Graepler had assigned to a phase lie close to each other on the graph of the correspondence analysis. After a cut on the graph, a new Graeplerian phase followed. Thus, the correspondence analysis confirmed Graepler's assignment and some graves, which could not yet be precisely assigned, could be classified into a phase. However, regional deviations, gender-specific or e.g. ethnic differences can falsify the result. Therefore, correspondence analysis works best with data from one and the same location.

  [36] The simultaneous observation and analysis of more than one outcome variable is also possible with the help of correspondence analysis. One could also speak of multivariate seriation here. Let us take as an example the grave goods and customs in Britain under Roman rule. The visualisation of the main components in a coordinate system shows very nicely that the Y-axis divides the categories into inhumation and cremation, while above the X-axis are cremated material and primary burials. Below that, graves without grave goods are located.
Thus, it can now be generally stated that tools appear mainly as secondary grave goods in cremations,
Shoes or belts, however, are primary grave goods, both in body burials and in cremations. The correspondence analysis has brought about a clear local separation here on the graph, so that one can also see, for example, that coffins were never used for cremations.

  [37] In the north of Britain, body AND cremation burials with cremated material and primary grave goods are attested, while in Wales cremation burials WITH cremated material and primary grave goods dominate the picture. In the west and southwest, inhumations predominated, while in the southeast, cremation burials WITHOUT cremated material and primary grave goods are mainly attested. Marion Struck attributes this on the one hand to ritual differences of the pre-Roman period, and on the other hand to the blurring of burial customs under Roman occupation.

  [38] Correspondence analysis can also be used to make a clear distinction between grave goods that are clearly male or female. For example, correspondence analysis has shown that weapons were found only in tombs of men, but earrings only in women, which was impossible to evaluate before due to the vast amount of data.

  39] In summary, we can say with Jörg Blasius: "Correspondence analysis is an explorative procedure for the graphical and numerical representation of rows and columns of arbitrary contingency tables. [...] (There is) a distance interpretation between the variables(expressions) in this procedure - and likewise one between the objects."
Thus, with the help of correspondence analysis, a complex table with dependencies can be easily represented graphically not only from row to column, but also from column to column and row to row. Here again, the position of the individual markers on the graph is decisive. The further apart two points are from each other, the less the information they represent has in common.

  40] Another excellent way to search for structures in Big Data is network analysis, which we will look at now. Networks consist of relationships in an information system. These can be, for example, the relationships between painters and their paintings, between family members, between communication partners such as letter writers, or between traders, customers and goods, to name just a few examples.

  [41] Archaeological or historical network analysis has evolved from quantitative network analysis in the social sciences, where it is widely used. Its aim is to capture and analyse complex social structures by diagramming and assessing social actors (i.e. individuals as well as groups and societies) and their relationships to each other.

  [42] Thinking in networks is not a result of computer-aided analysis, but accommodates the human brain's way of performing linkages. We had already established with databases that the relational database system is by far the most common, although it is not the best system from a computer science point of view. And so history and art history, for example, was and is also thought of as a network. Here, relationships can be represented very concretely in teacher-pupil relationships, or they can be taken further as a model of influence by describing networks of relationships between different directions and styles, as Alfred H Barr already demonstrated in his legendary work "Cubism and Abstract Art" from 1936.

  43] Network analysis thus defines culture as a (regular) network of relationships. All other aspects (such as individuals, resources or norms) are subordinate to the relations between entities. Accordingly, the individual is defined by his or her position in the network and not by gender, ethnos, age, education, wealth, ideology or behaviour. Network analysis is therefore particularly suited to a macroscopic view of complex networks of relationships with a multitude of actors and relationships, and is thus ideal for data exploration.

  [44] The basic element of any network is two nodes connected by an edge. Or in other words, two entities linked by a relation. We have already met such triples as RDF triples, which consist of subject, predicate and object. Here, subject and object are the nodes, and the predicate describes the type of connection. The Semantic Web as a large network of meaning is constructed in this way and therefore makes the contents of the structured data understandable for computers.

  [45] Network analysis therefore goes one step further and measures the activities in a network. To do this, one first defines in the data model (depending on the research question) the level of activity assigned to each node and the various attributes that can be assigned to each relationship or edge.

  [46] However, many cultural processes are not predictable. This is because relationships between people are constantly changing, as are people's responses to their environment or the conditions to which the world is exposed. In statistics, such cases of non-linear relationships are referred to as complex systems. Their complexity thus consists in the fact that the behaviour of entities cannot simply be derived from their properties. Network analysis therefore does not attempt to design a static model of relationships, but to make the complexity of the respective relationships flexible.

  [47] The study of complex systems as networks enables the application of graph theory. The fact that the number of edges in a complete graph grows quadratically with the number of nodes sheds additional light on the source of complexity in large networks: As a network grows, the number of relationships between entities quickly exceeds the number of entities in the network.

  [48] The classic work in historical network analysis is John Padgett and Christopher Ansell's paper on the Rise of the Medici Family in Renaissance Florence. With the help of network analysis, they were able to show that in the second half of the 14th century a large number of economic upstarts gained access to political office. With these families, spurned by the nobility, the Medici established extensive economic and family relations and thus secured their power, while the established families were left behind.

  [49] With the daily increase of images on the internet, there is a growing need for effective techniques to visualise image sets and multimodal data collections. After all, network analysis specialises in visualising multiple relationships in large data collections, making them visible in an interactive and exploratory way. And, of course, specific properties of image data such as colour or brightness, but also image patterns and image contents can be made connecting properties in such networks in order to visually highlight changing degrees of similarity.

  [50] Thus, similarities of objects in terms of shape and decoration are also seen as indicators of communication processes between workshops and described as a network. In the example shown here, the focus was on the similarities in pottery decoration, which serve as an indication of communication networks between settlements of the so-called "Bandkeramik" culture in the Rhineland.

  51] In general, certain characteristics of objects, such as material, quality or place of use can be categorised, combined and then brought into a network structure. In small-scale studies, for example, references to the household goods of a place or a social group can be visualised that would otherwise be lost in the quantity of attributes. In our example, it is argued that there should have been four social user groups based on the criteria.

  [52] Other examples are trade relations, diffusion of social or technical innovations, transfer of resources, but also interpersonal relations and role patterns.

  [53] The possibilities and problems of historical analysis that can result from network models are intensively discussed, since network analysis does not provide a uniform interpretation of the results.

  [54] A number of useful tools with mathematical, statistical and imaging functions are available for the conceptualisation, preparation and modelling of the data, which I have compiled here once.

  [55] ... and here are some more.

  56] Data visualisations are numerical values that are output as graphics. This is because charts and diagrams are easier to acquire than tables or other number- or text-based data compilations and thus make patterns and connections visible.

  [57] Let's start with a practical example. We had already noticed that Mondrian's paintings became more abstract from year to year. In order to investigate whether the brightness of the works also changed with this, corresponding measurements were carried out in Lev Manovich's Cultural Analytics Lab and the average brightness of all of Mondrian's early paintings created between 1905 and 1917 was calculated. For the first years, it was even possible to distinguish between the first and the second half of the year. You can see the results in the table on the left and the bar chart on the right. To do this, we plotted the years in the first column on the x-axis and the measured values in the second column on the y-axis. And indeed, because we are following established conventions that have been valid since René Descartes, the plot is easy for us to read. This is because the table makes definite assignments, but it is hard to see patterns in the data in it. The bar chart, on the other hand, makes fluctuations and outliers visible at a glance.
These could be made even clearer with additional lines. In any case, the visualisation leads us to be able to look into the data, so to speak.

  58] We could also output our data as a line chart. In comparison, however, our data analysis takes on a completely different weight. Connecting the measured values with a line always suggests a development from the origin on the left further to the right. It is true that the values are arranged by year, but within the years not chronologically, but by size. That is why there is such an up and down here.
If one wanted to work out the chronological development, one would have to summarise the values for individual years. But then the average no longer reflects the fluctuation range.

  59] A pie chart, on the other hand, gives ratios in percentages and assumes that our data collection represents a population, of which each piece of pie stands for a certain part. Therefore, one might think that the number of paintings per year would have been evaluated, but here we have plotted the average values of the mean brightness, which when added together do not give a meaningful value.

  60] All the graphs shown are therefore unsuitable for an investigation of the development of brightness in Mondrian's paintings. For the visualisation of a distribution, such as we have before us here, it would make more sense to use a dot distribution chart, as you see on the right. Here the brightness values of all paintings are plotted as points and the median is indicated as a line for orientation.

  [61] For an analysis of large sets of paintings, however, the visualisation form developed by Lev Manovich is even better suited. Here the points of the distribution diagram are replaced by miniatures of the photos. Now you can see at first glance that Mondrian's paintings gradually lighten slightly, but it is even clearer that around 1909 his palette in particular changes, from brown-green to pink-violet.
Because all data can be represented graphically in any form, we need to understand what pictorial statement is being made with each visualisation. And often it is not the data that we visualise in the graphs, but our pre- assumptions and approaches to data preparation and analysis!

  [62] Therefore, lets sum up in a brief run the most common forms of charts and their use. Most common are probabilistic charts, which help to convey the differences or similarities between values in a data set. These charts are often used to make comparisons between categories or to communicate rankings between categories.

  [63] Related are charts that show the frequency in the data or how widely the data values are distributed over an interval. Often these chart types are useful for creating shapes or patterns that give insight into the nature of the distribution in the data set.

  64] If the X-axis contains date values, the data is displayed over time. These charts are therefore used to show the change in data over a period of time to communicate or analyse trends and patterns in a dataset. The diagrams at the end, such as a timeline or Gantt chart, are more commonly used to communicate the sequence of events.

  [65] Diagrams that use area sizes to communicate differences or similarities are shown here in red. The proportions of the plane areas are used to represent orders of magnitude and comparisons of values.

  66] A special case of proportion charts are those that use proportions to show the relationship of parts to a whole. Thus, if the goal is to show how the parts of a variable relate to a whole, these charts are useful, or to show how the data are divided.

  67] Data related to geographical regions are of cause best visualised on a map.

  68] Correlations between two (or three) variables can be made clear in the charts shown here. The scatterplot serves a somewhat more neutral purpose of data exploration, while elements of different sizes already highlight the respective significance.

  [69] Various varieties of network diagrams that connect points (accounts) with lines (or edges) serve to reproduce relationships and relationship structures contained in the data.

  70] Some of the charts already shown explicitly express hierarchical relationships, and are therefore mentioned again here.

  71] Charts used to show the movement or flow of entities, or to show how a process or system works, are rendered in so-called flowcharts.

  [72] The deviations between the upper and lower limits of a scale are best shown with range charts. Here again, one has to be careful that these diagrams are not meant chronologically.

  73] Visually related are diagrams that indicate changes between two points. Perhaps best known are representations of voter migration after federal or state elections, but of course technological or chronological changes can also be visualised in this way.

  74] Results of multivariate or multidimensional analysis can be represented in networks and coordinate systems. But there are also alternatives to find or represent relationships or patterns between many different variables, even if they are not quite as common.

  [75] And finally, let‘s have a look at the depiction of measurement uncertainty and error margins within a data set. As you can see, there are a variety of chart types. It was important to me to group them according to what they say. Because, as we saw at the beginning, only the right choice of diagram is able to visualise the right statement. However, not only the different types of diagrams are important, but also data selection and design. Therefore, some principles of good practice in information visualisation should be discussed.

  76] In general, one could establish the following rules for information visualisation: First, good graphics are based on good data! A study with insufficiently acquired data will not be improved by perfect visualisation. Even if many people might fall for it.

  [77] Distinguish between discrete measured values and constantly changing data! The body weight of a person in kg is such a discrete measured value. My body weight has no temporal or consecutive relationship to yours. If, on the other hand, I measure my body weight every week and compare the values, there is a chronological relationship. Let's take unemployment in Germany as an example. If you want to illustrate the differences in the federal states, a simple column or bar chart is the right choice for discrete measured values that are independent of each other.
A continuous graph, on the other hand, is better suited for temporal or consecutive relationships in the data.

  78] Arrange the data in a meaningful order! An alphabetical order is helpful if you want to find individual entries quickly in long lists. However, it says nothing about peaks or correlations in the data.

  79] Choose the right dimensions for visualisation! It makes a big difference visually whether you give the weight in kilograms or grams, for example. If measured values are close to each other, they must be equalised in order to be easily readable. However, visually treat all entries the same. Do not make the mistake of showing distances or areas in different sizes and scales! Rather, interrupt the course of a bar and make the resulting gap visually clear. However, you still have to be clear about the statement that has been made as a result. In the upper example, there is obviously no problem with the weight of young girls in Germany, although a quarter are not of normal weight. In the lower one, it is clear that despite increasing anorexia, obesity is the bigger problem.

  [80] It is important to avoid distortions in the graphic implementation. Perspective representations, which obscure the actual proportions, are particularly popular. But pie charts are also unfavourable for measured values that are close together.
Equally confusing are diagrams with percentages that do not add up to one hundred.

  81] Always indicate the size of the sample (n)! Only then can the data be understood correctly. Unless you wanted to deliberately cheat with the data and show trends that are so clearly not present in the data.

  [82] Make it easy for the viewer to assign information! Increase the readability of your diagrams through information reduction and appropriate labelling! This is possible, for example, by placing measured values or labels close together or even combining them into one value. So make sure you have the right amount of data. Visualisation does not simplify, but clarifies. Good graphics therefore have a balanced design. They are descriptive and attract the reader's attention.

  83] Avoid superfluous design elements! Illustrations, pictures or the exaggeration of individual shapes influence perception and set a theme or create a certain mood. But be aware that these forms of design will not make your graphics look very scientific if they are used too boldly.

  [84] Revise the illustration to make your thesis stand out even more! Ask yourself whether all the elements depicted are semantically significant. And use the completed diagram to develop an even better understanding of your research result. Perhaps there are outliers in the data that you are only now really noticing, or details have fallen by the wayside that are important to you.

  [85] Document your decisions in a study and put a reference to it below your graph! Before you made the graph, a number of decisions were necessary: How was the data collected, processed and categorised? According to which question and under which focal points was the analysis carried out? What degree of reduction does the data show in the visualisation? Where can the source data be found? Who was involved and on which studies are they based?

  [86] Almost all spreadsheet programmes in Office packages offer the possibility of producing graphs yourself from imported data, and there are also a large number of free, web-based consumer applications on the net. So if you just want to make a quick chart, you'll find it there.

  [87] And just to pick out one: With rawgraphs you can create a wide variety of diagrams in a very convenient yet sophisticated way, which you can customise in terms of size, colour and labelling. This way you can easily try out the advantages and disadvantages of the different types of diagrams, especially since there is a corresponding explanation for each type.

  [88] And for geodata visualisation you can also use the online platform MapScholar.
It enables scholars in the humanities and social sciences to create digital "atlases" with high-resolution images of historical maps, superimpose them and save them with a precise fit.

  [89] For example, if you are simply looking for a web-based tool that will automatically output your uploaded data as a chart, map, timeline, video, network graph or WordCloud, SHIVA will have it. The results can be exported and embedded in websites.

  [90] In the digital humanities, however, visualisations are also used to explore the data. I would like to at least briefly mention a few tools created specifically for the DH community, and I will start with analyses of textual sources. There are some web-based visualisation tools for large corpora. At the word level, you can visualise frequencies particularly well with WordSeer. Our example shows the frequency of the words great and good in the speeches of several American presidents.

  91] You may already know another great tool for visualising correlations in text corpora, namely Voyant. Here, linguistic features are also displayed for data mining and text analysis. Take Goethe's Werther, for example. With Voyant, you can display the most frequently occurring words as a word cloud, examine each occurrence of a word in its syntactic context and output it as a lexical distribution plot. On the right, five words, namely O, Lotte, soul, heart and people are plotted according to their frequency at different passages in the book.

  [92] If you are interested in identifying and distributing certain themes and motifs in your texts, i.e. topic modelling, TOME is quite suitable, which uses machine learning algorithms to identify themes in a corpus and outputs them chronologically or in networks, ordered by frequency.

  [93] Another tool is Narrelations which is designed for the visual analysis of specific narrative phenomena. Here you can link the nesting and distribution of narrative levels and their correlations in a text you have uploaded to specific temporal phenomena. In addition to assessing overall patterns in a story, the interface allows you to focus on individual passages and annotations by coupling visual and textual analysis.

  [94] With the web-based tool Palladio, you can upload data in tabular form, view it as a list, sort and edit it, and re-export the list visualisations as a .csv file.
You can also output this data as points on a map and display the relationships between different points connected with lines.
Such relationships are visualised as a network in the graph view. The nodes can be scaled to reflect their relative size within your data and the display of links and labels can be switched on and off.
There is also a gallery view where the data is displayed in a grid-like manner for quick orientation. Here the dimensions of your data can also be linked to external web-based information and sorted by the different dimensions.

  [95] VisualEyes is an interactive storytelling tool to embed images, maps, charts, videos and other data into dynamic visualisations. The example of the home page shows us the life of the third US president Thomas Jefferson.

  [96] A tool for comparative analysis of museum data sets has been available for use online with MAX since 2019. Here, both art-historical and collection-historical problems can be tackled on the basis of Big Data. It would be possible, for example, to analyse the image carriers wood and canvas on the basis of the total stock of all museum data represented there. However, a lot of data preparation is necessary beforehand for many questions, taking into account our considerations on sampling.

  [97] Lev Manovich has developed a macro for the image processing programme ImageJ that implements his way of visualising image sets as a scatterplot and runs on almost any platform. Even temporal sequences can be displayed, as I would like to show you here for the paintings of Mondrian under consideration.

  [98] If one is aware that all graphs and charts are not only the result of data selection, but also that design guides interpretation, one will also want to take some care with one's own information visualisations. The art of illustration is the subject of graphic design and so it is natural to use graphic design techniques and design languages in the visualisation process when it comes to the relationships of type, space and colour, for example. In the example on the left, the colour contrasts the different columns without adding any value to the information. In the example on the right, colour was used to convey information. Here the traffic light colours underline the harmfulness of the nutritional values.

  [99] Reflecting on more aesthetically pleasing and functionally effective information design goes hand in hand with understanding the basics of graphic design and making a conscious choice from the tangible and appropriate design elements. An example is the poster "The Grammy Gap" by Emma Duncliffe, where the distribution of the annual Grammys to male and female artists is clearly visualised. Yvette Shen has developed a modular approach that helps to take design concerns into account in the visualisation. A basic knowledge of graphic design thus promotes better implementation in terms of the goal, the context, the content and the audience. However, just as beautiful graphics do not always lie, one must not put design above content. What remains central is the research question and the analysis.

  [100] And this brings us to the current research questions. In recent years, much research has been done on the history of information visualisation. It has become very clear that the representations develop a life of their own and are to be regarded as historical sources in their own right.
In the future, these findings must be reflected even more broadly back into the disciplines, where awareness of visualisation errors is not yet very widespread. This is because information visualisations are to be understood as independent sources of information. Accordingly, one should also think more about the appropriate form of representation and its message.
In the digital humanities, conventions of visualisation have already emerged in order to depict the results of explorative analysis, such as networks or correspondence analyses, in a generally comprehensible way. It would be desirable if not only the diagrams were printed here, but more and more often the data were also made available for one's own exploration. From this, entirely new forms of scientific discourse could develop. This process has already begun in natural science disciplines, but the initial euphoria of the Digital Huamnities for collaborative work has unfortunately subsided again somewhat, which is also due to the increasingly competitive research situation at universities and on career paths.
The analysis of large image sets is only just beginning, and we are also seeing new, promising approaches in computer science almost daily. Lev Manovich has demonstrated with his Image Plot tool how relationships in image data could be visualised in an image-like way. Here, we need to continue to develop creative ideas to be able to represent image-to-image relations and spatial relationships in visual networks and virtual spaces.

  101] Various quantitative methods and multivariate procedures
Basics of exploratory statistics
Concepts and theories of network analysis
Analysis procedures and visualisation methods of common measures and structures
Basic principles of information visualisation
Data collections and tools for data analysis and visualisation

  102] Structural and relational analysis, dealing with relational issues, selecting suitable analysis software (e.g. R, Python, PAST, gephi, Palladio) Reproducing / checking diagrams and other forms of visualisation
Analyse common measures and structures and output the results as a suitable visualisation

  103] How could the results of a multivariate analysis of large image sets be visualised?
On which mathematical methods is correspondence analysis based? What can it be used for? Give an example of the convincing use of historical network analysis.
Which of the two visualisations would you choose? Please justify your answer.

  [104] And finally, again a compilation of textbooks that you can use to deepen your knowledge. This time I have intentionally embedded the extraordinarily complex multivariate procedures in somewhat lighter fare and hope that this will again give you enough time to learn. I wish you much success and

  105] say goodbye to you for today with a video that the Norwegian band Röyksopp released for their song Remind Me back in 2002. It depicts the life of a Londoner in infographics. This video struck a chord and received the then coveted MTV Music Award.
Now I hope it inspires you and not to nightmares! and wish you all the best! Until next time, when we will deal with virtual spaces.