Reiter

This sections Task differs from the content of the session. In the overall course we try to cover important methods of text analysis. Because it would the to much to cover in one session we have split the tasks. This means that thetask in this section relates more to the text mining section of the lecture.
Edgar Allan Poe was the pioneer of the modern horror and detective genre. Poe, as well as Sir Arthur Conan Doyle and H.P. Lovecraft, wrote fantastic literature with horror and mystery themes. Poe influenced both Doyle and Lovecraft. In this unit we will calculate similarities between the short stories of the three authors. Stylometry will be used to compare stylistic features of different texts, for example to determine authorship.
You will carry out the following steps:
  • Download the corpus
  • Upload the corpus to Lexos
  • Analyse the texts with various analysis tools
The corpus to be analysed consists of ten texts by each of the three authors in plain text format as well as three wildcards. The latter are texts that were either written by one of the three authors or not.
Lexos is an online tool that you do not need to install. Visit the following website:
http://lexos.wheatoncollege.edu/upload
Here you can upload the corpus via drag and drop or via "Browse". Then click on the Analyze tab and try out different analysis tools.
Recommended Analysis tools: Dendrograms and K-Means Graphs
In stylometry, dendrograms and K-means graphs are used to analyse and visualise the stylistic similarities between texts. Both tools are essential for uncovering patterns and relationships in authorship attribution and other stylometric analyses. Both analysis Tools offer modifications on the Tokenization, Normalisation and culling the data
Tokenisation
Tokenization by token breaks down text into words or phrases, whereas tokenization by characters splits it into single characters for more detailed analysis.
Normalisation
Normalization proportionally adjusts word frequencies relative to text length, raw normalization keeps original counts, and TF-IDF (Term Frequency-Inverse Document Frequency) highlights important but uncommon words.
Culling
Culling by using top terms selects the most frequent terms, while requiring words to be in a minimum number of documents ensures only widely present terms are analyzed.
Dendrograms
Dendrograms are hierarchical tree diagrams that show the arrangement of clusters formed through hierarchical clustering, helping to identify groups of texts with similar writing styles. If you are not familiar with reading dendrograms yet, the developers of Lexos provide the following instructions:
https://wheatoncollege.edu/wp-content/uploads/2012/08/How-to-Read-a-Dendrogram-Web-Ready.pdf
Identify Clusters: Look for points where branches on the dendrogram join together. These points represent clusters of data points that are similar to each other. Branches joining further left indicate more similarity, while joins further to the right indicate greater dissimilarity.
Playing around with the Distance Metric Setting will deliver the results of different formulas for calculation similarities
K-Means Graphs
K-means graphs partition texts into a specified number of clusters based on their stylistic features, allowing for the identification of distinct groups within a dataset. Since our corpus consists of texts of three different authors, setting the graph to three makes sense. Considering there being wildcards with content of neither of the author, playing around with the number of clusters could yield interesting results.
(Example of a Corpus divided into three clusters with a centroid each)
Setting the graph to 3D Scatter will aid in detecting even more details regarding the similarities of the documents.
Task: Assignment of texts to authors with Lexos
Suggest authorship of the three wildcard texts in the downloaded corpus: Wildcard1, Wildcard2, Wildcard3. argue why you think that a certain text can be attributed to a certain author.
Apply different analysis tools in Lexos and use at least five different settings. You are welcome to upload screenshots to support your argument.