FIXME This page was translated from German into English using DeepL. Please help completing the translation.
(remove this paragraph once the translation is finished)

eAQUA – Extraktion von strukturiertem Wissen aus Antiken Quellen für die Altertumswissenschaft

(Engl.: Extraction of structured knowledge from ancient sources for the study of antiquity)

New methods in the humanities

The increasing feeding of complete text collections into electronic systems has led to a new situation in the sciences at the end of the 20th and beginning of the 21st century. In this context, there is often talk of Text Mining. This is an umbrella term that subsumes various statistical and linguistic techniques.

In eAQUA, originally a BMBF-funded project in the program “Interactions between Humanities and Sciences”, some of these methods have been investigated and further developed with regard to the historical languages Greek and Latin. As a result, it presents itself as an eAQUA portal, in which developed tools can be tested with respect to their usability to closed corpora that make use of these historical languages. Two of these tools, the so-called co-occurrence and citation analysis, will be explained in more detail below.

Language processing

For the extraction of structured information from texts, different language technology components are used, depending on the application. In the processing of ancient texts, for example, by the absence of so-called metadata, a few peculiarities arise, so that not all components are considered. In the following, a rough outline of the language technology used will be given.

Basically, three areas are spoken of within data mining when processing language:

  • document-specific processing
  • language-specific processing

This list is thematic, not chronological.

Domain-specific processing

Subtask Explanation
Proper Name Extraction The extraction of specific entities, mostly on the basis of manually annotated data sets. Here, only those typical for the domain (corpus) are meant. 1).
Create Stop Word List A stop word list is a list of terms that should be excluded from later processing. 2)
Topic Modeling Automatically assign terms to topics based on word properties and context information.
Fact Extraction Predefined types of information are modeled by processing. Many methods use the sequence of different words in a sentence for this. 3)
Relation Extraction Detect relationships between entities in a text.

Document specific processing

Subtask Explanation
Metadata Collection Metadata, in the case of corpus analysis e.g. place of origin, time of origin, authorship, editor, time of edition etc., are valuable sources of information in text analysis, for example to narrow down the selection of data to be processed.
Cleansing and Normalization Depending on how the data was captured, it must be cleansed of all irrelevant information, such as markup tags common to markup languages, before analysis. Any differing character encodings, such as transcribed ancient Greek beta code, must be converted to a uniform character encoding before processing.

Language-specific processing

Subtask Explanation
Language Recognition The languages used are determined. 4).
Segmentation Structures the text into individual parts that can be examined separately. Common is the segmentation into sentences based on the punctuation marks.
Tokenization Segmentation into individual parts (tokens) on the basis of the word level, for example by taking the space as a word boundary.
Word Stem Reduction Reduces words to their word stem in order to find inflections in a later search.
Lemmatization The basic form of a word (lemma) is formed.
Part-of-Speech Tagging Assignment of words and punctuation marks into word types.
Parsing The text is transformed into a new syntactic structure. For the parser a token is the atomic input unit.
Resolve Coreference (reference identity) A coreference exists if two linguistic expressions within an utterance refer to the same linguistic object, for example by using pronouns.
Proper Name Extraction In proper name recognition, also called Named Entity Recognition (NER), the terms of a text are assigned to certain types (e.g. place or person).

Cooccurrence calculation

Cooccurrence in linguistics generally refers to the co-occurrence of two lexical units within a superordinate segment. If, for example, two terms frequently occur together in a sentence, there is a justified assumption of a dependency relationship, whether semantic or also grammatical in nature. Statistical calculations are used to determine measures of the presumed dependency. Several conditions must be fulfilled for this:

  • A total corpus must be defined in which the occurrence of units, e.g. words, can be counted. These statistical parameters form the basis for the calculation.
  • The corpus must be segmented. For non-neighbor co-occurrences, i.e. co-occurrences that are not direct left or right neighbors, maximum defined distances are needed, otherwise every word would be connected to every word in a text. In eAQUA, the lexical unit sentence has been chosen. In addition to the neighborhood co-occurrences, sentence co-occurrences are therefore specified.
  • The corpus must be very large or include only the most frequent words in the calculation. 5)

In eAQUA, the neighborhood and sentence co-occurrences were computed primarily with likelihood functions and the significance measure Log-likelihood calculated. The values obtained are meaningful only in their relative order, unlike, for example, the Dice or Jaccard coefficients, which always yield an absolute value between 0 and 1. In principle, the following applies here for the so-called “lgl” value: the larger the value, the more probable a correlation, although negative values can also occur due to the algorithm.

Citation analysis

As a subfield of bibliometrics, citation analysis deals with the qualitative investigation of cited and citing works. The results are visually presented in a citation graph. From this, various regularities and structures of an author or a group of authors can be read. If the appropriate meta-data are available 6), the representations can be narrowed down using your own search filters.

Citation analysis is performed using string matching algorithms. String algorithms search for exact matches of a pattern in a text under definition of tolerance criteria. These criteria have been defined in the citation analysis of eAQUA as follows.

Reduced by all punctuation marks and a list of frequently used words 7), the corpus is decomposed into a sequence of five consecutive terms and examined for exact matches (matches) in the residual corpus using a so-called naive algorithm. The residual corpus is not reduced by considering metadata, such as the time of origin. One peculiarity of this approach is that self-citations are found for some authors, i.e., places where they obviously repeat themselves. Another that a citation can consist of several entries 8) and only becomes recognizable as a whole via the sorting function.

The parallel passages are finally assigned a similarity value using the edit distance, which lies between 0 = not identical and 1 = completely identical. The calculation is based on an algorithm Similar-Text, which is described by Oliver 9) by means of a pseudo-code.

sim = { n_{ab} * 2 } / { n_a + n_b }

where n_a and n_b are respectively the string length of the subtexts to be compared, and n_{ab} is the number of identical characters, i.e., the difference from the Levensthein distance 10).

sim = { (max(n_a,n_b) - lev(a,b))  * 2 } / { n_a + n_b }

Example: Similar-Text
String a = Example text 1
Zeichenkette b = Example text 2
n_a n_b lev(a,b) max(n_a,n_b) - lev(a,b) sim Similar-Text
14 15 6 9 {9 * 2} / {14 + 15} = 18 / 29 0,62

The calculated similarity values always refer to the completely tokenized segments, not just to the search mask. This means that even completely identical passages can be assigned a value that differs from 1 if they are used within a larger segment. In the following example the deviations result from the insertion quick brown.

Example: Similar-Text
String a = The quick brown fox jumps over the lazy dog
String b = The fox jumps over the lazy dog
n_a n_b lev(a,b) max(n_a,n_b) - lev(a,b) sim Similar-Text
43 31 12 31 {31 * 2} / {43 + 31} = 62 / 74 0,84

Similar-text calculations are only useful for short segments, such as sentence tokenization in eAQUA, because the values tend to decrease with the length of the segments examined.

1)
For example, the “Speaker” segments “North.” and “West.” abbreviated in Shakespeare's stage play “KING HENRY the Fourth” are person identifiers, not cardinal points
2)
Such lists can be cross-domain, for example typical for a language, or domain-specific, for example typical for an authorship. In eAQUA, these lists are created using word counts of the entire corpus.
3)
In eAQUA, for example, this has been accomplished with co-occurrence analysis.
4)
If these are not annotated in the metadata, this is a non-trivial problem, especially for multilingual texts, which is often solved by language-specific (key-)word lists
5)
[Dunning 93]. Dunning, T. „Accurate Methods for the Statistics of Surprise and Coincidenc“. In: Computational Linguistics 19, 1 (1993), 61–74.
6)
place / time / author: not a matter of course with ancient texts; some authorships are marked as (Pseudo-), for example, or in the case of time data, estimates are made because only periods of time are known, or place data are based on the place of death
7)
stop word list: This list is determined anew for each corpus by counting all words.
8)
The search mask, the pattern, consists of only 5 terms. Parallel passages with twice or more terms therefore result in more than one search mask and a corresponding number of finding places.
9)
[OLIVER 93].Oliver, Ian. Programming Classics: Implementing the World's Best Algorithms. Prentice Hall PTR New York, 1993.
10)
A method introduced by Russian mathematician Vladimir I. Levenshtein in 1965 to compare two strings by counting the minimum number of insertion, deletion, and replacement operations to convert one to the other.