meta data for this page
This page was translated from German into English using DeepL. Please help completing the translation.
(remove this paragraph once the translation is finished)
eAQUA – Extraktion von strukturiertem Wissen aus Antiken Quellen für die Altertumswissenschaft
(Engl.: Extraction of structured knowledge from ancient sources for the study of antiquity)
New methods in the humanities
The increasing feeding of complete text collections into electronic systems has led to a new situation in the sciences at the end of the 20th and beginning of the 21st century. In this context, there is often talk of Text Mining. This is an umbrella term that subsumes various statistical and linguistic techniques.
In eAQUA, originally a BMBF-funded project in the program “Interactions between Humanities and Sciences”, some of these methods have been investigated and further developed with regard to the historical languages Greek and Latin. As a result, it presents itself as an eAQUA portal, in which developed tools can be tested with respect to their usability to closed corpora that make use of these historical languages. Two of these tools, the so-called co-occurrence and citation analysis, will be explained in more detail below.
Language processing
For the extraction of structured information from texts, different language technology components are used, depending on the application. In the processing of ancient texts, for example, by the absence of so-called metadata, a few peculiarities arise, so that not all components are considered. In the following, a rough outline of the language technology used will be given.
Basically, three areas are spoken of within data mining when processing language:
- document-specific processing
- language-specific processing
This list is thematic, not chronological.
Domain-specific processing
Subtask | Explanation |
---|---|
Proper Name Extraction | The extraction of specific entities, mostly on the basis of manually annotated data sets. Here, only those typical for the domain (corpus) are meant. 1). |
Create Stop Word List | A stop word list is a list of terms that should be excluded from later processing. 2) |
Topic Modeling | Automatically assign terms to topics based on word properties and context information. |
Fact Extraction | Predefined types of information are modeled by processing. Many methods use the sequence of different words in a sentence for this. 3) |
Relation Extraction | Detect relationships between entities in a text. |
Document specific processing
Subtask | Explanation |
---|---|
Metadata Collection | Metadata, in the case of corpus analysis e.g. place of origin, time of origin, authorship, editor, time of edition etc., are valuable sources of information in text analysis, for example to narrow down the selection of data to be processed. |
Cleansing and Normalization | Depending on how the data was captured, it must be cleansed of all irrelevant information, such as markup tags common to markup languages, before analysis. Any differing character encodings, such as transcribed ancient Greek beta code, must be converted to a uniform character encoding before processing. |
Language-specific processing
Subtask | Explanation |
---|---|
Language Recognition | The languages used are determined. 4). |
Segmentation | Structures the text into individual parts that can be examined separately. Common is the segmentation into sentences based on the punctuation marks. |
Tokenization | Segmentation into individual parts (tokens) on the basis of the word level, for example by taking the space as a word boundary. |
Word Stem Reduction | Reduces words to their word stem in order to find inflections in a later search. |
Lemmatization | The basic form of a word (lemma) is formed. |
Part-of-Speech Tagging | Assignment of words and punctuation marks into word types. |
Parsing | The text is transformed into a new syntactic structure. For the parser a token is the atomic input unit. |
Resolve Coreference (reference identity) | A coreference exists if two linguistic expressions within an utterance refer to the same linguistic object, for example by using pronouns. |
Proper Name Extraction | In proper name recognition, also called Named Entity Recognition (NER), the terms of a text are assigned to certain types (e.g. place or person). |
Cooccurrence calculation
Cooccurrence in linguistics generally refers to the co-occurrence of two lexical units within a superordinate segment. If, for example, two terms frequently occur together in a sentence, there is a justified assumption of a dependency relationship, whether semantic or also grammatical in nature. Statistical calculations are used to determine measures of the presumed dependency. Several conditions must be fulfilled for this:
- A total corpus must be defined in which the occurrence of units, e.g. words, can be counted. These statistical parameters form the basis for the calculation.
- The corpus must be segmented. For non-neighbor co-occurrences, i.e. co-occurrences that are not direct left or right neighbors, maximum defined distances are needed, otherwise every word would be connected to every word in a text. In eAQUA, the lexical unit sentence has been chosen. In addition to the neighborhood co-occurrences, sentence co-occurrences are therefore specified.
- The corpus must be very large or include only the most frequent words in the calculation. 5)
In eAQUA, the neighborhood and sentence co-occurrences were computed primarily with likelihood functions and the significance measure Log-likelihood calculated. The values obtained are meaningful only in their relative order, unlike, for example, the Dice or Jaccard coefficients, which always yield an absolute value between 0 and 1. In principle, the following applies here for the so-called “lgl” value: the larger the value, the more probable a correlation, although negative values can also occur due to the algorithm.
Citation analysis
As a subfield of bibliometrics, citation analysis deals with the qualitative investigation of cited and citing works. The results are visually presented in a citation graph. From this, various regularities and structures of an author or a group of authors can be read. If the appropriate meta-data are available 6), the representations can be narrowed down using your own search filters.
Citation analysis is performed using string matching algorithms. String algorithms search for exact matches of a pattern in a text under definition of tolerance criteria. These criteria have been defined in the citation analysis of eAQUA as follows.
Reduced by all punctuation marks and a list of frequently used words 7), the corpus is decomposed into a sequence of five consecutive terms and examined for exact matches (matches) in the residual corpus using a so-called naive algorithm. The residual corpus is not reduced by considering metadata, such as the time of origin. One peculiarity of this approach is that self-citations are found for some authors, i.e., places where they obviously repeat themselves. Another that a citation can consist of several entries 8) and only becomes recognizable as a whole via the sorting function.
The parallel passages are finally assigned a similarity value using the edit distance, which lies between 0 = not identical and 1 = completely identical. The calculation is based on an algorithm Similar-Text, which is described by Oliver 9) by means of a pseudo-code.
where and are respectively the string length of the subtexts to be compared, and is the number of identical characters, i.e., the difference from the Levensthein distance 10).
Example: Similar-Text String a = Example text 1 Zeichenkette b = Example text 2 |
|||||
---|---|---|---|---|---|
lev(a,b) | max() - lev(a,b) | sim | Similar-Text | ||
14 | 15 | 6 | 9 | 0,62 |
The calculated similarity values always refer to the completely tokenized segments, not just to the search mask. This means that even completely identical passages can be assigned a value that differs from 1 if they are used within a larger segment. In the following example the deviations result from the insertion quick brown.
Example: Similar-Text String a = The quick brown fox jumps over the lazy dog String b = The fox jumps over the lazy dog |
|||||
---|---|---|---|---|---|
lev(a,b) | max() - lev(a,b) | sim | Similar-Text | ||
43 | 31 | 12 | 31 | 0,84 |
Similar-text calculations are only useful for short segments, such as sentence tokenization in eAQUA, because the values tend to decrease with the length of the segments examined.