Cognate Detection (LexStat)

class lingpy.compare.lexstat.LexStat(filename, **keywords)

Basic class for automatic cognate detection.

Parameters:

filename : str

The name of the file that shall be loaded.

model : Model

The sound-class model that shall be used for the analysis. Defaults to the SCA sound-class model.

merge_vowels : bool (default=True)

Indicate whether consecutive vowels should be merged into single tokens or kept apart as separate tokens.

transform : dict

A dictionary that indicates how prosodic strings should be simplified (or generally transformed), using a simple key-value structure with the key referring to the original prosodic context and the value to the new value. Currently, prosodic strings (see prosodic_string()) offer 11 different prosodic contexts. Since not all these are helpful in preliminary analyses for cognate detection, it is useful to merge some of these contexts into one. The default settings distinguish only 5 instead of 11 available contexts, namely:

  • C for all consonants in prosodically ascending position,

  • c for all consonants in prosodically descending position,

  • V for all vowels,

  • T for all tones, and

  • _ for word-breaks.

Make sure to check also the “vowel” keyword when initialising a LexStat object, since the symbols you use for vowels and tones should be identical with the ones you define in your transform dictionary.

vowels : str (default=”VT_”)

For scoring function creation using the get_scorer function, you have the possibility to use reduced scores for the matching of tones and vowels by modifying the “vscale” parameter, which is set to 0.5 as a default. In order to make sure that vowels and tones are properly detected, make sure your prosodic string representation of vowels matches the one in this keyword. Thus, if you change the prosodic strings using the “transform” keyword, you also need to change the vowel string, to make sure that “vscale” works as wanted in the get_scorer function.

check : bool (default=False)

If set to True, the input file will first be checked for errors before the calculation is carried out. Errors will be written to the file errors, defaulting to errors.log. See also apply_checks apply_checks : bool (default=False) If set to True, any errors identified by check will be handled silently.

no_bscorer: bool (default=False) :

If set to True, this will suppress the creation of a language-specific scoring function (which may become quite large and is additional ballast if the method “lexstat” is not used after all. If you use the “lexstat” method, however, this needs to be set to False.

errors : str

The name of the error log.

segments : str (default=”tokens”)

The name of the column in your data which contains the segmented transcriptions, or in which the segmented transcriptions should be placed.

transcription : str (default=”ipa”)

The name of the column in your data which contains the unsegmented transcriptions.

classes : str (default=”classes”)

The name of the column in the data which contains the sound class representation of the transcriptions, or in which this information shall be placed after automatic conversion.

numbers : str (default=”numbers”)

The language-specific triples consisting of language id (numeric), sound class string (one character only), and prosodic string (one character only). Usually, numbers are automatically created from the columns “classes”, “prostrings”, and “langid”, but you can also provide them in your data.

langid : str (default=”langid”)

Name of the column that contains a numerical language identifier, needed to produce the language-specific character triples (“numbers”). Unless specific explicitly, this is automatically created.

prostrings : str (default=”prostrings”)

Name of the column containing prosodic strings (see List2014d for more details) of the segmented transcriptions, containing one character per prosodic string. Prostrings add a contextual component to phonetic sequences. They are automatically created, but can likewise be submitted from the initial data.

weights : str (default=”weights”)

The name of the column which stores the individual gap-weights for each sequence. Gap weights are positive floats for each segment in a string, which modify the gap opening penalty during alignment.

tokenize : function (default=ipa2tokens)

The function which should be used to tokenize the entries in the column storing the transcriptions in case no segmentation is provided by the user.

get_prostring : function (default=prosodic_string)

The function which should be used to create prosodic strings from the segmented transcription data. If you want to completely ignore prosodic strings in LexStat calculations, you could just pass the following function:

>>> lex = LexStat('inputfile.tsv', get_prostring=lambda x: ["x" for
    y in x])

cldf : bool (default=True)

If set to True, as by default, this will allow for a specific treatment of phonetic symbols which cannot be completely resolved when internally converting tokens to classes (e.g., laryngeal h₂ in Indo-European). Following the CLDF specifications (in particular the specifications for writing transcriptions in segmented strings, as employed by the CLTS initiative), in cases of insecurity of pronunciation, users can adopt a `source/target` style, where the source is the symbol used, e.g., in a reconstruction system, and the target is a proposed phonetic interpretation. This practice is also accepted by the EDICTOR tool.

Notes

Instantiating this class does not require a lot of parameters. However, the user may modify its behaviour by providing additional attributes in the input file.

Attributes

pairs

dict

A dictionary with tuples of language names as key and indices as value, pointing to unique combinations of words with the same meaning in all language pairs.

model

Model

The sound class model instance which serves to convert the phonetic data into sound classes.

chars

list

A list of all unique language-specific character types in the instantiated LexStat object. The characters in this list consist of

  • the language identifier (numeric, referenced as “langid” as a default, but customizable via the keyword “langid”)

  • the sound class symbol for the respective IPA transcription value

  • the prosodic class value

All values are represented in the above order as one string, separated by a dot. Gaps are also included in this collection. They are traditionally represented as “X” for the sound class and “-” for the prosodic string.

rchars

list

A list containing all unique character types across languages. In contrast to the chars-attribute, the “rchars” (raw chars) do not contain the language identifier, thus they only consist of two values, separated by a dot, namely, the sound class symbol, and the prosodic class value.

scorer

dict

A collection of ScoreDict objects, which are used to score the strings. LexStat distinguishes two different scoring functions:

  • rscorer: A “raw” scorer that is not language-specific and consists only of sound class values and prosodic string values. This scorer is traditionally used to carry out the first alignment in order to calculate the language-specific scorer. It is directly accessible as an attribute of the LexStat class (rscorer). The characters which constitute the values in this scorer are accessible via the “rchars” attribue of each lexstat class.

  • bscorer: The language-specific scorer. This scorer is made of unique language-specific characters. These are accessible via the “chars” attribute of each LexStat class. As the “rscorer”, the “bscorer” can also be accessed directly as an attribute of the LexStat class (bscorer).

Methods

align_pairs(idxA, idxB[, concept])

Align all or some words of a given pair of languages.

cluster([method, cluster_method, threshold, ...])

Function for flat clustering of words into cognate sets.

get_distances([method, mode, gop, scale, ...])

Method calculates different distance estimates for language pairs.

get_random_distances([method, runs, mode, ...])

Method calculates randoms scores for unrelated words in a dataset.

get_scorer(**keywords)

Create a scoring function based on sound correspondences.

output(fileformat, **keywords)

Write data to file.

Inherited WordList Methods

get_entries(entry)

Return all entries matching the given entry-type as a two-dimensional list.

add_entries(entry, source, function[, override])

Add new entry-types to the word list by modifying given ones.

calculate(data[, taxa, concepts, ref])

Function calculates specific data.

export(fileformat[, sections, entries, ...])

Export the wordlist to specific fileformats.

get_dict([col, row, entry])

Function returns dictionaries of the cells matched by the indices.

get_etymdict([ref, entry, modify_ref])

Return an etymological dictionary representation of the word list.

get_list([row, col, entry, flat])

Function returns lists of rows and columns specified by their name.

get_paps([ref, entry, missing, modify_ref])

Function returns a list of present-absent-patterns of a given word list.

output(fileformat, **keywords)

Write wordlist to file.

renumber(source[, target, override])

Renumber a given set of string identifiers by replacing the ids by integers.