lingpy.basic package

Submodules

lingpy.basic.ops module

Module provides basic operations on Wordlist-Objects.

lingpy.basic.ops.calculate_data(wordlist, data, taxa='taxa', concepts='concepts', ref='cogid', **keywords)

Manipulate a wordlist object by adding different kinds of data.

Parameters:

data : str

The type of data that shall be calculated. Currently supports

  • “tree”: calculate a reference tree based on shared cognates

  • “dst”: get distances between taxa based on shared cognates

  • “cluster”: cluster the taxa into groups using different methods

lingpy.basic.ops.clean_taxnames(wordlist, column='doculect', f=<function <lambda>>)

Function cleans taxon names for use in Newick files.

lingpy.basic.ops.coverage(wordlist)

Determine the average coverage of a wordlist.

lingpy.basic.ops.get_score(wl, ref, mode, taxA, taxB, concepts_attr='concepts', ignore_missing=False)
lingpy.basic.ops.iter_rows(wordlist, *values)

Function generates a list of the specified values in a wordlist.

Parameters:

wordlist : ~lingpy.basic.wordlist.Wordlist

A wordlist object or one of the daughter classes of wordlists.

value : str

A value as defined in the header of the wordlist.

Returns:

list : list

A generator object that generates list containing the key of each row in the wordlist and the corresponding cells, as specified in the headers.

Notes

Use this function to quickly iterate over specified fields in the wordlist. For example, when trying to access all pairs of language names and concepts, you may write:

>>> for k, language, concept in iter_rows(wl, 'language', 'concept'):
        print(k, language, concept)

Note that this function returns the key of the given row as a first value. So if you do not specify anything, the output will just be the key.

lingpy.basic.ops.renumber(wordlist, source, target='', override=False)

Create numerical identifiers from string identifiers.

lingpy.basic.ops.triple2tsv(triples_or_fname, output='table')

Function reads a triple file and converts it to a tabular data structure.

lingpy.basic.ops.tsv2triple(wordlist, outfile=None)

Function converts a wordlist to a triple data structure.

Notes

The basic values of which the triples consist are:
  • ID (the ID in the TSV file)

  • COLUMN (the column in the TSV file)

  • VALUE (the entry in the TSV file)

lingpy.basic.ops.wl2dict(wordlist, sections, entries, exclude=None)

Convert a wordlist to a complex dictionary with headings as keys.

lingpy.basic.ops.wl2dst(wl, taxa='taxa', concepts='concepts', ref='cogid', refB='', mode='swadesh', ignore_missing=False, **keywords)

Function converts wordlist to distance matrix.

lingpy.basic.ops.wl2multistate(wordlist, ref, missing)

Function converts a wordlist to multistate format (compatible with PAUP).

lingpy.basic.ops.wl2qlc(header, data, filename='', formatter='concept', **keywords)

Write the basic data of a wordlist to file.

lingpy.basic.parser module

Basic parser for text files in QLC format.

class lingpy.basic.parser.QLCParser(filename, conf='')

Bases: object

Basic class for the handling of text files in QLC format.

add_entries(entry, source, function, override=False, **keywords)

Add new entry-types to the word list by modifying given ones.

Parameters:

entry : string

A string specifying the name of the new entry-type to be added to the word list.

source : string

A string specifying the basic entry-type that shall be modified. If multiple entry-types shall be used to create a new entry, they should be passed in a simple string separated by a comma.

function : function

A function which is used to convert the source into the target value.

keywords : {dict}

A dictionary of keywords that are passed as parameters to the function.

Notes

This method can be used to add new entry-types to the data by converting given ones. There are a lot of possibilities for adding new entries, but the most basic procedure is to use an existing entry-type and to modify it with help of a function.

class lingpy.basic.parser.QLCParserWithRowsAndCols(filename, row, col, conf)

Bases: QLCParser

get_entries(entry)

Return all entries matching the given entry-type as a two-dimensional list.

Parameters:

entry : string

The entry-type of the data that shall be returned in tabular format.

lingpy.basic.parser.read_conf(conf='')

lingpy.basic.tree module

Basic module for the handling of language trees.

class lingpy.basic.tree.Tree(tree, **keywords)

Bases: PhyloNode

Basic class for the handling of phylogenetic trees.

Parameters:

tree : {str file list}

A string or a file containing trees in Newick format. As an alternative, you can also simply pass a list containing taxon names. In that case, a random tree will be created from the list of taxa.

branch_lengths : bool (default=False)

When set to True, and a list of taxa is passed instead of a Newick string or a file containing a Newick string, a random tree with random branch lengths will be created with the branch lengths being in order of the maximum number of the total number of internal branches.

getDistanceToRoot(node)

Return the distance from the given node to the root.

Parameters:

node : str

The name of a given node in the tree.

Returns:

distance : int

The distance of the given node to the root of the tree.

get_distance(other, distance='grf', debug=False)

Function returns the Robinson-Foulds distance between the two trees.

Parameters:

other : lingpy.basic.tree.Tree

A tree object. It should have the same number of taxa as the intitial tree.

distance : { “grf”, “rf”, “branch”, “symmetric”} (default=”grf”)

The distance which shall be calculated. Select between:

  • “grf”: the generalized Robinson-Foulds Distance

  • “rf”: the Robinson-Foulds Distance

  • “symmetric”: the symmetric difference between all partitions of

    the trees

lingpy.basic.tree.random_tree(taxa, branch_lengths=False)

Create a random tree from a list of taxa.

Parameters:

taxa : list

The list containing the names of the taxa from which the tree will be created.

branch_lengths : bool (default=False)

When set to True, a random tree with random branch lengths will be created with the branch lengths being in order of the maximum number of the total number of internal branches.

Returns:

tree_string : str

A string representation of the random tree in Newick format.

lingpy.basic.wordlist module

This module provides a basic class for the handling of word lists.

class lingpy.basic.wordlist.BounceAsID

Bases: object

A helper class for CLDF conversion when tables are missing.

A class with trivial ‘item lookup’:

>>> b = BounceAsID()
>>> b[5]
{"ID": 5}
>>> b["long_id"]
{"ID": "long_id"}
class lingpy.basic.wordlist.Wordlist(filename, row='concept', col='doculect', conf=None)

Bases: QLCParserWithRowsAndCols

Basic class for the handling of multilingual word lists.

Parameters:

filename : { string, dict }

The input file that contains the data. Otherwise a dictionary with consecutive integers as keys and lists as values with the key 0 specifying the header.

row : str (default = “concept”)

A string indicating the name of the row that shall be taken as the basis for the tabular representation of the word list.

col : str (default = “doculect”)

A string indicating the name of the column that shall be taken as the basis for the tabular representation of the word list.

conf : string (default=’’)

A string defining the path to the configuration file (more information in the notes).

Notes

A word list is created from a dictionary containing the data. The idea is a three-dimensional representation of (linguistic) data. The first dimension is called col (column, usually “language”), the second one is called row (row, usually “concept”), the third is called entry, and in contrast to the first two dimensions, which have to consist of unique items, it contains flexible values, such as “ipa” (phonetic sequence), “cogid” (identifier for cognate sets), “tokens” (tokenized representation of phonetic sequences). The LingPy website offers some tutorials for word lists which we recommend to read in case you are looking for more information.

A couple of methods is provided along with the word list class in order to access the multi-dimensional input data. The main idea is to provide an easy way to access two-dimensional slices of the data by specifying which entry type should be returned. Thus, if a word list consists not only of simple orthographical entries but also of IPA encoded phonetic transcriptions, both the orthographical source and the IPA transcriptions can be easily accessed as two separate two-dimensional lists.

add_entries(entry, source, function, override=False, **keywords)

Add new entry-types to the word list by modifying given ones.

Parameters:

entry : string

A string specifying the name of the new entry-type to be added to the word list.

source : string

A string specifying the basic entry-type that shall be modified. If multiple entry-types shall be used to create a new entry, they should be passed in a simple string separated by a comma.

function : function

A function which is used to convert the source into the target value.

keywords : {dict}

A dictionary of keywords that are passed as parameters to the function.

Notes

This method can be used to add new entry-types to the data by converting given ones. There are a lot of possibilities for adding new entries, but the most basic procedure is to use an existing entry-type and to modify it with help of a function.

calculate(data, taxa='taxa', concepts='concepts', ref='cogid', **keywords)

Function calculates specific data.

Parameters:

data : str

The type of data that shall be calculated. Currently supports

  • “tree”: calculate a reference tree based on shared cognates

  • “dst”: get distances between taxa based on shared cognates

  • “cluster”: cluster the taxa into groups using different methods

coverage(stats='absolute')

Function determines the coverage of a wordlist.

export(fileformat, sections=None, entries=None, entry_sep='', item_sep='', template='', **keywords)

Export the wordlist to specific fileformats.

Notes

The difference between export and output is that the latter mostly serves for internal purposes and formats, while the former serves for publication of data, using specific, nested statements to create, for example, HTML or LaTeX files from the wordlist data.

classmethod from_cldf(path, columns=('parameter_id', 'concept_name', 'language_id', 'language_name', 'value', 'form', 'segments', 'language_glottocode', 'concept_concepticon_id', 'language_latitude', 'language_longitude', 'cognacy'), namespace=(('concept_name', 'concept'), ('language_id', 'doculect'), ('segments', 'tokens'), ('language_glottocode', 'glottolog'), ('concept_concepticon_id', 'concepticon'), ('language_latitude', 'latitude'), ('language_longitude', 'longitude'), ('cognacy', 'cognacy'), ('cogid_cognateset_id', 'cogid')), filter=<function Wordlist.<lambda>>, **kwargs)

Load a CLDF dataset.

Open a CLDF Dataset – with metadata or metadata-free – (only Wordlist datasets are supported for now, because other modules don’t seem to make sense for LingPy) and transform it into this Class. Columns from the FormTable are imported in lowercase, columns from LanguageTable, ParameterTable and CognateTable are prefixed with langage_, concept_ and `cogid_`and converted to lowercase.

Parameters:

columns: list or tuple :

The list of columns to import. (default: all columns)

filter: function: rowdict → bool :

A condition function for importing only some rows. (default: lambda row: row[“form”])

All other parameters are passed on to the `cls` :

Returns:

A `cls` object representing the CLDF dataset :

Notes

CLDFs default column names for wordlists are different from LingPy’s, so you probably have to use:

>>> lingpy.Wordlist.from_cldf(

“Wordlist-metadata.json”, )

in order to avoid errors from LingPy not finding required columns.

get_dict(col='', row='', entry='', **keywords)

Function returns dictionaries of the cells matched by the indices.

Parameters:

col : string (default=””)

The column index evaluated by the method. It should contain one of the values in the row of the Wordlist instance, usually a taxon (language) name.

row : string (default=””)

The row index evaluated by the method. It should contain one of the values in the row of the Wordlist instance, usually a taxon (language) name.

entry : string (default=””)

The index for the entry evaluated by the method. It can be used to specify the datatype of the rows or columns selected. As a default, the indices of the entries are returned.

Returns:

entries : dict

A dictionary of keys and values specifying the selected part of the data. Typically, this can be a dictionary of a given language with keys for the concept and values as specified in the “entry” keyword.

Notes

The “col” and “row” keywords in the function are all aliased according to the description in the wordlist.rc file. Thus, instead of using these attributes, the aliases can also be taken. For selecting a language, one may type something like:

>>> Wordlist.get_dict(language='LANGUAGE')

and for the selection of a concept, one may type something like:

>>> Wordlist.get_dict(concept='CONCEPT')

See the examples below for details.

Examples

Load the harry_potter.csv file:

>>> wl = Wordlist('harry_potter.csv')

Select all IPA-entries for the language “German”:

>>> wl.get_dict(language='German',entry='ipa')
{'Harry': ['haralt'], 'hand': ['hant'], 'leg': ['bain']}

Select all words (orthographical representation) for the concept “Harry”:

>>> wl.get_dict(concept="Harry",entry="words")
{'English': ['hæri'], 'German': ['haralt'], 'Russian': ['gari'],             'Ukrainian': ['gari']}

Note that the values of the dictionary that is returned are always lists, since it is possible that the original file contains synonyms (multiple words corresponding to the same concept).

get_distances(**kw)
get_etymdict(ref='cogid', entry='', modify_ref=False)

Return an etymological dictionary representation of the word list.

Parameters:

ref : string (default = “cogid”)

The reference entry which is used to store the cognate ids.

entry : string (default = ‘’)

The entry-type which shall be selected.

modify_ref : function (default=False)

Use a function to modify the reference. If your cognate identifiers are numerical, for example, and negative values are assigned as loans, but you want to suppress this behaviour, just set this keyword to “abs”, and all cognate IDs will be converted to their absolute value.

Returns:

etym_dict : dict

An etymological dictionary representation of the data.

Notes

In contrast to the word-list representation of the data, an etymological dictionary representation sorts the counterparts according to the cognate sets of which they are reflexes. If more than one cognate ID are assigned to an entry, for example in cases of fuzzy cognate IDs or partial cognate IDs, the etymological dictionary will return one cognate set for each of the IDs.

get_list(row='', col='', entry='', flat=False, **keywords)

Function returns lists of rows and columns specified by their name.

Parameters:

row: string (default = ‘’) :

The row name whose entries are selected from the data.

col : string (default = ‘’)

The column name whose entries are selected from the data.

entry: string (default = ‘’) :

The entry-type which is selected from the data.

flat : bool (default = False)

Specify whether the returned list should be one- or two-dimensional, or whether it should contain gaps or not.

Returns:

data : list

A list representing the selected part of the data.

Notes

The ‘col’ and ‘row’ keywords in the function are all aliased according to the description in the wordlist.rc file. Thus, instead of using these attributes, the aliases can also be taken. For selecting a language, one may type something like:

>>> Wordlist.get_list(language='LANGUAGE')

and for the selection of a concept, one may type something like:

>>> Wordlist.get_list(concept='CONCEPT')

See the examples below for details.

Examples

Load the harry_potter.csv file:

>>> wl = Wordlist('harry_potter.csv')

Select all IPA-entries for the language “German”:

>>> wl.get_list(language='German',entry='ipa'
['bain', 'hant', 'haralt']

Note that this function returns 0 for missing values (concepts that don’t have a word in the given language). If one wants to avoid this, the ‘flat’ keyword should be set to True.

Select all words (orthographical representation) for the concept “Harry”:

>>> wl.get_list(concept="Harry",entry="words")
[['Harry', 'Harald', 'Гари', 'Гарi']]

Note that the values of the list that is returned are always two-dimensional lists, since it is possible that the original file contains synonyms (multiple words corresponding to the same concept). If one wants to have a flat representation of the entries, the ‘flat’ keyword should be set to True:

>>> wl.get_list(concept="Harry",entry="words",flat=True)
['hæri', 'haralt', 'gari', 'hari']
get_paps(ref='cogid', entry='concept', missing=0, modify_ref=False)

Function returns a list of present-absent-patterns of a given word list.

Parameters:

ref : string (default = “cogid”)

The reference entry which is used to store the cognate ids.

entry : string (default = “concept”)

The field which is used to check for missing data.

missing : string,int (default = 0)

The marker for missing items.

get_tree(**kw)
iter_cognates(ref, *entries)

Iterate over cognate sets in a wordlist.

iter_rows(*entries)

Iterate over the columns in a wordlist.

Parameters:

entries : list

The name of the columns which shall be iterated.

Returns:

iterator : iterator

An iterator yielding lists in which the first entry is the ID of the wordlist row and the following entries are the content of the columns as specified.

Examples

Load a wordlist from LingPy’s test data:

>>> from lingpy.tests.util import test_data
>>> from lingpy import Wordlist
>>> wl = Wordlist(test_data("KSL.qlc"))
>>> list(wl.iter_rows('ipa'))[:10]
[[1, 'ɟiθ'],
 [2, 'ɔl'],
 [3, 'tut'],
 [4, 'al'],
 [5, 'apa.u'],
 [6, 'ʔayɬʦo'],
 [7, 'bytyn'],
 [8, 'e'],
 [9, 'and'],
 [10, 'e']]

So as you can see, the function returns the key of the wordlist as well as the specified entry.

output(fileformat, **keywords)

Write wordlist to file.

Parameters:

fileformat : {“tsv”,”tre”,”nwk”,”dst”, “taxa”, “starling”, “paps.nex”, “paps.csv”}

The format that is written to file. This corresponds to the file extension, thus ‘tsv’ creates a file in extended tsv-format, ‘dst’ creates a file in Phylip-distance format, etc.

filename : str

Specify the name of the output file (defaults to a filename that indicates the creation date).

subset : bool (default=False)

If set to c{True}, return only a subset of the data. Which subset is specified in the keywords ‘cols’ and ‘rows’.

cols : list

If subset is set to c{True}, specify the columns that shall be written to the csv-file.

rows : dict

If subset is set to c{True}, use a dictionary consisting of keys that specify a column and values that give a Python-statement in raw text, such as, e.g., “== ‘hand’”. The content of the specified column will then be checked against statement passed in the dictionary, and if it is evaluated to c{True}, the respective row will be written to file.

ref : str

Name of the column that contains the cognate IDs if ‘starling’ is chosen as an output format.

missing : { str, int } (default=0)

If ‘paps.nex’ or ‘paps.csv’ is chosen as fileformat, this character will be inserted as an indicator of missing data.

tree_calc : {‘neighbor’, ‘upgma’}

If no tree has been calculated and ‘tre’ or ‘nwk’ is chosen as output format, the method that is used to calculate the tree.

threshold : float (default=0.6)

The threshold that is used to carry out a flat cluster analysis if ‘groups’ or ‘cluster’ is chosen as output format.

ignore : { list, “all” (default=’all’)}

Modifies the output format in “tsv” output and allows to ignore certain blocks in extended “tsv”, like “msa”, “taxa”, “json”, etc., which should be passed as a list. If you choose “all” as a plain string and not a list, this will ignore all additional blocks and output only plain “tsv”.

prettify : bool (default=False)

Inserts comment characters between concepts in the “tsv” file output format, which makes it easier to see blocks of words denoting the same concept. Switching this off will output the file in plain “tsv”.

renumber(source, target='', override=False)

Renumber a given set of string identifiers by replacing the ids by integers.

Parameters:

source : str

The source column to be manipulated.

target : str (default=’’)

The name of the target colummn. If no name is chosen, the target column will be manipulated by adding “id” to the name of the source column.

override : bool (default=False)

Force to overwrite the data if the target column already exists.

Notes

In addition to a new column, an further entry is added to the “_meta” attribute of the wordlist by which newly coined ids can be retrieved from the former string attributes. This attribute is called “source2target” and can be accessed either via the “_meta” dictionary or directly as an attribute of the wordlist.

lingpy.basic.wordlist.from_cldf(path, to=<class 'lingpy.basic.wordlist.Wordlist'>, concept='Name', concepticon='Concepticon_ID', glottocode='Glottocode', language='Name')

Load data from CLDF into a LingPy Wordlist object or similar.

Parameters:

path : str

The path to the metadata-file of your CLDF dataset.

to : ~lingpy.basic.wordlist.Wordlist

A ~lingpy.basic.wordlist.Wordlist object or one of the descendants (LexStat, Alignmnent).

concept : str (default=’gloss’)

The name used for the basic gloss in the parameters.csv table.

glottocode : str (default=’glottocode’)

The default name for the column storing the Glottolog ID in the languages.csv table.

language : str (default=’name’)

The default name for the language name in the languages.csv table.

concepticon : str (default=’conceptset’)

The default name for the concept set in the paramters.csv table.

Notes

This function does not offer absolute flexibility regarding the data you can input so far. However, it can regularly read CLDF-formatted data into LingPy and thus allow you to use CLDF data in LingPy analyses.

lingpy.basic.wordlist.get_wordlist(path, delimiter=',', quotechar='"', normalization_form='NFC', **keywords)

Load a wordlist from a normal CSV file.

Parameters:

path : str

The path to your CSV file.

delimiter : str

The delimiter in the CSV file.

quotechar : str

The quote character in your data.

row : str (default = “concept”)

A string indicating the name of the row that shall be taken as the basis for the tabular representation of the word list.

col : str (default = “doculect”)

A string indicating the name of the column that shall be taken as the basis for the tabular representation of the word list.

conf : string (default=’’)

A string defining the path to the configuration file.

Notes

This function returns a Wordlist object. In contrast to the normal way to load a wordlist from a tab-separated file, however, this allows to directly load a wordlist from any “normal” csv-file, with your own specified delimiters and quote characters. If the first cell in the first row of your CSV file is not named “ID”, the integer identifiers, which are required by LingPy will be automatically created.

Module contents

This module provides basic classes for the handling of linguistic data.

The basic idea is to provide classes that allow the user to handle basic linguistic datatypes (spreadsheets, wordlists) in a consistent way.