semasioFlow.socmodels module¶

semasioFlow.socmodels.createSoc(token_dir, registers, soc_pos, lengths, socMTX, output_dir=None, input_suffix='.tcmx.weight.pac', output_suffix='.tcmx.soc.pac', store_focdists=False)¶

Multiply token-by-feature matrix by its second-order matrix.

It does store the matrices too.

Parameters

token_dir (str) – Path to the directory where the boolean matrices are stored.
register (pandas.DataFrame) – Register of model information, with names of the models in the index.
soc_pos (dict) – The keys are the names of the “SOC-POS” values, the values are filtered ~nephosem.Vocab objects.
length (list) – Integer elements will be used to select the length most frequent elements in the soc_pos lists, while other kinds of elements will trigger using the FOC items as SOC items.
output_dir (str, optional) – Directory where the matrices will be stored. Defaults to token_dir.
input_suffix (str, default=".tcmx.weight.pac") – Suffix of the filenames to load.
output_suffix (str, default=".tcmx.soc.pac") – Suffix of the filenames to save.
store_focdists (bool or str, default=False) – Whether to store the context-word distance matrix. If False, it doesn’t; if True, it stores them in output_dir; if it’s a string, it is taken to be the directory to store them in.

Returns

A register with one row per model and all the parameter settings as columns.

Return type

dict of pandas.dataframe

semasioFlow.socmodels.targetPPMI(targets, vocabs, collocs, type_name=None, main_matrix=None, fname=None, output_dir=None)¶

Registers PPMI values of a target lemma(s) with all possible context words.

Computes PPMI values between target lemma(s) and context words, for weighting. It also stores PMI values, raw frequencies and raw co-occurrences.

Parameters

target (list of str) – Lemma(s) for the target(s)
vocabs (dict) – Vocabularies to extract raw frequency information from; the keys are their names and the values are Vocab.
collocs (dict) – Frequency matrices to extract raw co-occurrence frequency and PPMI information from; the keys are their names and the values are TypeTokenMatrix.
type_name (str, optional) – Name of the type, prefix for file names.
main_matrix (str, optional) – Key in collocs indicating the matrix used for the weighting of the matrix to return.
fname (str, optional) – Filename to store the frequency data in. By default it combines type_name with “ppmi.tsv”.
str (output_dir =) – Directory where the matrices will be stored. It is necessary if a fname is not provided`.
optional – Directory where the matrices will be stored. It is necessary if a fname is not provided`.

Returns

ppmi – Type-level co-occurrence matrix with target type(s) as row(s) and PPMI values based the values in collocs`[`main_matrix].

Return type

TypeTokenMatrix

semasioFlow.socmodels.weightTokens(token_dir, weighting, registers, output_dir=None, input_suffix='.tcmx.bool.pac', output_suffix='.tcmx.weight.pac')¶

Apply (or not) weighting to all current token-level matrices across multiple weighting values.

It does store the matrices too.

Parameters

token_dir (str) – Path to the directory where the boolean matrices are stored.
weighting (dict) – Keys are the names of the PPMI parameter values; values are the matrices to use for weighting, (~nephosem.TypeTokenMatrix) or None.
register (pandas.DataFrame) – Register of model information, with names of the models in the index.
output_dir (str, optional) – Directory where the matrices will be stored. Defaults to token_dir.
input_suffix (str, default=".tcmx.bool.pac") – Suffix of the filenames to load.
output_suffix (str, default=".tcmx.weight.pac") – Suffix of the filenames to save.

Returns

data – A “model_register” dataframe with one row per model and the parameter settings as columns and a “token_register” dataframe with one row per token and the number and lists of context words as columns.

Return type

dict of pandas.dataframe