semasioFlow.socmodels module

semasioFlow.socmodels.createSoc(token_dir, registers, soc_pos, lengths, socMTX, output_dir=None, input_suffix='.tcmx.weight.pac', output_suffix='.tcmx.soc.pac', store_focdists=False)

Multiply token-by-feature matrix by its second-order matrix.

It does store the matrices too.

Parameters
  • token_dir (str) – Path to the directory where the boolean matrices are stored.

  • register (pandas.DataFrame) – Register of model information, with names of the models in the index.

  • soc_pos (dict) – The keys are the names of the “SOC-POS” values, the values are filtered ~nephosem.Vocab objects.

  • length (list) – Integer elements will be used to select the length most frequent elements in the soc_pos lists, while other kinds of elements will trigger using the FOC items as SOC items.

  • output_dir (str, optional) – Directory where the matrices will be stored. Defaults to token_dir.

  • input_suffix (str, default=".tcmx.weight.pac") – Suffix of the filenames to load.

  • output_suffix (str, default=".tcmx.soc.pac") – Suffix of the filenames to save.

  • store_focdists (bool or str, default=False) – Whether to store the context-word distance matrix. If False, it doesn’t; if True, it stores them in output_dir; if it’s a string, it is taken to be the directory to store them in.

Returns

A register with one row per model and all the parameter settings as columns.

Return type

dict of pandas.dataframe

semasioFlow.socmodels.targetPPMI(targets, vocabs, collocs, type_name=None, main_matrix=None, fname=None, output_dir=None)

Registers PPMI values of a target lemma(s) with all possible context words.

Computes PPMI values between target lemma(s) and context words, for weighting. It also stores PMI values, raw frequencies and raw co-occurrences.

Parameters
  • target (list of str) – Lemma(s) for the target(s)

  • vocabs (dict) – Vocabularies to extract raw frequency information from; the keys are their names and the values are Vocab.

  • collocs (dict) – Frequency matrices to extract raw co-occurrence frequency and PPMI information from; the keys are their names and the values are TypeTokenMatrix.

  • type_name (str, optional) – Name of the type, prefix for file names.

  • main_matrix (str, optional) – Key in collocs indicating the matrix used for the weighting of the matrix to return.

  • fname (str, optional) – Filename to store the frequency data in. By default it combines type_name with “ppmi.tsv”.

  • str (output_dir =) – Directory where the matrices will be stored. It is necessary if a fname is not provided`.

  • optional – Directory where the matrices will be stored. It is necessary if a fname is not provided`.

Returns

ppmi – Type-level co-occurrence matrix with target type(s) as row(s) and PPMI values based the values in collocs`[`main_matrix].

Return type

TypeTokenMatrix

semasioFlow.socmodels.weightTokens(token_dir, weighting, registers, output_dir=None, input_suffix='.tcmx.bool.pac', output_suffix='.tcmx.weight.pac')

Apply (or not) weighting to all current token-level matrices across multiple weighting values.

It does store the matrices too.

Parameters
  • token_dir (str) – Path to the directory where the boolean matrices are stored.

  • weighting (dict) – Keys are the names of the PPMI parameter values; values are the matrices to use for weighting, (~nephosem.TypeTokenMatrix) or None.

  • register (pandas.DataFrame) – Register of model information, with names of the models in the index.

  • output_dir (str, optional) – Directory where the matrices will be stored. Defaults to token_dir.

  • input_suffix (str, default=".tcmx.bool.pac") – Suffix of the filenames to load.

  • output_suffix (str, default=".tcmx.weight.pac") – Suffix of the filenames to save.

Returns

data – A “model_register” dataframe with one row per model and the parameter settings as columns and a “token_register” dataframe with one row per token and the number and lists of context words as columns.

Return type

dict of pandas.dataframe