semasioFlow.socmodels module¶
- semasioFlow.socmodels.createSoc(token_dir, registers, soc_pos, lengths, socMTX, output_dir=None, input_suffix='.tcmx.weight.pac', output_suffix='.tcmx.soc.pac', store_focdists=False)¶
Multiply token-by-feature matrix by its second-order matrix.
It does store the matrices too.
- Parameters
token_dir (str) – Path to the directory where the boolean matrices are stored.
register (
pandas.DataFrame
) – Register of model information, with names of the models in the index.soc_pos (dict) – The keys are the names of the “SOC-POS” values, the values are filtered ~nephosem.Vocab objects.
length (list) – Integer elements will be used to select the length most frequent elements in the soc_pos lists, while other kinds of elements will trigger using the FOC items as SOC items.
output_dir (str, optional) – Directory where the matrices will be stored. Defaults to token_dir.
input_suffix (str, default=".tcmx.weight.pac") – Suffix of the filenames to load.
output_suffix (str, default=".tcmx.soc.pac") – Suffix of the filenames to save.
store_focdists (bool or str, default=False) – Whether to store the context-word distance matrix. If False, it doesn’t; if True, it stores them in output_dir; if it’s a string, it is taken to be the directory to store them in.
- Returns
A register with one row per model and all the parameter settings as columns.
- Return type
dict of pandas.dataframe
- semasioFlow.socmodels.targetPPMI(targets, vocabs, collocs, type_name=None, main_matrix=None, fname=None, output_dir=None)¶
Registers PPMI values of a target lemma(s) with all possible context words.
Computes PPMI values between target lemma(s) and context words, for weighting. It also stores PMI values, raw frequencies and raw co-occurrences.
- Parameters
target (list of str) – Lemma(s) for the target(s)
vocabs (dict) – Vocabularies to extract raw frequency information from; the keys are their names and the values are
Vocab
.collocs (dict) – Frequency matrices to extract raw co-occurrence frequency and PPMI information from; the keys are their names and the values are
TypeTokenMatrix
.type_name (str, optional) – Name of the type, prefix for file names.
main_matrix (str, optional) – Key in collocs indicating the matrix used for the weighting of the matrix to return.
fname (str, optional) – Filename to store the frequency data in. By default it combines type_name with “ppmi.tsv”.
str (output_dir =) – Directory where the matrices will be stored. It is necessary if a fname is not provided`.
optional – Directory where the matrices will be stored. It is necessary if a fname is not provided`.
- Returns
ppmi – Type-level co-occurrence matrix with target type(s) as row(s) and PPMI values based the values in collocs`[`main_matrix].
- Return type
TypeTokenMatrix
- semasioFlow.socmodels.weightTokens(token_dir, weighting, registers, output_dir=None, input_suffix='.tcmx.bool.pac', output_suffix='.tcmx.weight.pac')¶
Apply (or not) weighting to all current token-level matrices across multiple weighting values.
It does store the matrices too.
- Parameters
token_dir (str) – Path to the directory where the boolean matrices are stored.
weighting (dict) – Keys are the names of the PPMI parameter values; values are the matrices to use for weighting, (~nephosem.TypeTokenMatrix) or None.
register (
pandas.DataFrame
) – Register of model information, with names of the models in the index.output_dir (str, optional) – Directory where the matrices will be stored. Defaults to token_dir.
input_suffix (str, default=".tcmx.bool.pac") – Suffix of the filenames to load.
output_suffix (str, default=".tcmx.weight.pac") – Suffix of the filenames to save.
- Returns
data – A “model_register” dataframe with one row per model and the parameter settings as columns and a “token_register” dataframe with one row per token and the number and lists of context words as columns.
- Return type
dict of pandas.dataframe