semasioFlow.focmodels module¶

semasioFlow.focmodels.createBow(query, settings, type_name=None, fnames=None, foc_win=None, foc_pos={'all': []}, bound={'match': '<artikel>', 'values': [False]}, tokenlist=None, dummy_sentbound='<artikel>', suffix='.tcmx.bool.pac', output_dir=None)¶

Create multiple bag-of-words token-level models on a loop.

Parameters

query (Vocab) – Types to collect tokens from
settings (dict) –
type_name (str, optional) – Name of the type, prefix for file names
fnames (str or list, optional) – Path to list of filenames or list of filenames to search tokens in. Default is the full corpus.
foc_win (list of tuples, optional) – List of window size settings; each tuple is one setting indicating left and right spans correspondingly. The default value is the one in the settings.
foc_pos (dict, optional) – The keys are the labels of the part-of-speech settings and the values are lists of context words to filter the matrix. The default value is “all”, with no filters.
bound (dict, optional) – The match value indicates the regex for the sentence boundary and a list of boolean values indicating whether it is respected. The default is with “<artikel>” as regex and no consideration for sentence boundaries.
tokenlist (list, optional) – List of token ID’s to filter the matrix. By default, the rows are not filtered.
dummy_sentbound (str, default="<artikel>") – String that will not match anything relevant in the corpus and therefore cancels sentence boundaries.
suffix (str, default=".ttmx.pos.pac") – Suffix for the filenames of the position matrices.
str (output_dir =) – Directory where the matrices will be stored. By default it’s a subdirectory type_name within the subdirectry “tokens” within settings[‘output-path’]. If the directory does not exist it will be created.
optional – Directory where the matrices will be stored. By default it’s a subdirectory type_name within the subdirectry “tokens” within settings[‘output-path’]. If the directory does not exist it will be created.

Returns

Register of model parameters: it has one row per model and the parameter settings as columns.

Return type

pandas.DataFrame

Note

As a secondary effect, the function stores all the token-by-feature boolean matrices.

semasioFlow.focmodels.createPath(query, settings, path_macros, type_name=None, fnames=None, tokenlist=None, foc_filter=None, suffix='.tcmx.bool.pac', output_dir=None)¶

Create multiple PATH token-level models on a loop.

Parameters

query (Vocab) – Types to collect tokens from
settings (dict) –
path_macros (list of tuples) – Each tuple is a LEMMAPATH group. The first element of each tuple is its label (for the name of the model). The second element is a list of :class:~nephosem.core.graph.MacroGraph, which can be obtained with SemasioFlow.load.loadMacro(). The third element is a boolean indicating whether the weight of each template.
type_name (str, optional) – Name of the type, prefix for file names
fnames (str or list, optional) – Path to list of filenames or list of filenames to search tokens in. Default is the full corpus.
tokenlist (list, optional) – List of token ID’s to filter the matrix. By default, the rows are not filtered.
foc_filter (list, optional) – List of context words to filter the matrix. By default, the columns are not filtered.
suffix (str, default=".ttmx.pos.pac") – Suffix for the filenames of the position matrices.
str (output_dir =) – Directory where the matrices will be stored. By default it’s a subdirectory type_name within the subdirectry “tokens” within settings[‘output-path’]. If the directory does not exist it will be created.
optional – Directory where the matrices will be stored. By default it’s a subdirectory type_name within the subdirectry “tokens” within settings[‘output-path’]. If the directory does not exist it will be created.

Returns

Register of model parameters: it has one row per model and the parameter settings as columns.

Return type

pandas.DataFrame

Note

As a secondary effect, the function stores all the token-by-feature boolean matrices.

semasioFlow.focmodels.createRel(query, settings, rel_macros, type_name=None, fnames=None, tokenlist=None, foc_filter=None, suffix='.tcmx.bool.pac', output_dir=None)¶

Create multiple LEMMAREL token-level models on a loop.

Parameters

query (Vocab) – Types to collect tokens from
settings (dict) –
rel_macros (list of tuples) – Each tuple is a LEMMAREL group. The first element of each tuple is its label (for the name of the model). The second element is a list of :class:~nephosem.core.graph.MacroGraph, which can be obtained with SemasioFlow.load.loadMacro().
type_name (str, optional) – Name of the type, prefix for file names
fnames (str or list, optional) – Path to list of filenames or list of filenames to search tokens in. Default is the full corpus.
tokenlist (list, optional) – List of token ID’s to filter the matrix. By default, the rows are not filtered.
foc_filter (list, optional) – List of context words to filter the matrix. By default, the columns are not filtered.
suffix (str, default=".ttmx.pos.pac") – Suffix for the filenames of the position matrices.
str (output_dir =) – Directory where the matrices will be stored. By default it’s a subdirectory type_name within the subdirectry “tokens” within settings[‘output-path’]. If the directory does not exist it will be created.
optional – Directory where the matrices will be stored. By default it’s a subdirectory type_name within the subdirectry “tokens” within settings[‘output-path’]. If the directory does not exist it will be created.

Returns

Register of model parameters: it has one row per model and the parameter settings as columns.

Return type

pandas.DataFrame

Note

As a secondary effect, the function stores all the token-by-feature boolean matrices.

semasioFlow.focmodels.tokensFromMacro(query, macros, settings, fnames=None, weight=1)¶

Obtain dependency-based token-level model.

Parameters

query (Vocab) – Types to collect tokens from.
macros (list of :class:~nephosem.core.graph.MacroGraph) – Can be obtained with SemasioFlow.load.loadMacro().
settings (dict) – It MUST include an appropiate ‘separator-line-machine’ value.
fnames (str or list, optional) – Path to list of filenames or list of filenames to search tokens in. Default is the full corpus.
weight (int, default=1) – Constant to multiply the values for (for weighting mechanisms).

Returns

res – Token level boolean matrix.

Return type

TypeTokenMatrix