semasioFlow.focmodels module

semasioFlow.focmodels.createBow(query, settings, type_name=None, fnames=None, foc_win=None, foc_pos={'all': []}, bound={'match': '<artikel>', 'values': [False]}, tokenlist=None, dummy_sentbound='<artikel>', suffix='.tcmx.bool.pac', output_dir=None)

Create multiple bag-of-words token-level models on a loop.

Parameters
  • query (Vocab) – Types to collect tokens from

  • settings (dict) –

  • type_name (str, optional) – Name of the type, prefix for file names

  • fnames (str or list, optional) – Path to list of filenames or list of filenames to search tokens in. Default is the full corpus.

  • foc_win (list of tuples, optional) – List of window size settings; each tuple is one setting indicating left and right spans correspondingly. The default value is the one in the settings.

  • foc_pos (dict, optional) – The keys are the labels of the part-of-speech settings and the values are lists of context words to filter the matrix. The default value is “all”, with no filters.

  • bound (dict, optional) – The match value indicates the regex for the sentence boundary and a list of boolean values indicating whether it is respected. The default is with “<artikel>” as regex and no consideration for sentence boundaries.

  • tokenlist (list, optional) – List of token ID’s to filter the matrix. By default, the rows are not filtered.

  • dummy_sentbound (str, default="<artikel>") – String that will not match anything relevant in the corpus and therefore cancels sentence boundaries.

  • suffix (str, default=".ttmx.pos.pac") – Suffix for the filenames of the position matrices.

  • str (output_dir =) – Directory where the matrices will be stored. By default it’s a subdirectory type_name within the subdirectry “tokens” within settings[‘output-path’]. If the directory does not exist it will be created.

  • optional – Directory where the matrices will be stored. By default it’s a subdirectory type_name within the subdirectry “tokens” within settings[‘output-path’]. If the directory does not exist it will be created.

Returns

Register of model parameters: it has one row per model and the parameter settings as columns.

Return type

pandas.DataFrame

Note

As a secondary effect, the function stores all the token-by-feature boolean matrices.

semasioFlow.focmodels.createPath(query, settings, path_macros, type_name=None, fnames=None, tokenlist=None, foc_filter=None, suffix='.tcmx.bool.pac', output_dir=None)

Create multiple PATH token-level models on a loop.

Parameters
  • query (Vocab) – Types to collect tokens from

  • settings (dict) –

  • path_macros (list of tuples) – Each tuple is a LEMMAPATH group. The first element of each tuple is its label (for the name of the model). The second element is a list of :class:~nephosem.core.graph.MacroGraph, which can be obtained with SemasioFlow.load.loadMacro(). The third element is a boolean indicating whether the weight of each template.

  • type_name (str, optional) – Name of the type, prefix for file names

  • fnames (str or list, optional) – Path to list of filenames or list of filenames to search tokens in. Default is the full corpus.

  • tokenlist (list, optional) – List of token ID’s to filter the matrix. By default, the rows are not filtered.

  • foc_filter (list, optional) – List of context words to filter the matrix. By default, the columns are not filtered.

  • suffix (str, default=".ttmx.pos.pac") – Suffix for the filenames of the position matrices.

  • str (output_dir =) – Directory where the matrices will be stored. By default it’s a subdirectory type_name within the subdirectry “tokens” within settings[‘output-path’]. If the directory does not exist it will be created.

  • optional – Directory where the matrices will be stored. By default it’s a subdirectory type_name within the subdirectry “tokens” within settings[‘output-path’]. If the directory does not exist it will be created.

Returns

Register of model parameters: it has one row per model and the parameter settings as columns.

Return type

pandas.DataFrame

Note

As a secondary effect, the function stores all the token-by-feature boolean matrices.

semasioFlow.focmodels.createRel(query, settings, rel_macros, type_name=None, fnames=None, tokenlist=None, foc_filter=None, suffix='.tcmx.bool.pac', output_dir=None)

Create multiple LEMMAREL token-level models on a loop.

Parameters
  • query (Vocab) – Types to collect tokens from

  • settings (dict) –

  • rel_macros (list of tuples) – Each tuple is a LEMMAREL group. The first element of each tuple is its label (for the name of the model). The second element is a list of :class:~nephosem.core.graph.MacroGraph, which can be obtained with SemasioFlow.load.loadMacro().

  • type_name (str, optional) – Name of the type, prefix for file names

  • fnames (str or list, optional) – Path to list of filenames or list of filenames to search tokens in. Default is the full corpus.

  • tokenlist (list, optional) – List of token ID’s to filter the matrix. By default, the rows are not filtered.

  • foc_filter (list, optional) – List of context words to filter the matrix. By default, the columns are not filtered.

  • suffix (str, default=".ttmx.pos.pac") – Suffix for the filenames of the position matrices.

  • str (output_dir =) – Directory where the matrices will be stored. By default it’s a subdirectory type_name within the subdirectry “tokens” within settings[‘output-path’]. If the directory does not exist it will be created.

  • optional – Directory where the matrices will be stored. By default it’s a subdirectory type_name within the subdirectry “tokens” within settings[‘output-path’]. If the directory does not exist it will be created.

Returns

Register of model parameters: it has one row per model and the parameter settings as columns.

Return type

pandas.DataFrame

Note

As a secondary effect, the function stores all the token-by-feature boolean matrices.

semasioFlow.focmodels.tokensFromMacro(query, macros, settings, fnames=None, weight=1)

Obtain dependency-based token-level model.

Parameters
  • query (Vocab) – Types to collect tokens from.

  • macros (list of :class:~nephosem.core.graph.MacroGraph) – Can be obtained with SemasioFlow.load.loadMacro().

  • settings (dict) – It MUST include an appropiate ‘separator-line-machine’ value.

  • fnames (str or list, optional) – Path to list of filenames or list of filenames to search tokens in. Default is the full corpus.

  • weight (int, default=1) – Constant to multiply the values for (for weighting mechanisms).

Returns

res – Token level boolean matrix.

Return type

TypeTokenMatrix