Weight contexts

library(semcloud)

This vignette shows how to generated tailored contexts for the visualization tools.

Required input

The context-weighting procedure assumes the following datasets produced with semasioFlow (Python):

A dataframe with one row per context word and columns with frequency and weight information (most importantly, a column with the weight you would use for first-order filtering/weighting).
- The context word ID column is called 'cw'.
- The file would be typically called '[lemma].ppmi.tsv' and stored with your nephovis data (i.e. the output_dir in vignette('processClouds'). It is created by semasioFlow.socmodels.targetPPMI().
A dataframe with one row per context word per token, with corpus-based information.
- This is generated by semasioFlow.contextwords.listContextwords() and will at least have:
  - a cw column with the type of the context word
  - a target_lemma column with the type of the target
  - a word column with the word form
  - token_id and cw_id columns
  - distance between context word and target and similar computed values.
- It will be typically called '[lemma].cws.detail.tsv'.
A dataframe with one row per token and columns with annotation values or other variables, as well as columns with semicolon-separated context words.
- The token ID column is called _id and the context word columns start with _cws. followed by the (first part of) the name of the model. Crucially, it should be possible to parse the relevant parameter settings from the name of your model.
- The file would be typically called '[lemma].variables.tsv' and stored with your nephovis data (i.e. the output_dir in vignette('processClouds'). It is the ‘token_register’ element in the list returned by semasioFlow.socmodels.weightTokens() but it could also include manually added variables.

The first and second dataframes are loaded and merged by setupConcordancer(), in order to add weighting information to the token-based register. The pmi_columnname in this function identifies the name of the column with weighting information (by default 'pmi_4') and renames it as myweight. There are two ways to point to the files in this function. If their names correspond to the naming convention and are stored in the same directory nephodir, you can run setupConcordancer(lemma, nephodir). Otherwise, you can use setupConcordancer(cws_detail_path = 'path/to/cws.detail.tsv', ppmi_path = 'path/to/ppmi.tsv').

cws <- setupConcordancer(cws_detail_path = file.path(cws_detail_dir, paste0(lemma, '.cws.detail.tsv')),
                         ppmi_path = file.path(nephodir, lemma, paste0(lemma, '.ppmi.tsv')))

The third dataframe is loaded separately.

variables <- readr::read_tsv(file.path(nephodir, lemma, paste0(lemma, ('.variables.tsv'))),
                            show_col_types = F, quote = "")

Build weighted concordances

The main function, weightConcordance(), takes both dataframes and the name of the lemma¹ as input and returns an enriched version of the variables dataframe with a raw context in the _ctxt.raw column and tailored contexts in columns starting with _ctxt., followed by (the first part of) the name of the model.

ctxts <- weightConcordance(variables, cws, lemma)
write_tsv(ctxts, file.path(nephodir, lemma, paste0(lemma, ('.variables.tsv'))), escape = 'none')

A tailored context wraps the target token and the words captured by the model with HTML tags to be read by the visualization tools:

The target is wrapped in <span> tags with class = "target", which NephoVis reads as bold and coloured based on the colour of the dot the token is connected to.²
Context words captured by a model for a token are wrapped in <strong> tags and are thus rendered in bold. It is also possible to add a weight as superscript.

Adapt the settings

The default settings of these functions assume a certain pattern in the names of the columns and how to extract parameter settings from them, but they can be adapted with various optional arguments. They allow the user to customize (1) how to read the model name, (2) how to translate names of parameters into selections of context words and (3) other corpus-related issues.

Read the model name

The default model name created by semasioFlow is made of two strings joined by a period (e.g. 'bound5-5lex.PPMIweight'): the first encodes the first-order selection parameters (e.g. 'bound5-5lex') and the second, the weighting information (e.g. 'PPMIweight'). If this information is not encoded in this way in your model names, you can adjust it with the foc_param_fun and weight_param_fun arguments (see ?weightLemma).

foc_param_fun: A function that takes the name of the model and returns a string with the first-order filters.; The default is function(m) stringr::str_split(m, '\\.')[[1]][[1]]: the first element in a .-separated sequence.
weight_param_fun: A function that takes the name of the model and returns a string with the weighting filters.; The default is function(m) stringr::str_split(m, '\\.')[[1]][[2]]: the second element in a .-separated sequence.
sup_weight_fun: A function that takes a string with weighting information as read by the function in weight_param_fun (e.g. 'PPMIweight') and returns a boolean value indicating whether PPMI values should be included as superindices in the tailored concordance. See ?weightLemma.; The default is function(weightparam) stringr::str_ends(weightparam, 'weight'): context words are weighted if the weighting parameter ends in 'weight'.; If you wanted PPMI values to always be included next to captured context words, you could run weightConcordance(variables, cws, lemma, sup_weight_fun = function(x) return(TRUE)).

Interpret parameter settings

The default values for interpreting the parameter settings are tailored to the corpus and setting used in Montes (2021). You would want to instruct the function how to read your parameter settings instead. See ?filterWeight and ?filterFoc. In these descriptions, foc_param refers to the first-order information as read by the function in foc_param_fun (e.g. 'bound5-5lex' and weight_param to the weighting information as read by the function in weight_param_fun (e.g. 'PPMIweight').

These instructions are used to identify which context words are captured by the model for each token; in addition, the semicolon-separated list of context words in the original file is also used.

is_dep_fun and max_steps_fun: Functions that take foc_param and returns TRUE if dependency information should be collected (for the former) and the number of steps in the dependency path to accept as viable context words (for the latter).; The default value of is_dep_fun is function(foc_param) stringr::str_starts(foc_param, 'LEMMA'): dependency information is only used if the string starts with 'LEMMA'.; The default value of max_steps_fun is function(foc_param) if (foc_param == 'LEMMAPATH2') 2 else 3, which means that it restricts the steps for LEMMAPATH2 models but only to 3 otherwise. This is only relevant if the result of is_dep_fun is TRUE.
window_filter_fun: Function that takes foc_param and returns a vector or list with two elements, viz. the left and right window spans. Not used if the result of is_dep_fun is FALSE.; The default value is windowFilter() shown below: it extracts digits separated by a hyphen and reads them as the left and right span respectively. If you only used one digit, e.g. '5' to refer to both windows, you could use something like function(x) stringr::str_replace(x, '[^\\d]+([\\d]+)[^\\d]+', '\\1-\\1').

windowFilter <- function(foc_param) {
  windows <- foc_param %>%
    stringr::str_extract('\\d+-\\d+') %>%
    stringr::str_split('-')
  readr::parse_integer(windows[[1]])
}

pos_filter_fun: Function that takes foc_param and returns a vector; if it is empty, no filter is implemented, while if it has values, only the rows for which the column pos matches an element in the list will be included. Not used if the result of is_dep_fun is FALSE.; The default value is posFilter() shown below: if foc_param ends in 'lex', only context words with 'noun', 'adj', 'adv' or 'verb' as pos will be considered captured; otherwise, all of them will.; If the 'lex' ending means something else in your data, you can replace the function with a customized version that used your own list of parts of speech as filter. If the name of the parameter setting is different, or even if you have multiple possibilities, you can adapt the function to those requirements as well.

posFilter <- function(foc_param) {
  if (stringr::str_ends(foc_param, 'lex')) {
    c('noun', 'adj', 'adv', 'verb')
  } else {
    c()
  }
}

bound_filter_fun: Function that takes foc_param and returns TRUE if the words outside the sentence are also captured and FALSE if they should be excluded.; The default is function (foc_param) stringr::str_starts(foc_param, 'nobound'): words outside the sentence are accepted if foc_param starts with 'nobound'. If this parameter does not vary in your models and is not even recorded in the model name, you should use a function such as function(x) return TRUE.
weight_filter_fun and threshold: The former is a function that takes weight_param and returns TRUE if weighting should be included and FALSE if it should be ignored. Including weighting implies that the values in the myweight column (as defined by setupConcordance()) are used to filter context words. The threshold arguments sets the value used for filtering context words based on myweight if the output of weight_filter_fun is TRUE.; The default is function(weightparam) stringr::str_ends(weightparam, 'no', negate = TRUE): it only returns FALSE if the weighting parameter ends in 'no'. The default threshold is 0.

Other adjustments

Depending on your corpus, there might be other adjustments you might want to do to how the output of semasioFlow.contextwords.listContextwords() is read. The clean_word_fun function takes the value of the word column and cleans it, e.g. replacing backticks with single quotation marks, or transforming '</sentence>' into '<br>' to start new sentences in new lines. In addition, the to_remove argument is a vector of strings matching word values that should be ignored. The default value is c('<sentence>'), meaning that rows with that value in the word column will be removed from the start.

References

Montes, Mariana. 2021. “Cloudspotting: Visual Analytics for Distributional Semantics.” PhD thesis, Leuven: KU Leuven.

Mariana Montes