
Weight contexts
Mariana Montes
2022-04-04
Source:vignettes/weightConcordance.Rmd
weightConcordance.RmdThis vignette shows how to generated tailored contexts for the visualization tools.
Required input
The context-weighting procedure assumes the following datasets produced with semasioFlow (Python):
-
A dataframe with one row per context word and columns with frequency and weight information (most importantly, a column with the weight you would use for first-order filtering/weighting).
The context word ID column is called
'cw'.The file would be typically called
'[lemma].ppmi.tsv'and stored with your nephovis data (i.e. theoutput_dirinvignette('processClouds'). It is created bysemasioFlow.socmodels.targetPPMI().
-
A dataframe with one row per context word per token, with corpus-based information.
-
This is generated by
semasioFlow.contextwords.listContextwords()and will at least have:- a
cwcolumn with the type of the context word - a
target_lemmacolumn with the type of the target - a
wordcolumn with the word form -
token_idandcw_idcolumns - distance between context word and target and similar computed values.
- a
It will be typically called
'[lemma].cws.detail.tsv'.
-
-
A dataframe with one row per token and columns with annotation values or other variables, as well as columns with semicolon-separated context words.
The token ID column is called
_idand the context word columns start with_cws.followed by the (first part of) the name of the model. Crucially, it should be possible to parse the relevant parameter settings from the name of your model.The file would be typically called
'[lemma].variables.tsv'and stored with your nephovis data (i.e. theoutput_dirinvignette('processClouds'). It is the ‘token_register’ element in the list returned bysemasioFlow.socmodels.weightTokens()but it could also include manually added variables.
The first and second dataframes are loaded and merged by setupConcordancer(), in order to add weighting information to the token-based register. The pmi_columnname in this function identifies the name of the column with weighting information (by default 'pmi_4') and renames it as myweight. There are two ways to point to the files in this function. If their names correspond to the naming convention and are stored in the same directory nephodir, you can run setupConcordancer(lemma, nephodir). Otherwise, you can use setupConcordancer(cws_detail_path = 'path/to/cws.detail.tsv', ppmi_path = 'path/to/ppmi.tsv').
cws <- setupConcordancer(cws_detail_path = file.path(cws_detail_dir, paste0(lemma, '.cws.detail.tsv')),
ppmi_path = file.path(nephodir, lemma, paste0(lemma, '.ppmi.tsv')))The third dataframe is loaded separately.
Build weighted concordances
The main function, weightConcordance(), takes both dataframes and the name of the lemma1 as input and returns an enriched version of the variables dataframe with a raw context in the _ctxt.raw column and tailored contexts in columns starting with _ctxt., followed by (the first part of) the name of the model.
ctxts <- weightConcordance(variables, cws, lemma)
write_tsv(ctxts, file.path(nephodir, lemma, paste0(lemma, ('.variables.tsv'))), escape = 'none')A tailored context wraps the target token and the words captured by the model with HTML tags to be read by the visualization tools:
The target is wrapped in
<span>tags withclass = "target", which NephoVis reads as bold and coloured based on the colour of the dot the token is connected to.2Context words captured by a model for a token are wrapped in
<strong>tags and are thus rendered in bold. It is also possible to add a weight as superscript.
Adapt the settings
The default settings of these functions assume a certain pattern in the names of the columns and how to extract parameter settings from them, but they can be adapted with various optional arguments. They allow the user to customize (1) how to read the model name, (2) how to translate names of parameters into selections of context words and (3) other corpus-related issues.
Read the model name
The default model name created by semasioFlow is made of two strings joined by a period (e.g. 'bound5-5lex.PPMIweight'): the first encodes the first-order selection parameters (e.g. 'bound5-5lex') and the second, the weighting information (e.g. 'PPMIweight'). If this information is not encoded in this way in your model names, you can adjust it with the foc_param_fun and weight_param_fun arguments (see ?weightLemma).
foc_param_fun- A function that takes the name of the model and returns a string with the first-order filters.
- The default is
function(m) stringr::str_split(m, '\\.')[[1]][[1]]: the first element in a.-separated sequence. weight_param_fun- A function that takes the name of the model and returns a string with the weighting filters.
- The default is
function(m) stringr::str_split(m, '\\.')[[1]][[2]]: the second element in a.-separated sequence. sup_weight_fun- A function that takes a string with weighting information as read by the function in
weight_param_fun(e.g.'PPMIweight') and returns a boolean value indicating whether PPMI values should be included as superindices in the tailored concordance. See?weightLemma. - The default is
function(weightparam) stringr::str_ends(weightparam, 'weight'): context words are weighted if the weighting parameter ends in'weight'. - If you wanted PPMI values to always be included next to captured context words, you could run
weightConcordance(variables, cws, lemma, sup_weight_fun = function(x) return(TRUE)).
Interpret parameter settings
The default values for interpreting the parameter settings are tailored to the corpus and setting used in Montes (2021). You would want to instruct the function how to read your parameter settings instead. See ?filterWeight and ?filterFoc. In these descriptions, foc_param refers to the first-order information as read by the function in foc_param_fun (e.g. 'bound5-5lex' and weight_param to the weighting information as read by the function in weight_param_fun (e.g. 'PPMIweight').
These instructions are used to identify which context words are captured by the model for each token; in addition, the semicolon-separated list of context words in the original file is also used.
-
is_dep_funandmax_steps_fun - Functions that take
foc_paramand returnsTRUEif dependency information should be collected (for the former) and the number of steps in the dependency path to accept as viable context words (for the latter). - The default value of
is_dep_funisfunction(foc_param) stringr::str_starts(foc_param, 'LEMMA'): dependency information is only used if the string starts with'LEMMA'. - The default value of
max_steps_funisfunction(foc_param) if (foc_param == 'LEMMAPATH2') 2 else 3, which means that it restricts the steps forLEMMAPATH2models but only to 3 otherwise. This is only relevant if the result ofis_dep_funisTRUE. window_filter_fun- Function that takes
foc_paramand returns a vector or list with two elements, viz. the left and right window spans. Not used if the result ofis_dep_funisFALSE. - The default value is
windowFilter()shown below: it extracts digits separated by a hyphen and reads them as the left and right span respectively. If you only used one digit, e.g.'5'to refer to both windows, you could use something likefunction(x) stringr::str_replace(x, '[^\\d]+([\\d]+)[^\\d]+', '\\1-\\1').
windowFilter <- function(foc_param) {
windows <- foc_param %>%
stringr::str_extract('\\d+-\\d+') %>%
stringr::str_split('-')
readr::parse_integer(windows[[1]])
}pos_filter_fun- Function that takes
foc_paramand returns a vector; if it is empty, no filter is implemented, while if it has values, only the rows for which the columnposmatches an element in the list will be included. Not used if the result ofis_dep_funisFALSE. - The default value is
posFilter()shown below: iffoc_paramends in'lex', only context words with'noun','adj','adv'or'verb'asposwill be considered captured; otherwise, all of them will. - If the
'lex'ending means something else in your data, you can replace the function with a customized version that used your own list of parts of speech as filter. If the name of the parameter setting is different, or even if you have multiple possibilities, you can adapt the function to those requirements as well.
posFilter <- function(foc_param) {
if (stringr::str_ends(foc_param, 'lex')) {
c('noun', 'adj', 'adv', 'verb')
} else {
c()
}
}bound_filter_fun- Function that takes
foc_paramand returnsTRUEif the words outside the sentence are also captured andFALSEif they should be excluded. - The default is
function (foc_param) stringr::str_starts(foc_param, 'nobound'): words outside the sentence are accepted iffoc_paramstarts with'nobound'. If this parameter does not vary in your models and is not even recorded in the model name, you should use a function such asfunction(x) return TRUE. -
weight_filter_funandthreshold - The former is a function that takes
weight_paramand returnsTRUEif weighting should be included andFALSEif it should be ignored. Including weighting implies that the values in themyweightcolumn (as defined bysetupConcordance()) are used to filter context words. Thethresholdarguments sets the value used for filtering context words based onmyweightif the output ofweight_filter_funisTRUE. - The default is
function(weightparam) stringr::str_ends(weightparam, 'no', negate = TRUE): it only returnsFALSEif the weighting parameter ends in'no'. The default threshold is 0.
Other adjustments
Depending on your corpus, there might be other adjustments you might want to do to how the output of semasioFlow.contextwords.listContextwords() is read. The clean_word_fun function takes the value of the word column and cleans it, e.g. replacing backticks with single quotation marks, or transforming '</sentence>' into '<br>' to start new sentences in new lines. In addition, the to_remove argument is a vector of strings matching word values that should be ignored. The default value is c('<sentence>'), meaning that rows with that value in the word column will be removed from the start.