Weight contexts
Mariana Montes
2022-04-04
Source:vignettes/weightConcordance.Rmd
weightConcordance.Rmd
This vignette shows how to generated tailored contexts for the visualization tools.
Required input
The context-weighting procedure assumes the following datasets produced with semasioFlow
(Python):
-
A dataframe with one row per context word and columns with frequency and weight information (most importantly, a column with the weight you would use for first-order filtering/weighting).
The context word ID column is called
'cw'
.The file would be typically called
'[lemma].ppmi.tsv'
and stored with your nephovis data (i.e. theoutput_dir
invignette('processClouds')
. It is created bysemasioFlow.socmodels.targetPPMI()
.
-
A dataframe with one row per context word per token, with corpus-based information.
-
This is generated by
semasioFlow.contextwords.listContextwords()
and will at least have:- a
cw
column with the type of the context word - a
target_lemma
column with the type of the target - a
word
column with the word form -
token_id
andcw_id
columns - distance between context word and target and similar computed values.
- a
It will be typically called
'[lemma].cws.detail.tsv'
.
-
-
A dataframe with one row per token and columns with annotation values or other variables, as well as columns with semicolon-separated context words.
The token ID column is called
_id
and the context word columns start with_cws.
followed by the (first part of) the name of the model. Crucially, it should be possible to parse the relevant parameter settings from the name of your model.The file would be typically called
'[lemma].variables.tsv'
and stored with your nephovis data (i.e. theoutput_dir
invignette('processClouds')
. It is the ‘token_register’ element in the list returned bysemasioFlow.socmodels.weightTokens()
but it could also include manually added variables.
The first and second dataframes are loaded and merged by setupConcordancer()
, in order to add weighting information to the token-based register. The pmi_columnname
in this function identifies the name of the column with weighting information (by default 'pmi_4'
) and renames it as myweight
. There are two ways to point to the files in this function. If their names correspond to the naming convention and are stored in the same directory nephodir
, you can run setupConcordancer(lemma, nephodir)
. Otherwise, you can use setupConcordancer(cws_detail_path = 'path/to/cws.detail.tsv', ppmi_path = 'path/to/ppmi.tsv')
.
cws <- setupConcordancer(cws_detail_path = file.path(cws_detail_dir, paste0(lemma, '.cws.detail.tsv')),
ppmi_path = file.path(nephodir, lemma, paste0(lemma, '.ppmi.tsv')))
The third dataframe is loaded separately.
Build weighted concordances
The main function, weightConcordance()
, takes both dataframes and the name of the lemma1 as input and returns an enriched version of the variables
dataframe with a raw context in the _ctxt.raw
column and tailored contexts in columns starting with _ctxt.
, followed by (the first part of) the name of the model.
ctxts <- weightConcordance(variables, cws, lemma)
write_tsv(ctxts, file.path(nephodir, lemma, paste0(lemma, ('.variables.tsv'))), escape = 'none')
A tailored context wraps the target token and the words captured by the model with HTML tags to be read by the visualization tools:
The target is wrapped in
<span>
tags withclass = "target"
, which NephoVis reads as bold and coloured based on the colour of the dot the token is connected to.2Context words captured by a model for a token are wrapped in
<strong>
tags and are thus rendered in bold. It is also possible to add a weight as superscript.
Adapt the settings
The default settings of these functions assume a certain pattern in the names of the columns and how to extract parameter settings from them, but they can be adapted with various optional arguments. They allow the user to customize (1) how to read the model name, (2) how to translate names of parameters into selections of context words and (3) other corpus-related issues.
Read the model name
The default model name created by semasioFlow
is made of two strings joined by a period (e.g. 'bound5-5lex.PPMIweight'
): the first encodes the first-order selection parameters (e.g. 'bound5-5lex'
) and the second, the weighting information (e.g. 'PPMIweight'
). If this information is not encoded in this way in your model names, you can adjust it with the foc_param_fun
and weight_param_fun
arguments (see ?weightLemma
).
foc_param_fun
- A function that takes the name of the model and returns a string with the first-order filters.
- The default is
function(m) stringr::str_split(m, '\\.')[[1]][[1]]
: the first element in a.
-separated sequence. weight_param_fun
- A function that takes the name of the model and returns a string with the weighting filters.
- The default is
function(m) stringr::str_split(m, '\\.')[[1]][[2]]
: the second element in a.
-separated sequence. sup_weight_fun
- A function that takes a string with weighting information as read by the function in
weight_param_fun
(e.g.'PPMIweight'
) and returns a boolean value indicating whether PPMI values should be included as superindices in the tailored concordance. See?weightLemma
. - The default is
function(weightparam) stringr::str_ends(weightparam, 'weight')
: context words are weighted if the weighting parameter ends in'weight'
. - If you wanted PPMI values to always be included next to captured context words, you could run
weightConcordance(variables, cws, lemma, sup_weight_fun = function(x) return(TRUE))
.
Interpret parameter settings
The default values for interpreting the parameter settings are tailored to the corpus and setting used in Montes (2021). You would want to instruct the function how to read your parameter settings instead. See ?filterWeight
and ?filterFoc
. In these descriptions, foc_param
refers to the first-order information as read by the function in foc_param_fun
(e.g. 'bound5-5lex'
and weight_param
to the weighting information as read by the function in weight_param_fun
(e.g. 'PPMIweight'
).
These instructions are used to identify which context words are captured by the model for each token; in addition, the semicolon-separated list of context words in the original file is also used.
-
is_dep_fun
andmax_steps_fun
- Functions that take
foc_param
and returnsTRUE
if dependency information should be collected (for the former) and the number of steps in the dependency path to accept as viable context words (for the latter). - The default value of
is_dep_fun
isfunction(foc_param) stringr::str_starts(foc_param, 'LEMMA')
: dependency information is only used if the string starts with'LEMMA'
. - The default value of
max_steps_fun
isfunction(foc_param) if (foc_param == 'LEMMAPATH2') 2 else 3
, which means that it restricts the steps forLEMMAPATH2
models but only to 3 otherwise. This is only relevant if the result ofis_dep_fun
isTRUE
. window_filter_fun
- Function that takes
foc_param
and returns a vector or list with two elements, viz. the left and right window spans. Not used if the result ofis_dep_fun
isFALSE
. - The default value is
windowFilter()
shown below: it extracts digits separated by a hyphen and reads them as the left and right span respectively. If you only used one digit, e.g.'5'
to refer to both windows, you could use something likefunction(x) stringr::str_replace(x, '[^\\d]+([\\d]+)[^\\d]+', '\\1-\\1')
.
windowFilter <- function(foc_param) {
windows <- foc_param %>%
stringr::str_extract('\\d+-\\d+') %>%
stringr::str_split('-')
readr::parse_integer(windows[[1]])
}
pos_filter_fun
- Function that takes
foc_param
and returns a vector; if it is empty, no filter is implemented, while if it has values, only the rows for which the columnpos
matches an element in the list will be included. Not used if the result ofis_dep_fun
isFALSE
. - The default value is
posFilter()
shown below: iffoc_param
ends in'lex'
, only context words with'noun'
,'adj'
,'adv'
or'verb'
aspos
will be considered captured; otherwise, all of them will. - If the
'lex'
ending means something else in your data, you can replace the function with a customized version that used your own list of parts of speech as filter. If the name of the parameter setting is different, or even if you have multiple possibilities, you can adapt the function to those requirements as well.
posFilter <- function(foc_param) {
if (stringr::str_ends(foc_param, 'lex')) {
c('noun', 'adj', 'adv', 'verb')
} else {
c()
}
}
bound_filter_fun
- Function that takes
foc_param
and returnsTRUE
if the words outside the sentence are also captured andFALSE
if they should be excluded. - The default is
function (foc_param) stringr::str_starts(foc_param, 'nobound')
: words outside the sentence are accepted iffoc_param
starts with'nobound'
. If this parameter does not vary in your models and is not even recorded in the model name, you should use a function such asfunction(x) return TRUE
. -
weight_filter_fun
andthreshold
- The former is a function that takes
weight_param
and returnsTRUE
if weighting should be included andFALSE
if it should be ignored. Including weighting implies that the values in themyweight
column (as defined bysetupConcordance()
) are used to filter context words. Thethreshold
arguments sets the value used for filtering context words based onmyweight
if the output ofweight_filter_fun
isTRUE
. - The default is
function(weightparam) stringr::str_ends(weightparam, 'no', negate = TRUE)
: it only returnsFALSE
if the weighting parameter ends in'no'
. The default threshold is 0.
Other adjustments
Depending on your corpus, there might be other adjustments you might want to do to how the output of semasioFlow.contextwords.listContextwords()
is read. The clean_word_fun
function takes the value of the word
column and cleans it, e.g. replacing backticks with single quotation marks, or transforming '</sentence>'
into '<br>'
to start new sentences in new lines. In addition, the to_remove
argument is a vector of strings matching word
values that should be ignored. The default value is c('<sentence>')
, meaning that rows with that value in the word
column will be removed from the start.