Skip to contents

Prepare dataframes for getContext.

Usage

setupConcordancer(
  lemma = "",
  input_dir = "",
  cws_detail_path = file.path(input_dir, paste0(lemma, ".cws.detail.tsv")),
  ppmi_path = file.path(input_dir, paste0(lemma, ".ppmi.tsv")),
  pmi_columnname = "pmi_4",
  distance_corrector_func = function(word) !stringr::str_starts(word, "<"),
  lemma_from_tid_fun = function(tid) paste(stringr::str_split(tid, "/")[[1]][-c(3, 4)],
    collapse = "/")
)

Arguments

lemma

Name of the lemma: for default filenames

input_dir

Directory where the files are stored

cws_detail_path

Path to a dataframe with one row per context word per token and context words with information from the token. Created by listContextWords in the semasioFlow Python module.

ppmi_path

Path to a dataframe with one context word per row and frequency information

pmi_columnname

Name (or prefix) of the column in the dataframe found in ppmi_path where weighting values are stored.

distance_corrector_func

Function to filter the rows of the dataframe in cws_detail_path based on the values of the word column, to recalculate distances between words.

lemma_from_tid_fun

Function to extract the target lemma from the tokenID.

Value

Enriched dataframe with one row per context word per token, weight values, corrected distances and a column indicating the right target lemma (in case you have more than one).