From semasioFlow to NephoVis
Mariana Montes
2022-04-04
Source:vignettes/processClouds.Rmd
processClouds.Rmd
This package is not autonomous: it is designed to support a specific section of a pipeline, meaning that it takes as input output from the semasioFlow
Python package1 and prepares the data to be used in NephoVis and the Level3 ShinyApp. This vignette will show the procedure to prepare data for NephoVis, whereas vignette('HDBSCAN')
will go through the HDBSCAN clustering for Level3 and the classification of clouds.
First, next to semcloud
itself, we will load some packages of the tidyverse
and cluster
.
library(semcloud)
library(readr)
library(dplyr)
library(purrr)
library(tibble)
library(cluster) # to compute medoids
library(rjson) # to store json files
# Do not show column types when loading dataframes:
options(readr.show_col_types = FALSE)
Then we want to set up the most important paths, which may or may not be in the same directory. By default, the semasioFlow
workflow will store the distance matrices in an output
directory with a tokens
subdirectory for token-level matrices and a cws
directory for type-level matrices. These are going to be the input for this processing. On the other hand, semasioFlow
stores registers (frequency data of the context words, parameter settings of the different models and lists of context words per token) in a nephovis
directory, which you can use to feed data to NephoVis. This is going to be our output directory. Each of these will have subdirectories for the different lemmas or concepts you might be working on, even if you only have one. For a semasiological workflow (e.g. Montes 2021) you will use lemmas, whereas an onomasiological workflow (e.g. Montes, Franco, and Heylen 2021) you will use concepts; the only difference is that the latter combines different variants per model. Any mention of lemmas in this vignette can be substituted for concept in the onomasiological workflow.
First we will look at the different steps when working with one lemma and then sum up with a loop over multiple lemmas.
Single lemma workflow
# based on the directory structure resulting from the Python workflow
lemma <- 'name_of_lemma'
base_dir <- "path/to/data" # shared path
input_dir <- file.path(base_dir, "output", "tokens", lemma) # where the data is stored
cw_dir <- file.path(base_dir, "output", "cws", lemma)
output_dir <- file.path(base_dir, "nephovis", lemma) # where the data will go
Token coordinates
The first step is to go through the token-level distance matrices and apply dimensionality reduction to obtain only two coordinates for visualization. For that purpose we need to decide which methods we are interested in (in case we want to compare them). The options are:
'mds'
: nonmetric multidimensional scaling (ran withvegan::metaMDS()
)'tsne'
: t-stochastic neighbor embedding (ran withRtsne::Rtsne()
), with different perplexities indicated by appending a number to the label, e.g.'tsne20'
,'tsne30'
…'umap'
: uniform manifold approximation and projection (ran withumap::umap()
).
For each technique, we obtain a different solution, which results in a tab-separated file with one row per token, a column with IDs and two columns per model: the x and the y coordinates. A named list as shown below will inform the code of which techniques to use (the names in the list) and which infixes to use in the file names.
# This list works for a loop in the function below and should then be stored as a json file
# in the github directory of each lemma, to tell the visualization what is being used
solutions_old <- list("mds" = ".mds")
for (perp in c(10, 20, 30, 50)) {
solutions_old[[paste0("tsne", perp)]] = paste0(".tsne.", perp)
}
solutions_old
For example, the code line below will only run Rtsne::Rtsne()
with perplexity 30 and store the coordinates in a file called 'name_of_lemma.tsne.30.tsv'
. This list must also be stored, to inform NephoVis of the available visualization options and how to find their data.
solutions <- list("tsne30" = ".tsne.30")
write(rjson::toJSON(solutions), file.path(output_dir, paste0(lemma, ".solutions.json")))
We also need a list of filenames to obtain the distances from, which we could either read by filtering the contents of our output_dir
or by appending the file extension to the names of the models in our register.
suffix <- ".ttmx.dist.pac"
models_file <- file.path(output_dir, paste0(lemma, '.models.tsv'))
files_list <- paste0(read_tsv(models_file)$`_model`, suffix)
The getClouds()
function groups the full “workflow”:
It sets up one empty dataframe per item in
solutions
, which will be filled in with coordinates and stored in a'[lemma].[solution].tsv'
file.-
For each file in
files_list
:2.1 It extracts the model name
2.2 It loads the file with
tokensFromPac()
2.3 If
logrank = TRUE
(the default), it applies the transformation; iflogdist = TRUE
it also computes euclidean distances on the log-transformed values instead of just making them symmetric. The clouds in Cloudspotting were generated withlogrank = TRUE, logdist = FALSE
.2.4 It applies the corresponding algorithm and extracts the coordinates
2.5 It appends the coordinates as columns preceded by the name of the model to the corresponding dataframe:
'[modelname].x'
;'[modelname].y'
.
In addition, if “mds” is one of the algorithms, it will return a list with the stress values.
getClouds(input_dir, output_dir, files_list, lemma, solutions)
Then you can read your files with readr::read_tsv()
.
Context words coordinates
For the context words, the workflow is exactly the same as for the tokens. The difference is that the files are saved as '.csv'
(because for some reason R cannot read them when they are .wwmx...pac
) and so it uses the focdistsFromCsv()
function. This is indicated by specifying type = "focdists"
.
Model distances and coordinates
The files created by getClouds()
take care of the visualization in Levels 2 and 3 of NephoVis. For the Level 1 plot, as well as the distance matrix in Level 2 and the selection of medoids, we first need to obtain distances between the models. For that purpose, compLemma()
loads the '[lemma].models.tsv'
file in the output_dir
in order to modify it by appending the coordinates from vegan::metaMDS()
on the distances between the models. By default, it will compute “euclidean” distances on the transformed matrices (see eucliMats()
), but the function can be changed with the fun
argument, and the transformation can be turned off with the transformed
argument. It returns some data for a register, which I tend to combine across lemmas and store as euclidean_register.tsv
to tell the index of the visualization which lemmas to offer.
Under the hood, it also stores the distance matrix as [lemma].models.dist.tsv
. If the file already exists, it loads it instead of recomputing the distances. This distance matrix is then used in Level 2 of NephoVis.
reg <- compLemma(lemma, input_dir, output_dir)
Medoids
The medoids are simply calculated with cluster::pam()
and some basic information is stored in a [lemma].medoids.tsv
file. The only important column for the visualization is medoids
, which is what NephoVis reads to automatically select the right models.
k <- 8 # number of medoids you want
distmtx <- read_tsv(file.path(output_dir, paste0(lemma, ".models.dist.tsv"))) %>%
matricizeCloud() %>% as.dist()
pam_data <- pam(distmtx, k = k)
medoid_data <- pam_data$clusinfo %>%
as_tibble() %>%
mutate(medoids = pam_data$medoids, medoid_i = seq(k))
write_tsv(medoid_data, file.path(output_dir, paste0(lemma, ".medoids.tsv")))
We can also add the clustering information to the models register.
Further steps
The visualization can also take advantage of tailored contexts for the different models; the code to obtain these columns, based on the output of semasioFlow.contextwords.listContextwords()
and the data in '[lemma].variables.tsv'
, is described in vignette('weightConcordance')
.
In addition, HDBSCAN clustering and classification of the clusters is covered in vignette('HDBSCAN')
.
Loops across multiple lemmas
Below is some code to run the workflow above across multiple lemmas.
lemmas <- c('lemma1', 'lemma2', 'lemma3') #or whatever they are
base_dir <- '/path/to/data'
input_dir <- file.path(base_dir, "output", "tokens") # where the data is stored
cw_dir <- file.path(base_dir, "output", "cws")
output_dir <- file.path(base_dir, "nephovis") # where the data will go
solutions <- list("tsne30" = ".tsne.30")
# Token level ----
suffix <- ".ttmx.dist.pac"
for (lemma in lemmas) {
models_file <- file.path(output_dir, lemma, paste0(lemma, '.models.tsv'))
files_list <- paste0(read_tsv(models_file)$`_model`, suffix)
write(toJSON(solutions), file.path(output_dir, lemma, paste0(lemma, ".solutions.json")))
getClouds(file.path(input_dir, lemma), file.path(output_dir, lemma), files_list, lemma, solutions)
}
# Context words level ----
suffix <- ".wwmx.dist.csv"
for (lemma in lemmas) {
models_file <- file.path(output_dir, lemma, paste0(lemma, '.models.tsv'))
files_list <- paste0(read_tsv(models_file)$`_model`, suffix)
getClouds(file.path(cw_dir, lemma), file.path(output_dir, lemma), files_list, lemma, solutions,
type = 'focdists')
}
# Distances between models ----
reg <- map_dfr(lemmas, ~compLemma(.x, file.path(input_dir, .x), file.path(output_dir, .x)))
write_tsv(reg, file.path(output_dir, "euclidean_register.tsv"))
# Medoids ----
k <- 8
for (lemma in lemmas) {
distmtx <- read_tsv(file.path(output_dir, lemma, paste0(lemma, ".models.dist.tsv"))) %>%
matricizeCloud() %>% as.dist()
pam_data <- pam(distmtx, k = k)
medoid_data <- pam_data$clusinfo %>%
as_tibble() %>%
mutate(medoids = pam_data$medoids, medoid_i = seq(k))
write_tsv(medoid_data, file.path(output_dir, lemma, paste0(lemma, ".medoids.tsv")))
models_file <- file.path(output_dir, lemma, paste0(lemma, ".models.tsv"))
read_tsv(models_file) %>%
mutate(
pam_cluster = pam_data$clustering[`_model`], # add pam-cluster number
medoid = pam_data$medoids[pam_cluster] # add name of medoid
) %>%
write_tsv(models_file)
}
References
Montes, Mariana. 2021. “Cloudspotting: Visual Analytics for Distributional Semantics.” PhD thesis, Leuven: KU Leuven.
Montes, Mariana, Karlien Franco, and Kris Heylen. 2021. “Indestructible Insights. A Case Study in Distributional Prototype Semantics.” In Cognitive Sociolinguistics Revisited, edited by Gitte Kristiansen, Karlien Franco, Stefano De Pascale, Laura Rosseel, and Weiwei Zhang, 251–63. De Gruyter. https://doi.org/10.1515/9783110733945-021.