Documentation
Getting started
After installing and starting snaut and you can start working with the semantic spaces by opening your Web browser and going to http://localhost:5005.
The first thing you need to do is to load a semantic space which you want to
use. If there is no space loaded at the moment, a window listing available
semantic spaces (the ones you put in the data
folder) should appear
automatically. You should select the semantic space from the dropdown menu and
click at the Load
button.
When the space is loaded, its name will apprear in the left upper corner of the screen. At this point, you can start working with the semantic space.
The interface is organized into four menus - each allows to explore the
semantic space in a different way. You can read more about these menus
here. The simplest way in which you can start working with the
semantic space is to pick a few words or phrases and type them in the form in
the Neighbours
tab (each word/phrase in a separate line; words in the phrase
separated by space) and click on the Calculate
button. You should see small
tables showing words that are most similar to the words or phrases you typed in
according to the semantic space.
Loading spaces
To add more spaces to the menu you need to drop a semantic space file in the
data
folder. It will then appear in the space loading menu. If you are unsure
about where you should drop the files you can click on the Change
button and
the target folder path will be displayed along with other information. You can
customize location of the spaces by changing the semantic_spaces
setting in
the configuration file (see here for more details).
Words and phrases
The interface allows to work with single words or with phrases composed of multiple words. In fact you can even think about a whole document as a very long phrese. If you enter a multi-word phrase to the input field it will be represented by the model as a sum of the vectors of all its words. If any of the words used in the phrase is not present in the loaded semantic space, snaut will not be able to compute the vector for this phrase and, as the result, it will ignore the whole phrase.
The phrases need to be entered as a list of words separated with spaces (interpuction must be removed).
Menus
Neighbours
This menu allows to look up nearest neighbours of a set of words.
For example, in order to check what are the words with a smallest distance to
brain and dinosaur, type in brain
and dinosaur
into the input box on
separate lines and press Calculate
. You can choose the metric that is used to
compute distance in the space.
You can also try to enter phrases composed of multiple words, for instance
compare behavior
, research
and behavior research
.
Matrix
If you need to obtain measurements for a large number of words, you can use the matrix menu.
The words for which you need the scores should be entered in the input form on the left. Each word or phrase should be entered in a separate line. Next, in the dropdown menu you can choose what kind of comparison do you want to make. The available options are:
- distances between all pairs of words in the list
- distances between the words in the list and all other words in the loaded semantic space
- distances between all pairs of words between the left input field and the right input field
When you click on Calculate
, snaut will compute the scores and, after this
is finished, it will initialize download of a file with the results. The file
is in a CSV format: it contains a table in a plain text with columns separated
with commas.
You can read in the list of words to the text field from a file on your disc by
clicking on the Load from a file
button below the target input field. The
Check availability
button can be used check whether all words specified in
the input field are present in the semantic space. Keep in mind that, if some
of the words are not in the space, snaut will ignore them when computing the
semantic distance measures.
Pairwise
You can use this menu to investigate distances between individual pairs of
words/documents. Each pair should be entered on a separate row in the input
field and elements of the pair should be speparated with a colon (':') . After
clicking on the Calculate
button, snaut will do the calculation and a
download of a CSV file with the result will be initialized.
For instance, in order to calculate the distance between pairs of words: home
and window
, car
and wheel
, fast car
and slow car
. You should enter:
home : window
car : wheel
car : cloud
fast car : slow car
Similarily to the Matrix interface you can load the list of pairs from a text file or check the availability of the words in the semantic space.
Analogy
snaut implements an offset method described by Mikolov, Yih, & Zweig (2013). The analogy interface allows you to perform algebraic operations using vector semantic space and capture some regularities in the language.
The classical example involves the computation king - man + woman which results in a vector very close to queen.
The computation can be performed by entering the words vectors of which you
want to have positive or negative contribution in the calculation. For
instance, in order to calculate king - man + woman you need to enter
king, woman
in the field positive vectors and man
in the field negative
vectors.
Configuration
snaut comes with a set of options that should work well for most usecases
running on a local computer. Nevertheless, you may want to tweak some of the
available options. This can be done by adjusting the settings in config.ini
,
which resides in the snaut main folder.
The configuration file is divided in two sections: server
and
semantic_space
. The server
section has the following options:
host
- this allows to specify on which IP address snaut is supposed to listen for requests. Two useful values are127.0.0.1
(default; listen only for requests coming from the local machine) and0.0.0.0
(listen for requests from any IP address; if set anyone will be able to communicate with the interface running on this computer)port
- port on which snaut will listen
The semantic_space
allows to configure settings directly related to how
snaut handles the semantic spaces, using the following options:
semspaces_dir
- a directory in which available semantic spaces are located ( default./data/
)preload_space
- ifyes
load a semantic space on startup (default:yes
)preload_space_file
- ifpreload_space
is set toyes
the space in this path will be preloaded (default: xxx)preload_space_format
- the format of the space that will be preloadedprenormalize
- when loading a space normalize all vectors to have length 1. This speeds up computation of cosine distances but does not allow to compute the other metrics (default:no
)matrix_size_limit
- a limit on the size of the computation that can be performed using snaut, in general this setting specifies the number of distance value which can be computed in each request, if set to -1 no limit will be enforced (default:-1
)allow_space_change
- if set toyes
allow the user to change the loaded semantic space using the web interface (using theChange
button in the semantic space menu), ifno
snaut only the preloaded space can be used (default:yes
)
Usage as a Web-server
To run snaut as a webserver on a Linux system you need to follow the following steps:
1) Install gunicorn. In Ubuntu:
sudo apt-get install gunicorn
2) Clone the snaut repository and install dependencies:
git clone https://github.com/pmandera/snaut.git
3) Install dependencies:
cd snaut
sudo pip install -r requirements.txt
4) Install CoffeeScript and compile coffee files to javascript. In Ubuntu:
sudo apt-get install coffeescript
coffee -c -o snaut/static/js/snaut snaut/coffee
5) Adjust the configuration file.
You need to keep in mind that computing semantic distances often involves performing operations over large matrices and can be computationally expensive: you need to make sure that you want to allow external users to make such extensive use of your computational resources.
If you intend to expose the space in the server mode you will need to make
some adjustments in the config.ini
file.
You need to disallow changing the semantic space loaded by the server by
allow_space_change
to no
. In oreder for snaut to load a semantic space
when it is loaded set preload_space
to yes
and select the space by
setting and preload_space_file
and preload_space_format
.
You probably also do not want your users to be able to compute huge matrices
including billions of cells. In order to prevent the users from doing that set
the matrix_size_limit
to a reasonable value (in our experience values like
1,000,000 work pretty well).
You may also want to adjust the documentation that is shown to the user. You
can do so by changing the doc_dir
setting. You can use the adjusted
documentation from ./doc/server/
or use your custom documentation directory.
6) Start snaut. For example to start snaut on port 9000, with 4 working instances run:
gunicorn \
--workers 5 \
--bind 127.0.0.1:15008\
--pid snaut.pid \
--log-file snaut.log \
snaut.wsgi:app
7) It is recommended to run gunicorn behind a reverse proxy. For more information see gunicorn documentation
File formats
CSV
snaut works with CSV and Matrix Market file formats.
Values in the CSV file should be separated by spaces and contain words in the first column and vector values in the following columns. snaut treats lines starting with '#' as comments. Optionally, you can provide additional information about the space in the comments opening the file:
- If the first line contains a comment starting with 'TITLE: ', snaut treat it as a title of the space, that will be displayed in the status field of the interface.
- The following lines are treated as the description of the space and will be
visible after clicking on
More info
.
If the first uncommented line contains two integer values, it will be treated as an information about the number of rows and columns in the file.
If the filename ends with *.gz
, snaut will assume that the file is
compressed using gzip.
The CSV format in snaut is compatible with a non-binary output of the Google word2vec tool.
The CSV files in this files can also be easily read in to R and used for example with the LSAfun package. To read in the space into a matrix, run:
space <- as.matrix(read.table('space.w2v.gz', sep=' ', row.names=1))
# for gzipped files:
space <- as.matrix(read.table(gzfile('space.w2v.gz'), sep=' ', row.names=1))
Semantic space market
In the case of semantic spaces which contain a large number of 0.0 values, such as those created when counting word co-occurrences without a dimensionality reduction step, the CSV format is not practical. For such cases it is better to use the Matrix Market format which handles sparse matrices more efficiently. For more details see here
The space based on the Matrix Market format consist of the following files:
- data.mtx - matrix with word vectors as rows
- row-labels - a text with one word on a line, in an order corresponding to row vectors in the data.mtx file
Optionally, you can provide a README.md file which contains a title of the space in the first line and a description in the following lines. The title should be separated from the description with one blank line.
The files should be included in one folder or zipped in one file to which you need to point snaut.