# Constructing binary axes

Unlike many sociological concepts, "community" isn't obviously one side of a binary distinction. I explore here whether binary embedding axes can be usefully constructed and applied to the concept of community. I choose to do so because binary concepts are pervasive in the social sciences, and because binary axes are one of the main conceptual and methodological contributions social scientists have made to the study of embeddings (starting with Kozlowski et al 2019).

Binaries are pervasive. While social structures like class and gender can divide the world into more than two categories, they can also orient social life along binary axes: rich-poor, masculine-feminine, etc. Indeed, Durkheim argues that the sacred-profane binary is fundamental to the social act of classification in general.

But what's the opposite of community? Social scientist and social actors write and talk about lack, or loss, or absence, without necessarily giving it a name of its own. Again, Durkheim lurks here, when he distinguishes between lack of integration and lack of regulation. The latter is *anomie*, but what's the former? Egoism? Individualism? I'm not sure that's a fruitful path to explore, though it might be.

Instead of Durkheim, I'll turn to Tönnies, and explore the distinction between "community" and "society". The basic method for constructing a binary axis from word vectors is this is this:

- create an axis from a single pair, by subtraction
- create an axis from multiple pairs, by averaging

I don't know of obvious synonyms for either "community" or "society" to construct multiple pairs with, so I'll start with that single pair here. But based on my prior exploration of words similar to community, I suspect a "local" - "global" axis might be correlated with the "community" - "society" one.

Like Kozlowski et al, I might plot where various other words fall relative to this community-society axis. What series of words should I pick to compare? A list of abstract social science concepts would be nice. Since I don't have that, I'll start by just going over the whole vocabulary and list words that are nearest one pole or the other. Would that then would work to construct pairs for axis expansion?

Conversely, I might plot where "community" and related words fall on another axis, like the local-global one -- or even axes related to class, gender, morality, etc that prior papers have used. This would be akin to what Arseniev-Koehler and Foster do for fatness.

## Notes on prior methods and code

### CMDist package (Stoltz and Taylor 2019)

https://github.com/dustinstoltz/CMDist/blob/master/R/get_relations.R

CMDist includes examples of averaging concepts - which differs from their method for creating a multiple-word pseudo-document. `get_direction()` and `get_centroid()` are the functions to look at. 

CMDist implements 3 versions for binary axes in `get_direction()`: 

- difference then average (Kozlowski et al)
- average then difference
- Euclidean norm (How is this different from Kozlowski et al? Don't they norm in their original code? This method comes from the Bolukbási et al paper that the whatlies package also references, so hopefully they're doing the same thing.)

### Geometry of Culture (Kozlowski et al 2019)

https://github.com/KnowledgeLab/GeometryofCulture/blob/master/code/build_cultural_dimensions.R

This is what the Geometry of Culture code does:

- norm each vector (divide by sqrt(sum(x^2)), the l2-norm)
- take differences between pairs
- norm again
- take average of difference vectors
- norm again

### Machine learning / cultural learning (Arseniev-Koehler and Foster 2020)

https://github.com/arsena-k/Word2Vec-bias-extraction/blob/master/dimension.py

https://github.com/arsena-k/Word2Vec-bias-extraction/blob/master/build_lexicon.py

### whatlies package (Warmerdam et al 2020)

The whatlies package is a new tool from the NLP company RASA. It's meant to provide a consistent way to explore, manipulate, and visualize embeddings from different sources. How similar is the vector algebra outlined in the papers I describe above to the methods used and demoed in the whatlies package? Can I use tools from whatlies to implement those methods, or something close enough?

The advantage of using whatlies over gensim seems to be that the result of whatever vector algebra remains either an EmbeddingSet or an Embedding, which makes it easier to calculate similarities, plot projections, etc., downstream. I think gensim might involve more low-level fiddling with numpy arrays, but the same things could be done.

Most of the relevant methods live in the EmbeddingSet class, not the Embedding class. 

There's a transformer to normalize an EmbeddingSet: https://github.com/RasaHQ/whatlies/blob/master/whatlies/transformers/_normalizer.py#L8

Subtraction is implemented for individual Embeddings, but division isn't, so norming without using an EmbeddingSet and transformer isn't possible.

There's also method to take the average of an embedding set, with could be done with a set of differences. https://github.com/RasaHQ/whatlies/blob/master/whatlies/embeddingset.py#L523

The `from_names_X()` method provides a way to turn something like gensim KeyedVectors into an EmbeddingSet directly, without saving the vectors to a file and loading them as a GensimLanguage object. (The GensimLanguage object doesn't necessarily have all the same methods? I'm unsure which way is better.) https://github.com/RasaHQ/whatlies/blob/master/whatlies/embeddingset.py#L328

I also now understand that the default metric for plot_interactive() *isn't* cosine similarity or cosine distance; it's normalized scalar projection, which they represent as the `>` operator. 

- https://github.com/RasaHQ/whatlies/blob/master/whatlies/embeddingset.py#L1119
- https://github.com/RasaHQ/whatlies/blob/master/whatlies/embedding.py#L115

I'm not convinced that the interactive scatterplots I make here are the clearest way to show my findings, but they're a starting point.

### Other papers and packages

Waller and Anderson 2020 have an interesting method of constructing and expanding binary axes for *community embeddings* based on pairs of subreddits, but I couldn't find any of their code posted publicly.

In gensim, .init_sims() norms vectors. That method will be replaced by .fill_norms() when gensim 4.0.0 is available, but that version is still in beta as of now (https://github.com/RaRe-Technologies/gensim/releases). 

## A note on English and other languages

I'm doing my main analysis in English. To understand the limits of my analysis, I might think a bit about how Anglocentric the concept of "community" might be.

Based on Benedict Anderson's discussion of the international translations of his book *Imagined Communities*, I have reason to think that "community" doesn't necessarily translate well into a language like French. *Communautarisme*, I've heard, has something of a negative connotation.

Tönnies, however, was writing about community in German, which is to say he was really writing about *Gemeinschaft.* English-speaking sociologists sometimes use that word, *Gemeinschaft*, to emphasize and invoke a moral, resonant experience of community. So it might be interesting to find or train German-language word embeddings and construct a Gemeinschaft-Gesellschaft axis for comparison. An ambitious extension would be to train a model on Tönnies's work and see how that model looked similar or different.

A few sources for pretrained German-language word vectors:

- https://deepset.ai/german-word-embeddings
- https://spacy.io/models/de
- https://fasttext.cc/docs/en/crawl-vectors.html

(I wish these pages had more metadata about when the models were trained and posted online.)

Text of Tönnies, Gemeinschaft und Gesellschaft (1887 edition):

- http://www.deutschestextarchiv.de/book/view/toennies_gemeinschaft_1887?p=9

## Load packages and embeddings

In [1]:
# load packages
import os

import gensim.downloader as api

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import altair as alt

from whatlies import Embedding, EmbeddingSet
from whatlies.language import GensimLanguage
from whatlies.transformers import Normalizer

In [2]:
# set up figures so the resolution is sharper
# borrowed from https://blakeaw.github.io/2020-05-25-improve-matplotlib-notebook-inline-res/
sns.set(rc={"figure.dpi":100, 'savefig.dpi':300})
sns.set_context('notebook')
sns.set_style("darkgrid")
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina')

# borrowed from https://github.com/altair-viz/altair/issues/1021
def my_theme(*args, **kwargs):
    return {'width': 500, 'height': 400}
alt.themes.register('my-chart', my_theme)
alt.themes.enable('my-chart')

ThemeRegistry.enable('my-chart')

In [3]:
# load gensim vectors (as KeyedVectors)
wv_wiki = api.load("glove-wiki-gigaword-200")
if not os.path.isfile("glove-wiki-gigaword-200.kv"):
    wv_wiki.save("glove-wiki-gigaword-200.kv")

In [4]:
# load vectors as whatlies GensimLanguage
lang_wiki = GensimLanguage("glove-wiki-gigaword-200.kv")

In [5]:
# load vectors as whatlies EmbeddingSet
emb_wiki = EmbeddingSet.from_names_X(names=wv_wiki.index2word, 
                                     X=wv_wiki.vectors)

What's the advantage of using GensimLanguage anyway? Is it faster or more efficient? It's annoying to have to write out the vectors after you load them. (More annoying for a model you've trained yourself, I'd imagine.)

## Community - society dimension

In [6]:
# subtraction and averaging with gensim
wv_wiki['community'] - wv_wiki['society']

array([ 0.21125801, -0.20572   ,  0.25577   , -1.07679   , -0.5692    ,
        0.02297999,  0.21760401, -0.020017  , -1.2587    ,  0.041086  ,
       -0.218533  , -0.12349999,  0.411472  ,  0.141909  ,  0.19098002,
       -0.421192  ,  0.012209  , -0.05559   ,  0.03202   ,  0.61028904,
       -0.65889   ,  0.5112002 , -0.38690004, -0.02410999,  0.52295   ,
       -0.19727   ,  0.11050999,  0.30345   ,  0.175234  , -0.6936    ,
       -1.3031931 , -0.54581   , -0.37711   ,  0.052096  ,  0.251908  ,
       -0.732768  , -0.37988997, -0.08363   , -0.28567997,  0.241564  ,
       -0.02117001,  0.13353002, -0.08291   , -0.22328001,  0.10966   ,
        0.33075   , -0.43351   , -0.06479999, -0.545506  ,  0.07736999,
        0.725142  , -0.801156  , -0.11023   ,  0.153733  ,  0.26353002,
       -0.06733999, -0.172911  , -0.49693003, -0.02404001,  0.19105   ,
        0.13389   , -0.21914999,  0.03555   ,  0.39821002, -0.23501399,
       -0.09658003, -1.01461   ,  0.01980007,  0.52537   , -0.66

In [7]:
# subtraction and averaging with whatlies
diff = emb_wiki["community"] - emb_wiki["society"]

In [8]:
# what's nice is that this is an Embedding object too
diff

Emb[(community - society)]

In [9]:
diff.vector

array([ 0.21125801, -0.20572   ,  0.25577   , -1.07679   , -0.5692    ,
        0.02297999,  0.21760401, -0.020017  , -1.2587    ,  0.041086  ,
       -0.218533  , -0.12349999,  0.411472  ,  0.141909  ,  0.19098002,
       -0.421192  ,  0.012209  , -0.05559   ,  0.03202   ,  0.61028904,
       -0.65889   ,  0.5112002 , -0.38690004, -0.02410999,  0.52295   ,
       -0.19727   ,  0.11050999,  0.30345   ,  0.175234  , -0.6936    ,
       -1.3031931 , -0.54581   , -0.37711   ,  0.052096  ,  0.251908  ,
       -0.732768  , -0.37988997, -0.08363   , -0.28567997,  0.241564  ,
       -0.02117001,  0.13353002, -0.08291   , -0.22328001,  0.10966   ,
        0.33075   , -0.43351   , -0.06479999, -0.545506  ,  0.07736999,
        0.725142  , -0.801156  , -0.11023   ,  0.153733  ,  0.26353002,
       -0.06733999, -0.172911  , -0.49693003, -0.02404001,  0.19105   ,
        0.13389   , -0.21914999,  0.03555   ,  0.39821002, -0.23501399,
       -0.09658003, -1.01461   ,  0.01980007,  0.52537   , -0.66

gensim and whatlies are doing the same thing — that's good!

The methods I researched above make it sound like normalizing the vectors is an important part of the process, so I explore how to do that next. 

In [10]:
# this is the numpy default and almost certainly not what I want
diff.norm

5.8058085

In [11]:
# this is the l2 norm, which *is* what I want
diff_norm = EmbeddingSet(diff).transform(Normalizer(norm='l2'))

In [12]:
diff_norm['(community - society)']

Emb[(community - society)]

Now I compare similarity scores for the raw and normalized binary axis vectors.

In [13]:
# toward the "community" end of the axis
emb_wiki.score_similar(diff)

[(Emb[modding], 0.589404284954071),
 (Emb[community], 0.595145583152771),
 (Emb[master-planned], 0.614898145198822),
 (Emb[unincorporated], 0.6228582262992859),
 (Emb[baraki], 0.6278692483901978),
 (Emb[mixed-income], 0.6375738382339478),
 (Emb[homa], 0.6425093412399292),
 (Emb[communities], 0.6557735204696655),
 (Emb[clarkston], 0.6563817262649536),
 (Emb[mechanicsville], 0.6569202542304993)]

In [14]:
# toward the "society" end
emb_wiki.score_similar(-diff)

[(Emb[society], 0.5049312710762024),
 (Emb[microscopical], 0.5116949081420898),
 (Emb[cymmrodorion], 0.5465895533561707),
 (Emb[meteoritical], 0.5658960342407227),
 (Emb[linnean], 0.579145610332489),
 (Emb[ophthalmological], 0.5990228652954102),
 (Emb[anti-vivisection], 0.5998556613922119),
 (Emb[entomological], 0.6023805141448975),
 (Emb[dilettanti], 0.6036361455917358),
 (Emb[speleological], 0.6155304908752441)]

In [15]:
emb_wiki_norm = emb_wiki.transform(Normalizer(norm='l2'))

In [16]:
emb_wiki_norm.score_similar(diff_norm['(community - society)'])

[(Emb[modding], 0.589404284954071),
 (Emb[community], 0.595145583152771),
 (Emb[master-planned], 0.614898145198822),
 (Emb[unincorporated], 0.6228581666946411),
 (Emb[baraki], 0.6278692483901978),
 (Emb[mixed-income], 0.6375738382339478),
 (Emb[homa], 0.6425093412399292),
 (Emb[communities], 0.6557735204696655),
 (Emb[clarkston], 0.6563817262649536),
 (Emb[mechanicsville], 0.6569201946258545)]

That's ... reassuring? that the metrics are exactly the same. I'll use the raw vectors here, for simplicity. But norming probably matters when averaging is involved.

Next, I'll explore similarity to each end of the community-society axis across the entire vocabulary. 

In [17]:
sim_diff = emb_wiki.score_similar(diff, n=len(emb_wiki), metric='cosine')

In [18]:
sim_diff[0:100]

[(Emb[modding], 0.589404284954071),
 (Emb[community], 0.595145583152771),
 (Emb[master-planned], 0.614898145198822),
 (Emb[unincorporated], 0.6228582262992859),
 (Emb[baraki], 0.6278692483901978),
 (Emb[mixed-income], 0.6375738382339478),
 (Emb[homa], 0.6425093412399292),
 (Emb[communities], 0.6557735204696655),
 (Emb[clarkston], 0.6563817262649536),
 (Emb[mechanicsville], 0.6569202542304993),
 (Emb[gated], 0.6586361527442932),
 (Emb[neighborhood], 0.662484884262085),
 (Emb[cybersitter], 0.6649418473243713),
 (Emb[mixed-use], 0.6691757440567017),
 (Emb[taizé], 0.6723114252090454),
 (Emb[har], 0.6735500693321228),
 (Emb[pittsylvania], 0.6744325160980225),
 (Emb[age-restricted], 0.6762539148330688),
 (Emb[conroe], 0.6776710748672485),
 (Emb[leflore], 0.6826707124710083),
 (Emb[mahru], 0.6833871603012085),
 (Emb[neighbourhood], 0.6848275661468506),
 (Emb[rossmoor], 0.6849431991577148),
 (Emb[perushim], 0.6870222687721252),
 (Emb[taize], 0.6876176595687866),
 (Emb[viejo], 0.688064694404602

In [19]:
sim_diff[-100:]

[(Emb[kautilya], 1.3038641214370728),
 (Emb[gynaecological], 1.303879737854004),
 (Emb[physikalische], 1.3040287494659424),
 (Emb[silurians], 1.3041560649871826),
 (Emb[soerensen], 1.3042516708374023),
 (Emb[honus], 1.304616928100586),
 (Emb[xiaokang], 1.304952621459961),
 (Emb[genootschap], 1.305346131324768),
 (Emb[fleischner], 1.305643081665039),
 (Emb[academician], 1.3060909509658813),
 (Emb[ichthyologists], 1.3062193393707275),
 (Emb[idsa], 1.3071272373199463),
 (Emb[ethnological], 1.30740487575531),
 (Emb[teratology], 1.3079063892364502),
 (Emb[f.r.s.], 1.3086650371551514),
 (Emb[saint-jean-baptiste], 1.3093072175979614),
 (Emb[fruitbearing], 1.3098504543304443),
 (Emb[anatomy], 1.3098732233047485),
 (Emb[feudalist], 1.3105204105377197),
 (Emb[mycological], 1.3109546899795532),
 (Emb[gulbarg], 1.3126049041748047),
 (Emb[bnhs], 1.313199758529663),
 (Emb[mammalogists], 1.313377857208252),
 (Emb[neurochemistry], 1.3137081861495972),
 (Emb[matriarchal], 1.3138256072998047),
 (Emb[dec

These results make sense, but there's a downside to using all 400,000 words in the vocabulary - it means that a lot of rare words are included. That's why you see things like "Mattachine Society" or "Linnean Society". 

Briefly, I'll look at the words in the middle of the list.

In [20]:
sim_diff[len(sim_diff)//2 - 50 : len(sim_diff)//2 + 50]

[(Emb[drolet], 1.000519871711731),
 (Emb[109,500], 1.0005199909210205),
 (Emb[http://www.nytsyn.com], 1.0005203485488892),
 (Emb[kloeden], 1.0005204677581787),
 (Emb[fita], 1.000520944595337),
 (Emb[27.19], 1.0005214214324951),
 (Emb[zohur], 1.0005216598510742),
 (Emb[maierhofer], 1.0005218982696533),
 (Emb[suff], 1.0005223751068115),
 (Emb[2,928], 1.000523567199707),
 (Emb[untidiness], 1.0005238056182861),
 (Emb[single-sideband], 1.0005238056182861),
 (Emb[kumiko], 1.0005241632461548),
 (Emb[ﬁnds], 1.0005241632461548),
 (Emb[visionics], 1.0005242824554443),
 (Emb[horwell], 1.0005242824554443),
 (Emb[anti-globalisation], 1.0005245208740234),
 (Emb[hoshangabad], 1.0005253553390503),
 (Emb[o'donnells], 1.0005265474319458),
 (Emb[logjammed], 1.0005278587341309),
 (Emb[vissel], 1.0005289316177368),
 (Emb[clínica], 1.0005289316177368),
 (Emb[nguyên], 1.0005310773849487),
 (Emb[maheswaran], 1.0005319118499756),
 (Emb[inauthenticity], 1.0005319118499756),
 (Emb[tziona], 1.0005320310592651),
 

That isn't actually how you get the most orthogonal words, is it? This again helps show the issues with using the full vocabulary -- many of these "words" are uncommon (e.g. names) or garbage (e.g. numbers, urls, misspellings).

The entire vocabulary isn't terribly useful for producing more pairs of opposite words. What might I do instead? 

Is there a list of just common english vocabulary in general? E.g. https://stackoverflow.com/questions/28339622/is-there-a-corpora-of-english-words-in-nltk

Ideal might be a set of common *social science* words. Could I approximate that that through vector averaging, using words like "community", "society", "sociology", etc.? Then I could expand the set by getting a list of words most similar to that average.

To begin withh, I'll *just* average "community" and "society", to get words in the general neighborhood of both. Then I'll use that shorter list of words as the comparison set.

In [21]:
avg = emb_wiki[['community', 'society']].average(name="avg(community, society)")

In [22]:
emb_wiki.score_similar(avg, n=100)

[(Emb[society], 0.10151195526123047),
 (Emb[community], 0.1131206750869751),
 (Emb[communities], 0.28810155391693115),
 (Emb[societies], 0.29189276695251465),
 (Emb[established], 0.3642843961715698),
 (Emb[culture], 0.3817620277404785),
 (Emb[social], 0.3840060234069824),
 (Emb[organizations], 0.3851216435432434),
 (Emb[organization], 0.3865753412246704),
 (Emb[cultural], 0.4050767421722412),
 (Emb[public], 0.4058724641799927),
 (Emb[founded], 0.4087352752685547),
 (Emb[association], 0.4117599129676819),
 (Emb[institution], 0.4144657850265503),
 (Emb[country], 0.42236191034317017),
 (Emb[life], 0.4231107234954834),
 (Emb[citizens], 0.423681378364563),
 (Emb[local], 0.4252852201461792),
 (Emb[part], 0.4261268377304077),
 (Emb[the], 0.4262123107910156),
 (Emb[education], 0.42667531967163086),
 (Emb[institutions], 0.4284619092941284),
 (Emb[well], 0.4290911555290222),
 (Emb[member], 0.42947137355804443),
 (Emb[establishment], 0.4309902787208557),
 (Emb[educational], 0.4321936368942261),
 

This is an improvement. In the long run, I still should probably filter out stopwords and, uh, punctuation apparently.

In [23]:
sim_avg = emb_wiki.embset_similar(avg, n=100)

In [24]:
# quick check of norming - it still produces the same scores
(EmbeddingSet(avg, *emb_wiki)
 .transform(Normalizer(norm='l2'))
 .score_similar(avg, n=11))

[(Emb[avg(community, society)], 0.0),
 (Emb[society], 0.1015118956565857),
 (Emb[community], 0.1131206750869751),
 (Emb[communities], 0.28810155391693115),
 (Emb[societies], 0.29189276695251465),
 (Emb[established], 0.3642843961715698),
 (Emb[culture], 0.38176196813583374),
 (Emb[social], 0.384006142616272),
 (Emb[organizations], 0.38512158393859863),
 (Emb[organization], 0.3865753412246704),
 (Emb[cultural], 0.40507662296295166)]

It's easier to plot if I have some sort of second axis, I think, even if that second axis isn't terribly informative itself. I'll use the average as that axis at first.

Kozlowski et al actually plot the *angles* of the vectors, using sports words as their comparison set. I could probably figure out how to do this eventually, but I don't see how to immediately. Maybe that would be similar to the arrow diagrams that whatlies uses in its non-interactive plots.

In [25]:
# default metric - projection
sim_avg.plot_interactive(x_axis=diff, y_axis=avg)

In [26]:
# when cosine similarity is used, there's a clear sharp lower bound
# on the y axis, which makes sense because that's how the word set 
# was defined
sim_avg.plot_interactive(x_axis=diff, y_axis=avg, 
                         axis_metric="cosine_similarity")

In [27]:
# add a second binary axis, local-global
sim_avg.plot_interactive(x_axis=diff, 
                         y_axis=emb_wiki['local'] - emb_wiki['global'])

In [28]:
sim_avg.plot_interactive(x_axis=diff, 
                         y_axis=emb_wiki['local'] - emb_wiki['global'], 
                         axis_metric='cosine_similarity')

Like Koslowski et al, I can calculate cosine similarity between the axes directly. (Note: the Embedding class has a method that does cosine *distance*.)

In [29]:
diff_loc = emb_wiki['local'] - emb_wiki['global']

In [30]:
diff.distance(diff_loc, metric='cosine')

0.8468398

A bigger comparison set of words might help me see more trends, but will make the plots themselves harder to read.

In [31]:
sim_avg2 = emb_wiki.embset_similar(avg, n=1000)

In [32]:
sim_avg2.plot_interactive(x_axis=diff, y_axis=avg, 
                          axis_metric="cosine_similarity")

In fact, now words like "neighborhood" show up strongly on the "community" side of the axis.

When I explored similarity to community, I found that the most similar words to community differed between twitter- and wikipedia-based glove vectors. So, what does a comparison set look like with the twitter vectors? Are there obvious similarities and differences at a glance?

In [None]:
wv_twitter = api.load("glove-twitter-200")

In [None]:
emb_twitter = EmbeddingSet.from_names_X(names=wv_twitter.index2word, 
                                        X=wv_twitter.vectors)

In [None]:
diff_twitter = emb_twitter['community'] - emb_twitter['society']
avg_twitter = emb_twitter[['community', 'society']].average(name="avg(community, society)")

In [None]:
(emb_twitter
 .embset_similar(avg_twitter, n=1000)
 .plot_interactive(x_axis=diff_twitter,
                   y_axis=avg_twitter))

In [None]:
(emb_twitter
 .embset_similar(avg_twitter, n=1000)
 .plot_interactive(x_axis=diff_twitter,
                   y_axis=emb_twitter['local'] - emb_twitter['global'],
                   axis_metric='cosine_similarity'))

In [None]:
diff_loc_twitter = emb_twitter['local'] - emb_twitter['global']

In [None]:
diff_twitter.distance(diff_loc_twitter)

## Einmal noch, auf Deutsch

As noted above, the sociological tradition that discusses community actually originates in German. (The French classical sociologists seem to prefer to talk about *solidarité*... See Aldous 1972 for a contextualization of the historical dialogue between Durkheim and Tönnies.)

Because of that origin, I'm curious about what modern German word vectors might show about Gemeinschaft (community) and Gesellschaft (society). Are there any obvious differences from English?

I'll choose some pretrained German-language embeddings. The deepset German glove embeddings are 3.5GB, which is larger than the largest spacy model... So I'll download that spacy model instead:

```
python -m spacy download de_core_news_lg
```

Note that I can read a little German, but I don't know it particularly well. To make this part of the analysis more serious, actual German speakers might need to weigh in.

In [None]:
from whatlies.language import SpacyLanguage

In [None]:
emb_wiki_de = SpacyLanguage("de_core_news_lg")

In [None]:
diff_de = emb_wiki_de['gemeinschaft'] - emb_wiki_de['gesellschaft']

In [None]:
avg_de = emb_wiki_de[["gemeinschaft", "gesellschaft"]].average(name="avg(gemeinschaft, gesellschaft)")

In [None]:
(emb_wiki_de
 .embset_similar(avg_de, n=1000)
 .plot_interactive(x_axis=diff_de,
                   y_axis=avg_de)
 .properties(width=500, height=400))

In [None]:
# most similar to gemeinschaft
(emb_wiki_de
 .embset_similar(avg_de, n=1000)
 .score_similar(diff_de, n=50))

I notice words like zusammenarbeiten, einheiten, mitglieder... Some religious words? (evangelischen, kirchlichen, konfessionelle)

In [None]:
# most similar to gesellschaft, as opposed to gemeinschaft
(emb_wiki_de
 .embset_similar(avg_de, n=1000)
 .score_similar(-diff_de, n=10))

politik and demokratie are "gesellschaft" words, not "gemeinschaft" words.

In [None]:
# backing up - what are the words most similar to gemeinschaft overall?
emb_wiki_de.score_similar('gemeinschaft', n=50)

I notice lots of adjectives? It might be better if words were stemmed... I see e.g. gemeinschaftlich, gemeinschaftliche, gemeinschaftlichen.

One key takeaway: apparently, under this model, gesellschaft is *the* most similar word to gemeinschaft.

Except for "nachbarschaftlichen" I don't see the same local connotations that "community" has in English?

## TODO: Train a word2vec model on Tönnies

What results would I get if I train a model on a classical sociological text?

This corpus explorer is pretty sweet: http://voyant-tools.org/?view=corpusset&stopList=stop.de.german.txt&input=http://www.deutschestextarchiv.de/book/download_lemmaxml/toennies_gemeinschaft_1887

But how do I get that lemmatized text into python?

https://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/The_TCF_Format