Hegel and the stopwords

It would be interesting working in ML with texts of a certain quality, for example literary and philosophical works. It would be even better, only dreaming: If we could make the machine understand and practice dialectical thinking. Being is the contrary of nothing, but it is also the same … But this, well, first we have to get the texts into the right format.

Hegel wrote in German, but most of the scholars who read his books, somebody does! will read them in translation, mostly in English, maybe in Chinese.

We know translating philosophical texts is an arduous task. We know many of these transpositions in other languages are misleading. As they should mirror both the ideas and the language, often they give neither. But judging is not my task here. When we talk about Hegel or dialectical thinking as an existing thing, we are talking about something not a priori

definable by “original texts” or “author’s intentions”. Rather, we have to refer to what people have read and think somewhere in the world, it may be in German or in other languages.

I will start with the translation of the “Phenomenology of Spirit” elaborated by William Wallace. Simply, it is the first one I have found when searching on the Net. I also feel attracted by a philosopher who died after a bicycle incident.

Having extracted the text from Gutenberg (as plain text, cutting off everything not written by Hegel) and saved it, we can (ok, after having tokenized and constructed the dfm with R

quanteda) make the machine produce a first statistics. It looks like this:

> textstat_frequency(phen_feat, n=20)

feature frequency rank docfreq group

1 the 8896 1 8896 all

2 of 5859 2 5859 all

3 and 4551 3 4551 all

4 is 3236 4 3236 all

5 in 2992 5 2992 all

6 to 2733 6 2733 all

7 a 2337 7 2337 all

8 it 1983 8 1983 all

9 as 1787 9 1787 all

10 which 1475 10 1475 all

A list of “stopwords”, the functional words without specific meaning, like articles, conjunctions and auxiliary verbs. How can we find the words that would characterize Hegel’s text? “Easy!” most data analysts will reply. “Let us just remove the ‘stopwords’”: But a quick test with the “stopword” package named Snowball, the most widely used, evidences our particular difficulty.

The beginning of the list of Snowball stopwords looks like this:

head(stopwords("en", source = "snowball"))

[1] "i" "me" "my"

[4] "myself" "we" "our" […]

[43] "be" "been" "being"

[55] "should" "could" "ought"

These words should be eliminated? I, me, my, being? They all evoke classical philosophical problems: Who am I, me, myself?

By eliminating stopwords with a predefined list, we would kick the philosophical topics out

Always remember, most instruments for data analysis have been developed for marketing experts looking for “content” and “sentiments”. Philosophy, instead, is all about stopwords.

It might properly be called a Stopwordery.

But even a philosopher will not have chosen “the”, “of” and “and” as main topics for his or her reasoning (better control first, but Hegel does not). This fact seems to permit the creation of a list of words that have no specific meaning within the discourse we analyze and can be eliminated. These are, for example, “a”, “an” and “the” (for the complete list, see below).

Creating a wordcloud with the cleaned text, we get a rather philosophical looking, confusing, picture:

“Life”, “freedom” and “consciousness” sound quite philosophical.

A diagram of the relatively most frequent terms might give a clearer first impression of Hegel’s “Phenomenology”:

But again, what does “being” refer to? It could be an auxiliary verb or a noun. Only in the latter case it would be relevant for the meaning of the text, at least as long as we do not concentrate on passive forms.

Now, it would be possible to do a context (kwic) research for every appearance of “being” (152 times), distinguishing all the cases. “In its essential being” sounds quite philosophical, “being aware” less. But this kind of work does not feel much like Machine Learning.

Even after adapting the usual lists of stopwords to our field, we are not able to produce a list of relevant terms. We should, instead of removing words a priori identified as “meaningless”,

use other instruments for getting the relevant text material. The application of the spacyR package will bring us further. Spacy permits the grammar parsing of the text. By isolating the nouns and adjectives, we will automatically eliminate most of the irrelevant material.

Technically

my_stopwords <- c("that's", "who's", "what's", "when's", "where's", "why's", "how's", "a",

"an", "the" ,"and", "but","if", "or", "because","as","until", "while", "of", "at", "by","for","itself", "their", "into", "with", "what", "about", "which", "that", "this", "above", "below", "to", "from", "up",

"down", "in", "out", "off","over","under","again","further","then","once", "when", "where", "why",

"how", "all", "any", "each","few","more", "most", "some", "such", "no", "they", "them",

"nor", "not", "only", "own", "same", "so","than", "too","very","thus", "will")

komplett_tokens <- quanteda::tokens(komplett_Ph_c,

remove_punct = TRUE,

remove_symbols = FALSE,

remove_numbers = TRUE)%>%

tokens_select(min_nchar=4)%>%

tokens_remove(my_stopwords)

Hegel meets the Machine

Cerca nel blog

Hegel and the stopwords

Etichette

Commenti

Posta un commento

Post popolari in questo blog

Making Max Weber's U-turn visible with R