R Tutorial: Token Distribution and Regular Expressions

Introduction

“This chapter explains how to use the positions of words in a vector to create distribution plots showing where words occur across a narrative. We introduce the grep function and show how to use regular expressions for more nuanced pattern matching.” (Jockers 2020: 37)

This tutorial is based on Jockers’ Text Analysis with R For Students of Literature, Chapter 4: “Token Distribution and Regular Expressions”. The grep() function used in that chapter for search with regular expressions does not work well with Arabic script, so we will use the search and replace functions from the tidyverse’s stringr library instead.

The tidyverse is a collection of R packages for data science that work together well because they share the same philosophy, grammar and data structures. The most important packages for us will be stringr (for working with text strings) and ggplot2 (for creating plots and other graphics).

To install all packages of the tidyverse, simply run:

> install.packages("tidyverse")

This will install all packages in the tidyverse collection, including stringr.

Remember, a package needs to be installed only once; but in every session you want to use that package, you need to load it using the library() function.

Setting up

We will start the tutorial with some code we created in the previous classes. Please click the “Run” button below the script to load the example text (al-Tabari’s history) and tokenize it:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IiMgbWFrZSBzdXJlIEFyYWJpYyBpcyBkaXNwbGF5ZWQgY29ycmVjdGx5OiBcblN5cy5zZXRsb2NhbGUoY2F0ZWdvcnkgPSBcIkxDX0FMTFwiLCBsb2NhbGUgPSBcIkMuVVRGLThcIikiLCJzYW1wbGUiOiJsaWJyYXJ5KFwic3RyaW5nclwiKSAgIyBpbXBvcnRpbmcgbGlicmFyaWVzIGlzIGFsd2F5cyBkb25lIGF0IHRoZSB0b3Agb2YgYSBzY3JpcHRcblxudXJsIDwtIFwiaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL09wZW5JVEkvMDMyNUFIL21hc3Rlci9kYXRhLzAzMTBUYWJhcmkvMDMxMFRhYmFyaS5UYXJpa2gvMDMxMFRhYmFyaS5UYXJpa2guU2hhbWVsYTAwMDk3ODNCSzEtYXJhMS5jb21wbGV0ZWRcIlxudGV4dF92IDwtIHNjYW4odXJsLCB3aGF0PVwiY2hhcmFjdGVyXCIsIHNlcD1cIlxcblwiLCBlbmNvZGluZz1cIlVURi04XCIpXG5zcGxpdHRlcl9pbmRleCA8LSB3aGljaCh0ZXh0X3YgPT0gXCIjTUVUQSNIZWFkZXIjRW5kI1wiKVxubGluZXNfdiA8LSB0ZXh0X3ZbKHNwbGl0dGVyX2luZGV4KzEpOmxlbmd0aCh0ZXh0X3YpXVxuYm9va192IDwtIHBhc3RlKGxpbmVzX3YsIGNvbGxhcHNlID0gXCJcXG5cIilcbmJvb2tfd29yZF9sIDwtIHN0cl9zcGxpdChib29rX3YsIFwiXFxcXFcrXCIpXG5ib29rX3dvcmRfdiA8LSB1bmxpc3QoYm9va193b3JkX2wpIn0=

We will mostly be using the variable book_word_v, which contains the tokenized text of Ṭabari’s History.

A word about coding style

(Jockers p. 38)

It is good practice to stick to some principles while coding:

variable names:
- use lower-case characters, and split words by underscores
- Jockers often uses a single character at the end of a variable name to indicate which data type is contained by the variable (_v for vector, _l for list, etc.)
use spaces before operators like <- and =

These are not fixed rules in R (R does not care whether you use upper-case characters in variable names, and remove the spaces before and after =), but adhering to a specific coding style makes it easier to read your code for others, and for yourself to remember how you named your variables.

For an example of a well-developed style guide for writing R code, see https://style.tidyverse.org/syntax.html.

Dispersion plots

(Jockers p. 38-)

Chapter 2 showed how to calculate and display raw and relative frequencies of words on the level of an entire book. In this chapter, we will look at how words are distributed within one book. In Jockers’ Moby Dick example: “At what points, for example, does Melville really get into writing about whales?”

Instead of collapsing an entire book into a frequency table, which does not take the sequence of words into account, we will now plot words in a sequence from the first word of the book to the last.

In order to visualize the distribution of the use of a word in the text, we will create a dispersion plot. A dispersion plot looks like a barcode:

Jockers’ dispersion plot for the word “whale” in Moby Dick

The (horizontal) X axis represents the position of each word in the book (1 is the first word, 20.000 is the 20.000th word). In the example above, for each time the word “whale” is mentioned in Melville’s Moby Dick, a vertical black line is drawn at the position of that word in the text (e.g., if the 200th word of the book is “whale”, a black line will be drawn at position 200 on the X axis.) Note that every line in the plot is the same width; if some lines appear wider than others, that is because they are actually a lot of lines positioned closely together.

In order to create such a plot, we will have to go through a number of steps:

choose a word of which we want to visualize the distribution in the text
tokenize the text (already done in the setup script; stored in the book_word_v variable)
identify the positions of the word in the book
plot these positions in a graph.

We first have to identify where our word is located in the text. As in chapter two, we will use the tokenized text (stored in the book_word_v variable) as the basis for our analysis.

To identify the positions in which our word is used, we will use the str_detect() function from the stringr package. This function takes two arguments: a character vector containing one or more strings, and a regular expression pattern that describes the word(s) you want to match. The function checks for every string in the vector whether it matches the regular expression you provided. It returns a vector that contains TRUE for every string that matched the regular expression, and FALSE for every string that did not match.

A preparatory example

In programming, it is often useful to test your code with some dummy data, so we can understand better what is happening when you run the code. Let’s do this for the str_detect() function, and see how it works:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KFwic3RyaW5nclwiKSAgIyBsb2FkIHRoZSBgc3RyaW5ncmAgcGFja2FnZSwgd2hpY2ggY29udGFpbnMgdGhlIGBzdHJfZGV0ZWN0YCBmdW5jdGlvblxudGVzdF92IDwtIGMoXCJhYlwiLCBcImJhXCIsIFwiYWRcIiwgXCJhYVwiKSAgIyBjcmVhdGUgYSBjaGFyYWN0ZXIgdmVjdG9yIHdpdGggc29tZSBkdW1teSB2YWx1ZXNcbm1hdGNoX3YgPC0gc3RyX2RldGVjdCh0ZXN0X3YsIFwiYS5cIikgICMgcmVndWxhciBleHByZXNzaW9uOiBtYXRjaCB0aGUgY2hhcmFjdGVyIFwiYVwiIGZvbGxvd2VkIGJ5IGFub3RoZXIgY2hhcmFjdGVyXG5tYXRjaF92In0=

If you push the Run button above, you will see the output of the str_detect function: a vector containing the value TRUE for every string in the test_v vector that contains the character “a” followed by another character, and the value FALSE for every string that does not match the regular expression.

We can now use the plot() function to create a dispersion plot from the output of the str_detect function:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KFwic3RyaW5nclwiKVxudGVzdF92IDwtIGMoXCJhYlwiLCBcImJhXCIsIFwiYWRcIiwgXCJhYVwiKVxubWF0Y2hfdiA8LSBzdHJfZGV0ZWN0KHRlc3RfdiwgXCJhLlwiKVxucGxvdChtYXRjaF92LCBcbiAgICAgdHlwZSA9IFwiaFwiLCAgICAgICAgICMgXCJoXCIgc3RhbmRzIGZvciBoaXN0b2dyYW0gKHNlZSBiZWxvdylcbiAgICAgeWxpbSA9IGMoMCwgMSksICAgICAjIHNldCB0aGUgbWF4aW11bSB2YWx1ZSBvZiB0aGUgWSBheGlzIHRvIDFcbiAgICAgeWF4cCA9IGMoMCwgMSwgMSksICAjIHNldCB0aGUgeSBheGlzIHZhbHVlcyAoMSBpbnRlcnZhbCwgIGJldHdlZW4gMCBhbmQgMSlcbiAgICAgeGF4cCA9IGMoMSwgNCwgMyksICAjIHNldCB0aGUgeCBheGlzIHZhbHVlcyAoMyBpbnRlcnZhbHMsIGJldHdlZW4gMSBhbmQgNClcbikifQ==

Because we set the plot type to h (for “histogram”), the plot() function will, for every element in the match_v vector, draw a vertical line from the X axis; the length of the vertical line is defined by the element’s value.
This works because for our match_v vector because that vector contains TRUE and FALSE values, which the plot() function automatically converts to their numerical equivalents, 1 and 0:

for the first element in the match_v vector (TRUE), the plot() function will draw a line of height 1 on position 1 of the x axis;
for the second element in the match_v vector (FALSE), the plot() function will draw a line of height 0 on position 2 of the x axis;
for the third element in the match_v vector (TRUE), the plot() function will draw a line of height 1 on position 3 of the x axis;
for the fourth element in the match_v vector (TRUE), the plot() function will draw a line of height 1 on position 4 of the x axis

A real-world example

We will use the terms ḥaddathanī/ḥaddathanā, “he transmitted to me/us” as an example. These terms are very important in the context of the isnād, a common citation practice in Arabic texts, in which a report is quoted together with every person (transmitter) who links the source to the original event reported.

This is an example of an isnād:

فحدثني محمد بن عمارة الأسدي ومحمد بن منصور قالا: حدثنا عبيد الله بن موسى، قال: أخبرنا موسى بن عبيده عن اياس ابن سلمة بن الأكوع، عن أبيه، قال: بعثت قريش

Muḥammad b. ʿUmāra al-Asadī and Muḥammad b. Manṣūr transmitted to me (ḥaddathanī): ʿUbayd Allāh b. Mūsā transmitted to me (ḥaddathanī): Mūsā b. ʿUbayda transmitted to me (akhbaranī), on the authority of Iyās ibn Salama b. al-Akwaʿ, on the authority of his father: “Qurays sent etc.”

The verb ḥaddatha is related to the term ḥadīth, which is used for transmitted reports on the words and deeds of the Prophet Muḥammad. The implications of the term ḥaddathanī/ḥaddathanā are not well understood; in ḥadīth studies, it is generally accepted that the term indicates direct oral/aural transmission of a report from a teacher to a student (as opposed to citation from a written book outside of a teaching context). There is less of a consensus on the use of the term in other genres than ḥadīth works.

We will try to use dispersion plots to get an insight into the distribution of this term, and another often used transmission term, akhbaranī/akhbaranā, “he transmitted to us”, in al-Ṭabarī’s History. That work is a universal history from the creation to the year 302 AH / 915 CE; the period after the hijra is organized in an annalistic way (the events of each year are narrated in a separate chapter).

First, we will create a vector that records which tokens in the text match a regular expression pattern that describes all possible variations of the term: حدثن[ياى] (that is, the string “ḥaddathan” followed by either alif (ā), yā' (ī), or alif maqṣūra (which is often used to represent a yā' in final position in printed books)).

NB: RStudio, like many other text editors, has trouble displaying right-to-left and left-to-right text on the same line in a way that is easily understandable for a human reader.

In the regular expression above, we first typed “حدثن”, followed by the opening bracket; RStudio automatically adds a closing bracket, and the brackets jump to the right of the Arabic word automatically. When you start writing Arabic text into the brackets, the opening bracket jumps to the left of “حدثن”, but the closing bracket remains to the right.

In order to make your patterns more readable, you can break down your regex pattern in chunks that R can display well on one line, and then use the paste() function to concatenate the partial patterns into a single pattern:

ptrn_1 <- "حدثن"
ptrn_2 <- "[ياى]"
ptrn <- paste(ptrn_1, ptrn_2, sep="")  # concatenate the two patterns into a single pattern

This code will create such a plot (you will have to write the regular expression into the code yourself!):

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImlmKCEgKFwic3RyaW5nclwiICVpbiUgKC5wYWNrYWdlcygpKSkpIGxpYnJhcnkoXCJzdHJpbmdyXCIpXG5cbmlmICghIChleGlzdHMoXCJib29rX3dvcmRfdlwiKSkpIHtcbiAgIyBtYWtlIHN1cmUgQXJhYmljIGlzIGRpc3BsYXllZCBjb3JyZWN0bHk6IFxuICBTeXMuc2V0bG9jYWxlKGNhdGVnb3J5ID0gXCJMQ19BTExcIiwgbG9jYWxlID0gXCJDLlVURi04XCIpXG4gIFxuICB1cmwgPC0gXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vT3BlbklUSS8wMzI1QUgvbWFzdGVyL2RhdGEvMDMxMFRhYmFyaS8wMzEwVGFiYXJpLlRhcmlraC8wMzEwVGFiYXJpLlRhcmlraC5TaGFtZWxhMDAwOTc4M0JLMS1hcmExLmNvbXBsZXRlZFwiXG4gIHRleHRfdiA8LSBzY2FuKHVybCwgd2hhdD1cImNoYXJhY3RlclwiLCBzZXA9XCJcXG5cIiwgZW5jb2Rpbmc9XCJVVEYtOFwiKVxuICBzcGxpdHRlcl9pbmRleCA8LSB3aGljaCh0ZXh0X3YgPT0gXCIjTUVUQSNIZWFkZXIjRW5kI1wiKVxuICBsaW5lc192IDwtIHRleHRfdlsoc3BsaXR0ZXJfaW5kZXgrMSk6bGVuZ3RoKHRleHRfdildXG4gIGJvb2tfdiA8LSBwYXN0ZShsaW5lc192LCBjb2xsYXBzZSA9IFwiXFxuXCIpXG4gIGJvb2tfd29yZF9sIDwtIHN0cl9zcGxpdChib29rX3YsIFwiXFxcXFcrXCIpXG4gIGJvb2tfd29yZF92IDwtIHVubGlzdChib29rX3dvcmRfbClcbn0iLCJzYW1wbGUiOiIjIHdyaXRlIHRoZSBwYXR0ZXJuIGhlcmU6IFxucHRybl8xIDwtXG5wdHJuXzIgPC1cbiAgXG5wdHJuIDwtIHBhc3RlKHB0cm5fMSwgcHRybl8yLCBzZXA9XCJcIikgICMgY29uY2F0ZW5hdGUgYm90aCBwYXR0ZXJucyBpbnRvIGEgc2luZ2xlIHBhdHRlcm5cblxuaGFkZGF0aGFuaV92IDwtIHN0cl9kZXRlY3QoYm9va193b3JkX3YsIHB0cm4pXG5cbnBsb3QoaGFkZGF0aGFuaV92LCBcbiAgICAgdHlwZSA9IFwiaFwiLCAgICAgICMgXCJoXCIgc3RhbmRzIGZvciBoaXN0b2dyYW1cbiAgICAgeWF4dCA9IFwiblwiLCAgICAgICMgZG8gbm90IGluY2x1ZGUgdGljayBtYXJrcyBmb3IgdmFsdWVzIG9uIHRoZSBZIGF4aXNcbiAgICAgeWxpbSA9IGMoMCwgMSksICAjIHNldCB0aGUgbWF4aW11bSB2YWx1ZSBvZiB0aGUgWSBheGlzIHRvIDFcbiAgICAgeGxpbSA9IGMoMCwgbGVuZ3RoKGhhZGRhdGhhbmlfdikpLCAjIHNldCB0aGUgbWF4aW11bSB2YWx1ZSBvZiB0aGUgWCBheGlzIHRvIHRoZSBudW1iZXIgb2YgdG9rZW5zIGluIHRoZSB0ZXh0XG4gICAgIG1haW4gPSBcIkRpc3BlcnNpb24gcGxvdCBmb3IgaGFkZGF0aGFuaS9hXCIsICAjIHRpdGxlIGZvciB0aGUgcGxvdFxuICAgICB4bGFiID0gXCJJbmRleCBwb3NpdGlvbnNcIiwgICAgICAgICAgICAgICAgICAgIyBsYWJlbCBmb3IgdGhlIHggYXhpc1xuICAgICB5bGFiID0gXCJcIiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIyBsYWJlbCBmb3IgdGhlIHkgYXhpc1xuKSJ9

The sheer amount of matches make it dificult to draw far-reaching conclusions from this plot. But at least we can see that

the use of the term ḥaddatha is much more prevalent in the first two thirds of the work than in the last third
there are a number of sections in the first third of the work in which the term is not mentioned.

Now, modify the code above yourself to create a dispersion plot for the terms أخبرني and أخبرنا:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImlmKCEgKFwic3RyaW5nclwiICVpbiUgKC5wYWNrYWdlcygpKSkpIHtcbiAgbGlicmFyeShcInN0cmluZ3JcIilcbn1cbmlmICghIChleGlzdHMoXCJib29rX3dvcmRfdlwiKSkpIHtcbiAgIyBtYWtlIHN1cmUgQXJhYmljIGlzIGRpc3BsYXllZCBjb3JyZWN0bHk6IFxuICBTeXMuc2V0bG9jYWxlKGNhdGVnb3J5ID0gXCJMQ19BTExcIiwgbG9jYWxlID0gXCJDLlVURi04XCIpXG4gIFxuICB1cmwgPC0gXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vT3BlbklUSS8wMzI1QUgvbWFzdGVyL2RhdGEvMDMxMFRhYmFyaS8wMzEwVGFiYXJpLlRhcmlraC8wMzEwVGFiYXJpLlRhcmlraC5TaGFtZWxhMDAwOTc4M0JLMS1hcmExLmNvbXBsZXRlZFwiXG4gIHRleHRfdiA8LSBzY2FuKHVybCwgd2hhdD1cImNoYXJhY3RlclwiLCBzZXA9XCJcXG5cIiwgZW5jb2Rpbmc9XCJVVEYtOFwiKVxuICBzcGxpdHRlcl9pbmRleCA8LSB3aGljaCh0ZXh0X3YgPT0gXCIjTUVUQSNIZWFkZXIjRW5kI1wiKVxuICBsaW5lc192IDwtIHRleHRfdlsoc3BsaXR0ZXJfaW5kZXgrMSk6bGVuZ3RoKHRleHRfdildXG4gIGJvb2tfdiA8LSBwYXN0ZShsaW5lc192LCBjb2xsYXBzZSA9IFwiXFxuXCIpXG4gIGJvb2tfd29yZF9sIDwtIHN0cl9zcGxpdChib29rX3YsIFwiXFxcXFcrXCIpXG4gIGJvb2tfd29yZF92IDwtIHVubGlzdChib29rX3dvcmRfbClcbn0iLCJzYW1wbGUiOiJsaWJyYXJ5KFwic3RyaW5nclwiKVxuXG4jIHdyaXRlIHRoZSBwYXR0ZXJuIGhlcmU6IFxucHRybl8xIDwtXG5wdHJuXzIgPC1cblxucHRybiA8LSBwYXN0ZShwdHJuXzEsIHB0cm5fMiwgc2VwPVwiXCIpICAjIGNvbmNhdGVuYXRlIGJvdGggcGF0dGVybnMgaW50byBhIHNpbmdsZSBwYXR0ZXJuXG5ha2hiYXJhbmlfdiA8LSBcblxucGxvdChcbiAgXG4gIFxuICBcbiAgXG4gIFxuICBcbikiLCJzb2x1dGlvbiI6ImxpYnJhcnkoXCJzdHJpbmdyXCIpXG5cbiMgd3JpdGUgdGhlIHBhdHRlcm4gaGVyZTogXG5wdHJuXzEgPC1cbnB0cm5fMiA8LVxuXG5wdHJuIDwtIHBhc3RlKHB0cm5fMSwgcHRybl8yLCBzZXA9XCJcIikgICMgY29uY2F0ZW5hdGUgYm90aCBwYXR0ZXJucyBpbnRvIGEgc2luZ2xlIHBhdHRlcm5cbmFraGJhcmFuaV92IDwtIHN0cl9kZXRlY3QoYm9va193b3JkX3YsIHB0cm4pXG5cbnBsb3QoYWtoYmFyYW5pX3YsIFxuICAgICB0eXBlID0gXCJoXCIsICAgICAgIyBcImhcIiBzdGFuZHMgZm9yIGhpc3RvZ3JhbVxuICAgICB5YXh0ID0gXCJuXCIsICAgICAgIyBkbyBub3QgaW5jbHVkZSB0aWNrIG1hcmtzIGZvciB2YWx1ZXMgb24gdGhlIFkgYXhpc1xuICAgICB5bGltID0gYygwLCAxKSwgICMgc2V0IHRoZSBtYXhpbXVtIHZhbHVlIG9mIHRoZSBZIGF4aXMgdG8gMVxuICAgICB4bGltID0gYygwLCBsZW5ndGgoYWtoYmFyYW5pX3YpKSwgIyBzZXQgdGhlIG1heGltdW0gdmFsdWUgb2YgdGhlIFggYXhpcyB0byB0aGUgbnVtYmVyIG9mIHRva2VucyBpbiB0aGUgdGV4dFxuICAgICBtYWluID0gXCJEaXNwZXJzaW9uIHBsb3QgZm9yIGFraGJhcmFuaS9hXCIsICAgIyB0aXRsZSBmb3IgdGhlIHBsb3RcbiAgICAgeGxhYiA9IFwiSW5kZXggcG9zaXRpb25zXCIsICAgICAgICAgICAgICAgICAgICMgbGFiZWwgZm9yIHRoZSB4IGF4aXNcbiAgICAgeWxhYiA9IFwiXCIsICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICMgbGFiZWwgZm9yIHRoZSB5IGF4aXNcbikifQ==

The plot shows that the term akhbara is used much less intensively in the book than ḥaddatha, especially in the first third of the book.

These observations do not prove anything by themselves, but they can serve as starting points for further analysis of citation patterns in al-Ṭabarī’s history.

multi-word patterns

The approach outlined above works only with tokenized text.

If the pattern we are looking for is not limited to one specific word, we cannot use a text tokenized into separate words, but have to tokenize it in a different way.

NB: we can also use non-tokenized text, and use the index location of each character in the text as ancher points for the visualization.

One way to do this, is by using each line of text as a token. During the tokenization process (see the setting up code above), we have already created a vector that contains every line as a separate string: lines_v. This line-by-line approach would work well, for example, if we want to contrast the distribution of the use of ḥaddatha in the book in general, with the distribution of the term as the first element in the chain of transmission (that is, the immediate source the author of the book took the report from). Since our text is formatted in OpenITI mARkdown format, we can use the mARkdown tag for the start of a paragraph (hashtag # followed by a space) to identify the cases where ḥaddathanā/ī is the first word of a transmission chain.

Write the regular expression that can be used to identify the instances where ḥaddathanā/ī is the first word of a transmission chain, fill in the plot function, and run the code:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImlmKCEgKFwic3RyaW5nclwiICVpbiUgKC5wYWNrYWdlcygpKSkpIGxpYnJhcnkoXCJzdHJpbmdyXCIpXG5cbmlmICghIChleGlzdHMoXCJsaW5lc192XCIpKSkge1xuICAjIG1ha2Ugc3VyZSBBcmFiaWMgaXMgZGlzcGxheWVkIGNvcnJlY3RseTogXG4gIFN5cy5zZXRsb2NhbGUoY2F0ZWdvcnkgPSBcIkxDX0FMTFwiLCBsb2NhbGUgPSBcIkMuVVRGLThcIilcbiAgXG4gIHVybCA8LSBcImh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9PcGVuSVRJLzAzMjVBSC9tYXN0ZXIvZGF0YS8wMzEwVGFiYXJpLzAzMTBUYWJhcmkuVGFyaWtoLzAzMTBUYWJhcmkuVGFyaWtoLlNoYW1lbGEwMDA5NzgzQksxLWFyYTEuY29tcGxldGVkXCJcbiAgdGV4dF92IDwtIHNjYW4odXJsLCB3aGF0PVwiY2hhcmFjdGVyXCIsIHNlcD1cIlxcblwiLCBlbmNvZGluZz1cIlVURi04XCIpXG4gIHNwbGl0dGVyX2luZGV4IDwtIHdoaWNoKHRleHRfdiA9PSBcIiNNRVRBI0hlYWRlciNFbmQjXCIpXG4gIGxpbmVzX3YgPC0gdGV4dF92WyhzcGxpdHRlcl9pbmRleCsxKTpsZW5ndGgodGV4dF92KV1cbiAgYm9va192IDwtIHBhc3RlKGxpbmVzX3YsIGNvbGxhcHNlID0gXCJcXG5cIilcbiAgYm9va193b3JkX2wgPC0gc3RyX3NwbGl0KGJvb2tfdiwgXCJcXFxcVytcIilcbiAgYm9va193b3JkX3YgPC0gdW5saXN0KGJvb2tfd29yZF9sKVxufSIsInNhbXBsZSI6IiMgYnVpbGQgIHRoZSByZWd1bGFyIGV4cHJlc3Npb24gdGhhdCBjYW4gYmUgdXNlZCB0byBpZGVudGlmeSB0aGUgaW5zdGFuY2VzIHdoZXJlIGhhZGRhdGhhbmEvaSBpcyB0aGUgZmlyc3Qgd29yZCBvZiBhIHRyYW5zbWlzc2lvbiBjaGFpbjogXG5wdHJuXzAgPC0gXCJcIiAgIyBzdGFydCBvZiBsaW5lLCBmb2xsb3dlZCBieSBoYXNodGFnIGFuZCBzcGFjZSwgYW5kIG9wdGlvbmFsbHkgb25lIG9mIHRoZSBjb25qdW5jdGlvbnMgd2EtIG9yIGZhLVxucHRybl8xIDwtIFwiXCIgICMgaGFkZGF0aGFuXG5wdHJuXzIgPC0gXCJcIiAgIyBhbGlmLCB5YSBvciBhbGlmIG1hcXN1cmFcbnB0cm4gPC0gcGFzdGUocHRybl8wLCBwdHJuXzEsIHB0cm5fMiwgc2VwPVwiXCIpICAjIGNvbmNhdGVuYXRlIGFsbCBwYXR0ZXJucyBpbnRvIGEgc2luZ2xlIHBhdHRlcm5cbmhhZGRhdGhhbmlfZmlyc3RfdiA8LSBzdHJfZGV0ZWN0KGxpbmVzX3YsIHB0cm4pXG5cbiMgcGxvdCB0aGUgdmVjdG9yOlxucGxvdChcblxuXG4pIiwic29sdXRpb24iOiJsaWJyYXJ5KFwic3RyaW5nclwiKVxuXG4jICEhISEgWU9VIFNUSUxMIEhBVkUgVE8gV1JJVEUgVEhFIFBBVFRFUk5TIEhFUkUgISEhISAoZnVsbCBzb2x1dGlvbjogYXQgYm90dG9tIG9mIHRoZSBwYWdlKTogXG5wdHJuXzAgPC0gXCJcIiAgIyBzdGFydCBvZiBsaW5lLCBmb2xsb3dlZCBieSBoYXNodGFnIGFuZCBzcGFjZSwgYW5kIG9wdGlvbmFsbHkgb25lIG9mIHRoZSBjb25qdW5jdGlvbnMgd2EtIG9yIGZhLVxucHRybl8xIDwtIFwiXCIgICMgaGFkZGF0aGFuXG5wdHJuXzIgPC0gXCJcIiAgIyBhbGlmLCB5YSBvciBhbGlmIG1hcXN1cmFcbnB0cm4gPC0gcGFzdGUocHRybl8wLCBwdHJuXzEsIHB0cm5fMiwgc2VwPVwiXCIpICAjIGNvbmNhdGVuYXRlIGFsbCBwYXR0ZXJucyBpbnRvIGEgc2luZ2xlIHBhdHRlcm5cbmhhZGRhdGhhbmlfZmlyc3RfdiA8LSBzdHJfZGV0ZWN0KGxpbmVzX3YsIHB0cm4pXG5cbiMgcGxvdCB0aGUgdmVjdG9yICh3ZSBhZGQgc29tZSBtb3JlIGFyZ3VtZW50cyBoZXJlIHRvIHByZXR0aWZ5IHRoZSBncmFwaCk6XG5wbG90KGhhZGRhdGhhbmlfZmlyc3RfdiwgXG4gICAgIHR5cGUgPSBcImhcIiwgICAgICAjIFwiaFwiIHN0YW5kcyBmb3IgaGlzdG9ncmFtXG4gICAgIHlheHQgPSBcIm5cIiwgICAgICAjIGRvIG5vdCBpbmNsdWRlIHRpY2sgbWFya3MgZm9yIHZhbHVlcyBvbiB0aGUgWSBheGlzXG4gICAgIHlsaW0gPSBjKDAsIDEpLCAgIyBzZXQgdGhlIG1heGltdW0gdmFsdWUgb2YgdGhlIFkgYXhpcyB0byAxXG4gICAgIHhsaW0gPSBjKDAsIGxlbmd0aChsaW5lc192KSksICMgc2V0IHRoZSBtYXhpbXVtIHZhbHVlIG9mIHRoZSBYIGF4aXMgdG8gdGhlIG51bWJlciBvZiB0b2tlbnMgaW4gdGhlIHRleHRcbiAgICAgbWFpbiA9IFwiRGlzcGVyc2lvbiBwbG90IGZvciBoYWRkYXRoYW5pL2EgaW4gaW5pdGlhbCBwb3NpdGlvblwiLCAgIyB0aXRsZSBmb3IgdGhlIHBsb3RcbiAgICAgeGxhYiA9IFwiSW5kZXggcG9zaXRpb25zXCIsICAgICAgICAgICAgICAgICAgICMgbGFiZWwgZm9yIHRoZSB4IGF4aXNcbiAgICAgeWxhYiA9IFwiXCIsICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICMgbGFiZWwgZm9yIHRoZSB5IGF4aXNcbikiLCJoaW50IjoiJmxkcXVvO1lvdSBjYW4gdXNlIHRoZSBleHByZXNzaW9uIDxjb2RlPl4jIC4/PC9jb2RlPiBmb3IgdGhlIGJlZ2lubmluZyBvZiB0aGUgbGluZTsgb3IgdG8gYmUgbW9yZSBwcmVjaXNlLCB1c2UgdGhlIGxldHRlcnMgd2F3IGFuZCBmYSBiZXR3ZWVuIHNxdWFyZSBicmFja2V0cyBpbnN0ZWFkIG9mIHRoZSBmdWxsIHN0b3AuIEZ1bGwgc29sdXRpb246IGNhbiBiZSBmb3VuZCBhdCB0aGUgYm90dG9tIG9mIHRoaXMgcGFnZS4mcmRxdW87In0=

This plot is much more informative than the first plot we made for the term ḥaddatha: it shows that al-Ṭabarī’s use of the term ḥaddatha in the first position of a chain of transmitters (that is, to describe his direct source), is strongly clustered in two sections at the beginning of the work, and gradually declines in the later years of his History.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImlmKCEgKFwic3RyaW5nclwiICVpbiUgKC5wYWNrYWdlcygpKSkpIGxpYnJhcnkoXCJzdHJpbmdyXCIpXG5cbmlmICghIChleGlzdHMoXCJsaW5lc192XCIpKSkge1xuICAjIG1ha2Ugc3VyZSBBcmFiaWMgaXMgZGlzcGxheWVkIGNvcnJlY3RseTogXG4gIFN5cy5zZXRsb2NhbGUoY2F0ZWdvcnkgPSBcIkxDX0FMTFwiLCBsb2NhbGUgPSBcIkMuVVRGLThcIilcbiAgXG4gIHVybCA8LSBcImh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9PcGVuSVRJLzAzMjVBSC9tYXN0ZXIvZGF0YS8wMzEwVGFiYXJpLzAzMTBUYWJhcmkuVGFyaWtoLzAzMTBUYWJhcmkuVGFyaWtoLlNoYW1lbGEwMDA5NzgzQksxLWFyYTEuY29tcGxldGVkXCJcbiAgdGV4dF92IDwtIHNjYW4odXJsLCB3aGF0PVwiY2hhcmFjdGVyXCIsIHNlcD1cIlxcblwiLCBlbmNvZGluZz1cIlVURi04XCIpXG4gIHNwbGl0dGVyX2luZGV4IDwtIHdoaWNoKHRleHRfdiA9PSBcIiNNRVRBI0hlYWRlciNFbmQjXCIpXG4gIGxpbmVzX3YgPC0gdGV4dF92WyhzcGxpdHRlcl9pbmRleCsxKTpsZW5ndGgodGV4dF92KV1cbiAgYm9va192IDwtIHBhc3RlKGxpbmVzX3YsIGNvbGxhcHNlID0gXCJcXG5cIilcbiAgYm9va193b3JkX2wgPC0gc3RyX3NwbGl0KGJvb2tfdiwgXCJcXFxcVytcIilcbiAgYm9va193b3JkX3YgPC0gdW5saXN0KGJvb2tfd29yZF9sKVxufSIsInNhbXBsZSI6IiMgd3JpdGUgdGhlIHBhdHRlcm4gaGVyZTogXG5wdHJuXzEgPC0gXCJcIiAgIyBzdGFydCBvZiBsaW5lLCBmb2xsb3dlZCBieSBoYXNodGFnIGFuZCBzcGFjZSwgYW5kIG9wdGlvbmFsbHkgb25lIG9mIHRoZSBjb25qdW5jdGlvbnMgd2EtIG9yIGZhLVxucHRybl8yIDwtIFwiXCIgICAgICAgICAgIyBhbGlmIG9yIGFsaWYtd2l0aC1oYW16YVxucHRybl8zIDwtIFwiXCIgICAgICAjIGtoYmFyYW5cbnB0cm5fNCA8LSBcIlwiICAgICAgICAgICMgaS9hXG5cbnB0cm4gPC0gcGFzdGUocHRybl8xLCBwdHJuXzIsIHB0cm5fMywgcHRybl80LCBzZXA9XCJcIikgICMgY29uY2F0ZW5hdGUgYWxsIHBhdHRlcm5zIGludG8gYSBzaW5nbGUgcGF0dGVyblxuXG4jIGNvbXBsZXRlIHRoZSB2YXJpYWJsZSBhdHRyaWJ1dGlvbiB1c2luZyB0aGUgc3RyX2RldGVjdCgpIGZ1bmN0aW9uOlxuYWtoYmFyYW5pX2ZpcnN0X3YgPC0gXG5cbnBsb3QoXG4gIFxuICBcbiAgXG4gIFxuICBcbiAgXG4pIiwic29sdXRpb24iOiJsaWJyYXJ5KFwic3RyaW5nclwiKVxuIyAhISEhIFlPVSBTVElMTCBIQVZFIFRPIFdSSVRFIFRIRSBQQVRURVJOUyBIRVJFICEhISEgKGZ1bGwgc29sdXRpb246IGF0IGJvdHRvbSBvZiB0aGUgcGFnZSk6IFxucHRybl8xIDwtIFwiXCIgICMgc3RhcnQgb2YgbGluZSwgZm9sbG93ZWQgYnkgaGFzaHRhZyBhbmQgc3BhY2UsIGFuZCBvcHRpb25hbGx5IG9uZSBvZiB0aGUgY29uanVuY3Rpb25zIHdhLSBvciBmYS1cbnB0cm5fMiA8LSBcIlwiICAjIGFsaWYgb3IgYWxpZi13aXRoLWhhbXphXG5wdHJuXzMgPC0gXCJcIiAgIyBraGJhcmFuXG5wdHJuXzQgPC0gXCJcIiAgIyBpL2FcblxucHRybiA8LSBwYXN0ZShwdHJuXzEsIHB0cm5fMiwgcHRybl8zLCBwdHJuXzQsIHNlcD1cIlwiKSAgIyBjb25jYXRlbmF0ZSBhbGwgcGF0dGVybnMgaW50byBhIHNpbmdsZSBwYXR0ZXJuXG5cbmFraGJhcmFuaV9maXJzdF92IDwtIHN0cl9kZXRlY3QobGluZXNfdiwgcHRybilcblxucGxvdChha2hiYXJhbmlfZmlyc3RfdiwgXG4gICAgIHR5cGUgPSBcImhcIiwgICAgICAjIFwiaFwiIHN0YW5kcyBmb3IgaGlzdG9ncmFtXG4gICAgIHlheHQgPSBcIm5cIiwgICAgICAjIGRvIG5vdCBpbmNsdWRlIHRpY2sgbWFya3MgZm9yIHZhbHVlcyBvbiB0aGUgWSBheGlzXG4gICAgIHlsaW0gPSBjKDAsIDEpLCAgIyBzZXQgdGhlIG1heGltdW0gdmFsdWUgb2YgdGhlIFkgYXhpcyB0byAxXG4gICAgIHhsaW0gPSBjKDAsIGxlbmd0aChsaW5lc192KSksICMgc2V0IHRoZSBtYXhpbXVtIHZhbHVlIG9mIHRoZSBYIGF4aXMgdG8gdGhlIG51bWJlciBvZiB0b2tlbnMgaW4gdGhlIHRleHRcbiAgICAgbWFpbiA9IFwiRGlzcGVyc2lvbiBwbG90IGZvciBha2hiYXJhbmkvYSBpbiBpbml0aWFsIHBvc2l0aW9uXCIsICAjIHRpdGxlIGZvciB0aGUgcGxvdFxuICAgICB4bGFiID0gXCJJbmRleCBwb3NpdGlvbnNcIiwgICAgICAgICAgICAgICAgICAgIyBsYWJlbCBmb3IgdGhlIHggYXhpc1xuICAgICB5bGFiID0gXCJcIiwgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIyBsYWJlbCBmb3IgdGhlIHkgYXhpc1xuKSIsImhpbnQiOiImbGRxdW87WW91IGNhbiB1c2UgdGhlIGV4cHJlc3Npb24gPGNvZGU+XiMgLj88L2NvZGU+IGZvciB0aGUgYmVnaW5uaW5nIG9mIHRoZSBsaW5lOyBvciB0byBiZSBtb3JlIHByZWNpc2UsIHVzZSB0aGUgbGV0dGVycyB3YXcgYW5kIGZhIGJldHdlZW4gc3F1YXJlIGJyYWNrZXRzIGluc3RlYWQgb2YgdGhlIGZ1bGwgc3RvcC4gRnVsbCBzb2x1dGlvbjogY2FuIGJlIGZvdW5kIGF0IHRoZSBib3R0b20gb2YgdGhpcyBwYWdlLiZyZHF1bzsifQ==

The contrast between akhbara and ḥaddatha becomes much clearer here: al-Ṭabarī clearly used the term akhbara very rarely in the citation of his own sources, even though the term is used frequently in other positions in the transmission chains.

These graphs can be used as starting-off points for deeper analysis of the citation patterns in al-Ṭabarī’s History.

Full exercise solutions:

ḥaddathani/a as first word in a paragraph:

library("stringr")

# build  the regular expression that can be used to identify the instances where haddathana/i is the first word of a transmission chain: 
ptrn_0 <- "^# [وف]?"  # start of line, followed by hashtag and space, and optionally one of the conjunctions wa- or fa-
ptrn_1 <- "حدثن"  # haddathan
ptrn_2 <- "[ياى]"  # alif, ya or alif maqsura
ptrn <- paste(ptrn_0, ptrn_1, ptrn_2, sep="")  # concatenate all patterns into a single pattern

haddathani_first_v <- str_detect(lines_v, ptrn)

# plot the vector (we add some more arguments here to prettify the graph):
plot(haddathani_first_v, 
     type = "h",      # "h" stands for histogram
     yaxt = "n",      # do not include tick marks for values on the Y axis
     ylim = c(0, 1),  # set the maximum value of the Y axis to 1
     xlim = c(0, length(lines_v)), # set the maximum value of the X axis to the number of tokens in the text
     main = "Dispersion plot for haddathani/a in initial position",  # title for the plot
     xlab = "Index positions",                   # label for the x axis
     ylab = "",                                  # label for the y axis
)

akhbarani/a as first word in a paragraph:

library("stringr")
# write the pattern here: 
ptrn_1 <- "^# [وف]?"  # start of line, followed by hashtag and space, and optionally one of the conjunctions wa- or fa-
ptrn_2 <- "[أا]"  # alif or alif-with-hamza
ptrn_3 <- "خبرن"  # khbaran
ptrn_4 <- "[ياى]"  # i/a

ptrn <- paste(ptrn_1, ptrn_2, ptrn_3, ptrn_4, sep="")  # concatenate all patterns into a single pattern

akhbarani_first_v <- str_detect(lines_v, ptrn)

plot(akhbarani_first_v, 
     type = "h",      # "h" stands for histogram
     yaxt = "n",      # do not include tick marks for values on the Y axis
     ylim = c(0, 1),  # set the maximum value of the Y axis to 1
     xlim = c(0, length(lines_v)), # set the maximum value of the X axis to the number of tokens in the text
     main = "Dispersion plot for akhbarani/a in initial position",  # title for the plot
     xlab = "Index positions",                   # label for the x axis
     ylab = "",                                  # label for the y axis
)