Codes and Other Data for Nominal Data Linkage


Nominal data linkage is a process long used by genealogists and more recently by historical demographers to reconstitute the demographic histories of individuals and link them to one another (see Hammel, "Family Reconstitution", Oxford Encyclopedia of Economic History, Oxford: Oxford University Press, 2003). Two processes are ordinarily involved: standardization of the spelling of entries used in linkage (which we call "tokenization"), and the linkage itself.

Various schemes have been developed for tokenization; they are of course sensitive to the particular languages involved. Because our work involved materials written both in Latin (before about 1848) and Croatian (after about 1848), with orthographic and other influences from German and Hungarian, we developed our own scheme. It is not perfect, especially where nicknames are involved. Tokenized forms of baptismal names, surnames, and placenames are all reduced to standard lengths if their original form is longer than the standard. First names have 5 characters, surnames 10, and place names 5. Much of the tokenization depended on the linguistic intuitions of the researchers, since most variant spellings of a name are intuitively obvious to those with sufficient knowledge of the languages. That said, it should be noted that priests varied widely in the way they wrote their data, some variation being attributable to the natural difficulties of rendering a Slavic language in an orthography designed for Latin and adopted in Hungarian, and some to the lamentable fact that many priests were less literate than one might have hoped.

The classic approaches in reconstitution developed (for demographers) by Louis Henry involve starting with a marriage record and then finding the baptismal records that can be plausibly linked to it. Further, one looks for the baptismal and burial records of the spouses, as well as any remarriage records, and the burial records of the children. We went further than this, looking for marriages of the children, baptismal records of the spouses, and so on, since we were interested in kinship beyond the nuclear family.

The database of baptismal, burial, and marriage records is large. After discarding illegible records we used 23,307 marriages (1717-1864), 112,181 baptisms (1714-1898), and 94,077 burials (1717-1898). One sees immediately that there were 34 years (1864-98) where baptisms and burials were recorded but no marriages. Some individuals recorded in those terminal years cannot be linked to their parents' marriages (for example the number of first births recorded in those years declines because there were no new marriages recorded). The transcribing of records was limited by a number of factors: parish books from before 1848 were available in the central State Archive of Croatia, but those from later dates had to be recovered from local police stations.  There were also limitations of time and money, and we concentrated on getting as much information on the history of individual marriages as possible; if we had gathered marriage data up to 1900, for example, we would have required baptismal data up to 1950 and burial data up to 2000.  The great emigrations of the early 20th century, the first and second World Wars, the civil wars of the 1990s, and the increasing importance of civil versus parish recording would have made such an endeavor very difficult.

Because the database is so large, we were obliged to employ computers. The first linkages were done by Ruth Deuel in Fortran, and the results were examined manually by Deuel, Čapo and Hammel. Some rules and scoring systems were employed for the resolution of ambiguities (such as 2 children of identical names born on the same date to 2 different sets of parents), and stubborn cases were resolved by hand. Some published analyses rested on these data. At a later date, further work on computerized linkage was done by Marcia Feitel, but manual resolution of ambiguities was still employed. Finally, Hammel decided in consultation with Carl Mason, that it would be preferable to use a completely automated linkage algorithm, because the resolution of ambiguities would leave a clear record in the algorithms themselves. To that end, Hammel began writing and testing such algorithms in the Perl language, which is uniquely suited to the sophisticated manipulation of text. These algorithms worked and gave results scarcely different from those achieved earlier, but when run on the entire dataset, they were too slow (even if faster than manual work). Aaron Gullickson improved the efficiency of the scripts. In general, no record was linked to another unless 3 of 4 tokenized personal names matched. For example, a child's baptism was not linked to its parents' marriage unless 3 of the 4 parental first and last names on the baptismal record matched the names of the spouses on the marriage record. (We note in passing that a match of 4 is possible, since it was not unusual for the maiden name of the wife to be given on the baptismal certificate of her early children, reflecting the way in which her identitity was known to midwives and priests. Other ethnographic data, admittedly from similar populations in Serbia, suggest that a woman was only gradually incorporated socially into her husband’s kin group, and that that incorporation was symbolized by how she was identified.)


Basic Data Files and Codebooks

 

The original files of baptisms, marriages, and burials were of disparate format. The earliest were simply unformatted notations written as linear descriptions of a ceremony, such as “Today, 25th February, 1722, I, Jovan, baptized Anica, daughter of Petar Kovacevic and Maria Kostic.”  By the end of the century these statements had become standardized, and indeed formatted, and eventually the record books were ruled in columns and rows for the entry of the required data. Our data entry procedures evolved to include newly specified information (such as the legitimacy of a birth) and the formatting of the source documents.  As a first step in analysis we standardized the formats of the various underlying files: baptisms, marriages, and burials, of course using the most complex form of each kind of file as a template. These are the *.uni files. There are 21 *.uni files, one for each kind of ceremony in each of seven parishes. Some of the information in those files was of no importance to record linkage, for example the name of the recording priest. The *.uni files were reduced to the *.w files, which contained the information essential to nominal data linkage. The *.w files contain a link back to the original *.uni file for each record. There are three *.w files, one for each kind of ceremony, each containing the records from all seven parishes.

Names in the *.uni files are as written as in the original data and not tokenized. The modern characters of Croatian for consonants with diacritics are not employed, since they were not employed in much of the earlier recordings and were not available for key to disk data entry when most of the data were coded. Spelling conventions derived from Latin, German, or Hungarian were commonly used for some consonants in the original data. The coders were all native speakers of Croatian, and they used the modern consonant spellings if those were in the data but stripped the diacritics. (The absence of the diacritics poses no problems to speakers of the language.) Names in the *.w files are tokenized, 5 characters for baptismal and place names, 10 characters for surnames; however, some untokenized names are left in the *.w files at their original length, in special fields. These will be clear from the field lengths in the *.w.format files. "fname" means first or baptismal name, "lname" means last or surname. Records in *.w files contain a pointer to the original record in the relevant *.uni file.

The .uni and .w files are so named on these web pages. Users will note, however that they are specified in the hyperlinks as .txt files, as for example, cern.birth.uni.txt and are actually so named on this server. The .txt suffix was employed so that they would be visible, downloadable, and usable under most operating systems. (Mac OS X, for example, interprets a *.uni file as a unix executable file, not as a text file.)

The seven parishes lie northeast of the Sava River in a plain about 10 kms from the river, up to the adjacent hills, from Bogičevci at the northeast (near Jasenovac) to Oriovac at the southwest (near Slavonski Brod). Six were in the former Military Border of Croatia, but Cernik was in Civil Croatia. The parishes, with their name codes are:

Name

Code1

Code2

Bogičevci

bogic

B

Cernik

cern

C

Nova Gradiška

grad

G

Oriovac

orio

O

Staro Petrovo Selo

petro

P

Štivice

stiv

S

Vrbje

vrbje

V

"Book" means the catalog number of the original record book in the Archive of Croatia. "Page" is the page in that book.

Binary (yes/no) variables such as the legitimacy of a birth are coded 1 for positive, 0 for negative. Other codes are indicated in the format files.

The website files do not include the "comment" codes, which were primarily used to alert us to problems during data entry.

"Parish" in the *.uni files is inherent in the file name. In the *.w files it is indicated by the initial letter of the parish name.

Users please note that parishes begin and end recording for different events at different times and that there are occasional gaps in the data where events were not recovered.

Downloading Files

When you display a data file in your browser, you can download it by saving it to a local file. All data files are straight ASCII text files. They vary in size from less than a kilobyte to more than 14 gigabytes. Consult the table below before downloading. Read the *.format files first; these are .htm files.

File Sizes (bytes)

File

Bytes

Formats

birth.uni.format

759

birth.w.format

855

death.uni.format

633

death.w.format

812

mar.uni.format

1,064

mar.w.format

1,381

 

 

*.w Files

birth.w

14,139,972

death.w

7,433,742

mar.w

4,104,496

*.uni Files

bogic.birth.uni

1,413,024

bogic.death.uni

991,924

bogic.mar.uni

300,800

cern.birth.uni

8,672,484

cern.death.uni

4,191,984

cern.mar.uni

2,053,760

grad.birth.uni

6,636,854

grad.death.uni

4,132,816

grad.mar.uni

1,653,120

orio.birth.uni

3,583,728

orio.death.uni

2,168,576

orio.mar.uni

1,122,560

petro.birth.uni

3,699,348

petro.death.uni

2,396,992

petro.mar.uni

1,299,520

stiv.birth.uni

1,164,072

stiv.death.uni

888,552

stiv.mar.uni

430,405

vrbje.birth.uni

2,437,122

vrbje.death.uni

1,498,120

vrbje.mar.uni

566,080

 

*.uni files. 21 files for 7 parishes x 3 events plus 3 codebooks

 

Parish/
File Type

Event Type

 

Baptismal Data

Burial Data

Marriage Data

 

 

 

 

Codebook

birth.uni.format

death.uni.format

mar.uni.format

 

 

 

 

Bogičevci

bogic.birth.uni

bogic.death.uni

bogic.mar.uni

Cernik

cern.birth.uni

cern.death.uni

cern.mar.uni

Nova Gradiška

grad.birth.uni

grad.death.uni

grad.mar.uni

Oriovac

orio.birth.uni

orio.death.uni

orio.mar.uni

Staro Petrovo Selo

petro.birth.uni

petro.death.uni

petro.mar.uni

Štivice

stiv.birth.uni

stiv.death.uni

stiv.mar.uni

Vrbje

vrbje.birth.uni

vrbje.death.uni

vrbje.mar.uni



   


*.w files. 3 files, one for each event type across all 7 parishes, plus 3 codebooks

 

Event Type

File Type

Baptismal Data

Burial Data

Marriage Data

 

 

 

 

Codebook

birth.w.format

death.w.format

mar.w.format

Data

birth.w

death.w

mar.w




                        


Back to main page