Codes and Other Data for Nominal Data Linkage
Nominal data linkage is a process long used by genealogists and more recently
by historical demographers to reconstitute the demographic histories of
individuals and link them to one another (see Hammel, "Family
Reconstitution", Oxford Encyclopedia of Economic History, Oxford: Oxford University Press, 2003). Two
processes are ordinarily involved: standardization of the spelling of entries
used in linkage (which we call "tokenization"), and the linkage
itself.
Various schemes have been developed for tokenization; they are of course
sensitive to the particular languages involved. Because our work involved
materials written both in Latin (before about 1848) and Croatian (after about
1848), with orthographic and other influences from German and Hungarian, we
developed our own scheme. It is not perfect, especially where nicknames are
involved. Tokenized forms of baptismal names, surnames, and placenames are all
reduced to standard lengths if their original form is longer than the standard.
First names have 5 characters, surnames 10, and place names 5. Much of the
tokenization depended on the linguistic intuitions of the researchers, since
most variant spellings of a name are intuitively obvious to those with
sufficient knowledge of the languages. That said, it should be noted that
priests varied widely in the way they wrote their data, some variation being
attributable to the natural difficulties of rendering a Slavic language in an
orthography designed for Latin and adopted in Hungarian, and some to the
lamentable fact that many priests were less literate than one might have hoped.
The classic approaches in reconstitution developed (for demographers) by Louis
Henry involve starting with a marriage record and then finding the baptismal
records that can be plausibly linked to it. Further, one looks for the
baptismal and burial records of the spouses, as well as any remarriage records,
and the burial records of the children. We went further than this, looking for
marriages of the children, baptismal records of the spouses, and so on, since
we were interested in kinship beyond the nuclear family.
The database of baptismal, burial, and marriage records is large. After
discarding illegible records we used 23,307 marriages (1717-1864), 112,181
baptisms (1714-1898), and 94,077 burials (1717-1898). One sees immediately that
there were 34 years (1864-98) where baptisms and burials were recorded but no
marriages. Some individuals recorded in those terminal years cannot be linked
to their parents' marriages (for example the number of first births recorded in
those years declines because there were no new marriages recorded). The
transcribing of records was limited by a number of factors: parish books from
before 1848 were available in the central State Archive of Croatia, but those
from later dates had to be recovered from local police stations. There
were also limitations of time and money, and we concentrated on getting as much
information on the history of individual marriages as possible; if we had
gathered marriage data up to 1900, for example, we would have required
baptismal data up to 1950 and burial data up to 2000. The great
emigrations of the early 20th century, the first and second World Wars, the
civil wars of the 1990s, and the increasing importance of civil versus parish
recording would have made such an endeavor very difficult.
Because the database is so large, we were obliged to employ computers. The
first linkages were done by Ruth Deuel in Fortran, and the results were
examined manually by Deuel, Čapo and Hammel. Some rules and scoring systems
were employed for the resolution of ambiguities (such as 2 children of
identical names born on the same date to 2 different sets of parents), and
stubborn cases were resolved by hand. Some published analyses rested on these
data. At a later date, further work on computerized linkage was done by Marcia
Feitel, but manual resolution of ambiguities was still employed. Finally,
Hammel decided in consultation with Carl Mason, that it would be preferable to
use a completely automated linkage algorithm, because the resolution of
ambiguities would leave a clear record in the algorithms themselves. To that
end, Hammel began writing and testing such algorithms in the Perl language,
which is uniquely suited to the sophisticated manipulation of text. These
algorithms worked and gave results scarcely different from those achieved
earlier, but when run on the entire dataset, they were too slow (even if faster
than manual work). Aaron Gullickson improved the efficiency of the scripts. In
general, no record was linked to another unless 3 of 4 tokenized personal names
matched. For example, a child's baptism was not linked to its parents' marriage
unless 3 of the 4 parental first and last names on the baptismal record matched
the names of the spouses on the marriage record. (We note in passing that a
match of 4 is possible, since it was not unusual for the maiden name of the
wife to be given on the baptismal certificate of her early children, reflecting
the way in which her identitity was known to midwives and priests. Other
ethnographic data, admittedly from similar populations in Serbia, suggest that
a woman was only gradually incorporated socially into her husband’s kin group, and
that that incorporation was symbolized by how she was identified.)
Basic Data Files and Codebooks
The original
files of baptisms, marriages, and burials were of disparate format. The
earliest were simply unformatted notations written as linear descriptions of a
ceremony, such as “Today, 25th February, 1722, I, Jovan, baptized
Anica, daughter of Petar Kovacevic and Maria Kostic.” By the end of the century these statements had become standardized,
and indeed formatted, and eventually the record books were ruled in columns and
rows for the entry of the required data. Our data entry procedures evolved to
include newly specified information (such as the legitimacy of a birth) and the
formatting of the source documents.
As a first step in analysis we standardized the formats of the various
underlying files: baptisms, marriages, and burials, of course using the most
complex form of each kind of file as a template. These are the *.uni files. There are 21 *.uni files, one for
each kind of ceremony in each of seven parishes. Some of the information in
those files was of no importance to record linkage, for example the name of the
recording priest. The *.uni files were reduced to the *.w files, which contained the information
essential to nominal data linkage. The *.w files contain a link back to the
original *.uni file for each record. There are three *.w files, one for each
kind of ceremony, each containing the records from all seven parishes.
Names in the
*.uni files are as written as in the original data and not tokenized. The
modern characters of Croatian for consonants with diacritics are not employed,
since they were not employed in much of the earlier recordings and were not
available for key to disk data entry when most of the data were coded. Spelling
conventions derived from Latin, German, or Hungarian were commonly used for
some consonants in the original data. The coders were all native speakers of
Croatian, and they used the modern consonant spellings if those were in the
data but stripped the diacritics. (The absence of the diacritics poses no
problems to speakers of the language.) Names in the *.w files are tokenized, 5
characters for baptismal and place names, 10 characters for surnames; however,
some untokenized names are left in the *.w files at their original length, in
special fields. These will be clear from the field lengths in the *.w.format
files. "fname" means first or baptismal name, "lname" means
last or surname. Records in *.w files contain a pointer to the original record
in the relevant *.uni file.
The .uni and .w
files are so named on these web pages. Users will note, however that they are
specified in the hyperlinks as .txt files, as for example, cern.birth.uni.txt and are actually so named on this server.
The .txt suffix was employed so that they would be visible, downloadable, and
usable under most operating systems. (Mac OS X, for example, interprets a *.uni
file as a unix executable file, not as a text file.)
The seven
parishes lie northeast of the Sava River in a plain about 10 kms from the river,
up to the adjacent hills, from Bogičevci at the northeast (near Jasenovac) to
Oriovac at the southwest (near Slavonski Brod). Six were in the former Military
Border of Croatia, but Cernik was in Civil Croatia. The parishes, with their
name codes are:
|
Name |
Code1 |
Code2 |
|
Bogičevci |
bogic |
B |
|
Cernik |
cern |
C |
|
Nova Gradiška |
grad |
G |
|
Oriovac |
orio |
O |
|
Staro Petrovo Selo |
petro |
P |
|
Štivice |
stiv |
S |
|
Vrbje |
vrbje |
V |
"Book"
means the catalog number of the original record book in the Archive of Croatia.
"Page" is the page in that book.
Binary (yes/no)
variables such as the legitimacy of a birth are coded 1 for positive, 0 for negative.
Other codes are indicated in the format files.
The website files
do not include the "comment" codes, which were primarily used to
alert us to problems during data entry.
"Parish"
in the *.uni files is inherent in the file name. In the *.w files it is
indicated by the initial letter of the parish name.
Users please note
that parishes begin and end recording for different events at different times
and that there are occasional gaps in the data where events were not recovered.
Downloading
Files
When you display
a data file in your browser, you can download it by saving it to a local file.
All data files are straight ASCII text files. They vary in size from less than
a kilobyte to more than 14 gigabytes. Consult the table below before
downloading. Read the *.format files first; these are .htm files.
File Sizes
(bytes)
|
File |
Bytes |
|
Formats |
|
|
birth.uni.format |
759 |
|
birth.w.format |
855 |
|
death.uni.format |
633 |
|
death.w.format |
812 |
|
mar.uni.format |
1,064 |
|
mar.w.format |
1,381 |
|
|
|
|
*.w Files |
|
|
birth.w |
14,139,972 |
|
death.w |
7,433,742 |
|
mar.w |
4,104,496 |
|
*.uni Files |
|
|
bogic.birth.uni |
1,413,024 |
|
bogic.death.uni |
991,924 |
|
bogic.mar.uni |
300,800 |
|
cern.birth.uni |
8,672,484 |
|
cern.death.uni |
4,191,984 |
|
cern.mar.uni |
2,053,760 |
|
grad.birth.uni |
6,636,854 |
|
grad.death.uni |
4,132,816 |
|
grad.mar.uni |
1,653,120 |
|
orio.birth.uni |
3,583,728 |
|
orio.death.uni |
2,168,576 |
|
orio.mar.uni |
1,122,560 |
|
petro.birth.uni |
3,699,348 |
|
petro.death.uni |
2,396,992 |
|
petro.mar.uni |
1,299,520 |
|
stiv.birth.uni |
1,164,072 |
|
stiv.death.uni |
888,552 |
|
stiv.mar.uni |
430,405 |
|
vrbje.birth.uni |
2,437,122 |
|
vrbje.death.uni |
1,498,120 |
|
vrbje.mar.uni |
566,080 |
*.uni files. 21 files for 7 parishes x 3 events plus 3 codebooks
|
Parish/ |
Event Type |
||
|
|
Baptismal Data |
Burial Data |
Marriage Data |
|
|
|
|
|
|
Codebook |
|||
|
|
|
|
|
|
Bogičevci |
|||
|
Cernik |
|||
|
Nova Gradiška |
|||
|
Oriovac |
|||
|
Staro Petrovo Selo |
|||
|
Štivice |
|||
|
Vrbje |
|||
*.w files. 3 files, one for each event type across all 7 parishes, plus 3 codebooks
|
Event Type |
|||
|
File Type |
Baptismal Data |
Burial Data |
Marriage Data |
|
|
|
|
|
|
Codebook |
|||
|
Data |
|||