README
Aaron Gullickson
5/11/2001
UPDATED:
01/14/2002
(Minor edits by Hammel in May 2008)
This readme is
intended as an overview of all the files employed in family reconstitution of
the Croatian dataset. The perl and
Splus scripts each have further instructions and directions contained within
them.
THE LINKING
PROGRAMS
The perl programs access the raw data files sortedbirths,
sortedmars,
and sorteddeaths which contain the parish records for these
events, from the relevant
*.w files and sorted on date of event. Since
these sorted* files are redundant,
they are not posted on the web
page.
The perl programs use this raw data to perform linkages,
constructing
family histories.
Six perl programs are used to do this.
their names indicate their
function. For example b2m links births to
marriages, m2d links marriages to
deaths, and so on. In
general,
they follow a similar pattern.
Hashes are created of the
relevant raw data sets using names as the
keys. These are then
matched
across two kinds of records, e.g. birth and marriage, and scored by various
factors
(same parish, age match, etc.). In addition, some matches are
rejected completely
because they would conflict with previous matches
(a woman dying before she
gives birth for example). Three of
the five
files followed this format.
The remaining two (m2m and m2d) simply
imputed the links from Marcia
Feitel's previous work on remarriage links
and marriage to death
links. I list each
of the
files in the order they should be run below and describe them
briefly. I do not describe the explicit scoring
routines or values
here as we may adjust them later. You should examine the actual perl
scripts
for that information.
m2b.pl - this script looks
at birth records and for each birth
attempts to find the marriage from
which it came. It does this
primarily
by matching on three of the four names of the spouses in the
marriage and
the parents of the birth. It is
highly reliable since
name combinations of three or four names are seldom
repeated. That is
why it is
run first. m2b.pl also links
together children for which
parents cannot be found in the marriage
records, but who likely share the same parent.
b2m.pl - This scripts
attempts to match births to their subsequent
marriage. It matches on names and scores
individuals on a variety of
factors.
In addition, it rejects matches if they occur before and
after a
certain age (see file for exact ages as we may adjust them)
and rejects
them if they contradict the matches from m2b.pl (we do
allow some
"shotgun" weddings if they occur within a reasonable span
before
marriage).
m2m.pl - This script simply goes
through marcia's links for remarriage
in combodat2 and assigns them if they
do not contradict our earlier
matches (i.e. cut off childbearing from
previous marriage too early).
b2d.pl - This
script attempts to match births to their own deaths. It
follows the classic matching and scoring routine and
rejects matches
that lead to a death before a marriage or the last linked
birth for the person (or
9 months before the last linked birth for
men).
createdatasets.pl - This program looks at the
output from the previous
programs and constructs life histories for each
person and puts these
in a file called croatdata.txt
m2d.pl - This script cycles through croatdata.txt and if
there is a
missing death record which can be filled in by Marcia's marriage
to
death links, we assign it here.
The resulting datafile is called
croatdata2.txt.
There is
also a seventh perl program called lastevent.pl
which will
assign the last event for each person. It creates a new data file
called croatdata3.txt
croatdata4.txt is created by an eighth program called
addgptoloe.pl which uses the
last recorded
service of an individual as a godparent or marriage witness as the last
event,
if that occurs later than the last observed event assigned by
lastevent.pl,
Each of these programs access a subroutine file called generalsubs.pm
which contain some general
subroutines. In addition there is
a
matchrecs subroutine at the bottom of m2b, b2m, and b2d.
The
entire matching routine can be run with one command from the
directory
containing all the relevant code and data files:
./runmatches
This
will write over any previous output.
OTHER FILES
The other
files are designed to check the
validity
of the matching and compare it to Marcia's findings.
OUTPUT FILES
Each of
the five principal programs produces a file called *.diag.txt (where * is
b2m,
b2d, etc). This file contains diagnostics
from the matching
routine. All
except m2d also produce files containing the final
matches called
*.matches.txt (b2m actually produces two:
b2m.mmatches.txt referenced by
the marriage, and b2m.bmatches.txt
referenced by the birth). The format of each of these is
below:
------------------------------------------------------------------------
b2d.matches.txt
birth
id death id score age at death deathdate
------------------------------------------------------------------------
m2b.matches.txt
marriage
id* last birthdate
# of kids birthid1 birthid2 ...
*marriage id
above 30000 indicates a kinset with missing parents
------------------------------------------------------------------------
b2m.mmatches.txt
mar.
id husbands bid wifes bid husbands age wifes age score
------------------------------------------------------------------------
b2m.bmatches.txt
bid marriage id score age sex
------------------------------------------------------------------------
m2m.matches.txt
marriage
id spouse
type remarriage
id ?
In
addition, for m2b, b2m, and b2d there are additional files.
*.match.prelim.txt
contains the scores for all potential matches.
These were then sorted and
the highest score for each potential match
was selected. *.ties.txt contain any ties on scoring
that occurred.
The final output files are croatdata.txt,
croatdata2.txt,croatdata3.txt,
and croatdata4.txt.
croatdata.txt - before m2d links are added
croatdata2.txt
- m2d links added but missing last observed event
croatdata3.txt - m2d
links and last observed event added
croatdata4.txt - service as godparent
or marriage witness added as possible last observed event
OTHER INPUT/OUTPUT FILES
The same
output directory also contains the initial input datasets.
These are
sortedbirths, sortedmars, and sorteddeaths, and
combodat2.txt. combodat2.txt is Marcia's links and is
the same as
combodat1.txt except empty cells have been replaced by an
explicit
"NA".
The
other files all relate to the Splus programs used to verify and
check the
links. I will not discuss them
here.