Pool selection (potential duplicates)
A pool of potential duplicates is obtained from the database by searching on the following data elements from the
incoming record:
- LCCN (010 a- normalized)
- The LCCN has the following format:
010
##$a###79139101#/AC/MN The index should include only the numeric portion. Skip any leading non-numeric characters, and index to either the first blank or end of subfield.
- ISBN (020 a - normalized)
- The ISBN in MARC records has the following format:
020
##$a0394502884 (Random House) :$c$12.50 020 a 074253779X (pbk. : alk. paper)
020 a 9780742537798 (pbk. : alk. paper)
The ISBN is at the beginning of the subfield, but other data can follow. Select the first string in the field, including the ending "X" if it is there. Because of the recent change to a 13-character ISBN, it is probably best to convert all incoming ISBNs to ISBN-13. The alternative is to allow an embedded match between the first 9 digits of the shorter ISBN and the longer one.
- First 25 characters of the normalized title (245 a, b)
- normalized)
- Normalization generally removes subfield coding, punctuation, and diacritics. Depending on how you have handled the conversion to Unicode, the removal of diacritics may not be possible.
We should add OCLC numbers to this search. They are a bit more complex because there is no standard place to store them in MARC records. The OCLC number can be identified by the use of "OCoLC" in conjunction with the number. The OCLC number may contain a prefix of "ocm" or "ocl7". Remove the "ocm" or "ocl7" from the beginning of the number for searching and for matching. They are most commonly found in two places:
- In the 001/003 of the MARC record
001 62525112
003 OCoLC
or in an 035 field
035 __ |a (OCoLC)ocm51050179 In this latter case, the OCLC is identified by the (OCoLC) preceding the number.
When all of these searches have been done, you have the pool of records against which the merge algorithm will act. Depending on how searching is done, you may have the same record in the pool more than once. If possible, eliminate those duplicates (or design the search and pool so that they don't happen.)
Level 1 Merge
The merge takes place in two stages. The first level merge allows records to merge on a limited algorithm. This merge is
only possible when both the incoming record and the pool record contain identifiers (LCCN or ISBN -- or OCLC number). A threshold weight is
set for each format in the global parameters table. If a record receives a weight below the threshold, it is not considered
a match and the record proceeds to Level 2. If the record receives a weight equal to or above the threshold, it is
considered a match.
The threshold for monographs record merging is 875 points (weights are listed in Appendix A). The Level 1 merge is done
on:
- LCCN/ISBN/OCLC# [If more than one number is available, the higher points value is used.]
- Publication date (from 008 pos. 07-10)
- First 25 characters of the normalized title (245 a, b combined)
For efficiency, it is probably best to test each record in the pool against the level 1 algorithm, stopping when a match has been found. This ignores multiple matches in the pool, but if the database is seeded with a non-duplicative source (eg. LoC records) and each new source is matched against the database, the number of duplicate matches should be low (and may even be suspect).
Level 2 Merge
If a merge is not obtained at Level 1, then Level 2 steps are performed.
Full title match (245 a, b - normalized)
The title gets different values depending on how "perfect" the match.
- Exact match (whole string matches)
- Embedded match (one title is embedded in the other, left-anchored)
- Keyword match (a percentage of the keywords that match between the titles, with additional points for having the keywords
in the same order in the title)
Country of publication
This is an exact match between values from MARC 008 pos. 15-17.
Main Entry (100, 110, 111 - normalized)
Comparison of 1XX fields from the records. Since not all records have 1XX fields, there are default values assigned when
one record has a 1XX and one doesn't, and when both are missing the 1XX.
- Exact match
- Keyword match (a percentage, useful mainly for 110 and 111 fields, which are corporate authors and conference names)
Pagination
Pagination is derived from the 300 $a field using the highest number found in that string. Pagination values are
only used if they are greater than 10.
Examples:
300 $a viii, 235 p. : = 235
300 $a xxvi, 468 p., [32] p. of plates = 468
300 $a 374 p. : = 374
300 $a 4 v. in one box : = 4
- Exact match
- Match within + / - 10
Publisher
Publisher names in the 260 b field are not highly normalized by library cataloging so this field is used only in rare
cases where a match has not been attained up to this point.
- Exact match
- Embedded match
Weights
The table of weights for our program allows us to assign negative and positive weights, with up to 5 different positive
weights (not counting zero).
Record Merge Algorithm for ONIX records
ONIX records appear to have only a small number of fields filled in, so we can assume that an incoming ONIX record will only use the level 1 match algorithm. ONIX records in the database should not match to each other, since each record represents a single publisher's edition of a way. ONIX records may match to MARC records, and getting this match to work should be our goal.
Data Elements
ISBN
ONIX records will probably have both a 10-character and a 13-character ISBN. Here is an example:
<productidentifier>
<b221>02</b221>
<b244>0002154129</b244>
</productidentifier>
<productidentifier>
<b221>03</b221>
<b244>9780002154123</b244>
</productidentifier>
The field coded "b221" with a value of "02" is the 10-character ISBN. The field coded "b221" with a value of "03" is the 13-character ISBN (which is equivalent to an EAN). Depending on how the software is handling ISBNs, either or both of these needs to be stored and indexed.
Title
The title is in the "title" field, labeled "b203". There are various kinds of titles that can be found in the ONIX standard. If there is more than one title, we should prefer the one with the "b202" code of "01", which means "Distinctive title".
<title>
<b202>01</b202>
<b203>Wealth Protection Secrets of a Millionaire Real Estate Investor</b203>
</title>
One "catch" is that the ONIX titles may begin with initial articles, like "The" and "A". It is common in the library world to index titles with the initial articles removed, following the indicator value in the title field. We either need to create an index of MARC titles with the initial articles left on, or remove the most common ones from the ONIX titles so we can retrieve them with a string match, or something else that I haven't thought of.
Dates
The ONIX publication date is in YYYYMMDD format. The MARC coded date (from the 008) is YYYY. We should create an indexed form of the ONIX publication date with just the YYYY portion.
<b003>20060901</b003>
Merging with ONIX
Using these three data elements should give us enough that 1) incoming MARC records will retrieve ONIX records into their pool and possibly match with them and 2) incoming ONIX records will retrieve MARC records into their pool and possibly match. This theory will need to be tested.
Appendix A: Weights for Monographs
(The minimum weight required for merging is +875.)
LCCN 010
a |
Match on subfield a |
200 |
Field present in both records but no match |
-320 |
Either record or both records missing |
0 |
ISBN 020 a |
Match on subfield a |
85 |
Field present in both records but no match |
-225 |
Either record or both records missing |
0 |
Date 008 7-10 |
Exact match |
200 |
+/-2 years |
-25 |
No match |
-250 |
Value missing |
0 |
Short-Title 245
a,b |
Exact match on first 25 characters |
450 |
Non match on first 25 characters |
0 |
Full-Title 245
a,b |
Exact match |
600 |
Either title contained within other title |
350 |
Either title shorter than 9 characters |
0 |
Non match |
-600 |
Matching keywords |
* |
Country of 008
15-17
Publication |
Exact match |
40 |
Either one missing |
0 |
Non match |
-205 |
Pagination 300
a |
Match exactly and > 10 |
100 |
Match exactly and < 10 |
50 |
Match within 10 and both are > 10 |
50 |
Match within 10 and either are < 10 |
20 |
Non match (by more than 10) |
-225 |
Publisher 260
b |
Exact match |
100 |
Either missing |
0 |
Occur within the other |
100 |
Non match |
-25 |
Main Entry 100 a,b,c,d,k,q
110 a,b,c,d,k,n
111
a,b,c,d,e,g,k,n,q
|
Exact match |
125 |
Matching keywords |
** |
Field missing from one record |
-25 |
Fields missing from both records |
75 |
Non match |
-200 |
* Calculate weight based on the percentage of full title keywords common to the incoming record and the database record
(% in common) x 450. If keywords are in the same order then add 50.
** If half or more of the main entry keywords are in common, calculate weight based on the percentage of keywords in common x
80. If keywords are in the same order then add 10.
|