[ Prototype Server Home | 856 Usage Home ]
Trouble in Online Paradise:
An Analysis of MARC
856 Usage at One Institution
Roy Tennant March 23, 2007
A Note on Technique
In this unscientific analysis of the usage of 856 MARC fields at one institution, 1,000,000 MARC records were obtained without using scientific random sampling methods. Nonetheless, the records were not selected by any particular criteria and the sample size represents a significant percentage of the whole (the UC Berkeley catalog). This analysis, however, should be considered anecdotal in nature.
User Needs
Numerous user studies1 by the California Digital Library strongly indicate that academic faculty and students (and students in particular) enjoy knowing then the full content of an item they seek is available on the Internet (for example, see "Earth Sciences Metasearch Portal Usability Testing," 2006, p. 8). This desire can manifest itself in at least a couple ways: 1) the desire to immediately see on a screen of search results which items are completely available online, and 2) the desire to filter searches based on the availability of full content (that is, "show me only items I can get to on the Internet"). Meanwhile, most of our library catalogs do a very poor job of serving either of these user desires.2
To serve these needs well we need the ability to know, via software algorithms, which items in our library catalogs are fully available on the Internet. In library catalogs, this information is recorded in the MARC 856 field3. As more content becomes available in full-text via mass digitization projects such as those by Google and the Open Content Alliance, the ability of libraries to specify when full-text is available only becomes more important. Therefore, this analysis seeks to discover how difficult it is for libraries to ascertain the availability of content fully available online and make recommendations on improvements that could be made to make such determinations easier and more accurate.
The 856 Field
The 856 field of the MARC record is where the information needed to locate an access an electronic resource is recorded. There are a number of subfields and options (for example, indicators) that do not apply to this analysis, so the documents cited should be consulted for a full explication of the field and subfields. This analysis will focus on only the subfields used within the sample data.
Analysis
1st Indicator
Explanation: The 1st indicator is the access method, which for
our purposes is nearly always 4
, "HTTP
(Hypertext Transfer Protocol)", although potentially could also be
1
, FTP, or 7
,
which refers to the method specified in subfield $2.
Discussion: Only ten
items (.05%) in the sample were discovered to have a first indicator
of 1
, and some of them appear to have been
miscoded. Another nearly
300 items (about 1.5% of the total) are coded with
7
, 156 of which have an additional subfield of
$2
that specifies "http" as the
method. The vast
majority were coded with 4
(98.42%).
Unfortunately, this indicator only indicates the access method, not
whether the item being accessed represents the complete item.
Conclusion: This indicator cannot be used to
determine whether an item is fully available.
2nd Indicator
Explanation: The second indicator is for specifying the
relationship of the item referenced by the URL in the 856 field to the
item described in the MARC record. There are almost exclusively three
entries that appear in the sample. Indicator
0
indicates it is the item that is described
by the record. Indicator 1
specifies a
version of the resource. Indicator
2
specifies a related resource.
Discussion: The majority (64.86%) of the 856s in the sample have
an indicator of 1
,
followed by those with an indicator of 0
(20.08%), and trailed by those with an indicator of 2
(10.62%). In this sample, records with an indicator of
2
seemed the least likely to be helpful in
finding items fully available online, since most entries were for credits
from the Internet Movie Database, a publisher description, or an archival
finding aid. Items with an indicator of 0
,
which specifies that the URL points to the item described by the MARC
record, is generally a good indicator that the URL in the 856 points to
the complete item. The presence of indicator
1
does not by itself indicate the presence
of the complete item. Rather, it is necessary to use additional criteria
to determine complete availability.
Conclusion: When a second indicator of
0
exists, it can be considered an indicator
of full content availability. When a second
indicator of 1
exists, it cannot be assumed
that the complete item exists online, but it may be possible to use it in
conjunction with other indicators to so determine, at least in some cases. A second indicator of 2
can be
considered to indicate that the 856 does not point to the complete
item.
Subfield $3
Explanation: Subfield 3
defines the
part of the described materials to which the field applies.
Discussion: Over 50% of the 856 fields in the sample have a $3
subfield. At least 42% of the $3 subfields specify the part to be "Table
of contents". At least another 7% specify a publisher description. Over 7%
identify a PDF version, with a Text version being identified in and
additional 7+% of the cases. Since this is a free-text field, however,
descriptions can vary. Examples of entries in this field include "Summary
(HTML) and complete full text (PDF)", "Finding aid :", "Current issue:",
"Sample text", "Abstract and full text", "maps", "v.1(2001)-", etc.
Conclusion: When a clearly negative (e.g.,
"Table of contents") or positive (e.g., "PDF version") declaration exists,
this subfield could be helpful in determining full content (or not)
availability. But there are an unknown percentage of edge cases that will
require sophisticated machine processing to properly determine.
Subfield $z
Explanation: Subfield $z is an optional note designed to explain
to end-users anything that relates to accessing the item referenced in the
856.
Discussion: Over 37% of the
sampled 856s have a $z subfield. Of this total, almost half of the $z
fields contain a message related to restricted access to purchased or
licensed content. Over 15% of the $z subfields (5.7% of the entire sample)
specified "Adobe Acrobat Reader required" which seems to indicate the
availability of full-text. Variations on that theme such as "Click here to
download PDF file" indicate that the availability of full-text is likely
higher once all the permutations of this are taken into account. Other
entries in this subfield include "full text available in pdf format at
same site;", "Credits from Internet Movie Database", "Freely available.",
"Click here to download PDF file"
Conclusion: When a clearly negative (e.g.,
"Credits from Internet Movie Database") or positive (e.g., "Adobe Acrobat Reader required")
declaration exists, this subfield could be helpful in determining full
content (or not) availability. But there are an unknown percentage of edge
cases that will require sophisticated machine processing to properly
determine.
Other Techniques
By analyzing the 856 fields within a given database it is possible to
determine patterns that reliably indicate full content availability. For
example, within the UC Berkeley dataset analyzed here there are numerous
856 fields that look like this: 856
41$uhttp://www.mip.berkeley.edu/cgi-bin/csmp?000338
and
856
41$uhttp://sunsite.berkeley.edu/TechRepPages/CSD-91-633
.
Once patterns like this are identified, it would be possible to programmatically
assign them to one category or another based on URL string matching (in
these cases full content availability).
In another set of cases 856 fields such as 856
41$uhttp://www.srs.fs.fed.us/pubs/gtr/gtr%5Fsrs010.pdf
can
reasonably be assumed to lead the user to the full-text due to the
filename extension (.pdf).
There may be exceptions to some of these assumptions, but I would guess that they would be very few in number if the decisions were carefully made, and the benefits of including such items in the pool of "fully available" are potentially substantial.
Recommendations
Short-Term Strategies
It is possible with enough review of the existing data, similar to what I
have done here, to specify an algorithm to check multiple conditions to
make a determination of full online availability. Some checks will result
in a clear negative determination (e.g., "Table of contents"), while
others will have a clear positive (e.g., the existence of second indicator
of 0
. A study of options and trade-offs will
reveal whether it is better to do this in a batch process and add an
unambiguous indicator to the record, or do these checks at the point of
display. In the lack of any guidance from a centralized cataloging
authority, the latter may be the safer route if the processing overhead is
not too onerous.
Long-Term Strategies
Within the existing work to refactor our bibliographic infrastructure (for
example, the present work on Resource Description
and Access, create a clear and unambiguous method to specify when a
URL points to the complete item being described.
Conclusions
The students and faculty at the University of California have made it clear that they highly value full online access to the items they locate, and I doubt there are many users who would disagree with this opinion. Therefore it is important to solve the current problems we have with making it easy for our users to both discover this content and easily retrieve it.
As I have outlined here in anecdotal form, our MARC/AACR2 infrastructure and how it has been put into practice within at least one library is preventing us from effectively fulfilling these valid user needs. For the short-term I believe there are some strategies that can provide a decent solution to these issues for the bulk of our material. But going forward, we need to find and implement an unambiguous, machine-processable method to specify when a URL will fetch the complete item. We cannot allow optional local practices4 to continue to be the basis upon which we rest our solutions.
1 California Digital Library. Evaluation
and Assessment Reports,
<http://www.cdlib.org/inside/assess/evaluation_activities/>. Return
2 Tennant, Roy. "The Trouble with Online" Library Journal, (September 15, 2004), <http://libraryjournal.com/article/CA452319.html> .
Return
3 OCLC Online Computer Library Center. 856 Electronic
Location and Access,
<http://www.oclc.org/bibformats/en/8xx/856.shtm> and Library of
Congress. MARC 21
Concise Bibliographic: Holdings, Location, Alternate Graphics, etc. Fields
(841-88X),
<http://www.loc.gov/marc/bibliographic/ecbdhold.html#mrcb856>. Return
4 More information on UC local practice can be
found at CDL Catalog Guidelines: CDL
Conventions for Cataloging Electronic Resources, <http://libraries.universityofcalifornia.edu/hots/tfer/tferguidcon.html>, 2003.