Mass Digitization of Books

by Karen Coyle

Preprint. Published in the Journal of Academic Librarianship, v. 32, n. 6

Mass digitization of the bound volumes that we generally call "books" has begun, and, thanks to the interest in Google and all that it does, it is getting widespread media attention. The Open Content Alliance (OCA), a library initiative formed after Google announced its library book digitization project, has brought library digitization projects into the public eye, even though libraries were experimenting with digitization for at least a decade. What is different today from some earlier digitization of books is not just the scale of these new initiatives, but the quality of "mass."

What is "mass digitization"?

Mass digitization is more than just a large-scale project. It is the conversion of materials on an industrial scale. That is, conversion of whole libraries without making a selection of individual materials. This is the opposite of the discrete digital collections that we see in online archives like the Library of Congress's Making of America, or the Online Archive of California. The goal of mass digitization is not to create collections but to digitize everything, or in this case, every book ever printed. To do this economically and with some speed, mass digitization is based on the efficient photographing of books, page-by-page, and subjecting those images to optical character recognition (OCR) software to produce searchable text. Human intervention is reduced to a minimum, so the OCR output is generally used without undergoing additional revision. Also, only limited structural markup, such as page numbers, tables of contents, and indices, are included because these cannot be detected automatically by the OCR software and therefore require human intervention in the scanning process.

When we talk of mass digitization we are of course talking about Google and its intention to digitize all of the books in five major US libraries.^¹ Google has made its goals for its Google Books service very clear: this is a search service, with some viewing of the context of the search terms. Google does not attempt (at this point) to provide a reading environment. Instead, they see themselves as an index, with digitization a process to create an index to books in libraries and bookstores. Details of Google's scanning process are not available, but we do know that they are doing high-volume scanning and that searching is done against the uncorrected OCR results that are obtained. Google has made no public announcement on the rate of digitizing, but John Price Wilkin of the University of Michigan has been quoted as saying that the scanning of the seven million volumes in Michigan's library would be completed within six years. He also said that an operator digitizes about 50 books a day. "At that rate, with, say, 20 people working every single day of the year, digitizing the library's 7 million volumes would take 19 years."^² Clearly mass-digitization will take an industrial-strength work force as well as industrialized work flows.

In October 2005, a second library-related mass digitization project was announced: The Open Content Alliance (OCA).^³ The OCA distinguished itself from Google in a number of ways. First, it would only digitize works in the public domain. Second, it would be "open." That is, it would make information about its technology available to others. Third, it was library-driven (although it receives funding from significant technology companies like Adobe and Microsoft). These latter two were in direct response to some criticisms of the commercial and secretive nature of Google's project. Scanning would be done by the Internet Archive using a system that they developed called "Scribe." The Internet Archive claims its digitization process costs around 10 cents a page and takes from 30 to 60 minutes for each book, depending on length. Announcements that the OCA is currently scanning about ten thousand books a month are impressive, although the math (120,000 per year means 8-9 years to reach the first million) shows that the scanning speed will need to increase for it to compete with Google's reported goals. The OCA appears to have plans for creating a reading environment, based on the demonstration books on the Internet Archive web site, which will host the OCA content.

Not to be left out of this trend, Microsoft has announced its own online book search, and has arranged to scan about 100,000 out-of-copyright books for the British Library. It is also one of the corporate partners in the OCA. The MSN® Book Search was not available at the time this article was written, but it is expected to be announced in beta on the Windows Live™ Ideas site.

Non-mass Digitization

The opposite of mass digitization involves the careful and individual selection of materials to be digitized. This is the type of digitization that arises out of preservation projects whose aim is to produce replacement copies of texts that are deteriorating or to make rare physical collections more widely accessible. Among these latter we have the Virginia eText project^⁴ or the beautifully rendered Octavo Editions^⁵ of rare and precious books.

Non-mass also has to do with the process of digitizing the books. Project Gutenberg^⁶, which now claims to have 18,000 books available in digital format, has taken a cottage industry approach to producing its catalog of books, using volunteers who convert and submit the books. Although the number of books produced in this way is impressive, the project began in 1971, and has produced an average of less than 500 digital texts per year in its 35 year history.

Another characteristic of non-mass digitization is its end-product. The primary results of mass digitization are photographic renditions of book pages backed up by searchable OCR, while non-mass forms of digitization may produce richly marked-up text that can be used to provide a variety of services, from linking out to reference books, to selection and copying of passages. Services like ebrary^⁷ and Questia^⁸ use highly structured books (and other documents) to provide a kind of online research workstation that supports a range of activities common to higher education research and writing. These services would not be possible with an underlying database resulting from mass digitization.

Large-scale Digitization

There have been numerous digitization projects that could be called "large-scale." Large-scale projects are more discriminating than mass-digitization projects. Although they do produce a lot of scanned pages, they are concerned about the creation of collections and about producing complete sets of documents. One of the more impressive of these is not about books but about journals: JSTOR.^⁹ JSTOR has digitized the back files of nearly 1,000 journals, reaching back into the mid-19^th century for some titles. A key goal of the JSTOR project is to create a complete run of each journal, which means doing careful monitoring of gaps in journal runs (including missing or damaged pages) by gathering physical copies from a number of different libraries.

There have been large-scale book digitization projects, such as the Carnegie Mellon Million Book project.^¹⁰ This project began in 2001 and is digitizing books in China, India, and Egypt. The stated goal is quite similar to that of the Google project: "The primary long-term objective is to capture all books in digital format."^¹¹ This project is clearly a precursor to today's mass digitization movement. It lacks only the speed and the indiscriminate nature of today's projects. Recently the Million Book Project announced that it has partnered both with the Internet Archive and the Open Content Alliance, joining the mass movement.

Stanford University also has a large-scale digitization project in place after setting up a robotic scanning lab for use by the university's libraries. Their goal was to digitize all of their books that were in the public domain.^¹² Stanford is now one of the libraries participating with Google in their mass digitization project.

Another interesting project in the large-scale digitization of books is that being done by Amazon in support of their Search Inside the Book capability.^¹³ Amazon is digitizing books that it sells for which permission is granted by publishers. The books in Amazon's catalog will be those that are in print, which creates a certain subset of books that will be included in Amazon's digitization project. The books are scanned and users can view page images in a browser display. Amazon also provides some features beyond page shots and searchable text: there are links to the table of contents, an excerpt, and the Amazon page includes a display of the first line of the book.

Technology for Mass Digitization

Digitization on a large scale is possible today due to improvements in the scanning technology itself. There are two main parts to this technology: the photography process that creates a digital image of the work, and the optical character recognition that interprets the photograph to extract a version of the text on the pages.

Previous scanning technology required that books be unbound so that pages would be flat for photographing and so that they could be fed automatically through the scanning machine. If books were not unbound, flattening them on the scanner glass was very damaging to the spines and binding, and required a person to position each page on the scanning surface. Scanning is now done with digital cameras pointing at open but bound books, and software acts on the digital images to adjust for the curvature of the open page, making the image flat even though the book is not. Software also allows the scanning of text and illustrations together, adjusting resolution and other characteristics as needed. These improvements mean that less human intervention is needed, but in addition the scanning technology and OCR technology is faster than ever before. Scan rates are advertised as from 1200 pages per hour to 3000 pages per hour. Images are captured at 600 DPI. This means that scanning produces large files: one page can be 20 megabytes, and a book can easily be 6 gigabytes, but the improvement in hard drive technology over the last decade means that the scanning systems can handle the files that result from the scanning activity.

There are two main companies making the basic hardware and software that is in use in book digitization projects, Kirtas Technology^¹⁴ and 4DgitalBooks.^¹⁵ The Kirtas Technology scanner is reported to cost about $150,000. Other, smaller systems are becoming available, such as the ATIZ BookDrive^¹⁶ has automatic book scanners priced at about $6500 that they report can scan 1800 pages per hour. Although high end scanners can be rented, mass digitization still requires a large funding commitment on the part of a library, which the interest in partnering with companies like Google or Microsoft, who have the deep pockets that libraries do not.

OCR technology has improved greatly over its lifetime, and the increase in computational power makes this aspect of the mass digitization task faster and more accurate. In addition, OCR capabilities are available for a wide variety of languages, such as those that would be found in the book collections of major research libraries. A widely used OCR software, Abbyy,^¹⁷ can handle 177 different languages, although they do not recommend processing for more than two languages at a time as this decreases the accuracy of the OCR output. Accuracy rates, although impressive, have some caveats. Most OCR software today can claim to get from 98%-99.9% accuracy, but much depends on original text and the physical quality of the item being scanned. Accuracy tests are often done on selected works, so one should expect that some works will not achieve this rate. Note that 99.9% accuracy still means that one character in 1000 is wrong, averaging over one error per modern book page.

Issues in Mass Digitization

There are numerous technology and management issues that must be considered in large-scale and mass digitization projects. The actual details of these issues are more than can be covered in a short article, but this should give a short overview of areas that will need to be investigated further.

Workflow

The manufacturers of scanning technology promote their products with figures on the number of pages that can be scanned per hour. Scanning, however, is only one part of the digitization work flow, so one cannot assume that the total throughput will be represented by that scanning figure alone. The Stanford library digitization project describes its work flow in ten steps including creation of metadata, scanning, quality control, OCR process, creation of technical metadata, and storage.^¹⁸ There are also decisions to be made that will affect the efficiency of the project, such as: what books will we scan? Within a book, what do we scan – cover-to-cover, even blank pages? Note that not all items are selected, not even in "mass digitization." Some items are eliminated because they are rare or fragile. Others are eliminated because they will not scan well, for a variety of reasons (those with tipped-in folded maps, as an example, or odd sizes that don't work well with the scanning technology). This means that selection for these characteristics must be part of the workflow.

Output and Book Structure

One aspect of the book that is obvious in the physical world is its solidity as a package. Binding maintains the physical order of the pages and the integrity of the whole. Books also have logical sub-elements, like chapters, numbered figures, tables of contents and indexes. These sub-elements provide a conceptual structure for the human reader.

During digitization, a bound book becomes a series of files representing pages. There is nothing inherent in the scanning process that creates a binding for the digital book. The output of OCR renders text, but the automatic recognition of book structures (page numbers, chapter beginnings, etc.) is not yet available although it is the object of research.^¹⁹ The encoding of structural elements by the human operators of the scanning equipment, including page numbers, is part of the workflow software that comes with scanning technology. As an example, the books digitized for Google Book Search have the scanned images linked to page numbers and four structural elements: table of contents, title page, copyright (generally the verso of the title page), and index. In books that are in the public domain there are often links from the page numbers in the table of contents and the index, although the accuracy of these is uneven. In fact, one often finds that the links in Google Book Search open a page near the desired page, such as the second or third page of the index rather than the first, or a numbered page one or two pages away.

User Interface

The weakest point of the mass digitization projects so far is the development of user interface to the digitized materials. The Google Book Search display of public domain books and books for which it has permission to display the pages, displays the pages as images in a web page, with the ability to go back and forward one page or to jump to a particular page number. Page images can be captured or printed from the public domain books only. Because the pages are simply images, there is no capability to highlight and copy text from the page, and books cannot be downloaded to be read offline.

The Internet Archive offers its books in a variety of formats. To view the books in a web page the user must download an application called DJVU. Books can also be downloaded in PDF format or in the DJVU format. The Internet Archive is also experimenting with a software it calls 'flip book' that imitates the look of an open book, with left and right pages and has animated page turning. The download formats allow highlight and paste features, but do not have the range of features that are considered desirable in electronic books.^²⁰

Standards

The digitizing formats used in mass digitization are primarily industry standards, such as the Tagged Image File Format (TIFF) and the Portable Document Format (PDF). There are no common standards for the overall package containing the images. Although the Metadata Encoding and Transmission Standard (METS)^²¹ format is used by some book digitization projects it has not been employed for those doing mass digitization. The library community has produced some recognized statements of best practices relating to digitization in general that designate both technology and elements of quality.^²² ^²³ ^²⁴ None of these are specific to mass digitization, and we must hope that appropriate guidelines for these new digitization projects will be developed.

Preservation

Although our field seemed to have established a few years ago that digital formats themselves have issues for preservation, there is the assumption that the digitized books from the mass digitization projects will be used for preservation of those texts.^²⁵ Because the new digitization technologies are less destructive to originals than any previous ones, they are highly suitable for items that are in need of duplication for preservation. However, the production of a preservation-quality copy is somewhat contrary to the desire to digitize whole libraries quickly and inexpensively with the least amount of human intervention. In fact, the books currently available through the Google Book Project show quality control issues such as missing page images and blurred or unreadable images. Clearly additional quality control must be applied to arrive at a sufficient level of quality for preservation purposes. There is an undeniable conflict between "mass" and "preservation" for the digitization of hard copy materials.

Scoping the Mass Digitization Project

There are two assumptions that are often made about mass digitization. The first is that you can digitize everything, and the second is that you can save money by not digitizing the same item more than once. For the first assumption, libraries will find that some items are either too fragile to be put through the mass digitization process, or are too far from the norm to be suitable to that process. Some books will be too large or too small; others will have odd-sized plates or folded maps that will need special handling. So digitizing an entire library will require some mass digitization and some special digitization projects.

The other part of the "digitize everything" goal is the desire to create at least one digital copy of every book available in any library. Google and the OCA are beginning this process by focusing on some large libraries in the Western world with impressively broad collections. How much of the world's literature will be digitized in this way? A statistical study of the five original Google Book Search collections^²⁶ shows that at the end of this project Google will have digitized 33 percent of the items in OCLC's WorldCat. The most important revelation from this study is that 40 percent of the items in WorldCat are held uniquely by only one institution. The long tail of the Google Book Search project will require involving many hundreds or thousands of libraries if they really intend to create an index to all of the books on library shelves today.

The second assumption is that time and money will be saved by keeping a registry of digitized books so that the work is not duplicated by other libraries.^²⁷
²⁸ In the arena of mass digitization, this assumption is being challenged by some with the argument that it may be more economical to scan a full shelf of books than to determine if a true duplicate exists elsewhere. This is in part because of the difficulty of defining "same" in a world with many similar but not identical editions. It is also because the mass digitization process may not produce true duplicates due to the error rate of OCR programs, and because of differences in decisions made at the time of scanning.

Conclusion

Although a significant number of large research libraries are engaging in mass digitization projects, other than the Google Book Search, which is available today, we have little idea how the digitized books will be used. There are many questions that need to be answered, such as: who does this digitized library serve? How does it serve users? How will the system respond when there are 10 million books in a database and a user enters the query "civil war"? (Note that Google has not yet determined how it will create an ordering principle for books.) Will some users read these books online in spite of the relative inconvenience of their formats and the computer screen's technology? Will it be possible to use the digitized pages to produce something more e-book like?

Google has clearly stated that their book project is solely aimed at providing a searchable index to the books on library shelves. They are quite careful not to promise an online reading experience, which would increase the quality control effort of their project and possibly make rapid digitization of the libraries impossible. Library leaders seem to be enticed by the speed of mass digitization, but seem unable to give up their desire to provide online access to the content of the books themselves. If mass digitization is the best way to bring all of the world's knowledge together in a single format, we are going to have to make some reconciliation between the economy of "mass" and the satisfaction of the needs of library users.

1 http://books.google.com/googlebooks/partners.html

2 Said, C. (2004). Revolutionary Chapter: Google's ambitious book-scanning plan seen as key shift in paper-based culture. San Francisco Chronicle.: F-1.

3 http://www.opencontentalliance.org/

4 http://etext.lib.virginia.edu/ebooks/

5 http://www.octavo.com/

6 http://www.gutenberg.org/

7 http://www.ebrary.com

8 http://www.questia.com

9 http://www.jstor.org

10 Carnegie Mellon Libraries: Million Book Project FAQ. http://www.library.cmu.edu/Libraries/MBP_FAQ.html

11 Reddy, Raj and Gloriana StClair. The Million Book Digital Library Project. Carnegie Mellon University. (December, 2001) Available: http://www.rr.cs.cmu.edu/mbdl.htm (Accessed July 31, 2006)

12 "SUL Books in the Public Domain" (http://library.stanford.edu/depts/dlp/collections/detail_pre23.html)

13 http://www.amazon.com/exec/obidos/tg/browse/-/10197021/103-2965365-1144655

14 http://www.kirtastech.com/

15 http://www.4digitalbooks.com/

16 http://www.atiz.com/

17 http://www.abbyy.com/

18 Robotic Book Scanning at the Stanford University libraries and Academic Information Resources: Report on the Status of Digitization Facilities and Services for Bound Library Materials. 7 May 2003. Available: http://library.stanford.edu/depts/diroff/DLStatement.html (Accessed August 4, 2006).

19 (HP) (http://www.hpl.hp.com/techreports/2004/HPL-2004-167.pdf)

20 Susan Gibbons, Tom Peters, Robin Bryan. E-Book Functionality White Paper. (January, 2003) http://www.lib.rochester.edu/main/ebooks/ebookwg/white.pdf (Access August 10, 2006)

21 Metadata Encoding and Transmission Standard (METS). http://www.loc.gov/standards/mets/

22 Best Practices. http://www.oclc.org/community/topics/digitization/bestpractices/

23 The Indiana University Digitization Project. Best Practices. http://www.statelib.lib.in.us/www/isl/diglibin/

24 California Digital Library. CDL Guidelines for Digital Objects. http://cdlib.org/inside/diglib/guidelines/ (Accessed August 5, 2006)

25 Said, op cit.

26 Lavoie, B., L. S. Connaway, et al. (2005). "Anatomy of Aggregate Collections: The Example of Google Print for Libraries." D-Lib Magazine 11(9).

27 OCLC Registry of Digital Masters. http://www.oclc.org/digitalpreservation/why/digitalregistry/

28 John Price Wilkin. Registering digitized monographs and serials. (May, 2001) http://www.umdl.umich.edu/pubs/dlf-registry.html