Here's what a Joseph Conrad novel might look like when you find it in the library...Like just about every professional writer and reader, I have been curious about Google’s much-debated library of scanned books — for personal reasons. After critics of the Google Books project charged the company with copyright infringement, a tentative agreement was reached last year that promises to pay authors $60 for the rights to copy each of their publications, with other fees to come. But I’m less interested, frankly, in any future royalties than in the benefits of instant access to a library that is estimated to eventually top 20 million books.
So when a mobile version of Google Book Search showed up among the apps offered on my relatively new iPhone, I tried it out. I was delighted to find that I could browse every issue of
Life magazine, from 1936 on, much as I had as a child (though I no longer retreated to the dark closet under the stairs in the decrepit ancient house of my great-aunt Olive). And I learned some surprising things.
One of the sample texts on Book Search was a Joseph Conrad novella from 1917,
The Shadow Line. I read it while stranded in an airport waiting room, happy for the emergency material. The novella put me in mind of Conrad’s
Under Western Eyes, a 1911 novel about terrorism that struck me as having renewed relevance for our time. I searched the free-books list on Book Search, and there it was, in the public domain.
The type was clear, and I found it easy to drag the text down the screen with my thumb — easier than with Kindle, or the Sony Reader, with its irritating page-refresh flicker. But I noticed here that the scanning process occasionally stuttered. Bits of grit or loose paper appeared to throw off the character-recognition software. French phrases so confused the device that it threw in asterisks, tildes and carats. Now and then, it would give up completely and erupt in a string of dingbats like comic-book cursing. A couple of underlined sentences were suddenly reproduced photographically in the original book type rather than the screen type. Then the device seemed to hit the virtual carriage return a few times, producing a three-quarter-inch blank space.
I was surprised that Google didn’t make use of a higher class of scanner. And I was really surprised by what happened next: like a dirty photo falling from between the pages of a book, a photograph popped up.
...And here's what it might look like when you download it from Google BooksIt showed the hand of whoever fed pages into the scanner — a hand with a latex sheath on its index finger, like a condom. The person’s nails were nothing to brag about. The condom and the nails, combined with the sudden, unexpected appearance, made the picture seem obscene and unhealthy. I thought with horror of the guy who found a finger in his bowl of fast-food chili.
Was this the literal hand of Google? The fickle finger of the company that holds my copyrights? The sticky fingers that, to hear some tell it, threaten to grab our literary heritage?
I wondered what such sloppiness said about the book-scanning project — about how much we can trust Google and how much we should fear it. Even as people involved with publishing have debated the issue of Google’s right to digital content, most of us, impressed with the company’s search engine and maps, have assumed it would at least get the technical part right.
Rereading press coverage of Google Books, I learned that others had found finger photos, and some had posted them online. But these technical concerns were crowded out by the lovefest for the project engaged in by important writers. Take Jeffrey Toobin's
sloppy kiss to the deal in the February 5, 2007,
New Yorker. In Toobin’s account, details about the scanning process are not so easy to pin down. He depicts Google’s chief scanner, Dan Clancy, a NASA veteran, as a lovable geek with granola-bar crumbs clinging to his clothes.
Clancy tells Toobin that the project’s enormous scope required the development of special scanning tools and leaves it at that. Says Toobin, “Google will not discuss its proprietary scanning technology, but, rather than investing in page-turning equipment, the company employs people to operate the machines, I was told by someone familiar with the process. ‘Automatic page-turners are optimized for a normal book, but there is no such thing as a normal book,’ Clancy said. ‘There is a great deal of variability over books in a library, in terms of size or dust or brittle pages.’”
According to a
Wikipedia contributor, Google currently uses Elphel cameras for book scanning. These were apparently adapted from models used to capture street imagery for Google Maps. (
Elphel is a little-known company based in Utah that, ironically, given Google’s secrecy, uses open-source software to operate its equipment.)
Some critics, of course, have highlighted concerns about the technical side of Google Books. In August, the linguist Geoffrey Nunberg, writing in
The Chronicle of Higher Education, attacked the project for errors in the data used to file the books: author, title, subject and year of publication, to begin with the most basic classification elements.
Nunberg wrote that the “book search's metadata are a train wreck: a mishmash wrapped in a muddle wrapped in a mess..."
To take Google's word for it, 1899 was a literary annus mirabilis,
which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker,
André Malraux's La Condition Humaine,
Stephen King's Christine, The Complete Shorter Fiction of Virginia Woolf,
Raymond Williams's Culture and Society 1780–1950,
and Robert Shelton's biography of Bob Dylan, to name just a few. And while there may be particular reasons why 1899 comes up so often, such misdatings are spread out across the centuries. A book on Peter F. Drucker is dated 1905, four years before the management consultant was even born; a book of Virginia Woolf's letters is dated 1900, when she would have been 8 years old. Tom Wolfe's Bonfire of the Vanities
is dated 1888, and an edition of Henry James's What Maisie Knew
is dated 1848.
Part of the problem is the stupidity in software, or grayware. But scanning technology is also at fault, Nunberg believes. For instance simple misreading of the copyright page seems to lie behind many incorrect datings.
Nowhere in Google’s FAQs or anywhere else is there a clear answer to the question of how books are physically scanned. Whether the books are disassembled in the process of scanning. What measures are taken to avert damage to scanned books, especially to older, more fragile ones with dry bindings and acidic paper. What sort of action readers or authors can take if they encounter errors in the scanning, dating or classification.
Nor has Google's press department answered my email asking these questions.
So it is likely that the company will also ignore this question: If the process of creating Google Books is open and its motives good, why is there so much secrecy about the nuts and bolts? Many experts feel there is room only for a single digital super library and Google is it. Geoffrey Nunberg writes, “No competitor will be able to come after it on the same scale. Nor is technology going to lower the cost of entry. Scanning will always be an expensive, labor-intensive project." So why then does Google seem to fear competition from the disclosure of information about the process?
In an October 9
New York Times op-ed piece, Sergey Brin promised to improve on the bibliographic information in Google Books. But he said nothing about scanning errors and seems to dispute the prediction that the service is likely to emerge as a de facto monopoly. Writing about the millions of out-of-print books threatened with extinction, the books he aims to preserve, he said, “I wish there were a hundred services with which I could easily look at such a book; it would have saved me a lot of time, and it would have spared Google a tremendous amount of effort. But despite a number of important digitization efforts to date (Google has even helped fund others, including some by the Library of Congress), none have been at a comparable scale, simply because no one else has chosen to invest the requisite resources. At least one such service will have to exist if there are ever to be one hundred. If Google Books is successful, others will follow.”
If there are to be many libraries, it is all the more important to get the quality of the original scans right. The same files might serve as material not only for other libraries but also for other formats, including Kindle or open-source-based readers.
Concentrating power and responsibility for any purpose in the hands of a single entity is rarely positive. You don't have to read millions of scanned books to glean that lesson. Just try Suetonius,
The Federalist Papers, Barbarians at the Gate or
All the King’s Men. You can find them for free — at your public library.
Comments [16]
10.12.09
09:53
If the image were about workers in the Post Office or a library, or administrative assistant or other worker who has to deal with a high volume of touching paper in their daily job would, you say they were wearing condoms?
Sad you are reading books on your iPhone, but I'm reading Design Observer online, so go figure.
10.12.09
10:41
Uh? Are you implying that Google's 'secret' approach to doing things should use 'secret' tools (as in non-open source software)?
You are confusing a process with a tool. The irony in the passage above is misconceived.
10.13.09
07:17
That statement is somewhat misleading - we (Elphel) are not just _using_ free and open source software (like some our customers), our products _are_ licensed under GNU licenses (GNU GPL v.3 for the software and FPGA code, GNU FDL for the circuit diagrams and PCB layout).
As for our customers - these licenses only mandate releasing the derivative code only if the (derivative) products themselves are distributed. As long as they are used in-house it is OK to be secretive.
Andrey
10.13.09
09:57
For Anthony Grafton's recent essay discussing his reservations and placing the Google project in a broader historical context see:
Future Reading, Digitization and its discontents.
http://www.newyorker.com/reporting/2007/11/05/071105fa_fact_grafton?printable=true
10.13.09
12:55
I also have to heartily applaud the wonderful volunteer efforts of Project Guttenberg. Anyone can become a transcriber. I am interested in the implications of paying people simply to transcribe books into digital format. Interesting how we are ok with it being free (volunteer) but would suddenly consider it objectionable to pay people $.01, $.10, etc. per page to do transcription.
In a time when many are excited by the potential of social networking, crowd sourcing, and coordinated action mediated by the internet it seems beneficial to have volunteers and/ or paid individuals use their wonderful grey matter to digitize these texts.
10.13.09
03:08
Muy bueno el blog de diseño! great finger! thanks from Argentina.
10.13.09
05:59
10.13.09
09:48
"How much should we trust Google?" About as much as anything else on the internet.
10.14.09
01:53
10.14.09
03:18
10.14.09
03:38
http://en.wikipedia.org/wiki/Rainbows_End
10.15.09
10:22
10.16.09
01:01
10.16.09
02:28
Unfortunately, going to the library is not a choice for some people in America! Perhaps you are from a city and do not realize this -- many people are in that boat -- but there are vast areas of this country where there are NO LIBRARIES!
Imagine ... entire communities and counties with no libraries. That is rural America for you! Please think outside your big-city box.
01.09.10
09:21
02.07.11
11:52