Project Ocean: Google to Digitize Stanford's Public Domain Holdings

From The Importance of and LISNews comes this note about Project Ocean, a collaborative project between Standford and Google to put millions of texts online. Information about the project was revealed in this recent NYT article:

The company has also been pushing hard to find new sources of information to index, beyond material that is already stored in a digital form. In December, it began an experiment with book publishers to index parts of books, reviews and other bibliographic information for Web surfers.

And Google has embarked on an ambitious secret effort known as Project Ocean, according to a person involved with the operation. With the cooperation of Stanford University, the company now plans to digitize the entire collection of the vast Stanford Library published before 1923, which is no longer limited by copyright restrictions. The project could add millions of digitized books that would be available exclusively via Google.


This does seem like good news in terms of making texts available, but I do have a moment of pause over the the "available exclusively via Google" part. While I'm sure for the short term there is some legal mumbo jumbo about making the collection freely available, what happens when Google goes public and has to feed the stock market beast? The New York Times article is from the business section and it's clear Google sees this as a buisness move. While there's nothing wrong with business I think we've all seen what a traditional buisness model can do to academcis and publishing--just look at what Elsevier has done over the past couple of years. The Importance of post that Charlie links to raises this very same point--see the post .

I'm also wondering how this fits with earlier plans by Stanford to digitize its collection. The LISNews thread has a blurb from an earlier NYT article about this.

I did a very quick Google search on this and didn't find a Project Ocean website so I'm a bit curious about thier openness.

Certainly, very true. But if we are to assume by the LISNews thread that they are digitizing these public domain texts by just scanning them in, then I think I'm less concerned than you are because I'm happy to see that they are working with making elements of the public domain more accessible--whether pay-for-use or not. I would assume it would be unreasonable to assume they are producing annotated editions which can be copyrighted.

So because the texts are public domain, there are different market forces than are at work with what Elsevier is doing. If access to the texts is too expensive, then someone could see about reproducing the important texts and making them available at reduced or no cost. And as a scholar, if you wanted to use/reproduce a large portion of the text in your work, copyright does not prohibit you from doing so. Compare this to getting permission from a publisher for working with a copyrighted text.

So I guess what I'm rambling toward here is that I'm fine with letting market forces determine what happens to texts in the public commons; the right to read is preserved. It's the heavy legislation which levies copyright in the content industries favor that I have a problem with.

My post below has garnered some very thoughtful responses, some of which took the route of email or phone calls. It's clear that I need to clarify and revise my ramblings on this subject, and I'm grateful that my thinking-out-loud is worthy of such consideration. Maybe by the time the book is finished, I will have gotten this right, but in the spirit of iteration, here's another hack at it...and this still feels not quite right, like I need a good joints after midnight talk with a few more folks before the dead obvious stuff hits me between the eyes... First, some of you challenge my distinction of Yahoo as more media-driven and mercantile, and Google as more "pure" and technology driven. And many point out that in the end, even making a distinction between "media businesses" and "technology businesses" is a distinction without a difference - both are in the business of creating value for customers. Such a distinction is vulnerable to a charge of easy thinking, and I agree. Let me clarify. In the end, I believe both companies are in the same business, and if I were forced to name that business in one word, I'd argue that business is media. Yes, Google started its life as an algorithm in a PhD program, and Yahoo started as an edited guide to the web, but they are clearly converging into the same business: they mediate information and services for consumers, and derive value from those services using the traditional revenue streams of the media business: advertising and subscriptions (if you don't think Google is in the subs business, think again). My point was that understanding both companies' DNA and culture is important in understanding where they might go with relation to content businesses like music, movies, and...