Crowdsourcing and Open Access: Collaborative Techniques for Disseminating Legal Materials and Scholarship
CROWDSOURCING AND OPEN ACCESS: COLLABORATIVE TECHNIQUES FOR DISSEMINATING LEGAL MATERIALS AND SCHOLARSHIP
Timothy K. Armstrong
This short essay surveys the state of open access to primary legal source materials (statutes, judicial opinions and the like) and legal scholarship. The ongoing digitization phenomenon (illustrated, although by no means typified, by massive scanning endeavors such as the Google Books project and the Library of Congress’s efforts to digitize United States historical documents) has made a wealth of information, including legal information, freely available online, and a number of open-access collections of legal source materials have been created. Many of these collections, however, suffer from similar flaws: they devote too much effort to collecting case law rather than other authorities, they overemphasize recent works (especially those originally created in digital form), they do not adequately hyperlink between related documents in the collection, their citator functions are haphazard and rudimentary, and they do not enable easy user authentication against official reference sources.
The essay explores whether some of these problems might be alleviated by enlarging the pool of contributors who are working to bring paper records into the digital era. The same “peer production” process that has allowed far-flung communities of volunteers to build large-scale informational goods like the Wikipedia encyclopedia or the Linux operating system might be harnessed to build a digital library. The essay critically reviews two projects that have sought to “crowdsource” proofreading and archiving of texts: Distributed Proofreaders, a project frequently held up as a model in the academic literature on peer production; and Wikisource, a sister site of Wikipedia that improves on Distributed Proofreaders in a number of ways. The essay concludes by offering a few illustrations meant to show the potential for using Wikisource as an open-access repository for primary source materials and scholarship, and considers some possible drawbacks of the crowdsourced approach.
The digital era has exposed the limitations of paper as an archival medium. Although paper (like other forms of hard-copy) makes an excellent tool for transmitting knowledge across lengthy spans of time, it makes a poor tool for transmitting knowledge across lengthy spans of distance. A wealth of knowledge, including legal knowledge, remains effectively trapped inside paper records, where it can be used only by those with access to the physical medium in which it is contained.
The movement to digitize paper records and make them freely available online promises to liberate information, including legal information, from these physical constraints and make it accessible around the globe. The scope of the task, however, is massive and daunting. Even the best organized (and best funded) efforts, such as the Google Books project (currently the subject of copyright litigation) and the Library of Congress’s efforts to scan American historical documents, can only scratch the surface. Indeed, the Library of Congress recently estimated that, at its present pace, it will take almost two thousand years to digitize the nine billion text records it presently holds in its collection.
Wikis and other collaborative tools change this picture in potentially important ways. Just as other informational projects have benefited by opening themselves to participation by a distributed community of volunteers, the means now exist to harness the efforts of legal professionals, students, and even interested members of the public at large to improve access to legal information, court decisions, statutes and regulations, and legal scholarship. In 2008, for example, the participants in one such project (initiated by the present author) succeeded in making crucial portions of the legislative history of the landmark Copyright Act of 1976 freely available online for the first time. The online version of the Copyright Act’s legislative history improves access not only by duplicating the text of the original report, but—perhaps more importantly—by making it possible for other online works that cite the report to hyperlink to it. This creates a seamless web of knowledge that improves upon the practical experience of using reference sources in paper form. If we multiply this isolated example by dozens, hundreds, or thousands of interested online users of legal texts, the possibility of a transformative moment in access to legal knowledge begins to appear ever closer.
This essay begins with a review of the open access imperative, which may be normatively grounded in considerations of transparency, democratic legitimacy, and the fulfillment of the university’s public service mandate. It then surveys the current status of a number of projects aimed at improving public access to legal materials and scholarship, and explores whether “crowdsourced,” Wiki-centered efforts may achieve comparable results at lower cost. It concludes with an assessment of some of the drawbacks and limitations of the “crowdsourced” approach.
II. Policy Background: The Open Access Imperative
A. Open Access to Scholarship
“Open access,” in the sense of making documentary materials available over the Internet for reading and copying without charge, is an emerging phenomenon in the legal academy. In the legal academic community, the “open access” label is associated primarily with free distribution of scholarly works. The discussion has revolved around whether to improve access to faculty scholarship, how best to do so, and what it might mean for the traditional legal publishing paradigm.
At one level, enlisting faculty support for scholarly open-access initiatives consists merely of fostering personal and institutional self-interest. Inaccessible scholarship is unpersuasive scholarship, and studies have tended to suggest that opening access to scholarly works correlates with greater scholarly impact (as measured by citation counts). Researchers’ growing reliance on the Internet as a complement—and perhaps, one day, a successor—to proprietary databases or library hard copies feeds the demand for open access to scholarly works. Furthermore, the same technologies that enable open access to traditional legal scholarship also give scholars new forms to express themselves, creating forms of scholarly discourse that would have been uneconomical to produce in the pre-Internet era.
The movement to assure open access to scholarship is more advanced outside the legal academy. The difference is partly explained by differing market dynamics: University libraries, driven by eye-popping increases in subscription costs for specialized research journals, responded by dropping subscriptions, creating a risk that scholars working in those specialized fields would find it more difficult both to remain abreast of developments and to ensure dissemination of their own work to their peers. Open access publishing solves both problems by making current scholarship available worldwide at little expense. For that reason, faculty at several influential research institutions have voted to authorize archiving and distribution of their scholarship on open-access terms. Harvard University’s Faculty of Arts and Sciences did so (by unanimous vote) early in 2008, and the Harvard Law School faculty unanimously followed suit a few months later. The Massachusetts Institute of Technology adopted a university-wide open access mandate in early 2009, and similar measures are pending or have been adopted by other universities.
The adoption of open-access mandates by university faculty has led to the creation of institutional electronic repositories of scholarly works. Duke Law School’s faculty scholarship repository includes faculty papers dating back over half a century. Harvard’s new DASH repository may be unique in including student-authored papers alongside faculty scholarship. Nor is the push for scholarly open access confined to elite institutions: the Oklahoma City University School of Law, for example, maintains a repository of faculty scholarship extending back four decades. Cross-institutional repositories such as SSRN and BEPress hold even larger collections of faculty scholarship from universities worldwide.
Some law journals have also committed to publishing on an open-access model. The Science Commons organization (an affiliate of Creative Commons) has an Open Access Law Program (“OALP”) that is intended to foster open access to legal scholarship by permitting authors to retain sufficient rights in their published works to enable those works to be hosted in open-access repositories. Several law journals have committed to honor the principles of open access in works published in their pages, and authors may obtain similar results even when publishing in journals that have not formally committed themselves to the OALP’s principles.
B. Open Access to Primary Source Materials
As important as the movement to open access to legal (and other) scholarship is, the public may derive still greater benefit from making primary legal source materials—statutes, regulations, case law, and the like—more broadly accessible. Positive law directly regulates individual behavior, and partly for that reason, individual due process interests in access to the law have long been recognized. Principles of democratic legitimacy also favor open access by citizens to information necessary to police the functioning of government.
Although open access to primary legal source materials would appear to present an uncontroversial imperative, both legal and practical obstacles remain. The Copyright Act places federal statutes and judicial decisions in the public domain. But the absence of any comparable statutory provision regarding the copyright status of primary legal source materials below the federal level has occasionally sparked controversy. Even at the federal level, a lack of consensus on the normative desirability of open access is evident in the competing bills now pending on the subject of open access to federally funded research.
C. Roles of Universities and Public Institutions
As prodigious producers and consumers of information, public and private universities stand among the most important institutional actors in the networked information infrastructure. Organized action by the academic community would go a long way towards making open access to information the norm rather than the still-developing exception. As of yet, however, progress towards inculcating open-access norms in the academic community has been halting and sporadic. The praiseworthy example shown by the prominent research institutions whose faculties have adopted open-access mandates only highlights the vastly greater number of institutions that have not.
Fostering open access naturally fits with the mission of the modern university along multiple dimensions. Faculty research benefits from a regime in which source materials are freely accessible, and faculty scholarship that is made available in an open-access forum promises greater impact. Furthermore, the widespread adoption of open-access initiatives may yield substantial benefits even outside the directly involved university community. Most universities conceive of their missions as including substantial public service and public education components (indeed, for state-funded institutions, such mandates may be enacted as positive law), and it is not difficult to situate efforts to make information more widely available within the broad domain of public service.
In dealing with the open-access phenomenon, university libraries in particular confront imperatives that do not point uniformly in a single direction. To be sure, many librarians rightly see themselves as natural allies of the open-access movement and as well-positioned advocates for open-access policies because those policies best meet the needs of the library’s core constituencies. On the other hand, the logic of open access enables disintermediation to occur simultaneously at many levels: just as open access may reduce the role and importance of publishers, so too may it diminish the historical status of libraries themselves as informational gatekeepers. It is no simple task for libraries and librarians to balance the conflicting incentives presented by the open-access phenomenon, which makes some of the pro-access steps that libraries and universities have been taking recently all the more remarkable.
In early 2009, the directors of some of the largest and most prestigious law libraries in the United States issued the “Durham Statement on Open Access to Legal Scholarship,” a document that has been signed by over 50 librarians and other supporters nationwide. The Durham Statement envisions a wholesale restructuring of the endeavor of producing and disseminating legal scholarship: not only does it call for the increased deployment of stable digital repositories of faculty scholarship (as already exist in many forms), it goes further and calls upon law schools to rely solely on such digital repositories and to cease publishing law journals in printed form.
On the question of open access to primary legal source materials, Professor Ian Gallacher believes law schools (and libraries) should take a leading role. Making primary source materials freely available would do much to serve the large number of law school graduates who practice outside large-firm settings. Gallacher goes on to argue that law schools’ institutional incentives may make them more desirable and effective custodians of primary source materials than private publishers (whose proprietary incentives may discourage widespread or free distribution) or even the government itself. Gallacher’s essay concludes by articulating a number of design standards that should be adopted by open-access archives of primary legal source materials, including: (1) universal openness and accessibility (whether with or without charge); (2) completeness; (3) flexibility in access and presentation; (4) flexibility in search and indexing methods; (5) speed; (6) reliability; (7) permanence; (8) vendor neutrality; (9) a citator or validator to identify doubtful precedents; and (10) community involvement in the development and maintenance of the archive. Some of the most robust nonproprietary legal databases already share many of these characteristics, with the last—community-centered development—perhaps most in need of strengthening.
III. Technical Background: Open Access Initiatives
Efforts to make legal information freely accessible online represent currents within a much broader stream that is the mass digitization movement. Exemplified by large-scale projects like Google Books, the Internet Archive, and Project Gutenberg, the digitization movement has long since succeeded in making the complete works of Shakespeare and Dickens, among countless others, freely accessible to a global audience. The all-encompassing ambitions of the largest digitization initiatives have, almost by happenstance, swept some legal materials within their ambit; it is possible, for example, to find early volumes of the Harvard Law Review on Google Books, or to read the United States Reports on the Internet Archive. The very scope of those projects, however, may diminish their utility: a search for “Harvard Law Review” on Google Books yields in excess of 10,000 “hits,” most of which are mere mentions of the Review in otherwise unrelated publications. A search for “United States Reports” on the Internet Archive returns 164 results (far fewer than the actual number of published volumes) arranged seemingly at random; again, the results include mentions of the United States Reports in other publications arrayed indiscriminately alongside actual copies of the Reports themselves.
Digitization initiatives focused specifically on legal materials, although less well developed overall, have already reached significant milestones in coverage and usability.
- Cornell Law School’s Legal Information Institute (LII) hosts a number of regularly updated federal law resources, including the United States Code and the Code of Federal Regulations. Although LII posts helpful information on each page about how up-to-date that page’s content is, the site suffers most notably from the lack of a citator; that is to say, it is not possible to find cases construing any of the statutes or regulations hosted on LII.
- The student-organized AltLaw project hosts federal appellate case law, including nearly all Supreme Court cases and nearly six decades’ worth of lower federal appellate decisions. AltLaw’s cases are impressively, if incompletely, hyperlinked both forwards and backwards in time: most citations to other content available on AltLaw appear in the form of clickable hyperlinks, and the site includes a functional citator service via the “citations to/from this” link on each page. The site also offers a variety of search tools for sifting through its voluminous case law repository that, if perhaps not yet as fully developed as the query languages provided by proprietary database operators, nevertheless improve upon the basic functionality of searching for key words or phrases.
- The Justia project includes its own browseable copies of both federal and state legislation, along with a case law repository that, unlike most competing alternatives, also hosts district court decisions and dockets.
- Many of the foregoing projects draw data from the bulk collections maintained by public.resource.org, which has scanned thousands of pages of judicial and other records and made the results available for free download in a variety of formats.
- The goal of making court records more accessible is shared by RECAP, a software extension that permits users to “liberate” records from the federal judiciary’s fee-based PACER service and make them available for free on the Internet Archive.
The movement to make primary legal source materials freely accessible experienced what may one day prove to have been a transformative moment on November 17, 2009, when search-engine giant Google announced that it had made several recent decades’ worth of state and federal appellate, district, tax, and bankruptcy court decisions available for searching via its Google Scholar portal. The site includes a rudimentary citator (accessible by clicking the “How cited” tab from each page) that usefully includes citations to secondary sources, such as articles and treatises online at Google Books. Moreover, by harnessing the market-tested Google search engine, the addition of case law instantly made Google Scholar a “player” in the burgeoning market of alternatives to the traditional proprietary legal database publishers. Google’s entry into the open-access case law world has drawn enthusiastic reactions from open-content advocates, although the near-term effect may be simply that Google crowds out other open-content projects—for example, in the wake of Google’s announcement, the AltLaw project essentially declared that it was no longer relevant and would shut down. Nevertheless, despite their noteworthy accomplishments in a comparatively short time, the many projects working to make legal information freely available online share some common drawbacks. Work focused on ameliorating these shared flaws would do a great deal to make open-access projects viable substitutes for proprietary legal databases.
- The bias towards collecting case law. First, nearly all the projects discussed above have focused on collecting the works of the judicial branch—a worthy endeavor, but one that risks bypassing the most important sources of governing authority in our “age of statutes.” Only the LII devotes substantial resources to collecting and updating federal executive and legislative materials such as the Code of Federal Regulations and United States Code.
- The bias towards contemporary sources. Legal history, if the extant online resources are any guide, began in the mid-20th century. This is still a substantial improvement over the state of play in the open-access world as recently as three or four years ago, when history seemed to begin circa 1994. But to a researcher seeking to illuminate the sources and development of doctrine, the absence of sources antedating 1950 or thereabouts represents a potentially serious impediment.
- Poor hyperlinking and citator functionality. Many open-access sites fail to take advantage of the improved capabilities that hypertext offers over publication of the exact same document in paper form. That is to say, although they reproduce the text of the courts’ opinions as published, they do nothing more than that: it is not possible to click and follow a citation that appears in the text of the court’s opinion and view the cited source, even if the other source is also online. Nor do most sites include any citator functionality whereby the opinion being displayed can be followed forward in time to locate references that cite it.
- Authentication against official referents. Reading an opinion hosted on the official site of the issuing court sends a powerful message that the text is authentic. Reading the same opinion on the site of a large proprietary publisher sends perhaps a different message, but not one that would cause most users to question the fidelity of the reproduction. Reading the same opinion on an open-access site, however, may prompt uncertainty as to the provenance and authenticity of the text. The availability of mechanisms for verifying a text’s authenticity will be an important functionality in encouraging more widespread use of open-access alternatives to proprietary publishers.
In summary, although the legal open-access movement has attained some noteworthy successes, its shortcomings remain prominently visible. For recent federal appellate case law, multiple open-access alternatives exist, some of which have attained great sophistication. Once one moves beyond those types of legal materials, however, the situation becomes far murkier, with haphazard substantive coverage and an underdeveloped suite of tools available to deal with the posted content.
IV. Crowdsourcing as Force Multiplier
A. Building an Informational Commons
A great deal has been written about the “commons-based peer production” phenomenon that began in the world of open-source software and has expanded in the past decade to support mass creation of a wide variety of expressive works. Open-content projects like Wikipedia harness the creative energies of a far-flung community of volunteers and enable them to collaborate asynchronously—their efforts mediated by the distributed architecture of the Internet and given legal stability through a family of specialized copyright licenses. Such mass collaboration—now frequently labeled “crowdsourcing”—enables the distributed creation of works whose scope comfortably exceeds anything that an individual or small group of dedicated professionals could produce. Crowdsourcing can be viewed as a force multiplier: companies and other entities can sometimes get far more work done by opening their projects to collaborative input than they could have accomplished solely through the efforts of their own employees.
Certain types of works lend themselves more easily to crowdsourced production than others. The world still awaits the first peer-produced hit song, blockbuster film, or acclaimed novel. Nevertheless, the crowdsourced approach has proved its value in the creation of an informational commons: a wide variety of informational goods have been created through the internet-mediated efforts of a distributed community of volunteers. The success of peer-production projects in the information economy necessarily raises the question whether an archetypal informational good—the library—might be created through similar means.
B. Crowdsourced Library-Building
The Library of Congress presently holds a collection of nine billion texts that exist in paper-only form. The National Archives has estimated that, by working steadily at an expected pace of 500,000 texts a year, it can digitize those nine billion paper records in 1,800 years. The figure is, one suspects, purposefully outlandish; it is difficult to imagine any human endeavor (outside, perhaps, the realm of religion) that can be sustained over such a gulf of time. If the numerator—the number of texts to be digitized—cannot be changed, perhaps we can focus on the denominator—the number of texts digitized per year. If the National Archives, working alone, can digitize half a million texts a year, then perhaps it should not work alone. Enlarging the pool of contributors who are working to digitize historical texts and make them freely available online would appear to be one obvious solution.
1. Distributed Proofreaders and Project Gutenberg
Project Gutenberg is one of the oldest digital library projects, having been launched in 1971 at the University of Illinois. In the decades since, founder Michael Hart and many other Project Gutenberg contributors have made tens of thousands of books available for online browsing or download in a variety of formats.
In 2000, a group of Project Gutenberg contributors launched a companion Web site, Distributed Proofreaders (“DP”), with the aim of using collaborative techniques to help expand the library of texts available through Project Gutenberg. Their efforts have paid handsome dividends; indeed, as of the time of this writing, a majority of all texts available through Project Gutenberg were contributed via Distributed Proofreaders.
Distributed Proofreaders hosts scanned images—that is to say, pictures—of the pages of new texts that are candidates to be added to Project Gutenberg. Registered users of the site may select a text of interest to them from the listing of currently active proofreading projects. The site then displays to the user a split-screen window showing both a scanned image of the selected page of the work and the corresponding text that appears on that page (generated initially via optical character recognition (OCR) software). Because the uncorrected OCR output frequently contains errors, the text in the lower portion of the split-screen display may not exactly match the page image in the upper portion. Users of the site provide the human judgment that is necessary to match the text to the scanned page image and make any necessary corrections in the text box. The Distributed Proofreaders site provides special instructions concerning how users should mark punctuation and special characters that appear in the scanned page image. Proofreading at Distributed Proofreaders proceeds in multiple stages, with each stage representing progressively greater progress towards a completed text that multiple persons have verified against the scanned source images. First, each document goes through at least two, and optionally three, proofreading rounds, designated “P1” through “P3” in the nomenclature of the site. At the P1 round, users correct the raw OCR output to match the appearance of the corresponding scanned page image. The P2 and P3 rounds, in turn, take as their input the corrected text produced during the P1 and P2 rounds, respectively. After all the proofreading rounds have been completed, the document proceeds through two formatting rounds (“F1” and “F2”), where the goal is to check the proofread text to make sure that the visual appearance (not merely the text) mimics the scanned original. There is a final, optional, smooth reading (“SR”) round aimed at ensuring that the final digitized text has been correctly transcribed and formatted.
Unlike wiki-based projects, not all users of Distributed Proofreaders may participate in each of the site’s activities. Distributed Proofreaders limits users’ eligibility to engage in various proofreading activities according to whether the user has created an account and the user’s prior history with the project. Access to all the later proofreading and formatting stages is granted only via application to the site’s administrators based upon certain eligibility criteria, as follows:
- Unregistered users of the site may participate only in SR rounds.
- Registered DP users initially may participate only in P1 proofreading rounds.
- To become eligible to participate in P2 proofreading rounds, a user must (1) complete 300 proofread pages at the P1 level, (2) be a member of the site for at least 21 days, and (3) pass a five-part proofreading quiz aimed at testing the user’s familiarity with the proofreading markup conventions used at the site.
- To become eligible to participate in F1 formatting rounds, a user must (1) complete 300 proofread pages at the P1 level, (2) be a member of the site for at least 21 days, and (3) pass a five-part formatting quiz aimed at assessing the user’s familiarity with DP’s formatting markup.
- To become eligible to participate in P3 proofreading rounds, a user must proofread a total of 400 pages, of which at least 50 must be from a “P2” round, (2) must complete a further 50 pages in an “F1” formatting round, and (3) must also pass a proofreading quiz.
- Finally, to become eligible to participate in F2 formatting rounds, a user must (1) complete 400 pages in an “F1” formatting round, and (2) be a member of the site for at least 91 days.
Regardless of the level of access they have attained, ordinary users of Distributed Proofreaders are unable to add new texts to the site. The ability to upload new scanned images and to initiate new proofreading projects is reserved for DP Project Managers, a status that must be granted separately by the site’s administrators. The requirements for a DP user to be given Project Manager status include: (1) familiarity with the site’s workflow process and guidelines; (2) a minimum of 400 pages completed at the “F1” level; and (3) identification of an existing Project Manager who is willing to serve as a mentor.
Professor Yochai Benkler rightly celebrates Distributed Proofreaders as a paradigmatic crowdsourcing success story, and others have emphasized the site’s potential value as a tool for crowdsourced library-building. The numbers are difficult to quarrel with. In the decade since its founding, volunteers coordinating their efforts through Distributed Proofreaders have proofread and released in electronic form (through Project Gutenberg) over 17,000 texts. DP’s strengths include a large and supportive user community (with over 3,000 contributors active in the last 30 days at the time of this writing) and a rapid proofreading process, with completion times even for lengthy works measured in weeks (at least in the early rounds).
Nevertheless, a number of structural weaknesses may limit DP’s utility as a tool for improving access to primary legal source materials. Unlike many of the most vibrant peer-produced informational projects, DP maintains a bureaucratic, hierarchical structure, with site administrators adjudicating users’ compliance with the site’s daunting criteria for promotion to higher levels of access, and all but the most senior users are disabled entirely from contributing new works. Furthermore, DP’s selection of texts is driven by the philosophy underlying its senior partner, Project Gutenberg, which expressly aims to maximize the inclusion of texts popular with a mass audience. In consequence, Distributed Proofreaders and Project Gutenberg include comparatively few texts of interest to the legal community. Project Gutenberg’s mission discourages the addition of such texts, and the DP architecture makes it difficult even for interested users inclined to do so. This hinders efforts to broaden the scope of the project’s coverage.
2. Crowdsourcing the Wiki Way
Of the nine wiki projects operated by the nonprofit Wikimedia Foundation (“WMF”), one—Wikipedia—has garnered most of the scholarly praise and criticism. WMF’s other projects (Wikibooks, Wikinews, Wikiquote, Wikisource, Wikispecies, Wikiversity, Wiktionary, and the Wikimedia Commons) have their own communities of dedicated users, who use a common set of wiki-based tools to contribute content within the scope of their respective missions. They have so far failed, however, to capture the academic imagination in quite the same way as Wikipedia—which, like an open flame, seems to have the power to draw all the oxygen out of academic discourse on law and wiki technologies. This is unfortunate, because WMF’s projects include another candidate that shares many of Wikipedia’s strengths, omits its most prominent weaknesses, and offers a natural fit with the interests and concerns of academics and others who study and value the public domain. That project is Wikisource.
Wikisource is a digital library of previously published free-content works. The project’s eligibility criteria are strict; only works that are in the public domain in the United States or are licensed under terms that allow free copying, modification, and reuse (including commercial use) are permitted to be hosted on the site. The requirement of prior publication is intended to ease verification (that is, to make it possible for the site’s geographically far-flung users to confirm that the text posted at the site matches the published original) and to deter misuse of the site for self-publication.
Wikisource’s mission differs from Wikipedia’s in ways that tend to insulate it against some of the criticisms often aimed at its larger sibling. Wikipedia’s stated goal is to describe the world from a neutral point of view—a goal that may be epistemologically unattainable, and at a minimum invites ongoing debate over the “neutrality” of articles published on the site. Wikisource’s polestar, in contrast, is not neutrality, but faithful reproduction of a source text as published. It is easy to imagine users reasonably holding differing opinions about whether the Wikipedia biographies of Presidents Barack Obama or George W. Bush adhere to the stated standard of neutrality; it is less easy to imagine users reasonably adhering to different views about whether the text reproduced at Wikisource matches the content of the published source.
Like Distributed Proofreaders, Wikisource now draws most new content from users who proofread and correct the text extracted from scanned page images of a published source. Unlike Distributed Proofreaders, however, Wikisource was not originally engineered with proofreading of page scans in mind. This functionality has been in place only during the last two to three years of the project’s existence. Nevertheless, the site now offers a clean and well-organized user interface that at least rivals, and perhaps exceeds, the usefulness and intuitive functionality of Distributed Proofreaders.
First, each scanned volume image accessible at Wikisource (which typically, although not always, correspond to a separately bound hard copy volume of a work as originally published) has a so-called “Index page” that reproduces identifying information about that volume as a whole—such as the title, author, publisher, year of publication, and possibly a table of contents. The volume’s Index page also includes links to each individual page contained within the volume. Each page link is color-coded using a standard schema that applies site-wide and reflects, in essence, the level of confidence of the project’s users that the text reproduced at that link accurately reflects the content of the corresponding scanned page. Thus, the Index page reveals at a glance how much progress the site’s users have made towards finalizing the proofreading and correction of the work. The color codes used on the site are:
- Red (“Not Proofread”): Signifies that the linked page contains text, but no user of the site has checked the text for accuracy. This color code is typically applied where the text included on the linked page consists entirely of the raw output of OCR software.
- Yellow (“Proofread”): Signifies that one user of the site has proofread and corrected the linked text so that it matches the content and formatting of the corresponding scanned page image.
- Green (“Validated”): Signifies that two or more users of the site have proofread and corrected the text of the linked page. This is the highest rating of page quality available on Wikisource.
In addition, there are three further color codes used on the site that convey additional information about the status of the corresponding linked page:
- Purple (“Problematic”): Signifies that the text on the linked page does not match the scanned original due to an error in the scanned image (such as a blurry or misaligned page), or because the content of the scanned page cannot be accurately reproduced for some other reason.
- Gray (“Unnecessary to Proofread”): Signifies that the corresponding page either is blank, or contains some content other than text (for example, an image or illustration).
- White (“Empty Page”): Signifies that, although the linked page includes a scanned image of the original source, no corresponding text of any kind exists yet on Wikisource. The site offers easy ways for users to fill in empty pages (generally upgrading the corresponding link from “Empty” to “Not Proofread” in the process), either by extracting text embedded in the image file, or by running an on-site OCR tool on the image.
The volume index page for a given work available for proofreading on Wikisource thus can appear, at any given moment, as an information-rich (and colorful) mosaic, instantly reflecting the validation level of each page included within that volume.
Clicking on any of the individual page links from the Index page opens a page-level view where proofing and correction of the text actually occurs. In an improvement over the Distributed Proofreaders interface, Wikisource displays the extracted text and the corresponding scanned page image side by side. Clicking on the “Edit” tab at the top of the page displays a scrolling text box side-by-side with the scanned image of the page. Users may enter any necessary corrections in the text window and update the contents of the page by clicking the “Save” button. If the user indicates a change in the page’s overall proofreading level (by clicking an adjacent radio button for whichever color coding is appropriate), the color of the corresponding link from the volume’s index will automatically be updated to reflect the changed status of that page.
As with Distributed Proofreaders, when the scanned pages of a work have been proofread to a satisfactory quality level, the proofread text of all (or some) of the pages within the work can be joined together to form a single electronic file of the complete proofread text. Unlike Distributed Proofreaders, however, this process resides entirely within the control of the users of the site and requires no additional software.  Subject only to certain technical constraints imposed by the underlying architecture, the corrected text from dozens or hundreds of scanned original pages may be automatically joined together to form a single Web page with the complete text of the entire original work. The common practice on Wikisource is to keep the scanned page images available even after proofreading is complete in order to ease authentication  ; for most works recently added to the site, users may verify for themselves (by clicking a small page number link that typically appears in the margin of the displayed text) that the text displayed at the site matches the content of the scanned page image. 
Wikisource, unlike Distributed Proofreaders, is a wiki: with the exception of a small number of "locked" pages, any user of the site may add or edit any work in the library. Thus, Wikisource removes some of the obstacles that make Distributed Proofreaders and Project Gutenberg unpromising candidates for hosting source materials of interest to the legal community. Indeed, the barriers to adding a new work to Wikisource are exceptionally low: so long as a user can locate (or create) an electronic file containing scanned images of the source work as originally published—and there are many scanned legal texts already available online at sites such as Google Books or the Internet Archive—the only indispensable step consists of uploading a set of suitable scans to Wikimedia Commons where it will be accessible by Wikisource. Every other step of the process—creating an index page, extracting (or creating) uncorrected OCR text from the scanned file, proofreading and correcting the text, and joining the corrected text pages together to form a consolidated e-text—can be crowdsourced.
The Wikisource process has already been used to make some texts of interest to the legal community freely accessible online. Indeed, Wikisource now hosts some texts that are not yet freely available anywhere else, such as key portions of the legislative history for the landmark Copyright Act of 1976. In an ongoing experiment to use the site to expand the availability of historical materials, the present author made over 70,000 pages of scanned images (taken mostly from the Library of Congress’s outstanding American Memory project) and raw OCR text, representing the first forty-three volumes of the United States Statutes at Large, available for proofreading and correction on Wikisource. At the time of this writing, all the public and private laws and resolutions of the First United States Congress, which sat in three sessions from March 4, 1789 to March 3, 1791, have been proofread and made publicly available by users of the site. Other selected statutes and proclamations within the scanned collection have also been proofread by users with an interest in particular issues or periods in American legal history. The material proofread to date represents a very small fraction of the full dataset comprising the early volumes of the Statutes at Large. Nevertheless, sufficient progress has occurred so far to at least demonstrate the viability of crowdsourced proofreading of legal texts, specifically works that would be unlikely to be included at Distributed Proofreaders.Wikisource can also serve as a repository for legal scholarship that meets the site’s inclusion criteria—thus potentially bringing together scholarship and primary source materials in a way not presently replicated by any other open-access repository.
By virtue of its design, Wikisource comports with many (although certainly not all) of Professor Ian Gallacher’s proposed design standards for open-access archives of primary legal source materials. Wikisource’s collection is universally accessible worldwide. It can be presented in a variety of formats (or downloaded freely and further processed to meet a user’s specific presentation needs), and its contents are open to indexing by Google or other standard search engines. The output format of any work hosted on Wikisource is an XHTML web page, an open vendor-neutral format that nevertheless enables preservation of a great deal of the original work’s formatting. When viewing any page within Wikisource (or any of the other Wikimedia Foundation wikis, such as Wikipedia), using the “View Page Source” function within one’s web browser will indicate, in the <DOCTYPE> declaration on the first line of the page source, the type of document being viewed. </ref> The Wikimedia Foundation’s globally distributed server architecture yields adequate response speeds in ordinary use. The site offers permanence in the form of downloadable snapshots of the full database as it existed at various points in time; if Wikisource itself were to ever go offline (for example, if WMF were ever to become insolvent), the content of the site (except for the most recent edits) could be swiftly restored by anyone with a mirror copy of the most recent database dump. A citator of sorts is available from the “What Links Here” link on every page of any WMF wiki, and the entire project is open to public development and maintenance.
For the purpose of assessing its potential value as a possible open-access repository for legal source texts, Wikisource’s strengths include: (1) a well-developed and stable architecture that enables contributions by any user familiar with the standardized MediaWiki editing syntax (which is much easier to learn than HTML); (2) the openness of its database, which any user may edit or expand; (3) the relative sophistication and user-friendliness of the site’s user interface; (4) the existence of a community of users within the site who are interested in legal topics and have already made several legal source texts available; and (5) the ease of authentication provided by the site’s use and preservation of scanned page images from the original published sources. Wikisource’s most evident weaknesses stem from the comparatively small community of users of the site: by any measure, Wikisource is a tiny project compared with Wikipedia or Distributed Proofreaders. The smaller number of users at the site translates into substantially greater time required to complete any given proofreading project and has also limited the number of texts that have been added to the site. Thus, Wikisource remains very far from approaching Professor Gallacher’s ideal of completeness for an open-access repository. Nevertheless, Wikisource offers an interesting alternative to Distributed Proofreaders as a platform for mass collaboration in making a variety of works freely available to the public, and the successes the site’s users have achieved to date offer a hint of its promise.
C. Pros and Cons
Crowdsourcing methods represent territory mostly unexplored by the various projects currently working to provide open access to legal source materials. The “peer production” approach, which has been ably used to create a wide variety of other informational goods, holds at least some promise as a tool for making legal and historical materials available more widely and without restriction.
Most fundamentally, crowdsourcing techniques alleviate resource constraints that otherwise limit the scope and operations of typical open-access efforts. Many of the organizations that have launched legal open-access sites are arms of educational or nonprofit institutions, and their reach is constrained by available resources.
Reduced organizational overhead is a second identifiable benefit of crowdsourcing. As the example of the Statutes at Large illustrates, launching a new crowdsourced open-access initiative is a project within the means of a dedicated individual acting on his or her own initiative. The need to build committee structures or to lobby for consensus-based decision-making is not an impediment; texts within any single user’s areas of interest and expertise may be added to a project almost effortlessly, with other users of the site free to contribute as their own interest and curiosity dictates. Inviting interested members of the legal community and the public to collaborate in building a free commons of legal source materials removes the resource constraints of any single organizing entity as a limiting factor.
The wiki-based architecture of a project like Wikisource offers another potential benefit in the form of reducing barriers to participation. Wikisource’s approach differs markedly from Distributed Proofreaders’: the open wiki-based architecture invites and facilitates participation by users of widely varying expertis. Some users may be competent in proofreading the scanned OCR text and marking rudimentary corrections, others may be knowledgeable about the MediaWiki formatting markup used across all the WMF sites, others may excel at categorization and indexing, and still others may have the type of skills that are necessary to program templates or scripts for managing more complex tasks. The architecture of the site permits users to contribute according to their respective expertise.
Like any other organizational tool, however, crowdsourced methods have weaknesses as well as strengths. Goal-setting and prioritization of work is a recurring issue for projects situated outside any formal organizational structure. For example, Wikisource, like Wikipedia or any number of similar open-content projects, has no “benevolent dictator” assigning tasks or ensuring that user effort flows to where it is most needed. User contributions are largely self-directed towards those areas where their interests happen to gravitate.
The problem of sustaining user engagement over time in the absence of traditional incentives (such as the payment of a salary) is also endemic to many crowdsourced projects. Users of peer-produced projects are free to come and go, and there is no guarantee that a user who launches any given project will see it through to completion. Although some users diligently perform work (such as archiving past discussions and rationalizing the site’s frequently confusing categorization structures) that improves the quality and usefulness of the site overall, most users appear to be focused on expanding the library by adding new content. In consequence, Wikisource is an unruly patchwork, with comparatively stable and well-organized content existing alongside fragmentary works organized only according to the idiosyncratic whim of a particular contributor. Although Wikipedia seems not to be in any imminent danger of failing, it is hardly difficult to locate examples of essentially moribund projects on Wikisource or any of the smaller WMF sites. In contrast, the Distributed Proofreaders architecture channels user participation by requiring users to select from a small number of currently ongoing projects if they wish to participate, and may actually provide some structural advantages here.
Despite remarkable successes in the past fifteen years or so, no open-access project for primary legal source materials approaches the size and sophistication of the large proprietary legal databases. Proprietary database publishers benefit from an inflow of subscriber revenues that no open-access project can hope to match;
A variety of high-quality informational goods have been produced using nonproprietary production processes that aggregate the individual contributions of a wide community of volunteers. As a matter of principle, there is no reason why such a crowdsourced production process might not be employed to extend access to legal materials and scholarship. The technological architecture for building new open-access projects in the legal arena is already in place; all that is missing is a sufficiently large pool of contributors willing to assist in building the informational commons as their interests and abilities permit. indeed, the whole point of the open-access movement is to provide an alternative to the proprietary subscriber-access paradigm and make information freely accessible to all.
To maximize the overall benefit to the information commons, crowdsourced projects should aim to supplement rather than to supplant existing open-access repositories for legal works. New projects should aim at building strength in areas where existing repositories are weak: they should focus more on legislative and executive materials rather than case law, and more on historical rather than contemporary works. Adding contextual richness with hyperlinking and verifiability against official sources will make such projects more attractive for everyday use and provide a practical alternative to proprietary legal databases.
|Sitea||Content Pagesb||Registered Usersc||Active Usersd||Database Sizee|
|Wikimedia Commons||94,283||984,709||21,127||1.5 GB|
a All references are to the English-language versions of the listed sites (so, the statistics for Wikipedia, for example, are those of en.wikipedia.org), with the exception of Wikimedia Commons. The Commons is a cross-language repository used by all Wikimedia Foundation sites to store graphic images, audio or video clips, and scanned page images that are intended for use at any Wikimedia Foundation site.
b The figures presented in the next three columns are taken from the Statistics page of each indicated site, which may be accessed by typing Special:Statistics into the Search box on each site’s home page. The data in these columns are current as of January 15, 2010.
c The figures in this column also come from the Statistics page of the indicated site and are current as of January 15, 2010. Although all the WMF sites permit editing by users who have not registered and created an account, certain practical advantages accrue from registration. The figures listed in this column represent the total number of users who have registered and created an account on the indicated site.
d The figures in this column also come from the Statistics page of the indicated site and are current as of January 15, 2010. The figures represent the number of registered users who have edited the site within the preceding 30 days.
e Snapshots of the complete database of each of the listed sites are made available for download periodically at download.wikimedia.org. Snapshots are not prepared for each site according to the same schedule; thus, it is not possible to compare the size of each of the listed sites as of a single common date. The relative sizes of the download archives, however, is generally reflective of the quantity of content available at each site listed. The sizes listed represent the size of the complete database (with all editing history intact) as a single compressed file, in gigabytes (GB) or megabytes (MB), as indicated. The quoted figures are taken from the following snapshot dates: Oct. 22, 2009 (Wikipedia); Jan. 8, 2010 (Wikimedia Commons); Jan. 10, 2010 (Wikinews and Meta-Wiki); Jan. 11, 2010 (Wikisource); Jan. 12, 2010 (Wikiversity); and Jan. 13, 2010 (Wiktionary, Wikibooks, Wikispecies, and Wikiquote).
|Language||Abbr||Content Pages||Registered Users||Active Users||Database Size|
- The home page of each of the listed sites is accessible by prepending the two-character language abbreviation to the common suffix wikisource.org—thus, en.wikisource.org, zh.wikisource.org, and so forth.
- The figures in the first five columns are taken from the Statistics pages of each listed site. The statistics page for the English-language Wikisource is available at http://en.wikisource.org/wiki/Special:Statistics. Replacing "en" with another site’s language code in that URL takes the user to the Statistics page of that language’s site. The quoted statistics are as of January 15, 2010.
- Database size figures are taken from download.wikimedia.org as explained in Notee to Table 1. The quoted sizes for each downloadable snapshot are taken from the following dates: Jan. 8, 2010 (fr, ar); Jan. 10, 2010 (pt, ru, de, he); Jan. 11, 2010 (en, zh); Jan. 12, 2010 (es); and Jan. 14, 2010 (it).
Figure 1. Proofreading User Interface at Distributed Proofreaders
Figure 2. A sample Wikisource index page
Figure 3. Side-by-side page view at Wikisource
Figure 4. Proofreading a scanned page at Wikisource
- Associate Professor of Law, University of Cincinnati College of Law. B.A. 1989, M.P.Aff. 1993, J.D. 1993, The University of Texas at Austin; LL.M . 2005, Harvard Law School. This work began as a series of presentations which I delivered at the summer 2008 and 2009 CALI Conferences for Law School Computing, and an updated version was presented at the meeting of the Section of Internet and Computer Law at the 2010 Annual Meeting of the Association of American Law Schools (AALS). I appreciate the thoughtful comments of the attendees at each of these sessions. Research support from the Harold C. Schott Foundation is gratefully acknowledged, as is the research assistance of Ron Jones. Copyright © 2010, Timothy K. Armstrong.
This work is licensed under the Creative Commons Attribution-Share Alike 3.0 United States license. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/, or send a letter to Creative Commons, 171 2nd Street, Suite 300, San Francisco, California, 94105, USA. For purposes of Paragraph 4(c) of said license, proper attribution must include the name of the original author and the name of the Santa Clara Computer & High Technology Law Journal as publisher, the title of the Article, the Uniform Resource Identifier, as described in the license, and, if applicable, credit indicating that the Article has been used in a derivative work.
- See, e.g., Authors Guild, Inc. v. Google Inc., No. 05 CV 8136(DC), 2009 WL 5576331 (S.D.N.Y. Nov. 19, 2009) (preliminarily approving proposed amended settlement agreement).
- See infra note 69.
- See Katie Hafner, History, Digitized (and Abridged), N.Y. Times, Mar. 11, 2007, § 3 (Magazine), at 1.
- See, e.g., Yochai Benkler, The Wealth of Networks 68–90 (2006) (collecting examples).
- See infra note 142 and accompanying text.
- See Wikisource, Pages that link to “Copyright Law Revision (House Report No. 94-1476)”, at http://en.wikisource.org/wiki/Special:WhatLinksHere/Copyright_Law_Revision_(House_Report_No._94-1476)) (accessed Apr. 15, 2010); Wikisource, Pages that link to “Copyright Law Revision (Senate Report No. 94-473)”, at http://en.wikisource.org/wiki/Special:WhatLinksHere/Copyright_Law_Revision_(Senate_Report_No._94-473)) (accessed Apr. 15, 2010).
- As one writer put it: “[M]any nerds believe that a billion readers can reliably weave together the pages of old books, one hyperlink at a time. Those with a passion for a special subject, obscure author or favorite book will, over time, link up its important parts. Multiply that simple generous act by millions of readers, and the universal library can be integrated in full, by fans for fans.” Kevin Kelly, Scan This Book!, N.Y. Times, May 14, 2006, § 6 (Magazine), at 42, 45.
- See Peter Suber, Open Access Overview, http://www.earlham.edu/~peters/fos/overview.htm (last visited Sept. 29, 2009) (“Open-access (OA) literature is digital, online, free of charge, and free of most copyright and licensing restrictions.”). There is no single settled definition of “open access,” although most conventional understandings of the term share common traits (the most important being the relative ease and low cost of access as compared with the traditional roprietary publication paradigm). See generally John Willinsky, The Access Principle: The Case For Open Access to Research and Scholarship App. A (2006) (cataloging “ten flavors of open access”); Lawrence B. Solum, Download It While It’s Hot: Open Access and Legal Scholarship, 10 Lewis & Clark L. Rev. 841, 856–57 (2006). The open access movement is a global phenomenon guided and informed by a number of declarations of principles issued by international groups, a full cataloging of which lies beyond the scope of the present essay. See, e.g., Richard A. Danner, Applying the Access Principle in Law: The Responsibilities of the Legal Scholar, 35 Int’l J. Leg. Info. 355, 359–66 (2007) (summarizing several of the pertinent declarations); David W. Opderbeck, The Penguin’s Paradox: The Political Economy of International Intellectual Property and the Paradox of Open Intellectual Property Models, 18 Stan. L. & Pol’y Rev. 101, 107–09 (2007) (recounting pertinent history).
By focusing on issues involving the legality of access to the underlying content, most discussions of open access elide related issues such as the openness of the software platforms used in creating and reading the content or the openness of the networks over which the content flows. See, e.g., Access Denied: The Practice and Policy of Global Internet Filtering (Ronald Deibert et al., eds., 2008) (surveying state actors’ controls over Internet information flows); Stephen Murgatroyd, Access to Knowledge in an e-Connected World, in The E-Connected World: Risks and Opportunities 79 (Stephen Coleman, ed., 2003) (acknowledging interrelationships among these concerns). This essay adheres to convention in focusing on the question of open access to content, while recognizing that other issues may carry greater force in particular circumstances.
- See, e.g., Joseph Scott Miller, Forward: Why Open Access to Scholarship Matters, 10 Lewis & Clark L. Rev. 733 (2006); Nicholas Bramble, Preparing Academic Scholarship for an Open Access World, 20 Harv. J.L. & Tech. 209 (2006).
- See Willinsky, supra note 8, at 22 ( “[O]pen access is associated with increased citations for authors and journals, when compared to similar work that is not open access”); id. at 22–24 (summarizing research). For a look at some of the methodological pitfalls of studies of this type, as well as some possible solutions, see Bernard S. Black & Paul L. Caron, Ranking Law Schools: Using SSRN to Measure Scholarly Performance, 81 Ind. L.J. 83, 92–95 (2006).
- See Solum, supra note 8, at 859 (“There will come a day when the saying, ‘If it isn’t on the net, it doesn’t exist,’ is true. Open access legal scholarship will be the only legal scholarship that is actually read. Closed access legal scholarship will be the tree that falls with no one in the forest.”); Carol A. Parker, Institutional Repositories and the Principle of Open Access: Changing the Way We Think About Legal Scholarship, 37 N.M. L. Rev. 431, 431 (2007) (suggesting that “open access to legal scholarship will soon be adopted and implement ed by every law school in the United States”); Richard A. Danner et al., The Twenty-First Century Law Library, 101 Law Libr. J. 143, 146 (2009) (“[T]he fact that young people are going to Google and to Wikipedia first is a call to arms in a way”) (comments of Richard A. Danner).
- See, e.g., Marci Hoffman & Katherine Topulos, Tyranny of the Available: Under-Represented Topics, Approaches, and Viewpoints, 35 Syracuse J. Int’l L. & Com. 175, 188–90 (2008);Paul L. Caron, Bloggership: How Blogs Are Transforming Legal Scholarship, 84 Wash. U.L. Rev. 1025 (2006).
- See Willinsky, supra note 8, ch. 2; Dan Hunter, Walled Gardens, 62 Wash. & Lee L. Rev. 607, 613–17 (2005). The effect of subscription costs as a driver of open access is surely greater in technical fields, where subscription rates for specialty publications may run into the thousands (or even tens of thousands) of dollars a year, than in law. But see Danner, supra note 8, at 377 (“[B]ecause they enjoy unlimited (and apparently cost-free) access to law journals and other information through Westlaw, LexisNexis, Hein Online, and other databases, it might be hard for law students and faculty to appreciate the impacts of access costs on researchers outside the U.S. legal education environment.”); Solum, supra note 8, at 863 (“[A]s you move from major research universities to regional universities to local colleges, the access of faculty and students to closed electronic databases (Westlaw, LexisNexis, JSTOR, etc.) begins to become very sketchy. In the least-developed countries, such access is virtually nonexistent.”).
- See, e.g., Michael J. Madison et al., The University as Constructed Cultural Commons, 30 Wash. U. J.L. & Pol’y 365, 399–400 (2009).
- See Harvard Law faculty votes for ‘open access’ to scholarly articles, http://www.law.harvard.edu/news/2008/05/07_openaccess.html (last visited Oct. 6, 2009).
- See Natasha Plotkin, MIT Will Publish All Faculty Articles Free In Online Repository, The Tech, Mar. 20, 2009, available at http://tech.mit.edu/V129/PDF/N14.pdf.
- On the other hand, the news is not uniformly favorable. In April 2009, the faculty of the University of Maryland defeated a resolution encouraging (but not requiring) that faculty members make their scholarship available in open-access repositories.
- See Duke Law School, Faculty Scholarship Repository, http://www.law.duke.edu/scholarship/repository (last visited Oct. 6, 2009). The earliest work presently found in the collection is Robinson O. Everett, Securing Security, 16 Law & Contemp. Probs. 49 (1951), available at http://eprints.law.duke.edu/365/. See generally Danner, supra note 8, at 393–94.
- The DASH repository is online at http://dash.harvard.edu/ (last visited Oct. 6, 2009).
- See Oklahoma City University School of Law, Faculty Scholarship Repository, http://www.okcu.edu/law/facultyandadministration/publications/index.php (last visited Oct. 6, 2009).
- See, e.g., Parker, supra note 11, at 431–32; Jessica Litman, The Economics of Open Access Law Publishing, 10 Lewis & Clark L. Rev. 779, 791–92 (2006); Black & Caron, supra note 10.
- See Science Commons: Open Access Law Program, http://sciencecommons.org/projects/publishing/oalaw/ (last visited Feb. 12, 2010).
- See Science Commons: Open Access Law: Adopting Journals, http://sciencecommons.org/projects/publishing/oalaw/oalawjournals/ (last visited Feb. 12, 2010).
- See, e.g., infra note 149 and accompanying text.
- Cf. Michael W. Carroll, The Movement for Open Access Law, 10 Lewis & Clark L. Rev. 741, 742–43 (2006) (arguing that open access to legal scholarship also confers public benefits by lowering litigants’ costs of access to novel legal theories that may persuade courts to rule in their favor).
- See, e.g., id. at 746; Justin Hughes, Created Facts and the Flawed Ontology of Copyright Law, 83 Notre Dame L. Rev. 43, 77–78 (2007) (considering several justifications for open access to court decisions); Nash v. Lathrop, 6 N.E. 559, 560 (Mass. 1886) (“it needs no argument to show that justice requires that all should have free access to the opinions, and that it is against sound public policy to prevent this, or to suppress and keep from the earliest knowledge of the public the statutes, or the decisions and opinions of the justices.”).
- See Timothy K. Armstrong, Chevron Deference and Agency Self-Interest, 13 Cornell J.L. & Pub. Pol’y 203, 273–74 (2004). For an argument that principles of governmental accountability support the adoption of data transparency practices by federal agencies, see David Robinson, et al., Government Data and the Invisible Hand, 11 Yale J.L. & Tech. 159 (2008–2009).
- See 17 U.S.C. § 105 (2006).
- See, e.g., Veeck v. Southern Building Code Cong. Int’l, 293 F.3d 791 (5th Cir. 2002) (en banc) (reasoning that copyright protection in text of privately authored model building code evaporated when that code was enacted as positive law by two municipalities); L. Ray Patterson & Craig Joyce, Monopolizing the Law: The Scope of Copyright Protection for Law Reports and Statutory Compilations, 36 UCLA L. Rev. 719, 809–10 (1989) (reasoning that pagination and chapter or section numbering of primary legal source materials are insufficiently expressive to qualify for copyright protection, and decrying efforts to enforce such protections as “in effect impos[ing] a tax for the use of the law”); Katie Fortney, Ending Copyright Claims in State Primary Legal Materials: Towards an Open Source Legal Operating System, http://ssrn.com/abstract=1347158 (last visited Oct. 20, 2009).
In the spring of 2008, the Office of Legislative Counsel of the State of Oregon, with what might charitably be described as a veneer of legal justification, took the remarkable step of asserting copyright protection over its own official compilation of state statutes, and sent out cease-and-desist letters to organizations that had posted the text of those statutes on the Internet. The state backed down following an outcry in the blogosphere, but similar issues might well recur as the open-access phenomenon continues to disrupt proprietary publication models for legal materials. See James Grimmelmann, Copyright, Technology, and Access to the Law: An Opinionated Primer, at http://james.grimmelmann.net/essays/CopyrightTechnologyAccess (June 19, 2008) (analyzing the Oregon incident); Fortney, supra note 28, at 1–2 (collecting assertions of copyright in statutes from other states); Carl Malamud, Three Revolutions in American Law ¶¶ 49–62 (2009), available at http://www.scribd.com/doc/21818472/Three-Revolutions-in-American-Law.
- Compare H.R. 801, 111th Cong. (2009) (forbidding federal agencies to condition research funding upon recipients’ archiving of findings in open-access repositories) with S. 1373, 111th Cong. (2009) (encouraging agencies to adopt such policies).
- See supra notes 14–17 and accompanying text.
- See, e.g., Ohio Rev. Code Ann. § 3333.04(A), (E) (West 2009) (directing state board of regents to consider “the needs of the people,” among other criteria, in identifying the “public services which should be offered” by state-supported higher education institutions).
- See Willinsky, supra note 8, at 65, 227–32 (arguing for establishment of open-access publishing and archiving cooperative based partly on fulfillment of participating institutions’ public service mandates).
- See, e.g., James G. Neal, A Lay Perspective on the Copyright Wars: A Report from the Trenches of the Section 108 Study Group, 32 Colum. J.L . & Arts 193, 194 (2008) (“Universities and libraries are committed to openness—general and barrier-free access to information framed by the rhetoric of open source, open standards, open archives and open knowledge.”); id. at 197–98 (discussing principles and policies developed by library community, several of which involve greater access to information).
- See id. at 194 (noting Professor Clayton Christiansen’s definition of “disruptive technologies” that “enable a larger population of less skilled people to do the things that historically only an expert could do”) (footnote omitted). On the complex pattern of incentives that libraries may face to limit free access to digital collections, see Guy Pessach, The Role of Libraries in A2K: Taking Stock and Looking Ahead, 2007 Mich. St. L. Rev. 257, 261–62 (2007). The possibility that mass digitization projects may threaten adverse effects on the operation of libraries is an important factor behind the equivocal stance the library community has adopted towards the proposed settlement of the ongoing Google Book Search copyright litigation. See, e.g., Supplemental Library Association Comments on the Proposed Settlement, Authors Guild, Inc. v. Google Inc., No. 05-CV-8136-DC (S.D.N.Y.), at http://www.arl.org/bm~doc/library-associations-supp-filing-sept-2-09.pdf.
- See Durham Statement on Open Access to Legal Scholarship, at http://cyber.law.harvard.edu/publications/durhamstatement (last visited Nov. 5, 2009) [hereafter “Durham Statement”].
- See id. (“We therefore urge every U.S . law school to commit to ending print publication of its journals and to making definitive versions of journals and other scholarship produced at the school immediately available upon publication in stable, open, digital formats, rather than in print.”).
- See Ian Gallacher, “Aux Armes, Citoyens!”: Time for Law Schools to Lead the Movement for Free and Open Access to the Law, 40 U. Toledo L. Rev. 1 (2008).
- See id. at 14–19.
- See id. at 21–31.
- See id. at 32–49.
- The home page of Google Books (formerly known as Google Book Search, which was itself formerly known as Google Print) is available at http://books.google.com/ (last visited Nov. 12, 2009). For an overview of the history and legal issues raised by Google Books, see, e.g., Dan L. Burk, The Mereology of Digital Copyright, 18 Fordham Intell. Prop. Media & Ent. L.J. 711, 713–22 (2008).
- The home page of the Internet Archive is available at http://www.archive.org/ (last visited Nov. 12, 2009). One of the Internet Archive’s distinguishing features is its effort to create a digital archive of the World Wide Web itself (which it labels the “Wayback Machine”) by taking and storing periodic “snapshots” of every site accessible to its software. See, e.g., Internet Archive v. Shell, 505 F. Supp. 2d 755, 760–61 (D. Colo. 2007) (explaining operation of the Wayback Machine); Diane Leenheer Zimmerman, Can Our Culture Be Saved?: The Future of Digital Archiving, 91 Minn. L. Rev. 989, 995–96 (2007). Perhaps less controversially, the Internet Archive also maintains an extensive collection of scanned public-domain texts. See http://www.archive.org/details/texts (last visited Nov. 12, 2009); Peter S. Menell, Knowledge Accessibility and Preservation Policy for the Digital Age, 44 Hous. L. Rev. 1013, 1040–41 (2007) (situating the Internet Archive’s scanning and preservation efforts in historical context).
- Project Gutenberg’s home page is available at http://www.gutenberg.org (last visited Nov. 12, 2009). The project’s activities are considered in Zimmerman, supra note 43, at 995; Hannibal Travis, Building Universal Digital Libraries: An Agenda for Copyright Reform, 33 Pepp. L. Rev. 761, 784 (2006). Project Gutenberg’s crowdsourced (but not Wiki-driven) proofreading engine, Distributed Proofreaders, is discussed infra at notes 86–106 and accompanying text.
- Lists of Shakespeare’s works, with links to the text of each, are available at http://www.gutenberg.org/browse/authors/s (last visited Nov. 12, 2009); http://en.wikisource.org/wiki/Author:William_Shakespeare (last visited Nov. 12, 2009).
- Lists of Dickens’s works, with links to the text of each, are available at http://www.gutenberg.org/browse/authors/d (accessed Nov. 12, 2009); http://en.wikisource.org/wiki/Author:Charles_Dickens (accessed Nov. 12, 2009). Were he alive today, Dickens himself would likely react unfavorably to the discovery that the natives of far-flung locales could easily read his great works without paying a penny in royalties. See, e.g., Larisa T. Castillo, Natural Authority in Charles Dickens’s Martin Chuzzlewit and the Copyright Act of 1842, 62 Nineteenth-Century Literature 435, 436 (2008).
- A Google Books search for “Harvard Law Review,” enclosed in quotation marks, returned 10,938 results when performed on November 13, 2009. Most of the top ten results were actual scans of early volumes of the journal, but after that the results quickly veered into unrelated publications—snippets of autobiographies whose authors mentioned their time on the Review, for example. Performing a narrower author search for “Harvard Law Review Association” yielded far fewer hits, but again, the results included a number of publications besides issues of the Review.
- I performed a search for “United States Reports” limited to “media type: texts” on the Internet Archive on November 13, 2009 and received 164 results, most of which seemed to be scans of fairly recent volumes of the Reports. Again, the results list included many other publications also hosted at the Internet Archive that happened to mention the Reports.
My legal historian and librarian friends will no doubt be quick to remind me that “United States Reports” is a colloquialism of sorts; we now apply that name to a collection of materials that includes many early volumes never published with that title, which understandably might not turn up in a search for “United States Reports.” Nevertheless, given that more than 550 volumes of the Reports have been published (including over 400 since “United States Reports” became the official title of the series), a search that returns fewer than one-third that number of hits (and includes among the total many documents that are not themselves part of the United States Reports) still reveals something about the fragmentary and sporadic coverage of legal materials found in many of the general-purpose digitization projects.
- The LII’s home page is available at http://www.law.cornell.edu/ (last visited Nov. 17, 2009). See also Gallacher, supra note 38, at 26 (praising LII as “the most visible example” of a law school’s “active engage[ment] in making the law accessible to everyone”).
- LII’s United States Code portal is available at http://www.law.cornell.edu/uscode/ (last visited Nov. 17, 2009). Users may jump directly to a particular title and section, search for legislation by popular name, or browse the entire Code by following a hierarchical arrangement of links.
- LII’s Code of Federal Regulations portal is available at http://www.law.cornell.edu/cfr/ (last visited Nov. 17, 2009). Users may jump directly to a particular title and section, or browse the full Code by following a set of links arranged hierarchically.
- AltLaw’s home page is available at http://www.altlaw.org/ (last visited Nov. 17, 2009). See also Gallacher, supra note 38, at 26 (singling out AltLaw as a particularly promising open-access resource).
- See http://www.altlaw.org/v1/about/coverage (last visited Nov. 17, 2009).
- See http://www.altlaw.org/v1/search/advanced (last visited Nov. 19, 2009); http://www.altlaw.org/v1/search/boolean (last visited Nov. 19, 2009). I do not mean to fault AltLaw’s designers for failing to match their impressive accomplishments in organizing and hyperlinking their hosted content with an equally sophisticated search engine. As David Weinberger has pointed out, one of the distinctive advantages of storing information digitally is that multiple overlapping organizational or searching schema may be adopted simultaneously without displacing or superseding other, equally valid, organizational schema for the same underlying content. See David Weinberger, Everything is Miscellaneous: The Power of the New Digital Disorder 19–23 (2007). The important part of the process is the one at which AltLaw has excelled, namely, simply getting the content online and hyperlinked; higher-order indexing and search functions can follow later (or be developed by others) so long as they have an underlying pool of content upon which to work.
- Justia’s home page is available at http://www.justia.com/ (last visited Nov. 19, 2009).
- On the significance of Justia’s court document repository, see Peter W. Martin, Online Access to Court Records—From Documents to Data, Particulars to Patterns, 53 Vill. L. Rev. 855, 885–87 (2008).
- The project’s home page, unsurprisingly, is http://public.resource.org/ (last visited Nov. 19, 2009); see also John Markoff, A Quest to Get More Court Rulings Online, and Free, N.Y. Times, Aug. 20, 2007, at http://www.nytimes.com/2007/08/20/technology/20westlaw.html (last visited Apr. 16, 2010).
- See http://www.recapthelaw.org/ (last visited Nov. 19, 2009).
- Google Scholar is available at http://scholar.google.com/ (last visited Nov. 19, 2009). For Google’s announcement of its new case law search function, see Anurag Acharya, Finding the laws that govern us, at http://googleblog.blogspot.com/2009/11/finding-laws-that-govern-us.html (last visited Nov. 17, 2009). At the time of this writing, Google Scholar includes “US state appellate and supreme court cases since 1950, US federal district, appellate, tax and bankruptcy courts since 1923 and US Supreme Court cases since 1791.” http://scholar.google.com/intl/en/scholar/help.html (last visited Nov. 19, 2009).
- The service, however, works in one direction only; that is, the Google Books treatises do not (yet?) link back to the cases they cite at Google Scholar. Nor is there a citator offered in either direction for statutes, none of which is included among the materials recently added to Google Scholar.
- See, e.g., John J. DiGilio, Bridging the DiGital Divide: A New Vendor in Town? Google Scholar Now Includes Case Law, LLRX.com, at http://www.llrx.com/featres/googlescholarcaselaw (Nov. 18, 2009) (surveying pros and cons of Google’s new database); Mikhail Koulikov, Indexing and Full-Text Coverage of Law Review Articles in Nonlegal Databases: An Initial Study, 102 Law Libr. J. 39, 52 ¶ 37 (2010) (noting that “the emergence of search engines such as Google Scholar, which are not subscription-based, has presented an entirely new set of issues regarding the relationship bet ween academics and published scholarship”); Eugene Volokh, The Future of Books Related to the Law?, 108 Mich. L. Rev. 823, 826 (2010) (predicting that advent of e-reader technologies, coupled with increasing availability of primary source materials in open-access repositories, will make it easier for reference sources commonly used in legal education, such as casebooks and treatises, to hyperlink directly to the cases and statutes cited therein).
- See, e.g., Tim Stanley, Free US Case Law from Google!—US Federal + 50 State Case Law, at http://onward.justia.com/useful-tools-web-sites-220-free-us-case-law-from-google-us-federal-50-state-case-law.html (last visited Nov. 17, 2009).
- See Richard Leiter, Google Scholar—(Almost) Great Free Legal Search, at http://thelifeofbooks.blogspot.com/2009/11/google-scholar-almost-great-free-legal.html (last visited Nov. 17, 2009) (anticipating that Google Scholar will compete more with free alternatives than with proprietary databases).
- See Lee Sims, In the Face of Google Assault, AltLaw Hangs It Up, at http://www.law.uconn.edu/content/face-google-assault-altlaw-hangs-it (last visited Dec. 1, 2009).
- See Guido Calabresi, A Common Law for the Age of Statutes ch. 1 (1982). In recognition of the growing importance of statutory and regulatory interpretation skills to contemporary practice, Harvard Law School overhauled its first-year curriculum in 2006 to incorporate a new (and mandatory) “Legislation and Regulation” course. See Elena Kagan, The Harvard Law School Revisited, 11 Green Bag 2d 475, 477–78 (2008); Legislation and Regulation, at http://www.law.harvard.edu/prospective/jd/about/legislation-regulation.html (accessed Dec. 18, 2009).
- See supra notes 49–51 and accompanying text.
- See, e.g., supra notes 53, 59 and references cited.
- Vestiges of this earlier and more limited open-access world survive today. For example, the FindLaw web site maintains its own archive of federal appeals court decisions that date only from the mid-1990s. See Federal Courts of Appeal—Judicial Branch—Federal Resources, http://www.findlaw.com/10fedgov/judicial/appeals_courts.html (last visited Dec. 18, 2009), and the circuit-by-circuit archives linked from that page. The United States Supreme Court, to take another example, has posted bound volumes of its own decisions online, but the earliest available is Volume 502, which collects opinions from the October 1991 Term of the Court. Supreme Court—Bound Volumes, http://www.supremecourtus.gov/opinions/boundvolumes.html (last visited Dec. 18, 2009).
1995, roughly speaking, marks the point at which the courts of appeals began posting electronic copies of their own decisions online. Because these reported decisions were “born digital”—that is to say, created and disseminated initially in electronic form—storing and organizing them online entailed no digitization expense, which led to their rapid proliferation. In contrast, digitizing and archiving earlier works has frequently entailed substantial labor and expense. See, e.g., Markoff, supra note 57.
- By way of an isolated counterexample, the Library of Congress’s American Memory Project has undertaken a massive and praiseworthy digitization initiative aimed at early American source texts. Of particular value is its collection entitled: A Century of Lawmaking for a New Nation: U.S. Congressional Documents and Debates, http://memory.loc.gov/ammem/amlaw/lawhome.html (last visited Jan. 4, 2010). Although this site hosts scanned images of a number of valuable texts not widely available elsewhere, such as documents produced by the Continental Congress and a number of works from the first years after the ratification of the Constitution, the Library of Congress has imposed a substantial technological impediment to easy access and use of the voluminous materials in its collection. In many instances, users may only view a single page image on screen at a time, and may only navigate forward and backward a page at a time, or jump to a specific, known, page number. Furthermore, for many of the most valuable works in its collection, the Library provides only page images, not text, making it impossible to (for example) copy-and-paste the language of early legislative enactments into another document. The present author’s Early United States Statutes project represents one effort to build upon the Library’s scanned document repository and make it more useful. http://homepages.uc.edu/~armstrty/statutes.html (last visited Jan. 4, 2010). For another such effort, see infra notes 143–148 and accompanying text.
- The AltLaw and Google Scholar sites, as already noted, are exceptions to this general rule. See supra notes 52–54, 59–60 and accompanying text.
- See Gallacher, supra note 38, at 40–41.
- See Yochai Benkler, Coase’s Penguin, or, Linux and The Nature of the Firm, 112 Yale L.J. 369, 375 (2002).
- See, e.g., See Steven Weber, The Success of Open Source ch. 4 (2004); Eric S. Raymond, A Brief History of Hackerdom and The Cathedral and the Bazaar, in The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary 5, 23–25 (Tim O’Reilly ed., O’Reilly & Associates 2001) (1999).
- See, e.g., Benkler, supra note 4.
- This brief, descriptive summary surely understates the aspects of Wikipedia that are the most interesting and worthy of study; for a better assessment, see Jonathan Zittrain, The Future of the Internet—and How to Stop It ch. 6 (2008).
- For peer-produced works, copyright licensing arrangements substitute for the legal authorizations that would otherwise be provided by the hierarchical structure of a private firm, enabling persons who are legal strangers to share, reuse, and expand one another’s expressive works. See Timothy K. Armstrong, Shrinking the Commons: Termination of Copyright Licenses and Transfers for the Benefit of the Public, 47 Harv. J. on Legis. 359, 407 (2010).
- Accessible overviews of the psychological and economic considerations that drive the crowdsourcing phenomenon are available in, e.g., Jeff Howe, Crowdsourcing: Why the Power of the Crowd is Driving the Future of Business (2008); Clay Shirky, Here Comes Everybody: The Power of Organizing Without Organizations (2008); Don Tapscott & Anthony D. Williams, Wikinomics: How Mass Collaboration Changes Everything (2006).
- See Weber, supra note 73, at 59 (noting complexities of software development process that tend to place relatively inflexible limits on what can be accomplished by one or two programmers).
- The entry for “crowdsourcing” in the Wikipedia encyclopedia includes an interesting list of crowdsourced projects involving a wide variety of private and public entities. http://en.wikipedia.org/wiki/Crowdsourcing (last visited Jan. 5, 2010).
- See Armstrong, supra note 76, at 361 (collecting examples).
- See Hafner, supra note 3.
- See id.
- Project Gutenberg Main Page, http://www.gutenberg.org (last visited Feb. 10, 2010). A 1992 essay describing the history and goals of the project, written by the project’s founder, is available at Michael Hart, Gutenberg: The History and Philosophy of Project Gutenberg, http://www.gutenberg.org/wiki/Gutenberg:The_History_and_Philosophy_of_Project_Gutenberg_by_Michael_Hart (last visited Feb. 10, 2010). A lengthier history of the project is Marie Lebert, Project Gutenberg (1971–2008), http://www.gutenberg.org/etext/27045 (last visited Feb. 10, 2010).
- The total exceeds 31,000 titles at the time of this writing. http://www.gutenberg.org/dirs/GUTINDEX-2010.txt (last visited Feb. 10, 2010).
- The home page of Distributed Proofreaders is online at http://www.pgdp.net (last visited Apr. 17, 2010). A short history of the project is available via the site’s entry in Wikipedia. See Wikipedia, Distributed Proofreaders, at http://en.wikipedia.org/wiki/Distributed_Proofreaders (last visited Apr. 17, 2010).
- The listing of texts that have achieved “Completed” status on Distributed Proofreaders (indicating that they have passed through all stages of the site’s multi-step proofreading process) included over 17,000 titles at the time of this writing. See DP: Complete Gold E-Texts, http://www.pgdp.net/c/list_etexts.php?x=g&sort=5 (last visited Feb. 10, 2010).
- See Figure 1, infra, at 31.
- See DP: Proofreading Guidelines, at http://www.pgdp.net/c/faq/proofreading_guidelines.php (last visited Feb. 10, 2010). A pocket summary of the lengthy Proofreading Guidelines is available at http://www.pgdp.net/c/faq/proofing_summary.pdf (last visited Feb. 10, 2010).
- A diagram illustrating the full workflow of a Distributed Proofreaders project (including preparation and post-processing activities that occur largely “behind the scenes,” invisible to ordinary users of the site) is available at http://www.pgdp.net/c/faq/DPflow.php (last visited Feb. 10, 2010).
- See id.
- See id.
- See id.
- See id.
- See id.
- See DP: P2: Proofreading Round 2, http://www.pgdp.net/c/tools/proofers/round.php?round_id=P2 (last visited Feb. 10, 2010).
- See DP: F1: Formatting Round 1, http://www.pgdp.net/c/tools/proofers/round.php?round_id=F1 (last visited Feb. 10, 2010).
- See DP: P3: Proofreading Round 3, http://www.pgdp.net/c/tools/proofers/round.php?round_id=P3 (last visited Feb. 10, 2010).
- See DP: F2: Formatting Round 2, http://www.pgdp.net/c/tools/proofers/round.php?round_id=F2 (last visited Feb. 10, 2010).
- See Access Requirements, at http://www.pgdp.net/wiki/Access_requirements#Project_Manager (accessed Feb. 10, 2010).
- See id. The site also recommends, but does not require, DP membership for a period of 4–6 months as a condition of Project Manager status. Id.
- See, e.g., Benkler, supra note 4, at 81; Benkler, supra note 72, at 398–99.
- See Travis, supra note 44, at 784 (“[c]ommons-based peer production has created what is arguably the largest and most successful digital library, and in a remarkably speedy, efficient, and user-friendly way.”).
- See supra note 86.
- See supra notes 93–100 and accompanying text.
- See supra notes 93–100 and accompanying text.
- As Project Gutenberg’s founder explained:
Hart, supra note 83.
Project Gutenberg selects etexts targeted a bit on the “bang for the buck” philosophy … we choose etexts we hope extremely large portions of the audience will want and use frequently. We are constantly asked to prepare etext from out of print editions of esoteric materials, but this does not provide for usage by the audience we have targeted, 99% of the general public.
- See generally The Wikimedia Foundation home page, http://wikimediafoundation.org/ (last visited Aug. 13, 2009). Links to each of the Foundation’s wiki projects appear at the bottom of the Foundation’s home page, and at the bottom of the home pages of each of the respective projects. See generally Descriptive Project Summaries, http://wikimediafoundation.org/wiki/Our_projects (last visited Aug. 13, 2009).
- See, e.g., Lawrence Lessig, Code: Version 2.0 (2006) (dedicated to Wikipedia); Tim Wu, Can Wiki Travel?, Apr. 6, 2007, http://www.slate.com/id/2163727/ (2007) (law professor Tim Wu declares himself “a confessed Wikipedia addict, sometime contributor, and true believer”).
- See, e.g., Suzanna Sherry, Democracy and the Death of Knowledge, 75 U. Cin. L. Rev. 1053, 1055 (2007); Robert McHenry, The Faith-Based Encyclopedia, available at http://www.tcsdaily.com/Article.aspx?id=111504A.
- See Michael J. Tonsing, The Wiki Family of Web Sites, Fed. Law., July 2009, at 14.
- At the 2010 Annual Meeting of the AALS Section on Internet and Computer Law where this paper (along with three others) was presented, Professor Paul Ohm observed that two of the four pieces presented by the members of the panel focused their analysis entirely on Wikipedia, signifying that perhaps the chosen theme of the Section’s panel (“Law and Wikis”) should have been revised to “Law and Wikipedia.” As discussed below, Wikipedia is, by a vast margin, the largest of WMF’s many projects (comfortably larger, indeed, than all the other WMF wikis combined), and may attract disproportionate attention for that reason alone. See infra Table 1, at 29.
Professor Ohm’s casual observation seems to have some empirical foundation. A search for “Wikipedia” in Westlaw’s JLR database on February 15, 2010 returned 2,258 “hits,” compared with just 30 for “Wikiquote,” 14 for “Wiktionary,” 10 for “Wikinews,” 8 for “Wikibooks,” 5 for “Wikimedia Commons,” 4 for “Wikiversity” (of which two appeared to be duplicates of one another), 3 for “Wikispecies” (with the same duplicates), and just 2 for “Wikisource.”
- Like all of WMF’s wikis, Wikisource consists of not one project, but many, each serving the needs of speakers of a particular language. The home page of the English-language version of Wikisource is online. See generally Wikisource Homepage, http://en.wikisource.org/ (last visited Aug. 13, 2009). At present, the English-language Wikisource library is, by several different units of measure, the largest; the top ten Wikisource libraries are listed infra Table 2, at 30.
- For Wikisource’s complete inclusion policy, see Wikisource: What Wikisource includes, http://en.wikisource.org/wiki/Wikisource:What_Wikisource_includes (last visited Feb. 10, 2010) [hereinafter “What Wikisource includes”]. The site also provides guidance as to which non-public domain works are sufficiently “free” to qualify for inclusion; for example, works issued under a simple Creative Commons Attribution license would qualify, but works issued under the more restrictive Attribution-Non Commercial-No Derivatives license would not. See Wikisource: Copyright Policy, http://en.wikisource.org/wiki/Wikisource:Copyright_policy (last visited Feb. 10, 2010).
- See What Wikisource includes, supra note 113.
- See Wikipedia:Neutral point of view, http://en.wikipedia.org/wiki/Wikipedia:Neutral_point_of_view (last visited Feb. 10, 2010); Lessig, supra note 108, at 243–44.
- See, e.g., Benkler, supra note 4, at 70–71 (“An effort to represent sympathetically all views on a subject, rather than to achieve objectivity, is the core operative characteristic of this effort.”). Wikipedia is a uniquely self-critical work, and lengthy discussions of the practical difficulty of achieving the site’s objective of substantive neutrality are easily located on the site itself. See generally Wikipedia:Why Wikipedia is not so great, http://en.wikipedia.org/wiki/Wikipedia:Why_Wikipedia_is_not_so_great (last visited Feb. 10, 2010).
- At the time of this writing, over 6,500 Wikipedia articles have been flagged as possibly violating the site’s neutrality principle, with most of them flagged as problematic for over a year. See Category:NPOV disputes, http://en.wikipedia.org/wiki/Category:NPOV_disputes (last visited Feb. 10, 2010).
- Issues involving “neutrality” do sneak in through the back door at Wikisource under the heading of completeness; that is to say, a user’s faithful reproduction of only a nonrepresentative or misleading excerpt from a work may prompt other users to add the rest of the work to provide necessary context. See Wikisource:What is Wikisource?, http://en.wikisource.org/wiki/Wikisource:What_is_Wikisource%3F (last visited Feb. 10, 2010) (briefly addressing neutrality issue in context of publication of misleading extracts from a work).
- See generally Barack Obama, http://en.wikipedia.org/wiki/Barack_Obama (last visited Feb. 10, 2010).
- See generally George W. Bush, http://en.wikipedia.org/wiki/George_W._Bush (last visited Feb. 10, 2010).
- This risk seems particularly low for works more recently added to Wikisource, many of which are assembled from scanned page images of the published original sources and which permit easy user verification of the text against the source image.
- Statistical descriptions of Wikisource (or, indeed, any of the WMF projects) involve substantial risks of error due to the constant flux of additions and deletions to the project. With that caveat in mind, however, it is possible to make some very broad points to illustrate the relative magnitude of the works available at the English-language Wikisource. As of January 2010, Wikisource included approximately 321,000 individual pages of scanned text—a figure that may undercount the actual number of scanned images available at the site, not all of which have yet been used to produce a corresponding text page. See Wikisource Statistics—Tables—English—Database records per namespace, http://stats.wikimedia.org/wikisource/EN/TablesWikipediaEN.htm#namespaces (last visited Feb. 10, 2010) (the column heading “104” in this table corresponds to the “Page” namespace used on the project and marks the number of text records that match a scanned page image at Wikisource).
- The necessary “ProofreadPage” software extension was added to the MediaWiki software that underlies all WMF sites in mid-2007. See Extension:ProofreadPage, http://www.mediawiki.org/wiki/Extension:Proofread_Page (last visited Feb. 10, 2010).
- A few of these files are hosted at Wikisource itself, although the more common practice appears to be to host the files at Wikimedia Commons, where they are equally usable by all WMF projects.
- See, e.g., Index:Le Morte d’Arthur—Volume 1, http://en.wikisource.org/wiki/Index:Le_Morte_d%27Arthur_-_Volume_1.djvu (last visited Feb. 10, 2010). Where a single work is originally published in multiple separately bound volumes, it is common for each volume’s Index page to include links to the Index pages of the other volumes in the series to aid navigation. See id. The “djvu” suffix refers to a common file format optimized for storing scanned images. See DjVu, http://en.wikipedia.org/wiki/DjVu (last visited Feb. 10, 2010).
- See Help:Page Status, at http://en.wikisource.org/wiki/Help:Page_Status (last visited Feb. 10, 2010).
- See id.
- See id. At the time of this writing, the number of pages that had reached each quality tier on the English Wikisource were: Not Proofread, 252,667; Proofread, 36,582; Validated, 15,190. The author of the present Essay is partly to blame for the predominance of pages consisting entirely of raw OCR output, having personally uploaded some 70,000 such pages to the site using automated scripts. See infra notes 143–148 and accompanying text (discussing project to host the United States Statutes at Large on Wikisource).
- See Help:Page Status, supra note 126.
- See id.
- See id. The availability of “Empty” pages on the site—for which an image, but no corresponding text, exists—explains why the existing Page statistics understate the true dimensions of Wikisource. See supra note 122.
- See Figure 2, infra, at 32. The site’s architecture thus gives the proofreading process some aspects of a rudimentary video game, with users’ proofreading activities yielding progressive “rewards” in the form of perceptible changes in the appearance of the work’s index page. This feature is, if nothing else, an ingenious way of encouraging sustained user involvement, even if the makers of World of Warcraft or other actual video games probably have little to fear from the competition posed by Wikisource.
- See Figure 3, infra, at 33.
- See id.
- See Figure 4, infra, at 34. Depending on the resolution of the scanned source image, the user may “zoom in” for a closer view of the image to ease proofreading of small text.
- See Figure 2, infra, at 32.
- Cf. supra note 90.
- See, e.g., infra note 149 and references cited therein.
- See id.
- See Help:Editing Wikisource, http://en.wikisource.org/wiki/Help:Editing_Wikisource (last visited Apr. 17, 2010).
- See supra note 124.
- See H.R. Rep. No. 94-1476 (1976), http://en.wikisource.org/wiki/Copyright_Law_Revision_%28House_Report_No._94 -1476%29 (last visited Feb. 12, 2010); S. Rep. No. 94-473 (1975), http://en.wikisource.org/wiki/Copyright_Law_Revision_%28Senate_Report_No._94-473%29 (last visited Feb. 12, 2010).
- See supra note 69.
- This portion of the Statutes at Large represents nearly a century and a half of American statutory law (1789–1925), as well as early treaties, Presidential proclamations, and the first version of the Revised Statutes (which would grow in time to become the work we now know as the United States Code).
- See United States Statutes at Large, Vol. 1, First Congress, http://en.wikisource.org/wiki/United_States_Statutes_at_Large/Volume_1/1st_Congress (last visited Feb. 12, 2010).
- For example, some Wikisource editors have proofread all four of the so-called “Alien and Sedition Acts” passed by Congress in 1798 and made the proofread text available on Wikisource, with links to additional explanatory content hosted on Wikipedia. See Alien and Sedition Acts, http://en.wikisource.org/wiki/Alien_and_Sedition_Acts (last visited Apr. 17, 2010). Other Wikisource users have been proofreading and posting the proclamations of President Theodore Roosevelt in essentially chronological order. See United States Statutes at Large, Vol. 33, Part 2, http://en.wikisource.org/wiki/Index:United_States_Statutes_at_Large_Volume_33 _Part_2 .djvu (last visited Apr. 17, 2010) and pages linked therefrom.
- The most complete volume at present is Volume 1, with approximately 25% of the pages proofread as of the time of this writing. See United States Statutes at Large Vol. 1, http://en.wikisource.org/wiki/Index:United_States_Statutes_at_Large_Volume_1.djvu (last visited Apr. 17, 2010). Clicking the volume links for any of the other scanned volumes in the Statutes at Large (all of which are linked from the page for Volume 1) will reveal the overwhelming predominance of page links that appear against a red background, signifying “not proofread.” See supra note 126.
- In addition to the sheer size of the dataset (the Statutes at Large scans alone presently account for over 20% of all the scanned pages available at Wikisource), the process of proofreading and correction is doubtless slowed by (1) the complex, multi-column page format employed in the original work; and (2) the poor quality of the raw OCR output from the software employed to date, which necessitates substantial human effort to proofread and correct a single page. There is nothing inevitable or irremediable about either of these problems; more technologically skilled users of the site may, in time, identify common OCR errors that may be auto-corrected across many pages at once using search-and-replace scripts, or may apply improved OCR software to the stored page images to yield a better baseline text that may be proofread more rapidly.
- See Timothy K. Armstrong, Fair Circumvention, 74 Brook. L. Rev. 1 (2008), http://en.wikisource.org/wiki/Fair_Circumvention. In the version of the article online at Wikisource, many citations to key statutes or cases appear as clickable hyperlinks that take the user directly to the work referenced by the citation. Links to explanatory content available on Wikipedia or other WMF wikis also appear throughout the document.
- See supra note 41 and accompanying text.
- The contents of all WMF wikis, including Wikisource and Wikipedia, are available for download at http://download.wikimedia.org/ (last visited Feb. 12, 2010).
- See supra notes 140–141 and accompanying text.
- See Table 1, infra, at 29; compare supra note 104 and accompanying text. Measured by the number of users who have participated at each project during the last thirty days, Wikisource is approximately one-tenth the size of Distributed Proofreaders, and barely one five-hundredth the size of Wikipedia. Wikisource’s small size and the relatively recent redesign of the site’s architecture to facilitate proofreading have also meant restricted throughput of works. As noted above, Distributed Proofreaders has completed over 17,000 texts, while the comparable statistic for Wikisource (consisting of those works that have achieved a quality level of “Validated” on all their pages) is only slightly over 100 texts at the time of this writing. See Category:Index Validated, at http://en.wikisource.org/wiki/Category:Index_Validated (last visited Feb. 12, 2010).
- See supra notes 49–58 and accompanying text. The Google Scholar case-law service may be that rare open-access project developed with minimal constraints as to resources. See supra notes 59–64 and accompanying text.
- Of course, making it possible for a wide variety of users to contribute irrespective of expertise may not represent an unalloyed blessing. A certain portion of the editing work on a site like Wikisource necessarily involves correcting erroneous contributions made by inexpert users of the site, although the benefit to allowing such users to participate and thereby to acquire greater familiarity with the site’s tools and culture surely outweighs the occasional need to undo mistaken or malicious edits.
- For examples of open-source development projects that do employ management structures guided by a “benevolent dictator,” at least to help make final decisions about which contributions will be accepted into the project, see eric S. Raymond, Homesteading the Noosphere and The Magic Cauldron, in The Cathedral and The Bazaar, supra note 73, at 79, 124–26.
- This characteristic is typical of the open-source approach to development of expressive content. See Eric S. Raymond, Homesteading the Noosphere, in The Cathedral and The Bazaar, supra note73, at 100–02 (explaining open-source software development as driven, at least in part, by the satisfaction users derive from practicing skills in areas of personal interest to them); Weber, supra not e 73, at 62 (“The key element of the open source process, as an ideal type, is voluntary participation and voluntary selection of tasks.”).
- See, e.g., Aaron Krowne, Building a Digital Library the Commons-Based Peer Production Way, 9 D-lib, Oct. 2003, http://www.dlib.org/dlib/october03/krowne/10krowne.html (last visited Feb. 12, 2010).
- But see Eric Goldman, Wikipedia’s Labor Squeeze and its Consequences, 8 J. Telecomm. & High Tech. L. 157 (2010) (arguing that, despite its successes to date, Wikipedia’s architecture and the lack of traditional user incentives makes the past pace of user contribution unsustainable).
- It is difficult to know whether the larger active user base at Distributed Proofreaders reflects superior architecture, or simply longer existence; after all, Distributed Proofreaders has been around for a decade, and may also draw contributions from users interested in Project Gutenberg, a nearly four-decade-old project.
- See Olufunmilayo B. Arewa, Open Access in a Closed Universe: Lexis, Westlaw, Law Schools, and the Legal Information Market, 10 Lewis & Clark L. Rev. 797, 827–28 (2006).