User:AdamBMorgan/Other help/Hierarchy of texts

From Wikisource
Jump to navigation Jump to search

{{essay}}

Hierarchy of texts

Opinion about the quality of texts and sources of texts on Wikisource.

The "rank" of a text held on Wikisource is subjective but it can affect how the text is treated. Rank can be judged based on the quality of the text, the reliability of the source and the ability to easily correct or confirm the integrity of the text at any time. This is intended to highlight the best type of work on Wikisource, why this is the case and show how to improve a text's "ranking."

High[edit]

Circular green icon with a plus symbol in the centre.
Circular green icon with a plus symbol in the centre.

The highest rank of text on Wikisource is one that has been proofread and validated from a scanned copy of the original using the ProofreadPage system. Each of these qualities (scan + proofreading + validation) contributes towards the reliability of the text and the credibility of Wikisource. The scan also allows for corrections to be made, or suspicions put to rest, easily by any user.

A scan that is accessible through the Proofread Page extension gives additional advantages over other sources. It allows multiple, dispersed users to transcribe the text by crowdsourcing, and the maintain it through the same process. If a transcription project is abandoned or sidelined, the scan remains and the project may be picked up by another interested user at any time. If Wikisource's policies or style changes following transcription, the availability of the scan would make adjustments easier to implement. Scans are versatile.

Texts in this rank have the best chance of being selected as a featured text.

Mid[edit]

Wikisource is not the only digital library nor is it the only one with reliabile proofreading. Plain text copied and pasted from another trusted digital library is acceptable, as long as the source is clearly indicated so that a reader can check the fidelity against the text on the source library.

It not always possible to work from a scanned text, so it is also acceptable to manually transcribe text from a physical copy such as a library book. This can be important with older, rarer works that are too fragile to be scanned or whose owners either choose not to, or do not have the ability to, make a scan. Again, the source and as much metadata about the original as possible is required. Ideally another Wikisource user will be able to double check and validate the transcription. Future readers may also want to use the metadata to check the fidelity of the transcription.

The main flaw with works in this rank is that, if a mistake is suspected, it cannot reliably be checked or corrected. Some readers may assume a correction is necessary and make this change, which may fix a real error, introduce an error to the correct text, or compound the existing error with a new, additional error. Without the scan, there is no easy wasy to confirm which is correct. If an error persists, it undermines confidence in the rest of the text and the Wikisource project in general.

To a lesser degree, the project is also limited by the source. By using the text of a digital library we are bound by their policies and choices, with no ability to make different decisions or apply our own policies and style if different to that of the source library. Some digital libraries may support a synthesis of editions to create their text, which is not supported by Wikisource, but by using their text we would be copying their policy in place of our own. If, in another example, the source library or original transcriber did not attempt to match the typography of the text, Wikisource cannot do so either. As texts are copied from one digital library to another, with no ability to refer back to the physical source of the text, it is possible errors may intrude or metadata may be lost.

Texts in this rank may be upgraded to the higher rank by making or acquiring a scan of the original physical text and uploading it as normal. The text can be matched to the scan to produce a work sourced directly to an online scan.

Low[edit]

In the ranks above the importance of a source and metadata has been stressed. If this information is present it is possible to check the fidelity of the transcription, albeit not as easily as it would be if a scan were available. If this information is not present there is no ability to check the work.

With no ability to check the work, there is no ability to improve the work or validate its presence on Wikisource. A reader has no indication that the work they are viewing is a true transcript of a real work. This undermines the credibility of Wikisource as a project and may lead to disappointment in the reader. It is hard to determine that a work in this condition meets Wikisource's inclusion criteria.

Web searches on specific lines of text may reveal the origin of the text, which may lead to at least a little metadata if not a verifiable source. Even if a source is located, the text will need to be checked for errors, omissions or alterations. There is a chance works in this rank may be deleted. If a source is found for the text, the text can move up in its ranking. There is a chance that it may be superseded and overwritten by the product of the ProofreadPage extension. However, if a scan of the correct text is found, and the edition is also the same, the text can be matched to the scan. It may require some additional proofreading but this will result in the text moving further up in ranking and, because the scan is online and readily accessible, the proofreading can be done at any time by anyone.

Lowest[edit]

Circular red icon with a minus symbol in the centre.
Circular red icon with a minus symbol in the centre.

Worse than plain, sourceless texts are texts that have been copied and pasted from the raw, un-proofread productsof OCR'ed[1] text, complete with OCR errors. These include scannos (typos created by automatically by a machine), gibberish and headers and footers included in the body of the text.

Texts such as these combine all of the problems mentioned above. They contain clear errors and problems but, without a scan to back them up, they can never be fixed or improved. By clearly being incomplete and inferior to the point of being awkward (or even impossible) to read, they create a bad impression of the project and undermine the credibility of Wikisource as a whole. There is no reliability, no possibility of verification and no way to accurately confirm the fidelity of these texts.

With works in this rank, it is possible (although not guaranteed) that they may have a stated source. This source may have a scan that can be uploaded. If so, the product of the ProofreadPage system will probably override the bug-ridden pasted version. If no source is stated, it may be possible to find further information through web searches on specific lines of text.

Texts in this state on Wikisource have a strong chance of being deleted.

Notes[edit]

  1. OCR: Optical Character Recognition, the method by which computers attempt to read page scans and try to convert them to readble text